cs.AR @ 2025-07-25: 044
-
00 07-24 (4) PRACtical: Subarray-Level Counter Update and Bank-Level Recovery Isolation for Efficient PRAC Rowhammer Mitigation PRACtical: Subarray-Level Counter Update und Bank-Level Recovery Isolation für effiziente PRAC Rowhammer Mitigation PRAC 有效缓解减贫和减贫方案下级反反更新和银行级复苏孤立措施 2507.18581v1 -
01 07-24 DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v3 -
02 07-24 GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction GNN-ALLP:基于模拟电路链接预测的图表神经网络 2504.10240v4 -
03 07-24 A 55-nm SRAM Chip Scanning Errors Every 125 ns for Event-Wise Soft Error Measurement Ein 55-nm-SRAM-Chip-Scannfehler alle 125 ns für Event-Wise-Soft-Error-Messung A 55-nm SRAM 芯片扫描错误 2504.08305v2 -
04 07-24 Explicit Sign-Magnitude Encoders Enable Power-Efficient Multipliers Explizite Zeichen-Magnituden-Encoder aktivieren leistungsfähige Multiplikatoren 显性信号- 数学编码器启用功率功率乘数器 2507.18179v1 -
05 07-24 Real-Time Object Detection and Classification using YOLO for Edge FPGAs Echtzeit-Objekterkennung und -Klassifizierung mit YOLO für Edge-FPGAs 实时物体探测和分类,用YOLO对边缘的FPGAs进行实时物体探测和分类 2507.18174v1 -
06 07-24 Designing High-Performance and Thermally Feasible Multi-Chiplet Architectures enabled by Non-bendable Glass Interposer Designing High-Performance und Thermisch Machbare Multi-Chiplet-Architekturen durch nicht-biegbare Glasinterposer ermöglicht 设计高性能和热能多芯结构,由不可移植的玻璃干涉器启用 2507.18040v1 -
07 07-23 (3) The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts Der neue LLM-Bottleneck: Eine Systemperspektive auf latente Aufmerksamkeit und Mixture-of-Experts 新的LLM瓶颈:对长期关注和混合专家的系统观点 2507.15465v2 -
08 07-23 Neuromorphic Computing: A Theoretical Framework for Time, Space, and Energy Scaling Neuromorphes Rechnen: Ein theoretisches Framework für Zeit-, Raum- und Energieskalierung 神经形态计算:时间、空间和能源规模的理论框架 2507.17886v1 -
09 07-23 Ironman: Accelerating Oblivious Transfer Extension for Privacy-Preserving AI with Near-Memory Processing Ironman: Beschleunigen Sie Oblicious Transfer Erweiterung für Datenschutz-Erhaltung KI mit Nah-Memory-Verarbeitung 铁人:加快隐私保护协会的 “ 加速不明显转让扩展 “ ,用于近中处理 2507.16391v2 -
10 07-23 Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning Effiziente Präzisionsskalierbare Hardware für die Mikroskalierung (MX) Verarbeitung in der Robotik 用于机器人学习微缩缩缩(MX)处理的高效精密缩放硬件 2505.22404v2 -
11 07-23 Hardware-Efficient Photonic Tensor Core: Accelerating Deep Neural Networks with Structured Compression Hardware-Effizient Photonic Tensor Core: Beschleunigen von tiefen neuralen Netzwerken mit strukturierter Kompression 硬件-高效光学光学时标核心:有结构压缩的加速深神经网络 2502.01670v2 -
12 07-23 Enabling Efficient Transaction Processing on CXL-Based Memory Sharing Effiziente Transaktionsverarbeitung auf CXL-Basis ermöglichen 促进基于CXL的记忆共享方面的高效率交易处理 2502.11046v2 -
13 07-22 (2) MTU: The Multifunction Tree Unit in zkSpeed for Accelerating HyperPlonk MTU: Multifunktionsbaum in zkSpeed zur Beschleunigung von HyperPlonk MTU: 以zkSpeed为单位的多功能树单位, 用于加速超隆 2507.16793v1 -
14 07-22 Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers Benutzerdefinierte Algorithmen-basierte Fehlertoleranz für Aufmerksamkeitsschichten in Transformatoren 自定义基于 ALgorithm 的对变换器中注意层的不宽容 2507.16676v1 -
15 07-22 GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Conditional Processing GCC: Eine 3DGS-Inferenzarchitektur mit Gaußian-Wise- und Cross-Stage-Bedingung 海合会:3DGS推理结构,带有高西-怀斯和跨标准条件处理 2507.15300v2 -
16 07-22 Augmenting Von Neumann’s Architecture for an Intelligent Future Augmenting von Neumanns Architektur für eine intelligente Zukunft 增强冯·诺伊曼的智慧未来建筑 2507.16628v1 -
17 07-22 Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach Optimierung der DNN-basierten HSI-Segmentierung FPGA-basierten SoC für ADS: Ein praktischer Ansatz 优化基于DNN 的基于DNNHSIHSI的ADS的基于FPGA的FPGA SoC分类:一种实用办法 2507.16556v1 -
18 07-22 Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection Ausbalancierung von Robustheit und Effizienz in eingebetteten DNNs durch Aktivierungsfunktionsauswahl 通过启动职能选择,在嵌入的DNN 中平衡稳健和效率 2504.05119v2 -
19 07-22 ApproxGNN: A Pretrained GNN for Parameter Prediction in Design Space Exploration for Approximate Computing CacagGNN: Ein prätrainiertes GNN für Parameter-Vorhersage in der Design-Weltraum-Exploration für annäherndes Rechnen ApproGNN: 设计近光计算空间探索中的参数预测预培训的GNNG 2507.16379v1 -
20 07-22 Hourglass Sorting: A novel parallel sorting algorithm and its implementation Sanduhr-Sortierung: Ein neuartiger paralleler Sortieralgorithmus und seine Implementierung 沙漏分类:新颖的平行分类算法及其实施 2507.16326v1 -
21 07-22 SVAgent: AI Agent for Hardware Security Verification Assertion SVAgent: KI-Agent für Hardware-Sicherheitsprüfung Assertion AI 硬件安全核查认证代理商 2507.16203v1 -
22 07-22 RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs RealBench: Benchmarking von Verilog-Generationsmodellen mit Real-World-IP-Designs ReealBeonch:以现实世界的IP设计为标准,将风险生成模型与现实世界的IP设计作为基准 2507.16200v1 -
23 07-22 A Sparsity-Aware Autonomous Path Planning Accelerator with HW/SW Co-Design and Multi-Level Dataflow Optimization A Sparsity-Aware Autonomous Path Planning Accelerator mit HW/SW Co-Design und Multi-Level-Datenflussoptimierung 配有HW/SW共同设计和多层次数据流优化的公平-软件自主路径规划加速器 2507.16177v1 -
24 07-21 (1) Autocomp: LLM-Driven Code Optimization for Tensor Accelerators Autocomp: LLM-gesteuerte Code-Optimierung für Tensor-Beschleuniger 自动comp: LLM- Driven 代码对 Tensor 加速器的优化 2505.18574v3 -
25 07-21 Per-Bank Bandwidth Regulation of Shared Last-Level Cache for Real-Time Systems Per-Bank-Bandbreitenregulierung des geteilten Last-Level-Cache für Echtzeit-Systeme 关于实时系统共享最后一级缓存的每家银行宽幅度条例 2410.14003v2 -
26 07-21 VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair VeriRAG: Ein Retrieval-Augmented Framework für automatisierte RTL-Testbarkeits-Reparatur VeriRAG: 自动RTL可测试性修理检索增强框架 2507.15664v1 -
27 07-21 When Pipelined In-Memory Accelerators Meet Spiking Direct Feedback Alignment: A Co-Design for Neuromorphic Edge Computing Wenn pipelined In-Memory Accelerators treffen Spiking Direct Feedback Alignment: Ein Co-Design für neuromorphe Edge Computing 当射线内模拟加速器与Spiking直接反馈对齐:神经边缘计算共同设计时 2507.15603v1 -
28 07-21 HEPPO-GAE: Hardware-Efficient Proximal Policy Optimization with Generalized Advantage Estimation HEPPO-GAE: Hardwareeffiziente proximale Politikoptimierung mit generalisierter Vorteilsschätzung HEPPO-GAE: 采用通用的先进估计法优化政策 2501.12703v2 -
29 07-21 Unicorn-CIM: Uncovering the Vulnerability and Improving the Resilience of High-Precision Compute-in-Memory Einhorn-CIM: Enthüllen der Schwachstelle und Verbesserung der Resilienz von Hochpräzisions-Compute-in-Memory 独角兽-CIM: 消除脆弱性和提高高精确度计量的复原力 2506.02311v2 -
30 07-19 (6) Iceberg: Enhancing HLS Modeling with Synthetic Data Iceberg: Verbesserung der HLS-Modellierung mit synthetischen Daten 冰山:加强利用合成数据建立HLS模型 2507.09948v2 -
31 07-19 Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge Effiziente Hardware-Beschleunigung von Hybrid Vision Transformer (ViT)-Netzwerken am Rand ermöglichen 使边缘的混合愿景变异器(VT)网络的高效硬件加速 2507.14651v1 -
32 07-19 Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length Charakterisieren von State Space Model (SSM) und SSM-Transformer Hybrid Language Model Performance mit langer Kontextlänge 确定国家空间模型(SSM)和SSM-过渡混合语言模型长内性性能特点 2507.12442v2 -
33 07-18 (5) Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need Effiziente LLM-Inferenz: Bandbreite, Berechnung, Synchronisierung und Kapazität sind alles, was Sie brauchen 高效率的LLM 推论: 带宽、计算、同步、能力是您所需要的 2507.14397v1 -
34 07-18 Hardware-Compatible Single-Shot Feasible-Space Heuristics for Solving the Quadratic Assignment Problem Hardware-kompatible Single-Shot-Feasible-Space-Heuristiken zur Lösung des quadratischen Zuordnungsproblems 用于解决四压指派问题的可兼容的硬件兼容单制制式实际空间- 超光速空间方法 2503.09676v2 -
35 07-18 An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC Ein End-to-End DNN-Inferenz-Framework für den SpiNNaker2 Neuromorphic MMPSoC SpinNNAker2神经地态 MPSC 的端对端 DNN 推推框架 2507.13736v1 -
36 07-18 Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal Schnelle Graphen-Vektorsuche über Hardware-Beschleunigung und verzögerte Synchronisation Traversal 通过硬件加速和延迟同步同步搜索快速图表矢量 2406.12385v2 -
37 07-18 4T2R X-ReRAM CiM Array for Variation-tolerant, Low-power, Massively Parallel MAC Operation 4T2R X-ReRAM CiM Array für variantentolerante, leistungsarme, massiv parallele MAC-Operation 4T2R X-RERRAM CIM 用于机动性、耐变、低功率、大规模平行的MAC行动 2507.13631v1 -
38 07-18 LASANA: Large-Scale Surrogate Modeling for Analog Neuromorphic Architecture Exploration LASANA: großflächige Surrogatmodellierung für die Erforschung der analogen neuromorphen Architektur LASNA:模拟神经成形建筑勘探大型代谢模型 2507.10748v2 -
39 07-17 (4) GPU Performance Portability needs Autotuning GPU Performance Portability benötigt Autotuning GPU 性能表现 便捷性需要自动调节 2505.03780v3 -
40 07-17 WIP: Turning Fake Chips into Learning Opportunities WIP: Fake Chips in Lernmöglichkeiten verwandeln WIP:将假芯片转化为学习机会 2507.13281v1 -
41 07-17 High-Performance Pipelined NTT Accelerators with Homogeneous Digit-Serial Modulo Arithmetic Hochleistungspipelined-NTT-Beschleuniger mit homogener Digit-Serialmodulo-Arithmetik NTT 高性能加速器 2507.12418v2 -
42 07-17 MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration MC$^2$A: Algorithm-Hardware Co-Design für effiziente Markov-Kette Monte Carlo Beschleunigung MC$$2$A: 提高Markov链节蒙特卡洛速度加速速度的辅助算法-Hardware共同设计 2507.12935v1 -
43 07-17 An ultra-low-power CGRA for accelerating Transformers at the edge Ein ultra-low-power CGRA zur Beschleunigung von Transformern am Rand 用于加速边缘变压器的超低功率CGRAA 2507.12904v1
Article 0
Title@2025-07-24 (4): PRACtical: Subarray-Level Counter Update and Bank-Level Recovery Isolation for Efficient PRAC Rowhammer Mitigation
Title: PRACtical: Subarray-Level Counter Update and Bank-Level Recovery Isolation for Efficient PRAC Rowhammer Mitigation | PRACtical: Subarray-Level Counter Update und Bank-Level Recovery Isolation für effiziente PRAC Rowhammer Mitigation | PRAC 有效缓解减贫和减贫方案下级反反更新和银行级复苏孤立措施 2507.18581v1 |
Authors (4): Ravan Nazaraliyev, Saber Ganjisaffar, Nurlan Nazaraliyev, Nael Abu-Ghazaleh
As DRAM density increases, Rowhammer becomes more severe due to heightened charge leakage, reducing the number of activations needed to induce bit flips. The DDR5 standard addresses this threat with in-DRAM per-row activation counters (PRAC) and the Alert Back-Off (ABO) signal to trigger mitigation. However, PRAC adds performance overhead by incrementing counters during the precharge phase, and recovery refreshes stalls the entire memory channel, even if only one bank is under attack. We propose PRACtical, a performance-optimized approach to PRAC+ABO that maintains the same security guarantees. First, we reduce counter update latency by introducing a centralized increment circuit, enabling overlap between counter updates and subsequent row activations in other subarrays. Second, we enhance the $RFM_{ab}$ mitigation by enabling bank-level granularity: instead of stalling the entire channel, only affected banks are paused. This is achieved through a DRAM-resident register that identifies attacked banks. PRACtical improves performance by 8% on average (up to 20%) over the state-of-the-art, reduces energy by 19%, and limits performance degradation from aggressive performance attacks to less than 6%, all while preserving Rowhammer protection.
随着DRAM密度的增加,Rowhammer公司由于高排爆泄漏而变得更为严重,从而降低了催化比试所需的激活次数。DDL15标准(DR5)用DRAM每行启动计数器(PRAC)和警报回退信号来应对这一威胁,以触发缓解。然而,由于DRAM密度增加,DRAM密度增加,Rowhammer公司由于高排泄漏而变得更加严重。DRAM公司在充电前阶段通过加注计数器增加业绩管理费,而恢复更新系统则使整个记忆频道陷于停顿,即使只有一家银行受到攻击。我们提议对PRAC+ABO公司采用业绩优化的方法,以维持同样的安全保障。首先,我们通过中央递增电路减少反更新,使反调更新和随后在其他次列内启动行的信号发生重叠。第二,我们通过允许银行级颗粒化器增加$RFMZab}减速率:不是拖延整个频道,而是暂停受影响的银行。我们通过DRAM公司登记册来查明受攻击的银行。
Article 1
Title@2025-07-24 (4): DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration
Title: DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration | DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung | DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v3 |
Authors (3): Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis
Transformers are gaining increasing attention across Natural Language Processing (NLP) application domains due to their outstanding accuracy. However, these data-intensive models add significant performance demands to the existing computing architectures. Systolic array architectures, adopted by commercial AI computing platforms like Google TPUs, offer energy-efficient data reuse but face throughput and energy penalties due to input-output synchronization via First-In-First-Out (FIFO) buffers. This paper proposes a novel scalable systolic array architecture featuring Diagonal-Input and Permutated weight stationary (DiP) dataflow for matrix multiplication acceleration. The proposed architecture eliminates the synchronization FIFOs required by state-of-the-art weight stationary systolic arrays. Beyond the area, power, and energy savings achieved by eliminating these FIFOs, DiP architecture maximizes the computational resource utilization, achieving up to 50\% throughput improvement over conventional weight stationary architectures. Analytical models are developed for both weight stationary and DiP architectures, including latency, throughput, time to full PEs utilization (TFPU), and FIFOs overhead. A comprehensive hardware design space exploration using 22nm commercial technology demonstrates DiP’s scalability advantages, achieving up to a 2.02x improvement in energy efficiency per area. Furthermore, DiP outperforms TPU-like architectures on transformer workloads from widely-used models, delivering energy improvement up to 1.81x and latency improvement up to 1.49x. At a 64x64 size with 4096 PEs, DiP achieves a peak throughput of 8.192 TOPS with energy efficiency 9.548 TOPS/W.
64 自然语言处理(NLP)应用领域中变压者由于它们的精确性而越来越受到越来越多的关注。然而,这些数据密集型模型增加了现有计算结构的显著性能需求。谷歌 TPUs 等商业AI 计算平台采用的Systolic 阵列结构提供了节能数据再利用,但由于通过FIFFOs(FIPO)缓冲实现输入-输出同步,而面临过量和能量惩罚。本文件提议了一个新的可缩放的系统阵列结构,其特点是对数-投入和变换重量固定(DiP)40数据流动,以加速矩阵的倍增倍增速度。拟议的结构消除了州级的重力固定阵列阵列阵列所需的同步FIFFFFFFFOs。在消除这些FIFFOs后实现了节能性数据再利用, DiPPFP结构最大限度地实现了计算资源的利用率,在常规重定额结构上实现了50 吞吐量改进。分析模型针对重量固定/变压结构, 包括平流、时间到完整PElix Slievilent distryal distrex distreal divex distrealus laus lavel lax divation lax divational
Article 2
Title@2025-07-24 (4): GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction
Title: GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction | GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction | GNN-ALLP:基于模拟电路链接预测的图表神经网络 2504.10240v4 |
Authors (9): Guanyuan Pan, Tiansheng Zhou, Bingtao Ma, Yaqi Wang, Jianxiang Zhao, Zhi Li, Yugui Lin, Pietro Lio, Shuai Wang
Circuit link prediction identifying missing component connections from incomplete netlists is crucial in analog circuit design automation. However, existing methods face three main challenges: 1) Insufficient use of topological patterns in circuit graphs reduces prediction accuracy; 2) Data scarcity due to the complexity of annotations hinders model generalization; 3) Limited adaptability to various netlist formats. We propose GNN-ACLP, a graph neural networks (GNNs) based method featuring three innovations to tackle these challenges. First, we introduce the SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction) framework and achieve port-level accuracy in circuit link prediction. Second, we propose Netlist Babel Fish, a netlist format conversion tool leveraging retrieval-augmented generation (RAG) with a large language model (LLM) to improve the compatibility of netlist formats. Finally, we construct SpiceNetlist, a comprehensive dataset that contains 775 annotated circuits across 10 different component classes. Experiments demonstrate accuracy improvements of 16.08% on SpiceNetlist, 11.38% on Image2Net, and 16.01% on Masala-CHAI compared to the baseline in intra-dataset evaluation, while maintaining accuracy from 92.05% to 99.07% in cross-dataset evaluation, exhibiting robust feature transfer capabilities.
在模拟电路设计自动化方面,现有方法面临三大挑战:(1) 电路图中不适当使用地形学模式降低了预测的准确性;(2) 由于说明的复杂性而缺乏数据,妨碍了模型的简单化;(3) 对各种网络列表格式的适应性有限。我们提议GNN-ANALP, 一种基于图形神经网络(GNN-ANNNs)的方法,该方法有三项创新,以应对这些挑战。首先,我们引入SEAL(从Subgraphs、嵌入和链接预测属性中学习)框架,并在电路连接预测中实现港口一级的准确性。第二,我们提议Netlist Babel Fish, 一种使用网络列表格式转换工具,利用检索和推荐生成的生成(RAG),使用大型语言模型(LLM),以提高网络列表格式的兼容性。最后,我们构建了SpiceNetlist,这是一套综合数据集,包含10个不同组成部分的775条附加说明电路。实验显示SpiceNetlist的准确性改进了16.08%,在图像2NetNet上提高了11.38%,在Masala-CHAI上实现了16.01%,在Masala-CHAI的网络上转换工具,从稳度评估,从稳性地段,从9-creabreaccreabilvadational-palvicational-palviidaldationdationdationalitydational-dationalitydationaldationalvialvialvialvicalvicalvialvicildationdationalvicildations。
Article 3
Title@2025-07-24 (4): A 55-nm SRAM Chip Scanning Errors Every 125 ns for Event-Wise Soft Error Measurement
Title: A 55-nm SRAM Chip Scanning Errors Every 125 ns for Event-Wise Soft Error Measurement | Ein 55-nm-SRAM-Chip-Scannfehler alle 125 ns für Event-Wise-Soft-Error-Messung | A 55-nm SRAM 芯片扫描错误 2504.08305v2 |
Authors (7): Yuibi Gomi, Akira Sato, Waleed Madany, Kenichi Okada, Satoshi Adachi, Masatoshi Itoh, Masanori Hashimoto
We developed a 55 nm CMOS SRAM chip that scans all data every 125 ns and outputs timestamped soft error data via an SPI interface through a FIFO. The proposed system, consisting of the developed chip and particle detectors, enables event-wise soft error measurement and precise identification of SBUs and MCUs, thus resolving misclassifications such as Pseudo- and Distant MCUs that conventional methods cannot distinguish. An 80-MeV proton irradiation experiment at RARiS, Tohoku University verified the system operation. Timestamps between the SRAM chip and the particle detectors were successfully synchronized, accounting for PLL disturbances caused by radiation. Event building was achieved by determining a reset offset with sub-ns resolution, and spatial synchronization was maintained within several tens of micrometers.
我们开发了55 nm CMOS SRAM芯片,通过FIFO通过SPI接口扫描每125 ns和输出每125 个输出时间戳的软错误数据,由开发的芯片和粒子探测器组成的拟议系统可以对事件进行软误测和精确识别SBU和MCU,从而解决常规方法无法区分的Psedo-和Distant MCUs等错误分类问题。在Tohoku大学RARiS进行了80-MeV质子辐照实验,对系统操作进行了核查。SRAM芯片和粒子探测器之间的时间戳是同步的,对辐射引起的PLL扰动进行了核算。通过确定子分辨率的重新设置和分辨分辨率,实现了活动建设,并在数个微米内保持了空间同步。
Article 4
Title@2025-07-24 (4): Explicit Sign-Magnitude Encoders Enable Power-Efficient Multipliers
Title: Explicit Sign-Magnitude Encoders Enable Power-Efficient Multipliers | Explizite Zeichen-Magnituden-Encoder aktivieren leistungsfähige Multiplikatoren | 显性信号- 数学编码器启用功率功率乘数器 2507.18179v1 |
Authors (5): Felix Arnold, Maxence Bouvier, Ryan Amaudruz, Renzo Andri, Lukas Cavigelli
This work presents a method to maximize power-efficiency of fixed point multiplier units by decomposing them into sub-components. First, an encoder block converts the operands from a two’s complement to a sign magnitude representation, followed by a multiplier module which performs the compute operation and outputs the resulting value in the original format. This allows to leverage the power-efficiency of the Sign Magnitude encoding for the multiplication. To ensure the computing format is not altered, those two components are synthesized and optimized separately. Our method leads to significant power savings for input values centered around zero, as commonly encountered in AI workloads. Under a realistic input stream with values normally distributed with a standard deviation of 3.0, post-synthesis simulations of the 4-bit multiplier design show up to 12.9% lower switching activity compared to synthesis without decomposition. Those gains are achieved while ensuring compliance into any production-ready system as the overall circuit stays logic-equivalent. With the compliance lifted and a slightly smaller input range of -7 to +7, switching activity reductions can reach up to 33%. Additionally, we demonstrate that synthesis optimization methods based on switching-activity-driven design space exploration can yield a further 5-10% improvement in power-efficiency compared to a power agnostic approach.
这项工作展示了一种方法, 使固定点乘数单位的功率最大化, 将其分解成子元件。 首先, 编码元件块将操作器从一个 2 的补充转换成一个符号级表示, 之后是一个计算操作和输出的乘数模块, 计算原始格式的计算操作和输出值。 这样可以利用信号磁度编码对乘法的功率效率。 为确保不改变计算格式, 这两个元件是分别合成和优化的。 我们的方法可以使输入值在零周围大量节省功率, 正如在AI工作量中常见的那样。 在现实的投入流中, 通常分布在标准偏差3. 0 下, 4位乘乘乘乘数设计后合成模拟显示, 与合成相比,转换活动率低12.9 % , 而没有分解。 这些增益在确保任何生产准备就绪的系统在总体电路路保持逻辑等值时得到合规性。 由于合规性解除, 并且将输入范围小点为 - 7 到 +7 , , 转换活动削减可以进一步达到33%。 此外, 我们证明, 以转换动力- 驱动- 驱动- 驱动- 驱动- 改进 空间 改进 改进 方法的合成优化方法可以使动力- 驱动- 10 改进 改进 改进 改进 改进 改进 空间- 改进 改进 改进 改进 改进 改进 改进 改进 改进空间- 改进
Article 5
Title@2025-07-24 (4): Real-Time Object Detection and Classification using YOLO for Edge FPGAs
Title: Real-Time Object Detection and Classification using YOLO for Edge FPGAs | Echtzeit-Objekterkennung und -Klassifizierung mit YOLO für Edge-FPGAs | 实时物体探测和分类,用YOLO对边缘的FPGAs进行实时物体探测和分类 2507.18174v1 |
Authors (2): Rashed Al Amin, Roman Obermaisser
Object detection and classification are crucial tasks across various application domains, particularly in the development of safe and reliable Advanced Driver Assistance Systems (ADAS). Existing deep learning-based methods such as Convolutional Neural Networks (CNNs), Single Shot Detectors (SSDs), and You Only Look Once (YOLO) have demonstrated high performance in terms of accuracy and computational speed when deployed on Field-Programmable Gate Arrays (FPGAs). However, despite these advances, state-of-the-art YOLO-based object detection and classification systems continue to face challenges in achieving resource efficiency suitable for edge FPGA platforms. To address this limitation, this paper presents a resource-efficient real-time object detection and classification system based on YOLOv5 optimized for FPGA deployment. The proposed system is trained on the COCO and GTSRD datasets and implemented on the Xilinx Kria KV260 FPGA board. Experimental results demonstrate a classification accuracy of 99%, with a power consumption of 3.5W and a processing speed of 9 frames per second (FPS). These findings highlight the effectiveness of the proposed approach in enabling real-time, resource-efficient object detection and classification for edge computing applications.
现有深层次的学习方法,如进化神经网络(CNNs)、单一射击探测器(SSDS)和“只看一眼”(YOLO)等,在部署在外地可编程门阵列(FPGAs)上时,在准确性和计算速度方面表现良好;然而,尽管取得了这些进步,基于最新工艺的YOLO物体探测和分类系统在实现适合FPGA边缘平台的资源效率方面继续面临挑战。为了应对这一限制,本文件展示了以YOLOv5为最佳优化的用于FPGA部署的资源高效实时物体探测和分类系统。拟议系统在CCOCO和GTSRD数据集方面得到了培训,并在Xilinx Kria KV260 FPGA板上实施。实验结果显示,分类准确率为99%,电力消耗为3.5W,处理速度为每秒9个目标(FPS),这些结果突出表明了实时目标探测和升级应用系统的效率。
Article 6
Title@2025-07-24 (4): Designing High-Performance and Thermally Feasible Multi-Chiplet Architectures enabled by Non-bendable Glass Interposer
Title: Designing High-Performance and Thermally Feasible Multi-Chiplet Architectures enabled by Non-bendable Glass Interposer | Designing High-Performance und Thermisch Machbare Multi-Chiplet-Architekturen durch nicht-biegbare Glasinterposer ermöglicht | 设计高性能和热能多芯结构,由不可移植的玻璃干涉器启用 2507.18040v1 |
Authors (4): Harsh Sharma, Janardhan Rao Doppa, Umit Y. Ogras, Partha Pratim Pande
Multi-chiplet architectures enabled by glass interposer offer superior electrical performance, enable higher bus widths due to reduced crosstalk, and have lower capacitance in the redistribution layer than current silicon interposer-based systems. These advantages result in lower energy per bit, higher communication frequencies, and extended interconnect range. However, deformation of the package (warpage) in glass interposer-based systems becomes a critical challenge as system size increases, leading to severe mechanical stress and reliability concerns. Beyond a certain size, conventional packaging techniques fail to manage warpage effectively, necessitating new approaches to mitigate warpage induced bending with scalable performance for glass interposer based multi-chiplet systems. To address these inter-twined challenges, we propose a thermal-, warpage-, and performance-aware design framework that employs architecture and packaging co-optimization. The proposed framework disintegrates the surface and embedded chiplets to balance conflicting design objectives, ensuring optimal trade-offs between performance, power, and structural reliability. Our experiments demonstrate that optimized multi-chiplet architectures from our design framework achieve up to 64.7% performance improvement and 40% power reduction compared to traditional 2.5D systems to execute deep neural network workloads with lower fabrication costs.
由玻璃干涉器促成的多晶体外结构能够提供较高的电力性能,由于交叉对谈减少而使公共汽车宽度提高,并且比目前硅干涉器系统在再分配层中的能力较低。这些优势导致每位能量降低,通信频率提高,互连范围扩大。然而,玻璃干涉器系统中包件(隔热器)的变形(隔热器)随着系统规模的扩大而成为一个关键挑战,导致严重的机械压力和可靠性问题。除了一定的尺寸外,常规包装技术无法有效地管理损耗页,因此有必要采取新的办法来减轻因玻璃隔热器多晶体内系统的可缩缩放而导致的曲折。为了应对这些相互交错的挑战,我们提议了一个热、隔热、隔热和有性能的设计框架,利用建筑和包装共同优化。拟议框架将表面和嵌入的芯片分离,以平衡相互矛盾的设计目标,确保最佳的性能、功率和结构可靠性之间的折。我们的实验表明,从设计框架中优化多晶体外结构结构与玻璃隔开来,以可伸缩性能结构的性能结构,以25.7%至低度的性能改进。
Article 7
Title@2025-07-23 (3): The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts
Title: The New LLM Bottleneck: A Systems Perspective on Latent Attention and Mixture-of-Experts | Der neue LLM-Bottleneck: Eine Systemperspektive auf latente Aufmerksamkeit und Mixture-of-Experts | 新的LLM瓶颈:对长期关注和混合专家的系统观点 2507.15465v2 |
Authors (13): Sungmin Yun, Seonyong Park, Hwayong Nam, Younjoo Lee, Gunjun Lee, Kwanhee Kyung, Sangpyo Kim, Nam Sung Kim, Jongmin Kim, Hyungyo Kim, Juhwan Cho, Seungmin Baek, Jung Ho Ahn
Computational workloads composing traditional Transformer models are starkly bifurcated. Multi-Head Attention (MHA) is memory-bound, with low arithmetic intensity, while feedforward layers are compute-bound. This dichotomy has long motivated research into specialized hardware to mitigate the MHA bottleneck. This paper argues that recent architectural shifts, namely Multi-head Latent Attention (MLA) and Mixture-of-Experts (MoE), challenge the premise of specialized attention hardware. We make two key observations. First, the arithmetic intensity of MLA is over two orders of magnitude greater than that of MHA, shifting it close to a compute-bound regime well-suited for modern accelerators like GPUs. Second, by distributing MoE experts across a pool of accelerators, their arithmetic intensity can be tuned through batching to match that of the dense layers, creating a more balanced computational profile. These findings reveal a diminishing need for specialized attention hardware. The central challenge for next-generation Transformers is no longer accelerating a single memory-bound layer. Instead, the focus must shift to designing balanced systems with sufficient compute, memory capacity, memory bandwidth, and high-bandwidth interconnects to manage the diverse demands of large-scale models.
由传统变换模型构成的计算工作量是两极分明的。 多负责人注意(MHA) 具有记忆性, 低算术强度, 进进进进的层是任意的。 这种二分法长期推动对专门硬件的研究, 以缓解MHA的瓶颈。 本文认为, 最近的建筑变化, 即多头低端注意(MLA) 和混合专家(MOE) , 挑战了专门注意硬件的前提。 我们做了两点关键观察。 首先, MAL的算术强度超过MHA的两级, 将它移到适合像GPUs这样的现代加速器的计算系统。 其次, 通过将MOE专家分散到一个加速器的集合库中, 他们的计算强度可以通过批量来调整, 与稠密层相匹配, 创造更平衡的计算配置。 这些结论显示对专门注意硬件的需求正在减少。 下一代变换的中央挑战不会比MHA更加速一个单一的缩层, 将它更接近于一个精确的系统。 相反, 重点必须转换为设计一个高度的内存的高度的高度的高度的内存到设计。
Article 8
Title@2025-07-23 (3): Neuromorphic Computing: A Theoretical Framework for Time, Space, and Energy Scaling
Title: Neuromorphic Computing: A Theoretical Framework for Time, Space, and Energy Scaling | Neuromorphes Rechnen: Ein theoretisches Framework für Zeit-, Raum- und Energieskalierung | 神经形态计算:时间、空间和能源规模的理论框架 2507.17886v1 |
Authors (1): James B Aimone
Neuromorphic computing (NMC) is increasingly viewed as a low-power alternative to conventional von Neumann architectures such as central processing units (CPUs) and graphics processing units (GPUs), however the computational value proposition has been difficult to define precisely. Here, we explain how NMC should be seen as general-purpose and programmable even though it differs considerably from a conventional stored-program architecture. We show that the time and space scaling of NMC is equivalent to that of a theoretically infinite processor conventional system, however the energy scaling is significantly different. Specifically, the energy of conventional systems scales with absolute algorithm work, whereas the energy of neuromorphic systems scales with the derivative of algorithm state. The unique characteristics of NMC architectures make it well suited for different classes of algorithms than conventional multi-core systems like GPUs that have been optimized for dense numerical applications such as linear algebra. In contrast, the unique characteristics of NMC make it ideally suited for scalable and sparse algorithms whose activity is proportional to an objective function, such as iterative optimization and large-scale sampling (e.g., Monte Carlo).
神经地貌计算(NMC)日益被视为常规的von Neumann结构,如中央处理器和图形处理器(GPUs)的低功率替代物,但计算价值建议却难以精确界定。这里我们解释NMC如何被视为一般用途和可编程,尽管它与传统的储存程序结构大不相同。我们表明NMC的时间和空间规模相当于理论上无限的处理器常规系统,但能源规模却大不相同。具体地说,常规系统规模的能量具有绝对算法工作,而神经形态系统规模的能量则具有算法状态的衍生物。NMC结构的独特性使它适合于与传统的多核心系统(如GPUs)不同的不同类别的算法,而GPUs已经优化地用于大量应用,例如线性代数。相比之下,NMC的独特性使得它适合于适合可缩放和稀少的算法,其活动与客观功能成正比,例如迭接式优化和大规模取样(如蒙特卡洛)。
Article 9
Title@2025-07-23 (3): Ironman: Accelerating Oblivious Transfer Extension for Privacy-Preserving AI with Near-Memory Processing
Title: Ironman: Accelerating Oblivious Transfer Extension for Privacy-Preserving AI with Near-Memory Processing | Ironman: Beschleunigen Sie Oblicious Transfer Erweiterung für Datenschutz-Erhaltung KI mit Nah-Memory-Verarbeitung | 铁人:加快隐私保护协会的 “ 加速不明显转让扩展 “ ,用于近中处理 2507.16391v2 |
Authors (9): Chenqi Lin, Kang Yang, Tianshi Xu, Ling Liang, Yufei Wang, Zhaohui Chen, Runsheng Wang, Mingyu Gao, Meng Li
With the wide application of machine learning (ML), privacy concerns arise with user data as they may contain sensitive information. Privacy-preserving ML (PPML) based on cryptographic primitives has emerged as a promising solution in which an ML model is directly computed on the encrypted data to provide a formal privacy guarantee. However, PPML frameworks heavily rely on the oblivious transfer (OT) primitive to compute nonlinear functions. OT mainly involves the computation of single-point correlated OT (SPCOT) and learning parity with noise (LPN) operations. As OT is still computed extensively on general-purpose CPUs, it becomes the latency bottleneck of modern PPML frameworks. In this paper, we propose a novel OT accelerator, dubbed Ironman, to significantly increase the efficiency of OT and the overall PPML framework. We observe that SPCOT is computation-bounded, and thus propose a hardware-friendly SPCOT algorithm with a customized accelerator to improve SPCOT computation throughput. In contrast, LPN is memory-bandwidth-bounded due to irregular memory access patterns. Hence, we further leverage the near-memory processing (NMP) architecture equipped with memory-side cache and index sorting to improve effective memory bandwidth. With extensive experiments, we demonstrate Ironman achieves a 39.2-237.4 times improvement in OT throughput across different NMP configurations compared to the full-thread CPU implementation. For different PPML frameworks, Ironman demonstrates a 2.1-3.4 times reduction in end-to-end latency for both CNN and Transformer models.
随着机器学习的广泛应用(ML),用户数据中出现了隐私问题,因为这些数据可能包含敏感信息。基于加密原始技术的隐私保护ML(PPML)已成为一个很有希望的解决办法,在其中,根据加密数据直接计算ML模型,以提供正式的隐私保障。然而,PPML框架严重依赖隐含的传输(OT)原始(OT)来计算非线性功能。OT主要涉及计算单点相关OT(SPCOT)和与噪音操作(LPN)等值。由于一般用途的CPOT框架仍在大量计算,因此它已成为现代PPML框架的悬浮瓶颈。在本文件中,我们提议建立一个新的OT加速器模型,称为铁铁人模型,以大幅提高OT和整个PPPL框架的效率。我们观察到,SPCOT是一个硬件友好的SPCOT算法,用一个定制的调制的调控的调调器来改进SPCOT的计算。相比之下,LPPNPPN是近端的悬浮悬浮的软的硬性软化软化硬化硬化存储模型,通过我们的存储模型,通过我们的存储式的存储式的存储式平时段的平式的升级的升级式的升级式的升级式的升级的升级式的存储式的升级式的升级式的模型,以显示的存储式的存储式的存储式的存储式的升级式的存储式的存储式的升级式的模型,要到我们的存储式的存储式的存储式的存储式的升级式的系统式的升级式的升级式的升级式的升级式的升级式的升级式的升级式的模型,以到我们的升级的存储式的系统的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的升级的
Article 10
Title@2025-07-23 (3): Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning
Title: Efficient Precision-Scalable Hardware for Microscaling (MX) Processing in Robotics Learning | Effiziente Präzisionsskalierbare Hardware für die Mikroskalierung (MX) Verarbeitung in der Robotik | 用于机器人学习微缩缩缩(MX)处理的高效精密缩放硬件 2505.22404v2 |
Authors (5): Stef Cuyckens, Xiaoling Yi, Nitish Satya Murthy, Chao Fang, Marian Verhelst
Autonomous robots require efficient on-device learning to adapt to new environments without cloud dependency. For this edge training, Microscaling (MX) data types offer a promising solution by combining integer and floating-point representations with shared exponents, reducing energy consumption while maintaining accuracy. However, the state-of-the-art continuous learning processor, namely Dacapo, faces limitations with its MXINT-only support and inefficient vector-based grouping during backpropagation. In this paper, we present, to the best of our knowledge, the first work that addresses these limitations with two key innovations: (1) a precision-scalable arithmetic unit that supports all six MX data types by exploiting sub-word parallelism and unified integer and floating-point processing; and (2) support for square shared exponent groups to enable efficient weight handling during backpropagation, removing storage redundancy and quantization overhead. We evaluate our design against Dacapo under iso-peak-throughput on four robotics workloads in TSMC 16nm FinFET technology at 400MHz, reaching a 51% lower memory footprint, and 4x higher effective training throughput, while achieving comparable energy efficiency, enabling efficient robotics continual learning at the edge.
自主机器人需要高效的在线学习,以适应没有云层依赖性的新环境。对于这一边缘培训,微缩缩缩放(MX)数据类型提供了有希望的解决办法,将整数和浮点代表与共享推手相结合,从而降低能源消耗,同时保持准确性。然而,最先进的连续学习处理器Dacapo,即Dacapo,由于它只提供MXINT支持,且在后向反射期间基于病媒的分组效率低下,因此面临限制。在本文中,我们根据我们的知识,介绍了第一个以两种关键创新办法解决这些限制的工作:(1) 一个精确的可缩放算术单位,该单位通过利用子词平行法以及统一的整数和浮动点处理来支持所有六个MX数据类型;(2) 支持平方共享的组合,以便在后向调整期间能够高效地处理重量,消除储存冗余和四等离式间接费用。我们对照Dacappoo的软件设计,在TSMC 16nm FinFET技术在400MHz的4个机器人工作量,达到51%的低水平的存储率,同时实现具有可比性的更高水平的学习效率。
Article 11
Title@2025-07-23 (3): Hardware-Efficient Photonic Tensor Core: Accelerating Deep Neural Networks with Structured Compression
Title: Hardware-Efficient Photonic Tensor Core: Accelerating Deep Neural Networks with Structured Compression | Hardware-Effizient Photonic Tensor Core: Beschleunigen von tiefen neuralen Netzwerken mit strukturierter Kompression | 硬件-高效光学光学时标核心:有结构压缩的加速深神经网络 2502.01670v2 |
Authors (6): Shupeng Ning, Hanqing Zhu, Chenghao Feng, Jiaqi Gu, David Z. Pan, Ray T. Chen
The rapid growth in computing demands, particularly driven by artificial intelligence applications, has begun to exceed the capabilities of traditional electronic hardware. Optical computing offers a promising alternative due to its parallelism, high computational speed, and low power consumption. However, existing photonic integrated circuits are constrained by large footprints, costly electro-optical interfaces, and complex control mechanisms, limiting the practical scalability of optical neural networks (ONNs). To address these limitations, we introduce a block-circulant photonic tensor core for a structure-compressed optical neural network (StrC-ONN) architecture. The structured compression technique substantially reduces both model complexity and hardware resources without sacrificing the versatility of neural networks, and achieves accuracy comparable to uncompressed models. Additionally, we propose a hardware-aware training framework to compensate for on-chip nonidealities to improve model robustness and accuracy. Experimental validation through image processing and classification tasks demonstrates that our StrC-ONN achieves a reduction in trainable parameters of up to 74.91%,while still maintaining competitive accuracy levels. Performance analyses further indicate that this hardware-software co-design approach is expected to yield a 3.56 times improvement in power efficiency. By reducing both hardware requirements and control complexity across multiple dimensions, this work explores a new pathway toward practical and scalable ONNs, highlighting a promising route to address future computational efficiency challenges.
计算机需求的快速增长,特别是由人工智能应用驱动的需求的快速增长,已开始超过传统电子硬件的能力。光化计算由于其平行性、高计算速度和低电耗,提供了一个有希望的替代方案。然而,现有的光电综合电路受到大脚印、昂贵的电光界面和复杂的控制机制的限制,限制了光导神经网络的实际缩放能力。为解决这些限制,我们为结构压抑的光电神经网络(StrC-ONN)架构引入了块-电环光电点核心。结构化压缩技术大大降低了模型复杂性和硬件资源,同时又不牺牲神经网络的多功能,并实现了与未压缩模型相当的准确性。此外,我们提议了一个硬件认知培训框架,以补偿光电网络上的不理想性,提高模型的稳健性和准确性。我们通过图像处理和分类任务的实验性验证表明,我们SstrC-N公司在可培训的参数上降低了74.91%,同时保持了竞争性的准确性水平。绩效分析还进一步表明,在提高成本方面,预期的方法是提高成本的路径上,同时要求。
Article 12
Title@2025-07-23 (3): Enabling Efficient Transaction Processing on CXL-Based Memory Sharing
Title: Enabling Efficient Transaction Processing on CXL-Based Memory Sharing | Effiziente Transaktionsverarbeitung auf CXL-Basis ermöglichen | 促进基于CXL的记忆共享方面的高效率交易处理 2502.11046v2 |
Authors (8): Zhao Wang, Yiqi Chen, Cong Li, Dimin Niu, Tianchan Guan, Zhaoyang Du, Xingda Wei, Guangyu Sun
Transaction processing systems are the crux for modern data-center applications, yet current multi-node systems are slow due to network overheads. This paper advocates for Compute Express Link (CXL) as a network alternative, which enables low-latency and cache-coherent shared memory accesses. However, directly adopting standard CXL primitives leads to performance degradation due to the high cost of maintaining cross-node cache coherence. To address the CXL challenges, this paper introduces CtXnL, a software-hardware co-designed system that implements a novel hybrid coherence primitive tailored to the loosely coherent nature of transactional data. The core innovation of CtXnL is empowering transaction system developers with the ability to selectively achieve data coherence. Our evaluations on OLTP workloads demonstrate that CtXnL enhances performance, outperforming current network-based systems and achieves with up to 2.08x greater throughput than vanilla CXL memory sharing architectures across universal transaction processing policies.
然而,直接采用标准的 CXL 原始系统会导致性能退化,因为维护交叉节点缓存一致性的成本很高。为了应对 CXL 的挑战,本文件引入了CtXnL 软件硬件联合设计系统CtXnL,这是一个软件硬件共同设计系统,用于实施一种新颖的混合一致性系统,它原始地适应交易数据的松散一致性性质。 CtXnL 的核心创新是赋予交易系统开发者能力,使其能够有选择地实现数据一致性。我们对 OLTP 工作量的评估表明, CtXnL 在整个通用交易处理政策中,在使用最多2.08x 的超过香草 CXL 存储共享结构的情况下,提高性能,优于当前基于网络的系统,并实现高达2.08x的吞吐量。
Article 13
Title@2025-07-22 (2): MTU: The Multifunction Tree Unit in zkSpeed for Accelerating HyperPlonk
Title: MTU: The Multifunction Tree Unit in zkSpeed for Accelerating HyperPlonk | MTU: Multifunktionsbaum in zkSpeed zur Beschleunigung von HyperPlonk | MTU: 以zkSpeed为单位的多功能树单位, 用于加速超隆 2507.16793v1 |
Authors (7): Jianqiao Mo, Alhad Daftardar, Joey Ah-kiow, Kaiyue Guo, Benedikt Bünz, Siddharth Garg, Brandon Reagen
Zero-Knowledge Proofs (ZKPs) are critical for privacy preservation and verifiable computation. Many ZKPs rely on kernels such as the SumCheck protocol and Merkle Tree commitments, which enable their security properties. These kernels exhibit balanced binary tree computational patterns, which enable efficient hardware acceleration. Prior work has investigated accelerating these kernels as part of an overarching ZKP protocol; however, a focused study of how to best exploit the underlying tree pattern for hardware efficiency remains limited. We conduct a systematic evaluation of these tree-based workloads under different traversal strategies, analyzing performance on multi-threaded CPUs and a hardware accelerator, the Multifunction Tree Unit (MTU). We introduce a hardware-friendly Hybrid Traversal for binary tree that improves parallelism and scalability while significantly reducing memory traffic on hardware. Our results show that MTU achieves up to 1478$\times$ speedup over CPU at DDR-level bandwidth and that our hybrid traversal outperforms as standalone approach by up to 3$\times$. These findings offer practical guidance for designing efficient hardware accelerators for ZKP workloads with binary tree structures.
对隐私保护和可核查计算而言,零知识验证(ZKPs)对隐私保护和可核查计算至关重要。 许多 ZKPs 依靠Sumcheck 协议和Merkle树承诺等内核,这些内核能够使其安全性能得以实现。 这些内核展示了平衡的双树计算模式,能够高效加速硬件的加速。 先前的工作调查了加速这些内核作为总体 ZKP 协议的一部分; 然而,关于如何最佳利用基本树型模式提高硬件效率的集中研究仍然有限。 我们系统评估了不同行进战略下的这些树型工作量,分析了多读CPU和硬件加速器(多功能树股)的性能。 我们为双树引入了一个硬件友好混合轨迹,既能改善平行性和可缩放性,又大大减少硬件的记忆流量。 我们的结果显示, MOTU在解甲返航级带宽度上达到1,478美元的速度比CPU高出1美元。 我们的混合车道超越了独立路段,在多读CPS和多功能加速器加速器加速器的操作上,最多3美元。
Article 14
Title@2025-07-22 (2): Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers
Title: Custom Algorithm-based Fault Tolerance for Attention Layers in Transformers | Benutzerdefinierte Algorithmen-basierte Fehlertoleranz für Aufmerksamkeitsschichten in Transformatoren | 自定义基于 ALgorithm 的对变换器中注意层的不宽容 2507.16676v1 |
Authors (3): Vasileios Titopoulos, Kosmas Alexandridis, Giorgos Dimitrakopoulos
Transformers and large language models (LLMs), powered by the attention mechanism, have transformed numerous AI applications, driving the need for specialized hardware accelerators. A major challenge in these accelerators is efficiently detecting errors caused by random hardware faults. Traditional algorithm-based fault tolerance (ABFT) techniques verify individual matrix multiplications but fall short in handling the full attention mechanism, particularly due to intermediate softmax normalization. This work proposes Flash-ABFT, a novel method that computes an online checksum across the entire three-matrix product of query, key and value matrices, of an attention layer, including the softmax operation, with a single check. This approach significantly reduces overhead by eliminating redundant checks while maintaining high fault-detection accuracy. Experimental results demonstrate that Flash-ABFT incurs only 5.3% hardware area overhead and less than 1.9% energy overhead, making it a cost-effective and robust solution for error detection in attention accelerators.
由关注机制驱动的变换器和大型语言模型(LLMS)已经改变了许多AI应用程序,从而需要专门的硬件加速器。这些加速器面临的一个主要挑战是有效检测随机硬件故障造成的错误。传统的基于算法的缺陷容忍(ABFT)技术可以核查单个矩阵的乘法,但在处理全关注机制方面却做得不够,特别是由于中间软通缩的正常化。这项工作提出了Flash-ABFT,这是一个新颖的方法,它计算出一个包括软式加速器操作在内的关注层的整个三个矩阵产品(查询、关键和价值矩阵)的在线校验和,包括一个软式检查。这种方法通过消除冗余检查,同时保持高度的缺陷探测精确度,大大减少了间接费用。实验结果表明,Flaf-ABFT只产生5.3%的硬件管理区,低于1.9%的能源管理费,因此它是一个成本有效和稳健的解决方案,用于在关注加速器中检测错误。
Article 15
Title@2025-07-22 (2): GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Conditional Processing
Title: GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Conditional Processing | GCC: Eine 3DGS-Inferenzarchitektur mit Gaußian-Wise- und Cross-Stage-Bedingung | 海合会:3DGS推理结构,带有高西-怀斯和跨标准条件处理 2507.15300v2 |
Authors (9): Minnan Pei, Gang Li, Junwen Si, Zeyu Zhu, Zitao Mo, Peisong Wang, Zhuoran Song, Xiaoyao Liang, Jian Cheng
3D Gaussian Splatting (3DGS) has emerged as a leading neural rendering technique for high-fidelity view synthesis, prompting the development of dedicated 3DGS accelerators for mobile applications. Through in-depth analysis, we identify two major limitations in the conventional decoupled preprocessing-rendering dataflow adopted by existing accelerators: 1) a significant portion of preprocessed Gaussians are not used in rendering, and 2) the same Gaussian gets repeatedly loaded across different tile renderings, resulting in substantial computational and data movement overhead. To address these issues, we propose GCC, a novel accelerator designed for fast and energy-efficient 3DGS inference. At the dataflow level, GCC introduces: 1) cross-stage conditional processing, which interleaves preprocessing and rendering to dynamically skip unnecessary Gaussian preprocessing; and 2) Gaussian-wise rendering, ensuring that all rendering operations for a given Gaussian are completed before moving to the next, thereby eliminating duplicated Gaussian loading. We also propose an alpha-based boundary identification method to derive compact and accurate Gaussian regions, thereby reducing rendering costs. We implement our GCC accelerator in 28nm technology. Extensive experiments demonstrate that GCC significantly outperforms the state-of-the-art 3DGS inference accelerator, GSCore, in both performance and energy efficiency.
3D Gaussian Splateting (3DGS) 已成为高不忠实视图合成的主要神经化技术,促使开发了专用的3DGS加速器,用于移动应用。通过深入分析,我们确定了现有加速器采用的传统脱钩预处理前复制数据流的两大限制:(1) 很大一部分预处理前高标不用于制作;(2) 同一高标反复装入不同的瓷块,导致大量计算和数据移动间接费用。为了解决这些问题,我们建议海合会为快速和节能3DGS推断设计一个新的加速器。在数据流层面,海合会提出:(1) 跨阶段有条件处理,将预处理前处理和使不必要高标预处理动态跳过;(2) 高调演算,确保特定高标的所有操作在移动到下一个时完成,从而消除重复的戈斯装载。我们还提议在海合会3DGS上大幅降低成本。我们还提议在海合会3DGA中大幅降低成本。
Article 16
Title@2025-07-22 (2): Augmenting Von Neumann’s Architecture for an Intelligent Future
Title: Augmenting Von Neumann’s Architecture for an Intelligent Future | Augmenting von Neumanns Architektur für eine intelligente Zukunft | 增强冯·诺伊曼的智慧未来建筑 2507.16628v1 |
Authors (2): Rajpreet Singh, Vidhi Kothari
This work presents a novel computer architecture that extends the Von Neumann model with a dedicated Reasoning Unit (RU) to enable native artificial general intelligence capabilities. The RU functions as a specialized co-processor that executes symbolic inference, multi-agent coordination, and hybrid symbolic-neural computation as fundamental architectural primitives. This hardware-embedded approach allows autonomous agents to perform goal-directed planning, dynamic knowledge manipulation, and introspective reasoning directly within the computational substrate at system scale. The architecture incorporates a reasoning-specific instruction set architecture, parallel symbolic processing pipelines, agent-aware kernel abstractions, and a unified memory hierarchy that seamlessly integrates cognitive and numerical workloads. Through systematic co-design across hardware, operating system, and agent runtime layers, this architecture establishes a computational foundation where reasoning, learning, and adaptation emerge as intrinsic execution properties rather than software abstractions, potentially enabling the development of general-purpose intelligent machines.
这项工作提出了一个新的计算机结构,将冯纽曼模型与专门的理性单元(RU)相扩展,以建立本地人工一般智能能力。RU作为一个专业的共同处理器,作为执行象征性推论、多剂协调以及作为基本建筑原始材料的混合象征性神经计算的共同处理器。这一硬件组合法使自主代理商能够在系统规模的计算基数中直接进行目标导向的规划、动态知识操作和反省推理。该结构包含一个针对具体推理的教学结构、平行的象征性处理管道、感知内核抽取以及一个能够无缝整合认知和数字工作量的统一的记忆级。通过系统共同设计跨硬件、操作系统和代理运行时间层,这一结构建立了一个计算基础,使推理、学习和适应成为内在执行特性,而不是软件抽象,从而有可能促进通用智能机器的发展。
Article 17
Title@2025-07-22 (2): Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach
Title: Optimization of DNN-based HSI Segmentation FPGA-based SoC for ADS: A Practical Approach | Optimierung der DNN-basierten HSI-Segmentierung FPGA-basierten SoC für ADS: Ein praktischer Ansatz | 优化基于DNN 的基于DNNHSIHSI的ADS的基于FPGA的FPGA SoC分类:一种实用办法 2507.16556v1 |
Authors (3): Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe
The use of HSI for autonomous navigation is a promising research field aimed at improving the accuracy and robustness of detection, tracking, and scene understanding systems based on vision sensors. Combining advanced computer algorithms, such as DNNs, with small-size snapshot HSI cameras enhances the reliability of these systems. HSI overcomes intrinsic limitations of greyscale and RGB imaging in depicting physical properties of targets, particularly regarding spectral reflectance and metamerism. Despite promising results in HSI-based vision developments, safety-critical systems like ADS demand strict constraints on latency, resource consumption, and security, motivating the shift of ML workloads to edge platforms. This involves a thorough software/hardware co-design scheme to distribute and optimize the tasks efficiently among the limited resources of computing platforms. With respect to inference, the over-parameterized nature of DNNs poses significant computational challenges for real-time on-the-edge deployment. In addition, the intensive data preprocessing required by HSI, which is frequently overlooked, must be carefully managed in terms of memory arrangement and inter-task communication to enable an efficient integrated pipeline design on a SoC. This work presents a set of optimization techniques for the practical co-design of a DNN-based HSI segmentation processor deployed on a FPGA-based SoC targeted at ADS, including key optimizations such as functional software/hardware task distribution, hardware-aware preprocessing, ML model compression, and a complete pipelined deployment. Applied compression techniques significantly reduce the complexity of the designed DNN to 24.34% of the original operations and to 1.02% of the original number of parameters, achieving a 2.86x speed-up in the inference task without noticeable degradation of the segmentation accuracy.
将高级计算机算法(如DNNS)与小型快照HSI照相机相结合,可以提高这些系统的可靠性。HSI克服了灰度和RGB成像在描述目标物理属性方面的内在局限性,特别是在光谱反射和元化方面。尽管基于HSI的愿景开发取得了可喜成果,但像ADS这样的安全关键系统要求严格限制延缓度、资源消耗和安全,促使ML工作量转移到边缘平台。这涉及一个彻底的软件/硬件共同设计计划,以便在计算机平台的有限资源中高效地分配和优化任务。关于推论,DNNNRS的超度性能在描述目标的物理属性方面,特别是在光谱反射和元化方面,构成了巨大的计算挑战。此外,Hawa的预处理模型要求(经常被忽视)要求对延缩、资源消耗和安全进行严格的限制,促使ML工作量转移到边缘平台上的工作量。这涉及一个彻底的软件/硬件联合设计计划,在计算机平台上实现一个高效的通用的管道设计,包括高清晰的硬度的硬化部分。
Article 18
Title@2025-07-22 (2): Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection
Title: Balancing Robustness and Efficiency in Embedded DNNs Through Activation Function Selection | Ausbalancierung von Robustheit und Effizienz in eingebetteten DNNs durch Aktivierungsfunktionsauswahl | 通过启动职能选择,在嵌入的DNN 中平衡稳健和效率 2504.05119v2 |
Authors (3): Jon Gutiérrez-Zaballa, Koldo Basterretxea, Javier Echanobe
Machine learning-based embedded systems for safety-critical applications, such as aerospace and autonomous driving, must be robust to perturbations caused by soft errors. As transistor geometries shrink and voltages decrease, modern electronic devices become more susceptible to background radiation, increasing the concern about failures produced by soft errors. The resilience of deep neural networks (DNNs) to these errors depends not only on target device technology but also on model structure and the numerical representation and arithmetic precision of their parameters. Compression techniques like pruning and quantization, used to reduce memory footprint and computational complexity, alter both model structure and representation, affecting soft error robustness. In this regard, although often overlooked, the choice of activation functions (AFs) impacts not only accuracy and trainability but also compressibility and error resilience. This paper explores the use of bounded AFs to enhance robustness against parameter perturbations, while evaluating their effects on model accuracy, compressibility, and computational load with a technology-agnostic approach. We focus on encoder-decoder convolutional models developed for semantic segmentation of hyperspectral images with application to autonomous driving systems. Experiments are conducted on an AMD-Xilinx’s KV260 SoM.
用于安全关键应用的基于机械学习的嵌入系统,如航空航天和自主驾驶等,必须能够抵御软差错造成的扰动。随着晶体管地貌缩缩和电压下降,现代电子装置更容易受到本底辐射的影响,增加了对软差错失的关注。深神经网络(DNN)对这些错误的抗御能力不仅取决于目标装置技术,而且还取决于模型结构及其参数的数值表示和计算精确度。压缩技术,如修剪和量化,用于减少记忆足迹和计算复杂性,改变模型结构和代表性,影响软差错稳健性。在这方面,尽管往往被忽视,但激活功能的选择不仅影响准确性和可训练性,而且影响易缩和错错复原力。本文探讨了如何使用捆绑的AFS(D)来增强对参数扰动的稳健性,同时评价其对模型准确性、可压缩性和计算性的影响,以及用技术-认知法方法计算负荷。我们注重为超光谱光谱图像的解变变变式模型,我们注重为超光谱图像的自动分析系统。
Article 19
Title@2025-07-22 (2): ApproxGNN: A Pretrained GNN for Parameter Prediction in Design Space Exploration for Approximate Computing
Title: ApproxGNN: A Pretrained GNN for Parameter Prediction in Design Space Exploration for Approximate Computing | CacagGNN: Ein prätrainiertes GNN für Parameter-Vorhersage in der Design-Weltraum-Exploration für annäherndes Rechnen | ApproGNN: 设计近光计算空间探索中的参数预测预培训的GNNG 2507.16379v1 |
Authors (2): Ondrej Vlcek, Vojtech Mrazek
Approximate computing offers promising energy efficiency benefits for error-tolerant applications, but discovering optimal approximations requires extensive design space exploration (DSE). Predicting the accuracy of circuits composed of approximate components without performing complete synthesis remains a challenging problem. Current machine learning approaches used to automate this task require retraining for each new circuit configuration, making them computationally expensive and time-consuming. This paper presents ApproxGNN, a construction methodology for a pre-trained graph neural network model predicting QoR and HW cost of approximate accelerators employing approximate adders from a library. This approach is applicable in DSE for assignment of approximate components to operations in accelerator. Our approach introduces novel component feature extraction based on learned embeddings rather than traditional error metrics, enabling improved transferability to unseen circuits. ApproxGNN models can be trained with a small number of approximate components, supports transfer to multiple prediction tasks, utilizes precomputed embeddings for efficiency, and significantly improves accuracy of the prediction of approximation error. On a set of image convolutional filters, our experimental results demonstrate that the proposed embeddings improve prediction accuracy (mean square error) by 50% compared to conventional methods. Furthermore, the overall prediction accuracy is 30% better than statistical machine learning approaches without fine-tuning and 54% better with fast finetuning.
近距离计算为错误容忍应用提供了大有希望的能源效率效益,但发现最佳近似需要广泛的设计空间探索。 预测由近似部件构成的电路的准确性而不进行完整的合成仍是一个棘手的问题。 当前用于将这项任务自动化的机器学习方法需要为每个新电路配置进行再培训,使其在计算上成本昂贵和耗时。 本文展示了ApproxGNNN, 用于预培训的图形神经网络模型的建筑方法,预测使用图书馆近似加速器的QoR和HW近似加速器的成本。 这种方法适用于DSE, 用于在加速器中为操作分配近似部件。 我们的方法采用基于学习嵌入而非传统误差指标的新颖的部件提取功能,从而能够改进向隐蔽电路的可转换性。 ApproxGNNNM 模型可以用少量的近似部件进行培训,支持向多个预测任务转移,使用预编集的嵌入式嵌入器提高效率,并大大提高近似误差的准确性。 在一套图像进缩过滤器中,我们的实验结果显示,以学习率为54%的精确度的方法比快速的精确度改进了50。
Article 20
Title@2025-07-22 (2): Hourglass Sorting: A novel parallel sorting algorithm and its implementation
Title: Hourglass Sorting: A novel parallel sorting algorithm and its implementation | Sanduhr-Sortierung: Ein neuartiger paralleler Sortieralgorithmus und seine Implementierung | 沙漏分类:新颖的平行分类算法及其实施 2507.16326v1 |
Authors (2): Daniel Bascones, Borja Morcillo
Sorting is one of the fundamental problems in computer science. Playing a role in many processes, it has a lower complexity bound imposed by $\mathcal{O}(n\log{n})$ when executing on a sequential machine. This limit can be brought down to sub-linear times thanks to parallelization techniques that increase the number of comparisons done in parallel. This, however, increases the cost of implementation, which limits the application of such techniques. Moreover, as the size of the arrays increases, a bottleneck arises in moving the vast quantities of data required at the input, and generated at the output of such sorter. This might impose time requirements much stricter than those of the sorting itself. In this paper, a novel parallel sorter is proposed for the specific case where the input is parallel, but the output is serial. The design is then implemented and verified on an FPGA within the context of a quantum LDPC decoder. A latency of $\log{n}$ is achieved for the output of the first element, after which the rest stream out for a total sorting time of $n+\log{n}$. Contrary to other parallel sorting methods, clock speed does not degrade with $n$, and resources scale linearly with input size.
排序是计算机科学的根本问题之一。 在许多过程中, 执行相继机器时, 它被 $\ mathcal{O} (n\log{n}) ($mathcal{O}) (n\log{n}) 所约束的复杂度较低。 由于平行技术增加了平行比较的次数, 这一限制可以降为亚线时间。 但是, 这增加了执行成本, 限制了这些技术的应用。 此外, 数组的大小增加, 在移动输入所需的大量数据时会出现瓶颈, 并在这种排序器的输出时生成。 这可能会给自己带来比排序要求更严格得多的时间要求 。 在本文中, 为输入平行但输出为序列的具体案例建议了一个新的平行排序器 。 然后在量子 LDPC decoder 的背景下在 FPGA 上执行和校验。 在第一个元素的输出中, 将达到 $\log{n} $( $) 的宽度, 在总排序时间后, 平流流流以 $ * 和直线性资源的速度排序。
Article 21
Title@2025-07-22 (2): SVAgent: AI Agent for Hardware Security Verification Assertion
Title: SVAgent: AI Agent for Hardware Security Verification Assertion | SVAgent: KI-Agent für Hardware-Sicherheitsprüfung Assertion | AI 硬件安全核查认证代理商 2507.16203v1 |
Authors (6): Rui Guo, Avinash Ayalasomayajula, Henian Li, Jingbo Zhou, Sujan Kumar Saha, Farimah Farahmandi
Verification using SystemVerilog assertions (SVA) is one of the most popular methods for detecting circuit design vulnerabilities. However, with the globalization of integrated circuit design and the continuous upgrading of security requirements, the SVA development model has exposed major limitations. It is not only inefficient in development, but also unable to effectively deal with the increasing number of security vulnerabilities in modern complex integrated circuits. In response to these challenges, this paper proposes an innovative SVA automatic generation framework SVAgent. SVAgent introduces a requirement decomposition mechanism to transform the original complex requirements into a structured, gradually solvable fine-grained problem-solving chain. Experiments have shown that SVAgent can effectively suppress the influence of hallucinations and random answers, and the key evaluation indicators such as the accuracy and consistency of the SVA are significantly better than existing frameworks. More importantly, we successfully integrated SVAgent into the most mainstream integrated circuit vulnerability assessment framework and verified its practicality and reliability in a real engineering design environment.
使用SystemVerilog(SVA)进行核查是发现电路设计弱点最常用的方法之一,然而,随着集成电路设计全球化和安全要求不断升级,SVA开发模型暴露出重大局限性,不仅在发展过程中效率低下,而且无法有效处理现代复杂集成电路中日益增多的安全弱点。为了应对这些挑战,本文件建议建立一个创新的SVA自动生成框架SVAVGent。SVangerent引入了一种要求分解机制,将原有的复杂要求转化为结构化的、逐步溶解的精细微解决问题链。实验表明,SVAGENT能够有效抑制幻觉和随机回答的影响,而SVA的准确性和一致性等关键评价指标比现有框架要好得多。更重要的是,我们成功地将SVAgent纳入最主流的综合电路脆弱性评估框架,并在一个真正的工程设计环境中核实其实用性和可靠性。
Article 22
Title@2025-07-22 (2): RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs
Title: RealBench: Benchmarking Verilog Generation Models with Real-World IP Designs | RealBench: Benchmarking von Verilog-Generationsmodellen mit Real-World-IP-Designs | ReealBeonch:以现实世界的IP设计为标准,将风险生成模型与现实世界的IP设计作为基准 2507.16200v1 |
Authors (13): Pengwei Jin, Di Huang, Chongxiao Li, Shuyao Cheng, Yang Zhao, Xinyao Zheng, Jiaguo Zhu, Shuyi Xing, Bohan Dou, Rui Zhang, Zidong Du, Qi Guo, Xing Hu
The automatic generation of Verilog code using Large Language Models (LLMs) has garnered significant interest in hardware design automation. However, existing benchmarks for evaluating LLMs in Verilog generation fall short in replicating real-world design workflows due to their designs’ simplicity, inadequate design specifications, and less rigorous verification environments. To address these limitations, we present RealBench, the first benchmark aiming at real-world IP-level Verilog generation tasks. RealBench features complex, structured, real-world open-source IP designs, multi-modal and formatted design specifications, and rigorous verification environments, including 100% line coverage testbenches and a formal checker. It supports both module-level and system-level tasks, enabling comprehensive assessments of LLM capabilities. Evaluations on various LLMs and agents reveal that even one of the best-performing LLMs, o1-preview, achieves only a 13.3% pass@1 on module-level tasks and 0% on system-level tasks, highlighting the need for stronger Verilog generation models in the future. The benchmark is open-sourced at https://github.com/IPRC-DIP/RealBench.
使用大语言模型自动生成Verilog码的做法引起了对硬件设计自动化的极大兴趣,然而,目前用于评价Verilog生成的LLMS的现有基准,由于设计简单、设计规格不足和核查环境不那么严格,在复制现实世界设计工作流程方面不足。为了解决这些局限性,我们介绍RealBench,这是旨在执行真实世界IP级Verilog生成任务的第一个基准。Rebench具有复杂、结构化、真实世界开放源码的IP设计设计、多式和格式化的设计规格以及严格的核查环境,包括100%的线路覆盖测试器和正式检查器。它支持模块一级和系统一级的任务,从而能够全面评估LLM能力。对各种LMmm和代理人的评价显示,即使是最优秀的LMS(o1-preview)之一,模块级任务也只达到13.3%的通行证@1,系统一级任务则达到0%,强调未来需要更强大的Verilogs生成模型。该基准在https://github.com/IP-ReparalDIP/DIP。
Article 23
Title@2025-07-22 (2): A Sparsity-Aware Autonomous Path Planning Accelerator with HW/SW Co-Design and Multi-Level Dataflow Optimization
Title: A Sparsity-Aware Autonomous Path Planning Accelerator with HW/SW Co-Design and Multi-Level Dataflow Optimization | A Sparsity-Aware Autonomous Path Planning Accelerator mit HW/SW Co-Design und Multi-Level-Datenflussoptimierung | 配有HW/SW共同设计和多层次数据流优化的公平-软件自主路径规划加速器 2507.16177v1 |
Authors (7): Yifan Zhang, Xiaoyu Niu, Hongzheng Tian, Yanjun Zhang, Bo Yu, Shaoshan Liu, Sitao Huang
Path planning is critical for autonomous driving, generating smooth, collision-free, feasible paths based on perception and localization inputs. However, its computationally intensive nature poses significant challenges for resource-constrained autonomous driving hardware. This paper presents an end-to-end FPGA-based acceleration framework targeting the quadratic programming (QP), core of optimization-based path planning. We employ a hardware-friendly alternating direction method of multipliers (ADMM) for QP solving and a parallelizable preconditioned conjugate gradient (PCG) method for linear systems. By analyzing sparse matrix patterns, we propose customized storage schemes and efficient sparse matrix multiplication units, significantly reducing resource usage and accelerating matrix operations. Our multi-level dataflow optimization strategy incorporates intra-operator parallelization and pipelining, inter-operator fine-grained pipelining, and CPU-FPGA system-level task mapping. Implemented on the AMD ZCU102 platform, our framework achieves state-of-the-art latency and energy efficiency, including 1.48x faster performance than the best FPGA-based design, 2.89x over an Intel i7-11800H CPU, 5.62x over an ARM Cortex-A57 embedded CPU, and 1.56x over a state-of-the-art GPU solution, along with a 2.05x throughput improvement over existing FPGA-based designs.
对自主驾驶而言,基于认知和本地化投入的光滑、无碰撞、可行路径至关重要;然而,其计算密集性对资源受限制的自主驱动硬件构成重大挑战;本文件介绍了一个以四级编程(QP)为主的基于优化路径规划核心的基于二次编程(QP)的端到端的FPGA加速框架;我们采用硬件友好的乘数交替方向方法(ADMMM)解决QP,对线性系统采用平行的、有先决条件的同质梯(PCG)方法;通过分析稀薄的矩阵模式,我们提出了定制的储存计划和高效的分散矩阵增殖器,大大减少了资源使用和加速的矩阵操作。我们多级数据流优化战略包括了以四分机内平行和管道设计为主的FPG7-800型管道、双机间精细化管道以及CPU-FG102系统一级任务绘图。我们的框架在AMZCU102平台上实现了最先进的拉特度和能效,包括1.48x比最佳FA型设计更快的业绩,2.89x超过5.57-GMAx的C-S-SU-S-SUI-S-S-S-S-SUI-S-S-S-SU-S-S-SUT-SU-S-S-S-S-S-SU-S-S-S-S-S-S-S-SU-S-S-S-S-S-S-S-S-S-S-S-S-S-I-S-S-S-S-SU-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 24
Title@2025-07-21 (1): Autocomp: LLM-Driven Code Optimization for Tensor Accelerators
Title: Autocomp: LLM-Driven Code Optimization for Tensor Accelerators | Autocomp: LLM-gesteuerte Code-Optimierung für Tensor-Beschleuniger | 自动comp: LLM- Driven 代码对 Tensor 加速器的优化 2505.18574v3 |
Authors (4): Charles Hong, Sahil Bhatia, Alvin Cheung, Yakun Sophia Shao
Hardware accelerators, especially those designed for tensor processing, have become ubiquitous in today’s computing landscape. However, even with significant efforts in building compilers, programming these tensor accelerators remains challenging, leaving much of their potential underutilized. Recently, large language models (LLMs), trained on large amounts of code, have shown significant promise in code generation and optimization tasks, but generating low-resource languages like specialized tensor accelerator code still poses a significant challenge. We tackle this challenge with Autocomp, an approach that empowers accelerator programmers to leverage domain knowledge and hardware feedback to optimize code via an automated LLM-driven search. We accomplish this by: 1) formulating each optimization pass as a structured two-phase prompt, divided into planning and code generation phases, 2) inserting domain knowledge during planning via a concise and adaptable optimization menu, and 3) integrating correctness and performance metrics from hardware as feedback at each search iteration. Across three categories of representative workloads and two different accelerators, we demonstrate that Autocomp-optimized code runs 5.6x (GEMM) and 2.7x (convolution) faster than the vendor-provided library, and outperforms expert-level hand-tuned code by 1.4x (GEMM), 1.1x (convolution), and 1.3x (fine-grained linear algebra). Additionally, we demonstrate that optimization schedules generated from Autocomp can be reused across similar tensor operations, improving speedups by up to 24% under a fixed sample budget.
在当今的计算环境中,硬件加速器,特别是那些设计用于高压处理的硬件加速器已经变得无处不在。然而,即便在建设数据编纂器方面做出了巨大努力,程序制作这些高压加速器仍然具有挑战性,使得其潜在潜力得不到充分利用。最近,在大量代码方面受过培训的大型语言模型(LLMS)在代码生成和优化任务中表现出了巨大的希望,但在生成像专门的高压加速器代码这样的低资源语言方面仍是一个重大挑战。我们用Autocomp 来应对这一挑战。Autcomp(Autocomp)赋予加速器程序程序程序员通过自动LLOM驱动搜索将域知识和硬件反馈用于优化代码。我们通过下列方法完成这项工作:1)将每个优化的通过结构化的两阶段快速、分为规划和代码生成阶段;2)在规划过程中通过简洁和调整的优化菜单插入域知识,3)将硬件的准确性和性度指标整合成为每次搜索的反馈。在三类代表性工作量和两个不同的内部加速器中,我们展示了自动整合的代码在5.6x(GEMMLA-D-A-D-D-D-D-CAR-D-D-C-C-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-D-C-C-D-D-C-C-C-C-C-C-C-C-C-C-C-C-D-D-D-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-S-S-S-S-S-S-S-C-C-S-C-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S
Article 25
Title@2025-07-21 (1): Per-Bank Bandwidth Regulation of Shared Last-Level Cache for Real-Time Systems
Title: Per-Bank Bandwidth Regulation of Shared Last-Level Cache for Real-Time Systems | Per-Bank-Bandbreitenregulierung des geteilten Last-Level-Cache für Echtzeit-Systeme | 关于实时系统共享最后一级缓存的每家银行宽幅度条例 2410.14003v2 |
Authors (4): Connor Sullivan, Alex Manley, Mohammad Alian, Heechul Yun
Modern commercial-off-the-shelf (COTS) multicore processors have advanced memory hierarchies that enhance memory-level parallelism (MLP), which is crucial for high performance. To support high MLP, shared last-level caches (LLCs) are divided into multiple banks, allowing parallel access. However, uneven distribution of cache requests from the cores, especially when requests from multiple cores are concentrated on a single bank, can result in significant contention affecting all cores that access the cache. Such cache bank contention can even be maliciously induced – known as cache bank-aware denial-of-service (DoS) attacks – in order to jeopardize the system’s timing predictability. In this paper, we propose a per-bank bandwidth regulation approach for multi-banked shared LLC based multicore real-time systems. By regulating bandwidth on a per-bank basis, the approach aims to prevent unnecessary throttling of cache accesses to non-contended banks, thus improving overall performance (throughput) without compromising isolation benefits of throttling. We implement our approach on a RISC-V system-on-chip (SoC) platform using FireSim and evaluate extensively using both synthetic and real-world workloads. Our evaluation results show that the proposed per-bank regulation approach effectively protects real-time tasks from co-running cache bank-aware DoS attacks, and offers up to a 3.66$\times$ performance improvement for the throttled benign best-effort tasks compared to prior bank-oblivious bandwidth throttling approaches.
现代商业现成(COTS)多核心处理器具有先进的记忆等级结构,可以加强记忆级平行(MLP),这对于高绩效至关重要。为支持高MLP,共享最后一级缓存(LLCs)分为多个银行,允许平行访问。然而,核心的缓存请求分布不均,特别是当多个核心请求集中在单一银行时,可能会造成重大争议,影响所有进入缓存库的核心。这种缓存银行争议甚至可能被恶意诱发 – – 被称为缓存银行对服务的拒绝(DoS)袭击 – – 以破坏系统的时间可预测性。在本文中,我们建议对多银行共享缓存缓存(LLCs)的多核心实时系统采取银行带宽监管办法。通过对每家银行的带宽管理进行监管,防止不必要的缓存进入未受调控银行的银行,从而改善总体业绩(通货),同时不损害减税的孤立效益。我们用IMISC-O-Real-al-alfreal-hal-hal-hal-hal-hal-hall Applemental-hal-hal-deal-hill Stapal-dal-hal-hal-dal-dal-hal-dal-dal-dal-dal-dal-dal-dal-dal-dal-hapal-dal-hapal-dal-dal-dal-dal-dal-dal-dal-dal-dal-dal-dal-dal-dal-dal-dal-d-d-d-d-d-d-s-d-d-s-d-s-dal-dal-d-s-s-dal-ass-s-dal-sal-s-ass-d-s-ass-ass-ass-dal-cal-cal-cal-sal-s-s-sal-sal-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-S-S-S-S-S-S-S-S-s-s-
Article 26
Title@2025-07-21 (1): VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair
Title: VeriRAG: A Retrieval-Augmented Framework for Automated RTL Testability Repair | VeriRAG: Ein Retrieval-Augmented Framework für automatisierte RTL-Testbarkeits-Reparatur | VeriRAG: 自动RTL可测试性修理检索增强框架 2507.15664v1 |
Authors (6): Haomin Qi, Yuyang Du, Lihao Zhang, Soung Chang Liew, Kexin Chen, Yining Du
Large language models (LLMs) have demonstrated immense potential in computer-aided design (CAD), particularly for automated debugging and verification within electronic design automation (EDA) tools. However, Design for Testability (DFT) remains a relatively underexplored area. This paper presents VeriRAG, the first LLM-assisted DFT-EDA framework. VeriRAG leverages a Retrieval-Augmented Generation (RAG) approach to enable LLM to revise code to ensure DFT compliance. VeriRAG integrates (1) an autoencoder-based similarity measurement model for precise retrieval of reference RTL designs for the LLM, and (2) an iterative code revision pipeline that allows the LLM to ensure DFT compliance while maintaining synthesizability. To support VeriRAG, we introduce VeriDFT, a Verilog-based DFT dataset curated for DFT-aware RTL repairs. VeriRAG retrieves structurally similar RTL designs from VeriDFT, each paired with a rigorously validated correction, as references for code repair. With VeriRAG and VeriDFT, we achieve fully automated DFT correction – resulting in a 7.72-fold improvement in successful repair rate compared to the zero-shot baseline (Fig. 5 in Section V). Ablation studies further confirm the contribution of each component of the VeriRAG framework. We open-source our data, models, and scripts at https://github.com/yuyangdu01/LLM4DFT.
大型语言模型(LLMS)在计算机辅助设计(CAD)中显示出巨大的潜力,特别是在电子设计自动化调试和核查自动化调试工具中的潜力,然而,测试设计(DFT)仍是一个相对不足的探索领域。本文展示了第一个LLM协助的DFT-EDA框架VeliRAG。VeriRAG利用Retreval-Auged DFD(RAG)方法使LM能够修改代码以确保DFT的合规性。VeriRAG整合了(1)一个基于自动调试器的类似性测量模型,用于准确检索LLM的TRL设计,以及(2)一个迭代代码修订管道,使LM能够确保DFT的合规性,同时保持同步性。为了支持VeriRAG,我们引入了Verilog-DFT(Verivalvalvalval-AUDFDF)数据集,从VeriFDFT(每个配以严格校准的校正)中提取了结构上的RTLFLFDFDFDFDF框架图图图图图图图图图,每套都与DFA/RBS校正。
Article 27
Title@2025-07-21 (1): When Pipelined In-Memory Accelerators Meet Spiking Direct Feedback Alignment: A Co-Design for Neuromorphic Edge Computing
Title: When Pipelined In-Memory Accelerators Meet Spiking Direct Feedback Alignment: A Co-Design for Neuromorphic Edge Computing | Wenn pipelined In-Memory Accelerators treffen Spiking Direct Feedback Alignment: Ein Co-Design für neuromorphe Edge Computing | 当射线内模拟加速器与Spiking直接反馈对齐:神经边缘计算共同设计时 2507.15603v1 |
Authors (7): Haoxiong Ren, Yangu He, Kwunhang Wong, Rui Bao, Ning Lin, Zhongrui Wang, Dashan Shang
Spiking Neural Networks (SNNs) are increasingly favored for deployment on resource-constrained edge devices due to their energy-efficient and event-driven processing capabilities. However, training SNNs remains challenging because of the computational intensity of traditional backpropagation algorithms adapted for spike-based systems. In this paper, we propose a novel software-hardware co-design that introduces a hardware-friendly training algorithm, Spiking Direct Feedback Alignment (SDFA) and implement it on a Resistive Random Access Memory (RRAM)-based In-Memory Computing (IMC) architecture, referred to as PipeSDFA, to accelerate SNN training. Software-wise, the computational complexity of SNN training is reduced by the SDFA through the elimination of sequential error propagation. Hardware-wise, a three-level pipelined dataflow is designed based on IMC architecture to parallelize the training process. Experimental results demonstrate that the PipeSDFA training accelerator incurs less than 2% accuracy loss on five datasets compared to baselines, while achieving 1.1X~10.5X and 1.37X~2.1X reductions in training time and energy consumption, respectively compared to PipeLayer.
由于具有节能和事件驱动的处理能力,Spik Spik NealNetwork(SNN)越来越倾向于在资源限制的边缘装置上部署,原因是它们具有节能和事件驱动的处理能力。然而,培训SNNN仍然具有挑战性,因为适应以钉钉为基础的系统的传统反反反向反向分析算法的计算强度。在本文中,我们提议采用一个新的软件硬件硬件软件-硬件共同设计,引入硬件友好型培训算法,Spiking直接反馈对齐(SDFA),并在以耐受随机访问内存(RRAM)为基础的内镜化计算机(IMC)结构上实施,以加速SNNN培训。软件方面,SDFA通过消除连续的错误传播,降低了SNN培训的计算复杂性。在IMC结构的基础上设计了三级管道数据流,以平行的培训进程。实验结果显示,PipeSDFA培训加速器与基线相比,在5个数据集(PipESDA-10.5X)和1X节能-2.1x的消耗中分别实现了1.X时间和1xLAX的削减。
Article 28
Title@2025-07-21 (1): HEPPO-GAE: Hardware-Efficient Proximal Policy Optimization with Generalized Advantage Estimation
Title: HEPPO-GAE: Hardware-Efficient Proximal Policy Optimization with Generalized Advantage Estimation | HEPPO-GAE: Hardwareeffiziente proximale Politikoptimierung mit generalisierter Vorteilsschätzung | HEPPO-GAE: 采用通用的先进估计法优化政策 2501.12703v2 |
Authors (2): Hazem Taha, Ameer M. S. Abdelhadi
This paper introduces HEPPO-GAE, an FPGA-based accelerator designed to optimize the Generalized Advantage Estimation (GAE) stage in Proximal Policy Optimization (PPO). Unlike previous approaches that focused on trajectory collection and actor-critic updates, HEPPO-GAE addresses GAE’s computational demands with a parallel, pipelined architecture implemented on a single System-on-Chip (SoC). This design allows for the adaptation of various hardware accelerators tailored for different PPO phases. A key innovation is our strategic standardization technique, which combines dynamic reward standardization and block standardization for values, followed by 8-bit uniform quantization. This method stabilizes learning, enhances performance, and manages memory bottlenecks, achieving a 4x reduction in memory usage and a 1.5x increase in cumulative rewards. We propose a solution on a single SoC device with programmable logic and embedded processors, delivering throughput orders of magnitude higher than traditional CPU-GPU systems. Our single-chip solution minimizes communication latency and throughput bottlenecks, significantly boosting PPO training efficiency. Experimental results show a 30% increase in PPO speed and a substantial reduction in memory access time, underscoring HEPPO-GAE’s potential for broad applicability in hardware-efficient reinforcement learning algorithms.
本文介绍了HEPOPO-GAE,这是一个基于FPGA的加速器,旨在优化Proximal政策优化(PPPO)中通用优势估计(GAE)阶段。 与以往侧重于轨迹采集和行为者-批评更新的方法不同,HEPPO-GAE处理GAE的计算要求,在单一系统对齐系统(SOC)上实施平行的编程结构。这一设计允许为不同的PPPO阶段量身定制的各种硬件加速器进行改造。一个关键的创新是,我们的战略标准化技术,将动态的奖励标准化和价值观的区级标准化结合起来,随后是8比标准的统一四级四分级四分位四舍四入。这种方法稳定了学习、提高绩效和管理记忆瓶颈,实现了记忆用量的4x减少,累积收益增加了1.5x。 我们提出了一个单一的SOC装置的解决办法,其逻辑和嵌入式处理器比传统的CUU-GPUPS系统高得多。我们的单级解决方案最大限度地减少了通信延迟和通量,大大提升了PPO30级止步步,大大提升了PPPO-CEPO的快速学习效率。
Article 29
Title@2025-07-21 (1): Unicorn-CIM: Uncovering the Vulnerability and Improving the Resilience of High-Precision Compute-in-Memory
Title: Unicorn-CIM: Uncovering the Vulnerability and Improving the Resilience of High-Precision Compute-in-Memory | Einhorn-CIM: Enthüllen der Schwachstelle und Verbesserung der Resilienz von Hochpräzisions-Compute-in-Memory | 独角兽-CIM: 消除脆弱性和提高高精确度计量的复原力 2506.02311v2 |
Authors (3): Qiufeng Li, Yiwen Liang, Weidong Cao
Compute-in-memory (CIM) architecture has been widely explored to address the von Neumann bottleneck in accelerating deep neural networks (DNNs). However, its reliability remains largely understudied, particularly in the emerging domain of floating-point (FP) CIM, which is crucial for speeding up high-precision inference and on device training. This paper introduces Unicorn-CIM, a framework to uncover the vulnerability and improve the resilience of high-precision CIM, built on static random-access memory (SRAM)-based FP CIM architecture. Through the development of fault injection and extensive characterizations across multiple DNNs, Unicorn-CIM reveals how soft errors manifest in FP operations and impact overall model performance. Specifically, we find that high-precision DNNs are extremely sensitive to errors in the exponent part of FP numbers. Building on this insight, Unicorn-CIM develops an efficient algorithm-hardware co-design method that optimizes model exponent distribution through fine-tuning and incorporates a lightweight Error Correcting Code (ECC) scheme to safeguard high-precision DNNs on FP CIM. Comprehensive experiments show that our approach introduces just an 8.98% minimal logic overhead on the exponent processing path while providing robust error protection and maintaining model accuracy. This work paves the way for developing more reliable and efficient CIM hardware.
在加速深神经网络(DNN)中,人们广泛探讨了如何解决冯纽曼瓶颈的深神经网络(CIM)结构问题。然而,其可靠性仍然在很大程度上未得到充分研究,特别是在浮点(FP) CIM这一新兴领域,这对加速高精度推断和装置培训至关重要。本文介绍了Uncorn-CIM这一发现高精度CIM脆弱性和提高高精度CIM抗御力的框架,这一框架建立在静态随机存取存储(SRAM)基于FP CIM的静态随机存储(FP CIM)结构上。通过在多个DNNS中开发错误注入和广泛描述,独角兽CIM揭示了FP操作和总体模型性表现中出现的软错误。具体地说,我们发现高精度的DNNNN对于F数字的显性部分错误极为敏感。基于这一洞察,Uncorn-CIM开发了一种高效的模型联合设计方法,通过微调优化优化优化模型的配置,并纳入了轻度错误校准的校正规则,纠正了FP操作中的软误度,同时引入了我们高精度的CIM(EC) 的CIM流程,为高精度的路径,为高精度的CIM提供了高精度的节制。
Article 30
Title@2025-07-19 (6): Iceberg: Enhancing HLS Modeling with Synthetic Data
Title: Iceberg: Enhancing HLS Modeling with Synthetic Data | Iceberg: Verbesserung der HLS-Modellierung mit synthetischen Daten | 冰山:加强利用合成数据建立HLS模型 2507.09948v2 |
Authors (6): Zijian Ding, Tung Nguyen, Weikai Li, Aditya Grover, Yizhou Sun, Jason Cong
Deep learning-based prediction models for High-Level Synthesis (HLS) of hardware designs often struggle to generalize. In this paper, we study how to close the generalizability gap of these models through pretraining on synthetic data and introduce Iceberg, a synthetic data augmentation approach that expands both large language model (LLM)-generated programs and weak labels of unseen design configurations. Our weak label generation method is integrated with an in-context model architecture, enabling meta-learning from actual and proximate labels. Iceberg improves the geometric mean modeling accuracy by $86.4\%$ when adapt to six real-world applications with few-shot examples and achieves a $2.47\times$ and a $1.12\times$ better offline DSE performance when adapting to two different test datasets. Our open-sourced code is here: https://github.com/UCLA-VAST/iceberg
硬件设计高级合成(HLS)的深层次基于学习的预测模型往往难以概括。 在本文中,我们研究如何通过合成数据预培训来缩小这些模型的可概括性差距,并引入了冰山,这是一个合成数据增强方法,它扩大了大型语言模型(LLM)产生的程序,并扩大了隐性设计配置的标签。我们薄弱的标签生成方法与内流模型结构相结合,使得能够从实际标签和近贴标签中进行元学习。 冰山在适应六个真实世界应用并举几个例子时,将几何平均建模精确度提高86.4 元, 并在适应两个不同的测试数据集时,实现2.47美元和1.12美元的更好离线 DSE性能。 我们的开放源代码在这里:https://github.com/UCLAST/iceberg。
Article 31
Title@2025-07-19 (6): Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge
Title: Enabling Efficient Hardware Acceleration of Hybrid Vision Transformer (ViT) Networks at the Edge | Effiziente Hardware-Beschleunigung von Hybrid Vision Transformer (ViT)-Netzwerken am Rand ermöglichen | 使边缘的混合愿景变异器(VT)网络的高效硬件加速 2507.14651v1 |
Authors (4): Joren Dumoulin, Pouya Houshmand, Vikram Jain, Marian Verhelst
Hybrid vision transformers combine the elements of conventional neural networks (NN) and vision transformers (ViT) to enable lightweight and accurate detection. However, several challenges remain for their efficient deployment on resource-constrained edge devices. The hybrid models suffer from a widely diverse set of NN layer types and large intermediate data tensors, hampering efficient hardware acceleration. To enable their execution at the edge, this paper proposes innovations across the hardware-scheduling stack: a.) At the lowest level, a configurable PE array supports all hybrid ViT layer types; b.) temporal loop re-ordering within one layer, enabling hardware support for normalization and softmax layers, minimizing on-chip data transfers; c.) further scheduling optimization employs layer fusion across inverted bottleneck layers to drastically reduce off-chip memory transfers. The resulting accelerator is implemented in 28nm CMOS, achieving a peak energy efficiency of 1.39 TOPS/W at 25.6 GMACs/s.
常规神经网络和视觉变压器结合了常规神经网络和视觉变压器的元素,以便能够轻量度和准确检测。但是,在资源限制的边缘装置上有效部署这些变压器方面仍然存在若干挑战。混合模型有多种多样的NN层类型和大型中间数据变压器,妨碍高效硬件加速。为了能够在边缘执行这些变压器,本文件提议在硬件排位堆中进行创新:a. 在最低水平上,可配置的 PE阵列支持所有混合VIT层类型;b. 在一个层内重新排列时环环,使硬件支持正常化和软式模层,最大限度地减少芯数据传输;c. 进一步安排优化在倒置瓶颈层进行层融合,以大幅减少离子内存储传输。因此产生的加速器在28nm CMOS实施,在25.6 GMAC/s实现1.39 TOPS/W的最高能效。
Article 32
Title@2025-07-19 (6): Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length
Title: Characterizing State Space Model (SSM) and SSM-Transformer Hybrid Language Model Performance with Long Context Length | Charakterisieren von State Space Model (SSM) und SSM-Transformer Hybrid Language Model Performance mit langer Kontextlänge | 确定国家空间模型(SSM)和SSM-过渡混合语言模型长内性性能特点 2507.12442v2 |
Authors (5): Saptarshi Mitra, Rachid Karami, Haocheng Xu, Sitao Huang, Hyoukjun Kwon
The demand for machine intelligence capable of processing continuous, long-context inputs on local devices is growing rapidly. However, the quadratic complexity and memory requirements of traditional Transformer architectures make them inefficient and often unusable for these tasks. This has spurred a paradigm shift towards new architectures like State Space Models (SSMs) and hybrids, which promise near-linear scaling. While most current research focuses on the accuracy and theoretical throughput of these models, a systematic performance characterization on practical consumer hardware is critically needed to guide system-level optimization and unlock new applications. To address this gap, we present a comprehensive, comparative benchmarking of carefully selected Transformer, SSM, and hybrid models specifically for long-context inference on consumer and embedded GPUs. Our analysis reveals that SSMs are not only viable but superior for this domain, capable of processing sequences up to 220K tokens on a 24GB consumer GPU-approximately 4x longer than comparable Transformers. While Transformers may be up to 1.8x faster at short sequences, SSMs demonstrate a dramatic performance inversion, becoming up to 4x faster at very long contexts (~57K tokens). Our operator-level analysis reveals that custom, hardware-aware SSM kernels dominate the inference runtime, accounting for over 55% of latency on edge platforms, identifying them as a primary target for future hardware acceleration. We also provide detailed, device-specific characterization results to guide system co-design for the edge. To foster further research, we will open-source our characterization framework.
对能够处理当地设备连续、长文本投入的机器情报的需求正在迅速增长,然而,传统变异器结构的四重复杂和记忆要求使得传统变异器结构效率低,往往无法用于这些任务。这促使向国家空间模型(SSMs)和混合体等新结构的范式转变,这些结构有望近线缩放。虽然大多数目前的研究侧重于这些模型的准确性和理论输送量,但对实用消费者硬件的系统性能定性至关重要,以指导系统层面的优化和打开新的应用程序。为弥补这一差距,我们提出了精心选择的变异器、SSSM和混合模型的全面、比较基准化基准,具体针对消费者和嵌入的GPUPs进行长的推断。我们的分析显示,SSMs不仅可行,而且优于此领域,能够对24GB消费者GPU值约4x比可比变异的变异器进行高达220K的顺序处理。虽然变异器在短的顺序上可能达到1.8x开放度优化,但SSMs展示了惊人的反向,在非常长的背景环境背景下的递化,在非常长的背景中将更快地达到4x的递化的递化水平上,在非常的硬化的硬化的系统上显示我们的硬化分析。
Article 33
Title@2025-07-18 (5): Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need
Title: Efficient LLM Inference: Bandwidth, Compute, Synchronization, and Capacity are all you need | Effiziente LLM-Inferenz: Bandbreite, Berechnung, Synchronisierung und Kapazität sind alles, was Sie brauchen | 高效率的LLM 推论: 带宽、计算、同步、能力是您所需要的 2507.14397v1 |
Authors (4): Michael Davies, Neal Crago, Karthikeyan Sankaralingam, Christos Kozyrakis
This paper presents a limit study of transformer-based large language model (LLM) inference, focusing on the fundamental performance bottlenecks imposed by memory bandwidth, memory capacity, and synchronization overhead in distributed inference systems. We develop a hardware-agnostic performance model that abstracts away implementation details, enabling the analysis of a wide range of current and near-future hardware technologies. Our analysis spans from current HBM3 memory technology used in AI accelerators like GPUs and TPUs to systems based on advanced HBM4 and advanced 3D-stacked DRAM technology. It also covers SRAM-based designs and scaling techniques from distributed clusters with varying numbers of chips to wafer-scale integration. Our key findings for auto-regressive decoding are: i) serving LLMs requires 100s of GB per server to serve a model instance; ii) high memory bandwidth is critical for high per-user throughput; iii) exposed synchronization latencies to achieve collective communication must be around 1us else they make the memory bandwidth ineffective; iv) DRAM-based designs have a fundamental advantage in terms of system-level efficiency as measured in throughput per cost or watt; and v) hardware designs can easily reach 2000+ user token/sec but getting to 10,000+ tokens/sec will need smaller models, smaller context, or other forms of algorithmic advances. This study provides valuable insights into the fundamental performance limits of LLM inference, highlighting the potential benefits of future hardware advancements and guiding the optimization of LLM deployment strategies.
本文介绍了对基于变压器的大语言模型(LLM)推断的限值研究,重点是记忆带宽、记忆能力以及分布式推算系统中同步管理管理管理器造成的基本性能瓶颈。我们开发了一个硬件认知性性性性能模型,该模型可以摘述实施细节,从而能够分析各种当前和近未来硬件技术。我们的分析范围包括目前用于AI加速器的HBM3内存技术,如GPU和TPU的内存技术,以基于先进的HBM4和高级的3D-堆叠式 DRAM技术为基础的系统。它还涵盖分布式组群的基于SRAM的设计和缩放技术,其芯片数量和瓦弗规模集集集集的分布式集。我们关于自动递增性解解码的关键性性性性能模型的发现是:(i) 服务LLMMS每台服务器需要100GB才能为模型服务;(ii) 高用户的内存带带宽度对于高操作器至关重要;(iii) 实现集体沟通的时,必须大约一us 使存储带宽带带宽技术失效; IV-RAM的内存储带宽宽宽带宽宽宽带技术。 以较小的设计在2000年基本面上具有基本性前导值前导值前导值和最低值-直达标码-平面值-平面值-平平平平平面的系统/平面的精度研究中,但通过测量度研究将可测量度上,在2000年的精度上,通过测量度上调低的硬度上调低的系统-平平面度能中可以测量度能。
Article 34
Title@2025-07-18 (5): Hardware-Compatible Single-Shot Feasible-Space Heuristics for Solving the Quadratic Assignment Problem
Title: Hardware-Compatible Single-Shot Feasible-Space Heuristics for Solving the Quadratic Assignment Problem | Hardware-kompatible Single-Shot-Feasible-Space-Heuristiken zur Lösung des quadratischen Zuordnungsproblems | 用于解决四压指派问题的可兼容的硬件兼容单制制式实际空间- 超光速空间方法 2503.09676v2 |
Authors (13): Haesol Im, Chan-Woo Yang, Moslem Noori, Dmitrii Dobrynin, Elisabetta Valiante, Giacomo Pedretti, Arne Heittmann, Thomas Van Vaerenbergh, Masoud Mohseni, John Paul Strachan, Dmitri Strukov, Ray Beausoleil, Ignacio Rozada
Research into the development of special-purpose computing architectures designed to solve quadratic unconstrained binary optimization (QUBO) problems has flourished in recent years. It has been demonstrated in the literature that such special-purpose solvers can outperform traditional CMOS architectures by orders of magnitude with respect to timing metrics on synthetic problems. However, they face challenges with constrained problems such as the quadratic assignment problem (QAP), where mapping to binary formulations such as QUBO introduces overhead and limits parallelism. In-memory computing (IMC) devices, such as memristor-based analog Ising machines, offer significant speedups and efficiency gains over traditional CPU-based solvers, particularly for solving combinatorial optimization problems. In this work, we present a novel local search heuristic designed for IMC hardware to tackle the QAP. Our approach enables massive parallelism that allows for computing of full neighbourhoods simultaneously to make update decisions. We ensure binary solutions remain feasible by selecting local moves that lead to neighbouring feasible solutions, leveraging feasible-space search heuristics and the underlying structure of a given problem. Our approach is compatible with both digital computers and analog hardware. We demonstrate its effectiveness in CPU implementations by comparing it with state-of-the-art heuristics for solving the QAP.
研究开发特殊目的计算机结构,旨在解决四级不受限制的二进制优化问题(QUBO)的研究近年来已经蓬勃发展,文献表明,这类特殊用途解决方案通过合成问题的定时度指标,能够以数量级比传统的 CMOS 结构优于传统的 CMOS 结构;然而,它们面临着一些制约问题的挑战,如四级分配问题(QAP)等,即对四级分配问题(QAP)等二进制配制的映像像QUBO(QUBO)的映像引入了间接费用和限制的平行关系。内模化计算(IMC)设备,如以模模模模为基的模拟Ising机器,对传统的基于CPU的解决方案提供显著的超速和效率增益,特别是解决组合优化问题。在这项工作中,我们为IMC硬件设计了一种新的本地搜索超额。 我们的方法使得大规模平行化,能够同时计算整个街区,从而做出更新决定。我们确保二进式解决方案仍然可行,方法是选择邻近可行解决方案的本地动作,利用可行的空间搜索超模量搜索和基质计算机的基本结构,以比较其硬件。我们的方法与数字式计算机兼容兼容。
Article 35
Title@2025-07-18 (5): An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC
Title: An End-to-End DNN Inference Framework for the SpiNNaker2 Neuromorphic MPSoC | Ein End-to-End DNN-Inferenz-Framework für den SpiNNaker2 Neuromorphic MMPSoC | SpinNNAker2神经地态 MPSC 的端对端 DNN 推推框架 2507.13736v1 |
Authors (6): Matthias Jobst, Tim Langer, Chen Liu, Mehmet Alici, Hector A. Gonzalez, Christian Mayr
This work presents a multi-layer DNN scheduling framework as an extension of OctopuScheduler, providing an end-to-end flow from PyTorch models to inference on a single SpiNNaker2 chip. Together with a front-end comprised of quantization and lowering steps, the proposed framework enables the edge-based execution of large and complex DNNs up to transformer scale using the neuromorphic platform SpiNNaker2.
这项工作提出了一个多层次的 DNN 列表框架,作为OctopuScheduler的延伸,从PyTorrch模型中提供端对端流动,以推断单一的SpinNNAker2芯片。连同由量化和下调步骤组成的前端,拟议框架使大型和复杂的DNN能够使用神经变形平台SpinNNAker2在变压器规模上以边缘为基础执行大型和复杂的DNN,直至变压器规模。
Article 36
Title@2025-07-18 (5): Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal
Title: Fast Graph Vector Search via Hardware Acceleration and Delayed-Synchronization Traversal | Schnelle Graphen-Vektorsuche über Hardware-Beschleunigung und verzögerte Synchronisation Traversal | 通过硬件加速和延迟同步同步搜索快速图表矢量 2406.12385v2 |
Authors (4): Wenqi Jiang, Hang Hu, Torsten Hoefler, Gustavo Alonso
Vector search systems are indispensable in large language model (LLM) serving, search engines, and recommender systems, where minimizing online search latency is essential. Among various algorithms, graph-based vector search (GVS) is particularly popular due to its high search performance and quality. However, reducing GVS latency by intra-query parallelization remains challenging due to limitations imposed by both existing hardware architectures (CPUs and GPUs) and the inherent difficulty of parallelizing graph traversals. To efficiently serve low-latency GVS, we co-design hardware and algorithm by proposing Falcon and Delayed-Synchronization Traversal (DST). Falcon is a hardware GVS accelerator that implements efficient GVS operators, pipelines these operators, and reduces memory accesses by tracking search states with an on-chip Bloom filter. DST is an efficient graph traversal algorithm that simultaneously improves search performance and quality by relaxing traversal orders to maximize accelerator utilization. Evaluation across various graphs and datasets shows that Falcon, prototyped on FPGAs, together with DST, achieves up to 4.3x and 19.5x lower latency and up to 8.0x and 26.9x improvements in energy efficiency over CPU- and GPU-based GVS systems.
在大型语言模型(LLM)服务、搜索引擎和建议系统中,矢量搜索系统是不可或缺的,在大型语言模型(LLM)服务、搜索引擎和建议系统中,最大限度地减少在线搜索延迟时间是不可或缺的。在各种算法中,基于图形的矢量搜索(GVS)因其高搜索性能和质量而特别受欢迎。然而,由于现有硬件结构(CPU和GPU)的限制以及平行图形穿透系统的内在困难,降低GVS的悬浮性能仍然具有挑战性。为了高效地为低纬度的GVS服务,我们通过提出“猎鹰”和“延迟同步轨道”(DST)来共同设计硬件和算法硬件和算法。 鹰是一个硬件GVS的加速器,它使用高效的GVS操作者、管道、这些操作者,通过跟踪搜索状态和电流过滤器(CPU和GPGPG5)来减少记忆的存取。DST是一种高效的图解算法,同时通过放松旅行订单来提高搜索性业绩和质量,从而最大限度地利用加速器。对各种图表和数据集进行评估。
Article 37
Title@2025-07-18 (5): 4T2R X-ReRAM CiM Array for Variation-tolerant, Low-power, Massively Parallel MAC Operation
Title: 4T2R X-ReRAM CiM Array for Variation-tolerant, Low-power, Massively Parallel MAC Operation | 4T2R X-ReRAM CiM Array für variantentolerante, leistungsarme, massiv parallele MAC-Operation | 4T2R X-RERRAM CIM 用于机动性、耐变、低功率、大规模平行的MAC行动 2507.13631v1 |
Authors (6): Fuyuki Kihara, Seiji Uenohara, Satoshi Awamura, Naoko Misawa, Chihiro Matsui, Ken Takeuchi
Computation-in-Memory (CiM) is attracting attention as a technology that can perform MAC calculations required for AI accelerators, at high speed with low power consumption. However, there is a problem regarding power consumption and device-derived errors that increase as row parallelism increases. In this paper, a 4T2R ReRAM cell and an 8T SRAM CiM suitable for CiM is proposed. It is shown that adopting the proposed 4T2R ReRAM cell reduces the errors due to variation in ReRAM devices compared to conventional 4T4R ReRAM cells.
模拟计算(CiM)作为一种技术吸引了人们的注意,这种技术能够以低电耗的高速度对AI加速器进行MAC计算,但是,电耗和装置产生的错误随着行平行增加而增加,存在问题。本文建议使用一个4T2R ReRAM单元和一个适合Cim的8TSRAM 系统。 事实证明,采用拟议的 4T2R ReRAM 单元可以减少由于ReRAM装置与常规的 4T4R ReRAM 电池相比出现差异而产生的错误。
Article 38
Title@2025-07-18 (5): LASANA: Large-Scale Surrogate Modeling for Analog Neuromorphic Architecture Exploration
Title: LASANA: Large-Scale Surrogate Modeling for Analog Neuromorphic Architecture Exploration | LASANA: großflächige Surrogatmodellierung für die Erforschung der analogen neuromorphen Architektur | LASNA:模拟神经成形建筑勘探大型代谢模型 2507.10748v2 |
Authors (4): Jason Ho, James A. Boyle, Linshen Liu, Andreas Gerstlauer
Neuromorphic systems using in-memory or event-driven computing are motivated by the need for more energy-efficient processing of artificial intelligence workloads. Emerging neuromorphic architectures aim to combine traditional digital designs with the computational efficiency of analog computing and novel device technologies. A crucial problem in the rapid exploration and co-design of such architectures is the lack of tools for fast and accurate modeling and simulation. Typical mixed-signal design tools integrate a digital simulator with an analog solver like SPICE, which is prohibitively slow for large systems. By contrast, behavioral modeling of analog components is faster, but existing approaches are fixed to specific architectures with limited energy and performance modeling. In this paper, we propose LASANA, a novel approach that leverages machine learning to derive data-driven surrogate models of analog sub-blocks in a digital backend architecture. LASANA uses SPICE-level simulations of a circuit to train ML models that predict circuit energy, performance, and behavior at analog/digital interfaces. Such models can provide energy and performance annotation on top of existing behavioral models or function as replacements to analog simulation. We apply LASANA to an analog crossbar array and a spiking neuron circuit. Running MNIST and spiking MNIST, LASANA surrogates demonstrate up to three orders of magnitude speedup over SPICE, with energy, latency, and behavioral error less than 7%, 8%, and 2%, respectively.
使用模拟或事件驱动的内晶系统,其动机是需要以更节能的方式处理人工智能工作量。新兴神经形态结构的目的是将传统数字设计与模拟计算和新设备技术的计算效率相结合。快速探索和共同设计这类结构的一个关键问题是缺乏快速和精确建模和模拟的工具。典型的混合信号设计工具将数字模拟器与类似解答器(如SPICE)结合在一起,而对于大型系统来说,这种模拟器速度极慢。相比之下,模拟组件的行为建模更快,但现有方法固定在能源和性能建模有限的特定结构上。在本文件中,我们提出LASANA,这是利用机器学习在数字后端结构中生成模拟小块的数据驱动代谢模型的一个新办法。LASANA使用SPICE级电路模拟来培训ML模型,预测电路能、性能和模拟/数字界面的动作。这类模型可以提供能源和性能建模,在现有行为压速度和性能2级定型模型的顶端上,我们用SARIMISA的模拟模型和运行模型替换了三级的SIMLA。
Article 39
Title@2025-07-17 (4): GPU Performance Portability needs Autotuning
Title: GPU Performance Portability needs Autotuning | GPU Performance Portability benötigt Autotuning | GPU 性能表现 便捷性需要自动调节 2505.03780v3 |
Authors (3): Burkhard Ringlein, Thomas Parnell, Radu Stoica
As LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware. Today’s reliance on a single dominant platform limits portability, creates vendor lock-in, and raises barriers for new AI hardware. In this work, we make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning to enable portable LLM inference with state-of-the-art performance without code changes. Focusing on performance-critical LLM kernels, we demonstrate that this approach explores up to 15x more kernel parameter configurations, produces significantly more diverse code across multiple dimensions, and even outperforms vendor-optimized implementations by up to 230%, all while reducing kernel code size by 70x and eliminating manual code optimizations. Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
随着LLMS的复杂程度不断增长,实现最先进的LLM性能要求严格地共同设计各种算法、软件和硬件。今天,对单一主导平台的依赖限制了可移动性,创建了供应商锁定,并为新的AI硬件制造了障碍。在这项工作中,我们有理由将即时(JIT)汇编与全面的内核参数自动调整结合起来,以便允许便携式LLM在不改变代码的情况下对最新性能进行推论。聚焦于性能临界的LLM内核,我们证明这一方法可以探索多达15x更多的内核参数配置,在多个维度上产生更多样化得多的代码,甚至比供应商优化执行率高出230 % , 而同时将内核代码的尺寸减少70x,并消除手动代码优化。我们的结果突出表明,自动化是释放GPU供应商模型可移植性能的一条有希望的道路。
Article 40
Title@2025-07-17 (4): WIP: Turning Fake Chips into Learning Opportunities
Title: WIP: Turning Fake Chips into Learning Opportunities | WIP: Fake Chips in Lernmöglichkeiten verwandeln | WIP:将假芯片转化为学习机会 2507.13281v1 |
Authors (3): Haniye Mehraban, Saad Azmeen-ur-Rahman, John Hu
This work-in-progress paper presents a case study in which counterfeit TL074 operational amplifiers, discovered in a junior level electronics course, became the basis for a hands on learning experience. Counterfeit integrated circuits (IC) are increasingly common, posing a significant threat to the integrity of undergraduate electronics laboratories. Instead of simply replacing the counterfeit components, we turned the issue into a teaching moment. Students engaged in hands-on diagnostics measuring current, analyzing waveforms, and troubleshooting. By working with fake chip components, they gained deeper insight into analog circuits, supply chain security, and practical engineering.
这份进行中的文件提出了一个案例研究,其中在初级电子课程中发现的伪造TL074操作放大器成为学习经验的基础;伪造综合电路(IC)日益普遍,对本科本科电子实验室的完整性构成重大威胁;我们没有简单地取代伪造部件,而是将问题变成了一个教学时刻;学生们从事手动诊断,测量电流,分析波形和排除故障;他们利用假芯片部件,更深入地了解模拟电路、供应链安全和实际工程。
Article 41
Title@2025-07-17 (4): High-Performance Pipelined NTT Accelerators with Homogeneous Digit-Serial Modulo Arithmetic
Title: High-Performance Pipelined NTT Accelerators with Homogeneous Digit-Serial Modulo Arithmetic | Hochleistungspipelined-NTT-Beschleuniger mit homogener Digit-Serialmodulo-Arithmetik | NTT 高性能加速器 2507.12418v2 |
Authors (3): George Alexakis, Dimitrios Schoinianakis, Giorgos Dimitrakopoulos
The Number Theoretic Transform (NTT) is a fundamental operation in privacy-preserving technologies, particularly within fully homomorphic encryption (FHE). The efficiency of NTT computation directly impacts the overall performance of FHE, making hardware acceleration a critical technology that will enable realistic FHE applications. Custom accelerators, in FPGAs or ASICs, offer significant performance advantages due to their ability to exploit massive parallelism and specialized optimizations. However, the operation of NTT over large moduli requires large word-length modulo arithmetic that limits achievable clock frequencies in hardware and increases hardware area costs. To overcome such deficits, digit-serial arithmetic has been explored for modular multiplication and addition independently. The goal of this work is to leverage digit-serial modulo arithmetic combined with appropriate redundant data representation to design modular pipelined NTT accelerators that operate uniformly on arbitrary small digits, without the need for intermediate (de)serialization. The proposed architecture enables high clock frequencies through regular pipelining while maintaining parallelism. Experimental results demonstrate that the proposed approach outperforms state-of-the-art implementations and reduces hardware complexity under equal performance and input-output bandwidth constraints.
数字理论变换(NTT)是保护隐私技术的基本操作,特别是在完全同质加密(FHE)中。NTT计算的效率直接影响FHE的总体性能,使硬件加速成为使FHE应用符合现实的关键技术。在FPGAs或ACIS中,定制加速器由于能够利用大规模平行和专门优化,因此具有很大的性能优势。然而,NTT在大型模末机上的运作需要大量的单长模数计算,从而限制硬件中可实现的时钟频率,增加硬件成本。为了克服这种缺陷,数字序列计算已经为模块化倍增和独立地探索了。这项工作的目标是利用数字序列计算法和适当的冗余数据代表来设计模块化的编审动式NTTT加速器,这些加速器在任意的小数字上统一运作,而不需要中间(分级)级化。拟议的结构通过常规管线,使高时钟频率得以保持平行状态。实验结果表明,拟议的方法比标准化了州-艺术倍化的配置和高频频度。
Article 42
Title@2025-07-17 (4): MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration
Title: MC$^2$A: Enabling Algorithm-Hardware Co-Design for Efficient Markov Chain Monte Carlo Acceleration | MC$^2$A: Algorithm-Hardware Co-Design für effiziente Markov-Kette Monte Carlo Beschleunigung | MC$$2$A: 提高Markov链节蒙特卡洛速度加速速度的辅助算法-Hardware共同设计 2507.12935v1 |
Authors (6): Shirui Zhao, Jun Yin, Lingyun Yao, Martin Andraud, Wannes Meert, Marian Verhelst
An increasing number of applications are exploiting sampling-based algorithms for planning, optimization, and inference. The Markov Chain Monte Carlo (MCMC) algorithms form the computational backbone of this emerging branch of machine learning. Unfortunately, the high computational cost limits their feasibility for large-scale problems and real-world applications, and the existing MCMC acceleration solutions are either limited in hardware flexibility or fail to maintain efficiency at the system level across a variety of end-to-end applications. This paper introduces \textbf{MC$^2$A}, an algorithm-hardware co-design framework, enabling efficient and flexible optimization for MCMC acceleration. Firstly, \textbf{MC$^2$A} analyzes the MCMC workload diversity through an extension of the processor performance roofline model with a 3rd dimension to derive the optimal balance between the compute, sampling and memory parameters. Secondly, \textbf{MC$^2$A} proposes a parametrized hardware accelerator architecture with flexible and efficient support of MCMC kernels with a pipeline of ISA-programmable tree-structured processing units, reconfigurable samplers and a crossbar interconnect to support irregular access. Thirdly, the core of \textbf{MC$^2$A} is powered by a novel Gumbel sampler that eliminates exponential and normalization operations. In the end-to-end case study, \textbf{MC$^2$A} achieves an overall {$307.6\times$, $1.4\times$, $2.0\times$, $84.2\times$} speedup compared to the CPU, GPU, TPU and state-of-the-art MCMC accelerator. Evaluated on various representative MCMC workloads, this work demonstrates and exploits the feasibility of general hardware acceleration to popularize MCMC-based solutions in diverse application domains.
越来越多的应用程序正在利用基于取样的算法来进行规划、优化和推断。 Markov 链链 Monte Carlo( MCMC) 算法构成了这个新兴机器学习分支的计算主干。 不幸的是, 高计算成本限制了大规模问题和现实世界应用的可行性, 以及现有的 MMC加速解决方案在硬件灵活性上受到限制, 或者未能在各种端至端应用中保持系统一级的效率。 本文引入了 textbf{ MC$2$2, 3美元 。 本文引入了一种基于算法的硬软件共同设计框架, 使 MMC 加速的高效和灵活优化。 首先,\ textb{ MC$2$A} 通过扩展进程或性能操作模型来分析 MMC 的多样化。
Article 43
Title@2025-07-17 (4): An ultra-low-power CGRA for accelerating Transformers at the edge
Title: An ultra-low-power CGRA for accelerating Transformers at the edge | Ein ultra-low-power CGRA zur Beschleunigung von Transformern am Rand | 用于加速边缘变压器的超低功率CGRAA 2507.12904v1 |
Authors (1): Rohit Prasad
Transformers have revolutionized deep learning with applications in natural language processing, computer vision, and beyond. However, their computational demands make it challenging to deploy them on low-power edge devices. This paper introduces an ultra-low-power, Coarse-Grained Reconfigurable Array (CGRA) architecture specifically designed to accelerate General Matrix Multiplication (GEMM) operations in transformer models tailored for the energy and resource constraints of edge applications. The proposed architecture integrates a 4 x 4 array of Processing Elements (PEs) for efficient parallel computation and dedicated 4 x 2 Memory Operation Blocks (MOBs) for optimized LOAD/STORE operations, reducing memory bandwidth demands and enhancing data reuse. A switchless mesh torus interconnect network further minimizes power and latency by enabling direct communication between PEs and MOBs, eliminating the need for centralized switching. Through its heterogeneous array design and efficient dataflow, this CGRA architecture addresses the unique computational needs of transformers, offering a scalable pathway to deploy sophisticated machine learning models on edge devices.
变异器通过自然语言处理、计算机视觉等应用和自然语言处理、计算机视觉等应用的深层次学习实现了革命。 但是,它们的计算要求使得在低功率边缘装置上部署它们具有挑战性。 本文介绍了一个超低功率的、 粗略的可重新配置的阵列( CGRA) 结构, 专门旨在加速通用矩阵乘法( GEMM) 变异模型中为边缘应用的能源和资源限制而设计的变异器操作。 拟议的结构整合了4x4组处理元素(PE) , 用于高效平行计算, 并专门为优化 LOAD/STOR 操作、 减少记忆带宽需求和加强数据再利用专门设计 4x 2 记忆操作块(MOBs) 。 一个无开关网际连接, 使 PE 和 MOBs 之间能够直接通信, 从而消除对中央转换的需要, 从而进一步将能量和耐拉特。 通过其混合阵列设计和高效的数据流, 该结构解决变异器的独特计算需要, 提供了在边缘装置上部署尖端机器学习模型的可扩缩路径。