cs.AR @ 2025-06-25: 046
-
00 06-24 (2) Robotics Under Construction: Challenges on Job Sites Robotik im Bau: Herausforderungen auf Job-Sites 正在建设中的机器人:工作场所的挑战 2506.19597v1 -
01 06-24 Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference Mischung aus Cache-Conditional Experts für effiziente mobile Geräteableitung 高效移动设备引力缓存-条件专家混合 2412.00099v2 -
02 06-24 ChatModel: Automating Reference Model Design and Verification with LLMs ChatModel: Automatisieren von Referenzmodell-Design und Überprüfung mit LLMs 聊天模式:使用LLMs自动使用参考模型设计和核查 2506.15066v2 -
03 06-24 In-Memory Sorting-Searching with Cayley Tree In-Memory-Sortierung-Suche mit Cayley-Baum 与 Cayley 树一起在记忆中排序- 使用 Cayley 树 2506.19379v1 -
04 06-24 Efficient Hardware Implementation of Modular Multiplier over GF (2m) on FPGA Effiziente Hardware-Implementierung des Modularen Multiplikators über GF (2m) auf FPGA 在FPGA上实施超过2米的GF(2GF)的模块倍增器的高效硬件 2506.09464v3 -
05 06-23 (1) MEDEA: A Design-Time Multi-Objective Manager for Energy-Efficient DNN Inference on Heterogeneous Ultra-Low Power Platforms MEDEA: Design-Time Multi-Objective Manager für energieeffiziente DNN-Inferenz auf heterogenen Ultra-Low-Power-Plattformen MEDEA: 能源效率DNN热能超低层动力平台的热源性DN推论设计-时间多目标管理员 2506.19067v1 -
06 06-23 Design high-confidence computers using trusted instructional set architecture and emulators Design von Computern mit hohem Vertrauen mit vertrauenswürdigen Lehrset-Architektur und Emulatoren 使用可信赖的教学系统架构和模拟器设计高自信计算机 2506.18780v1 -
07 06-23 EasyDRAM: An FPGA-based Infrastructure for Fast and Accurate End-to-End Evaluation of Emerging DRAM Techniques EasyDRAM: Eine FPGA-basierte Infrastruktur für eine schnelle und präzise End-to-End-Bewertung neu entstehender DRAM-Techniken AFACDRAM: 以FPGA为基础的快速和准确端至端评价新兴DRAM技术基础设施 2506.10441v2 -
08 06-23 Embedded FPGA Acceleration of Brain-Like Neural Networks: Online Learning to Scalable Inference Embedded FPGA Beschleunigung von Gehirn-ähnlichen Neuronalen Netzwerken: Online-Lernen zu skalierbarer Inferenz FPGA 脑如神经网络加速FPGA嵌入式FPGA:在线学习可缩放推论 2506.18530v1 -
09 06-23 AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing AnalogNAS-Bench: Ein NAS-Benchmark für analoges In-Memory Computing AnalogNAS-Bench:NAS模拟计算基准 2506.18495v1 -
10 06-22 (7) AMD Versal Implementations of FAM and SSCA Estimators AMD-Versal-Implementierungen von FAM und SSCA-Schätzern AMD 快速实施FAM和SCA模拟器 2506.18003v1 -
11 06-22 Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance Kosteneffiziente Optimierung und Implementierung des CRT-Paillier Entschlüsselungs-Algorithmus für verbesserte Leistung 以成本-效益高的方式优化和实施CRT-PALIer 解密算法,提高性能 2506.17935v1 -
12 06-22 GainSight: Application-Guided Profiling for Composing Heterogeneous On-Chip Memories in AI Hardware Accelerators GainSight: Anwendungsorientierte Profilierung für das Komponieren heterogener On-Chip-Erinnerungen in KI-Hardwarebeschleunigern GainSight: 在 AI 硬件加速器中组合不同质在芯片上的记忆的应用指导分析 2504.14866v3 -
13 06-21 (6) Residue Number System (RNS) based Distributed Quantum Multiplication Rückstandszahlsystem (RNS) basiert auf verteilter Quanten-Multiplikation 基于残余数字系统(RNS)的分布量乘法 2506.17588v1 -
14 06-21 LLM-Aided Testbench Generation and Bug Detection for Finite-State Machines LLM-Aided Testbench Generation und Fehlererkennung für Finite-State-Maschinen LLM 辅助有限国机器的LLM辅助试验箱生成和虫虫检测 2406.17132v2 -
15 06-21 ASC-Hook: fast and transparent system call hook for Arm ASC-Hook: Schneller und transparenter Systemrufhaken für Arm ASC-Hook: 快速和透明的系统调用武器钩 2412.05784v4 -
16 06-20 (5) $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction $Δ$-Nets: Interaktionsbasiertes System für eine optimale parallele $λ$-Reduktion \(-净额:最佳平行互动系统\)$美元-削减 2505.20314v3 -
17 06-20 Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities Bewertung der RISC-V-MatMul-Beschleunigungsfähigkeiten von Tenstorrent 评估出租人的RISC-V MatMul加速能力 2505.06085v3 -
18 06-20 RCNet: $ΔΣ$ IADCs as Recurrent AutoEncoders RCNet: $Δω$ IADCs als recurrent AutoEncoder RCNet:作为经常性自动编码器的空间碎片协委会 2506.16903v1 -
19 06-20 Lookup Table-based Multiplication-free All-digital DNN Accelerator Featuring Self-Synchronous Pipeline Accumulation Lookup Tischbasierte Multiplikationsfreies All-Digital-DNN-Beschleuniger mit selbstsynchroner Pipeline-Akkumulation 基于表格的无乘法化表格 全数字 DNN 加速器 自动同步管道累积 2506.16800v1 -
20 06-20 CodeV-R1: Reasoning-Enhanced Verilog Generation CodeV-R1: Grundlegende Verilog-Generierung 代码V-R1:有理性的增强性性性性性性性生殖器生成 2505.24183v2 -
21 06-20 Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP) Zoozve: Eine Strip-Mining-freie RISC-V-Vektorerweiterung mit arbitrary Register-Gruppierungsunterstützung (WIP) Zooozve:利用任意登记册组汇编支助,扩大一个带分解自由的RISC-V矢量扩展区(WIP) 2504.15678v2 -
22 06-19 (4) SparseDPD: A Sparse Neural Network-based Digital Predistortion FPGA Accelerator for RF Power Amplifier Linearization SparseDPD: Ein Sparse Neural Network-basierter Digital Predistortion FPGA Accelerator für RF Leistungsverstärker Linearisierung SpararseDPD: 一种基于松散神经网络的FPGA 用于RF Power 放大器线性化的松式神经网络数字预感加速器 2506.16591v1 -
23 06-19 REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing REIS: Ein leistungsstarkes und energieeffizientes Retrieval-System mit In-Storage-Verarbeitung REIS:具有在系统内处理的高效能和节能检索系统 2506.16444v1 -
24 06-19 All-in-One Analog AI Hardware: On-Chip Training and Inference with Conductive-Metal-Oxide/HfOx ReRAM Devices All-in-One Analog-KI-Hardware: On-Chip-Training und Schlussfolgerung mit leitfähig-Metall-Oxid/HfOx-ReRAM-Geräten A. AI 硬件:与导体-金属-氧化物/HfOx雷射装置的芯片培训和推断 2502.04524v4 -
25 06-19 Serving Large Language Models on Huawei CloudMatrix384 Große Sprachmodelle auf Huawei CloudMatrix384 瓦威云马特列克384 2506.12708v3 -
26 06-19 Microcanonical simulated annealing: Massively parallel Monte Carlo simulations with sporadic random-number generation Mikrokanonisches simuliertes Glühen: Massiv parallele Monte-Carlo-Simulationen mit sporadischer Zufalls-Zahlen-Generierung 模拟微气候模拟Annealing:以零星随机数生成的大规模平行的蒙特卡洛模拟模拟 2506.16240v1 -
27 06-19 How to Increase Energy Efficiency with a Single Linux Command Wie man die Energieeffizienz mit einem einzigen Linux-Befehl erhöht 如何通过单Linux指挥来提高能源效率 2506.16046v1 -
28 06-19 HetGPU: The pursuit of making binary compatibility towards GPUs HetGPU: Das Streben nach binärer Kompatibilität gegenüber GPUs HETGPU: 努力使二进制兼容到 GPUs 2506.15993v1 -
29 06-19 Profile-Guided Temporal Prefetching Profilgeführtes Temporales Prefetching 预测性指导时间间隔期 2506.15985v1 -
30 06-18 (3) Sudoku: Decomposing DRAM Address Mapping into Component Functions Sudoku: Dekomponieren von DRAM-Adressen Mapping in Komponentenfunktionen 数独: 将 DRAM 地址映射解析成元函数 2506.15918v1 -
31 06-18 Bias Variation Compensation in Perimeter-Gated SPAD TRNGs Bias Variation Compensation in Perimeter-Gated SPAD TRNGs 围地的SPAD TRNGs中的比值变化赔偿 2506.15888v1 -
32 06-18 A System Level Compiler for Massively-Parallel, Spatial, Dataflow Architectures Ein System Level Compiler für massiv-Parallel-, räumliche, Datenflussarchitekturen 用于Massalizy-Parallel、空间、数据流结构的系统级汇编器 2506.15875v1 -
33 06-18 SR-NCL: an Area-/Energy-Efficient Resilient NCL Architecture Based on Selective Redundancy SR-NCL: eine flächen-/energieeffiziente NCL-Architektur basierend auf selektiver Redundanz SR-NCL:基于选择性冗余的地区/能源/能有效抗力的NCL建筑 2506.15634v1 -
34 06-18 From Block to Byte: Transforming PCIe SSDs with CXL Memory Protocol and Instruction Annotation Von Block zu Byte: Transformation von PCIe SSDs mit CXL Memory Protocol und Instruction Annotation 从块到字节: 用 CXL 内存协议和指令注释转换 PCIe SSD 2506.15613v1 -
35 06-18 CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies CXL-GPU: GPU-Speichergrenzen mit der Integration von CXL-Technologien drücken CXL-GPU:推动GPU记忆边界与CXL技术一体化 2506.15601v1 -
36 06-18 Acore-CIM: build accurate and reliable mixed-signal CIM cores with RISC-V controlled self-calibration Acore-CIM: Aufbau präziser und zuverlässiger CIM-Kerne mit RISC-V-gesteuerter Selbstkalibrierung 核心CIM:建立准确可靠的混合信号CIM核心,并配有RISC-V控制的自我校准 2506.15440v1 -
37 06-18 J3DAI: A tiny DNN-Based Edge AI Accelerator for 3D-Stacked CMOS Image Sensor J3DAI: Ein winziger DNN-basierter Edge-KI-Beschleuniger für 3D-gestapelten CMOS-Bildsensor J3DAI: 用于 3D 堆放 CMOS 图像传感器的小型 DNN 以 DNN 为基础的 AI 自动加速器 2506.15316v1 -
38 06-17 (2) Scaling Intelligence: Designing Data Centers for Next-Gen Language Models Scaling Intelligence: Konzipieren von Rechenzentren für Sprachmodelle der nächsten Generation 扩大情报范围:为下一代语言模型设计数据中心 2506.15006v1 -
39 06-17 VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM-Driven Verilog Coding VeriLeaky: Navigieren von IP-Schutz gegen Utility in Feinsteuerung für LLM-getriebene Verilog Coding VeriLeaky:在LLM-Driven Verilog编码精密设计中导航 IP保护相对于实用性 2503.13116v4 -
40 06-17 ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors ASAP-FE: Energieeffiziente Feature-Extraktion Ermöglichung von Multi-Channel-Schlüsselwort-Spoting auf Edge-Prozessoren ASAP-FE:能源效率地貌采掘促进多通道关键词 2506.14657v1 -
41 06-17 Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees Guaranteed Guess: Ein Sprachmodellierungsansatz für CISC-to-RISC Transpilation mit Testgarantien 有担保的猜测:具有测试保证的CISC到RISC传输语言模拟方法 2506.14606v1 -
42 06-17 Empirically-Calibrated H100 Node Power Models for Reducing Uncertainty in AI Training Energy Estimation Empirisch kalibrierte H100-Knotenstrommodelle zur Reduzierung von Unsicherheiten in der KI-Trainingsenergieschätzung AI 培训能源估算中减少不确定性的活性校准 H100 节点动力模型 2506.14551v1 -
43 06-17 Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC Tensor-Manipulationseinheit (TMU): Rekonfigurierbare, Nahspeicher-Tensor-Manipulation für High-Throughput-KI-SoC Tensor 操纵装置(TMU):高干压 AI SoC 的可重新配置、近近中间的Tensor 操纵装置 2506.14364v1 -
44 06-17 CXLMemSim: A pure software simulated CXL.mem for performance characterization CXLMemSim: Eine reine Software simulierte CXL.mem für die Leistungscharakterisierung CXLMemSim:一种模拟CXL.mem用于性能定性的纯软件 2303.06153v2 -
45 06-17 Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification Umfassende Verilog-Design-Probleme: Ein Benchmark-Datensatz für die nächste Generation zur Bewertung großer Sprachmodelle und -agenten auf RTL-Design und -Verifikation 全面的纵向设计问题:用于评价关于RTL设计和核查的大型语言模型和代理人的下一个基准数据集 2506.14074v1
Article 0
Title@2025-06-24 (2): Robotics Under Construction: Challenges on Job Sites
Title: Robotics Under Construction: Challenges on Job Sites | Robotik im Bau: Herausforderungen auf Job-Sites | 正在建设中的机器人:工作场所的挑战 2506.19597v1 |
Authors (8): Haruki Uchiito, Akhilesh Bhat, Koji Kusaka, Xiaoya Zhang, Hiraku Kinjo, Honoka Uehara, Motoki Koyama, Shinji Natsume
As labor shortages and productivity stagnation increasingly challenge the construction industry, automation has become essential for sustainable infrastructure development. This paper presents an autonomous payload transportation system as an initial step toward fully unmanned construction sites. Our system, based on the CD110R-3 crawler carrier, integrates autonomous navigation, fleet management, and GNSS-based localization to facilitate material transport in construction site environments. While the current system does not yet incorporate dynamic environment adaptation algorithms, we have begun fundamental investigations into external-sensor based perception and mapping system. Preliminary results highlight the potential challenges, including navigation in evolving terrain, environmental perception under construction-specific conditions, and sensor placement optimization for improving autonomy and efficiency. Looking forward, we envision a construction ecosystem where collaborative autonomous agents dynamically adapt to site conditions, optimizing workflow and reducing human intervention. This paper provides foundational insights into the future of robotics-driven construction automation and identifies critical areas for further technological development.
由于劳动力短缺和生产力停滞日益对建筑业构成挑战,自动化已成为可持续基础设施发展的关键。本文件介绍了一个自主有效载荷运输系统,作为向完全无人驾驶建筑工地迈出的第一步。我们基于CD110R-3爬行载体的系统整合了自主导航、车队管理和基于全球导航卫星系统的本地化,以便利建筑工地环境中的物质运输。虽然目前的系统尚未纳入动态环境适应算法,但我们已开始对基于外部传感器的感知和绘图系统进行根本性调查。初步结果凸显了潜在挑战,包括地貌变化中的导航、特定建筑条件下的环境观以及改善自主和效率的传感器优化。展望未来,我们设想了一个自主协作剂动态适应场地条件、优化工作流程和减少人类干预的建筑生态系统。这份文件为机器人驱动的建筑自动化的未来提供了基础见解,并确定了进一步技术发展的关键领域。
Article 1
Title@2025-06-24 (2): Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Title: Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference | Mischung aus Cache-Conditional Experts für effiziente mobile Geräteableitung | 高效移动设备引力缓存-条件专家混合 2412.00099v2 |
Authors (8): Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi
Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or “experts” for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE’s applicability across real-world applications.
专家混合公司(MoE)LLMs最近通过有选择地使用专门的子网络或“专家”来提高每个输入的性能的能力得到了关注。然而,在记忆限制装置上部署MOE仍然具有挑战性,特别是在按批量大小生成标牌时,与涉及长序列或大批量的典型高通量设置相反,在这项工作中,我们优化了存储限制装置上的MOE,因为只有一组专家重量才适合DRAM。我们引入了一种新的缓存系统战略,利用专家在象征性生成过程中的再利用来改进缓存地点。我们评估了我们在语言建模、MMLU和GSM8K基准方面的做法,并展示了在移动装置上显示2美元时间的超速率,提供了灵活、无培训的解决方案,以扩大MOE在现实应用中的应用范围。
Article 2
Title@2025-06-24 (2): ChatModel: Automating Reference Model Design and Verification with LLMs
Title: ChatModel: Automating Reference Model Design and Verification with LLMs | ChatModel: Automatisieren von Referenzmodell-Design und Überprüfung mit LLMs | 聊天模式:使用LLMs自动使用参考模型设计和核查 2506.15066v2 |
Authors (6): Jianmin Ye, Tianyang Liu, Qi Tian, Shengchu Su, Zhe Jiang, Xi Wang
As the complexity of integrated circuit designs continues to escalate, the functional verification becomes increasingly challenging. Reference models, critical for accelerating the verification process, are themselves becoming more intricate and time-consuming to develop. Despite the promise shown by large language models (LLMs) in code programming, effectively generating complex reference models remains a significant hurdle. To address these challenges, we introduce ChatModel, the first LLM-aided agile reference model generation and verification platform. ChatModel streamlines the transition from design specifications to fully functional reference models by integrating design standardization and hierarchical agile modeling. Employing a building-block generation strategy, it not only enhances the design capabilities of LLMs for reference models but also significantly boosts verification efficiency. We evaluated ChatModel on 300 designs of varying complexity, demonstrating substantial improvements in both efficiency and quality of reference model generation. ChatModel achieved a peak performance improvement of 55.02% compared to alternative methods, with notable enhancements in generation stability, and delivered a 9.18x increase in its capacity to produce reference model designs. Furthermore, it accelerated the iterative process of reference model design and validation by an average of 5.90x compared to traditional approaches. These results highlight the potential of ChatModel to significantly advance the automation of reference model generation and validation.
由于集成电路设计的复杂性继续升级,功能性核查变得日益具有挑战性。对于加速核查进程至关重要的参考模型本身正在变得更加复杂和耗时地开发。尽管大型语言模型(LLMs)在代码编程中显示了希望,但有效生成复杂的参考模型仍是一个重大障碍。为了应对这些挑战,我们引入了ChatModel,即第一个由LLM协助的LLM型快速参考模型生成和核查平台。ChatModel通过整合设计标准化和等级灵活建模,简化了从设计规格向完全功能性能参考模型的过渡。采用建筑区块生成战略,不仅提高了LLMs用于参考模型的设计能力,而且还大大提高了核查效率。我们评估了300种不同复杂设计的ChatModel,表明在创建参考模型的效率和质量方面都有很大改进。ChatModel实现了与替代方法相比最高性能改进55.02%,并显著加强了生产参考模型设计的能力。此外,它加速了参考模型设计和验证的迭接过程,比传统模型的参照率平均提高了5.90x。这些结果突出表明了Chadel的自动化。
Article 3
Title@2025-06-24 (2): In-Memory Sorting-Searching with Cayley Tree
Title: In-Memory Sorting-Searching with Cayley Tree | In-Memory-Sortierung-Suche mit Cayley-Baum | 与 Cayley 树一起在记忆中排序- 使用 Cayley 树 2506.19379v1 |
Authors (3): Subrata Paul, Sukanta Das, Biplab K Sikdar
This work proposes a computing model to reduce the workload of CPU. It relies on the data intensive computation in memory, where the data reside, and effectively realizes an in-memory computing (IMC) platform. Each memory word, with additional logic, acts as a tiny processing element which forms the node of a Cayley tree. The Cayley tree in turn defines the framework for solving the data intensive computational problems. It finds the solutions for in-memory searching, computing the max (min) in-memory and in-memory sorting while reducing the involvement of CPU. The worst case time complexities of the IMC based solutions for in-memory searching and computing max (min) in-memory are $\mathcal{O}\log{n}$. Such solutions are independent of the order of elements in the list. The worst case time complexity of in-memory sorting, on the other hand, is $\mathcal{O}(n\log{n})$. Two types of hardware implementations of the IMC platform are proposed. One is based on the existing/conventional memory architecture, and the other one is on a newly defined memory architecture. The solutions are further implemented in FPGA platform to prove the effectiveness of the IMC architecture while comparing with the state-of-the art designs.
这项工作建议了一个计算模型来减少 CPU 的工作量。 它依赖于存储中的数据密集计算, 数据所在的位置, 并有效地实现一个模拟计算平台。 每个记忆单词, 加上额外的逻辑, 都作为构成 Cayley 树节点的微小处理元素。 Cayley 树反过来又定义了解决数据密集计算问题的框架。 它在减少 CCPU 参与的同时, 依靠记忆中的数据密集计算。 IMC 基础的内模搜索和计算( 最小) 最大( 最小) 计算( 最小) 最差的复杂时间是$\ mathcal{ Olog{ n} 。 这些解决方案独立于列表中元素的顺序。 模拟分类中最差的个案时间复杂性是 $\ mathcal{ minal{O} (n\log{n} $。 IMC 平台的硬件实施两种类型。 IMC 平台的硬件实施方法最坏的时间复杂性是提议。 其中一种基于现有/ 常规存储结构, 和新定义的IMCFA 格式结构。
Article 4
Title@2025-06-24 (2): Efficient Hardware Implementation of Modular Multiplier over GF (2m) on FPGA
Title: Efficient Hardware Implementation of Modular Multiplier over GF (2m) on FPGA | Effiziente Hardware-Implementierung des Modularen Multiplikators über GF (2m) auf FPGA | 在FPGA上实施超过2米的GF(2GF)的模块倍增器的高效硬件 2506.09464v3 |
Authors (3): Ruby Kumari, Gaurav Purohit, Abhijit Karmakar
Elliptic curve cryptography (ECC) has emerged as the dominant public-key protocol, with NIST standardizing parameters for binary field GF(2^m) ECC systems. This work presents a hardware implementation of a Hybrid Multiplication technique for modular multiplication over binary field GF(2m), targeting NIST B-163, 233, 283, and 571 parameters. The design optimizes the combination of conventional multiplication (CM) and Karatsuba multiplication (KM) to enhance elliptic curve point multiplication (ECPM). The key innovation uses CM for smaller operands (up to 41 bits for m=163) and KM for larger ones, reducing computational complexity and enhancing efficiency. The design is evaluated in three areas: Resource Utilization For m=163, the hybrid design uses 6,812 LUTs, a 39.82% reduction compared to conventional methods. For m=233, LUT usage reduces by 45.53% and 70.70% compared to overlap-free and bit-parallel implementations. Delay Performance For m=163, achieves 13.31ns delay, improving by 37.60% over bit-parallel implementations. For m=233, maintains 13.39ns delay. Area-Delay Product For m=163, achieves ADP of 90,860, outperforming bit-parallel (75,337) and digit-serial (43,179) implementations. For m=233, demonstrates 16.86% improvement over overlap-free and 96.10% over bit-parallel designs. Results show the hybrid technique significantly improves speed, hardware efficiency, and resource utilization for ECC cryptographic systems.
剪切曲线加密法(ECC)已成为主要的公用钥匙协议,NIST对二进制实地GF(2°m)ECC系统的标准化参数,这项工作展示了对二进制实地GF(2m)的模块倍增组合倍增技术的硬件实施,目标是NIST B-163、233、283和571参数。设计优化了常规倍增(CM)和Karatsuba倍增(KM)的组合,目的是提高椭圆曲线点的倍增(ECPM) 。关键创新使用内存用于小型剧团(M=163至41位位元)和知识管理,降低计算复杂性和提高效率。设计在三个领域进行了评价:M=163资源利用混合倍倍倍倍倍倍增,混合设计使用6 812 LUTs,比常规方法减少了39.82%。M=233,LUT的利用率比重减少了45.53%和70.70%,比重叠和位数硬体的递增(EC=163,比值改进13.31x延迟,C改进了Cxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 5
Title@2025-06-23 (1): MEDEA: A Design-Time Multi-Objective Manager for Energy-Efficient DNN Inference on Heterogeneous Ultra-Low Power Platforms
Title: MEDEA: A Design-Time Multi-Objective Manager for Energy-Efficient DNN Inference on Heterogeneous Ultra-Low Power Platforms | MEDEA: Design-Time Multi-Objective Manager für energieeffiziente DNN-Inferenz auf heterogenen Ultra-Low-Power-Plattformen | MEDEA: 能源效率DNN热能超低层动力平台的热源性DN推论设计-时间多目标管理员 2506.19067v1 |
Authors (4): Hossein Taji, José Miranda, Miguel Peón-Quirós, David Atienza
The growing demand for on-device AI necessitates energy-efficient execution of DNN based applications on resource-constrained ultra-low power (ULP) platforms. Heterogeneous architectures, combining specialized processing elements (PEs), have emerged as a key solution for achieving the required performance and energy efficiency. However, optimizing energy while executing applications on these platforms requires efficiently managing platform resources like PEs, power features, and memory footprint, all while adhering to critical application deadlines. This paper presents MEDEA, a novel design-time multi-objective manager for energy-efficient DNN inference on Heterogeneous ULP (HULP) platforms. MEDEA uniquely integrates: kernel-level dynamic voltage and frequency scaling (DVFS) for dynamic energy adaptation; kernel-level granularity scheduling, suitable for specialized accelerators; memory-aware adaptive tiling to navigate severe memory constraints; and all within a timing constraint-based optimization strategy, which minimizes energy based on application deadline. To showcase practical viability, we evaluate MEDEA on HEEPtimize, a heterogeneous ULP platform (22 nm, FPGA-prototyped) featuring a RISC-V processor besides Near-Memory Computing (NMC) and Coarse-Grained Reconfigurable Array (CGRA) accelerators. Experimental results, using a biomedical seizure detection case study, demonstrate that MEDEA achieves overall energy reductions of up to 38% compared to representative state-of-the-art methods, while consistently meeting all timing and memory requirements. This effectiveness is attributed to its integrated features, with our analysis showing that kernel-level DVFS alone can be responsible for over 31% of the energy savings in specific scenarios.
由于对在线AI的需求不断增长,因此需要在资源限制的超低功率平台上执行基于DNN的节能应用,这需要以节能方式执行基于资源限制的超低功率平台的DNN应用程序。混合式结构,结合专门处理元素(PE),已成为实现所需性能和能效的关键解决方案。然而,在这些平台上执行应用时优化能源,需要高效率地管理平台资源,如PE、电力功能和记忆足迹,同时遵守严格的应用期限。本文展示了MEDEA,这是在高能效超常超低功率超低功率平台(HULP)上设计时时时的多目标预感调 DNNN(DNN)平台。我们评价了MEA(HEPEG-VLP) 长期动态动态电压和频率缩放量(DVFS) ,同时展示了我们内部稳定局(RIFA-C) 的内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内
Article 6
Title@2025-06-23 (1): Design high-confidence computers using trusted instructional set architecture and emulators
Title: Design high-confidence computers using trusted instructional set architecture and emulators | Design von Computern mit hohem Vertrauen mit vertrauenswürdigen Lehrset-Architektur und Emulatoren | 使用可信赖的教学系统架构和模拟器设计高自信计算机 2506.18780v1 |
Authors (1): Shuangbao Paul Wang
High-confidence computing relies on trusted instructional set architecture, sealed kernels, and secure operating systems. Cloud computing depends on trusted systems for virtualization tasks. Branch predictions and pipelines are essential in improving performance of a CPU/GPU. But Spectre and Meltdown make modern processors vulnerable to be exploited. Disabling the prediction and pipeline is definitely not a good solution. On the other hand, current software patches can only address non-essential issues around Meltdown. This paper introduces a holistic approach in trusted computer architecture design and emulation.
高自信计算依赖于可信赖的指令设置架构、密封内核和安全操作系统。 云计算依赖于可信任的虚拟化任务系统。 分支预测和管道对于改进CPU/GPU的性能至关重要。 但是, 光谱和熔点下使现代处理器容易被利用。 禁用预测和管道绝对不是一个好的解决办法。 另一方面, 目前的软件补丁只能解决围绕Meltdown的非必要问题。 本文在可信任的计算机架构设计和模拟中引入了整体方法 。
Article 7
Title@2025-06-23 (1): EasyDRAM: An FPGA-based Infrastructure for Fast and Accurate End-to-End Evaluation of Emerging DRAM Techniques
Title: EasyDRAM: An FPGA-based Infrastructure for Fast and Accurate End-to-End Evaluation of Emerging DRAM Techniques | EasyDRAM: Eine FPGA-basierte Infrastruktur für eine schnelle und präzise End-to-End-Bewertung neu entstehender DRAM-Techniken | AFACDRAM: 以FPGA为基础的快速和准确端至端评价新兴DRAM技术基础设施 2506.10441v2 |
Authors (5): Oğuzhan Canpolat, Ataberk Olgun, David Novo, Oğuz Ergin, Onur Mutlu
DRAM is a critical component of modern computing systems. Recent works propose numerous techniques (that we call DRAM techniques) to enhance DRAM-based computing systems’ throughput, reliability, and computing capabilities (e.g., in-DRAM bulk data copy). Evaluating the system-wide benefits of DRAM techniques is challenging as they often require modifications across multiple layers of the computing stack. Prior works propose FPGA-based platforms for rapid end-to-end evaluation of DRAM techniques on real DRAM chips. Unfortunately, existing platforms fall short in two major aspects: (1) they require deep expertise in hardware description languages, limiting accessibility; and (2) they are not designed to accurately model modern computing systems. We introduce EasyDRAM, an FPGA-based framework for rapid and accurate end-to-end evaluation of DRAM techniques on real DRAM chips. EasyDRAM overcomes the main drawbacks of prior FPGA-based platforms with two key ideas. First, EasyDRAM removes the need for hardware description language expertise by enabling developers to implement DRAM techniques using a high-level language (C++). At runtime, EasyDRAM executes the software-defined memory system design in a programmable memory controller. Second, EasyDRAM tackles a fundamental challenge in accurately modeling modern systems: real processors typically operate at higher clock frequencies than DRAM, a disparity that is difficult to replicate on FPGA platforms. EasyDRAM addresses this challenge by decoupling the processor-DRAM interface and advancing the system state using a novel technique we call time scaling, which faithfully captures the timing behavior of the modeled system. We believe and hope that EasyDRAM will enable innovative ideas in memory system design to rapidly come to fruition. To aid future research EasyDRAM implementation is open sourced at https://github.com/CMU-SAFARI/EasyDRAM.
DRAM 是现代计算系统的关键组成部分。 最近的工作提出了许多技术( 我们称之为 DRAM 技术) , 以加强 DRAM 的计算系统输送量、 可靠性和计算能力( 如 DRAM 批量数据副本 ) 。 评估 DRAM 技术的全系统效益具有挑战性, 因为这些技术往往需要在计算堆的多个层面进行修改。 先前的工作提出了基于 FCGA 的平台, 用于对 DRAM 技术进行快速端对端对端评价。 不幸的是, 现有平台在两个主要方面落后:(1) 它们需要硬件描述语言方面的深层次专业知识, 限制无障碍; 和 (2) 它们的设计不是为了精确地模拟现代计算系统。 我们引入了 EasyDRAM, 以快速和准确的方式对 DRAM 进行数据分析。
Article 8
Title@2025-06-23 (1): Embedded FPGA Acceleration of Brain-Like Neural Networks: Online Learning to Scalable Inference
Title: Embedded FPGA Acceleration of Brain-Like Neural Networks: Online Learning to Scalable Inference | Embedded FPGA Beschleunigung von Gehirn-ähnlichen Neuronalen Netzwerken: Online-Lernen zu skalierbarer Inferenz | FPGA 脑如神经网络加速FPGA嵌入式FPGA:在线学习可缩放推论 2506.18530v1 |
Authors (5): Muhammad Ihsan Al Hafiz, Naresh Ravichandran, Anders Lansner, Pawel Herman, Artur Podobas
Edge AI applications increasingly require models that can learn and adapt on-device with minimal energy budget. Traditional deep learning models, while powerful, are often overparameterized, energy-hungry, and dependent on cloud connectivity. Brain-Like Neural Networks (BLNNs), such as the Bayesian Confidence Propagation Neural Network (BCPNN), propose a neuromorphic alternative by mimicking cortical architecture and biologically-constrained learning. They offer sparse architectures with local learning rules and unsupervised/semi-supervised learning, making them well-suited for low-power edge intelligence. However, existing BCPNN implementations rely on GPUs or datacenter FPGAs, limiting their applicability to embedded systems. This work presents the first embedded FPGA accelerator for BCPNN on a Zynq UltraScale+ SoC using High-Level Synthesis. We implement both online learning and inference-only kernels with support for variable and mixed precision. Evaluated on MNIST, Pneumonia, and Breast Cancer datasets, our accelerator achieves up to 17.5x latency and 94% energy savings over ARM baselines, without sacrificing accuracy. This work enables practical neuromorphic computing on edge devices, bridging the gap between brain-like learning and real-world deployment.
AI 应用程序日益需要能够学习和适应以最低能源预算为基础的智能模型。传统的深层次学习模式虽然强大,但往往过于强大、能源饥饿,并依赖于云层连接。脑力与神经网络(BLNNNs),如巴伊西亚信任促进神经网络(BCPNN),通过模仿皮层建筑和生物限制的学习,提出神经变形替代方案。它们提供有本地学习规则和不受监管/半监控学习的稀疏结构,使其适合低功率边缘智能。然而,现有的 BCPNN 执行依靠GPUs或数据中心FPGAs,将其应用范围限制在嵌入系统。这项工作提出了第一个嵌入的FPGA A加速器,在Zynq Ultramation+SoC 上,使用高水平合成。我们实施了在线学习和推断专用的内核内核内核内存,支持变和混合的精确度。评估了MNIST、中空和乳腺癌中空的中层数据配置,从而实现了我们17个实际的系统数据库。
Article 9
Title@2025-06-23 (1): AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing
Title: AnalogNAS-Bench: A NAS Benchmark for Analog In-Memory Computing | AnalogNAS-Bench: Ein NAS-Benchmark für analoges In-Memory Computing | AnalogNAS-Bench:NAS模拟计算基准 2506.18495v1 |
Authors (4): Aniss Bessalah, Hatem Mohamed Abdelmoumen, Karima Benatchba, Hadjer Benmeziane
Analog In-memory Computing (AIMC) has emerged as a highly efficient paradigm for accelerating Deep Neural Networks (DNNs), offering significant energy and latency benefits over conventional digital hardware. However, state-of-the-art neural networks are not inherently designed for AIMC, as they fail to account for its unique non-idealities. Neural Architecture Search (NAS) is thus needed to systematically discover neural architectures optimized explicitly for AIMC constraints. However, comparing NAS methodologies and extracting insights about robust architectures for AIMC requires a dedicated NAS benchmark that explicitly accounts for AIMC-specific hardware non-idealities. To address this, we introduce AnalogNAS-Bench, the first NAS benchmark tailored specifically for AIMC. Our study reveals three key insights: (1) standard quantization techniques fail to capture AIMC-specific noises, (2) robust architectures tend to feature wider and branched blocks, (3) skip connections improve resilience to temporal drift noise. These insights highlight the limitations of current NAS benchmarks for AIMC and pave the way for future analog-aware NAS. All the implementations used in this paper can be found at https://github.com/IBM/analog-nas/tree/main/analognasbench.
模拟计算机(AIMC)已成为加速深神经网络(DNNS)的一个高效范例,为常规数字硬件提供了巨大的能量和延缓效应;然而,最先进的神经网络并非天生设计给AIMC的,因为它们没有说明其独特的非理想性;因此,需要神经结构搜索(NAS)系统发现明确为AIMC限制优化的神经结构;然而,比较NAS的方法和为AIMC提取关于强健结构的洞察力需要专门的NAS基准,明确说明AIMC特定硬件非理想性。为此,我们引入了AnalogNAS-Bench,这是专门为AIMC量身定制的第一个NAS基准。我们的研究揭示了三个主要的洞察:(1) 标准石化技术未能捕捉到AIMC特有的噪音,(2) 坚固的建筑往往具有更宽广和分块的特征,(3) 跳过连接来提高对时间流动噪音的复原力。这些洞察显示目前NAS基准的局限性,并为未来NAMC特定硬件非理想性硬件铺路。
Article 10
Title@2025-06-22 (7): AMD Versal Implementations of FAM and SSCA Estimators
Title: AMD Versal Implementations of FAM and SSCA Estimators | AMD-Versal-Implementierungen von FAM und SSCA-Schätzern | AMD 快速实施FAM和SCA模拟器 2506.18003v1 |
Authors (3): Carol Jingyi Li, Ruilin Wu, Philip H. W. Leong
Cyclostationary analysis is widely used in signal processing, particularly in the analysis of human-made signals, and spectral correlation density (SCD) is often used to characterise cyclostationarity. Unfortunately, for real-time applications, even utilising the fast Fourier transform (FFT), the high computational complexity associated with estimating the SCD limits its applicability. In this work, we present optimised, high-speed field-programmable gate array (FPGA) implementations of two SCD estimation techniques. Specifically, we present an implementation of the FFT accumulation method (FAM) running entirely on the AMD Versal AI engine (AIE) array. We also introduce an efficient implementation of the strip spectral correlation analyser (SSCA) that can be used for window sizes up to $2^{20}$. For both techniques, a generalised methodology is presented to parallelise the computation while respecting memory size and data bandwidth constraints. Compared to an NVIDIA GeForce RTX 3090 graphics processing unit (GPU) which uses a similar 7nm technology to our FPGA, for the same accuracy, our FAM/SSCA implementations achieve speedups of 4.43x/1.90x and a 30.5x/24.5x improvement in energy efficiency.
在信号处理中,特别是在分析人造信号时,广泛使用恒定分析,光谱相关密度(SCD)常常用来描述周期性周期性。不幸的是,对于实时应用,甚至使用快速傅里叶变换(FFT),与估计SCD有关的高计算复杂性限制了其适用性。在这项工作中,我们介绍了两种SCD估计技术的优化、高速的外地可编程门阵列(FPGA)实施。具体地说,我们介绍了完全在AMD Versal AI(AIE)阵列上运行的FFFT积累法(FAM)的实施情况。我们还介绍了对光谱相关分析器(SSCA)的高效应用,这些应用的窗口大小可达220美元。对于这两种技术,我们提出了一种通用的方法,在测量记忆大小和数据带宽限制的同时,同时进行计算。比照NVIDIA GEFFORce RTX 3090的图形处理器(GPU),它使用类似于7nm技术来改进我们FA/4A的速率。
Article 11
Title@2025-06-22 (7): Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance
Title: Cost-Effective Optimization and Implementation of the CRT-Paillier Decryption Algorithm for Enhanced Performance | Kosteneffiziente Optimierung und Implementierung des CRT-Paillier Entschlüsselungs-Algorithmus für verbesserte Leistung | 以成本-效益高的方式优化和实施CRT-PALIer 解密算法,提高性能 2506.17935v1 |
Authors (5): Zhengwu Huang, Ding Deng, Pengyue Sun, Guangfu Sun, Xiaomei Tang
To address the privacy protection problem in cloud computing, privacy enhancement techniques such as the Paillier additive homomorphism algorithm are receiving widespread attention. Paillier algorithm allows addition and scalar multiplication operations in dencrypted state, which can effectively protect privacy. However, its computational efficiency is limited by complex modulo operations due to the ciphertext expansion followed by encryption. To accelerate its decryption operation, the Chinese Remainder Theorem (CRT) is often used to optimize these modulo operations, which lengthens the decryption computation chain in turn. To address this issue, we propose an eCRT-Paillier decryption algorithm that shortens the decryption computation chain by combining precomputed parameters and eliminating extra judgment operations introduced by Montgomery modular multiplications. These two improvements reduce 50% modular multiplications and 60% judgment operations in the postprocessing of the CRT-Paillier decryption algorithm. Based on these improvements, we propose a highly parallel full-pipeline architecture to eliminate stalls caused by multiplier reuse in traditional modular exponentiation operations. This architecture also adopts some optimizations such as simplifying modular exponentiation units by dividing the exponent into segments and parallelizing data flow by multi-core instantiation. Finally, a high-throughput and efficient Paillier accelerator named MESA was implemented on the Xilinx Virtex-7 FPGA for evaluation, which can complete a decryption using 2048-bit key within 0.577ms under 100 MHz clock frequency. Compared to prior works, MESA demonstrates a throughput improvement of 1.16 to 313.21 under identical conditions, also with enhancements in area efficiency for LUT, DSP, and FF of 3.32 to 117.55, 1.49 to 1.64, and 2.94 to 9.94, respectively.
为了解决云计算中的隐私保护问题,诸如Paillier Actracing dolmondism 算法等隐私增强技术正在得到广泛关注。 Paillier 算法允许在加密状态下增加和加速倍增操作,这可以有效保护隐私。然而,由于加密后扩展,其计算效率因复杂的模版操作而受到限制。为了加速解密操作,中国保存器 Theorem (CRT) 经常用于优化这些模版操作,它反过来延长解密计算链。为了解决这一问题,我们建议使用一个 eCRT-Pailler 解密算法, 使解密计算链在加密状态下增加和加速, 从而缩短解密计算链, 因为它结合了预合成参数,并消除了由蒙哥德摩托尔多式倍增益的多功能。 这两项改进使CRT-Paillier器解密算算法的后处理过程减少了50%的倍增倍数和60%。 基于这些改进,我们提议用一个高度平行的全管结构来消除在传统模块下进行自动再处理后再利用的Dlickral-licklifildalx 3 的自动升级操作。这个结构,最后将一个通过快速化的压结构进行,将一些压压压的压压压压压压压压压压压到直压到直压成平流,将一些压的MFDrmax。
Article 12
Title@2025-06-22 (7): GainSight: Application-Guided Profiling for Composing Heterogeneous On-Chip Memories in AI Hardware Accelerators
Title: GainSight: Application-Guided Profiling for Composing Heterogeneous On-Chip Memories in AI Hardware Accelerators | GainSight: Anwendungsorientierte Profilierung für das Komponieren heterogener On-Chip-Erinnerungen in KI-Hardwarebeschleunigern | GainSight: 在 AI 硬件加速器中组合不同质在芯片上的记忆的应用指导分析 2504.14866v3 |
Authors (10): Peijing Li, Matthew Hung, Yiming Tan, Konstantin Hoßfeld, Jake Cheng Jiajun, Shuhan Liu, Lixian Yan, Xinxin Wang, H. -S. Philip Wong, Thierry Tambe
As AI workloads drive soaring memory requirements, higher-density on-chip memory is needed for domain-specific accelerators beyond what current SRAM technology can provide. We motivate that algorithms and application behavior should guide the composition of heterogeneous on-chip memories. However, little work has incorporated dynamic application profiles into these design decisions, and no existing tools are expressly designed for this purpose. We present GainSight, a profiling framework that analyzes fine-grained memory access patterns and data lifetimes in domain-specific accelerators. By instrumenting retargetable architectural simulator backends with application- and device-agnostic analytical frontends, GainSight aligns workload-specific traffic and lifetime metrics with mockups of emerging memory devices, informing system-level heterogeneous memory design. We also present a set of case studies on MLPerf Inference and PolyBench workloads using simulated GPU and systolic array architectures, highlighting the utility of GainSight and the insights it provides: (1) 64% of L1 and 18% of L2 GPU cache accesses, and 79% of systolic array scratchpad accesses across profiled workloads are short-lived and suitable for silicon-based gain cell RAM (Si-GCRAM); (2) Heterogeneous memory arrays that augment SRAM with GCRAM can reduce active energy consumption by up to 66.8%.
AI 工作量导致记忆要求飞涨, 需要超出当前 SRAM 技术所能提供的域特定加速器, 需要更高密度的芯片内存。 我们鼓励算法和应用行为应该指导各种芯片内存储器的构成。 但是, 很少有工作将动态应用剖面纳入这些设计决定, 也没有为此目的明确设计任何现有工具。 我们展示了一个剖析框架, 分析具体域加速器中细微刻度内存访问模式和数据寿命。 通过将可重新定位的建筑模拟后端与应用程序和装置分析前端进行仪器化, 增益视觉应该将特定工作量的流量和寿命指标与新兴存储器的模拟组合结合起来, 向系统级内存设计提供信息。 我们还展示了一套关于 MLPerf 推断和 PolyBench 工作量的案例研究, 使用模拟的GainSightSightSight的阵列结构, 突出GARES的效用及其洞察到的L1和18%的L2 PUPO的内存内存空间, 以及SL2 CRA 的短期访问阵列, 和79 % Slaimal- cal- sal- clas- clasma- slappy 。
Article 13
Title@2025-06-21 (6): Residue Number System (RNS) based Distributed Quantum Multiplication
Title: Residue Number System (RNS) based Distributed Quantum Multiplication | Rückstandszahlsystem (RNS) basiert auf verteilter Quanten-Multiplikation | 基于残余数字系统(RNS)的分布量乘法 2506.17588v1 |
Authors (2): Bhaskar Gaur, Himanshu Thapliyal
Multiplication of quantum states is a frequently used function or subroutine in quantum algorithms and applications, making quantum multipliers an essential component of quantum arithmetic. However, quantum multiplier circuits suffer from high Toffoli depth and T gate usage, which ultimately affects their scalability and applicability on quantum computers. To address these issues, we propose utilizing the Residue Number System (RNS) based distributed quantum multiplication, which executes multiple quantum modulo multiplication circuits across quantum computers or jobs with lower Toffoli depth and T gate usage. Towards this end, we propose a design of Quantum Diminished-1 Modulo $(2^n+1)$ Multiplier, an essential component of RNS based distributed quantum multiplication. We provide estimates of quantum resource usage and compare them with those of an existing non-distributed quantum multiplier for 6 to 16 qubit sized output. Our comparative analysis estimates up to 46.018% lower Toffoli depth, and reduction in T gates of 34.483% to 86.25%.
量子状态的乘法是量子算法和应用程序中常用的函数或次常规,使量子乘数成为量子算法和应用程序的一个基本组成部分。然而,量子倍化电路受到高 Toffoli 深度和T门使用的影响,最终影响到其在量子计算机上的可缩放性和可应用性。为了解决这些问题,我们提议使用基于残余数字系统分布量子乘法,在量子计算机或具有低 Toffoli 深度和T门使用量子计算机或工作之间执行多种量子模数倍化电路。为此,我们提议设计量子倍化电路,这是基于 RNS 分布量子乘法的一个必要组成部分。我们提供了量子资源使用估计数,并将其与现有的非分配量子乘数乘数乘法的6至16 公尺输出量乘法进行比较。我们的比较分析估计,其深度可达46.018%,低 Toffoli 深度为46.483 %至86.25%。
Article 14
Title@2025-06-21 (6): LLM-Aided Testbench Generation and Bug Detection for Finite-State Machines
Title: LLM-Aided Testbench Generation and Bug Detection for Finite-State Machines | LLM-Aided Testbench Generation und Fehlererkennung für Finite-State-Maschinen | LLM 辅助有限国机器的LLM辅助试验箱生成和虫虫检测 2406.17132v2 |
Authors (5): Jitendra Bhandari, Johann Knechtel, Ramesh Narayanaswamy, Siddharth Garg, Ramesh Karri
This work investigates the potential of tailoring Large Language Models (LLMs), specifically GPT3.5 and GPT4, for the domain of chip testing. A key aspect of chip design is functional testing, which relies on testbenches to evaluate the functionality and coverage of Register-Transfer Level (RTL) designs. We aim to enhance testbench generation by incorporating feedback from commercial-grade Electronic Design Automation (EDA) tools into LLMs. Through iterative feedback from these tools, we refine the testbenches to achieve improved test coverage. Our case studies present promising results, demonstrating that this approach can effectively enhance test coverage. By integrating EDA tool feedback, the generated testbenches become more accurate in identifying potential issues in the RTL design. Furthermore, we extended our study to use this enhanced test coverage framework for detecting bugs in the RTL implementations
这项工作调查了为芯片测试领域定制大语言模型(LLMS,特别是GPT3.5和GPT4)的潜力,芯片设计的一个关键方面是功能性测试,它依靠测试箱来评价登记册-转让水平设计(RTL)的功能和覆盖范围;我们的目标是通过将商业级电子设计自动化工具的反馈纳入LMS,加强测试生成。我们通过这些工具的反复反馈,改进测试箱,以扩大测试范围。我们的案例研究表明,这一方法可有效提高测试范围。通过整合EDA工具反馈,产生的测试箱在确定RTL设计中的潜在问题方面变得更加准确。此外,我们扩大了研究,利用这一强化测试范围框架来检测RTL实施中的错误。
Article 15
Title@2025-06-21 (6): ASC-Hook: fast and transparent system call hook for Arm
Title: ASC-Hook: fast and transparent system call hook for Arm | ASC-Hook: Schneller und transparenter Systemrufhaken für Arm | ASC-Hook: 快速和透明的系统调用武器钩 2412.05784v4 |
Authors (4): Yang Shen, Min Xie, Wenzhe Zhang, Tao Wu
Intercepting system calls is crucial for tools that aim to modify or monitor application behavior. However, existing system call interception tools on the ARM platform still suffer from limitations in terms of performance and completeness. This paper presents an efficient and comprehensive binary rewriting framework, ASC-Hook, specifically designed for intercepting system calls on the ARM platform. ASC-Hook addresses two key challenges on the ARM architecture: the misalignment of the target address caused by directly replacing the SVC instruction with br x8, and the return to the original control flow after system call interception. This is achieved through a hybrid replacement strategy and our specially designed trampoline mechanism. By implementing multiple completeness strategies specifically for system calls, we ensured comprehensive and thorough interception. Experimental results show that ASC-Hook reduces overhead to at least 1/29 of that of existing system call interception tools. We conducted extensive performance evaluations of ASC-Hook, and the average performance loss for system call-intensive applications is 3.7\% .
截取系统呼叫对于旨在修改或监测应用行为的工具至关重要,然而,在ARM平台上现有的系统呼叫截取工具仍然在性能和完整性方面受到限制。本文件介绍了一个高效而全面的二进制重写框架(ASC-Hook,专门设计用于截取系统呼叫ARM平台的ASC-Hook,专为截取系统而设计的ASC-Hook)。ASC-Hook应对了ARM架构的两大挑战:直接用brx8取代SVC指令导致目标地址的错配,以及系统呼叫拦截后返回原控制流程。这是通过混合替换战略和我们专门设计的蹦床机制实现的。我们通过执行针对系统呼叫的多重完整战略,确保了全面彻底的截取。实验结果表明,ASC-Hook将现有系统呼叫拦截工具的间接费用减少到至少1/29。我们对ASC-Hook进行了广泛的绩效评估,而系统呼叫密集应用的平均性能损失是3.7。
Article 16
Title@2025-06-20 (5): $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction
Title: $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction | $Δ$-Nets: Interaktionsbasiertes System für eine optimale parallele $λ$-Reduktion | \(-净额:最佳平行互动系统\)$美元-削减 2505.20314v3 |
Authors (1): Daniel Augusto Rizzi Salvadori
I present a model of universal parallel computation called $\Delta$-Nets, and a method to translate $\lambda$-terms into $\Delta$-nets and back. Together, the model and the method constitute an algorithm for optimal parallel $\lambda$-reduction, solving the longstanding enigma with groundbreaking clarity. I show that the $\lambda$-calculus can be understood as a projection of $\Delta$-Nets$-$one that severely restricts the structure of sharing, among other drawbacks. Unhindered by these restrictions, the $\Delta$-Nets model opens the door to new parallel programming language implementations and computer architectures that are more efficient and performant than previously possible.
我提出了一个称为$Delta$-Nets的通用平行计算模型,以及将$lambda$-terms 转换成$Delta$-nets和回调的方法。模型和方法共同构成一个优化平行$\lambda$降值的算法,以突破性清晰度解决长期谜题。我表明,$Limbda$-calculus可以被理解为一个严重限制共享结构的$Delta$-Nets-one的预测,除其他缺陷外。由于这些限制,$Delta$-Nets模型为平行语言实施和计算机结构打开了大门,这些语言实施和计算机结构比以前可能更有效和更实用。
Article 17
Title@2025-06-20 (5): Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities
Title: Assessing Tenstorrent’s RISC-V MatMul Acceleration Capabilities | Bewertung der RISC-V-MatMul-Beschleunigungsfähigkeiten von Tenstorrent | 评估出租人的RISC-V MatMul加速能力 2505.06085v3 |
Authors (3): Hiari Pizzini Cavagna, Daniele Cesarini, Andrea Bartolini
The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull’s execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull’s performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.
由于对作为大语言模型(LLMs)服务的基因化AI的需求日益增加,因此有必要建立优化计算效率和能源消耗的专门硬件结构,本文件评估了Tenstorrent Grayskull e75 RISC-V 加速器的性能,该加速器用于降低数值精确度,这是LLM计算中的一项基本操作。我们对Grayskull的执行模型、网格大小、矩阵尺寸、数据格式和数字精确影响计算效率作了详细的定性。此外,我们比较了Grayskull的性能与最新工艺结构的加速性能,包括Intel Sapphire Rapids处理器和两个NVIDIA GPU(V100和A100)。虽然NVIDIA GPU主导了原始性能,但Grayskull展示了电力消耗和计算性载荷之间的竞争性交易,达到1.55 TFLOPs/Watt的峰值,与BF16。
Article 18
Title@2025-06-20 (5): RCNet: $ΔΣ$ IADCs as Recurrent AutoEncoders
Title: RCNet: $ΔΣ$ IADCs as Recurrent AutoEncoders | RCNet: $Δω$ IADCs als recurrent AutoEncoder | RCNet:作为经常性自动编码器的空间碎片协委会 2506.16903v1 |
Authors (3): Arnaud Verdant, William Guicquero, Jérôme Chossat
This paper proposes a deep learning model (RCNet) for Delta-Sigma ($\Delta\Sigma$) ADCs. Recurrent Neural Networks (RNNs) allow to describe both modulators and filters. This analogy is applied to Incremental ADCs (IADC). High-end optimizers combined with full-custom losses are used to define additional hardware design constraints: quantized weights, signal saturation, temporal noise injection, devices area. Focusing on DC conversion, our early results demonstrate that $SNR$ defined as an Effective Number Of Bits (ENOB) can be optimized under a certain hardware mapping complexity. The proposed RCNet succeeded to provide design tradeoffs in terms of $SNR$ ($>$13bit) versus area constraints ($<$14pF total capacitor) at a given $OSR$ (80 samples). Interestingly, it appears that the best RCNet architectures do not necessarily rely on high-order modulators, leveraging additional topology exploration degrees of freedom.
本文为德尔塔-西格玛(Delta\Sigma$\Delta\Sigma$) ADC提出一个深层次学习模型(RCNet) 。 经常性神经网络允许描述调制器和过滤器。 这个类比适用于递增ADC(IADC) 。 高端优化器加上全海关损失用来界定额外的硬件设计限制: 量化重量、 信号饱和度、 时间噪音注射、 装置区。 以DC 转换为重点, 我们早期的结果显示, 在一定的硬件绘图复杂度下, 定义为有效比特数( ENOB) 的 $SNR$( ENOB) 可以优化。 拟议的RCNet成功地提供了设计交换条件, 以美元( $ > 13bit) 相对于地区限制 ( < 14pF 总电容器) , 以给定美元( 80 样本) 。 有意思的是, , 最好的RCNet 结构似乎不一定依赖高级的调制器, 利用额外的表层自由探索度。
Article 19
Title@2025-06-20 (5): Lookup Table-based Multiplication-free All-digital DNN Accelerator Featuring Self-Synchronous Pipeline Accumulation
Title: Lookup Table-based Multiplication-free All-digital DNN Accelerator Featuring Self-Synchronous Pipeline Accumulation | Lookup Tischbasierte Multiplikationsfreies All-Digital-DNN-Beschleuniger mit selbstsynchroner Pipeline-Akkumulation | 基于表格的无乘法化表格 全数字 DNN 加速器 自动同步管道累积 2506.16800v1 |
Authors (3): Hiroto Tagata, Takashi Sato, Hiromitsu Awano
Deep neural networks (DNNs) have been widely applied in our society, yet reducing power consumption due to large-scale matrix computations remains a critical challenge. MADDNESS is a known approach to improving energy efficiency by substituting matrix multiplication with table lookup operations. Previous research has employed large analog computing circuits to convert inputs into LUT addresses, which presents challenges to area efficiency and computational accuracy. This paper proposes a novel MADDNESS-based all-digital accelerator featuring a self-synchronous pipeline accumulator, resulting in a compact, energy-efficient, and PVT-invariant computation. Post-layout simulation using a commercial 22nm process showed that 2.5 times higher energy efficiency (174 TOPS/W) and 5 times higher area efficiency (2.01 TOPS/mm2) can be achieved compared to the conventional accelerator.
深神经网络(DNNS)在我们的社会中被广泛应用,然而,由于大规模矩阵计算而降低电力消耗仍然是一个重大挑战。MADDNES是已知的提高能源效率的方法,它用表格查看操作取代矩阵乘法。以前的研究使用大型模拟计算电路将输入转换成LUT地址,这对地区效率和计算准确性提出了挑战。本文提议了一个新的以MADDNESS为基础的所有数字加速器,以自同步管道加速器为主,导致一个紧凑、节能和PVT变量计算。使用商业22nm进程进行后外延模拟显示,与常规加速器相比,能效(174 TOPS/W)和地区效率(2.01 TOPS/mm2)可以提高2.5倍,区域效率(2.01 TOPS/mm2)提高5倍。
Article 20
Title@2025-06-20 (5): CodeV-R1: Reasoning-Enhanced Verilog Generation
Title: CodeV-R1: Reasoning-Enhanced Verilog Generation | CodeV-R1: Grundlegende Verilog-Generierung | 代码V-R1:有理性的增强性性性性性性性生殖器生成 2505.24183v2 |
Authors (19): Yaoyu Zhu, Di Huang, Hanqi Lyu, Xiaoyun Zhang, Chongxiao Li, Wenxuan Shi, Yutong Wu, Jianan Mu, Jinghua Wang, Yang Zhao, Pengwei Jin, Shuyao Cheng, Shengwen Liang, Xishan Zhang, Rui Zhang, Zidong Du, Qi Guo, Xing Hu, Yunji Chen
Large language models (LLMs) trained via reinforcement learning with verifiable reward (RLVR) have achieved breakthroughs on tasks with explicit, automatable verification, such as software programming and mathematical problems. Extending RLVR to electronic design automation (EDA), especially automatically generating hardware description languages (HDLs) like Verilog from natural-language (NL) specifications, however, poses three key challenges: the lack of automated and accurate verification environments, the scarcity of high-quality NL-code pairs, and the prohibitive computation cost of RLVR. To this end, we introduce CodeV-R1, an RLVR framework for training Verilog generation LLMs. First, we develop a rule-based testbench generator that performs robust equivalence checking against golden references. Second, we propose a round-trip data synthesis method that pairs open-source Verilog snippets with LLM-generated NL descriptions, verifies code-NL-code consistency via the generated testbench, and filters out inequivalent examples to yield a high-quality dataset. Third, we employ a two-stage “distill-then-RL” training pipeline: distillation for the cold start of reasoning abilities, followed by adaptive DAPO, our novel RLVR algorithm that can reduce training cost by adaptively adjusting sampling rate. The resulting model, CodeV-R1-7B, achieves 68.6% and 72.9% pass@1 on VerilogEval v2 and RTLLM v1.1, respectively, surpassing prior state-of-the-art by 12~20%, while matching or even exceeding the performance of 671B DeepSeek-R1. We will release our model, training pipeline, and dataset to facilitate research in EDA and LLM communities.
通过可核实的奖励(RLVR)强化学习培训的大型语言模型(LLM)在明确、可自动化的核查任务(如软件编程和数学问题)方面取得了突破。将RLVR扩大到电子设计自动化(EDA),特别是自动生成硬件描述语言(HDLs),如来自自然语言(NL)规格的Verilog(HDLL),然而,这带来了三大挑战:缺乏自动和准确的核查环境,缺少高质量的NLVR码配对,以及RLVR的计算成本过高。为此,我们引入了代码V-R1,一个用于培训VRR1的RV-7框架,用于培训Verilog的LLLMM。首先,我们开发了一个基于规则的测试引擎,对金参考进行强有力的等值检查。第二,我们提出了一个圆柱数据合成方法,将开放源的 Verilog布片与LM生成的NLLS描述配对,通过生成的测试对代码-NLF20码一致性进行校准,并通过等示例进行过滤, 提供高品质数据集。第三,我们开始两个阶段的“DLVR-R-R-R-R-R-R-R-R-R-d-d-d-d-drodustris-d-d-d-d-dal”的升级的测试,然后进行成本的升级的升级的校程的校程的校程的校程的校程的校制算算法。
Article 21
Title@2025-06-20 (5): Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP)
Title: Zoozve: A Strip-Mining-Free RISC-V Vector Extension with Arbitrary Register Grouping Compilation Support (WIP) | Zoozve: Eine Strip-Mining-freie RISC-V-Vektorerweiterung mit arbitrary Register-Gruppierungsunterstützung (WIP) | Zooozve:利用任意登记册组汇编支助,扩大一个带分解自由的RISC-V矢量扩展区(WIP) 2504.15678v2 |
Authors (7): Siyi Xu, Limin Jiang, Yintao Liu, Yihao Shen, Yi Shi, Shan Cao, Zhiyuan Jiang
Vector processing is crucial for boosting processor performance and efficiency, particularly with data-parallel tasks. The RISC-V “V” Vector Extension (RVV) enhances algorithm efficiency by supporting vector registers of dynamic sizes and their grouping. Nevertheless, for very long vectors, the static number of RVV vector registers and its power-of-two grouping can lead to performance restrictions. To counteract this limitation, this work introduces Zoozve, a RISC-V vector instruction extension that eliminates the need for strip-mining. Zoozve allows for flexible vector register length and count configurations to boost data computation parallelism. With a data-adaptive register allocation approach, Zoozve permits any register groupings and accurately aligns vector lengths, cutting down register overhead and alleviating performance declines from strip-mining. Additionally, the paper details Zoozve’s compiler and hardware implementations using LLVM and SystemVerilog. Initial results indicate Zoozve yields a minimum 10.10$\times$ reduction in dynamic instruction count for fast Fourier transform (FFT), with a mere 5.2\% increase in overall silicon area.
矢量处理对于提高处理器的性能和效率至关重要,特别是在数据平行任务方面。RISC-V“V”矢量扩展(RVVV)通过支持动态尺寸的矢量登记册及其分组来提高算法效率。然而,对于非常长的矢量而言,RV矢量登记册的静数及其2组的功率可能导致性能限制。为了消除这一限制,这项工作引入了Zooozve,一个RISC-V矢量指示扩展,消除了脱衣采掘的需要。Zoozve允许灵活矢量登记长度和计数配置,以推进数据的平行计算。在数据适应性登记册分配方法下,Zoozve允许任何登记组并准确地调整矢量长度,削减登记册的间接费用并减缓脱衣采伐的性能下降。此外,文件细节Zozve使用LVM和系统Verilog, 显示Zozve能产生最低10.10美元的时间减少动态指示值,用于快速四变换(FFT),在一个小的面积上增加了一个5.2个区域。
Article 22
Title@2025-06-19 (4): SparseDPD: A Sparse Neural Network-based Digital Predistortion FPGA Accelerator for RF Power Amplifier Linearization
Title: SparseDPD: A Sparse Neural Network-based Digital Predistortion FPGA Accelerator for RF Power Amplifier Linearization | SparseDPD: Ein Sparse Neural Network-basierter Digital Predistortion FPGA Accelerator für RF Leistungsverstärker Linearisierung | SpararseDPD: 一种基于松散神经网络的FPGA 用于RF Power 放大器线性化的松式神经网络数字预感加速器 2506.16591v1 |
Authors (3): Manno Versluis, Yizhuo Wu, Chang Gao
Digital predistortion (DPD) is crucial for linearizing radio frequency (RF) power amplifiers (PAs), improving signal integrity and efficiency in wireless systems. Neural network (NN)-based DPD methods surpass traditional polynomial models but face computational challenges limiting their practical deployment. This paper introduces SparseDPD, an FPGA accelerator employing a spatially sparse phase-normalized time-delay neural network (PNTDNN), optimized through unstructured pruning to reduce computational load without accuracy loss. Implemented on a Xilinx Zynq-7Z010 FPGA, SparseDPD operates at 170 MHz, achieving exceptional linearization performance (ACPR: -59.4 dBc, EVM: -54.0 dBc, NMSE: -48.2 dB) with only 241 mW dynamic power, using 64 parameters with 74% sparsity. This work demonstrates FPGA-based acceleration, making NN-based DPD practical and efficient for real-time wireless communication applications. Code is publicly available at https://github.com/MannoVersluis/SparseDPD.
数字粒变器(DPD)对于无线电频率(RF)功率放大器(PAs)的线性化、提高无线系统的信号完整性和效率至关重要。神经网络(NN)的DPD方法超越了传统的多元模型,但面临限制其实际部署的计算挑战。本文介绍SpraseDPD(PSARID),这是一个FPGA加速器,它使用空间稀少的阶段性标准化时间分流神经网络(PNTDN),通过无结构的剪接机优化,以减少计算负载,而不会造成精确损失。在Xilinx Zynq-7Z010 FPGA上实施,SprassDPD在170MHz上运行,实现特殊的线化性能(ACR:-59.4 dBc,EVM:-54. 0 dBc,NMSE:-48.2 dB),只有241 mW动态功率,使用74%的音量参数。这项工作显示以FGA为基础的加速度,使NDDDPD成为实用和高效的实时无线通信应用。代码可在https://gthub.com/ManVlus/Sluis/sar上公开查阅。
Article 23
Title@2025-06-19 (4): REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing
Title: REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing | REIS: Ein leistungsstarkes und energieeffizientes Retrieval-System mit In-Storage-Verarbeitung | REIS:具有在系统内处理的高效能和节能检索系统 2506.16444v1 |
Authors (10): Kangqi Chen, Andreas Kosmas Kakolyris, Rakesh Nadig, Manos Frouzakis, Nika Mansouri Ghiasi, Yu Liang, Haiyu Mao, Jisung Park, Mohammad Sadrosadati, Onur Mutlu
Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on. To overcome this issue, Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository. RAG consists of three stages: indexing, retrieval, and generation. The retrieval stage of RAG becomes a significant bottleneck in inference pipelines. In this stage, a user query is mapped to an embedding vector and an Approximate Nearest Neighbor Search (ANNS) algorithm searches for similar vectors in the database to identify relevant items. Due to the large database sizes, ANNS incurs significant data movement overheads between the host and the storage system. To alleviate these overheads, prior works propose In-Storage Processing (ISP) techniques that accelerate ANNS by performing computations inside storage. However, existing works that leverage ISP for ANNS (i) employ algorithms that are not tailored to ISP systems, (ii) do not accelerate data retrieval operations for data selected by ANNS, and (iii) introduce significant hardware modifications, limiting performance and hindering their adoption. We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms. First, REIS employs a database layout that links database embedding vectors to their associated documents, enabling efficient retrieval. Second, it enables efficient ANNS by introducing an ISP-tailored data placement technique that distributes embeddings across the planes of the storage system and employs a lightweight Flash Translation Layer. Third, REIS leverages an ANNS engine that uses the existing computational resources inside the storage system. Compared to a server-grade system, REIS improves the performance (energy efficiency) of retrieval by an average of 13x (55x).
大型语言模型(LLMS) 面临一个固有的挑战: 他们的知识仅限于他们所培训的数据。 为了克服这一问题, Retrealval- Auged Ging(RAG) 以外部知识库补充了LLMS的静态培训派生知识。 RAG 由三个阶段组成: 索引、 检索和生成。 RAG 的检索阶段在推导管道中成为一个很大的瓶颈。 在现阶段, 用户查询被映射为嵌入矢量和接近近距离 Neighbor 搜索(ANNS) 算法搜索数据库中类似的矢量,以识别相关项目。 由于数据库规模大, ARGS 在主机和存储系统之间产生了大量的数据移动管理管理。 为了减轻这些管理费, 先前的工程提议通过内部存储计算加速ANNS 13 管道的加速 。 然而, 将 ISP 用于 NANS 的系统(i) 使用不适应 ISP 系统 的算法, (ii) 不加速为ANNS 所选择的数据检索运行的 RELS 流流流流流流流 系统 的检索操作 , , , 系统 将 升级 系统 升级 系统 系统 升级 系统 系统 升级 将 升级 系统 系统 系统 升级 系统 升级 升级 。
Article 24
Title@2025-06-19 (4): All-in-One Analog AI Hardware: On-Chip Training and Inference with Conductive-Metal-Oxide/HfOx ReRAM Devices
Title: All-in-One Analog AI Hardware: On-Chip Training and Inference with Conductive-Metal-Oxide/HfOx ReRAM Devices | All-in-One Analog-KI-Hardware: On-Chip-Training und Schlussfolgerung mit leitfähig-Metall-Oxid/HfOx-ReRAM-Geräten | A. AI 硬件:与导体-金属-氧化物/HfOx雷射装置的芯片培训和推断 2502.04524v4 |
Authors (12): Donato Francesco Falcone, Victoria Clerico, Wooseok Choi, Tommaso Stecconi, Folkert Horst, Laura Begon-Lours, Matteo Galetta, Antonio La Porta, Nikhil Garg, Fabien Alibart, Bert Jan Offrein, Valeria Bragaglia
Analog in-memory computing is an emerging paradigm designed to efficiently accelerate deep neural network workloads. Recent advancements have focused on either inference or training acceleration. However, a unified analog in-memory technology platform-capable of on-chip training, weight retention, and long-term inference acceleration-has yet to be reported. This work presents an all-in-one analog AI accelerator, combining these capabilities to enable energy-efficient, continuously adaptable AI systems. The platform leverages an array of analog filamentary conductive-metal-oxide (CMO)/HfOx resistive switching memory cells (ReRAM) integrated into the back-end-of-line (BEOL). The array demonstrates reliable resistive switching with voltage amplitudes below 1.5V, compatible with advanced technology nodes. The array multi-bit capability (over 32 stable states) and low programming noise (down to 10nS) enable a nearly ideal weight transfer process, more than an order of magnitude better than other memristive technologies. Inference performance is validated through matrix-vector multiplication simulations on a 64x64 array, achieving a root-mean-square error improvement by a factor of 20 at 1 second and 3 at 10 years after programming, compared to state-of-the-art. Training accuracy closely matching the software equivalent is achieved across different datasets. The CMO/HfOx ReRAM technology lays the foundation for efficient analog systems accelerating both inference and training in deep neural networks.
模拟模拟计算是一个新兴范例,旨在高效加快深神经网络工作量。最近的进展侧重于推论或培训加速。然而,一个统一的模拟模拟模拟技术平台,能够进行芯片培训、保重和长期加速推力,但尚未报告。这项工作是一个全在的模拟AI加速器,将这些能力结合起来,以促成节能、持续适应的AI系统。平台利用一系列模拟丝状导电离电动/HfOx阻动存储器(ReRAM),将其整合到后端(BEL)中。不过,该阵列展示了可靠的阻力转换,其电流加速度低于1.5V,与先进技术节点相容。阵列多位加速器能力(超过32个稳定状态)和低编程噪音(降至10nS)使得一个接近理想的权重转换过程,比其他模拟技术要高得多一个级级的深度和级的深度转换存储存储存储存储存储存储存储器存储器(RERAM)的性性性能通过等量的矩阵-模拟10年之后的直径直径镜模拟,在10年的模拟中,在10年的模拟中,在10年的模拟中,通过对等量的模拟中,在10年的模拟中,通过等量的模拟中,通过对等量的精确的精确度的精确的精确的模拟,在10年的模拟,对等量的模拟,在10次的精确的模拟数据进行了10次的模拟进行数据序列的模拟的模拟,在10年的模拟的模拟的模拟的模拟进行。
Article 25
Title@2025-06-19 (4): Serving Large Language Models on Huawei CloudMatrix384
Title: Serving Large Language Models on Huawei CloudMatrix384 | Große Sprachmodelle auf Huawei CloudMatrix384 | 瓦威云马特列克384 2506.12708v3 |
Authors (46): Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao, Depeng Liang, Dong Cao, Juncheng Liu, Yongqiang Yang, Xiaolong Bai, Yi Li, Huaguo Xie, Huatao Wu, Zhibin Yu, Lv Chen, Hu Liu, Yujun Ding, Haipei Zhu, Jing Xia, Yi Xiong, Zhou Yu, Heng Liao
The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.
大型语言模型(LLMS)的快速演进,由不断增长的参数尺度驱动,采用了混合专家结构(MOE),并扩大了背景长度,这给AI基础设施提出了前所未有的要求。传统AI群组在计算强度、记忆带宽、芯片间通信和延迟度方面面临着限制,加上不同的工作量和严格的服务级目标。解决这些问题需要从根本上重新设计硬件软件集成。本文介绍了由生产级CloudMartrix384超级节拍制的下一代AI数据中心结构Huawe CloudMatrix。它整合了384 Ascend 910 NPUs和192 Kunpeng CPUs,通过超高频宽频宽度统一Bus(UB)网络相互连接,使直接的全通通信和动态共享资源。这些特征是优化通信密集型业务的性能,如大规模MEUE专家平行和分布式模型的存取。为了充分利用Cloadmartrix384, 我们提议Clodmarx-Infer, 高级LIM提供三种核心创新解决方案,包括:PIS-Sildal-deal-deal Stal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-s-s-s-s-s-s-s-s-s-s-s-s-s-sal-s-s-sal-sal-s-s ex-al-al-al-al-al-sal-al-sal-sal-sal-sal-sal-sal-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-sal-al-
Article 26
Title@2025-06-19 (4): Microcanonical simulated annealing: Massively parallel Monte Carlo simulations with sporadic random-number generation
Title: Microcanonical simulated annealing: Massively parallel Monte Carlo simulations with sporadic random-number generation | Mikrokanonisches simuliertes Glühen: Massiv parallele Monte-Carlo-Simulationen mit sporadischer Zufalls-Zahlen-Generierung | 模拟微气候模拟Annealing:以零星随机数生成的大规模平行的蒙特卡洛模拟模拟 2506.16240v1 |
Authors (9): M. Bernaschi, L. A. Fernandez, I. González-Adalid Pemartín, E. Marinari, V. Martin-Mayor, G. Parisi, F. Ricci-Tersenghi, J. J. Ruiz-Lorenzo, D. Yllanes
Numerical simulations of models and theories that describe complex experimental systems $\unicode{x2014}$in fields like high-energy and condensed-matter physics$\unicode{x2014}$ are becoming increasingly important. Examples include lattice gauge theories, which can describe, among others, quantum chromodynamics (the Standard Model description of strong interactions between elementary particles), and spin-glass systems. Beyond fundamental research, these computational methods also find practical applications, among many others, in optimization, finance, and complex biological problems. However, Monte Carlo simulations, an important subcategory of these methods, are plagued by a major drawback: they are extremely greedy for (pseudo) random numbers. The total fraction of computer time dedicated to random-number generation increases as the hardware grows more sophisticated, and can get prohibitive for special-purpose computing platforms. We propose here a general-purpose microcanonical simulated annealing (mic.SA) formalism that dramatically reduces such a burden. The algorithm is fully adapted to a massively parallel computation, as we show in the particularly demanding benchmark of the three-dimensional Ising spin glass. We carry out very stringent numerical tests of the new algorithm by comparing our results, obtained on GPUs, with high-precision standard (i.e., random-number-greedy) simulations performed on the Janus II custom-built supercomputer. In those cases where thermal equilibrium is reachable (i.e., in the paramagnetic phase), both simulations reach compatible values. More significantly, barring short-time corrections, a simple time rescaling suffices to map the mic.SA off-equilibrium dynamics onto the results obtained with standard simulations.
描述高能和浓缩物质物理的模型和理论的数值模拟 $\ unicode{x2014} 美元 在高能和浓缩物质物理的模型和理论中, 美元 $\ uncode{x2014} 美元 正在变得越来越重要。 例如, lattice 仪表理论, 除其他外, 可以描述量子铬动力学( 标准模型对初级粒子之间强烈互动的描述) 和旋玻璃系统。 除了基础研究外, 这些计算方法还发现, 在优化、 金融和复杂的生物问题等方面, 描述复杂实验系统 的模型和理论 。 然而, 蒙特卡洛 模拟是这些方法的一个重要子类, 正在受到一个重大的倒退的困扰: 它们非常贪婪( 假) 随机数字物理( 假) 随机计算器的总数随着硬件的增加而增加, 并且能够吸引特殊目的计算平台。 我们在这里提议一个通用的微卡纳温度模拟( MI) 正式的地图, 已经完全适应了一种大规模平行的计算方法, 正如我们在三维级的直径直径直径直径的直径直径直径直径直径的直径直径等的直径等等的轨数据序列中, 。 我们进行着的硬质的精确的精确测算。
Article 27
Title@2025-06-19 (4): How to Increase Energy Efficiency with a Single Linux Command
Title: How to Increase Energy Efficiency with a Single Linux Command | Wie man die Energieeffizienz mit einem einzigen Linux-Befehl erhöht | 如何通过单Linux指挥来提高能源效率 2506.16046v1 |
Authors (3): Alborz Jelvani, Richard P Martin, Santosh Nagarakatte
Processors with dynamic power management provide a variety of settings to control energy efficiency. However, tuning these settings does not achieve optimal energy savings. We highlight how existing power capping mechanisms can address these limitations without requiring any changes to current power governors. We validate this approach using system measurements across a month-long data acquisition campaign from SPEC CPU 2017 benchmarks on a server-class system equipped with dual Intel Xeon Scalable processors. Our results indicate that setting a simple power cap can improve energy efficiency by up to 25% over traditional energy-saving system configurations with little performance loss, as most default settings focus on thermal regulation and performance rather than compute efficiency. Power capping is very accessible compared to other approaches, as it can be implemented with a single Linux command. Our results point to programmers and administrators using power caps as a primary mechanism to maintain significant energy efficiency while retaining acceptable performance, as opposed to deploying complex DVFS algorithms.
具有动态电源管理的处理器提供了控制能源效率的各种环境。 但是, 调整这些环境并不能实现最佳节能。 我们强调现有的电力封顶机制如何在不要求改变当前电力主管的情况下解决这些限制。 我们用SPEC CPU 2017 标准对一个为期一个月的数据采集活动进行系统测量来验证这一方法,该标准来自配备双 Intel Xeon 可缩放处理器的服务器级系统2017 的基准。 我们的结果表明,建立简单的电力封顶可以比传统的节能系统配置提高能源效率高达25%,但性能损失很少,因为大多数默认设置侧重于热调和性能,而不是计算效率。 与其它方法相比,电封顶非常容易使用,因为它可以使用单一的Linux指令执行。 我们的成果点是,程序员和行政管理人员使用电源封顶作为主要机制来保持高能效,同时保持可接受的性能,而不是使用复杂的DVFS算法。
Article 28
Title@2025-06-19 (4): HetGPU: The pursuit of making binary compatibility towards GPUs
Title: HetGPU: The pursuit of making binary compatibility towards GPUs | HetGPU: Das Streben nach binärer Kompatibilität gegenüber GPUs | HETGPU: 努力使二进制兼容到 GPUs 2506.15993v1 |
Authors (4): Yiwei Yang, Yusheng Zheng, Tong Yu, Andi Quinn
Heterogeneous GPU infrastructures present a binary compatibility challenge: code compiled for one vendor’s GPU will not run on another due to divergent instruction sets, execution models, and driver stacks . We propose hetGPU, a new system comprising a compiler, runtime, and abstraction layer that together enable a single GPU binary to execute on NVIDIA, AMD, Intel, and Tenstorrent hardware. The hetGPU compiler emits an architecture-agnostic GPU intermediate representation (IR) and inserts metadata for managing execution state. The hetGPU runtime then dynamically translates this IR to the target GPU’s native code and provides a uniform abstraction of threads, memory, and synchronization. Our design tackles key challenges: differing SIMT vs. MIMD execution (warps on NVIDIA/AMD vs. many-core RISC-V on Tenstorrent), varied instruction sets, scheduling and memory model discrepancies, and the need for state serialization for live migration. We detail the hetGPU architecture, including the IR transformation pipeline, a state capture/reload mechanism for live GPU migration, and an abstraction layer that bridges warp-centric and core-centric designs. Preliminary evaluation demonstrates that unmodified GPU binaries compiled with hetGPU can be migrated across disparate GPUs with minimal overhead, opening the door to vendor-agnostic GPU computing.
高遗传性 GPU 基础设施存在一个二进制兼容性挑战: 一个供应商 GPU 的代码不会在另一个供应商的 GPU 上运行, 原因是不同的指令集、 执行模型和驱动器堆叠 。 我们提议 HetGPU , 这是一个包含一个编译器、 运行时间和抽象层的新系统, 使一个单一的 GPU 二进制能够在 NVIDIA、 AMD、 Intel 和 Testorrent 硬件上执行 。 HetGPU 编译出一个建筑- 不可理的 GPU 中间代表( IR) , 并插入用于管理执行状态的元数据 。 hetGPU 运行时间然后动态地将IGPU 转换为目标 GPU的本地代码, 并提供统一的线索、 内存和同步的模板的抽象集。 我们的设计解决了关键的挑战: 不同的 SIMT v. MID 执行 ( NIVDIA/ APU 诉 Testortical ) , 的多核心 RImoveal mission G.
Article 29
Title@2025-06-19 (4): Profile-Guided Temporal Prefetching
Title: Profile-Guided Temporal Prefetching | Profilgeführtes Temporales Prefetching | 预测性指导时间间隔期 2506.15985v1 |
Authors (7): Mengming Li, Qijun Zhang, Yichuan Gao, Wenji Fang, Yao Lu, Yongqing Ren, Zhiyao Xie
Temporal prefetching shows promise for handling irregular memory access patterns, which are common in data-dependent and pointer-based data structures. Recent studies introduced on-chip metadata storage to reduce the memory traffic caused by accessing metadata from off-chip DRAM. However, existing prefetching schemes struggle to efficiently utilize the limited on-chip storage. An alternative solution, software indirect access prefetching, remains ineffective for optimizing temporal prefetching. In this work, we propose Prophet–a hardware-software co-designed framework that leverages profile-guided methods to optimize metadata storage management. Prophet profiles programs using counters instead of traces, injects hints into programs to guide metadata storage management, and dynamically tunes these hints to enable the optimized binary to adapt to different program inputs. Prophet is designed to coexist with existing hardware temporal prefetchers, delivering efficient, high-performance solutions for frequently executed workloads while preserving the original runtime scheme for less frequently executed workloads. Prophet outperforms the state-of-the-art temporal prefetcher, Triangel, by 14.23%, effectively addressing complex temporal patterns where prior profile-guided solutions fall short (only achieving 0.1% performance gain). Prophet delivers superior performance across all evaluated workload inputs, introducing negligible profiling, analysis, and instruction overhead.
最新的研究引入了芯片元数据存储系统,以减少获取离芯片 DRAM 元数据导致的记忆流量。然而,现有的预发计划难以有效利用芯片储存的有限数据。 另一种解决方案,即软件间接存取预发,对于优化时间累积而言,仍然无效。 在这项工作中,我们提出先知-一个硬件软件共同设计的框架,利用剖面设计方法优化元数据储存管理。 先知剖面设计程序使用反跟踪,向程序输入提示,以指导元数据存储管理,并动态调试这些提示,以使优化的双环图能够适应不同的程序投入。先知的设计是与现有的硬件时间累积器共存,为经常完成的工作量提供高效、高绩效解决方案,同时保留原运行时间计划,以降低执行工作量的频率。 先知超越了配置配置工具优化元数据储存管理的方法。 先知剖面设计程序使用反跟踪,向程序输入提示提示,向程序输入提示,以指导元数据存储管理程序输入提示,并动态调调这些提示,使优化的双环图能够适应不同的程序投入。 先知设计所有复杂的时间分析,通过14.23度分析, 有效评估了所有高级分析,从而实现了所有高级分析。
Article 30
Title@2025-06-18 (3): Sudoku: Decomposing DRAM Address Mapping into Component Functions
Title: Sudoku: Decomposing DRAM Address Mapping into Component Functions | Sudoku: Dekomponieren von DRAM-Adressen Mapping in Komponentenfunktionen | 数独: 将 DRAM 地址映射解析成元函数 2506.15918v1 |
Authors (5): Minbok Wi, Seungmin Baek, Seonyong Park, Mattan Erez, Jung Ho Ahn
Decomposing DRAM address mappings into component-level functions is critical for understanding memory behavior and enabling precise RowHammer attacks, yet existing reverse-engineering methods fall short. We introduce novel timing-based techniques leveraging DRAM refresh intervals and consecutive access latencies to infer component-specific functions. Based on this, we present Sudoku, the first software-based tool to automatically decompose full DRAM address mappings into channel, rank, bank group, and bank functions while identifying row and column bits. We validate Sudoku’s effectiveness, successfully decomposing mappings on recent Intel and AMD processors.
将 DRAM 地址映射解析成组件级功能对于理解记忆行为和精确的 RowHammer 袭击至关重要,但现有的反向工程方法却不尽人意。 我们引入了基于时间的新技术,利用 DRAM 的更新间隔和连续存取延迟来推断组件特定功能。 在此基础上, 我们展示了Sudoku, 这是第一个基于软件的工具, 可以自动将 DRAM 地址映射完全分解成频道、 级别、 银行集团和银行功能, 同时识别行和列位位。 我们验证了Sudoku 的效能, 成功地将最近的 Intel 和 AMD 处理器的映射解析 。
Article 31
Title@2025-06-18 (3): Bias Variation Compensation in Perimeter-Gated SPAD TRNGs
Title: Bias Variation Compensation in Perimeter-Gated SPAD TRNGs | Bias Variation Compensation in Perimeter-Gated SPAD TRNGs | 围地的SPAD TRNGs中的比值变化赔偿 2506.15888v1 |
Authors (3): Md Sakibur Sajal, Hunter Guthrie, Marc Dandin
Random number generators that utilize arrays of entropy source elements suffer from bias variation (BV). Despite the availability of efficient debiasing algorithms, optimized implementations of hardware friendly options depend on the bit bias in the raw bit streams and cannot accommodate a wide BV. In this work, we present a 64 x 64 array of perimeter gated single photon avalanche diodes (pgSPADs), fabricated in a 0.35 {\mu}m standard CMOS technology, as a source of entropy to generate random binary strings with a BV compensation technique. By applying proper gate voltages based on the devices’ native dark count rates, we demonstrate less than 1% BV for a raw-bit generation rate of 2 kHz/pixel at room temperature. The raw bits were debiased using the classical iterative Von Neumann’s algorithm and the debiased bits were found to pass all of the 16 tests from NIST’s Statistical Test Suite.
使用 entropy 源元素阵列的随机数生成器存在偏差差异( BV) 。 尽管有高效的降低偏差算法, 硬件友好选项的优化实施取决于原始位数流中的比特偏差, 无法容纳宽度 BV 。 在这项工作中, 我们展示了64 x 64 个周边封闭的单光电雪崩二极管( pgSPADs)阵列, 以0. 35 ~ mu}mum 标准 CMOS 技术制成, 以此为一种诱变源, 用 BV 补偿技术生成随机的双弦字符。 通过应用基于设备本地暗数率的适当门电压, 我们显示在室温下2 kHz/ pixel 的原始位数代率低于1% BV 。 原始位数使用古典迭代 Von Neumann 的算法和除位数位元元元数通过 NIST 统计测试套的全部16 测试。
Article 32
Title@2025-06-18 (3): A System Level Compiler for Massively-Parallel, Spatial, Dataflow Architectures
Title: A System Level Compiler for Massively-Parallel, Spatial, Dataflow Architectures | Ein System Level Compiler für massiv-Parallel-, räumliche, Datenflussarchitekturen | 用于Massalizy-Parallel、空间、数据流结构的系统级汇编器 2506.15875v1 |
Authors (5): Dirk Van Essendelft, Patrick Wingo, Terry Jordan, Ryan Smith, Wissam Saidi
We have developed a novel compiler called the Multiple-Architecture Compiler for Advanced Computing Hardware (MACH) designed specifically for massively-parallel, spatial, dataflow architectures like the Wafer Scale Engine. Additionally, MACH can execute code on traditional unified-memory devices. MACH addresses the complexities in compiling for spatial architectures through a conceptual Virtual Machine, a flexible domain-specific language, and a compiler that can lower high-level languages to machine-specific code in compliance with the Virtual Machine concept. While MACH is designed to be operable on several architectures and provide the flexibility for several standard and user-defined data mappings, we introduce the concept with dense tensor examples from NumPy and show lowering to the Wafer Scale Engine by targeting Cerebras’ hardware specific languages.
我们开发了一个小说编译器,名为高级计算机硬件多建筑汇编器(MACH),专门设计用于大规模平行、空间、数据流结构,如Wafer比例式引擎。此外,MACH可以执行传统统一模拟装置的代码。MACH通过概念虚拟机器、灵活的域名语言和能够根据虚拟机器概念将高语言降为机器专用代码的编译器,解决空间结构汇编的复杂性。MACH设计在几个结构上可以操作,并为若干标准和用户定义的数据绘图提供灵活性。我们用NumPy的密度高的高压示例来介绍这一概念,并以Cerebras硬件特定语言为对象,向瓦费尔比例式引擎展示低调的概念。
Article 33
Title@2025-06-18 (3): SR-NCL: an Area-/Energy-Efficient Resilient NCL Architecture Based on Selective Redundancy
Title: SR-NCL: an Area-/Energy-Efficient Resilient NCL Architecture Based on Selective Redundancy | SR-NCL: eine flächen-/energieeffiziente NCL-Architektur basierend auf selektiver Redundanz | SR-NCL:基于选择性冗余的地区/能源/能有效抗力的NCL建筑 2506.15634v1 |
Authors (3): Hasnain A. Ziad, Alexander C. Bodoh, Ashiq A. Sakib
Duplication-based redundancy schemes have proven to be effective in designing fully-resilient Quasi-delay Insensitive (QDI) asynchronous circuits. The complete resiliency, however, is accompanied by significant energy, latency, and area overhead. This paper presents a novel error-tolerant Null Convention Logic (NCL) architecture based on selective redundancy. Results demonstrate the efficacy of the proposed method in terms of area and energy utilization as compared to existing duplication-based NCL designs, targeting an image processing application.
事实证明,基于重复的冗余计划在设计完全耐受力的准易变性不敏感电路(QDI)方面是有效的,但是,在完全恢复能力的同时,还有大量的能量、潜伏和地区间接费用,本文件提出了基于选择性冗余的新颖的耐误 Null公约逻辑(NCL)架构,结果显示,与现有的基于重复的NCL设计相比,拟议方法在面积和能源利用方面的效力,目标是图像处理应用程序。
Article 34
Title@2025-06-18 (3): From Block to Byte: Transforming PCIe SSDs with CXL Memory Protocol and Instruction Annotation
Title: From Block to Byte: Transforming PCIe SSDs with CXL Memory Protocol and Instruction Annotation | Von Block zu Byte: Transformation von PCIe SSDs mit CXL Memory Protocol und Instruction Annotation | 从块到字节: 用 CXL 内存协议和指令注释转换 PCIe SSD 2506.15613v1 |
Authors (11): Miryeong Kwon, Donghyun Gouk, Junhyeok Jang, Jinwoo Baek, Hyunwoo You, Sangyoon Ji, Hongjoo Jung, Junseok Moon, Seungkwan Kang, Seungjun Lee, Myoungsoo Jung
This paper explores how Compute Express Link (CXL) can transform PCIe-based block storage into a scalable, byte-addressable working memory. We address the challenges of adapting block storage to CXL’s memory-centric model by emphasizing cacheability as a key enabler and advocating for Type 3 endpoint devices, referred to as CXL-SSDs. To validate our approach, we prototype a CXL-SSD on a custom FPGA platform and propose annotation mechanisms, Determinism and Bufferability, to enhance performance while preserving data persistency. Our simulation-based evaluation demonstrates that CXL-SSD achieves 10.9x better performance than PCIe-based memory expanders and further reduces latency by 5.4x with annotation enhancements. In workloads with high locality, CXL-SSD approaches DRAM-like performance due to efficient on-chip caching. This work highlights the feasibility of integrating block storage into CXL’s ecosystem and provides a foundation for future memory-storage convergence.
本文探讨了计算Express Link (CXL) 如何将基于 PCIe 的块状存储转换成可缩放、可字数处理的工作内存。 我们通过强调作为关键助推器的缓存性,并提倡使用称为 CXL-SSD 的3型端点设备来应对将块状存储转换成CXL-SSD 的挑战。 为了验证我们的方法,我们将CXL-SSD 原型放在一个定制的 FPGA 平台上,并提议说明机制,即确定性和缓冲性,以便在保存数据持久性的同时提高性能。 我们的模拟评价表明, CXL-SSD比基于 PCI 的内存扩展器取得10.9x 更好的性能,并通过加注度进一步减少5.4x 的耐久性。 在高地点工作量中, CXL-SDD接近DRAM 一样的性能,以高效的电荷堆积为目的。 这项工作突出了将块状存储并入 CXL 生态系统的可行性,并为未来存储存储中提供基础。
Article 35
Title@2025-06-18 (3): CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies
Title: CXL-GPU: Pushing GPU Memory Boundaries with the Integration of CXL Technologies | CXL-GPU: GPU-Speichergrenzen mit der Integration von CXL-Technologien drücken | CXL-GPU:推动GPU记忆边界与CXL技术一体化 2506.15601v1 |
Authors (11): Donghyun Gouk, Seungkwan Kang, Seungjun Lee, Jiseon Kim, Kyungkuk Nam, Eojin Ryu, Sangwon Lee, Dongpyung Kim, Junhyeok Jang, Hanyeoreum Bae, Myoungsoo Jung
This work introduces a GPU storage expansion solution utilizing CXL, featuring a novel GPU system design with multiple CXL root ports for integrating diverse storage media (DRAMs and/or SSDs). We developed and siliconized a custom CXL controller integrated at the hardware RTL level, achieving two-digit nanosecond roundtrip latency, the first in the field. This study also includes speculative read and deterministic store mechanisms to efficiently manage read and write operations to hide the endpoint’s backend media latency variation. Performance evaluations reveal our approach significantly outperforms existing methods, marking a substantial advancement in GPU storage technology.
这项工作引入了使用 CXL 的 GPU 存储扩展解决方案, 其特点是具有多个 CXL 根端端的新型 GPU 系统设计, 用于整合多种存储介质( DRAMs 和/或 SSDs) 。 我们开发并整合了硬件 RTL 级的自定义 CXL 控制器, 实现两位数的 纳米第二轮回转延迟, 首先是实地的。 此研究还包括用于有效管理读写操作以隐藏端点后端介质延迟的读写操作的投机性和确定性存储器机制 。 绩效评估显示, 我们的方法大大超过现有方法, 标志着 GPU 存储技术的显著进步 。
Article 36
Title@2025-06-18 (3): Acore-CIM: build accurate and reliable mixed-signal CIM cores with RISC-V controlled self-calibration
Title: Acore-CIM: build accurate and reliable mixed-signal CIM cores with RISC-V controlled self-calibration | Acore-CIM: Aufbau präziser und zuverlässiger CIM-Kerne mit RISC-V-gesteuerter Selbstkalibrierung | 核心CIM:建立准确可靠的混合信号CIM核心,并配有RISC-V控制的自我校准 2506.15440v1 |
Authors (9): Omar Numan, Gaurav Singh, Kazybek Adam, Jelin Leslin, Aleksi Korsman, Otto Simola, Marko Kosunen, Jussi Ryynänen, Martin Andraud
Developing accurate and reliable Compute-In-Memory (CIM) architectures is becoming a key research focus to accelerate Artificial Intelligence (AI) tasks on hardware, particularly Deep Neural Networks (DNNs). In that regard, there has been significant interest in analog and mixed-signal CIM architectures aimed at increasing the efficiency of data storage and computation to handle the massive amount of data needed by DNNs. Specifically, resistive mixed-signal CIM cores are pushed by recent progresses in emerging Non-Volatile Memory (eNVM) solutions. Yet, mixed-signal CIM computing cores still face several integration and reliability challenges that hinder their large-scale adoption into end-to-end AI computing systems. In terms of integration, resistive and eNVM-based CIM cores need to be integrated with a control processor to realize end-to-end AI acceleration. Moreover, SRAM-based CIM architectures are still more efficient and easier to program than their eNVM counterparts. In terms of reliability, analog circuits are more susceptible to variations, leading to computation errors and degraded accuracy. This work addresses these two challenges by proposing a self-calibrated mixed-signal CIM accelerator SoC, fabricated in 22-nm FDSOI technology. The integration is facilitated by (1) the CIM architecture, combining the density and ease of SRAM-based weight storage with multi-bit computation using linear resistors, and (2) an open-source programming and testing strategy for CIM systems. The accuracy and reliability are enabled through an automated RISC-V controlled on-chip calibration, allowing us to improve the compute SNR by 25 to 45% across multiple columns to reach 18-24 dB. To showcase further integration possibilities, we show how our proof-of-concept SoC can be extended to recent high-density linear resistor technologies for enhanced computing performance.
开发准确和可靠的计算计量(CIM)架构正在成为一个关键的研究焦点,以加速人工智能(AI)硬件任务,特别是深神经网络(DNN)。在这方面,人们对模拟和混合信号的CIM架构非常感兴趣,目的是提高数据存储和计算效率,以处理DNNS所需要的大量数据。具体地说,在新兴非流动性存储(eNVM)的解决方案中,耐性混合信号 CIM核心正在成为关键的研究焦点。然而,混合信号的 CIM计算核心仍然面临着若干整合和可靠性挑战,这阻碍了它们被大规模采用端到端到端的AI计算系统。在整合、耐性和基于 ENM CIM 的核心方面,需要与控制处理器整合,以实现端到端的AI加速。此外,基于SRAM 的 CIM 架构仍然比其基于直径直的对等系统更高效和更容易进行编程。在可靠性方面,类比电路更易变,导致计算错误,并导致直径直径的直径直径的计算,在SIM IM 系统上,这需要整合一个控制流程测试,通过SIM 快速的SIM 将SIM 显示两个挑战通过SIM 。
Article 37
Title@2025-06-18 (3): J3DAI: A tiny DNN-Based Edge AI Accelerator for 3D-Stacked CMOS Image Sensor
Title: J3DAI: A tiny DNN-Based Edge AI Accelerator for 3D-Stacked CMOS Image Sensor | J3DAI: Ein winziger DNN-basierter Edge-KI-Beschleuniger für 3D-gestapelten CMOS-Bildsensor | J3DAI: 用于 3D 堆放 CMOS 图像传感器的小型 DNN 以 DNN 为基础的 AI 自动加速器 2506.15316v1 |
Authors (12): Benoit Tain, Raphael Millet, Romain Lemaire, Michal Szczepanski, Laurent Alacoque, Emmanuel Pluchart, Sylvain Choisnet, Rohit Prasad, Jerome Chossat, Pascal Pierunek, Pascal Vivet, Sebastien Thuries
This paper presents J3DAI, a tiny deep neural network-based hardware accelerator for a 3-layer 3D-stacked CMOS image sensor featuring an artificial intelligence (AI) chip integrating a Deep Neural Network (DNN)-based accelerator. The DNN accelerator is designed to efficiently perform neural network tasks such as image classification and segmentation. This paper focuses on the digital system of J3DAI, highlighting its Performance-Power-Area (PPA) characteristics and showcasing advanced edge AI capabilities on a CMOS image sensor. To support hardware, we utilized the Aidge comprehensive software framework, which enables the programming of both the host processor and the DNN accelerator. Aidge supports post-training quantization, significantly reducing memory footprint and computational complexity, making it crucial for deploying models on resource-constrained hardware like J3DAI. Our experimental results demonstrate the versatility and efficiency of this innovative design in the field of edge AI, showcasing its potential to handle both simple and computationally intensive tasks. Future work will focus on further optimizing the architecture and exploring new applications to fully leverage the capabilities of J3DAI. As edge AI continues to grow in importance, innovations like J3DAI will play a crucial role in enabling real-time, low-latency, and energy-efficient AI processing at the edge.
本文介绍了基于 J3DAI 的微小深神经网络硬件加速器 J3DAI , 是一个用于3层3D堆叠的 CMOS 图像传感器的微小神经网络硬件加速器, 其特点是人工智能芯片(AI) 集深神经网络(DNN) 的加速器。 DNN 加速器旨在高效完成图像分类和分解等神经网络任务。 本文侧重于 J3DAI 的数字系统, 突出其性能- 区域( PPAA) 特性, 展示CMOS 图像传感器上高级高级的高级AI能力。 为了支持硬件, 我们使用了 Aidge 综合软件框架, 使主机处理器和 DNNC 加速器能够同时整合。 Aidge 支持培训后的评分, 大幅减少记忆足迹和计算复杂性, 使J3DAI 等受资源限制的硬件的部署模式变得至关重要。 我们的实验结果显示, 在边缘AI 领域, 展示其处理简单和计算密集的任务的潜力。
Article 38
Title@2025-06-17 (2): Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
Title: Scaling Intelligence: Designing Data Centers for Next-Gen Language Models | Scaling Intelligence: Konzipieren von Rechenzentren für Sprachmodelle der nächsten Generation | 扩大情报范围:为下一代语言模型设计数据中心 2506.15006v1 |
Authors (4): Jesmin Jahan Tithi, Hanjiang Wu, Avishaii Abuhatzera, Fabrizio Petrini
The explosive growth of Large Language Models (LLMs) - such as GPT-4 with 1.8 trillion parameters - demands a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, wider scale-out domains, and larger memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model flops per token x Observed tokens per sec / Peak flops of the hardware) and overall throughput. For the co-design study, we extended and validated a performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.
大型语言模型(LLMS)的爆炸性增长(LLMS),如GPT-4,具有1.8万亿美元参数的GPT-4,要求彻底重新思考数据中心架构,以确保可缩放性、效率和成本效益。我们的工作提供了一个综合共同设计框架,共同探索FLOPS、HBM带宽度和能力、多网络表层(两层对全Flat光学)、扩大范围的规模以及LLMS使用的民众平行/优化战略。我们引入并评估了Fulllat网络结构,这些结构提供了所有节点之间统一的高宽度、低延度连通性,并展示了它们对业绩和可伸缩性的变革影响。我们通过详细的敏感性分析,量化了重叠的计算和通信的好处,利用硬件加速的集体、更广泛的缩放域和更大的记忆能力。我们的研究范围既包括稀缺的(专家组合)又以密集的变压器为基础的LMMSM,揭示了系统设计选择如何影响FLOPS利用模型(MFFFMSAS =每基本观察度x观察的模型x-servealserverial signal supal supal sess /cal exal lapal laveal must lax pal lax pal must laveal must lapal 10 pal lipal lapal lipal lautal lautal lapal lapal 10 lipal lapal ladal lapal ladal lautaldaldaldal ladaldaldaldal ladaldaldaldal ladeal ladealdaldal ladal ladal ladal 和我们我们我们10 10 10 10 10 10 10 10 10 10 10 10 和通过我们的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化模型,通过10 和软化模型,通过10的软化的软化的软化模型的软化的软化
Article 39
Title@2025-06-17 (2): VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM-Driven Verilog Coding
Title: VeriLeaky: Navigating IP Protection vs Utility in Fine-Tuning for LLM-Driven Verilog Coding | VeriLeaky: Navigieren von IP-Schutz gegen Utility in Feinsteuerung für LLM-getriebene Verilog Coding | VeriLeaky:在LLM-Driven Verilog编码精密设计中导航 IP保护相对于实用性 2503.13116v4 |
Authors (10): Zeng Wang, Minghao Shao, Mohammed Nabeel, Prithwish Basu Roy, Likhitha Mankali, Jitendra Bhandari, Ramesh Karri, Ozgur Sinanoglu, Muhammad Shafique, Johann Knechtel
Large language models (LLMs) offer significant potential for coding, yet fine-tuning (FT) with curated data is essential for niche languages like Verilog. Using proprietary intellectual property (IP) for FT presents a serious risk, as FT data can be leaked through LLM inference. This leads to a critical dilemma for design houses: seeking to build externally accessible LLMs offering competitive Verilog coding, how can they leverage in-house IP to enhance FT utility while ensuring IP protection? For the first time in the literature, we study this dilemma. Using LLaMA 3.1-8B, we conduct in-house FT on a baseline Verilog dataset (RTLCoder) supplemented with our own in-house IP, which is validated through multiple tape-outs. To rigorously assess IP leakage, we quantify structural similarity (AST/Dolos) and functional equivalence (Synopsys Formality) between generated codes and our in-house IP. We show that our IP can indeed be leaked, confirming the threat. As defense, we evaluate logic locking of Verilog codes (ASSURE). This offers some level of protection, yet reduces the IP’s utility for FT and degrades the LLM’s performance. Our study shows the need for novel strategies that are both effective and minimally disruptive to FT, an essential effort for enabling design houses to fully utilize their proprietary IP toward LLM-driven Verilog coding.
大型语言模型( LLMs) 提供了重要的编码潜力, 但是微调( FT ) 却对Verilog 等利基语言至关重要。 使用专有知识产权( IP) 来为FT 使用基底 Verilog 数据库( RTLCoder ) 带来严重的风险, 因为FT 数据可以通过 LLM 推断泄漏。 这导致设计房屋面临一个关键性的两难: 寻求建立外部可获取的LLMs , 提供竞争性 Verilog 编码, 如何利用内部IP 来提高FT 效用,同时确保IP 保护? 我们在文献中首次研究了这一难题。 使用LalaMA 3.1-8B , 我们用一个基底的 Verilog 数据库( RTLCorder) 进行内部FT , 补充我们的内部IP IP 数据库( RTLC ) 数据库( RTLC RT) , 通过多层磁带图解的验证验证。 为了充分评估 IPM 的实用性设计策略, 需要某种程度的精确性研究。
Article 40
Title@2025-06-17 (2): ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors
Title: ASAP-FE: Energy-Efficient Feature Extraction Enabling Multi-Channel Keyword Spotting on Edge Processors | ASAP-FE: Energieeffiziente Feature-Extraktion Ermöglichung von Multi-Channel-Schlüsselwort-Spoting auf Edge-Prozessoren | ASAP-FE:能源效率地貌采掘促进多通道关键词 2506.14657v1 |
Authors (5): Jongin Choi, Jina Park, Woojoo Lee, Jae-Jin Lee, Massoud Pedram
Multi-channel keyword spotting (KWS) has become crucial for voice-based applications in edge environments. However, its substantial computational and energy requirements pose significant challenges. We introduce ASAP-FE (Agile Sparsity-Aware Parallelized-Feature Extractor), a hardware-oriented front-end designed to address these challenges. Our framework incorporates three key innovations: (1) Half-overlapped Infinite Impulse Response (IIR) Framing: This reduces redundant data by approximately 25% while maintaining essential phoneme transition cues. (2) Sparsity-aware Data Reduction: We exploit frame-level sparsity to achieve an additional 50% data reduction by combining frame skipping with stride-based filtering. (3) Dynamic Parallel Processing: We introduce a parameterizable filter cluster and a priority-based scheduling algorithm that allows parallel execution of IIR filtering tasks, reducing latency and optimizing energy efficiency. ASAP-FE is implemented with various filter cluster sizes on edge processors, with functionality verified on FPGA prototypes and designs synthesized at 45 nm. Experimental results using TC-ResNet8, DS-CNN, and KWT-1 demonstrate that ASAP-FE reduces the average workload by 62.73% while supporting real-time processing for up to 32 channels. Compared to a conventional fully overlapped baseline, ASAP-FE achieves less than a 1% accuracy drop (e.g., 96.22% vs. 97.13% for DS-CNN), which is well within acceptable limits for edge AI. By adjusting the number of filter modules, our design optimizes the trade-off between performance and energy, with 15 parallel filters providing optimal performance for up to 25 channels. Overall, ASAP-FE offers a practical and efficient solution for multi-channel KWS on energy-constrained edge devices.
多通道关键字定位( KWS) 对边缘环境中基于声音的应用至关重要 。 但是, 其大量的计算和能源需求带来了巨大的挑战 。 我们引入了 ASAP- FE ( Agile Sparity-Award Plain- Fature 提取器) , 硬件导向的前端端, 旨在应对这些挑战。 我们的框架包含三个关键创新:(1) 半覆式的Infinite Impulse 响应( IIR ) : 将冗余数据减少约25%, 同时保持基本的电话流流流转换提示 。 (2) 减少: 我们利用框架级的垃圾存储器和基于斜度过滤的过滤器, 来调整50%的数据。 我们引入了一个可参数化的过滤器组合和基于优先的列表算法, 使IIR的过滤任务能够平行进行, 降低粘固度, 在边缘处理器上采用各种过滤器尺寸, 功能在 FPCNGA 样板上进行验证, 将数据合成为45 nm; 通过SWT- Referal- Develrial IM 提供最佳的运行结果, AS- droupal AS- droupal AS to a less a less a less a less laveild laveal laveal laveal 。
Article 41
Title@2025-06-17 (2): Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees
Title: Guaranteed Guess: A Language Modeling Approach for CISC-to-RISC Transpilation with Testing Guarantees | Guaranteed Guess: Ein Sprachmodellierungsansatz für CISC-to-RISC Transpilation mit Testgarantien | 有担保的猜测:具有测试保证的CISC到RISC传输语言模拟方法 2506.14606v1 |
Authors (5): Ahmed Heakl, Sarim Hashmi, Chaimaa Abi, Celine Lee, Abdulrahman Mahmoud
The hardware ecosystem is rapidly evolving, with increasing interest in translating low-level programs across different instruction set architectures (ISAs) in a quick, flexible, and correct way to enhance the portability and longevity of existing code. A particularly challenging class of this transpilation problem is translating between complex- (CISC) and reduced- (RISC) hardware architectures, due to fundamental differences in instruction complexity, memory models, and execution paradigms. In this work, we introduce GG (Guaranteed Guess), an ISA-centric transpilation pipeline that combines the translation power of pre-trained large language models (LLMs) with the rigor of established software testing constructs. Our method generates candidate translations using an LLM from one ISA to another, and embeds such translations within a software-testing framework to build quantifiable confidence in the translation. We evaluate our GG approach over two diverse datasets, enforce high code coverage (>98%) across unit tests, and achieve functional/semantic correctness of 99% on HumanEval programs and 49% on BringupBench programs, respectively. Further, we compare our approach to the state-of-the-art Rosetta 2 framework on Apple Silicon, showcasing 1.73x faster runtime performance, 1.47x better energy efficiency, and 2.41x better memory usage for our transpiled code, demonstrating the effectiveness of GG for real-world CISC-to-RISC translation tasks. We will open-source our codes, data, models, and benchmarks to establish a common foundation for ISA-level code translation research.
硬件生态系统正在迅速演变,人们越来越有兴趣以快速、灵活和正确的方式将低级别程序翻译到不同的教学设置架构(ISAs)中,以快速、灵活和正确的方式将低级别程序翻译到不同的教学设置结构(ISAs)中,从而增强现有代码的可移动性和寿命。由于在教学复杂性、记忆模型和执行模式方面存在着根本差异,这一转换问题的一个特别具有挑战性的类别是复杂的(CISC)和减少的(RISC)硬件结构。在这项工作中,我们引入了GG(担保猜测),这是一个以ISA为核心的传输管道,将预先训练的大型语言模型(LLLMs)的翻译能力与成熟的软件测试结构结合起来。此外,我们的方法利用LLLM(LMs)进行候选人翻译,并将这种翻译嵌入一个软件测试框架,以建立对翻译的可量化的信任。我们通过两个不同的数据集来评估我们的GG方法,在单位测试中实施高代码覆盖(>98%),在HumanEval 程序上实现9 %的功能/中度的开放性转换,在Beup Bench 程序上实现了49 %。 此外,我们将将我们的方法用于Silex- Stal-listral-listral-liver-lic-lic-lical-lic-lic-deal-lical-lical-lax lax lax ladeal-deal-lish-lishal-lishal-de-lishal-lax lax lax lautal-lishal-lax ax 2.x ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax a-sal-sal-sal-sal-sal-deal-cal-sal-deal-lical-cal-ladal-cal-ladal-lad-lad-laxxxxx-lad-lad-lad-laxxxxx-lax-laxxxx a-lax ax ax ax a-lax ax ax a-lax a-lax a-lax-
Article 42
Title@2025-06-17 (2): Empirically-Calibrated H100 Node Power Models for Reducing Uncertainty in AI Training Energy Estimation
Title: Empirically-Calibrated H100 Node Power Models for Reducing Uncertainty in AI Training Energy Estimation | Empirisch kalibrierte H100-Knotenstrommodelle zur Reduzierung von Unsicherheiten in der KI-Trainingsenergieschätzung | AI 培训能源估算中减少不确定性的活性校准 H100 节点动力模型 2506.14551v1 |
Authors (7): Alex C. Newkirk, Jared Fernandez, Jonathan Koomey, Imran Latif, Emma Strubell, Arman Shehabi, Constantine Samaras
As AI’s energy demand continues to grow, it is critical to enhance the understanding of characteristics of this demand, to improve grid infrastructure planning and environmental assessment. By combining empirical measurements from Brookhaven National Laboratory during AI training on 8-GPU H100 systems with open-source benchmarking data, we develop statistical models relating computational intensity to node-level power consumption. We measure the gap between manufacturer-rated thermal design power (TDP) and actual power demand during AI training. Our analysis reveals that even computationally intensive workloads operate at only 76% of the 10.2 kW TDP rating. Our architecture-specific model, calibrated to floating-point operations, predicts energy consumption with 11.4% mean absolute percentage error, significantly outperforming TDP-based approaches (27-37% error). We identified distinct power signatures between transformer and CNN architectures, with transformers showing characteristic fluctuations that may impact grid stability.
由于AI的能源需求在继续增长,因此,提高对这一需求特点的理解,改善电网基础设施规划和环境评估至关重要。通过将Brookhaven国家实验室在AI8-GPU H100系统培训期间的经验性测量与开放源基准数据相结合,我们开发了计算强度与节点电力消耗有关的统计模型。我们测量了在AI培训期间制造商定级热设计功率(TDP)与实际电力需求之间的差距。我们的分析显示,即使计算密集的工作量也只能达到10.2千瓦TDP评级的76 % 。我们根据浮动点操作校准的具体建筑模型预测能源消耗为11.4%的绝对百分比误差,显著超过基于TDP的方法(27-37%误差 ) 。我们确定了变压器和CNN结构之间不同的电量特征,变压器显示可能会影响电网的稳定性。
Article 43
Title@2025-06-17 (2): Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC
Title: Tensor Manipulation Unit (TMU): Reconfigurable, Near-Memory Tensor Manipulation for High-Throughput AI SoC | Tensor-Manipulationseinheit (TMU): Rekonfigurierbare, Nahspeicher-Tensor-Manipulation für High-Throughput-KI-SoC | Tensor 操纵装置(TMU):高干压 AI SoC 的可重新配置、近近中间的Tensor 操纵装置 2506.14364v1 |
Authors (7): Weiyu Zhou, Zheng Wang, Chao Chen, Yike Li, Yongkui Yang, Zhuoyu Wu, Anupam Chattopadhyay
While recent advances in AI SoC design have focused heavily on accelerating tensor computation, the equally critical task of tensor manipulation, centered on high,volume data movement with minimal computation, remains underexplored. This work addresses that gap by introducing the Tensor Manipulation Unit (TMU), a reconfigurable, near-memory hardware block designed to efficiently execute data-movement-intensive operators. TMU manipulates long datastreams in a memory-to-memory fashion using a RISC-inspired execution model and a unified addressing abstraction, enabling broad support for both coarse- and fine-grained tensor transformations. Integrated alongside a TPU within a high-throughput AI SoC, the TMU leverages double buffering and output forwarding to improve pipeline utilization. Fabricated in SMIC 40nm technology, the TMU occupies only 0.019 mm2 while supporting over 10 representative tensor manipulation operators. Benchmarking shows that TMU alone achieves up to 1413 and 8.54 operator-level latency reduction compared to ARM A72 and NVIDIA Jetson TX2, respectively. When integrated with the in-house TPU, the complete system achieves a 34.6% reduction in end-to-end inference latency, demonstrating the effectiveness and scalability of reconfigurable tensor manipulation in modern AI SoCs.
虽然AISOC设计方面最近的进展在很大程度上集中于加速高压计算,但以高容量和微量数据流动为核心的强压操纵同样至关重要的任务仍未得到充分探讨。这项工作通过引入Tensor操纵股(TMU)来弥补这一差距,TMU是一个可重新配置的近模硬件块,旨在高效执行数据移动密集操作者。TMU利用RISC激励的执行模式和统一处理抽象,以记忆到模拟的方式操控长流,利用RISC激励的执行模式和统一处理抽象,使以高容量和微量数据流动为核心的强压操纵能够得到广泛支持。TMUPU与高通气流的AISOC(TMU)内的一个TPU(TMU)一起,利用双倍缓冲和输出推进管道利用。在SMIC 40m技术中,TMU只占用0.019毫米2,同时支持10多个具有代表性的抗压操纵操作者。基准显示,仅TMU(TMU)与AR72和NVA JEX2的递增操作能力分别在TU(AI-DER)中实现自动递减34)。
Article 44
Title@2025-06-17 (2): CXLMemSim: A pure software simulated CXL.mem for performance characterization
Title: CXLMemSim: A pure software simulated CXL.mem for performance characterization | CXLMemSim: Eine reine Software simulierte CXL.mem für die Leistungscharakterisierung | CXLMemSim:一种模拟CXL.mem用于性能定性的纯软件 2303.06153v2 |
Authors (6): Yiwei Yang, Brian Zhao, Yusheng Zheng, Pooneh Safayenikoo, Tanvir Ahmed Khan, Andi Quinn
CXLMemSim is a fast, lightweight simulation framework that enables performance characterization of memory systems based on Compute Express Link (CXL) .mem technology. CXL.mem allows disaggregation and pooling of memory to mitigate memory stranding (underutilized memory trapped on fully loaded servers) in cloud and datacenter environments. However, CXL-attached memory introduces additional latency and bandwidth constraints compared to local DRAM, and real CXL .mem hardware is not yet widely available for empirical evaluation. CXLMemSim addresses this gap by attaching to unmodified applications and simulating CXL-based memory pools in software. It operates by tracing memory allocations and accesses using efficient kernel probes and hardware performance counters, dividing execution into epochs, and injecting timing delays to emulate various CXL .mem latency/bandwidth characteristics. This approach incurs modest runtime overhead while preserving realistic load/store memory access patterns. We implement CXLMemSim on commodity hardware without special devices, and our evaluation shows that it runs orders of magnitude faster than cycle-accurate simulators (e.g., Gem5) for real-world workloads, while accurately modeling the performance impact of CXL .mem. We demonstrate use cases where CXLMemSim enables experimentation with memory pooling configurations, scheduling policies, data migration strategies, and caching techniques that were previously infeasible to evaluate at scale. Key findings include the viability of software-based CXL .mem emulation with low overhead, insights into latency and congestion effects in memory pools, and guidance for system designers to optimize memory disaggregation. Overall, CXLMemSim provides a practical and extensible platform for researchers and practitioners to explore CXL.mem innovations before real hardware becomes commonplace.
CXLMemSim 是一个快速、轻巧的模拟框架, 使得基于 Compute Express Link (CXL).meem 技术的记忆系统能够进行性能定性。 CXL.mem 允许对记忆进行分类和集合, 以减少云层和数据中心环境中的内存保持( 完全装入服务器的内存未充分利用的内存) 。 然而, CXL 附加的内存与本地 DRA 相比, 真实的 CXL.mem 硬件还无法广泛用于实证评估。 CXMemSim 在软件中附加未改进的应用程序,并模拟基于 CXL的内存存储库库库。 CX 将内存分配和存的内存数据转换为CLlm 的准确性能。 CX 用于实时的内存和内存访问模式,我们用 CX 的内存服务器的内存数据显示, 内存的内存的内存和内存的内存系统显示, 内存的内存的内存的内存数据流数据流数据流数据流数据记录。
Article 45
Title@2025-06-17 (2): Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification
Title: Comprehensive Verilog Design Problems: A Next-Generation Benchmark Dataset for Evaluating Large Language Models and Agents on RTL Design and Verification | Umfassende Verilog-Design-Probleme: Ein Benchmark-Datensatz für die nächste Generation zur Bewertung großer Sprachmodelle und -agenten auf RTL-Design und -Verifikation | 全面的纵向设计问题:用于评价关于RTL设计和核查的大型语言模型和代理人的下一个基准数据集 2506.14074v1 |
Authors (8): Nathaniel Pinckney, Chenhui Deng, Chia-Tung Ho, Yun-Da Tsai, Mingjie Liu, Wenfei Zhou, Brucek Khailany, Haoxing Ren
We present the Comprehensive Verilog Design Problems (CVDP) benchmark, a new dataset and infrastructure to advance LLM and agent research in hardware design and verification. CVDP includes 783 problems across 13 task categories, covering RTL generation, verification, debugging, specification alignment, and technical Q&A authored by experienced hardware engineers. Problems are offered in both non-agentic and agentic formats. The benchmark introduces more realistic and challenging contexts than prior work, with state-of-the-art models achieving no more than 34% pass@1 on code generation. Agentic tasks$\unicode{x2013}$especially those involving RTL reuse and verification$\unicode{x2013}$are particularly difficult. Evaluation uses open-source tools and model scoring infrastructure, with comprehension tasks assessed via BLEU and LLM-based judging. CVDP reveals substantial gaps in current model capabilities, underscoring the need for continued research toward robust, real-world hardware design automation.
我们提出了全面纵向设计问题基准(CVDP),这是推进LLM和硬件设计和核查代理研究的新数据集和基础设施。CVDP包括13个任务类别中的783个问题,包括RTL的生成、核查、调试、规格调整和由有经验的硬件工程师撰写的技术“A”;在非试剂和代理格式方面都提出了问题。该基准提出了比以往工作更现实和更具挑战性的背景,在代码生成方面,最新模型的通过率不超过34%@1。 Agric 任务$\uncode{2013}$,特别是涉及RTL的再利用和核查$uncode{x2013}$,尤其困难。评价使用开放源工具和模型评分基础设施,通过BLEU和LM的评分。CVDP揭示了当前模型能力的巨大差距,强调需要继续研究,以实现强健的、真实世界硬件设计自动化。