cs.AR @ 2025-07-11: 040
-
00 07-10 (4) DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v2 -
01 07-10 Accelerating Transposed Convolutions on FPGA-based Edge Devices Beschleunigung transponierter Konvolutionen auf FPGA-basierten Edge-Geräten 加速基于 FPGA 的边缘设备的转换变速 2507.07683v1 -
02 07-09 (3) Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure Berechnen kann nicht mit der Wahrheit umgehen: Warum Kommunikationssteuer das Gedächtnis und die Verbindungen in der modernen KI-Infrastruktur priorisiert 计算无法处理真相:为什么通讯税在现代AI基础设施中将记忆和相互联系放在优先地位? 2507.07223v1 -
03 07-09 Opto-ViT: Architecting a Near-Sensor Region of Interest-Aware Vision Transformer Accelerator with Silicon Photonics Opto-ViT: Bau einer nah-Sensor-Region von Interesse-Aware Vision Transformer Accelerator mit Silicon Photonics Opto-VT: 设计具有硅光谱仪的近传感器区域 2507.07044v1 -
04 07-09 AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model AHCPTQ: Genaue und hardwarekompatible Nachschulungs-Quantisierung für Segment-Anything-Modell ACHPTQ: 分片 “ 任何 “ 模式的准确和硬件兼容的训练后培训后量化 2503.03088v2 -
05 07-09 Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM Designs Deep-Learning-based Pre-Layout Parasitic Capacitance Prediction auf SRAM-Designs 关于SRRAM设计设计的深层学习的Layount前寄生虫能力预测 2507.06549v1 -
06 07-09 Towards LLM-based Root Cause Analysis of Hardware Design Failures Auf dem Weg zu einer LLM-basierten Root-Cause-Analyse von Hardware-Design-Fehlern 基于LLM的硬件设计故障根本原因分析 2507.06512v1 -
07 07-08 (2) SLDB: An End-To-End Heterogeneous System-on-Chip Benchmark Suite for LLM-Aided Design SLDB: Eine End-to-End Heterogene System-on-Chip Benchmark Suite für LLM-Aided Design SLDD: LLM 辅助设计 LLM 的终端到End 异元系统芯片上芯片上系统基准套件 2507.06376v1 -
08 07-08 hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation hdl2v: Ein Code-Übersetzungsdatensatz für verbesserte LLM Verilog-Generierung hdl2v: 用于强化LLM Verilog 生成的代码翻译数据集 2506.04544v2 -
09 07-08 Multi-Queue SSD I/O Modeling & Its Implications for Data Structure Design Multi-Queue SSD I/O Modellierung & seine Implikationen für die Datenstrukturgestaltung 多队队 SSD I/O 建模及其对数据结构设计的影响 2507.06349v1 -
10 07-08 PrefixAgent: An LLM-Powered Design Framework for Efficient Prefix Adder Optimization PrefixAgent: Ein LLM-Powered Design Framework für effiziente Prefix Adder-Optimierung 前缀:高效前缀添加器优化的LLM授权设计框架 2507.06127v1 -
11 07-08 RTGPU: Real-Time Computing with Graphics Processing Units RTGPU: Echtzeit-Computing mit Grafikverarbeitungseinheiten RTGPU: 配有图形处理股的实时计算机 2507.06069v1 -
12 07-08 OLAF: Programmable Data Plane Acceleration for Asynchronous Distributed Reinforcement Learning OLAF: Programmierbare Datenplanbeschleunigung für asynchrones, verteiltes Weiterbildungslernen OLAF: 可编程数据计划加速非同步分布式加强学习 2507.05876v1 -
13 07-08 GATMesh: Clock Mesh Timing Analysis using Graph Neural Networks GATMesh: Uhr Mesh Timing Analyse mit Hilfe von Graph Neural Networks GATMesh:利用图形神经网络分析时钟网时间 2507.05681v1 -
14 07-08 A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs Laufzeit-Adaptive Transformer Neural Network Accelerator auf FPGAs FPGAs 运行时间- 适应性变革器神经网络加速器 2411.18148v3 -
15 07-08 iThermTroj: Exploiting Intermittent Thermal Trojans in Multi-Processor System-on-Chips iThermTroj: Ausnutzung intermittierender thermischer Trojaner in Multi-Prozessor-System-on-Chips iThermTroj:在多处理器系统芯片中利用间隔热热特洛伊 2507.05576v1 -
16 07-08 Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads Per-Row-Aktivierung auf echte Hardware zählen: Demystifying Performance Overheads 实时硬件的每减激活计数:解开对性能的神秘化 2507.05556v1 -
17 07-07 (1) Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search Bit-Flip-Fault-Angriff: Zerkleinernde Graphen-Neural-Netzwerke über schrittweise Bitsuche Bit- Flip 错误攻击: 通过渐变位搜索粉碎图形神经网络 2507.05531v1 -
18 07-07 ViPSN 2.0: A Reconfigurable Battery-free IoT Platform for Vibration Energy Harvesting ViPSN 2.0: Eine neu konfigurierbare, batteriefreie IoT-Plattform für Vibrationsenergieernte VIPSN 2.0: 一个无电池的振动能源收集再配置无电的 IOT 平台 2507.05081v1 -
19 07-07 Optimizing Scalable Multi-Cluster Architectures for Next-Generation Wireless Sensing and Communication Optimierung skalierbarer Multi-Cluster-Architekturen für drahtloses Sensing und Kommunikation der nächsten Generation 优化用于下一代无线遥感和通信的可缩放多集群建筑 2507.05012v1 -
20 07-07 AXI-REALM: Safe, Modular and Lightweight Traffic Monitoring and Regulation for Heterogeneous Mixed-Criticality Systems AXI-REALM: Sichere, modulare und leichte Verkehrsüberwachung und Regulierung für Heterogene Mixed-Criticality-Systeme AXI-REALM: 安全、模块和轻量量的交通监测和管理 2501.10161v2 -
21 07-07 Jack Unit: An Area- and Energy-Efficient Multiply-Accumulate (MAC) Unit Supporting Diverse Data Formats Jack Unit: Eine flächen- und energieeffiziente Multiplizierungsakkumulation (MAC)-Einheit, die unterschiedliche Datenformate unterstützt 杰克单位:一个区域和能源效率乘数累积(MAC)单位,支持多种数据格式 2507.04772v1 -
22 07-07 ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning ChipSeek-R1: Generierung von Mensch-Überwindungs-RTL mit LLM über hierarchisches Reward-getriebenes Verstärkungs-Lernen ChipSeek-R1:通过等级制奖励强化学习,与LLM一道产生载人超越越越越越越越越越权 2507.04736v1 -
23 07-07 FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs FAMOUS: Flexibler Beschleuniger für den Aufmerksamkeitsmechanismus des Transformators auf UltraScale+ FPGAs FANOUS: 超标准+FPGAs变异器注意机制灵活加速器 2409.14023v3 -
24 07-07 NeuroPDE: A Neuromorphic PDE Solver Based on Spintronic and Ferroelectric Devices NeuroPDE: Ein neuromorphes PDE-Lösemittel auf Basis spintronischer und Ferroelektrischer Geräte NeuroPDE: 一种基于 Spentronic 和 Ferroweze 装置的神经形态PDE 溶解器 2507.04677v1 -
25 07-06 (7) da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs da4ml: Verteilte Arithmetik für Echtzeit-Neurale Netzwerke auf FPGAs da4ml: FPGAs 实时神经网络的分布式重新测量 2507.04535v1 -
26 07-06 HLStrans: Dataset for LLM-Driven C-to-HLS Hardware Code Synthesis HLStrans: Datensatz für LLM-getriebene C-zu-HLS-Hardware-Codesynthese HLStrans:LLM-Driven C-to-HLS硬件代码合成数据集 2507.04315v1 -
27 07-06 FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification FIXME: Zur End-to-End-Benchmarkierung der LLM-Aided Design-Überprüfung FIXME:走向LLM辅助设计核查的终至终基准基准 2507.04276v1 -
28 07-05 (6) Heterogeneous Memory Benchmarking Toolkit Heterogenes Memory Benchmarking Toolkit 不同记忆基准衡量工具包 2505.00901v2 -
29 07-05 Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM Reduzierung der Kosten des Ausfalls in Flash-Achtung durch Verstecken von RNG mit GEMM 通过与GEMM躲入RNNG降低 “ 闪启动 “ 中的辍学费用 2410.07531v2 -
30 07-04 (5) A Flexible Instruction Set Architecture for Efficient GEMMs Flexible Instruktions-Set-Architektur für effiziente GEMMs 高效的通用环管机制的灵活教学结构 2507.03522v1 -
31 07-04 High-Level Surface Code Decoding via Parallel FFNNs on CIM Platforms High-Level-Oberflächencode-Dekodierung über parallele FFNNs auf CIM-Plattformen 通过平行的FFNN就CIM平台通过平行的FFNN就CIM平台进行认证 2411.18090v2 -
32 07-04 Hummingbird: A Smaller and Faster Large Language Model Accelerator on Embedded FPGA Hummingbird: Ein kleinerer und schnellerer Large Language Model Accelerator auf Embedded FPGA 蜂鸟:在嵌入的FPGA上用更小、更快的大型语言模型加速器 2507.03308v1 -
33 07-04 ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis ForgeHLS: Ein großformatiger, Open-Source-Datensatz für High-Level-Synthese ForgeHLS: 用于高级别综合的大型、开放源码数据集 2507.03255v1 -
34 07-03 (4) Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification Hey KI, Generieren Sie mir einen Hardware-Code! Agentische KI-basierte Hardware-Design & Verifizierung AI, 生成一个硬件代码! Agentic AI 的硬件设计和验证 2507.02660v1 -
35 07-03 Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure Durchbrechen der HBM Bit Cost Barrier: Domainspezifisches ECC für KI-Inferenz-Infrastruktur 打破HBM比位成本壁垒:AI推理基础设施特定域ECC 2507.02654v1 -
36 07-03 MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem MARS: Processing-in-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem MARS: 储存子系统内原始信号基因组分析的处理-中间加速 2506.10931v2 -
37 07-03 AC-Refiner: Efficient Arithmetic Circuit Optimization Using Conditional Diffusion Models AC-Refiner: Effiziente Arithmetische Schaltungsoptimierung mit bedingten Diffusionsmodellen AC-Refineer:使用有条件扩散模型高效亚氏电路优化 2507.02598v1 -
38 07-03 System-performance and cost modeling of Large Language Model training and inference Systemperformance und Kostenmodellierung von Large Language Model Training und Schlussfolgerung 大语言模式培训和推论的系统业绩和成本模型化 2507.02456v1 -
39 07-03 DecoRTL: A Run-time Decoding Framework for RTL Code Generation with LLMs DecoRTL: Ein Laufzeit-Decoding-Framework für RTL-Code-Generierung mit LLMs DecoRTL: 使用LLMs的RTL代码生成运行时间解码框架 2507.02226v1
Article 0
Title@2025-07-10 (4): DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration
Title: DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration | DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung | DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v2 |
Authors (3): Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis
Transformers are gaining increasing attention across different application domains due to their outstanding accuracy. However, these data-intensive models add significant performance demands to the existing computing architectures. Systolic arrays are spatial architectures that have been adopted by commercial AI computing platforms (like Google TPUs), due to their energy-efficient approach of data-reusability. However, these spatial architectures face a penalty in throughput and energy efficiency due to the need for input and output synchronization using First-In-First-Out (FIFO) buffers. This paper proposes a novel scalable systolic-array architecture featuring Diagonal-Input and Permutated weight-stationary (DiP) dataflow for the acceleration of matrix multiplication. The proposed architecture eliminates the synchronization FIFOs required by state-of-the-art weight stationary systolic arrays. Aside from the area, power, and energy savings achieved by eliminating these FIFOs, DiP architecture maximizes the computational resources (PEs) utilization. Thus, it outperforms the weight-stationary counterparts in terms of throughput by up to 50%. A comprehensive hardware design space exploration is demonstrated using commercial 22nm technology, highlighting the scalability advantages of DiP over the conventional approach across various dimensions where DiP offers improvement of energy efficiency per area up to 2.02x. Furthermore, DiP is evaluated using various transformer workloads from widely-used models, consistently outperforming TPU-like architectures, achieving energy improvements of up to 1.81x and latency improvements of up to 1.49x across a range of transformer workloads. At a 64x64 size with 4096 PEs, DiP achieves a peak performance of 8.2 TOPS with energy efficiency 9.55 TOPS/W.
这些数据密集型模型增加了现有计算结构的显著性能要求。 系统阵列是商业AI计算平台(如Google TPUs)采用的空间结构,因为其数据的可恢复性具有节能性。 然而,这些空间结构由于需要使用FIFO(FIFO)缓冲进行投入和产出同步,在吞吐和能源效率方面面临着一个障碍。本文建议了一个新的可缩放的40级系统阵列结构,其特点是对角-内流和变换的加权-静态(DIP)数据流,以加速矩阵倍增。拟议的结构消除了最新重量固定式数据阵列所需的同步FIFFOs。除了通过消除FIFO(FIFO)实现的输入和产出同步之外,diPIP结构将计算资源最大化。 因此,它比重-平流-平流-平面-平面-平面-平面-平面-平面 1. 它比重-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平-平-平-平-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平-平-平-平-平-平-平面-平面-平面-平面-平面-平-平-平-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平面-平面-平面-平面-平面-平-平-平-平-平-平-平-平-
Article 1
Title@2025-07-10 (4): Accelerating Transposed Convolutions on FPGA-based Edge Devices
Title: Accelerating Transposed Convolutions on FPGA-based Edge Devices | Beschleunigung transponierter Konvolutionen auf FPGA-basierten Edge-Geräten | 加速基于 FPGA 的边缘设备的转换变速 2507.07683v1 |
Authors (2): Jude Haris, José Cano
Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.
为了解决这个问题,我们在本文件中提议了MM2IM, 一个硬件软件共同设计的加速器,将MM2IM与COL2IM组合在一起,以高效地处理控制资源边缘装置上的TCONV层。我们使用SECDA-TFLite设计工具包,执行MM2IM,并评估其在261 TCONV问题配置中的性能表现,实现1.9x的平均速度,与双轨的ARM Neon优化的CPU基准相对应。然后我们从众所周知的Com2SUI模型到达到4.2x速度的TRIM, 将其与类似的GMMSM2 基准模型相比较。
Article 2
Title@2025-07-09 (3): Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure
Title: Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure | Berechnen kann nicht mit der Wahrheit umgehen: Warum Kommunikationssteuer das Gedächtnis und die Verbindungen in der modernen KI-Infrastruktur priorisiert | 计算无法处理真相:为什么通讯税在现代AI基础设施中将记忆和相互联系放在优先地位? 2507.07223v1 |
Authors (1): Myoungsoo Jung
Modern AI workloads such as large language models (LLMs) and retrieval-augmented generation (RAG) impose severe demands on memory, communication bandwidth, and resource flexibility. Traditional GPU-centric architectures struggle to scale due to growing inter-GPU communication overheads. This report introduces key AI concepts and explains how Transformers revolutionized data representation in LLMs. We analyze large-scale AI hardware and data center designs, identifying scalability bottlenecks in hierarchical systems. To address these, we propose a modular data center architecture based on Compute Express Link (CXL) that enables disaggregated scaling of memory, compute, and accelerators. We further explore accelerator-optimized interconnects-collectively termed XLink (e.g., UALink, NVLink, NVLink Fusion)-and introduce a hybrid CXL-over-XLink design to reduce long-distance data transfers while preserving memory coherence. We also propose a hierarchical memory model that combines local and pooled memory, and evaluate lightweight CXL implementations, HBM, and silicon photonics for efficient scaling. Our evaluations demonstrate improved scalability, throughput, and flexibility in AI infrastructure.
大型语言模型(LLMS)和检索增强的生成(RAG)等现代AI工作量,如大型语言模型(LLMS)和检索增强的生成(RAG),对记忆、通信带宽和资源灵活性提出了严重的要求。传统的GPU中心建筑由于GPU之间的通信管理费用不断增加而难以扩大规模。本报告介绍主要的AI概念,并解释变异器如何在LLMS中使数据代表发生革命。我们分析大型AI硬件和数据中心设计,找出等级系统中的可缩放瓶颈。为了解决这些问题,我们提议基于计算快递链接(CXL)的模块式数据中心结构,以便能够对记忆、计算和加速器进行分解的缩。我们进一步探索加速器-优化的互联互通-集体称为XLink(例如, ALink, NVVLink, NVLink Fulsion)- 并采用混合的 CXL-over-XLink设计,以减少长距离数据传输,同时保持记忆的一致性。我们还提议一个等级记忆模型,将本地和集合记忆结合起来,并评价轻型的CXLL(轻重 CXL)执行、HBMMM, 和硅灵活性,展示我们通过高效的升级和智能基础设施。
Article 3
Title@2025-07-09 (3): Opto-ViT: Architecting a Near-Sensor Region of Interest-Aware Vision Transformer Accelerator with Silicon Photonics
Title: Opto-ViT: Architecting a Near-Sensor Region of Interest-Aware Vision Transformer Accelerator with Silicon Photonics | Opto-ViT: Bau einer nah-Sensor-Region von Interesse-Aware Vision Transformer Accelerator mit Silicon Photonics | Opto-VT: 设计具有硅光谱仪的近传感器区域 2507.07044v1 |
Authors (10): Mehrdad Morsali, Chengwei Zhou, Deniz Najafi, Sreetama Sarkar, Pietro Mercati, Navid Khoshavi, Peter Beerel, Mahdi Nikdast, Gourav Datta, Shaahin Angizi
Vision Transformers (ViTs) have emerged as a powerful architecture for computer vision tasks due to their ability to model long-range dependencies and global contextual relationships. However, their substantial compute and memory demands hinder efficient deployment in scenarios with strict energy and bandwidth limitations. In this work, we propose OptoViT, the first near-sensor, region-aware ViT accelerator leveraging silicon photonics (SiPh) for real-time and energy-efficient vision processing. Opto-ViT features a hybrid electronic-photonic architecture, where the optical core handles compute-intensive matrix multiplications using Vertical-Cavity Surface-Emitting Lasers (VCSELs) and Microring Resonators (MRs), while nonlinear functions and normalization are executed electronically. To reduce redundant computation and patch processing, we introduce a lightweight Mask Generation Network (MGNet) that identifies regions of interest in the current frame and prunes irrelevant patches before ViT encoding. We further co-optimize the ViT backbone using quantization-aware training and matrix decomposition tailored for photonic constraints. Experiments across device fabrication, circuit and architecture co-design, to classification, detection, and video tasks demonstrate that OptoViT achieves 100.4 KFPS/W with up to 84% energy savings with less than 1.6% accuracy loss, while enabling scalable and efficient ViT deployment at the edge.
视觉转换器(ViPhs)是计算机视觉任务的一个强大架构,因为能够模拟长距离依赖关系和全球背景关系。然而,它们的大量计算和记忆要求阻碍了在能源和带宽限制严格的情景下高效部署。在这项工作中,我们提议使用OptoViT,这是第一个近距离传感器,区域能见的ViT加速器,利用实时和节能视觉处理的硅相控器(SiPh),Opto-ViT具有一种混合电子光学结构,光学核心通过垂直卡维度地表发射激光器(VCSeELs)和微光镜激光器(MRMs)处理高强度的矩阵倍增。我们提议采用非线性功能和正常化程序。为减少冗余计算和补配处理,我们引入一个轻量制的面具生成网络(MGNet),确定当前框架中感兴趣的区域,并在ViT编码之前将不相干的电子相配配配方结构。我们进一步将ViT骨架配置中,使用Vi-conti-te-te-te-te-te-te-tra contra contra contra contranation-destration-lifliveral-liveral remitalalalaltraction-traleval-realtralational treval treval treval treval treval treval 培训,同时展示制成100-laveal 和Mismetald Smlevalds 测试S-S-regilvacal 和制成一个比制平平平平平平平平平平流、S-realdrodrodrodrodrocildrocildal-rodal-rocreal-S-rocreal-real-rocal-rodrodsal-S-rocal-rocal-平流机制成平平平平流、平流、平平平流模型,在100平平流制结构,在100平平平平平平平平平平平平平平平流、通过100平流和制结构,在Sl化的磁结构,在100平流和制平平平平平平平平平平
Article 4
Title@2025-07-09 (3): AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model
Title: AHCPTQ: Accurate and Hardware-Compatible Post-Training Quantization for Segment Anything Model | AHCPTQ: Genaue und hardwarekompatible Nachschulungs-Quantisierung für Segment-Anything-Modell | ACHPTQ: 分片 “ 任何 “ 模式的准确和硬件兼容的训练后培训后量化 2503.03088v2 |
Authors (4): Wenlun Zhang, Yunshan Zhong, Shimpei Ando, Kentaro Yoshioka
The Segment Anything Model (SAM) has demonstrated strong versatility across various visual tasks. However, its large storage requirements and high computational cost pose challenges for practical deployment. Post-training quantization (PTQ) has emerged as an effective strategy for efficient deployment, but we identify two key challenges in SAM that hinder the effectiveness of existing PTQ methods: the heavy-tailed and skewed distribution of post-GELU activations, and significant inter-channel variation in linear projection activations. To address these challenges, we propose AHCPTQ, an accurate and hardware-efficient PTQ method for SAM. AHCPTQ introduces hardware-compatible Hybrid Log-Uniform Quantization (HLUQ) to manage post-GELU activations, employing log2 quantization for dense small values and uniform quantization for sparse large values to enhance quantization resolution. Additionally, AHCPTQ incorporates Channel-Aware Grouping (CAG) to mitigate inter-channel variation by progressively clustering activation channels with similar distributions, enabling them to share quantization parameters and improving hardware efficiency. The combination of HLUQ and CAG not only enhances quantization effectiveness but also ensures compatibility with efficient hardware execution. For instance, under the W4A4 configuration on the SAM-L model, AHCPTQ achieves 36.6% mAP on instance segmentation with the DINO detector, while achieving a 7.89x speedup and 8.64x energy efficiency over its floating-point counterpart in FPGA implementation.
分层信息传输模式(SAM)在各种视觉任务中表现出很强的多功能性,然而,它的大量储存要求和高计算成本对实际部署构成挑战;培训后量化(PTQ)已成为高效部署的有效战略,但我们查明了SAM中阻碍现有PTQ方法有效性的两个关键挑战:GELU启动后启动的量的重尾和偏斜分布,以及线性投影启动中的重大渠道差异。为了应对这些挑战,我们提议AHPTQ(AHCPTQ),即SAM的准确和硬件高效的PTQQ方法。AHPTQ(HLQ)引入了硬件兼容的混合日志-统一量化(PTQQQ),以管理GELU启动后启动,使用对密度小值的正对正对齐的二次量化,对大值的稀释进行统一量化,以加强静态解解决方案。此外,ACPTQ(CGA) 与类似分发的组合,使它们能够分享QODRIV参数,同时提高SA-QA的稳定性执行。
Article 5
Title@2025-07-09 (3): Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM Designs
Title: Deep-Learning-Based Pre-Layout Parasitic Capacitance Prediction on SRAM Designs | Deep-Learning-based Pre-Layout Parasitic Capacitance Prediction auf SRAM-Designs | 关于SRRAM设计设计的深层学习的Layount前寄生虫能力预测 2507.06549v1 |
Authors (6): Shan Shen, Dingcheng Yang, Yuyang Xie, Chunyan Pei, Wenjian Yu, Bei Yu
To achieve higher system energy efficiency, SRAM in SoCs is often customized. The parasitic effects cause notable discrepancies between pre-layout and post-layout circuit simulations, leading to difficulty in converging design parameters and excessive design iterations. Is it possible to well predict the parasitics based on the pre-layout circuit, so as to perform parasitic-aware pre-layout simulation? In this work, we propose a deep-learning-based 2-stage model to accurately predict these parasitics in pre-layout stages. The model combines a Graph Neural Network (GNN) classifier and Multi-Layer Perceptron (MLP) regressors, effectively managing class imbalance of the net parasitics in SRAM circuits. We also employ Focal Loss to mitigate the impact of abundant internal net samples and integrate subcircuit information into the graph to abstract the hierarchical structure of schematics. Experiments on 4 real SRAM designs show that our approach not only surpasses the state-of-the-art model in parasitic prediction by a maximum of 19X reduction of error but also significantly boosts the simulation process by up to 598X speedup.
为了实现更高的系统能效, SoCs 中的 SRAM 往往是定制的。 寄生效应导致预置和后布设电路模拟之间的显著差异,导致设计参数和过度设计迭代的难以调和。 是否有可能很好地预测基于预置电路的寄生虫体, 以便进行寄生虫意识预留机模拟? 在这项工作中, 我们提出一个基于深层次学习的2级模型, 以准确预测这些在预置阶段的寄生虫。 该模型将一个图形神经网络(GNN) 分类器和多射线接收器(MLP) 的递增器结合起来, 从而有效地管理SRAM电路中净寄生虫体的分类不平衡。 我们还利用Concle Loss 来减轻大量内部网样的影响, 并将亚电路信息纳入图中, 以抽取示学结构。 在 4个真实的 SRAM 设计的实验显示, 我们的方法不仅超过寄生模型的状态, 并且大大地将误差减少19X 98 的速度提升到5X 模拟过程。
Article 6
Title@2025-07-09 (3): Towards LLM-based Root Cause Analysis of Hardware Design Failures
Title: Towards LLM-based Root Cause Analysis of Hardware Design Failures | Auf dem Weg zu einer LLM-basierten Root-Cause-Analyse von Hardware-Design-Fehlern | 基于LLM的硬件设计故障根本原因分析 2507.06512v1 |
Authors (6): Siyu Qiu, Muzhi Wang, Raheel Afsharmazayejani, Mohammad Moradi Shahmiri, Benjamin Tan, Hammond Pearce
With advances in large language models (LLMs), new opportunities have emerged to develop tools that support the digital hardware design process. In this work, we explore how LLMs can assist with explaining the root cause of design issues and bugs that are revealed during synthesis and simulation, a necessary milestone on the pathway towards widespread use of LLMs in the hardware design process and for hardware security analysis. We find promising results: for our corpus of 34 different buggy scenarios, OpenAI’s o3-mini reasoning model reached a correct determination 100% of the time under pass@5 scoring, with other state of the art models and configurations usually achieving more than 80% performance and more than 90% when assisted with retrieval-augmented generation.
随着大型语言模型的进步,出现了开发支持数字硬件设计过程的工具的新机会。在这项工作中,我们探索了LLMS如何帮助解释在合成和模拟过程中揭示的设计问题和错误的根源,这是在硬件设计过程和硬件安全分析中广泛使用LLMS道路上一个必要的里程碑。我们发现有希望的结果:在我们总共34种不同的错误假设情景中,OpenAI的O3-mini推理模型在通过@5评分的时间里达到了100%的正确确定,其他艺术模型和配置通常达到80%以上,在协助检索生成时达到90%以上。
Article 7
Title@2025-07-08 (2): SLDB: An End-To-End Heterogeneous System-on-Chip Benchmark Suite for LLM-Aided Design
Title: SLDB: An End-To-End Heterogeneous System-on-Chip Benchmark Suite for LLM-Aided Design | SLDB: Eine End-to-End Heterogene System-on-Chip Benchmark Suite für LLM-Aided Design | SLDD: LLM 辅助设计 LLM 的终端到End 异元系统芯片上芯片上系统基准套件 2507.06376v1 |
Authors (3): Elisavet Lydia Alvanaki, Kevin Lee, Luca P. Carloni
Over the last few years, Large Language Models (LLMs) have emerged as a valuable tool for Electronic Design Automation (EDA). State-of-the-art research in LLM-aided design has demonstrated the ability of LLMs to generate syntactically correct RTL code, showcasing encouraging prospects for integrating AI into the hardware design process. A key enabler of these advancements is the availability of high-quality benchmarks to evaluate new approaches. However, existing datasets and benchmarks fall short of system-level design, as they focus primarily on component-level information and low-complexity designs. To address this gap, we introduce the System-Level Design Benchmark (SLDB), a dataset tailored for evaluating LLMs in system-level integration and configuration tasks. SLDB includes a curated benchmark suite of 10 baseline SoC designs, whose components can be combined into an exponential number of distinct tile-based SoCs through a synthetic library. The dataset provides full SoC configurations, accelerator integration code, communication parameters, and accelerator-aware system configurations, along with testing-application code, compatible with the ESP platform[1].
过去几年来,大语言模型(LLMS)已成为电子设计自动化(EDA)的宝贵工具。LLM辅助设计中的最新研究表明,LLMS有能力生成综合正确的RTL代码,展示了将AI纳入硬件设计过程的令人鼓舞的前景。这些进展的一个关键推动因素是具备高质量基准来评价新方法。然而,现有的数据集和基准没有达到系统一级的设计,因为它们主要侧重于组件级信息和低兼容度设计。为弥补这一差距,我们采用了系统级设计基准,这是一套专门为系统级整合和配置任务评估LMS而设计的数据集。SLDB包括一套10个基线SoC设计基准套件,其组成部分可以通过合成图书馆与不同基于语言的索尔的指数数相结合。该数据集提供完整的 SoC配置、加速器集成码、通信参数和加速器-系统配置,同时提供测试-应用码兼容的平台[可兼容性1]。
Article 8
Title@2025-07-08 (2): hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation
Title: hdl2v: A Code Translation Dataset for Enhanced LLM Verilog Generation | hdl2v: Ein Code-Übersetzungsdatensatz für verbesserte LLM Verilog-Generierung | hdl2v: 用于强化LLM Verilog 生成的代码翻译数据集 2506.04544v2 |
Authors (6): Charles Hong, Brendan Roberts, Huijae An, Alex Um, Advay Ratan, Yakun Sophia Shao
Large language models (LLMs) are playing an increasingly large role in domains such as code generation, including hardware code generation, where Verilog is the key language. However, the amount of publicly available Verilog code pales in comparison to the amount of code available for software languages like Python. In this work, we present hdl2v (“HDL-to-Verilog”), a dataset which seeks to increase the amount of available human-written Verilog data by translating or compiling three other hardware description languages - VHDL, Chisel, and PyMTL3 - to Verilog. Furthermore, we demonstrate the value of hdl2v in enhancing LLM Verilog generation by improving performance of a 32 billion-parameter open-weight model by up to 23% (pass@10) in VerilogEvalV2, without utilizing any data augmentation or knowledge distillation from larger models. We also show hdl2v’s ability to boost the performance of a data augmentation-based fine-tuning approach by 63%. Finally, we characterize and analyze our dataset to better understand which characteristics of HDL-to-Verilog datasets can be expanded upon in future work for even better performance.
大型语言模型(LLMS)在诸如代码生成(包括硬件代码生成)等领域发挥着越来越重要的作用, 包括硬件代码生成( Verilog 是 Verilog 的关键语言 ) 。 然而, 公开提供的 Verilog 代码数量与 Python 等软件语言可用的代码数量相比, 与可用代码数量相比, Vython 等软件语言的代码数量是苍白的。 在这项工作中, 我们提供了 hdl2v (“ HDL- 到 Verilog ” ) , 这个数据集试图通过翻译或汇编其他三种硬件描述语言( VHDL、 Chisel 和 PyMTL3 - 至 Verilog ) 来增加现有的人文版 Verilog 数据数量。 此外, 我们用 HDL- VerivalV 2 改进了320亿 参数开放度模型的性能模型的性能( passel@10) 。 我们还展示了 hdl2 能力, 63% 来提升基于数据增强基于 微调方法的性能的性能。 最后, 我们分析和分析了我们的数据数据集, 以便更好地了解未来数据系统如何改进了HDL- 。
Article 9
Title@2025-07-08 (2): Multi-Queue SSD I/O Modeling & Its Implications for Data Structure Design
Title: Multi-Queue SSD I/O Modeling & Its Implications for Data Structure Design | Multi-Queue SSD I/O Modellierung & seine Implikationen für die Datenstrukturgestaltung | 多队队 SSD I/O 建模及其对数据结构设计的影响 2507.06349v1 |
Authors (3): Erin Ransom, Andrew Lim, Michael Mitzenmacher
Understanding the performance profiles of storage devices and how best to utilize them has always been non-trivial due to factors such as seek times, caching, scheduling, concurrent access, flash wear-out, and garbage collection. However, analytical frameworks that provide simplified abstractions of storage performance can still be accurate enough to evaluate external memory algorithms and data structures at the design stage. For example, the Disk Access Machine (DAM) model assumes that a storage device transfers data in fixed-size blocks of size B and that all transfers have unit latency. This abstraction is already sufficient to explain some of the benefits of data structures such as B-trees and Log-Structured Merge trees (LSM trees); however, storage technology advances have significantly reduced current models’ accuracy and utility. This paper introduces the Multi-Queue Solid State Drive (MQSSD) model, a new storage abstraction. This model builds upon previous models and aims to more accurately represent the performance characteristics of modern storage hardware. We identify key performance-critical aspects of modern multi-queue solid-state drives on which we base our model and demonstrate these characteristics on actual hardware. We then show how our model can be applied to LSM-tree-based storage engines to optimize them for modern storage hardware. We highlight that leveraging concurrent access is crucial for fully utilizing the high throughput of multi-queue SSDs, enabling designs that may appear counterintuitive under traditional paradigms We then validate these insights through experiments using Facebook’s LSM-tree-based key-value store, RocksDB. We conclude that the MQSSD model offers a more accurate abstraction of modern hardware than previous models, allowing for greater insight and optimization.
了解存储装置的性能剖析以及如何最好地使用存储装置的性能剖析,总是非三进制的,因为寻找时间、缓存、时间安排、同时访问、闪光磨损和垃圾收集等因素。然而,提供储存性能简化抽象的分析性框架仍然可以准确到足以评价设计阶段的外部内存算法和数据结构。例如,磁盘存取机器模型(DAM)假设存储装置传输B大小的固定尺寸块的数据,而且所有传输都具有单位性能。这种传统直观已经足以解释数据结构的一些好处,如B-树和Log-Strucal-rocked Merge 树(LSM树);然而,储存技术的进步大大降低了当前模式的准确性和实用性。本文介绍了多Que Solid State Livestal(MSSDSD)模型(MSSSD),这个模型以以前的模型为基础,旨在更准确地代表现代储存硬件的性能特征。我们确定了基于现代的硬体数据数据库的关键-关键方面。我们用这个模型来建立我们的模型,然后用SD-SD-SDSDl的精度模型来结束的精度模型来完成。我们用这些硬储存的精度模型来完成。
Article 10
Title@2025-07-08 (2): PrefixAgent: An LLM-Powered Design Framework for Efficient Prefix Adder Optimization
Title: PrefixAgent: An LLM-Powered Design Framework for Efficient Prefix Adder Optimization | PrefixAgent: Ein LLM-Powered Design Framework für effiziente Prefix Adder-Optimierung | 前缀:高效前缀添加器优化的LLM授权设计框架 2507.06127v1 |
Authors (4): Dongsheng Zuo, Jiadong Zhu, Yang Luo, Yuzhe Ma
Prefix adders are fundamental arithmetic circuits, but their design space grows exponentially with bit-width, posing significant optimization challenges. Previous works face limitations in performance, generalization, and scalability. To address these challenges, we propose PrefixAgent, a large language model (LLM)-powered framework that enables efficient prefix adder optimization. Specifically, PrefixAgent reformulates the problem into subtasks including backbone synthesis and structure refinement, which effectively reduces the search space. More importantly, this new design perspective enables us to efficiently collect enormous high-quality data and reasoning traces with E-graph, which further results in an effective fine-tuning of LLM. Experimental results show that PrefixAgent synthesizes prefix adders with consistently smaller areas compared to baseline methods, while maintaining scalability and generalization in commercial EDA flows.
前置语言添加器是基本的算术电路,但其设计空间随着位宽而成,成倍增长,形成巨大的优化挑战。以前的工程在性能、一般化和可缩放性方面面临着限制。为了应对这些挑战,我们提议使用一个大语言模型(LLM)驱动框架,使前置添加器能够高效地优化。具体地说,前置添加器将问题重新配置为子任务,包括主干合成和结构改进,从而有效地减少了搜索空间。更重要的是,这一新的设计视角使我们能够有效地收集大量高质量的数据和通过电子绘图推理的痕迹,这进一步导致对LLLM进行有效的微调。 实验结果表明,前置组合器与基线方法相比,在商业的EDA流中保持可缩放性和一般化性。
Article 11
Title@2025-07-08 (2): RTGPU: Real-Time Computing with Graphics Processing Units
Title: RTGPU: Real-Time Computing with Graphics Processing Units | RTGPU: Echtzeit-Computing mit Grafikverarbeitungseinheiten | RTGPU: 配有图形处理股的实时计算机 2507.06069v1 |
Authors (11): Atiyeh Gheibi-Fetrat, Amirsaeed Ahmadi-Tonekaboni, Farzam Koohi-Ronaghi, Pariya Hajipour, Sana Babayan-Vanestan, Fatemeh Fotouhi, Elahe Mortazavian-Farsani, Pouria Khajehpour-Dezfouli, Sepideh Safari, Shaahin Hessabi, Hamid Sarbazi-Azad
In this work, we survey the role of GPUs in real-time systems. Originally designed for parallel graphics workloads, GPUs are now widely used in time-critical applications such as machine learning, autonomous vehicles, and robotics due to their high computational throughput. Their parallel architecture is well-suited for accelerating complex tasks under strict timing constraints. However, their integration into real-time systems presents several challenges, including non-preemptive execution, execution time variability, and resource contention; factors that can lead to unpredictable delays and deadline violations. We examine existing solutions that address these challenges, including scheduling algorithms, resource management techniques, and synchronization methods, and highlight open research directions to improve GPU predictability and performance in real-time environments.
在这项工作中,我们调查了GPU在实时系统中的作用。GPU最初是为平行图形工作量设计的,现在由于机器学习、自主车辆和机器人等时间紧迫的应用程序的计算量很高,GPU现在被广泛使用。它们的平行结构非常适合在严格的时间限制下加速复杂的任务。然而,将其纳入实时系统带来了若干挑战,包括非先发制人的执行、执行时间的变异性和资源争议;可能导致无法预测的延误和违反最后期限的因素。我们研究了应对这些挑战的现有解决办法,包括时间安排算法、资源管理技术和同步方法,并突出强调了提高GPU在实时环境中的可预测性和绩效的开放研究方向。
Article 12
Title@2025-07-08 (2): OLAF: Programmable Data Plane Acceleration for Asynchronous Distributed Reinforcement Learning
Title: OLAF: Programmable Data Plane Acceleration for Asynchronous Distributed Reinforcement Learning | OLAF: Programmierbare Datenplanbeschleunigung für asynchrones, verteiltes Weiterbildungslernen | OLAF: 可编程数据计划加速非同步分布式加强学习 2507.05876v1 |
Authors (6): Nehal Baganal Krishna, Anam Tahir, Firas Khamis, Mina Tahmasbi Arashloo, Michael Zink, Amr Rizk
Asynchronous Distributed Reinforcement Learning (DRL) can suffer from degraded convergence when model updates become stale, often the result of network congestion and packet loss during large-scale training. This work introduces a network data-plane acceleration architecture that mitigates such staleness by enabling inline processing of DRL model updates as they traverse the accelerator engine. To this end, we design and prototype a novel queueing mechanism that opportunistically combines compatible updates sharing a network element, reducing redundant traffic and preserving update utility. Complementing this we provide a lightweight transmission control mechanism at the worker nodes that is guided by feedback from the in-network accelerator. To assess model utility at line rate, we introduce the Age-of-Model (AoM) metric as a proxy for staleness and verify global fairness and responsiveness properties using a formal verification method. Our evaluations demonstrate that this architecture significantly reduces update staleness and congestion, ultimately improving the convergence rate in asynchronous DRL workloads.
当模型更新变得僵化,往往是网络拥堵和大规模培训期间包包丢失的结果,这种分散分布式强化学习(DRL)可能会受到退化的趋同。 这项工作引入了一个网络数据-机加速结构,通过对 DRL 模型更新进行线性处理,从而缓解这种累滞性。 为此,我们设计并原型了一个新颖的排队机制,将共享网络元素的兼容性更新、减少冗余流量和保持更新效用的机会结合起来。 补充这一机制,我们为工人节提供了一种轻量的传输控制机制,以网络加速器的反馈为指导。 为了以线速评估模型的实用性,我们引入了AoM(AoM) 标准,作为固态的代理,并使用正式的核查方法核实全球公平性和反应性特性。 我们的评估表明,这一结构极大地减少了更新的粘稠和堵塞性,最终提高了不同步的DRL工作量的趋同率。
Article 13
Title@2025-07-08 (2): GATMesh: Clock Mesh Timing Analysis using Graph Neural Networks
Title: GATMesh: Clock Mesh Timing Analysis using Graph Neural Networks | GATMesh: Uhr Mesh Timing Analyse mit Hilfe von Graph Neural Networks | GATMesh:利用图形神经网络分析时钟网时间 2507.05681v1 |
Authors (2): Muhammad Hadir Khan, Matthew Guthaus
Clock meshes are essential in high-performance VLSI systems for minimizing skew and handling PVT variations, but analyzing them is difficult due to reconvergent paths, multi-source driving, and input mesh buffer skew. SPICE simulations are accurate but slow; yet simplified models miss key effects like slew and input skew. We propose GATMesh, a Graph Neural Network (GNN)-based framework that models the clock mesh as a graph with augmented structural and physical features. Trained on SPICE data, GATMesh achieves high accuracy with average delay error of 5.27ps on unseen benchmarks, while achieving speed-ups of 47146x over multi-threaded SPICE simulation.
在高性能VLSI系统中,锁环环片对于最大限度地减少扭曲和处理PVT变异至关重要,但由于路径对齐、多源驱动和输入网格缓冲键,分析这些变异是困难的。 SPICE模拟是准确但缓慢的; 但是简化模型忽略了关键效应, 如行刑和输入键盘。 我们提议GATMesh, 一个基于图形神经网络(GNN)的框架, 以图形的形式模拟时钟网, 以强化的结构和物理特征为模型。 在SPICE数据上培训, GATMesh 实现高度精确, 平均延迟误差为5. 27秒, 而在多读的SPICE 模拟中, 速度为47146x。
Article 14
Title@2025-07-08 (2): A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs
Title: A Runtime-Adaptive Transformer Neural Network Accelerator on FPGAs | Laufzeit-Adaptive Transformer Neural Network Accelerator auf FPGAs | FPGAs 运行时间- 适应性变革器神经网络加速器 2411.18148v3 |
Authors (5): Ehsan Kabir, Austin R. J. Downey, Jason D. Bakos, David Andrews, Miaoqing Huang
Transformer neural networks (TNN) excel in natural language processing (NLP), machine translation, and computer vision (CV) without relying on recurrent or convolutional layers. However, they have high computational and memory demands, particularly on resource-constrained devices like FPGAs. Moreover, transformer models vary in processing time across applications, requiring custom models with specific parameters. Designing custom accelerators for each model is complex and time-intensive. Some custom accelerators exist with no runtime adaptability, and they often rely on sparse matrices to reduce latency. However, hardware designs become more challenging due to the need for application-specific sparsity patterns. This paper introduces ADAPTOR, a runtime-adaptive accelerator for dense matrix computations in transformer encoders and decoders on FPGAs. ADAPTOR enhances the utilization of processing elements and on-chip memory, enhancing parallelism and reducing latency. It incorporates efficient matrix tiling to distribute resources across FPGA platforms and is fully quantized for computational efficiency and portability. Evaluations on Xilinx Alveo U55C data center cards and embedded platforms like VC707 and ZCU102 show that our design is 1.2$\times$ and 2.87$\times$ more power efficient than the NVIDIA K80 GPU and the i7-8700K CPU respectively. Additionally, it achieves a speedup of 1.7 to 2.25$\times$ compared to some state-of-the-art FPGA-based accelerators.
变压器神经网络(TNNN)在自然语言处理(NLP)、机器翻译和计算机视觉(CV)方面优于自然语言处理(NLP)、机器翻译和计算机视觉(CV),不依赖于经常性或进化层。然而,它们具有很高的计算和记忆需求,特别是FPGAs等资源限制的装置。此外,在各种应用程序的处理时间上,变压器模型各不相同,要求有特定参数的定制模型。为每个模型设计定制加速器既复杂又费时费时。有些定制加速器没有运行时间适应性,它们往往依靠稀薄的矩阵来降低悬浮度。然而,硬件设计由于需要具体应用的偏斜度模式,因此更具挑战性。本文引入了ADAPtor,这是在FPGGGAs的变压器编码和分解码器中进行密集矩阵计算的一个运行时间适应性加速加速加速加速器。
Article 15
Title@2025-07-08 (2): iThermTroj: Exploiting Intermittent Thermal Trojans in Multi-Processor System-on-Chips
Title: iThermTroj: Exploiting Intermittent Thermal Trojans in Multi-Processor System-on-Chips | iThermTroj: Ausnutzung intermittierender thermischer Trojaner in Multi-Prozessor-System-on-Chips | iThermTroj:在多处理器系统芯片中利用间隔热热特洛伊 2507.05576v1 |
Authors (4): Mehdi Elahi, Mohamed R. Elshamy, Abdel-Hameed Badawy, Ahmad Patooghy
Thermal Trojan attacks present a pressing concern for the security and reliability of System-on-Chips (SoCs), especially in mobile applications. The situation becomes more complicated when such attacks are more evasive and operate sporadically to stay hidden from detection mechanisms. In this paper, we introduce Intermittent Thermal Trojans (iThermTroj) that exploit the chips’ thermal information in a random time-triggered manner. According to our experiments, iThermTroj attack can easily bypass available threshold-based thermal Trojan detection solutions. We investigate SoC vulnerabilities to variations of iThermTroj through an in-depth analysis of Trojan activation and duration scenarios. We also propose a set of tiny Machine Learning classifiers for run-time anomaly detection to protect SoCs against such intermittent thermal Trojan attacks. Compared to existing methods, our approach improves the attack detection rate by 29.4\%, 17.2\%, and 14.3\% in scenarios where iThermTroj manipulates up to 80\%, 60\%, and 40\% of SoC’s thermal data, respectively. Additionally, our method increases the full protection resolution to 0.8 degrees Celsius, meaning that any temperature manipulations exceeding $\pm 0.8$ degrees will be detected with 100\% accuracy.
热热的Trojan(iThermTroj)袭击对芯片(SOSC)的安全和可靠性提出了紧迫的关注,特别是在移动应用中。当这种袭击更加隐蔽,并且零星运作以隐藏于探测机制之外时,情况就变得更加复杂。在本文中,我们引入了以随机时间触发的方式利用芯片热信息的不定期热热热信息。根据我们的实验,iThermTroj袭击可以很容易地绕过基于阈值的热质Trojan探测解决方案。我们通过深入分析Trojan的启动和持续时间设想方案,调查 SoC对iThermTroj变异的脆弱性。我们还提出一套小型机器学习分类器,用于实时探测SOC,以防止这种间歇性热质袭击。与现有方法相比,我们的方法将攻击探测率提高29.4、17.2和14.3。在iThermTroj操纵到80、60和4040°C热量数据将分别提升到0.8度的防护度。
Article 16
Title@2025-07-08 (2): Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads
Title: Per-Row Activation Counting on Real Hardware: Demystifying Performance Overheads | Per-Row-Aktivierung auf echte Hardware zählen: Demystifying Performance Overheads | 实时硬件的每减激活计数:解开对性能的神秘化 2507.05556v1 |
Authors (8): Jumin Kim, Seungmin Baek, Minbok Wi, Hwayong Nam, Michael Jaemin Kim, Sukhan Lee, Kyomin Sohn, Jung Ho Ahn
Per-Row Activation Counting (PRAC), a DRAM read disturbance mitigation method, modifies key DRAM timing parameters, reportedly causing significant performance overheads in simulator-based studies. However, given known discrepancies between simulators and real hardware, real-machine experiments are vital for accurate PRAC performance estimation. We present the first real-machine performance analysis of PRAC. After verifying timing modifications on the latest CPUs using microbenchmarks, our analysis shows that PRAC’s average and maximum overheads are just 1.06% and 3.28% for the SPEC CPU2017 workloads – up to 9.15x lower than simulator-based reports. Further, we show that the close page policy minimizes this overhead by effectively hiding the elongated DRAM row precharge operations due to PRAC from the critical path.
一台DRAM读解扰动计数法(PRAC),DRAM读解扰动计数法(PRAC),它修改关键的DRAM计时参数,据报在模拟器研究中造成大量性能间接费用。然而,鉴于模拟器与实际硬件之间存在已知差异,实机实验对于准确的PRAC性能估计至关重要。我们对PRAC首次进行实机性能分析。在用微子标记来核查最新CPU的计时修改后,我们的分析显示,SPEC CPU 2017工作量的平均和最高间接费用只有1.06%和3.28%,比模拟器报告低9.15x。此外,我们显示,近页政策有效地隐藏了由于PRAC从关键路径进行的长期DRAM行前置业务,从而将这一间接费用降到最低。
Article 17
Title@2025-07-07 (1): Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search
Title: Bit-Flip Fault Attack: Crushing Graph Neural Networks via Gradual Bit Search | Bit-Flip-Fault-Angriff: Zerkleinernde Graphen-Neural-Netzwerke über schrittweise Bitsuche | Bit- Flip 错误攻击: 通过渐变位搜索粉碎图形神经网络 2507.05531v1 |
Authors (2): Sanaz Kazemi Abharian, Sai Manoj Pudukotai Dinakarrao
Graph Neural Networks (GNNs) have emerged as a powerful machine learning method for graph-structured data. A plethora of hardware accelerators has been introduced to meet the performance demands of GNNs in real-world applications. However, security challenges of hardware-based attacks have been generally overlooked. In this paper, we investigate the vulnerability of GNN models to hardware-based fault attack, wherein an attacker attempts to misclassify output by modifying trained weight parameters through fault injection in a memory device. Thus, we propose Gradual Bit-Flip Fault Attack (GBFA), a layer-aware bit-flip fault attack, selecting a vulnerable bit in each selected weight gradually to compromise the GNN’s performance by flipping a minimal number of bits. To achieve this, GBFA operates in two steps. First, a Markov model is created to predict the execution sequence of layers based on features extracted from memory access patterns, enabling the launch of the attack within a specific layer. Subsequently, GBFA identifies vulnerable bits within the selected weights using gradient ranking through an in-layer search. We evaluate the effectiveness of the proposed GBFA attack on various GNN models for node classification tasks using the Cora and PubMed datasets. Our findings show that GBFA significantly degrades prediction accuracy, and the variation in its impact across different layers highlights the importance of adopting a layer-aware attack strategy in GNNs. For example, GBFA degrades GraphSAGE’s prediction accuracy by 17% on the Cora dataset with only a single bit flip in the last layer.
内建网络(GNNs) 已成为一个强大的图形结构数据的机器学习方法。 大量硬件加速器已被引入, 以满足实际应用中 GNNs 的性能需求。 然而, 硬件攻击的安全挑战一般被忽略。 在本文中, 我们调查GNN 模型对基于硬件的过失攻击的脆弱性, 攻击者试图通过在记忆设备中注入错误来修改经过训练的重量参数, 从而错误地分类输出。 因此, 我们提议Gradual Bit- Flip Fault 攻击( GBFA) , 这是一种有一层觉悟的位断层断层攻击, 每选择的重量选择一个脆弱的精确度, 通过翻转少量的位数来逐渐降低 GNNNS 的性能。 要做到这一点, GBFA 模型创建了一个基于从记忆访问模式中提取的特征, 使攻击在特定层内启动的17个攻击。 随后, GBFA 仅通过层搜索的梯级排序, 在选定的最后重量中, 选择了脆弱的部分, 一种低度的精确度的精确度的精确度, 我们的GFADA 的精确度 的精确度, 。 在搜索模型中, 在不同的GVA 上, 显示中, 我们的数值 的数值 的数值 的数值 以不同的数值 的 以不同的数值 数字的精确度 的精确度 。
Article 18
Title@2025-07-07 (1): ViPSN 2.0: A Reconfigurable Battery-free IoT Platform for Vibration Energy Harvesting
Title: ViPSN 2.0: A Reconfigurable Battery-free IoT Platform for Vibration Energy Harvesting | ViPSN 2.0: Eine neu konfigurierbare, batteriefreie IoT-Plattform für Vibrationsenergieernte | VIPSN 2.0: 一个无电池的振动能源收集再配置无电的 IOT 平台 2507.05081v1 |
Authors (25): Xin Li, Mianxin Xiao, Xi Shen, Jiaqing Chu, Weifeng Huang, Jiashun Li, Yaoyi Li, Mingjing Cai, Jiaming Chen, Xinming Zhang, Daxing Zhang, Congsi Wang, Hong Tang, Bao Zhao, Qitao Lu, Yilong Wang, Jianjun Wang, Minyi Xu, Shitong Fang, Xuanyu Huang. Chaoyang Zhao, Zicheng Liu, Yaowen Yang, Guobiao Hu, Junrui Liang, Wei-Hsin Liao
Vibration energy harvesting is a promising solution for powering battery-free IoT systems; however, the instability of ambient vibrations presents significant challenges, such as limited harvested energy, intermittent power supply, and poor adaptability to various applications. To address these challenges, this paper proposes ViPSN2.0, a modular and reconfigurable IoT platform that supports multiple vibration energy harvesters (piezoelectric, electromagnetic, and triboelectric) and accommodates sensing tasks with varying application requirements through standardized hot-swappable interfaces. ViPSN~2.0 incorporates an energy-indication power management framework tailored to various application demands, including light-duty discrete sampling, heavy-duty high-power sensing, and complex-duty streaming tasks, thereby effectively managing fluctuating energy availability. The platform’s versatility and robustness are validated through three representative applications: ViPSN-Beacon, enabling ultra-low-power wireless beacon transmission from a single transient fingertip press; ViPSN-LoRa, supporting high-power, long-range wireless communication powered by wave vibrations in actual marine environments; and ViPSN-Cam, enabling intermittent image capture and wireless transfer. Experimental results demonstrate that ViPSN~2.0 can reliably meet a wide range of requirements in practical battery-free IoT deployments under energy-constrained conditions.
为应对这些挑战,本文件提议VIPSN2.0, 一个模块化和可重新配置的IOT平台, 支持多种振动能量采集器(PiPSN-Beacon, 使超低功率无线信标能够通过标准化的热可应用界面传输,满足不同的应用要求。 VIPSN-LoRa, 支持高功率、远程无线通信能力,在实际海洋环境中进行波震动; VIPSN-LoRa, 支持无线性无线高能和远程无线通信能力; VIPSN-Veal-Veal-Veal-PSMLAMLA, 在实际的海上环境中进行无波震动; VIPSN-S-Veal-SLOVAL-S-SLADS-LADS, 在实际的无线-NPS-PS-VERMS-S-LAVAL-S-S-LADS-LS-LANSLA AS ASAL-SAL-LANSLANSLANSLANSLANSLLLL AS AS AS ASAL ASAL ASAL ASAL ASAL ASAL ASAL ASAL ASAL ASAL ASAL ASAL AS AS-L ASAL AS-L ASAL AS-L AS-L RA ASAL 上, AS-L AS AS ASL SSL SSL SS AS AS AS SS SS SS SS ASL SS SS SS SS SS SS SS SS SS SS SS AS AS ASL ASL ASL SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS SS
Article 19
Title@2025-07-07 (1): Optimizing Scalable Multi-Cluster Architectures for Next-Generation Wireless Sensing and Communication
Title: Optimizing Scalable Multi-Cluster Architectures for Next-Generation Wireless Sensing and Communication | Optimierung skalierbarer Multi-Cluster-Architekturen für drahtloses Sensing und Kommunikation der nächsten Generation | 优化用于下一代无线遥感和通信的可缩放多集群建筑 2507.05012v1 |
Authors (4): Samuel Riedel, Yichao Zhang, Marco Bertuletti, Luca Benini
Next-generation wireless technologies (for immersive-massive communication, joint communication and sensing) demand highly parallel architectures for massive data processing. A common architectural template scales up by grouping tens to hundreds of cores into shared-memory clusters, which are then scaled out as multi-cluster manycore systems. This hierarchical design, used in GPUs and accelerators, requires a balancing act between fewer large clusters and more smaller clusters, affecting design complexity, synchronization, communication efficiency, and programmability. While all multi-cluster architectures must balance these trade-offs, there is limited insight into optimal cluster sizes. This paper analyzes various cluster configurations, focusing on synchronization, data movement overhead, and programmability for typical wireless sensing and communication workloads. We extend the open-source shared-memory cluster MemPool into a multi-cluster architecture and propose a novel double-buffering barrier that decouples processor and DMA. Our results show a single 256-core cluster can be twice as fast as 16 16-core clusters for memory-bound kernels and up to 24% faster for compute-bound kernels due to reduced synchronization and communication overheads.
下一代无线技术(用于暗中移动通信、联合通信和感测)需要高度平行的大规模数据处理结构。通用的建筑模板将数万至数百个核心分组成共享模拟集群,然后作为多集群多核心系统予以扩大。这种在GPU和加速器中使用的等级设计需要在较少大型集群和较小集群之间采取平衡行动,影响设计复杂性、同步性、通信效率和程序可操作性。虽然所有多集群结构必须平衡这些权衡,但对最佳集群规模的洞察力有限。本文分析各种集群配置,重点是同步、数据移动管理以及典型无线感应和通信工作量的可编程性。我们把开放源共享集群MemPool 扩展为多集群结构,并提议一个新型的双重缓冲屏障,以拆分解相容处理器和DMA。我们的结果显示,单一的256核心集群可以比16个核心集群的内嵌式集群快一倍,以至超过24 %的内置式同步。
Article 20
Title@2025-07-07 (1): AXI-REALM: Safe, Modular and Lightweight Traffic Monitoring and Regulation for Heterogeneous Mixed-Criticality Systems
Title: AXI-REALM: Safe, Modular and Lightweight Traffic Monitoring and Regulation for Heterogeneous Mixed-Criticality Systems | AXI-REALM: Sichere, modulare und leichte Verkehrsüberwachung und Regulierung für Heterogene Mixed-Criticality-Systeme | AXI-REALM: 安全、模块和轻量量的交通监测和管理 2501.10161v2 |
Authors (9): Thomas Benz, Alessandro Ottaviano, Chaoqun Liang, Robert Balas, Angelo Garofalo, Francesco Restuccia, Alessandro Biondi, Davide Rossi, Luca Benini
The automotive industry is transitioning from federated, homogeneous, interconnected devices to integrated, heterogeneous, mixed-criticality systems (MCS). This leads to challenges in achieving timing predictability techniques due to access contention on shared resources, which can be mitigated using hardware-based spatial and temporal isolation techniques. Focusing on the interconnect as the point of access for shared resources, we propose AXI-REALM, a lightweight, modular, technology-independent, and open-source real-time extension to AXI4 interconnects. AXI-REALM uses a budget-based mechanism enforced on periodic time windows and transfer fragmentation to provide fair arbitration, coupled with execution predictability on real-time workloads. AXI-REALM features a comprehensive bandwidth and latency monitor at both the ingress and egress of the interconnect system. Latency information is also used to detect and reset malfunctioning subordinates, preventing missed deadlines. We provide a detailed cost assessment in a 12 nm node and an end-to-end case study implementing AXI-REALM into an open-source MCS, incurring an area overhead of less than 2%. When running a mixed-criticality workload, with a time-critical application sharing the interconnect with non-critical applications, we demonstrate that the critical application can achieve up to 68.2% of the isolated performance by enforcing fairness on the interconnect traffic through burst fragmentation, thus reducing the subordinate access latency by up to 24 times. Near-ideal performance, (above 95% of the isolated performance) can be achieved by distributing the available bandwidth in favor of the critical application.
汽车业正在从联邦、单一、相互连接的装置向综合、多样化和混合临界系统(MCS)过渡。这导致由于在共享资源上存在争议,实现时间可预测性技术的挑战,而使用基于硬件的空间和时间隔离技术可以缓解这种争议。侧重于连接作为共享资源的接入点,我们提议将AXI-REALM这一轻量、模块化、技术独立和开放源实时扩展至AXI4互联系统。AXI-REALM利用一个基于预算的机制,在定期时间窗口和传输分散上实施一个基于预算的机制,以提供公平的仲裁,同时在实时工作量上实施执行可预测性。AXI-REALM在互连系统的反反反反反反向和反向两方面都有一个全面的带宽度监测。 延迟信息还用于探测和重置下属的故障,防止错过最后期限。 我们对AXI-REAM进行详细的成本评估,在开放源的窗口和传输系统上实施基于预算的分散式配置,在实时工作量上保持公平性,从而通过不精确的连通性应用实现关键的连通性,同时进行我们之间的连通性应用。
Article 21
Title@2025-07-07 (1): Jack Unit: An Area- and Energy-Efficient Multiply-Accumulate (MAC) Unit Supporting Diverse Data Formats
Title: Jack Unit: An Area- and Energy-Efficient Multiply-Accumulate (MAC) Unit Supporting Diverse Data Formats | Jack Unit: Eine flächen- und energieeffiziente Multiplizierungsakkumulation (MAC)-Einheit, die unterschiedliche Datenformate unterstützt | 杰克单位:一个区域和能源效率乘数累积(MAC)单位,支持多种数据格式 2507.04772v1 |
Authors (6): Seock-Hwan Noh, Sungju Kim, Seohyun Kim, Daehoon Kim, Jaeha Kung, Yeseong Kim
In this work, we introduce an area- and energy-efficient multiply-accumulate (MAC) unit, named Jack unit, that is a jack-of-all-trades, supporting various data formats such as integer (INT), floating point (FP), and microscaling data format (MX). It provides bit-level flexibility and enhances hardware efficiency by i) replacing the carry-save multiplier (CSM) in the FP multiplier with a precision-scalable CSM, ii) performing the adjustment of significands based on the exponent differences within the CSM, and iii) utilizing 2D sub-word parallelism. To assess effectiveness, we implemented the layout of the Jack unit and three baseline MAC units. Additionally, we designed an AI accelerator equipped with our Jack units to compare with a state-of-the-art AI accelerator supporting various data formats. The proposed MAC unit occupies 1.17~2.01x smaller area and consumes 1.05~1.84x lower power compared to the baseline MAC units. On five AI benchmarks, the accelerator designed with our Jack units improves energy efficiency by 1.32~5.41x over the baseline across various data formats.
在这项工作中,我们引入了一个面积和节能乘积(MAC)单位,名为Jack 单位,称为 Jack 单位,这是一个全方位交换点,支持各种数据格式,如整数(INT)、浮点(FP)和微缩缩缩缩缩数据格式(MX)。它提供了位级灵活性,提高了硬件效率,通过一)用精确可缩放的CSM,取代FP乘数中的承载储蓄乘数(CSM),二)根据CSM内部的排出差异对标志和标志进行调整,以及三)利用2D分词平行法,利用2D分词。为了评估效果,我们实施了Jack 单位和3个基线MAC 单位的布局。此外,我们设计了一个配有我们的杰克单位的AI 加速器,以便与支持各种数据格式的最新的AI 加速器进行比较。拟议的MAC单位的面积为1.17~2.01x小面积,消耗1.84x比MAC单位的基线低1.05~5.8x功率。在5个基准基准线上,我们设计的自动加速器改进了1x1美元格式。
Article 22
Title@2025-07-07 (1): ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning
Title: ChipSeek-R1: Generating Human-Surpassing RTL with LLM via Hierarchical Reward-Driven Reinforcement Learning | ChipSeek-R1: Generierung von Mensch-Überwindungs-RTL mit LLM über hierarchisches Reward-getriebenes Verstärkungs-Lernen | ChipSeek-R1:通过等级制奖励强化学习,与LLM一道产生载人超越越越越越越越越越权 2507.04736v1 |
Authors (10): Zhirong Chen, Kaiyan Chang, Zhuolin Li, Xinyang He, Chujie Chen, Cangyuan Li, Mengdi Wang, Haobo Xu, Yinhe Han, Ying Wang
Large Language Models (LLMs) show significant potential for automating Register-Transfer Level (RTL) code generation. However, current approaches face a critical challenge: they can not simultaneously optimize for functional correctness and hardware quality (Power, Performance, Area - PPA). Methods based on supervised fine-tuning often generate functionally correct but PPA-suboptimal code, lacking mechanisms to learn optimization principles. In contrast, post-processing techniques that attempt to improve PPA metrics after generation are often inefficient because they operate externally without updating the LLM’s parameters, thus failing to enhance the model’s intrinsic design capabilities. To bridge this gap, we introduce ChipSeek-R1, a hierarchical reward-driven reinforcement learning framework to train LLMs to generate RTL code that achieves both functional correctness and optimized PPA metrics. ChipSeek-R1 employs a hierarchical reward system, which incorporates direct feedback on syntax, functional correctness (from simulators) and PPA metrics (from synthesis tools) during reinforcement learning. This enables the model to learn complex hardware design trade-offs via trial-and-error, generating RTL code that is both functionally correct and PPA-optimized. Evaluating ChipSeek-R1 on standard benchmarks (VerilogEval, RTLLM), we achieve state-of-the-art results in functional correctness. Notably, on the RTLLM benchmark, ChipSeek-R1 generated 27 RTL designs surpassing the PPA metrics of the original human-written code. Our findings demonstrate the effectiveness of integrating toolchain feedback into LLM training and highlight the potential for reinforcement learning to enable automated generation of human-surpassing RTL code. We open-source our code in anonymous github.
大型语言模型(LLMS) 显示在登记册-传输级别(RTL)代码生成自动化方面的巨大潜力。然而,当前的方法面临一个严峻的挑战:它们不能同时优化功能正确性和硬件质量(Power、Paper、Sea-PPA) 。基于监管微调的方法往往产生功能正确性但PPA亚最佳代码,缺乏学习优化原则的机制。相比之下,试图在生成后改进 PPPA 度量度的后处理技术往往效率低下,因为它们不更新 LLM 参数而从外部操作,从而无法增强模型的内在设计能力。为了缩小这一差距,我们引入了 ChipSeek-RPA 以等级为驱动的强化学习框架,以生成功能正确性和优化 PPPPA 度。 ChipSeek-Rlock-RDRL1 使用一个等级奖励系统,在强化学习过程中直接反馈Syntax、功能性(从模拟器) 和 Pubal-Ral-L Ral-al-Silviewal-Ial-Silviewal-Ial-Ial-Siling Reval-Serviews-I) 数据, 在测试中显示我们27R-rent-ral-ral-ral-I-I-l-l-l-l-S-S-S-S-S-S-S-leval-l-S-S-S-S-Sildalking-Silking-Silking-SD-SDisalking-SD-Sildalking-S-Supalking-SD-l-l-SD-SD-SD-SD-S-S-S-S-S-l-l-S-S-S-S-l-S-S-S-S-S-S-SD-SD-Sil-Silking-Sil-Sil-Sil-Sl-Sl-S-Sl-Sl-S-l-l-l-S-l-SD-S-S-S-S-S-S-l-S-S-S-S-S-S-S-S-S-S-S
Article 23
Title@2025-07-07 (1): FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs
Title: FAMOUS: Flexible Accelerator for the Attention Mechanism of Transformer on UltraScale+ FPGAs | FAMOUS: Flexibler Beschleuniger für den Aufmerksamkeitsmechanismus des Transformators auf UltraScale+ FPGAs | FANOUS: 超标准+FPGAs变异器注意机制灵活加速器 2409.14023v3 |
Authors (6): Ehsan Kabir, Md. Arafat Kabir, Austin R. J. Downey, Jason D. Bakos, David Andrews, Miaoqing Huang
Transformer neural networks (TNNs) are being applied across a widening range of application domains, including natural language processing (NLP), machine translation, and computer vision (CV). Their popularity is largely attributed to the exceptional performance of their multi-head self-attention blocks when analyzing sequential data and extracting features. To date, there are limited hardware accelerators tailored for this mechanism, which is the first step before designing an accelerator for a complete model. This paper proposes \textit{FAMOUS}, a flexible hardware accelerator for dense multi-head attention (MHA) computation of TNNs on field-programmable gate arrays (FPGAs). It is optimized for high utilization of processing elements and on-chip memories to improve parallelism and reduce latency. An efficient tiling of large matrices has been employed to distribute memory and computing resources across different modules on various FPGA platforms. The design is evaluated on Xilinx Alveo U55C and U200 data center cards containing Ultrascale+ FPGAs. Experimental results are presented that show that it can attain a maximum throughput, number of parallel attention heads, embedding dimension and tile size of 328 (giga operations/second (GOPS)), 8, 768 and 64 respectively on the U55C. Furthermore, it is 3.28$\times$ and 2.6$\times$ faster than the Intel Xeon Gold 5220R CPU and NVIDIA V100 GPU respectively. It is also 1.3$\times$ faster than the fastest state-of-the-art FPGA-based accelerator.
变形神经网络(TNNS)正在应用到范围不断扩大的应用领域,包括自然语言处理(NLP)、机器翻译和计算机视觉(CV)等。其受欢迎程度主要归功于多头自省块在分析连续数据和提取特征时的超常性能。迄今为止,为这一机制定制的硬件加速器有限,这是设计完整模型加速器的第一步。本文提议在现场可编程的门阵列中,为密集多头关注(MHA)计算TNS的软硬件加速器(NLPP)、机器翻译和计算机视觉(CVPGS)。对于高使用处理元件和机内记忆来改进平行性并降低耐久性,采用了高效的大型矩阵来在不同的模块中分配记忆和计算资源。在Scilinx Alveo U55C和U200数据中心中,含有超水平的Utracal+ FPGAs, 实验结果显示其最高水平和最高水平的C-C-C 和C-C-C-C-C-C-OFA 显示其最平行关注度,其最大程度为C-C-C-C-C-C-C-C-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 24
Title@2025-07-07 (1): NeuroPDE: A Neuromorphic PDE Solver Based on Spintronic and Ferroelectric Devices
Title: NeuroPDE: A Neuromorphic PDE Solver Based on Spintronic and Ferroelectric Devices | NeuroPDE: Ein neuromorphes PDE-Lösemittel auf Basis spintronischer und Ferroelektrischer Geräte | NeuroPDE: 一种基于 Spentronic 和 Ferroweze 装置的神经形态PDE 溶解器 2507.04677v1 |
Authors (8): Siqing Fu, Lizhou Wu, Tiejun Li, Chunyuan Zhang, Sheng Ma, Jianmin Zhang, Yuhan Tang, Jixuan Tang
In recent years, new methods for solving partial differential equations (PDEs) such as Monte Carlo random walk methods have gained considerable attention. However, due to the lack of hardware-intrinsic randomness in the conventional von Neumann architecture, the performance of PDE solvers is limited. In this paper, we introduce NeuroPDE, a hardware design for neuromorphic PDE solvers that utilizes emerging spintronic and ferroelectric devices. NeuroPDE incorporates spin neurons that are capable of probabilistic transmission to emulate random walks, along with ferroelectric synapses that store continuous weights non-volatilely. The proposed NeuroPDE achieves a variance of less than 1e-2 compared to analytical solutions when solving diffusion equations, demonstrating a performance advantage of 3.48x to 315x speedup in execution time and an energy consumption advantage of 2.7x to 29.8x over advanced CMOS-based neuromorphic chips. By leveraging the inherent physical stochasticity of emerging devices, this study paves the way for future probabilistic neuromorphic computing systems.
近年来,解决部分差异方程式(PDE)的新方法,如Monte Carlo随机行走方法等,引起了相当多的关注,然而,由于常规的冯纽曼建筑缺乏硬件内分泌随机性,PDE解答器的性能有限。在本文中,我们引入了NeuroPDE,这是神经形态式PDE解答器的硬件设计,利用新兴的脊柱和电电动装置;NeuroPDE包含能够进行随机传导的旋转神经元,以及不挥发性地储存连续重量的铁电动突触觉。拟议的NeuroPDE在解决扩散方程式时,与分析解决方案相比,取得了不到1E-2的差异,这表明执行时间的性能优势为3.48x至315x加速,而能量消耗优势为2.7x至29.8x高于先进的CMOS型神经形态芯片。通过利用新兴装置固有的物理随机性,为未来的不稳定性神经形态计算系统铺平了道路。
Article 25
Title@2025-07-06 (7): da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs
Title: da4ml: Distributed Arithmetic for Real-time Neural Networks on FPGAs | da4ml: Verteilte Arithmetik für Echtzeit-Neurale Netzwerke auf FPGAs | da4ml: FPGAs 实时神经网络的分布式重新测量 2507.04535v1 |
Authors (5): Chang Sun, Zhiqiang Que, Vladimir Loncar, Wayne Luk, Maria Spiropulu
Neural networks with a latency requirement on the order of microseconds, like the ones used at the CERN Large Hadron Collider, are typically deployed on FPGAs fully unrolled and pipelined. A bottleneck for the deployment of such neural networks is area utilization, which is directly related to the required constant matrix-vector multiplication (CMVM) operations. In this work, we propose an efficient algorithm for implementing CMVM operations with distributed arithmetic (DA) on FPGAs that simultaneously optimizes for area consumption and latency. The algorithm achieves resource reduction similar to state-of-the-art algorithms while being significantly faster to compute. The proposed algorithm is open-sourced and integrated into the \texttt{hls4ml} library, a free and open-source library for running real-time neural network inference on FPGAs. We show that the proposed algorithm can reduce on-chip resources by up to a third for realistic, highly quantized neural networks while simultaneously reducing latency, enabling the implementation of previously infeasible networks.
对微秒顺序有延缓要求的神经网络,如CERN大型远距对流器所使用的网络,通常部署在完全无动于衷和编织的FPGAs上。部署这种神经网络的瓶颈是地区利用,这与所需的恒定矩阵-矢量倍增(CMVM)操作直接相关。在这项工作中,我们建议一种高效的算法,用于在FPGAs上实施带有分布式算术(DA)的CMM操作,同时优化地区消费和延缓。算法实现了类似于最新算法的资源削减,同时大大加快了计算速度。提议的算法是开放源,并融入了\textt{hls4ml}图书馆,这是一个用于运行FPGAs实时神经网络的自由和开源图书馆。我们表明,拟议的算法可以将实时神经网络的资源减少三分之一,在现实和高度四分化的神经网络上减少三分之一,同时减少延缓度,使先前无法使用的网络得以实施。
Article 26
Title@2025-07-06 (7): HLStrans: Dataset for LLM-Driven C-to-HLS Hardware Code Synthesis
Title: HLStrans: Dataset for LLM-Driven C-to-HLS Hardware Code Synthesis | HLStrans: Datensatz für LLM-getriebene C-zu-HLS-Hardware-Codesynthese | HLStrans:LLM-Driven C-to-HLS硬件代码合成数据集 2507.04315v1 |
Authors (5): Qingyun Zou, Nuo Chen, Yao Chen, Bingsheng He, WengFei Wong
High-level synthesis (HLS) enables software developers to describe and implement hardware at a higher level of abstraction by using C/C++ instead of traditional hardware description languages to automatically generate FPGA-ready designs. However, generating HLS code significantly differs from standard C/C++: it disallows certain coding idioms, relies on specialized libraries, and critically requires fine-grained transformations and the insertion of optimization directives (pragmas) to achieve high performance. Large language models (LLMs) have shown promise in automating such transformations, yet existing open-source datasets lack sufficient complexity and optimization diversity. To address this gap, we introduce the HLStrans dataset, a comprehensive collection of 137 distinct real word programs, each annotated with a variety of C-to-HLS transformations that yield over 23K labeled design variants. These include a broad spectrum of pragmas and code-level optimizations. We benchmark state-of-the-art LLMs on this dataset to evaluate their ability to generate synthesizable, high-performance HLS code. As part of an ongoing effort, we plan to expand the HLStrans dataset in both scale and program variety, further empowering research at the intersection of AI and hardware synthesis.
高级合成(HLS)使软件开发者能够通过使用C/C++,而不是传统的硬件描述语言,在更高的抽象水平上描述和实施硬件,从而使用C/C++,而不是传统硬件描述语言,自动生成FPGA准备的设计。然而,生成HLS代码与标准的C/C++ 有很大不同:它不允许某些编码的音格,依赖专门图书馆,并非常需要细微的转换和插入优化指令(pragmas)以实现高性能。大型语言模型(LLLLMS)显示了实现这种转换自动化的希望,但现有的开放源数据集缺乏足够的复杂性和优化多样性。为了解决这一差距,我们引入了HLStrans数据集,这是一个包含137个截然不同的真实字程序的综合集,每个都有各种C–HLS转换的附加说明,产生超过23K标签设计变异体。其中包括一系列宽度的壁画和代码级优化。我们在这个数据集上对状态的LLLMS进行了基准,以评价它们生成合成高性能的HLS码的能力。作为正在进行的努力的一部分,我们计划在相互交织的硬件中进一步扩展的硬件程序。
Article 27
Title@2025-07-06 (7): FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification
Title: FIXME: Towards End-to-End Benchmarking of LLM-Aided Design Verification | FIXME: Zur End-to-End-Benchmarkierung der LLM-Aided Design-Überprüfung | FIXME:走向LLM辅助设计核查的终至终基准基准 2507.04276v1 |
Authors (18): Gwok-Waa Wan, Shengchu Su, Ruihu Wang, Qixiang Chen, Sam-Zaak Wong, Mengnv Xing, Hefei Feng, Yubo Wang, Yinan Zhu, Jingyi Zhang, Jianmin Ye, Xinlai Wan, Tao Ni, Qiang Xu, Nan Guan, Zhe Jiang, Xi Wang, Yang Jun
Despite the transformative potential of Large Language Models (LLMs) in hardware design, a comprehensive evaluation of their capabilities in design verification remains underexplored. Current efforts predominantly focus on RTL generation and basic debugging, overlooking the critical domain of functional verification, which is the primary bottleneck in modern design methodologies due to the rapid escalation of hardware complexity. We present FIXME, the first end-to-end, multi-model, and open-source evaluation framework for assessing LLM performance in hardware functional verification (FV) to address this crucial gap. FIXME introduces a structured three-level difficulty hierarchy spanning six verification sub-domains and 180 diverse tasks, enabling in-depth analysis across the design lifecycle. Leveraging a collaborative AI-human approach, we construct a high-quality dataset using 100% silicon-proven designs, ensuring comprehensive coverage of real-world challenges. Furthermore, we enhance the functional coverage by 45.57% through expert-guided optimization. By rigorously evaluating state-of-the-art LLMs such as GPT-4, Claude3, and LlaMA3, we identify key areas for improvement and outline promising research directions to unlock the full potential of LLM-driven automation in hardware design verification. The benchmark is available at https://github.com/ChatDesignVerification/FIXME.
尽管大语言模型(LLMS)在硬件设计方面具有变革潜力,但对其设计核查能力的全面评价仍未得到充分探讨。目前的努力主要侧重于RTL的产生和基本调试,忽略了功能核查这一关键领域,而功能核查是现代设计方法中的主要瓶颈,因为硬件复杂性的迅速升级,这是现代设计方法中的主要瓶颈。我们提出了FIXME,即第一个端到端、多模版和开放源码评价框架,用于评估LLMM在硬件功能核查(FV)方面的绩效,以弥补这一关键差距。FIXME引入了一个结构化的三级困难等级结构,包括六个核查分界和180个不同的任务,从而得以在设计生命周期中进行深入分析。我们利用AI-人类协作方法,利用100%的硅软件设计,建立一个高质量的数据集,确保全面覆盖现实世界的挑战。此外,我们通过专家指导优化,将功能覆盖面提高45.57%。我们通过严格评价GPT-4、Claud3和LlaMA3等高级LMMS/LMAD确定关键的改进领域,并勾勒MA3,我们确定了在设计中可以改进的硬件/DLMILMIstrimal/Dralstrisprisprisimimimimimimiming 的硬件设计的潜在基础设计上的潜在研究方向。我们确定了。我们确定了潜在的研究方向。我们确定了在设计上展示的硬件设计中可以展示的硬件设计上展示的关键性。
Article 28
Title@2025-07-05 (6): Heterogeneous Memory Benchmarking Toolkit
Title: Heterogeneous Memory Benchmarking Toolkit | Heterogenes Memory Benchmarking Toolkit | 不同记忆基准衡量工具包 2505.00901v2 |
Authors (4): Golsana Ghaemi, Gabriel Franco, Kazem Taram, Renato Mancuso
This paper presents an open-source kernel-level heterogeneous memory characterization framework (MemScope) for embedded systems. MemScope enables precise characterization of the temporal behavior of available memory modules under configurable contention stress scenarios. MemScope leverages kernel-level control over physical memory allocation, cache maintenance, CPU state, interrupts, and I/O device activity to accurately benchmark heterogeneous memory subsystems. This gives us the privilege to directly map pieces of contiguous physical memory and instantiate allocators, allowing us to finely control cores to create and eliminate interference. Additionally, we can minimize noise and interruptions, guaranteeing more consistent and precise results compared to equivalent user-space solutions. Running our Framework on a Xilinx Zynq UltraScale+ ZCU102 CPU-FPGA platform demonstrates its capability to precisely benchmark bandwidth and latency across various memory types, including PL-side DRAM and BRAM, in a multi-core system.
本文为嵌入系统提供了一个开放源码级不同存储特性框架( MemScope ) 。 MemScope 能够精确地描述在可配置争议压力假设情景下可用存储模块的时间行为。 MemScope 对物理内存分配、缓存维护、 缓存状态、 CPU 状态、 中断和 I/ O 设备活动进行内嵌系统内部内嵌系统的开放源码级不同存储特性框架( MemScope ) 。 这给了我们直接绘制毗连物理内存和即时分配器碎片图的特权, 使我们能够微小地控制核心, 以创造和消除干扰。 此外, 我们可以将噪音和干扰降到最低, 保证与同等的用户空间解决方案相比, 更加一致和精确的结果。 我们在 Xilinx Zynq Ultrasus + ZCUCU102 CPU- FPGA 平台上运行我们的框架, 表明它有能力在一个多核心系统中精确地测量各种记忆类型( 包括PL- 侧 DRAM 和 BRAM) 的带宽度和宽度。
Article 29
Title@2025-07-05 (6): Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM
Title: Reducing the Cost of Dropout in Flash-Attention by Hiding RNG with GEMM | Reduzierung der Kosten des Ausfalls in Flash-Achtung durch Verstecken von RNG mit GEMM | 通过与GEMM躲入RNNG降低 “ 闪启动 “ 中的辍学费用 2410.07531v2 |
Authors (3): Haiyue Ma, Jian Liu, Ronny Krashinsky
Dropout, a network operator, when enabled is likely to dramatically impact the performance of Flash-Attention, which in turn increases the end-to-end training time of Large-Language-Models (LLMs). The main contributor to such performance degradation is the Random Number Generation (RNG) phase. The state-of-the-art optimization is to fuse RNG into the Flash-Attention kernel. However, while RNG and Attention do not compete on compute or memory resources, they are bounded on the same lower-level architecture bottlenecks. Fusion can hardly hide RNG latency within the Attention kernel. We propose overlapping RNG with previous GEMM layers in the network to hide RNG latency and improve end-to-end performance. RNG and GEMM have distinct resource requirements and hardware bottlenecks, so they can run together without compromising each other’s performance. We propose a fine-grained analytical performance model that analyzes low-level architecture resource utilization to evaluate RNG-GEMM overlapping performance benefits. This model, cross-validated by silicon results, shows 1.26x speedup for overlapping RNG and GEMM layers over a sequential implementation on one Transformer Block (one LLM layer including multi-head attention and feed-forward layers), and 1.22x over state-of-the-art fusion implementation, for Llama3 on GH100 GPUs with FP8 precision. Because the kernel patterns are regular, the findings of the shared bottlenecks, as well as the achievable performance benefits, can be generalized to different model architectures, software implementations and hardware configurations.
当启用时,网络运行员的辍学可能会对闪电-感应的性能产生巨大影响,而闪电-感应(LLM)又会增加大语言-模模(LLM)的端到端培训时间。造成这种性绩退化的主要原因是随机数字生成(RNG)阶段。最先进的优化是将RNG与闪电-感应内核结合到闪电-感应内核。然而,虽然RNG和注意力并不在计算或记忆资源上竞争,但它们被困在同一个较低的结构瓶颈上。 聚合无法在注意力核心内隐藏RNGNG的惯性。 我们提议将RNG与以前GEM的端到端培训时间重叠,以隐藏RNG的耐应力和改进端-端性业绩。 最先进的是资源要求和硬件瓶颈,因此可以将RNGNG(NG-G-GM)组合资源利用的低水平分析性能分析模型,以评价普通-GEMM的重叠性能效益。 这个模型、Scolal-conal-colveral-hal-laftal-hal-lax执行结果的交叉和GM1级的S-Sleval-S-S-Slation-Sleval-Slational-Supl),显示的Slupal-Sld-S-S-S-Slup-Slupal-S-S-S-Supal-Slupal-Slationsx的Slupx的Slation-Slation-Slation-lation-Sld-Supal-s-Supx的Supx,显示Supx的Sup-S-S-S-S-S-S-S-l-S-S-S-Sup-S-Sup-Sup-S-S-Sl-Sl-Slationslupal-Slupal-xxxxxxxx-Sl-Sl-Slupal-l-l-S-S-Sl-Sl-S-Sl-S-Sl-Supx的效益,是S-l-l-S-Sup的跨B-Sl
Article 30
Title@2025-07-04 (5): A Flexible Instruction Set Architecture for Efficient GEMMs
Title: A Flexible Instruction Set Architecture for Efficient GEMMs | Flexible Instruktions-Set-Architektur für effiziente GEMMs | 高效的通用环管机制的灵活教学结构 2507.03522v1 |
Authors (5): Alexandre de Limas Santana, Adrià Armejach, Francesc Martinez, Erich Focht, Marc Casas
GEneral Matrix Multiplications (GEMMs) are recurrent in high-performance computing and deep learning workloads. Typically, high-end CPUs accelerate GEMM workloads with Single-Instruction Multiple Data (SIMD) or vector Instruction Set Architectures (ISAs). Since these ISAs face significant issues when running GEMM workloads, particularly when dealing with small, tall, or skinny matrices, matrix ISAs have been proposed and implemented by major hardware vendors in the last years. Although these matrix ISAs deliver larger throughput when running GEMMs than their SIMD/vector counterparts, they are rigid solutions unable to dynamically adapt themselves to application-specific aspects like the data format. This paper demonstrates that the state-of-the-art matrix ISAs deliver suboptimal performance when running the most commonly used convolution and transformer models. This paper proposes the Matrix Tile Extension (MTE), the first matrix ISA that completely decouples the instruction set architecture from the microarchitecture and seamlessly interacts with existing vector ISAs. MTE incurs minimal implementation overhead since it only requires a few additional instructions and a 64-bit Control Status Register (CSR) to keep its state. Specifically, MTE can i) vectorize GEMMs across the three dimensions M, N, and K; ii) leverage the capacity of the existing vector register file; and iii) decouple the tile shape from the underlying microarchitecture. MTE achieves speed-ups of 1.35x over the best state-of-the-art matrix ISA.
基因矩阵乘数( GEMMs) 在高性能计算和深层学习工作量中经常出现。 通常, 高端 CPU 加速 GEM 工作量, 使用单向多重数据( SIMD) 或矢量指令设置架构( ISAs) 。 由于这些ISA 在运行 GEM 工作量时面临重大问题, 特别是处理小、 高或瘦矩阵时, 过去几年中主要硬件供应商提议并实施了矩阵 ISA 。 虽然这些矩阵在运行 GEM 时提供比 SIMD/ 矢量对应人员更大的输送量, 但它们是僵硬的解决方案, 无法动态地适应数据格式等特定应用程序。 本文显示, 最先进的矩阵 ISA 在运行最常用的变动和变异模型时, 提供亚优性表现。 本文提出了矩阵扩展( MTEA) 第一个矩阵, 将指令设置的结构与现有的矢量 ISA 进行完全分解, 与现有的矢量媒介进行无缝的交互互动。 MOTE 实施最短的解决方案是最低的, 因为其顶部的形状, 仅需要多少的 IMC MAC 和MSL 。
Article 31
Title@2025-07-04 (5): High-Level Surface Code Decoding via Parallel FFNNs on CIM Platforms
Title: High-Level Surface Code Decoding via Parallel FFNNs on CIM Platforms | High-Level-Oberflächencode-Dekodierung über parallele FFNNs auf CIM-Plattformen | 通过平行的FFNN就CIM平台通过平行的FFNN就CIM平台进行认证 2411.18090v2 |
Authors (12): Hao Wang, Erjia Xiao, Wenbo Mu, Songhuan He, Zhongyi Ni, Lingfeng Zhang, Xiaokun Zhan, Yifei Cui, Jinguo Liu, Cheng Wang, Zhongrui Wang, Renjing Xu
Due to the high sensitivity of qubits to environmental noise, which leads to decoherence and information loss, active quantum error correction(QEC) is essential. Surface codes represent one of the most promising fault-tolerant QEC schemes, but they require decoders that are accurate, fast, and scalable to large-scale quantum platforms. In all types of decoders, fully neural network-based high-level decoders offer decoding thresholds that surpass baseline decoder-Minimum Weight Perfect Matching (MWPM), and exhibit strong scalability, making them one of the ideal solutions for addressing surface code challenges. However, current fully neural network-based high-level decoders can only operate serially and do not meet the current latency requirements (below 440 ns). To address these challenges, we first propose a parallel fully feedforward neural network (FFNN) high-level surface code decoder, and comprehensively measure its decoding performance on a computing-in-memory (CIM) hardware simulation platform. With the currently available hardware specifications, our work achieves a decoding threshold of 14.22%, surpassing the MWPM baseline of 10.3%, and achieves high pseudo-thresholds of 10.4%, 11.3%, 12%, and 11.6% with decoding latencies of 197.03 ns, 234.87 ns, 243.73 ns, and 251.65 ns for distances of 3, 5, 7 and 9, respectively. The impact of hardware parameters and non-idealities on these results is discussed, and the hardware simulation results are extrapolated to a 4K quantum cryogenic environment.
由于对环境噪声的高度敏感性,这会导致分解和信息丢失,主动量误差校正(QEC)至关重要。地表代码代表着最有希望的防故障QEC方案之一,但需要精确、快速且可缩到大型量子平台的解码器。在所有类型的解码器中,完全以神经网络为基础的高级解码器提供了超过基线脱coder-最小值73 超标的解码阈值(MWPM),并显示出强大的可变性,使其成为应对表面代码挑战的理想解决方案之一。然而,目前完全神经网络为基础的高解码器只能连续操作,不能满足当前拉动要求(低于440 ns)。为了应对这些挑战,我们首先提出一个平行的全向上线网络(FFNNNN)高级地码解码解码,并全面测量其在计算内值(NWMMM)硬件模拟平台上的解码性能,使其成为解决地表代码挑战的理想解决方案之一。随着目前可用的硬件规格(5,以神经网络为基础的高级解码 3, 高级解码高级解码码器操作器) 3, 高级解码高级解码仪(NMWDMW) 的值为9,我们的工作将达到12个基底基底基底值(MWDMARD%) 底值值环境的底值值值和14.
Article 32
Title@2025-07-04 (5): Hummingbird: A Smaller and Faster Large Language Model Accelerator on Embedded FPGA
Title: Hummingbird: A Smaller and Faster Large Language Model Accelerator on Embedded FPGA | Hummingbird: Ein kleinerer und schnellerer Large Language Model Accelerator auf Embedded FPGA | 蜂鸟:在嵌入的FPGA上用更小、更快的大型语言模型加速器 2507.03308v1 |
Authors (7): Jindong Li, Tenglong Li, Ruiqi Chen, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng
Deploying large language models (LLMs) on embedded devices remains a significant research challenge due to the high computational and memory demands of LLMs and the limited hardware resources available in such environments. While embedded FPGAs have demonstrated performance and energy efficiency in traditional deep neural networks, their potential for LLM inference remains largely unexplored. Recent efforts to deploy LLMs on FPGAs have primarily relied on large, expensive cloud-grade hardware and have only shown promising results on relatively small LLMs, limiting their real-world applicability. In this work, we present Hummingbird, a novel FPGA accelerator designed specifically for LLM inference on embedded FPGAs. Hummingbird is smaller, targeting embedded FPGAs such as the KV260 and ZCU104 with 67% LUT, 39% DSP, and 42% power savings over existing research. Hummingbird is stronger, targeting LLaMA3-8B and supporting longer contexts, overcoming the typical 4GB memory constraint of embedded FPGAs through offloading strategies. Finally, Hummingbird is faste, achieving 4.8 tokens/s and 8.6 tokens/s for LLaMA3-8B on the KV260 and ZCU104 respectively, with 93-94% model bandwidth utilization, outperforming the prior 4.9 token/s for LLaMA2-7B with 84% bandwidth utilization baseline. We further demonstrate the viability of industrial applications by deploying Hummingbird on a cost-optimized Spartan UltraScale FPGA, paving the way for affordable LLM solutions at the edge.
在嵌入装置上部署大型语言模型(LLMS)仍然是一项巨大的研究挑战,原因是LLMS的计算和记忆需求很高,而且这类环境中可用的硬件资源有限。嵌入的FPGAs展示了传统深神经网络的性能和能源效率,但其LLM推断潜力基本上尚未探索。最近在FPGA上部署LMs的努力主要依靠大型、昂贵的云级硬件,仅显示出相对较小的LLMS的可承受效果,限制了它们的实际适用性。在这项工作中,我们介绍了蜂鸟,这是专门为LLM在嵌入的FPGAs中进行LGA的推断而设计的新型FPGA 加速器。Hmingbirs规模较小,针对嵌入式的FPGAs,如KV260和ZCO104, 67% LUT,39% DSP, 和42% 现有研究的节能节余。蜂鸟更强大,以LAM3-8B为对象,通过卸载战略克服嵌入式FGAS的典型4GB级的记忆边缘限制。最后,HBirbirdalalalal-lax在SLMS-S-LMSLMS-S-S-S-Slima-SLMLA lialliallialliallish 3和8-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SAL-SAL-S-S-SAL-Slima-S-S-S-S-S-S-SL-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 33
Title@2025-07-04 (5): ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis
Title: ForgeHLS: A Large-Scale, Open-Source Dataset for High-Level Synthesis | ForgeHLS: Ein großformatiger, Open-Source-Datensatz für High-Level-Synthese | ForgeHLS: 用于高级别综合的大型、开放源码数据集 2507.03255v1 |
Authors (6): Zedong Peng, Zeju Li, Mingzhe Gao, Qiang Xu, Chen Zhang, Jieru Zhao
We introduce ForgeEDA, an open-source comprehensive circuit dataset across various categories. ForgeEDA includes diverse circuit representations such as Register Transfer Level (RTL) code, Post-mapping (PM) netlists, And-Inverter Graphs (AIGs), and placed netlists, enabling comprehensive analysis and development. We demonstrate ForgeEDA’s utility by benchmarking state-of-the-art EDA algorithms on critical tasks such as Power, Performance, and Area (PPA) optimization, highlighting its ability to expose performance gaps and drive advancements. Additionally, ForgeEDA’s scale and diversity facilitate the training of AI models for EDA tasks, demonstrating its potential to improve model performance and generalization. By addressing limitations in existing datasets, ForgeEDA aims to catalyze breakthroughs in modern IC design and support the next generation of innovations in EDA.
我们引入了开放源码综合电路数据集ForgeEDA(ForgeEDA),这是一个跨越不同类别的开放源码综合电路数据集;ForgeEDA(ForgeEDA)包括多种电路代表,如登记册传输级别代码(RTL),后绘图网名单(PM),和内向图(AIGs),以及放置的网络名单,从而能够进行全面分析和开发;我们通过对电力、性能和地区(PPPA)优化等关键任务的最新EDA算法进行基准化,展示ForgeEDA的效用,突出其暴露性能差距和驱动力进步的能力;此外,ForgeEDA的规模和多样性有助于培训AI模型,展示其改进模型性能和概括化的潜力;通过解决现有数据集的局限性,ForgeEDA(FOREDA)旨在催化现代IC设计方面的突破,并支持EDA的下一代创新。
Article 34
Title@2025-07-03 (4): Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification
Title: Hey AI, Generate Me a Hardware Code! Agentic AI-based Hardware Design & Verification | Hey KI, Generieren Sie mir einen Hardware-Code! Agentische KI-basierte Hardware-Design & Verifizierung | AI, 生成一个硬件代码! Agentic AI 的硬件设计和验证 2507.02660v1 |
Authors (7): Deepak Narayan Gadde, Keerthan Kopparam Radhakrishna, Vaisakh Naduvodi Viswambharan, Aman Kumar, Djones Lettnin, Wolfgang Kunz, Sebastian Simon
Modern Integrated Circuits (ICs) are becoming increasingly complex, and so is their development process. Hardware design verification entails a methodical and disciplined approach to the planning, development, execution, and sign-off of functionally correct hardware designs. This tedious process requires significant effort and time to ensure a bug-free tape-out. The field of Natural Language Processing has undergone a significant transformation with the advent of Large Language Models (LLMs). These powerful models, often referred to as Generative AI (GenAI), have revolutionized how machines understand and generate human language, enabling unprecedented advancements in a wide array of applications, including hardware design verification. This paper presents an agentic AI-based approach to hardware design verification, which empowers AI agents, in collaboration with Humain-in-the-Loop (HITL) intervention, to engage in a more dynamic, iterative, and self-reflective process, ultimately performing end-to-end hardware design and verification. This methodology is evaluated on five open-source designs, achieving over 95% coverage with reduced verification time while demonstrating superior performance, adaptability, and configurability.
现代集成电路(ICs)越来越复杂,其开发过程也越来越复杂。硬件设计核查要求以有条不紊和有纪律的方法来规划、开发、执行和签署功能正确的硬件设计。这一繁琐的过程需要投入大量精力和时间来确保无错误的磁带。随着大语言模型的出现,自然语言处理领域经历了巨大的转变。这些强大的模型,通常被称为“创源AI”(GenAI),使机器如何理解和生成人类语言发生了革命性的变化,使包括硬件设计核查在内的多种应用取得了前所未有的进步。本文介绍了一种基于人工智能的硬件设计核查代理方法,它授权AI代理商与Humain-in-the-Loop(HITL)(HIM)(HITL)(HITL)(HIM)(Himain-in-the-Loop)(HIM)(PLLL)(P)(LLLLM)(LLLLLM)(LLLLLM)(H)(H)(H)(HLLLLLLLM)合作,最终进行端端端端端端端端端到端硬件设计和核查。该方法在5个开源设计上得到了评估。该方法以5个开源设计上进行了评估。该方法以95%以上的软件设计,在显示超95%以上核查,在显示高超95%的核查时间和超限的功能、适应性和可容和可容性能。该技术。该技术。该方法在显示优。该技术。该技术。该方法在显示优度的同时,它能、可容和可容和可容和可容性。它可容性和可容性。它可容性能。它。该技术。该技术。该技术。这个方法以优性。该技术。该技术。该技术。该方法以优性。这个方法以优性。该方法以优性。该方法以优性能、可容性、可容性。它能和可容和可容和可容性。这个方法以优性。该方法以优性。它可容性。该方法以优性。该方法以优性能和可容性。该方法以优性。该方法以优性。该方法以优性。它可容性。它
Article 35
Title@2025-07-03 (4): Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure
Title: Breaking the HBM Bit Cost Barrier: Domain-Specific ECC for AI Inference Infrastructure | Durchbrechen der HBM Bit Cost Barrier: Domainspezifisches ECC für KI-Inferenz-Infrastruktur | 打破HBM比位成本壁垒:AI推理基础设施特定域ECC 2507.02654v1 |
Authors (8): Rui Xie, Asad Ul Haq, Yunhua Fang, Linsen Ma, Sanchari Sen, Swagath Venkataramani, Liu Liu, Tong Zhang
High-Bandwidth Memory (HBM) delivers exceptional bandwidth and energy efficiency for AI workloads, but its high cost per bit, driven in part by stringent on-die reliability requirements, poses a growing barrier to scalable deployment. This work explores a system-level approach to cost reduction by eliminating on-die ECC and shifting all fault management to the memory controller. We introduce a domain-specific ECC framework combining large-codeword Reed–Solomon~(RS) correction with lightweight fine-grained CRC detection, differential parity updates to mitigate write amplification, and tunable protection based on data importance. Our evaluation using LLM inference workloads shows that, even under raw HBM bit error rates up to $10^{-3}$, the system retains over 78\% of throughput and 97\% of model accuracy compared with systems equipped with ideal error-free HBM. By treating reliability as a tunable system parameter rather than a fixed hardware constraint, our design opens a new path toward low-cost, high-performance HBM deployment in AI infrastructure.
高带宽内存(HBM)为AI工作量提供了特殊的带宽和能源效率,但是其高昂的每部分成本部分地由严格的地面可靠性要求驱动,对可缩放部署构成越来越大的障碍。这项工作探索了一种系统级的降低成本办法,办法是消除现场ECC,并将所有故障管理权转移给记忆控制器。我们引入了一个针对具体域的ECC框架,将大型字词Reed-Solomon~(RS)校正与轻量级微微分CRC检测、差异等值更新以减缓写作放大和基于数据重要性的金枪鱼保护结合起来。我们使用LLM推论的结果表明,即使在原始HBM位误差率最高达10-3美元的情况下,与配备理想无误HBM的系统相比,系统仍保留了78的吞吐量和97的模型精度。我们的设计将可靠性视为一个金枪鱼分系统参数,而不是固定的硬件制约,开辟了一条在AI基础设施中低成本、高效的HBM部署的新道路。
Article 36
Title@2025-07-03 (4): MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem
Title: MARS: Processing-In-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem | MARS: Processing-in-Memory Acceleration of Raw Signal Genome Analysis Inside the Storage Subsystem | MARS: 储存子系统内原始信号基因组分析的处理-中间加速 2506.10931v2 |
Authors (11): Melina Soysal, Konstantina Koliogeorgi, Can Firtina, Nika Mansouri Ghiasi, Rakesh Nadig, Haiyu Mao, Geraldo F. Oliveira, Yu Liang, Klea Zambaku, Mohammad Sadrosadati, Onur Mutlu
Raw signal genome analysis (RSGA) has emerged as a promising approach to enable real-time genome analysis by directly analyzing raw electrical signals. However, rapid advancements in sequencing technologies make it increasingly difficult for software-based RSGA to match the throughput of raw signal generation. This paper demonstrates that while hardware acceleration techniques can significantly accelerate RSGA, the high volume of genomic data shifts the performance and energy bottleneck from computation to I/O data movement. As sequencing throughput increases, I/O overhead becomes the main contributor to both runtime and energy consumption. Therefore, there is a need to design a high-performance, energy-efficient system for RSGA that can both alleviate the data movement bottleneck and provide large acceleration capabilities. We propose MARS, a storage-centric system that leverages the heterogeneous resources within modern storage systems (e.g., storage-internal DRAM, storage controller, flash chips) alongside their large storage capacity to tackle both data movement and computational overheads of RSGA in an area-efficient and low-cost manner. MARS accelerates RSGA through a novel hardware/software co-design approach. First, MARS modifies the RSGA pipeline via two filtering mechanisms and a quantization scheme, reducing hardware demands and optimizing for in-storage execution. Second, MARS accelerates the RSGA steps directly within the storage by leveraging both Processing-Near-Memory and Processing-Using-Memory paradigms. Third, MARS orchestrates the execution of all steps to fully exploit in-storage parallelism and minimize data movement. Our evaluation shows that MARS outperforms basecalling-based software and hardware-accelerated state-of-the-art read mapping pipelines by 93x and 40x, on average across different datasets, while reducing their energy consumption by 427x and 72x.
原始信号基因组分析(RSGA)已成为通过直接分析原始电信号实现实时基因组分析的一个很有希望的方法,但测序技术的迅速进展使得基于软件的RSGA越来越难以匹配原始信号生成的输送量。本文表明,虽然硬件加速技术可以大大加速RSGA,但大量的基因组数据可以将性能和能量瓶颈从计算转换到I/O数据移动。随着测序的通过量的增加,I/O间接费用成为运行时间和能源消耗的主要推动者。因此,需要为RSGA设计一个高性能、节能的系统,既能减轻数据流动的瓶颈,又能提供巨大的加速能力。我们提议MARS是一个储存中心系统,它利用现代储存系统(例如储存-内部DRAM、储存控制器、闪存芯片)的多样化资源,同时利用其庞大的储存能力,以地区效率和低成本的方式处理RSGA的数据移动和计算基的第三次消耗量。 IMS通过新型硬件/软件联合设计,加速RSGA的评估,在IMAS运行过程中,通过S IMA系统内部的升级和不断升级的存储系统,在不断优化的存储系统中进行数据流中进行数据流中进行数据流数据流数据流中,通过不断升级的系统,通过不断升级的系统进行数据流数据流数据流数据转换,以降低的系统,以降低其内部的存储和冲压流数据流数据流数据流数据流化的系统,以降低。
Article 37
Title@2025-07-03 (4): AC-Refiner: Efficient Arithmetic Circuit Optimization Using Conditional Diffusion Models
Title: AC-Refiner: Efficient Arithmetic Circuit Optimization Using Conditional Diffusion Models | AC-Refiner: Effiziente Arithmetische Schaltungsoptimierung mit bedingten Diffusionsmodellen | AC-Refineer:使用有条件扩散模型高效亚氏电路优化 2507.02598v1 |
Authors (10): Chenhao Xue, Kezhi Li, Jiaxing Zhang, Yi Ren, Zhengyuan Shi, Chen Zhang, Yibo Lin, Lining Zhang, Qiang Xu, Guangyu Sun
Arithmetic circuits, such as adders and multipliers, are fundamental components of digital systems, directly impacting the performance, power efficiency, and area footprint. However, optimizing these circuits remains challenging due to the vast design space and complex physical constraints. While recent deep learning-based approaches have shown promise, they struggle to consistently explore high-potential design variants, limiting their optimization efficiency. To address this challenge, we propose AC-Refiner, a novel arithmetic circuit optimization framework leveraging conditional diffusion models. Our key insight is to reframe arithmetic circuit synthesis as a conditional image generation task. By carefully conditioning the denoising diffusion process on target quality-of-results (QoRs), AC-Refiner consistently produces high-quality circuit designs. Furthermore, the explored designs are used to fine-tune the diffusion model, which focuses the exploration near the Pareto frontier. Experimental results demonstrate that AC-Refiner generates designs with superior Pareto optimality, outperforming state-of-the-art baselines. The performance gain is further validated by integrating AC-Refiner into practical applications.
诸如添加器和乘数等亚学电路是数字系统的基本组成部分,直接影响到性能、功率和面积足迹。然而,由于设计空间巨大和复杂的物理限制,优化这些电路仍具有挑战性。虽然最近的深层次学习方法已显示出希望,但它们努力不断地探索高潜力设计变体,限制其优化效率。为了应对这一挑战,我们提议AC-Refineer,这是一个利用有条件扩散模型的新型算术回路优化框架。我们的关键见解是将计算回路合成重新设定为一项有条件的图像生成任务。通过将目标质量(QoRs)解密扩散进程小心地调整,AC-Refineer始终生产高质量的电路设计。此外,探索的设计被用于微调扩散模型,该模型的重点是在Pareto前沿的勘探。实验结果表明,AC-Refineer生成了高级Pareto最佳性、优于最新水平的基线。通过将AC-Refineer纳入实际应用,进一步验证了绩效收益。
Article 38
Title@2025-07-03 (4): System-performance and cost modeling of Large Language Model training and inference
Title: System-performance and cost modeling of Large Language Model training and inference | Systemperformance und Kostenmodellierung von Large Language Model Training und Schlussfolgerung | 大语言模式培训和推论的系统业绩和成本模型化 2507.02456v1 |
Authors (7): Wenzhe Guo, Joyjit Kundu, Uras Tos, Weijiang Kong, Giuliano Sisto, Timon Evenblij, Manu Perumkunnil
Large language models (LLMs), based on transformer architectures, have revolutionized numerous domains within artificial intelligence, science, and engineering due to their exceptional scalability and adaptability. However, the exponential growth in LLM size and complexity has outpaced advancements in compute capacity, memory bandwidth, network performance, and cost efficiency, posing significant challenges to their scalability on distributed systems. To address these limitations, alternative model architectures, optimization strategies, communication-aware network topologies, and novel system design approaches have been proposed in literature. This paper introduces a performance-cost modeling methodology for LLM training and inference that integrates state-of-the-art compute techniques with memory optimizations, and latest communication techniques. Building on an analytical performance model, our approach incorporates recent innovations such as the flash attention technique and mixture of experts models to address the memory bandwidth and compute bottlenecks. It also considers the impact of different network topologies and topology-specific communication algorithms with 5D parallellism. The framework also integrates a chiplet cost model. The proposed modeling methodology provides valuable insights to guide future compute system design and facilitates hardware-software co-development, in particular due to its ability to analyze performance-cost trade-offs for various system architectural configurations.
以变压器结构为基础的大型语言模型(LLMS),由于人造智能、科学和工程的可扩缩性和适应性不同,使人造智能、科学和工程领域的众多领域发生了革命性革命性。然而,LLM规模和复杂性的指数增长超过了计算能力、记忆带宽、网络性能和成本效益方面的进步,对分布式系统的可扩缩性提出了重大挑战。为克服这些局限性,在文献中提出了替代模型结构、优化战略、通信-有意识网络地形和新型系统设计方法。本文件还介绍了LLM培训和推论的绩效成本模型方法,将最新计算技术与记忆优化和最新通信技术相结合。在分析性能模型的基础上,我们的方法结合了近期的创新,例如快速关注技术和专家模型的组合,以解决记忆带宽度和压缩瓶颈问题。还考虑了不同网络地形和具有5D平行特征的顶层通信算法的影响。框架还结合了芯片成本模型。拟议的模型提供了宝贵的洞察力,指导未来系统配置系统设计,并便利了各种硬件软件的配置能力,从而进行适当的分析。
Article 39
Title@2025-07-03 (4): DecoRTL: A Run-time Decoding Framework for RTL Code Generation with LLMs
Title: DecoRTL: A Run-time Decoding Framework for RTL Code Generation with LLMs | DecoRTL: Ein Laufzeit-Decoding-Framework für RTL-Code-Generierung mit LLMs | DecoRTL: 使用LLMs的RTL代码生成运行时间解码框架 2507.02226v1 |
Authors (3): Mohammad Akyash, Kimia Azar, Hadi Kamali
As one of their many applications, large language models (LLMs) have recently shown promise in automating register transfer level (RTL) code generation. However, conventional LLM decoding strategies, originally designed for natural language, often fail to meet the structural and semantic demands of RTL, leading to hallucinated, repetitive, or invalid code outputs. In this paper, we first investigate the root causes of these decoding failures through an empirical analysis of token-level entropy during RTL generation. Our findings reveal that LLMs exhibit low confidence in regions of structural ambiguity or semantic complexity, showing that standard decoding strategies fail to differentiate between regions requiring determinism (syntax-critical regions) and those that benefit from creative exploratory variability (design-critical regions). Then, to overcome this, we introduce DecoRTL, a novel run-time decoding strategy, that is both syntax-aware and contrastive for RTL code generation. DecoRTL integrates two complementary components: (i) self-consistency sampling, which generates multiple candidates and re-ranks them based on token-level agreement to promote correctness while maintaining diversity; and (ii) syntax-aware temperature adaptation, which classifies tokens by their syntactical and functional roles and adjusts the sampling temperature accordingly, enforcing low temperature for syntax-critical tokens and higher temperature for exploratory ones. Our approach operates entirely at inference time without requiring any additional model fine-tuning. Through evaluations on multiple open-source LLMs using the VerilogEval benchmark, we demonstrate significant improvements in syntactic validity, functional correctness, and output diversity, while the execution overhead (performance overhead) is imperceptible.
作为许多应用之一,大型语言模型(LLMS)最近显示,在注册传输水平(RTL)代码生成自动化方面,大型语言模型(LLM)最近显示出了前景;然而,原本为自然语言设计的常规LLM解码战略往往无法满足RTL的结构和语义要求,导致幻灭、重复或无效代码输出。在本文中,我们首先通过对RTL生成过程中的代币性激素进行实验性分析来调查这些解码失败的根源。我们的调查结果显示,LLMS在结构模糊或语义复杂度区域中表现出了多度的可靠性,表明标准解码战略未能区分需要确定性(合成关键区域)的区域和那些受益于创造性探索性变异(设计关键区域)的区域。为了克服这些差异,我们引入了DecoRTL(一种新型运行时间解码战略),这是对RTL生成的代币种性调和对比性的。