cs.AR @ 2025-08-01: 034
-
00 07-31 (4) Preprocessing Methods for Memristive Reservoir Computing for Image Recognition Vorverarbeitungsverfahren für Memoristive Reservoir Computing für die Bilderkennung 用于识别图像的中间储量计算预处理方法 2506.05588v3 -
01 07-31 Hardware-Aware Fine-Tuning of Spiking Q-Networks on the SpiNNaker2 Neuromorphic Platform Hardware-Aware Feintuning von Spiking Q-Netzwerken auf der SpiNNaker2 Neuromorphic Platform SpinNNNAK2 神经变形平台SpiNNAKK QNetwork 的硬件- 硬件- 软件精密配置 2507.23562v1 -
02 07-31 TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling TrIM, Dreieckige Eingabebewegung Systolisches Array für konvolutionäre neurale Netzwerke: Datenfluss und analytische Modellierung TrIM, 用于进进进神经网络的三角输入运动轮流阵列: 数据流和分析模型 2408.01254v3 -
03 07-31 Smart Video Capsule Endoscopy: Raw Image-Based Localization for Enhanced GI Tract Investigation Smart Video Capsule Endoskopie: Raw Image-basierte Lokalisierung für verbesserte GI Tract Untersuchung 智能视频胶囊内窥视:用于强化GITract调查的原始基于图像的本地化 2507.23398v1 -
04 07-30 (3) KLLM: Fast LLM Inference with K-Means Quantization KLLM: Schnelle LLM-Inferenz mit K-Means-Quantisierung KLLM: 快速LLM 与 K- Means 量化的推论 2507.23035v1 -
05 07-30 No Redundancy, No Stall: Lightweight Streaming 3D Gaussian Splatting for Real-time Rendering Keine Redundanz, kein Stall: Leichtes Streaming 3D Gaussian Splatting für Echtzeit-Rendering 无冗余,无拖延:轻量级流 3D Gaussian Splating 用于实时招标 2507.21572v2 -
06 07-29 (2) A Customized Memory-aware Architecture for Biological Sequence Alignment Eine maßgeschneiderte, speicherbewusste Architektur für die biologische Sequenzausrichtung 用于生物序列对齐的自定义内存意识架构 2507.22221v1 -
07 07-29 A Multi-Agent Generative AI Framework for IC Module-Level Verification Automation Multi-Agent Generatives KI-Framework für die IC-Modul-Level-Verifikationsautomatisierung IC 模块级核查自动化多机构生成AI框架 2507.21694v1 -
08 07-29 SLTarch: Towards Scalable Point-Based Neural Rendering by Taming Workload Imbalance and Memory Irregularity SLTarch: Auf dem Weg zu einer skalierbaren point-based Neural Rendering durch Taming Workload Unbalance und Memory Irregularity SLSTARK:通过修补工作负荷的不平衡和记忆不规则性,走向可缩放的点基心神经成形 2507.21499v1 -
09 07-29 Automated HEMT Model Construction from Datasheets via Multi-Modal Intelligence and Prior-Knowledge-Free Optimization Automatisierte HEMT-Modellkonstruktion aus Datenblättern über Multi-Modal-Intelligenz und vorherige wissensfreie Optimierung 通过多模式情报和事先知识自由优化,从数据表格中自动建立电子内容管理模型 2507.21430v1 -
10 07-28 (1) FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization FastMamba: Hochgeschwindigkeits- und Effizienter Mamba-Beschleuniger auf FPGA mit präziser Quantisierung FPGA 快速Mamba:一个高速度、高效的 Mamba 加速器,使用准确量化的 FPGA 加速器 2505.18975v4 -
11 07-28 LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference LUT Tensor Core: Ein Software-Hardware-Co-Design für LUT-basierte Low-Bit LLM-Inferenz LUT 信标核心:基于 LUT的低比低 LLM 推断的软件-硬件共同设计 2408.06003v3 -
12 07-27 (7) Demystifying the 7-D Convolution Loop Nest for Data and Instruction Streaming in Reconfigurable AI Accelerators Entmystifizierung des 7-D-Konvolutionsschleifen-Nests für Daten- und Instruktionsstreaming in rekonfigurierbaren KI-Beschleunigern 解开重新配置 AI 加速器中数据和指示流7D 革命圈点的神秘化 2507.20420v1 -
13 07-27 RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs RoCE BALBOA: Service-verstärktes Rechenzentrum RDMA für SmartNICs RoCE BALBOA:增强服务数据中心 2507.20412v1 -
14 07-27 ACCESS-AV: Adaptive Communication-Computation Codesign for Sustainable Autonomous Vehicle Localization in Smart Factories ACCESS-AV: Adaptive Kommunikations-Computation Codesign für nachhaltige autonome Fahrzeuglokalisierung in Smart Factories ACCES-AV: 智能工厂可持续自主车辆本地化适应性通信-计算代码符号 2507.20399v1 -
15 07-26 (6) AxOSyn: An Open-source Framework for Synthesizing Novel Approximate Arithmetic Operators AxOSyn: Ein Open-Source-Framework zur Synthese von Romanen Ungefähre Arithmetische Operatoren AxOSyn: 一个用于合成新奇近光亚学天文操作员的开放源码框架 2507.20007v1 -
16 07-26 A Scalable Resource Management Layer for FPGA SoCs in 6G Radio Units Eine skalierbare Ressourcenverwaltungsschicht für FPGA-SoCs in 6G-Funkgeräten 6G无线电台FPGA SoCs可缩放资源管理层 2507.19963v1 -
17 07-26 ChipletPart: Scalable Cost-Aware Partitioning for 2.5D Systems ChipletPart: Skalierbare Kosten-Bewusst-Partitionierung für 2.5D-Systeme ChipletPart: 2.5D 系统可缩放成本软件分割 2507.19819v1 -
18 07-26 Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning Kleiner, schneller, billiger: Architekturdesigns für effizientes maschinelles Lernen 更小、更快、更便宜:高效机械学习的建筑设计 2507.19795v1 -
19 07-26 LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation LaMAGIC2: Erweiterte Schaltungsformulierungen für sprachmodellbasierte analoge Topologie-Generierung LaMAGIC2:语言模拟模拟模拟地形生成的先进电路配制 2506.10235v2 -
20 07-25 (5) MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation with Backend Aware Synthesis Optimization MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation mit Backend Aware Syntheseoptimierung MCP4EDA: LLM 授权示范背景议定书RTL-GDSII 2507.19570v1 -
21 07-25 A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration A3D-MoE: Beschleunigung großer Sprachmodelle mit Expertenmix über 3D Heterogene Integration A3D-MOE:通过3D异变融合加速采用专家混合型大语言模式 2507.19142v1 -
22 07-25 3DGauCIM: Accelerating Static/Dynamic 3D Gaussian Splatting via Digital CIM for High Frame Rate Real-Time Edge Rendering 3DGauCIM: Statische/Dynamische 3D Gaussian Splatting über Digital CIM beschleunigen für hohe Framerate Echtzeit Edge Rendering 3DGauCIM:通过数字 CIM加速静电/动态 3D Gaussian Splating, 用于高框架速实时边缘下调 2507.19133v1 -
23 07-25 Stella Nera: A Differentiable Maddness-Based Hardware Accelerator for Efficient Approximate Matrix Multiplication Stella Nera: Ein differenzierter Maddness-basierter Hardware-Beschleuniger für eine effiziente, annähernde Matrix-Multiplikation Stella Nera: 高效近光矩阵乘法的有区别的基于 Maddness 的硬件加速器 2311.10207v2 -
24 07-25 GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units GENIAL: Generative Design Space Exploration über Netzwerk-Inversion für stromarme algorithmische Logische Einheiten GENIAL:通过网络转换生成设计空间探索,用于低功率测算仪 2507.18989v1 -
25 07-25 RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems RailX: Eine flexible, skalierbare und kostenarme Netzwerkarchitektur für Hyper-Scale LLM Trainingssysteme RailX:超大型有限LM培训系统灵活、可缩放和低成本网络架构 2507.18889v1 -
26 07-25 GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Conditional Processing GCC: Eine 3DGS-Inferenzarchitektur mit Gaußian-Wise- und Cross-Stage-Bedingung 海合会:3DGS推理结构,带有高西-怀斯和跨标准条件处理 2507.15300v3 -
27 07-24 (4) PRACtical: Subarray-Level Counter Update and Bank-Level Recovery Isolation for Efficient PRAC Rowhammer Mitigation PRACtical: Subarray-Level Counter Update und Bank-Level Recovery Isolation für effiziente PRAC Rowhammer Mitigation PRAC 有效缓解减贫和减贫方案下级反反更新和银行级复苏孤立措施 2507.18581v1 -
28 07-24 DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v3 -
29 07-24 GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction GNN-ALLP:基于模拟电路链接预测的图表神经网络 2504.10240v4 -
30 07-24 A 55-nm SRAM Chip Scanning Errors Every 125 ns for Event-Wise Soft Error Measurement Ein 55-nm-SRAM-Chip-Scannfehler alle 125 ns für Event-Wise-Soft-Error-Messung A 55-nm SRAM 芯片扫描错误 2504.08305v2 -
31 07-24 Explicit Sign-Magnitude Encoders Enable Power-Efficient Multipliers Explizite Zeichen-Magnituden-Encoder aktivieren leistungsfähige Multiplikatoren 显性信号- 数学编码器启用功率功率乘数器 2507.18179v1 -
32 07-24 Real-Time Object Detection and Classification using YOLO for Edge FPGAs Echtzeit-Objekterkennung und -Klassifizierung mit YOLO für Edge-FPGAs 实时物体探测和分类,用YOLO对边缘的FPGAs进行实时物体探测和分类 2507.18174v1 -
33 07-24 Designing High-Performance and Thermally Feasible Multi-Chiplet Architectures enabled by Non-bendable Glass Interposer Designing High-Performance und Thermisch Machbare Multi-Chiplet-Architekturen durch nicht-biegbare Glasinterposer ermöglicht 设计高性能和热能多芯结构,由不可移植的玻璃干涉器启用 2507.18040v1
Article 0
Title@2025-07-31 (4): Preprocessing Methods for Memristive Reservoir Computing for Image Recognition
Title: Preprocessing Methods for Memristive Reservoir Computing for Image Recognition | Vorverarbeitungsverfahren für Memoristive Reservoir Computing für die Bilderkennung | 用于识别图像的中间储量计算预处理方法 2506.05588v3 |
Authors (5): Rishona Daniels, Duna Wattad, Ronny Ronen, David Saad, Shahar Kvatinsky
Reservoir computing (RC) has attracted attention as an efficient recurrent neural network architecture due to its simplified training, requiring only its last perceptron readout layer to be trained. When implemented with memristors, RC systems benefit from their dynamic properties, which make them ideal for reservoir construction. However, achieving high performance in memristor-based RC remains challenging, as it critically depends on the input preprocessing method and reservoir size. Despite growing interest, a comprehensive evaluation that quantifies the impact of these factors is still lacking. This paper systematically compares various preprocessing methods for memristive RC systems, assessing their effects on accuracy and energy consumption. We also propose a parity-based preprocessing method that improves accuracy by 2-6% while requiring only a modest increase in device count compared to other methods. Our findings highlight the importance of informed preprocessing strategies to improve the efficiency and scalability of memristive RC systems.
储量计算(RC)作为一个高效的经常性神经网络结构吸引了人们的注意,因为其培训简化了,只要求其最后的过敏读出层接受培训。在与存储器一起实施时,RC系统受益于其动态特性,这些特性使得它们成为储油层建造的理想条件。然而,在以memristor为基础的RC实现高性能仍具有挑战性,因为它关键取决于输入预处理方法和储油层大小。尽管人们日益感兴趣,但仍然缺乏全面评价,以量化这些因素的影响。本文系统地比较了存储器系统的各种预处理方法,评估其对精度和能源消耗的影响。我们还提出了一个基于对等的预处理方法,即提高精度2-6%,而仅要求与其他方法相比,设备计数略有增加。我们的调查结果强调了知情的预处理战略对于提高存储器系统的效率和可扩展性的重要性。
Article 1
Title@2025-07-31 (4): Hardware-Aware Fine-Tuning of Spiking Q-Networks on the SpiNNaker2 Neuromorphic Platform
Title: Hardware-Aware Fine-Tuning of Spiking Q-Networks on the SpiNNaker2 Neuromorphic Platform | Hardware-Aware Feintuning von Spiking Q-Netzwerken auf der SpiNNaker2 Neuromorphic Platform | SpinNNNAK2 神经变形平台SpiNNAKK QNetwork 的硬件- 硬件- 软件精密配置 2507.23562v1 |
Authors (3): Sirine Arfa, Bernhard Vogginger, Christian Mayr
Spiking Neural Networks (SNNs) promise orders-of-magnitude lower power consumption and low-latency inference on neuromorphic hardware for a wide range of robotic tasks. In this work, we present an energy-efficient implementation of a reinforcement learning (RL) algorithm using quantized SNNs to solve two classical control tasks. The network is trained using the Q-learning algorithm, then fine-tuned and quantized to low-bit (8-bit) precision for embedded deployment on the SpiNNaker2 neuromorphic chip. To evaluate the comparative advantage of SpiNNaker2 over conventional computing platforms, we analyze inference latency, dynamic power consumption, and energy cost per inference for our SNN models, comparing performance against a GTX 1650 GPU baseline. Our results demonstrate SpiNNaker2’s strong potential for scalable, low-energy neuromorphic computing, achieving up to 32x reduction in energy consumption. Inference latency remains on par with GPU-based execution, with improvements observed in certain task settings, reinforcing SpiNNaker2’s viability for real-time neuromorphic control and making the neuromorphic approach a compelling direction for efficient deep Q-learning.
Spik NealNetworks(SNNS)承诺对神经形态硬件进行高压低电耗和低长推导,用于一系列广泛的机器人任务。在这项工作中,我们展示了使用四分制 SNNS 解决两个古典控制任务的强化学习(RL)算法的节能应用。该网络使用Q-学习算法进行了培训,然后对SpinnNAker2神经形态芯片的嵌入部署精确度进行了微调(8-位)微调和量化。为了评估SpinNNAker2相对于常规计算平台的比较优势,我们分析了SnNNE2模型的推断力、动态电耗和能源成本,对照GTX 1650 GPU基准对性能进行了比较。我们的结果表明SpinNNARC2在可缩放电、低能量神经形态计算方面的巨大潜力,达到32x能源消耗量的减少量。推导力仍与基于GPU的操作法相近,在某些任务环境中观察到的改进,加强了Spinne-national lacer 方法,加强了Spal-stable-stregregregrostal laction Qs for the rest for the recialction aprevental
Article 2
Title@2025-07-31 (4): TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling
Title: TrIM, Triangular Input Movement Systolic Array for Convolutional Neural Networks: Dataflow and Analytical Modelling | TrIM, Dreieckige Eingabebewegung Systolisches Array für konvolutionäre neurale Netzwerke: Datenfluss und analytische Modellierung | TrIM, 用于进进进神经网络的三角输入运动轮流阵列: 数据流和分析模型 2408.01254v3 |
Authors (3): Cristian Sestito, Shady Agwa, Themis Prodromakis
In order to follow the ever-growing computational complexity and data intensity of state-of-the-art AI models, new computing paradigms are being proposed. These paradigms aim at achieving high energy efficiency by mitigating the Von Neumann bottleneck that relates to the energy cost of moving data between the processing cores and the memory. Convolutional Neural Networks (CNNs) are susceptible to this bottleneck, given the massive data they have to manage. Systolic arrays (SAs) are promising architectures to mitigate data transmission cost, thanks to high data utilization of Processing Elements (PEs). These PEs continuously exchange and process data locally based on specific dataflows (such as weight stationary and row stationary), in turn reducing the number of memory accesses to the main memory. In SAs, convolutions are managed either as matrix multiplications or exploiting the raster-order scan of sliding windows. However, data redundancy is a primary concern affecting area, power, and energy. In this paper, we propose TrIM: a novel dataflow for SAs based on a Triangular Input Movement and compatible with CNN computing. TrIM maximizes the local input utilization, minimizes the weight data movement, and solves the data redundancy problem. Furthermore, TrIM does not incur the significant on-chip memory penalty introduced by the row stationary dataflow. When compared to state-of-the-art SA dataflows, the high data utilization offered by TrIM guarantees ~10X less memory access. Furthermore, considering that PEs continuously overlap multiplications and accumulations, TrIM achieves high throughput (up to 81.8% higher than row stationary), other than requiring a limited number of registers (up to 15.6X fewer registers than row stationary).
为了跟踪最新AI模型不断增长的计算复杂性和数据密度,正在提出新的计算模式。这些模式的目的是通过减少Von Neumann瓶颈,实现高能效,因为Von Neumann瓶颈与处理核心和记忆中数据移动的能量成本有关。进化神经网络(CNNs)容易受到这一瓶颈的影响,因为需要管理大量数据。Systolic 数组(SAs)是降低数据传输成本的有希望的结构,因为处理要素(PE)的数据利用率很高。这些PE基于特定数据流(例如重量固定和行定点)不断在本地交换和处理数据效率高能效。在SAls, 变动要么是矩阵变倍化,要么是利用滑动窗口的光序扫描。然而,数据冗余是影响区域、电力和能源的一个主要问题。在本文中,我们提议TrIM:以三角轴输入要素的内存数据流为基础,不是将TIM-行内存数据流最小化,而是将TIM的内存数据最大化到数据使用率数据。Sal-deal Modal-deal Moveal Movemental Movemental 需要通过高数据。Sal IM IM IM 需要通过高数据存储数据存储数据存储数据流, 进行最慢化的内化的内存取取取数据。
Article 3
Title@2025-07-31 (4): Smart Video Capsule Endoscopy: Raw Image-Based Localization for Enhanced GI Tract Investigation
Title: Smart Video Capsule Endoscopy: Raw Image-Based Localization for Enhanced GI Tract Investigation | Smart Video Capsule Endoskopie: Raw Image-basierte Lokalisierung für verbesserte GI Tract Untersuchung | 智能视频胶囊内窥视:用于强化GITract调查的原始基于图像的本地化 2507.23398v1 |
Authors (4): Oliver Bause, Julia Werner, Paul Palomero Bernardo, Oliver Bringmann
For many real-world applications involving low-power sensor edge devices deep neural networks used for image classification might not be suitable. This is due to their typically large model size and require- ment of operations often exceeding the capabilities of such resource lim- ited devices. Furthermore, camera sensors usually capture images with a Bayer color filter applied, which are subsequently converted to RGB images that are commonly used for neural network training. However, on resource-constrained devices, such conversions demands their share of energy and optimally should be skipped if possible. This work ad- dresses the need for hardware-suitable AI targeting sensor edge devices by means of the Video Capsule Endoscopy, an important medical proce- dure for the investigation of the small intestine, which is strongly limited by its battery lifetime. Accurate organ classification is performed with a final accuracy of 93.06% evaluated directly on Bayer images involv- ing a CNN with only 63,000 parameters and time-series analysis in the form of Viterbi decoding. Finally, the process of capturing images with a camera and raw image processing is demonstrated with a customized PULPissimo System-on-Chip with a RISC-V core and an ultra-low power hardware accelerator providing an energy-efficient AI-based image clas- sification approach requiring just 5.31 {\mu}J per image. As a result, it is possible to save an average of 89.9% of energy before entering the small intestine compared to classic video capsules.
对于许多用于图像分类的低能传感器边缘装置深神经网络的实际应用来说,这种转换可能不合适。这是因为其典型的模型规模巨大,需要操作时量往往超过这种资源液化设备的能力。此外,摄像传感器通常用贝氏色过滤器捕获图像,然后将其转换成用于神经网络培训的RGB图像。然而,在资源限制的装置上,这种转换要求其份额的能量和尽可能最优的方式加以跳动。这项工作满足了通过视频胶囊(Endoscopy)对传感器边缘装置进行硬件易行用的AI的需要。视频胶囊(Endoscopy)是调查小肠的重要医学前置值,其电池寿命极受限制。对Bayer图像进行的最后精确度评价为93.06%,而CNN仅以63,000个参数为基础进行时间序列分析,并以Vterbi解码形式进行。最后,通过摄像机和原始图像处理用89个小图像的过程,这是用于调查小型肠脏的重要医学前的精度,因为其电池寿命极短度受电压的图像进入了一台智能系统,因此需要一台SIPLILILIC的精度。
Article 4
Title@2025-07-30 (3): KLLM: Fast LLM Inference with K-Means Quantization
Title: KLLM: Fast LLM Inference with K-Means Quantization | KLLM: Schnelle LLM-Inferenz mit K-Means-Quantisierung | KLLM: 快速LLM 与 K- Means 量化的推论 2507.23035v1 |
Authors (7): Xueying Wu, Baijun Zhou, Zhihui Gao, Yuzhe Fu, Qilin Zheng, Yintao He, Hai Li
Large language model (LLM) inference poses significant challenges due to its intensive memory and computation demands. Weight and activation quantization (WAQ) offers a promising solution by reducing both memory footprint and arithmetic complexity. However, two key challenges remain in the existing WAQ designs. (1) Traditional WAQ designs rely on uniform integer-based quantization for hardware efficiency, but this often results in significant accuracy degradation at low precision. K-Means-based quantization, a non-uniform quantization technique, achieves higher accuracy by matching the Gaussian-like distributions of weights and activations in LLMs. However, its non-uniform nature prevents direct execution on low-precision compute units, requiring dequantization and floating-point matrix multiplications (MatMuls) during inference. (2) Activation outliers further hinder effective low-precision WAQ. Offline thresholding methods for outlier detection can lead to significant model performance degradation, while existing online detection techniques introduce substantial runtime overhead. To address the aforementioned challenges and fully unleash the potential of WAQ with K-Means quantization for LLM inference, in this paper, we propose KLLM, a hardware-software co-design framework. KLLM features an index-based computation scheme for efficient execution of MatMuls and nonlinear operations on K-Means-quantized data, which avoids most of the dequantization and full-precision computations. Moreover, KLLM incorporates a novel outlier detection engine, Orizuru, that efficiently identifies the top-$k$ largest and smallest elements in the activation data stream during online inference. Extensive experiments show that, on average, KLLM achieves speedups of 9.67x, 7.03x and energy efficiency improvements of 229.50x, 150.21x compared to the A100 GPU and Atom, respectively.
大型语言模型(LLM)的推论因其密集的内存和计算需求而构成重大挑战。 重力和激活四分制(WAQ)通过减少记忆足迹和算术复杂性提供了一个大有希望的解决方案。 然而,现有的WAQ设计中仍存在两个关键挑战。 (1) 传统的WAQ设计依靠统一的整数量化来提高硬件效率,但通常导致低精度的精确度大幅下降。 K-Means基四分制(一种非单式四分制技术),通过匹配像高斯一样的重量分布和LLMM的激活(WAQQ)。 然而,它的不统一性质使得无法直接执行低精确度的计算单位。 (1) 传统的WAQQQQQQ 设计依靠统一的整分级整分级量化,但是这往往导致低精度有效低精度的WAQQ。 K-MLUIQ的离值临界分级分级测试方法可以导致显著的模型性能退化,而现有的在线检测技术可以大量运行。 为了应对上述挑战,并且完全释放了K-MA公司最低的操作,K-MLULLLLL的中的一项不精度操作,在K-MA的中可以显示一个不精度框架。
Article 5
Title@2025-07-30 (3): No Redundancy, No Stall: Lightweight Streaming 3D Gaussian Splatting for Real-time Rendering
Title: No Redundancy, No Stall: Lightweight Streaming 3D Gaussian Splatting for Real-time Rendering | Keine Redundanz, kein Stall: Leichtes Streaming 3D Gaussian Splatting für Echtzeit-Rendering | 无冗余,无拖延:轻量级流 3D Gaussian Splating 用于实时招标 2507.21572v2 |
Authors (6): Linye Wei, Jiajun Tang, Fan Fei, Boxin Shi, Runsheng Wang, Meng Li
3D Gaussian Splatting (3DGS) enables high-quality rendering of 3D scenes and is getting increasing adoption in domains like autonomous driving and embodied intelligence. However, 3DGS still faces major efficiency challenges when faced with high frame rate requirements and resource-constrained edge deployment. To enable efficient 3DGS, in this paper, we propose LS-Gaussian, an algorithm/hardware co-design framework for lightweight streaming 3D rendering. LS-Gaussian is motivated by the core observation that 3DGS suffers from substantial computation redundancy and stalls. On one hand, in practical scenarios, high-frame-rate 3DGS is often applied in settings where a camera observes and renders the same scene continuously but from slightly different viewpoints. Therefore, instead of rendering each frame separately, LS-Gaussian proposes a viewpoint transformation algorithm that leverages inter-frame continuity for efficient sparse rendering. On the other hand, as different tiles within an image are rendered in parallel but have imbalanced workloads, frequent hardware stalls also slow down the rendering process. LS-Gaussian predicts the workload for each tile based on viewpoint transformation to enable more balanced parallel computation and co-designs a customized 3DGS accelerator to support the workload-aware mapping in real-time. Experimental results demonstrate that LS-Gaussian achieves 5.41x speedup over the edge GPU baseline on average and up to 17.3x speedup with the customized accelerator, while incurring only minimal visual quality degradation.
3D Gausian Splateting (3DDGS) 能够高质量地展示3D场景,并在自动驱动和内装智能等领域得到越来越多的采用。 但是, 3DDS 在面临高框架速率要求和资源限制边缘部署时,仍然面临主要的效率挑战。 因此, 为使3DDS能够高效地配置3DGS, 我们在本文件中提议, 3D S- Gausian 算法/硬件共同设计框架, 用于轻量流3D 。 LS- Gausian 的核心观察显示, 3DGS 存在大量计算冗余和摊位。 一方面, 一方面, 在实际情景中, 高框架3 3DGS 经常在相机观察和连续地显示相同场景时, 高框架3DGS 应用高框架。 LS- Gussiator 提出一个观点转换算法, 利用跨框架的连续性来创造高效的3DV。 另一方面, 图像中不同的瓷体只是平行但有不均匀的工作量, 经常的硬件摊子也拖慢了转换过程。 LS- GAussiexxxxeralalalalalalalalalimalalalalalalalaliming 。
Article 6
Title@2025-07-29 (2): A Customized Memory-aware Architecture for Biological Sequence Alignment
Title: A Customized Memory-aware Architecture for Biological Sequence Alignment | Eine maßgeschneiderte, speicherbewusste Architektur für die biologische Sequenzausrichtung | 用于生物序列对齐的自定义内存意识架构 2507.22221v1 |
Authors (3): Nasrin Akbari, Mehdi Modarressi, Alireza Khadem
Sequence alignment is a fundamental process in computational biology which identifies regions of similarity in biological sequences. With the exponential growth in the volume of data in bioinformatics databases, the time, processing power, and memory bandwidth for comparing a query sequence with the available databases grows proportionally. The sequence alignment algorithms often involve simple arithmetic operations and feature high degrees of inherent fine-grained and coarse-grained parallelism. These features can be potentially exploited by a massive parallel processor, such as a GPU, to increase throughput. In this paper, we show that the excessive memory bandwidth demand of the sequence alignment algorithms prevents exploiting the maximum achievable throughput on conventional parallel machines. We then propose a memory-aware architecture to reduce the bandwidth demand of the sequence alignment algorithms, effectively pushing the memory wall to extract higher throughput. The design is integrated at the logic layer of an emerging 3D DRAM as a processing-in-memory architecture to further increase the available bandwidth. The experimental results show that the proposed architecture results in up to 2.4x speedup over a GPU-based design. Moreover, by moving the computation closer to the memory, power consumption is reduced by 37%, on average.
序列序列对齐是一个基本的计算生物学过程,它确定生物序列中相近的区域。 随着生物信息数据库中数据量的指数增长, 时间、 处理力和内存带宽与可用数据库比较查询序列的量成比例增长。 序列对齐算法往往涉及简单的算术操作, 并具有高层次的内在细微细和粗粗微的平行特征。 这些特征有可能被大型平行处理器( 如GPU) 利用, 以增加输送量。 本文显示, 序列对齐算法的过度记忆带宽需求无法利用常规平行机器上的最大可实现的吞吐量。 我们然后提议一个记忆- 觉结构, 以减少序列对齐算法的带宽需求, 有效地推动记忆墙以提取更高的吞吐量。 设计可以在正在形成的 3D DRAM 的逻辑层中整合成一个处理- 模块, 以进一步增加可用的带宽。 实验结果显示, 拟议的结构在GPU 设计上达到2.4x 速度。 此外, 通过更接近平均的计算, 将电流到 。
Article 7
Title@2025-07-29 (2): A Multi-Agent Generative AI Framework for IC Module-Level Verification Automation
Title: A Multi-Agent Generative AI Framework for IC Module-Level Verification Automation | Multi-Agent Generatives KI-Framework für die IC-Modul-Level-Verifikationsautomatisierung | IC 模块级核查自动化多机构生成AI框架 2507.21694v1 |
Authors (5): Wenbo Liu, Forbes Hou, Jon Zhang, Hong Liu, Allen Lei
As large language models demonstrate enormous potential in the field of Electronic Design Automation (EDA), generative AI-assisted chip design is attracting widespread attention from academia and industry. Although these technologies have made preliminary progress in tasks such as code generation, their application in chip verification – a critical bottleneck in the chip development cycle – remains at an exploratory stage. This paper proposes an innovative Multi-Agent Verification Framework (MAVF) aimed at addressing the limitations of current single-LLM approaches in complex verification tasks. Our framework builds an automated transformation system from design specifications to testbench through the collaborative work of multiple specialized agents, including specification parsing, verification strategy generation, and code implementation. Through verification experiments on multiple chip modules of varying complexity, results show that MAVF significantly outperforms traditional manual methods and single-dialogue generative AI approaches in verification document parsing and generation, as well as automated testbench generation. This research opens new directions for exploring generative AI applications in verification automation, potentially providing effective approaches to solving the most challenging bottleneck issues in chip design.
由于大型语言模型在电子设计自动化领域显示出巨大潜力,因此基因型的AI辅助芯片设计正在吸引学术界和工业界的广泛关注,尽管这些技术在代码生成等任务方面取得了初步进展,但在芯片核查(芯片开发周期中一个关键瓶颈)中的应用仍处于探索阶段。本文件提议了一个创新的多机构核查框架(MAVF),旨在解决当前单一LLM方法在复杂核查任务中的局限性。我们的框架通过多个专门剂的协作工作,包括规格分解、核查战略的制定和代码的执行,建立了一个从设计规格到测试的自动转换系统。通过对复杂程度不同的多个芯片模块的核查实验,结果显示MAVF在核查文件分解和生成以及自动测试箱生成方面大大超越了传统的手工方法和单一对话型人工方法。这一研究为探索核查自动化中具有基因特征的AI应用开辟了新的方向,有可能为解决芯片设计中最具挑战性的瓶颈问题提供有效的方法。
Article 8
Title@2025-07-29 (2): SLTarch: Towards Scalable Point-Based Neural Rendering by Taming Workload Imbalance and Memory Irregularity
Title: SLTarch: Towards Scalable Point-Based Neural Rendering by Taming Workload Imbalance and Memory Irregularity | SLTarch: Auf dem Weg zu einer skalierbaren point-based Neural Rendering durch Taming Workload Unbalance und Memory Irregularity | SLSTARK:通过修补工作负荷的不平衡和记忆不规则性,走向可缩放的点基心神经成形 2507.21499v1 |
Authors (8): Xingyang Li, Jie Jiang, Yu Feng, Yiming Gan, Jieru Zhao, Zihan Liu, Jingwen Leng, Minyi Guo
Rendering is critical in fields like 3D modeling, AR/VR, and autonomous driving, where high-quality, real-time output is essential. Point-based neural rendering (PBNR) offers a photorealistic and efficient alternative to conventional methods, yet it is still challenging to achieve real-time rendering on mobile platforms. We pinpoint two major bottlenecks in PBNR pipelines: LoD search and splatting. LoD search suffers from workload imbalance and irregular memory access, making it inefficient on off-the-shelf GPUs. Meanwhile, splatting introduces severe warp divergence across GPU threads due to its inherent sparsity. To tackle these challenges, we propose SLTarch, an algorithm-architecture co-designed framework. At its core, SLTarch introduces SLTree, a dedicated subtree-based data structure, and LTcore, a specialized hardware architecture tailored for efficient LoD search. Additionally, we co-design a divergence-free splatting algorithm with our simple yet principled hardware augmentation, SPcore, to existing PBNR accelerators. Compared to a mobile GPU, SLTarch achieves 3.9$\times$ speedup and 98\% energy savings with negligible architecture overhead. Compared to existing accelerator designs, SLTarch achieves 1.8$\times$ speedup with 54\% energy savings.
3D 建模、 AR/ VR 和 自主驱动等领域, 高品质的实时实时输出至关重要。 基于点的神经转换( PBNR) 提供了常规方法的光现实和高效的替代方法, 但实现移动平台的实时覆盖仍然具有挑战性。 我们在 PBNR 管道中发现了两个主要瓶颈: 搜索和扩展。 LoD 搜索存在工作量不平衡和不规则的记忆访问, 使得它无法在现成的 GPU 上实现效率。 与此同时, 螺旋式的螺旋形使GPU 线由于其内在的松散性而出现严重的扭曲差异。 为了应对这些挑战, 我们提议SLTarch, 是一个算法- 共同设计的框架。 在其核心, SLSTarch 引入了SLTree, 一个专门的次树基数据结构, 以及一个专门为高效的LOD搜索而设计的专用硬件结构。 此外, 我们共同设计了一个与我们简单但有原则的硬件增强、 SPCore 的无差异的螺旋式算法, 至现有的PBNC$ 。 。 。 为了应对这些挑战, 将SLTER 的节能速度提高到 与一个可达98 标准 和可达98 节能 的节能结构, 达到98 达到98 的节能的节能。
Article 9
Title@2025-07-29 (2): Automated HEMT Model Construction from Datasheets via Multi-Modal Intelligence and Prior-Knowledge-Free Optimization
Title: Automated HEMT Model Construction from Datasheets via Multi-Modal Intelligence and Prior-Knowledge-Free Optimization | Automatisierte HEMT-Modellkonstruktion aus Datenblättern über Multi-Modal-Intelligenz und vorherige wissensfreie Optimierung | 通过多模式情报和事先知识自由优化,从数据表格中自动建立电子内容管理模型 2507.21430v1 |
Authors (4): Yuang Peng, Jiarui Zhong, Yang Zhang, Hong Cai Chen
Parameter extraction for industry-standard device models like ASM-HEMT is crucial in circuit design workflows. However, many manufacturers do not provide such models, leaving users to build them using only datasheets. Unfortunately, datasheets lack sufficient information for standard step-by-step extraction. Moreover, manual data extraction from datasheets is highly time-consuming, and the absence of a fully automated method forces engineers to perform tedious manual work. To address this challenge, this paper introduces a novel, end-to-end framework that fully automates the generation of simulation-ready ASM-HEMT SPICE models directly from PDF datasheets. Our framework is founded on two core innovations: 1) a multi-modal AI pipeline that integrates computer vision with a large language model (LLM) to robustly parse heterogeneous datasheet layouts and digitize characteristic curves, and 2) a novel Iterative-Focusing Tree-structured Parzen Estimator (IF-TPE) optimization algorithm is specifically designed for device parameter extraction under the high-dimensional, sparse-data condition by adaptively refining the parameter search space. Experimental validation on a diverse set of 17 commercial HEMT devices from 10 manufacturers confirms the framework’s accuracy and robustness. The generated models demonstrate excellent agreement with published DC and RF characteristics. As the first fully automated workflow of its kind, our proposed solution offers a transformative approach to device modeling, poised to significantly accelerate the circuit design cycle by eliminating the need for manual parameter extraction.
工业标准设备模型(如ASM-HEMT-HEMT)的人工提取参数在电路设计工作流程中至关重要。然而,许多制造商没有提供这种模型,让用户仅使用数据表来建立这些模型。不幸的是,数据表缺乏足够的标准逐步提取信息。此外,数据表的手工数据提取耗时甚多,而且缺乏完全自动化的方法迫使工程师进行枯燥的手工工作。为了应对这一挑战,本文件引入了一个新型的、端到端框架,充分自动化地利用PDF数据表直接生成模拟的ASM-HEMT SPICE模型。我们的框架建立在两个核心创新的基础上:1)多式人工智能管道,将计算机图像与大语言模型(LLLM)相结合,以稳健的混杂的组合数据表布局和数字化的特征曲线。2)新型的热吸附方法优化算法(IF-TPE) 优化算法是专门设计在高维度、低密度数据条件下,通过适应性地改进参数搜索空间的精确度设计参数的参数提取。实验性测试模型将10号的精确性测试模型和模拟模拟模拟模型的多样化地展示了模型。
Article 10
Title@2025-07-28 (1): FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization
Title: FastMamba: A High-Speed and Efficient Mamba Accelerator on FPGA with Accurate Quantization | FastMamba: Hochgeschwindigkeits- und Effizienter Mamba-Beschleuniger auf FPGA mit präziser Quantisierung | FPGA 快速Mamba:一个高速度、高效的 Mamba 加速器,使用准确量化的 FPGA 加速器 2505.18975v4 |
Authors (4): Aotao Wang, Haikuo Shao, Shaobo Ma, Zhongfeng Wang
State Space Models (SSMs), like recent Mamba2, have achieved remarkable performance and received extensive attention. However, deploying Mamba2 on resource-constrained edge devices encounters many problems: severe outliers within the linear layer challenging the quantization, diverse and irregular element-wise tensor operations, and hardware-unfriendly nonlinear functions in the SSM block. To address these issues, this paper presents FastMamba, a dedicated accelerator on FPGA with hardware-algorithm co-design to promote the deployment efficiency of Mamba2. Specifically, we successfully achieve 8-bit quantization for linear layers through Hadamard transformation to eliminate outliers. Moreover, a hardware-friendly and fine-grained power-of-two quantization framework is presented for the SSM block and convolution layer, and a first-order linear approximation is developed to optimize the nonlinear functions. Based on the accurate algorithm quantization, we propose an accelerator that integrates parallel vector processing units, pipelined execution dataflow, and an efficient SSM Nonlinear Approximation Unit, which enhances computational efficiency and reduces hardware complexity. Finally, we evaluate FastMamba on Xilinx VC709 FPGA. For the input prefill task on Mamba2-130M, FastMamba achieves 68.80\times and 8.90\times speedup over Intel Xeon 4210R CPU and NVIDIA RTX 3090 GPU, respectively. In the output decode experiment with Mamba2-2.7B, FastMamba attains 6\times higher energy efficiency than RTX 3090 GPU.
国家空间模型(SSMM)与最近的Mamba2一样,已经取得了显著的绩效并得到广泛关注。然而,在资源限制的边缘装置上部署 Mamba2 遇到了许多问题:线性层内有严重的异常值,对量化、多元和不规则元素的振幅操作以及SSM区块内硬件不友好的非线性功能。为了解决这些问题,本文展示了FastMamba,这是FPGA上一个专用加速器,配有硬件-algorithm 共同设计,以提高 Mamba2 的部署效率。具体地说,我们成功地实现了线性层的8位四位四位四点四点四点四,通过Hadamard 实验性转换消除外层。此外,为SMSMSM和精密的二点四点四点四点四点四点四,为SM-90级五点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四点四分。
Article 11
Title@2025-07-28 (1): LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference
Title: LUT Tensor Core: A Software-Hardware Co-Design for LUT-Based Low-Bit LLM Inference | LUT Tensor Core: Ein Software-Hardware-Co-Design für LUT-basierte Low-Bit LLM-Inferenz | LUT 信标核心:基于 LUT的低比低 LLM 推断的软件-硬件共同设计 2408.06003v3 |
Authors (11): Zhiwen Mo, Lei Wang, Jianyu Wei, Zhichen Zeng, Shijie Cao, Lingxiao Ma, Naifeng Jing, Ting Cao, Jilong Xue, Fan Yang, Mao Yang
Large Language Model (LLM) inference becomes resource-intensive, prompting a shift toward low-bit model weights to reduce the memory footprint and improve efficiency. Such low-bit LLMs necessitate the mixed-precision matrix multiplication (mpGEMM), an important yet underexplored operation involving the multiplication of lower-precision weights with higher-precision activations. Off-the-shelf hardware does not support this operation natively, leading to indirect, thus inefficient, dequantization-based implementations. In this paper, we study the lookup table (LUT)-based approach for mpGEMM and find that a conventional LUT implementation fails to achieve the promised gains. To unlock the full potential of LUT-based mpGEMM, we propose LUT Tensor Core, a software-hardware co-design for low-bit LLM inference. LUT Tensor Core differentiates itself from conventional LUT designs through: 1) software-based optimizations to minimize table precompute overhead and weight reinterpretation to reduce table storage; 2) a LUT-based Tensor Core hardware design with an elongated tiling shape to maximize table reuse and a bit-serial design to support diverse precision combinations in mpGEMM; 3) a new instruction set and compilation optimizations for LUT-based mpGEMM. LUT Tensor Core significantly outperforms existing pure software LUT implementations and achieves a 1.44$\times$ improvement in compute density and energy efficiency compared to previous state-of-the-art LUT-based accelerators.
大型语言模型(LLM) 推论成为资源密集型的大型语言模型(LLM) 推论成为资源密集型,促使转向低位模型重量,以减少记忆足迹并提高效率。这种低位LLMM 低位LOMS 需要混合精度矩阵倍增(mpGEMM ) , 这是一项重要但尚未充分探索的操作, 涉及低精度重量的倍增, 以及高精度启动。 现成硬件无法支持这项本地操作, 导致间接的、 效率低下的分解实施。 在本文中, 我们研究基于查找表( LUT) 的方法, 以降低记忆足足足量的存储量, 发现常规 LUT 执行无法实现承诺的收益。 为了释放基于 LUT 精度矩阵的组合, 我们提议LUT Tensor Core- Core Compressional 将 LGEODA 与现有精度精度版本的硬度组合组合, 一种基于LUT-S-BS-S-BS-BS-BS-BS-BS-SLEDRILEDF 版本的系统配置, 一个基于现有精度的硬度配置版本版本的版本版本版本版本版本版本版本版本版本版本版本的版本版本版本。
Article 12
Title@2025-07-27 (7): Demystifying the 7-D Convolution Loop Nest for Data and Instruction Streaming in Reconfigurable AI Accelerators
Title: Demystifying the 7-D Convolution Loop Nest for Data and Instruction Streaming in Reconfigurable AI Accelerators | Entmystifizierung des 7-D-Konvolutionsschleifen-Nests für Daten- und Instruktionsstreaming in rekonfigurierbaren KI-Beschleunigern | 解开重新配置 AI 加速器中数据和指示流7D 革命圈点的神秘化 2507.20420v1 |
Authors (2): Md Rownak Hossain Chowdhury, Mostafizur Rahman
Convolution remains the most compute-intensive operation in AI acceleration, often constituting over 80-90% of the workload. Existing approaches in spatial architectures such as coarse-grained reconfigurable arrays (CGRAs) and field-programmable gate arrays (FPGAs) frequently rely on loop unrolling or GEMM-based matrix transformations, introducing significant overhead in both data movement and instruction control. This paper presents a new framework designed to systematically demystify the 7-dimensional convolution loop nest by reinterpreting it as a hardware-centric data and instruction streaming problem. Instead of treating the loop nest as a fixed computational construct, our approach exposes its structure as a set of spatial and temporal mappings governed by hardware parameters such as compute element distribution, interconnect topology, and reconfigurability. This abstraction supports lightweight, flexible deployment of convolution without reliance on heavyweight transformations or reordering schemes. We demonstrate the application of our approach on the MAVeC accelerator. We detail the implementation of convolution operations in MAVeC and extend the framework to support full model execution on VGG-16. Our profiling reveals high PE utilization (over 90%), significant fold reuse, and scalable throughput up to 1.56 TFLOPs/sec and 12.7 KIPS for end-to-end VGG-16 inference. These results validate the efficacy of our approach in minimizing control overhead, improving data locality, and enabling efficient large-scale convolution execution without reliance on conventional transformation-based methods.
在AI加速中,电离层仍然是最大的计算密集操作,通常占工作量的80-90%以上。现有空间结构方法,如粗微可重新配置的阵列(CGRAs)和实地可编程的门阵列(FPGAs)经常依赖环状滚动或GEMM基基矩阵变换,在数据移动和教学控制中引入大量间接费用。本文提出了一个新框架,旨在通过重新将7维共流环巢解释为硬件中心的数据和教学流问题,系统拆解七维共流环巢。我们的方法不是将环状网当作固定的计算结构,而是将其结构暴露为一套由硬件参数规范的空间和时间图解结构,这些参数包括:可编集元素分布、互连的地形和可重新配置。这种抽象化支持轻度、灵活地部署电流,而无需依靠重量转换或重新排序计划。我们运用了MAVAVIC acerator方法。我们详细介绍了MAVERC的变动操作情况,在可编程、可编程的节流操作框架中,在不改进POPS-LAFLTADRIDFLS 上,在不进行高的升级的升级和扩展中,在不使用中,这些格式上,这些框架将支持了高的升级数据流流流流数据流流式的升级,在S-LVGVGFIFLDFAFADFAV。
Article 13
Title@2025-07-27 (7): RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs
Title: RoCE BALBOA: Service-enhanced Data Center RDMA for SmartNICs | RoCE BALBOA: Service-verstärktes Rechenzentrum RDMA für SmartNICs | RoCE BALBOA:增强服务数据中心 2507.20412v1 |
Authors (7): Maximilian Jakob Heer, Benjamin Ramhorst, Yu Zhu, Luhao Liu, Zhiyi Hu, Jonas Dann, Gustavo Alonso
Data-intensive applications in data centers, especially machine learning (ML), have made the network a bottleneck, which in turn has motivated the development of more efficient network protocols and infrastructure. For instance, remote direct memory access (RDMA) has become the standard protocol for data transport in the cloud as it minimizes data copies and reduces CPU-utilization via host-bypassing. Similarly, an increasing amount of network functions and infrastructure have moved to accelerators, SmartNICs, and in-network computing to bypass the CPU. In this paper we explore the implementation and deployment of RoCE BALBOA, an open-source, RoCE v2-compatible, scalable up to hundreds of queue-pairs, and 100G-capable RDMA-stack that can be used as the basis for building accelerators and smartNICs. RoCE BALBOA is customizable, opening up a design space and offering a degree of adaptability not available in commercial products. We have deployed BALBOA in a cluster using FPGAs and show that it has latency and performance characteristics comparable to commercial NICs. We demonstrate its potential by exploring two classes of use cases. One involves enhancements to the protocol for infrastructure purposes (encryption, deep packet inspection using ML). The other showcases the ability to perform line-rate compute offloads with deep pipelines by implementing commercial data preprocessing pipelines for recommender systems that process the data as it arrives from the network before transferring it directly to the GPU. These examples demonstrate how BALBOA enables the exploration and development of SmartNICs and accelerators operating on network data streams.
数据中心的数据密集应用,特别是机器学习(ML),使网络成为瓶颈,这反过来推动了更有效的网络协议和基础设施的发展。例如,远程直接存储访问(RDMA)成为云中数据传输的标准协议,因为它最大限度地减少数据复制量,减少通过主机浏览使用CPU。同样,越来越多的网络功能和基础设施已经转向加速器、智能网络功能和基础设施以及绕过CPU的网络计算。在这份文件中,我们探索了RoCE BALBOA的安装和部署,这是一个开放源,RoCE V2可兼容,可扩缩至数百个排垫,以及100G可升级的RDMA堆成为云中数据传输的标准协议,可用作建立加速器和智能智能用户访问的基础。RoCE BALBOA是定制的,打开了设计空间,提供了商业产品中无法使用的适应度。我们利用FGAs的集群将BALBOA从透明流和性运行能力从透明管道转换到智能网络,我们用OrentalA的运行能力演示了ODRA的运行能力。我们用BROBA进行直接测试,用ORRA的进度展示了BRBRM的运行。我们使用了它的运行能力。
Article 14
Title@2025-07-27 (7): ACCESS-AV: Adaptive Communication-Computation Codesign for Sustainable Autonomous Vehicle Localization in Smart Factories
Title: ACCESS-AV: Adaptive Communication-Computation Codesign for Sustainable Autonomous Vehicle Localization in Smart Factories | ACCESS-AV: Adaptive Kommunikations-Computation Codesign für nachhaltige autonome Fahrzeuglokalisierung in Smart Factories | ACCES-AV: 智能工厂可持续自主车辆本地化适应性通信-计算代码符号 2507.20399v1 |
Authors (5): Rajat Bhattacharjya, Arnab Sarkar, Ish Kool, Sabur Baidya, Nikil Dutt
Autonomous Delivery Vehicles (ADVs) are increasingly used for transporting goods in 5G network-enabled smart factories, with the compute-intensive localization module presenting a significant opportunity for optimization. We propose ACCESS-AV, an energy-efficient Vehicle-to-Infrastructure (V2I) localization framework that leverages existing 5G infrastructure in smart factory environments. By opportunistically accessing the periodically broadcast 5G Synchronization Signal Blocks (SSBs) for localization, ACCESS-AV obviates the need for dedicated Roadside Units (RSUs) or additional onboard sensors to achieve energy efficiency as well as cost reduction. We implement an Angle-of-Arrival (AoA)-based estimation method using the Multiple Signal Classification (MUSIC) algorithm, optimized for resource-constrained ADV platforms through an adaptive communication-computation strategy that dynamically balances energy consumption with localization accuracy based on environmental conditions such as Signal-to-Noise Ratio (SNR) and vehicle velocity. Experimental results demonstrate that ACCESS-AV achieves an average energy reduction of 43.09% compared to non-adaptive systems employing AoA algorithms such as vanilla MUSIC, ESPRIT, and Root-MUSIC. It maintains sub-30 cm localization accuracy while also delivering substantial reductions in infrastructure and operational costs, establishing its viability for sustainable smart factory environments.
自动运载工具(ADV)越来越多地用于5G网络型智能工厂的货物运输,计算密集的本地化模块为优化提供了重要的机会。我们建议AccES-AV,即节能车辆到基础设施(V2I)本地化框架,在智能工厂环境中利用现有的5G基础设施。通过利用机会访问定期广播的5G同步信号区(SSSB),实现本地化,ACCESS-AV避免了专用路边单位(RSU)或更多机载传感器以实现能源效率和降低成本的需要。我们采用多信号分类(MUSIC)算法,优化资源限制的ADV平台,根据信号到噪音比率(SNR)和车辆速度等环境条件,动态平衡能源消耗和本地化精度。实验结果表明,ACCS-AVSA实现了平均节能削减43.09 % ,而SIS-ISA系统则采用智能稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定等等系统下等系统等系统系统下运行等系统系统系统,实现了其下运行、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定等等等等等等等等等等等等等系统-系统-系统、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、稳定、
Article 15
Title@2025-07-26 (6): AxOSyn: An Open-source Framework for Synthesizing Novel Approximate Arithmetic Operators
Title: AxOSyn: An Open-source Framework for Synthesizing Novel Approximate Arithmetic Operators | AxOSyn: Ein Open-Source-Framework zur Synthese von Romanen Ungefähre Arithmetische Operatoren | AxOSyn: 一个用于合成新奇近光亚学天文操作员的开放源码框架 2507.20007v1 |
Authors (3): Siva Satyendra Sahoo, Salim Ullah, Akash Kumar
Edge AI deployments are becoming increasingly complex, necessitating energy-efficient solutions for resource-constrained embedded systems. Approximate computing, which allows for controlled inaccuracies in computations, is emerging as a promising approach for improving power and energy efficiency. Among the key techniques in approximate computing are approximate arithmetic operators (AxOs), which enable application-specific optimizations beyond traditional computer arithmetic hardware reduction-based methods, such as quantization and precision scaling. Existing design space exploration (DSE) frameworks for approximate computing limit themselves to selection-based approaches or custom synthesis at fixed abstraction levels, which restricts the flexibility required for finding application-specific optimal solutions. Further, the tools available for the DSE of AxOs are quite limited in terms of exploring different approximation models and extending the analysis to different granularities. To this end, we propose AxOSyn, an open-source framework for the DSE of AxOs that supports both selection and synthesis approaches at various abstraction levels. AxOSyn allows researchers to integrate custom methods for evaluating approximations and facilitates DSE at both the operator-level and application-specific. Our framework provides an effective methodology for achieving energy-efficient, approximate operators.
人工智能部署正变得越来越复杂,需要为资源限制的嵌入系统提供节能解决方案。近似计算(允许有控制的计算不准确)正在成为改善电力和能源效率的一个有希望的方法。近似计算的关键技术包括近似算术操作员(AxOs),它使得在传统的计算机算术硬件削减法方法(如量化和精确度缩放)之外,能够实现具体应用优化,超越传统的计算机算术硬件削减法(如量化和精确度缩放)。现有的估计计算的空间设计探索(DSE)框架限制在固定抽象级别上选择基于选择的方法或定制合成,这限制了寻找具体应用最佳解决方案所需的灵活性。此外,用于AxOsDSE的现有工具在探索不同的近似模型和将分析扩大到不同的微粒方面非常有限。为此,我们提议了AxOSyn,即支持不同抽象级别选择和合成方法的开放源框架。AxOSyn使研究人员能够将评估近似值的定制方法整合到操作员一级和具体应用的DSE。我们的框架提供了实现能源效率的有效方法。
Article 16
Title@2025-07-26 (6): A Scalable Resource Management Layer for FPGA SoCs in 6G Radio Units
Title: A Scalable Resource Management Layer for FPGA SoCs in 6G Radio Units | Eine skalierbare Ressourcenverwaltungsschicht für FPGA-SoCs in 6G-Funkgeräten | 6G无线电台FPGA SoCs可缩放资源管理层 2507.19963v1 |
Authors (4): Nikolaos Bartzoudis, José Rubio Fernández, David López-Bueno, Antonio Román Villarroel
This work presents a perspective on addressing the underutilization of computing resources in FPGA SoC devices deployed in 5G radio and edge computing infrastructure. The initial step in this approach involves developing a resource management layer capable of dynamically migrating and scaling functions within these devices in response to contextual events. This layer serves as the foundation for designing a hierarchical, data-driven micro-orchestrator responsible for managing the lifecycle of functions in FPGA SoC devices. In this paper, the proposed resource management layer is utilized to reconfigure a function based on events identified by a computer vision edge application.
这项工作提出了解决在5G无线电和边缘计算基础设施中部署的FPGA SoC设备中计算资源利用不足问题的视角,这一方法的第一步是开发一个资源管理层,以便能够根据背景事件在这些设备中动态迁移和扩展功能,该层是设计一个管理FPGA SoC设备功能生命周期的分级、数据驱动微操作器的基础,在本文件中,拟议的资源管理层用于根据计算机视觉边缘应用程序确定的事件重新配置一个功能。
Article 17
Title@2025-07-26 (6): ChipletPart: Scalable Cost-Aware Partitioning for 2.5D Systems
Title: ChipletPart: Scalable Cost-Aware Partitioning for 2.5D Systems | ChipletPart: Skalierbare Kosten-Bewusst-Partitionierung für 2.5D-Systeme | ChipletPart: 2.5D 系统可缩放成本软件分割 2507.19819v1 |
Authors (5): Alexander Graening, Puneet Gupta, Andrew B. Kahng, Bodhisatta Pramanik, Zhiang Wang
Industry adoption of chiplets has been increasing as a cost-effective option for making larger high-performance systems. Consequently, partitioning large systems into chiplets is increasingly important. In this work, we introduce ChipletPart - a cost-driven 2.5D system partitioner that addresses the unique constraints of chiplet systems, including complex objective functions, limited reach of inter-chiplet I/O transceivers, and the assignment of heterogeneous manufacturing technologies to different chiplets. ChipletPart integrates a sophisticated chiplet cost model with its underlying genetic algorithm-based technology assignment and partitioning methodology, along with a simulated annealing-based chiplet floorplanner. Our results show that: (i) ChipletPart reduces chiplet cost by up to 58% (20% geometric mean) compared to state-of-the-art min-cut partitioners, which often yield floorplan-infeasible solutions; (ii) ChipletPart generates partitions with up to 47% (6% geometric mean) lower cost as compared to the prior work Floorplet; and (iii) for the testcases we study, heterogeneous integration reduces cost by up to 43% (15% geometric mean) compared to homogeneous implementations. We also present case studies that show how changes in packaging or inter-chiplet signaling technologies can affect partitioning solutions. Finally, we make ChipletPart, the underlying chiplet cost model, and a chiplet testcase generator available as open-source tools for the community.
工业采用芯片作为一种具有成本效益的办法来建立更大型高性能系统,因此,将大型系统分割成芯片越来越重要。在这项工作中,我们引入了芯片部分 – – 由成本驱动的2.5D系统分割器,处理芯片系统的独特限制,包括复杂的客观功能,跨芯片I/O收发器的有限范围,以及向不同的芯片分配多种制造技术。芯片部分将精密的芯片成本模型与基于遗传算法的技术分配和分配方法相结合,同时采用模拟的芯片基底板板。我们的成果显示:(一) 芯片部分将芯片成本降低到58 % (20%的几何值平均值) ,与最先进的芯片系统分割器相比,它往往产生地板不可行的解决方案;(二) 芯片部分生成了高达47 % (6%的模型平均值) 与先前的基于基因算法的技术分配和分配方法相比,公开成本较低;以及(三) 测试室,基于我们的研究,混成的混成品整合将芯片成本降低到58%的芯片成本, 也影响我们目前的平质化测试模型研究。
Article 18
Title@2025-07-26 (6): Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning
Title: Smaller, Faster, Cheaper: Architectural Designs for Efficient Machine Learning | Kleiner, schneller, billiger: Architekturdesigns für effizientes maschinelles Lernen | 更小、更快、更便宜:高效机械学习的建筑设计 2507.19795v1 |
Authors (1): Steven Walton
Major advancements in the capabilities of computer vision models have been primarily fueled by rapid expansion of datasets, model parameters, and computational budgets, leading to ever-increasing demands on computational infrastructure. However, as these models are deployed in increasingly diverse and resource-constrained environments, there is a pressing need for architectures that can deliver high performance while requiring fewer computational resources. This dissertation focuses on architectural principles through which models can achieve increased performance while reducing their computational demands. We discuss strides towards this goal through three directions. First, we focus on data ingress and egress, investigating how information may be passed into and retrieved from our core neural processing units. This ensures that our models make the most of available data, allowing smaller architectures to become more performant. Second, we investigate modifications to the core neural architecture, applied to restricted attention in vision transformers. This section explores how removing uniform context windows in restricted attention increases the expressivity of the underlying neural architecture. Third, we explore the natural structures of Normalizing Flows and how we can leverage these properties to better distill model knowledge. These contributions demonstrate that careful design of neural architectures can increase the efficiency of machine learning algorithms, allowing them to become smaller, faster, and cheaper.
计算机愿景模型能力的重大进步主要得益于数据集、模型参数和计算预算的迅速扩展,导致对计算基础设施的需求不断增加。然而,随着这些模型在日益多样化和资源紧张的环境中部署,迫切需要能够提供高性能的建筑,同时需要较少的计算资源。这一论文侧重于建筑原则,通过这些建筑原则,模型可以提高性能,同时减少其计算需求。我们从三个方向讨论实现这一目标的进展。首先,我们注重进进和进数据,研究如何将信息传递到核心神经处理器中,从而调查如何从核心神经处理器中检索信息。这确保了我们的模型能够充分利用现有数据,使较小结构变得更加实用。第二,我们研究核心神经结构的修改,将其应用于限制对视觉变异器的关注。本节探讨了在有限情况下消除统一的环境窗口如何提高基本神经结构的清晰度。第三,我们探索正常化流程的自然结构,以及我们如何利用这些属性更好地提取模型知识。这些贡献表明,谨慎地设计神经结构,使其更廉价,可以提高质量。
Article 19
Title@2025-07-26 (6): LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation
Title: LaMAGIC2: Advanced Circuit Formulations for Language Model-Based Analog Topology Generation | LaMAGIC2: Erweiterte Schaltungsformulierungen für sprachmodellbasierte analoge Topologie-Generierung | LaMAGIC2:语言模拟模拟模拟地形生成的先进电路配制 2506.10235v2 |
Authors (5): Chen-Chia Chang, Wan-Hsuan Lin, Yikang Shen, Yiran Chen, Xin Zhang
Automation of analog topology design is crucial due to customized requirements of modern applications with heavily manual engineering efforts. The state-of-the-art work applies a sequence-to-sequence approach and supervised finetuning on language models to generate topologies given user specifications. However, its circuit formulation is inefficient due to O( | V | 2) token length and suffers from low precision sensitivity to numeric inputs. In this work, we introduce LaMAGIC2, a succinct float-input canonical formulation with identifier (SFCI) for language model-based analog topology generation. SFCI addresses these challenges by improving component-type recognition through identifier-based representations, reducing token length complexity to O( | V | ), and enhancing numeric precision sensitivity for better performance under tight tolerances. Our experiments demonstrate that LaMAGIC2 achieves 34% higher success rates under a tight tolerance of 0.01 and 10X lower MSEs compared to a prior method. LaMAGIC2 also exhibits better transferability for circuits with more vertices with up to 58.5% improvement. These advancements establish LaMAGIC2 as a robust framework for analog topology generation. |
模拟地形设计自动化至关重要,因为现代应用需要大量手工工程,因此模拟地形设计自动化至关重要。最先进的工作采用顺序到顺序的方法,对语言模型进行有监督的微调,以产生符合用户规格的地形。然而,由于O(V 2)象征性长度,电路配制效率低下,对数字输入的精确度低。在这项工作中,我们引入了LaMAGIC2, 一种简明的、具有基于语言模型的模拟地形生成识别标志(SFCI)的浮式输入式集成式配方。SFCI通过基于识别特征的表示方式改进部件类型识别,降低O(V ) 的象征性复杂性,提高数字精度,以便在紧紧的容度下提高性能。我们的实验表明,LaMAGIC2在0.0和10X较低的MSE的严格耐受容度下,取得了34%更高的成功率。LaMAGIC2还表明,具有更高程度58.5%改进的脊椎的电路路路的可转移性。这些进步使LAMAGIC2成为了最高生成结构的坚固框架。
Article 20
Title@2025-07-25 (5): MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation with Backend Aware Synthesis Optimization
Title: MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation with Backend Aware Synthesis Optimization | MCP4EDA: LLM-Powered Model Context Protocol RTL-to-GDSII Automation mit Backend Aware Syntheseoptimierung | MCP4EDA: LLM 授权示范背景议定书RTL-GDSII 2507.19570v1 |
Authors (6): Yiting Wang, Wanghao Ye, Yexiao He, Yiran Chen, Gang Qu, Ang Li
This paper presents MCP4EDA, the first Model Context Protocol server that enables Large Language Models (LLMs) to control and optimize the complete open-source RTL-to-GDSII design flow through natural language interaction. The system integrates Yosys synthesis, Icarus Verilog simulation, OpenLane place-and-route, GTKWave analysis, and KLayout visualization into a unified LLM-accessible interface, enabling designers to execute complex multi-tool EDA workflows conversationally via AI assistants such as Claude Desktop and Cursor IDE. The principal contribution is a backend-aware synthesis optimization methodology wherein LLMs analyze actual post-layout timing, power, and area metrics from OpenLane results to iteratively refine synthesis TCL scripts, establishing a closed-loop optimization system that bridges the traditional gap between synthesis estimates and physical implementation reality. In contrast to conventional flows that rely on wire-load models, this methodology leverages real backend performance data to guide synthesis parameter tuning, optimization sequence selection, and constraint refinement, with the LLM functioning as an intelligent design space exploration agent. Experimental evaluation on representative digital designs demonstrates 15-30% improvements in timing closure and 10-20% area reduction compared to default synthesis flows, establishing MCP4EDA as the first practical LLM-controlled end-to-end open-source EDA automation system. The code and demo are avaiable at: http://www.agent4eda.com/
本文件介绍MCP4EDA,这是第一个使大语言模型(LLMS)能够通过自然语言互动控制和优化完整的开放源码 RTL-到GDSII设计流程的模拟背景协议服务器。该系统整合了Yosys合成、Icarus Verilog模拟、OpenLane place-and-route、GTKWave分析、KLayout可视化成一个统一的LLMM-可访问界面,使设计师能够通过诸如Claude桌面和Cursor IDE等AI助理进行复杂的多工具 EDA工作流程对话。主要贡献是一个后端综合优化合成优化方法,其中LMS分析OpenLane结果中的实际延期后时间、权力和地区指标,以迭接方式完善TCLOCR脚本,建立一个闭式优化系统与实际操作的LMM-S-S-SLA系统之间传统差距,与依赖线载模型的常规流动形成对照,这种方法利用真正的后端性业绩数据来指导合成参数的调整、优化序列选择选择和制约性改进,LM-20-20号合成系统作为智能的透明系统,在智能的LMA-ralde-ralde-ration-ral-rental 流中运行中运行中,测试中,将第15A 将自动演示流中,将自动演示的缩缩缩缩缩算。实验性分析。
Article 21
Title@2025-07-25 (5): A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration
Title: A3D-MoE: Acceleration of Large Language Models with Mixture of Experts via 3D Heterogeneous Integration | A3D-MoE: Beschleunigung großer Sprachmodelle mit Expertenmix über 3D Heterogene Integration | A3D-MOE:通过3D异变融合加速采用专家混合型大语言模式 2507.19142v1 |
Authors (9): Wei-Hsing Huang, Janak Sharda, Cheng-Jhih Shih, Yuyao Kong, Faaiq Waqar, Pin-Jun Chen, Yingyan, Lin, Shimeng Yu
Conventional large language models (LLMs) are equipped with dozens of GB to TB of model parameters, making inference highly energy-intensive and costly as all the weights need to be loaded to onboard processing elements during computation. Recently, the Mixture-of-Experts (MoE) architecture has emerged as an efficient alternative, promising efficient inference with less activated weights per token. Nevertheless, fine-grained MoE-based LLMs face several challenges: 1) Variable workloads during runtime create arbitrary GEMV-GEMM ratios that reduce hardware utilization, 2) Traditional MoE-based scheduling for LLM serving cannot fuse attention operations with MoE operations, leading to increased latency and decreased hardware utilization, and 3) Despite being more efficient than conventional LLMs, loading experts from DRAM still consumes significant energy and requires substantial DRAM bandwidth. Addressing these challenges, we propose: 1) A3D-MoE, a 3D Heterogeneous Integration system that employs state-of-the-art vertical integration technology to significantly enhance memory bandwidth while reducing Network-on-Chip (NoC) overhead and energy consumption. 2) A 3D-Adaptive GEMV-GEMM-ratio systolic array with V-Cache efficient data reuse and a novel unified 3D dataflow to solve the problem of reduced hardware utilization caused by arbitrary GEMV-GEMM ratios from different workloads, 3) A Hardware resource-aware operation fusion scheduler that fuses attention operations with MoE operations to enhance hardware performance, and 4) MoE Score-Aware HBM access reduction with even-odd expert placement that reduces DRAM access and bandwidth requirements. Our evaluation results indicate that A3D-MoE delivers significant performance enhancements, reducing latency by a factor of 1.8x to 2x and energy consumption by 2x to 4x, while improving throughput by 1.44x to 1.8x compared to the state-of-the-art.
常规的大型语言模型(LLMS)配备了数十GB到TB的模型参数,因此,在计算过程中,所有重量都需要加载到机上处理元素,因此,高耗能和昂贵的推论。最近,Mixture of Expert(MoE)架构(MO)已经成为一个高效的替代方案,有希望的高效推论,每个象征性的活性重量较低。然而,基于微调的MELMLM面临若干挑战:(1) 运行期间,工作负荷变化造成了任意的GEMV-GEM比率,降低了硬件利用率;(2) LLMME的传统基于M(M)的时间安排无法将注意力与2OE业务连接起来,导致延动了2ODLLO(LO)操作,导致LOFT-D(MI-D)运作效率提高,使OFLA-MA-RAL(O-IL)成本降低了我们OFLA-RLA(O)的运行效率。我们建议:(1) A3MD-O-SLO(OLI-I-RLI-LILO)的运行降低的运行要求,使得OLO-LILILO-LI-LILO-S-S-S-S-S-S-S-S-S-LILMLMLM-S-S-S-LD)降低了我们O-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SLBLY的运行降低了我们的运行的运行要求,使A-S-S-S-S-S-S-LMLY-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-LMD-LMD-LMD-LMD-LMD-LMD-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-LID-S-S-S-S-S-S-LY-S-S-L
Article 22
Title@2025-07-25 (5): 3DGauCIM: Accelerating Static/Dynamic 3D Gaussian Splatting via Digital CIM for High Frame Rate Real-Time Edge Rendering
Title: 3DGauCIM: Accelerating Static/Dynamic 3D Gaussian Splatting via Digital CIM for High Frame Rate Real-Time Edge Rendering | 3DGauCIM: Statische/Dynamische 3D Gaussian Splatting über Digital CIM beschleunigen für hohe Framerate Echtzeit Edge Rendering | 3DGauCIM:通过数字 CIM加速静电/动态 3D Gaussian Splating, 用于高框架速实时边缘下调 2507.19133v1 |
Authors (13): Wei-Hsing Huang, Cheng-Jhih Shih, Jian-Wei Su, Samuel Wade Wang, Vaidehi Garg, Yuyao Kong, Jen-Chun Tien, Nealson Li, Arijit Raychowdhury, Meng-Fan Chang, Yingyan, Lin, Shimeng Yu
Dynamic 3D Gaussian splatting (3DGS) extends static 3DGS to render dynamic scenes, enabling AR/VR applications with moving objects. However, implementing dynamic 3DGS on edge devices faces challenges: (1) Loading all Gaussian parameters from DRAM for frustum culling incurs high energy costs. (2) Increased parameters for dynamic scenes elevate sorting latency and energy consumption. (3) Limited on-chip buffer capacity with higher parameters reduces buffer reuse, causing frequent DRAM access. (4) Dynamic 3DGS operations are not readily compatible with digital compute-in-memory (DCIM). These challenges hinder real-time performance and power efficiency on edge devices, leading to reduced battery life or requiring bulky batteries. To tackle these challenges, we propose algorithm-hardware co-design techniques. At the algorithmic level, we introduce three optimizations: (1) DRAM-access reduction frustum culling to lower DRAM access overhead, (2) Adaptive tile grouping to enhance on-chip buffer reuse, and (3) Adaptive interval initialization Bucket-Bitonic sort to reduce sorting latency. At the hardware level, we present a DCIM-friendly computation flow that is evaluated using the measured data from a 16nm DCIM prototype chip. Our experimental results on Large-Scale Real-World Static/Dynamic Datasets demonstrate the ability to achieve high frame rate real-time rendering exceeding 200 frame per second (FPS) with minimal power consumption, merely 0.28 W for static Large-Scale Real-World scenes and 0.63 W for dynamic Large-Scale Real-World scenes. This work successfully addresses the significant challenges of implementing static/dynamic 3DGS technology on resource-constrained edge devices.
动态 3D Gaus 螺旋 3DGS (3DGS) 将静态 3DGS 扩展为动态场景,使AR/VR 应用程序与移动物体相容。然而,在边缘装置上实施动态 3DDGS 面临挑战:(1) 将DRAM的所有高斯参数加载以利结结结壳,产生高昂的能源成本。 (2) 动态场景的参数提高拉动拉伸缩和能源消耗。 (3) 具有较高参数的芯片缓冲能力有限,减少缓冲再利用,导致DRAM访问频率频繁。 (4) 动态 3DGS 操作无法随时与数字模拟(DCIM ) 进行实时对齐,阻碍边缘装置的实时性能和电源效率,导致电池寿命寿命减少,或需要大电池组。为了应对这些挑战,我们提出了算法层面,我们引入了3DRAM-LS 软体缓冲硬质组合, 将当前硬体的硬体- 水平的硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-硬体-直-直- 运行-直图-直图-直图-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直径-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-直压-
Article 23
Title@2025-07-25 (5): Stella Nera: A Differentiable Maddness-Based Hardware Accelerator for Efficient Approximate Matrix Multiplication
Title: Stella Nera: A Differentiable Maddness-Based Hardware Accelerator for Efficient Approximate Matrix Multiplication | Stella Nera: Ein differenzierter Maddness-basierter Hardware-Beschleuniger für eine effiziente, annähernde Matrix-Multiplikation | Stella Nera: 高效近光矩阵乘法的有区别的基于 Maddness 的硬件加速器 2311.10207v2 |
Authors (5): Jannis Schönleber, Lukas Cavigelli, Matteo Perotti, Luca Benini, Renzo Andri
Artificial intelligence has surged in recent years, with advancements in machine learning rapidly impacting nearly every area of life. However, the growing complexity of these models has far outpaced advancements in available hardware accelerators, leading to significant computational and energy demands, primarily due to matrix multiplications, which dominate the compute workload. Maddness (i.e., Multiply-ADDitioN-lESS) presents a hash-based version of product quantization, which renders matrix multiplications into lookups and additions, eliminating the need for multipliers entirely. We present Stella Nera, the first Maddness-based accelerator achieving an energy efficiency of 161 TOp/s/W@0.55V, 25x better than conventional MatMul accelerators due to its small components and reduced computational complexity. We further enhance Maddness with a differentiable approximation, allowing for gradient-based fine-tuning and achieving an end-to-end performance of 92.5% Top-1 accuracy on CIFAR-10.
近年来,人工智能激增,机器学习的进展迅速影响到生命的几乎每个领域,然而,这些模型日益复杂,远远超过了现有硬件加速器的进步速度,导致大量计算和能源需求,这主要是由于矩阵乘法乘法的倍增,在计算工作量中占主导地位。 Maddness(即多盘-AditioN-lESS)呈现了一种基于散数的产品定量化版本,使矩阵在外观和添加中进行倍增,完全消除了对乘数的需求。我们介绍的是Sella Nera,这是第一台基于Madness的加速器,其能源效率达到161 Top/s/W@0.55V,25x高于传统的MatMul加速器,因为其小部分和计算复杂性降低。我们进一步加强了基于差异的近似度,允许基于梯度的微调微调,使CIFAR-10达到92.5%的端至端精确度。
Article 24
Title@2025-07-25 (5): GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units
Title: GENIAL: Generative Design Space Exploration via Network Inversion for Low Power Algorithmic Logic Units | GENIAL: Generative Design Space Exploration über Netzwerk-Inversion für stromarme algorithmische Logische Einheiten | GENIAL:通过网络转换生成设计空间探索,用于低功率测算仪 2507.18989v1 |
Authors (5): Maxence Bouvier, Ryan Amaudruz, Felix Arnold, Renzo Andri, Lukas Cavigelli
As AI workloads proliferate, optimizing arithmetic units is becoming increasingly important to reduce the footprint of digital systems. Conventional design flows, which often rely on manual or heuristics-based optimization, are limited in their ability to thoroughly explore the vast design space. In this paper, we introduce GENIAL, a machine learning-based framework for the automatic generation and optimization of arithmetic units, more specifically multipliers. At the core of GENIAL is a Transformer-based surrogate model trained in two stages, involving self-supervised pretraining followed by supervised finetuning, to robustly forecast key hardware metrics such as power and area from abstracted design representations. By inverting the surrogate model, GENIAL efficiently searches for new operand encodings that directly minimize power consumption in arithmetic units for specific input data distributions. Extensive experiments on large datasets demonstrate that GENIAL is consistently more sample efficient than other methods, and converges faster towards optimized designs. This enables to deploy a high-effort logic synthesis optimization flow in the loop, improving the accuracy of the surrogate model. Notably, GENIAL automatically discovers encodings that achieve up to 18% switching activity savings within multipliers on representative AI workloads compared with the conventional two’s complement. We also demonstrate the versatility of our approach by achieving significant improvements on Finite State Machines, highlighting GENIAL’s applicability for a wide spectrum of logic functions. Together, these advances mark a significant step toward automated Quality-of-Results-optimized combinational circuit generation for digital systems.
随着AI工作量的扩大,优化算术单位对于减少数字系统的足迹越来越重要。常规设计流动通常依赖人工或超自然优化,但对其彻底探索庞大设计空间的能力有限。在本文件中,我们引入了GENIAL,一个基于机器学习的框架,用于自动生成和优化算术单位,更具体地说是乘数单位。GENIAL的核心是一个基于变压器的替代模型,它分为两个阶段,包括自我监督的预先培训,然后有监督的微调,以强有力地预测关键硬件指标,如从抽象设计表示中得出的电力和面积。通过推翻超导模型,GENIAL高效搜索新的变异编码,直接将计算单位的电源消耗降到最低,用于特定输入数据分布。关于大数据集的广泛实验表明GENIAL始终比其他方法更高效,并更快地集中到优化设计。这样就可以在循环中部署高精度的逻辑合成优化组合,提高超导模型的精度。 很明显,GENIAL自动地发现,在两种变异性序列中进行精确的编码,从而将常规变换为18的倍变压。
Article 25
Title@2025-07-25 (5): RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems
Title: RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems | RailX: Eine flexible, skalierbare und kostenarme Netzwerkarchitektur für Hyper-Scale LLM Trainingssysteme | RailX:超大型有限LM培训系统灵活、可缩放和低成本网络架构 2507.18889v1 |
Authors (8): Yinxiao Feng, Tiancheng Chen, Yuchen Wei, Siyuan Shen, Shiju Wang, Wei Li, Kaisheng Ma, Torsten Hoefler
Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the \textit{Rail-optimized} network are extremely expensive, while direct topologies such as \textit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose \textit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on \textit{Hamiltonian Decomposition} theory to organize separate rail-based rings into \textit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all communication. More than $100$K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only $2\sim4$ inter-node hops. The network cost per injection/All-Reduce bandwidth of \textit{RailX} is less than $10\%$ of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than $50\%$ of the Fat-Tree. Specifically, only $\sim$$$1.3$B is required to interconnect 200K chips with 1.8TB bandwidth. \textit{RailX} can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.
AI 工作量越来越大, 需要超大型基础设施; 但是, 传统的互连网络架构既不可缩放, 也不具备足够的成本效益。 基于树的地形结构, 如 rextit{ Rail- optimified} 网络非常昂贵, 而像\ textit{ Torus} 这样的直接地形结构没有足够的双节带宽和灵活性。 在本文中, 我们提议\ textit{ RailX} , 一个基于节点内直接连通和跨节电路转换的可重新配置的网络架构。 节点和光学开关是有形的 2D 组织起来的, 并且比现有的中央电路转换网络网络更容易缩放。 我们提议一种基于\ textit{ Hamiltonian Decomposit} 的新型互连通方法, 将不同的铁路环组织成\ textitle{all{ all- all- to- commission。 我们提议的网络成本成本是每100美元, 而不是每平面的平流 。
Article 26
Title@2025-07-25 (5): GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Conditional Processing
Title: GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Conditional Processing | GCC: Eine 3DGS-Inferenzarchitektur mit Gaußian-Wise- und Cross-Stage-Bedingung | 海合会:3DGS推理结构,带有高西-怀斯和跨标准条件处理 2507.15300v3 |
Authors (9): Minnan Pei, Gang Li, Junwen Si, Zeyu Zhu, Zitao Mo, Peisong Wang, Zhuoran Song, Xiaoyao Liang, Jian Cheng
3D Gaussian Splatting (3DGS) has emerged as a leading neural rendering technique for high-fidelity view synthesis, prompting the development of dedicated 3DGS accelerators for resource-constrained platforms. The conventional decoupled preprocessing-rendering dataflow in existing accelerators has two major limitations: 1) a significant portion of preprocessed Gaussians are not used in rendering, and 2) the same Gaussian gets repeatedly loaded across different tile renderings, resulting in substantial computational and data movement overhead. To address these issues, we propose GCC, a novel accelerator designed for fast and energy-efficient 3DGS inference. GCC introduces a novel dataflow featuring: 1) \textit{cross-stage conditional processing}, which interleaves preprocessing and rendering to dynamically skip unnecessary Gaussian preprocessing; and 2) \textit{Gaussian-wise rendering}, ensuring that all rendering operations for a given Gaussian are completed before moving to the next, thereby eliminating duplicated Gaussian loading. We also propose an alpha-based boundary identification method to derive compact and accurate Gaussian regions, thereby reducing rendering costs. We implement our GCC accelerator in 28nm technology. Extensive experiments demonstrate that GCC significantly outperforms the state-of-the-art 3DGS inference accelerator, GSCore, in both performance and energy efficiency.
3D Gaussian Splatting (3DGS) 已成为高不忠实视图合成的主要神经化技术,促使开发了专用的3DGS加速器,用于资源限制平台。常规的处理前前数据流分解在现有加速器中有两个主要限制:(1) 很大一部分预处理前高斯平流器没有被用于制作;(2) 同一高斯反复装入不同平流层,导致大量计算和数据移动管理。为了解决这些问题,我们建议海合会(HCC),这是为快速和节能3DGS推断设计的新型加速器。海合会推出了新的数据流,其特点为:(1) 流(Textit{跨阶段有条件处理 ) , 将预处理前高斯平流的不必要高标用于制作;(2) 脂(Gausseral) {Gaussator } , 确保在向下方移动前完成给特定戈斯的所有操作,从而消除高高校的重高频装载,从而降低高校28级平面技术的精确度测试。我们还提议了一个基于高校的GA-GA- 测试。
Article 27
Title@2025-07-24 (4): PRACtical: Subarray-Level Counter Update and Bank-Level Recovery Isolation for Efficient PRAC Rowhammer Mitigation
Title: PRACtical: Subarray-Level Counter Update and Bank-Level Recovery Isolation for Efficient PRAC Rowhammer Mitigation | PRACtical: Subarray-Level Counter Update und Bank-Level Recovery Isolation für effiziente PRAC Rowhammer Mitigation | PRAC 有效缓解减贫和减贫方案下级反反更新和银行级复苏孤立措施 2507.18581v1 |
Authors (4): Ravan Nazaraliyev, Saber Ganjisaffar, Nurlan Nazaraliyev, Nael Abu-Ghazaleh
As DRAM density increases, Rowhammer becomes more severe due to heightened charge leakage, reducing the number of activations needed to induce bit flips. The DDR5 standard addresses this threat with in-DRAM per-row activation counters (PRAC) and the Alert Back-Off (ABO) signal to trigger mitigation. However, PRAC adds performance overhead by incrementing counters during the precharge phase, and recovery refreshes stalls the entire memory channel, even if only one bank is under attack. We propose PRACtical, a performance-optimized approach to PRAC+ABO that maintains the same security guarantees. First, we reduce counter update latency by introducing a centralized increment circuit, enabling overlap between counter updates and subsequent row activations in other subarrays. Second, we enhance the $RFM_{ab}$ mitigation by enabling bank-level granularity: instead of stalling the entire channel, only affected banks are paused. This is achieved through a DRAM-resident register that identifies attacked banks. PRACtical improves performance by 8% on average (up to 20%) over the state-of-the-art, reduces energy by 19%, and limits performance degradation from aggressive performance attacks to less than 6%, all while preserving Rowhammer protection.
随着DRAM密度的增加,Rowhammer公司由于高排爆泄漏而变得更为严重,从而降低了催化比试所需的激活次数。DDL15标准(DR5)用DRAM每行启动计数器(PRAC)和警报回退信号来应对这一威胁,以触发缓解。然而,由于DRAM密度增加,DRAM密度增加,Rowhammer公司由于高排泄漏而变得更加严重。DRAM公司在充电前阶段通过加注计数器增加业绩管理费,而恢复更新系统则使整个记忆频道陷于停顿,即使只有一家银行受到攻击。我们提议对PRAC+ABO公司采用业绩优化的方法,以维持同样的安全保障。首先,我们通过中央递增电路减少反更新,使反调更新和随后在其他次列内启动行的信号发生重叠。第二,我们通过允许银行级颗粒化器增加$RFMZab}减速率:不是拖延整个频道,而是暂停受影响的银行。我们通过DRAM公司登记册来查明受攻击的银行。
Article 28
Title@2025-07-24 (4): DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration
Title: DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration | DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung | DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v3 |
Authors (3): Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis
Transformers are gaining increasing attention across Natural Language Processing (NLP) application domains due to their outstanding accuracy. However, these data-intensive models add significant performance demands to the existing computing architectures. Systolic array architectures, adopted by commercial AI computing platforms like Google TPUs, offer energy-efficient data reuse but face throughput and energy penalties due to input-output synchronization via First-In-First-Out (FIFO) buffers. This paper proposes a novel scalable systolic array architecture featuring Diagonal-Input and Permutated weight stationary (DiP) dataflow for matrix multiplication acceleration. The proposed architecture eliminates the synchronization FIFOs required by state-of-the-art weight stationary systolic arrays. Beyond the area, power, and energy savings achieved by eliminating these FIFOs, DiP architecture maximizes the computational resource utilization, achieving up to 50\% throughput improvement over conventional weight stationary architectures. Analytical models are developed for both weight stationary and DiP architectures, including latency, throughput, time to full PEs utilization (TFPU), and FIFOs overhead. A comprehensive hardware design space exploration using 22nm commercial technology demonstrates DiP’s scalability advantages, achieving up to a 2.02x improvement in energy efficiency per area. Furthermore, DiP outperforms TPU-like architectures on transformer workloads from widely-used models, delivering energy improvement up to 1.81x and latency improvement up to 1.49x. At a 64x64 size with 4096 PEs, DiP achieves a peak throughput of 8.192 TOPS with energy efficiency 9.548 TOPS/W.
64 自然语言处理(NLP)应用领域中变压者由于它们的精确性而越来越受到越来越多的关注。然而,这些数据密集型模型增加了现有计算结构的显著性能需求。谷歌 TPUs 等商业AI 计算平台采用的Systolic 阵列结构提供了节能数据再利用,但由于通过FIFFOs(FIPO)缓冲实现输入-输出同步,而面临过量和能量惩罚。本文件提议了一个新的可缩放的系统阵列结构,其特点是对数-投入和变换重量固定(DiP)40数据流动,以加速矩阵的倍增倍增速度。拟议的结构消除了州级的重力固定阵列阵列阵列所需的同步FIFFFFFFFOs。在消除这些FIFFOs后实现了节能性数据再利用, DiPPFP结构最大限度地实现了计算资源的利用率,在常规重定额结构上实现了50 吞吐量改进。分析模型针对重量固定/变压结构, 包括平流、时间到完整PElix Slievilent distryal distrex distreal divex distrealus laus lavel lax divation lax divational
Article 29
Title@2025-07-24 (4): GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction
Title: GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction | GNN-ACLP: Graph Neural Networks Based Analog Circuit Link Prediction | GNN-ALLP:基于模拟电路链接预测的图表神经网络 2504.10240v4 |
Authors (9): Guanyuan Pan, Tiansheng Zhou, Bingtao Ma, Yaqi Wang, Jianxiang Zhao, Zhi Li, Yugui Lin, Pietro Lio, Shuai Wang
Circuit link prediction identifying missing component connections from incomplete netlists is crucial in analog circuit design automation. However, existing methods face three main challenges: 1) Insufficient use of topological patterns in circuit graphs reduces prediction accuracy; 2) Data scarcity due to the complexity of annotations hinders model generalization; 3) Limited adaptability to various netlist formats. We propose GNN-ACLP, a graph neural networks (GNNs) based method featuring three innovations to tackle these challenges. First, we introduce the SEAL (learning from Subgraphs, Embeddings, and Attributes for Link prediction) framework and achieve port-level accuracy in circuit link prediction. Second, we propose Netlist Babel Fish, a netlist format conversion tool leveraging retrieval-augmented generation (RAG) with a large language model (LLM) to improve the compatibility of netlist formats. Finally, we construct SpiceNetlist, a comprehensive dataset that contains 775 annotated circuits across 10 different component classes. Experiments demonstrate accuracy improvements of 16.08% on SpiceNetlist, 11.38% on Image2Net, and 16.01% on Masala-CHAI compared to the baseline in intra-dataset evaluation, while maintaining accuracy from 92.05% to 99.07% in cross-dataset evaluation, exhibiting robust feature transfer capabilities.
在模拟电路设计自动化方面,现有方法面临三大挑战:(1) 电路图中不适当使用地形学模式降低了预测的准确性;(2) 由于说明的复杂性而缺乏数据,妨碍了模型的简单化;(3) 对各种网络列表格式的适应性有限。我们提议GNN-ANALP, 一种基于图形神经网络(GNN-ANNNs)的方法,该方法有三项创新,以应对这些挑战。首先,我们引入SEAL(从Subgraphs、嵌入和链接预测属性中学习)框架,并在电路连接预测中实现港口一级的准确性。第二,我们提议Netlist Babel Fish, 一种使用网络列表格式转换工具,利用检索和推荐生成的生成(RAG),使用大型语言模型(LLM),以提高网络列表格式的兼容性。最后,我们构建了SpiceNetlist,这是一套综合数据集,包含10个不同组成部分的775条附加说明电路。实验显示SpiceNetlist的准确性改进了16.08%,在图像2NetNet上提高了11.38%,在Masala-CHAI上实现了16.01%,在Masala-CHAI的网络上转换工具,从稳度评估,从稳性地段,从9-creabreaccreabilvadational-palvicational-palviidaldationdationdationalitydational-dationalitydationaldationalvialvialvialvicalvicalvialvicildationdationalvicildations。
Article 30
Title@2025-07-24 (4): A 55-nm SRAM Chip Scanning Errors Every 125 ns for Event-Wise Soft Error Measurement
Title: A 55-nm SRAM Chip Scanning Errors Every 125 ns for Event-Wise Soft Error Measurement | Ein 55-nm-SRAM-Chip-Scannfehler alle 125 ns für Event-Wise-Soft-Error-Messung | A 55-nm SRAM 芯片扫描错误 2504.08305v2 |
Authors (7): Yuibi Gomi, Akira Sato, Waleed Madany, Kenichi Okada, Satoshi Adachi, Masatoshi Itoh, Masanori Hashimoto
We developed a 55 nm CMOS SRAM chip that scans all data every 125 ns and outputs timestamped soft error data via an SPI interface through a FIFO. The proposed system, consisting of the developed chip and particle detectors, enables event-wise soft error measurement and precise identification of SBUs and MCUs, thus resolving misclassifications such as Pseudo- and Distant MCUs that conventional methods cannot distinguish. An 80-MeV proton irradiation experiment at RARiS, Tohoku University verified the system operation. Timestamps between the SRAM chip and the particle detectors were successfully synchronized, accounting for PLL disturbances caused by radiation. Event building was achieved by determining a reset offset with sub-ns resolution, and spatial synchronization was maintained within several tens of micrometers.
我们开发了55 nm CMOS SRAM芯片,通过FIFO通过SPI接口扫描每125 ns和输出每125 个输出时间戳的软错误数据,由开发的芯片和粒子探测器组成的拟议系统可以对事件进行软误测和精确识别SBU和MCU,从而解决常规方法无法区分的Psedo-和Distant MCUs等错误分类问题。在Tohoku大学RARiS进行了80-MeV质子辐照实验,对系统操作进行了核查。SRAM芯片和粒子探测器之间的时间戳是同步的,对辐射引起的PLL扰动进行了核算。通过确定子分辨率的重新设置和分辨分辨率,实现了活动建设,并在数个微米内保持了空间同步。
Article 31
Title@2025-07-24 (4): Explicit Sign-Magnitude Encoders Enable Power-Efficient Multipliers
Title: Explicit Sign-Magnitude Encoders Enable Power-Efficient Multipliers | Explizite Zeichen-Magnituden-Encoder aktivieren leistungsfähige Multiplikatoren | 显性信号- 数学编码器启用功率功率乘数器 2507.18179v1 |
Authors (5): Felix Arnold, Maxence Bouvier, Ryan Amaudruz, Renzo Andri, Lukas Cavigelli
This work presents a method to maximize power-efficiency of fixed point multiplier units by decomposing them into sub-components. First, an encoder block converts the operands from a two’s complement to a sign magnitude representation, followed by a multiplier module which performs the compute operation and outputs the resulting value in the original format. This allows to leverage the power-efficiency of the Sign Magnitude encoding for the multiplication. To ensure the computing format is not altered, those two components are synthesized and optimized separately. Our method leads to significant power savings for input values centered around zero, as commonly encountered in AI workloads. Under a realistic input stream with values normally distributed with a standard deviation of 3.0, post-synthesis simulations of the 4-bit multiplier design show up to 12.9% lower switching activity compared to synthesis without decomposition. Those gains are achieved while ensuring compliance into any production-ready system as the overall circuit stays logic-equivalent. With the compliance lifted and a slightly smaller input range of -7 to +7, switching activity reductions can reach up to 33%. Additionally, we demonstrate that synthesis optimization methods based on switching-activity-driven design space exploration can yield a further 5-10% improvement in power-efficiency compared to a power agnostic approach.
这项工作展示了一种方法, 使固定点乘数单位的功率最大化, 将其分解成子元件。 首先, 编码元件块将操作器从一个 2 的补充转换成一个符号级表示, 之后是一个计算操作和输出的乘数模块, 计算原始格式的计算操作和输出值。 这样可以利用信号磁度编码对乘法的功率效率。 为确保不改变计算格式, 这两个元件是分别合成和优化的。 我们的方法可以使输入值在零周围大量节省功率, 正如在AI工作量中常见的那样。 在现实的投入流中, 通常分布在标准偏差3. 0 下, 4位乘乘乘乘数设计后合成模拟显示, 与合成相比,转换活动率低12.9 % , 而没有分解。 这些增益在确保任何生产准备就绪的系统在总体电路路保持逻辑等值时得到合规性。 由于合规性解除, 并且将输入范围小点为 - 7 到 +7 , , 转换活动削减可以进一步达到33%。 此外, 我们证明, 以转换动力- 驱动- 驱动- 驱动- 驱动- 改进 空间 改进 改进 方法的合成优化方法可以使动力- 驱动- 10 改进 改进 改进 改进 改进 改进 空间- 改进 改进 改进 改进 改进 改进 改进 改进 改进空间- 改进
Article 32
Title@2025-07-24 (4): Real-Time Object Detection and Classification using YOLO for Edge FPGAs
Title: Real-Time Object Detection and Classification using YOLO for Edge FPGAs | Echtzeit-Objekterkennung und -Klassifizierung mit YOLO für Edge-FPGAs | 实时物体探测和分类,用YOLO对边缘的FPGAs进行实时物体探测和分类 2507.18174v1 |
Authors (2): Rashed Al Amin, Roman Obermaisser
Object detection and classification are crucial tasks across various application domains, particularly in the development of safe and reliable Advanced Driver Assistance Systems (ADAS). Existing deep learning-based methods such as Convolutional Neural Networks (CNNs), Single Shot Detectors (SSDs), and You Only Look Once (YOLO) have demonstrated high performance in terms of accuracy and computational speed when deployed on Field-Programmable Gate Arrays (FPGAs). However, despite these advances, state-of-the-art YOLO-based object detection and classification systems continue to face challenges in achieving resource efficiency suitable for edge FPGA platforms. To address this limitation, this paper presents a resource-efficient real-time object detection and classification system based on YOLOv5 optimized for FPGA deployment. The proposed system is trained on the COCO and GTSRD datasets and implemented on the Xilinx Kria KV260 FPGA board. Experimental results demonstrate a classification accuracy of 99%, with a power consumption of 3.5W and a processing speed of 9 frames per second (FPS). These findings highlight the effectiveness of the proposed approach in enabling real-time, resource-efficient object detection and classification for edge computing applications.
现有深层次的学习方法,如进化神经网络(CNNs)、单一射击探测器(SSDS)和“只看一眼”(YOLO)等,在部署在外地可编程门阵列(FPGAs)上时,在准确性和计算速度方面表现良好;然而,尽管取得了这些进步,基于最新工艺的YOLO物体探测和分类系统在实现适合FPGA边缘平台的资源效率方面继续面临挑战。为了应对这一限制,本文件展示了以YOLOv5为最佳优化的用于FPGA部署的资源高效实时物体探测和分类系统。拟议系统在CCOCO和GTSRD数据集方面得到了培训,并在Xilinx Kria KV260 FPGA板上实施。实验结果显示,分类准确率为99%,电力消耗为3.5W,处理速度为每秒9个目标(FPS),这些结果突出表明了实时目标探测和升级应用系统的效率。
Article 33
Title@2025-07-24 (4): Designing High-Performance and Thermally Feasible Multi-Chiplet Architectures enabled by Non-bendable Glass Interposer
Title: Designing High-Performance and Thermally Feasible Multi-Chiplet Architectures enabled by Non-bendable Glass Interposer | Designing High-Performance und Thermisch Machbare Multi-Chiplet-Architekturen durch nicht-biegbare Glasinterposer ermöglicht | 设计高性能和热能多芯结构,由不可移植的玻璃干涉器启用 2507.18040v1 |
Authors (4): Harsh Sharma, Janardhan Rao Doppa, Umit Y. Ogras, Partha Pratim Pande
Multi-chiplet architectures enabled by glass interposer offer superior electrical performance, enable higher bus widths due to reduced crosstalk, and have lower capacitance in the redistribution layer than current silicon interposer-based systems. These advantages result in lower energy per bit, higher communication frequencies, and extended interconnect range. However, deformation of the package (warpage) in glass interposer-based systems becomes a critical challenge as system size increases, leading to severe mechanical stress and reliability concerns. Beyond a certain size, conventional packaging techniques fail to manage warpage effectively, necessitating new approaches to mitigate warpage induced bending with scalable performance for glass interposer based multi-chiplet systems. To address these inter-twined challenges, we propose a thermal-, warpage-, and performance-aware design framework that employs architecture and packaging co-optimization. The proposed framework disintegrates the surface and embedded chiplets to balance conflicting design objectives, ensuring optimal trade-offs between performance, power, and structural reliability. Our experiments demonstrate that optimized multi-chiplet architectures from our design framework achieve up to 64.7% performance improvement and 40% power reduction compared to traditional 2.5D systems to execute deep neural network workloads with lower fabrication costs.
由玻璃干涉器促成的多晶体外结构能够提供较高的电力性能,由于交叉对谈减少而使公共汽车宽度提高,并且比目前硅干涉器系统在再分配层中的能力较低。这些优势导致每位能量降低,通信频率提高,互连范围扩大。然而,玻璃干涉器系统中包件(隔热器)的变形(隔热器)随着系统规模的扩大而成为一个关键挑战,导致严重的机械压力和可靠性问题。除了一定的尺寸外,常规包装技术无法有效地管理损耗页,因此有必要采取新的办法来减轻因玻璃隔热器多晶体内系统的可缩缩放而导致的曲折。为了应对这些相互交错的挑战,我们提议了一个热、隔热、隔热和有性能的设计框架,利用建筑和包装共同优化。拟议框架将表面和嵌入的芯片分离,以平衡相互矛盾的设计目标,确保最佳的性能、功率和结构可靠性之间的折。我们的实验表明,从设计框架中优化多晶体外结构结构与玻璃隔开来,以可伸缩性能结构的性能结构,以25.7%至低度的性能改进。