cs.DC @ 2025-06-20: 110
-
00 06-18 (3) Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction Föderated Learning for MRI-based BrainAGE: Eine multizentrische Studie zur post-stroke funktionellen Ergebnisvorhersage 为基于MRI的脑力智能学习联合会学习:关于打击后功能性结果预测的多中心研究 2506.15626v1 -
01 06-18 LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters LiteGD: Leichte und dynamische GPU Dispatching für großflächige Heterogene Cluster LiteGD: 大型异源集散发轻量和动态GPU 2506.15595v1 -
02 06-18 DAGs for the Masses DAGs für die Massen 质量的 DAG 值 2506.13998v2 -
03 06-18 Automatic Metadata Capture and Processing for High-Performance Workflows Automatische Metadatenerfassung und -verarbeitung für Hochleistungs-Workflows 高绩效工作流程自动获取和处理元元数据 2506.15537v1 -
04 06-18 Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation Minimierung der Kommunikation für parallele symmetrische Tensor-Zeiten gleiche Vektor-Computation 最大限度地减少平行对称日光时同步矢量计算通信 2506.15488v1 -
05 06-18 All is Not Lost: LLM Recovery without Checkpoints Alles ist nicht verloren: LLM Erholung ohne Checkpoints 并非全部丢失:LLM 恢复没有检查站 2506.15461v1 -
06 06-18 Parallel Paradigms in Modern HPC: A Comparative Analysis of MPI, OpenMP, and CUDA Parallele Paradigmen im modernen HPC: Eine vergleichende Analyse von MPI, OpenMP und CUDA 现代HPC的平行范例:对MPI、OpenMP和CUDA的比较分析 2506.15454v1 -
07 06-18 Exploring Fast Fourier Transforms on the Tenstorrent Wormhole Schnell Fourier-Transformationen auf dem Tenstorrent Wormhole erkunden 探索登山洞上的快速傅里叶变形 2506.15437v1 -
08 06-18 Computing the Schulze Method for Large-Scale Preference Data Sets Berechnung der Schulze-Methode für großformatige Präferenzdatensätze 计算大尺度优先数据集的平板法 2505.12976v2 -
09 06-18 RISC-V for HPC: An update of where we are and main action points RISC-V für HPC: Ein Update, wo wir sind und die wichtigsten Aktionspunkte HPC的RISC-V:关于我们目前的最新情况和主要行动要点的最新情况 2506.15418v1 -
10 06-18 An Efficient Candidate-Free R-S Set Similarity Join Algorithm with the Filter-and-Verification Tree and MapReduce Eine effiziente, kandidatfreie R-S-Set-Ähnlichkeit Begleiten Sie den Algorithmus mit dem Filter-und-Verifikationsbaum und MapReduce 与过滤和核查树和地图显示的高效无候选人候选人 R-S 设置相似性 2506.03893v2 -
11 06-18 Reconfigurable Intelligent Surface Aided Vehicular Edge Computing: Joint Phase-shift Optimization and Multi-User Power Allocation Rekonfigurierbares intelligentes Surface Aided Vehicular Edge Computing: Joint Phase-Shift Optimization und Multi-User Power Allocation 重新配置的智能地面辅助车辆边缘 电子计算:联合阶段-轮-优化和多用户电力配置 2407.13123v2 -
12 06-18 Serving Large Language Models on Huawei CloudMatrix384 Große Sprachmodelle auf Huawei CloudMatrix384 瓦威云马特列克384 2506.12708v2 -
13 06-18 Centroid Approximation for Byzantine-Tolerant Federated Learning Centroid Approximation für Byzantinisch-Tolerant-Federated Learning 拜占庭 – – 协调联邦学习 2506.15264v1 -
14 06-18 eLLM: Elastic Memory Management Framework for Efficient LLM Serving eLLM: Elastic Memory Management Framework für effizientes LLM Serving eLLM:高效LLM服务 Elastic记忆管理框架 2506.15155v1 -
15 06-18 Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library Parallel Data Object Creation: Auf dem Weg zu einem skalierbaren Metadaten-Management in der Hochleistungs-I/O-Bibliothek 平行数据对象的生成:在高绩效一/O图书馆中实现可缩放元数据管理 2506.15114v1 -
16 06-17 (2) Supporting the development of Machine Learning for fundamental science in a federated Cloud with the AI_INFN platform Unterstützung der Entwicklung von Machine Learning für die Grundlagenforschung in einer föderierten Cloud mit der AI_INFN-Plattform 与AI_INFN平台合作,支持在联合云层中发展用于基础科学的机器学习 2502.21266v2 -
17 06-17 Scaling Intelligence: Designing Data Centers for Next-Gen Language Models Scaling Intelligence: Konzipieren von Rechenzentren für Sprachmodelle der nächsten Generation 扩大情报范围:为下一代语言模型设计数据中心 2506.15006v1 -
18 06-17 Zarr-Based Chunk-Level Cumulative Sums in Reduced Dimensions ZARR-Based Chunk-Level Kumulative Summen in reduzierten Abmessungen 减量尺寸中Zarr 基铜级累计总和 2506.14981v1 -
19 06-17 Exploring Dynamic Load Balancing Algorithms for Block-Structured Mesh-and-Particle Simulations in AMReX Dynamische Lastausgleichsalgorithmen für blockstrukturierte Mesh-and-Particle-Simulationen in AMReX erforschen 探索AMReX 中块结构化网状和粒子模拟的动态负载平衡算法 2505.15122v2 -
20 06-17 Event-Driven Online Vertical Federated Learning Event-getriebenes Online-Vertical-Federated-Learning 在线纵向联邦学习 2506.14911v1 -
21 06-17 Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters Ressourcenoptimierung mit MPI-Prozess Malleability für dynamische Workloads in HPC Clustern HPC 集群中动态工作量的 MPI 进程最小性 2506.14743v1 -
22 06-17 SETI@home: Data Acquisition and Front-End Processing SETI@home: Datenerfassung und Front-End-Verarbeitung SETI@home:数据采集和前端处理 2506.14718v1 -
23 06-17 Keigo: Co-designing Log-Structured Merge Key-Value Stores with a Non-Volatile, Concurrency-aware Storage Hierarchy (Extended Version) Keigo: Co-Designing Log-Structured Merge Key-Value Stores mit einer nicht-volatilen, concurrency-aware Speicherhierarchie (erweiterte Version) Keigo:共同设计有非流动、具有通货币通汇存储等级的逻辑结构合并金价库(扩展版本) 2506.14630v1 -
24 06-17 Concepts for designing modern C++ interfaces for MPI Konzepte für die Gestaltung moderner C++-Schnittstellen für MPI 为MPI设计现代 C+++ 界面的概念 2506.14610v1 -
25 06-17 Consensus Power Inequality: A Comparative Study of Blockchain Networks Consensus Power Inequality: Eine vergleichende Studie über Blockchain-Netzwerke 协商一致的 “ 权力不平等:对供应链网络的比较研究 “ 2506.14393v1 -
26 06-17 Decoupling Generation and Evaluation for Parallel Greedy Best-First Search(extended version) Entkoppelung von Generation und Evaluation für parallele Greedy-Best-First-Suche (erweiterte Version) 平行贪婪最佳第一搜索的脱钩生成和评估(扩展版) 2408.05682v2 -
27 06-17 HarMoEny: Efficient Multi-GPU Inference of MoE Models HarMoEny: Effiziente Multi-GPU-Schlussfolgerung von MoE-Modellen HarMoEny:教育部模型的高效多指数指数多推推 2506.12417v2 -
28 06-17 Convergence-Privacy-Fairness Trade-Off in Personalized Federated Learning Convergence-Privacy-Fairness Trade-Off im personalisierten Federated Learning 个人化联邦学习中统一-私人-公平贸易-个人化联邦学习 2506.14251v1 -
29 06-17 A Novel Indicator for Quantifying and Minimizing Information Utility Loss of Robot Teams Ein neuer Indikator für die Quantifizierung und Minimierung von Informationen Dienstprogramm Verlust von Roboter-Teams 计算和尽量减少机器人小组信息效用损失的新指标 2506.14237v1 -
30 06-17 The Redundancy of Full Nodes in Bitcoin: A Network-Theoretic Demonstration of Miner-Centric Propagation Topologies Die Redundanz von Vollknoten in Bitcoin: Eine netzwerktheoretische Demonstration von Miner-Centric Propagation Topologien Bittcoin中完全节点的冗余:矿工-Centric传承型体的网络理论示范 2506.14197v1 -
31 06-17 Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching Kosteneffiziente Bedienung von LLM-Agenten über Test-Zeitplan-Caching 通过试验-时间计划缓冲,以成本效率高的方式服务LLM代理物 2506.14852v1 -
32 06-17 Efficient Serving of LLM Applications with Probabilistic Demand Modeling Effizientes Servieren von LLM-Anwendungen mit probabilistischer Nachfragemodellierung 高效率地利用概率需求建模来服务LLM应用程序 2506.14851v1 -
33 06-17 Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse Déjà Vu: Effiziente Video-Sprachen-Abfrage-Engine mit Learning-based Inter-Frame Computation Reuse Déjà Vu:高效视频语言查询引擎,以学习为基础的基于学习的网络间计算再使用 2506.14107v1 -
34 06-16 (1) ReinDSplit: Reinforced Dynamic Split Learning for Pest Recognition in Precision Agriculture ReinDSplit: Dynamisches Split-Lernen für Pesterkennung in der Precision Agriculture verstärkt ReinDSplit:强化动态分散学习,以便在精密农业中承认害虫特征 2506.13935v1 -
35 06-16 A Terminology for Scientific Workflow Systems Eine Terminologie für wissenschaftliche Workflow-Systeme 科学工作流程系统术语术语 2506.07838v4 -
36 06-16 BanditWare: A Contextual Bandit-based Framework for Hardware Prediction BanditWare: Ein Kontextbandit-basiertes Framework für Hardware-Vorhersage BanditWare:基于背景的硬硬件预测土匪框架 2506.13730v1 -
37 06-16 POPQC: Parallel Optimization for Quantum Circuits (Extended Version) POPQC: Parallele Optimierung für Quantenkreise (erweiterte Version) POPQC: 量子电路平行优化(扩展版本) 2506.13720v1 -
38 06-16 EBS-CFL: Efficient and Byzantine-robust Secure Clustered Federated Learning EBS-CFL: Effizientes und Byzantinisch-Robustes Sicheres Cluster-Federiertes Lernen EBS-CFL: 高效和拜占庭-怒火安全分组联邦学习 2506.13612v1 -
39 06-16 ILVES: Accurate and efficient bond length and angle constraints in molecular dynamics ILVES: Genaue und effiziente Bindungslänge und Winkelbeschränkungen in der molekularen Dynamik ILVES: 分子动态的精确而高效的联结长度和角限制 2503.13075v3 -
40 06-16 Perfect Privacy for Discriminator-Based Byzantine-Resilient Federated Learning Perfekte Privatsphäre für diskriminatorbasiertes Byzantinisch-Resilientes Federated Learning 具有抵抗力的联邦学习组织 2506.13561v1 -
41 06-16 EvalNet: A Practical Toolchain for Generation and Analysis of Extreme-Scale Interconnects EvalNet: Eine praktische Toolchain für die Generierung und Analyse von Extrem-Scale-Verbindungen EvalNet:生成和分析极端系统互联的实用工具链 2105.12663v3 -
42 06-16 Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory Byzantinisch-Tolerant Konsens in GPU-inspiriert gemeinsamen Speicher 在GPU-受GPU启发的共同记忆中,拜占庭-容忍共识 2503.12788v2 -
43 06-16 Blockchain and Biometrics: Survey, GDPR Elements, and Future Directions Blockchain und Biometrie: Umfrage, GDPR-Elemente und Zukunftsrichtung 块链和生物计量:调查、GDPR要素和未来方向 2302.10883v3 -
44 06-16 DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving DDiT: Dynamische Ressourcenzuteilung für Diffusionstransformator-Modelldienst DDIT:为传播变异模型服务提供动态资源配置 2506.13497v1 -
45 06-16 On Immutable Memory Systems for Artificial Agents: A Blockchain-Indexed Automata-Theoretic Framework Using ECDH-Keyed Merkle Chains Auf unveränderliche Speichersysteme für künstliche Agenten: Ein Blockchain-indexed Automata-Theoretic Framework mit ECDH-Keyed Merkle Chains 人工制剂可变记忆系统:使用ECDH-Keyed Merkle 链条的链链式内浸式自动成像-理论框架 2506.13246v1 -
46 06-16 DFPL: Decentralized Federated Prototype Learning Across Heterogeneous Data Distributions DFPL: Dezentrales Federated Prototype Learning über unterschiedliche Datenverteilungen hinweg DFPL: 分散的联邦原型学习,跨异种数据分布 2505.04947v3 -
47 06-16 A Hybrid Heuristic Framework for Resource-Efficient Querying of Scientific Experiments Data Ein hybrider Heuristischer Rahmen für eine ressourceneffiziente Abfrage wissenschaftlicher Experimentdaten 资源效率科学实验数据调查混合元框架 2506.10422v2 -
48 06-15 (7) Distributed Computing From First Principles Verteiltes Rechnen von den ersten Prinzipien 从原始原则中分配的计算 2506.12959v1 -
49 06-15 Self-Stabilizing Replicated State Machine Coping with Byzantine and Recurring Transient Faults Selbststabilisierende replizierte Staatsmaschine, die mit byzantinischen und wiederkehrenden transienten Fehlern fertig wird 应对拜占庭和经常性中转过失的自稳定复制国家机器 2506.12900v1 -
50 06-15 BLITZSCALE: Fast and Live Large Model Autoscaling with O(1) Host Caching BLITZSCALE: Schnelle und Live-Großmodellautoskalierung mit O(1) Host-Caching BLITZSCALE: 与 O(1) 主机缓存快速和活的大型模型自动缩放 2412.17246v2 -
51 06-15 zkMixer: A Configurable Zero-Knowledge Mixer with Anti-Money Laundering Consensus Protocols zkMixer: Ein konfigurierbarer Null-Knowledge-Mixer mit Anti-Money Laundering Consensus-Protokollen zkMixer:一个与反洗钱共识议定书的可配置零知识混合器 2503.14729v2 -
52 06-15 Cross-architecture universal feature coding via distribution alignment Cross-architecture universal feature coding via distribution alignment 通过分配协调进行跨建筑跨建筑通用特征编码 2506.12737v1 -
53 06-15 Energy-Efficient Real-Time Job Mapping and Resource Management in Mobile-Edge Computing Energieeffizientes Echtzeit-Job Mapping und Ressourcenmanagement im Mobile-Edge Computing 移动电子计算中的能源高效实时工作绘图和资源管理 2506.12686v1 -
54 06-14 (6) Accelerating Cloud-Based Transcriptomics: Performance Analysis and Optimization of the STAR Aligner Workflow Beschleunigung der Cloud-basierten Transkription: Leistungsanalyse und Optimierung des STAR Aligner Workflows 加速以云为基础的云基转换器:STAR升空者工作流程的绩效分析和优化 2506.12611v1 -
55 06-14 Performance optimization of BLAS algorithms with band matrices for RISC-V processors Leistungsoptimierung von BLAS-Algorithmen mit Bandmatrizen für RISC-V-Prozessoren BLAS算法的性能优化,为RISC-V处理器提供带宽矩阵 2502.13839v2 -
56 06-14 Decentralized Distributed Graph Coloring: Cluster Graphs Dezentrale verteilte Graphen-Färbung: Clustergraphen 分散分布的图表颜色:群集图表 2405.07725v2 -
57 06-14 AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v1 -
58 06-14 Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks Optimierung des Federated Learning mit Remote Embeddings für Graph Neural Networks 利用远程嵌入模型神经网络优化联邦学习 2506.12425v1 -
59 06-14 Optimizing Resource Allocation and Energy Efficiency in Federated Fog Computing for IoT Optimierung der Ressourcenallokation und Energieeffizienz im Federated Fog Computing für IoT IoT的联雾计算器优化资源分配和能源效率 2504.00791v3 -
60 06-14 Boosting Resource-Constrained Federated Learning Systems with Guessed Updates Ressourcenkonzentrierte Föderierte Lernsysteme mit geschätzten Updates fördern 推动资源限制的联邦学习系统并猜测最新情况 2110.11486v3 -
61 06-14 QoS-aware Scheduling of Periodic Real-time Task Graphs on Heterogeneous Pre-occupied MECs QoS-aware Planung von periodischen Echtzeit-Taskgraphen zu heterogenen vorbesetzten MECs 不同不同类型预先使用和未使用MEC的定期实时任务图的定期实时布局 2506.12415v1 -
62 06-14 Efficient Unified Caching for Accelerating Heterogeneous AI Workloads Effizientes Unified Caching für die Beschleunigung heterogener KI-Workloads 加速异异性人工智能加速加速加速加速统一快递 2506.12370v1 -
63 06-14 Decouple and Decompose: Scaling Resource Allocation with DeDe Entkoppeln und Zersetzen: Skalierung der Ressourcenzuteilung mit DeDe 分化和分解:扩大资源分配与除去 2412.11447v3 -
64 06-14 EdgeProfiler: A Fast Profiling Framework for Lightweight LLMs on Edge Using Analytical Model EdgeProfiler: Ein schnelles Profiling-Framework für leichte LLMs am Rand mit analytischem Modell 边缘推进器:利用分析模型分析边缘的轻量LMs的快速分析框架 2506.09061v2 -
65 06-14 KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider KVCache Cache in der Wildnis: KVCache Cache bei einem großen Cloud-Anbieter charakterisieren und optimieren KVcache 野生缓存: 大云提供方的 KVcache 缓存的特性和优化 KVcache 缓存 2506.02634v2 -
66 06-14 GroupNL: Low-Resource and Robust CNN Design over Cloud and Device GroupNL: Low-Resource und robustes CNN-Design über Cloud und Device GroupNL: 低资源资源和强力有线电视新闻网关于云和装置的设计 2506.12335v1 -
67 06-14 Towards Energy-Efficient Distributed Agreement Auf dem Weg zu einem energieeffizienten, verteilten Abkommen 争取实现节能分配协议 2506.12282v1 -
68 06-13 (5) Fed-HeLLo: Efficient Federated Foundation Model Fine-Tuning with Heterogeneous LoRA Allocation Fed-HeLlo: Effizientes Federated Foundation Model Feintuning mit heterogener LoRA-Zuteilung Fed-HELLo:高效联邦基金会 2506.12213v1 -
69 06-13 MindFlayer SGD: Efficient Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times MindFlayer SGD: Effiziente parallele SGD in der Gegenwart von heterogenen und zufälligen Arbeiter-Berechnungszeiten MindFlayer SGD: 存在异基因和随机工人时的有效平行SGD计算 2410.04285v2 -
70 06-13 Secure API-Driven Research Automation to Accelerate Scientific Discovery Sichere API-gesteuerte Forschungsautomatisierung zur Beschleunigung der wissenschaftlichen Entdeckung 安全 API-驱动研究自动化加速科学发现 2506.11950v1 -
71 06-13 A retrospective on DISPEED – Leveraging heterogeneity in a drone swarm for IDS execution Eine Retrospektive über DISPEED – Heterogenität in einem Drohnenschwarm für IDS-Exekution 在无人驾驶飞机群群中利用异异性进行IDS处决 2506.11800v1 -
72 06-13 PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices PIPO: Pipelined Offloading für effiziente Schlussfolgerungen auf Consumer Devices PIPO:为有效推断消费者设备而喷射的卸载 2504.03664v2 -
73 06-13 CoBRA: A Universal Strategyproof Confirmation Protocol for Quorum-based Proof-of-Stake Blockchains CoBRA: Ein universelles Strategie-Proof-of-Stake-Blockchains-Protokoll für quorum-basierte Proof-of-Stake-Blockchains CoBRA: 一项基于法定人数的 “ 制片检验 “ 的通用战略防战略确认议定书 2503.16783v2 -
74 06-13 Auctions with Tokens: Monetary Policy as a Mechanism Design Choice Auktionen mit Tokens: Geldpolitik als Mechanism Design Choice 与Tokons的拍卖:货币政策作为一种机制设计选择 2301.13794v4 -
75 06-13 Bounded Memory in Distributed Networks Gebundenes Gedächtnis in verteilten Netzwerken 分布式网络中的环绕内存 2506.11644v1 -
76 06-13 Capsule: Efficient Player Isolation for Datacenters Kapsel: Effiziente Spielerisolierung für Rechenzentren Capsule: 数据中心的有效玩家隔离 2506.11483v1 -
77 06-13 Level set-based inverse homogenisation of three-dimensional piezoelectric materials Inverse Homogenisierung von dreidimensionalen piezoelektrischen Werkstoffen auf stufenweiser Basis 三维压电压材料的 水平定级反同质化 2410.03148v3 -
78 06-13 Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units Topologie-Bewusst-Virtualisierung über kernverbundene Neuralverarbeitungseinheiten 核心间连接神经元处理单位的地形-认知虚拟化 2506.11446v1 -
79 06-13 Advancing Hybrid Defense for Byzantine Attacks in Federated Learning Förderung der Hybrid-Verteidigung für byzantinische Angriffe im Federated Learning 推进联邦学习联盟拜占庭攻击事件混合防御 2409.06474v3 -
80 06-13 You can lie but not deny: SWMR registers with signature properties in systems with Byzantine processes Sie können lügen, aber nicht leugnen: SWMR Register mit Signatur Eigenschaften in Systemen mit byzantinischen Prozessen 你可以说谎,但不能否认:在拜占庭程序系统中,有签名属性的SWMR登记系统登记系统登记系统。 2504.09805v2 -
81 06-13 WindVE: Collaborative CPU-NPU Vector Embedding WindVE: Kollaborative CPU-NPU-Vektor-Einbettung Windeve:协作式CPU-NPU 矢量嵌入 2504.14941v4 -
82 06-12 (4) SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding SwiftSpec: Ultra-Low Latency LLM Decodierung durch Skalierung asynchroner spekulativer Decodierung SwiftSpecle: 通过缩放非同步的投机性代号来解码超低纬度LLM LLM 2506.11309v1 -
83 06-12 Byzantine-Resilient Secure Aggregation for Federated Learning Without Privacy Compromises Byzantinisch-Resilient Sichere Aggregation für Federated Learning ohne Datenschutz Kompromisse Byzantine-抗拜占庭-无隐私障碍联邦学习安全聚合 2405.08698v3 -
84 06-12 LoByITFL: Low Communication Secure and Private Federated Learning LoByITFL: Niedrige Kommunikation Sicheres und Privates Federated Learning LoByITFL: 低通信安全和私营联邦学习 2405.19217v2 -
85 06-12 To Compress or Not To Compress: Energy Trade-Offs and Benefits of Lossy Compressed I/O Um zu komprimieren oder nicht zu komprimieren: Energie-Handels-Offs und Vorteile von Lossy Compressed I/O 压缩或非压缩:能源贸易额和损失压缩 I/O 2410.23497v2 -
86 06-12 TimberStrike: Dataset Reconstruction Attack Revealing Privacy Leakage in Federated Tree-Based Systems TimberStrike: Datensatz-Rekonstruktion Angriff Enthüllen der Privatsphäre Leckage in Federated Tree-Based Systems 木材三角:联邦树基系统中数据集重建攻击清除隐私渗漏 2506.07605v2 -
87 06-12 Adaptive Job Scheduling in Quantum Clouds Using Reinforcement Learning Adaptive Jobplanung in Quantenwolken mittels Verstärkungslernen 利用强化学习在量云中进行适应性就业安排 2506.10889v1 -
88 06-12 The Impact of Partial Computations on the Red-Blue Pebble Game Der Einfluss von partiellen Berechnungen auf das rot-blaue Pebble-Spiel 部分计算对红蓝色球游戏的影响 2506.10854v1 -
89 06-12 Faster CONGEST Approximation Algorithms for Maximum Weighted Independent Set in Sparse Graphs Schnellere CONGEST-Annäherung Algorithmen für maximal gewichtete unabhängige Satz in Sparse Graphen 快速 CONEEST 粗图中最大加权独立设置的 CONEST 近似比度值 2506.10845v1 -
90 06-12 Proteus: Enabling High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic Proteus: Leistungsstarkes Processing-Using-DRAM mit dynamischer Bit-Präzision, adaptiver Datendarstellung und flexibler Arithmetik Proteus: 具有动态比精确度、适应性数据表示和弹性亚光学的能动高性能处理-Using-DRAM 2501.17466v2 -
91 06-12 Towards Sustainable Computing: Exploring Energy Consumption Efficiency of Alternative Configurations and Workloads in an Open Source Messaging System Auf dem Weg zu nachhaltigem Rechnen: Energieeffizienz von alternativen Konfigurationen und Workloads in einem Open Source Messaging System untersuchen 实现可持续计算:探索开放源码通信系统中替代配置和工作量的能源消耗效率 2506.10693v1 -
92 06-12 Fully Energy-Efficient Randomized Backoff: Slow Feedback Loops Yield Fast Contention Resolution Vollenergieeffizienter Randomized Backoff: Langsame Rückkopplungsschleifen liefern schnelle Streitbeilegung 完全节能随机后退:慢速反馈循环 2302.07751v4 -
93 06-12 Deployment of Containerized Simulations in an API-Driven Distributed Infrastructure Bereitstellung von containerisierten Simulationen in einer API-getriebenen verteilten Infrastruktur 在API-驱动分配基础设施中部署集装箱化模拟设备 2506.10642v1 -
94 06-12 Graph-based Gossiping for Communication Efficiency in Decentralized Federated Learning Graphbasiertes Gossing für Kommunikationseffizienz im dezentralisierten Föderierten Lernen 以图表为基础的分散式联邦学习传播效率Gossiping 2506.10607v1 -
95 06-12 Model Discovery and Graph Simulation: A Lightweight Alternative to Chaos Engineering Modellentdeckung und Graphensimulation: Eine leichte Alternative zur Chaos-Engineering 模型发现和图示模拟:解决混乱工程的轻量替代方法 2506.11176v1 -
96 06-12 6G Infrastructures for Edge AI: An Analytical Perspective 6G-Infrastrukturen für Edge AI: Eine analytische Perspektive 6G 供异地边缘使用的基础设施:分析角度 2506.10570v1 -
97 06-12 GPU-Accelerated Distributed QAOA on Large-scale HPC Ecosystems GPU-beschleunigte verteilte QAOA auf großflächige HPC-Ökosysteme GPU-加速加速的大型高氯苯生态系统分布式QAOA 2506.10531v1 -
98 06-12 HP2C-DT: High-Precision High-Performance Computer-enabled Digital Twin HP2C-DT: High-Precision High-Performance-Computer-fähiger Digital Twin HP2C-DT:高精确度高绩效计算机化数字双双 2506.10523v1 -
99 06-12 Understanding the Performance and Power of LLM Inferencing on Edge Accelerators Die Leistung und Leistung von LLM-Inferenzen auf Edge-Beschleuniger verstehen 了解LLM LLM对边缘加速器的推论的性能和功率 2506.09554v2 -
100 06-12 TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference TD-Pipe: Vorübergehend disaggregierte Pipeline-Parallelismus-Architektur für High-Throughput-LLM-Inferenz TD-Pipe:高干压压压下LLM推论的热分解管道平行结构 2506.10470v1 -
101 06-12 Automating Multi-Tenancy Performance Evaluation on Edge Compute Nodes Automatisieren von Multi-Tenancy-Performance-Evaluierung auf Edge Compute Nodes 将多层计算节点的多层业绩评价自动化 2506.10461v1 -
102 06-12 Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent-based Methods Mehrdimensionale Autoskalierung von Verarbeitungsdienstleistungen: Ein Vergleich von agentenbasierten Methoden 处理服务多维多维自动升级:以代理为基础的方法比较 2506.10420v1 -
103 06-12 Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators Föderiertes Lernen im globalen Energiebudget über Heterogene Edge-Beschleuniger 全球能源预算内关于异异异系边缘加速器的联邦学习 2506.10413v1 -
104 06-12 HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration HPCTransCompile: Ein KI-Compiler-generierter Datensatz für Hochleistungs-CUDA-Transpilation und LLM-Voruntersuchung HPC Transtranscompility: AI CUDA 高性能 CUDA 转换和 LLM 初步探索的人工智能汇编器生成数据集 2506.10401v1 -
105 06-12 Bug Classification in Quantum Software: A Rule-Based Framework and Its Evaluation Fehlerklassifizierung in der Quantensoftware: Ein regelbasiertes Framework und seine Bewertung 量子软件中的臭虫分类:基于规则的框架及其评价 2506.10397v1 -
106 06-12 Is Sparse Matrix Reordering Effective for Sparse Matrix-Vector Multiplication? Ist Sparse Matrix Reordering wirksam für Sparse Matrix-Vector Multiplikation? 粗缩矩阵重新排序是否对 粗略矩阵- Vector 乘法有效? 2506.10356v1 -
107 06-12 PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production PerfTracker: Online-Performance-Fehlersuche für großformatige Modellschulungen in der Produktion PerfTracker:大规模生产示范培训在线绩效问题解决 2506.08528v3 -
108 06-12 SLO-Aware Scheduling for Large Language Model Inferences SLO-Aware Scheduling für große Sprachmodell-Schlussfolgerungen 大语言示范推理大语言示范推理的 SLO-Aware 排程 2504.14966v2 -
109 06-12 Resilience through Automated Adaptive Configuration for Distribution and Replication Resilienz durch Automatisierte Adaptive Konfiguration für Verteilung und Replizierung 通过自动适应配置进行分发和复制的复原力 2506.10248v1
Article 0
Title@2025-06-18 (3): Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction
Title: Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction | Föderated Learning for MRI-based BrainAGE: Eine multizentrische Studie zur post-stroke funktionellen Ergebnisvorhersage | 为基于MRI的脑力智能学习联合会学习:关于打击后功能性结果预测的多中心研究 2506.15626v1 |
Authors (11): Vincent Roca, Marc Tommasi, Paul Andrey, Aurélien Bellet, Markus D. Schirmer, Hilde Henon, Laurent Puy, Julien Ramon, Grégory Kuchcinski, Martin Bretzner, Renaud Lopes
$\textbf{Objective:}$ Brain-predicted age difference (BrainAGE) is a neuroimaging biomarker reflecting brain health. However, training robust BrainAGE models requires large datasets, often restricted by privacy concerns. This study evaluates the performance of federated learning (FL) for BrainAGE estimation in ischemic stroke patients treated with mechanical thrombectomy, and investigates its association with clinical phenotypes and functional outcomes. $\textbf{Methods:}$ We used FLAIR brain images from 1674 stroke patients across 16 hospital centers. We implemented standard machine learning and deep learning models for BrainAGE estimates under three data management strategies: centralized learning (pooled data), FL (local training at each site), and single-site learning. We reported prediction errors and examined associations between BrainAGE and vascular risk factors (e.g., diabetes mellitus, hypertension, smoking), as well as functional outcomes at three months post-stroke. Logistic regression evaluated BrainAGE’s predictive value for these outcomes, adjusting for age, sex, vascular risk factors, stroke severity, time between MRI and arterial puncture, prior intravenous thrombolysis, and recanalisation outcome. $\textbf{Results:}$ While centralized learning yielded the most accurate predictions, FL consistently outperformed single-site models. BrainAGE was significantly higher in patients with diabetes mellitus across all models. Comparisons between patients with good and poor functional outcomes, and multivariate predictions of these outcomes showed the significance of the association between BrainAGE and post-stroke recovery. $\textbf{Conclusion:}$ FL enables accurate age predictions without data centralization. The strong association between BrainAGE, vascular risk factors, and post-stroke recovery highlights its potential for prognostic modeling in stroke care.
$\ textbf{ 目标 :} 大脑预测年龄差异 (BrainAGE) 是一个反映大脑健康的神经成形生物标记 。 但是, 培训强大的脑分析模型需要大型数据集。 通常受到隐私问题的限制。 本研究评估了在接受机械性心肌梗塞治疗的缺血中病人中用于脑分析估算的联邦学习(FL)的性能, 并调查了它与临床性细胞类型和功能结果的关系 。 $\ textbf{ 方法:} 我们使用了16个医院中心16个中风病人的FLAIR脑图象。 我们根据三种数据管理战略实施了脑分析的标准机能学习和深层次学习模型: 集中学习(集合数据)、 FL(每个站点的地方培训) 和单点学习。 我们报告了脑分析与血管风险因素(如糖尿病、高血压、高血压、高血压和高血压后脑反应) 三个月的功能变化结果。 将大脑预测值与这些结果的预测值与高血压、 直径直径直径直径直径直径直径直径直径直径直径直径直径分析结果数据 显示数据显示数据在中央数据显示中, 。
Article 1
Title@2025-06-18 (3): LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters
Title: LiteGD: Lightweight and dynamic GPU Dispatching for Large-scale Heterogeneous Clusters | LiteGD: Leichte und dynamische GPU Dispatching für großflächige Heterogene Cluster | LiteGD: 大型异源集散发轻量和动态GPU 2506.15595v1 |
Authors (3): Kunming Zhang, Hanlong Liao, Guoming Tang
Parallel computing with multiple GPUs has become the dominant paradigm for machine learning tasks, especially those of large language models (LLMs). To reduce the latency incurred by inter-GPU communication, a common practice for parallel tasks has been to allocate GPUs based on their physical proximity. However, this long-standing assumption has notable limitations, particularly in large-scale, heterogeneous GPU clusters where bandwidth distribution among GPUs is irregular. In this paper, we introduce LiteGD, a lightweight and dynamic GPU dispatching system based on global perspectives. To tackle the difficulty of storing massive GPU topology information, LiteGD adopts a computation-aware design that leverages a lightweight Transformer network trained on sampled data. Our customized design for network structure ensures both transferability and scalability. LiteGD also employs a bidirectional tree search approach to find the optimal GPU dispatching in the data generated in the previous step, which can identify near-optimal solutions while reducing search overhead. We implement and evaluate LiteGD in both real and simulated GPU clusters with homogeneous and heterogeneous interconnects, respectively. Experimental results demonstrate that LiteGD consistently achieves high GPU bandwidth efficacy (approximately 90\%) across various cluster configurations and 80\% in real-world H100 cluster, significantly outperforming conventional default and interconnect topology-aware dispatching methods, particularly in large-scale heterogeneous environments.
与多个 GPU 平行计算已成为机器学习任务的主导模式,特别是大型语言模型(LLMs ) 。 为了降低GPU之间通信的延迟度,对平行任务的一种常见做法是根据其物理相近性分配GPU。 但是,这一长期的假设有显著的局限性,特别是在大型、多式 GPU 集群中,GPU的带宽分布不规律。 在本文中,我们引入了基于全球视角的轻巧和动态 GPU 发送系统LiteGGD, 这是一种轻巧和动态 GPU 发送系统。为了解决存储大型 GPU 地形信息的困难,LiteGD 采用了一种计算效应设计,利用了在抽样数据方面受过培训的轻巧变转换器网络。我们对网络结构的定制设计确保了可转移性和可缩放性。 但是,Lited GPUPU 也采用了双向树搜索方法,在前一步骤产生的数据中找到最优化的 GPU 发送方式,同时减少搜索间接数据。 我们用和模拟的GPUGPUG GP G- Intravelyal- cal- cal- cloeval- slational 分别在80 常规 上持续实现80 和80 常规 常规的频率上 的G- cloeval- sloveylational
Article 2
Title@2025-06-18 (3): DAGs for the Masses
Title: DAGs for the Masses | DAGs für die Massen | 质量的 DAG 值 2506.13998v2 |
Authors (4): Michael Anoprenko, Andrei Tonkikh, Alexander Spiegelman, Petr Kuznetsov
A recent approach to building consensus protocols on top of Directed Acyclic Graphs (DAGs) shows much promise due to its simplicity and stable throughput. However, as each node in the DAG typically includes a linear number of references to the nodes in the previous round, prior DAG protocols only scale up to a certain point when the overhead of maintaining the graph becomes the bottleneck. To enable large-scale deployments of DAG-based protocols, we propose a sparse DAG architecture, where each node includes only a constant number of references to random nodes in the previous round. We present a sparse version of Bullshark – one of the most prominent DAG-based consensus protocols – and demonstrate its improved scalability. Remarkably, unlike other protocols that use random sampling to reduce communication complexity, we manage to avoid sacrificing resilience: the protocol can tolerate up to $f<n/3$ Byzantine faults (where $n$ is the number of participants), same as its less scalable deterministic counterpart. The proposed ``sparse’’ methodology can be applied to any protocol that maintains disseminated system updates and causal relations between them in a graph-like structure. Our simulations show that the considerable reduction of transmitted metadata in sparse DAGs results in more efficient network utilization and better scalability.
在定向环形图(DAGs)顶部建立共识协议的最近方法显示,由于其简单和稳定的吞吐量而大有希望。然而,由于DAG中每个节点通常包括上一回合节点的直线引用量,以前的DAG协议只在维持图的间接费用成为瓶颈时将规模扩大到某一点。为了能够大规模部署基于DAG的协议,我们提议一个稀疏的DAG结构,其中每个节点仅包括上一轮随机节点的经常引用次数。我们提出了一个稀疏版的Bullshark – – 以DAG为基础的最突出的共识协议之一 – – 并展示其可扩展性。值得注意的是,与其他使用随机抽样降低通信复杂性的协议不同,我们设法避免牺牲复原力:协议可以容忍高达$ </3美元(Byzantine)的断层(即参与者人数),与不那么可伸缩的确定性对应方相同。拟议的“Slisparse”方法可以适用于任何在图像中保持系统传播更新和不断降低数据流率的网络结构,在图表中显示我们不断降低数据利用结果结果方面,从而显示数据流式地降低了机率的系统更新和因果关系。
Article 3
Title@2025-06-18 (3): Automatic Metadata Capture and Processing for High-Performance Workflows
Title: Automatic Metadata Capture and Processing for High-Performance Workflows | Automatische Metadatenerfassung und -verarbeitung für Hochleistungs-Workflows | 高绩效工作流程自动获取和处理元元数据 2506.15537v1 |
Authors (2): Polina Shpilker, Line Pouchard
Modern workflows run on increasingly heterogeneous computing architectures and with this heterogeneity comes additional complexity. We aim to apply the FAIR principles for research reproducibility by developing software to collect metadata annotations for workflows run on HPC systems. We experiment with two possible formats to uniformly store these metadata, and reorganize the collected metadata to be as easy to use as possible for researchers studying their workflow performance.
现代工作流程运行在日益多样化的计算机结构上,随着这种差异性化的出现,情况更加复杂。 我们的目标是通过开发软件来收集HPC系统运行的工作流程的元数据说明来应用FAIR研究再复制原则。 我们尝试两种可能的格式来统一存储这些元数据,并重组收集的元数据,以便尽可能方便研究人员研究其工作流程绩效时使用。
Article 4
Title@2025-06-18 (3): Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation
Title: Minimizing Communication for Parallel Symmetric Tensor Times Same Vector Computation | Minimierung der Kommunikation für parallele symmetrische Tensor-Zeiten gleiche Vektor-Computation | 最大限度地减少平行对称日光时同步矢量计算通信 2506.15488v1 |
Authors (6): Hussam Al Daas, Grey Ballard, Laura Grigori, Suraj Kumar, Kathryn Rouse, Mathieu Vérité
In this article, we focus on the parallel communication cost of multiplying the same vector along two modes of a $3$-dimensional symmetric tensor. This is a key computation in the higher-order power method for determining eigenpairs of a $3$-dimensional symmetric tensor and in gradient-based methods for computing a symmetric CP decomposition. We establish communication lower bounds that determine how much data movement is required to perform the specified computation in parallel. The core idea of the proof relies on extending a key geometric inequality for $3$-dimensional symmetric computations. We demonstrate that the communication lower bounds are tight by presenting an optimal algorithm where the data distribution is a natural extension of the triangle block partition scheme for symmetric matrices to 3-dimensional symmetric tensors.
在本篇文章中,我们侧重于按照3美元维对称振幅两种模式乘以同一矢量的平行通信成本。 这是用于确定3美元维对称振幅的高级功率法和用于计算对称CP分解的梯度法中确定3美元维对称振幅的高级功率法和以梯度为基础的计算法中的关键计算。 我们设置了通信下限,以确定平行进行规定的计算需要多少数据移动。 证据的核心理念取决于在3美元维对称计算中扩大关键几何不平等。 我们通过展示一种最佳算法,显示通信的下限是紧凑的,因为数据分布是对称矩阵的三角区分隔法的自然延伸至3维对称矩的三角区分隔法。
Article 5
Title@2025-06-18 (3): All is Not Lost: LLM Recovery without Checkpoints
Title: All is Not Lost: LLM Recovery without Checkpoints | Alles ist nicht verloren: LLM Erholung ohne Checkpoints | 并非全部丢失:LLM 恢复没有检查站 2506.15461v1 |
Authors (3): Nikolay Blagoev, Oğuzhan Ersoy, Lydia Yiyu Chen
Training LLMs on decentralized and wimpy computation nodes, e.g., multiple on-spot instances, lowers the training cost and enables model democratization. The inevitable challenge here is the churn of nodes due to failures and the operator’s scheduling policies, leading to losing a stage - a part of the model. The conventional approaches to recover from failures are to either use checkpointing, where periodically a copy of the entire model is sent to an additional storage, or redundant computation. These approaches yield significant communication and/or computation overhead even in non-failure cases and scale poorly in settings with large models. In this paper, we propose, CheckFree, an efficient recovery method where a failing stage is substituted by a weighted average of the closest neighboring stages. In contrast to the state of the art, CheckFree requires no additional computation or storage. However, because of the nature of averaging neighbouring stages, it can only recover failures of intermediate stages. We further extend our method to CheckFree+ with out-of-order pipeline execution to tolerate crashes of the first and last stages. Thanks to out-of-order pipelining, behaviour of those stages is mimicked by their neighboring ones, which allows CheckFree+ to recover them by simply copying the weights from the immediate neighbour. To be able to recover the (de)embedding layers, CheckFree+ copies those layers to the neighboring stages, which requires relatively small storage overhead. We extensively evaluate our method on LLaMa models of model sizes from 124M to 1.5B with varying failure frequencies. In the case of low and medium failure rates (5-10%), CheckFree and CheckFree+ outperform both checkpointing and redundant computation in terms of convergence in wall-clock time by over 12%. Both of our proposals can be run via our code available at: https://github.com/gensyn-ai/CheckFree.
有关分散和微弱计算节点的培训LLMS , 例如, 多个现场案例, 降低培训成本, 并能够实现模型民主化。 这里不可避免的挑战是节点因失败和操作员的时间安排政策导致的节点杂交, 导致一个阶段( 模式的一部分 ) 。 从失败中恢复的常规方法是使用检查站, 定期将整个模型的副本发送到额外的存储, 或者进行多余的计算。 这些方法产生大量沟通和/ 或计算间接费用, 即使是在非失败案例中, 也降低了培训成本, 并且使模型在大型模型中, 降低了成本。 在此文件中, 我们提议, CheckFree, 一个高效的节点回收方法, 以最接近的相邻阶段的平均平均值取代一个节点。 与艺术状态不同的是, CheckFreeFree不需要额外的计算或存储阶段。 但是, 由于平均的相邻阶段的性质, 只能回收中间阶段的失败。 我们进一步扩展了“ CheckFree+” 的管道执行方法, 以容忍第一个和最后一个阶段的断流流流流流。 。 感谢的断流流流流流, 这些阶段的行为可以由正常的节流流的节流速度, 快速递, 这些阶段的节流的递, 由中间的递的节率的节率的运行的节流的节制, 由中间的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制, , , 至右制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的节制的
Article 6
Title@2025-06-18 (3): Parallel Paradigms in Modern HPC: A Comparative Analysis of MPI, OpenMP, and CUDA
Title: Parallel Paradigms in Modern HPC: A Comparative Analysis of MPI, OpenMP, and CUDA | Parallele Paradigmen im modernen HPC: Eine vergleichende Analyse von MPI, OpenMP und CUDA | 现代HPC的平行范例:对MPI、OpenMP和CUDA的比较分析 2506.15454v1 |
Authors (2): Nizar ALHafez, Ahmad Kurdi
This paper presents a comprehensive comparison of three dominant parallel programming models in High Performance Computing (HPC): Message Passing Interface (MPI), Open Multi-Processing (OpenMP), and Compute Unified Device Architecture (CUDA). Selecting optimal programming approaches for modern heterogeneous HPC architectures has become increasingly critical. We systematically analyze these models across multiple dimensions: architectural foundations, performance characteristics, domain-specific suitability, programming complexity, and recent advancements. We examine each model’s strengths, weaknesses, and optimization techniques. Our investigation demonstrates that MPI excels in distributed memory environments with near-linear scalability for communication-intensive applications, but faces communication overhead challenges. OpenMP provides strong performance and usability in shared-memory systems and loop-centric tasks, though it is limited by shared memory contention. CUDA offers substantial performance gains for data-parallel GPU workloads, but is restricted to NVIDIA GPUs and requires specialized expertise. Performance evaluations across scientific simulations, machine learning, and data analytics reveal that hybrid approaches combining two or more models often yield optimal results in heterogeneous environments. The paper also discusses implementation challenges, optimization best practices, and emerging trends such as performance portability frameworks, task-based programming, and the convergence of HPC and Big Data. This research helps developers and researchers make informed decisions when selecting programming models for modern HPC applications, emphasizing that the best choice depends on application requirements, hardware, and development constraints.
本文全面比较了高性能计算(HPC)中三大主要平行编程模式:信息传递界面(MPI)、开放多处理(Open-Processing)和计算统一设备架构(CUDA),为现代多元的HPC架构选择最佳编程方法已变得日益重要。我们系统地分析这些模式的多个层面:建筑基础、性能特点、具体领域适合性、程序设计复杂性和最新进展。我们审查了每个模型的优势、弱点和优化技术。我们的调查显示,MPI在分布式记忆环境中的成绩优异,通信密集应用的分布式缩略缩缩缩缩缩版(MP),但面临通信间接挑战。OpenMP在共享模拟系统和环球中心任务中提供了很强的业绩和可用性强。CUDA为数据平行组合工作量提供了巨大的绩效收益,但仅限于NVDIMAA GUPPS, 需要专门的专业知识。我们的调查显示,将两种或两种或两种以上模型相结合的混合方法在互不相同的环境中产生最佳结果。本文还强调了实施方面的挑战、最佳的趋同性,即HPC的研发者在选择程序制定过程中选择了BIPLMLA的格局、最佳做法,这时,有助于选择了BDRDFDRDRDR的进度,使BDFDFDFDFL决定成为了H的进度和H的进度,成为了H的进度和H的进度,作为HFFFFFF的进度。
Article 7
Title@2025-06-18 (3): Exploring Fast Fourier Transforms on the Tenstorrent Wormhole
Title: Exploring Fast Fourier Transforms on the Tenstorrent Wormhole | Schnell Fourier-Transformationen auf dem Tenstorrent Wormhole erkunden | 探索登山洞上的快速傅里叶变形 2506.15437v1 |
Authors (3): Nick Brown, Jake Davies, Felix LeClair
Whilst numerous areas of computing have adopted the RISC-V Instruction Set Architecture (ISA) wholesale in recent years, it is yet to become widespread in HPC. RISC-V accelerators offer a compelling option where the HPC community can benefit from the specialisation offered by the open nature of the standard but without the extensive ecosystem changes required when adopting RISC-V CPUs. In this paper we explore porting the Cooley-Tukey Fast Fourier Transform (FFT) algorithm to the Tenstorrent Wormhole PCIe RISC-V based accelerator. Built upon Tenstorrent’s Tensix architecture, this technology decouples the movement of data from compute, potentially offering increased control to the programmer. Exploring different optimisation techniques to address the bottlenecks inherent in data movement, we demonstrate that for a 2D FFT whilst the Wormhole n300 is slower than a server-grade 24-core Xeon Platinum CPU, the Wormhole draws around 8 times less power and consumes around 2.8 times less energy than the CPU when computing the Fourier transform.
尽管近年来许多计算领域都采用了RISC-V指令设置架构(ISA)批发,但在HPC中尚未普及。 RISC-V加速器提供了一个令人信服的选项,使HPC社区能够从标准的开放性质所提供的专门化中受益,但是在采用RISC-VCPU时没有所需的广泛的生态系统变化。 在本文中,我们探索将Cooley-Tukey Fast Fourier 变换算法移植到基于Tensorent Wormhole PCIe RISC-V 加速器。 建在Tenstorrent的十六个架构上,这一技术将数据从计算中分离出来,可能增加对程序员的控制。 探索不同的优化技术,以解决数据流动中固有的瓶颈问题,我们证明对于2D FFT,而Wormhole n300 慢于服务器级24-cent Xeon Platinum CPU, 烟雾洞的电量大约比CPU少8倍左右。
Article 8
Title@2025-06-18 (3): Computing the Schulze Method for Large-Scale Preference Data Sets
Title: Computing the Schulze Method for Large-Scale Preference Data Sets | Berechnung der Schulze-Methode für großformatige Präferenzdatensätze | 计算大尺度优先数据集的平板法 2505.12976v2 |
Authors (3): Theresa Csar, Martin Lackner, Reinhard Pichler
The Schulze method is a voting rule widely used in practice and enjoys many positive axiomatic properties. While it is computable in polynomial time, its straight-forward implementation does not scale well for large elections. In this paper, we develop a highly optimised algorithm for computing the Schulze method with Pregel, a framework for massively parallel computation of graph problems, and demonstrate its applicability for large preference data sets. In addition, our theoretic analysis shows that the Schulze method is indeed particularly well-suited for parallel computation, in stark contrast to the related ranked pairs method. More precisely we show that winner determination subject to the Schulze method is NL-complete, whereas this problem is P-complete for the ranked pairs method.
舒尔兹法是一种在实践中广泛使用的投票规则,并具有许多积极的非毒性特性。 虽然它在多式时间可以计算,但其直向前执行在大规模选举时并不很好。 在本文中,我们开发了一种非常优化的算法,用Pregel计算舒尔兹法,这是大量平行计算图表问题的一个框架,并展示了它对大型优先数据集的适用性。 此外,我们的理论分析表明,舒尔兹法确实特别适合平行计算,这与相关的排名配对法形成鲜明对比。更准确地说,我们表明,受舒尔兹法制约的优胜者确定是NL-完成的,而这个问题对于排名配对法来说是P-完整的。
Article 9
Title@2025-06-18 (3): RISC-V for HPC: An update of where we are and main action points
Title: RISC-V for HPC: An update of where we are and main action points | RISC-V für HPC: Ein Update, wo wir sind und die wichtigsten Aktionspunkte | HPC的RISC-V:关于我们目前的最新情况和主要行动要点的最新情况 2506.15418v1 |
Authors (1): Nick Brown
This extended abstract is submitted on behalf of the RISC-V HPC SIG who have been undertaking an analysis to explore the current state and limitations of the RISC-V ecosystem for HPC. Whilst it is right to celebrate that there has been great progress made in recent years, we also highlight limitations and where effort should be focussed.
这一扩展摘要是代表RISC-V HPC SIG提交的,后者一直在进行分析,以探讨RISC-V生态系统目前的状况和限制。
Article 10
Title@2025-06-18 (3): An Efficient Candidate-Free R-S Set Similarity Join Algorithm with the Filter-and-Verification Tree and MapReduce
Title: An Efficient Candidate-Free R-S Set Similarity Join Algorithm with the Filter-and-Verification Tree and MapReduce | Eine effiziente, kandidatfreie R-S-Set-Ähnlichkeit Begleiten Sie den Algorithmus mit dem Filter-und-Verifikationsbaum und MapReduce | 与过滤和核查树和地图显示的高效无候选人候选人 R-S 设置相似性 2506.03893v2 |
Authors (7): Yuhong Feng, Fangcao Jian, Yixuan Cao, Xiaobin Jian, Jia Wang, Haiyue Feng, Chunyan Miao
Given two different collections of sets, the exact set similarity R-S Join finds all set pairs with similarity no less than a given threshold, which has widespread applications. While existing algorithms accelerate large-scale R-S Joins using a two-stage filter-and-verification framework along with the parallel and distributed MapReduce framework, they suffer from excessive candidate set pairs, leading to significant I/O, data transfer, and verification overhead, and ultimately degrading the performance. This paper proposes novel candidate-free R-S Join (CF-RS-Join) algorithms that integrate filtering and verification into a single stage through filter-and-verification trees (FVTs) and their linear variants (LFVTs). First, CF-RS-Join with FVT (CF-RS-Join/FVT) is proposed to leverage an innovative FVT structure that compresses elements and associated sets in memory, enabling single-stage processing that eliminates the candidate set generation, fast lookups, and reduced database scans. Correctness proofs are provided. Second, CF-RS-Join with LFVT (CF-RS-Join/LFVT) is proposed to exploit a more compact Linear FVT, which compresses non-branching paths into single nodes and stores them in linear arrays for optimized traversal. Third, MR-CF-RS-Join/FVT and MR-CF-RS-Join/LFVT have been proposed to extend our approaches using MapReduce for parallel processing. Empirical studies on 7 real-world datasets have been conducted to evaluate the performance of the proposed algorithms against selected existing algorithms in terms of execution time, scalability, memory usage, and disk usage. Experimental results demonstrate that our algorithm using MapReduce, i.e., MR-CF-RS-Join/LFVT, achieves the best performance.
以两种不同的数据集收藏, 精确设定的相近 R- S Join 发现所有相近的配对均不少于一个类似阈值,这具有广泛的应用性。虽然现有的算法使用两个阶段的过滤和核查框架以及平行和分布式的 MapReduce 框架加速了大型 R- S join 组合,但是它们遭受了过多的候选配对,导致大量I/ O、数据传输和核查管理,最终降低了性能。本文件提议采用新的无候选人的 R- S 联合(CF-RS- Join ) 算法,通过过滤和核查树(FVT) 及其线性变体(LFVT ) 加速大型 R- S- S 联合。 首先, CF- RS- Jin 与 FVT (C- RF- Rival- Iral- Serveral- Serveral- Serveral- Serveral- Serval- lavements ) 使用一个创新 工具, i- the real- real- lif- ex- list- sal- lif- sal- sleval-leval- serval- us- liversal-lational-lations) liver- serval- sal- slations, liver- liver- liver- 和i- slations- slations- 和i- slations- suptal- suptal- sal- slations- sal-s-s-s-s- sal- , lads- sal-s-s-s-s-s-s-s-s-s-s- sal-s- sal- sal- li- sal-s-s-s- sal-s- sal-s- sal- sal-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s- sl-s- serv) lad- li-s-s-s- li- li-
Article 11
Title@2025-06-18 (3): Reconfigurable Intelligent Surface Aided Vehicular Edge Computing: Joint Phase-shift Optimization and Multi-User Power Allocation
Title: Reconfigurable Intelligent Surface Aided Vehicular Edge Computing: Joint Phase-shift Optimization and Multi-User Power Allocation | Rekonfigurierbares intelligentes Surface Aided Vehicular Edge Computing: Joint Phase-Shift Optimization und Multi-User Power Allocation | 重新配置的智能地面辅助车辆边缘 电子计算:联合阶段-轮-优化和多用户电力配置 2407.13123v2 |
Authors (6): Kangwei Qi, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen, Khaled B. Letaief
Vehicular edge computing (VEC) is an emerging technology with significant potential in the field of internet of vehicles (IoV), enabling vehicles to perform intensive computational tasks locally or offload them to nearby edge devices. However, the quality of communication links may be severely deteriorated due to obstacles such as buildings, impeding the offloading process. To address this challenge, we introduce the use of Reconfigurable Intelligent Surfaces (RIS), which provide alternative communication pathways to assist vehicular communication. By dynamically adjusting the phase-shift of the RIS, the performance of VEC systems can be substantially improved. In this work, we consider a RIS-assisted VEC system, and design an optimal scheme for local execution power, offloading power, and RIS phase-shift, where random task arrivals and channel variations are taken into account. To address the scheme, we propose an innovative deep reinforcement learning (DRL) framework that combines the Deep Deterministic Policy Gradient (DDPG) algorithm for optimizing RIS phase-shift coefficients and the Multi-Agent Deep Deterministic Policy Gradient (MADDPG) algorithm for optimizing the power allocation of vehicle user (VU). Simulation results show that our proposed scheme outperforms the traditional centralized DDPG, Twin Delayed Deep Deterministic Policy Gradient (TD3) and some typical stochastic schemes.
视觉边缘计算(Vec)是一种新兴技术,在车辆互联网领域具有巨大的潜力,使车辆能够在当地执行密集的计算任务或将其卸载到附近的边缘装置;然而,通信连接的质量可能由于建筑等障碍而严重恶化,妨碍卸载过程。为了应对这一挑战,我们采用可重新配置的智能表面(RIS),提供替代通信途径,协助车辆通信。通过动态调整RIS的阶段性班级,VEC系统的性能可以大大改善。在这项工作中,我们考虑一个RIS辅助VEC系统,并设计一个最佳的当地执行力、卸载力和RIS阶段性班班式计划,其中考虑到随机任务到达和渠道变异。为了应对这个方案,我们建议采用一个创新的深层强化学习框架,将深层威慑政策分级算法(DDPG)结合,以优化RIS阶段性调值系数和多位的低度政策性能定位系统。在这项工作中,我们为优化SIMDGDGS-DFA(MADGDG) 和最佳机动性机动化机动化车辆配置计划,以优化SMADGDUDFIDSDSDLADLA(MADRDRDRDRDRDRDRDRDRDLADL)的拟议系统。
Article 12
Title@2025-06-18 (3): Serving Large Language Models on Huawei CloudMatrix384
Title: Serving Large Language Models on Huawei CloudMatrix384 | Große Sprachmodelle auf Huawei CloudMatrix384 | 瓦威云马特列克384 2506.12708v2 |
Authors (46): Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao, Depeng Liang, Dong Cao, Juncheng Liu, Yongqiang Yang, Xiaolong Bai, Yi Li, Huaguo Xie, Huatao Wu, Zhibin Yu, Lv Chen, Hu Liu, Yujun Ding, Haipei Zhu, Jing Xia, Yi Xiong, Zhou Yu, Heng Liao
The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910C NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.
大型语言模型(LLMS)的快速演化是由日益扩大的参数尺度驱动的。 大型语言模型(LLMS)的快速演化,采用了混合专家结构(MOE),并延长了背景长度,这给AI基础设施提出了前所未有的要求。传统AI群组在计算强度、记忆带宽、芯片间通信和延缓方面面临着限制,再加上不同的工作量和严格的服务级目标。 解决这些问题需要从根本上重新设计硬件软件整合。 本文介绍了在生产级CloudMartrix384超级节中实现的下一代AI数据中心结构Huawe CloudMatMatrix。 它整合了384 Ascend 910C NPUS和192 Kunpeng CPUCs, 通过超高频宽宽宽度统一布网连接连接的计算强度、记忆带宽宽宽宽度通信的通信模型。 这些特征是优化通信密集型操作的性能,如大规模多级多级多级双级双级双级和分布在ILIDOD 进入。我们建议了CloadMRMTRM-ID-Inder-Indealdeal devaldal destal destildal destildal destal destal press; AS-stal-stal Stal-stal-stal-stal-stildal-stal-stal-stal-staltial-stal-stal-stal-stal-stal-stal-stal-stal-stal-stal-staltialtips
Article 13
Title@2025-06-18 (3): Centroid Approximation for Byzantine-Tolerant Federated Learning
Title: Centroid Approximation for Byzantine-Tolerant Federated Learning | Centroid Approximation für Byzantinisch-Tolerant-Federated Learning | 拜占庭 – – 协调联邦学习 2506.15264v1 |
Authors (4): Mélanie Cambus, Darya Melnyk, Tijana Milentijević, Stefan Schmid
Federated learning allows each client to keep its data locally when training machine learning models in a distributed setting. Significant recent research established the requirements that the input must satisfy in order to guarantee convergence of the training loop. This line of work uses averaging as the aggregation rule for the training models. In particular, we are interested in whether federated learning is robust to Byzantine behavior, and observe and investigate a tradeoff between the average/centroid and the validity conditions from distributed computing. We show that the various validity conditions alone do not guarantee a good approximation of the average. Furthermore, we show that reaching good approximation does not give good results in experimental settings due to possible Byzantine outliers. Our main contribution is the first lower bound of $\min{\frac{n-t}{t},\sqrt{d}}$ on the centroid approximation under box validity that is often considered in the literature, where $n$ is the number of clients, $t$ the upper bound on the number of Byzantine faults, and $d$ is the dimension of the machine learning model. We complement this lower bound by an upper bound of $2\min{n,\sqrt{d}}$, by providing a new analysis for the case $n<d$. In addition, we present a new algorithm that achieves a $\sqrt{2d}$-approximation under convex validity, which also proves that the existing lower bound in the literature is tight. We show that all presented bounds can also be achieved in the distributed peer-to-peer setting. We complement our analytical results with empirical evaluations in federated stochastic gradient descent and federated averaging settings.
联邦学习让每个客户在分布式环境中培训机器学习模式时能够将数据保存在本地。 最近的重要研究确立了投入必须满足的要求,以保证培训循环的趋同。 这行工作以平均为培训模式的总规则。 特别是, 我们感兴趣的是, 联盟学习是否对拜占庭行为有利, 观察和调查平均/ 中间值和分布式计算的有效性条件之间的权衡。 我们表明, 单凭各种有效性条件并不能保证平均差值的接近。 此外, 我们显示, 达到良好近差不会在实验环境中产生良好的效果, 因为可能由拜占庭外端值来保证培训循环的趋同。 我们的主要贡献是, 以 $minfrac{n- tt{t} 的平均约束为标准。 我们的主要贡献是, 在文献中经常考虑的中, $是客户的数量, 美元是 By占地差差值的上限, 美元是机器学习模型的维值。 我们用一个较低的内限值, 也通过当前 irtradal_to_deal a creal deal deal devidudude ex ex ex ex ex ex ex ex ex excurrup ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex a ex ex ex ex a ex a exlation. ex a ex ex ex ex ex ex ex ex ex ex a ex a ex a ex a ex ex ex ex a ex exild ex a ex ex a ex a ex a ex exvicuut. in in in in in in in a ex a ex a ex a ex a exual a ex a exual a exild. ex a ex a ex a ex a ex a ex a ex a ex a su a ex a ex a su a su a su a su ex ex ex a ex a ex a ex a ex a ex
Article 14
Title@2025-06-18 (3): eLLM: Elastic Memory Management Framework for Efficient LLM Serving
Title: eLLM: Elastic Memory Management Framework for Efficient LLM Serving | eLLM: Elastic Memory Management Framework für effizientes LLM Serving | eLLM:高效LLM服务 Elastic记忆管理框架 2506.15155v1 |
Authors (14): Jiale Xu, Rui Zhang, Yi Xiong, Cong Guo, Zihan Liu, Yangjie Zhou, Weiming Hu, Hao Wu, Changxu Shao, Ziqing Wang, Yongjie Yuan, Junping Zhao, Minyi Guo, Jingwen Leng
Large Language Models are increasingly being deployed in datacenters. Serving these models requires careful memory management, as their memory usage includes static weights, dynamic activations, and key-value caches. While static weights are constant and predictable, dynamic components such as activations and KV caches change frequently during runtime, presenting significant challenges for efficient memory management. Modern LLM serving systems typically handle runtime memory and KV caches at distinct abstraction levels: runtime memory management relies on static tensor abstractions, whereas KV caches utilize a page table-based virtualization layer built on top of the tensor abstraction. This virtualization dynamically manages KV caches to mitigate memory fragmentation. However, this dual-level approach fundamentally isolates runtime memory and KV cache management, resulting in suboptimal memory utilization under dynamic workloads, which can lead to a nearly 20% drop in throughput. To address these limitations, we propose eLLM, an elastic memory management framework inspired by the classical memory ballooning mechanism in operating systems. The core components of eLLM include: (1) Virtual Tensor Abstraction, which decouples the virtual address space of tensors from the physical GPU memory, creating a unified and flexible memory pool; (2) an Elastic Memory Mechanism that dynamically adjusts memory allocation through runtime memory inflation and deflation, leveraging CPU memory as an extensible buffer; and (3) a Lightweight Scheduling Strategy employing SLO-aware policies to optimize memory utilization and effectively balance performance trade-offs under stringent SLO constraints. Comprehensive evaluations demonstrate that eLLM significantly outperforms state-of-the-art systems, 2.32x higher decoding throughput, and supporting 3x larger batch sizes for 128K-token inputs.
大型语言模型正在越来越多地在数据中心部署。 为这些模型服务需要谨慎的记忆管理, 因为它们的记忆使用包括静态重量、 动态激活和关键值缓存。 虽然静态重量是恒定的和可预测的, 但动态部分, 如激活和 KV 缓存管理经常在运行时变化, 给有效的记忆管理带来重大挑战。 现代 LLM 服务系统通常在不同的抽象级别处理运行运行时间记忆和 KV 缓存。 为了解决这些限制, 我们建议 运行时间存储管理依赖于静态的色调, 而 KV 缓存的缓存使用则使用基于页级表的虚拟化层。 这种虚拟化以动态方式管理 KV 缓存的缓存。 但是, 这种双级方法从根本上孤立运行时间记忆和 KV 缓存管理, 在动态工作量管理下导致次优化的记忆利用。 为了解决这些限制, 我们提议 ELLM , 一个由经典的存储感应感应感应感应机制激励的弹性内存管理框架。 ELLM 的核心组成部分包括:(1) 虚拟的Central- crealalalalal- deal- deliveral- develyal- develyal developmental developmentalmental develil developmentalmental develmental develmental develmental develmental develmental develmental deal develut ex; ex ex ex ex ex ex ex ex ex ex ex ex , ex deliver deliver deliver develmental develmentaldaldmentaldal develmental develmental develmental develmental develmental develmental develmental develmental develmental deal develmental develmental deal develmental develmental develmental develmental develmental develmental develmental develmental develmental develmental develmental develmental deal deal deal deal deal deal deal deal develmentaldaldaldmentald sal develmental develmental develmental
Article 15
Title@2025-06-18 (3): Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library
Title: Parallel Data Object Creation: Towards Scalable Metadata Management in High-Performance I/O Library | Parallel Data Object Creation: Auf dem Weg zu einem skalierbaren Metadaten-Management in der Hochleistungs-I/O-Bibliothek | 平行数据对象的生成:在高绩效一/O图书馆中实现可缩放元数据管理 2506.15114v1 |
Authors (6): Youjia Li, Robert Latham, Robert Ross, Ankit Agrawal, Alok Choudhary, Wei-Keng Liao
High-level I/O libraries, such as HDF5 and PnetCDF, are commonly used by large-scale scientific applications to perform I/O tasks in parallel. These I/O libraries store the metadata such as data types and dimensionality along with the raw data in the same files. While these libraries are well-optimized for concurrent access to the raw data, they are designed neither to handle a large number of data objects efficiently nor to create different data objects independently by multiple processes, as they require applications to call data object creation APIs collectively with consistent metadata among all processes. Applications that process data gathered from remote sensors, such as particle collision experiments in high-energy physics, may generate data of different sizes from different sensors and desire to store them as separate data objects. For such applications, the I/O library’s requirement on collective data object creation can become very expensive, as the cost of metadata consistency check increases with the metadata volume as well as the number of processes. To address this limitation, using PnetCDF as an experimental platform, we investigate solutions in this paper that abide the netCDF file format, as well as propose a new file header format that enables independent data object creation. The proposed file header consists of two sections, an index table and a list of metadata blocks. The index table contains the reference to the metadata blocks and each block stores metadata of objects that can be created collectively or independently. The new design achieves a scalable performance, cutting data object creation times by up to 582x when running on 4096 MPI processes to create 5,684,800 data objects in parallel. Additionally, the new method reduces the memory footprints, with each process requiring an amount of memory space inversely proportional to the number of processes.
高级 I/ O 库, 如 HDF5 和 PnetCDF 等高级 I/ O 库, 通常被大型科学应用软件用来平行执行 I/ O 任务。 这些 I/ O 库可以将数据类型和维度等元数据与原始数据存储在同一文档中。 虽然这些图书馆在同时访问原始数据方面得到了很好的优化, 但是它们的设计既不是为了高效处理大量数据对象,也不是为了通过多个程序独立创建不同的数据对象,因为它们需要应用程序来调用数据对象创建 API82 和所有进程之间一致的元数据。 应用程序处理从远程目标传感器收集的数据, 如高能物理中的粒子碰撞实验, 可能会从不同传感器生成不同大小的数据, 并且希望将这些数据存储为不同的数据对象。 对于这些应用程序, I/ O 库对集体数据对象创建的要求可能会变得非常昂贵, 因为元数据量和进程的数量会增加。 要解决这个问题, 使用 PnetCDF 平台, 我们调查本文中遵循 NetDF 对象文件 格式的解决方案, 例如粒子 运行40 的粒子碰撞实验实验 物理物理实验实验实验, 实验实验实验实验实验实验实验实验实验实验实验实验实验实验实验, , , , , , , 可能会产生不同传感器 产生不同大小 数据格式 数据格式 数据格式 数据格式 数据格式 , 数据格式, 并提议 格式 格式 将 格式 格式 格式 格式 格式 格式 运行 格式 格式 格式 运行 格式 格式 格式 运行 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 , 格式 格式 格式 , 将 , 格式 格式 格式 , 格式 格式 格式 格式 格式 格式 格式 格式 格式 , 格式 格式 格式 格式 , 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式 格式
Article 16
Title@2025-06-17 (2): Supporting the development of Machine Learning for fundamental science in a federated Cloud with the AI_INFN platform
Title: Supporting the development of Machine Learning for fundamental science in a federated Cloud with the AI_INFN platform | Unterstützung der Entwicklung von Machine Learning für die Grundlagenforschung in einer föderierten Cloud mit der AI_INFN-Plattform | 与AI_INFN平台合作,支持在联合云层中发展用于基础科学的机器学习 2502.21266v2 |
Authors (10): Lucio Anderlini, Matteo Barbetti, Giulio Bianchini, Diego Ciangottini, Stefano Dal Pra, Diego Michelotto, Carmelo Pellegrino, Rosa Petrini, Alessandro Pascolini, Daniele Spiga
Machine Learning (ML) is driving a revolution in the way scientists design, develop, and deploy data-intensive software. However, the adoption of ML presents new challenges for the computing infrastructure, particularly in terms of provisioning and orchestrating access to hardware accelerators for development, testing, and production. The INFN-funded project AI_INFN (“Artificial Intelligence at INFN”) aims at fostering the adoption of ML techniques within INFN use cases by providing support on multiple aspects, including the provision of AI-tailored computing resources. It leverages cloud-native solutions in the context of INFN Cloud, to share hardware accelerators as effectively as possible, ensuring the diversity of the Institute’s research activities is not compromised. In this contribution, we provide an update on the commissioning of a Kubernetes platform designed to ease the development of GPU-powered data analysis workflows and their scalability on heterogeneous, distributed computing resources, possibly federated as Virtual Kubelets with the interLink provider.
机器学习(ML)正在推动科学家设计、开发和部署数据密集型软件的方式的革命。然而,ML的通过给计算基础设施带来了新的挑战,特别是在提供和安排为开发、测试和生产提供硬件加速器方面。由NINFN资助的项目AI_INFN(“FIN的人工智能”)旨在通过支持在NINFN使用的案例中采用ML技术,在多个方面提供支持,包括提供全自动定制的计算资源。它利用INNFN云背景下的云化解决方案,尽可能有效地分享硬件加速器,确保研究所研究活动的多样性不会受到损害。在这个贡献中,我们提供了启用Kubernetes平台的最新情况,该平台旨在便利开发GPU驱动的数据分析工作流程及其在混合、分布式计算机资源上的可缩放性,并可能作为虚拟Kubetlets与链接提供商联合。
Article 17
Title@2025-06-17 (2): Scaling Intelligence: Designing Data Centers for Next-Gen Language Models
Title: Scaling Intelligence: Designing Data Centers for Next-Gen Language Models | Scaling Intelligence: Konzipieren von Rechenzentren für Sprachmodelle der nächsten Generation | 扩大情报范围:为下一代语言模型设计数据中心 2506.15006v1 |
Authors (4): Jesmin Jahan Tithi, Hanjiang Wu, Avishaii Abuhatzera, Fabrizio Petrini
The explosive growth of Large Language Models (LLMs) - such as GPT-4 with 1.8 trillion parameters - demands a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, wider scale-out domains, and larger memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model flops per token x Observed tokens per sec / Peak flops of the hardware) and overall throughput. For the co-design study, we extended and validated a performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.
大型语言模型(LLMS)的爆炸性增长(LLMS),如GPT-4,具有1.8万亿美元参数的GPT-4,要求彻底重新思考数据中心架构,以确保可缩放性、效率和成本效益。我们的工作提供了一个综合共同设计框架,共同探索FLOPS、HBM带宽度和能力、多网络表层(两层对全Flat光学)、扩大范围的规模以及LLMS使用的民众平行/优化战略。我们引入并评估了Fulllat网络结构,这些结构提供了所有节点之间统一的高宽度、低延度连通性,并展示了它们对业绩和可伸缩性的变革影响。我们通过详细的敏感性分析,量化了重叠的计算和通信的好处,利用硬件加速的集体、更广泛的缩放域和更大的记忆能力。我们的研究范围既包括稀缺的(专家组合)又以密集的变压器为基础的LMMSM,揭示了系统设计选择如何影响FLOPS利用模型(MFFFMSAS =每基本观察度x观察的模型x-servealserverial signal supal supal sess /cal exal lapal laveal must lax pal lax pal must laveal must lapal 10 pal lipal lapal lipal lautal lautal lapal lapal 10 lipal lapal ladal lapal ladal lautaldaldaldal ladaldaldaldal ladaldaldaldal ladeal ladealdaldal ladal ladal ladal 和我们我们我们10 10 10 10 10 10 10 10 10 10 10 10 和通过我们的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化的软化模型,通过10 和软化模型,通过10的软化的软化的软化模型的软化的软化
Article 18
Title@2025-06-17 (2): Zarr-Based Chunk-Level Cumulative Sums in Reduced Dimensions
Title: Zarr-Based Chunk-Level Cumulative Sums in Reduced Dimensions | ZARR-Based Chunk-Level Kumulative Summen in reduzierten Abmessungen | 减量尺寸中Zarr 基铜级累计总和 2506.14981v1 |
Authors (4): Hailiang Zhang, Dieu My T. Nguyen, Christine Smit, Mahabal Hegde
Data analysis on massive multi-dimensional data, such as high-resolution large-region time averaging or area averaging for geospatial data, often involves calculations over a significant number of data points. While performing calculations in scalable and flexible distributed or cloud environments is a viable option, a full scan of large data volumes still serves as a computationally intensive bottleneck, leading to significant cost. This paper introduces a generic and comprehensive method to address these computational challenges. This method generates a small, size-tunable supplementary dataset that stores the cumulative sums along specific subset dimensions on top of the raw data. This minor addition unlocks rapid and cheap high-resolution large-region data analysis, making calculations over large numbers of data points feasible with small instances or even microservices in the cloud. This method is general-purpose, but is particularly well-suited for data stored in chunked, cloud-optimized formats and for services running in distributed or cloud environments. We present a Zarr extension proposal to integrate the specifications of this method and facilitate its straightforward implementation in general-purpose software applications. Benchmark tests demonstrate that this method, implemented in Amazon Web services (AWS), significantly outperforms the brute-force approach used in on-premises services. With just 5% supplemental storage, this method achieves a performance that is 3-4 orders of magnitude (~10,000 times) faster than the brute-force approach, while incurring significantly reduced computational costs.
大量多维数据的数据分析,如高分辨率大区平均时间或地理空间数据平均面积等,往往涉及大量数据点的计算。在可缩放和灵活分布的云或云层环境中进行计算是一个可行的选项,而对大数据量进行全面扫描仍是一种计算密集的瓶颈,导致大量成本。本文介绍了一种通用和全面的方法,以应对这些计算挑战。这种方法产生一个小的、可大小可调控的补充数据集,将累积的总和储存在原始数据之上的特定子层面。这一小增加开启了快速和廉价的高分辨率大区域数据分析,使大量数据点的计算在云中小规模甚至微观服务中可行。这种方法是通用的,但特别适合于在块状、云性优化格式和在分布式或云层环境中运行的服务中存储的数据。我们提出了一个Zarr扩展建议,以整合这种方法的规格,便于在一般用途软件应用程序中直接实施。基准测试表明,这种方法在亚马逊-亚马逊区域网络服务(AAA-Misroria-rass)中实施,该方法的使用速度大大超过5Misma-rass-rass asession ass apression apression apression apression apression session apression apression apression apression apression apression apression pression apression apression apressional apressional apression apression apression apression pression apression pression pression pressional apress press pressal_ roispressal_ roce) 方法,该方法,该方法,在基础化方法在使用该方法在基础化方法下,该方法在5xisal a 方法下,该方法下,该方法在快速方法在快速方法下,该方法使用 a 方法下,该方法使用____方法下,该方法使用__ 方法在使用方法下大大方法在5x ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax ax a
Article 19
Title@2025-06-17 (2): Exploring Dynamic Load Balancing Algorithms for Block-Structured Mesh-and-Particle Simulations in AMReX
Title: Exploring Dynamic Load Balancing Algorithms for Block-Structured Mesh-and-Particle Simulations in AMReX | Dynamische Lastausgleichsalgorithmen für blockstrukturierte Mesh-and-Particle-Simulationen in AMReX erforschen | 探索AMReX 中块结构化网状和粒子模拟的动态负载平衡算法 2505.15122v2 |
Authors (4): Amitash Nanda, Md Kamal Hossain Chowdhury, Hannah Ross, Kevin Gott
Load balancing is critical for successful large-scale high-performance computing (HPC) simulations. With modern supercomputers increasing in complexity and variability, dynamic load balancing is becoming more critical to use computational resources efficiently. In this study, performed during a summer collaboration at Lawrence Berkeley National Laboratory, we investigate various standard dynamic load-balancing algorithms. This includes the time evaluation of a brute-force solve for application in algorithmic evaluation, as well as quality and time evaluations of the Knapsack algorithm, an SFC algorithm, and two novel algorithms: a painter’s partition-based SFC algorithm and a combination Knapsack+SFC methodology-based on hardware topology. The results suggest Knapsack and painter’s partition-based algorithms should be among the first algorithms evaluated by HPC codes for cases with limited weight deviation and will perform at least slightly better than AMReX’s percentage-tracking partitioning strategy across most simulations, although effects diminish as weight variety increases.
对成功的大型高性能计算(HPC)模拟来说,负载平衡是成功的大规模高性能计算(HPC)模拟的关键。随着现代超级计算机的复杂性和变异性不断增加,动态负载平衡对于高效使用计算资源越来越重要。在Lawrence Berkele国家实验室夏季合作期间进行的这项研究中,我们调查了各种标准的动态负载平衡算法。其中包括对用于算法评估的粗力解决方案进行时间评估,以及对Knapack算法、SFC算法和两种新奇算法的质量和时间评估:一个基于分区的SFC算法,另一个基于硬件地形的Knapsack+SFC方法组合。结果显示,Knapack和画家基于分区的算法应该是由HPC代码对重量偏差有限的案例进行评估的第一种算法,其效果至少比AMREX在多数模拟中的百分比跟踪分隔战略略微好一些,尽管随着重量的提高,效果会减少。
Article 20
Title@2025-06-17 (2): Event-Driven Online Vertical Federated Learning
Title: Event-Driven Online Vertical Federated Learning | Event-getriebenes Online-Vertical-Federated-Learning | 在线纵向联邦学习 2506.14911v1 |
Authors (4): Ganyu Wang, Boyu Wang, Bin Gu, Charles Ling
Online learning is more adaptable to real-world scenarios in Vertical Federated Learning (VFL) compared to offline learning. However, integrating online learning into VFL presents challenges due to the unique nature of VFL, where clients possess non-intersecting feature sets for the same sample. In real-world scenarios, the clients may not receive data streaming for the disjoint features for the same entity synchronously. Instead, the data are typically generated by an \emph{event} relevant to only a subset of clients. We are the first to identify these challenges in online VFL, which have been overlooked by previous research. To address these challenges, we proposed an event-driven online VFL framework. In this framework, only a subset of clients were activated during each event, while the remaining clients passively collaborated in the learning process. Furthermore, we incorporated \emph{dynamic local regret (DLR)} into VFL to address the challenges posed by online learning problems with non-convex models within a non-stationary environment. We conducted a comprehensive regret analysis of our proposed framework, specifically examining the DLR under non-convex conditions with event-driven online VFL. Extensive experiments demonstrated that our proposed framework was more stable than the existing online VFL framework under non-stationary data conditions while also significantly reducing communication and computation costs.
与离线学习相比,在线学习更适应垂直联邦学习(VFL)中的现实世界情景。然而,将在线学习纳入VFL提出了挑战,因为VFL具有独特的性质,因为客户拥有同一样本的非交叉特征。在现实世界情景中,客户可能不会为同一实体的脱节特征获得数据流。相反,数据通常由仅与一组客户相关的一个元素(emph{event})生成。我们首先在网上VFLL中发现这些挑战,而以前的研究已经忽略了这些挑战。为了应对这些挑战,我们提议建立一个由事件驱动的在线VFLL框架。在这个框架中,只有一组客户在每次活动中被激活,而其余客户则在学习过程中被动协作。此外,我们把emph{动态当地遗憾(DLR)}纳入VFLF,以应对在线学习问题在非静止环境中与非 convvex模型构成的挑战。我们对拟议框架进行了全面的遗憾分析,具体审查了非 convex情况下的DLR,在非 Convex网上测试下,在非CFLF 模式下,在活动驱动的在线计算中演示了我们拟议的非稳定的在线成本。
Article 21
Title@2025-06-17 (2): Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters
Title: Resource Optimization with MPI Process Malleability for Dynamic Workloads in HPC Clusters | Ressourcenoptimierung mit MPI-Prozess Malleability für dynamische Workloads in HPC Clustern | HPC 集群中动态工作量的 MPI 进程最小性 2506.14743v1 |
Authors (7): Sergio Iserte, Iker Martín-Álvarez, Krzysztof Rojek, José I. Aliaga, Maribel Castillo, Weronika Folwarska, Antonio J. Peña
Dynamic resource management is essential for optimizing computational efficiency in modern high-performance computing (HPC) environments, particularly as systems scale. While research has demonstrated the benefits of malleability in resource management systems (RMS), the adoption of such techniques in production environments remains limited due to challenges in standardization, interoperability, and usability. Addressing these gaps, this paper extends our prior work on the Dynamic Management of Resources (DMR) framework, which provides a modular and user-friendly approach to dynamic resource allocation. Building upon the original DMRlib reconfiguration runtime, this work integrates new methodology from the Malleability Module (MaM) of the Proteo framework, further enhancing reconfiguration capabilities with new spawning strategies and data redistribution methods. In this paper, we explore new malleability strategies in HPC dynamic workloads, such as merging MPI communicators and asynchronous reconfigurations, which offer new opportunities for dramatically reducing memory overhead. The proposed enhancements are rigorously evaluated on a world-class supercomputer, demonstrating improved resource utilization and workload efficiency. Results show that dynamic resource management can reduce the workload completion time by 40% and increase the resource utilization by over 20%, compared to static resource allocation.
在现代高性能计算(HPC)环境中,特别是在系统规模方面,动态资源管理对于优化计算效率至关重要。虽然研究表明,在资源管理系统(RMS)中具有可移动性的好处,但由于标准化、互操作性和可用性方面的挑战,在生产环境中采用这种技术仍然有限。弥补这些差距,本文件扩展了我们先前关于动态资源管理框架的工作,该框架为动态资源配置提供了模块化和方便用户的方法,为动态资源配置提供了一种模块化和方便用户的方法。在最初的DMRlib重组运行时间的基础上,这项工作整合了从Proteo框架的可移动性模块(MAM)中获得的新方法,进一步加强了配置能力,采用新的产卵战略和数据再分配方法。在本文件中,我们探索了HPC动态工作量中新的可移动性战略,如合并MPI通信器和不同步重组,为大幅降低记忆管理管理提供了新的机会。拟议改进将在世界级超级计算机上得到严格评价,表明资源利用率和工作量效率的提高。结果显示,动态资源管理可以将工作量完成时间减少40%,并将资源利用率提高20%以上的资源利用率。
Article 22
Title@2025-06-17 (2): SETI@home: Data Acquisition and Front-End Processing
Title: SETI@home: Data Acquisition and Front-End Processing | SETI@home: Datenerfassung und Front-End-Verarbeitung | SETI@home:数据采集和前端处理 2506.14718v1 |
Authors (6): Eric J. Korpela, David P. Anderson, Jeff Cobb, Matt Lebofsky, Wei Liu, Dan Werthimer
SETI@home is a radio Search for Extraterrestrial Intelligence (SETI) project, looking for technosignatures in data recorded at multiple observatories from 1998 to 2020. Most radio SETI projects analyze data using dedicated processing hardware. SETI@home uses a different approach: time-domain data is distributed over the Internet to $\gt 10^{5}$ volunteered home computers, which analyze it. The large amount of computing power this affords ($\sim 10^{15}$ floating-point operations per second (FPOP/s)) allows us to increase the sensitivity and generality of our search in three ways. We use coherent integration, a technique in which data is transformed so that the power of drifting signals is confined to a single discrete Fourier transform (DFT) bin. We perform this coherent search over 123 000 Doppler drift rates in the range ($\pm$100 Hz s$^{-1}$). Second, we search for a variety of signal types, such as pulsed signals and arbitrary repeated waveforms. The analysis uses a range of DFT sizes, with frequency resolutions ranging from 0.075 Hz to 1221 Hz. The front end of SETI@home produces a set of detections that exceed thresholds in power and goodness of fit. We accumulated $\sim 1.2\times 10^{10}$ such detections. The back end of SETI@home takes these detections, identifies and removes radio frequency interference (RFI), and looks for groups of detections that are consistent with extraterrestrial origin and that persist over long timescales. This paper describes the front end of SETI@home and provides parameters for the primary data source, the Arecibo Observatory; the back end and its results are described in a companion paper.
SETI@ home 是一个无线电搜索外星情报( SETI) 项目, 寻找1998 至 2020 年多个观测站所记录的数据中的技术签名。 多数 SETI 电台项目使用专用处理硬件分析数据。 SETI@ home 使用不同的方法: 时间- 域数据通过互联网传播到$gt 105美元 自愿的家庭计算机, 而这些计算机可以分析它。 大量计算能力( $sim 1015美元 ) 每秒( FPOP/s) 能够让我们以三种方式增加我们搜索的敏感性和一般性。 我们使用连贯的整合技术, 数据在这种技术中变换了数据, 使漂浮信号的力量局限于一个单一的离子 Fourierer变换( DFT) bin。 我们进行这种一致的搜索, 超过 123 000 Dopplerrld 的在范围 $100 Hz $-1} 。 第二, 我们搜索各种信号类型, 如脉冲信号源和任意重复的波变换波 。 。 分析使用一系列DFT 大小, 格式的频率分辨率显示从 0.075 频率分辨率的频率分辨率分辨率分辨率分辨率显示前端点的SEI 20 的探测数据显示 Hz 20 20 级 级 度 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级
Article 23
Title@2025-06-17 (2): Keigo: Co-designing Log-Structured Merge Key-Value Stores with a Non-Volatile, Concurrency-aware Storage Hierarchy (Extended Version)
Title: Keigo: Co-designing Log-Structured Merge Key-Value Stores with a Non-Volatile, Concurrency-aware Storage Hierarchy (Extended Version) | Keigo: Co-Designing Log-Structured Merge Key-Value Stores mit einer nicht-volatilen, concurrency-aware Speicherhierarchie (erweiterte Version) | Keigo:共同设计有非流动、具有通货币通汇存储等级的逻辑结构合并金价库(扩展版本) 2506.14630v1 |
Authors (6): Rúben Adão, Zhongjie Wu, Changjun Zhou, Oana Balmau, João Paulo, Ricardo Macedo
We present Keigo, a concurrency- and workload-aware storage middleware that enhances the performance of log-structured merge key-value stores (LSM KVS) when they are deployed on a hierarchy of storage devices. The key observation behind Keigo is that there is no one-size-fits-all placement of data across the storage hierarchy that optimizes for all workloads. Hence, to leverage the benefits of combining different storage devices, Keigo places files across different devices based on their parallelism, I/O bandwidth, and capacity. We introduce three techniques - concurrency-aware data placement, persistent read-only caching, and context-based I/O differentiation. Keigo is portable across different LSMs, is adaptable to dynamic workloads, and does not require extensive profiling. Our system enables established production KVS such as RocksDB, LevelDB, and Speedb to benefit from heterogeneous storage setups. We evaluate Keigo using synthetic and realistic workloads, showing that it improves the throughput of production-grade LSMs up to 4x for write- and 18x for read-heavy workloads when compared to general-purpose storage systems and specialized LSM KVS.
我们介绍了Keigo, 这是一种货币和工作量均能存储的复合中继器,在储存装置的等级结构上部署时,可以提高按逻辑结构合并的关键价值仓库(LSM KVS)的性能; Keigo的主要观察是,在储存的层次上,没有一刀切地放置所有工作量最佳的全套数据; 因此,利用将不同储存装置合并的好处,Keigo将不同装置的档案放在不同装置的平行性、I/O带宽和容量的基础上; 我们采用了三种技术,即:计算货币识别数据放置、持续只读缓冲和基于背景的I/O差异; Keigo可移动到不同的LSM,可适应于动态工作量,不需要广泛的剖析; 我们的系统使得固定的KVS生产,如RocksDB、级别DB和Spreedb, 能够从各种储存装置中受益; 我们利用合成和现实的工作量来评估Keigo, 表明与通用储存系统相比,它可以将生产级LSMS的吞吐量提高到4x和18x,可读取。
Article 24
Title@2025-06-17 (2): Concepts for designing modern C++ interfaces for MPI
Title: Concepts for designing modern C++ interfaces for MPI | Konzepte für die Gestaltung moderner C++-Schnittstellen für MPI | 为MPI设计现代 C+++ 界面的概念 2506.14610v1 |
Authors (8): C. Nicole Avans, Alfredo A. Correa, Sayan Ghosh, Matthias Schimek, Joseph Schuchart, Anthony Skjellum, Evan D. Suggs, Tim Niklas Uhl
Since the C++ bindings were deleted in 2008, the Message Passing Interface (MPI) community has revived efforts in building high-level modern C++ interfaces. Such interfaces are either built to serve specific scientific application needs (with limited coverage to the underlying MPI functionalities), or as an exercise in general-purpose programming model building, with the hope that bespoke interfaces can be broadly adopted to construct a variety of distributed-memory scientific applications. However, with the advent of modern C++-based heterogeneous programming models, GPUs and widespread Machine Learning (ML) usage in contemporary scientific computing, the role of prospective community-standardized high-level C++ interfaces to MPI is evolving. The success of such an interface clearly will depend on providing robust abstractions and features adhering to the generic programming principles that underpin the C++ programming language, without compromising on either performance and portability, the core principles upon which MPI was founded. However, there is a tension between idiomatic C++ handling of types and lifetimes, and, MPI’s loose interpretation of object lifetimes/ownership and insistence on maintaining global states. Instead of proposing “yet another” high-level C++ interface to MPI, overlooking or providing partial solutions to work around the key issues concerning the dissonance between MPI semantics and idiomatic C++, this paper focuses on the three fundamental aspects of a high-level interface: type system, object lifetimes and communication buffers, also identifying inconsistencies in the MPI specification. Presumptive solutions can be unrefined, and we hope the broader MPI and C++ communities will engage with us in productive exchange of ideas and concerns.
自2008年删除 C++ 约束性 C++ 以来, 信息传递界面(MPI) 社区恢复了建设高水平现代 C++ 界面的努力, 这种界面的建立是为了满足具体的科学应用需求( 覆盖范围有限 MPI 基本功能 ) , 或者是作为通用编程模式建设的一项练习, 希望可以广泛采用语言界面, 以构建多种分布式模拟科学应用。 但是, 随着现代科学计算中基于 C++ 的现代混合编程模型、 GPUs 和广泛使用的缓冲性机器学习(ML) 的使用, 社区标准化高层次 C++ 界面与MPI 的潜在功能的作用正在演变。 这种界面的成功显然取决于提供强大的抽象和特性, 坚持支持C+PI 通用编程语言的通用编程原则,同时不损害MPI 的性、 类型和 寿命和寿命周期性( GML) 的处理方式, 以及对对象寿命/所有权和持续维持全球目标的作用。 而不是在 C+I 高层次的 C 上提出“ 版本的交程” , 和 C+VL 的 C 的 的 的 格式, 的交路段的交点 之间, 我们的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本, 和语言的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本,我们的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本
Article 25
Title@2025-06-17 (2): Consensus Power Inequality: A Comparative Study of Blockchain Networks
Title: Consensus Power Inequality: A Comparative Study of Blockchain Networks | Consensus Power Inequality: Eine vergleichende Studie über Blockchain-Netzwerke | 协商一致的 “ 权力不平等:对供应链网络的比较研究 “ 2506.14393v1 |
Authors (3): Kamil Tylinski, Abylay Satybaldy, Paolo Tasca
The distribution of consensus power is a cornerstone of decentralisation, influencing the security, resilience, and fairness of blockchain networks while ensuring equitable impact among participants. This study provides a rigorous evaluation of consensus power inequality across five prominent blockchain networks - Bitcoin, Ethereum, Cardano, Hedera, and Algorand - using data collected from January 2022 to July 2024. Leveraging established economic metrics, including the Gini coefficient and Theil index, the research quantitatively assesses how power is distributed among blockchain network participants. A robust dataset, capturing network-specific characteristics such as mining pools, staking patterns, and consensus nodes, forms the foundation of the analysis, enabling meaningful comparisons across diverse architectures. Through an in-depth comparative study, the paper identifies key disparities in consensus power distribution. Hedera and Bitcoin demonstrate more balanced power distribution, aligning closely with the principles of decentralisation. Ethereum and Cardano demonstrate moderate levels of inequality. However, contrary to expectations, Ethereum has become more concentrated following its transition to Proof-of-Stake. Meanwhile, Algorand shows a pronounced centralisation of power. Moreover, the findings highlight the structural and operational drivers of inequality, including economic barriers, governance models, and network effects, offering actionable insights for more equitable network design. This study establishes a methodological framework for evaluating blockchain consensus power inequality, emphasising the importance of targeted strategies to ensure fairer power distribution and enhancing the sustainability of decentralised systems. Future research will build on these findings by integrating additional metrics and examining the influence of emerging consensus mechanisms.
利用从2022年1月至2024年7月收集的数据,分配协商一致权力是权力下放的基石,影响供应链网络的安全、复原力和公平性,同时确保参与者之间的公平影响。本研究报告对五大主要供应链网络――Bitcoin、Etheum、Cardano、Hedera和Algorand――的共识权力不平等进行了严格的评估。利用既有经济指标,包括基尼系数和Theil指数,研究从数量上评估了供应链参与者之间的权力分配情况。强有力的数据集,捕捉了采矿集合、定型和共识节点等网络特有的特点,构成了分析的基础,使得能够对不同结构进行有意义的比较。通过深入的比较研究,该文件确定了共识权力分配方面的主要差异。Hederra和Bitcoin展示了更平衡的权力分配,与分散原则密切配合。Etium和Cardano的研究表明,与期望不同的是,Etheinum在向验证系统过渡后,更集中了更集中的网络影响。同时,Algoland还展示了对可持续性的明显集中性研究, 和可实现的不平等的网络影响。
Article 26
Title@2025-06-17 (2): Decoupling Generation and Evaluation for Parallel Greedy Best-First Search(extended version)
Title: Decoupling Generation and Evaluation for Parallel Greedy Best-First Search(extended version) | Entkoppelung von Generation und Evaluation für parallele Greedy-Best-First-Suche (erweiterte Version) | 平行贪婪最佳第一搜索的脱钩生成和评估(扩展版) 2408.05682v2 |
Authors (2): Takumi Shimoda, Alex Fukunaga
In order to understand and control the search behavior of parallel search, recent work has proposed a class of constrained parallel greedy best-first search algorithms which only expands states that satisfy some constraint.However, enforcing such constraints can be costly, as threads must be waiting idly until a state that satisfies the expansion constraint is available. We propose an improvement to constrained parallel search which decouples state generation and state evaluation and significantly improves state evaluation rate, resulting in better search performance.
为了了解和控制平行搜索的搜索行为,最近的工作提出了一组受限制的平行贪婪最佳搜索算法,该算法只能扩大满足某些限制的条件。 然而,实施这些限制可能成本很高,因为线条必须等待满足扩张限制条件的状态才能到位。 我们建议改进受限制的平行搜索,它会分解各州的生成和州的评价,并显著提高州评价率,从而导致更好的搜索性能。
Article 27
Title@2025-06-17 (2): HarMoEny: Efficient Multi-GPU Inference of MoE Models
Title: HarMoEny: Efficient Multi-GPU Inference of MoE Models | HarMoEny: Effiziente Multi-GPU-Schlussfolgerung von MoE-Modellen | HarMoEny:教育部模型的高效多指数指数多推推 2506.12417v2 |
Authors (6): Zachary Doucet, Rishi Sharma, Martijn de Vos, Rafael Pires, Anne-Marie Kermarrec, Oana Balmau
Mixture-of-Experts (MoE) models offer computational efficiency during inference by activating only a subset of specialized experts for a given input. This enables efficient model scaling on multi-GPU systems that use expert parallelism without compromising performance. However, load imbalance among experts and GPUs introduces waiting times, which can significantly increase inference latency. To address this challenge, we propose HarMoEny, a novel solution to address MoE load imbalance through two simple techniques: (i) dynamic token redistribution to underutilized GPUs and (ii) asynchronous prefetching of experts from the system to GPU memory. These techniques achieve a near-perfect load balance among experts and GPUs and mitigate delays caused by overloaded GPUs. We implement HarMoEny and compare its latency and throughput with four MoE baselines using real-world and synthetic datasets. Under heavy load imbalance, HarMoEny increases throughput by 37%-70% and reduces time-to-first-token by 34%-41%, compared to the next-best baseline. Moreover, our ablation study demonstrates that HarMoEny’s scheduling policy reduces the GPU idling time by up to 84% compared to the baseline policies.
为了应对这一挑战,我们建议哈莫尼模型在推论期间提供计算效率,只为特定投入启用一组专门专家。这样可以对使用专家平行但又不损害性能的多GPU系统进行高效的模型缩放。然而,专家和GPU之间的负荷不平衡引入了等待时间,这可以大大增加推推导时间。为了应对这一挑战,我们提议哈莫尼模型,这是一个通过两种简单技术解决MOE负载不平衡的新解决方案:(一) 动态象征性再分配到未充分利用的GPU,以及(二) 系统专家向GPU记忆的不同步前置。这些技术在专家和GPUP上实现了近乎完美的负载平衡,减轻了超载GPU造成的延误。我们实施了HarMoEny,用四个MOE基线用真实世界和合成数据集来比较其衬底线和吞吐量。在重负载不平衡的情况下,HarMony将投入增加37%-70%,并将时间端偏移到GPU存储点。此外,与下一个最佳的基线相比,我们的研究将缩小了该基线。
Article 28
Title@2025-06-17 (2): Convergence-Privacy-Fairness Trade-Off in Personalized Federated Learning
Title: Convergence-Privacy-Fairness Trade-Off in Personalized Federated Learning | Convergence-Privacy-Fairness Trade-Off im personalisierten Federated Learning | 个人化联邦学习中统一-私人-公平贸易-个人化联邦学习 2506.14251v1 |
Authors (8): Xiyu Zhao, Qimei Cui, Weicai Li, Wei Ni, Ekram Hossain, Quan Z. Sheng, Xiaofeng Tao, Ping Zhang
Personalized federated learning (PFL), e.g., the renowned Ditto, strikes a balance between personalization and generalization by conducting federated learning (FL) to guide personalized learning (PL). While FL is unaffected by personalized model training, in Ditto, PL depends on the outcome of the FL. However, the clients’ concern about their privacy and consequent perturbation of their local models can affect the convergence and (performance) fairness of PL. This paper presents PFL, called DP-Ditto, which is a non-trivial extension of Ditto under the protection of differential privacy (DP), and analyzes the trade-off among its privacy guarantee, model convergence, and performance distribution fairness. We also analyze the convergence upper bound of the personalized models under DP-Ditto and derive the optimal number of global aggregations given a privacy budget. Further, we analyze the performance fairness of the personalized models, and reveal the feasibility of optimizing DP-Ditto jointly for convergence and fairness. Experiments validate our analysis and demonstrate that DP-Ditto can surpass the DP-perturbed versions of the state-of-the-art PFL models, such as FedAMP, pFedMe, APPLE, and FedALA, by over 32.71% in fairness and 9.66% in accuracy.
个人化个人化学习(PFL),例如著名的Ditto,通过开展联邦化学习(FL)来指导个性化学习(PL),在个性化学习(FL)中取得个性化和普及之间的平衡。虽然FL不受个人化模式培训的影响,但在Ditto,PL取决于FL的结果。然而,客户对其隐私的关切以及随之而来对当地模式的干扰,会影响PL的趋同和(表现)公平性。本文介绍PLFL,称为DP-Ditto,在保护不同隐私(DP)下,是Ditto的非三重扩展,分析其隐私保障、模式趋同和业绩分配公平性之间的取舍。我们还分析了DP-Ditto个人化模式在个人化模式的上层界限,并得出了具有隐私预算的最佳全球汇总数。此外,我们分析了个人化模式的业绩公平性,并揭示了将DP-D-Ditto 联合优化DP-Ditto,以达到统一和公平性。实验证实了我们的分析,并表明DP-D-Dittototo(F-F-A)在联邦-AFLAF-A、P-P-P-P-P-P-P-B-M-MA-M-M-M-M-M-M-P-M-P-P-M-M-P-P-P-M-P-P-P-P-P-P-P-B-P-P-P-P-B-P-B-P-P-P-P-P-P-P-P-P-B-P-P-P-B-B-P-P-P-P-P-P-P-P-P-P-A-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-A-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-P-
Article 29
Title@2025-06-17 (2): A Novel Indicator for Quantifying and Minimizing Information Utility Loss of Robot Teams
Title: A Novel Indicator for Quantifying and Minimizing Information Utility Loss of Robot Teams | Ein neuer Indikator für die Quantifizierung und Minimierung von Informationen Dienstprogramm Verlust von Roboter-Teams | 计算和尽量减少机器人小组信息效用损失的新指标 2506.14237v1 |
Authors (8): Xiyu Zhao, Qimei Cui, Wei Ni, Quan Z. Sheng, Abbas Jamalipour, Guoshun Nan, Xiaofeng Tao, Ping Zhang
The timely exchange of information among robots within a team is vital, but it can be constrained by limited wireless capacity. The inability to deliver information promptly can result in estimation errors that impact collaborative efforts among robots. In this paper, we propose a new metric termed Loss of Information Utility (LoIU) to quantify the freshness and utility of information critical for cooperation. The metric enables robots to prioritize information transmissions within bandwidth constraints. We also propose the estimation of LoIU using belief distributions and accordingly optimize both transmission schedule and resource allocation strategy for device-to-device transmissions to minimize the time-average LoIU within a robot team. A semi-decentralized Multi-Agent Deep Deterministic Policy Gradient framework is developed, where each robot functions as an actor responsible for scheduling transmissions among its collaborators while a central critic periodically evaluates and refines the actors in response to mobility and interference. Simulations validate the effectiveness of our approach, demonstrating an enhancement of information freshness and utility by 98%, compared to alternative methods.
团队内部的机器人之间及时交流信息至关重要,但它可能受到有限无线能力的制约。 无法及时提供信息可能导致影响机器人之间协作努力的估计错误。 在本文件中,我们提出了一个新的指标,称为信息效用损失(LOIU),以量化对合作至关重要的信息的新鲜度和效用。该指标使机器人能够在带宽限制范围内确定信息传播的优先次序。我们还提议利用信仰分布对LOIU进行估计,并据此优化设备对设备对设备传输的传输时间表和资源分配战略,以尽量减少机器人团队中平均时间的LOIU。开发了一个半集中化的多显性深层政策优先框架,其中每个机器人都作为负责安排其合作者之间传输的行为者,而中央评论家则定期评估和完善应对流动性和干扰的行为者。模拟验证了我们的方法的有效性,表明与替代方法相比,信息新鲜度和实用性提高了98%。
Article 30
Title@2025-06-17 (2): The Redundancy of Full Nodes in Bitcoin: A Network-Theoretic Demonstration of Miner-Centric Propagation Topologies
Title: The Redundancy of Full Nodes in Bitcoin: A Network-Theoretic Demonstration of Miner-Centric Propagation Topologies | Die Redundanz von Vollknoten in Bitcoin: Eine netzwerktheoretische Demonstration von Miner-Centric Propagation Topologien | Bittcoin中完全节点的冗余:矿工-Centric传承型体的网络理论示范 2506.14197v1 |
Authors (1): Dr Craig S Wright
This paper formally examines the network structure of Bitcoin CORE (BTC) and Bitcoin Satoshi Vision (BSV) using complex graph theory to demonstrate that home-hosted full nodes are incapable of participating in or influencing the propagation topology. Leveraging established models such as scale-free networks and small-world connectivity, we demonstrate that the propagation graph is dominated by a densely interconnected miner clique, while full nodes reside on the periphery, excluded from all transaction-to-block inclusion paths. Using simulation-backed metrics and eigenvalue centrality analysis, we confirm that full nodes are neither critical nor operationally relevant for consensus propagation.
本文正式审查了Bitcoin CORE(BTC)和Bitcoin Satoshi Vision(BSV)的网络结构,使用了复杂的图表理论来证明由家庭主机的全节无法参与或影响传播的地形学。利用无规模网络和小世界连接等既定模型,我们证明传播图以一个密不可分的矿床为主,而完整的节点则位于外围,并被排除在所有交易到区融合路径之外。我们利用模拟支持的衡量标准以及电子价值中心分析,我们确认完全节点对于共识传播既不重要,在业务上也不切合。
Article 31
Title@2025-06-17 (2): Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching
Title: Cost-Efficient Serving of LLM Agents via Test-Time Plan Caching | Kosteneffiziente Bedienung von LLM-Agenten über Test-Zeitplan-Caching | 通过试验-时间计划缓冲,以成本效率高的方式服务LLM代理物 2506.14852v1 |
Authors (3): Qizheng Zhang, Michael Wornow, Kunle Olukotun
LLM-based agentic applications have shown increasingly remarkable capabilities in complex workflows but incur substantial costs due to extensive planning and reasoning requirements. Existing LLM caching techniques (like context caching and semantic caching), primarily designed for serving chatbots, are insufficient for agentic applications where outputs depend on external data or environmental contexts. We propose agentic plan caching, a novel approach that extracts, stores, adapts, and reuses structured plan templates from planning stages of agentic applications across semantically similar tasks to reduce the cost of serving. Unlike traditional semantic caching, our system extracts plan templates from completed agent executions at test-time, employs keyword extraction to match new requests against cached plans, and utilizes lightweight models to adapt these templates to task-specific plans with contexts. Evaluation across multiple real-world agentic applications shows that our system can reduce costs by 46.62% on average while maintaining performance, offering a more efficient solution for serving LLM-based agents that complements existing LLM serving infrastructures.
以LLM为基础的代理应用在复杂的工作流程中表现出日益显著的能力,但由于广泛的规划和推理要求,费用高昂。主要设计用于为聊天室提供服务的现有LLM缓冲技术(如环境缓冲和语义缓冲等)对于产出取决于外部数据或环境背景的代理应用来说是不够的。我们提议了一种新颖的计划缓冲方法,即提取、储存、调整和再利用跨语义类似任务的代理应用规划阶段的结构性计划模板,以降低服务成本。与传统的语义缓冲法不同,我们的系统从测试时已完成的代理处决中提取了计划模板,使用关键词提取方法将新的请求与缓存计划匹配,并利用轻量模型使这些模板适应特定任务计划的环境。跨多个现实世界的代理应用的评估表明,我们的系统可以平均地降低46.62%的成本,同时保持绩效,为补充现有LLM服务基础设施的LM代理提供更有效的解决方案。
Article 32
Title@2025-06-17 (2): Efficient Serving of LLM Applications with Probabilistic Demand Modeling
Title: Efficient Serving of LLM Applications with Probabilistic Demand Modeling | Effizientes Servieren von LLM-Anwendungen mit probabilistischer Nachfragemodellierung | 高效率地利用概率需求建模来服务LLM应用程序 2506.14851v1 |
Authors (11): Yifei Liu, Zuo Gan, Zhenghao Gan, Weiye Wang, Chen Chen, Yizhou Shan, Xusheng Chen, Zhenhua Han, Yifei Zhu, Shixuan Sun, Minyi Guo
Applications based on Large Language Models (LLMs) contains a series of tasks to address real-world problems with boosted capability, which have dynamic demand volumes on diverse backends. Existing serving systems treat the resource demands of LLM applications as a blackbox, compromising end-to-end efficiency due to improper queuing order and backend warm up latency. We find that the resource demands of LLM applications can be modeled in a general and accurate manner with Probabilistic Demand Graph (PDGraph). We then propose Hermes, which leverages PDGraph for efficient serving of LLM applications. Confronting probabilistic demand description, Hermes applies the Gittins policy to determine the scheduling order that can minimize the average application completion time. It also uses the PDGraph model to help prewarm cold backends at proper moments. Experiments with diverse LLM applications confirm that Hermes can effectively improve the application serving efficiency, reducing the average completion time by over 70% and the P95 completion time by over 80%.
基于大语言模型(LLMs)的应用包含一系列任务,以解决提升能力带来的现实世界问题,这些增强能力对多种后端的需求量是动态的。现有的服务系统将LLM应用程序的资源需求作为黑盒处理,由于不适当的排队顺序和后端暖化延缓度而损害端到端的效率。我们发现LLM应用程序的资源需求可以与概率需求图(PDGraph)以一般和准确的方式建模。我们然后提议Hermes,利用PDGraph来有效服务LLM应用程序。面对概率性需求描述,Hermes应用Gitins政策来确定能够最大限度地减少平均应用完成时间的排期顺序。它也使用PDGraph模型来帮助在适当时间前温后端。与各种LM应用程序的实验证实Hermes能够有效地改进应用程序的效率,将平均完成时间减少70%以上,P95完成时间减少80%以上。
Article 33
Title@2025-06-17 (2): Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse
Title: Déjà Vu: Efficient Video-Language Query Engine with Learning-based Inter-Frame Computation Reuse | Déjà Vu: Effiziente Video-Sprachen-Abfrage-Engine mit Learning-based Inter-Frame Computation Reuse | Déjà Vu:高效视频语言查询引擎,以学习为基础的基于学习的网络间计算再使用 2506.14107v1 |
Authors (11): Jinwoo Hwang, Daeun Kim, Sangyeop Lee, Yoonsung Kim, Guseul Heo, Hojoon Kim, Yunseok Jeong, Tadiwos Meaza, Eunhyeok Park, Jeongseob Ahn, Jongse Park
Recently, Video-Language Models (VideoLMs) have demonstrated remarkable capabilities, offering significant potential for flexible and powerful video query systems. These models typically rely on Vision Transformers (ViTs), which process video frames individually to extract visual embeddings. However, generating embeddings for large-scale videos requires ViT inferencing across numerous frames, posing a major hurdle to real-world deployment and necessitating solutions for integration into scalable video data management systems. This paper introduces D'ej`a Vu, a video-language query engine that accelerates ViT-based VideoLMs by reusing computations across consecutive frames. At its core is ReuseViT, a modified ViT model specifically designed for VideoLM tasks, which learns to detect inter-frame reuse opportunities, striking an effective balance between accuracy and reuse. Although ReuseViT significantly reduces computation, these savings do not directly translate into performance gains on GPUs. To overcome this, D'ej`a Vu integrates memory-compute joint compaction techniques that convert the FLOP savings into tangible performance gains. Evaluations on three VideoLM tasks show that D'ej`a Vu accelerates embedding generation by up to a 2.64x within a 2% error bound, dramatically enhancing the practicality of VideoLMs for large-scale video analytics.
最近,视频语言模型(VideoLMS)展示了非凡的能力,为灵活和强大的视频查询系统提供了巨大的潜力。这些模型通常依赖视觉变换器(View 变换器),这些变换器单独处理视频框架以提取视觉嵌入器。然而,为大型视频生成嵌入器需要ViT跨多个框架进行嵌入,这对现实世界的部署构成重大障碍,并且需要将解决方案纳入可缩放的视频数据管理系统。本文介绍了D'eja Vu,一个视频语言查询引擎,通过在连续的框中重新使用计算来加速ViT的视频LMMS。其核心是ReuseViT,这是专门为视频LM任务设计的经过修改的ViT模型,它学会探测跨框架再利用的机会,在准确性和再利用之间取得有效平衡。虽然再使用ViT会大大降低计算,但这些节省量并不能直接转化为可缩放的视频数据管理系统。要克服这一点,D'64 Vu整合存储联合压缩技术,将FLOP储蓄转换成有形的图像磁带结果,在3VLx级的升级任务中加速了VLu的升级。
Article 34
Title@2025-06-16 (1): ReinDSplit: Reinforced Dynamic Split Learning for Pest Recognition in Precision Agriculture
Title: ReinDSplit: Reinforced Dynamic Split Learning for Pest Recognition in Precision Agriculture | ReinDSplit: Dynamisches Split-Lernen für Pesterkennung in der Precision Agriculture verstärkt | ReinDSplit:强化动态分散学习,以便在精密农业中承认害虫特征 2506.13935v1 |
Authors (4): Vishesh Kumar Tanwar, Soumik Sarkar, Asheesh K. Singh, Sajal K. Das
To empower precision agriculture through distributed machine learning (DML), split learning (SL) has emerged as a promising paradigm, partitioning deep neural networks (DNNs) between edge devices and servers to reduce computational burdens and preserve data privacy. However, conventional SL frameworks’ one-split-fits-all strategy is a critical limitation in agricultural ecosystems where edge insect monitoring devices exhibit vast heterogeneity in computational power, energy constraints, and connectivity. This leads to straggler bottlenecks, inefficient resource utilization, and compromised model performance. Bridging this gap, we introduce ReinDSplit, a novel reinforcement learning (RL)-driven framework that dynamically tailors DNN split points for each device, optimizing efficiency without sacrificing accuracy. Specifically, a Q-learning agent acts as an adaptive orchestrator, balancing workloads and latency thresholds across devices to mitigate computational starvation or overload. By framing split layer selection as a finite-state Markov decision process, ReinDSplit convergence ensures that highly constrained devices contribute meaningfully to model training over time. Evaluated on three insect classification datasets using ResNet18, GoogleNet, and MobileNetV2, ReinDSplit achieves 94.31% accuracy with MobileNetV2. Beyond agriculture, ReinDSplit pioneers a paradigm shift in SL by harmonizing RL for resource efficiency, privacy, and scalability in heterogeneous environments.
为了通过分布式机器学习(DML)增强精密农业能力,分化学习(SL)已成为一个大有希望的模式,将边缘装置和服务器之间的深神经网络分隔开来,以减少计算负担并保护数据隐私;然而,常规SL框架的一刀切战略是农业生态系统中的一个关键限制,因为边缘昆虫监测装置在计算能力、能源限制和连通性方面表现出巨大的差异性。这导致分层瓶颈、资源利用效率低下和模型性能受损。缩小这一差距,我们引入了Reindsplit,这是一个新的强化学习框架,它以动态方式为每个装置定制DNNN分离点,在不牺牲准确性的情况下优化效率。具体地说,Q学习代理作为适应性管弦乐器,平衡各种装置之间的工作量和耐久性临界值,以减轻计算性饥饿或超负荷。通过将分层选择设定为有限状态的Markov决策过程,ReindDSpllit 趋同确保高度受限的装置在一段时间内对模式培训做出有意义的贡献。评估了三种节解数据集数据集,使用了ResNet18、GoogNetNet-lafalityDSliftal-DSliftDServiews Slift Slid SliftDS
Article 35
Title@2025-06-16 (1): A Terminology for Scientific Workflow Systems
Title: A Terminology for Scientific Workflow Systems | Eine Terminologie für wissenschaftliche Workflow-Systeme | 科学工作流程系统术语术语 2506.07838v4 |
Authors (26): Frédéric Suter, Tainã Coleman, İlkay Altintaş, Rosa M. Badia, Bartosz Balis, Kyle Chard, Iacopo Colonnelli, Ewa Deelman, Paolo Di Tommaso, Thomas Fahringer, Carole Goble, Shantenu Jha, Daniel S. Katz, Johannes Köster, Ulf Leser, Kshitij Mehta, Hilary Oliver, J. -Luc Peterson, Giovanni Pizzi, Loïc Pottier, Raül Sirvent, Eric Suchyta, Douglas Thain, Sean R. Wilkinson, Justin M. Wozniak, Rafael Ferreira da Silva
The term scientific workflow has evolved over the last two decades to encompass a broad range of compositions of interdependent compute tasks and data movements. It has also become an umbrella term for processing in modern scientific applications. Today, many scientific applications can be considered as workflows made of multiple dependent steps, and hundreds of workflow management systems (WMSs) have been developed to manage and run these workflows. However, no turnkey solution has emerged to address the diversity of scientific processes and the infrastructure on which they are implemented. Instead, new research problems requiring the execution of scientific workflows with some novel feature often lead to the development of an entirely new WMS. A direct consequence is that many existing WMSs share some salient features, offer similar functionalities, and can manage the same categories of workflows but also have some distinct capabilities. This situation makes researchers who develop workflows face the complex question of selecting a WMS. This selection can be driven by technical considerations, to find the system that is the most appropriate for their application and for the resources available to them, or other factors such as reputation, adoption, strong community support, or long-term sustainability. To address this problem, a group of WMS developers and practitioners joined their efforts to produce a community-based terminology of WMSs. This paper summarizes their findings and introduces this new terminology to characterize WMSs. This terminology is composed of fives axes: workflow characteristics, composition, orchestration, data management, and metadata capture. Each axis comprises several concepts that capture the prominent features of WMSs. Based on this terminology, this paper also presents a classification of 23 existing WMSs according to the proposed axes and terms.
过去二十年来,科学工作流程这一术语演变为包括相互依存计算任务和数据流动的广泛构成。它也成为现代科学应用中处理的总括术语。今天,许多科学应用可被视为由多个依赖步骤组成的工作流程,数百个工作流程管理系统(WMSs)已经开发出来来管理和运行这些工作流程。然而,没有出现任何统包式解决办法来解决科学流程及其实施基础设施的多样性问题。相反,需要执行具有某些新特点的科学工作流程的新研究问题往往导致形成全新的WMS。一个直接后果是,许多现有的WMS术语具有某些显著特征,提供类似的功能,可以管理相同的工作流程类别,但也有一些不同的能力。这种情况使开发工作流程的研究人员面临选择WMS的复杂问题。这种选择可以由技术因素驱动,找到最适合其应用和资源的系统,或诸如声誉、采用、强有力的社区支持或长期文件可持续性等其他因素。为了解决这个问题,WMSS的当前术语的特性是WMS的每个术语的每个核心, 将WMS的当前定义和每个核心的术语组成了WMS的系统。
Article 36
Title@2025-06-16 (1): BanditWare: A Contextual Bandit-based Framework for Hardware Prediction
Title: BanditWare: A Contextual Bandit-based Framework for Hardware Prediction | BanditWare: Ein Kontextbandit-basiertes Framework für Hardware-Vorhersage | BanditWare:基于背景的硬硬件预测土匪框架 2506.13730v1 |
Authors (5): Tainã Coleman, Hena Ahmed, Ravi Shende, Ismael Perez, Ïlkay Altintaş
Distributed computing systems are essential for meeting the demands of modern applications, yet transitioning from single-system to distributed environments presents significant challenges. Misallocating resources in shared systems can lead to resource contention, system instability, degraded performance, priority inversion, inefficient utilization, increased latency, and environmental impact. We present BanditWare, an online recommendation system that dynamically selects the most suitable hardware for applications using a contextual multi-armed bandit algorithm. BanditWare balances exploration and exploitation, gradually refining its hardware recommendations based on observed application performance while continuing to explore potentially better options. Unlike traditional statistical and machine learning approaches that rely heavily on large historical datasets, BanditWare operates online, learning and adapting in real-time as new workloads arrive. We evaluated BanditWare on three workflow applications: Cycles (an agricultural science scientific workflow) BurnPro3D (a web-based platform for fire science) and a matrix multiplication application. Designed for seamless integration with the National Data Platform (NDP), BanditWare enables users of all experience levels to optimize resource allocation efficiently.
分散式计算系统对于满足现代应用的需求至关重要,但从单一系统向分布式环境过渡却提出了重大挑战。共享系统中资源分配不当可能导致资源争议、系统不稳定、性能退化、优先倒置、低效率利用、延时率增加和环境影响。我们介绍了BanditWare,这是一个在线建议系统,动态地选择最合适的硬件,用于使用一个符合背景的多武装强盗算法的应用。BanditWare平衡了勘探和开发,根据观察到的应用性能逐步完善其硬件建议,同时继续探索潜在的更好选项。与严重依赖大型历史数据集的传统统计和机器学习方法不同,BanditWare在网上操作,随着新工作量的到来,实时学习和适应。我们评估了BanditWare的三个工作流程应用程序:循环(农业科学工作流程)、BurnPro3D(一个基于网络的消防科学平台)和一个矩阵倍增应用。设计,以便与国家数据平台(NDP)进行无缝整合,BanditWare使所有经验水平的用户能够优化资源分配。
Article 37
Title@2025-06-16 (1): POPQC: Parallel Optimization for Quantum Circuits (Extended Version)
Title: POPQC: Parallel Optimization for Quantum Circuits (Extended Version) | POPQC: Parallele Optimierung für Quantenkreise (erweiterte Version) | POPQC: 量子电路平行优化(扩展版本) 2506.13720v1 |
Authors (4): Pengyu Liu, Jatin Arora, Mingkuan Xu, Umut A. Acar
Optimization of quantum programs or circuits is a fundamental problem in quantum computing and remains a major challenge. State-of-the-art quantum circuit optimizers rely on heuristics and typically require superlinear, and even exponential, time. Recent work proposed a new approach that pursues a weaker form of optimality called local optimality. Parameterized by a natural number $\Omega$, local optimality insists that each and every $\Omega$-segment of the circuit is optimal with respect to an external optimizer, called the oracle. Local optimization can be performed using only a linear number of calls to the oracle but still incurs quadratic computational overheads in addition to oracle calls. Perhaps most importantly, the algorithm is sequential. In this paper, we present a parallel algorithm for local optimization of quantum circuits. To ensure efficiency, the algorithm operates by keeping a set of fingers into the circuit and maintains the invariant that a $\Omega$-deep circuit needs to be optimized only if it contains a finger. Operating in rounds, the algorithm selects a set of fingers, optimizes in parallel the segments containing the fingers, and updates the finger set to ensure the invariant. For constant $\Omega$, we prove that the algorithm requires $O(n\lg{n})$ work and $O(r\lg{n})$ span, where $n$ is the circuit size and $r$ is the number of rounds. We prove that the optimized circuit returned by the algorithm is locally optimal in the sense that any $\Omega$-segment of the circuit is optimal with respect to the oracle.
量子程序或电路的优化是量子计算的一个根本性问题, 仍然是一项重大挑战。 最先进的量子电路优化器依赖于超线性, 通常需要超线性, 甚至指数化的时间。 最近的工作提出了一种新的方法, 追求一种较弱的优化形式, 叫做本地优化。 以自然数字 $\ Omega$ 表示的参数, 本地最佳性坚持电路的每个和每个$\ Omega$ 部分与外部优化器( 称为 Ooracle ) 最优化。 本地优化只能使用对甲骨骼的直线数, 但仍然需要进行二次计算。 也许最重要的是, 算法是顺序的。 在本文中, 我们为本地优化电路提供了一种平行的算法。 为了确保效率, 算法的运作方式是将一组手指留在电路中, 保持一个美元 美元 深度的电路流需要优化的电路 。 在圆形中, 算法中, 将一个美元 美元 和 美元 美元 。
Article 38
Title@2025-06-16 (1): EBS-CFL: Efficient and Byzantine-robust Secure Clustered Federated Learning
Title: EBS-CFL: Efficient and Byzantine-robust Secure Clustered Federated Learning | EBS-CFL: Effizientes und Byzantinisch-Robustes Sicheres Cluster-Federiertes Lernen | EBS-CFL: 高效和拜占庭-怒火安全分组联邦学习 2506.13612v1 |
Authors (6): Zhiqiang Li, Haiyong Bao, Menghong Guan, Hao Pan, Cheng Huang, Hong-Ning Dai
Despite federated learning (FL)’s potential in collaborative learning, its performance has deteriorated due to the data heterogeneity of distributed users. Recently, clustered federated learning (CFL) has emerged to address this challenge by partitioning users into clusters according to their similarity. However, CFL faces difficulties in training when users are unwilling to share their cluster identities due to privacy concerns. To address these issues, we present an innovative Efficient and Robust Secure Aggregation scheme for CFL, dubbed EBS-CFL. The proposed EBS-CFL supports effectively training CFL while maintaining users’ cluster identity confidentially. Moreover, it detects potential poisonous attacks without compromising individual client gradients by discarding negatively correlated gradients and aggregating positively correlated ones using a weighted approach. The server also authenticates correct gradient encoding by clients. EBS-CFL has high efficiency with client-side overhead O(ml + m^2) for communication and O(m^2l) for computation, where m is the number of cluster identities, and l is the gradient size. When m = 1, EBS-CFL’s computational efficiency of client is at least O(log n) times better than comparison schemes, where n is the number of clients.In addition, we validate the scheme through extensive experiments. Finally, we theoretically prove the scheme’s security.
尽管联合会学习(FL)具有协作学习的潜力,但由于分布式用户的数据差异性,其业绩却因分布式用户的数据差异而恶化。最近,分组联合学习(CFL)通过根据用户的相似性将用户分成组群来应对这一挑战;然而,CFL由于隐私问题,当用户不愿意分享其集群身份时,在培训方面遇到困难。为了解决这些问题,我们为CFL提出了一个创新的高效和强力安全聚合计划,称为EBS-CFL。拟议的EBS-CFL支持有效培训CFL,同时以保密的方式保持用户的群集身份。此外,它通过放弃负相关梯度并使用加权方法将正相关梯度集合,发现潜在的有毒攻击,但不会损害单个客户的梯度。CFL服务器还验证客户的梯度编码。EBS-CFL与客户端管理O(ml+m%2)效率很高,用于计算通信和O(m%2l)的O(m%2l)数据是群集身份的数量,而L的梯度规模最小。当m= NBS-CFL系统比我们客户最终计算效率计划时,我们的安全率更精确化为我们客户系统。
Article 39
Title@2025-06-16 (1): ILVES: Accurate and efficient bond length and angle constraints in molecular dynamics
Title: ILVES: Accurate and efficient bond length and angle constraints in molecular dynamics | ILVES: Genaue und effiziente Bindungslänge und Winkelbeschränkungen in der molekularen Dynamik | ILVES: 分子动态的精确而高效的联结长度和角限制 2503.13075v3 |
Authors (10): Lorién López-Villellas, Carl Christian Kjelgaard Mikkelsen, Juan José Galano-Frutos, Santiago Marco-Sola, Jesús Alastruey-Benedé, Pablo Ibáñez, Pablo Echenique, Miquel Moretó, Maria Cristina De Rosa, Pablo García-Risueño
All-atom, force field-based molecular dynamics simulations are essential tools in computational chemistry, enabling the prediction and analysis of biomolecular systems with atomic-level resolution. However, as system sizes and simulation timescales increase, so does the associated computational cost. To extend simulated time using the same resources, a common strategy is to constrain the fastest degrees of freedom, such as bond lengths, allowing for larger integration time steps without compromising accuracy. The de facto state-of-the-art algorithms for this purpose (SHAKE, LINCS, and P-LINCS) are integrated into most molecular dynamics packages and widely adopted across the field. Despite their impact, these methods exhibit limitations: all converge slowly when high numerical accuracy is required, and the LINCS and P-LINCS algorithms cannot handle general angular constraints, limiting further increases in time step. In this article, we introduce ILVES, a family of parallel algorithms that converge so rapidly that it is now practical to solve bond length and associated angular constraint equations as accurately as the hardware will allow. We have integrated ILVES into Gromacs and our analysis demonstrates that it is superior to the state-of-the-art when constraining bond lengths. Due to its better convergence properties, we also show that if the time step is increased up to 3.5 fs by enforcing angular constraints, ILVES enables a 1.65x increase in simulated time using the same computational resources and wall-clock time, an outcome unattainable with current methods. This advance can significantly reduce the computational cost of most all-atom molecular dynamics simulations while improving their accuracy and extending access to larger systems and longer timescales.
所有原子, 强制的实地分子动态模拟是计算化学的基本工具, 使得以原子分辨率对生物分子系统进行预测和分析。 然而, 随着系统规模和模拟时间尺度的增加, 相关的计算成本也随之增加。 要使用相同资源延长模拟时间, 共同的战略是限制最快的自由度, 如债券长度, 允许更大的整合时间步骤, 同时又不降低准确性。 用于此目的的事实上最先进的算法( Shake、 LINCS 和 P- LINCS) 已经融入大多数分子动态包, 并且被广泛采用。 尽管这些方法产生了影响, 但它们也显示出局限性: 当需要高精确度时, 所有的模拟时间尺度都缓慢趋同, 而 LINCS 和 P- LINCS 的算法无法处理一般的角限制, 从而限制进一步增加时间步骤。 在本文章中, 我们引入了一套平行的算法, 现在可以快速地解决债券长度问题, 并随着硬件的精确度而降低其直角限制 。 我们把ILVES 更精确的算法, 当我们使用更精确的时间轨的计算到更精确的精确的计算时, , 当我们用一个更精确的递化的递化的递增的递化的递化的递增到硬的递增的递化的递增的递化的递化的递增到硬化的算法时, 。
Article 40
Title@2025-06-16 (1): Perfect Privacy for Discriminator-Based Byzantine-Resilient Federated Learning
Title: Perfect Privacy for Discriminator-Based Byzantine-Resilient Federated Learning | Perfekte Privatsphäre für diskriminatorbasiertes Byzantinisch-Resilientes Federated Learning | 具有抵抗力的联邦学习组织 2506.13561v1 |
Authors (4): Yue Xia, Christoph Hofmeister, Maximilian Egger, Rawad Bitar
Federated learning (FL) shows great promise in large-scale machine learning but introduces new privacy and security challenges. We propose ByITFL and LoByITFL, two novel FL schemes that enhance resilience against Byzantine users while keeping the users’ data private from eavesdroppers. To ensure privacy and Byzantine resilience, our schemes build on having a small representative dataset available to the federator and crafting a discriminator function allowing the mitigation of corrupt users’ contributions. ByITFL employs Lagrange coded computing and re-randomization, making it the first Byzantine-resilient FL scheme with perfect Information-Theoretic (IT) privacy, though at the cost of a significant communication overhead. LoByITFL, on the other hand, achieves Byzantine resilience and IT privacy at a significantly reduced communication cost, but requires a Trusted Third Party, used only in a one-time initialization phase before training. We provide theoretical guarantees on privacy and Byzantine resilience, along with convergence guarantees and experimental results validating our findings.
联邦学习(FL)在大型机器学习中表现出巨大的希望,但带来了新的隐私和安全挑战。 我们提议ByITFL和LoByITFL,这是两个新的FL计划,可以增强对拜占庭用户的复原力,同时将用户的数据从窃听者手中保持隐私和拜占庭复原力。 为了确保隐私和拜占庭复原力,我们的计划建立在向联邦提供一个小型代表性数据集的基础上,并设计一个歧视功能,从而可以减少腐败用户的贡献。 ByITFL采用Lagrange编码计算和重新定位,使之成为第一个具有完美信息理论隐私权的Byzantine弹性FL计划,尽管这样做的代价是通信管理费巨大。 另一方面,LoByFLIT以大幅降低通信成本的方式实现Byzantine复原力和信息技术隐私,但需要一个受信任的第三方,仅在培训前的一次性初始阶段使用。 我们为隐私和Byzantine复原力提供理论保障,同时提供趋同保证和实验结果验证我们的调查结果。
Article 41
Title@2025-06-16 (1): EvalNet: A Practical Toolchain for Generation and Analysis of Extreme-Scale Interconnects
Title: EvalNet: A Practical Toolchain for Generation and Analysis of Extreme-Scale Interconnects | EvalNet: Eine praktische Toolchain für die Generierung und Analyse von Extrem-Scale-Verbindungen | EvalNet:生成和分析极端系统互联的实用工具链 2105.12663v3 |
Authors (14): Maciej Besta, Patrick Iff, Marcel Schneider, Nils Blach, Alessandro Maissen, Salvatore Di Girolamo, Jens Domke, Jascha Krattenmacher, Ankit Singla, Kartik Lakhotia, Laura Monroe, Fabrizio Petrini, Robert Gerstenberger, Torsten Hoefler
The diversity of communication paths in a network - especially non-minimal paths - is a key enabler of performance at extreme scales. We present EvalNet, a toolchain for scalable generation and analysis over 25 important network topologies, such as Slim Fly, PolarFly, and Orthogonal Fat Trees, with a strong focus on path diversity metrics. EvalNet provides an extensive and fine-grained analysis of shortest and non-shortest paths, including their multiplicities, lengths, and interference. It supports exact measurement and visualization of bandwidth and throughput between every router pair, enabling unprecedented insight into routing potential. EvalNet also includes detailed models for construction cost and power consumption, and interfaces seamlessly with established simulators, which we tune to support large-scale evaluations on low-cost hardware. Using EvalNet, we deliver the widest and most comprehensive path diversity study to date, demonstrating how path diversity underpins throughput and scalability, and facilitating progress towards new frontiers in extreme-scale network design.
网络 — — 特别是非最小路径 — — 通信路径的多样性是极端规模性工作的关键促进因素。我们展示了EvalNet,这是一个在25种重要的网络地形上,如Slim Fly、PollarFly和Orthogonal Fat Trees进行可缩放的生成和分析的工具链,其重点是路径多样性指标。EvalNet对最短和非最短路径,包括其多功能、长度和干扰进行了广泛和精细的分析。它支持对每个路由对之间带宽和输送的精确测量和可视化,从而能够对路由潜力进行前所未有的洞察。EvalNet还包括建筑成本和电力消耗的详细模型,以及与既有模拟器的无缝接口,我们通过这些模拟器支持对低成本硬件进行大规模评估。我们利用EvalNet,提供了迄今为止最广泛和最全面的路径多样性研究,展示了路径如何支撑通过量和可扩展性,并促进在极端规模的网络设计中迈向新的边界。
Article 42
Title@2025-06-16 (1): Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory
Title: Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory | Byzantinisch-Tolerant Konsens in GPU-inspiriert gemeinsamen Speicher | 在GPU-受GPU启发的共同记忆中,拜占庭-容忍共识 2503.12788v2 |
Authors (3): Chryssis Georgiou, Manaswini Piduguralla, Sathya Peri
In this work, we formalize a novel shared memory model inspired by the popular GPU architecture. Within this model, we develop algorithmic solutions to the Byzantine Consensus problem and analyze their fault-resilience.
在这项工作中,我们正式确定了受广受欢迎的GPU架构启发的新颖的共享记忆模式。 在这个模式中,我们开发了拜占庭共识问题的算法解决方案,并分析了其缺陷的抵抗力。
Article 43
Title@2025-06-16 (1): Blockchain and Biometrics: Survey, GDPR Elements, and Future Directions
Title: Blockchain and Biometrics: Survey, GDPR Elements, and Future Directions | Blockchain und Biometrie: Umfrage, GDPR-Elemente und Zukunftsrichtung | 块链和生物计量:调查、GDPR要素和未来方向 2302.10883v3 |
Authors (7): Mahdi Ghafourian, Ruben Vera-Rodriguez, Julian Fierrez, Bilgesu Sumer, Ruben Tolosana, Aythami Moralez, Els Kindt
Biometric recognition as an efficient and hard-to-forge way of identification and verification has become an indispensable part of the current digital world. The fast evolution of this technology has been a strong incentive for integration into many applications. Meanwhile, blockchain, the decentralized ledger technology, has been widely received by both research and industry in the past few years, and it is being increasingly deployed today in many different applications, such as money transfer, IoT, healthcare, or logistics. Recently, researchers have started to speculate on the pros and cons and what the best applications would be when these two technologies cross paths. This paper provides a survey of the research literature on the combination of blockchain and biometrics and includes a first legal analysis of this integration based on GDPR to shed light on challenges and potentials. Although the integration of blockchain technology into the biometric sector is still in its infancy, with a growing body of literature discussing specific applications and advanced technological setups, this paper aims to provide a holistic understanding of blockchain applicability in biometrics. Based on published studies, this article discusses, among others, practical examples combining blockchain and biometrics for novel applications in PKI systems, distributed trusted services, and identity management. Challenges and limitations when combining blockchain and biometrics that motivate future work will also be discussed; e.g., blockchain networks at their current stage may not be efficient or economical for some real-time biometric applications. Finally, we also discuss key legal aspects of the EU General Data Protection Regulation (GDPR) related to this combination of technologies (blockchain and biometrics); for example, accountability, immutability, anonymity, and data protection elements.
近些年来,研究和产业界广泛接受分散的分类账技术,如今,这种技术正在越来越多地用于许多不同的应用,如资金转移、IOT、医疗保健或物流等。最近,研究人员开始猜测这两种技术的利弊和弊端,以及当这两种技术交汇时,最佳应用将是什么。本文对关于块链和生物鉴别技术相结合的研究文献进行了调查,并包括了以GDPR为基础的关于这一整合的首次法律分析,以揭示挑战和潜力。尽管将块链技术纳入生物鉴别部门的工作仍处于萌芽阶段,越来越多的文献讨论了具体应用和先进技术配置。本文旨在全面了解生物鉴别技术中的链系适用性。根据已出版的研究报告,本文除其他外,还讨论了将块链和生物鉴别技术的结合到块链和生物鉴别技术的组合的研究文献,并讨论了目前基本安全系统的系统、可信任性服务、最终数据管理等要素。
Article 44
Title@2025-06-16 (1): DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving
Title: DDiT: Dynamic Resource Allocation for Diffusion Transformer Model Serving | DDiT: Dynamische Ressourcenzuteilung für Diffusionstransformator-Modelldienst | DDIT:为传播变异模型服务提供动态资源配置 2506.13497v1 |
Authors (10): Heyang Huang, Cunchen Hu, Jiaqi Zhu, Ziyuan Gao, Liangliang Xu, Yizhou Shan, Yungang Bao, Sun Ninghui, Tianwei Zhang, Sa Wang
The Text-to-Video (T2V) model aims to generate dynamic and expressive videos from textual prompts. The generation pipeline typically involves multiple modules, such as language encoder, Diffusion Transformer (DiT), and Variational Autoencoders (VAE). Existing serving systems often rely on monolithic model deployment, while overlooking the distinct characteristics of each module, leading to inefficient GPU utilization. In addition, DiT exhibits varying performance gains across different resolutions and degrees of parallelism, and significant optimization potential remains unexplored. To address these problems, we present DDiT, a flexible system that integrates both inter-phase and intra-phase optimizations. DDiT focuses on two key metrics: optimal degree of parallelism, which prevents excessive parallelism for specific resolutions, and starvation time, which quantifies the sacrifice of each request. To this end, DDiT introduces a decoupled control mechanism to minimize the computational inefficiency caused by imbalances in the degree of parallelism between the DiT and VAE phases. It also designs a greedy resource allocation algorithm with a novel scheduling mechanism that operates at the single-step granularity, enabling dynamic and timely resource scaling. Our evaluation on the T5 encoder, OpenSora SDDiT, and OpenSora VAE models across diverse datasets reveals that DDiT significantly outperforms state-of-the-art baselines by up to 1.44x in p99 latency and 1.43x in average latency.
文本到视频模型( T2V) 模式旨在从文本提示中生成动态和表达性视频。 生成管道通常包含多个模块, 如语言编码器、 Diful 变异器( DIT) 和 VAE 。 现有的服务系统往往依赖单体模型部署, 忽略每个模块的特性, 导致对 GPU 的利用效率低下。 此外, DiT 显示不同分辨率和平行度之间的不同性能收益不同, 且显著的优化潜力尚未开发。 为了解决这些问题, 我们介绍了 DDDIT , 这是一个灵活的系统, 整合了跨阶段和内部的优化。 DDIT 侧重于两个关键尺度: 最佳的平行度, 防止特定分辨率的过度平行性, 以及饥饿时间, 从而忽略了每个模块的牺牲。 为此, DDIT 引入了一种分解控制机制, 以最大限度地减少由于 DIT 和 VAEE 阶段之间的平行性差差造成的计算效率。 此外, DDIT 还设计了一个在新版本的 Streal- Streal Streal 系统上, 将一个稳定的资源配置算算算算出一个新的Siltial- sali- sal- sal- silvialvidustral- silvial- silvial- sal- salvialvialvialvial- silvial- sal- silvialviduvald- salvial
Article 45
Title@2025-06-16 (1): On Immutable Memory Systems for Artificial Agents: A Blockchain-Indexed Automata-Theoretic Framework Using ECDH-Keyed Merkle Chains
Title: On Immutable Memory Systems for Artificial Agents: A Blockchain-Indexed Automata-Theoretic Framework Using ECDH-Keyed Merkle Chains | Auf unveränderliche Speichersysteme für künstliche Agenten: Ein Blockchain-indexed Automata-Theoretic Framework mit ECDH-Keyed Merkle Chains | 人工制剂可变记忆系统:使用ECDH-Keyed Merkle 链条的链链式内浸式自动成像-理论框架 2506.13246v1 |
Authors (1): Craig Steven Wright
This paper presents a formalised architecture for synthetic agents designed to retain immutable memory, verifiable reasoning, and constrained epistemic growth. Traditional AI systems rely on mutable, opaque statistical models prone to epistemic drift and historical revisionism. In contrast, we introduce the concept of the Merkle Automaton, a cryptographically anchored, deterministic computational framework that integrates formal automata theory with blockchain-based commitments. Each agent transition, memory fragment, and reasoning step is committed within a Merkle structure rooted on-chain, rendering it non-repudiable and auditably permanent. To ensure selective access and confidentiality, we derive symmetric encryption keys from ECDH exchanges contextualised by hierarchical privilege lattices. This enforces cryptographic access control over append-only DAG-structured knowledge graphs. Reasoning is constrained by formal logic systems and verified through deterministic traversal of policy-encoded structures. Updates are non-destructive and historied, preserving epistemic lineage without catastrophic forgetting. Zero-knowledge proofs facilitate verifiable, privacy-preserving inclusion attestations. Collectively, this architecture reframes memory not as a cache but as a ledger - one whose contents are enforced by protocol, bound by cryptography, and constrained by formal logic. The result is not an intelligent agent that mimics thought, but an epistemic entity whose outputs are provably derived, temporally anchored, and impervious to post hoc revision. This design lays foundational groundwork for legal, economic, and high-assurance computational systems that require provable memory, unforgeable provenance, and structural truth.
本文为合成物剂提供了一个正规化的合成物剂结构,旨在保留不可改变的记忆、可核查的推理和受限的缩略式增长。传统的人工智能系统依赖于易感性、不透明的统计模型,容易被感知性漂移和历史修正。相反,我们引入了Merkle Automaton的概念,这是一个加密的、确定性的计算框架,将正式的自动数据理论与基于链式承诺相结合。每个物剂过渡、记忆碎片和推理步骤都在一个扎根于链的默克尔结构中进行,使其不可持久和可审计的永久。为了确保选择性的获取和保密性,我们从以等级特权为背景的ECDH(ECD)交换中获取对称的加密密钥。这对仅附于DAG结构化知识图件的加密控制。 理由受正式逻辑系统制约,并通过政策编码结构的确定性曲解的曲解来验证。 更新是非破坏性的、具有历史特性的、 缩略性、 缩略性、 直系但不会被灾难性地遗忘的底线线系的线系。
Article 46
Title@2025-06-16 (1): DFPL: Decentralized Federated Prototype Learning Across Heterogeneous Data Distributions
Title: DFPL: Decentralized Federated Prototype Learning Across Heterogeneous Data Distributions | DFPL: Dezentrales Federated Prototype Learning über unterschiedliche Datenverteilungen hinweg | DFPL: 分散的联邦原型学习,跨异种数据分布 2505.04947v3 |
Authors (6): Hongliang Zhang, Fenghua Xu, Zhongyuan Yu, Shanchen Pang, Chunqiang Hu, Jiguo Yu
Federated learning is a distributed machine learning paradigm through centralized model aggregation. However, standard federated learning relies on a centralized server, making it vulnerable to server failures. While existing solutions utilize blockchain technology to implement Decentralized Federated Learning (DFL), the statistical heterogeneity of data distributions among clients severely degrades the performance of DFL. Driven by this issue, this paper proposes a decentralized federated prototype learning framework, named DFPL, which significantly improves the performance of DFL across heterogeneous data distributions. Specifically, DFPL introduces prototype learning into DFL to mitigate the impact of statistical heterogeneity and reduces the amount of parameters exchanged between clients. Additionally, blockchain is embedded into our framework, enabling the training and mining processes to be implemented locally on each client. From a theoretical perspective, we analyze the convergence of DFPL by modeling the required computational resources during both training and mining processes. The experiment results highlight the superiority of our DFPL in model performance and communication efficiency across four benchmark datasets with heterogeneous data distributions.
联邦学习是一种分散的机械学习模式,通过集中的模型集成,但标准联合学习依赖中央服务器,使其容易受服务器故障的影响。虽然现有解决方案利用链链技术实施分权联邦学习(DFL),但客户之间数据分布的统计差异性严重削弱了DFL的绩效。受这一问题驱动,本文件提出一个名为DFPL的分散化联邦原型学习框架,它大大改善了DFL在不同数据分布中的性能。具体来说,DFPL将原型学习引入DFL,以减轻统计异质的影响,减少客户之间交换的参数数量。此外,将块链嵌入我们的框架中,使培训和采矿过程能够在当地对每个客户实施。从理论角度,我们分析DFPL的趋同,在培训和采矿过程中对所需的计算资源进行建模。实验结果突出了我们的DFPL在模式性业绩和通信效率方面的优势,跨越四个有多种数据分布的基准数据集。
Article 47
Title@2025-06-16 (1): A Hybrid Heuristic Framework for Resource-Efficient Querying of Scientific Experiments Data
Title: A Hybrid Heuristic Framework for Resource-Efficient Querying of Scientific Experiments Data | Ein hybrider Heuristischer Rahmen für eine ressourceneffiziente Abfrage wissenschaftlicher Experimentdaten | 资源效率科学实验数据调查混合元框架 2506.10422v2 |
Authors (2): Mayank Patel, Minal Bhise
Scientific experiments and modern applications are generating large amounts of data every day. Most organizations utilize In-house servers or Cloud resources to manage application data and workload. The traditional database management system (DBMS) and HTAP systems spend significant time & resources to load the entire dataset into DBMS before starting query execution. On the other hand, in-situ engines may reparse required data multiple times, increasing resource utilization and data processing costs. Additionally, over or under-allocation of resources also increases application running costs. This paper proposes a lightweight Resource Availability &Workload aware Hybrid Framework (RAW-HF) to optimize querying raw data by utilizing existing finite resources efficiently. RAW-HF includes modules that help optimize the resources required to execute a given workload and maximize the utilization of existing resources. The impact of applying RAW-HF to real-world scientific dataset workloads like Sloan Digital Sky Survey (SDSS) and Linked Observation Data (LOD) presented over 90% and 85% reduction in workload execution time (WET) compared to widely used traditional DBMS PostgreSQL. The overall CPU, IO resource utilization, and WET have been reduced by 26%, 25%, and 26%, respectively, while improving memory utilization by 33%, compared to the state-of-the-art workload-aware partial loading technique (WA) proposed for hybrid systems. A comparison of MUAR technique used by RAW-HF with machine learning based resource allocation techniques like PCC is also presented.
多数组织利用内部服务器或云端资源管理应用数据和工作量。传统数据库管理系统(DBMS)和HTAP系统花费大量时间和资源将全部数据集装入DBMS,然后才开始询问执行。另一方面,当地引擎可能多次重新分析所需数据,增加资源利用和数据处理费用。此外,资源过多或分配不足也增加了应用程序运行成本。本文件建议采用一个轻量资源提供和工作认知框架(RAW-HF),以便通过高效利用现有有限资源优化原始查询数据。 RAW-HF系统包括模块,帮助优化执行特定工作量和最大限度地利用现有资源所需的资源。 将RAWHF应用于现实世界科学数据集工作量的影响,如Sloan数字天空测量(SDSS)和链接观测数据(LOD),提出了90%和85%的工作量执行时间,与广泛使用的传统DBMS Post-GSQL(RA-H) 优化原始数据查询数据查询数据。总体CPU、 IO-HW 资源调配技术的利用率和WE-% 的计算方法的利用率分别通过Sloan DSL 和WA-Real-res Madressal 技术的利用率降低
Article 48
Title@2025-06-15 (7): Distributed Computing From First Principles
Title: Distributed Computing From First Principles | Verteiltes Rechnen von den ersten Prinzipien | 从原始原则中分配的计算 2506.12959v1 |
Authors (1): Kenneth Odoh
This book on Distributed Computing aims to benefit a diverse audience, ranging from aspiring engineers, and seasoned researchers, to a wide range of professionals. Driven by my passion for making the core concepts of distributed computing accessible, this work is a significant undertaking designed to empower individuals from all backgrounds to gain valuable insight. Have you ever wondered how a typical distributed system works under the hood? Are you looking for a pedagogical guide with complete implementations? In this work, we have implemented several foundational algorithms in Distributed Computing. Whether your expertise lies in the theoretical foundations or the practical applications of the principles of Distributed Systems, this book is for you.
这本《分布式计算》的书旨在让不同的读者受益,从有志工程师和老练的研究人员,到广泛的专业人员。 由我热衷于使分布式计算的核心概念无障碍的热情驱动,这项工作是一项重要的任务,旨在增强来自各种背景的个人的能力,以获得宝贵的洞察力。 你有没有想过一个典型的分布式系统如何在兜帽下运作?你是否在寻找一个具有完整执行内容的教学指南?在这个工作中,我们在分布式计算中应用了几种基本算法。无论你的专业知识来自理论基础还是分配式系统原则的实际应用,这本书都是为了你。
Article 49
Title@2025-06-15 (7): Self-Stabilizing Replicated State Machine Coping with Byzantine and Recurring Transient Faults
Title: Self-Stabilizing Replicated State Machine Coping with Byzantine and Recurring Transient Faults | Selbststabilisierende replizierte Staatsmaschine, die mit byzantinischen und wiederkehrenden transienten Fehlern fertig wird | 应对拜占庭和经常性中转过失的自稳定复制国家机器 2506.12900v1 |
Authors (5): Shlomi Dolev, Amit Hendin, Maurice Herlihy, Maria Potop Butucaru, Elad Michael Schiller
The ability to perform repeated Byzantine agreement lies at the heart of important applications such as blockchain price oracles or replicated state machines. Any such protocol requires the following properties: (1) \textit{Byzantine fault-tolerance}, because not all participants can be assumed to be honest, (2) r\textit{ecurrent transient fault-tolerance}, because even honest participants may be subject to transient glitches'', (3) \textit{accuracy}, because the results of quantitative queries (such as price quotes) must lie within the interval of honest participants' inputs, and (4) \textit{self-stabilization}, because it is infeasible to reboot a distributed system following a fault. This paper presents the first protocol for repeated Byzantine agreement that satisfies the properties listed above. Specifically, starting in an arbitrary system configuration, our protocol establishes consistency. It preserves consistency in the face of up to $\lceil n/3 \rceil -1$ Byzantine participants {\em and} constant recurring (
noise’’) transient faults, of up to $\lceil n/6 \rceil-1$ additional malicious transient faults, or even more than $\lceil n/6 \rceil-1$ (uniformly distributed) random transient faults, in each repeated Byzantine agreement.
执行重复的拜占庭协议的能力是重要应用的核心,例如块链价格或电动国家机器。任何此类协议都需要以下属性:(1)\ textit{Byzantine 断层容忍},因为并非所有参与者都能被假定为诚实,(2)rtit{Textit{Textentient Traudi容忍},因为即使是诚实的参与者也可能受“滑动’,(3)\textit{acurity}的短暂性“滑动’”,(3)\textit{acurity}的制约,因为量化询问(如价格报价)的结果必须处于诚实参与者投入的间隔之内,(4)\ textit{自我稳定},因为不可能在错误发生后重新启用分布的系统。本文为重复的拜占庭协议提供了第一个满足上述属性的协议。具体地说,从任意的系统配置开始,我们的协议就具有一致性。它保持了在正面面直到$ceil n/3\rcle -1 随机参与者的一致度,以及 不断重复(Nnnoise’reviewrnial_rentirnial firal) nrvient firnial airview exnial axild
Article 50
Title@2025-06-15 (7): BLITZSCALE: Fast and Live Large Model Autoscaling with O(1) Host Caching
Title: BLITZSCALE: Fast and Live Large Model Autoscaling with O(1) Host Caching | BLITZSCALE: Schnelle und Live-Großmodellautoskalierung mit O(1) Host-Caching | BLITZSCALE: 与 O(1) 主机缓存快速和活的大型模型自动缩放 2412.17246v2 |
Authors (7): Dingyan Zhang, Haotian Wang, Yang Liu, Xingda Wei, Yizhou Shan, Rong Chen, Haibo Chen
Model autoscaling is the key mechanism to achieve serverless model-as-a-service, but it faces a fundamental trade-off between scaling speed and storage/memory usage to cache parameters, and cannot meet frequent scaling requirements across multiple hosts. The key problem is that data plane performance is slow, and scaled instances remain stopped while parameters are loading. In this paper, we first show that the data plane can be made fast with no or O(1) caching by loading parameters through the compute network between GPUs because: (1) its speed is comparable to host cache and is underutilized, and (2) scaling multiple instances requires no or O(1) caching with network-optimized multicast. Second, autoscaling can be made live by breaking the scaling abstraction for inference from a coarse-grained instance-level to a fine-grained layer-level. This allows us to offload the layer computation from the overloaded serving instances to the scaled ones without waiting for the parameters to be fully loaded. Under real-world workloads, our system BLITZSCALE achieves up to 94 % lower tail latency reductions compared to state-of-the-art autoscaling system (ServerlessLLM), and it reduces the GPU time used for serving by 49 % when compared with serving systems that do not support autoscaling like DistServe and vLLM with the same service-level-agreement.
模型自动缩放是实现无服务器模式服务的关键机制, 但它在缩放速度和存储/ 模拟使用到缓存参数之间面临着一个根本性的权衡, 并且无法满足多个主机的频繁缩放要求。 关键的问题是, 数据平面性能缓慢, 且当参数装入时, 缩放情况仍然停止。 在本文中, 我们首先显示, 数据平面可以通过计算 GPU 之间网络的加载参数来快速完成, 没有或 O(1) 缓存参数, 因为:(1) 其速度与主机缓存相当, 并且没有得到充分利用; (2) 缩放多个情况要求不使用或 O(1) 缓存网络优化多播放。 其次, 自动缩放可以通过打破缩放的缩放抽象, 从粗略的审点到细化的层级别。 这样, 数据平面可以卸载从超载的服务量到缩放参数, 而不必等待全部装入参数。 在现实世界工作量下, 我们的系统 BLITZSCALLELE 需要与网络优化多盘多盘多盘多盘, 达到94 % 的尾端调调调, 而不是通过自动递减系统, 。 将自动递减系统用来为运行系统降低调系统, 。
Article 51
Title@2025-06-15 (7): zkMixer: A Configurable Zero-Knowledge Mixer with Anti-Money Laundering Consensus Protocols
Title: zkMixer: A Configurable Zero-Knowledge Mixer with Anti-Money Laundering Consensus Protocols | zkMixer: Ein konfigurierbarer Null-Knowledge-Mixer mit Anti-Money Laundering Consensus-Protokollen | zkMixer:一个与反洗钱共识议定书的可配置零知识混合器 2503.14729v2 |
Authors (2): Theodoros Constantinides, John Cartlidge
We introduce a zero-knowledge cryptocurrency mixer framework that allows groups of users to set up a mixing pool with configurable governance conditions, configurable deposit delays, and the ability to refund or confiscate deposits if it is suspected that funds originate from crime. Using a consensus process, group participants can monitor inputs to the mixer and determine whether the inputs satisfy the mixer conditions. If a deposit is accepted by the group, it will enter the mixer and become untraceable. If it is not accepted, the verifiers can freeze the deposit and collectively vote to either refund the deposit back to the user, or confiscate the deposit and send it to a different user. This behaviour can be used to examine deposits, determine if they originate from a legitimate source, and if not, return deposits to victims of crime.
我们引入了一个零知识加密货币混合框架,允许用户群体建立混合池,其中含有可配置的治理条件、可配置的存款延迟,如果怀疑存款来源于犯罪,则能够退还或没收存款;通过协商一致程序,小组参与者可以监测向混合器提供的投入,并确定投入是否满足混合器条件;如果一个存款被该集团接受,它将进入混合器并变得无法追踪;如果不被接受,核查者可以冻结存款,集体投票将存款退还给用户,或者没收存款并寄给另一个用户。这种行为可以用来检查存款,确定这些存款是否来自合法来源,如果不是,则将存款归还犯罪受害者。
Article 52
Title@2025-06-15 (7): Cross-architecture universal feature coding via distribution alignment
Title: Cross-architecture universal feature coding via distribution alignment | Cross-architecture universal feature coding via distribution alignment | 通过分配协调进行跨建筑跨建筑通用特征编码 2506.12737v1 |
Authors (4): Changsheng Gao, Shan Liu, Feng Wu, Weisi Lin
Feature coding has become increasingly important in scenarios where semantic representations rather than raw pixels are transmitted and stored. However, most existing methods are architecture-specific, targeting either CNNs or Transformers. This design limits their applicability in real-world scenarios where features from both architectures coexist. To address this gap, we introduce a new research problem: cross-architecture universal feature coding (CAUFC), which seeks to build a unified codec that can effectively compress features from heterogeneous architectures. To tackle this challenge, we propose a two-step distribution alignment method. First, we design the format alignment method that unifies CNN and Transformer features into a consistent 2D token format. Second, we propose the feature value alignment method that harmonizes statistical distributions via truncation and normalization. As a first attempt to study CAUFC, we evaluate our method on the image classification task. Experimental results demonstrate that our method achieves superior rate-accuracy trade-offs compared to the architecture-specific baseline. This work marks an initial step toward universal feature compression across heterogeneous model architectures.
在语义表达方式而不是原始像素被传输和存储的情况下,地物编码变得日益重要。 但是,大多数现有方法都是针对特定结构的,针对有线电视新闻网或变异器。 这种设计限制了其在两个结构特征共存的现实世界情景中的适用性。 为了解决这一差距,我们引入了一个新的研究问题:跨结构通用特征编码(CAUFC),它寻求建立一个能够有效压缩不同结构特征的统一编码(CAUFC)。为了应对这一挑战,我们提出了一个两步配对方法。首先,我们设计了将CNN和变异器特征统一成统一的 2D 符号格式的格式调整方法。第二,我们提出了通过脱速和正常化协调统计分布的特征值调整方法。作为研究CAUFC的首个尝试,我们评估了图像分类任务的方法。实验结果表明,我们的方法与特定结构的基线相比,实现了更高的率-准确性交易。这项工作标志着跨多种模型结构实现普遍特征压缩的一个初步步骤。
Article 53
Title@2025-06-15 (7): Energy-Efficient Real-Time Job Mapping and Resource Management in Mobile-Edge Computing
Title: Energy-Efficient Real-Time Job Mapping and Resource Management in Mobile-Edge Computing | Energieeffizientes Echtzeit-Job Mapping und Ressourcenmanagement im Mobile-Edge Computing | 移动电子计算中的能源高效实时工作绘图和资源管理 2506.12686v1 |
Authors (3): Chuanchao Gao, Niraj Kumar, Arvind Easwaran
Mobile-edge computing (MEC) has emerged as a promising paradigm for enabling Internet of Things (IoT) devices to handle computation-intensive jobs. Due to the imperfect parallelization of algorithms for job processing on servers and the impact of IoT device mobility on data communication quality in wireless networks, it is crucial to jointly consider server resource allocation and IoT device mobility during job scheduling to fully benefit from MEC, which is often overlooked in existing studies. By jointly considering job scheduling, server resource allocation, and IoT device mobility, we investigate the deadline-constrained job offloading and resource management problem in MEC with both communication and computation contentions, aiming to maximize the total energy saved for IoT devices. For the offline version of the problem, where job information is known in advance, we formulate it as an Integer Linear Programming problem and propose an approximation algorithm, $\mathtt{LHJS}$, with a constant performance guarantee. For the online version, where job information is only known upon release, we propose a heuristic algorithm, $\mathtt{LBS}$, that is invoked whenever a job is released. Finally, we conduct experiments with parameters from real-world applications to evaluate their performance.
由于服务器工作处理算法的不完善平行,以及IoT设备移动对无线网络数据通信质量的影响,在工作时间安排期间,必须共同考虑服务器资源分配和IoT设备移动,以充分受益于MEC, 现有研究中经常忽视这一点。通过共同考虑工作时间安排、服务器资源分配和IoT设备移动,我们通过通信和计算争论来调查在MEC中受时间限制的工作卸载和资源管理问题,目的是尽量扩大为IoT设备节省的全部能量。对于这一问题的离线版本,如果事先知道工作信息,我们把它设计成Integer线性规划问题,提出近似算法,$\matht{LHJS}$,并不断保证工作表现。对于只在发布时才知道工作信息的在线版本,我们建议使用超自然算法, $\matht{LBS}$, 每当工作表现参数被发布时,我们都会用真实的应用来进行。
Article 54
Title@2025-06-14 (6): Accelerating Cloud-Based Transcriptomics: Performance Analysis and Optimization of the STAR Aligner Workflow
Title: Accelerating Cloud-Based Transcriptomics: Performance Analysis and Optimization of the STAR Aligner Workflow | Beschleunigung der Cloud-basierten Transkription: Leistungsanalyse und Optimierung des STAR Aligner Workflows | 加速以云为基础的云基转换器:STAR升空者工作流程的绩效分析和优化 2506.12611v1 |
Authors (4): Piotr Kica, Sabina Lichołai, Michał Orzechowski, Maciej Malawski
In this work, we explore the Transcriptomics Atlas pipeline adapted for cost-efficient and high-throughput computing in the cloud. We propose a scalable, cloud-native architecture designed for running a resource-intensive aligner – STAR – and processing tens or hundreds of terabytes of RNA-sequencing data. We implement multiple optimization techniques that give significant execution time and cost reduction. The impact of particular optimizations is measured in medium-scale experiments followed by a large-scale experiment that leverages all of them and validates the current design. Early stopping optimization allows a reduction in total alignment time by 23%. We analyze the scalability and efficiency of one of the most widely used sequence aligners. For the cloud environment, we identify one of the most suitable EC2 instance types and verify the applicability of spot instances usage.
在这项工作中,我们探索适应云层中成本-效率和高通量计算而改造的Transcriptomics Atlas管道。我们提出一个可扩缩的云型结构,用于运行资源密集型连接器 – – STAR – – 和处理数十或数百兆兆字节 RNA 序列数据。我们采用多种优化技术,使执行时间和成本大大降低。特定优化的影响在中尺度实验中测量,随后进行大规模实验,利用所有这些实验,并验证当前设计。早期停止优化可以使总调整时间减少23%。我们分析了最广泛使用的序列连接器之一的可扩缩性和效率。对于云环境,我们确定了最合适的EC2例类型之一,并核查现场使用的适用性。
Article 55
Title@2025-06-14 (6): Performance optimization of BLAS algorithms with band matrices for RISC-V processors
Title: Performance optimization of BLAS algorithms with band matrices for RISC-V processors | Leistungsoptimierung von BLAS-Algorithmen mit Bandmatrizen für RISC-V-Prozessoren | BLAS算法的性能优化,为RISC-V处理器提供带宽矩阵 2502.13839v2 |
Authors (8): Anna Pirova, Anastasia Vodeneeva, Konstantin Kovalev, Alexander Ustinov, Evgeny Kozinov, Alexey Liniov, Valentin Volokitin, Iosif Meyerov
The rapid development of RISC-V instruction set architecture presents new opportunities and challenges for software developers. Is it sufficient to simply recompile high-performance software optimized for x86-64 onto RISC-V CPUs? Are current compilers capable of effectively optimizing C and C++ codes or is it necessary to use intrinsics or assembler? Can we analyze and improve performance without well-developed profiling tools? Do standard optimization techniques work? Are there specific RISC-V features that need to be considered? These and other questions require careful consideration. In this paper, we present our experience optimizing four BLAS algorithms for band matrix operations on RISC-V processors. We demonstrate how RISC-V-optimized implementations of OpenBLAS algorithms can be significantly accelerated through improved vectorization of computationally intensive loops. Experiments on Lichee Pi 4A and Banana Pi BPI-F3 devices using RVV 0.7.1 and RVV 1.0 vector instruction sets respectively, show speedups of 1.5x to 10x depending on the operation compared to the OpenBLAS baseline. In particular, the successful use of vector register grouping with RVV can lead to significant performance improvements.
RISC-V 指令设置架构的快速发展为软件开发者带来了新的机遇和挑战。 简单地将高性能软件重新合成为x86-64优化到RISC-V CPU 是否就足够了? 目前的汇编者是否能够有效地优化C和C++代码,或者有必要使用内在的或装配器? 我们能否在没有完善的剖析工具的情况下分析和改进性能? 标准优化技术是否工作? 是否需要分别使用RV 0.7.1和RV V 1. 0的矢量教学组来分别考虑具体的RISC-V特征? 这些以及其他问题需要认真考虑。 在本文件中,我们介绍我们为RISC-V 处理器的带式矩阵操作优化四种 BLAS 算法的经验是否足够? 我们展示了如何通过改进计算密集循环的矢量来大大加速实施 OpenBLAS 算法。 对LAC Pi 4A 和 Banana Pi BPI-F3 的实验分别使用RV 0.7.1 和 RV 1.0 矢量教学组, 显示1.5x 至 10x 的加速度的加速度,这取决于的加速速度取决于操作与 OpBBBLAS 基准基线相比,取决于 成功 成功矢量 成功 成功 成功 成功地登记。
Article 56
Title@2025-06-14 (6): Decentralized Distributed Graph Coloring: Cluster Graphs
Title: Decentralized Distributed Graph Coloring: Cluster Graphs | Dezentrale verteilte Graphen-Färbung: Clustergraphen | 分散分布的图表颜色:群集图表 2405.07725v2 |
Authors (3): Maxime Flin, Magnus M. Halldorsson, Alexandre Nolin
Graph coloring is fundamental to distributed computing. We give the first sub-logarithmic distributed algorithm for coloring cluster graphs. These graphs are obtained from the underlying communication network by contracting nodes and edges, and they appear frequently as components in the study of distributed algorithms. In particular, we give a $O(\log^* n)$-round algorithm to $(\Delta+1)$-color cluster graphs of at least polylogarithmic degree. The previous best bound known was $\operatorname{poly}(\log n)$ [Flin et al., SODA’24]. This properly generalizes results in the CONGEST model and shows that distributed graph problems can be solved quickly even when the node itself is decentralized.
图形颜色是分布式计算的基础。 我们给出了第一个子对数分布算法, 用于彩色群集图。 这些图表是通过连接节点和边缘从基本通信网络获取的, 并且经常在分布式算法研究中作为组成部分出现。 特别是, 我们给一个$O( log) n) 圆形算法, 给$( delta+ 1) $( $) $( $) , $( $) 彩色组图, 至少在多logric 度上。 之前最著名的是 $\ operatorname{poly} (\log n) [Flin et al., SODA’ 24] 。 这恰当地概括了 CONEST 模型的结果, 并显示即使节点本身是分散的, 也能够快速解决分布式的图表问题 。
Article 57
Title@2025-06-14 (6): AI Flow: Perspectives, Scenarios, and Approaches
Title: AI Flow: Perspectives, Scenarios, and Approaches | AI Flow: Perspektiven, Szenarien und Ansätze | AI 流动:观点、设想和方法 2506.12479v1 |
Authors (12): Hongjun An, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
由于克劳德·香农的基本信息理论和艾伦·图灵的机智智能远见框架的开创性,信息和通信技术(IT/CT)的趋同性演进形成了一个不间断的连通和计算浪潮,这种协同效应引发了技术革命,现在随着大型人工智能(AI)模型的重新塑造工业和重新界定人体机械合作而达到顶峰。然而,由于大型模型中大量资源消耗和高通信带宽需求,实现无处不在的情报面临巨大挑战。为了应对这些挑战,AI流动被引入了多学科框架,将先进的信息技术和CT进步结合起来,特别强调以下三个关键点。首先,装置-顶尖的云形框架作为基础,将终端装置、边缘服务器和云层集群结合起来,优化低电流模型的伸缩性和效率。第二,我们引入了家庭模型的概念,即一系列规模不同的模型,与一致的隐蔽性特征相适应,使得有效的合作和灵活性能够适应不同的资源限制和动态情景。第三,连接性和互动性框架作为基础基础,将连接性和互动性框架作为基础,将最终的智能升级性模型,从而提升AI系统。
Article 58
Title@2025-06-14 (6): Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks
Title: Optimizing Federated Learning using Remote Embeddings for Graph Neural Networks | Optimierung des Federated Learning mit Remote Embeddings für Graph Neural Networks | 利用远程嵌入模型神经网络优化联邦学习 2506.12425v1 |
Authors (2): Pranjal Naman, Yogesh Simmhan
Graph Neural Networks (GNNs) have experienced rapid advancements in recent years due to their ability to learn meaningful representations from graph data structures. Federated Learning (FL) has emerged as a viable machine learning approach for training a shared model on decentralized data, addressing privacy concerns while leveraging parallelism. Existing methods that address the unique requirements of federated GNN training using remote embeddings to enhance convergence accuracy are limited by their diminished performance due to large communication costs with a shared embedding server. In this paper, we present OpES, an optimized federated GNN training framework that uses remote neighbourhood pruning, and overlaps pushing of embeddings to the server with local training to reduce the network costs and training time. The modest drop in per-round accuracy due to pre-emptive push of embeddings is out-stripped by the reduction in per-round training time for large and dense graphs like Reddit and Products, converging up to $\approx2\times$ faster than the state-of-the-art technique using an embedding server and giving up to $20\%$ better accuracy than vanilla federated GNN learning.
近些年来,由于能够从图表数据结构中学习有意义的表现,神经网络(GNNs)取得了迅速的进步; 联邦学习(FL)已成为一种可行的机械学习方法,用于培训分散化数据的共同模式,在利用平行主义的同时解决隐私问题; 现有方法解决了使用远程嵌入器进行联合化GNN培训以提高趋同准确性的独特要求,但由于与共享嵌入服务器的通信费用巨大,其性能下降,因此受到限制。 本文介绍了OPES,一个优化的GNN培训框架,即使用远程邻里剪接,以及将嵌入器与当地培训相重叠,以降低网络成本和培训时间。 由于先发制式地推进嵌入器,使全方位准确性下降,因为对红化和产品等大型和密集图形的全方位培训时间减少,而聚合到$\approx2\ti time,比使用远程嵌入服务器的州工艺更快,并给予比Vanilla Federate GNNT学习的20美元更高的准确性。
Article 59
Title@2025-06-14 (6): Optimizing Resource Allocation and Energy Efficiency in Federated Fog Computing for IoT
Title: Optimizing Resource Allocation and Energy Efficiency in Federated Fog Computing for IoT | Optimierung der Ressourcenallokation und Energieeffizienz im Federated Fog Computing für IoT | IoT的联雾计算器优化资源分配和能源效率 2504.00791v3 |
Authors (2): Syed Sarmad Shah, Anas Ali
Fog computing significantly enhances the efficiency of IoT applications by providing computation, storage, and networking resources at the edge of the network. In this paper, we propose a federated fog computing framework designed to optimize resource management, minimize latency, and reduce energy consumption across distributed IoT environments. Our framework incorporates predictive scheduling, energy-aware resource allocation, and adaptive mobility management strategies. Experimental results obtained from extensive simulations using the OMNeT++ environment demonstrate that our federated approach outperforms traditional non-federated architectures in terms of resource utilization, latency, energy efficiency, task execution time, and scalability. These findings underline the suitability and effectiveness of the proposed framework for supporting sustainable and high-performance IoT services.
雾计算通过在网络边缘提供计算、储存和联网资源,大大提高了IOT应用的效率。在本文中,我们提议了一个联合雾计算框架,旨在优化资源管理、尽量减少延迟和减少分布式IOT环境中的能源消耗。我们的框架包括预测时间表、能源意识资源分配和适应性流动管理战略。利用OMNET+++环境进行的广泛模拟的实验结果表明,我们的联合方法在资源利用、延缓性、能效、任务执行时间和可扩缩性方面超过了传统的非联合结构。这些结论强调了支持可持续和高性能IOT服务的拟议框架的适宜性和有效性。
Article 60
Title@2025-06-14 (6): Boosting Resource-Constrained Federated Learning Systems with Guessed Updates
Title: Boosting Resource-Constrained Federated Learning Systems with Guessed Updates | Ressourcenkonzentrierte Föderierte Lernsysteme mit geschätzten Updates fördern | 推动资源限制的联邦学习系统并猜测最新情况 2110.11486v3 |
Authors (6): Mohamed Yassine Boukhari, Akash Dhasade, Anne-Marie Kermarrec, Rafael Pires, Othmane Safsafi, Rishi Sharma
Federated learning (FL) enables a set of client devices to collaboratively train a model without sharing raw data. This process, though, operates under the constrained computation and communication resources of edge devices. These constraints combined with systems heterogeneity force some participating clients to perform fewer local updates than expected by the server, thus slowing down convergence. Exhaustive tuning of hyperparameters in FL, furthermore, can be resource-intensive, without which the convergence is adversely affected. In this work, we propose GEL, the guess and learn algorithm. GEL enables constrained edge devices to perform additional learning through guessed updates on top of gradient-based steps. These guesses are gradientless, i.e., participating clients leverage them for free. Our generic guessing algorithm (i) can be flexibly combined with several state-of-the-art algorithms including FEDPROX, FEDNOVA, FEDYOGI or SCALEFL; and (ii) achieves significantly improved performance when the learning rates are not best tuned. We conduct extensive experiments and show that GEL can boost empirical convergence by up to 40% in resource constrained networks while relieving the need for exhaustive learning rate tuning.
联邦学习(FL) 使一组客户设备能够合作训练模型而无需共享原始数据。 虽然这一过程在边缘装置的有限计算和通信资源下运行。 这些限制加上系统差异,迫使一些参与客户进行比服务器预期的更少的本地更新,从而减慢了趋同速度。 FL 中超参数的抽调可能会是资源密集型的,没有这种集成就会受到不利影响。 在这项工作中,我们提议GEL, 猜测和学习算法。 GEL 使有限的边缘装置能够通过梯度基步骤的顶部猜测更新进行更多的学习。 这些猜测是没有梯度的, 即参与的客户可以免费利用它们。 我们的通用算法(i) 可以灵活地结合一些最先进的算法, 包括FEDPROX, FEDNOVA, FEDDENOVA, FEDYOGI 或 SCALELLL; 以及 (ii) 当学习率没有最佳调整时, 业绩就会显著改善。 我们进行了广泛的实验, 并表明GEL 能够通过资源限制网络的40%的升级来推动经验趋同。
Article 61
Title@2025-06-14 (6): QoS-aware Scheduling of Periodic Real-time Task Graphs on Heterogeneous Pre-occupied MECs
Title: QoS-aware Scheduling of Periodic Real-time Task Graphs on Heterogeneous Pre-occupied MECs | QoS-aware Planung von periodischen Echtzeit-Taskgraphen zu heterogenen vorbesetzten MECs | 不同不同类型预先使用和未使用MEC的定期实时任务图的定期实时布局 2506.12415v1 |
Authors (2): Ashutosh Shankar, Astha Kumari
In latency-sensitive applications, efficient task scheduling is crucial for maintaining Quality of Service (QoS) while meeting strict timing constraints. This paper addresses the challenge of scheduling periodic tasks structured as directed acyclic graphs (DAGs) within heterogeneous, pre-occupied Mobile Edge Computing (MEC) networks. We propose a modified version of the Heterogeneous Earliest Finish Time (HEFT) algorithm designed to exploit residual processing capacity in preoccupied MEC environments. Our approach dynamically identifies idle intervals on processors to create a feasible hyperperiodic schedule that specifies an allocated virtual machine (VM), task version, and start time for each task. This scheduling strategy maximizes the aggregate QoS by optimizing task execution without disrupting the existing periodic workload, while also adhering to periodicity, precedence, and resource constraints.Experimental results demonstrate that our method achieves enhanced load balancing and resource utilization, highlighting its potential to improve performance in heterogeneous MEC infrastructures supporting real-time, periodic applications.
在对隐性敏感的应用中,高效的任务排期对于在满足严格的时间安排限制的同时保持服务质量至关重要,本文件述及将按定向单流图(DAGs)结构构成的定期任务安排在多式、预先投入使用的移动边缘计算(MEC)网络内的挑战。我们提议了一种经修改的超异性刻度断层时间算法,目的是在热心的MEC环境中利用剩余处理能力。我们的方法是动态地确定处理器的闲置间隔,以创造一个可行的超周期时间表,规定分配的虚拟机器(VM)、任务版本和每项任务的启动时间。这一排期战略通过优化任务执行来最大限度地实现聚合QOS,同时不干扰现有周期性工作量,同时遵守周期性、先例和资源限制。研究结果表明,我们的方法提高了工作量的平衡和资源利用,突出了其在支持实时定期应用的混合MEC基础设施中提高性能的潜力。
Article 62
Title@2025-06-14 (6): Efficient Unified Caching for Accelerating Heterogeneous AI Workloads
Title: Efficient Unified Caching for Accelerating Heterogeneous AI Workloads | Effizientes Unified Caching für die Beschleunigung heterogener KI-Workloads | 加速异异性人工智能加速加速加速加速统一快递 2506.12370v1 |
Authors (11): Tianze Wang, Yifei Liu, Chen Chen, Pengfei Zuo, Jiawei Zhang, Qizhen Weng, Yin Chen, Zhenhua Han, Jieru Zhao, Quan Chen, Minyi Guo
Modern AI clusters, which host diverse workloads like data pre-processing, training and inference, often store the large-volume data in cloud storage and employ caching frameworks to facilitate remote data access. To avoid code-intrusion complexity and minimize cache space wastage, it is desirable to maintain a unified cache shared by all the workloads. However, existing cache management strategies, designed for specific workloads, struggle to handle the heterogeneous AI workloads in a cluster – which usually exhibit heterogeneous access patterns and item storage granularities. In this paper, we propose IGTCache, a unified, high-efficacy cache for modern AI clusters. IGTCache leverages a hierarchical access abstraction, AccessStreamTree, to organize the recent data accesses in a tree structure, facilitating access pattern detection at various granularities. Using this abstraction, IGTCache applies hypothesis testing to categorize data access patterns as sequential, random, or skewed. Based on these detected access patterns and granularities, IGTCache tailors optimal cache management strategies including prefetching, eviction, and space allocation accordingly. Experimental results show that IGTCache increases the cache hit ratio by 55.6% over state-of-the-art caching frameworks, reducing the overall job completion time by 52.2%.
现代的AI类组群,这些组群包含各种工作量,如数据预处理、培训和推断等,往往将大量数据储存在云层储存中,并采用缓存框架,以便利远程数据访问。为了避免代码入侵的复杂性,并尽量减少缓存空间浪费,可取的做法是保持一个所有工作量共享的统一缓存。然而,为具体工作量设计的现有的缓存管理战略,力求在一个组群中处理不同的AI工作量 – – 通常显示各种不同的访问模式和项目储存颗粒。在本文件中,我们提议IGTCache,一个统一的、高效的现代AI类群群群库缓存器。IGTCache利用一个等级访问抽象,AccessStreamTree,在树结构中组织最新的数据访问,便利在各种工作量中进行访问模式的检测。然而,IGTCache应用假设测试将数据访问模式按顺序、随机或扭曲进行分类。基于这些检测到的存取模式和颗粒体,IGTCache为现代的缓存管理战略,包括前伸缩、驱逐和空间分配空间比例。IGTC, 实验性结果显示,通过55-C的完成率框架,使IGTGTA+C提高了整个缓存率。
Article 63
Title@2025-06-14 (6): Decouple and Decompose: Scaling Resource Allocation with DeDe
Title: Decouple and Decompose: Scaling Resource Allocation with DeDe | Entkoppeln und Zersetzen: Skalierung der Ressourcenzuteilung mit DeDe | 分化和分解:扩大资源分配与除去 2412.11447v3 |
Authors (3): Zhiying Xu, Minlan Yu, Francis Y. Yan
Efficient resource allocation is essential in cloud systems to facilitate resource sharing among tenants. However, the growing scale of these optimization problems have outpaced commercial solvers commonly employed in production. To accelerate resource allocation, prior approaches either customize solutions for narrow domains or impose workload-specific assumptions. In this work, we revisit real-world resource allocation problems and uncover a common underlying structure: the vast majority of these problems are inherently separable, i.e., they optimize the aggregate utility of individual resource and demand allocations, under separate constraints for each resource and each demand. Building on this observation, we develop DeDe, a scalable and theoretically rooted optimization framework for large-scale resource allocation. At the core of DeDe is a decouple-and-decompose approach: it decouples entangled resource and demand constraints and thereby decomposes the overall optimization into alternating per-resource and per-demand subproblems that can be solved efficiently and in parallel. We have implemented and released DeDe as a Python package with a familiar modeling interface. Our experiments on three representative resource allocation tasks – cluster scheduling, traffic engineering, and load balancing – demonstrate that DeDe delivers significant speedups while generating higher-quality allocations.
在云层系统中,高效的资源分配至关重要,以便利租户之间共享资源。然而,这些优化问题的规模不断扩大,超过了生产中通常使用的商业解决方案。为了加快资源分配,以前的做法要么定制狭隘领域的解决办法,要么强加特定工作量的假设。在这项工作中,我们重新审视现实世界的资源分配问题,发现一个共同的基本结构:这些问题的绝大多数本质上是可分离的,即,在每种资源和每项需求的不同制约下,优化个别资源和需求分配的总体效用。基于这一观察,我们开发了DeDeDe,这是大规模资源分配的一个可调整的、理论上扎根的优化框架。在DeDeDe的核心,是一种分化和分离的方法:它分解了资源与需求制约的交替,从而将总体优化转化为可高效和平行解决的每个资源和需求子问题。我们实施并发行了DeDe作为具有熟悉模型界面的Python一揽子方案。我们在三个具有代表性的资源分配任务的实验 – – 分组安排、交通质量和负荷分配平衡 – – 展示了巨大的速度。
Article 64
Title@2025-06-14 (6): EdgeProfiler: A Fast Profiling Framework for Lightweight LLMs on Edge Using Analytical Model
Title: EdgeProfiler: A Fast Profiling Framework for Lightweight LLMs on Edge Using Analytical Model | EdgeProfiler: Ein schnelles Profiling-Framework für leichte LLMs am Rand mit analytischem Modell | 边缘推进器:利用分析模型分析边缘的轻量LMs的快速分析框架 2506.09061v2 |
Authors (4): Alyssa Pinnock, Shakya Jayakody, Kawsher A Roxy, Md Rubel Ahmed
This paper introduces EdgeProfiler, a fast profiling framework designed for evaluating lightweight Large Language Models (LLMs) on edge systems. While LLMs offer remarkable capabilities in natural language understanding and generation, their high computational, memory, and power requirements often confine them to cloud environments. EdgeProfiler addresses these challenges by providing a systematic methodology for assessing LLM performance in resource-constrained edge settings. The framework profiles compact LLMs, including TinyLLaMA, Gemma3.1B, Llama3.2-1B, and DeepSeek-r1-1.5B, using aggressive quantization techniques and strict memory constraints. Analytical modeling is used to estimate latency, FLOPs, and energy consumption. The profiling reveals that 4-bit quantization reduces model memory usage by approximately 60-70%, while maintaining accuracy within 2-5% of full-precision baselines. Inference speeds are observed to improve by 2-3x compared to FP16 baselines across various edge devices. Power modeling estimates a 35-50% reduction in energy consumption for INT4 configurations, enabling practical deployment on hardware such as Raspberry Pi 4/5 and Jetson Orin Nano Super. Our findings emphasize the importance of efficient profiling tailored to lightweight LLMs in edge environments, balancing accuracy, energy efficiency, and computational feasibility.
本文介绍了EdgeProfileer, 这是一种用于评价边缘系统中轻量大语言模型(LLMS)的快速剖析框架;虽然LLMS在自然语言理解和生成方面提供了非凡的能力,但其高计算、内存和电力要求往往将其限制在云层环境中。EgeProfileer为评估LLM系统在资源限制的边缘环境中的性能提供了系统的方法,从而应对这些挑战。框架简介LLMS,包括TinyLLLAMA、Gemma3.1B、Llama3.2-1B和DeepSeek-r1-11.5.B,使用了侵略性定量技术和严格的记忆限制。分析模型用于估计延度、低计算、高计算和高电耗。分析表明,4位四位的夸度使模型记忆使用率减少约60-70%,同时将完全精度基线的精确度保持在2-5%之内。观察到的推感速度比各种边缘装置的FP16基线高出2-3x。电力模型估计了INT4配置的能源消耗量减少35-50%,从而在硬件上实际部署,例如Raspberry PI/5的精准性精度精确度, 和Jeal-imalimalimalimalimalimalimalimalimal imalisbalismisal labisalismabisal laimisalisalisalisalisalisalisal resticalismism find find find find find find find faldisildisaldaldaldisildisal find find find find find fism fism fism fism fism fism fism fismuddism fism fism fism find fism fism fism find find fism fism fism fism fismism find find fism fism fism fism fism fism fism fism fis find fal fism fism faldism fin fin fin fin fin fin fin fin fin falism final
Article 65
Title@2025-06-14 (6): KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider
Title: KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider | KVCache Cache in der Wildnis: KVCache Cache bei einem großen Cloud-Anbieter charakterisieren und optimieren | KVcache 野生缓存: 大云提供方的 KVcache 缓存的特性和优化 KVcache 缓存 2506.02634v2 |
Authors (9): Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, Haibo Chen
Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.
使用大型语言模型(LLMs)对于云端提供者十分重要,在处理每项请求后,缓存中间结果(KV$)对于云端提供者十分重要,但对于LLM服务如何从KV$缓存中受益,了解有限,因为缓存驱逐政策等系统设计决定高度依赖工作量。在本文中,我们从一个主要的LLM服务提供者对KV$工作量模式的首次系统描述中,得出了以往侧重于合成工作量的研究所没有涉及的意见,包括:KV$再利用在各种请求中被扭曲,其中单点再利用与多点请求同样重要;再利用时间和概率各不相同,考虑到所有请求,但具体的请求类别,模式往往可以预测;理想缓存打击比率所需的总体缓存规模是适度的。根据特征,我们进一步提议一项工作量-觉缓存驱逐政策,在现实世界痕迹下改进服务绩效,特别是缓存能力有限。
Article 66
Title@2025-06-14 (6): GroupNL: Low-Resource and Robust CNN Design over Cloud and Device
Title: GroupNL: Low-Resource and Robust CNN Design over Cloud and Device | GroupNL: Low-Resource und robustes CNN-Design über Cloud und Device | GroupNL: 低资源资源和强力有线电视新闻网关于云和装置的设计 2506.12335v1 |
Authors (6): Chuntao Ding, Jianhang Xie, Junna Zhang, Salman Raza, Shangguang Wang, Jiannong Cao
It has become mainstream to deploy Convolutional Neural Network (CNN) models on ubiquitous Internet of Things (IoT) devices with the help of the cloud to provide users with a variety of high-quality services. Most existing methods have two limitations: (i) low robustness in handling corrupted image data collected by IoT devices; and (ii) high consumption of computational and transmission resources. To this end, we propose the Grouped NonLinear transformation generation method (GroupNL), which generates diversified feature maps by utilizing data-agnostic Nonlinear Transformation Functions (NLFs) to improve the robustness of the CNN model. Specifically, partial convolution filters are designated as seed filters in a convolutional layer, and a small set of feature maps, i.e., seed feature maps, are first generated based on vanilla convolution operation. Then, we split seed feature maps into several groups, each with a set of different NLFs, to generate corresponding diverse feature maps with in-place nonlinear processing. Moreover, GroupNL effectively reduces the parameter transmission between multiple nodes during model training by setting the hyperparameters of NLFs to random initialization and not updating them during model training, and reduces the computing resources by using NLFs to generate feature maps instead of most feature maps generated based on sliding windows. Experimental results on CIFAR-10, GTSRB, CIFAR-10-C, Icons50, and ImageNet-1K datasets in NVIDIA RTX GPU platforms show that the proposed GroupNL outperforms other state-of-the-art methods in model robust and training acceleration. Specifically, on the Icons-50 dataset, the accuracy of GroupNL-ResNet-18 achieves approximately 2.86% higher than the vanilla ResNet-18. GroupNL improves training speed by about 53% compared to vanilla CNN when trained on a cluster of 8 NVIDIA RTX 4090 GPUs on the ImageNet-1K dataset.
在云端帮助下,在全方位网络上安装40-18型N-18型网络网络(NNNNN)模型,在云端帮助下,为用户提供各种高质量的服务。大多数现有方法有两个局限性:(一) 处理由IoT设备收集的腐败图像数据的力度低;(二) 计算和传输资源消耗量高。为此,我们建议使用“N-Lear 18型网络转换方法”(GroupNLL),通过利用数据-N-90型网络网络变异功能(NLFs)来生成多样化的地貌图,以提高CNN-50型网络的稳健性。具体地说,部分变异过滤器被指定为变异层的种子过滤器,而一套小的地貌地图,即:i.e.地貌图,首先根据香草变变异变速度操作生成。然后,我们将种子地地图分解成几个组,每个组都有一套不同的N-NLF-IL,然后用非线式的处理方式生成相应的地图。此外,G-N-LL在初始的50型图像上有效地减少了参数传输的参数传输,在模型上,在模型上更新了IRC-IL-IL-IL-IL-IL-I-IL-IL-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I
Article 67
Title@2025-06-14 (6): Towards Energy-Efficient Distributed Agreement
Title: Towards Energy-Efficient Distributed Agreement | Auf dem Weg zu einem energieeffizienten, verteilten Abkommen | 争取实现节能分配协议 2506.12282v1 |
Authors (2): Hugo Mirault, Peter Robinson
We study fault-tolerant consensus in a variant of the synchronous message passing model, where, in each round, every node can choose to be awake or asleep. This is known as the sleeping model (Chatterjee, Gmyr, Pandurangan PODC 2020) and defines the awake complexity (also called \emph{energy complexity}), which measures the maximum number of rounds that any node is awake throughout the execution. Only awake nodes can send and receive messages in a given round and all messages sent to sleeping nodes are lost. We present new deterministic consensus algorithms that tolerate up to $f<n$ crash failures, where $n$ is the number of nodes. Our algorithms match the optimal time complexity lower bound of $f+1$ rounds. For multi-value consensus, where the input values are chosen from some possibly large set, we achieve an energy complexity of ${O}(\lceil f^2 / n \rceil)$ rounds, whereas for binary consensus, we show that ${O}(\lceil f / \sqrt{n} \rceil)$ rounds are possible.
在同步传递信息模式的变体中,我们研究了不误容忍的共识,每个节点都可以选择在每回合中选择清醒或睡觉。这被称为睡眠模式(Chatterjee, Gmyr, Pandurangan PoDC 2020),并定义了清醒的复杂性(也称为 emph{ 能源复杂性} ) ,它测量了整个执行过程中任何节点醒起来的最大回合数。只有清醒的节点能够以给定回合发送和接收信息,并且所有发送到睡眠节点的信息都丢失了。我们提出了新的确定性共识算法,可以容忍高达$f < n$的崩溃失败,其中美元是节点的数量。我们的算法符合最理想的时间复杂性,低于$+1美元。对于多值的共识,如果从某些可能大的组合中选择了输入值,我们就实现了${O}(lceil f%2/ n\rceil) 的能量复杂性,而对于二进回合则会丢失。我们展示了${O} f/\qqrt{r>的可能性。
Article 68
Title@2025-06-13 (5): Fed-HeLLo: Efficient Federated Foundation Model Fine-Tuning with Heterogeneous LoRA Allocation
Title: Fed-HeLLo: Efficient Federated Foundation Model Fine-Tuning with Heterogeneous LoRA Allocation | Fed-HeLlo: Effizientes Federated Foundation Model Feintuning mit heterogener LoRA-Zuteilung | Fed-HELLo:高效联邦基金会 2506.12213v1 |
Authors (4): Zikai Zhang, Ping Liu, Jiahao Xu, Rui Hu
Federated Learning has recently been utilized to collaboratively fine-tune foundation models across multiple clients. Notably, federated low-rank adaptation LoRA-based fine-tuning methods have recently gained attention, which allows clients to fine-tune FMs with a small portion of trainable parameters locally. However, most existing methods do not account for the heterogeneous resources of clients or lack an effective local training strategy to maximize global fine-tuning performance under limited resources. In this work, we propose Fed-HeLLo, a novel federated LoRA-based fine-tuning framework that enables clients to collaboratively fine-tune an FM with different local trainable LoRA layers. To ensure its effectiveness, we develop several heterogeneous LoRA allocation (HLA) strategies that adaptively allocate local trainable LoRA layers based on clients’ resource capabilities and the layer importance. Specifically, based on the dynamic layer importance, we design a Fisher Information Matrix score-based HLA that leverages dynamic gradient norm information. To better stabilize the training process, we consider the intrinsic importance of LoRA layers and design a Geometrically-Defined HLA strategy. It shapes the collective distribution of trainable LoRA layers into specific geometric patterns, such as Triangle, Inverted Triangle, Bottleneck, and Uniform. Moreover, we extend GD-HLA into a randomized version, named Randomized Geometrically-Defined HLA, for enhanced model accuracy with randomness. By co-designing the proposed HLA strategies, we incorporate both the dynamic and intrinsic layer importance into the design of our HLA strategy. We evaluate our approach on five datasets under diverse federated LoRA fine-tuning settings, covering three levels of data distribution from IID to extreme Non-IID. Results show that Fed-HeLLo with HLA strategies is both effective and efficient.
最近,联邦学习联盟被用于在多个客户中合作完善微调基础模型。值得注意的是,联盟式低级别适应LORA的微调调整方法最近引起了关注,这使得客户能够对本地可培训参数的一小部分微调调进行微调,然而,大多数现有方法没有考虑到客户的多种资源,或者缺乏有效的本地培训战略,以在有限资源下最大限度地优化全球微调绩效。在这项工作中,我们提议Fed-HELLO(一个全新的Federal-Droalal-roal-roal-commanal Reformation框架),使客户能够协作微调调调调调调调调调调,具有不同的当地可培训LOR的调调调调调调方法。为了确保其有效性,我们制定了若干异性LRA调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调调
Article 69
Title@2025-06-13 (5): MindFlayer SGD: Efficient Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times
Title: MindFlayer SGD: Efficient Parallel SGD in the Presence of Heterogeneous and Random Worker Compute Times | MindFlayer SGD: Effiziente parallele SGD in der Gegenwart von heterogenen und zufälligen Arbeiter-Berechnungszeiten | MindFlayer SGD: 存在异基因和随机工人时的有效平行SGD计算 2410.04285v2 |
Authors (3): Artavazd Maranjyan, Omar Shaikh Omar, Peter Richtárik
We investigate the problem of minimizing the expectation of smooth nonconvex functions in a distributed setting with multiple parallel workers that are able to compute stochastic gradients. A significant challenge in this context is the presence of arbitrarily heterogeneous and stochastic compute times among workers, which can severely degrade the performance of existing parallel stochastic gradient descent (SGD) methods. While some parallel SGD algorithms achieve optimal performance under deterministic but heterogeneous delays, their effectiveness diminishes when compute times are random - a scenario not explicitly addressed in their design. To bridge this gap, we introduce MindFlayer SGD, a novel parallel SGD method specifically designed to handle stochastic and heterogeneous compute times. Through theoretical analysis and empirical evaluation, we demonstrate that MindFlayer SGD consistently outperforms existing baselines, particularly in environments with heavy-tailed noise. Our results highlight its robustness and scalability, making it a compelling choice for large-scale distributed learning tasks.
我们调查了如何在分布式环境中最大限度地减少对光滑非混凝土功能的期望的问题,在分布式环境中,由能够计算随机梯度的多个平行工人组成。在这方面,一个重大挑战是工人之间存在任意的多元性和随机性计算时间,这可以严重降低现有平行随机梯度梯度梯度梯度梯度方法的性能。虽然一些平行的SGD算法在确定性但差异性延迟下取得了最佳效果,但是当计算时间是随机的时,其有效性就会降低,而这种假设在设计中并未明确涉及。为了缩小这一差距,我们引入了一种新型的平行 SGD 方法,即专为处理随机性和多元性计算时间而设计的新型的平行 SGD 方法。我们通过理论分析和经验评估,证明DminFlay SGD始终超越现有基线,特别是在有重尾声的环境中。我们的结果突出其坚固性和可缩性,使得它成为大规模分布式学习任务的令人信服的选择。
Article 70
Title@2025-06-13 (5): Secure API-Driven Research Automation to Accelerate Scientific Discovery
Title: Secure API-Driven Research Automation to Accelerate Scientific Discovery | Sichere API-gesteuerte Forschungsautomatisierung zur Beschleunigung der wissenschaftlichen Entdeckung | 安全 API-驱动研究自动化加速科学发现 2506.11950v1 |
Authors (10): Tyler J. Skluzacek, Paul Bryant, A. J. Ruckman, Daniel Rosendo, Suzanne Prentice, Michael J. Brim, Ryan Adamson, Sarp Oral, Mallikarjun Shankar, Rafael Ferreira da Silva
The Secure Scientific Service Mesh (S3M) provides API-driven infrastructure to accelerate scientific discovery through automated research workflows. By integrating near real-time streaming capabilities, intelligent workflow orchestration, and fine-grained authorization within a service mesh architecture, S3M revolutionizes programmatic access to high performance computing (HPC) while maintaining uncompromising security. This framework allows intelligent agents and experimental facilities to dynamically provision resources and execute complex workflows, accelerating experimental lifecycles, and unlocking the full potential of AI-augmented autonomous science. S3M signals a new era in scientific computing infrastructure that eliminates traditional barriers between researchers, computational resources, and experimental facilities.
安全科学服务网目(S3M)提供由API驱动的基础设施,通过自动化研究工作流程加速科学发现;通过将近实时流能力、智能工作流程管弦和细微授权整合到服务网目结构中,S3M使高性能计算(HPC)方案获取机会发生革命,同时保持无懈可击的安全性;这一框架允许智能代理和实验设施动态地提供资源和实施复杂的工作流程,加速实验生命周期,并释放AI增强的自主科学的全部潜力。 S3M标志着科学计算基础设施的新时代,消除了研究人员、计算资源和实验设施之间的传统障碍。
Article 71
Title@2025-06-13 (5): A retrospective on DISPEED – Leveraging heterogeneity in a drone swarm for IDS execution
Title: A retrospective on DISPEED – Leveraging heterogeneity in a drone swarm for IDS execution | Eine Retrospektive über DISPEED – Heterogenität in einem Drohnenschwarm für IDS-Exekution | 在无人驾驶飞机群群中利用异异性进行IDS处决 2506.11800v1 |
Authors (7): Vincent Lannurien, Camélia Slimani, Louis Morge-Rollet, Laurent Lemarchand, David Espes, Frédéric Le Roy, Jalil Boukhobza
Swarms of drones are gaining more and more autonomy and efficiency during their missions. However, security threats can disrupt their missions’ progression. To overcome this problem, Network Intrusion Detection Systems ((N)IDS) are promising solutions to detect malicious behavior on network traffic. However, modern NIDS rely on resource-hungry machine learning techniques, that can be difficult to deploy on a swarm of drones. The goal of the DISPEED project is to leverage the heterogeneity (execution platforms, memory) of the drones composing a swarm to deploy NIDS. It is decomposed in two phases: (1) a characterization phase that consists in characterizing various IDS implementations on diverse embedded platforms, and (2) an IDS implementation mapping phase that seeks to develop selection strategies to choose the most relevant NIDS depending on the context. On the one hand, the characterization phase allowed us to identify 36 relevant IDS implementations on three different embedded platforms: a Raspberry Pi 4B, a Jetson Xavier, and a Pynq-Z2. On the other hand, the IDS implementation mapping phase allowed us to design both standalone and distributed strategies to choose the best NIDSs to deploy depending on the context. The results of the project have led to three publications in international conferences, and one publication in a journal.
无人驾驶飞机的群落在执行任务期间越来越具有自主权和效率。但是,安全威胁可能会破坏其飞行任务的进展。为了克服这一问题,网络入侵探测系统((N)IDS)是探知网络交通恶意行为的大有希望的解决办法。不过,现代国家无人驾驶飞机依靠资源饥饿机器学习技术,而这种技术很难在无人驾驶飞机群中部署。DISPEED项目的目标是利用无人驾驶飞机的异质性(执行平台、记忆平台)来部署NIDS。它分两个阶段进行分解:(1) 网络入侵探测系统(NIDS)是确定不同嵌入平台上的各种IDS执行特点的一个特征化阶段,和(2) 国家无人驾驶飞机实施阶段,力求根据具体情况制定选择最相关的NIDS的甄选战略。 一方面,特征分析阶段使我们能够在三个不同的嵌入平台上找到36个相关的 IDS执行平台:Raspberry Pi 4B、Jetson Xavier和Pynq-Z2。另一方面,国际安全数据系统在三个日历上选择了国际数据系统实施成果,在三个版本中选择了国际数据系统设计阶段。
Article 72
Title@2025-06-13 (5): PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices
Title: PIPO: Pipelined Offloading for Efficient Inference on Consumer Devices | PIPO: Pipelined Offloading für effiziente Schlussfolgerungen auf Consumer Devices | PIPO:为有效推断消费者设备而喷射的卸载 2504.03664v2 |
Authors (3): Yangyijian Liu, Jun Li, Wu-Jun Li
The high memory and computation demand of large language models (LLMs) makes them challenging to be deployed on consumer devices due to limited GPU memory. Offloading can mitigate the memory constraint but often suffers from low GPU utilization, leading to low inference efficiency. In this work, we propose a novel framework, called pipelined offloading (PIPO), for efficient inference on consumer devices. PIPO designs a fine-grained offloading pipeline, complemented with optimized data transfer and computation, to achieve high concurrency and efficient scheduling for inference. Experimental results show that compared with state-of-the-art baseline, PIPO increases GPU utilization from below 40% to over 90% and achieves up to 3.1$\times$ higher throughput, running on a laptop equipped with a RTX3060 GPU of 6GB memory.
大型语言模型(LLMs)的内存和计算需求很高,因此,由于GPU内存有限,在消费设备上部署这些模型具有挑战性。 卸载可以减轻内存限制,但通常会受到低GPU利用率的影响,导致低推论效率。 在这项工作中,我们提议了一个叫作编审卸载(PIPO)的新框架,以有效推断消费设备。 PIPO设计了一个精细的卸载管道,辅之以优化的数据传输和计算,以达到高通货率和高效的推论时间安排。 实验结果表明,与最先进的基线相比,PIPPO将GPU的利用率从40%提高到90%以上,并达到高达3.1美元的较高吞吐量,在一台配备了RTX3060 GPU的RT X3060 GPUPU的手提电脑上运行。
Article 73
Title@2025-06-13 (5): CoBRA: A Universal Strategyproof Confirmation Protocol for Quorum-based Proof-of-Stake Blockchains
Title: CoBRA: A Universal Strategyproof Confirmation Protocol for Quorum-based Proof-of-Stake Blockchains | CoBRA: Ein universelles Strategie-Proof-of-Stake-Blockchains-Protokoll für quorum-basierte Proof-of-Stake-Blockchains | CoBRA: 一项基于法定人数的 “ 制片检验 “ 的通用战略防战略确认议定书 2503.16783v2 |
Authors (4): Zeta Avarikioti, Eleftherios Kokoris Kogias, Ray Neiheiser, Christos Stefo
We present a formal analysis of quorum-based State Machine Replication (SMR) protocols in Proof-of-Stake (PoS) systems under a hybrid threat model comprising honest, Byzantine, and rational validators. Our analysis of traditional quorum-based protocols establishes two fundamental impossibility results: (1) in partially synchronous networks, no quorum-based protocol can achieve SMR when rational and Byzantine validators comprise more than $1/3$ of participants, and (2) in synchronous networks, SMR remains impossible when rational and Byzantine validators comprise $2/3$ or more of participants. To overcome these limitations, we propose two complementary solutions in our hybrid model. First, we introduce a protocol that enforces a bound on the volume of the total transacted amount that is finalized within any time window $\Delta$ and prove that this bound is necessary for secure SMR protocols in our model. Second, we present the \emph{strongest chain rule}, which enables efficient finalization of transactions when the majority of honest participants provably support the SMR execution. Through empirical analysis of Ethereum and Cosmos networks, we demonstrate that validator participation consistently exceeds the required ${5}/{6}$ threshold, establishing the practical feasibility of our solution in production PoS systems.
我们在一个由诚实、拜占庭和理性验证者组成的混合威胁模式下,对以法定人数为基础的国家机器复制(SMR)系统中的基于标准的国家机器复制(SMR)协议进行了正式分析。我们对传统的基于法定人数的协议的分析确定了两个根本不可能的结果:(1)在部分同步的网络中,任何基于法定人数的协议都无法达到标准RMR,如果合理的和拜占庭的验证者包括参与者的1/3美元以上,或者在同步的网络中,如果合理和拜占庭的验证者包括参与者的2/3美元或以上,那么SMR仍然不可能实现。为了克服这些限制,我们建议了我们的混合模式中的两种互补解决办法。首先,我们引入了一项协议,对在任何时间窗口中最终确定的交易总量实施约束,即$和德尔塔美元,并证明这一约束对于在我们的模型中确保SMR协议的安全是必要的。第二,我们提出了\emph{最强的链则},当大多数诚实的验证者支持SMR执行时,SMR交易能够有效完成。我们提出的两个混合模式中的补充方案。我们提出的两个补充方案。我们提出的补充方案。首先提出一个协议,通过对Eeinumum和宇宙和宇宙网络进行经验分析,我们所需要的美元/Cosmasmosum 5 的临界值的临界值标准要求,我们证明有效的参与始终超过标准。
Article 74
Title@2025-06-13 (5): Auctions with Tokens: Monetary Policy as a Mechanism Design Choice
Title: Auctions with Tokens: Monetary Policy as a Mechanism Design Choice | Auktionen mit Tokens: Geldpolitik als Mechanism Design Choice | 与Tokons的拍卖:货币政策作为一种机制设计选择 2301.13794v4 |
Authors (1): Andrea Canidio
I study a repeated auction in which payments are made with a blockchain token created and initially owned by the auction designer. Unlike the ``virtual money’’ previously examined in mechanism design, such tokens can be saved and traded outside the mechanism. I show that the present-discounted value of expected revenues equals that of a conventional dollar auction, but revenues accrue earlier and are less volatile. The optimal monetary policy burns the tokens used for payment, a practice common in blockchain-based protocols. I also show that the same outcome can be reproduced in a dollar auction if the auctioneer issues a suitable dollar-denominated security. This equivalence breaks down with moral hazard and contracting frictions: with severe contracting frictions the token auction dominates, whereas with mild contracting frictions the dollar auction combined with a dollar-denominated financial instrument is preferred.
我研究了一次反复的拍卖,在拍卖中,付款使用的是一个由拍卖设计者创建和最初拥有的链条标码。与以前在机制设计中审查的“虚拟货币”不同,这些标码可以保存和在机制之外交易。我表明,目前预期收入的折扣价值相当于常规美元拍卖的价值,但收入提前累积,波动性较小。最佳货币政策烧毁了用于付款的标码,这是以链条为基础的协议中的一种常见做法。我还表明,如果拍卖者发行一个以美元计价的适当证券,同样的结果也可以在美元拍卖中复制。这种等值与道德风险和合同摩擦打破了:随着严重的合同摩擦,象征性拍卖占主导地位,而随着小规模合同摩擦,美元拍卖加上以美元计价的金融工具则更为可取。
Article 75
Title@2025-06-13 (5): Bounded Memory in Distributed Networks
Title: Bounded Memory in Distributed Networks | Gebundenes Gedächtnis in verteilten Netzwerken | 分布式网络中的环绕内存 2506.11644v1 |
Authors (6): Ran Ben Basat, Keren Censor-Hillel, Yi-Jun Chang, Wenchen Han, Dean Leitersdorf, Gregory Schwartzman
The recent advent of programmable switches makes distributed algorithms readily deployable in real-world datacenter networks. However, there are still gaps between theory and practice that prevent the smooth adaptation of CONGEST algorithms to these environments. In this paper, we focus on the memory restrictions that arise in real-world deployments. We introduce the $\mu$-CONGEST model where on top of the bandwidth restriction, the memory of nodes is also limited to $\mu$ words, in line with real-world systems. We provide fast algorithms of two main flavors. First, we observe that many algorithms in the CONGEST model are memory-intensive and do not work in $\mu$-CONGEST. A prime example of a family of algorithms that use large memory is clique-listing algorithms. We show that the memory issue that arises here cannot be resolved without incurring a cost in the round complexity, by establishing a lower bound on the round complexity of listing cliques in $\mu$-CONGEST. We introduce novel techniques to overcome these issues and generalize the algorithms to work within a given memory bound. Combined with our lower bound, these provide tight tradeoffs between the running time and memory of nodes. Second, we show that it is possible to efficiently simulate various families of streaming algorithms in $\mu$-CONGEST. These include fast simulations of $p$-pass algorithms, random order streams, and various types of mergeable streaming algorithms. Combining our contributions, we show that we can use streaming algorithms to efficiently generate statistics regarding combinatorial structures in the network. An example of an end result of this type is that we can efficiently identify and provide the per-color frequencies of the frequent monochromatic triangles in $\mu$-CONGEST.
最近出现的可编程开关使分布式算法容易在真实世界的数据中心网络中部署。 但是, 理论和实践之间仍然存在着差距, 无法使 CONEST 算法顺利适应这些环境。 在本文中, 我们集中关注在现实世界部署中产生的记忆限制。 我们引入了$mu$- CONEST 模型, 在带宽限制之外, 节点的记忆也限于 $mu$, 符合真实世界的系统。 我们提供了两种主要口味的快速流算法。 首先, 我们观察到, ONEEST 模型中的许多算法是记忆密集的, 不在$mu$- ONEST 中工作。 一个使用大记忆的算法组合的主要例子就是 球列算算算算算算算法。 我们这里产生的记忆问题无法解决, 在圆形的复杂度中, 通过在以$\mu$- CONEST 中设置一个更低的圆序, 我们可以用新的技术来克服这些问题, 将这个算算法 放在一个具有记忆的系统内, 快速的网络结构中, , 显示我们不断 快速的算算算算算 。
Article 76
Title@2025-06-13 (5): Capsule: Efficient Player Isolation for Datacenters
Title: Capsule: Efficient Player Isolation for Datacenters | Kapsel: Effiziente Spielerisolierung für Rechenzentren | Capsule: 数据中心的有效玩家隔离 2506.11483v1 |
Authors (4): Zhouheng Du, Nima Davari, Li Li, Nodir Kodirov
Cloud gaming is increasingly popular. A challenge for cloud provider is to keep datacenter utilization high: a non-trivial task due to application variety. These applications come in different shapes and sizes. So do cloud datacenter resources, e.g., CPUs, GPUs, NPUs. Part of the challenge stems from game engines being predominantly designed to run only one player. One player in a lightweight game might utilize only a fraction of the cloud server GPU. The remaining GPU capacity will be left underutilized, an undesired outcome for the cloud provider. We introduce Capsule, a mechanism that allows multiple players to seamlessly share one GPU. We implemented Capsule in O3DE, a popular open source game engine. Our evaluations show that Capsule can increase datacenter resource utilization by accommodating up to 2.25x more players, without degrading player gaming experience. Capsule is also application agnostic. We ran four applications on Capsule-based O3DE with no application changes. Our experiences show that Capsule design can be adopted by other game engines to increase datacenter utilization across cloud providers.
云层游戏越来越受欢迎。 云端提供商面临的一项挑战是保持高数据中心利用率: 一种非三相任务, 因为应用种类不同。 这些应用程序的形状和大小不同。 云层数据中心资源也是如此, 例如 CPU、 GPUs、 NPUs。 部分挑战来自游戏引擎主要设计只运行一个玩家。 一个轻量级游戏中的玩家可能只使用云服务器 GPU 的一小部分。 其余的 GPU 容量将被留在未充分利用的范围内, 云端提供商不希望看到的结果。 我们引入了 Capsule, 这个机制允许多个玩家无缝共享一个 GPU 。 我们在 O3DE 中应用了 Capsule , 一个受欢迎的开源游戏引擎。 我们的评估显示, Capsule 可以通过容纳2. 25x 更多的玩家来增加数据中心资源的利用, 而不会降低玩家的玩家的玩家的游戏。 Capsule 也应用“ 孔” 。 我们在基于 Capsule 的 O3DE 应用程序上运行了四个应用程序, 没有变化 。 我们的经验显示, Capsule 设计可以被其他游戏引擎采用。
Article 77
Title@2025-06-13 (5): Level set-based inverse homogenisation of three-dimensional piezoelectric materials
Title: Level set-based inverse homogenisation of three-dimensional piezoelectric materials | Inverse Homogenisierung von dreidimensionalen piezoelektrischen Werkstoffen auf stufenweiser Basis | 三维压电压材料的 水平定级反同质化 2410.03148v3 |
Authors (3): Zachary J. Wegert, Anthony P. Roberts, Vivien J. Challis
In this paper we use memory-distributed level set-based topology optimisation to design three-dimensional periodic piezoelectric materials with enhanced properties. We compare and assess several existing iterative solvers with respect to their weak scalability and find that an approximate Schur complement preconditioned generalized minimal residual method method demonstrates the best performance and scalability for solving the piezoelectric homogenisation equations. We use the developed techniques to computationally design high-resolution piezoelectric metamaterials with enhanced stiffness and piezoelectric properties that yield new insights into material design for sensor, hydrophone, and actuator applications. We suggest two robust structures with no fine-scale features that exhibit enhanced piezoelectric properties several times larger than those of the base material. We find that level set-based topology optimisation is well suited to problems involving piezoelectricity and has the advantage of avoiding large regions of intermediate density material. Our memory-distributed level-set implementation is open source and provided for practitioners in the community.
在本文中,我们使用基于内存分布水平的定型地形优化来设计具有强化特性的三维定期电动材料。我们比较和评估了几个现有迭代解答器的可伸缩性,发现一个约Schur补充的、具有先决条件的通用最低残余方法表明,解决派生电同质方程式的最佳性能和可伸缩性。我们利用开发的技术来计算设计高分辨率派生电的外生材料,这种材料的坚硬性和派生电特性能够对传感器、水听器和电动器应用的材料设计产生新的洞察力。我们建议了两个没有细微特征的强健结构,这些结构展示出比基本材料大几倍的派生电特性。我们发现,基于定型地形的优化非常适合涉及派电量的问题,并具有避免大面积中间密度材料的优势。我们的记忆分配水平应用是开放的来源,供社区从业人员使用。
Article 78
Title@2025-06-13 (5): Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units
Title: Topology-Aware Virtualization over Inter-Core Connected Neural Processing Units | Topologie-Bewusst-Virtualisierung über kernverbundene Neuralverarbeitungseinheiten | 核心间连接神经元处理单位的地形-认知虚拟化 2506.11446v1 |
Authors (7): Dahu Feng, Erhu Feng, Dong Du, Pinjie Xu, Yubin Xia, Haibo Chen, Rong Zhao
With the rapid development of artificial intelligence (AI) applications, an emerging class of AI accelerators, termed Inter-core Connected Neural Processing Units (NPU), has been adopted in both cloud and edge computing environments, like Graphcore IPU, Tenstorrent, etc. Despite their innovative design, these NPUs often demand substantial hardware resources, leading to suboptimal resource utilization due to the imbalance of hardware requirements across various tasks. To address this issue, prior research has explored virtualization techniques for monolithic NPUs, but has neglected inter-core connected NPUs with the hardware topology. This paper introduces vNPU, the first comprehensive virtualization design for inter-core connected NPUs, integrating three novel techniques: (1) NPU route virtualization, which redirects instruction and data flow from virtual NPU cores to physical ones, creating a virtual topology; (2) NPU memory virtualization, designed to minimize translation stalls for SRAM-centric and NoC-equipped NPU cores, thereby maximizing the memory bandwidth; and (3) Best-effort topology mapping, which determines the optimal mapping from all candidate virtual topologies, balancing resource utilization with end-to-end performance. We have developed a prototype of vNPU on both an FPGA platform (Chipyard+FireSim) and a simulator (DCRA). Evaluation results indicate that, compared to other virtualization approaches such as unified virtual memory and MIG, vNPU achieves up to a 2x performance improvement across various ML models, with only 2% hardware cost.
随着人工智能(AI)应用的迅速发展,一个新兴的AI加速器类别,即称为跨核心连通神经处理器(NPU),已经在云层和边缘计算环境中被采用,如Iplacore 议会联盟、Tensorrent等。尽管这些NPU设计创新,但往往需要大量硬件资源,由于各种任务的硬件要求不平衡,导致资源利用不够最佳。为了解决这一问题,先前的研究探索了单一式NPU的虚拟化技术,但忽视了与硬件表层的连接核心NPU。本文介绍了VNPU,这是连接的NPU首次综合虚拟化设计,其中结合了三种新颖技术:(1) NPU线路虚拟化,将指令和数据从虚拟NPU核心转向物理资源,创造了虚拟表层学;(2) NPU记忆虚拟化,旨在最大限度地减少SRA中以中和无C装备的NPU核心核心核心的翻译摊位,从而最大限度地改进存储率的带宽度带宽度;(3) 最佳地形图象图,它决定了从所有候选的2级SFAFA级平面模型的优化图像,平衡资源使用。
Article 79
Title@2025-06-13 (5): Advancing Hybrid Defense for Byzantine Attacks in Federated Learning
Title: Advancing Hybrid Defense for Byzantine Attacks in Federated Learning | Förderung der Hybrid-Verteidigung für byzantinische Angriffe im Federated Learning | 推进联邦学习联盟拜占庭攻击事件混合防御 2409.06474v3 |
Authors (4): Kai Yue, Richeng Jin, Chau-Wai Wong, Huaiyu Dai
Federated learning (FL) enables multiple clients to collaboratively train a global model without sharing their local data. Recent studies have highlighted the vulnerability of FL to Byzantine attacks, where malicious clients send poisoned updates to degrade model performance. In particular, many attacks have been developed targeting specific aggregation rules, whereas various defense mechanisms have been designed for dedicated threat models. This paper studies the resilience of attack-agnostic FL scenarios, where the server lacks prior knowledge of both the attackers’ strategies and the number of malicious clients involved. We first introduce hybrid defenses against state-of-the-art attacks. Our goal is to identify a general-purpose aggregation rule that performs well on average while also avoiding worst-case vulnerabilities. By adaptively selecting from available defenses, we demonstrate that the server remains robust even when confronted with a substantial proportion of poisoned updates. We also emphasize that existing FL defenses should not automatically be regarded as secure, as demonstrated by the newly proposed Trapsetter attack. The proposed attack outperforms other state-of-the-art attacks by further increasing the impact of the attack by 5-15%. Our findings highlight the ongoing need for the development of Byzantine-resilient aggregation algorithms in FL.
联邦学习(FL) 使多个客户能够在不分享本地数据的情况下合作训练全球模型。 最近的研究突出表明FL很容易受到拜占庭攻击,恶意客户发送有毒更新,以降低模型性能。特别是,许多袭击针对特定集成规则,而各种防御机制是为专门的威胁模型设计的。本文研究攻击性不可知的FL情景的抗御能力,即服务器事先不了解攻击者的战略和所涉恶意客户的数量。我们首先对最先进的袭击采用混合防御手段。我们的目标是确定一个通用综合规则,该规则平均运行良好,同时避免最坏的弱点。我们通过从现有防御中作出适应性选择,表明即使面对大量有毒更新,服务器仍然强劲。我们还强调,现有的FL防御不应自动被视为安全,正如新提议的Trapstertter攻击所证明的那样。拟议攻击比其他最先进的攻击更接近于其他状态的防御手段,进一步增加5-15 %的攻击影响。我们的调查结果突出表明,需要不断开发Fizangines 。
Article 80
Title@2025-06-13 (5): You can lie but not deny: SWMR registers with signature properties in systems with Byzantine processes
Title: You can lie but not deny: SWMR registers with signature properties in systems with Byzantine processes | Sie können lügen, aber nicht leugnen: SWMR Register mit Signatur Eigenschaften in Systemen mit byzantinischen Prozessen | 你可以说谎,但不能否认:在拜占庭程序系统中,有签名属性的SWMR登记系统登记系统登记系统。 2504.09805v2 |
Authors (2): Xing Hu, Sam Toueg
We define and show how to implement SWMR registers that provide properties of unforgeable digital signatures - without actually using such signatures - in systems with Byzantine processes. Intuitively, processes can use these registers to write values as if they are signed'', such that these
signed values’’ can be verified'' by any process and
relayed’’ to any process. All our register implementations are from SWMR registers, and they work in systems with $n > 3f$ processes, $f$ of which can be Byzantine. We show that these implementations are optimal in the number of Byzantine processes they can tolerate: more precisely, we prove that if $3 \le n \le 3f$, the registers that we propose cannot be implemented from SWMR registers without using signatures. The registers that we introduce in this paper can also be implemented without signatures in message-passing systems with $n > 3f$ processes, $f$ of which can be Byzantine: this is because SWMR registers can be implemented in such systems (Most'efaoui, Petrolia, Raynal, and Jard 2017).
我们定义并展示如何在拜占庭进程的系统中实施提供不可预见数字签名属性的SWMR登记册, 而不实际使用这种签名。 直观地说, 这些进程可以使用这些登记册来写值, 仿佛它们“ 签名 ” , 这样这些“ 指定值” 可以通过任何进程“ 核查” , 并“ 延期” 到任何进程。 我们所有的注册执行情况都来自SWMR登记册, 它们的工作系统有3f$以上进程, 其中1美元可以是Byzantine。 我们表明, 这些执行在他们能够容忍的Byzantine进程数量上是最佳的: 更准确地说, 我们证明, 如果3\le n\ le 3f$, 我们提议的登记册无法在不使用签名的情况下从SWMRMR登记册执行。 我们在本文中引入的登记册也可以在信息接收系统中不签名而执行, $ > 3f$, 其中1美元可以是Byzantine: 这是因为SWMR登记册可以在这种系统中实施, 和RARDAR( )。 (Mostaria, ) 。
Article 81
Title@2025-06-13 (5): WindVE: Collaborative CPU-NPU Vector Embedding
Title: WindVE: Collaborative CPU-NPU Vector Embedding | WindVE: Kollaborative CPU-NPU-Vektor-Einbettung | Windeve:协作式CPU-NPU 矢量嵌入 2504.14941v4 |
Authors (7): Jinqi Huang, Xuebing Yu, Yi Xiong, Wenjie Huang, Entong Li, Li Zeng, Xin chen
Retrieval-Augmented Generation is a technology that enhances large language models by integrating information retrieval. In the industry, inference services based on LLMs are highly sensitive to cost-performance ratio, prompting the need for improving hardware resource utilization in the inference service. Specifically, vector embedding and retrieval processes take up to 20% of the total latency. Therefore, optimizing the utilization of computational resources in vector embeddings is crucial for enhancing the cost-performance ratio of inference processes, which in turn boosts their product competitiveness.In this paper, we analyze the deployment costs of vector embedding technology in inference services, propose a theoretical formula, and determine through the mathematical expression that increasing the capacity to process concurrent queries is the key to reducing the deployment costs of vector embeddings. Therefore, in this paper, we focus on improving the product’s capability to process concurrent queries. To optimize concurrency without sacrificing performance, we have designed a queue manager that adeptly offloads CPU peak queries. This manager utilizes a linear regression model to ascertain the optimal queue depths, a critical parameter that significantly influences the efficacy of the system. We further develop a system named WindVE that uses a CPU-NPU heterogeneous architecture to offload peak concurrent queries, which leverages the performance differences between the two processors to effectively manage traffic surges. Through experiments, we compare WindVE to the state-of-the-art vector embedding framework FlagEmbedding, and achieve a concurrency level up to 22.3% higher than the scheme without offloading.
在行业中,基于LLMS的推论服务对成本-性能比率具有高度敏感性,从而促使需要改进推论服务的硬件资源利用率。具体地说,矢量嵌入和检索过程占总延缓量的20%。因此,优化在矢量嵌入过程中对计算资源的利用,对于提高推论过程的成本-性能比率至关重要,而这反过来又会提高它们的产品竞争力。在本文件中,我们分析了基于LLMS的矢量嵌入技术在推论服务中的部署成本,提出了理论公式,并通过数学表达方式确定,提高同时查询的能力是降低矢量嵌入的部署成本的关键。因此,在本文件中,我们侧重于提高产品处理并行查询的能力。为了在不牺牲性能的情况下优化调值,我们设计了一个排队管理器,该排队管理器将CPU峰值查询调低。这位经理利用了线性回归模型来确定最优的排位深度,这是一个关键参数,通过数学表达方式来大大地影响导航系统的效率。我们利用了C-CPRLM的升级系统,我们开发了一个不同时调的系统。
Article 82
Title@2025-06-12 (4): SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding
Title: SwiftSpec: Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding | SwiftSpec: Ultra-Low Latency LLM Decodierung durch Skalierung asynchroner spekulativer Decodierung | SwiftSpecle: 通过缩放非同步的投机性代号来解码超低纬度LLM LLM 2506.11309v1 |
Authors (8): Ziyi Zhang, Ziheng Jiang, Chengquan Jiang, Menghan Yu, Size Zheng, Haibin Lin, Henry Hoffmann, Xin Liu
Low-latency decoding for large language models (LLMs) is crucial for applications like chatbots and code assistants, yet generating long outputs remains slow in single-query settings. Prior work on speculative decoding (which combines a small draft model with a larger target model) and tensor parallelism has each accelerated decoding. However, conventional approaches fail to apply both simultaneously due to imbalanced compute requirements (between draft and target models), KV-cache inconsistencies, and communication overheads under small-batch tensor-parallelism. This paper introduces SwiftSpec, a system that targets ultra-low latency for LLM decoding. SwiftSpec redesigns the speculative decoding pipeline in an asynchronous and disaggregated manner, so that each component can be scaled flexibly and remove draft overhead from the critical path. To realize this design, SwiftSpec proposes parallel tree generation, tree-aware KV cache management, and fused, latency-optimized kernels to overcome the challenges listed above. Across 5 model families and 6 datasets, SwiftSpec achieves an average of 1.75x speedup over state-of-the-art speculative decoding systems and, as a highlight, serves Llama3-70B at 348 tokens/s on 8 Nvidia Hopper GPUs, making it the fastest known system for low-latency LLM serving at this scale.
大型语言模型(LLMS)的低持久性解码对于像聊天机和代码助理这样的应用至关重要,但生成长产出的系统在单件设备环境下仍然缓慢。 先前的投机解码工作( 将小型草案模型与更大的目标模型相结合) 和 超度平行工作各自加速解码。 但是,常规方法未能同时应用,原因是( 草案和目标模型之间) 的计算要求不平衡, KV 缓冲不一致, 以及小批的高压软体下的通信间接费用。 本文介绍了SwiftSpec, 该系统针对超低的LLM解码的脱码系统。 SwiftSpec 以不连贯和分解的方式重新设计投机解码管道( 将小型草案模型模型模型与更大的目标模型结合起来 ) , 使每个部件可以灵活地缩放, 从关键路径上删除间接费用草案。 为了实现这一设计, SweftSpecreatSpec 提议平行的树生成、 树性KV缓冲式缓冲式缓冲管理, 以及已知的软质- Oppict- 内核内, 以克服上面所列的挑战。 在5个模型中, 5个模型中, Swedial- slax- slades 和8-tals平级系统上, 在Speal-deal- slax
Article 83
Title@2025-06-12 (4): Byzantine-Resilient Secure Aggregation for Federated Learning Without Privacy Compromises
Title: Byzantine-Resilient Secure Aggregation for Federated Learning Without Privacy Compromises | Byzantinisch-Resilient Sichere Aggregation für Federated Learning ohne Datenschutz Kompromisse | Byzantine-抗拜占庭-无隐私障碍联邦学习安全聚合 2405.08698v3 |
Authors (4): Yue Xia, Christoph Hofmeister, Maximilian Egger, Rawad Bitar
Federated learning (FL) shows great promise in large scale machine learning, but brings new risks in terms of privacy and security. We propose ByITFL, a novel scheme for FL that provides resilience against Byzantine users while keeping the users’ data private from the federator and private from other users. The scheme builds on the preexisting non-private FLTrust scheme, which tolerates malicious users through trust scores (TS) that attenuate or amplify the users’ gradients. The trust scores are based on the ReLU function, which we approximate by a polynomial. The distributed and privacy-preserving computation in ByITFL is designed using a combination of Lagrange coded computing, verifiable secret sharing and re-randomization steps. ByITFL is the first Byzantine resilient scheme for FL with full information-theoretic privacy.
联邦学习(FL)在大规模机器学习中表现出巨大的希望,但在隐私和安全方面带来了新的风险。我们提议ByITFL,这是FL的一个新计划,它为Byzantine用户提供抗御能力,同时保持用户从联邦和私人从其他用户获得的私密数据。这个计划以原有的非私人FLTruust计划为基础,它通过减轻或扩大用户梯度的信用分数来容忍恶意用户。信任分数以RLU功能为基础,我们以一个多边词汇为近似值。ByITFL的分布式和隐私保护计算是使用Lagrange编码计算、可核查的秘密共享和重新定位步骤的组合设计的。Byzantine计划是拥有完整信息理论隐私的FL的第一个Byzantine适应性计划。
Article 84
Title@2025-06-12 (4): LoByITFL: Low Communication Secure and Private Federated Learning
Title: LoByITFL: Low Communication Secure and Private Federated Learning | LoByITFL: Niedrige Kommunikation Sicheres und Privates Federated Learning | LoByITFL: 低通信安全和私营联邦学习 2405.19217v2 |
Authors (4): Yue Xia, Maximilian Egger, Christoph Hofmeister, Rawad Bitar
Privacy of the clients’ data and security against Byzantine clients are key challenges in Federated Learning (FL). Existing solutions to joint privacy and security incur sacrifices on the privacy guarantee. We introduce LoByITFL, the first communication-efficient information-theoretically private and secure FL scheme that makes no sacrifices on the privacy guarantees while ensuring security against Byzantine adversaries. The key components are a small and representative dataset available to the federator, a careful modification of the FLTrust algorithm, and the one-time use of a trusted third party during an initialization period. We provide theoretical guarantees on the privacy and Byzantine resilience, as well as experimental results showing the convergence of LoByITFL.
客户对拜占庭客户的数据和安全隐私是联邦学习联盟(FL)面临的主要挑战。共同隐私和安全的现有解决方案对隐私保障造成牺牲。我们引入了LoByITFL,这是第一个通信高效信息理论私营和安全的FL计划,在确保对拜占庭对手的安全的同时,对隐私保障不做牺牲。关键组成部分是联邦培训者可获得的具有代表性的小型数据集,仔细修改FLTruust算法,以及在初始阶段一次性使用受信任的第三方。我们为隐私和拜占庭复原力提供了理论保障,以及显示LoByITFL趋同的实验结果。
Article 85
Title@2025-06-12 (4): To Compress or Not To Compress: Energy Trade-Offs and Benefits of Lossy Compressed I/O
Title: To Compress or Not To Compress: Energy Trade-Offs and Benefits of Lossy Compressed I/O | Um zu komprimieren oder nicht zu komprimieren: Energie-Handels-Offs und Vorteile von Lossy Compressed I/O | 压缩或非压缩:能源贸易额和损失压缩 I/O 2410.23497v2 |
Authors (5): Grant Wilkins, Sheng Di, Jon C. Calhoun, Robert Underwood, Franck Cappello
Modern scientific simulations generate massive volumes of data, creating significant challenges for I/O and storage systems. Error-bounded lossy compression (EBLC) offers a solution by reducing data set sizes while preserving data quality within user-specified limits. This study provides the first comprehensive energy characterization of state-of-the-art EBLC algorithms–SZ2, SZ3, ZFP, QoZ, and SZx–across various scientific data sets, CPU generations, and parallel/serial modes. We analyze the energy consumption patterns of compression and decompression operations, as well as the energy trade-offs in data I/O scenarios. Our work demonstrates the relationships between compression ratios, runtime, energy efficiency, and data quality, highlighting the importance of considering compressors and error bounds for specific use cases. We demonstrate that EBLC can significantly reduce I/O energy consumption, with savings of up to two orders of magnitude compared to uncompressed I/O for large data sets. In multi-node HPC environments, we observe energy reductions of approximately 25% when using EBLC. We also show that EBLC can achieve compression ratios of 10-100x, potentially reducing storage device requirements by nearly two orders of magnitude. This work provides a framework for system operators and computational scientists to make informed decisions about implementing EBLC for energy-efficient data management in HPC environments.
现代科学模拟产生大量数据,给I/O和储存系统带来重大挑战。受错误限制的损失压缩(EBLC)提供了一种解决办法,通过减少数据集规模,同时在用户规定的限度内保留数据质量。这项研究提供了对先进的EBLC算法-SZ2、SZ3、ZFP、QoZ和SZx-跨各种科学数据集、CPU世代和平行/序列模式的首次全面能源特征描述。我们分析了压缩和压缩作业的能源消耗模式以及数据I/O情景中的能源交易。我们的工作展示了压缩比率、运行时间、能源效率和数据质量之间的关系,强调了为具体使用案例考虑压缩和误差约束的重要性。我们证明EBLCC可以大大减少I/O的能源消耗量,而大型数据集的I/O节能量可节省多达两个级。在多角度的HPC环境中,我们观察到在使用EBLC系统降低能源交易量时,能源交易量和数据质量的EBCLC操作者为10级的储存量,我们还显示EB系统实现10级的储存要求,EBLC公司可能使ELC公司达到10级标准。
Article 86
Title@2025-06-12 (4): TimberStrike: Dataset Reconstruction Attack Revealing Privacy Leakage in Federated Tree-Based Systems
Title: TimberStrike: Dataset Reconstruction Attack Revealing Privacy Leakage in Federated Tree-Based Systems | TimberStrike: Datensatz-Rekonstruktion Angriff Enthüllen der Privatsphäre Leckage in Federated Tree-Based Systems | 木材三角:联邦树基系统中数据集重建攻击清除隐私渗漏 2506.07605v2 |
Authors (5): Marco Di Gennaro, Giovanni De Lucia, Stefano Longari, Stefano Zanero, Michele Carminati
Federated Learning has emerged as a privacy-oriented alternative to centralized Machine Learning, enabling collaborative model training without direct data sharing. While extensively studied for neural networks, the security and privacy implications of tree-based models remain underexplored. This work introduces TimberStrike, an optimization-based dataset reconstruction attack targeting horizontally federated tree-based models. Our attack, carried out by a single client, exploits the discrete nature of decision trees by using split values and decision paths to infer sensitive training data from other clients. We evaluate TimberStrike on State-of-the-Art federated gradient boosting implementations across multiple frameworks, including Flower, NVFlare, and FedTree, demonstrating their vulnerability to privacy breaches. On a publicly available stroke prediction dataset, TimberStrike consistently reconstructs between 73.05% and 95.63% of the target dataset across all implementations. We further analyze Differential Privacy, showing that while it partially mitigates the attack, it also significantly degrades model performance. Our findings highlight the need for privacy-preserving mechanisms specifically designed for tree-based Federated Learning systems, and we provide preliminary insights into their design.
联邦学习联合会已成为中央机构学习的以隐私为导向的替代方案,有利于合作模式培训,而没有直接分享数据。尽管对神经网络进行了广泛研究,但基于树的模型对安全和隐私的影响仍未得到充分探讨。这项工作引入了TaultStrike,这是以横向结合的树为基础的模型为对象的基于优化的数据元重建攻击。我们由一个客户进行的攻击,利用决策树的离散性质,利用不同的价值和决定路径从其他客户处推断敏感培训数据。我们评估了木材在包括Flower、NVFFlare和FedTre在内的多个框架的州级联盟梯度促进实施方面发生的碰撞,展示了它们易受隐私破坏的脆弱性。在公开提供的中风预测数据集中,木材Strike持续地重建了所有执行过程中目标数据集的73.05 %至95.63%。我们进一步分析差异隐私,表明它虽然部分减轻了攻击,但也显著地降低了模型性。我们的调查结果突出表明需要专门为基于树木的联邦学习系统设计的隐私保护机制,我们提供了初步的见解。
Article 87
Title@2025-06-12 (4): Adaptive Job Scheduling in Quantum Clouds Using Reinforcement Learning
Title: Adaptive Job Scheduling in Quantum Clouds Using Reinforcement Learning | Adaptive Jobplanung in Quantenwolken mittels Verstärkungslernen | 利用强化学习在量云中进行适应性就业安排 2506.10889v1 |
Authors (4): Waylon Luo, Jiapeng Zhao, Tong Zhan, Qiang Guan
Present-day quantum systems face critical bottlenecks, including limited qubit counts, brief coherence intervals, and high susceptibility to errors-all of which obstruct the execution of large and complex circuits. The advancement of quantum algorithms has outpaced the capabilities of existing quantum hardware, making it difficult to scale computations effectively. Additionally, inconsistencies in hardware performance and pervasive quantum noise undermine system stability and computational accuracy. To optimize quantum workloads under these constraints, strategic approaches to task scheduling and resource coordination are essential. These methods must aim to accelerate processing, retain operational fidelity, and reduce the communication burden inherent to distributed setups. One of the persistent challenges in this domain is how to efficiently divide and execute large circuits across multiple quantum processors (QPUs), especially in error-prone environments. In response, we introduce a simulation-based tool that supports distributed scheduling and concurrent execution of quantum jobs on networked QPUs connected via real-time classical channels. The tool models circuit decomposition for workloads that surpass individual QPU limits, allowing for parallel execution through inter-processor communication. Using this simulation environment, we compare four distinct scheduling techniques-among them, a model informed by reinforcement learning. These strategies are evaluated across multiple metrics, including runtime efficiency, fidelity preservation, and communication costs. Our analysis underscores the trade-offs inherent in each approach and highlights how parallelized, noise-aware scheduling can meaningfully improve computational throughput in distributed quantum infrastructures.
当今量子系统面临严重的瓶颈,包括量子计数有限、一致性间隔短暂、容易发生妨碍大型复杂电路执行的错误。量子算法的进步速度超过了现有量子处理器(QPUs)的能力,使得难以有效计算。此外,硬件性能的不一致和普遍的量子噪音破坏了系统稳定性和计算准确性。为了在这些制约因素下优化量子工作量,任务时间安排和资源协调的战略方法至关重要。这些方法必须旨在加速处理、保持业务忠诚和减少分布式装置所固有的通信负担。这一领域持续存在的挑战之一是如何高效率地区分和执行多量子处理器(QPUs)的大型电路,特别是在易出错的环境中。对此,我们采用了基于模拟的工具,支持通过实时古典渠道连接的网络化的量子驱动器分配时间表和同时执行量子工作。工具模型对超过个人量子平流限制的工作量进行电路分解,允许通过处理器的通信进行平行执行。我们利用这一模拟环境,对四种截然不同的列表技术在多个量子处理器处理器(QUPs)之间进行分割和执行大型电路路路路路路段,在易发生交易效率方面进行模型,通过学习,通过模型进行计算。通过虚拟的进度分析,这些模型进行计算,通过虚拟路流分析,通过虚拟路流分析来进行。
Article 88
Title@2025-06-12 (4): The Impact of Partial Computations on the Red-Blue Pebble Game
Title: The Impact of Partial Computations on the Red-Blue Pebble Game | Der Einfluss von partiellen Berechnungen auf das rot-blaue Pebble-Spiel | 部分计算对红蓝色球游戏的影响 2506.10854v1 |
Authors (3): Pál András Papp, Aleksandros Sobczyk, A. N. Yzelman
We study an extension of the well-known red-blue pebble game (RBP) with partial computation steps, inspired by the recent work of Sobczyk. While the original RBP assumes that we need to have all the inputs of an operation in fast memory at the same time, in many concrete computations, the inputs can be aggregated one by one into the final output value. These partial computation steps can enable pebbling strategies with much smaller I/O cost, and in settings where such a step-by-step aggregation is possible, this extended red-blue pebble game offers a much more realistic cost model. We establish the fundamental properties of this partial-computing red-blue pebble game (PRBP), and compare it to the original RBP. We begin with some simple examples where allowing partial computations can decrease the optimal I/O cost. It is also shown that the cost can decrease by up to a linear factor this way, but in general, it is NP-hard to decide whether partial computations allow for a smaller cost in a specific DAG. We then discuss how $S$-partitions, a crucial tool for deriving I/O lower bounds in RBP, can be adapted to the PRBP model. These new tools are then used to establish lower bounds on the I/O cost of some prominent computational tasks. Finally, we also adapt a hardness result from RBP, showing that the optimum cost is still NP-hard to approximate in PRBP to any reasonable factor.
我们研究的是众所周知的红色蓝色泡泡游戏(RBP)的延伸,其部分计算步骤受Sobczyk最近工作的启发。虽然最初的RBP假设我们需要同时将一个操作的所有投入都存储在快速存储中,在许多具体的计算中,投入可以一个一个一个地汇总到最终产出值中。这些部分计算步骤可以使战略以更小的I/O成本进行曲解,在有可能逐步整合的情况下,这种延长的红色蓝色泡泡游戏提供了更现实得多的成本模型。我们建立了这个部分计算红蓝色游戏(PRBP)的基本特性,并将其与原始的RBP进行比较。我们先用一些简单的例子来开始,允许部分计算可以降低最佳I/O成本。还表明成本可以通过线性因素降低成本,但一般来说,仍然很难决定部分计算是否允许在特定的DAG中降低成本。我们然后讨论如何从$S-PBBB游戏(PRBP)的精度部分调整到最终的硬性成本/调整工具。这些在IMBBP/RBRBA的模型中可以显示某种较低的硬性成本。
Article 89
Title@2025-06-12 (4): Faster CONGEST Approximation Algorithms for Maximum Weighted Independent Set in Sparse Graphs
Title: Faster CONGEST Approximation Algorithms for Maximum Weighted Independent Set in Sparse Graphs | Schnellere CONGEST-Annäherung Algorithmen für maximal gewichtete unabhängige Satz in Sparse Graphen | 快速 CONEEST 粗图中最大加权独立设置的 CONEST 近似比度值 2506.10845v1 |
Authors (2): Salwa Faour, Fabian Kuhn
The maximum independent set problem is a classic optimization problem that has also been studied quite intensively in the distributed setting. While the problem is hard to approximate in general, there are good approximation algorithms known for several sparse graph families. In this paper, we consider deterministic distributed CONGEST algorithms for the weighted version of the problem in trees and graphs of bounded arboricity. For trees, we prove that the task of deterministically computing a $(1-\epsilon)$-approximate solution to the maximum weight independent set (MWIS) problem has a tight $\Theta(\log^(n) / \epsilon)$ complexity. The lower bound already holds on unweighted oriented paths. On the upper bound side, we show that the bound can be achieved even in unrooted trees. For graphs $G=(V,E)$ of arboricity $\beta>1$, we give two algorithms. If the sum of all node weights is $w(V)$, we show that for any $\epsilon>0$, an independent set of weight at least $(1-\epsilon)\cdot \frac{w(V)}{4\beta}$ can be computed in $O(\log^2(\beta/\epsilon)/\epsilon + \log^ n)$ rounds. This result is obtained by a direct application of the local rounding framework of Faour, Ghaffari, Grunau, Kuhn, and Rozho\v{n} [SODA ‘23]. We further show that for any $\epsilon>0$, an independent set of weight at least $(1-\epsilon)\cdot\frac{w(V)}{2\beta+1}$ can be computed in $O(\log^3(\beta)\cdot\log(1/\epsilon)/\epsilon^2 \cdot\log n)$ rounds. This improves on a recent result of Gil [OPODIS ‘23], who showed that a $1/\lfloor(2+\epsilon)\beta\rfloor$-approximation to the MWIS problem can be computed in $O(\beta\cdot\log n)$ rounds. As an intermediate step, we design an algorithm to compute an independent set of total weight at least $(1-\epsilon)\cdot\sum_{v\in V}\frac{w(v)}{deg(v)+1}$ in time $O(\log^3(\Delta)\cdot\log(1/\epsilon)/\epsilon + \log^* n)$, where $\Delta$ is the maximum degree of the graph.
nan
Article 90
Title@2025-06-12 (4): Proteus: Enabling High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic
Title: Proteus: Enabling High-Performance Processing-Using-DRAM with Dynamic Bit-Precision, Adaptive Data Representation, and Flexible Arithmetic | Proteus: Leistungsstarkes Processing-Using-DRAM mit dynamischer Bit-Präzision, adaptiver Datendarstellung und flexibler Arithmetik | Proteus: 具有动态比精确度、适应性数据表示和弹性亚光学的能动高性能处理-Using-DRAM 2501.17466v2 |
Authors (11): Geraldo F. Oliveira, Mayank Kabra, Yuxin Guo, Kangqi Chen, A. Giray Yağlıkçı, Melina Soysal, Mohammad Sadrosadati, Joaquin Olivares Bueno, Saugata Ghose, Juan Gómez-Luna, Onur Mutlu
Processing-using-DRAM (PUD) is a paradigm where the analog operational properties of DRAM are used to perform bulk logic operations. While PUD promises high throughput at low energy and area cost, we uncover three limitations of existing PUD approaches that lead to significant inefficiencies: (i) static data representation, i.e., two’s complement with fixed bit-precision, leading to unnecessary computation over useless (i.e., inconsequential) data; (ii) support for only throughput-oriented execution, where the high latency of individual PUD operations can only be hidden in the presence of bulk data-level parallelism; and (iii) high latency for high-precision (e.g., 32-bit) operations. To address these issues, we propose Proteus, the first hardware framework that addresses the high execution latency of bulk bitwise PUD operations by implementing a data-aware runtime engine for PUD. Proteus reduces the latency of PUD operations in three different ways: (i) Proteus dynamically reduces the bit-precision (and thus the latency and energy consumption) of PUD operations by exploiting narrow values (i.e., values with many leading zeros or ones); (ii) Proteus concurrently executes independent in-DRAM primitives belonging to a single PUD operation across multiple DRAM arrays; (iii) Proteus chooses and uses the most appropriate data representation and arithmetic algorithm implementation for a given PUD instruction transparently to the programmer.
利用DRAM(PUD)进行批量逻辑操作时使用DRAM(PUD)的模拟操作特性是一个范例,DRAM(PUD)的模拟操作特性被用来进行批量逻辑操作。虽然PUD承诺以低能量和地区成本提供高输送量,但我们发现现有的PUD方法有三种限制,导致显著效率低下:(一) 静态数据代表制,即用固定的比特精度来补充两套数据,导致对无用的(即无关联)数据进行不必要的计算;(二) 仅支持以吞吐量为主的执行,而单项PUD行动的高度延缓度只能隐藏在散装数据级平行操作中;(三) 高精确度(例如32位位) 高清晰度(例如32位) 操作。 为了解决这些问题,我们提议普罗特斯(Proteus)第一个硬件框架,通过为PUDUD安装一个有数据运行状态的运行时间引擎,解决PUD业务的延迟性,以三种不同方式降低PUD业务的延度:(二) 动态减少B-DROPROPL) 和最透明运行的运行,并适当使用。
Article 91
Title@2025-06-12 (4): Towards Sustainable Computing: Exploring Energy Consumption Efficiency of Alternative Configurations and Workloads in an Open Source Messaging System
Title: Towards Sustainable Computing: Exploring Energy Consumption Efficiency of Alternative Configurations and Workloads in an Open Source Messaging System | Auf dem Weg zu nachhaltigem Rechnen: Energieeffizienz von alternativen Konfigurationen und Workloads in einem Open Source Messaging System untersuchen | 实现可持续计算:探索开放源码通信系统中替代配置和工作量的能源消耗效率 2506.10693v1 |
Authors (3): Maria Voreakou, George Kousiouris, Mara Nikolaidou
Energy consumption in current large scale computing infrastructures is becoming a critical issue, especially with the growing demand for centralized systems such as cloud environments. With the advancement of microservice architectures and the Internet of Things, messaging systems have become an integral and mainstream part of modern computing infrastructures, carrying out significant workload in a majority of applications. In this paper, we describe an experimental process to explore energy-based benchmarking for RabbitMQ, one of the main open source messaging frameworks. The involved system is described, as well as required components, and setup scenarios, involving different workloads and configurations among the tests as well as messaging system use cases. Alternative architectures are investigated and compared from an energy consumption point of view, for different message rates and consumer numbers. Differences in architectural selection have been quantified and can lead to up to 31\% reduction in power consumption. The resulting dataset is made publicly available and can thus prove helpful for architectures’ comparison, energy-based cost modeling, and beyond.
目前大规模计算机基础设施的能源消耗正在成为一个关键问题,特别是随着对云层环境等中央系统的需求日益增加,特别是随着对云层环境等中央系统的需求不断增加,信息通信系统已成为现代计算机基础设施的一个有机组成部分和主流部分,对大多数应用软件的工作量很大。本文描述了一个实验过程,以探索以能源为基础的拉比-Q基准基准,这是主要的开放源信息框架之一。描述了所涉系统以及所需的组成部分和设置设想,涉及测试和通信系统使用案例之间的不同工作量和配置。从能源消耗的角度,对不同信息率和消费者数字的替代结构进行了调查和比较。建筑选择的差异已经量化,并可能导致电力消耗减少31。由此形成的数据集可以公开提供,从而对建筑的比较、能源成本建模以及以后的建模很有帮助。
Article 92
Title@2025-06-12 (4): Fully Energy-Efficient Randomized Backoff: Slow Feedback Loops Yield Fast Contention Resolution
Title: Fully Energy-Efficient Randomized Backoff: Slow Feedback Loops Yield Fast Contention Resolution | Vollenergieeffizienter Randomized Backoff: Langsame Rückkopplungsschleifen liefern schnelle Streitbeilegung | 完全节能随机后退:慢速反馈循环 2302.07751v4 |
Authors (5): Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, John Kuszmaul, Maxwell Young
Contention resolution addresses the problem of coordinating access to a shared channel. Time proceeds in slots, and a packet transmission can be made in any slot. A packet is successfully sent if no other packet is also transmitted during that slot. If two or more packets are sent in the same slot, then none of these transmissions succeed. Listening during a slot gives ternary feedback, indicating if that slot had (0) silence, (1) a successful transmission, or (2+) noise. No other feedback is available. Packets are (adversarially) injected into the system over time. A packet departs the system once it is successful. The goal is to send all packets while optimizing throughput, which is roughly the fraction of successful slots. Most prior algorithms with constant throughput require a short feedback loop, in the sense that a packet’s sending probability in slot t+1 is fully determined by its internal state at slot t and the channel feedback at slot t. An open question is whether these short feedback loops are necessary; that is, how often must listening and updating occur in order to achieve constant throughput? This question addresses energy efficiency, since both listening and sending consume significant energy. The channel can also suffer adversarial noise (“jamming”), which causes any listener to hear noise, even when no packets are sent. How does jamming affect our goal of long feedback loops/energy efficiency? Connecting these questions, we ask: what does a contention-resolution algorithm have to sacrifice to reduce channel accesses? Must we give up on constant throughput or robustness to noise? Here, we show that we need not concede anything. Suppose there are N packets and J jammed slots, where the input is determined by an adaptive adversary. We give an algorithm that, with high probability in N+J, has constant throughput and polylog(N+J) channel accesses per packet.
内容解析解决了对共享频道访问的协调问题 。 时间在空格中持续, 并且可以在任何空格中进行包传输 。 如果在空格中, 没有其它的包也会成功发送 。 如果两个或两个以上的包被在同一空格中发送, 那么这些传输就不会成功 。 在空格中监听会提供永恒的反馈, 表明空格是否有( 0) 沉默, (1) 成功传输, 或 ( 2+) 噪音 。 没有其他反馈 。 封会( 对抗性) 随着时间的推移被注入系统 。 包一旦成功, 就会退出系统 。 封会发送所有的包, 并且优化通量, 这大约是成功空格的一小部分 。 大多数具有恒定输量的算法都需要一个简短的回路圈回路, 也就是说, 一个在空档中发送的概率完全取决于它的内部状态, (1) 沉默, (1) 成功传输或频道反馈。 一个开放的问题是, 我们从这些短的回馈回路是需要什么; 在那里, 需要多少监听和更新 来实现不断的流流流流流流 ? 这个问题会降低 节流 节流, 。
Article 93
Title@2025-06-12 (4): Deployment of Containerized Simulations in an API-Driven Distributed Infrastructure
Title: Deployment of Containerized Simulations in an API-Driven Distributed Infrastructure | Bereitstellung von containerisierten Simulationen in einer API-getriebenen verteilten Infrastruktur | 在API-驱动分配基础设施中部署集装箱化模拟设备 2506.10642v1 |
Authors (3): Tim Kraus, Axel Sauer, Ingo Feldner
The increasingly dynamic market for embedded systems makes virtual prototypes an indispensable tool for hardware/software codesign. The broad acceptance of the methodology has led to a diverse range of solutions: from open-source, pure console-based simulators to highly capable commercial simulation tools. In this work we present SUNRISE, an infrastructure to provide users a unified approach to utilizing virtual prototyping solutions, facilitate access to various simulation technologies and boost cooperation by leveraging decentralized compute resources for deployment of simulation workloads and definition of open APIs.
嵌入系统日益活跃的市场使虚拟原型成为硬件/软件编码的不可或缺的工具。该方法得到广泛接受,导致了一系列不同的解决办法:从开放源码、纯控制台模拟器到高能力商业模拟工具。在这项工作中,我们介绍了SUNRIS, 这是一种基础设施,为用户提供使用虚拟原型解决方案的统一方法,便利人们获得各种模拟技术,并通过利用分散计算资源来部署模拟工作量和界定开放式API,促进合作。
Article 94
Title@2025-06-12 (4): Graph-based Gossiping for Communication Efficiency in Decentralized Federated Learning
Title: Graph-based Gossiping for Communication Efficiency in Decentralized Federated Learning | Graphbasiertes Gossing für Kommunikationseffizienz im dezentralisierten Föderierten Lernen | 以图表为基础的分散式联邦学习传播效率Gossiping 2506.10607v1 |
Authors (5): Huong Nguyen, Hong-Tri Nguyen, Praveen Kumar Donta, Susanna Pirttikangas, Lauri Lovén
Federated learning has emerged as a privacy-preserving technique for collaborative model training across heterogeneously distributed silos. Yet, its reliance on a single central server introduces potential bottlenecks and risks of single-point failure. Decentralizing the server, often referred to as decentralized learning, addresses this problem by distributing the server role across nodes within the network. One drawback regarding this pure decentralization is it introduces communication inefficiencies, which arise from increased message exchanges in large-scale setups. However, existing proposed solutions often fail to simulate the real-world distributed and decentralized environment in their experiments, leading to unreliable performance evaluations and limited applicability in practice. Recognizing the lack from prior works, this work investigates the correlation between model size and network latency, a critical factor in optimizing decentralized learning communication. We propose a graph-based gossiping mechanism, where specifically, minimum spanning tree and graph coloring are used to optimize network structure and scheduling for efficient communication across various network topologies and message capacities. Our approach configures and manages subnetworks on real physical routers and devices and closely models real-world distributed setups. Experimental results demonstrate that our method significantly improves communication, compatible with different topologies and data sizes, reducing bandwidth and transfer time by up to circa 8 and 4.4 times, respectively, compared to naive flooding broadcasting methods.
联邦学习已成为在分布各式各式各样的发射井进行合作模式培训的一种隐私保护技术。然而,对单一中央服务器的依赖导致潜在的瓶颈和单一点失败的风险。将服务器(通常称为分散式学习)分散化,通过在网络内各节点之间分配服务器角色来解决这个问题。这种纯粹的分散化的缺点之一是通信效率低下,这是大规模设置中增加的信息交流导致的通信效率低下。然而,现有的拟议解决方案往往无法在其实验中模拟真实世界分布和分散式环境,导致业绩评估不可靠,实际应用性有限。这项工作认识到先前工作的缺乏,因此调查了模型规模和网络延缓度之间的相互关系,这是优化分散式学习通信的一个关键因素。我们提议了一个基于图表的流言机制,具体地说,在这种机制中,最小的覆盖树木和图表的颜色用于优化网络结构以及各种网络地形和信息能力之间高效通信的时间安排。我们的方法在实际物理路由和装置上配置和管理子网络,并密切模拟真实世界分布的设置。实验结果表明,我们的方法大大改进通信,大大改进了时间,与不同高层广播的频率和天平时段之间的频率,分别缩小。
Article 95
Title@2025-06-12 (4): Model Discovery and Graph Simulation: A Lightweight Alternative to Chaos Engineering
Title: Model Discovery and Graph Simulation: A Lightweight Alternative to Chaos Engineering | Modellentdeckung und Graphensimulation: Eine leichte Alternative zur Chaos-Engineering | 模型发现和图示模拟:解决混乱工程的轻量替代方法 2506.11176v1 |
Authors (2): Anatoly A. Krasnovsky, Alexander Zorkin
Microservice applications are prone to cascading failures because of dense inter-service dependencies. Ensuring resilience usually demands fault-injection experiments in production-like setups. We propose \textit{model discovery} – an automated CI/CD step that extracts a live dependency graph from trace data – and show that this lightweight representation is sufficient for accurate resilience prediction. Using the DeathStarBench Social Network, we build the graph, simulate failures via Monte-Carlo, and run matching chaos experiments on the real system. The graph model closely matches reality: with no replication, 16 trials yield an observed resilience of 0.186 versus a predicted 0.161; with replication, both observed and predicted values converge to 0.305 (mean absolute error \leq 0.0004). These results indicate that even a simple, automatically discovered graph can estimate microservice availability with high fidelity, offering rapid design-time insight without full-scale failure testing.
微服务应用程序容易因大量服务间依赖性而出现连续故障。 确保复原力通常要求在类似生产型的设置中进行错误注射实验。 我们提议 \ textit{ 模范发现} – – 自动的 CI/ CD 步骤,从跟踪数据中提取活的依赖性图表 – – 并显示这种轻量表示方式足以进行准确的复原力预测。 我们使用死亡之星堡社会网络构建这个图,通过蒙特卡罗模拟失败,并在实际系统中进行匹配的混乱实验。 图形模型非常符合现实: 没有复制, 16项试验产生的观察到的抗御力为0.186, 而预测的为0. 161; 复制, 观察到的和预测的数值都接近 0.305 (平均绝对错误\leq 0004)。 这些结果表明,即使简单、 自动发现的图表也能以高度忠诚的方式估计微观服务的可用性, 提供快速的设计- 时间洞察, 而不进行全面的故障测试。
Article 96
Title@2025-06-12 (4): 6G Infrastructures for Edge AI: An Analytical Perspective
Title: 6G Infrastructures for Edge AI: An Analytical Perspective | 6G-Infrastrukturen für Edge AI: Eine analytische Perspektive | 6G 供异地边缘使用的基础设施:分析角度 2506.10570v1 |
Authors (6): Kurt Horvath, Shpresa Tuda, Blerta Idrizi, Stojan Kitanov, Fisnik Doko, Dragi Kimovski
The convergence of Artificial Intelligence (AI) and the Internet of Things has accelerated the development of distributed, network-sensitive applications, necessitating ultra-low latency, high throughput, and real-time processing capabilities. While 5G networks represent a significant technological milestone, their ability to support AI-driven edge applications remains constrained by performance gaps observed in real-world deployments. This paper addresses these limitations and highlights critical advancements needed to realize a robust and scalable 6G ecosystem optimized for AI applications. Furthermore, we conduct an empirical evaluation of 5G network infrastructure in central Europe, with latency measurements ranging from 61 ms to 110 ms across different close geographical areas. These values exceed the requirements of latency-critical AI applications by approximately 270%, revealing significant shortcomings in current deployments. Building on these findings, we propose a set of recommendations to bridge the gap between existing 5G performance and the requirements of next-generation AI applications.
人工智能(AI)与物联网加速了分布式、网络敏感、需要超低延迟、高吞吐量和实时处理能力的网络的开发。虽然5G网络是一个重要的技术里程碑,但它们支持AI驱动的边缘应用的能力仍然受到现实世界部署中观察到的绩效差距的制约。本文件讨论了这些局限性,并突出强调了实现为AI应用优化的强大和可扩展的6G生态系统所需的重要进展。此外,我们还对中欧5G网络基础设施进行了实证性评估,对不同近地理区域的5G网络基础设施进行了61米至110米的悬浮度测量,这些值超过对耐久临界的AI应用的要求约270 %,揭示了当前部署的重大缺陷。我们根据这些调查结果提出了一套建议,以弥合5G现有绩效与下一代AI应用要求之间的差距。
Article 97
Title@2025-06-12 (4): GPU-Accelerated Distributed QAOA on Large-scale HPC Ecosystems
Title: GPU-Accelerated Distributed QAOA on Large-scale HPC Ecosystems | GPU-beschleunigte verteilte QAOA auf großflächige HPC-Ökosysteme | GPU-加速加速的大型高氯苯生态系统分布式QAOA 2506.10531v1 |
Authors (9): Zhihao Xu, Srikar Chundury, Seongmin Kim, Amir Shehata, Xinyi Li, Ang Li, Tengfei Luo, Frank Mueller, In-Saeng Suh
Quantum computing holds great potential to accelerate the process of solving complex combinatorial optimization problems. The Distributed Quantum Approximate Optimization Algorithm (DQAOA) addresses high-dimensional, dense problems using current quantum computing techniques and high-performance computing (HPC) systems. In this work, we improve the scalability and efficiency of DQAOA through advanced problem decomposition and parallel execution using message passing on the Frontier CPU/GPU supercomputer. Our approach ensures efficient quantum-classical workload management by distributing large problem instances across classical and quantum resources. Experimental results demonstrate that enhanced decomposition strategies and GPU-accelerated quantum simulations significantly improve DQAOA’s performance, achieving up to 10x speedup over CPU-based simulations. This advancement enables better scalability for large problem instances, supporting the practical deployment of GPU systems for hybrid quantum-classical applications. We also highlight ongoing integration efforts using the Quantum Framework (QFw) to support future HPC-quantum computing systems.
量子计算具有加速解决复杂组合优化问题的巨大潜力。 分布的量子流近优化算法( DQAOA) 解决了使用当前量子计算技术和高性能计算系统( HPC) 的高维、密集的问题。 在这项工作中,我们通过在Front CPU/GPU超级计算机上传递信息来提高DQAOA的可缩放性和效率。 我们的方法通过在古典和量子资源中分配大问题,确保了高效的量子级工作量管理。 实验结果显示,强化的分解战略和GPU加速量子模拟大大提高了DQAOAA的性能,在基于CPU的模拟中达到多达10x加速速度。 这一进步使大型问题案例的可缩放性得到改善,支持在混合量子级应用中实际部署 GPU系统。 我们还强调了目前利用量子框架( QFww) 进行的整合努力,以支持未来的HPC QQQPC QEN计算系统。
Article 98
Title@2025-06-12 (4): HP2C-DT: High-Precision High-Performance Computer-enabled Digital Twin
Title: HP2C-DT: High-Precision High-Performance Computer-enabled Digital Twin | HP2C-DT: High-Precision High-Performance-Computer-fähiger Digital Twin | HP2C-DT:高精确度高绩效计算机化数字双双 2506.10523v1 |
Authors (6): E. Iraola, M. García-Lorenzo, F. Lordan-Gomis, F. Rossi, E. Prieto-Araujo, R. M. Badia
Digital twins are transforming the way we monitor, analyze, and control physical systems, but designing architectures that balance real-time responsiveness with heavy computational demands remains a challenge. Cloud-based solutions often struggle with latency and resource constraints, while edge-based approaches lack the processing power for complex simulations and data-driven optimizations. To address this problem, we propose the High-Precision High-Performance Computer-enabled Digital Twin (HP2C-DT) reference architecture, which integrates High-Performance Computing (HPC) into the computing continuum. Unlike traditional setups that use HPC only for offline simulations, HP2C-DT makes it an active part of digital twin workflows, dynamically assigning tasks to edge, cloud, or HPC resources based on urgency and computational needs. Furthermore, to bridge the gap between theory and practice, we introduce the HP2C-DT framework, a working implementation that uses COMPSs for seamless workload distribution across diverse infrastructures. We test it in a power grid use case, showing how it reduces communication bandwidth by an order of magnitude through edge-side data aggregation, improves response times by up to 2x via dynamic offloading, and maintains near-ideal strong scaling for compute-intensive workflows across a practical range of resources. These results demonstrate how an HPC-driven approach can push digital twins beyond their current limitations, making them smarter, faster, and more capable of handling real-world complexity.
数字双胞胎正在改变我们监测、分析和控制物理系统的方式,但设计使实时反应与大量计算需求平衡的架构仍是一项挑战。基于云型解决方案往往与隐性和资源限制相抗衡。基于云型的解决方案往往与悬浮和资源限制相抗衡,而基于边缘的方法缺乏复杂模拟和数据驱动优化的处理能力。为了解决这一问题,我们建议采用高精度高性能计算机化高性能计算机驱动数字双胞胎(HP2C-DT)参考架构,将高性能计算机(HPC-DT)整合到计算机连续运行中。不同于传统设置,即仅使用高性能电子计算机(HPC)进行离线模拟的传统设置,HPC-DT使它成为数字双工作流程的积极部分,根据紧迫性和计算需求动态向边缘、云型或高电联资源分配任务。此外,为了弥合理论和实践之间的差距,我们引入了HP2C-DT框架,一个运行运行中使用COMS系统在各种基础设施中进行无缝的工作分配。我们用电网进行测试,显示它如何通过边缘端级的顺序减少通信带带宽度,通过接近边缘级数据同步数据同步数据组合,通过高度的深度数据组合提升提高,通过动态驱动,通过动态的动态驱动提升提升提升和快速度提升,使这些动态的工作流程在两个时间显示其动态的动态的动态驱动力,从而显示其动态性地压流流流向上进行。
Article 99
Title@2025-06-12 (4): Understanding the Performance and Power of LLM Inferencing on Edge Accelerators
Title: Understanding the Performance and Power of LLM Inferencing on Edge Accelerators | Die Leistung und Leistung von LLM-Inferenzen auf Edge-Beschleuniger verstehen | 了解LLM LLM对边缘加速器的推论的性能和功率 2506.09554v2 |
Authors (2): Mayank Arya, Yogesh Simmhan
Large Language Models (LLMs) have demonstrated exceptional benefits to a wide range of domains, for tasks as diverse as code generation and robot navigation. While LLMs are usually served from cloud data centers, mission-critical and privacy-sensitive applications may require local hosting of open LLM models. Given the large GPU memory footprint needed for LLMs, edge accelerators such as Nvidia Jetson Orin AGX with 64GB of shared GPU-CPU RAM are a compelling choice. However, the feasibility and performance of LLM inference on edge accelerators is under-explored. This study presents a detailed evaluation of LLM inference on the NVIDIA Jetson Orin AGX, on four SOTA models ranging from 2.7B to 32.8B parameters, such as Meta Llama3.1, Microsoft-Phi2, Deepseek-R1-Qwen. We investigate the impact of varying batch sizes, sequence lengths, and quantization levels on latency, throughput, and perplexity, and also explore various custom power modes on the Orin AGX to perform power and energy consumption analysis. Our findings offer interesting insights on the trade-offs between efficiency, inference speed and resource use, e.g., increasing the sequence length causes a decrease in token throughput and quantization causes smaller LLMs to be slower. These results can help optimize LLM serving on edge accelerators for practical applications.
大型语言模型(LLMS)为多种领域展示了特殊的好处,包括代码生成和机器人导航等多种任务。虽然LLMS通常由云型数据中心提供,但任务关键和隐私敏感应用可能需要对开放式LMM模型进行本地托管。鉴于LLM公司需要巨大的GPU记忆足迹,Nvidia Jetson Orin AgX和64GB共享GPU-CPU RAM的边缘加速器(64GB)是一个令人信服的选择。然而,LLLM在边缘加速器上的实际推推力的可行性和性能表现没有得到充分的探讨。本研究报告详细评估了LLMM在NVIDIA Jetson Orin AGX上对NVDIA Jetson Orin AGX的推断,这四种SOTA模型从2.7B到32.8B参数的本地托管。鉴于Llama3.1、微软-Phi2、深海-R1-Qwen。我们调查了不同批量的批量尺寸尺寸、序列长度和四级递解度水平水平对LIMS的精度应用的影响。我们探索AGX对更小的精度的精度的精度应用的精度和精度应用的精度进行了深入分析。我们在提高的精度和精度和精度分析,在提高的精度和精度上提供了对精度的精度的精度的精度的精度的精度和精度的精度和精度的精度和精度的精度和精度的精度分析。我们度的精度的精度的精度和精度和精度的精度和精度的精度的精度的精度的精度的精度的精度分析。
Article 100
Title@2025-06-12 (4): TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference
Title: TD-Pipe: Temporally-Disaggregated Pipeline Parallelism Architecture for High-Throughput LLM Inference | TD-Pipe: Vorübergehend disaggregierte Pipeline-Parallelismus-Architektur für High-Throughput-LLM-Inferenz | TD-Pipe:高干压压压下LLM推论的热分解管道平行结构 2506.10470v1 |
Authors (6): Hongbin Zhang, Taosheng Wei, Zhenyi Zheng, Jiangsu Du, Zhiguang Chen, Yutong Lu
As the model size continuously increases, pipeline parallelism shows great promise in throughput-oriented LLM inference due to its low demand on communications. However, imbalanced pipeline workloads and complex data dependencies in the prefill and decode phases result in massive pipeline bubbles and further severe performance reduction. To better exploit the pipeline parallelism for high-throughput LLM inference, we propose TD-Pipe, with the key idea lies in the temporally-disaggregated pipeline parallelism architecture. Specifically, this architecture disaggregates the prefill and decode phases in the temporal dimension, so as to eliminate pipeline bubbles caused by the phase switching. TD-Pipe identifies potential issues of exploiting the novel architecture and provides solutions. First, a hierarchy-controller structure is used to better coordinate devices in pipeline parallelism by decoupling the scheduling from execution. Second, the AI-based greedy prefill approach aggressively performs more prefills by predicting the output length and simulating the memory usage. Third, the inter-batch work stealing approach dynamically balances decode phase workloads between different batches to reduce bubbles. Forth, the spatial-temporal intensity comparison approach determines the optimal switch from decode to prefill by comparing the performance drop from reduced computational intensity with that from phase switching bubbles. Extensive experiments show that TD-Pipe effectively increases the throughput of LLM inference by up to 1.91x over the existing tensor parallel approach and 2.73x over the existing pipeline parallel approach on GPU nodes with only PCIe interconnection.
随着模型规模的不断扩大,管道平行关系表明,由于对通信的需求低,以吞吐为主的LLM测谎极有可能在吞吐为主的LLM测谎中表现出巨大的希望。然而,由于管道工作量不平衡,而且预装和解码阶段的数据依赖性复杂,导致管道泡沫的泡沫和进一步大幅降低性能。为了更好地利用管道平行关系,我们提议TD-Pipe, 关键理念在于时间分解的管道平行结构。具体地说,这一结构分解了时间层面的预填和解码阶段,从而消除了阶段转换造成的管道泡沫。 TD-Pipe查明了利用新结构并提供解决方案的潜在问题。 首先,使用一个等级控制器结构来更好地协调管道平行状态的装置,将高通量 LLMMM 推理算与执行的时间安排脱钩。 其次,基于AI的贪婪预补方法通过预测产出长度和模拟记忆使用来积极进行预补。 第三,批量法方法仅以动态平衡方式将不同部门之间的阶段工作量分解到降低气泡。 Forloder-deal-comalalalalalalalalalalal 比较了从现有的升级到从标准的升级到从目前的深度的深度分析,从而将标准升级升级到从目前的递化程度的升级到从目前的递化的深度分析,从而决定了从目前的递化的递化的递化程度,从而将缩小了从目前的递化的递化的递化的递化的递化的递化程度。
Article 101
Title@2025-06-12 (4): Automating Multi-Tenancy Performance Evaluation on Edge Compute Nodes
Title: Automating Multi-Tenancy Performance Evaluation on Edge Compute Nodes | Automatisieren von Multi-Tenancy-Performance-Evaluierung auf Edge Compute Nodes | 将多层计算节点的多层业绩评价自动化 2506.10461v1 |
Authors (4): Joanna Georgiou, Moysis Symeonides, George Pallis, Marios D. Dikaiakos
Edge Computing emerges as a promising alternative of Cloud Computing, with scalable compute resources and services deployed in the path between IoT devices and Cloud. Since virtualization techniques can be applied on Edge compute nodes, administrators can share their Edge infrastructures among multiple users, providing the so-called multi-tenancy. Even though multi-tenancy is unavoidable, it raises concerns about security and performance degradation due to resource contention in Edge Computing. For that, administrators need to deploy services with non-antagonizing profiles and explore workload co-location scenarios to enhance performance and energy consumption. Achieving this, however, requires extensive configuration, deployment, iterative testing, and analysis, an effort-intensive and time-consuming process. To address this challenge, we introduce an auto-benchmarking framework designed to streamline the analysis of multi-tenancy performance in Edge environments. Our framework includes a built-in monitoring stack and integrates with widely used benchmarking workloads, such as streaming analytics, database operations, machine learning applications, and component-based stress testing. We perform a case-driven analysis and provide valuable insights into the impact of multi-tenancy on Edge environments with different hardware configurations and diverse workloads. Finally, the implementation of our framework, along with the containerized workloads used for experimentation, is publicly available.
电磁计算作为云形计算的一种有希望的替代方案,在IoT装置和云体之间的路径上部署可缩放的计算资源和服务。由于虚拟化技术可以应用到 Edge 计算节点上,管理员可以在多个用户中共享其边缘基础设施,提供所谓的多重租赁。即使多重租赁是不可避免的,它也引起了对安全和性能退化的关切,因为电磁计算中的资源争议导致的安全和性能退化。为此,管理员需要使用非自动配置配置的服务,并探索工作量的合用情景,以提高性能和能源消耗。然而,要实现这一目标,需要广泛的配置、部署、迭代测试和分析,这是一个耗费大量精力和耗时的过程。为了应对这一挑战,我们引入一个自动标定框架,旨在简化对Edge环境中多重租赁性绩效的分析。我们的框架包括一个内置式监测堆,并与广泛使用的基准工作量相结合,例如流传分析、数据库操作、机器学习应用和基于组件的压力测试。我们根据案例分析进行一项有价值的分析,并对多种硬件和耗时费过程的影响提供有价值的洞察。我们所利用的硬件和实验在不同的环境上的最后配置。
Article 102
Title@2025-06-12 (4): Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent-based Methods
Title: Multi-dimensional Autoscaling of Processing Services: A Comparison of Agent-based Methods | Mehrdimensionale Autoskalierung von Verarbeitungsdienstleistungen: Ein Vergleich von agentenbasierten Methoden | 处理服务多维多维自动升级:以代理为基础的方法比较 2506.10420v1 |
Authors (5): Boris Sedlak, Alireza Furutanpey, Zihang Wang, Víctor Casamayor Pujol, Schahram Dustdar
Edge computing breaks with traditional autoscaling due to strict resource constraints, thus, motivating more flexible scaling behaviors using multiple elasticity dimensions. This work introduces an agent-based autoscaling framework that dynamically adjusts both hardware resources and internal service configurations to maximize requirements fulfillment in constrained environments. We compare four types of scaling agents: Active Inference, Deep Q Network, Analysis of Structural Knowledge, and Deep Active Inference, using two real-world processing services running in parallel: YOLOv8 for visual recognition and OpenCV for QR code detection. Results show all agents achieve acceptable SLO performance with varying convergence patterns. While the Deep Q Network benefits from pre-training, the structural analysis converges quickly, and the deep active inference agent combines theoretical foundations with practical scalability advantages. Our findings provide evidence for the viability of multi-dimensional agent-based autoscaling for edge environments and encourage future work in this research direction.
由于严格的资源限制,计算断层与传统的自动计算断裂,因此,利用多种弹性维度鼓励更灵活的缩放行为。这项工作引入了一个基于代理的自动缩放框架,对硬件资源和内部服务配置进行动态调整,以最大限度地满足受限制环境中的要求。我们比较了四种类型的缩放剂:主动推论、深Q网络、结构知识分析和深活性推理,同时运行两个真实世界的处理服务:用于视觉识别的YOLOv8和用于QR代码检测的 OpenCV。结果显示,所有代理都实现了可接受的 SLO 性能,并有各种不同的趋同模式。虽然深Q网络从培训前获益,但结构分析迅速汇合,深活性推理剂将理论基础与实用的可伸缩性优势结合起来。我们的调查结果为边缘环境基于多维体的自动调整的可行性提供了证据,并鼓励今后在这一研究方向开展工作。
Article 103
Title@2025-06-12 (4): Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators
Title: Federated Learning within Global Energy Budget over Heterogeneous Edge Accelerators | Föderiertes Lernen im globalen Energiebudget über Heterogene Edge-Beschleuniger | 全球能源预算内关于异异异系边缘加速器的联邦学习 2506.10413v1 |
Authors (4): Roopkatha Banerjee, Tejus Chandrashekar, Ananth Eswar, Yogesh Simmhan
Federated Learning (FL) enables collaborative model training across distributed clients while preserving data privacy. However, optimizing both energy efficiency and model accuracy remains a challenge, given device and data heterogeneity. Further, sustainable AI through a global energy budget for FL has not been explored. We propose a novel optimization problem for client selection in FL that maximizes the model accuracy within an overall energy limit and reduces training time. We solve this with a unique bi-level ILP formulation that leverages approximate Shapley values and energy-time prediction models to efficiently solve this. Our FedJoule framework achieves superior training accuracies compared to SOTA and simple baselines for diverse energy budgets, non-IID distributions, and realistic experiment configurations, performing 15% and 48% better on accuracy and time, respectively. The results highlight the effectiveness of our method in achieving a viable trade-off between energy usage and performance in FL environments.
联邦学习联合会(FL)在保护数据隐私的同时,为分布式客户提供合作模式培训,使分布式客户能够进行合作模式培训,同时保护数据隐私。然而,考虑到装置和数据差异性,优化能源效率和模型准确性仍是一项挑战。此外,尚未探讨通过FL全球能源预算实现可持续的AI。我们提议在FL选择客户时出现一个新的优化问题,以便在总的能源限度内最大限度地提高模型准确性并减少培训时间。我们用独特的双级ILP公式解决这个问题,利用大约的损耗值和能源时间预测模型来有效解决这一问题。我们的FedJoule框架实现了与SOTA相比的高级培训便利度,以及不同能源预算、非IID分布和现实实验配置的简单基线,在准确性和时间上分别提高了15%和48%。结果突出表明了我们的方法在FL环境中实现能源使用与性能之间可行的平衡的有效性。
Article 104
Title@2025-06-12 (4): HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration
Title: HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration | HPCTransCompile: Ein KI-Compiler-generierter Datensatz für Hochleistungs-CUDA-Transpilation und LLM-Voruntersuchung | HPC Transtranscompility: AI CUDA 高性能 CUDA 转换和 LLM 初步探索的人工智能汇编器生成数据集 2506.10401v1 |
Authors (6): Jiaqi Lv, Xufeng He, Yanchen Liu, Xu Dai, Yang Hu, Shouyi Yin
The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating computational bottlenecks. Meanwhile, due to the cultivation of user programming habits and the high performance of GPUs, the CUDA ecosystem has established a dominant position in the field of parallel software. This dominance requires other hardware platforms to support CUDA-based software with performance portability. However, translating CUDA code to other platforms poses significant challenges due to differences in parallel programming paradigms and hardware architectures. Existing approaches rely on language extensions, domain-specific languages (DSLs), or compilers but face limitations in workload coverage and generalizability. Moreover, these methods often incur substantial development costs. Recently, LLMs have demonstrated extraordinary potential in various vertical domains, especially in code-related tasks. However, the performance of existing LLMs in CUDA transpilation, particularly for high-performance code, remains suboptimal. The main reason for this limitation lies in the lack of high-quality training datasets. To address these challenges, we propose a novel framework for generating high-performance CUDA and corresponding platform code pairs, leveraging AI compiler and automatic optimization technology. We further enhance the framework with a graph-based data augmentation method and introduce HPCTransEval, a benchmark for evaluating LLM performance on CUDA transpilation. We conduct experiments using CUDA-to-CPU transpilation as a case study on leading LLMs. The result demonstrates that our framework significantly improves CUDA transpilation, highlighting the potential of LLMs to address compatibility challenges within the CUDA ecosystem.
NVIDIA GPU及其基于CUDA的软件生态系统为平行计算提供了强有力的支持,并大大缓解了计算瓶颈。与此同时,由于用户编程习惯的培养以及GPU的高性能,CUDA生态系统在平行软件领域建立了主导地位。这一主导地位要求其他硬件平台支持基于CUDA的可移植软件。然而,将CUDA代码转换到其他平台,由于平行编程模式和硬件结构的差异而构成重大挑战。现有方法依赖于语言扩展、特定域语言(DSLs)或汇编者,但面临着工作量覆盖和可概括性方面的限制。此外,这些方法往往带来巨大的发展成本。最近,LLMMS在各种纵向领域,特别是在与代码有关的任务方面展现了超强的潜力。然而,CUDA的现有LMS的性能平台,特别是高性能代码,仍然不那么,这种局限性的主要原因是缺乏高质量的培训数据集。为了应对这些挑战,我们提出了CUDUDA对CUD的高级性能评估框架,我们提议了一个用于CUDALA的高级性、高性能评估,我们CUDA的CRLUDLUD。我们为CA的升级的升级的高级化数据框架,我们提出一个高性工具,我们为CUDA的高级性能和高性能的CULUD。
Article 105
Title@2025-06-12 (4): Bug Classification in Quantum Software: A Rule-Based Framework and Its Evaluation
Title: Bug Classification in Quantum Software: A Rule-Based Framework and Its Evaluation | Fehlerklassifizierung in der Quantensoftware: Ein regelbasiertes Framework und seine Bewertung | 量子软件中的臭虫分类:基于规则的框架及其评价 2506.10397v1 |
Authors (2): Mir Mohammad Yousuf, Shabir Ahmad Sofi
Accurate classification of software bugs is essential for improving software quality. This paper presents a rule-based automated framework for classifying issues in quantum software repositories by bug type, category, severity, and impacted quality attributes, with additional focus on quantum-specific bug types. The framework applies keyword and heuristic-based techniques tailored to quantum computing. To assess its reliability, we manually classified a stratified sample of 4,984 issues from a dataset of 12,910 issues across 36 Qiskit repositories. Automated classifications were compared with ground truth using accuracy, precision, recall, and F1-score. The framework achieved up to 85.21% accuracy, with F1-scores ranging from 0.7075 (severity) to 0.8393 (quality attribute). Statistical validation via paired t-tests and Cohen’s Kappa showed substantial to almost perfect agreement for bug type (k = 0.696), category (k = 0.826), quality attribute (k = 0.818), and quantum-specific bug type (k = 0.712). Severity classification showed slight agreement (k = 0.162), suggesting room for improvement. Large-scale analysis revealed that classical bugs dominate (67.2%), with quantum-specific bugs at 27.3%. Frequent bug categories included compatibility, functional, and quantum-specific defects, while usability, maintainability, and interoperability were the most impacted quality attributes. Most issues (93.7%) were low severity; only 4.3% were critical. A detailed review of 1,550 quantum-specific bugs showed that over half involved quantum circuit-level problems, followed by gate errors and hardware-related issues.
精确的软件错误分类对于提高软件质量至关重要。 本文提供了一个基于规则的自动自动框架, 用于按错误类型、 类别、 严重程度和受影响的质量属性对量子软件库中的问题进行分类, 并额外侧重于量子型错误类型。 框架应用了关键词和基于脂质的量子计算技术。 为了评估其可靠性, 我们手工从36 Qiskit 仓库的12 910个数据集中分类了4 984个问题。 自动分类用准确性、 精确性、 回溯性和 F1 核心来比较了基于规则的自动框架。 框架达到了85. 21% 的准确性, F1 核心从 0. 775 (多样性) 到 0. 8 393( 质量属性)。 通过配对式测试和 Cohen kappa 的统计验证非常接近完美的协议类型( k = 0.696)、 类别( k = 0. 8266)、 质量属性( k=0. 818), 质量属性( k = 0. 0. 8018 ) 质量分类显示轻微的准确性定义( k= 0. 0.162) 准确性分类 准确性质量等级, 质量等级分析( ) A 和直径级( ) 显示1.3) 质量等级分析。 和直径级分析( 质量问题为1 质量问题为1 级) 级分析( 级) 和直径比( 级) 。 。 。 级分析( 0.16级) 和直径级 级分析。
Article 106
Title@2025-06-12 (4): Is Sparse Matrix Reordering Effective for Sparse Matrix-Vector Multiplication?
Title: Is Sparse Matrix Reordering Effective for Sparse Matrix-Vector Multiplication? | Ist Sparse Matrix Reordering wirksam für Sparse Matrix-Vector Multiplikation? | 粗缩矩阵重新排序是否对 粗略矩阵- Vector 乘法有效? 2506.10356v1 |
Authors (5): Omid Asudeh, Sina Mahdipour Saravani, Gerald Sabin, Fabrice Rastello, P Sadayappan
This work evaluates the impact of sparse matrix reordering on the performance of sparse matrix-vector multiplication across different multicore CPU platforms. Reordering can significantly enhance performance by optimizing the non-zero element patterns to reduce total data movement and improve the load-balancing. We examine how these gains vary over different CPUs for different reordering strategies, focusing on both sequential and parallel execution. We address multiple aspects, including appropriate measurement methodology, comparison across different kinds of reordering strategies, consistency across machines, and impact of load imbalance.
这项工作评估了分散矩阵重新排序对不同多极CPU平台的稀少矩阵矢量倍增性效果的影响,通过优化非零元素模式,减少数据总体流动,改善负载平衡,重新排序可显著提高绩效。我们考察了不同重排战略中不同CPU的收益差异,重点是顺序和平行执行。我们探讨了多个方面,包括适当的计量方法、不同类型重排战略的比较、跨机器的一致性以及负载不平衡的影响。
Article 107
Title@2025-06-12 (4): PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production
Title: PerfTracker: Online Performance Troubleshooting for Large-scale Model Training in Production | PerfTracker: Online-Performance-Fehlersuche für großformatige Modellschulungen in der Produktion | PerfTracker:大规模生产示范培训在线绩效问题解决 2506.08528v3 |
Authors (13): Yu Guan, Zhiyu Yin, Haoyu Chen, Sheng Cheng, Chaojie Yang, Kun Qian, Tianyin Xu, Yang Zhang, Hanyu Zhao, Yong Li, Wei Lin, Dennis Cai, Ennan Zhai
Troubleshooting performance problems of large model training (LMT) is immensely challenging, due to unprecedented scales of modern GPU clusters, the complexity of software-hardware interactions, and the data intensity of the training process. Existing troubleshooting approaches designed for traditional distributed systems or datacenter networks fall short and can hardly apply to real-world training systems. In this paper, we present PerfTracker, the first online troubleshooting system utilizing fine-grained profiling, to diagnose performance issues of large-scale model training in production. PerfTracker can diagnose performance issues rooted in both hardware (e.g., GPUs and their interconnects) and software (e.g., Python functions and GPU operations). It scales to LMT on modern GPU clusters. PerfTracker effectively summarizes runtime behavior patterns of fine-grained LMT functions via online profiling, and leverages differential observability to localize the root cause with minimal production impact. PerfTracker has been deployed as a production service for large-scale GPU clusters of O(10, 000) GPUs (product homepage https://help.aliyun.com/zh/pai/user-guide/perftracker-online-performance-analysis-diagnostic-tool). It has been used to diagnose a variety of difficult performance issues.
大型模型培训(LMT)的故障排除问题非常艰巨,原因是现代GPU集群规模空前,软件硬件互动的复杂性,以及培训过程的数据强度。为传统分布式系统或数据中心网络设计的现有故障排除方法不足,难以适用于现实世界培训系统。本文介绍PerfTracker,这是第一个使用微小分析的在线故障排除系统,用于诊断生产中大规模模型培训的绩效问题。 PerfTracker可以诊断硬件(例如,GPUs及其内部连接)和软件(例如,Python功能和GPU操作)两方面的绩效问题。在现代GPU组或数据中心中,现有故障排除方法不足,难以适用于现实世界培训系统。在本文中,我们介绍PerfTracker,这是第一个使用微小分析模型分析的在线故障排除系统,用来分析大规模GPUG(例如,10,000)的硬件组合及其内部连接)和软件(例如,Python 功能-stall-stall) GPUS-stall a has hasimal-hillagemental-deviewal ASyalmental a.
Article 108
Title@2025-06-12 (4): SLO-Aware Scheduling for Large Language Model Inferences
Title: SLO-Aware Scheduling for Large Language Model Inferences | SLO-Aware Scheduling für große Sprachmodell-Schlussfolgerungen | 大语言示范推理大语言示范推理的 SLO-Aware 排程 2504.14966v2 |
Authors (7): Jinqi Huang, Yi Xiong, Xuebing Yu, Wenjie Huang, Entong Li, Li Zeng, Xin Chen
Large language models (LLMs) have revolutionized applications such as code completion, chatbots, and online classification. To elevate user experiences, service level objectives (SLOs) serve as crucial benchmarks for assessing inference services capabilities. In practice, an inference service processes multiple types of tasks, each with its own distinct SLO. To ensure satisfactory user experiences, each request’s distinct SLOs should be considered in scheduling. However, existing designs lack this consideration, leading to insufficient hardware utility and suboptimal performance. This paper analyzes scenarios to process tasks with varying SLOs, and introduces a simulated annealing-based scheduler to decide request priority sequence based on a request’s SLO, input lengths, and possible output lengths. As the first specialized scheduler for multi-SLO scenarios, this work improves SLO attainment by up to 5x and reduces average latency by 31.6% on Python-Code-23k-ShareGPT and ShareGPT_Vicuna_unfiltered datasets, compared to current state-of-the-art framework vLLM and a new framework LMDeploy.
大型语言模型(LLMs)使代码完成、聊天机和在线分类等应用发生革命性变化。为了提高用户经验,服务级目标(SLOs)是评估推断服务能力的关键基准。在实践中,一种推断服务处理多种类型的任务,每个任务都有自己的不同的 SLO。为了确保用户的满意经验,应考虑每个请求中不同的 SLOs。然而,现有的设计缺乏这种考虑,导致硬件效用不足和业绩欠佳。本文件分析了处理不同 SLO任务的各种设想,并引入了一个模拟的Annealing制表,以根据请求的 SLO、输入长度和可能的输出长度来决定请求的优先序列。作为多种 SLO方案情景的第一个专门时间表,这项工作将SLO的实现率提高到5x,并将Python-Codeool-23k-ShareGPT和ShareGPT_Vicuna_unfilterd datatset, 与当前的状态-艺术框架vLLLLLM和新的框架相比,将PLMM框架的平均长度减少31.6%。
Article 109
Title@2025-06-12 (4): Resilience through Automated Adaptive Configuration for Distribution and Replication
Title: Resilience through Automated Adaptive Configuration for Distribution and Replication | Resilienz durch Automatisierte Adaptive Konfiguration für Verteilung und Replizierung | 通过自动适应配置进行分发和复制的复原力 2506.10248v1 |
Authors (3): Scott D. Stoller, Balaji Jayasankar, Yanhong A. Liu
This paper presents a powerful automated framework for making complex systems resilient under failures, by optimized adaptive distribution and replication of interdependent software components across heterogeneous hardware components with widely varying capabilities. A configuration specifies how software is distributed and replicated: which software components to run on each computer, which software components to replicate, which replication protocols to use, etc. We present an algorithm that, given a system model and resilience requirements, (1) determines initial configurations of the system that are resilient, and (2) generates a reconfiguration policy that determines reconfiguration actions to execute in response to failures and recoveries. This model-finding algorithm is based on state-space exploration and incorporates powerful optimizations, including a quotient reduction based on a novel equivalence relation between states. We present experimental results from successfully applying a prototype implementation of our framework to a model of an autonomous driving system.
本文提供了一个强大的自动框架,通过优化适应性分布和复制不同功能的多种硬件组件的相互依存软件组件,使复杂系统在故障情况下具有复原力。一个配置具体说明了软件是如何分配和复制的:每个计算机上运行哪些软件组件,哪些软件组件可以复制,哪些软件组件可以复制,哪些协议可以复制等等。 我们提出了一个算法,根据一个系统模型和复原力要求,(1) 确定具有复原力的系统的初始配置,(2) 产生一个重组政策,决定为应对故障和回收而采取重组行动。这一模型调查算法以州空间探索为基础,并包含强有力的优化,包括基于国家间新等同关系的理论削减。我们介绍了成功将我们框架的原型实施应用到一个自主驱动系统模型的实验结果。