cs.DC @ 2025-07-04: 116
-
00 07-03 (4) HybridTier: an Adaptive and Lightweight CXL-Memory Tiering System HybridTier: ein adaptives und leichtes CXL-Memory-Tiersystem 混合板:适应和轻量的CXL-模模铁环系 2312.04789v2 -
01 07-03 PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案 2506.19660v2 -
02 07-03 FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference FlowSpec: Kontinuierliche pipelined Spekulative Dekodierung für effiziente verteilte LLM-Inferenz 流谱:为有效分布分布的LLM 推断而持续喷射的投机性分解 2507.02620v1 -
03 07-03 MULTI-SCOUT: Multistatic Integrated Sensing and Communications in 5G and Beyond for Moving Target Detection, Positioning, and Tracking MULTI-SCOUT: Multistatisches integriertes Sensing und Kommunikation in 5G und darüber hinaus für das Verschieben von Zielerkennung, Positionierung und Tracking 目标探测、定位和跟踪:用于推进目标探测、定位和跟踪的 5G及5G 以外多空间综合遥感和通信 2507.02613v1 -
04 07-03 AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v2 -
05 07-03 Resolving CAP Through Automata-Theoretic Economic Design: A Unified Mathematical Framework for Real-Time Partition-Tolerant Systems Lösung von CAP durch Automata-Theoretisches Wirtschaftsdesign: Ein einheitlicher mathematischer Rahmen für Echtzeit-Partitions-Tolerante Systeme 通过自动化数据理论经济设计解决CAP:实时分区-耐用系统统一数学框架 2507.02464v1 -
06 07-03 Red grape detection with accelerated artificial neural networks in the FPGA’s programmable logic Rote Traubenerkennung mit beschleunigten künstlichen neuronalen Netzwerken in der programmierbaren Logik des FPGA FPGA的可编程逻辑的红葡萄探测与加速人工神经网络 2507.02443v1 -
07 07-03 The Artificial Scientist – in-transit Machine Learning of Plasma Simulations Der Künstliche Wissenschaftler – in-transit maschinelles Lernen von Plasmasimulationen 人造科学家 – – Plasma模拟模拟的中转机器学习 2501.03383v3 -
08 07-03 Alps, a versatile research infrastructure Alpen, eine vielseitige Forschungsinfrastruktur 阿尔卑斯山,多用途研究基础设施 2507.02404v1 -
09 07-03 VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software VEFIA: 垂直联邦合作软件有效推断审计框架 2507.02376v1 -
10 07-03 Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources Flotilla: Ein skalierbarer, modularer und widerstandsfähiger föderierter Lernrahmen für heterogene Ressourcen 船队:多样化资源的可扩展、模块化和有弹性的联邦学习框架 2507.02295v1 -
11 07-03 Domain-Adversarial Transfer Learning for Fault Root Cause Identification in Cloud Computing Systems Domain-Adversarial-Transfer-Lernen für fehlerhafte Root-Cause-Identifikation in Cloud Computing-Systemen 为在云计算系统中查明原因原因而进行校内自动转移学习 2507.02233v1 -
12 07-02 (3) Signalling Health for Improved Kubernetes Microservice Availability Signalisierung der Gesundheit für verbesserte Kubernetes Mikroservice-Verfügbarkeit 改善Kubernetes微服务提供情况 2507.02158v1 -
13 07-02 Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association Grundlegende Grenzen der Hierarchischen Sicheren Aggregation mit Cyclic User Association 与cycclic用户协会的等级安全分类基本限制 2503.04564v4 -
14 07-02 SAKURAONE: Empowering Transparent and Open AI Platforms through Private-Sector HPC Investment in Japan SAKURAONE: Stärkung transparenter und offener KI-Plattformen durch Private-Sector HPC-Investitionen in Japan SAKURAONE:通过在日本的私营部门HPC投资增强透明和开放的AI平台的权能 2507.02124v1 -
15 07-02 Parallelization of Network Dynamics Computations in Heterogeneous Distributed Environment Parallelisierung von Network Dynamics Computations in heterogener verteilter Umgebung 不同差异分布环境中网络动态计算平行化 2410.19075v2 -
16 07-02 Analyzing Common Electronic Structure Theory Algorithms for Distributed Quantum Computing Analyse gemeinsamer elektronischer Strukturtheorien Algorithmen für verteiltes Quantenrechnen 分布量量计算法的通用电子结构理论比值 2507.01902v1 -
17 07-02 Evolving HPC services to enable ML workloads on HPE Cray EX Evolving HPC-Dienste, um ML-Workloads auf HPE Cray EX zu ermöglichen 不断演化的HPC服务,使HPE Cray EX 的ML工作量得以完成 2507.01880v1 -
18 07-02 Not eXactly Byzantine: Efficient and Resilient TEE-Based State Machine Replication Nicht eXactly Byzantin: Effiziente und resiliente TEE-basierte State Machine Replication Byzantine:高效且具有弹性的以技术为基础的国家机器复制 2501.11051v3 -
19 07-02 GPU-based complete search for nonlinear minimization subject to bounds GPU-basierte komplette Suche nach nichtlinearer Minimierung unter Grenzen 基于 GPU 的基于 GPU 的完整搜索, 以不受约束的方式对非线性最小化进行搜索 2507.01770v1 -
20 07-02 Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization Deep Recommender Models Inferenz: Automatische Asymmetrische Datenflussoptimierung 深建议模型推断:自动对称数据流动优化 2507.01676v1 -
21 07-02 Melding the Serverless Control Plane with the Conventional Cluster Manager for Speed and Compatibility Verschmelzen des serverlosen Steuerplans mit dem konventionellen Clustermanager für Geschwindigkeit und Kompatibilität 与用于速度和兼容性的常规集管理器管理器熔化无服务器控制平面 2505.24551v2 -
22 07-02 EDGChain-E: A Decentralized Git-Based Framework for Versioning Encrypted Energy Data EDGChain-E: Ein dezentralisiertes Git-basiertes Framework zur Versionierung verschlüsselter Energiedaten EDGCHain-E: 以分散式基基基框架取代加密能源数据版本 2507.01615v1 -
23 07-02 Optimal Computation in Anonymous Dynamic Networks Optimale Berechnung in anonymen dynamischen Netzwerken 匿名动态网络的最佳计算 2207.08061v8 -
24 07-02 Rational Censorship Attack: Breaking Blockchain with a Blackboard Rationaler Zensurangriff: Blockchain mit einer Tafel durchbrechen 理性审查攻击:用黑板打破锁链 2507.01453v1 -
25 07-02 EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices EdgeLoRA: Ein effizientes Multi-Tenant LLM Serving System auf Edge-Geräten EdgeloRA:一个高效的长长长长长长长长长长长长长长长长长长边缘装置服务系统 2507.01438v1 -
26 07-02 Optimal Dispersion Under Asynchrony Optimale Dispersion unter Asynchronie Asynconsrony 下的优化分散 2507.01298v1 -
27 07-02 Far From Sight, Far From Mind: Inverse Distance Weighting for Graph Federated Recommendation Weit weg vom Sehen, weit weg vom Denken: Inverse Distanzgewichtung für Graph Federated Empfehlung 远离视觉,远离心智:对 “ 绿联建议 “ 的反距离加权 2507.01285v1 -
28 07-01 (2) Capacity Planning and Scheduling for Jobs with Uncertainty in Resource Usage and Duration Kapazitätsplanung und Planung für Jobs mit Unsicherheit in Ressourcennutzung und -dauer 资源使用和期限不确定的工作的能力规划和时间安排 2507.01225v1 -
29 07-01 FLARE: A Dataflow-Aware and Scalable Hardware Architecture for Neural-Hybrid Scientific Lossy Compression FLARE: Eine datenflussfähige und skalierbare Hardwarearchitektur für Neural-Hybrid Scientific Lossy Compression FLARE: 用于神经 – – Hybrid科学损失压缩的数据流软件和可缩放硬件结构 2507.01224v1 -
30 07-01 HERCULES: Hardware accElerator foR stoChastic schedULing in hEterogeneous Systems HERCULES: Hardware-Accelerator foR stoChastic SchedULing in hEterogeneous Systems HERCULES: 氢外源系统中的硬件加速器 forR 蒸蒸蒸蒸气 2507.01113v1 -
31 07-01 A Terminology for Scientific Workflow Systems Eine Terminologie für wissenschaftliche Workflow-Systeme 科学工作流程系统术语术语 2506.07838v5 -
32 07-01 Efficient Gate Reordering for Distributed Quantum Compiling in Data Centers Effiziente Gate-Reorder für verteilte Quantenkompilierung in Rechenzentren 数据中心分配量数汇编高效门(高效门)重新排序 2507.01090v1 -
33 07-01 Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing Nicht jeder Wasserverbrauch ist gleich: Ein Wasserdruck-gewichtetes Metric für nachhaltiges Rechnen 并非所有水消耗量都相等:可持续计算中的水应激反应加权计量 2506.22773v2 -
34 07-01 How Fast Can Graph Computations Go on Fine-grained Parallel Architectures Wie schnell man Berechnungen graphen kann geht auf feinkörnigen parallelen Architekturen 快速图表计算在精细的平行建筑上如何进行 2507.00949v1 -
35 07-01 Turning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix, Arizona Umwandlung von KI-Datenzentren in Grid-Interaktive Vermögenswerte: Ergebnisse einer Felddemonstration in Phoenix, Arizona 将AI数据中心变成网状互动资产:亚利桑那州凤凰城现场示范的成果 2507.00909v1 -
36 07-01 A New Family of Thread to Core Allocation Policies for an SMT ARM Processor Eine neue Thread-Familie für Kernzuteilungsrichtlinien für einen SMT ARM-Prozessor SMT ARM 处理器核心分配政策新一串线索 2507.00855v1 -
37 07-01 Enabling mixed-precision in spectral element codes Ermöglichung der Mischpräzision in Spektralelementcodes 使光谱元代码具有混合精度 2503.02134v2 -
38 07-01 yProv4ML: Effortless Provenance Tracking for Machine Learning Systems yProv4ML: Müheloses Provenienz-Tracking für maschinelle Lernsysteme yProv4ML: 机器学习系统无穷无尽的证明跟踪 2507.01078v1 -
39 07-01 PANDAS: Peer-to-peer, Adaptive Networking for Data Availability Sampling within Ethereum Consensus Timebounds PANDAS: Peer-to-Peer, Adaptive Networking für Datenverfügbarkeit Probenahme innerhalb von Ethereum Consensus Timebounds PANDAS:对等对等网络,为数据提供建立适应性网络,在Eetenum共识时限内抽样 2507.00824v1 -
40 07-01 To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing Zum Offload oder nicht zum Offload: Modellgetriebener Vergleich von Edge-native und On-Device-Verarbeitung 卸载还是不卸载:边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边 2504.15162v2 -
41 07-01 Provenance Tracking in Large-Scale Machine Learning Systems Provenienzverfolgung in großformatigen Maschinen-Lernsystemen 大型机器学习系统中的证书追踪系统 2507.01075v1 -
42 07-01 Improving the scalability of a high-order atmospheric dynamics solver based on the deal.II library Verbesserung der Skalierbarkeit eines auf dem Deal basierenden atmosphärischen Dynamiklösers.II-Bibliothek 根据协议改善高阶大气动态求解器的可缩放性。 2505.00384v3 -
43 07-01 Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds Safe Low Bandwidth SPV: Eine formale Behandlung von vereinfachten Zahlungsverifikationsprotokollen und Sicherheitsbunden 安全低频带宽度SPV:对简化付款核查议定书和安全圈的正式处理 2507.00740v1 -
44 07-01 Accelerating Loading WebGraphs in ParaGrapher Beschleunigte Laden WebGraphs in ParaGrapher 加速加载 ParaGrapher 中的网页格 2507.00716v1 -
45 07-01 eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems eACGM: Non-instrumented Performance Tracing and Anomalie Detection towards Machine Learning Systems eACGM:非仪器化业绩追踪和异常探测,转向机器学习系统 2506.02007v2 -
46 07-01 Toward Edge General Intelligence with Multiple-Large Language Model (Multi-LLM): Architecture, Trust, and Orchestration Hin zum Rand Allgemeine Intelligenz mit multi-large Sprachmodell (Multi-LLM): Architektur, Vertrauen und Orchestrierung 以多大语言模式(Multi-LLM):建筑、信任和管弦化 2507.00672v1 -
47 07-01 DynoStore: A wide-area distribution system for the management of data over heterogeneous storage DynoStore: Ein weiträumiges Distributionssystem für die Verwaltung von Daten über heterogene Speicherung DynoStore:管理不同储存数据广域分布系统 2507.00576v1 -
48 07-01 Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling Collaboratives Multi-Agent-Verstärkungs-Lernkonzept für elastische Cloud-Ressourcenskalierung 弹性云层资源扩缩多机构加强学习方法合作 2507.00550v1 -
49 07-01 Edge Computing and its Application in Robotics: A Survey Edge Computing und seine Anwendung in der Robotik: Eine Umfrage 边缘计算及其在机器人学中的应用:调查 2507.00523v1 -
50 07-01 LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference LLM-Mesh: Elastische Freigabe für serverlose LLM-Inferenz aktivieren LLM-Mesh:为无服务器的LLM推理提供弹性分享能力 2507.00507v1 -
51 07-01 Real-Time In-Network Machine Learning on P4-Programmable FPGA SmartNICs with Fixed-Point Arithmetic and Taylor Echtzeit-In-Network Machine Learning auf P4-Programmierbaren FPGA SmartNICs mit Fixed-Point Arithmetic und Taylor P4-可编程的PFGA智能计算机计算机计算机与固定点测算机和泰勒的实时网络内机器学习 2507.00428v1 -
52 07-01 Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning Find a Scapegoat: Vergiftung der Mitgliedschaft Inferenzangriff und Verteidigung zu Federated Learning 寻找一条“Scamegoat”:毒瘾成员攻击和防御联邦学习组织 2507.00423v1 -
53 07-01 Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs LLMs in HPC-Clustern bedienen: Eine vergleichende Studie von Qualcomm Cloud AI 100 Ultra- und Hochleistungs-GPUs HPC群集中服务长效LLMs:对Qalcomm Cloud AI 100超效和高效GPU的比较研究 2507.00418v1 -
54 07-01 HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism HelixPipe: Effizientes Training von langen Sequenztransformatoren mit Aufmerksamkeit Paralleler Pipeline-Parallelismus HelixPipe:对长序列变异器进行有效分布式培训,注意平行管道平行平行平行 2507.00394v1 -
55 06-30 (1) Evaluation of a Foundational Model and Stochastic Models for Forecasting Sporadic or Spiky Production Outages of High-Performance Machine Learning Services Bewertung eines Basismodells und stochastische Modelle zur Vorhersage sporadischer oder würziger Produktionsausfälle hochleistungsfähiger Machine Learning Services 评价预测高性能机器学习服务零星或斯皮生产流出的基础模型和存储模型 2507.01067v1 -
56 06-30 Rust vs. C for Python Libraries: Evaluating Rust-Compatible Bindings Toolchains Rust vs. C für Python Bibliotheken: Bewertung von Rust-kompatiblen Bindungen Toolchains Python图书馆的Rust诉C案:评估Rust-Compable Contracable Contails 工具链 2507.00264v1 -
57 06-30 CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training CrossPipe: Auf dem Weg zu optimalen Pipeline-Fahrplänen für Cross-Datacenter-Schulungen CrossPipe:争取为跨数据中心培训制定最佳管道时间表 2507.00217v1 -
58 06-30 Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data Vermeiden Sie das Vergessen, indem Sie globale Wissensgradienten im Föderierten Lernen mit nicht-ID-Daten bewahren 避免在使用非二二二维数据进行联邦学习时因保留全球知识进步而被遗忘 2505.20485v3 -
59 06-30 Identifying the Truth of Global Model: A Generic Solution to Defend Against Byzantine and Backdoor Attacks in Federated Learning (full version) Die Wahrheit des globalen Modells identifizieren: Eine generische Lösung gegen byzantinische und Hintertürangriffe im Federated Learning (Vollversion) 查明全球模式真相:在联邦学习联盟中防范拜占庭和后门攻击的一般解决办法(全文) 2311.10248v3 -
60 06-30 Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC Agent.xpu: Effiziente Planung von Agentic LLM Workloads auf heterogenen SoC Agent.xpu: 高效地安排对异基因 soC 的Agentic LLM 工作负荷 2506.24045v1 -
61 06-30 Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge Intelligente Orchestrierung der verteilten Large Foundation Model Inferenz am Rande 分散在边缘的大基金会模型推断 2504.03668v2 -
62 06-30 QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference QPART: Adaptive Modell-Quantisierung und dynamische Workload-Balancing für akkurat-bewusste Edge-Inferenz QPART: 适应性模型量化和动态工作量平衡,以利准确度认知边缘推断 2506.23934v1 -
63 06-30 Cuckoo Heavy Keeper and the balancing act of maintaining heavy hitters in stream processing Cuckoo Heavy Keeper und der Balanceakt der Aufrechterhaltung von schweren Hittern in der Stromverarbeitung Cuckoo重物保管器和在溪流处理中保持重击器的平衡做法 2412.12873v3 -
64 06-30 Segmented Operations using Matrix Multiplications Segmentierte Operationen mit Matrix-Multiplikationen 使用矩阵乘法进行分割操作 2506.23906v1 -
65 06-30 Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction Nachweis der begrenzten Skalierbarkeit der zentralisierten verteilten Optimierung durch eine neue untere Bound-Konstruktion 证明通过新建下下界建筑的集中分配最佳优化的有限可扩展性 2506.23836v1 -
66 06-30 Large-scale Neural Network Quantum States for ab initio Quantum Chemistry Simulations on Fugaku Großes neurales Netzwerk Quantenstaaten für ab initio Quantenchemie Simulationen auf Fugaku 用于对富巴库进行初始量子化学模拟的大型神经网络量图州 2506.23809v1 -
67 06-30 When Servers Meet Species: A Fab-to-Grave Lens on Computing’s Biodiversity Impact Wenn Server Arten treffen: Eine Fab-to-Grave-Lens für die Biodiversitätswirkung von Computing 当服务器与物种相遇时:关于计算机的生物多样性影响的一个从宽到宽的镜头 2506.20442v3 -
68 06-30 Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model Auf dem Weg zum Aufbau privater LLMs: Erforschung von Multi-Node-Experten-Parallelismus auf Apple Silicon für Mixture-of-Experts Large Language Model 走向建设私有私人LLMs:探索关于苹果硅的多节专家平行专家,用于混合专家大语言模型 2506.23635v1 -
69 06-30 FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models FedEx-LoRA: Exakte Aggregation für Federated and Efficient Fine-Tuning of Foundation Models FedEx-LORA:基金会模型的联邦和高效精度 2410.09432v4 -
70 06-30 Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning * 评分:在联邦学习中保护隐私、错误行为检测和贡献评价 2506.23583v1 -
71 06-30 PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization PipeOffload: Verbesserung der Skalierbarkeit von Pipeline Parallelismus mit Speicheroptimierung 管道卸载: 提高管道平行式与内存优化的可缩放性 2503.01328v2 -
72 06-30 VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference VQ-LLM: Hochleistungs-Code-Generierung für Vector Quantization Augmented LLM Inferenz VQ-LLLM: 矢量量化增强LLM 推理高性能代码生成 2503.02236v2 -
73 06-30 Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism Oasen: Effiziente großformatige Modellschulung auf Commodity-Servern durch überlappende und automatisierte Tensor-Modellparallelität Oases:通过重叠和自动登光示范平行模式,对商品服务器进行有效的大型大型示范培训 2305.16121v2 -
74 06-29 (7) FastSet: Parallel Claim Settlement FastSet: Parallele Forderungsabrechnung FastSet:平行索赔理赔 2506.23395v1 -
75 06-29 FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model FedRef: Kommunikation-Effizient Bayesian Feinabstimmung mit Referenzmodell FedRef: 通信-节能贝ysian精密票,参考模型 2506.23210v1 -
76 06-29 Efficient malicious information detection method based on set partitioning for large-scale Internet of Things Effiziente Methode zur Erkennung von bösartigen Informationen, basierend auf der eingestellten Partitionierung für das Internet der Dinge im großen Maßstab 基于大规模物联网的固定分区的高效恶意信息检测方法 2502.11538v2 -
77 06-29 Verifying Properties of Index Arrays in a Purely-Functional Data-Parallel Language Überprüfung der Eigenschaften von Index-Arrays in einer rein funktionalen Daten-Parallel-Sprache 校验纯功能数据- Parallel 语言索引阵列属性 2506.23058v1 -
78 06-28 (6) ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism ACHTUNG2D: Kommunikation Effizient verteilter Selbstaufmerksamkeitsmechanismus 注意2D: 沟通高效分配自发性传播机制 2503.15758v2 -
79 06-28 Cicada: A Pipeline-Efficient Approach to Serverless Inference with Decoupled Management Cicada: Ein Pipeline-Effizienter Ansatz zur serverlosen Schlussfolgerung mit entkoppelter Verwaltung Cicada:用管道有效处理无服务器推断与拆分管理的方法 2502.20959v2 -
80 06-28 Performance Measurements in the AI-Centric Computing Continuum Systems Leistungsmessungen in den KI-Centric Computing Continuum Systemen AI-Centric 电子计算大陆系统的业绩计量 2506.22884v1 -
81 06-28 Reliable Image Transmission in CPS-based Pub/Sub Zuverlässige Bildübertragung im CPS-basierten Pub/Sub 以CPS为基础的PP/Pub/Sub的可靠图像传输 2506.22875v1 -
82 06-28 Momentum-based Accelerated Algorithm for Distributed Optimization under Sector-Bound Nonlinearity Momentumbasierte beschleunigte Algorithmen zur verteilten Optimierung unter sektorübergreifender Nichtlinearität 部门-基于动力的在部门-健全非线性下分配的优化分配加速计算 2506.22855v1 -
83 06-28 Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models Adaptive Rangverteilung für Federated Parameter-Efficient Fine-Tuning of Language Models 联邦准拉米有效精密语言模式调适级分配 2501.14406v3 -
84 06-28 TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations TriADA: Massiv parallel Trilineare Matrix-by-Tensor Multiplizieren von Algorithmen und Gerätearchitektur für die Beschleunigung von 3D-Diskreten Transformationen TriADA: 加速 3D 分立变换的大规模平行平行三线矩阵矩阵逐个传感器乘数加算法和设备结构 2506.22818v1 -
85 06-28 Characterizing GPU Resilience and Impact on AI/HPC Systems Charakterisierung der GPU-Resilienz und Auswirkungen auf AI/HPC-Systeme 确定GPU的复原力和对AI/HPC系统的影响 2503.11901v3 -
86 06-28 Efficiently Serving Large Multimodal Models Using EPD Disaggregation Effizientes Servieren großer multimodaler Modelle mit EPD-Disaggregation 利用EPD拆分有效服务大型多模式模式 2501.05460v4 -
87 06-28 Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication Waage: CUDA- und Tensorkerne für hochleistungsfähige Sparse-Matrix-Multiplikation synergisieren 激光仪:将CUDA和Tensor核心同步用于高性能散射矩阵乘法 2506.22714v1 -
88 06-27 (5) SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving SLED: Ein spekulatives LLM-Decoding-Framework für effizientes Edge Serving SLED: 有效边缘服务投机性LLM代谢框架 2506.09397v3 -
89 06-27 DistShap: Scalable GNN Explanations with Distributed Shapley Values DistShap: Skalierbare GNN-Erklärungen mit verteilten Shapley-Werten 分布式shap:可缩放的 GNN 解释和分布式形状值 2506.22668v1 -
90 06-27 Reductions in local certification Reduzierung der lokalen Zertifizierung 地方认证减少 2502.01551v2 -
91 06-27 Towards Operational Data Analytics Chatbots – Virtual Knowledge Graph is All You Need Auf dem Weg zu operativen Datenanalytik Chatbots – Virtual Knowledge Graph ist alles, was Sie brauchen 迈向实用数据分析分析聊天器 – – 虚拟知识图是你所需要的全部 2506.22267v1 -
92 06-27 Autonomic Microservice Management via Agentic AI and MAPE-K Integration Autonomes Microservice Management über Agentic AI und MAPE-K Integration 通过Agentic AI和MAPE-K整合进行自动微服务管理 2506.22185v1 -
93 06-27 Reliability Analysis of Smart Contract Execution Architectures: A Comparative Simulation Study Zuverlässigkeitsanalyse von Smart Contract Execution Architectures: Eine vergleichende Simulationsstudie 智能合同执行结构可靠性分析:比较模拟研究 2506.22180v1 -
94 06-27 MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism MPipeMoE: Memory Efficient MoE für vortrainierte Modelle mit adaptivem Pipeline Parallelismus MPIPEMOE: 适应性管道平行主义的预培训模型记忆高效记忆部 2506.22175v1 -
95 06-27 Proof-of-Behavior: Behavior-Driven Consensus for Trustworthy Decentralized Finance Proof-of-Behavior: Behavior-Driven Consensus für vertrauenswürdige dezentralisierte Finanzen 行为证明:可信赖的权力下放金融行为共识 2506.22171v1 -
96 06-27 MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators MCFuser: High-Performance und schnelle Fusion von Memory-Bound Compute-Intensive Operatoren MCFuser: 内存 – – 弹道计算密集操作员的高度性能和迅速扩散 2506.22169v1 -
97 06-27 SPTCStencil: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swap SPTCStencil: Entleashing Sparse Tensor Cores für Schablone Computation via Strided Swap JSPCtencil: 通过 Strided Swap 解析 Stencils 计算 Stencils 的稀释渗漏天体核心 2506.22035v1 -
98 06-27 SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference SiPipe: Überbrückung der CPU-GPU-Utilisationslücke für effiziente Pipeline-Parallel-LLM-Inferenz SiPipe:弥合CPU-GPU利用差距,提高管道-Parallel LLM 推理效率 2506.22033v1 -
99 06-27 Programming Distributed Collective Processes in the eXchange Calculus Programmierung verteilter kollektiver Prozesse im eXchange Calculus eXchange Calculus 中的程序编程分配集体进程 2401.11212v4 -
100 06-27 A Survey on Federated Fine-tuning of Large Language Models Eine Umfrage über Federated Fine-Tuning von großen Sprachmodellen 大语言模式联邦微调调查 2503.12016v2 -
101 06-27 Generative AI for Software Architecture. Applications, Challenges, and Future Directions Generative KI für Softwarearchitektur. Anwendungen, Herausforderungen und Zukunftsrichtungen A. 软件结构的生成AI 应用、挑战和未来方向 2503.13310v2 -
102 06-27 AeroDaaS: Towards an Application Programming Framework for Drones-as-a-Service AeroDaaS: Auf dem Weg zu einem Anwendungsprogrammierungsrahmen für Drohnen-as-a-Service AeroDaaS:努力为作为服务对象的无人机制定应用方案框架 2504.03802v2 -
103 06-27 Enabling Bitcoin Smart Contracts on the Internet Computer Ermöglichung von Bitcoin Smart Contracts auf dem Internet-Computer 使因特网计算机上比特币智能合同成为可能 2506.21327v2 -
104 06-26 (4) Benchmarking and Parallelization of Electrostatic Particle-In-Cell for low-temperature Plasma Simulation by particle-thread Binding Benchmarking und Parallelisierung elektrostatischer Partikel-In-Zellen für Niedertemperatur-Plasmasimulation durch Partikel-Thread-Bindung 低温等温等同等量定基准和静电粒子细胞中电静电粒子细胞平行化 2506.21524v1 -
105 06-26 Efficient and Reuseable Cloud Configuration Search Using Discovery Spaces Effiziente und wiederverwendbare Cloud-Konfiguration Suche mit Discovery Spaces 利用发现空间进行高效和可再利用的云层配置搜索 2506.21467v1 -
106 06-26 exa-AMD: A Scalable Workflow for Accelerating AI-Assisted Materials Discovery and Design exa-AMD: Ein skalierbarer Workflow zur Beschleunigung der Entdeckung und des Designs von KI-Assistenten Exa-AMD:加速使用AI辅助材料发现和设计的一个可缩放工作流程 2506.21449v1 -
107 06-26 Carbon-Aware Microservice Deployment for Optimal User Experience on a Budget Carbon-Aware Microservice Bereitstellung für eine optimale Benutzererfahrung auf einem Budget 为最佳预算用户提供最佳预算用户经验的碳软件微型服务部署 2506.21422v1 -
108 06-26 Exploring Micro Frontends: A Case Study Application in E-Commerce Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce 探索微观前沿:电子商务案例研究应用 2506.21297v1 -
109 06-26 Balancing Privacy, Robustness, and Efficiency in Machine Learning Ausbalancierende Privatsphäre, Robustheit und Effizienz im maschinellen Lernen 平衡隐私、强健和机器学习效率 2312.14712v3 -
110 06-26 The Autonomy of the Lightning Network: A Mathematical and Economic Proof of Structural Decoupling from BTC Die Autonomie des Blitznetzes: Ein mathematischer und wirtschaftlicher Beweis der strukturellen Entkopplung von BTC 闪电网络的自主性:结构脱钩与BTC的数学和经济证明 2506.19333v2 -
111 06-26 Bridding OT and PaaS in Edge-to-Cloud Continuum Bridding OT und PaaS im Edge-to-Cloud Continuum 边际至环际环礁岛的Briding OT和PaaS 2506.21072v1 -
112 06-26 An Information-Theoretic Analysis for Federated Learning under Concept Drift Eine informationstheoretische Analyse für das Federated Learning unter Konzept Drift 根据 “ 漂流概念 “ 进行的联邦学习信息理论分析 2506.21036v1 -
113 06-26 BLOCKS: Blockchain-supported Cross-Silo Knowledge Sharing for Efficient LLM Services BLOCKS: Blockchain-gestützter Cross-Silo-Wissensaustausch für effiziente LLM-Dienste BLOCKS:为高效率的LLM服务进行链链式支持的跨SIlo知识共享 2506.21033v1 -
114 06-26 Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe Tragbare Hochleistungs-Kernel-Generation für einen numerischen Fluid-Dynamik-Code mit DaCe DaCe 计算流流体动态代码的可携高性能核心生成器 2506.20994v1 -
115 06-26 ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks ParEval-Repo: Eine Benchmark-Suite zur Bewertung von LLMs mit HPC-Übersetzungsaufgaben auf Repository-Ebene PaarEval-Repo:评价拥有仓库级高常委会翻译任务的LLMLM 基准套件 2506.20938v1
Article 0
Title@2025-07-03 (4): HybridTier: an Adaptive and Lightweight CXL-Memory Tiering System
Title: HybridTier: an Adaptive and Lightweight CXL-Memory Tiering System | HybridTier: ein adaptives und leichtes CXL-Memory-Tiersystem | 混合板:适应和轻量的CXL-模模铁环系 2312.04789v2 |
Authors (6): Kevin Song, Jiacheng Yang, Zixuan Wang, Jishen Zhao, Sihang Liu, Gennady Pekhimenko
Modern workloads are demanding increasingly larger memory capacity. Compute Express Link (CXL)-based memory tiering has emerged as a promising solution for addressing this problem by utilizing traditional DRAM alongside slow-tier CXL memory devices. We analyze prior tiering systems and observe two challenges for high-performance memory tiering: adapting to skewed but dynamically varying data hotness distributions while minimizing memory and cache overhead due to tiering. To address these challenges, we propose HybridTier, an adaptive and lightweight tiering system for CXL memory. HybridTier tracks both long-term data access frequency and short-term access momentum \emph{simultaneously} to accurately capture and adapt to shifting hotness distributions. HybridTier reduces the metadata memory overhead by tracking data accesses \emph{probabilistically}, obtaining higher memory efficiency by trading off a small amount of tracking inaccuracy that has a negligible impact on application performance. To reduce cache overhead, HybridTier uses lightweight data structures that optimize for data locality to track data hotness. Our evaluations show that HybridTier outperforms prior systems by up to $91\%$ ($19\%$ geomean), incurring $2.0-7.8\times$ less memory overhead and $1.7-3.5\times$ less cache misses.
现代工作量要求增加记忆能力。 计算Express Link( CXL) 基于Express Link( CXL) 的记忆分层是解决这一问题的一个很有希望的解决方案,它利用传统的 DRAM 和缓慢的 CXL 记忆设备来解决这一问题。 我们分析先前的分层系统并观察到高性能内存分层的两个挑战: 适应扭曲但动态差异的数据热分布,同时因分层而尽量减少记忆力和缓冲间接费用。 为了应对这些挑战,我们提议混合Tier, 一种适应性和轻量级的CXL 内存分级系统。 混合Tier 跟踪长期数据存取频率和短期存取动力 \ emph{smultaneous} 以准确捕捉和适应移动热量分布。 混合TRADIER通过跟踪数据访问 \ emph{ { 概率} 来降低元数据存储存储率,通过交换少量的追踪对应用性能影响很小的不小的不精确性能。 为了减少缓度, 减少缓称, 混合Tier 使用轻度数据存数据访问频率的数据结构, 和短期存取动力动力 = $ 0.8 和前的存储系统在9__ 10美元 10美元 的存储中, 。
Article 1
Title@2025-07-03 (4): PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling
Title: PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling | PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays | PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案 2506.19660v2 |
Authors (4): Shuhang Xu, Yunfei Gu, Linhui Liu, Chentao Wu
As flash-based Solid State Drive (SSD) arrays become essential to modern data centers, scaling these arrays to meet explosive data growth is a frequent and critical operation. However, the conventional wear-leveling (WL) paradigm applied during scaling suffers from a fundamental flaw: it ignores the non-linear relationship between wear and failure probability, potentially pushing the most vulnerable, aged disks towards premature failure. To address this critical issue at its root, we propose the Probability-Sensitive Wear Leveling (PS-WL) scheme, which shifts the optimization goal from balancing wear to directly balancing failure risk. At its core, PS-WL introduces an “effective lifetime” model derived from a realistic failure probability to more accurately assess disk lifetime. This model guides a PID controller for wear leveling operation, with a conservative zone minimizes performance overhead by restricting warm data migration. Comprehensive simulations validate the superiority of PS-WL over state-of-the-art methods. The results demonstrate that our approach significantly reduces performance overhead while, most critically, consistently and effectively lowering the aggregated array failure risk across diverse system configurations and workloads. This proves that by directly optimizing for reliability, PS-WL builds a scalable storage system that is, by design, fundamentally safer, more efficient, and more stable.
由于基于闪光的固态驱动器(SSD)阵列对现代数据中心至关重要,扩大这些阵列以适应爆炸性数据增长是一项经常和关键的操作。然而,在缩放过程中应用的常规磨损等级(WL)模式存在一个根本性缺陷:它忽视了磨损和故障概率之间的非线性关系,有可能将最脆弱的老磁盘推向过早的失败。为了从根本上解决这一关键问题,我们提议了“概率感应性湿分级(PS-WL)计划 ” , 将优化目标从平衡磨损转向直接平衡故障风险。 在其核心方面, PS-WL 引入了一个“ 有效终身” 模式, 其依据是现实性失败概率来更准确地评估磁盘寿命。 这个模式指导了PID控制器的磨损操作, 保守区通过限制热数据迁移而最大限度地减少性能管理。 全面模拟验证了PS-WL优于最新技术方法的优势。 其结果表明,我们的方法极大地降低了绩效管理,同时,最关键地、持续和有效地降低不同系统配置和工作量的汇总阵列失败风险。这个模型通过直接地证明,更稳定的存储是更稳定的安全。
Article 2
Title@2025-07-03 (4): FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
Title: FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference | FlowSpec: Kontinuierliche pipelined Spekulative Dekodierung für effiziente verteilte LLM-Inferenz | 流谱:为有效分布分布的LLM 推断而持续喷射的投机性分解 2507.02620v1 |
Authors (4): Xing Liu, Lizhuo Luo, Ming Tang, Chao Huang
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accpeted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.36$\times$-1.77$\times$ compared to baselines. Our code is publicly available at \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec#}
在网络边缘,分布式的推论是一种大语言模型(LLMs)的推论方法,在网络边缘是一种大语言模型(LLMs)的推论方法,很有希望,它将推论过程分散到多个设备,以确保LLMs能够与设备内存相容。最近基于管道的方法有可能平行通信和计算,从而有助于减少推论的延缓度。然而,如果网络边缘的推论要求稀少,而管道通常使用率较低,则好处就会减少。为了在边缘有效分布式LLLM推论,我们提议建立一个基于管道-parllel树的投机解码框架。TextSpec包含三个关键机制,以提高解码效率:(1) 基于分数的分数分分分的分步核查办法,将更重要的代号放在更早一点,以带来折叠的标物;(2)在核查期间,在保持正确的因果关系的同时,高效率地管理普兰松无效的代币;(3) 动态的扩展战略草案,以提供高质量的投机性投入。这些技术工作以音乐方式加强管道利用和投机效率。我们评估在现实-美元-数字的Slex_Slex标准上改进了我们SlexSlex/slex标准在1号/s
Article 3
Title@2025-07-03 (4): MULTI-SCOUT: Multistatic Integrated Sensing and Communications in 5G and Beyond for Moving Target Detection, Positioning, and Tracking
Title: MULTI-SCOUT: Multistatic Integrated Sensing and Communications in 5G and Beyond for Moving Target Detection, Positioning, and Tracking | MULTI-SCOUT: Multistatisches integriertes Sensing und Kommunikation in 5G und darüber hinaus für das Verschieben von Zielerkennung, Positionierung und Tracking | 目标探测、定位和跟踪:用于推进目标探测、定位和跟踪的 5G及5G 以外多空间综合遥感和通信 2507.02613v1 |
Authors (6): Yalin E. Sagduyu, Kemal Davaslioglu, Tugba Erpek, Sastry Kompella, Gustave Anderson, Jonathan Ashdown
This paper presents a complete signal-processing chain for multistatic integrated sensing and communications (ISAC) using 5G Positioning Reference Signal (PRS). We consider a distributed architecture in which one gNB transmits a periodic OFDM-PRS waveform while multiple spatially separated receivers exploit the same signal for target detection, parameter estimation and tracking. A coherent cross-ambiguity function (CAF) is evaluated to form a range-Doppler map from which the bistatic delay and radial velocity are extracted for every target. For a single target, the resulting bistatic delays are fused through nonlinear least-squares trilateration, yielding a geometric position estimate, and a regularized linear inversion of the radial-speed equations yields a two-dimensional velocity vector, where speed and heading are obtained. The approach is applied to 2D and 3D settings, extended to account for time synchronization bias, and generalized to multiple targets by resolving target association. The sequence of position-velocity estimates is then fed to standard and extended Kalman filters to obtain smoothed tracks. Our results show high-fidelity moving-target detection, positioning, and tracking using 5G PRS signals for multistatic ISAC.
本文用5G定位参考信号(PRS)为多静态综合遥感和通信提供了一个完整的信号处理链(ISAC),用于多静态综合遥感和通信。我们考虑一个分布式结构,即一个GNB传输一个定期的OFDM-PRS波形,而多个空间分离的接收器利用同一信号进行目标探测、参数估计和跟踪。一个连贯的交叉立体功能(CAF)被评价成一个射程-Doppler地图,每个目标都可以从中提取二等缓存延迟和辐射速度。对于一个单一目标,由此产生的二等延迟会通过非线性最小方位的三角连接而成,产生几何位置估计,以及一个定期化线性线性线性线性对射线式转换产生二维速度矢量,获得速度和航向。该方法适用于2D和3D环境环境,用于计算时间同步偏差,并通过解决目标关联而向多个目标普及。对于位置-速度估计的顺序随后被输入到标准和扩大的Kalman过滤器,以获得平稳轨道。我们的成果显示用于高分辨率定位和多分辨率的IS目标探测。
Article 4
Title@2025-07-03 (4): AI Flow: Perspectives, Scenarios, and Approaches
Title: AI Flow: Perspectives, Scenarios, and Approaches | AI Flow: Perspektiven, Szenarien und Ansätze | AI 流动:观点、设想和方法 2506.12479v2 |
Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
由于克劳德·香农的基本信息理论和艾伦·图灵的机智智能远见框架的开创性,信息和通信技术(IT/CT)的趋同性演进形成了一个不间断的连通和计算浪潮,这种协同效应引发了技术革命,现在随着大型人工智能(AI)模型的重新塑造工业和重新界定人体机械合作而达到顶峰。然而,由于大型模型中大量资源消耗和高通信带宽需求,实现无处不在的情报面临巨大挑战。为了应对这些挑战,AI流动被引入了多学科框架,将先进的信息技术和CT进步结合起来,特别强调以下三个关键点。首先,装置-顶尖的云形框架作为基础,将终端装置、边缘服务器和云层集群结合起来,优化低电流模型的伸缩性和效率。第二,我们引入了家庭模型的概念,即一系列规模不同的模型,与一致的隐蔽性特征相适应,使得有效的合作和灵活性能够适应不同的资源限制和动态情景。第三,连接性和互动性框架作为基础基础,将连接性和互动性框架作为基础,将最终的智能升级性模型,从而提升AI系统。
Article 5
Title@2025-07-03 (4): Resolving CAP Through Automata-Theoretic Economic Design: A Unified Mathematical Framework for Real-Time Partition-Tolerant Systems
Title: Resolving CAP Through Automata-Theoretic Economic Design: A Unified Mathematical Framework for Real-Time Partition-Tolerant Systems | Lösung von CAP durch Automata-Theoretisches Wirtschaftsdesign: Ein einheitlicher mathematischer Rahmen für Echtzeit-Partitions-Tolerante Systeme | 通过自动化数据理论经济设计解决CAP:实时分区-耐用系统统一数学框架 2507.02464v1 |
Authors (1): Craig S Wright
The CAP theorem asserts a trilemma between consistency, availability, and partition tolerance. This paper introduces a rigorous automata-theoretic and economically grounded framework that reframes the CAP trade-off as a constraint optimization problem. We model distributed systems as partition-aware state machines and embed economic incentive layers to stabilize consensus behavior across adversarially partitioned networks. By incorporating game-theoretic mechanisms into the global transition semantics, we define provable bounds on convergence, liveness, and correctness. Our results demonstrate that availability and consistency can be simultaneously preserved within bounded epsilon margins, effectively extending the classical CAP limits through formal economic control.
CAP 定理主张一致性、 可用性和 分区容忍性之间的三角关系。 本文引入了严格的自动数据理论和经济基础框架, 将CAP 的权衡重新设定为限制优化问题。 我们将分布系统建为有分区意识的国家机器, 并嵌入经济激励层以稳定敌对隔离网络的共识行为。 通过将游戏理论机制纳入全球过渡语义, 我们定义了可测量的趋同、 活性和正确性界限。 我们的结果表明, 可用性和一致性可以同时保存在受约束的Epsilon边际范围内, 通过正式的经济控制有效地扩展了典型CAP的界限。
Article 6
Title@2025-07-03 (4): Red grape detection with accelerated artificial neural networks in the FPGA’s programmable logic
Title: Red grape detection with accelerated artificial neural networks in the FPGA’s programmable logic | Rote Traubenerkennung mit beschleunigten künstlichen neuronalen Netzwerken in der programmierbaren Logik des FPGA | FPGA的可编程逻辑的红葡萄探测与加速人工神经网络 2507.02443v1 |
Authors (5): Sandro Costa Magalhães, Marco Almeida, Filipe Neves dos Santos, António Paulo Moreira, Jorge Dias
Robots usually slow down for canning to detect objects while moving. Additionally, the robot’s camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs’ PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA’s PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.
机器人在移动时通常会慢下来, 以便让罐头在移动时检测物体 。 此外, 机器人的相机配置低框架速率, 以跟踪检测算法的速度 。 这将在任务执行和探索时受到限制, 使机器人增加任务执行时间 。 AMD 开发了 Vitis- AI 框架, 将检测算法部署到 FPGAs 中。 但是, 这个工具没有完全使用 FPGAs PL 。 在这项工作中, 我们使用 FINN 结构来部署 3 个 ANN、 MobilNet v1 和 4 位四位方位数的 ANN、 CPV 和 CNV 和 1 位四位量化 (BNNN) 。 这些模型在 FPGA 的 PL 中被训练为 RG2C 数据集 。 这是在开放访问中释放的自取数据集 。 MOPNet v1 效果更好, 达到 98 和 611 FPS 的推断速度 。 在这项工作中, 我们证明我们可以使用 FPGA 来加速加速 。
Article 7
Title@2025-07-03 (4): The Artificial Scientist – in-transit Machine Learning of Plasma Simulations
Title: The Artificial Scientist – in-transit Machine Learning of Plasma Simulations | Der Künstliche Wissenschaftler – in-transit maschinelles Lernen von Plasmasimulationen | 人造科学家 – – Plasma模拟模拟的中转机器学习 2501.03383v3 |
Authors (22): Jeffrey Kelling, Vicente Bolea, Michael Bussmann, Ankush Checkervarty, Alexander Debus, Jan Ebert, Greg Eisenhauer, Vineeth Gutta, Stefan Kesselheim, Scott Klasky, Vedhas Pandit, Richard Pausch, Norbert Podhorszki, Franz Poschel, David Rogers, Jeyhun Rustamov, Steve Schmerler, Ulrich Schramm, Klaus Steiniger, Rene Widera, Anna Willmann, Sunita Chandrasekaran
Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.
深度学习技术,特别是利用这些数量的域内数据来提取有助于建立科学理解的模式。在这里,我们展示了一个流流工作流程,模拟数据直接流到一个机器学习(ML)框架,绕过文件系统瓶颈。数据在中转过程中不同步地转换为模拟和培训模型。随着所提供的工作流程,数据操作可以用通用和易用的编程语言进行,使应用程序用户不必适应应用程序输出常规。作为证据,我们考虑对Kelvin-Helmholtz 进行GPU加速细胞中的粒子模拟(PIConGPU),我们利用经验再玩,避免在从非稳定的进程中不断学习过程中灾难性地遗忘。我们详细介绍了在移植和扩展到前沿外观系统时遇到的挑战。
Article 8
Title@2025-07-03 (4): Alps, a versatile research infrastructure
Title: Alps, a versatile research infrastructure | Alpen, eine vielseitige Forschungsinfrastruktur | 阿尔卑斯山,多用途研究基础设施 2507.02404v1 |
Authors (3): Maxime Martinasso, Mark Klein, Thomas C. Schulthess
The Swiss National Supercomputing Centre (CSCS) has a long-standing tradition of delivering top-tier high-performance computing systems, exemplified by the Piz Daint supercomputer. However, the increasing diversity of scientific needs has exposed limitations in traditional vertically integrated HPC architectures, which often lack flexibility and composability. To address these challenges, CSCS developed Alps, a next-generation HPC infrastructure designed with a transformative principle: resources operate as independent endpoints within a high-speed network. This architecture enables the creation of independent tenant-specific and platform-specific services, tailored to diverse scientific requirements. Alps incorporates heterogeneous hardware, including CPUs and GPUs, interconnected by a high-performance Slingshot network, and offers a modular storage system. A key innovation is the versatile software-defined cluster (vCluster) technology, which bridges cloud and HPC paradigms. By abstracting infrastructure, service management, and user environments into distinct layers, vClusters allow for customized platforms that support diverse workloads. Current platforms on Alps serve various scientific domains, including numerical weather prediction, and AI research.
瑞士国家超高速计算中心(CSCS)有着提供顶级高性能计算系统的长期传统,例如Piz Daint超级计算机;然而,科学需求日益多样化,暴露了传统的纵向一体化高电联结构中的局限性,这些结构往往缺乏灵活性和可复性;为了应对这些挑战,CSCS开发了高电联基础设施Alps,这是下一代高电联基础设施,其设计具有变革性原则:资源在高速网络中作为独立的端点运作;这一架构使得能够创建独立、针对租户和平台的服务,适合不同的科学要求;阿尔卑斯综合了多种硬件,包括CPU和GPUPUs,通过高性能闪光网络相互连接,提供了一个模块储存系统;一项关键的创新是多功能软件定义的集聚群技术,将云和高电联模式连接起来;通过将基础设施、服务管理和用户环境抽象地分化到不同的层, vClusters能够建立支持不同工作量的定制平台;目前阿尔卑斯平台为各种科学领域服务,包括数字天气预报和AI研究服务。
Article 9
Title@2025-07-03 (4): VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software
Title: VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software | VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software | VEFIA: 垂直联邦合作软件有效推断审计框架 2507.02376v1 |
Authors (6): Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang
Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants’ data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to audit whether the data party’s inference software is executed as expected during large-scale inference without leaking the data privacy of the data party or introducing additional latency to the inference system. The core of VeFIA is that the task party can use the inference results from a framework with Trusted Execution Environments (TEE) and the coordinator to validate the correctness of the data party’s computation results. VeFIA guarantees that, as long as the abnormal inference exceeds 5.4%, the task party can detect execution anomalies in the inference software with a probability of 99.99%, without incurring any additional online inference latency. VeFIA’s random sampling validation achieves 100% positive predictive value, negative predictive value, and true positive rate in detecting abnormal inference. To the best of our knowledge, this is the first paper to discuss the correctness of inference software execution in VFL.
VFIA帮助任务方审计数据方的推论软件是否在大规模推论期间按预期执行,而不会泄露数据方的数据隐私,也不会给推断系统带来额外的延迟。VFIA的核心是任务方可以使用信任执行环境框架和数据方计算结果协调员框架的推论结果来验证数据方计算结果的正确性。 VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的执行异常情况,其概率为99.99%,而不会给网上推论系统带来任何额外的延迟。VEFIA的核心是任务方可以使用信任执行环境框架和协调员框架的推论结果来验证数据方计算结果的正确性。VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的异常性异常性,且有可能达到99.99%,而不会在网上推论任何额外的拉度。VEFIA的随机抽样验证可以使用100%的精确度来检测我们准确度。
Article 10
Title@2025-07-03 (4): Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources
Title: Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources | Flotilla: Ein skalierbarer, modularer und widerstandsfähiger föderierter Lernrahmen für heterogene Ressourcen | 船队:多样化资源的可扩展、模块化和有弹性的联邦学习框架 2507.02295v1 |
Authors (8): Roopkatha Banerjee, Prince Modi, Jinal Vyas, Chunduru Sri Abhijit, Tejus Chandrashekar, Harsha Varun Marisetty, Manik Gupta, Yogesh Simmhan
With the recent improvements in mobile and edge computing and rising concerns of data privacy, Federated Learning(FL) has rapidly gained popularity as a privacy-preserving, distributed machine learning methodology. Several FL frameworks have been built for testing novel FL strategies. However, most focus on validating the learning aspects of FL through pseudo-distributed simulation but not for deploying on real edge hardware in a distributed manner to meaningfully evaluate the federated aspects from a systems perspective. Current frameworks are also inherently not designed to support asynchronous aggregation, which is gaining popularity, and have limited resilience to client and server failures. We introduce Flotilla, a scalable and lightweight FL framework. It adopts a ``user-first’’ modular design to help rapidly compose various synchronous and asynchronous FL strategies while being agnostic to the DNN architecture. It uses stateless clients and a server design that separates out the session state, which are periodically or incrementally checkpointed. We demonstrate the modularity of Flotilla by evaluating five different FL strategies for training five DNN models. We also evaluate the client and server-side fault tolerance on 200+ clients, and showcase its ability to rapidly failover within seconds. Finally, we show that Flotilla’s resource usage on Raspberry Pis and Nvidia Jetson edge accelerators are comparable to or better than three state-of-the-art FL frameworks, Flower, OpenFL and FedML. It also scales significantly better compared to Flower for 1000+ clients. This positions Flotilla as a competitive candidate to build novel FL strategies on, compare them uniformly, rapidly deploy them, and perform systems research and optimizations.
随着移动和边缘计算的最新改进,以及数据隐私的日益关切,Federal Learning(FL)作为一个隐私保护、分布式的机器学习方法迅速获得普及。一些FL框架已经建成,用于测试新的FL战略。然而,大多数框架的重点是通过假分布模拟来验证FL的学习方面,而不是以分布式方式在真实边缘硬件上进行部署,以便从系统的角度有意义地评估FL的加入方方面面。目前框架本身也并非旨在支持非同步整合,这种整合正在越来越受欢迎,而且对客户和服务器故障的适应力也很有限。我们引入了可缩缩缩放和轻轻的FL框架。我们采用了“用户第一”的模块设计来测试FL的学习方面,以帮助快速制定各种同步和无声调的FL战略,同时对DNNN的架构进行定量或递增检查。我们用5种FL战略来评估5种不同的FL战略的模块。我们还大大评价客户和FLFloral-lielder Veral的快速使用能力,我们用Flixal-lax 和Fleveral的功能展示了300的功能,我们用Flder-lix的功能上显示了比R的功能更棒的功能更能展示。
Article 11
Title@2025-07-03 (4): Domain-Adversarial Transfer Learning for Fault Root Cause Identification in Cloud Computing Systems
Title: Domain-Adversarial Transfer Learning for Fault Root Cause Identification in Cloud Computing Systems | Domain-Adversarial-Transfer-Lernen für fehlerhafte Root-Cause-Identifikation in Cloud Computing-Systemen | 为在云计算系统中查明原因原因而进行校内自动转移学习 2507.02233v1 |
Authors (2): Bruce Fang, Danyi Gao
This paper addresses the challenge of fault root cause identification in cloud computing environments. The difficulty arises from complex system structures, dense service coupling, and limited fault information. To solve this problem, an intelligent identification algorithm based on transfer learning is proposed. The method introduces a shared feature extraction module and a domain adversarial mechanism to enable effective knowledge transfer from the source domain to the target domain. This improves the model’s discriminative ability and generalization performance in the target domain. The model incorporates a pseudo-label selection strategy. When labeled samples are lacking in the target domain, high-confidence predictions are used in training. This enhances the model’s ability to recognize minority classes. To evaluate the stability and adaptability of the method in real-world scenarios, experiments are designed under three conditions: label scarcity, class imbalance, and heterogeneous node environments. Experimental results show that the proposed method outperforms existing mainstream approaches in several key metrics, including accuracy, F1-Score, and AUC. The model demonstrates stronger discriminative power and robustness. Notably, under extreme class imbalance and significant structural differences in the target domain, the model still maintains high performance. This validates the effectiveness and practical value of the proposed mechanisms in complex cloud computing systems.
本文探讨云计算环境中的缺陷根源识别挑战。 困难来自复杂的系统结构、密集的服务连接和有限的缺陷信息。 为了解决这一问题, 提出了基于转移学习的智能识别算法。 方法引入了一个共享特征提取模块和一个域对称机制, 以便从源域向目标域进行有效的知识转移。 这改善了模型在目标域的偏差能力和概括性表现。 模型包含一个假标签选择战略。 当标注样本在目标域缺乏时, 使用高可信度预测方法来进行培训。 这增强了模型识别少数群体类别的能力。 为了评估在现实世界情景中方法的稳定性和适应性, 实验是在三个条件下设计的: 标签稀缺性、 阶级不平衡和多变节点环境。 实验结果表明, 拟议的方法超越了在几个关键指标领域( 包括准确性、 F1- 核心和 AUC ) 的现有主流方法。 该模型显示了更强的歧视性力量和强健性。 很明显, 在极端的类别不平衡性和目标域的重大结构差异下, 模型仍然保持高性能。
Article 12
Title@2025-07-02 (3): Signalling Health for Improved Kubernetes Microservice Availability
Title: Signalling Health for Improved Kubernetes Microservice Availability | Signalisierung der Gesundheit für verbesserte Kubernetes Mikroservice-Verfügbarkeit | 改善Kubernetes微服务提供情况 2507.02158v1 |
Authors (3): Jacob Roberts, Blair Archibald, Phil Trinder
Microservices are often deployed and managed by a container orchestrator that can detect and fix failures to maintain the service availability critical in many applications. In Poll-based Container Monitoring (PCM), the orchestrator periodically checks container health. While a common approach, PCM requires careful tuning, may degrade service availability, and can be slow to detect container health changes. An alternative is Signal-based Container Monitoring (SCM), where the container signals the orchestrator when its status changes. We present the design, implementation, and evaluation of an SCM approach for Kubernetes and empirically show that it has benefits over PCM, as predicted by a new mathematical model. We compare the service availability of SCM and PCM over six experiments using the SockShop benchmark. SCM does not require that polling intervals are tuned, and yet detects container failure 86\% faster than PCM and container readiness in a comparable time with limited resource overheads. We find PCM can erroneously detect failures, and this reduces service availability by 4\%. We propose that orchestrators offer SCM features for faster failure detection than PCM without erroneous detections or careful tuning.
在一个通用办法中,PCM需要仔细调整,可能降低服务的提供,而且可以缓慢地检测集装箱健康的变化。另一个办法是基于信号的集装箱监测,集装箱监测在状态变化时可以向管弦乐队发出信号。我们介绍Kubernetes 的SCM 方法的设计、实施和评估,并用经验证明,它比PCM 和PCM 服务比PCM有益处,正如新的数学模型所预测的那样。我们比较SCM 和PCM 的可用性与使用SockShop基准的六次试验相比。SCC并不要求调整投票间隔时间,而是在资源管理有限的情况下,比PCM和集装箱就绪状态更快86 。我们发现PCM可以错误地探测出故障,从而减少服务供应量4。我们建议,管弦乐队提供SCM的功能,以便在没有错误的检测或仔细调整的情况下,比PCM更快地检测出故障。
Article 13
Title@2025-07-02 (3): Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association
Title: Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association | Grundlegende Grenzen der Hierarchischen Sicheren Aggregation mit Cyclic User Association | 与cycclic用户协会的等级安全分类基本限制 2503.04564v4 |
Authors (6): Xiang Zhang, Zhou Li, Kai Wan, Hua Sun, Mingyue Ji, Giuseppe Caire
Secure aggregation is motivated by federated learning (FL) where a cloud server aims to compute an averaged model (i.e., weights of deep neural networks) of the locally-trained models of numerous clients, while adhering to data security requirements. Hierarchical secure aggregation (HSA) extends this concept to a three-layer hierarchical network, where clustered users communicate with the server through an intermediate layer of relays. In HSA, beyond conventional server security, relay security is also enforced to ensure that the relays remain oblivious to the users’ inputs (an abstraction of the local models in FL). Existing study on HSA assumes that each user is associated with only one relay, limiting opportunities for coding across inter-cluster users to achieve efficient communication and key generation. In this paper, we consider HSA with a cyclic association pattern where each user is connected to $B$ consecutive relays in a wrap-around manner. We propose an efficient aggregation scheme which includes a message design for the inputs inspired by gradient coding-a well-known technique for efficient communication in distributed computing-along with a highly non-trivial security key design. We also derive novel converse bounds on the minimum achievable communication and key rates using information-theoretic arguments.
安全聚合的动机是联合学习(FL),云服务器的目的是计算当地培训的众多客户的模型的平均模型(即深神经网络的重量),同时遵守数据安全要求。 等级安全聚合(HSA)将这一概念推广到三级分级网络,集中用户通过中间继电器与服务器进行交流。在HSA中,除了常规服务器安全外,还实施中继安全,以确保中继器不为用户的投入所触动(即FL中当地模型的抽象化)。HSA的现有研究假设,每个用户只与一个中继器相关,从而限制了各组用户为高效通信和关键生成进行编码的机会。在本文件中,我们考虑HSA采用循环联系模式,即每个用户都通过中间继电器连接到$B$的连续继电器。我们提出了一个高效的汇总计划,其中包括由梯度编码驱动的投入信息设计信息设计,这是一种众所周知的技术,用于在分布式计算机时高效通信,同时使用高度非三端关键关键参数,同时使用最低关键参数。
Article 14
Title@2025-07-02 (3): SAKURAONE: Empowering Transparent and Open AI Platforms through Private-Sector HPC Investment in Japan
Title: SAKURAONE: Empowering Transparent and Open AI Platforms through Private-Sector HPC Investment in Japan | SAKURAONE: Stärkung transparenter und offener KI-Plattformen durch Private-Sector HPC-Investitionen in Japan | SAKURAONE:通过在日本的私营部门HPC投资增强透明和开放的AI平台的权能 2507.02124v1 |
Authors (1): Fumikazu Konishi
SAKURAONE is a managed high performance computing (HPC) cluster developed and operated by the SAKURA Internet Research Center. It reinforces the ``KOKARYOKU PHY’’ configuration of bare-metal GPU servers and is designed as a cluster computing resource optimized for advanced workloads, including large language model (LLM) training. In the ISC 2025 edition of the TOP500 list, SAKURAONE was ranked \textbf{49th} in the world based on its High Performance Linpack (HPL) score, demonstrating its global competitiveness. In particular, it is the \textbf{only system within the top 100} that employs a fully open networking stack based on \textbf{800~GbE (Gigabit Ethernet)} and the \textbf{SONiC (Software for Open Networking in the Cloud)} operating system, highlighting the viability of open and vendor-neutral technologies in large-scale HPC infrastructure. SAKURAONE achieved a sustained performance of 33.95~PFLOP/s on the HPL benchmark (Rmax), and 396.295~TFLOP/s on the High Performance Conjugate Gradient (HPCG) benchmark. For the HPL-MxP benchmark, which targets low-precision workloads representative of AI applications, SAKURAONE delivered an impressive 339.86~PFLOP/s using FP8 precision. The system comprises 100 compute nodes, each equipped with eight NVIDIA H100 GPUs. It is supported by an all-flash Lustre storage subsystem with a total physical capacity of 2~petabytes, providing high-throughput and low-latency data access. Internode communication is enabled by a full-bisection bandwidth interconnect based on a Rail-Optimized topology, where the Leaf and Spine layers are interconnected via 800~GbE links. This topology, in combination with RoCEv2 (RDMA over Converged Ethernet version 2), enables high-speed, lossless data transfers and mitigates communication bottlenecks in large-scale parallel workloads.
SAKURAONE 是SAKURA Internet 研究中心开发并运行的有管理的高性能计算集(HPC) 。 它强化了“ KOKARYOKU PHY” 的光金属 GPU 服务器配置, 设计为用于高级工作量的集束计算资源, 包括大型语言模型( LLIM) 培训。 在 TOP 500 列表的2025 版中, SAKURAONE 根据其高性能 Linpack (HPL) 评分, 在世界排名中排名为\ textbffffret 高性能。 SAKURAF 实现了33- slovelop- suple commission 技术在高性能中持续运行, 在 HPC 高性能中, 使用高性能 IMO/ SDFS 的高级性能( IMO) 和高性能 IMFTFAL , 提供高性能( IMO- TIMLA) IMLA 完全性能 提供 IMBTFIL) 。
Article 15
Title@2025-07-02 (3): Parallelization of Network Dynamics Computations in Heterogeneous Distributed Environment
Title: Parallelization of Network Dynamics Computations in Heterogeneous Distributed Environment | Parallelisierung von Network Dynamics Computations in heterogener verteilter Umgebung | 不同差异分布环境中网络动态计算平行化 2410.19075v2 |
Authors (2): Oleksandr Sudakov, Volodymyr Maistrenko
This paper addresses the problem of parallelizing computations to study non-linear dynamics in large networks of non-locally coupled oscillators using heterogeneous computing resources. The proposed approach can be applied to a variety of non-linear dynamics models with runtime specification of parameters and network topologies. Parallelizing the solution of equations for different network elements is performed transparently and, in contrast to available tools, does not require parallel programming from end-users. The runtime scheduler takes into account the performance of computing and communication resources to reduce downtime and to achieve a quasi-optimal parallelizing speed-up. The proposed approach was implemented, and its efficiency is proven by numerous applications for simulating large dynamical networks with 10^3-10^8 elements described by Hodgkin-Huxley, FitzHugh-Nagumo, and Kuramoto models, for investigating pathological synchronization during Parkinson’s disease, analyzing multi-stability, for studying chimera and solitary states in 3D networks, etc. All the above computations may be performed using symmetrical multiprocessors, graphic processing units, and a network of workstations within the same run and it was demonstrated that near-linear speed-up can be achieved for large networks. The proposed approach is promising for extension to new hardware like edge-computing devices.
本文探讨利用多种计算资源,研究大型网络中非本地联动振动器网络的非线性动力学的平行计算问题。提议的方法可以适用于各种非线性动态模型,并具有运行时间参数和网络地形的规格。不同网络元素的等式解决办法的平行执行是透明的,与现有工具不同,并不需要终端用户的平行编程。运行时间表计时器考虑到计算和通信资源的性能,以减少停机率,实现准最佳的平行加速。拟议的方法已经实施,其效率得到许多应用的证明:模拟大型动态网络的模拟应用,包括Hodgkin-Huxley、FitzHugh-Nagumo和Kuramoto模型所描述的10+3-10-10-10-10+8要素。不同网络元素的平行解决方案是透明的,用于调查帕金森疾病期间的病理学同步性,分析多功能性,用于研究3D网络中的恰门和单独状态等。所有上述计算方法都可以使用对称的多功能处理器、图形处理器和近端端端端网络。在运行中示范的大型硬件网络,运行和升级。
Article 16
Title@2025-07-02 (3): Analyzing Common Electronic Structure Theory Algorithms for Distributed Quantum Computing
Title: Analyzing Common Electronic Structure Theory Algorithms for Distributed Quantum Computing | Analyse gemeinsamer elektronischer Strukturtheorien Algorithmen für verteiltes Quantenrechnen | 分布量量计算法的通用电子结构理论比值 2507.01902v1 |
Authors (2): Grier M. Jones, Hans-Arno Jacobsen
To move towards the utility era of quantum computing, many corporations have posed distributed quantum computing (DQC) as a framework for scaling the current generation of devices for practical applications. One of these applications is quantum chemistry, also known as electronic structure theory, which has been poised as a “killer application” of quantum computing, To this end, we analyze five electronic structure methods, found in common packages such as Tequila and ffsim, which can be easily interfaced with the Qiskit Circuit Cutting addon. Herein, we provide insights into cutting these algorithms using local operations (LO) to determine their aptitude for distribution. The key findings of our work are that many of these algorithms cannot be efficiently parallelized using LO, and new methods must be developed to apply electronic structure theory within a DQC framework.
为了向量子计算实用时代迈进,许多公司将分布量子计算(DQC)作为扩大目前生产的实际应用装置的框架。其中一项应用是量子化学,也称为电子结构理论,它一直被定位为量子计算的一种“杀手应用 ” 。为此,我们分析了五种电子结构方法,这些方法见于诸如Tequila和ffsim等通用软件包中,它们很容易与Qiskit Cirectric Cutting 补充软件连接。在这里,我们提供洞察力,说明如何利用当地操作(LO)来截断这些算法,以确定其分配能力。我们工作的关键结论是,许多这些算法无法有效地同时使用LO,必须开发新的方法,以便在DQC框架内应用电子结构理论。
Article 17
Title@2025-07-02 (3): Evolving HPC services to enable ML workloads on HPE Cray EX
Title: Evolving HPC services to enable ML workloads on HPE Cray EX | Evolving HPC-Dienste, um ML-Workloads auf HPE Cray EX zu ermöglichen | 不断演化的HPC服务,使HPE Cray EX 的ML工作量得以完成 2507.01880v1 |
Authors (13): Stefano Schuppli, Fawzi Mohamed, Henrique Mendonça, Nina Mujkanovic, Elia Palme, Dino Conciatore, Lukas Drescher, Miguel Gila, Pim Witlox, Joost VandeVondele, Maxime Martinasso, Thomas C. Schulthess, Torsten Hoefler
The Alps Research Infrastructure leverages GH200 technology at scale, featuring 10,752 GPUs. Accessing Alps provides a significant computational advantage for researchers in Artificial Intelligence (AI) and Machine Learning (ML). While Alps serves a broad range of scientific communities, traditional HPC services alone are not sufficient to meet the dynamic needs of the ML community. This paper presents an initial investigation into extending HPC service capabilities to better support ML workloads. We identify key challenges and gaps we have observed since the early-access phase (2023) of Alps by the Swiss AI community and propose several technological enhancements. These include a user environment designed to facilitate the adoption of HPC for ML workloads, balancing performance with flexibility; a utility for rapid performance screening of ML applications during development; observability capabilities and data products for inspecting ongoing large-scale ML workloads; a utility to simplify the vetting of allocated nodes for compute readiness; a service plane infrastructure to deploy various types of workloads, including support and inference services; and a storage infrastructure tailored to the specific needs of ML workloads. These enhancements aim to facilitate the execution of ML workloads on HPC systems, increase system usability and resilience, and better align with the needs of the ML community. We also discuss our current approach to security aspects. This paper concludes by placing these proposals in the broader context of changes in the communities served by HPC infrastructure like ours.
阿尔卑斯山研究基础设施在规模上利用GH200技术,拥有10,752个全球动力源;阿尔卑斯山利用阿尔卑斯山为人工智能(AI)和机器学习(ML)的研究人员提供了巨大的计算优势;阿尔卑斯山为科学界提供了巨大的计算优势;阿尔卑斯山为大量科学界服务,但传统的高常委会服务本身不足以满足ML社区的动态需求;本文件对扩大高常委会服务能力以更好地支持最低业务水平工作量进行了初步调查;我们确定了瑞士阿尔卑斯山早期进入阶段(2023年)以来我们观察到的主要挑战和差距,并提出了几项技术改进建议,其中包括一个用户环境,旨在便利采用高常委会处理最低业务负荷工作量,平衡业绩与灵活性;在开发过程中快速业绩筛选应用高常力应用软件的效用;用于检查当前大型最低业务预算单位工作量的可耐用性和数据产品;用于简化所分配的无记记号的审评;用于部署各种工作量的服务飞机基础设施,包括支助和推断服务;以及适应高常委工作量具体需要的储存基础设施;这些加强用户环境环境环境,目的是促进在更大范围内执行高常地管理我们的安全要求。
Article 18
Title@2025-07-02 (3): Not eXactly Byzantine: Efficient and Resilient TEE-Based State Machine Replication
Title: Not eXactly Byzantine: Efficient and Resilient TEE-Based State Machine Replication | Nicht eXactly Byzantin: Effiziente und resiliente TEE-basierte State Machine Replication | Byzantine:高效且具有弹性的以技术为基础的国家机器复制 2501.11051v3 |
Authors (2): Marc Leinweber, Hannes Hartenstein
We propose, implement, and evaluate NxBFT, a resilient and efficient State Machine Replication protocol using Trusted Execution Environments (TEEs). NxBFT focuses on a “Not eXactly Byzantine” (NxB) operating model as a middle ground between crash and Byzantine fault tolerance. NxBFT’s consensus layer is asynchronous, graph-based, leaderless, and optimized for the NxB operating model, enabling load-balancing of requests between replicas and, in fault-free cases, two network round trips between decisions. We identify fundamental issues with crash recovery due the use of TEEs in asynchrony that only can be circumvented by relying on synchrony for liveness. We provide a throughput-latency trade-off analysis of NxBFT, Chained-Damysus (rotating leader), and MinBFT (static leader) for up to 40 replicas and network round trip latencies up to 150 ms. NxBFT achieves the highest throughput in all scenarios. When small latencies are required, MinBFT and Damysus are at an advantage with Damysus benefiting from the NxB model in terms of throughput for small deployments. In contrast to leader-based approaches, NxBFT’s performance is almost not impacted when actual crash faults occur.
我们提出、实施并评价NxBFT,这是一个使用信任的执行环境(TEEs)的有弹性和高效的国家机器复制协议。 NxBFT侧重于“非Xactly Byzantine”(NxB)运行模式,作为崩溃和拜占庭断裂容忍度之间的中间地带。 NxBFT的共识层为NxBB业务模式提供了无同步、基于图形、无领头和优化,使复制和网络之间的请求在复制和在无故障情况下的两次决定之间的网络往返中实现负重平衡。我们确定了由于在无系统断层中使用TEE(NxB)系统而导致的崩溃恢复的根本问题,只有依靠同步来避免这种故障。我们提供了对NxBFT、连锁-Damputus(旋转头领)和MinBFT(静态领导)的超常交易分析,因为基于复制和网络的跨周期通度高达40 ms。 NxFTFT在所有情景中实现了最高断层优势。当需要小型部署时,NFTFT和小型部署时,MB公司则从小的比值领先于小型部署时,则要从小的对比。
Article 19
Title@2025-07-02 (3): GPU-based complete search for nonlinear minimization subject to bounds
Title: GPU-based complete search for nonlinear minimization subject to bounds | GPU-basierte komplette Suche nach nichtlinearer Minimierung unter Grenzen | 基于 GPU 的基于 GPU 的完整搜索, 以不受约束的方式对非线性最小化进行搜索 2507.01770v1 |
Authors (3): Guanglu Zhang, Qihang Shan, Jonathan Cagan
This paper introduces a GPU-based complete search method to enclose the global minimum of a nonlinear function subject to simple bounds on the variables. Using interval analysis, coupled with the computational power and architecture of GPU, the method iteratively rules out the regions in the search domain where the global minimum cannot exist and leaves a finite set of regions where the global minimum must exist. For effectiveness, because of the rigor of interval analysis, the method is guaranteed to enclose the global minimum of the nonlinear function even in the presence of rounding errors. For efficiency, the method employs a novel GPU-based single program, single data parallel programming style to circumvent major GPU performance bottlenecks, and a variable cycling technique is also integrated into the method to reduce computational cost when minimizing large-scale nonlinear functions. The method is validated by minimizing 10 multimodal benchmark test functions with scalable dimensions, including the well-known Ackley function, Griewank function, Levy function, and Rastrigin function. These benchmark test functions represent grand challenges of global optimization, and enclosing the guaranteed global minimum of these benchmark test functions with more than 80 dimensions has not been reported in the literature. Our method completely searches the feasible domain and successfully encloses the guaranteed global minimum of these 10 benchmark test functions with up to 10,000 dimensions using only one GPU in a reasonable computation time, far exceeding the reported results in the literature due to the unique method design and implementation based on GPU architecture.
本文引入了一个基于 GPU 的完整搜索方法, 以包含非线性函数的全球最小值, 并受变量的简单界限 。 使用间隔分析, 加上 GPU 的计算力和架构, 该方法迭代排除了搜索域中无法存在全球最低值的区域, 并留下了一组必须存在全球最低值的区域 。 关于有效性, 由于间隔分析的严格性能, 该方法保证包含非线性函数的全球最低值 , 即使存在四舍五入错误 。 关于效率, 该方法使用一个新的基于 GPU 的单一程序、 单一数据平行编程风格, 以绕过 GPU 的主要性能瓶颈, 以及可变的循环法技术也被纳入了在尽量减少大规模非线性功能时降低计算成本的方法 。 对于该方法的验证方法是将10个具有可缩放尺寸的多模式基准测试功能, 包括众所周知的 Ackley 函数、 Griewank 函数、 Levy 函数 和 Rastrigin 函数 。 这些基准测试功能是全球优化的巨大挑战, , 这些基准性测试功能代表着全球优化的巨大挑战, , 将这些基准性测试功能与超过 80个范围的保证最低值的最低限度的最小值连接值连接值连接值 , 在10 上, 我们的模型在10个基准值中报告了10个基准值的模型的搜索中只基底域域域域域域中, 。
Article 20
Title@2025-07-02 (3): Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization
Title: Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization | Deep Recommender Models Inferenz: Automatische Asymmetrische Datenflussoptimierung | 深建议模型推断:自动对称数据流动优化 2507.01676v1 |
Authors (4): Giuseppe Ruggeri, Renzo Andri, Daniele Jahier Pagliari, Lukas Cavigelli
Deep Recommender Models (DLRMs) inference is a fundamental AI workload accounting for more than 79% of the total AI workload in Meta’s data centers. DLRMs’ performance bottleneck is found in the embedding layers, which perform many random memory accesses to retrieve small embedding vectors from tables of various sizes. We propose the design of tailored data flows to speedup embedding look-ups. Namely, we propose four strategies to look up an embedding table effectively on one core, and a framework to automatically map the tables asymmetrically to the multiple cores of a SoC. We assess the effectiveness of our method using the Huawei Ascend AI accelerators, comparing it with the default Ascend compiler, and we perform high-level comparisons with Nvidia A100. Results show a speed-up varying from 1.5x up to 6.5x for real workload distributions, and more than 20x for extremely unbalanced distributions. Furthermore, the method proves to be much more independent of the query distribution than the baseline.
深建议模型(DLRMs)的推断是AI(AI)的一个基本工作量,占Meta数据中心AI工作量总量的79%以上。 DLRM的性能瓶颈存在于嵌入层中,这些层可以进行许多随机的内存访问,从不同大小的表格中检索小型嵌入矢量。我们建议设计量身定制的数据流,以加快嵌入查看。也就是说,我们建议了四个战略,以便有效地在一个核心上查看嵌入表,并提出了一个框架,以对 SoC的多个核心进行不对称地自动绘制表格。我们用Huawei Ascend AI加速器来评估我们的方法的有效性,将其与默认的Ascend 编译器进行比较,我们还与Nvidia A100进行高层次的比较。结果显示,实际工作量分布的速度从1.5x到6.5x不等,极端不平衡的分布则超过20x。此外,这一方法证明比基线更独立于查询分布的方法。
Article 21
Title@2025-07-02 (3): Melding the Serverless Control Plane with the Conventional Cluster Manager for Speed and Compatibility
Title: Melding the Serverless Control Plane with the Conventional Cluster Manager for Speed and Compatibility | Verschmelzen des serverlosen Steuerplans mit dem konventionellen Clustermanager für Geschwindigkeit und Kompatibilität | 与用于速度和兼容性的常规集管理器管理器熔化无服务器控制平面 2505.24551v2 |
Authors (6): Leonid Kondrashov, Lazar Cvetković, Hancheng Wang, Boxi Zhou, Dhairya Rungta, Dmitrii Ustiugov
Modern serverless applications, often interactive with highly volatile traffic, challenge system scalability, demanding control planes that deliver low latency and cost efficiency. Analysis of production traces and existing systems reveals that current control plane designs (synchronous and asynchronous), particularly when built on conventional cluster managers like Kubernetes, struggle with this balance, often wasting significant CPU and memory resources on creating underutilized or idle instances. While clean-slate approaches like Dirigent offer performance gains, they sacrifice compatibility with established cluster management ecosystems. We introduce PulseNet, a serverless system designed to achieve high performance and low cost while maintaining compatibility with conventional cluster managers. PulseNet employs a novel dual-track control plane. A standard asynchronous track manages long-lived, full-featured regular instances for handling predictable, sustainable traffic, preserving full compatibility and feature sets off the critical path. Concurrently, an expedited parallel track addresses excessive traffic bursts that trigger cold starts. This fast path utilizes node-local agents (Pulselets) to rapidly spawn short-lived Emergency Instances with a reduced feature set, critically bypassing the latency overhead of the main cluster manager. Our experiments demonstrate that PulseNet, while remaining compatible with conventional managers for >98% invocation traffic, achieves 35% faster end-to-end performance at a comparable cost to the incompatible Dirigent system. PulseNet outperforms Kubernetes-compatible systems with synchronous control planes by 1.5-3.5x at 8-21% lower cost, and surpasses asynchronous counterparts by 1.7-3.5x at 3-33% lower cost.
没有服务器的现代应用程序,通常与高度波动的交通、挑战系统变异性、挑战性能、高要求的控制机体发生互动,提供低延迟和成本效率。对生产轨迹和现有系统的分析显示,当前的控制机设计(同步和不同步),特别是在Kubernetes等常规组群管理器上,与这种平衡作斗争时,常常浪费大量的CPU和记忆资源,造成利用不足或闲置的事例。虽然像Diriririgent这样的清洁的平流方法可以带来绩效增益,但它们牺牲了与既定的集管管理生态系统的兼容性。我们引入了无服务器网络,这个系统旨在实现高性能和低成本,同时保持与常规组管理器管理器的兼容性双轨控制器。 一条标准的不同步运行轨迹管理器管理器管理器管理器,在常规组别上,在常规组别上,在常规组别上,在常规组别上,在常规组别上,在常规组别上,在常规组别上,在常规级管理器尾比性操作中,在成本上,在不兼容性操作中,在等常规组管理器操作中,在等主端,在不兼容性管理器上,运行中,在不兼容性运行中,在运行中,运行中,在不兼容性地,运行运行中,运行中,运行中,运行中,在成本比比性地,运行比比比比性运行更低。
Article 22
Title@2025-07-02 (3): EDGChain-E: A Decentralized Git-Based Framework for Versioning Encrypted Energy Data
Title: EDGChain-E: A Decentralized Git-Based Framework for Versioning Encrypted Energy Data | EDGChain-E: Ein dezentralisiertes Git-basiertes Framework zur Versionierung verschlüsselter Energiedaten | EDGCHain-E: 以分散式基基基框架取代加密能源数据版本 2507.01615v1 |
Authors (4): Alper Alimoglu, Kamil Erdayandi, Mustafa A. Mustafa, Ümit Cali
This paper proposes a new decentralized framework, named EDGChain-E (Encrypted-Data-Git Chain for Energy), designed to manage version-controlled, encrypted energy data using blockchain and the InterPlanetary File System. The framework incorporates a Decentralized Autonomous Organization (DAO) to orchestrate collaborative data governance across the lifecycle of energy research and operations, such as smart grid monitoring, demand forecasting, and peer-to-peer energy trading. In EDGChain-E, initial commits capture the full encrypted datasets-such as smart meter readings or grid telemetry-while subsequent updates are tracked as encrypted Git patches, ensuring integrity, traceability, and privacy. This versioning mechanism supports secure collaboration across multiple stakeholders (e.g., utilities, researchers, regulators) without compromising sensitive or regulated information. We highlight the framework’s capability to maintain FAIR-compliant (Findable, Accessible, Interoperable, Reusable) provenance of encrypted data. By embedding hash-based content identifiers in Merkle trees, the system enables transparent, auditable, and immutable tracking of data changes, thereby supporting reproducibility and trust in decentralized energy applications.
本文提出了一个新的权力下放框架,名为EDGCHain-E(能源加密数据-Git-E),旨在管理版本控制、加密的能源数据,使用块链和InterPlanetyFile系统;该框架包括一个权力下放自治组织(DAO),以便在能源研究和作业的整个生命周期内,在智能电网监测、需求预测和同行能源交易等合作性数据治理方面,在EDGCHain-E中,初步承诺获取全部加密数据集,如智能仪读数或电网遥测仪等,随后更新作为加密的Git补丁进行跟踪,确保完整性、可追踪性和隐私性;这一版本机制支持多个利益攸关方(例如公用事业、研究人员、监管机构)在不损害敏感或规范信息的情况下开展安全的合作;我们强调该框架有能力保持加密数据的兼容性(易读、可读、可互操作性、可重复性)证明。通过在Merkle树上嵌入基于有源内容的标识,使系统能够透明、可审计和不可改变的数据变化跟踪,从而支持在能源信任方面支持应用方面实现。
Article 23
Title@2025-07-02 (3): Optimal Computation in Anonymous Dynamic Networks
Title: Optimal Computation in Anonymous Dynamic Networks | Optimale Berechnung in anonymen dynamischen Netzwerken | 匿名动态网络的最佳计算 2207.08061v8 |
Authors (2): Giuseppe A. Di Luna, Giovanni Viglietta
We give a simple characterization of the functions that can be computed deterministically by anonymous processes in dynamic networks, depending on the number of leaders in the network. In addition, we provide efficient distributed algorithms for computing all such functions assuming minimal or no knowledge about the network. Each of our algorithms comes in two versions: one that terminates with the correct output and a faster one that stabilizes on the correct output without explicit termination. Notably, these are the first deterministic algorithms whose running times scale linearly with both the number of processes and a parameter of the network which we call “dynamic disconnectivity” (meaning that our dynamic networks do not necessarily have to be connected at all times). We also provide matching lower bounds, showing that all our algorithms are asymptotically optimal for any fixed number of leaders. While most of the existing literature on anonymous dynamic networks relies on classic mass-distribution techniques, our work makes use of a novel combinatorial structure called “history tree”, which is of independent interest. Among other contributions, our results make conclusive progress on two popular fundamental problems for anonymous dynamic networks: leaderless Average Consensus (i.e., computing the mean value of input numbers distributed among the processes) and multi-leader Counting (i.e., determining the exact number of processes in the network). Our contribution not only opens a promising line of research on applications of history trees, but also demonstrates that computation in anonymous dynamic networks is practically feasible and far less demanding than previously conjectured.
我们根据网络领导者的数量,对动态网络中可以匿名程序决定的功能进行简单的定性。 此外,我们提供有效的分布式算法,在对网络知之甚少或完全不了解的情况下计算所有此类功能。我们的每种算法都是两个版本的:一个以正确输出结束,一个以稳定正确输出而无需明确终止的快速速度稳定在正确输出上。值得注意的是,这些是第一个确定性算法,其运行时间以“动态脱节”为线性规模,同时以“动态脱节性”为网络的参数(意味着我们的动态网络不必在任何时候都连接起来 ) 。我们还提供匹配的较低范围,表明我们所有的算法对于固定领导人来说都是以同样的方式最佳的。虽然关于匿名动态网络的现有文献大多依赖传统的批量分配技术,但我们的工作使用一种叫做“历史树”的新组合结构,这是有希望的。除其他贡献外,我们的成果在匿名动态网络的两个流行的基本问题上取得了决定性的进展:平均共识(i)网络的领导力,而不是远额计算历史(i. decreal development the nual prial prial prial prial pristration) nual nudestration produstration pact process)。
Article 24
Title@2025-07-02 (3): Rational Censorship Attack: Breaking Blockchain with a Blackboard
Title: Rational Censorship Attack: Breaking Blockchain with a Blackboard | Rationaler Zensurangriff: Blockchain mit einer Tafel durchbrechen | 理性审查攻击:用黑板打破锁链 2507.01453v1 |
Authors (2): Michelle Yeo, Haoqian Zhang
Censorship resilience is a fundamental assumption underlying the security of blockchain protocols. Additionally, the analysis of blockchain security from an economic and game theoretic perspective has been growing in popularity in recent years. In this work, we present a surprising rational censorship attack on blockchain censorship resilience when we adopt the analysis of blockchain security from a game theoretic lens and assume all users are rational. In our attack, a colluding group with sufficient voting power censors the remainder nodes such that the group alone can gain all the rewards from maintaining the blockchain. We show that if nodes are rational, coordinating this attack just requires a public read and write blackboard and we formally model the attack using a game theoretic framework. Furthermore, we note that to ensure the success of the attack, nodes need to know the total true voting power held by the colluding group. We prove that the strategy to join the rational censorship attack and also for nodes to honestly declare their power is a subgame perfect equilibrium in the corresponding extensive form game induced by our attack. Finally, we discuss the implications of the attack on blockchain users and protocol designers as well as some potential countermeasures.
此外,从经济和游戏理论角度对链锁安全的分析近年来越来越受欢迎。在这项工作中,当我们从游戏理论角度对链锁安全进行分析时,我们展示了对链锁审查能力的一种令人惊讶的合理审查,并假定所有用户都是理性的。在我们的攻击中,一个拥有足够投票权的串通集团对其余节点进行检查,以使该团体能够单独从维持链锁中获得所有好处。我们表明,如果节点是理性的,协调这一攻击只需要一个公开读写黑板,我们正式用游戏理论框架来模拟攻击。此外,我们注意到,为了确保袭击成功,我们不需了解串通集团所拥有的完全真实的投票权力。我们证明,加入理性审查攻击和诚实宣布其权力的战略是我们攻击所引发的大规模游戏中的一个亚相完美平衡。最后,我们讨论了袭击对链锁用户和协议设计者的影响,作为某些可能的反措施。
Article 25
Title@2025-07-02 (3): EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices
Title: EdgeLoRA: An Efficient Multi-Tenant LLM Serving System on Edge Devices | EdgeLoRA: Ein effizientes Multi-Tenant LLM Serving System auf Edge-Geräten | EdgeloRA:一个高效的长长长长长长长长长长长长长长长长长长边缘装置服务系统 2507.01438v1 |
Authors (7): Zheyu Shen, Yexiao He, Ziyao Wang, Yuning Zhang, Guoheng Sun, Wanghao Ye, Ang Li
Large Language Models (LLMs) have gained significant attention due to their versatility across a wide array of applications. Fine-tuning LLMs with parameter-efficient adapters, such as Low-Rank Adaptation (LoRA), enables these models to efficiently adapt to downstream tasks without extensive retraining. Deploying fine-tuned LLMs on multi-tenant edge devices offers substantial benefits, such as reduced latency, enhanced privacy, and personalized responses. However, serving LLMs efficiently on resource-constrained edge devices presents critical challenges, including the complexity of adapter selection for different tasks and memory overhead from frequent adapter swapping. Moreover, given the multiple requests in multi-tenant settings, processing requests sequentially results in underutilization of computational resources and increased latency. This paper introduces EdgeLoRA, an efficient system for serving LLMs on edge devices in multi-tenant environments. EdgeLoRA incorporates three key innovations: (1) an adaptive adapter selection mechanism to streamline the adapter configuration process; (2) heterogeneous memory management, leveraging intelligent adapter caching and pooling to mitigate memory operation overhead; and (3) batch LoRA inference, enabling efficient batch processing to significantly reduce computational latency. Comprehensive evaluations using the Llama3.1-8B model demonstrate that EdgeLoRA significantly outperforms the status quo (i.e., llama.cpp) in terms of both latency and throughput. The results demonstrate that EdgeLoRA can achieve up to a 4 times boost in throughput. Even more impressively, it can serve several orders of magnitude more adapters simultaneously. These results highlight EdgeLoRA’s potential to transform edge deployment of LLMs in multi-tenant scenarios, offering a scalable and efficient solution for resource-constrained environments.
大型语言模型(LLMS)因其在多种应用中的多功能性而得到极大关注。微调LLMS与具有参数效率的适配器(如低兰克适应(LORA))的微调LLMS相比,使得这些模型能够在没有广泛再培训的情况下高效地适应下游任务。在多密度边缘设备上部署微调LMS(LLLMS)带来巨大的好处,如降低潜伏性、增强隐私和个人化反应。然而,在资源限制的边缘设备上高效地为LLMS服务。但是,在资源限制的边缘设备上为LMMS(LLMMS)提供适应性适应性适应性选择机制,以简化调适配器配置程序; 混合存储管理,利用智能调适配器调和存储器来减少经常调换换换的存储器管理。此外,鉴于多强度的设置多重要求,处理这些要求依次导致计算计算对计算资源利用率不足和增加延缓度。 本文引入高效的LLORD3.1 快速处理结果,可以大幅地进行模拟变压。
Article 26
Title@2025-07-02 (3): Optimal Dispersion Under Asynchrony
Title: Optimal Dispersion Under Asynchrony | Optimale Dispersion unter Asynchronie | Asynconsrony 下的优化分散 2507.01298v1 |
Authors (5): Debasish Pattanayak, Ajay D. Kshemkalyani, Manish Kumar, Anisur Rahaman Molla, Gokarna Sharma
We study the dispersion problem in anonymous port-labeled graphs: $k \leq n$ mobile agents, each with a unique ID and initially located arbitrarily on the nodes of an $n$-node graph with maximum degree $\Delta$, must autonomously relocate so that no node hosts more than one agent. Dispersion serves as a fundamental task in distributed computing of mobile agents, and its complexity stems from key challenges in local coordination under anonymity and limited memory. The goal is to minimize both the time to achieve dispersion and the memory required per agent. It is known that any algorithm requires $\Omega(k)$ time in the worst case, and $\Omega(\log k)$ bits of memory per agent. A recent result [SPAA’25] gives an optimal $O(k)$-time algorithm in the synchronous setting and an $O(k \log k)$-time algorithm in the asynchronous setting, both using $O(\log(k+\Delta))$ bits. In this paper, we close the complexity gap in the asynchronous setting by presenting the first dispersion algorithm that runs in optimal $O(k)$ time using $O(\log(k+\Delta))$ bits of memory per agent. Our solution is based on a novel technique we develop in this paper that constructs a port-one tree in anonymous graphs, which may be of independent interest.
我们研究匿名端口标签图表中的分散问题: $k\leq n$移动代理器, 每一个都有独特的身份, 最初被任意放置在最大度为$\Delta$的美元节点上, 必须自动迁移, 以便无节点能容纳不止一个代理器。 分散是移动代理器分布计算中的一项基本任务, 其复杂性来自匿名和有限记忆下当地协调的关键挑战。 目标是在匿名和有限记忆下尽可能减少实现分散的时间和每个代理器所需的记忆。 已知任何算法都需要$\ Omega (k) 时间, 在最差的情况下需要$\ Omega (log k) 和$\ log k) 每个代理器的记忆比特 。 最近的结果 [ SPA’ 25] 给出了同步环境中最佳的美元( k) 和 $( k) logk k) 的时间算法。 目标是尽可能减少时间运算, 使用 $( k@ Delta) 美元 位。 在本文中, 我们将一个基于 美元 硬质 流流流流流中 的硬体 的硬体 的硬体 解 解 解 。
Article 27
Title@2025-07-02 (3): Far From Sight, Far From Mind: Inverse Distance Weighting for Graph Federated Recommendation
Title: Far From Sight, Far From Mind: Inverse Distance Weighting for Graph Federated Recommendation | Weit weg vom Sehen, weit weg vom Denken: Inverse Distanzgewichtung für Graph Federated Empfehlung | 远离视觉,远离心智:对 “ 绿联建议 “ 的反距离加权 2507.01285v1 |
Authors (4): Aymen Rayane Khouas, Mohamed Reda Bouadjenek, Hakim Hacid, Sunil Aryal
Graph federated recommendation systems offer a privacy-preserving alternative to traditional centralized recommendation architectures, which often raise concerns about data security. While federated learning enables personalized recommendations without exposing raw user data, existing aggregation methods overlook the unique properties of user embeddings in this setting. Indeed, traditional aggregation methods fail to account for their complexity and the critical role of user similarity in recommendation effectiveness. Moreover, evolving user interactions require adaptive aggregation while preserving the influence of high-relevance anchor users (the primary users before expansion in graph-based frameworks). To address these limitations, we introduce Dist-FedAvg, a novel distance-based aggregation method designed to enhance personalization and aggregation efficiency in graph federated learning. Our method assigns higher aggregation weights to users with similar embeddings, while ensuring that anchor users retain significant influence in local updates. Empirical evaluations on multiple datasets demonstrate that Dist-FedAvg consistently outperforms baseline aggregation techniques, improving recommendation accuracy while maintaining seamless integration into existing federated learning frameworks.
图表结合建议系统提供了一种保护隐私的替代方法,而传统的集中建议结构往往引起对数据安全的担忧。尽管联合学习可以使个人化建议不暴露原始用户数据,但现有的汇总方法忽略了用户嵌入这一环境的独特性。事实上,传统的汇总方法没有考虑到其复杂性和用户相似性在建议有效性方面的关键作用。此外,不断演变的用户互动需要适应性汇总,同时保持高相关性的锁定用户(在扩大图表框架之前的主要用户)的影响。为了解决这些限制,我们引入了Dist-FedAvg,这是一种新的远程汇总方法,旨在提高图形粘入学习的个人化和汇总效率。我们的方法为具有类似嵌入的用户分配了更高的汇总权重,同时确保锁定用户在本地更新中保持重要影响力。对多个数据集的亲切评价表明, Dist-FedAvg 持续地超越基线汇总技术,提高建议准确性,同时保持与现有Federedered学习框架的无缝结合。
Article 28
Title@2025-07-01 (2): Capacity Planning and Scheduling for Jobs with Uncertainty in Resource Usage and Duration
Title: Capacity Planning and Scheduling for Jobs with Uncertainty in Resource Usage and Duration | Kapazitätsplanung und Planung für Jobs mit Unsicherheit in Ressourcennutzung und -dauer | 资源使用和期限不确定的工作的能力规划和时间安排 2507.01225v1 |
Authors (7): Sunandita Patra, Mehtab Pathan, Mahmoud Mahfouz, Parisa Zehtabi, Wided Ouaja, Daniele Magazzeni, Manuela Veloso
Organizations around the world schedule jobs (programs) regularly to perform various tasks dictated by their end users. With the major movement towards using a cloud computing infrastructure, our organization follows a hybrid approach with both cloud and on-prem servers. The objective of this work is to perform capacity planning, i.e., estimate resource requirements, and job scheduling for on-prem grid computing environments. A key contribution of our approach is handling uncertainty in both resource usage and duration of the jobs, a critical aspect in the finance industry where stochastic market conditions significantly influence job characteristics. For capacity planning and scheduling, we simultaneously balance two conflicting objectives: (a) minimize resource usage, and (b) provide high quality-of-service to the end users by completing jobs by their requested deadlines. We propose approximate approaches using deterministic estimators and pair sampling-based constraint programming. Our best approach (pair sampling-based) achieves much lower peak resource usage compared to manual scheduling without compromising on the quality-of-service.
世界各地的组织定期安排工作(方案),以执行其终端用户规定的各项任务。随着使用云计算基础设施的重大转变,我们组织采用了云式和即时服务器的混合方法。这项工作的目标是进行能力规划,即估计所需资源和在预置电网计算环境的工作时间安排。我们的方法的一个关键贡献是处理资源使用和工作期限的不确定性,这是金融业中一个关键方面,因为其市场状况不均严重影响工作特点。在能力规划和时间安排方面,我们同时平衡两个相互矛盾的目标:(a) 尽量减少资源使用,和(b) 通过按要求的最后期限完成工作,向终端用户提供高质量的服务。我们建议采用确定性估计和对口抽样制约性规划的大致方法。我们的最佳方法(基于采样方法)在不损及服务质量的情况下,实现了人工安排的最高峰资源使用率。
Article 29
Title@2025-07-01 (2): FLARE: A Dataflow-Aware and Scalable Hardware Architecture for Neural-Hybrid Scientific Lossy Compression
Title: FLARE: A Dataflow-Aware and Scalable Hardware Architecture for Neural-Hybrid Scientific Lossy Compression | FLARE: Eine datenflussfähige und skalierbare Hardwarearchitektur für Neural-Hybrid Scientific Lossy Compression | FLARE: 用于神经 – – Hybrid科学损失压缩的数据流软件和可缩放硬件结构 2507.01224v1 |
Authors (8): Wenqi Jia, Ying Huang, Jian Xu, Zhewen Hu, Sian Jin, Jiannan Tian, Yuede Ji, Miao Yin
Scientific simulation leveraging high-performance computing (HPC) systems is crucial for modeling complex systems and phenomena in fields such as astrophysics, climate science, and fluid dynamics, generating massive datasets that often reach petabyte to exabyte scales. However, managing these vast data volumes introduces significant I/O and network bottlenecks, limiting practical performance and scalability. While cutting-edge lossy compression frameworks powered by deep neural networks (DNNs) have demonstrated superior compression ratios by capturing complex data correlations, their integration into HPC workflows poses substantial challenges due to the hybrid non-neural and neural computation patterns, causing excessive memory access overhead, large sequential stalls, and limited adaptability to varying data sizes and workloads in existing hardware platforms. To overcome these challenges and push the limit of high-performance scientific computing, we for the first time propose FLARE, a dataflow-aware and scalable hardware architecture for neural-hybrid scientific lossy compression. FLARE minimizes off-chip data access, reduces bubble overhead through efficient dataflow, and adopts a modular design that provides both scalability and flexibility, significantly enhancing throughput and energy efficiency on modern HPC systems. Particularly, the proposed FLARE achieves runtime speedups ranging from $3.50 \times$ to $96.07 \times$, and energy efficiency improvements ranging from $24.51 \times$ to $520.68 \times$, across various datasets and hardware platforms.
利用高性能计算(HPC)系统的科学模拟工具使用高性能计算(HPC)系统对于在天体物理学、气候科学和流体动态等领域建立复杂的系统和现象模型至关重要,从而产生大量数据集,这些数据集往往达到偏差至偏差的尺度。然而,管理这些庞大的数据量带来了重大的I/O和网络瓶颈,限制了实际性能和可缩放性。虽然由深层神经网络(DNN)驱动的尖端失落压缩框架通过获取复杂的数据相关性显示出较高的压缩比率,但将其纳入HPC工作流程带来了巨大的挑战,因为非神经和神经的混合计算模式,导致过度存取存储器管理、大型相继摊以及对现有硬件平台不同数据大小和工作量的适应性有限。为了克服这些挑战并推高性科学计算限度,我们首次提议FLARE,一个数据流-认知和可缩放的硬件架构,用于神经-节能20美元的科学压缩。 FLARE尽可能减少离机的数据存取量,通过高效数据流流流流流减少气压,减少气压,以及采用模块化模型设计,从高性能-时间段的节流效率和高度数据效率,使HPARELS-RY-S-S-50-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 30
Title@2025-07-01 (2): HERCULES: Hardware accElerator foR stoChastic schedULing in hEterogeneous Systems
Title: HERCULES: Hardware accElerator foR stoChastic schedULing in hEterogeneous Systems | HERCULES: Hardware-Accelerator foR stoChastic SchedULing in hEterogeneous Systems | HERCULES: 氢外源系统中的硬件加速器 forR 蒸蒸蒸蒸气 2507.01113v1 |
Authors (4): Vairavan Palaniappan, Adam H. Ross, Amit Ranjan Trivedi, Debjit Pal
Efficient workload scheduling is a critical challenge in modern heterogeneous computing environments, particularly in high-performance computing (HPC) systems. Traditional software-based schedulers struggle to efficiently balance workload distribution due to high scheduling overhead, lack of adaptability to dynamic workloads, and suboptimal resource utilization. These pitfalls are compounded in heterogeneous systems, where differing computational elements can have vastly different performance profiles. To resolve these hindrances, we present a novel FPGA-based accelerator for stochastic online scheduling (SOS). We modify a greedy cost selection assignment policy by adapting existing cost equations to engage with discretized time before implementing them into a hardware accelerator design. Our design leverages hardware parallelism, precalculation, and precision quantization to reduce job scheduling latency. By introducing a hardware-accelerated approach to real-time scheduling, this paper establishes a new paradigm for adaptive scheduling mechanisms in heterogeneous computing systems. The proposed design achieves high throughput, low latency, and energy-efficient operation, offering a scalable alternative to traditional software scheduling methods. Experimental results demonstrate consistent workload distribution, fair machine utilization, and up to 1060x speedup over single-threaded software scheduling policy implementations. This makes the SOS accelerator a strong candidate for deployment in high-performance computing system, deeplearning pipelines, and other performance-critical applications.
在现代多种不同的计算环境中,特别是在高性能计算系统中,高效的工作时间安排是一项关键的挑战。传统的基于软件的调度员努力通过高排时的间接费用、缺乏适应动态工作量的适应能力以及资源利用不足来有效平衡工作量分配。这些缺陷在不同的系统里更为复杂,不同的计算要素可能具有截然不同的业绩特征。为了解决这些障碍,我们为随机在线排期提供了一个新的基于FPGA的升级器。我们修改了贪婪的成本选择分配政策,调整了现有的成本方程式,以便在将现有成本方程式应用成硬件加速器设计之前,与分散的时间打交道。我们的设计利用了硬件平行性、预估算和精确的四分法来降低工作时间安排的延迟性。通过采用硬件加速方法实时排期,本文件为混合计算系统中的适应性排期机制建立了一个新的范式。我们提出的设计实现了高通量、低拉力和节能操作,为传统的软件排期方法提供了可调整的强有力替代方法。我们的设计利用了硬件的平行平行性平衡、预估和精确的分级等方法来减少职位排期。我们的设计利用了同步性工作,从而展示了10个高额的系统,从而实现了高速度地使用了高级部署政策。
Article 31
Title@2025-07-01 (2): A Terminology for Scientific Workflow Systems
Title: A Terminology for Scientific Workflow Systems | Eine Terminologie für wissenschaftliche Workflow-Systeme | 科学工作流程系统术语术语 2506.07838v5 |
Authors (26): Frédéric Suter, Tainã Coleman, İlkay Altintaş, Rosa M. Badia, Bartosz Balis, Kyle Chard, Iacopo Colonnelli, Ewa Deelman, Paolo Di Tommaso, Thomas Fahringer, Carole Goble, Shantenu Jha, Daniel S. Katz, Johannes Köster, Ulf Leser, Kshitij Mehta, Hilary Oliver, J. -Luc Peterson, Giovanni Pizzi, Loïc Pottier, Raül Sirvent, Eric Suchyta, Douglas Thain, Sean R. Wilkinson, Justin M. Wozniak, Rafael Ferreira da Silva
The term scientific workflow has evolved over the last two decades to encompass a broad range of compositions of interdependent compute tasks and data movements. It has also become an umbrella term for processing in modern scientific applications. Today, many scientific applications can be considered as workflows made of multiple dependent steps, and hundreds of workflow management systems (WMSs) have been developed to manage and run these workflows. However, no turnkey solution has emerged to address the diversity of scientific processes and the infrastructure on which they are implemented. Instead, new research problems requiring the execution of scientific workflows with some novel feature often lead to the development of an entirely new WMS. A direct consequence is that many existing WMSs share some salient features, offer similar functionalities, and can manage the same categories of workflows but also have some distinct capabilities. This situation makes researchers who develop workflows face the complex question of selecting a WMS. This selection can be driven by technical considerations, to find the system that is the most appropriate for their application and for the resources available to them, or other factors such as reputation, adoption, strong community support, or long-term sustainability. To address this problem, a group of WMS developers and practitioners joined their efforts to produce a community-based terminology of WMSs. This paper summarizes their findings and introduces this new terminology to characterize WMSs. This terminology is composed of fives axes: workflow characteristics, composition, orchestration, data management, and metadata capture. Each axis comprises several concepts that capture the prominent features of WMSs. Based on this terminology, this paper also presents a classification of 23 existing WMSs according to the proposed axes and terms.
过去二十年来,科学工作流程这一术语演变为包括相互依存计算任务和数据流动的广泛构成。它也成为现代科学应用中处理的总括术语。今天,许多科学应用可被视为由多个依赖步骤组成的工作流程,数百个工作流程管理系统(WMSs)已经开发出来来管理和运行这些工作流程。然而,没有出现任何统包式解决办法来解决科学流程及其实施基础设施的多样性问题。相反,需要执行具有某些新特点的科学工作流程的新研究问题往往导致形成全新的WMS。一个直接后果是,许多现有的WMS术语具有某些显著特征,提供类似的功能,可以管理相同的工作流程类别,但也有一些不同的能力。这种情况使开发工作流程的研究人员面临选择WMS的复杂问题。这种选择可以由技术因素驱动,找到最适合其应用和资源的系统,或诸如声誉、采用、强有力的社区支持或长期文件可持续性等其他因素。为了解决这个问题,WMSS的当前术语的特性是WMS的每个术语的每个核心, 将WMS的当前定义和每个核心的术语组成了WMS的系统。
Article 32
Title@2025-07-01 (2): Efficient Gate Reordering for Distributed Quantum Compiling in Data Centers
Title: Efficient Gate Reordering for Distributed Quantum Compiling in Data Centers | Effiziente Gate-Reorder für verteilte Quantenkompilierung in Rechenzentren | 数据中心分配量数汇编高效门(高效门)重新排序 2507.01090v1 |
Authors (8): Riccardo Mengoni, Walter Nadalin, Mathys Rennela, Jimmy Rotureau, Tom Darras, Julien Laurat, Eleni Diamanti, Ioannis Lavdas
Just as classical computing relies on distributed systems, the quantum computing era requires new kinds of infrastructure and software tools. Quantum networks will become the backbone of hybrid, quantum-augmented data centers, in which quantum algorithms are distributed over a local network of quantum processing units (QPUs) interconnected via shared entanglement. In this context, it is crucial to develop methods and software that minimize the number of inter-QPU communications. Here we describe key features of the quantum compiler araQne, which is designed to minimize distribution cost, measured by the number of entangled pairs required to distribute a monolithic quantum circuit using gate teleportation protocols. We establish the crucial role played by circuit reordering strategies, which strongly reduce the distribution cost compared to a baseline approach.
正如古典计算依赖于分布式系统一样,量子计算时代需要新型基础设施和软件工具。 量子计算网络将成为混合量子增强数据中心的支柱,其中量子算法分布在通过共同缠绕而相互联系的量子处理器(QPUs)本地网络中。 在这方面,开发尽量减少QPU之间通信数量的方法和软件至关重要。 我们在这里描述量子汇编器 ARAQne 的关键特征。 量子汇编器 ARAQne 旨在将分配成本降到最低,用使用门传导协议分配单体量子电路所需的缠绕对子数量来衡量。 我们确定了电路重新排序战略所起的关键作用,与基线方法相比,这大大降低了分配成本。
Article 33
Title@2025-07-01 (2): Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing
Title: Not All Water Consumption Is Equal: A Water Stress Weighted Metric for Sustainable Computing | Nicht jeder Wasserverbrauch ist gleich: Ein Wasserdruck-gewichtetes Metric für nachhaltiges Rechnen | 并非所有水消耗量都相等:可持续计算中的水应激反应加权计量 2506.22773v2 |
Authors (3): Yanran Wu, Inez Hua, Yi Ding
Water consumption is an increasingly critical dimension of computing sustainability, especially as AI workloads rapidly scale. However, current water impact assessment often overlooks where and when water stress is more severe. To fill in this gap, we present SCARF, the first general framework that evaluates water impact of computing by factoring in both spatial and temporal variations in water stress. SCARF calculates an Adjusted Water Impact (AWI) metric that considers both consumption volume and local water stress over time. Through three case studies on LLM serving, datacenters, and semiconductor fabrication plants, we show the hidden opportunities for reducing water impact by optimizing location and time choices, paving the way for water-sustainable computing. The code is available at https://github.com/jojacola/SCARF.
水消费是计算可持续性的一个越来越关键的方面,特别是由于AI工作负担迅速,然而,目前的水影响评估往往忽略了水压力更严重的地方和时间,为填补这一空白,我们提出了第一个总框架SCARF,这是通过将水压力的空间和时间变化因素考虑在内来评价计算对水的影响的第一个总框架。SCARF计算了一个经过调整的水影响(AWI)衡量标准,其中考虑到消费量和一段时间内当地水压力。通过对LLM服务、数据中心和半导体制造厂的三项个案研究,我们展示了通过优化地点和时间选择来减少水影响的隐蔽机会,为可持续水计算铺平了道路。该代码可在https://github.com/jojacola/SCARF上查阅。
Article 34
Title@2025-07-01 (2): How Fast Can Graph Computations Go on Fine-grained Parallel Architectures
Title: How Fast Can Graph Computations Go on Fine-grained Parallel Architectures | Wie schnell man Berechnungen graphen kann geht auf feinkörnigen parallelen Architekturen | 快速图表计算在精细的平行建筑上如何进行 2507.00949v1 |
Authors (6): Yuqing Wang, Charles Colley, Brian Wheatman, Jiya Su, David F. Gleich, Andrew A. Chien
Large-scale graph problems are of critical and growing importance and historically parallel architectures have provided little support. In the spirit of co-design, we explore the question, How fast can graph computing go on a fine-grained architecture? We explore the possibilities of an architecture optimized for fine-grained parallelism, natural programming, and the irregularity and skew found in real-world graphs. Using two graph benchmarks, PageRank (PR) and Breadth-First Search (BFS), we evaluate a Fine-Grained Graph architecture, UpDown, to explore what performance codesign can achieve. To demonstrate programmability, we wrote five variants of these algorithms. Simulations of up to 256 nodes (524,288 lanes) and projections to 16,384 nodes (33M lanes) show the UpDown system can achieve 637K GTEPS PR and 989K GTEPS BFS on RMAT, exceeding the best prior results by 5x and 100x respectively.
大型图表问题至关重要,而且越来越重要,历史上平行的建筑几乎没有什么支持。本着共同设计的精神,我们探索了这样一个问题:如何快速地用精细雕刻的建筑进行计算?我们探索了在真实世界图中找到的精细平行、自然编程、不规则性和扭曲性的最佳建筑的可能性。我们用PageRank(PageRank)和Breadth-First Search(BFS)这两个图表基准来评估一个精美的图表结构,即UpDow,以探索性能标码能够实现什么。为了显示可编程性,我们编写了五个这些算法的变式。对256节点(524,288个航道)的模拟和对16,384节点(33M航道)的预测显示,“UpDown”系统可以实现637K GTEPS PR 和989KGTEPS BFS分别超过5x和100x的以往最佳结果。
Article 35
Title@2025-07-01 (2): Turning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix, Arizona
Title: Turning AI Data Centers into Grid-Interactive Assets: Results from a Field Demonstration in Phoenix, Arizona | Umwandlung von KI-Datenzentren in Grid-Interaktive Vermögenswerte: Ergebnisse einer Felddemonstration in Phoenix, Arizona | 将AI数据中心变成网状互动资产:亚利桑那州凤凰城现场示范的成果 2507.00909v1 |
Authors (15): Philip Colangelo, Ayse K. Coskun, Jack Megrue, Ciaran Roberts, Shayan Sengupta, Varun Sivaram, Ethan Tiao, Aroon Vijaykar, Chris Williams, Daniel C. Wilson, Zack MacFarland, Daniel Dreiling, Nathan Morey, Anuja Ratnayake, Baskar Vairamohan
Artificial intelligence (AI) is fueling exponential electricity demand growth, threatening grid reliability, raising prices for communities paying for new energy infrastructure, and stunting AI innovation as data centers wait for interconnection to constrained grids. This paper presents the first field demonstration, in collaboration with major corporate partners, of a software-only approach–Emerald Conductor–that transforms AI data centers into flexible grid resources that can efficiently and immediately harness existing power systems without massive infrastructure buildout. Conducted at a 256-GPU cluster running representative AI workloads within a commercial, hyperscale cloud data center in Phoenix, Arizona, the trial achieved a 25% reduction in cluster power usage for three hours during peak grid events while maintaining AI quality of service (QoS) guarantees. By orchestrating AI workloads based on real-time grid signals without hardware modifications or energy storage, this platform reimagines data centers as grid-interactive assets that enhance grid reliability, advance affordability, and accelerate AI’s development.
人工智能(AI)正在刺激指数式电力需求增长,威胁电网可靠性,提高社区支付新能源基础设施的价格,并阻碍AI创新,因为数据中心等待连接到受限制的电网。本文件与主要公司伙伴合作,首次实地演示了将AI数据中心转化为灵活电网资源的软件专用方法-紧急导师,将AI数据中心转化为灵活的电网资源,从而能够在不大规模基础设施建设的情况下高效和立即利用现有的电网资源。 在256-GPU集群中进行,在亚利桑那州凤凰城一个商业、超大型云层数据中心内具有代表性的AI工作量,试验在电网顶端活动期间将集群用电量减少25%,同时保持AI服务质量保障。通过在没有硬件改造或能源储存的情况下根据实时电网信号安排AI工作量,这个平台将数据中心重新构成网络互动资产,提高电网的可靠性,提高负担能力,加快AI的开发。
Article 36
Title@2025-07-01 (2): A New Family of Thread to Core Allocation Policies for an SMT ARM Processor
Title: A New Family of Thread to Core Allocation Policies for an SMT ARM Processor | Eine neue Thread-Familie für Kernzuteilungsrichtlinien für einen SMT ARM-Prozessor | SMT ARM 处理器核心分配政策新一串线索 2507.00855v1 |
Authors (5): Marta Navarro, Josué Feliu, Salvador Petit, María E. Gómez, Julio Sahuquillo
Modern high-performance servers commonly integrate Simultaneous Multithreading (SMT) processors, which efficiently boosts throughput over single-threaded cores. Optimizing performance in SMT processors faces challenges due to the inter-application interference within each SMT core. To mitigate the interference, thread-to-core (T2C) allocation policies play a pivotal role. State-of-the-art T2C policies work in two steps: i) building a per-application performance stack using performance counters and ii) building performance prediction models to identify the best pairs of applications to run on each core. This paper explores distinct ways to build the performance stack in ARM processors and introduces the Instructions and Stalls Cycles (ISC) stack, a novel approach to overcome ARM PMU limitations. The ISC stacks are used as inputs for a performance prediction model to estimate the applications’ performance considering the inter-application interference. The accuracy of the prediction model (second step) depends on the accuracy of the performance stack (first step); thus, the higher the accuracy of the performance stack, the higher the potential performance gains obtained by the T2C allocation policy. This paper presents SYNPA as a family of T2C allocation policies. Experimental results show that $SYNPA4$, the best-performing SYNPA variant, outperforms turnaround time by 38\% over Linux, which represents 3$\times$ the gains achieved by the state-of-the-art policies for ARM processors. Furthermore, the multiple discussions and refinements presented throughout this paper can be applied to other SMT processors from distinct vendors and are aimed at helping performance analysts build performance stacks for accurate performance estimates in real processors.
现代高性能服务器通常会整合超高性能多读化(SMT)处理器(SMT),这些处理器能有效地提高单面核心的通过量。 优化SMT处理器的性能因每个SMT核心的应用干扰而面临挑战。 为减轻干扰,线对核心(T2C)分配政策具有关键作用。 最新T2C政策分级模式分两步工作:(一) 使用性能计(SMT)建立每套应用性能堆,这能有效提升单面核心核心应用量的最佳配对。 本文探讨了在ARM处理器中建立性能堆的不同方法,并介绍了指令和Stalls周期(ISC),这是克服ARM PMU限制的新办法。 使用ISC堆数作为业绩预测模型的投入,根据应用干扰度来估计应用程序的性能。 预测模型(第二步)的准确度取决于性能堆的准确性能(第一步);因此,使用更准确的纸质价比数据组进行更精确的计算。
Article 37
Title@2025-07-01 (2): Enabling mixed-precision in spectral element codes
Title: Enabling mixed-precision in spectral element codes | Ermöglichung der Mischpräzision in Spektralelementcodes | 使光谱元代码具有混合精度 2503.02134v2 |
Authors (5): Yanxiang Chen, Pablo de Oliveira Castro, Paolo Bientinesi, Niclas Jansson, Roman Iakymchuk
Mixed-precision computing has the potential to significantly reduce the cost of exascale computations, but determining when and how to implement it in programs can be challenging. In this article, we propose a methodology for enabling mixed-precision with the help of computer arithmetic tools, roofline model, and computer arithmetic techniques. As case studies, we consider Nekbone, a mini-application for the Computational Fluid Dynamics (CFD) solver Nek5000, and a modern Neko CFD application. With the help of the Verificarlo tool and computer arithmetic techniques, we introduce a strategy to address stagnation issues in the preconditioned Conjugate Gradient method in Nekbone and apply these insights to implement a mixed-precision version of Neko. We evaluate the derived mixed-precision versions of these codes by combining metrics in three dimensions: accuracy, time-to-solution, and energy-to-solution. Notably, mixed-precision in Nekbone reduces time-to-solution by roughly 1.62x and energy-to-solution by 2.43x on MareNostrum 5, while in the real-world Neko application, the gain is up to 1.3x in both time and energy, with the accuracy that matches double-precision results.
混合精密计算有可能大幅降低伸缩计算的成本,但确定何时以及如何在程序中实施它可能具有挑战性。在本篇文章中,我们提出一种方法,在计算机算术工具、屋顶模型和计算机算术技术的帮助下,使混合精密化成为可能。作为案例研究,我们认为Nekbone是计算流体动力(CFD)溶液(Nek 5000)的微型应用和现代Neko CFD应用。在Verifarlo 工具和计算机算术的帮助下,我们引入了一种战略,在Nekbone的预设的 Conjugate渐进法中解决停滞问题,并将这些洞见应用于实施Neko的混合精密化版本。我们通过将精度、溶解时间和能量溶三个维度的计量组合来评估这些代码的混合精度。 值得注意的是,Nekbone的混合精度使时间到解度减少大约1.62x和2.43的能量溶化方法,在Nekbone的精度中,在Merestruex的精确度上,在1.3世界的双倍的精度应用中,同时将精度与实际结果与1.3进行。
Article 38
Title@2025-07-01 (2): yProv4ML: Effortless Provenance Tracking for Machine Learning Systems
Title: yProv4ML: Effortless Provenance Tracking for Machine Learning Systems | yProv4ML: Müheloses Provenienz-Tracking für maschinelle Lernsysteme | yProv4ML: 机器学习系统无穷无尽的证明跟踪 2507.01078v1 |
Authors (3): Gabriele Padovani, Valentine Anantharaj, Sandro Fiore
The rapid growth of interest in large language models (LLMs) reflects their potential for flexibility and generalization, and attracted the attention of a diverse range of researchers. However, the advent of these techniques has also brought to light the lack of transparency and rigor with which development is pursued. In particular, the inability to determine the number of epochs and other hyperparameters in advance presents challenges in identifying the best model. To address this challenge, machine learning frameworks such as MLFlow can automate the collection of this type of information. However, these tools capture data using proprietary formats and pose little attention to lineage. This paper proposes yProv4ML, a framework to capture provenance information generated during machine learning processes in PROV-JSON format, with minimal code modifications.
人们对大型语言模型的兴趣迅速增长,反映了其灵活性和概括化的潜力,并吸引了各类研究人员的注意;然而,这些技术的出现也暴露出缺乏透明度和追求发展所需的严格性;特别是,无法预先确定时代和其他超光度计的数目,在确定最佳模型方面提出了挑战;为应付这一挑战,MLFlow等机器学习框架可以使这类信息的收集自动化;然而,这些工具利用专有格式收集数据,对线条没有多少注意;本文件提议采用 yProv4ML,这是一个框架,以PROV-JSON格式收集在机器学习过程中产生的出处信息,并尽可能地对代码进行修改。
Article 39
Title@2025-07-01 (2): PANDAS: Peer-to-peer, Adaptive Networking for Data Availability Sampling within Ethereum Consensus Timebounds
Title: PANDAS: Peer-to-peer, Adaptive Networking for Data Availability Sampling within Ethereum Consensus Timebounds | PANDAS: Peer-to-Peer, Adaptive Networking für Datenverfügbarkeit Probenahme innerhalb von Ethereum Consensus Timebounds | PANDAS:对等对等网络,为数据提供建立适应性网络,在Eetenum共识时限内抽样 2507.00824v1 |
Authors (9): Matthieu Pigaglio, Onur Ascigil, Michał Król, Sergi Rene, Felix Lange, Kaleem Peeroo, Ramin Sadre, Vladimir Stankovic, Etienne Rivière
Layer-2 protocols can assist Ethereum’s limited throughput, but globally broadcasting layer-2 data limits their scalability. The Danksharding evolution of Ethereum aims to support the selective distribution of layer-2 data, whose availability in the network is verified using randomized data availability sampling (DAS). Integrating DAS into Ethereum’s consensus process is challenging, as pieces of layer-2 data must be disseminated and sampled within four seconds of the beginning of each consensus slot. No existing solution can support dissemination and sampling under such strict time bounds. We propose PANDAS, a practical approach to integrate DAS with Ethereum under Danksharding’s requirements without modifying its protocols for consensus and node discovery. PANDAS disseminates layer-2 data and samples its availability using lightweight, direct exchanges. Its design accounts for message loss, node failures, and unresponsive participants while anticipating the need to scale out the Ethereum network. Our evaluation of PANDAS’s prototype in a 1,000-node cluster and simulations for up to 20,000 peers shows that it allows layer-2 data dissemination and sampling under planetary-scale latencies within the 4-second deadline.
2层协议可以帮助Eceenum的有限输送量,但全球广播层-2数据限制了其可缩放性。Eceenum的Dankharding演化过程旨在支持有选择地分配第2层数据,通过随机的数据提供抽样(DAS)来核实其网络中的可用性。将DAS纳入Etheum的共识过程具有挑战性,因为第2层数据必须在每个共识空档开始时的四秒钟内传播和取样。任何现有解决方案都无法支持在如此严格的时限内进行传播和取样。我们提出PANDAS,这是将DAS与Eehereum结合到Danksharding要求中的实用方法,而没有修改其共识和节点发现协议。PANDAS传播第2层数据,并利用光量、直接交换来抽样其可用性。DAS对信息丢失、节点故障和反应不灵敏的参与者进行设计,同时预测需要扩大Eteenum网络的规模。我们对PANDAS原型在1 000节的集群和模拟中最多为20 000对20,000对20,000个同侪进行评估显示它允许在4个截止日期内进行层-2数据传播和取样。
Article 40
Title@2025-07-01 (2): To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing
Title: To Offload or Not To Offload: Model-driven Comparison of Edge-native and On-device Processing | Zum Offload oder nicht zum Offload: Modellgetriebener Vergleich von Edge-native und On-Device-Verarbeitung | 卸载还是不卸载:边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边-边 2504.15162v2 |
Authors (5): Nathan Ng, David Irwin, Ananthram Swami, Don Towsley, Prashant Shenoy
Computational offloading is a promising approach for overcoming resource constraints on client devices by moving some or all of an application’s computations to remote servers. With the advent of specialized hardware accelerators, client devices are now able to perform fast local processing of specific tasks, such as machine learning inference, reducing the need for offloading computations. However, edge servers with accelerators also offer faster processing for offloaded tasks than was previously possible. In this paper, we present an analytic and experimental comparison of on-device processing and edge offloading for a range of accelerator, network, and application workload scenarios, with the goal of understanding when to use local on-device processing and when to offload computations. We present models that leverage analytical queuing results to capture the effects of dynamic factors such as the performance gap between the device and edge server, network variability, server load, and multi-tenancy on the edge server. We experimentally demonstrate the accuracy of our models for a range of hardware and application scenarios and show that our models achieve a mean absolute percentage error of 2.2% compared to observed latencies. We use our models to develop an adaptive resource manager for intelligent offloading and show its efficacy in the presence of variable network conditions and dynamic multi-tenant edge settings.
计算卸载是克服客户设备资源限制的一个很有希望的方法,通过将某些或所有应用程序的计算方法转移到远程服务器来克服客户设备的资源限制。随着专门的硬件加速器的出现,客户设备现在能够对具体任务进行快速本地处理,例如机器学习推论,减少卸载计算的必要性。然而,带有加速器的边缘服务器也比以前可能做到的更快地处理卸载任务。在本文件中,我们对各种加速器、网络和应用工作量设想方案在设备上处理和卸载的边缘进行分析和实验性比较,以了解何时使用本地的加速器处理和何时卸载计算。我们提出模型,利用分析排出结果来捕捉设备与边缘服务器之间性能差距、网络变异性、服务器负荷和边缘服务器多强度等动态因素的影响。我们实验性地展示了各种硬件和应用程序设想方案模型的准确性能,并展示了我们模型在使用本地设备处理和卸载计算时的绝对百分比错误率,以便了解何时使用本地的裁量器处理和何时卸载计算。我们所观测到的网络的可变性效率,我们用来展示了一种可变式的智能模型。
Article 41
Title@2025-07-01 (2): Provenance Tracking in Large-Scale Machine Learning Systems
Title: Provenance Tracking in Large-Scale Machine Learning Systems | Provenienzverfolgung in großformatigen Maschinen-Lernsystemen | 大型机器学习系统中的证书追踪系统 2507.01075v1 |
Authors (3): Gabriele Padovani, Valentine Anantharaj, Sandro Fiore
As the demand for large scale AI models continues to grow, the optimization of their training to balance computational efficiency, execution time, accuracy and energy consumption represents a critical multidimensional challenge. Achieving this balance requires not only innovative algorithmic techniques and hardware architectures but also comprehensive tools for monitoring, analyzing, and understanding the underlying processes involved in model training and deployment. Provenance data information about the origins, context, and transformations of data and processes has become a key component in this pursuit. By leveraging provenance, researchers and engineers can gain insights into resource usage patterns, identify inefficiencies, and ensure reproducibility and accountability in AI development workflows. For this reason, the question of how distributed resources can be optimally utilized to scale large AI models in an energy efficient manner is a fundamental one. To support this effort, we introduce the yProv4ML library, a tool designed to collect provenance data in JSON format, compliant with the W3C PROV and ProvML standards. yProv4ML focuses on flexibility and extensibility, and enables users to integrate additional data collection tools via plugins. The library is fully integrated with the yProv framework, allowing for higher level pairing in tasks run also through workflow management systems.
随着对大规模AI模型的需求继续增长,优化其培训以平衡计算效率、执行时间、准确性和能源消耗,是一项关键的多层面挑战。实现这一平衡不仅需要创新的算法技术和硬件结构,还需要监测、分析和理解模型培训和部署所涉基本过程的综合工具。关于数据和程序的起源、背景和转变的验证数据信息已成为这一追求的一个关键组成部分。通过利用源头,研究人员和工程师可以了解资源使用模式,查明效率低和确保AI开发工作流程的再生和问责。为此原因,如何以最佳方式利用分配的资源以高能效的方式推广大型AI模型是一个根本性问题。为了支持这一努力,我们引入了 yProv4ML图书馆,这是一个工具,旨在收集Json格式的证明数据,符合W3C PROV和ProvML标准。 yProv4MLL, 侧重于灵活性和存在性,并使用户能够通过插件整合更多的数据收集工具。图书馆与YProvv框架充分整合,允许通过更高层次的系统配置工作流程。
Article 42
Title@2025-07-01 (2): Improving the scalability of a high-order atmospheric dynamics solver based on the deal.II library
Title: Improving the scalability of a high-order atmospheric dynamics solver based on the deal.II library | Verbesserung der Skalierbarkeit eines auf dem Deal basierenden atmosphärischen Dynamiklösers.II-Bibliothek | 根据协议改善高阶大气动态求解器的可缩放性。 2505.00384v3 |
Authors (3): Giuseppe Orlando, Tommaso Benacchio, Luca Bonaventura
We present recent advances on the massively parallel performance of a numerical scheme for atmosphere dynamics applications based on the deal.II library. The implicit-explicit discontinuous finite element scheme is based on a matrix-free approach, meaning that no global sparse matrix is built and only the action of the linear operators on a vector is actually implemented. Following a profiling analysis, we focus on the performance optimization of the numerical method and describe the impact of different preconditioning and solving techniques in this framework. Moreover, we show how the use of the latest version of the deal.II library and of suitable execution flags can improve the parallel performance.
我们介绍了基于此协议的大气动态应用数字办法的大规模平行性能的最新进展。基于此协议的隐含和不透明不连续的有限要素办法以无矩阵方法为基础,这意味着没有建立全球稀疏矩阵,只有线性操作员在矢量上的行动才得到实际执行。在进行剖析分析后,我们侧重于数字方法的性能优化,并描述这一框架内不同先决条件和解决技术的影响。此外,我们展示了如何使用最新版本的协议。二 图书馆和合适的执行标记可以改善平行性能。
Article 43
Title@2025-07-01 (2): Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds
Title: Safe Low Bandwidth SPV: A Formal Treatment of Simplified Payment Verification Protocols and Security Bounds | Safe Low Bandwidth SPV: Eine formale Behandlung von vereinfachten Zahlungsverifikationsprotokollen und Sicherheitsbunden | 安全低频带宽度SPV:对简化付款核查议定书和安全圈的正式处理 2507.00740v1 |
Authors (1): Craig S Wright
This paper presents a complete formal specification, protocol description, and mathematical proof structure for Simplified Payment Verification (SPV) as originally defined in the Bitcoin whitepaper \cite{nakamoto2008}. In stark contrast to the misrepresentations proliferated by popular implementations, we show that SPV is not only secure under bounded adversarial assumptions but strictly optimal for digital cash systems requiring scalable and verifiable transaction inclusion. We reconstruct the SPV protocol from first principles, grounding its verification model in symbolic automata, Merkle membership relations, and chain-of-proof dominance predicates. Through rigorous probabilistic and game-theoretic analysis, we derive the economic bounds within which the protocol operates securely and verify its liveness and safety properties under partial connectivity, hostile relay networks, and adversarial propagation delay. Our specification further introduces low-bandwidth optimisations such as adaptive polling and compressed header synchronisation while preserving correctness. This document serves both as a blueprint for secure SPV implementation and a rebuttal of common misconceptions surrounding non-validating clients.
本文介绍了《Bitcoin 白纸\ cite{nakamoto2008} 最初定义的简化支付核查(SPV)的完整正式规格、协议描述和数学证明结构。与大众执行过程中大量出现的不实陈述形成鲜明对比的是,我们显示,SPV不仅在有约束的对抗假设下安全,而且对于数字现金系统也严格地来说是最佳的,需要包含可缩放和可核查的交易。我们从最初的原则中重建SPV协议,将其核查模式建立在象征性的自动数据、默克尔成员关系和可防控的支配地位前导线上。我们通过严格的概率和游戏理论分析,得出了协议安全运行的经济界限,并在部分连接、敌对的中继网络和对抗性传播延迟下验证了其生活和安全性。我们的规格进一步引入了低带宽选择,如适应性投票和压缩头板同步,同时保持正确性。本文件既是安全实施SPV的蓝图,也是对非验证客户的常见误解的反驳。
Article 44
Title@2025-07-01 (2): Accelerating Loading WebGraphs in ParaGrapher
Title: Accelerating Loading WebGraphs in ParaGrapher | Beschleunigte Laden WebGraphs in ParaGrapher | 加速加载 ParaGrapher 中的网页格 2507.00716v1 |
Authors (1): Mohsen Koohi Esfahani
ParaGrapher is a graph loading API and library that enables graph processing frameworks to load large-scale compressed graphs with minimal overhead. This capability accelerates the design and implementation of new high-performance graph algorithms and their evaluation on a wide range of graphs and across different frameworks. However, our previous study identified two major limitations in ParaGrapher: inefficient utilization of high-bandwidth storage and reduced decompression bandwidth due to increased compression ratios. To address these limitations, we present two optimizations for ParaGrapher in this paper. To improve storage utilization, particularly for high-bandwidth storage, we introduce ParaGrapher-FUSE (PG-Fuse) a filesystem based on the FUSE (Filesystem in User Space). PG-Fuse optimizes storage access by increasing the size of requested blocks, reducing the number of calls to the underlying filesystem, and caching the received blocks in memory for future calls. To improve the decompression bandwidth, we introduce CompBin, a compact binary representation of the CSR format. CompBin facilitates direct accesses to neighbors while preventing storage usage for unused bytes. Our evaluation on 12 real-world and synthetic graphs with up to 128 billion edges shows that PG-Fuse and CompBin achieve up to 7.6 and 21.8 times speedup, respectively.
Paragrapher 是一个图形加载 API 和 图书馆, 使图表处理框架能够以最小的管理费来装载大型压缩图形, 这个能力可以加速设计和实施新的高性能图表算法, 并对各种图表和不同框架进行评估。 但是, 我们先前的研究在 Paragrapher 中发现了两个主要的局限性: 高带宽储存的利用率低, 并且由于压缩比率增加而减少压缩带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽。 为了宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽
Article 45
Title@2025-07-01 (2): eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems
Title: eACGM: Non-instrumented Performance Tracing and Anomaly Detection towards Machine Learning Systems | eACGM: Non-instrumented Performance Tracing and Anomalie Detection towards Machine Learning Systems | eACGM:非仪器化业绩追踪和异常探测,转向机器学习系统 2506.02007v2 |
Authors (3): Ruilin Xu, Zongxuan Xie, Pengfei Chen
We present eACGM, a full-stack AI/ML system monitoring framework based on eBPF. eACGM collects real-time performance data from key hardware components, including the GPU and network communication layer, as well as from key software stacks such as CUDA, Python, and PyTorch, all without requiring any code instrumentation or modifications. Additionally, it leverages libnvml to gather process-level GPU resource usage information. By applying a Gaussian Mixture Model (GMM) to the collected multidimensional performance metrics for statistical modeling and clustering analysis, eACGM effectively identifies complex failure modes, such as latency anomalies, hardware failures, and communication inefficiencies, enabling rapid diagnosis of system bottlenecks and abnormal behaviors. To evaluate eACGM’s effectiveness and practicality, we conducted extensive empirical studies and case analyses in multi-node distributed training scenarios. The results demonstrate that eACGM, while maintaining a non-intrusive and low-overhead profile, successfully captures critical performance anomalies during model training and inference. Its stable anomaly detection performance and comprehensive monitoring capabilities validate its applicability and scalability in real-world production environments, providing strong support for performance optimization and fault diagnosis in large-scale AI/ML systems.
eACGM从关键硬件组成部分,包括GPU和网络通信层,以及CUDA、Python和PyTorrch等关键软件堆堆中收集实时性能数据,而无需任何编码仪器或修改。此外,它利用 libnvml 收集流程级GPU资源使用信息。通过对收集的统计建模和集群分析的多维性能指标采用高斯混合模型(GMMM),电子ACGM有效地查明复杂的故障模式,如悬浮异常、硬件故障和通信效率低下,从而能够快速诊断系统瓶颈和异常行为。为了评价电子ACGM的效能和实用性,我们进行了广泛的实证研究,并在多点分布的培训假设中进行了案例研究。结果显示,在维持非侵扰性和低头型模型外观的同时,成功地捕捉了模型培训和推断过程中的关键性能异常。它稳定地检测异常性能和全面监测能力,在稳定性异常性异常性能和大规模地优化全球生产能力方面,验证了其性能和性能。
Article 46
Title@2025-07-01 (2): Toward Edge General Intelligence with Multiple-Large Language Model (Multi-LLM): Architecture, Trust, and Orchestration
Title: Toward Edge General Intelligence with Multiple-Large Language Model (Multi-LLM): Architecture, Trust, and Orchestration | Hin zum Rand Allgemeine Intelligenz mit multi-large Sprachmodell (Multi-LLM): Architektur, Vertrauen und Orchestrierung | 以多大语言模式(Multi-LLM):建筑、信任和管弦化 2507.00672v1 |
Authors (10): Haoxiang Luo, Yinqiu Liu, Ruichen Zhang, Jiacheng Wang, Gang Sun, Dusit Niyato, Hongfang Yu, Zehui Xiong, Xianbin Wang, Xuemin Shen
Edge computing enables real-time data processing closer to its source, thus improving the latency and performance of edge-enabled AI applications. However, traditional AI models often fall short when dealing with complex, dynamic tasks that require advanced reasoning and multimodal data processing. This survey explores the integration of multi-LLMs (Large Language Models) to address this in edge computing, where multiple specialized LLMs collaborate to enhance task performance and adaptability in resource-constrained environments. We review the transition from conventional edge AI models to single LLM deployment and, ultimately, to multi-LLM systems. The survey discusses enabling technologies such as dynamic orchestration, resource scheduling, and cross-domain knowledge transfer that are key for multi-LLM implementation. A central focus is on trusted multi-LLM systems, ensuring robust decision-making in environments where reliability and privacy are crucial. We also present multimodal multi-LLM architectures, where multiple LLMs specialize in handling different data modalities, such as text, images, and audio, by integrating their outputs for comprehensive analysis. Finally, we highlight future directions, including improving resource efficiency, trustworthy governance multi-LLM systems, while addressing privacy, trust, and robustness concerns. This survey provides a valuable reference for researchers and practitioners aiming to leverage multi-LLM systems in edge computing applications.
远程计算使实时数据处理更接近其源头,从而改进边际应用的隐蔽度和性能。然而,传统的AI模型在处理需要高级推理和多式数据处理的复杂、动态任务时往往落后于传统的AI模型。这项调查探索了将多LLM(通用语言模型)整合到边际计算中解决这一难题的办法,在边际计算中,多个专门的LLM公司合作提高任务性能和适应性。我们审查了从传统边缘AI模型向单一LLOM部署的过渡,并最终向多LLLM系统过渡的情况。调查讨论了扶持性技术,例如动态协调、资源时间安排和跨域知识转让,这些技术是多LLEM实施的关键。一个中心重点是可信赖的多LLM系统,确保在可靠性和隐私至关重要的环境中进行强有力的决策。我们还介绍了多LLM公司多式联运多LM结构,其中多个LLM公司专门处理不同的数据模式,如文本、图像和音频,将其产出综合用于全面分析。我们着重指出了未来的方向,包括提高资源效率、可信赖的多级LLEM系统,同时解决隐私、信任和稳定应用的多级调查问题。
Article 47
Title@2025-07-01 (2): DynoStore: A wide-area distribution system for the management of data over heterogeneous storage
Title: DynoStore: A wide-area distribution system for the management of data over heterogeneous storage | DynoStore: Ein weiträumiges Distributionssystem für die Verwaltung von Daten über heterogene Speicherung | DynoStore:管理不同储存数据广域分布系统 2507.00576v1 |
Authors (9): Dante D. Sanchez-Gallegos, J. L. Gonzalez-Compean, Maxime Gonthier, Valerie Hayot-Sasson, J. Gregory Pauloski, Haochen Pan, Kyle Chard, Jesus Carretero, Ian Foster
Data distribution across different facilities offers benefits such as enhanced resource utilization, increased resilience through replication, and improved performance by processing data near its source. However, managing such data is challenging due to heterogeneous access protocols, disparate authentication models, and the lack of a unified coordination framework. This paper presents DynoStore, a system that manages data across heterogeneous storage systems. At the core of DynoStore are data containers, an abstraction that provides standardized interfaces for seamless data management, irrespective of the underlying storage systems. Multiple data container connections create a cohesive wide-area storage network, ensuring resilience using erasure coding policies. Furthermore, a load-balancing algorithm ensures equitable and efficient utilization of storage resources. We evaluate DynoStore using benchmarks and real-world case studies, including the management of medical and satellite data across geographically distributed environments. Our results demonstrate a 10\% performance improvement compared to centralized cloud-hosted systems while maintaining competitive performance with state-of-the-art solutions such as Redis and IPFS. DynoStore also exhibits superior fault tolerance, withstanding more failures than traditional systems.
在不同设施之间分配数据的好处包括:资源利用率的提高、通过复制提高复原力和通过处理来源附近的数据改进性能的提高;然而,由于各种准入协议、不同的认证模式和缺乏统一协调框架,管理这些数据具有挑战性;本文件介绍了DynoStore,这是一个管理不同储存系统的数据的系统;DynoStore的核心是数据容器,这是一个抽象的界面,为无缝数据管理提供了标准化的界面,而不论其基本储存系统如何;多数据集装箱连接建立了一个具有凝聚力的广域储存网络,利用消化编码政策确保了复原力;此外,工作量平衡算法确保了储存资源的公平和高效利用;我们利用基准和现实世界案例研究对DynoStore进行了评估,包括在地理分布环境中管理医疗和卫星数据;我们的结果显示,与中央云载系统相比,业绩有10个百分点的改进,同时与Redis和GIS.DynoStore等最先进的解决方案保持竞争性的性能;DynoStore还展示了优的错错容性,比传统系统更加失败。
Article 48
Title@2025-07-01 (2): Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling
Title: Collaborative Multi-Agent Reinforcement Learning Approach for Elastic Cloud Resource Scaling | Collaboratives Multi-Agent-Verstärkungs-Lernkonzept für elastische Cloud-Ressourcenskalierung | 弹性云层资源扩缩多机构加强学习方法合作 2507.00550v1 |
Authors (2): Bruce Fang, Danyi Gao
This paper addresses the challenges of rapid resource variation and highly uncertain task loads in cloud computing environments. It proposes an optimization method for elastic cloud resource scaling based on a multi-agent system. The method deploys multiple autonomous agents to perceive resource states in parallel and make local decisions. While maintaining the distributed nature of the system, it introduces a collaborative value function to achieve global coordination. This improves the responsiveness of resource scheduling and enhances overall system performance. To strengthen system foresight, a lightweight state prediction model is designed. It assists agents in identifying future workload trends and optimizes the selection of scaling actions. For policy training, the method adopts a centralized training and decentralized execution reinforcement learning framework. This enables agents to learn effectively and coordinate strategies under conditions of incomplete information. The paper also constructs typical cloud scenarios, including multi-tenancy and burst traffic, to evaluate the proposed method. The evaluation focuses on resource isolation, service quality assurance, and robustness. Experimental results show that the proposed multi-agent scaling strategy outperforms existing methods in resource utilization, SLA violation control, and scheduling latency. The results demonstrate strong adaptability and intelligent regulation. This provides an efficient and reliable new approach to solving the problem of elastic resource scaling in complex cloud platforms.
本文讨论了在云计算环境中快速资源变化和高度不确定任务负荷的挑战,建议了基于多试剂系统的弹性云源资源缩放优化方法。该方法部署多个自主代理器,以同时了解资源状态并作出当地决定;在保持该系统的分布性质的同时,还引入了协作价值功能,以实现全球协调。这提高了资源排期的响应能力并提高了整个系统的业绩。为加强系统展望,设计了一个轻量国家预测模型。它帮助代理商确定未来工作量趋势,优化了规模化行动的选择。对于政策培训,该方法采用了集中培训和分散执行强化学习框架。该方法使代理商能够在信息不完整的情况下有效学习和协调战略。该方法还构建了典型的云层情景,包括多重使用和爆破交通,以评价拟议方法。该评价侧重于资源孤立、服务质量保证和稳健性。实验结果表明,拟议的多试剂扩展战略超过了资源利用、苏丹解放军违规控制和排整等现有方法。该方法显示了强有力的适应性和智能度调控。该方法为解决问题提供了一种高效和可靠的新的云层模型。
Article 49
Title@2025-07-01 (2): Edge Computing and its Application in Robotics: A Survey
Title: Edge Computing and its Application in Robotics: A Survey | Edge Computing und seine Anwendung in der Robotik: Eine Umfrage | 边缘计算及其在机器人学中的应用:调查 2507.00523v1 |
Authors (2): Nazish Tahir, Ramviyas Parasuraman
The Edge computing paradigm has gained prominence in both academic and industry circles in recent years. By implementing edge computing facilities and services in robotics, it becomes a key enabler in the deployment of artificial intelligence applications to robots. Time-sensitive robotics applications benefit from the reduced latency, mobility, and location awareness provided by the edge computing paradigm, which enables real-time data processing and intelligence at the network’s edge. While the advantages of integrating edge computing into robotics are numerous, there has been no recent survey that comprehensively examines these benefits. This paper aims to bridge that gap by highlighting important work in the domain of edge robotics, examining recent advancements, and offering deeper insight into the challenges and motivations behind both current and emerging solutions. In particular, this article provides a comprehensive evaluation of recent developments in edge robotics, with an emphasis on fundamental applications, providing in-depth analysis of the key motivations, challenges, and future directions in this rapidly evolving domain. It also explores the importance of edge computing in real-world robotics scenarios where rapid response times are critical. Finally, the paper outlines various open research challenges in the field of edge robotics.
近些年来,边缘计算模式在学术界和产业界都日益受到重视。通过在机器人领域实施边缘计算设施和服务,它成为向机器人部署人工智能应用的关键促进因素。时间敏感的机器人应用受益于边缘计算模式提供的较低的潜伏性、流动性和定位意识,它使得网络边缘的实时数据处理和智能得以实现。虽然将边缘计算纳入机器人的优势很多,但最近没有进行全面审查这些效益的调查。本文旨在通过突出边缘机器人领域的重要工作、审查最近的进展以及更深入地了解当前和新出现解决方案背后的挑战和动机来弥合这一差距。特别是,本篇文章全面评价了边缘机器人领域最近的动态,强调基本应用,深入分析了这一迅速变化领域的关键动机、挑战和未来方向。它还探讨了边缘计算在现实世界机器人情景中的重要性,在这些情景中,迅速的反应时间至关重要。最后,论文概述了边缘机器人领域的各种公开研究挑战。
Article 50
Title@2025-07-01 (2): LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference
Title: LLM-Mesh: Enabling Elastic Sharing for Serverless LLM Inference | LLM-Mesh: Elastische Freigabe für serverlose LLM-Inferenz aktivieren | LLM-Mesh:为无服务器的LLM推理提供弹性分享能力 2507.00507v1 |
Authors (5): Chuhao Xu, Zijun Li, Quan Chen, Han Zhao, Minyi Guo
The rise of LLMs has driven demand for private serverless deployments, characterized by moderate-scale models and infrequent requests. While existing solutions follow exclusive GPU deployment, we take a step back to explore modern platforms and find that: Emerging CPU architectures with built-in accelerators are capable of serving LLMs but remain underutilized, and both CPUs and GPUs can accommodate multiple LLMs simultaneously. We propose LLM-Mesh, a serverless inference scheme for small-to-mid-sized LLMs that enables elastic sharing across heterogeneous hardware. LLM-Mesh tackles three fundamental challenges: (1) precise, fine-grained compute resource allocation at token-level to handle fluctuating computational demands; (2) a coordinated and forward-looking memory scaling mechanism to detect out-of-memory hazards and reduce operational overhead; and (3) a dual approach that reduces resource fragmentation through proactive preemption and reactive bin-packing. Experimental results on 4 32-core CPUs and 4 A100 GPUs show that LLM-Meshimproves service capacity by 44% - 63% through sharing, while further leveraging CPUs boosts this to 91% - 159%.
LLMS的兴起推动了对私人无服务器部署的需求,其特点是中等规模模型和不经常的要求。虽然现有的解决方案是完全的 GPU 部署,但我们退后一步探索现代平台,发现: 内装加速器的新兴CPU结构能够为LLMs服务,但仍未得到充分利用,而CPU和GPU能够同时容纳多个LMsLM。 我们提议了LLM-Mesh,一个小型至中型LMS的服务器无源推理计划,使各种硬件能够进行弹性共享。 LLMM-Mesh解决了三项基本挑战:(1) 精确、精细的压缩压缩压缩了象征性水平的资源分配,以处理波动的计算需求;(2) 协调和前瞻性的存储扩展机制,以发现超模范的危害并减少业务间接费用;(3) 一种双重办法,通过积极主动的先发制式和反应式的包装来减少资源分散。 4,32 核心CPU和4,100 GPUS的实验结果显示LM-MESimproves 服务能力由44% - 63% 共享,同时进一步利用CPUS-159 将CPS-91% 提升至159% 。
Article 51
Title@2025-07-01 (2): Real-Time In-Network Machine Learning on P4-Programmable FPGA SmartNICs with Fixed-Point Arithmetic and Taylor
Title: Real-Time In-Network Machine Learning on P4-Programmable FPGA SmartNICs with Fixed-Point Arithmetic and Taylor | Echtzeit-In-Network Machine Learning auf P4-Programmierbaren FPGA SmartNICs mit Fixed-Point Arithmetic und Taylor | P4-可编程的PFGA智能计算机计算机计算机与固定点测算机和泰勒的实时网络内机器学习 2507.00428v1 |
Authors (6): Mohammad Firas Sada, John J. Graham, Mahidhar Tatineni, Dmitry Mishin, Thomas A. DeFanti, Frank Würthwein
As machine learning (ML) applications become integral to modern network operations, there is an increasing demand for network programmability that enables low-latency ML inference for tasks such as Quality of Service (QoS) prediction and anomaly detection in cybersecurity. ML models provide adaptability through dynamic weight adjustments, making Programming Protocol-independent Packet Processors (P4)-programmable FPGA SmartNICs an ideal platform for investigating In-Network Machine Learning (INML). These devices offer high-throughput, low-latency packet processing and can be dynamically reconfigured via the control plane, allowing for flexible integration of ML models directly at the network edge. This paper explores the application of the P4 programming paradigm to neural networks and regression models, where weights and biases are stored in control plane table lookups. This approach enables flexible programmability and efficient deployment of retrainable ML models at the network edge, independent of core infrastructure at the switch level.
随着机器学习(ML)应用成为现代网络运作的组成部分,对网络可编程性的需求日益增加,从而能够对服务质量(QOS)预测和网络安全异常探测等任务进行低纬度 ML推断。 ML模型通过动态重量调整提供适应性,使编程协议独立包装处理器(P4)-可编程的FPGA智能处理器(FPGA Smartnics)成为调查网络机器学习(INML)的理想平台。这些装置提供高通量、低纬度包处理,可通过控制平面进行动态重组,允许将ML模型在网络边缘直接灵活整合。本文探讨了将P4编程范式应用于神经网络和回归模型,其中重量和偏差储存在控制平面表外观中。这一方法使得在网络边缘能够灵活地编程和高效地部署可改编程的ML模型,在开关上独立于核心基础设施。
Article 52
Title@2025-07-01 (2): Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning
Title: Find a Scapegoat: Poisoning Membership Inference Attack and Defense to Federated Learning | Find a Scapegoat: Vergiftung der Mitgliedschaft Inferenzangriff und Verteidigung zu Federated Learning | 寻找一条“Scamegoat”:毒瘾成员攻击和防御联邦学习组织 2507.00423v1 |
Authors (4): Wenjin Mo, Zhiyuan Li, Minghong Fang, Mingwei Fang
Federated learning (FL) allows multiple clients to collaboratively train a global machine learning model with coordination from a central server, without needing to share their raw data. This approach is particularly appealing in the era of privacy regulations like the GDPR, leading many prominent companies to adopt it. However, FL’s distributed nature makes it susceptible to poisoning attacks, where malicious clients, controlled by an attacker, send harmful data to compromise the model. Most existing poisoning attacks in FL aim to degrade the model’s integrity, such as reducing its accuracy, with limited attention to privacy concerns from these attacks. In this study, we introduce FedPoisonMIA, a novel poisoning membership inference attack targeting FL. FedPoisonMIA involves malicious clients crafting local model updates to infer membership information. Additionally, we propose a robust defense mechanism to mitigate the impact of FedPoisonMIA attacks. Extensive experiments across various datasets demonstrate the attack’s effectiveness, while our defense approach reduces its impact to a degree.
联邦学习组织(FL)允许多个客户在中央服务器的协调下合作培训全球机器学习模式,无需分享原始数据。在像GDPR这样的隐私监管时代,这种方法特别具有吸引力,导致许多著名公司采用。然而,FL的分布性使其容易中毒袭击,恶意客户在攻击者控制下发送有害数据以损害模式。FL的大多数现有中毒袭击旨在降低模式的完整性,如降低其准确性,同时对这些袭击的隐私关注有限。在本研究中,我们引入了Fed PoisonMIA,这是针对FL. Fed PoisisonMIA的新的中毒成员推断性袭击,涉及恶意客户设计当地模型更新以推断成员信息。此外,我们提出一个强有力的防御机制,以减轻Fed PoisonMIA袭击的影响。在各种数据集中进行的广泛实验显示了袭击的效果,而我们的防御方法则将攻击的影响降低到一定程度。
Article 53
Title@2025-07-01 (2): Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs
Title: Serving LLMs in HPC Clusters: A Comparative Study of Qualcomm Cloud AI 100 Ultra and High-Performance GPUs | LLMs in HPC-Clustern bedienen: Eine vergleichende Studie von Qualcomm Cloud AI 100 Ultra- und Hochleistungs-GPUs | HPC群集中服务长效LLMs:对Qalcomm Cloud AI 100超效和高效GPU的比较研究 2507.00418v1 |
Authors (10): Mohammad Firas Sada, John J. Graham, Elham E Khoda, Mahidhar Tatineni, Dmitry Mishin, Rajesh K. Gupta, Rick Wagner, Larry Smarr, Thomas A. DeFanti, Frank Würthwein
This study presents a benchmarking analysis of the Qualcomm Cloud AI 100 Ultra (QAic) accelerator for large language model (LLM) inference, evaluating its energy efficiency (throughput per watt) and performance against leading NVIDIA (A100, H200) and AMD (MI300A) GPUs within the National Research Platform (NRP) ecosystem. A total of 15 open-source LLMs, ranging from 117 million to 90 billion parameters, are served using the vLLM framework. The QAic inference cards appears to be energy efficient and performs well in the energy efficiency metric in most cases. The findings offer insights into the potential of the Qualcomm Cloud AI 100 Ultra for high-performance computing (HPC) applications within the National Research Platform (NRP).
本研究对大型语言模型(LLM)推论的Qalcomm Cloud AI 100 Ultra(QAic)加速器进行了基准分析,对照国家研究平台生态系统内主要的NVIDIA(A100,H200)和AMD(MI300A)GPU,评价其能效(每瓦)和性能表现。共有15个开放源LMs,范围从1.17亿到900亿参数,利用VLLM框架提供服务。QAic推论卡看来是节能的,在大多数情况下在能源效率指标方面表现良好。研究结果揭示了Qalcomm Cloud AI 100 Ulttra在国家研究平台内应用高性能计算的潜力。
Article 54
Title@2025-07-01 (2): HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism
Title: HelixPipe: Efficient Distributed Training of Long Sequence Transformers with Attention Parallel Pipeline Parallelism | HelixPipe: Effizientes Training von langen Sequenztransformatoren mit Aufmerksamkeit Paralleler Pipeline-Parallelismus | HelixPipe:对长序列变异器进行有效分布式培训,注意平行管道平行平行平行 2507.00394v1 |
Authors (5): Geng Zhang, Shenggan Cheng, Xuanlei Zhao, Ziming Liu, Yang You
As transformer sequence lengths grow, existing pipeline parallelisms incur suboptimal performance due to the quadratic attention computation and the substantial memory overhead. To relieve these challenges, we propose HelixPipe, a novel pipeline parallelism for long sequence transformer training. First, HelixPipe introduces attention parallel partition, which schedules attention computations of different micro batches across different pipeline stages in parallel, reducing pipeline bubbles. Second, it employs a two-fold first-in-last-out micro batch schedule to balance memory usage and overlap communication with computation. Additionally, HelixPipe utilizes recomputation without attention and chunked MLP to mitigate fragmentation and enable longer sequences. Experiments demonstrate that HelixPipe gains increasing advantages with longer sequence lengths, and outperforms existing methods in throughput and scalability across varying pipeline sizes, model sizes, and cluster configurations. Notably, it achieves a 26\% speedup over baseline methods when training a 7B model with 128k sequence length on 64 H20 GPUs. Code is available at https://github.com/code-tunnel/Megatron-LM/tree/dev.
随着变压器序列长度的扩大,现有的管道平行状态由于四端注意计算和大量记忆管理而产生亚最佳性能。为了减轻这些挑战,我们提议HelixPipe,这是长序列变压器培训的一种新型管道平行状态。首先,HelixPipe引入注意平行分割,将不同管道阶段的不同微批量的注意力计算平行地安排在不同的管道阶段,减少管道泡沫。其次,它使用一个双倍的上流第一批微批,以平衡存储用量和与计算重叠的通信。此外,HelixPipe利用重塑而不引起注意,并将MLP块块块块块来缓解碎裂和延长序列。实验表明,HelixPipe在较长序列长度的优势增加,超越了现有管道大小、模型大小和集束配置的输送和伸缩方法。值得注意的是,当在64 H20 GPPPS上培训一个有128k序列的7B模型时,它比基线方法快了26个。 https://giubb/mcotoron-Mcol-coplegon-Long-col-col-col-col-gnelvels.code可查到。 http://http://http://http://gree/mtreaxmtreaxmtal/mtal/col/col/col/col/col/col/col-col-col-col-col-col_/col-col_/col-col_/col-col-col/col_/col-col-col_gevol-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-col-cal-col-cal-cal-col-cal-cal-cal-cal-cal-cal-cal-cal-cal-cal-cal-cal-col-col-col-col-
Article 55
Title@2025-06-30 (1): Evaluation of a Foundational Model and Stochastic Models for Forecasting Sporadic or Spiky Production Outages of High-Performance Machine Learning Services
Title: Evaluation of a Foundational Model and Stochastic Models for Forecasting Sporadic or Spiky Production Outages of High-Performance Machine Learning Services | Bewertung eines Basismodells und stochastische Modelle zur Vorhersage sporadischer oder würziger Produktionsausfälle hochleistungsfähiger Machine Learning Services | 评价预测高性能机器学习服务零星或斯皮生产流出的基础模型和存储模型 2507.01067v1 |
Authors (1): Keun Soo Yim
Time series forecasting models have diverse real world applications (e.g., from electricity metrics to software workload). Latest foundational models trained for time series forecasting show strengths (e.g., for long sequences and in zero-shot settings). However, foundational model was not yet used for forecasting rare, spiky events, i.e., a challenging target because those are a corner case of extreme events. In this paper, we optimize a state-of-the-art foundational model to forecast sporadic or spiky production outages of high-performance machine learning services powering billions of client devices. We evaluate the forecasting errors of the foundational model compared with classical stochastic forecasting models (e.g., moving average and autoregressive). The analysis helps us understand how each of the evaluated models performs for the sporadic or spiky events. For example, it identifies the key patterns in the target data that are well tracked by the foundational model vs. each of the stochastic models. We use the models with optimal parameters to estimate a year-long outage statistics of a particular root cause with less than 6% value errors.
时间序列预测模型具有不同的真实世界应用(例如,从电力量计到软件工作量等)。经过时间序列预测培训的最新基础模型显示了优势(例如,长序列和零射环境)。然而,基础模型尚未用于预测罕见的、突发事件,即具有挑战性的目标,因为这些事件是极端事件的一个角落。在本文中,我们优化了最先进的基础模型,以预测高性能机器学习服务零星或突然的产量断流,使数十亿客户设备具有动力。我们评估了基础模型的预测错误,与典型的随机预测模型(例如,移动平均和自动递增性)相比,显示了优势。分析有助于我们了解每个被评估的模型是如何为零星或突发事件进行演化的。例如,它确定了目标数据中的关键模式,这些模式由基础模型和每个随机模型都很好地跟踪。我们使用这些模型,用最理想的参数来估计特定根部的年长统计值不到6%的错误。
Article 56
Title@2025-06-30 (1): Rust vs. C for Python Libraries: Evaluating Rust-Compatible Bindings Toolchains
Title: Rust vs. C for Python Libraries: Evaluating Rust-Compatible Bindings Toolchains | Rust vs. C für Python Bibliotheken: Bewertung von Rust-kompatiblen Bindungen Toolchains | Python图书馆的Rust诉C案:评估Rust-Compable Contracable Contails 工具链 2507.00264v1 |
Authors (3): Isabella Basso do Amaral, Renato Cordeiro Ferreira, Alfredo Goldman
The Python programming language is best known for its syntax and scientific libraries, but it is also notorious for its slow interpreter. Optimizing critical sections in Python entails special knowledge of the binary interactions between programming languages, and can be cumbersome to interface manually, with implementers often resorting to convoluted third-party libraries. This comparative study evaluates the performance and ease of use of the PyO3 Python bindings toolchain for Rust against ctypes and cffi. By using Rust tooling developed for Python, we can achieve state-of-the-art performance with no concern for API compatibility.
Python 编程语言在其语法和科学图书馆中最出名,但它也因其翻译速度缓慢而臭名昭著。 优化 Python 的关键部分需要特别了解程序语言之间的二进制互动,并且可能难以用手动方式连接,因为执行者经常求助于复杂的第三方图书馆。 这份比较研究评估了 PyO3 Python 捆绑工具链的性能和容易使用性。 通过使用为 Python 开发的 Rust 工具链,我们可以在不考虑API 兼容性的情况下实现最先进的性能。
Article 57
Title@2025-06-30 (1): CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training
Title: CrossPipe: Towards Optimal Pipeline Schedules for Cross-Datacenter Training | CrossPipe: Auf dem Weg zu optimalen Pipeline-Fahrplänen für Cross-Datacenter-Schulungen | CrossPipe:争取为跨数据中心培训制定最佳管道时间表 2507.00217v1 |
Authors (4): Tiancheng Chen, Ales Kubicek, Langwen Huang, Torsten Hoefler
Training large language models (LLMs) now requires resources that exceed a single datacenter, making cross-datacenter strategies increasingly crucial. We present CrossPipe, a framework designed to optimize model training across geographically distributed datacenters by explicitly modeling and mitigating the impact of network latency and limited bandwidth. It enables unified analysis and optimization incorporating both pipeline parallelism (PP) and opportunities for overlapping data parallelism (DP) communication. CrossPipe generates optimized pipeline schedules using either solver-based optimal or fast near-optimal greedy algorithms, built upon a flexible execution engine that separates scheduling logic from communication details. Our evaluation shows that CrossPipe reduces training time by up to 33.6\% compared to traditional pipeline schedules under identical memory constraints. When memory constraints are relaxed, CrossPipe maintains strong performance despite communication delays, approaching the efficiency of idealized schedules without delays. CrossPipe offers improved scalability and resource utilization, particularly in environments with high network latency or limited bandwidth.
培训大型语言模型(LLMS)现在需要超过单一数据中心的资源,使交叉数据中心战略越来越重要。我们介绍了CrossPipe,这是一个框架,旨在通过明确建模和减轻网络长期性和有限带宽的影响,优化地理分布数据中心的示范培训。它能够统一分析和优化,既包括管道平行(PP),也包括数据平行通信的重叠(DP)机会。CrossPipe利用基于解决问题者的最佳或快速接近最佳的贪婪算法,利用灵活的执行引擎,将逻辑与通信细节区分开来。我们的评估表明,CrossPipe将培训时间比传统管道时间表减少33.6,而在记忆相同的限制下,Crospipe尽管记忆受限制,但仍保持强劲的绩效,不拖延地接近理想化时间表的效率。Crospie提供更好的可扩展性和资源利用性和资源利用性,特别是在高网络延迟或带宽度有限的环境中。
Article 58
Title@2025-06-30 (1): Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data
Title: Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data | Vermeiden Sie das Vergessen, indem Sie globale Wissensgradienten im Föderierten Lernen mit nicht-ID-Daten bewahren | 避免在使用非二二二维数据进行联邦学习时因保留全球知识进步而被遗忘 2505.20485v3 |
Authors (5): Abhijit Chunduru, Majid Morafah, Mahdi Morafah, Vishnu Pandi Chellapandi, Ang Li
The inevitable presence of data heterogeneity has made federated learning very challenging. There are numerous methods to deal with this issue, such as local regularization, better model fusion techniques, and data sharing. Though effective, they lack a deep understanding of how data heterogeneity can affect the global decision boundary. In this paper, we bridge this gap by performing an experimental analysis of the learned decision boundary using a toy example. Our observations are surprising: (1) we find that the existing methods suffer from forgetting and clients forget the global decision boundary and only learn the perfect local one, and (2) this happens regardless of the initial weights, and clients forget the global decision boundary even starting from pre-trained optimal weights. In this paper, we present FedProj, a federated learning framework that robustly learns the global decision boundary and avoids its forgetting during local training. To achieve better ensemble knowledge fusion, we design a novel server-side ensemble knowledge transfer loss to further calibrate the learned global decision boundary. To alleviate the issue of learned global decision boundary forgetting, we further propose leveraging an episodic memory of average ensemble logits on a public unlabeled dataset to regulate the gradient updates at each step of local training. Experimental results demonstrate that FedProj outperforms state-of-the-art methods by a large margin.
不可避免的数据差异性的存在使得联盟间学习变得非常困难。 有很多方法可以解决这个问题, 比如本地规范化、更好的模型融合技术和数据共享。 虽然这些方法有效,但它们缺乏对数据差异性如何影响全球决策界限的深刻理解。 在本文中,我们通过使用一个玩具的例子对所学决定界限进行实验性分析来弥补这一差距。 我们的观察令人惊讶:(1) 我们发现,现有方法因忘记而受损,客户忘记了全球决策界限,只学会了完美的本地边界,(2) 不论初始重量如何, 客户都忘记了全球决策界限, 甚至从经过预先训练的最佳重量开始。 在本文中,我们介绍FedProj, 是一个能强有力地学习全球决策界限并避免在当地培训中遗忘的联邦化学习框架。 为了更好地实现共同的知识融合,我们设计了一个全新的服务器方知识转移损失,以进一步校准已学的全球决定界限。 为了减轻已学全球决定边界问题,我们进一步建议利用平均水平差值的记忆性记忆, 将每一步的FDP- Morealtial-lagial Adal fortial a ress a pretting a pretting a pretting a press press a press a press a press a maligilgaltiald progregal fal fal fal press a progal fal fal praldaldaldaldaldaldaldaldaldaldaldaldaldaldal uncald praldaldaldaldald un un praldaldaldaldaldaldaldaldald praldaldalds mas undaldaldaldaldaldal aps aps mas mas mas ap apdaldald praldaldaldaldal pral pral pral pral pral pral pral pral pral madaldaldaldaldaldaldaldaldaldaldaldaldal mas ap ma,我们,我们以在
Article 59
Title@2025-06-30 (1): Identifying the Truth of Global Model: A Generic Solution to Defend Against Byzantine and Backdoor Attacks in Federated Learning (full version)
Title: Identifying the Truth of Global Model: A Generic Solution to Defend Against Byzantine and Backdoor Attacks in Federated Learning (full version) | Die Wahrheit des globalen Modells identifizieren: Eine generische Lösung gegen byzantinische und Hintertürangriffe im Federated Learning (Vollversion) | 查明全球模式真相:在联邦学习联盟中防范拜占庭和后门攻击的一般解决办法(全文) 2311.10248v3 |
Authors (3): Sheldon C. Ebron, Meiying Zhang, Kan Yang
Federated Learning (FL) enables multiple parties to train machine learning models collaboratively without sharing the raw training data. However, the federated nature of FL enables malicious clients to influence a trained model by injecting error model updates via Byzantine or backdoor attacks. To detect malicious model updates, a typical approach is to measure the distance between each model update and a \textit{ground-truth model update}. To find such \textit{ground-truth model updates}, existing defenses either require a benign root dataset on the server (e.g., FLTrust) or simply use trimmed mean or median as the threshold for clipping (e.g., FLAME). However, such benign root datasets are impractical, and the trimmed mean or median may also eliminate contributions from these underrepresented datasets. In this paper, we propose a generic solution, namely FedTruth, to defend against model poisoning attacks in FL, where the \textit{ground-truth model update} (i.e., the global model update) will be estimated among all the model updates with dynamic aggregation weights. Specifically, FedTruth does not have specific assumptions on the benign or malicious data distribution or access to a benign root dataset. Moreover, FedTruth considers the potential contributions from all benign clients. Our empirical results show that FedTruth can reduce the impacts of poisoned model updates against both Byzantine and backdoor attacks, and is also efficient in large-scale FL systems.
联邦学习组织(FL) 使多个缔约方能够合作培训机器学习模式,而无需共享原始培训数据。 但是, FL的联盟性质使恶意客户能够通过Byzantine 或后门攻击来通过输入错误模型更新来影响经过训练的模式。 要检测恶意模型更新,典型的方法是测量每个模型更新与这些代表度数据集的距离。要找到这样的FedTruth, 以防范FL的模型中毒袭击, 服务器上的种子模型更新(如FLTruust) 需要无害的根数据集, 或只是使用刻值平均值或中位值作为剪切的门槛(如FLAME ) 。 然而, 这样的良性根数据集是不切实际的, 刻值平均值或中值也可能消除这些代表度数据集的贡献。 在本文中,我们提出一个通用解决方案, 即FDTruth, 来防范FL的模型中毒袭击模式袭击, 也就是:全球模型更新的平均值或中中中位值中位值中位值中位值中位值, 也可以估算所有Breal- brealalalalal 的基数据更新。
Article 60
Title@2025-06-30 (1): Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC
Title: Agent.xpu: Efficient Scheduling of Agentic LLM Workloads on Heterogeneous SoC | Agent.xpu: Effiziente Planung von Agentic LLM Workloads auf heterogenen SoC | Agent.xpu: 高效地安排对异基因 soC 的Agentic LLM 工作负荷 2506.24045v1 |
Authors (8): Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Rui Qu, Maoliang Li, Xiang Chen, Guojie Luo
The proliferation of agentic Large Language Models (LLMs) on personal devices introduces a new class of workloads characterized by a dichotomy of objectives. Reactive tasks, initiated by users, demand immediate, low-latency responses, while proactive tasks operate invisibly and prioritize throughput. Existing on-device LLM engines, designed for isolated inferences, fail to efficiently manage these concurrent and conflicting requests on consumer-grade heterogeneous SoCs with CPU, integrated GPU, and NPU. This paper introduces Agent.xpu, an efficient serving system for agentic LLM workloads on memory-unified heterogeneous SoCs. With dedicated offline profiling, Agent.xpu first constructs a heterogeneous execution graph, which fuses and chunks model kernels for affinity-guided, elastic accelerator mapping with predictive kernel annotation. At runtime, its online scheduler enables fine-grained, kernel-level preemption to guarantee the responsiveness of reactive tasks. To maximize SoC utilization, it adopts slack-aware kernel backfill to opportunistically append proactive tasks, and mitigates NPU-iGPU contention via bandwidth-aware dispatch. Evaluation on an Intel Core Ultra SoC shows that Agent.xpu achieves 4.6$\times$ lower latency for reactive tasks and sustains 1.6$\times$-6.8$\times$ higher throughput for proactive tasks compared to state-of-the-art inference engines.
个人设备中的代理性大型语言模型(LLMs)的激增带来了一个新的工作量类别,其特点是目标的二分法。用户启动的响应性任务,即即即时、低纬度反应,同时积极主动的任务以隐形的方式运作并排出分流。为孤立推理设计的现有的在设计上安装的LM引擎未能有效地管理与消费者级差异的软件(与CPU、整合的GPU和NPU)有关的这些同时和相互冲突的要求。本文引入了Agres.xpu,这是一个高效的系统,用于存储记忆统一性多异质 SoCs的LM(LM)工作量。通过专门的离线性剖面分析,Agent.xpu首先构建一个混杂的执行图,该图为亲近性、导导出、弹性加速器式的模型内核内核内核值,保证反应性能反应性任务的响应性能。为了最大限度地利用,SSoC公司采用慢性内核内核内核内值高的内核内核内值后置,为机性反应性反应性反应性反应性反应性反应性平时,通过Sleximalex-assimalevlevl化的交付。
Article 61
Title@2025-06-30 (1): Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge
Title: Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge | Intelligente Orchestrierung der verteilten Large Foundation Model Inferenz am Rande | 分散在边缘的大基金会模型推断 2504.03668v2 |
Authors (3): Fernando Koch, Aladin Djuhera, Alecio Binotto
Large Foundation Models (LFMs), including multi-modal and generative models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments, such as Multi-access Edge Computing (MEC), presents significant challenges for workload orchestration due to time-varying network, compute, and storage conditions. In particular, current split inference strategies, which partition LFM layers across nodes, are not designed to adapt to fluctuating workloads, dynamic bandwidth conditions, or evolving privacy constraints in high-utilization MEC environments. In this work, we propose a novel adaptive split inference orchestration framework that elevates both the placement and partitioning of LFM layers to runtime-tunable variables. Specifically, our framework enables real-time, quality-of-service (QoS)-aware management of inference workloads by extending conventional orchestrators with three key services: (1) Capacity-aware workload distribution, which continuously profiles node resources and selects an optimal subset of MEC nodes; (2) Dynamic partition migration, which transparently relocates pre-cut LFM segments in response to changes in utilization or network conditions; (3) Real-time reconfiguration, which dynamically re-splits LFM layers to balance latency, throughput, and privacy. We formalize the joint placement-partitioning problem, outline a reference architecture and algorithmic workflow, and discuss applicability in representative smart city, V2X, and industrial edge scenarios.
大型基础模型(LMM),包括多模式和基因模型,有望为下一代的边缘应用释放新的能力,但通过在资源限制和多样化的边缘环境中(如多接入边缘计算(MEC))与LFMS进行推论,由于时间变化的网络、计算和储存条件,对工作量的调控提出了重大挑战。特别是,目前将LFM层分布于节点的不同推论战略,其设计并不是为了适应在高利用的MEC环境中不断变化的工作量、动态带宽条件或隐私限制。在这项工作中,我们提出了一个新的适应性差异性差异性差异性调整调控框架,将LFM结构的布局和分区提升为可运行的时间可调变变量。具体地说,我们的框架通过扩大传统的管弦风琴管,将LFM结构分为三个关键服务:(1) 能力认知性工作量分配,不断描述资源,并在MEC节点选择一个最佳的参照区段;(2) 智能分区迁移,将LFM结构的布局前的配置和结构的重新定位,从而透明地将稳定和结构的升级地改变到真正的结构。
Article 62
Title@2025-06-30 (1): QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference
Title: QPART: Adaptive Model Quantization and Dynamic Workload Balancing for Accuracy-aware Edge Inference | QPART: Adaptive Modell-Quantisierung und dynamische Workload-Balancing für akkurat-bewusste Edge-Inferenz | QPART: 适应性模型量化和动态工作量平衡,以利准确度认知边缘推断 2506.23934v1 |
Authors (6): Xiangchen Li, Saeid Ghafouri, Bo Ji, Hans Vandierendonck, Deepu John, Dimitrios S. Nikolopoulos
As machine learning inferences increasingly move to edge devices, adapting to diverse computational capabilities, hardware, and memory constraints becomes more critical. Instead of relying on a pre-trained model fixed for all future inference queries across diverse edge devices, we argue that planning an inference pattern with a request-specific model tailored to the device’s computational capacity, accuracy requirements, and time constraints is more cost-efficient and robust to diverse scenarios. To this end, we propose an accuracy-aware and workload-balanced inference system that integrates joint model quantization and inference partitioning. In this approach, the server dynamically responds to inference queries by sending a quantized model and adaptively sharing the inference workload with the device. Meanwhile, the device’s computational power, channel capacity, and accuracy requirements are considered when deciding. Furthermore, we introduce a new optimization framework for the inference system, incorporating joint model quantization and partitioning. Our approach optimizes layer-wise quantization bit width and partition points to minimize time consumption and cost while accounting for varying accuracy requirements of tasks through an accuracy degradation metric in our optimization model. To our knowledge, this work represents the first exploration of optimizing quantization layer-wise bit-width in the inference serving system, by introducing theoretical measurement of accuracy degradation. Simulation results demonstrate a substantial reduction in overall time and power consumption, with computation payloads decreasing by over 80% and accuracy degradation kept below 1%.
随着机器学习推论日益转向边缘装置,适应不同的计算能力、硬件和内存限制变得更加关键。我们争辩说,规划一种根据不同边缘装置的计算能力、准确要求和时间限制而专门设计的请求型号的推论模式时,成本效率更高,对不同的假设情况也更强。为此,我们提出一个准确度和工作量平衡的推论系统,将联合模型量化和推推分结合起来。在这一方法中,服务器不依赖为今后所有不同边缘装置的推论查询而固定的预先培训模型,而是通过发送一个量化模型和适应性地与该装置分享推论工作量。与此同时,在决定时会考虑该装置的计算能力、频道能力和准确性要求。此外,我们为推断系统引入一个新的优化框架,包括联合模型量化和分解。我们的方法优化了层与偏差点的宽度和偏差点,以尽量减少时间消耗和成本,同时通过一个精确度较低的时间要求来进行计算,同时通过一个精确度的降解度模型,将我们整个精度的精确度指标性精度测量了我们整个消费的精度的精度的精度的精度度度测量结果,从而将精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度推度推度推度推度推模型引入了我们的精度推。
Article 63
Title@2025-06-30 (1): Cuckoo Heavy Keeper and the balancing act of maintaining heavy hitters in stream processing
Title: Cuckoo Heavy Keeper and the balancing act of maintaining heavy hitters in stream processing | Cuckoo Heavy Keeper und der Balanceakt der Aufrechterhaltung von schweren Hittern in der Stromverarbeitung | Cuckoo重物保管器和在溪流处理中保持重击器的平衡做法 2412.12873v3 |
Authors (2): Vinh Quang Ngo, Marina Papatriantafilou
Finding heavy hitters in databases and data streams is a fundamental problem with applications ranging from network monitoring to database query optimization, machine learning, and more. Approximation algorithms offer practical solutions, but they present trade-offs involving throughput, memory usage, and accuracy. Moreover, modern applications further complicate these trade-offs by demanding capabilities beyond sequential processing that require both parallel scaling and support for concurrent queries and updates. Analysis of these trade-offs led us to the key idea behind our proposed streaming algorithm, Cuckoo Heavy Keeper (CHK). The approach introduces an inverted process for distinguishing frequent from infrequent items, which unlocks new algorithmic synergies that were previously inaccessible with conventional approaches. By further analyzing the competing metrics with a focus on parallelism, we propose an algorithmic framework that balances scalability aspects and provides options to optimize query and insertion efficiency based on their relative frequencies. The framework is capable of parallelizing any heavy-hitter detection algorithm. Besides the algorithms’ analysis, we present an extensive evaluation on both real-world and synthetic data across diverse distributions and query selectivity, representing the broad spectrum of application needs. Compared to state-of-the-art methods, CHK improves throughput by 1.7-5.7$\times$ and accuracy by up to four orders of magnitude even under low-skew data and tight memory constraints. These properties allow its parallel instances to achieve near-linear scale-up and low latency for heavy-hitter queries, even under a high query rate. We expect the versatility of CHK and its parallel instances to impact a broad spectrum of tools and applications in large-scale data analytics and stream processing systems
在数据库和数据流中找到重击器是一个根本性问题,从网络监测到数据库查询优化、机器学习等应用都是一个根本性问题。 匹配算法提供了实际的解决办法,但提供了权衡,涉及输送量、记忆使用和准确性。 此外,现代应用还使这些权衡更加复杂,要求能力超越顺序处理,需要平行扩大和支持同步查询和更新。对这些权衡的分析导致我们的拟议流算法库库重力(CHK)背后的关键想法。这个方法引入了一种自转过程,以区分经常和不经常的项目,这些不经常的项目释放了以前与常规方法无法获取的新的算法协同效应。通过进一步分析竞合的计量标准,侧重于平行性、记忆使用和准确性。我们提出了一种算法框架,平衡可按相对频率优化查询和插入效率。对于任何重洞测算算法,除了算法的分析之外,我们还广泛评价了不同分发和查询的实时和合成工具,从而代表着广泛应用的频谱性应用需求。我们进一步分析了相互竞争的算法框架框架框架,从而通过直径直达和直径直径直达的数据限制,通过四级的系统,通过直径直径比平比平级和直平级的系统,实现了平比平比平比平比平比平比平比平比平比平比平级地、平级、平级、直平级、平级地、平级地、平级地、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平级、平
Article 64
Title@2025-06-30 (1): Segmented Operations using Matrix Multiplications
Title: Segmented Operations using Matrix Multiplications | Segmentierte Operationen mit Matrix-Multiplikationen | 使用矩阵乘法进行分割操作 2506.23906v1 |
Authors (3): Aleksandros Sobczyk, Giuseppe Sorrentino, Anastasios Zouzias
Specialized computational units that perform small matrix multiplications as primitive operations are typically present in modern accelerators. However, these units are often underutilized for many fundamental operations besides dense matrix multiplications. The analysis of algorithms for such architectures is currently stagnated due to the lack of a rigorous theoretical model of computation that captures their characteristics. In this work, we propose MMV-RAM, a computational model tailored to matrix multiplication accelerators. MMV-RAM judiciously extends the Vector-RAM model with an additional processing unit that multiplies two matrices of sizes $n\times s$ and $s\times s$ in a single parallel step, where $s$ is a model parameter. We provide a detailed theoretical analysis of the model, and carefully balance the computational power between the matrix and vector units, guided by the circuit complexity lower bound that parity is not in AC[0]. In MMV-RAM, we study algorithms for segmented scan and sum, two fundamental parallel primitives. We propose a segmented scan algorithm that uses matrix multiplications to perform speculative block-scan computations, which runs in $O(\log_s(n))$ steps. In contrast, we show that any algorithm that uses only the vector unit of MMV-RAM requires $\Omega\left(\frac{\log_2(n)}{\log_2\log_2(n)}\right)$ steps. We further apply these techniques to obtain similar theoretical speedups for element-wise vector multiplication and matrix multiplication. Beyond the worst-case complexity analysis, we propose algorithms for segmented operations that could lead to highly efficient and pragmatic implementations. For example, we observe that segmented sum is a combination of three elementary parallel primitives: scan, compress, and vector differentiation. As a case study, we implement…
由于原始操作通常在现代加速器中存在, 具有小矩阵乘法的特殊计算单位 2: 原始操作通常在现代加速器中存在。 但是, 这些单位除了密集矩阵乘法之外, 在许多基本操作中往往没有得到充分利用。 由于缺少一个严格的理论计算模型, 无法捕捉这些结构的特性, 对这些结构的算法的分析目前处于停滞状态。 我们在此工作中建议MMV- RAM, 一种为矩阵乘法加速器定制的计算模型。 MMV- RAM 明智地扩展了矢量- 内存模型, 并增加了一个处理单位, 将两个大小的基量 $=xx 美元和 $s=xxx 的矢量计算器。 我们提议一个使用矩阵数xxxxxxxxxxxxxxxxx 的计算方法, 我们使用这些矩阵xxxxxxxxxxx的计算, 我们使用这些矩阵xxxxxxxx 。
Article 65
Title@2025-06-30 (1): Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction
Title: Proving the Limited Scalability of Centralized Distributed Optimization via a New Lower Bound Construction | Nachweis der begrenzten Skalierbarkeit der zentralisierten verteilten Optimierung durch eine neue untere Bound-Konstruktion | 证明通过新建下下界建筑的集中分配最佳优化的有限可扩展性 2506.23836v1 |
Authors (1): Alexander Tyurin
We consider centralized distributed optimization in the classical federated learning setup, where $n$ workers jointly find an $\varepsilon$-stationary point of an $L$-smooth, $d$-dimensional nonconvex function $f$, having access only to unbiased stochastic gradients with variance $\sigma^2$. Each worker requires at most $h$ seconds to compute a stochastic gradient, and the communication times from the server to the workers and from the workers to the server are $\tau_{s}$ and $\tau_{w}$ seconds per coordinate, respectively. One of the main motivations for distributed optimization is to achieve scalability with respect to $n$. For instance, it is well known that the distributed version of SGD has a variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{n \varepsilon^2},$ which improves with the number of workers $n,$ where $\Delta = f(x^0) - f^*,$ and $x^0 \in R^d$ is the starting point. Similarly, using unbiased sparsification compressors, it is possible to reduce both the variance-dependent runtime term and the communication runtime term. However, once we account for the communication from the server to the workers $\tau_{s}$, we prove that it becomes infeasible to design a method using unbiased random sparsification compressors that scales both the server-side communication runtime term $\tau_{s} d \frac{L \Delta}{\varepsilon}$ and the variance-dependent runtime term $\frac{h \sigma^2 L \Delta}{\varepsilon^2},$ better than poly-logarithmically in $n$, even in the homogeneous (i.i.d.) case, where all workers access the same distribution. To establish this result, we construct a new “worst-case” function and develop a new lower bound framework that reduces the analysis to the concentration of a random sum, for which we prove a concentration bound. These results reveal fundamental limitations in scaling distributed optimization, even under the homogeneous assumption.
我们考虑在古典Federal化学习设置中集中分配优化, 由2美元工人共同找到一个$\varepsilon$- 固定点, 即$L$- smooth, $d$d$- diversion 函数, 只能使用不偏颇的随机梯度梯度, 差价$=2美元。 每位工人需要最多小时时间来计算一个随机梯度梯度, 从服务器到工人和工人到服务器的通信时间段, 分别是$@tausil_ sality, 美元和美元美元。 分配优化的主要动机之一是在美元方面实现可调整。 例如, 众所周知, SGDD的分布版本有一个差异性运行时间段值 $\ gmax% 2 L\ delta_ lax 时间段来计算一个扭曲性梯度梯度的梯度。
Article 66
Title@2025-06-30 (1): Large-scale Neural Network Quantum States for ab initio Quantum Chemistry Simulations on Fugaku
Title: Large-scale Neural Network Quantum States for ab initio Quantum Chemistry Simulations on Fugaku | Großes neurales Netzwerk Quantenstaaten für ab initio Quantenchemie Simulationen auf Fugaku | 用于对富巴库进行初始量子化学模拟的大型神经网络量图州 2506.23809v1 |
Authors (4): Hongtao Xu, Zibo Wu, Mingzhen Li, Weile Jia
Solving quantum many-body problems is one of the fundamental challenges in quantum chemistry. While neural network quantum states (NQS) have emerged as a promising computational tool, its training process incurs exponentially growing computational demands, becoming prohibitively expensive for large-scale molecular systems and creating fundamental scalability barriers for real-world applications. To address above challenges, we present \ours, a high-performance NQS training framework for \textit{ab initio} electronic structure calculations. First, we propose a scalable sampling parallelism strategy with multi-layers workload division and hybrid sampling scheme, which break the scalability barriers for large-scale NQS training. Then, we introduce multi-level parallelism local energy parallelism, enabling more efficient local energy computation. Last, we employ cache-centric optimization for transformer-based \textit{ansatz} and incorporate it with sampling parallelism strategy, which further speedup up the NQS training and achieve stable memory footprint at scale. Experiments demonstrate that \ours accelerate NQS training with up to 8.41x speedup and attains a parallel efficiency up to 95.8\% when scaling to 1,536 nodes.
解决量子体问题是量子化学的根本挑战之一。虽然神经网络量子状态(NQS)已经成为一个充满希望的计算工具,但其培训过程也产生了成倍增长的计算需求,对于大型分子系统来说,成本太高,而且为现实世界应用创造了基本的可缩缩放障碍。为了应对上述挑战,我们提出了高性能的NQS培训框架,用于计算电子结构。首先,我们提出了一个可缩放的抽样平行战略,包括多层工作量分工和混合采样计划,这打破了大规模NQS培训的可缩放障碍。然后,我们引入了多级平行的本地能源平行主义,从而使得当地能源计算效率更高。最后,我们采用了基于变压器的缓存中心优化 \ textit{anaztz} , 并将其纳入抽样平行战略, 从而进一步加快了NQS培训,并在规模上实现稳定的记忆足迹。 实验表明,\我们加速了NQS培训,将NQS培训加速到8.41x加速速度,并实现平行效率到95.8时达到15。
Article 67
Title@2025-06-30 (1): When Servers Meet Species: A Fab-to-Grave Lens on Computing’s Biodiversity Impact
Title: When Servers Meet Species: A Fab-to-Grave Lens on Computing’s Biodiversity Impact | Wenn Server Arten treffen: Eine Fab-to-Grave-Lens für die Biodiversitätswirkung von Computing | 当服务器与物种相遇时:关于计算机的生物多样性影响的一个从宽到宽的镜头 2506.20442v3 |
Authors (4): Tianyao Shi, Ritbik Kumar, Inez Hua, Yi Ding
Biodiversity loss is a critical planetary boundary, yet its connection to computing remains largely unexamined. Prior sustainability efforts in computing have focused on carbon and water, overlooking biodiversity due to the lack of appropriate metrics and modeling frameworks. This paper presents the first end-to-end analysis of biodiversity impact from computing systems. We introduce two new metrics–Embodied Biodiversity Index (EBI) and Operational Biodiversity Index (OBI)–to quantify biodiversity impact across the lifecycle, and present FABRIC, a modeling framework that links computing workloads to biodiversity impacts. Our evaluation highlights the need to consider biodiversity alongside carbon and water in sustainable computing design and optimization. The code is available at https://github.com/TianyaoShi/FABRIC.
生物多样性的丧失是一个重要的行星边界,但生物多样性与计算的联系基本上仍未得到审查。先前的计算可持续性努力侧重于碳和水,由于缺乏适当的计量标准和模型框架,忽视了生物多样性。本文件对计算系统对生物多样性的影响进行了第一次端至端分析。我们引入了两种新的计量- Embodied生物多样性指数(EBI)和业务生物多样性指数(OBI),以量化整个生命周期的生物多样性影响,并推出了FABRIC, 这是一种将计算工作量与生物多样性影响联系起来的模型框架。我们的评估强调,在可持续计算设计和优化时,需要将生物多样性与碳和水一起考虑。该代码可在https://github.com/tianaoShi/FABRIC上查阅。
Article 68
Title@2025-06-30 (1): Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model
Title: Towards Building Private LLMs: Exploring Multi-Node Expert Parallelism on Apple Silicon for Mixture-of-Experts Large Language Model | Auf dem Weg zum Aufbau privater LLMs: Erforschung von Multi-Node-Experten-Parallelismus auf Apple Silicon für Mixture-of-Experts Large Language Model | 走向建设私有私人LLMs:探索关于苹果硅的多节专家平行专家,用于混合专家大语言模型 2506.23635v1 |
Authors (6): Mu-Chi Chen, Po-Hsuan Huang, Xiangrui Ke, Chia-Heng Tu, Chun Jason Xue, Shih-Hao Hung
Large Language Models (LLMs) have revolutionized Artificial Intelligence (AI) with significant advancements such as OpenAI’s ChatGPT, Meta’s Llama, and Databricks’ DBRX. This paper addresses the cost and scalability challenges encountered when constructing private LLM systems for personal or small group services, as aimed by Apple Intelligence. A Mac Studio cluster with Apple’s M2 Ultra chips is established as a cost-efficient solution to host and accelerate the pretrained DBRX model with the Mixture-of-Experts (MoE) architecture. Our performance analysis reveal that parallel execution of the model’s experts across two to four machine nodes significantly reduces inference time. We find that computation time for the experts is comparable to the communication time for exchanging their outputs, emphasizing the importance of network latency over bandwidth. We also observe significant management overhead due to Apple software stack’s memory management logic. Based on these findings, we develop optimization schemes to eliminate the memory management overhead. As a result, the Mac Studio cluster is 1.15 times more cost-efficient than the state-of-the-art AI supercomputer with NVIDIA H100 GPUs. In addition, we construct a performance model to estimate system performance under varying configurations, and the model provides valuable insights for designing private LLM systems.
大型语言模型(LLMS)革命了人工智能(AI),取得了显著进步,如OpenAI’s ChatGPT、Meta’s Llama和Databricks的DBRX。本文阐述了苹果智能公司为个人或小群体服务建造私人LLM系统时遇到的成本和可扩缩性挑战。苹果智能公司(LLLMS)开发了一个配有苹果M2Ultra芯片的麦克工作室集群,作为容纳和加速与混合专家(MOE)结构一起经过预先训练的DBRX模型的成本效益解决方案。我们的业绩分析显示,在两到四个机器节点平行执行模型的专家大大缩短了推论时间。我们发现,专家计算时间与交换其产出的通信时间相当,强调网络在带宽上拉动的重要性。我们还观察到由于苹果软件堆的记忆管理逻辑,我们开发了最优化计划,以消除记忆管理管理费。结果显示,Mac Stud室集群比在两到四个机器节点上平行的模型成本效率高出1.15倍。 我们设计了高端的GMA的系统。
Article 69
Title@2025-06-30 (1): FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models
Title: FedEx-LoRA: Exact Aggregation for Federated and Efficient Fine-Tuning of Foundation Models | FedEx-LoRA: Exakte Aggregation für Federated and Efficient Fine-Tuning of Foundation Models | FedEx-LORA:基金会模型的联邦和高效精度 2410.09432v4 |
Authors (3): Raghav Singhal, Kaustubh Ponkshe, Praneeth Vepakomma
Low-Rank Adaptation (LoRA) is a popular technique for efficient fine-tuning of foundation models. However, applying LoRA in federated learning environments, where data is distributed across multiple clients, presents unique challenges. Existing methods rely on traditional federated averaging of LoRA adapters, resulting in inexact updates. To address this, we propose Federated Exact LoRA, or FedEx-LoRA, which adds a residual error term to the pretrained frozen weight matrix. Our approach achieves exact updates with minimal computational and communication overhead, preserving LoRA’s efficiency. We evaluate the method on various models across arithmetic reasoning, commonsense reasoning, natural language understanding and natural language generation tasks, showing consistent performance gains over state-of-the-art methods across multiple settings. Through extensive analysis, we quantify that the deviations in updates from the ideal solution are significant, highlighting the need for exact aggregation. Our method’s simplicity, efficiency, and broad applicability position it as a promising solution for accurate and effective federated fine-tuning of foundation models. Our code is publicly available at https://github.com/RaghavSinghal10/fedex-lora.
低兰克适应(LORA)是高效微调基础模型的流行技术。然而,在联合学习环境中应用LORA(LORA),将数据分布于多个客户,这带来了独特的挑战。现有方法依赖于传统的LORA适应器平均使用率,导致不精确的更新。为此,我们提议采用Fedex-Ex-LORA(即Fed-Ex-LORA),这为经过事先训练的冷冻重量矩阵增加了一个剩余错误术语。我们的方法以最低的计算和通信间接费用实现精确更新,维护LORA的效率。我们评估了各种模型的方法,这些模型包括算术推理、常识推理、自然语言理解和自然语言生成任务,表明在多种环境中最先进的方法上取得一致的绩效收益。我们通过广泛分析,量化更新理想解决方案的偏差是显著的,突出了精确汇总的必要性。我们的方法的简单性、效率和广泛适用性位置是准确和有效精确调整基础模型的可行解决方案。我们的代码在 https://github.com/RAghavShalhalal10。
Article 70
Title@2025-06-30 (1): Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning
Title: Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning | Detect \& Score: Privacy-Preserving Misbehaviour Detection and Contribution Evaluation in Federated Learning | * 评分:在联邦学习中保护隐私、错误行为检测和贡献评价 2506.23583v1 |
Authors (3): Marvin Xhemrishi, Alexandre Graell i Amat, Balázs Pejó
Federated learning with secure aggregation enables private and collaborative learning from decentralised data without leaking sensitive client information. However, secure aggregation also complicates the detection of malicious client behaviour and the evaluation of individual client contributions to the learning. To address these challenges, QI (Pejo et al.) and FedGT (Xhemrishi et al.) were proposed for contribution evaluation (CE) and misbehaviour detection (MD), respectively. QI, however, lacks adequate MD accuracy due to its reliance on the random selection of clients in each training round, while FedGT lacks the CE ability. In this work, we combine the strengths of QI and FedGT to achieve both robust MD and accurate CE. Our experiments demonstrate superior performance compared to using either method independently.
以安全汇总方式进行联邦学习,可以使私人和协作从分散的数据中学习,而不会泄露敏感的客户信息,然而,安全汇总也使发现恶意客户行为和评价个别客户对学习的贡献复杂化,为了应对这些挑战,建议分别进行缴款评价(CE)和不当行为检测(Xhemrishi等人)。然而,QI由于依赖在每轮培训中随机选择客户,因此缺乏足够的MD准确度,而FedGT缺乏CE能力。在这项工作中,我们结合QI和FedGT的优势,以实现稳健的MD和准确的CE。我们的实验显示优于独立使用两种方法。
Article 71
Title@2025-06-30 (1): PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization
Title: PipeOffload: Improving Scalability of Pipeline Parallelism with Memory Optimization | PipeOffload: Verbesserung der Skalierbarkeit von Pipeline Parallelismus mit Speicheroptimierung | 管道卸载: 提高管道平行式与内存优化的可缩放性 2503.01328v2 |
Authors (5): Xinyi Wan, Penghui Qi, Guangxing Huang, Min Lin, Jialin Li
Pipeline parallelism (PP) is widely used for training large language models (LLMs), yet its scalability is often constrained by high activation memory consumption as the number of in-flight microbatches grows with the degree of PP. In this paper, we focus on addressing this challenge by leveraging the under-explored memory offload strategy in PP. With empirical study, we discover that in the majority of standard configurations, at least half, and potentially all, of the activations can be offloaded with negligible overhead. In the cases where full overload is not possible, we introduce a novel selective offload strategy that decreases peak activation memory in a better-than-linear manner. Furthermore, we integrate memory offload with other techniques to jointly consider overall throughput and memory limitation. Our experiments proves that the per-device activation memory effectively reduces with the total number of stages, making PP a stronger alternative than TP, offering up to a 19\% acceleration with even lower memory consumption. The implementation is open-sourced at \href{https://github.com/sail-sg/zero-bubble-pipeline-parallelism}{this url}.
管道平行(PP)被广泛用于培训大型语言模型(LLMS),但其可缩放性往往受到高活性内存消耗的制约,因为飞行中微插管的数量随着PP的增多而增加。在本文中,我们侧重于通过利用PP中探索不足的内存卸卸战略来应对这一挑战。通过经验研究,我们发现在大多数标准配置中,至少一半,而且可能全部,激活可以用微不足道的间接费用来卸载。在不可能完全超载的情况下,我们引入了一种新型选择性卸载战略,以比线性更强的方式减少高峰内存。此外,我们将内存卸与其他技术结合起来,共同考虑总体吞吐量和内存限制。我们的实验证明,每部的内存会随着总阶段的增多而有效减少,使PPP成为比TP更强大的替代方案,使记忆消耗量更低的加速度达到19。在\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\
Article 72
Title@2025-06-30 (1): VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference
Title: VQ-LLM: High-performance Code Generation for Vector Quantization Augmented LLM Inference | VQ-LLM: Hochleistungs-Code-Generierung für Vector Quantization Augmented LLM Inferenz | VQ-LLLM: 矢量量化增强LLM 推理高性能代码生成 2503.02236v2 |
Authors (14): Zihan Liu, Xinhao Luo, Junxian Guo, Wentao Ni, Yangjie Zhou, Yue Guan, Cong Guo, Weihao Cui, Yu Feng, Minyi Guo, Yuhao Zhu, Minjia Zhang, Jingwen Leng, Chen Jin
In this work, we design and implement VQ-LLM, an efficient fused Vector Quantization (VQ) kernel generation framework. We first introduce a software abstraction called codebook cache to optimize codebook access efficiency and support the integration of VQ with various computations. The codebook cache adaptively stores different entries across the GPU’s memory hierarchy, including off-chip global memory, on-chip shared memory, and registers. Centered around the codebook cache, we design an efficient computation engine that optimizes memory traffic during computations involving codebooks. This compute engine adopts the codebook-centric dataflow and fusion optimizations. Additionally, we provide adaptive heuristics to tailor parameter selection in our optimizations to diverse VQ configurations. Our optimizations achieve an average latency reduction of 46.13% compared to unoptimized versions. Compared to existing open-source implementations, our methods decrease latency by 64.36% to 99.1%. A final comparison with state-of-the-art element-wise quantization methods like AWQ and KVQuant shows that our VQ-LLM is practically viable, achieving latencies close or even better latencies to those at equivalent bit-widths, potentially offering greater accuracy.
在此工作中, 我们设计和实施 VQ- LLLM , 一个高效的导体矢量内存( VQ) 生成框架 。 我们首先引入了名为代码簿缓存的软件抽象化工具, 以优化代码簿访问效率, 支持将 VQ 与各种计算整合。 代码簿缓存将存储整个 GPU 的记忆层的不同条目, 包括离芯片全球内存、 芯片共享内存和登记册。 围绕代码簿缓存, 我们设计了一个高效的计算引擎, 在使用代码簿的计算过程中优化存储流量。 这个计算引擎采用了代码簿中心的数据流和聚合优化。 此外, 我们提供适应性超重度来调整参数选择以适应不同的 VQ 配置 。 我们的优化实现了平均延缓度减少46. 13%, 包括离芯片全球内存、 芯片共享内存和登记册。 与现有的开放源执行相比, 我们的方法将延缩率降低64.36 % 至99.1 % 。 与州级元素四分解的缩方法的最后比较, 显示我们最晚或实际相等的精确度可能实现。
Article 73
Title@2025-06-30 (1): Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism
Title: Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism | Oasen: Effiziente großformatige Modellschulung auf Commodity-Servern durch überlappende und automatisierte Tensor-Modellparallelität | Oases:通过重叠和自动登光示范平行模式,对商品服务器进行有效的大型大型示范培训 2305.16121v2 |
Authors (8): Shengwei Li, Zhiquan Lai, Dongsheng Li, Yanqi Hao, Weijie Liu, Keshi Ge, Xiaoge Deng, Kai Lu
Deep learning is experiencing a rise in large-scale models. Training large-scale models is costly, prompting researchers to train large-scale models on commodity servers that more researchers can access. The massive number of parameters necessitates the use of model parallelism training methods. Existing studies focus on training with pipeline model parallelism. However, the tensor model parallelism (TMP) is inevitable when the model size keeps increasing, where frequent data-dependent communication and computation operations significantly reduce the training efficiency. In this paper, we present Oases, an automated TMP method with overlapped communication to accelerate large-scale model training on commodity servers. Oases proposes a fine-grained training operation schedule to maximize overlapping communication and computation that have data dependence. Additionally, we design the Oases planner that searches for the best model parameter partition strategy of TMP to achieve further accelerations. Unlike existing methods, Oases planner is tailored to model the cost of overlapped communication-computation operations. We evaluate Oases on various model settings and two commodity clusters, and compare Oases to four state-of-the-art implementations. Experimental results show that Oases achieves speedups of 1.01–1.48(\times) over the fastest baseline, and speedups of up to 1.95(\times) over Megatron.
深层学习正在经历大规模模型的提升。 培训大型模型成本昂贵, 促使研究人员在更多的研究人员可以访问的商品服务器上培训大型模型。 大量的参数要求使用模型平行培训方法。 现有的研究侧重于管道模型平行的培训。 但是, 当模型规模持续增加时, 高压模型平行( TMP)是不可避免的, 经常依靠数据的通信和计算操作会大大降低培训效率。 在本文中, 我们介绍Oases, 一种自动的TMP 方法, 与重叠的通信方法重叠, 以加速商品服务器上的大规模模型培训。 Oses 提出了一个精细化的培训操作时间表, 以尽量扩大重复的通信和计算, 并具有数据依赖性。 此外, 我们设计了Oases planner, 搜索TMP 的最佳模型参数分割战略以实现进一步的加速。 与现有的方法不同, Oases planner 专门设计了通信和计算操作重叠成本的模型。 我们评估了各种模型设置和两个商品集群的Oases, 并将Oases 和四个州级的模型执行速度比 1. 01 实验结果显示Oase 和超过 1. 速度。
Article 74
Title@2025-06-29 (7): FastSet: Parallel Claim Settlement
Title: FastSet: Parallel Claim Settlement | FastSet: Parallele Forderungsabrechnung | FastSet:平行索赔理赔 2506.23395v1 |
Authors (2): Xiaohong Chen, Grigore Rosu
FastSet is an actor-based distributed protocol for decentralized finance and settlement, which is inspired from blockchains. Account holders cooperate by making claims, which can include payments, holding and transferring assets, accessing and updating shared data, medical records, digital identity, and mathematical theorems, among many others. The claims are signed by their owners and are broadcast to a decentralized network of validators, which validate and settle them. Validators replicate the global state of the accounts and need not communicate with each other. In sharp contrast to blockchains, strong consistency is purposely given up as a requirement. Yet, many if not most of the blockchain benefits are preserved. The protocol is proved to be correct, despite its massively parallel nature.
FastSet是一个基于行为体的分散金融和结算分配协议,其灵感来自供应链; 账户持有人合作,提出债权,其中可包括付款、持有和转移资产、获取和更新共享数据、医疗记录、数字身份和数学理论等; 债权由所有者签字,并广播给一个分散的验证人网络,由他们验证和结算; 验证人复制账户的全球状况,不需要相互沟通; 与供应链形成鲜明对比,故意放弃强有力的一致性,将其作为一项要求; 然而,即使不是大多数,也有许多供应链的好处得到了维护; 协议被证明是正确的,尽管其性质极为平行。
Article 75
Title@2025-06-29 (7): FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model
Title: FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model | FedRef: Kommunikation-Effizient Bayesian Feinabstimmung mit Referenzmodell | FedRef: 通信-节能贝ysian精密票,参考模型 2506.23210v1 |
Authors (2): Taehwan Yoon, Bongjun Choi
Federated learning(FL) is used for distributed scenarios to train artificial intelligence(AI) models while ensuring users’ privacy. In federated learning scenario, the server generally never knows about users’ data. This type of concept makes the AI training process efficient in terms of data privacy. However, regarding model performance, federated AI models may not sufficiently satisfy AI users’ expectations. Furthermore, AI users have a wide range of different needs. It is not easy to satisfy the whole users needs. These types of issues can be addressed through AI model optimization, fine-tuning, or personalization to achieve optimal model performance. To address model optimization challenges, we propose reference model-based federated learning for optimal fine-tuning, which overcomes catastrophic forgetting in each round. This method is derived from Bayesian parameter-efficient transfer learning, which includes an optimal proximal term and enables overcoming the catastrophic forgetting issue in each round by utilizing a reference model that incorporates previous model parameters. As a result, this method achieves both high model performance and low computing cost.
联邦学习(FL) 用于分布式情景, 用于培训人工智能模型,同时确保用户隐私。 在联合学习情景中,服务器一般从来不知道用户的数据。这种概念使得AI培训过程在数据隐私方面效率高。然而,关于模型性能,联合AI模型可能不能充分满足AI用户的期望。此外,AI用户有着各种各样的不同需要。满足整个用户的需要并非易事。这类问题可以通过AI模型优化、微调或个性化来解决,以实现最佳模型性能。为了应对模型优化挑战,我们建议采用基于参考模型的联合会式学习,进行最佳微调,以克服每轮中灾难性的遗漏。这一方法源自于巴伊西亚参数高效传输学习,其中包括最佳的精度术语,并能够利用包含前模型参数的参考模型克服每轮灾难性的遗忘问题。因此,这种方法既能实现高模型性能,又能降低计算成本。
Article 76
Title@2025-06-29 (7): Efficient malicious information detection method based on set partitioning for large-scale Internet of Things
Title: Efficient malicious information detection method based on set partitioning for large-scale Internet of Things | Effiziente Methode zur Erkennung von bösartigen Informationen, basierend auf der eingestellten Partitionierung für das Internet der Dinge im großen Maßstab | 基于大规模物联网的固定分区的高效恶意信息检测方法 2502.11538v2 |
Authors (6): Yuhan Suo, Runqi Chai, Kaiyuan Chen, Senchun Chai, Wannian Liang, Yuanqing Xia
With the large-scale integration of Internet of Things (IoT) into enterprise information management systems, organizations are pursuing digital transformation that hinges on real-time data insights-and yet face escalating security and governance risks. Detecting and responding to threats at scale without impairing system efficiency has therefore become a critical information-management and decision-support challenge for today’s executives. This paper develops a distributed, gain-based anomaly-detection framework tailored to IoT-enabled enterprise systems, underpinned by an optimized sensor-subset partitioning strategy. Starting from the perspective of set partitioning strategies, this study analyzes the key factor that contributes to the performance differences between distributed and centralized algorithms. By examining the gain mutual influence of sensor subsets, an optimal set partitioning strategy is designed to minimize inter-subset mutual influence while enhancing intra-subset correlation. To further reduce the computational cost of gain updates, a suboptimal partitioning strategy based on Grassmann distance is proposed, improving the efficiency of selecting suspicious sensors. Theoretical analysis demonstrates that this approach effectively reduces the computational cost of gain updates while maintaining detection performance. Finally, simulation results validate the effectiveness of the proposed method in enhancing attack detection performance.
由于大规模地将Tings(IoT)互联网纳入企业信息管理系统,各组织正在寻求数字转型,这种转型取决于实时数据洞察力,而同时又面临不断升级的安全和治理风险。因此,在不损害系统效率的情况下检测和应对规模威胁已成为当今行政主管的一项重要信息管理和决策支持挑战。本文件为IoT型企业系统制定了一个分布式的、基于收益的异常检测框架,以优化的传感器子集散分战略为基础。从设定的分隔战略的角度出发,本研究分析了促成分布式算法和集中式算法之间绩效差异的关键因素。通过审查传感器子集的相互影响,最佳的集成分割战略旨在最大限度地减少子集成的相互影响,同时加强子集成的相互关系。为了进一步降低获取更新的计算成本,拟以格拉斯曼型距离为基础的次级优化分割战略将提高选择可疑传感器的效率。理论分析表明,这一方法在保持探测性能的同时,有效地降低了获取更新的计算成本。最后,模拟结果验证了拟议方法在加强攻击性探测中的有效性。
Article 77
Title@2025-06-29 (7): Verifying Properties of Index Arrays in a Purely-Functional Data-Parallel Language
Title: Verifying Properties of Index Arrays in a Purely-Functional Data-Parallel Language | Überprüfung der Eigenschaften von Index-Arrays in einer rein funktionalen Daten-Parallel-Sprache | 校验纯功能数据- Parallel 语言索引阵列属性 2506.23058v1 |
Authors (3): Nikolaj Hey Hinnerskov, Robert Schenck, Cosmin E. Oancea
This paper presents a novel approach to automatically verify properties of pure data-parallel programs with non-linear indexing – expressed as pre- and post-conditions on functions. Programs consist of nests of second-order array combinators (e.g., map, scan, and scatter) and loops. The key idea is to represent arrays as index functions: programs are index function transformations over which properties are propagated and inferred. Our framework proves properties on index functions by distilling them into algebraic (in)equalities and discharging them to a Fourier-Motzkin-based solver. The framework is practical and accessible: properties are not restricted to a decidable logic, but instead are carefully selected to express practically useful guarantees that can be automatically reasoned about and inferred. These guarantees extend beyond program correctness and can be exploited by the entire compiler pipeline for optimization. We implement our system in the pure data-parallel language Futhark and demonstrate its practicality on seven applications, reporting an average verification time of 1 second. Two case studies show how eliminating dynamic verification in GPU programs results in significant speedups.
本文介绍了一种新颖的方法,用于自动核查纯数据平行程序(以功能的预设和后设条件表示)的特性。程序由二阶阵列组合器(如地图、扫描和散射)和环形组成。关键的想法是将阵列作为索引功能:程序是指数函数转换,其属性是传播和推断的。我们的框架通过将这些功能转化为代数(不平等),并将其放入以Fourier-Motzkin为基础的求解器,来证明索引功能的特性。框架是实用和可访问的:特性不局限于一个可分解的逻辑,而是经过仔细选择,以表达可以自动解释和推断的实用的保证。这些保证超出了程序正确性,可以被整个编译管道用于优化。我们用纯数据单词 Futhark 来实施我们的系统,并在7个应用程序上展示其实用性,报告平均核查时间为1秒。两个案例研究显示如何消除GPU程序动态核查结果显著的速度。
Article 78
Title@2025-06-28 (6): ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism
Title: ATTENTION2D: Communication Efficient Distributed Self-Attention Mechanism | ACHTUNG2D: Kommunikation Effizient verteilter Selbstaufmerksamkeitsmechanismus | 注意2D: 沟通高效分配自发性传播机制 2503.15758v2 |
Authors (1): Venmugil Elango
Transformer-based models have emerged as a leading architecture for natural language processing, natural language generation, and image generation tasks. A fundamental element of the transformer architecture is self-attention, which allows the model to capture intricate dependencies within the data. However, the self-attention mechanism also incurs significant computational and memory costs, particularly for long sequences. In this paper, we introduce ATTENTION2D, a novel approach that exploits parallelism along two dimensions - query and key/value - of the self-attention operation. This method enables efficient distribution and parallelization of computations across multiple devices. Our approach facilitates asymptotically faster training and inference phases compared to previous methods, without relying on approximations or incurring additional computational or memory overheads. Furthermore, unlike existing techniques that struggle to scale with an increasing number of processing units, our approach effectively scales with additional processing units. Our experimental results confirm the effectiveness of our method in improving communication efficiency and scalability. Compared to Ring Attention, our approach demonstrated up to a 5x performance boost on a GPT-3-like model using 64 NVIDIA A100 GPUs across 16 nodes, and up to a 9.4x performance boost on 64 NVIDIA H100 GPUs across 64 nodes.
以变异器为基础的模型已成为自然语言处理、自然语言生成和图像生成任务的主导架构。变异器结构的一个基本要素是自省,使该模型能够捕捉数据内错综复杂的相互依存性。然而,自省机制还产生大量的计算和记忆成本,特别是长序列。在本文件中,我们引入了一种新颖的方法,利用了自省操作的两个维度——查询和关键/价值——的平行性能。这种方法使多种装置的计算能够高效分布和平行。我们的方法便于与以往方法相比,进行无干扰的更快培训和推断阶段,不依赖近似或产生额外的计算或记忆管理。此外,与现有技术相比,随着处理单位数量的增加,我们的方法与更多的处理单位不同,我们的方法有效地比重。我们的实验结果证实了我们提高通信效率和可缩放性的方法的有效性。与雷亚尔注意相比,我们的方法展示了GPT-3型模型的5x性增强性能,使用64NVIA A至16的NVIA/64100GPA的推进度。
Article 79
Title@2025-06-28 (6): Cicada: A Pipeline-Efficient Approach to Serverless Inference with Decoupled Management
Title: Cicada: A Pipeline-Efficient Approach to Serverless Inference with Decoupled Management | Cicada: Ein Pipeline-Effizienter Ansatz zur serverlosen Schlussfolgerung mit entkoppelter Verwaltung | Cicada:用管道有效处理无服务器推断与拆分管理的方法 2502.20959v2 |
Authors (7): Z. Wu, Y. Deng, J. Hu, L. Cui, Z. Zhang, L. Zeng, G. Min
Serverless computing has emerged as a pivotal paradigm for deploying Deep Learning (DL) models, offering automatic scaling and cost efficiency. However, the inherent cold start problem in serverless ML inference systems, particularly the time-consuming model loading process, remains a significant bottleneck. Utilizing pipelined model loading improves efficiency but still suffer from pipeline stalls due to sequential layer construction and monolithic weight loading. In this paper, we propose \textit{Cicada}, a novel pipeline optimization framework that coordinates computational, storage, and scheduling resources through three key mechanisms: (1) \textit{MiniLoader}: which reduces layer construction overhead by opportunistically optimizing parameter initialization; (2) \textit{WeightDecoupler}: decoupling weight file processing from layer construction, enabling asynchronous weight retrieval and out-of-order weight application; (3) \textit{Priority-Aware Scheduler}: dynamically allocating resources to ensure high-priority inference tasks are executed promptly. Our experimental results demonstrate that Cicada achieves significant performance improvements over the state-of-the-art PISeL framework. Specifically, Cicada reduces end-to-end inference latency by an average of 61.59\%, with the MiniLoader component contributing the majority of this optimization (53.41\%), and the WeightDecoupler achieves up to 26.17\% improvement. Additionally, Cicada achieves up to 2.52x speedup in the inference pipeline utlization compared to PISeL.
无服务器计算已成为部署深层学习(DL)模型的关键范例,提供了自动缩放和成本效率。然而,无服务器的 ML 推断系统,特别是耗时模型装载流程的内在冷启动问题仍然是一个重大瓶颈。 使用管道式装货模型提高了效率,但由于相继层构造和单体重量装载,仍然受到管道阻塞的影响。 在本文中,我们提议了\ textit{Cicada},这是一个新的管道优化框架,通过三个关键机制协调计算、存储和调度资源:(1)\ textit{Miniloader}:(1) 通过随机优化参数初始化来减少管道铺设费用;(2)\ textit{WeightDecouupler}:将重文件处理与层结构建设脱钩,从而实现不连续的重量回收和超序重量负荷;(3)\ textitle{Pritial{AwardS-Awardr}:动态分配资源,以确保高优先度改进任务得到执行。我们的实验结果表明,Cadada(cada)通过常规化框架实现大幅的升级,使C-L’Lial-Leal-de-de-de-de-de-deal-deal-de-de-filde-de-deal-de-de-deal-deal-lemental-deal-lemental-lemental-lemental-Lis-de-fal-de-de-de-lemental-de-de-de-de-lemental-de-lex-de-de-de-lex-lex-lex-lemental-lemental-lemental-lemental-deal-lex,使Pal-de-le-li-li-le-le-le-le-le-le-le-le-le-le-le-le-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-Ix-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-
Article 80
Title@2025-06-28 (6): Performance Measurements in the AI-Centric Computing Continuum Systems
Title: Performance Measurements in the AI-Centric Computing Continuum Systems | Leistungsmessungen in den KI-Centric Computing Continuum Systemen | AI-Centric 电子计算大陆系统的业绩计量 2506.22884v1 |
Authors (3): Praveen Kumar Donta, Qiyang Zhang, Schahram Dustdar
Over the Eight decades, computing paradigms have shifted from large, centralized systems to compact, distributed architectures, leading to the rise of the Distributed Computing Continuum (DCC). In this model, multiple layers such as cloud, edge, Internet of Things (IoT), and mobile platforms work together to support a wide range of applications. Recently, the emergence of Generative AI and large language models has further intensified the demand for computational resources across this continuum. Although traditional performance metrics have provided a solid foundation, they need to be revisited and expanded to keep pace with changing computational demands and application requirements. Accurate performance measurements benefit both system designers and users by supporting improvements in efficiency and promoting alignment with system goals. In this context, we review commonly used metrics in DCC and IoT environments. We also discuss emerging performance dimensions that address evolving computing needs, such as sustainability, energy efficiency, and system observability. We also outline criteria and considerations for selecting appropriate metrics, aiming to inspire future research and development in this critical area.
八十年来,计算模式已从大型中央系统转向压缩分布式结构,导致分布式计算机连续系统(DCC)的兴起。在这个模型中,云层、边缘、物联网(IoT)和移动平台等多层共同支持广泛的应用。最近,创用AI和大语言模型的出现进一步加大了整个连续体对计算资源的需求。虽然传统的业绩计量提供了坚实的基础,但需要重新审视和扩大,以跟上不断变化的计算需求和应用要求。精确的绩效测量有利于系统设计者和用户,支持提高效率,促进与系统目标保持一致。在这方面,我们审查了DCC和IoT环境中常用的计量标准。我们还讨论了解决不断演变的计算需求(如可持续性、能效和系统可耐性)的新的业绩层面。我们还概述了选择适当计量的标准和考虑因素,目的是激励这一关键领域的未来研发。
Article 81
Title@2025-06-28 (6): Reliable Image Transmission in CPS-based Pub/Sub
Title: Reliable Image Transmission in CPS-based Pub/Sub | Zuverlässige Bildübertragung im CPS-basierten Pub/Sub | 以CPS为基础的PP/Pub/Sub的可靠图像传输 2506.22875v1 |
Authors (8): Everson Flores, Bruna Guterres, Thomaz Pereira Junior, Paula Barros, Alberto Cabral, Cristiana Lima Dora, Marcelo Malheiros, Marcelo Pias
Developments in communication and automation have driven the expansion of distributed networks, essential for IoT and CPS development in industrial applications requiring reliable image processing and real-time adaptability. Although broadly adopted, there is a literature gap regarding the performance of MQTT protocol for image sharing and transmission under high-traffic scenarios with intermittent connectivity, restricting its use in critical IoT and CPS applications. In this context, the present work examines the reliability of real-time image transmission in IoT and CPS industrial systems that utilize the MQTT-based publish/subscribe communication model. It focuses on scenarios with network interruptions and high data traffic, evaluating the performance of a distributed system through a series of controlled testbed validation experiments. Experimental validation demonstrated that while the MQTT-based system sustains reliable transmission under normal conditions, its recovery capability depends on the failure point, with complete restoration occurring when disruptions affect the Orchestrator Node and partial recovery when the Producer Node or Broker are affected. The study also confirmed that the system prevents duplicate errors and adapts well to increasing network demands, reinforcing its suitability for industrial applications that require efficient and resilient data handling.
通信和自动化方面的发展推动了分布式网络的扩展,这对需要可靠的图像处理和实时适应的工业应用的IOT和CPS开发至关重要。虽然广泛采用,但关于MQTT协议在具有间歇性连接的高流量情景下图像共享和传播的绩效,在通信和自动化的发展限制了其在关键的IOT和CPS应用中的使用。在这方面,目前的工作审查了使用MQTT的出版/订阅通信模型的IOT和CPS工业系统中实时图像传输的可靠性。它侧重于网络中断和高数据流量的情景,通过一系列受控试验证实验评估分布式系统的性能。实验验证表明,虽然基于MQTT的系统在正常条件下维持可靠的传输,但其恢复能力取决于故障点,在干扰影响Orchestrator Node和在生产商Node或Broker受到影响时部分恢复。研究还证实,该系统防止重复错误,并适应不断增长的网络需求,加强其在需要高效和有弹性数据处理的工业应用方面的适宜性。
Article 82
Title@2025-06-28 (6): Momentum-based Accelerated Algorithm for Distributed Optimization under Sector-Bound Nonlinearity
Title: Momentum-based Accelerated Algorithm for Distributed Optimization under Sector-Bound Nonlinearity | Momentumbasierte beschleunigte Algorithmen zur verteilten Optimierung unter sektorübergreifender Nichtlinearität | 部门-基于动力的在部门-健全非线性下分配的优化分配加速计算 2506.22855v1 |
Authors (2): Mohammadreza Doostmohammadian, Hamid R. Rabiee
Distributed optimization advances centralized machine learning methods by enabling parallel and decentralized learning processes over a network of computing nodes. This work provides an accelerated consensus-based distributed algorithm for locally non-convex optimization using the gradient-tracking technique. The proposed algorithm (i) improves the convergence rate by adding momentum towards the optimal state using the heavy-ball method, while (ii) addressing general sector-bound nonlinearities over the information-sharing network. The link nonlinearity includes any sign-preserving odd sector-bound mapping, for example, log-scale data quantization or clipping in practical applications. For admissible momentum and gradient-tracking parameters, using perturbation theory and eigen-spectrum analysis, we prove convergence even in the presence of sector-bound nonlinearity and for locally non-convex cost functions. Further, in contrast to most existing weight-stochastic algorithms, we adopt weight-balanced (WB) network design. This WB design and perturbation-based analysis allow to handle dynamic directed network of agents to address possible time-varying setups due to link failures or packet drops.
通过在计算节点网络上建立平行和分散的学习过程,分散优化优化的中央机器学习方法,从而在计算节点网络上实现平行和分散的学习过程。这项工作提供了一种加速的基于共识的分布算法,用于使用梯度跟踪技术实现当地非曲线优化。提议的算法(一)通过利用重球法增加向最佳状态发展的势头,提高趋同率,同时(二)解决在信息共享网络上一般部门非线性的问题。链接非线性包括任何标值保留奇异的部门分布绘图,例如,对日志尺度数据进行定量或对实际应用进行剪切。对于可接受的势头和梯度跟踪参数,我们利用扰动理论和eigen光谱分析,证明即使在存在部门性非直线性和当地非碳度成本功能的情况下,我们也会达到趋同率。此外,与大多数现有的加权组合算法相比,我们采用了权重平衡(WB)网络设计。这种WB设计和基于扰动的分析可以处理动态引导的代理网络,以解决因链接失败或包装下降而可能造成的时间波动的设置。
Article 83
Title@2025-06-28 (6): Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models
Title: Adaptive Rank Allocation for Federated Parameter-Efficient Fine-Tuning of Language Models | Adaptive Rangverteilung für Federated Parameter-Efficient Fine-Tuning of Language Models | 联邦准拉米有效精密语言模式调适级分配 2501.14406v3 |
Authors (4): Fei Wu, Jia Hu, Geyong Min, Shiqiang Wang
Pre-trained Language Models (PLMs) have demonstrated their superiority and versatility in modern Natural Language Processing (NLP), effectively adapting to various downstream tasks through further fine-tuning. Federated Parameter-Efficient Fine-Tuning (FedPEFT) has emerged as a promising solution to address privacy and efficiency challenges in distributed training for PLMs on resource-constrained local devices. However, our measurements reveal two key limitations of FedPEFT: heterogeneous data across devices exacerbates performance degradation of low-rank adaptation, and a fixed parameter configuration results in communication inefficiency. To overcome these limitations, we propose FedARA, a novel Adaptive Rank Allocation framework for federated parameter-efficient fine-tuning of language models. Specifically, FedARA employs truncated Singular Value Decomposition (SVD) adaptation to enhance similar feature representation across clients, significantly mitigating the adverse effects of data heterogeneity. Subsequently, it utilizes dynamic rank allocation to progressively identify critical ranks, effectively improving communication efficiency. Lastly, it leverages rank-based module pruning to automatically remove inactive modules, steadily reducing local computational cost and memory usage in each federated learning round. Extensive experiments show that FedARA consistently outperforms baselines by an average of 6.95% to 8.49% across various datasets and models under heterogeneous data while significantly improving communication efficiency by 2.40$ \times$. Moreover, experiments on various edge devices demonstrate substantial decreases in total training time and energy consumption by up to 48.90% and 46.95%, respectively.
- 预先培训的语言模型(PLMS)在现代自然语言处理(NLP)中显示了其优势和多功能性,通过进一步的微调,有效地适应了各种下游任务; 联邦拉法(Fedater-Effective Feed-Effective- FeedPEFT)(FedPePEF)(Fed-PePEF)(Fed-Preater-Preater-PLM)(FPLM)(FMLM)(FPLM(PLMM)(FMLM)(FLM)(FLMLM)(PLM))(FPLM)在对资源限制的地方设备对PLMS(PLM)(PLMLM)(PLMLM)(PMLM)(PLMM)(PMLML)(PLM)(PML)(PLML)(PLML)(PLML))(PL)(PLML))(PL)(PLLLL)(PLL)(PL)(PL)(PLLLL)(PL)(PL)(PL)(PLLL)(PLLLL)(PLLLL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PLLM(PL)(PL)(PL)(PLLLM(PL)(PL)(PLML)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PLPL)(PL)(PL)(PL)(P(P)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL)(PL(PL
Article 84
Title@2025-06-28 (6): TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations
Title: TriADA: Massively Parallel Trilinear Matrix-by-Tensor Multiply-Add Algorithm and Device Architecture for the Acceleration of 3D Discrete Transformations | TriADA: Massiv parallel Trilineare Matrix-by-Tensor Multiplizieren von Algorithmen und Gerätearchitektur für die Beschleunigung von 3D-Diskreten Transformationen | TriADA: 加速 3D 分立变换的大规模平行平行三线矩阵矩阵逐个传感器乘数加算法和设备结构 2506.22818v1 |
Authors (4): Stanislav Sedukhin, Yoichi Tomioka, Kazuya Matsumoto, Yuichi Okuyama
Multilinear transformations are key in high-performance computing (HPC) and artificial intelligence (AI) workloads, where data is represented as tensors. However, their high computational and memory demands, which grow with dimensionality, often slow down critical tasks. Moreover, scaling computation by enlarging the number of parallel processing units substantially increases energy consumption, limiting widespread adoption, especially for sparse data, which is common in HPC and AI applications. This paper introduces the Trilinear Algorithm and isomorphic to algorithm Device Architecture (TriADA) to address these challenges with the following innovations: (1) a massively parallel, low-rank algorithm for computing a family of trilinear (3D) discrete orthogonal transformations (3D-DXTs), which is a special case of the more general 3-mode matrix-by-tensor multiplication (3D-GEMT); (2) a new outer-product-based GEMM kernel with decoupled streaming active memory, specially designed to accelerate 3D-GEMT operation; (3) an isomorphic to the proposed algorithm, fully distributed 3D network of mesh interconnected processing elements or cells with a coordinate-free, data-driven local processing activity, which is independent of problem size; (4) an elastic sparse outer-product (ESOP) method that avoids unnecessary computing and communication operations with zero-valued operands, thereby enhancing energy efficiency, computational accuracy, and stability. TriADA is capable of performing a variety of trilinear transformations with hypercubic arithmetic complexity in a linear number of time-steps. The massively parallel, scalable, and energy-efficient architecture of TriADA is ideal for accelerating multilinear tensor operations, which are the most demanding parts of AI and HPC workloads.
多线性转换是高性能计算(HPC)和人工智能(AI)工作量的关键,数据以高压表示,但高计算和内存需求随维度增长而增长,往往放慢关键任务。此外,通过增加平行处理单位的数量而扩大计算规模,大大增加能源消耗,限制广泛采用,特别是稀释数据,这是HPC和AI应用中常见的(D-GEMT);本文件介绍了Trilinear Algorithm 和对算法设备结构(TriADAD)的不透明内存,以迎接这些挑战,其创新有以下创新:(1) 大量平行的、低级的计算和内存需求,用于计算三线直线(3D)直线(3D)的直径和内存式直径转换组合的计算和内存。 3D(OP-XT)是更一般的3-mmod矩阵多变(3D-GEMT);(2)一个新的以外产产品为主的内核产品内衬,专门设计用于加速3D-GEMT操作的分流流化的内存;(3)-直径(3-直径-直径-直径)电操作中的一种不动的自动和直径操作网络-直径、直径流操作的自动流操作和直径递化)网络-直径流操作网络-直径流操作网络-直径-直径运行-直径流操作的自动操作的自动和直径流操作操作操作操作的自动操作操作操作操作;(3)-直径-直径-直径流操作网络-直径径径径径流操作网络-直径-直径-直径-直径-直操作-直-直操作-直操作-直-直-直操作系统-直-直-直操作-直操作-直操作-直-直-直-直-直-直-直-直-直-直-直-直操作-直-直-直径-直径-直径-直径-直径-直径-直径-直-直径-直-直-直-直-直-直-直-直-直-直径-直-直-直-直-直-直-直-直径-直-直-直-直-直-直-直
Article 85
Title@2025-06-28 (6): Characterizing GPU Resilience and Impact on AI/HPC Systems
Title: Characterizing GPU Resilience and Impact on AI/HPC Systems | Charakterisierung der GPU-Resilienz und Auswirkungen auf AI/HPC-Systeme | 确定GPU的复原力和对AI/HPC系统的影响 2503.11901v3 |
Authors (14): Shengkun Cui, Archit Patke, Hung Nguyen, Aditya Ranjan, Ziheng Chen, Phuong Cao, Brett Bode, Gregory Bauer, Catello Di Martino, Saurabh Jha, Chandra Narayanaswami, Daby Sow, Zbigniew T. Kalbarczyk, Ravishankar K. Iyer
This study characterizes GPU resilience in Delta HPC, a large-scale AI system that consists of 1,056 A100 and H100 GPUs, with over 1,300 petaflops of peak throughput. Delta HPC is operated by the National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign. We used 2.5 years of operational data (11.7 million GPU hours) on GPU errors. Our major findings include: (i) H100 GPU memory resilience is worse than A100 GPU memory, with 3.2x lower per-GPU MTBE for memory errors, (ii) The GPU memory error-recovery mechanisms on H100 GPUs are insufficient to handle the increased memory capacity, (iii) H100 GPUs demonstrate significantly improved GPU hardware resilience over A100 GPUs with respect to critical hardware components, (iv) GPU errors on both A100 and H100 GPUs frequently result in job failures due to the lack of robust recovery mechanisms at the application level, and (v) We project the impact of GPU node availability on larger-scales and find that significant overprovisioning of 5% is necessary to handle GPU failures.
这项研究对德尔塔高电联的GPU复原力进行了特征分析,Delta HPC是一个大型的AI系统,由1,056 A100 和H100 GPU组成,有1,300多个顶峰输送管道。Delta HPC由伊利诺伊大学Ubana-Champaign国家超载应用中心(NCSA)运行。我们在GPU错误上使用了2.5年的操作数据(1,170万 GPU小时)。我们的主要发现包括:(一) H100 GPU记忆复原力比A100 GPU记忆力差,每GPU MTBE的记忆错误为3.2x低,而每个GPUMTBE的记忆错误回收机制不足以处理增加的内存能力;(二) H100 GPU的GPU内存错误回收机制不足以处理更大的内存能力;(三) HPU显示,在关键硬件部件上GPU的GPU的硬件恢复能力大大提高;(四) A100 和H100 GPU的G错误经常导致工作失误,因为应用一级缺乏强大的恢复机制;(五) 我们预测,GPUPUPU的可用率必须处理5%的失败。
Article 86
Title@2025-06-28 (6): Efficiently Serving Large Multimodal Models Using EPD Disaggregation
Title: Efficiently Serving Large Multimodal Models Using EPD Disaggregation | Effizientes Servieren großer multimodaler Modelle mit EPD-Disaggregation | 利用EPD拆分有效服务大型多模式模式 2501.05460v4 |
Authors (12): Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, Yong Zhang, Zhenan Fan
Large Multimodal Models (LMMs) extend Large Language Models (LLMs) by handling diverse inputs such as images, audio, and video, but at the cost of adding a multimodal encoding stage that increases both computational and memory overhead. This step negatively affects key Service Level Objectives (SLOs), such as time to first token (TTFT) and time per output token (TPOT). We introduce Encode-Prefill-Decode (EPD) Disaggregation, a novel framework that separates the encoding, prefill, and decode stages onto dedicated resources. Unlike current systems, which bundle encoding and prefill together, our approach decouples these steps, unlocking new opportunities and optimizations. These include a mechanism to cache multimedia tokens for efficient transfer, a novel way to parallelize the encoding load within a request, a module for optimal resource allocation for disaggregated serving, and a novel role-switching method to handle changing workload characteristics. Experimental evaluations with popular LMMs show substantial gains in memory efficiency (up to 15x lower peak memory utilization), batch sizes (up to 22x larger), 10x more images per request, and 2.2x larger KV caches. Furthermore, it leads to significant improvements in SLO attainment (up to 90-100% improvement) and TTFT (up to 71% reduction), compared to systems that do not disaggregate. The code is available at https://github.com/vbdi/epdserve.
大型多式模型(LMMs)通过处理图像、音频和视频等多种投入,扩展了大语言模型(LLMMs),处理图像、音频和视频等多种投入,但成本是增加多式编码阶段,增加计算和记忆管理。这一步骤对关键服务级目标(SLOs)产生了负面影响,如第一到第一令(TTFT)的时间和每个输出符号(TPOT)的时间等。我们引入了Ecco-Prefrip-Decode(EPD)分解,这是一个将编码、预填和解码阶段分离到专用资源的新框架。与将编码、预填、预填和解码到专用资源的当前系统不同的是,我们的方法将这些步骤拆分解,打开新的机会和优化。这些步骤包括一个机制,为高效传输存储多媒体标牌(SLOLOs),在请求中将编码与编码相平行,将编码与编码相平行的MMUP/CFTA系统相平行,将S-CFS-CS-CS-delevxxxxxxx 递缩缩缩到递缩系统。
Article 87
Title@2025-06-28 (6): Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication
Title: Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication | Waage: CUDA- und Tensorkerne für hochleistungsfähige Sparse-Matrix-Multiplikation synergisieren | 激光仪:将CUDA和Tensor核心同步用于高性能散射矩阵乘法 2506.22714v1 |
Authors (7): Jinliang Shi, Shigang Li, Youxuan Xu, Xueying Wang, Rongtian Fu, Zhi Ma, Tong Wu
Sparse matrix multiplication operators (i.e., SpMM and SDDMM) are widely used in deep learning and scientific computing. Modern accelerators are commonly equipped with Tensor cores and CUDA cores to accelerate sparse operators. The former brings superior computing power but only for structured matrix multiplication, while the latter has relatively lower performance but with higher programming flexibility. In this work, we discover that utilizing one resource alone leads to inferior performance for sparse matrix multiplication, due to their respective limitations. To this end, we propose Libra, a systematic approach that enables synergistic computation between CUDA and Tensor cores to achieve the best performance for sparse matrix multiplication. Specifically, we propose a 2D-aware workload distribution strategy to find out the sweet point of task mapping for different sparse operators, leveraging both the high performance of Tensor cores and the low computational redundancy on CUDA cores. In addition, Libra incorporates systematic optimizations for heterogeneous computing, including hybrid load-balancing, finely optimized kernel implementations, and GPU-accelerated preprocessing. Extensive experimental results on H100 and RTX 4090 GPUs show that Libra outperforms the state-of-the-art by on average 3.1x (up to 9.23x) over DTC-SpMM and 2.9x (up to 3.9x) for end-to-end GNN applications. Libra opens up a new perspective for sparse operator acceleration by fully exploiting the heterogeneous computing resources on GPUs.
在深层次的学习和科学计算中,广泛使用微缩矩阵乘数操作器(即,SpMM和SDDMM),现代加速器通常配有Tensor核心和CUDA核心,以加速稀释操作器。前者带来较高的计算能力,但只是为了结构化矩阵乘法,而后者则具有较高的性能,但后者则具有较高的编程灵活性。在这项工作中,我们发现仅使用一种资源就会导致稀薄矩阵乘数的性能较差。为此,我们提议使用利布拉,一种系统化方法,使CUDA和Tensor核心之间实现协同计算,以达到稀散矩阵倍增法的最佳性能。具体地,我们建议采用2D觉觉觉识的工作量分配战略,为不同的稀释操作者找出任务绘图的甜点,同时利用Tensoror CUDA的高性能和低计算冗余性。此外,利布拉将混合式计算,包括混合负重、精细优化的内核内核执行,以及GP-加速计算前期,以S-creal-cal-cal-cal-cal-cal-cal-cal-cal-cal-cal-cal-cal-Cal-x,在H100和D-S-S-S-S-S-S-S-S-S-S-S-S-S-x上显示S-S-S-S-S-S-S-x的平流平平平平流-S-S-S-S-S-S-S-x-S-S-S-S-S-x-S-x-S-S-S-S-x-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-x
Article 88
Title@2025-06-27 (5): SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving
Title: SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving | SLED: Ein spekulatives LLM-Decoding-Framework für effizientes Edge Serving | SLED: 有效边缘服务投机性LLM代谢框架 2506.09397v3 |
Authors (8): Xiangchen Li, Dimitrios Spatharakis, Saeid Ghafouri, Jiakun Fan, Hans Vandierendonck, Deepu John, Bo Ji, Dimitrios Nikolopoulos
The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the tokens utilizing a more precise target model. To further increase the efficiency of verification, the edge server batch the diverse verification requests from devices. This approach supports device heterogeneity and reduces server-side memory footprint by sharing the same upstream target model across multiple devices. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: 2.2 more system throughput, 2.8 more system capacity, and better cost efficiency, all without sacrificing model accuracy.
大型语言模型(LLMs)日益复杂,而边缘装置计算预算有限,这两者之间日益加大了差距,尽管硬件能力逐步改善,但对于高效的在设备上进行精密推断是一个关键的挑战。现有的战略,如进取量计、裁剪或远程推断、效率方面的贸易准确性或导致巨大的成本负担。本立场文件引入了一个新的框架,利用投机性解码技术,利用投机性解码,过去主要被视为自动递减生成LMs的一种解码加速技术,这是一种有希望的方法,通过在多种设备中串通计算来专门适应边缘计算。我们提议了\acronym,这个框架允许轻量级边设备使用不同的草稿模型在当地起草多个候选标牌,而一个单一的共享边端服务器则使用更精确的目标模型来验证这些标牌。为了进一步提高核查效率,边端服务器从设备中收集了各种核查请求。这个方法支持装置的异质性,并通过共享多个设备共享相同的上游目标模型来减少服务器的记忆足迹。我们与Jetson Orin Nano、Raspberry Pi 4B/5、更精度系统更精准性、更精准性地显示所有N4VLALVALServi系统。
Article 89
Title@2025-06-27 (5): DistShap: Scalable GNN Explanations with Distributed Shapley Values
Title: DistShap: Scalable GNN Explanations with Distributed Shapley Values | DistShap: Skalierbare GNN-Erklärungen mit verteilten Shapley-Werten | 分布式shap:可缩放的 GNN 解释和分布式形状值 2506.22668v1 |
Authors (3): Selahattin Akkas, Aditya Devarakonda, Ariful Azad
With the growing adoption of graph neural networks (GNNs), explaining their predictions has become increasingly important. However, attributing predictions to specific edges or features remains computationally expensive. For example, classifying a node with 100 neighbors using a 3-layer GNN may involve identifying important edges from millions of candidates contributing to the prediction. To address this challenge, we propose DistShap, a parallel algorithm that distributes Shapley value-based explanations across multiple GPUs. DistShap operates by sampling subgraphs in a distributed setting, executing GNN inference in parallel across GPUs, and solving a distributed least squares problem to compute edge importance scores. DistShap outperforms most existing GNN explanation methods in accuracy and is the first to scale to GNN models with millions of features by using up to 128 GPUs on the NERSC Perlmutter supercomputer.
由于越来越多地采用图形神经网络(GNN),解释其预测已变得日益重要。然而,将预测归因于特定的边缘或特征,在计算上仍然非常昂贵。例如,使用三层GNN对100个邻居的节点进行分类可能涉及从成百上千万候选人中找出有助于预测的重要边缘。为了应对这一挑战,我们提议DistShap(DistShap),这是一个平行的算法,在多个GPUs之间传播基于Shapley价值的解释。DistShap(DistShap)通过在分布式设置中取样子集成操作,在GPUs之间平行执行GNN推断,解决分布最小的平方块问题,以计算边缘重要性分数。DustShap(DistShap)在准确性上超越了大多数现有的GNNN的解释方法,并且是第一个通过在NERSC Perlmuter超级计算机上使用多达128个GPUPS来向具有数百万个特性的GNNNNM模型扩展。
Article 90
Title@2025-06-27 (5): Reductions in local certification
Title: Reductions in local certification | Reduzierung der lokalen Zertifizierung | 地方认证减少 2502.01551v2 |
Authors (2): Louis Esperet, Sébastien Zeitoun
Local certification is a topic originating from distributed computing, where a prover tries to convince the vertices of a graph $G$ that $G$ satisfies some property $\mathcal{P}$. To convince the vertices, the prover gives a small piece of information, called certificate, to each vertex, and the vertices then decide whether the property $\mathcal{P}$ is satisfied by just looking at their certificate and the certificates of their neighbors. When studying a property $\mathcal{P}$ in the perspective of local certification, the aim is to find the optimal size of the certificates needed to certify $\mathcal{P}$, which can be viewed a measure of the local complexity of $\mathcal{P}$. A certification scheme is considered to be efficient if the size of the certificates is polylogarithmic in the number of vertices. While there have been a number of meta-theorems providing efficient certification schemes for general graph classes, the proofs of the lower bounds on the size of the certificates are usually very problem-dependent. In this work, we introduce a notion of hardness reduction in local certification, and show that we can transfer a lower bound on the certificates for a property $\mathcal{P}$ to a lower bound for another property $\mathcal{P}’$, via a (local) hardness reduction from $\mathcal{P}$ to $\mathcal{P}’$. We then give a number of applications in which we obtain polynomial lower bounds for many classical properties using such reductions.
本地认证是一个源自分布式计算的主题, 证明人试图说服图形$G$的顶点 $G$ 符合某些属性 $\ mathcal{P} $。 要说服顶点, 证明人给每个顶点提供少量信息, 称为证书, 而顶点则决定 $\ mathcal{P} $ 是否只要查看他们的证书和邻居的证书就可以满足。 从本地认证的角度来研究一个属性 $\ mathcal{P} $ 的顶点, 目的是找到用于认证$\ mathcal{P} $ 的顶点。 要说服顶点的顶点, 证明人给本地的顶点 $ $ 最优的顶点 。 在本地认证的底点上, 我们从本地的底点 $ 递减到本地的 $ 。
Article 91
Title@2025-06-27 (5): Towards Operational Data Analytics Chatbots – Virtual Knowledge Graph is All You Need
Title: Towards Operational Data Analytics Chatbots – Virtual Knowledge Graph is All You Need | Auf dem Weg zu operativen Datenanalytik Chatbots – Virtual Knowledge Graph ist alles, was Sie brauchen | 迈向实用数据分析分析聊天器 – – 虚拟知识图是你所需要的全部 2506.22267v1 |
Authors (4): Junaid Ahmed Khan, Hiari Pizzini Cavagna, Andrea Proia, Andrea Bartolini
With generative artificial intelligence challenging computational scientific computing, data centers are experiencing unprecedented growth in both scale and volume. As a result, computing efficiency has become more critical than ever. Operational Data Analytics (ODA) relies on the collection of data center telemetry to improve efficiency, but so far has been focusing on real-time telemetry data visualization and post-mortem analysis. However, with NoSQL databases now serving as the default storage backend to support scalability, querying this data is challenging due to its schema-less nature, which requires domain knowledge to traverse relationships between data sources. Ontologies and Knowledge Graphs (KGs) can capture these relationships, but traditional KGs are costly to scale and have not been widely applied to multivariate timeseries. Virtual Knowledge Graphs (VKGs) offer a lightweight alternative by generating query-specific graphs at runtime. In this work, we present a full end-to-end ODA chatbot system that uses a Large Language Model (LLM) to generate SPARQL queries, utilizing VKG for data retrieval. This approach achieves 92.5% accuracy compared to 25% with direct NoSQL queries. The proposed methodology optimizes VKG construction and LLM inference, cutting previous work average query latency by 85% (from 20.36s to 3.03s) and keeping VKG sizes under 179 MiB. This performance makes the tool suitable for deployment and real-time interaction with ODA end-users.
计算机化人工智能在计算科学计算中具有挑战性,因此数据中心在规模和数量上都经历了前所未有的增长。因此,计算效率比以往任何时候更加关键。操作性数据分析(ODA)依靠收集数据中心遥测来提高效率,但迄今为止一直侧重于实时遥测数据可视化和死后分析。然而,由于诺斯卡勒数据库现在作为默认存储后端以支持可缩放性,因此对这些数据的查询具有挑战性,因为它具有无计划性,需要域知识来改变数据来源之间的关系。 内科和知识图表(KGs)可以捕捉这些关系,但传统的KGs在规模上成本成本很高,没有被广泛应用于多变时间序列。虚拟知识图(VKGs)提供了一种轻量的替代方法,在运行时生成特定查询图表。在这项工作中,我们展示了一个完全端到端的官方发展援助聊天平台系统,它使用大语言模型(LLLM)来生成 SPARQL查询,使用VKG Gs-LQ来进行数据检索,但传统的K-LS-LQ的准确度为25;这个方法,在正常的 VK-LS-LCS-CS-Lxxxxxxxxxxxxxxxxxxx
Article 92
Title@2025-06-27 (5): Autonomic Microservice Management via Agentic AI and MAPE-K Integration
Title: Autonomic Microservice Management via Agentic AI and MAPE-K Integration | Autonomes Microservice Management über Agentic AI und MAPE-K Integration | 通过Agentic AI和MAPE-K整合进行自动微服务管理 2506.22185v1 |
Authors (7): Matteo Esposito, Alexander Bakhtin, Noman Ahmad, Mikel Robredo, Ruoyu Su, Valentina Lenarduzzi, Davide Taibi
While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.
虽然微观服务正在通过提供空前的可扩缩性和独立部署,使云层计算革命化,但其分散性质带来了严重的安全和管理挑战,可能危及系统稳定。我们提议了一个以MAPE-K为基础的框架,利用ATP-K的代理AI进行自主异常探测和补救,以解决高度分布的系统管理的艰巨任务。我们的框架为维持强大和安全的微观服务提供了实用的、行业准备就绪的解决方案。从业者和研究人员可以定制框架,以加强系统稳定,减少故障,并监测更广泛的系统质量属性,例如系统性能水平、复原力、安全和异常管理等。
Article 93
Title@2025-06-27 (5): Reliability Analysis of Smart Contract Execution Architectures: A Comparative Simulation Study
Title: Reliability Analysis of Smart Contract Execution Architectures: A Comparative Simulation Study | Zuverlässigkeitsanalyse von Smart Contract Execution Architectures: Eine vergleichende Simulationsstudie | 智能合同执行结构可靠性分析:比较模拟研究 2506.22180v1 |
Authors (1): Önder Gürcan
The industrial market continuously needs reliable solutions to secure autonomous systems. Especially as these systems become more complex and interconnected, reliable security solutions are becoming increasingly important. One promising solution to tackle this challenge is using smart contracts designed to meet contractual conditions, avoid malicious errors, secure exchanges, and minimize the need for reliable intermediaries. However, smart contracts are immutable. Moreover, there are different smart contract execution architectures (namely Order-Execute and Execute-Order-Validate) that have different throughputs. In this study, we developed an evaluation model for assessing the security of reliable smart contract execution. We then developed a realistic smart contract enabled IoT energy case study. Finally, we simulate the developed case study to evaluate several smart contract security vulnerabilities reported in the literature. Our results show that the Execute-Order-Validate architecture is more promising regarding reliability and security.
工业市场不断需要可靠的解决方案来保障自主系统的安全。 特别是当这些系统变得更加复杂和相互联系时,可靠的安全解决方案变得越来越重要。 应对这一挑战的一个大有希望的解决方案是使用智能合同来满足合同条件、避免恶意错误、安全交换和尽量减少对可靠中介的需求。然而,智能合同是不可改变的。此外,有不同的智能合同执行结构(即命令-执行和执行-Order-Validate)有着不同的产出。在本研究中,我们开发了一个评估可靠智能合同执行安全的评估模型。然后我们开发了一个现实的智能合同,使IoT能源案例研究得以进行。最后,我们模拟了开发的案例研究,以评价文献中报道的若干智能合同安全弱点。我们的结果显示,执行-Order-Valiate架构在可靠性和安全方面更有希望。
Article 94
Title@2025-06-27 (5): MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism
Title: MPipeMoE: Memory Efficient MoE for Pre-trained Models with Adaptive Pipeline Parallelism | MPipeMoE: Memory Efficient MoE für vortrainierte Modelle mit adaptivem Pipeline Parallelismus | MPIPEMOE: 适应性管道平行主义的预培训模型记忆高效记忆部 2506.22175v1 |
Authors (7): Zheng Zhang, Donglin Yang, Yaqi Xia, Liang Ding, Dacheng Tao, Xiaobo Zhou, Dazhao Cheng
Recently, Mixture-of-Experts (MoE) has become one of the most popular techniques to scale pre-trained models to extraordinarily large sizes. Dynamic activation of experts allows for conditional computation, increasing the number of parameters of neural networks, which is critical for absorbing the vast amounts of knowledge available in many deep learning areas. However, despite the existing system and algorithm optimizations, there are significant challenges to be tackled when it comes to the inefficiencies of communication and memory consumption. In this paper, we present the design and implementation of MPipeMoE, a high-performance library that accelerates MoE training with adaptive and memory-efficient pipeline parallelism. Inspired by that the MoE training procedure can be divided into multiple independent sub-stages, we design adaptive pipeline parallelism with an online algorithm to configure the granularity of the pipelining. Further, we analyze the memory footprint breakdown of MoE training and identify that activations and temporary buffers are the primary contributors to the overall memory footprint. Toward memory efficiency, we propose memory reusing strategies to reduce memory requirements by eliminating memory redundancies, and develop an adaptive selection component to determine the optimal strategy that considers both hardware capacities and model characteristics at runtime. We implement MPipeMoE upon PyTorch and evaluate it with common MoE models in a physical cluster consisting of 8 NVIDIA DGX A100 servers. Compared with the state-of-art approach, MPipeMoE achieves up to 2.8x speedup and reduces memory footprint by up to 47% in training large models.
最近,Mixture of Experters(MOE)已成为最受欢迎的技术之一,可以将培训前的模型推广到超大型的超大规模。活跃的专家激活使专家能够进行有条件的计算,增加神经网络的参数数量,这对吸收许多深层学习领域的大量知识至关重要。然而,尽管存在现有系统和算法优化,但在通信和记忆消耗效率低下方面,仍有重大挑战需要应对。本文介绍了MIPeMoE的设计和实施,这是一个高性能图书馆,可以加速对教育部的适应性和记忆高效的管道平行运行培训。由于可以将教育部的培训程序分为多个独立的子阶段,因此,我们设计了适应性管道平行,并采用了在线算法,以配置管道的颗粒性。此外,我们分析了教育部培训的记忆足迹分解,并确定启动和临时缓冲是整个记忆足迹的主要贡献者。为了提高记忆效率,我们建议通过消除记忆再现速度模型来减少记忆需求,降低存储速度的MIPX存储速度要求,并开发一个适应性OE的硬件部分。我们用适应性选择模型来确定最佳战略。
Article 95
Title@2025-06-27 (5): Proof-of-Behavior: Behavior-Driven Consensus for Trustworthy Decentralized Finance
Title: Proof-of-Behavior: Behavior-Driven Consensus for Trustworthy Decentralized Finance | Proof-of-Behavior: Behavior-Driven Consensus für vertrauenswürdige dezentralisierte Finanzen | 行为证明:可信赖的权力下放金融行为共识 2506.22171v1 |
Authors (3): Ailiya Borjigin, Wei Zhou, Cong He
Current blockchain protocols (e.g., Proof-of-Work and Proof-of-Stake) secure the ledger yet cannot measure validator trustworthiness, allowing subtle misconduct that is especially damaging in decentralized-finance (DeFi) settings. We introduce Proof-of-Behavior (PoB), a consensus model that (i) gives each action a layered utility score – covering motivation and outcome, (ii) adapts validator weights using recent scores, and (iii) applies decentralized verification with proportional slashing. The reward design is incentive-compatible, yielding a Nash equilibrium in which honest behavior maximizes long-run pay-offs. Simulated DeFi experiments (loan-fraud detection, reputation-weighted validation) show that PoB cuts fraud acceptance by more than 90%, demotes malicious validators within two rounds, and improves proposer fairness versus standard PoS, all with no more than a 5% throughput overhead. By linking consensus influence to verifiably trustworthy conduct, PoB offers a scalable, regulation-friendly foundation for secure and fair blockchain governance in financial applications.
目前的链条协议(例如,工作证明和行为证明)确保了分类账的安全,但无法衡量验证的可信度,允许在分散金融(DeFi)环境中特别有害于他人的微妙的不当行为。我们引入了一个共识模式,即(一) 给每个行动一个分层的公用分数 – – 包括动机和结果,(二) 使用最近的分数来调整校准权重,(三) 应用分级的按比例裁量法进行分散核查。奖赏设计是符合奖励的,产生一种纳什平衡,使诚实的行为能够最大限度地实现长期的回报。模拟的 DeFi 实验( 欺诈检测、 信用加权验证) 显示, PoB 将欺诈的接受率削减90%以上, 并在两轮内降低恶意验证者, 提高提议方对标准 PoS 的公平性, 所有这些都不超过5% 的吞吐量管理费。 通过将共识影响与可核查的可信赖的行为联系起来, PoB 为金融应用中的安全和公正的阻断层治理提供了一个可扩展、有利于监管的基础。
Article 96
Title@2025-06-27 (5): MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators
Title: MCFuser: High-Performance and Rapid Fusion of Memory-Bound Compute-Intensive Operators | MCFuser: High-Performance und schnelle Fusion von Memory-Bound Compute-Intensive Operatoren | MCFuser: 内存 – – 弹道计算密集操作员的高度性能和迅速扩散 2506.22169v1 |
Authors (4): Zheng Zhang, Donglin Yang, Xiaobo Zhou, Dazhao Cheng
Operator fusion, a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the dynamicity of tensor dimension sizes could potentially lead to these operators becoming memory-bound, necessitating the generation of fused kernels, a task hindered by limited search spaces for fusion strategies, redundant memory access, and prolonged tuning time, leading to sub-optimal performance and inefficient deployment. We introduce MCFuser, a pioneering framework designed to overcome these obstacles by generating high-performance fused kernels for what we define as memory-bound compute-intensive (MBCI) operator chains. Leveraging high-level tiling expressions to delineate a comprehensive search space, coupled with Directed Acyclic Graph (DAG) analysis to eliminate redundant memory accesses, MCFuser streamlines kernel optimization. By implementing guidelines to prune the search space and incorporating an analytical performance model with a heuristic search, MCFuser not only significantly accelerates the tuning process but also demonstrates superior performance. Benchmarked against leading compilers like Ansor on NVIDIA A100 and RTX3080 GPUs, MCFuser achieves up to a 5.9x speedup in kernel performance and outpaces other baselines while reducing tuning time by over 70-fold, showcasing its agility.
操作员融合是改进数据地点和减轻 GPU 记忆带宽压力的关键技术,由于饱和的计算过程,这种关键技术往往无法扩大到多重计算密集操作员的融合,但由于饱和的计算过程,这些计算密集操作员往往无法将其融合起来。然而,强度尺寸的动态性能可能会导致这些操作员进入记忆中,从而需要生成引信的内核,这一任务由于聚合战略搜索空间有限、多余的内存访问和延长调时而受阻,导致工作表现不尽理想,部署效率低下。我们引入了MCFuser,这是一个开拓性框架,旨在克服这些障碍,通过生成高性能的装装合内核密集操作员(MBCI)操作链来生成高性能的内装内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核 执行准则。执行准则 执行准则 执行准则 执行准则 执行准则 执行准则 执行准则 执行准则 执行准则 执行准则 。M内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核内核 核 , 等 等等 等 等 等 等 等 等 等 等 等一级 等一级 等前核内核内核管制内核内核内核内核内核管制内核内核内核内核内核内核管制内核内核内核
Article 97
Title@2025-06-27 (5): SPTCStencil: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swap
Title: SPTCStencil: Unleashing Sparse Tensor Cores for Stencil Computation via Strided Swap | SPTCStencil: Entleashing Sparse Tensor Cores für Schablone Computation via Strided Swap | JSPCtencil: 通过 Strided Swap 解析 Stencils 计算 Stencils 的稀释渗漏天体核心 2506.22035v1 |
Authors (4): Qiqi GU, Chenpeng Wu, Heng Shi, Jianguo Yao
Stencil computation, a pivotal numerical method in science and engineering, iteratively updates grid points using weighted neighbor contributions and exhibits strong parallelism for multi-core processors. Current optimization techniques targeting conducting stencil computation on tensor core accelerators incur substantial overheads due to redundant zero-padding during the transformation to matrix multiplication. To address this, we introduce a sparse computation paradigm that eliminates inefficiencies by exploiting specialized hardware units. This paper exploits the sparsity in these matrices as a feature and presents SPTCStencil, a high-performance stencil computation system accelerated by Sparse Tensor Core (SpTCs). SPTCStencil is the first to harness SpTCs for acceleration beyond deep learning domains. First, Our approach generalizes an efficient transformation of stencil computation into matrix multiplications and specializes this conversion for SpTC compatibility through a novel sparsification strategy. Furthermore, SPTCStencil incorporates a high-performance GPU kernel with systematic optimizations designed to maximize efficiency on SpTCs. Experimental evaluations demonstrate that SPTCStencil 5.46$\times$ and Tensor Core-based approaches by 2.00$\times$ on average.
Stencils 计算是科学和工程中的关键数字方法,利用加权邻里贡献反复更新电网点,并展示了多核心处理器的强大平行。目前,针对对电压核心加速器进行静态计算的最优化技术由于向矩阵乘法转换过程中的多余零涂面而产生了大量的间接费用。为了解决这个问题,我们引入了一种稀疏的计算模式,通过利用专门硬件单位消除效率低下现象。本文利用这些矩阵中的孔隙作为特征,并介绍了由Spass Tensor Core(SpTCs)加速的高性能电线圈计算系统(SPCStencils)。POSCStencils是第一个利用SpTC实现超越深层学习域加速的先行技术。首先,我们的方法将电压计算高效转换为矩阵倍增法,并专门通过新式的蒸气化战略使SpTC兼容性转换。此外,SPCStencilstencils将高性GPU 内壳作为一种功能,系统优化旨在最大限度地提高SpTCs效率的系统优化。实验性评估显示,以2.CStencilstencils$00美元为标准。
Article 98
Title@2025-06-27 (5): SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference
Title: SiPipe: Bridging the CPU-GPU Utilization Gap for Efficient Pipeline-Parallel LLM Inference | SiPipe: Überbrückung der CPU-GPU-Utilisationslücke für effiziente Pipeline-Parallel-LLM-Inferenz | SiPipe:弥合CPU-GPU利用差距,提高管道-Parallel LLM 推理效率 2506.22033v1 |
Authors (3): Yongchao He, Bohan Zhao, Zheng Cao
As inference workloads for large language models (LLMs) scale to meet growing user demand, pipeline parallelism (PP) has become a widely adopted strategy for multi-GPU deployment, particularly in cross-node setups, to improve key-value (KV) cache capacity and inference throughput. However, PP suffers from inherent inefficiencies caused by three types of execution bubbles-load-imbalance, intra-stage, and inter-stage-which limit pipeline saturation. We present SiPipe, a heterogeneous pipeline design that improves throughput by leveraging underutilized CPU resources to offload auxiliary computation and communication. SiPipe incorporates three key techniques-CPU sampling, a token-safe execution model, and structure-aware transmission-to mitigate pipeline bubbles and improve execution efficiency. Across diverse LLMs, SiPipe achieves up to 2.1 times higher throughput, 43% lower per-token latency, and up to 23% higher average GPU utilization compared to the state-of-the-art vLLM under the same PP configuration, demonstrating its generality across LLMs and deployment scenarios.
作为大型语言模型(LLMS)规模的推论工作量,以满足用户不断增长的需求,管道平行(PP)已成为一项广泛采用的多GPU部署战略,特别是在交叉节点设置方面,目的是提高关键值(KV)缓存能力和推导量,然而,由于三种类型的执行泡沫-负载平衡、阶段内和限制管道饱和的阶段间,PPPIPe的固有效率低下。我们介绍了SiPipe,一种通过利用未充分利用的CPU资源卸载辅助计算和通信来改进吞吐的混合管道设计。SiPipe采用了三种关键技术-CPU取样、一种象征性安全执行模式和结构通识传输,以减轻管道泡沫并提高执行效率。在不同的LMs,SiPipe实现了高达2.1倍的高流量,43%的人均粘合度比同一PP配置下的最新工艺水平VLLM平均利用率高23%,展示了整个LMSLM和部署情景的一般性。
Article 99
Title@2025-06-27 (5): Programming Distributed Collective Processes in the eXchange Calculus
Title: Programming Distributed Collective Processes in the eXchange Calculus | Programmierung verteilter kollektiver Prozesse im eXchange Calculus | eXchange Calculus 中的程序编程分配集体进程 2401.11212v4 |
Authors (5): Giorgio Audrito, Roberto Casadei, Ferruccio Damiani, Gianluca Torta, Mirko Viroli
Recent trends like the Internet of Things (IoT) suggest a vision of dense and multi-scale deployments of computing devices in nearly all kinds of environments. A prominent engineering challenge revolves around programming the collective adaptive behaviour of such computational ecosystems. This requires abstractions able to capture concepts like ensembles (dynamic groups of cooperating devices) and collective tasks (joint activities carried out by ensembles). In this work, we consider collections of devices interacting with neighbours and that execute in nearly-synchronised sense-compute-interact rounds, where the computation is given by a single program mapping sensing values and incoming messages to output and outcoming messages. To support programming whole computational collectives, we propose the abstraction of a distributed collective process, which can be used to define at once the ensemble formation logic and its collective task. We formalise the abstraction in the eXchange Calculus (XC), a core functional language based on neighbouring values (maps from neighbours to values) where state and interaction is handled through a single primitive, exchange, and provide a corresponding implementation in the FCPP language. Then, we exercise distributed collective processes using two case studies: multi-hop message propagation and distributed monitoring of spatial properties. Finally, we discuss the features of the abstraction and its suitability for different kinds of distributed computing applications.
在这项工作中,我们考虑与邻居发生互动的装置的集成,这些装置以近同步的感知和计算互动周期执行,计算方法是由一个单一程序绘制感测值和发送信息到输出和流出信息。为了支持整个计算集体的编程,我们提议一个分布式集体过程的抽象化,这个过程可以用来立即界定共性形成逻辑及其集体任务。我们把电子Xchange Calculus(XC)中的抽象化,这是一个基于相邻价值的核心功能语言(从邻居到价值观的图解),通过单一原始、交换处理国家和互动,并在FCPP语言中提供相应的执行。最后,我们利用两种案例研究,进行分布式集成的集体进程,并传播各种空间信息。最后,我们用两种案例研究的形式,进行集体分布式的数学特性。我们用两种案例研究来传播其空间信息。最后,我们用两种案例研究来传播空间信息。
Article 100
Title@2025-06-27 (5): A Survey on Federated Fine-tuning of Large Language Models
Title: A Survey on Federated Fine-tuning of Large Language Models | Eine Umfrage über Federated Fine-Tuning von großen Sprachmodellen | 大语言模式联邦微调调查 2503.12016v2 |
Authors (10): Yebo Wu, Chunlin Tian, Jingguang Li, He Sun, Kahou Tam, Zhanting Zhou, Haicheng Liao, Zhijiang Guo, Li Li, Chengzhong Xu
Large Language Models (LLMs) have demonstrated impressive success across various tasks. Integrating LLMs with Federated Learning (FL), a paradigm known as FedLLM, offers a promising avenue for collaborative model adaptation while preserving data privacy. This survey provides a systematic and comprehensive review of FedLLM. We begin by tracing the historical development of both LLMs and FL, summarizing relevant prior research to set the context. Subsequently, we delve into an in-depth analysis of the fundamental challenges inherent in deploying FedLLM. Addressing these challenges often requires efficient adaptation strategies; therefore, we conduct an extensive examination of existing Parameter-Efficient Fine-tuning (PEFT) methods and explore their applicability within the FL framework. To rigorously evaluate the performance of FedLLM, we undertake a thorough review of existing fine-tuning datasets and evaluation benchmarks. Furthermore, we discuss FedLLM’s diverse real-world applications across multiple domains. Finally, we identify critical open challenges and outline promising research directions to foster future advancements in FedLLM. This survey aims to serve as a foundational resource for researchers and practitioners, offering valuable insights into the rapidly evolving landscape of federated fine-tuning for LLMs. It also establishes a roadmap for future innovations in privacy-preserving AI. We actively maintain a GitHub repo \href{https://github.com/Clin0212/Awesome-Federated-LLM-Learning}{https://github.com/Clin0212/Awesome-Federated-LLM-Learning} to track cutting-edge advancements in this field.
大型语言模型(LLMs)在各种任务中表现出了令人印象深刻的成功。将LLM与FedLLM(FFL)模式(FedLLM)相结合,为合作模式调整提供了一个充满希望的渠道,同时保护数据隐私。这项调查对FedLLM(FedLM)的绩效进行了系统和全面的审查。我们首先跟踪LM(FedLM)和FL(FL)的历史发展,总结了相关研究前的相关背景。随后,我们深入深入分析部署FedLLM12(FendLLM12)所固有的基本挑战。应对这些挑战往往需要有效的适应战略。因此,我们广泛审视现有的Pater-Effective Feal-Reformation(PEFFT)方法,并探索这些方法在FLLM框架内的适用性适用性。为了严格评估FedLLM(FLLM)的业绩,我们对现有的微调数据集和评价基准进行了彻底的审查。此外,我们讨论了FedLLM的多种现实应用。最后,我们查明了关键的公开挑战,并概述了研究方向,以促进FedLLLLLMM(F)M(OLO)公司未来进步)。
Article 101
Title@2025-06-27 (5): Generative AI for Software Architecture. Applications, Challenges, and Future Directions
Title: Generative AI for Software Architecture. Applications, Challenges, and Future Directions | Generative KI für Softwarearchitektur. Anwendungen, Herausforderungen und Zukunftsrichtungen | A. 软件结构的生成AI 应用、挑战和未来方向 2503.13310v2 |
Authors (8): Matteo Esposito, Xiaozhou Li, Sergio Moreschini, Noman Ahmad, Tomas Cerny, Karthik Vaidhyanathan, Valentina Lenarduzzi, Davide Taibi
Context: Generative Artificial Intelligence (GenAI) is transforming much of software development, yet its application in software architecture is still in its infancy, and no prior study has systematically addressed the topic. Aim: We aim to systematically synthesize the use, rationale, contexts, usability, and future challenges of GenAI in software architecture. Method: We performed a multivocal literature review (MLR), analyzing peer-reviewed and gray literature, identifying current practices, models, adoption contexts, and reported challenges, extracting themes via open coding. Results: Our review identified significant adoption of GenAI for architectural decision support and architectural reconstruction. OpenAI GPT models are predominantly applied, and there is consistent use of techniques such as few-shot prompting and retrieved-augmented generation (RAG). GenAI has been applied mostly to initial stages of the Software Development Life Cycle (SDLC), such as Requirements-to-Architecture and Architecture-to-Code. Monolithic and microservice architectures were the dominant targets. However, rigorous testing of GenAI outputs was typically missing from the studies. Among the most frequent challenges are model precision, hallucinations, ethical aspects, privacy issues, lack of architecture-specific datasets, and the absence of sound evaluation frameworks. Conclusions: GenAI shows significant potential in software design, but several challenges remain on its path to greater adoption. Research efforts should target designing general evaluation methodologies, handling ethics and precision, increasing transparency and explainability, and promoting architecture-specific datasets and benchmarks to bridge the gap between theoretical possibilities and practical use.
• 目标:我们的目标是系统地综合GENAI在软件结构中的使用、理由、背景、可用性和未来挑战。方法:我们进行了多语言文献审查,分析同行评审和灰色文献,查明当前做法、模式、采用背景和所报告的挑战,通过公开编码提取主题。结果:我们的审查发现GENAI在建筑决策支持和建筑重建方面大量采用GENAI产出的测试。OpenAI GPT模型主要应用,并且始终使用各种技术,例如少发的提示和检索的一代(RAGG)。 GENAI主要应用于软件发展生命周期的初始阶段,如设计要求和结构到规则等。
Article 102
Title@2025-06-27 (5): AeroDaaS: Towards an Application Programming Framework for Drones-as-a-Service
Title: AeroDaaS: Towards an Application Programming Framework for Drones-as-a-Service | AeroDaaS: Auf dem Weg zu einem Anwendungsprogrammierungsrahmen für Drohnen-as-a-Service | AeroDaaS:努力为作为服务对象的无人机制定应用方案框架 2504.03802v2 |
Authors (4): Suman Raj, Rajdeep Singh, Kautuk Astu, Yogesh Simmhan
The increasing adoption of UAVs with advanced sensors and GPU-accelerated edge computing has enabled real-time AI-driven applications in fields such as precision agriculture, wildfire monitoring, and environmental conservation. However, integrating deep learning on UAVs remains challenging due to platform heterogeneity, real-time constraints, and the need for seamless cloud-edge coordination. To address these challenges, we introduce AeroDaaS, a service-oriented framework that abstracts UAV-based sensing complexities and provides a Drone-as-a-Service (DaaS) model for intelligent decision-making. AeroDaaS offers modular service primitives for on-demand UAV sensing, navigation, and analytics as composable microservices, ensuring cross-platform compatibility and scalability across heterogeneous UAV and edge-cloud infrastructures. We implement and evaluate a preliminary version of AeroDaaS for two real-world DaaS applications. We require <=40 lines of code for the applications and see minimal platform overhead of <=20 ms per frame and <=0.5 GB memory usage on Orin Nano. These early results are promising for AeroDaaS as an efficient, flexible and scalable UAV programming framework for autonomous aerial analytics.
为应对这些挑战,我们引入了AeroDaaS(AeroDaaS),这是一个服务导向框架,它总结了基于UAV的遥感复杂性,并为智能决策提供了一种Drone-as-a-services(DaaaS)模式。 AeroDaAS(DaaAS)提供模块化服务,供需要的UAV遥感、野火监测和环境保护等领域使用。AeroDaAS提供模块化服务,供需要的UAV遥感、导航和解析作为可兼容的微服务使用,确保不同UAVAV和边缘-cloud基础设施的跨平台兼容性和可扩展性。为了应对这些挑战,我们引入了AeroDaS(AeroDa)初步版本,用于两个真实世界的DaaS应用。我们要求为应用提供“Dro-as-a-s-s-servic(Da)”号代码,并且看到“20 mes”米的顶端平台最短,而“0.5-GB”号AVAVAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Article 103
Title@2025-06-27 (5): Enabling Bitcoin Smart Contracts on the Internet Computer
Title: Enabling Bitcoin Smart Contracts on the Internet Computer | Ermöglichung von Bitcoin Smart Contracts auf dem Internet-Computer | 使因特网计算机上比特币智能合同成为可能 2506.21327v2 |
Authors (4): Ryan Croote, Islam El-Ashi, Thomas Locher, Yvonne-Anne Pignolet
There is growing interest in providing programmatic access to the value locked in Bitcoin, which famously offers limited programmability itself. Various approaches have been put forth in recent years, with the vast majority of proposed mechanisms either building new functionality on top of Bitcoin or leveraging a bridging mechanism to enable smart contracts that make use of ``wrapped’’ bitcoins on entirely different platforms. In this work, an architecture is presented that follows a different approach. The architecture enables the execution of Turing-complete Bitcoin smart contracts on the Internet Computer (IC), a blockchain platform for hosting and executing decentralized applications. Instead of using a bridge, IC and Bitcoin nodes interact directly, eliminating potential security risks that the use of a bridge entails. This integration requires novel concepts, in particular to reconcile the probabilistic nature of Bitcoin with the irreversibility of finalized state changes on the IC, which may be of independent interest. In addition to the presentation of the architecture, we provide evaluation results based on measurements of the Bitcoin integration running on mainnet. The evaluation results demonstrate that, with finalization in a few seconds and low execution costs, this integration enables complex Bitcoin-based decentralized applications that were not practically feasible or economically viable before.
Bitcoin公司有著越来越多的兴趣,提供方案访问,以获得Bitcoin公司所锁定的价值。Bitcoin公司有名有姓地提供了有限的程序性。近年来,提出了各种办法,绝大多数拟议机制要么在Bitcoin公司之上建立新的功能,要么利用一个桥梁机制,使智能合同能够在完全不同的平台上使用“已包装的”比特币。在这项工作中,提出了一种采用不同方法的架构。该架构使图灵-完成的Bitcoin智能合同得以在互联网计算机(IC)上执行,这是一个托管和实施分散应用的链链平台。除了使用桥梁、IC和Bitcoin节点进行直接互动外,消除使用桥梁可能带来的潜在安全风险。这种整合需要新的概念,特别是调和Bitcoin公司最终确定的国家变化的概率性,后者可能具有独立的兴趣。除了该架构外,我们还根据主网上Bitcoin公司一体化的测量结果提供评价结果。评价结果表明,由于在几秒钟和低执行成本的情况下最终确定,这种整合在实际上是无法实现的。
Article 104
Title@2025-06-26 (4): Benchmarking and Parallelization of Electrostatic Particle-In-Cell for low-temperature Plasma Simulation by particle-thread Binding
Title: Benchmarking and Parallelization of Electrostatic Particle-In-Cell for low-temperature Plasma Simulation by particle-thread Binding | Benchmarking und Parallelisierung elektrostatischer Partikel-In-Zellen für Niedertemperatur-Plasmasimulation durch Partikel-Thread-Bindung | 低温等温等同等量定基准和静电粒子细胞中电静电粒子细胞平行化 2506.21524v1 |
Authors (4): Libn Varghese, Bhaskar Chaudhury, Miral Shah, Mainak Bandyopadhyay
The Particle-In-Cell (PIC) method for plasma simulation tracks particle phase space information using particle and grid data structures. High computational costs in 2D and 3D device-scale PIC simulations necessitate parallelization, with the Charge Deposition (CD) subroutine often becoming a bottleneck due to frequent particle-grid interactions. Conventional methods mitigate dependencies by generating private grids for each core, but this approach faces scalability issues. We propose a novel approach based on a particle-thread binding strategy that requires only four private grids per node in distributed memory systems or four private grids in shared memory systems, enhancing CD scalability and performance while maintaining conventional data structures and requiring minimal changes to existing PIC codes. This method ensures complete accessibility of grid data structure for concurrent threads and avoids simultaneous access to particles within the same cell using additional functions and flags. Performance evaluations using a PIC benchmark for low-temperature partially magnetized E x B discharge simulation on a shared memory as well as a distributed memory system (1000 cores) demonstrate the method’s scalability, and additionally, we show the method has little hardware dependency.
利用粒子和网格数据结构进行等离子模拟粒子相位空间信息的粒子内Cell(PIC)方法。 2D 和 3D 设备级的PIC 模拟的高计算成本要求平行化,由于经常的粒子电网相互作用,充电沉降(CD)子路由往往成为瓶颈。常规方法通过为每个核心生成私人网格来减轻依赖性,但这种方法面临可缩放问题。我们提议基于粒子-粒子模拟利用粒子和网格数据结构的粒子模拟颗粒相位空间信息的新办法,它只需要分布式内存系统中的4个私人网格或共享内存系统中的4个私人网格,提高CD的可缩放性和性,同时维持传统的数据结构,要求对现有PIC代码作出最低限度的改动。这种方法确保同时使用同步线条,避免使用额外的功能和旗帜同时进入同一单元格内的颗粒子。我们建议采用对低温部分磁化的E x B 排放模拟进行业绩评估,该方法在共享内存和分布内存系统(1 000 核心) 显示该方法的可伸缩性,此外,我们显示该方法几乎没有硬性。
Article 105
Title@2025-06-26 (4): Efficient and Reuseable Cloud Configuration Search Using Discovery Spaces
Title: Efficient and Reuseable Cloud Configuration Search Using Discovery Spaces | Effiziente und wiederverwendbare Cloud-Konfiguration Suche mit Discovery Spaces | 利用发现空间进行高效和可再利用的云层配置搜索 2506.21467v1 |
Authors (7): Michael Johnston, Burkhard Ringlein, Christoph Hagleitner, Alessandro Pomponio, Vassilis Vassiliadis, Christian Pinto, Srikumar Venugopal
Finding the optimal set of cloud resources to deploy a given workload at minimal cost while meeting a defined service level agreement is an active area of research. Combining tens of parameters applicable across a large selection of compute, storage, and services offered by cloud providers with similar numbers of application-specific parameters leads to configuration spaces with millions of deployment options. In this paper, we propose Discovery Space, an abstraction that formalizes the description of workload configuration problems, and exhibits a set of characteristics required for structured, robust and distributed investigations of large search spaces. We describe a concrete implementation of the Discovery Space abstraction and show that it is generalizable across a diverse set of workloads such as Large Language Model inference and Big Data Analytics. We demonstrate that our approach enables safe, transparent sharing of data between executions of best-of-breed optimizers increasing the efficiency of optimal configuration detection in large search spaces. We also demonstrate how Discovery Spaces enable transfer and reuse of knowledge across similar search spaces, enabling configuration search speed-ups of over 90%.
找到最佳的云层资源来以最低的成本部署特定的工作量,同时满足规定的服务水平协议是一个活跃的研究领域。 将大量选择的计算、储存和服务中所适用的数十项参数结合起来,由具有类似应用参数的云层提供者提供大量选择的计算、储存和服务,导致配置空间,使用数百万个部署选项。 在本文件中,我们提出探索空间,这是一个将工作量配置问题描述正式化的抽象概念,并展示了对大型搜索空间进行结构化、稳健和分布式调查所需的一系列特征。我们描述了探索空间抽象的切实执行情况,并表明它广泛适用于诸如大语言模型推断和大数据分析等一系列不同的工作量。我们证明,我们的方法能够安全、透明地共享最佳优化剂执行之间的数据,提高在大型搜索空间进行最佳配置探测的效率。我们还演示了探索空间如何在类似搜索空间进行知识的转移和再利用,使配置搜索速度超过90%。
Article 106
Title@2025-06-26 (4): exa-AMD: A Scalable Workflow for Accelerating AI-Assisted Materials Discovery and Design
Title: exa-AMD: A Scalable Workflow for Accelerating AI-Assisted Materials Discovery and Design | exa-AMD: Ein skalierbarer Workflow zur Beschleunigung der Entdeckung und des Designs von KI-Assistenten | Exa-AMD:加速使用AI辅助材料发现和设计的一个可缩放工作流程 2506.21449v1 |
Authors (7): Maxim Moraru, Weiyi Xia, Zhuo Ye, Feng Zhang, Yongxin Yao, Ying Wai Li, Cai-Zhuang Wang
exa-AMD is a Python-based application designed to accelerate the discovery and design of functional materials by integrating AI/ML tools, materials databases, and quantum mechanical calculations into scalable, high-performance workflows. The execution model of exa-AMD relies on Parsl, a task-parallel programming library that enables a flexible execution of tasks on any computing resource from laptops to supercomputers. By using Parsl, exa-AMD is able to decouple the workflow logic from execution configuration, thereby empowering researchers to scale their workflows without having to reimplement them for each system.
Exa-AMD是一个基于Python的应用程序,旨在通过将AI/ML工具、材料数据库和量子机械计算纳入可缩放的高性能工作流程,加速发现和设计功能性材料。 Exa-AMD的执行模式依赖于Parsl,这是一个任务平行编程库,可以灵活地执行从膝上型计算机到超级计算机的任何计算资源的任务。 通过使用Parsl,Exa-AMD能够将工作流程逻辑与执行配置脱钩,从而使研究人员不必为每个系统重新实施工作流程,从而能够扩大工作流程的规模。
Article 107
Title@2025-06-26 (4): Carbon-Aware Microservice Deployment for Optimal User Experience on a Budget
Title: Carbon-Aware Microservice Deployment for Optimal User Experience on a Budget | Carbon-Aware Microservice Bereitstellung für eine optimale Benutzererfahrung auf einem Budget | 为最佳预算用户提供最佳预算用户经验的碳软件微型服务部署 2506.21422v1 |
Authors (3): Kevin Kreutz, Philipp Wiesner, Monica Vitali
The carbon footprint of data centers has recently become a critical concern. So far, most carbon-aware strategies have focused on leveraging the flexibility of scheduling decisions for batch processing by shifting the time and location of workload executions. However, such approaches cannot be applied to service-oriented cloud applications, since they have to be reachable at every point in time and often at low latencies. We propose a carbon-aware approach for operating microservices under hourly carbon budgets. By choosing the most appropriate version and horizontal scaleout for each microservice, our strategy maximizes user experience and revenue while staying within budget constraints. Experiments across various application configurations and carbon budgets demonstrate that the approach adapts properly to changing workloads and carbon intensities.
数据中心的碳足迹最近已成为一个关键问题。 到目前为止,大多数碳意识战略都侧重于通过改变工作量处决的时间和地点来利用批量处理决策时间安排的灵活性,然而,这些方法不能应用于服务导向的云应用,因为它们必须在每一个时间点都能达到,而且往往在低晚时间。我们提出了在小时碳预算下运行微型服务的碳意识方法。通过为每个微观服务选择最合适的版本和横向扩展,我们的战略在保持预算限制的同时最大限度地利用用户的经验和收入。 各种应用配置和碳预算的实验表明,该方法适应了不断变化的工作量和碳强度。
Article 108
Title@2025-06-26 (4): Exploring Micro Frontends: A Case Study Application in E-Commerce
Title: Exploring Micro Frontends: A Case Study Application in E-Commerce | Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce | 探索微观前沿:电子商务案例研究应用 2506.21297v1 |
Authors (5): Ricardo Hideki Hangai Kojo, Luiz Fernando Corte Real, Renato Cordeiro Ferreira, Thatiane de Oliveira Rosa, Alfredo Goldman
In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company’s needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company’s context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.
在微观前端的建筑风格中,前端被分为小部分,从简单的按钮到整个页面。目标是提高可缩放性、复原力和团队独立性,尽管其成本增加了复杂性和基础设施需求。本文件试图了解何时值得采用微观前端,特别是在工业方面。为此,我们根据学术和灰色文献对微型前端的艺术状态进行了调查。我们随后在手工艺产品的市场中实施了这种建筑风格,这些产品已经使用了微观服务。最后,我们通过与开发商的半开放问卷评估了执行情况。在经过研究的市场公司中,由于主要系统(爪哇单项)和专门的前端系统之间的紧密连接,需要进行建筑变革。此外,为了解决这些问题,我们根据学术和灰色文献对微型前端结构进行了调查。我们随后在手工艺产品的市场中采用了微型前端结构,并且已经使用了已经使用过缩略图的后端模式。最后,我们通过半开放的问卷来评估执行情况。尽管通过最精密的前端基础设施(Java ) 和前端技术的运用方式得到了成功的应用,但是在采用最灵活的前端端分析中, 也实现了对正端分析。在采用最具有可比性的公司进行了成功的前端分析,但通过这种分析后端和最成功的前端技术的后端反应, 也取得了必要的应用。在采用了最成功的前端评估。在采用了最成功的前端技术。在采用后端技术,在采用后端的后端技术,在采用了最成功的前端技术,在采用后端技术,在采用。在采用后端技术,在采用后端技术,在采用后端端端端端端技术,在采用后端分析中,在采用后端技术,在采用后端,在采用后端技术,在采用了最成功的前端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端分析中实现了。在采用后端选择。在采用后端,在采用后端,在采用后端,在采用后端,在采用了最成功的前端选择,在采用了。在采用后端,在采用后端,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端
Article 109
Title@2025-06-26 (4): Balancing Privacy, Robustness, and Efficiency in Machine Learning
Title: Balancing Privacy, Robustness, and Efficiency in Machine Learning | Ausbalancierende Privatsphäre, Robustheit und Effizienz im maschinellen Lernen | 平衡隐私、强健和机器学习效率 2312.14712v3 |
Authors (3): Youssef Allouah, Rachid Guerraoui, John Stephan
This position paper argues that achieving robustness, privacy, and efficiency simultaneously in machine learning systems is infeasible under prevailing threat models. The tension between these goals arises not from algorithmic shortcomings but from structural limitations imposed by worst-case adversarial assumptions. We advocate for a systematic research agenda aimed at formalizing the robustness-privacy-efficiency trilemma, exploring how principled relaxations of threat models can unlock better trade-offs, and designing benchmarks that expose rather than obscure the compromises made. By shifting focus from aspirational universal guarantees to context-aware system design, the machine learning community can build models that are truly appropriate for real-world deployment.
本立场文件认为,在现行威胁模式下,在机器学习系统中同时实现稳健性、隐私和效率是行不通的。 这些目标之间的紧张关系并非源于算法缺陷,而是由于最坏的对抗性假设造成的结构性限制。 我们主张系统研究议程,旨在正式确定稳健性-隐私-效率三重力,探索威胁模式的原则性放松如何能够打开更好的权衡,并设计暴露而不是掩盖妥协的基准。 通过将重点从抱负性普遍保障转向环境意识系统设计,机器学习界可以构建真正适合现实世界部署的模式。
Article 110
Title@2025-06-26 (4): The Autonomy of the Lightning Network: A Mathematical and Economic Proof of Structural Decoupling from BTC
Title: The Autonomy of the Lightning Network: A Mathematical and Economic Proof of Structural Decoupling from BTC | Die Autonomie des Blitznetzes: Ein mathematischer und wirtschaftlicher Beweis der strukturellen Entkopplung von BTC | 闪电网络的自主性:结构脱钩与BTC的数学和经济证明 2506.19333v2 |
Authors (1): Craig Steven Wright
This paper presents a formal analysis of the Lightning Network as a monetary system structurally diverging from Bitcoin’s base-layer settlement model. We demonstrate that under increasing transaction demand, BTC transaction fees rise superlinearly due to throughput constraints, while Lightning Network routing costs approach a bounded asymptote. Using mathematical modeling, game-theoretic proofs, and complexity analysis, we show that Lightning enables indefinite off-chain operation via the emergence of liquidity hub oligopolies. These hubs exhibit properties of unregulated financial intermediaries, including rent extraction, opacity, and systemic fragility. Strategic agent models show that channel closure becomes economically infeasible, and routing problems approach hardness limits in P-Space complexity. We conclude that Lightning does not merely extend Bitcoin, but constitutes a synthetic financial system with shadowbank characteristics, lacking reserve discipline, transparency, or enforceable settlement guarantees.
本文对闪电网络作为一个与比特币基础结算模式结构不同的货币系统进行了正式分析。 我们证明,在交易需求不断增长的情况下,BTC交易费由于吞吐限制而出现超直线性上升,而Lightning网络路由成本则接近于一连串的无足轻重。 我们用数学模型、游戏理论证据和复杂分析来证明,Lightning通过流动资金中心寡头政治的出现,可以无限期地进行离链操作。 这些中心展示了不受管制的金融中介的特性,包括租金提取、不透明和系统脆弱性。 战略代理模型表明,关闭通道在经济上变得不可行,而路由问题则接近P-Space复杂程度的硬性限度。 我们的结论是,Lightning不仅仅是扩展Bitcoin,而是构成一个具有影子银行特征、缺乏储备纪律、透明度或可强制执行的结算担保的合成金融体系。
Article 111
Title@2025-06-26 (4): Bridding OT and PaaS in Edge-to-Cloud Continuum
Title: Bridding OT and PaaS in Edge-to-Cloud Continuum | Bridding OT und PaaS im Edge-to-Cloud Continuum | 边际至环际环礁岛的Briding OT和PaaS 2506.21072v1 |
Authors (2): Carlos J Barrios, Yves Denneulin
The Operational Technology Platform as a Service (OTPaaS) initiative provides a structured framework for the efficient management and storage of data. It ensures excellent response times while improving security, reliability, data and technology sovereignty, robustness, and energy efficiency, which are crucial for industrial transformation and data sovereignty. This paper illustrates successful deployment, adaptable application management, and various integration components catering to Edge and Cloud environments. It leverages the advantages of the Platform as a Service model and highlights key challenges that have been addressed for specific use cases.
运行技术平台是一个服务(OTPaaS)倡议,为有效管理和储存数据提供了一个结构化框架,它确保了极好的反应时间,同时改进了安全、可靠性、数据和技术主权、稳健性和能源效率,这些对于工业转型和数据主权至关重要,本文件介绍了成功的部署、适应性应用管理以及适合边缘和云层环境的各种整合组成部分,利用平台作为服务模式的优势,并着重指出了为具体使用案例处理的主要挑战。
Article 112
Title@2025-06-26 (4): An Information-Theoretic Analysis for Federated Learning under Concept Drift
Title: An Information-Theoretic Analysis for Federated Learning under Concept Drift | Eine informationstheoretische Analyse für das Federated Learning unter Konzept Drift | 根据 “ 漂流概念 “ 进行的联邦学习信息理论分析 2506.21036v1 |
Authors (3): Fu Peng, Meng Zhang, Ming Tang
Recent studies in federated learning (FL) commonly train models on static datasets. However, real-world data often arrives as streams with shifting distributions, causing performance degradation known as concept drift. This paper analyzes FL performance under concept drift using information theory and proposes an algorithm to mitigate the performance degradation. We model concept drift as a Markov chain and introduce the \emph{Stationary Generalization Error} to assess a model’s capability to capture characteristics of future unseen data. Its upper bound is derived using KL divergence and mutual information. We study three drift patterns (periodic, gradual, and random) and their impact on FL performance. Inspired by this, we propose an algorithm that regularizes the empirical risk minimization approach with KL divergence and mutual information, thereby enhancing long-term performance. We also explore the performance-cost tradeoff by identifying a Pareto front. To validate our approach, we build an FL testbed using Raspberry Pi4 devices. Experimental results corroborate with theoretical findings, confirming that drift patterns significantly affect performance. Our method consistently outperforms existing approaches for these three patterns, demonstrating its effectiveness in adapting concept drift in FL.
联邦学习(FL)的近期研究通常对静态数据集进行训练。然而,真实世界数据通常以流体流形式出现,分布变化,导致性能退化,称为概念流,本文利用信息理论分析概念流的FL性能,并提出减少性能退化的算法。我们将漂浮概念建为Markov链,并引入“标准通用错误”来评估模型捕捉未来不可见数据特性的能力。它的上界是利用KL差异和相互信息得出的。我们研究了三种漂流模式(周期、渐进和随机)及其对FL性能的影响。我们为此提出一种算法,将实验风险最小化方法与KL差异和相互信息规范起来,从而提高长期性能。我们还通过确定Pareto前台来探索性能成本权衡。为了验证我们的方法,我们用 Raspberry Pi4 设备建立了一个FL 测试台。实验结果与理论结论相证实,证实了这种漂移模式对性能的显著影响。我们的方法始终超越了这三种模式的现有方法,并展示了在漂移法上的概念的有效性。
Article 113
Title@2025-06-26 (4): BLOCKS: Blockchain-supported Cross-Silo Knowledge Sharing for Efficient LLM Services
Title: BLOCKS: Blockchain-supported Cross-Silo Knowledge Sharing for Efficient LLM Services | BLOCKS: Blockchain-gestützter Cross-Silo-Wissensaustausch für effiziente LLM-Dienste | BLOCKS:为高效率的LLM服务进行链链式支持的跨SIlo知识共享 2506.21033v1 |
Authors (7): Zhaojiacheng Zhou, Hongze Liu, Shijing Yuan, Hanning Zhang, Jiong Lou, Chentao Wu, Jie Li
The hallucination problem of Large Language Models (LLMs) has increasingly drawn attention. Augmenting LLMs with external knowledge is a promising solution to address this issue. However, due to privacy and security concerns, a vast amount of downstream task-related knowledge remains dispersed and isolated across various “silos,” making it difficult to access. To bridge this knowledge gap, we propose a blockchain-based external knowledge framework that coordinates multiple knowledge silos to provide reliable foundational knowledge for large model retrieval while ensuring data security. Technically, we distill knowledge from local data into prompts and execute transactions and records on the blockchain. Additionally, we introduce a reputation mechanism and cross-validation to ensure knowledge quality and provide incentives for participation. Furthermore, we design a query generation framework that provides a direct API interface for large model retrieval. To evaluate the performance of our proposed framework, we conducted extensive experiments on various knowledge sources. The results demonstrate that the proposed framework achieves efficient LLM service knowledge sharing in blockchain environments.
大语言模型(LLMs)的幻觉问题日益引起人们的注意。增加具有外部知识的LLMs是解决这一问题的一个大有希望的解决办法。然而,由于隐私和安全方面的考虑,大量下游任务相关知识仍然分散在各种“筒仓”中,并被孤立在不同的“筒仓”中,因此难以获取。为了缩小这一知识差距,我们提议了一个基于链条的外部知识框架,以协调多种知识仓,为大型模型检索提供可靠的基础知识,同时确保数据安全。技术上,我们将当地数据的知识提炼为快速数据,并在链链中进行交易和记录。此外,我们引入了一种名声机制和交叉验证,以确保知识质量,并为参与提供激励。此外,我们设计了一个为大型模型检索提供直接的API界面的生成查询框架。为了评估我们拟议框架的绩效,我们在许多知识源上进行了广泛的实验。结果显示,拟议的框架实现了在链锁环境中高效的LM服务知识共享。
Article 114
Title@2025-06-26 (4): Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe
Title: Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe | Tragbare Hochleistungs-Kernel-Generation für einen numerischen Fluid-Dynamik-Code mit DaCe | DaCe 计算流流体动态代码的可携高性能核心生成器 2506.20994v1 |
Authors (4): Måns I. Andersson, Martin Karp, Niclas Jansson, Stefano Markidis
With the emergence of new high-performance computing (HPC) accelerators, such as Nvidia and AMD GPUs, efficiently targeting diverse hardware architectures has become a major challenge for HPC application developers. The increasing hardware diversity in HPC systems often necessitates the development of architecture-specific code, hindering the sustainability of large-scale scientific applications. In this work, we leverage DaCe, a data-centric parallel programming framework, to automate the generation of high-performance kernels. DaCe enables automatic code generation for multicore processors and various accelerators, reducing the burden on developers who would otherwise need to rewrite code for each new architecture. Our study demonstrates DaCe’s capabilities by applying its automatic code generation to a critical computational kernel used in Computational Fluid Dynamics (CFD). Specifically, we focus on Neko, a Fortran-based solver that employs the spectral-element method, which relies on small tensor operations. We detail the formulation of this computational kernel using DaCe’s Stateful Dataflow Multigraph (SDFG) representation and discuss how this approach facilitates high-performance code generation. Additionally, we outline the workflow for seamlessly integrating DaCe’s generated code into the Neko solver. Our results highlight the portability and performance of the generated code across multiple platforms, including Nvidia GH200, Nvidia A100, and AMD MI250X GPUs, with competitive performance results. By demonstrating the potential of automatic code generation, we emphasise the feasibility of using portable solutions to ensure the long-term sustainability of large-scale scientific applications.
随着新的高性能计算加速器(HPC)的出现,例如Nvidia和AMD GPUs等新型高性能计算加速器的出现,高效地针对不同硬件结构的高效生成已成为HPC应用程序开发者的一项重大挑战。高PC系统硬件多样性的日益增强往往要求开发特定架构的代码,从而妨碍大规模科学应用的可持续性。在这项工作中,我们利用数据中心平行编程框架DaCe(DaCe)来自动生成高性能内核内核内核。DaCe(DaCe)为多核心处理器和各种加速器自动生成码,从而减轻了开发者为每个新架构需要重写代码的开发者的负担。我们的研究显示DaCe的自动代码生成能力,将其自动生成的代码用于一个关键计算核心的计算核心内核内核内核内核内核内核内核内核内核内核内核生成的高级性能。我们通过DFA(SDFA)的高级数据流数据流生成工具, 来详细讨论这一计算计算内核内核内核内核内核内核内核内核内核内核生成的高级数据生成的系统。
Article 115
Title@2025-06-26 (4): ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks
Title: ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks | ParEval-Repo: Eine Benchmark-Suite zur Bewertung von LLMs mit HPC-Übersetzungsaufgaben auf Repository-Ebene | PaarEval-Repo:评价拥有仓库级高常委会翻译任务的LLMLM 基准套件 2506.20938v1 |
Authors (4): Joshua H. Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele
GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portable execution models exist, they still require significant developer effort to port to and optimize for different hardware architectures. Recent advances in large language models (LLMs) can help us reduce some of this programmer burden. In this paper, we present a novel benchmark and testing framework, ParEval-Repo, which can be used to evaluate the efficacy of LLM-based approaches in automatically translating entire codebases across GPGPU execution models. ParEval-Repo includes several scientific computing and AI mini-applications in a range of programming models, and levels of repository complexity. We use ParEval-Repo to evaluate a range of state-of-the-art open-source and commercial LLMs, with both a non-agentic and a top-down agentic approach. We assess code generated by the LLMs and approaches in terms of compilability, functional correctness, categories of build errors, and the cost of translation in terms of the number of inference tokens. Our results demonstrate that LLM translation of scientific applications is feasible for small programs but difficulty with generating functional build systems and cross-file dependencies pose challenges in scaling to larger codebases.
近年来,GPGPPU结构已变得千差万别,导致出现了各种专门的编程模型和支持这些模型的软件堆叠。虽然有便携式执行模型,它们仍需要大量开发者努力将不同硬件结构移植并优化到不同的硬件结构。大型语言模型(LLMs)最近的进展可以帮助我们减轻部分程序员负担。在本文件中,我们提出了一个新的基准和测试框架,即ParEval-Repo,可用于评价基于LLM方法在自动翻译GPGPU执行模型整个代码库方面的效力。ParEval-Repo包括一系列编程模型中的若干科学计算和AI微型应用,以及储存复杂程度。我们利用ParEval-Repo来评估一系列最新的开放源和商业LMS(LMs),同时采用非试管和自上而下的方法。我们评估了LLMMS生成的代码和方法的可兼容性、功能正确性、构建错误的类别,以及从推理数量上翻译的成本,包括一系列编程模型和存储软件的难度。我们利用功能模型的系统得出了更大程度。