• 00 06-26 (4) Benchmarking and Parallelization of Electrostatic Particle-In-Cell for low-temperature Plasma Simulation by particle-thread Binding Benchmarking und Parallelisierung elektrostatischer Partikel-In-Zellen für Niedertemperatur-Plasmasimulation durch Partikel-Thread-Bindung 低温等温等同等量定基准和静电粒子细胞中电静电粒子细胞平行化 2506.21524v1
  • 01 06-26 Efficient and Reuseable Cloud Configuration Search Using Discovery Spaces Effiziente und wiederverwendbare Cloud-Konfiguration Suche mit Discovery Spaces 利用发现空间进行高效和可再利用的云层配置搜索 2506.21467v1
  • 02 06-26 exa-AMD: A Scalable Workflow for Accelerating AI-Assisted Materials Discovery and Design exa-AMD: Ein skalierbarer Workflow zur Beschleunigung der Entdeckung und des Designs von KI-Assistenten Exa-AMD:加速使用AI辅助材料发现和设计的一个可缩放工作流程 2506.21449v1
  • 03 06-26 Carbon-Aware Microservice Deployment for Optimal User Experience on a Budget Carbon-Aware Microservice Bereitstellung für eine optimale Benutzererfahrung auf einem Budget 为最佳预算用户提供最佳预算用户经验的碳软件微型服务部署 2506.21422v1
  • 04 06-26 Enabling Bitcoin Smart Contracts on the Internet Computer Ermöglichung von Bitcoin Smart Contracts auf dem Internet-Computer 使因特网计算机上比特币智能合同成为可能 2506.21327v1
  • 05 06-26 Exploring Micro Frontends: A Case Study Application in E-Commerce Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce 探索微观前沿:电子商务案例研究应用 2506.21297v1
  • 06 06-26 Balancing Privacy, Robustness, and Efficiency in Machine Learning Ausbalancierende Privatsphäre, Robustheit und Effizienz im maschinellen Lernen 平衡隐私、强健和机器学习效率 2312.14712v3
  • 07 06-26 The Autonomy of the Lightning Network: A Mathematical and Economic Proof of Structural Decoupling from BTC Die Autonomie des Blitznetzes: Ein mathematischer und wirtschaftlicher Beweis der strukturellen Entkopplung von BTC 闪电网络的自主性:结构脱钩与BTC的数学和经济证明 2506.19333v2
  • 08 06-26 Bridding OT and PaaS in Edge-to-Cloud Continuum Bridding OT und PaaS im Edge-to-Cloud Continuum 边际至环际环礁岛的Briding OT和PaaS 2506.21072v1
  • 09 06-26 An Information-Theoretic Analysis for Federated Learning under Concept Drift Eine informationstheoretische Analyse für das Federated Learning unter Konzept Drift 根据 “ 漂流概念 “ 进行的联邦学习信息理论分析 2506.21036v1
  • 10 06-26 BLOCKS: Blockchain-supported Cross-Silo Knowledge Sharing for Efficient LLM Services BLOCKS: Blockchain-gestützter Cross-Silo-Wissensaustausch für effiziente LLM-Dienste BLOCKS:为高效率的LLM服务进行链链式支持的跨SIlo知识共享 2506.21033v1
  • 11 06-26 Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe Tragbare Hochleistungs-Kernel-Generation für einen numerischen Fluid-Dynamik-Code mit DaCe DaCe 计算流流体动态代码的可携高性能核心生成器 2506.20994v1
  • 12 06-26 ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks ParEval-Repo: Eine Benchmark-Suite zur Bewertung von LLMs mit HPC-Übersetzungsaufgaben auf Repository-Ebene PaarEval-Repo:评价拥有仓库级高常委会翻译任务的LLMLM 基准套件 2506.20938v1
  • 13 06-25 (3) Survey: Graph Databases Erhebung: Graphische Datenbanken 调查:图表数据库 2505.24758v2
  • 14 06-25 SuperSONIC: Cloud-Native Infrastructure for ML Inferencing SuperSONIC: Cloud-Native Infrastruktur für die ML-Inferenzierung 用于ML推导的云源基础设施 2506.20657v1
  • 15 06-25 Hear No Evil: Detecting Gradient Leakage by Malicious Servers in Federated Learning Hear No Evil: Detecting Gradient Leakage by Malicious Servers in Federated Learning 听不见邪恶:在联邦学习中发现恶意服务器的渐变渗漏 2506.20651v1
  • 16 06-25 On the Encoding Process in Decentralized Systems Zum Kodierungsprozess in dezentralisierten Systemen 分权系统编码进程 2408.15203v2
  • 17 06-25 Vertex addition to a ball graph with application to reliability and area coverage in autonomous swarms Vertex Ergänzung zu einem Kugelgraphen mit Anwendung zur Zuverlässigkeit und Flächenabdeckung in autonomen Schwarms 使用自动群群群的可靠性和覆盖面积的球图的顶点添加 2506.19197v2
  • 18 06-25 WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads WattsOnAI: Messen, Analysieren und Visualisieren von Energie und Carbon Footprint von KI-Workloads WattsOnAI:AI工作量的测量、分析、可视化能源和碳足迹 2506.20535v1
  • 19 06-25 Collaborative Batch Size Optimization for Federated Learning Kollaborative Batch-Größenoptimierung für Federated Learning 联邦学习联合会的合作批量数量优化 2506.20511v1
  • 20 06-25 MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing MARCO: Multi-Agent Code-Optimierung mit Echtzeit-Knowledge Integration für High-Performance Computing MARCO: 利用实时知识整合优化多机构代码,促进高绩效计算 2505.03906v3
  • 21 06-25 When Servers Meet Species: A Fab-to-Grave Lens on Computing’s Biodiversity Impact Wenn Server Arten treffen: Eine Fab-to-Grave-Lens für die Biodiversitätswirkung von Computing 当服务器与物种相遇时:关于计算机的生物多样性影响的一个从宽到宽的镜头 2506.20442v1
  • 22 06-25 PAT: a new algorithm for all-gather and reduce-scatter operations at scale PAT: ein neuer Algorithmus für All-Gather- und Reduce-Scatter-Operationen im Maßstab PAT: 大规模全采集和减少散散作业的新算法 2506.20252v1
  • 23 06-25 The Blind Men and the Elephant: Mapping Interdisciplinarity in Research on Decentralized Autonomous Organizations Blinde Männer und Elefant: Interdisziplinarität in der Forschung über dezentralisierte autonome Organisationen kartieren 盲人和大象:绘制分权自治组织研究的多元性图 2502.09949v2
  • 24 06-25 LiteGD: Lightweight and Dynamic GPU Dispatching for Large-scale Heterogeneous Clusters LiteGD: Leichte und dynamische GPU Dispatching für großflächige Heterogene Cluster LiteGD: 大型异源集束体轻量和动态GPU发射 2506.15595v2
  • 25 06-25 On the $h$-majority dynamics with many opinions Auf der $h$-Mehrheitsdynamik mit vielen Meinungen 关于以美元为多数的动态, 2506.20218v1
  • 26 06-24 (2) MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models MegaFold: System-Level-Optimierungen zur Beschleunigung von Proteinstruktur-Vorhersagemodellen MegaFold:加速蛋白质结构结构预测模型的全系统优化 2506.20686v1
  • 27 06-24 Power-Capping Metric Evaluation for Improving Energy Efficiency in HPC Applications Power-Capping Metric Evaluation zur Verbesserung der Energieeffizienz in HPC-Anwendungen 提高高常委会应用能效提高方法评估 2505.21758v2
  • 28 06-24 Can One Safety Loop Guard Them All? Agentic Guard Rails for Federated Computing Kann ein Sicherheitsschlaufe Guard sie alle? Agentic Guard Rails für Federated Computing 一个安全环圈能保护全部吗? 2506.20000v1
  • 29 06-24 AI-coupled HPC Workflow Applications, Middleware and Performance KI-gekoppelte HPC-Workflow-Anwendungen, Middleware und Performance 工作流量应用、中软件和性能 2406.14315v2
  • 30 06-24 MAIZX: A Carbon-Aware Framework for Optimizing Cloud Computing Emissions MAIZX: Ein Carbon-Aware-Framework zur Optimierung von Cloud-Computing-Emissionen MAIZX:优化云计算排放的碳软件框架 2506.19972v1
  • 31 06-24 Maintaining a Bounded Degree Expander in Dynamic Peer-to-Peer Networks Aufrechterhaltung eines begrenzten Grades Expander in dynamischen Peer-to-Peer-Netzwerken 维持动态同侪网络中的宽度扩展器 2506.17757v2
  • 32 06-24 FDA-Opt: Communication-Efficient Federated Fine-Tuning of Language Models FDA-Opt: Kommunikationseffizientes Federated Fine-Tuning von Sprachmodellen FFDA-Opt: 交流-高效联邦语言模型精密使用 2505.04535v2
  • 33 06-24 Formalization and security analysis of the Bridgeless protocol Formalisierung und Sicherheitsanalyse des Bridgeless-Protokolls 对 “ 无桥梁议定书 “ 的正规化和安全分析 2506.19730v1
  • 34 06-24 PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案 2506.19660v1
  • 35 06-24 Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures Auf dem Weg zu einem introspektiven dynamischen Modell weltweit verteilter Computing-Infrastrukturen 争取建立全球分布全球电子计算基础设施的前瞻性动态模型 2506.19578v1
  • 36 06-24 MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning MemAscend: Systemspeicheroptimierung für SSD-Offloaded LLM Fine-Tuning MemAscend: SSD- 卸载 LLM 精密调试的系统内存优化 2505.23254v2
  • 37 06-24 Picsou: Enabling Replicated State Machines to Communicate Efficiently Picsou: Replizierte Staatsmaschinen effizient kommunizieren Picsou: 使可复制的国家机器能够有效通信 2312.11029v2
  • 38 06-24 TrainVerify: Equivalence-Based Verification for Distributed LLM Training TrainVerify: Gleichwertigkeitsbasierte Überprüfung für verteiltes LLM-Training 培训核查:分布式LLM培训的等效核查 2506.15961v2
  • 39 06-24 RepuNet: A Reputation System for Mitigating Malicious Clients in DFL RepuNet: Ein Reputationssystem zur Bekämpfung bösartiger Kunden in der DFL RepuNet:DFL中减少恶意客户的声望系统 2506.19892v1
  • 40 06-24 The Autonomous Data Language – Concepts, Design and Formal Verification Die autonome Datensprache – Konzepte, Design und formale Überprüfung 自主数据语言 – – 概念、设计和正式核查 2506.19457v1
  • 41 06-24 Agent-Based Triangle Counting: Unlocking Truss Decomposition, Triangle Centrality, and Local Clustering Coefficient Agent-Based Triangle Counting: Entsperren Truss Zersetzung, Dreieck Zentralität und lokale Clustering Koeffizient 基于代理的三角计数:解锁Truss分解、三角中心以及地方集束 2402.03653v2
  • 42 06-24 Computing Tree Structures in Anonymous Graphs via Mobile Agents Berechnung von Baumstrukturen in anonymen Graphen über Mobile Agents 通过移动代理器在匿名图纸中的电子树结构 2506.19365v1
  • 43 06-24 PBFT-Backed Semantic Voting for Multi-Agent Memory Pruning PBFT-unterstützte semantische Abstimmung für Multi-Agent Memory Pruning PBFT 多重机构内存缓冲后退的语义投票 2506.17338v2
  • 44 06-24 A Heuristic Algorithm for Shortest Path Search Ein Heuristischer Algorithmus für die kürzeste Pfadsuche 用于最短路径搜索的 Hyuristic 算法 2506.19349v1
  • 45 06-24 Network Structures as an Attack Surface: Topology-Based Privacy Leakage in Federated Learning Netzwerkstrukturen als Angriffsfläche: Topologiebasiertes Datenschutz-Leakage im Federated Learning 网络结构作为攻击表面:联邦学习中的基于地形的隐私渗漏 2506.19260v1
  • 46 06-24 Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems Forschung über Modellparallelität und Datenparallelität Optimierungsmethoden in großsprachlichen modellbasierten Empfehlungssystemen 研究示范平行主义和数据平行主义 2506.17551v2
  • 47 06-24 Shelby: Decentralized Storage Designed to Serve Shelby: Dezentraler Speicher für die Bedienung Shelby: 设计用于提供服务的分散储存 2506.19233v1
  • 48 06-24 Private Model Personalization Revisited Private Modell-Personalisierung überarbeitet 重新研究的私人个人模式 2506.19220v1
  • 49 06-23 (1) Binsparse: A Specification for Cross-Platform Storage of Sparse Matrices and Tensors Binsparse: Eine Spezifikation für die plattformübergreifende Lagerung von Sparse Matrizen und Tensoren Binsparse: 微粒母体和导体跨平台储存规格 2506.19175v1
  • 50 06-23 GradualDiff-Fed: A Federated Learning Specialized Framework for Large Language Model GradualDiff-Fed: Ein Federated Learning Specialized Framework für großes Sprachmodell 逐步发展伙伴关系:联邦学习大语言模式专门框架 2506.19164v1
  • 51 06-23 Survey of HPC in US Research Institutions Erhebung über HPC in US-Forschungseinrichtungen 美国研究机构的HPC调查 2506.19019v1
  • 52 06-23 Pod: An Optimal-Latency, Censorship-Free, and Accountable Generalized Consensus Layer Pod: Eine optimale Latenz, Zensur-frei und buchhalterisch generalisierte Konsensebene pod:最佳、无检查和可问责的共识层 2501.14931v3
  • 53 06-23 The Power of Strong Linearizability: the Difficulty of Consistent Refereeing Die Kraft der starken Linearität: die Schwierigkeit des konsequenten Referees 强强线性力量:一致裁判的困难 2506.18401v1
  • 54 06-23 Fully-Dynamic Parallel Algorithms for Single-Linkage Clustering Volldynamisch-Parallelalgorithmen für Single-Linkage-Clustering 单一链接集束的全动态平行平行数值 2506.18384v1
  • 55 06-23 A Contention-Free Model for Converged Kubernetes on HPC Ein konfliktfreies Modell für konvergierte Kubernete auf HPC 高常动中聚苯乙烯趋同Kubernets无内容模式 2406.06995v2
  • 56 06-23 Edge Association Strategies for Synthetic Data Empowered Hierarchical Federated Learning with Non-IID Data Edge Association Strategien für Synthetische Daten Empowered Hierarchical Federated Learning mit nicht-ID Daten 合成数据协会赋予非IID数据高级联邦学习权力、非IID数据的高级联邦学习战略 2506.18259v1
  • 57 06-22 (7) DeInfoReg: A Decoupled Learning Framework for Better Training Throughput DeInfoReg: Ein entkoppelter Lernrahmen für besseren Trainingsdurchsatz DInfoReg:一个分离的学习框架,以改善培训工作量 2506.18193v1
  • 58 06-22 Floating-Point Data Transformation for Lossless Compression Floating-Point-Datentransformation für verlustfreie Kompression 用于无损失压缩的浮动点数据转换 2506.18062v1
  • 59 06-22 Leveraging Cloud-Fog Automation for Autonomous Collision Detection and Classification in Intelligent Unmanned Surface Vehicles Nutzung von Cloud-Fog Automation zur autonomen Kollisionserkennung und Klassifizierung in intelligenten unbemannten Oberflächenfahrzeugen 利用云雾自动化对智能无载表面车辆进行自动碰撞探测和分类 2506.18024v1
  • 60 06-22 CFTel: A Practical Architecture for Robust and Scalable Telerobotics with Cloud-Fog Automation CFTel: Eine praktische Architektur für robuste und skalierbare Telerobotik mit Cloud-Fog Automation FLFel:一个有云雾自动化的强力和可缩放的Telerostotics实用建筑 2506.17991v1
  • 61 06-22 Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms Nutzung eines großen Sprachmodells für intelligente Protokollverarbeitung und autonomes Debugging in Cloud-KI-Plattformen 利用大语言模型,在云层独立平台中利用智能日志处理和自动调试大语言模型 2506.17900v1
  • 62 06-22 SPD-CFL: Stepwise Parameter Dropout for Efficient Continual Federated Learning SPD-CFL: Schrittweiser Parameter-Ausfall für effizientes kontinuierliches Federated Learning SPD-CFL: 高效持续联邦学习的分级参数辍学 2405.09394v2
  • 63 06-22 NestQuant: Post-Training Integer-Nesting Quantization for On-Device DNN NestQuant: Post-Training Integer-Nesting Quantization for On-Device DNN NestQuant: 培训后DNN的整数 2506.17870v1
  • 64 06-21 (6) FedBaF: Federated Learning Aggregation Biased by a Foundation Model FedBaF: Federated Learning Aggregation Durch ein Stiftungsmodell biased FedBAF: 联邦学习联合组织 2410.18352v3
  • 65 06-21 Implementation and Evaluation of Fast Raft for Hierarchical Consensus Umsetzung und Bewertung des Fast Raft für den Hierarchischen Konsens 落实和评价促进等级共识的快行道 2506.17793v1
  • 66 06-21 A Locally Differential Private Coding-Assisted Succinct Histogram Protocol Ein lokal differenziertes, privates Coding Assisted Succinct Histogramm Protokoll 本地差异私家编码辅助闪电直方图议定书 2506.17767v1
  • 67 06-21 Automated Selfish Mining Analysis for DAG-Based PoW Consensus Protocols Automatisierte Selfish Mining Analyse für DAG-basierte PoW-Konsensusprotokolle 为基于残疾非洲集团的《水、水、水、水、水、水的共识议定书》自动自采矿分析 2501.10888v3
  • 68 06-21 Choosing the Right Battery Model for Data Center Simulations Auswahl des richtigen Batteriemodells für Rechenzentrumssimulationen 选择数据中心模拟的右电池模型 2506.17739v1
  • 69 06-21 Distributed Butterfly Analysis using Mobile Agents Verteilte Schmetterlingsanalyse mit mobilen Agenten 使用移动剂进行分布式蝴蝶分析 2506.17721v1
  • 70 06-21 JAX-LaB: A High-Performance, Differentiable, Lattice Boltzmann Library for Modeling Multiphase Fluid Dynamics in Geosciences and Engineering JAX-LaB: Eine leistungsstarke, differenzierbare Lattice Boltzmann Bibliothek zur Modellierung von Mehrphasen-Flüssigkeitsdynamiken in Geowissenschaften und Ingenieurwissenschaften JAX-LAB:地球科学和工程多阶段流力动力建模高绩效、可区别的Lattice Boltzmann图书馆 2506.17713v1
  • 71 06-21 Residue Number System (RNS) based Distributed Quantum Multiplication Rückstandszahlsystem (RNS) basiert auf verteilter Quanten-Multiplikation 基于残余数字系统(RNS)的分布量乘法 2506.17588v1
  • 72 06-21 ConsumerBench: Benchmarking Generative AI Applications on End-User Devices ConsumerBench: Benchmarking Generative KI-Anwendungen auf Endgeräten 消费者:确定最终用户设备应用基准 2506.17538v1
  • 73 06-20 (5) A Grassroots Network and Community Roadmap for Interconnected Autonomous Science Laboratories for Accelerated Discovery Ein Grassroots-Netzwerk und ein gemeinschaftlicher Fahrplan für vernetzte autonome Wissenschaftslaboratorien für beschleunigte Entdeckung 加速发现相互连接的自治科学实验室基层网络和社区路线图 2506.17510v1
  • 74 06-20 Optimal Parallel Algorithms for Convex Hulls in 2D and 3D under Noisy Primitive Operations Optimale Parallelalgorithmen für Konvexhüllen in 2D und 3D unter Noisy Primitive Operations 2D 和 3D 的Convex Hulls 在噪音原始操作下的最佳平行比值 2506.17507v1
  • 75 06-20 Fed-pilot: Optimizing LoRA Allocation for Efficient Federated Fine-Tuning with Heterogeneous Clients Fed-Pilot: Optimierung der LoRA-Allokation für effizientes Federated Fine-Tuning mit heterogenen Kunden Fed-试点:优化LORA分配,与异质客户进行高效的联邦货币调整 2410.10200v2
  • 76 06-20 Code Generation for Near-Roofline Finite Element Actions on GPUs from Symbolic Variational Forms Code-Generierung für flächennahe Finite-Element-Aktionen auf GPUs aus Symbolischen Variationsformen 象征式变换形式GPU的近Roofline有限元素行动代码生成代码 2506.17471v1
  • 77 06-20 SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving SLED: Ein spekulatives LLM-Decoding-Framework für effizientes Edge Serving SLED: 有效边缘服务投机性LLM代谢框架 2506.09397v2
  • 78 06-20 A Comparative Analysis of Distributed Linear Solvers under Data Heterogeneity Eine vergleichende Analyse der verteilten linearen Solver unter Daten Heterogenität 数据差异下分布线性溶剂的比较分析 2304.10640v4
  • 79 06-20 LayerZero SchichtZero 层数为零 2312.09118v3
  • 80 06-20 $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction $Δ$-Nets: Interaktionsbasiertes System für eine optimale parallele $λ$-Reduktion \(-净额:最佳平行互动系统\)$美元-削减 2505.20314v3
  • 81 06-20 Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory Byzantinisch-Tolerant Konsens in GPU-inspiriert gemeinsamen Speicher 在GPU-受GPU启发的共同记忆中,拜占庭-容忍共识 2503.12788v3
  • 82 06-20 JANUS: Resilient and Adaptive Data Transmission for Enabling Timely and Efficient Cross-Facility Scientific Workflows JANUS: Resiliente und adaptive Datenübertragung zur rechtzeitigen und effizienten Cross-Facility wissenschaftlichen Workflows JANUS:为及时、高效的跨设施科学工作流程提供具有弹性和适应性的数据传输 2506.17084v1
  • 83 06-20 Comparison of substructured non-overlapping domain decomposition and overlapping additive Schwarz methods for large-scale Helmholtz problems with multiple sources Vergleich von substrukturierten nicht-überlappenden Domänenzersetzungen und überlappenden additiven Schwarz-Methoden für großflächige Helmholtz-Probleme mit mehreren Quellen 用于处理与多种来源有关的大规模Helmholtz问题的亚结构非重叠重叠域分解和重叠添加剂施瓦兹方法比较 2506.16875v1
  • 84 06-20 Speeding up Local Optimization in Vehicle Routing with Tensor-based GPU Acceleration Beschleunigung der lokalen Optimierung im Fahrzeugrouting mit Tensor-basierter GPU-Beschleunigung 加速使用基于 Tensor 的 GPU 加速车辆运行的本地优化 2506.17357v1
  • 85 06-20 Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: 带有内嵌原体的管弦式分布式 AI系统 2403.04311v2
  • 86 06-20 Incentivizing High-quality Participation From Federated Learning Agents Anreize für eine qualitativ hochwertige Beteiligung von Federated Learning Agents 激励来自联邦学习代理机构的高质量参与 2506.16731v1
  • 87 06-20 Persistent HyTM via Fast Path Fine-Grained Locking Persistent HyTM über Schnellweg feinkörnige Verriegelung 长效HYTM通过快车道 精密的锁闭 2501.14783v2
  • 88 06-19 (4) Enabling Blockchain Interoperability Through Network Discovery Services Aktivierung der Blockchain-Interoperabilität durch Network Discovery Services 通过网络发现服务促进链链互连互操作性 2506.16611v1
  • 89 06-19 Parallel Point-to-Point Shortest Paths and Batch Queries Parallele Punkt-zu-Punkt-Kurze Pfade und Batch-Abfragen 平行点对点最短路径和批量查询 2506.16488v1
  • 90 06-19 A Study of Synchronization Methods for Concurrent Size Eine Studie über Synchronisationsmethoden für die gleichzeitige Größe 同步体积同步化方法研究 2506.16350v1
  • 91 06-19 LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Embodied Intelligence System LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration für IoT-basiertes Embodyd Intelligence System LAECIPS: 以IoT为基础的内嵌式情报系统大型远景模型 辅助适应性边缘群落协作 2404.10498v2
  • 92 06-19 Serving Large Language Models on Huawei CloudMatrix384 Große Sprachmodelle auf Huawei CloudMatrix384 瓦威云马特列克384 2506.12708v3
  • 93 06-19 NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning NetSenseML: Netzwerk-adaptive Kompression für effizientes verteiltes maschinelles Lernen NetSensenseML:高效分配机器学习网络-ADT压缩 2506.16235v1
  • 94 06-19 Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction Föderated Learning for MRI-based BrainAGE: Eine multizentrische Studie zur post-stroke funktionellen Ergebnisvorhersage 为基于MRI的脑力智能学习联合会学习:关于打击后功能性结果预测的多中心研究 2506.15626v2
  • 95 06-19 Reconfigurable Intelligent Surface Assisted VEC Based on Multi-Agent Reinforcement Learning Rekonfigurierbare intelligente oberflächenunterstützte VEC auf Basis von Multi-Agenten-Verstärkungslernen 基于多机构强化学习的可重新配置智能表面辅助VEC 2406.11318v2
  • 96 06-19 Deep-Reinforcement-Learning-Based AoI-Aware Resource Allocation for RIS-Aided IoV Networks Deep-Reinforcement-Learning-based AoI-Aware Ressourcenzuweisung für RIS-Aided IoV-Netzwerke 为RIS援助的IOV网络分配的深入加强-基于学习的AoI-软件资源 2406.11245v2
  • 97 06-19 DRL-Based Federated Self-Supervised Learning for Task Offloading and Resource Allocation in ISAC-Enabled Vehicle Edge Computing DRL-basiertes, selbstüberwachtes Lernen für Aufgabe Offloading und Ressourcenallokation im ISAC-fähigen Fahrzeug Edge Computing DRL-基于DRL的基于联邦的自我监督学习,以在ISAC-可加入的车辆边缘电子计算中进行任务卸载和资源分配 2408.14831v2
  • 98 06-19 Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping Leiter-Residual: Parallelismus-bewusste Architektur zur Beschleunigung großer Modellinferenz mit Kommunikationsüberlappung 云梯-残余:加速大型模型推断与通信重叠的平行意识结构 2501.06589v5
  • 99 06-19 HetGPU: The pursuit of making binary compatibility towards GPUs HetGPU: Das Streben nach binärer Kompatibilität gegenüber GPUs HETGPU: 努力使二进制兼容到 GPUs 2506.15993v1
  • 100 06-19 KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider KVCache Cache in der Wildnis: KVCache Cache bei einem großen Cloud-Anbieter charakterisieren und optimieren KVcache 野生缓存: 大云提供方的 KVcache 缓存的特性和优化 KVcache 缓存 2506.02634v3

Article 0

Title@2025-06-26 (4): Benchmarking and Parallelization of Electrostatic Particle-In-Cell for low-temperature Plasma Simulation by particle-thread Binding

Title: Benchmarking and Parallelization of Electrostatic Particle-In-Cell for low-temperature Plasma Simulation by particle-thread Binding Benchmarking und Parallelisierung elektrostatischer Partikel-In-Zellen für Niedertemperatur-Plasmasimulation durch Partikel-Thread-Bindung 低温等温等同等量定基准和静电粒子细胞中电静电粒子细胞平行化 2506.21524v1

Authors (4): Libn Varghese, Bhaskar Chaudhury, Miral Shah, Mainak Bandyopadhyay

The Particle-In-Cell (PIC) method for plasma simulation tracks particle phase space information using particle and grid data structures. High computational costs in 2D and 3D device-scale PIC simulations necessitate parallelization, with the Charge Deposition (CD) subroutine often becoming a bottleneck due to frequent particle-grid interactions. Conventional methods mitigate dependencies by generating private grids for each core, but this approach faces scalability issues. We propose a novel approach based on a particle-thread binding strategy that requires only four private grids per node in distributed memory systems or four private grids in shared memory systems, enhancing CD scalability and performance while maintaining conventional data structures and requiring minimal changes to existing PIC codes. This method ensures complete accessibility of grid data structure for concurrent threads and avoids simultaneous access to particles within the same cell using additional functions and flags. Performance evaluations using a PIC benchmark for low-temperature partially magnetized E x B discharge simulation on a shared memory as well as a distributed memory system (1000 cores) demonstrate the method’s scalability, and additionally, we show the method has little hardware dependency.

利用粒子和网格数据结构进行等离子模拟粒子相位空间信息的粒子内Cell(PIC)方法。 2D 和 3D 设备级的PIC 模拟的高计算成本要求平行化,由于经常的粒子电网相互作用,充电沉降(CD)子路由往往成为瓶颈。常规方法通过为每个核心生成私人网格来减轻依赖性,但这种方法面临可缩放问题。我们提议基于粒子-粒子模拟利用粒子和网格数据结构的粒子模拟颗粒相位空间信息的新办法,它只需要分布式内存系统中的4个私人网格或共享内存系统中的4个私人网格,提高CD的可缩放性和性,同时维持传统的数据结构,要求对现有PIC代码作出最低限度的改动。这种方法确保同时使用同步线条,避免使用额外的功能和旗帜同时进入同一单元格内的颗粒子。我们建议采用对低温部分磁化的E x B 排放模拟进行业绩评估,该方法在共享内存和分布内存系统(1 000 核心) 显示该方法的可伸缩性,此外,我们显示该方法几乎没有硬性。


Article 1

Title@2025-06-26 (4): Efficient and Reuseable Cloud Configuration Search Using Discovery Spaces

Title: Efficient and Reuseable Cloud Configuration Search Using Discovery Spaces Effiziente und wiederverwendbare Cloud-Konfiguration Suche mit Discovery Spaces 利用发现空间进行高效和可再利用的云层配置搜索 2506.21467v1

Authors (7): Michael Johnston, Burkhard Ringlein, Christoph Hagleitner, Alessandro Pomponio, Vassilis Vassiliadis, Christian Pinto, Srikumar Venugopal

Finding the optimal set of cloud resources to deploy a given workload at minimal cost while meeting a defined service level agreement is an active area of research. Combining tens of parameters applicable across a large selection of compute, storage, and services offered by cloud providers with similar numbers of application-specific parameters leads to configuration spaces with millions of deployment options. In this paper, we propose Discovery Space, an abstraction that formalizes the description of workload configuration problems, and exhibits a set of characteristics required for structured, robust and distributed investigations of large search spaces. We describe a concrete implementation of the Discovery Space abstraction and show that it is generalizable across a diverse set of workloads such as Large Language Model inference and Big Data Analytics. We demonstrate that our approach enables safe, transparent sharing of data between executions of best-of-breed optimizers increasing the efficiency of optimal configuration detection in large search spaces. We also demonstrate how Discovery Spaces enable transfer and reuse of knowledge across similar search spaces, enabling configuration search speed-ups of over 90%.

找到最佳的云层资源来以最低的成本部署特定的工作量,同时满足规定的服务水平协议是一个活跃的研究领域。 将大量选择的计算、储存和服务中所适用的数十项参数结合起来,由具有类似应用参数的云层提供者提供大量选择的计算、储存和服务,导致配置空间,使用数百万个部署选项。 在本文件中,我们提出探索空间,这是一个将工作量配置问题描述正式化的抽象概念,并展示了对大型搜索空间进行结构化、稳健和分布式调查所需的一系列特征。我们描述了探索空间抽象的切实执行情况,并表明它广泛适用于诸如大语言模型推断和大数据分析等一系列不同的工作量。我们证明,我们的方法能够安全、透明地共享最佳优化剂执行之间的数据,提高在大型搜索空间进行最佳配置探测的效率。我们还演示了探索空间如何在类似搜索空间进行知识的转移和再利用,使配置搜索速度超过90%。


Article 2

Title@2025-06-26 (4): exa-AMD: A Scalable Workflow for Accelerating AI-Assisted Materials Discovery and Design

Title: exa-AMD: A Scalable Workflow for Accelerating AI-Assisted Materials Discovery and Design exa-AMD: Ein skalierbarer Workflow zur Beschleunigung der Entdeckung und des Designs von KI-Assistenten Exa-AMD:加速使用AI辅助材料发现和设计的一个可缩放工作流程 2506.21449v1

Authors (7): Maxim Moraru, Weiyi Xia, Zhuo Ye, Feng Zhang, Yongxin Yao, Ying Wai Li, Cai-Zhuang Wang

exa-AMD is a Python-based application designed to accelerate the discovery and design of functional materials by integrating AI/ML tools, materials databases, and quantum mechanical calculations into scalable, high-performance workflows. The execution model of exa-AMD relies on Parsl, a task-parallel programming library that enables a flexible execution of tasks on any computing resource from laptops to supercomputers. By using Parsl, exa-AMD is able to decouple the workflow logic from execution configuration, thereby empowering researchers to scale their workflows without having to reimplement them for each system.

Exa-AMD是一个基于Python的应用程序,旨在通过将AI/ML工具、材料数据库和量子机械计算纳入可缩放的高性能工作流程,加速发现和设计功能性材料。 Exa-AMD的执行模式依赖于Parsl,这是一个任务平行编程库,可以灵活地执行从膝上型计算机到超级计算机的任何计算资源的任务。 通过使用Parsl,Exa-AMD能够将工作流程逻辑与执行配置脱钩,从而使研究人员不必为每个系统重新实施工作流程,从而能够扩大工作流程的规模。


Article 3

Title@2025-06-26 (4): Carbon-Aware Microservice Deployment for Optimal User Experience on a Budget

Title: Carbon-Aware Microservice Deployment for Optimal User Experience on a Budget Carbon-Aware Microservice Bereitstellung für eine optimale Benutzererfahrung auf einem Budget 为最佳预算用户提供最佳预算用户经验的碳软件微型服务部署 2506.21422v1

Authors (3): Kevin Kreutz, Philipp Wiesner, Monica Vitali

The carbon footprint of data centers has recently become a critical concern. So far, most carbon-aware strategies have focused on leveraging the flexibility of scheduling decisions for batch processing by shifting the time and location of workload executions. However, such approaches cannot be applied to service-oriented cloud applications, since they have to be reachable at every point in time and often at low latencies. We propose a carbon-aware approach for operating microservices under hourly carbon budgets. By choosing the most appropriate version and horizontal scaleout for each microservice, our strategy maximizes user experience and revenue while staying within budget constraints. Experiments across various application configurations and carbon budgets demonstrate that the approach adapts properly to changing workloads and carbon intensities.

数据中心的碳足迹最近已成为一个关键问题。 到目前为止,大多数碳意识战略都侧重于通过改变工作量处决的时间和地点来利用批量处理决策时间安排的灵活性,然而,这些方法不能应用于服务导向的云应用,因为它们必须在每一个时间点都能达到,而且往往在低晚时间。我们提出了在小时碳预算下运行微型服务的碳意识方法。通过为每个微观服务选择最合适的版本和横向扩展,我们的战略在保持预算限制的同时最大限度地利用用户的经验和收入。 各种应用配置和碳预算的实验表明,该方法适应了不断变化的工作量和碳强度。


Article 4

Title@2025-06-26 (4): Enabling Bitcoin Smart Contracts on the Internet Computer

Title: Enabling Bitcoin Smart Contracts on the Internet Computer Ermöglichung von Bitcoin Smart Contracts auf dem Internet-Computer 使因特网计算机上比特币智能合同成为可能 2506.21327v1

Authors (4): Ryan Croote, Islam El-Ashi, Thomas Locher, Yvonne-Anne Pignolet

There is growing interest in providing programmatic access to the value locked in Bitcoin, which famously offers limited programmability itself. Various approaches have been put forth in recent years, with the vast majority of proposed mechanisms either building new functionality on top of Bitcoin or leveraging a bridging mechanism to enable smart contracts that make use of ``wrapped’’ bitcoins on entirely different platforms. In this work, an architecture is presented that follows a different approach. The architecture enables the execution of Turing-complete Bitcoin smart contracts on the Internet Computer (IC), a blockchain platform for hosting and executing decentralized applications. Instead of using a bridge, IC and Bitcoin nodes interact directly, eliminating potential security risks that the use of a bridge entails. This integration requires novel concepts, in particular to reconcile the probabilistic nature of Bitcoin with the irreversibility of finalized state changes on the IC, which may be of independent interest. In addition to the presentation of the architecture, we provide evaluation results based on measurements of the Bitcoin integration running on mainnet. The evaluation results demonstrate that, with finalization in a few seconds and low execution costs, this integration enables complex Bitcoin-based decentralized applications that were not practically feasible or economically viable before.

Bitcoin公司有著越来越多的兴趣,提供方案访问,以获得Bitcoin公司所锁定的价值。Bitcoin公司有名有姓地提供了有限的程序性。近年来,提出了各种办法,绝大多数拟议机制要么在Bitcoin公司之上建立新的功能,要么利用一个桥梁机制,使智能合同能够在完全不同的平台上使用“已包装的”比特币。在这项工作中,提出了一种采用不同方法的架构。该架构使图灵-完成的Bitcoin智能合同得以在互联网计算机(IC)上执行,这是一个托管和实施分散应用的链链平台。除了使用桥梁、IC和Bitcoin节点进行直接互动外,消除使用桥梁可能带来的潜在安全风险。这种整合需要新的概念,特别是调和Bitcoin公司最终确定的国家变化的概率性,后者可能具有独立的兴趣。除了该架构外,我们还根据主网上Bitcoin公司一体化的测量结果提供评价结果。评价结果表明,由于在几秒钟和低执行成本的情况下最终确定,这种整合在实际上是无法实现的。


Article 5

Title@2025-06-26 (4): Exploring Micro Frontends: A Case Study Application in E-Commerce

Title: Exploring Micro Frontends: A Case Study Application in E-Commerce Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce 探索微观前沿:电子商务案例研究应用 2506.21297v1

Authors (5): Ricardo Hideki Hangai Kojo, Luiz Fernando Corte Real, Renato Cordeiro Ferreira, Thatiane de Oliveira Rosa, Alfredo Goldman

In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company’s needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company’s context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.

在微观前端的建筑风格中,前端被分为小部分,从简单的按钮到整个页面。目标是提高可缩放性、复原力和团队独立性,尽管其成本增加了复杂性和基础设施需求。本文件试图了解何时值得采用微观前端,特别是在工业方面。为此,我们根据学术和灰色文献对微型前端的艺术状态进行了调查。我们随后在手工艺产品的市场中实施了这种建筑风格,这些产品已经使用了微观服务。最后,我们通过与开发商的半开放问卷评估了执行情况。在经过研究的市场公司中,由于主要系统(爪哇单项)和专门的前端系统之间的紧密连接,需要进行建筑变革。此外,为了解决这些问题,我们根据学术和灰色文献对微型前端结构进行了调查。我们随后在手工艺产品的市场中采用了微型前端结构,并且已经使用了已经使用过缩略图的后端模式。最后,我们通过半开放的问卷来评估执行情况。尽管通过最精密的前端基础设施(Java ) 和前端技术的运用方式得到了成功的应用,但是在采用最灵活的前端端分析中, 也实现了对正端分析。在采用最具有可比性的公司进行了成功的前端分析,但通过这种分析后端和最成功的前端技术的后端反应, 也取得了必要的应用。在采用了最成功的前端评估。在采用了最成功的前端技术。在采用后端技术,在采用后端的后端技术,在采用了最成功的前端技术,在采用后端技术,在采用。在采用后端技术,在采用后端技术,在采用后端端端端端端技术,在采用后端分析中,在采用后端技术,在采用后端,在采用后端技术,在采用了最成功的前端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端分析中实现了。在采用后端选择。在采用后端,在采用后端,在采用后端,在采用后端,在采用了最成功的前端选择,在采用了。在采用后端,在采用后端,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端


Article 6

Title@2025-06-26 (4): Balancing Privacy, Robustness, and Efficiency in Machine Learning

Title: Balancing Privacy, Robustness, and Efficiency in Machine Learning Ausbalancierende Privatsphäre, Robustheit und Effizienz im maschinellen Lernen 平衡隐私、强健和机器学习效率 2312.14712v3

Authors (3): Youssef Allouah, Rachid Guerraoui, John Stephan

This position paper argues that achieving robustness, privacy, and efficiency simultaneously in machine learning systems is infeasible under prevailing threat models. The tension between these goals arises not from algorithmic shortcomings but from structural limitations imposed by worst-case adversarial assumptions. We advocate for a systematic research agenda aimed at formalizing the robustness-privacy-efficiency trilemma, exploring how principled relaxations of threat models can unlock better trade-offs, and designing benchmarks that expose rather than obscure the compromises made. By shifting focus from aspirational universal guarantees to context-aware system design, the machine learning community can build models that are truly appropriate for real-world deployment.

本立场文件认为,在现行威胁模式下,在机器学习系统中同时实现稳健性、隐私和效率是行不通的。 这些目标之间的紧张关系并非源于算法缺陷,而是由于最坏的对抗性假设造成的结构性限制。 我们主张系统研究议程,旨在正式确定稳健性-隐私-效率三重力,探索威胁模式的原则性放松如何能够打开更好的权衡,并设计暴露而不是掩盖妥协的基准。 通过将重点从抱负性普遍保障转向环境意识系统设计,机器学习界可以构建真正适合现实世界部署的模式。


Article 7

Title@2025-06-26 (4): The Autonomy of the Lightning Network: A Mathematical and Economic Proof of Structural Decoupling from BTC

Title: The Autonomy of the Lightning Network: A Mathematical and Economic Proof of Structural Decoupling from BTC Die Autonomie des Blitznetzes: Ein mathematischer und wirtschaftlicher Beweis der strukturellen Entkopplung von BTC 闪电网络的自主性:结构脱钩与BTC的数学和经济证明 2506.19333v2

Authors (1): Craig Steven Wright

This paper presents a formal analysis of the Lightning Network as a monetary system structurally diverging from Bitcoin’s base-layer settlement model. We demonstrate that under increasing transaction demand, BTC transaction fees rise superlinearly due to throughput constraints, while Lightning Network routing costs approach a bounded asymptote. Using mathematical modeling, game-theoretic proofs, and complexity analysis, we show that Lightning enables indefinite off-chain operation via the emergence of liquidity hub oligopolies. These hubs exhibit properties of unregulated financial intermediaries, including rent extraction, opacity, and systemic fragility. Strategic agent models show that channel closure becomes economically infeasible, and routing problems approach hardness limits in P-Space complexity. We conclude that Lightning does not merely extend Bitcoin, but constitutes a synthetic financial system with shadowbank characteristics, lacking reserve discipline, transparency, or enforceable settlement guarantees.

本文对闪电网络作为一个与比特币基础结算模式结构不同的货币系统进行了正式分析。 我们证明,在交易需求不断增长的情况下,BTC交易费由于吞吐限制而出现超直线性上升,而Lightning网络路由成本则接近于一连串的无足轻重。 我们用数学模型、游戏理论证据和复杂分析来证明,Lightning通过流动资金中心寡头政治的出现,可以无限期地进行离链操作。 这些中心展示了不受管制的金融中介的特性,包括租金提取、不透明和系统脆弱性。 战略代理模型表明,关闭通道在经济上变得不可行,而路由问题则接近P-Space复杂程度的硬性限度。 我们的结论是,Lightning不仅仅是扩展Bitcoin,而是构成一个具有影子银行特征、缺乏储备纪律、透明度或可强制执行的结算担保的合成金融体系。


Article 8

Title@2025-06-26 (4): Bridding OT and PaaS in Edge-to-Cloud Continuum

Title: Bridding OT and PaaS in Edge-to-Cloud Continuum Bridding OT und PaaS im Edge-to-Cloud Continuum 边际至环际环礁岛的Briding OT和PaaS 2506.21072v1

Authors (2): Carlos J Barrios, Yves Denneulin

The Operational Technology Platform as a Service (OTPaaS) initiative provides a structured framework for the efficient management and storage of data. It ensures excellent response times while improving security, reliability, data and technology sovereignty, robustness, and energy efficiency, which are crucial for industrial transformation and data sovereignty. This paper illustrates successful deployment, adaptable application management, and various integration components catering to Edge and Cloud environments. It leverages the advantages of the Platform as a Service model and highlights key challenges that have been addressed for specific use cases.

运行技术平台是一个服务(OTPaaS)倡议,为有效管理和储存数据提供了一个结构化框架,它确保了极好的反应时间,同时改进了安全、可靠性、数据和技术主权、稳健性和能源效率,这些对于工业转型和数据主权至关重要,本文件介绍了成功的部署、适应性应用管理以及适合边缘和云层环境的各种整合组成部分,利用平台作为服务模式的优势,并着重指出了为具体使用案例处理的主要挑战。


Article 9

Title@2025-06-26 (4): An Information-Theoretic Analysis for Federated Learning under Concept Drift

Title: An Information-Theoretic Analysis for Federated Learning under Concept Drift Eine informationstheoretische Analyse für das Federated Learning unter Konzept Drift 根据 “ 漂流概念 “ 进行的联邦学习信息理论分析 2506.21036v1

Authors (3): Fu Peng, Meng Zhang, Ming Tang

Recent studies in federated learning (FL) commonly train models on static datasets. However, real-world data often arrives as streams with shifting distributions, causing performance degradation known as concept drift. This paper analyzes FL performance under concept drift using information theory and proposes an algorithm to mitigate the performance degradation. We model concept drift as a Markov chain and introduce the \emph{Stationary Generalization Error} to assess a model’s capability to capture characteristics of future unseen data. Its upper bound is derived using KL divergence and mutual information. We study three drift patterns (periodic, gradual, and random) and their impact on FL performance. Inspired by this, we propose an algorithm that regularizes the empirical risk minimization approach with KL divergence and mutual information, thereby enhancing long-term performance. We also explore the performance-cost tradeoff by identifying a Pareto front. To validate our approach, we build an FL testbed using Raspberry Pi4 devices. Experimental results corroborate with theoretical findings, confirming that drift patterns significantly affect performance. Our method consistently outperforms existing approaches for these three patterns, demonstrating its effectiveness in adapting concept drift in FL.

联邦学习(FL)的近期研究通常对静态数据集进行训练。然而,真实世界数据通常以流体流形式出现,分布变化,导致性能退化,称为概念流,本文利用信息理论分析概念流的FL性能,并提出减少性能退化的算法。我们将漂浮概念建为Markov链,并引入“标准通用错误”来评估模型捕捉未来不可见数据特性的能力。它的上界是利用KL差异和相互信息得出的。我们研究了三种漂流模式(周期、渐进和随机)及其对FL性能的影响。我们为此提出一种算法,将实验风险最小化方法与KL差异和相互信息规范起来,从而提高长期性能。我们还通过确定Pareto前台来探索性能成本权衡。为了验证我们的方法,我们用 Raspberry Pi4 设备建立了一个FL 测试台。实验结果与理论结论相证实,证实了这种漂移模式对性能的显著影响。我们的方法始终超越了这三种模式的现有方法,并展示了在漂移法上的概念的有效性。


Article 10

Title@2025-06-26 (4): BLOCKS: Blockchain-supported Cross-Silo Knowledge Sharing for Efficient LLM Services

Title: BLOCKS: Blockchain-supported Cross-Silo Knowledge Sharing for Efficient LLM Services BLOCKS: Blockchain-gestützter Cross-Silo-Wissensaustausch für effiziente LLM-Dienste BLOCKS:为高效率的LLM服务进行链链式支持的跨SIlo知识共享 2506.21033v1

Authors (7): Zhaojiacheng Zhou, Hongze Liu, Shijing Yuan, Hanning Zhang, Jiong Lou, Chentao Wu, Jie Li

The hallucination problem of Large Language Models (LLMs) has increasingly drawn attention. Augmenting LLMs with external knowledge is a promising solution to address this issue. However, due to privacy and security concerns, a vast amount of downstream task-related knowledge remains dispersed and isolated across various “silos,” making it difficult to access. To bridge this knowledge gap, we propose a blockchain-based external knowledge framework that coordinates multiple knowledge silos to provide reliable foundational knowledge for large model retrieval while ensuring data security. Technically, we distill knowledge from local data into prompts and execute transactions and records on the blockchain. Additionally, we introduce a reputation mechanism and cross-validation to ensure knowledge quality and provide incentives for participation. Furthermore, we design a query generation framework that provides a direct API interface for large model retrieval. To evaluate the performance of our proposed framework, we conducted extensive experiments on various knowledge sources. The results demonstrate that the proposed framework achieves efficient LLM service knowledge sharing in blockchain environments.

大语言模型(LLMs)的幻觉问题日益引起人们的注意。增加具有外部知识的LLMs是解决这一问题的一个大有希望的解决办法。然而,由于隐私和安全方面的考虑,大量下游任务相关知识仍然分散在各种“筒仓”中,并被孤立在不同的“筒仓”中,因此难以获取。为了缩小这一知识差距,我们提议了一个基于链条的外部知识框架,以协调多种知识仓,为大型模型检索提供可靠的基础知识,同时确保数据安全。技术上,我们将当地数据的知识提炼为快速数据,并在链链中进行交易和记录。此外,我们引入了一种名声机制和交叉验证,以确保知识质量,并为参与提供激励。此外,我们设计了一个为大型模型检索提供直接的API界面的生成查询框架。为了评估我们拟议框架的绩效,我们在许多知识源上进行了广泛的实验。结果显示,拟议的框架实现了在链锁环境中高效的LM服务知识共享。


Article 11

Title@2025-06-26 (4): Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe

Title: Portable High-Performance Kernel Generation for a Computational Fluid Dynamics Code with DaCe Tragbare Hochleistungs-Kernel-Generation für einen numerischen Fluid-Dynamik-Code mit DaCe DaCe 计算流流体动态代码的可携高性能核心生成器 2506.20994v1

Authors (4): Måns I. Andersson, Martin Karp, Niclas Jansson, Stefano Markidis

With the emergence of new high-performance computing (HPC) accelerators, such as Nvidia and AMD GPUs, efficiently targeting diverse hardware architectures has become a major challenge for HPC application developers. The increasing hardware diversity in HPC systems often necessitates the development of architecture-specific code, hindering the sustainability of large-scale scientific applications. In this work, we leverage DaCe, a data-centric parallel programming framework, to automate the generation of high-performance kernels. DaCe enables automatic code generation for multicore processors and various accelerators, reducing the burden on developers who would otherwise need to rewrite code for each new architecture. Our study demonstrates DaCe’s capabilities by applying its automatic code generation to a critical computational kernel used in Computational Fluid Dynamics (CFD). Specifically, we focus on Neko, a Fortran-based solver that employs the spectral-element method, which relies on small tensor operations. We detail the formulation of this computational kernel using DaCe’s Stateful Dataflow Multigraph (SDFG) representation and discuss how this approach facilitates high-performance code generation. Additionally, we outline the workflow for seamlessly integrating DaCe’s generated code into the Neko solver. Our results highlight the portability and performance of the generated code across multiple platforms, including Nvidia GH200, Nvidia A100, and AMD MI250X GPUs, with competitive performance results. By demonstrating the potential of automatic code generation, we emphasise the feasibility of using portable solutions to ensure the long-term sustainability of large-scale scientific applications.

随着新的高性能计算加速器(HPC)的出现,例如Nvidia和AMD GPUs等新型高性能计算加速器的出现,高效地针对不同硬件结构的高效生成已成为HPC应用程序开发者的一项重大挑战。高PC系统硬件多样性的日益增强往往要求开发特定架构的代码,从而妨碍大规模科学应用的可持续性。在这项工作中,我们利用数据中心平行编程框架DaCe(DaCe)来自动生成高性能内核内核内核。DaCe(DaCe)为多核心处理器和各种加速器自动生成码,从而减轻了开发者为每个新架构需要重写代码的开发者的负担。我们的研究显示DaCe的自动代码生成能力,将其自动生成的代码用于一个关键计算核心的计算核心内核内核内核内核内核内核内核内核内核内核内核生成的高级性能。我们通过DFA(SDFA)的高级数据流数据流生成工具, 来详细讨论这一计算计算内核内核内核内核内核内核内核内核内核内核生成的高级数据生成的系统。


Article 12

Title@2025-06-26 (4): ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks

Title: ParEval-Repo: A Benchmark Suite for Evaluating LLMs with Repository-level HPC Translation Tasks ParEval-Repo: Eine Benchmark-Suite zur Bewertung von LLMs mit HPC-Übersetzungsaufgaben auf Repository-Ebene PaarEval-Repo:评价拥有仓库级高常委会翻译任务的LLMLM 基准套件 2506.20938v1

Authors (4): Joshua H. Davis, Daniel Nichols, Ishan Khillan, Abhinav Bhatele

GPGPU architectures have become significantly diverse in recent years, which has led to an emergence of a variety of specialized programming models and software stacks to support them. While portable execution models exist, they still require significant developer effort to port to and optimize for different hardware architectures. Recent advances in large language models (LLMs) can help us reduce some of this programmer burden. In this paper, we present a novel benchmark and testing framework, ParEval-Repo, which can be used to evaluate the efficacy of LLM-based approaches in automatically translating entire codebases across GPGPU execution models. ParEval-Repo includes several scientific computing and AI mini-applications in a range of programming models, and levels of repository complexity. We use ParEval-Repo to evaluate a range of state-of-the-art open-source and commercial LLMs, with both a non-agentic and a top-down agentic approach. We assess code generated by the LLMs and approaches in terms of compilability, functional correctness, categories of build errors, and the cost of translation in terms of the number of inference tokens. Our results demonstrate that LLM translation of scientific applications is feasible for small programs but difficulty with generating functional build systems and cross-file dependencies pose challenges in scaling to larger codebases.

近年来,GPGPPU结构已变得千差万别,导致出现了各种专门的编程模型和支持这些模型的软件堆叠。虽然有便携式执行模型,它们仍需要大量开发者努力将不同硬件结构移植并优化到不同的硬件结构。大型语言模型(LLMs)最近的进展可以帮助我们减轻部分程序员负担。在本文件中,我们提出了一个新的基准和测试框架,即ParEval-Repo,可用于评价基于LLM方法在自动翻译GPGPU执行模型整个代码库方面的效力。ParEval-Repo包括一系列编程模型中的若干科学计算和AI微型应用,以及储存复杂程度。我们利用ParEval-Repo来评估一系列最新的开放源和商业LMS(LMs),同时采用非试管和自上而下的方法。我们评估了LLMMS生成的代码和方法的可兼容性、功能正确性、构建错误的类别,以及从推理数量上翻译的成本,包括一系列编程模型和存储软件的难度。我们利用功能模型的系统得出了更大程度。


Article 13

Title@2025-06-25 (3): Survey: Graph Databases

Title: Survey: Graph Databases Erhebung: Graphische Datenbanken 调查:图表数据库 2505.24758v2

Authors (4): Miguel E. Coimbra, Lucie Svitáková, Alexandre P. Francisco, Luís Veiga

Graph databases have become essential tools for managing complex and interconnected data, which is common in areas like social networks, bioinformatics, and recommendation systems. Unlike traditional relational databases, graph databases offer a more natural way to model and query intricate relationships, making them particularly effective for applications that demand flexibility and efficiency in handling interconnected data. Despite their increasing use, graph databases face notable challenges. One significant issue is the irregular nature of graph data, often marked by structural sparsity, such as in its adjacency matrix representation, which can lead to inefficiencies in data read and write operations. Other obstacles include the high computational demands of traversal-based queries, especially within large-scale networks, and complexities in managing transactions in distributed graph environments. Additionally, the reliance on traditional centralized architectures limits the scalability of Online Transaction Processing (OLTP), creating bottlenecks due to contention, CPU overhead, and network bandwidth constraints. This paper presents a thorough survey of graph databases. It begins by examining property models, query languages, and storage architectures, outlining the foundational aspects that users and developers typically engage with. Following this, it provides a detailed analysis of recent advancements in graph database technologies, evaluating these in the context of key aspects such as architecture, deployment, usage, and development, which collectively define the capabilities of graph database solutions.

图表数据库已成为管理复杂和相互关联的数据的基本工具,这些数据在社会网络、生物信息学和建议系统等领域是常见的。与传统的关联数据库不同,图表数据库提供了一种更自然的模型和查询复杂关系的方法,使这些数据库对要求灵活和高效处理相互关联数据的应用程序特别有效。尽管使用量日益增加,图表数据库面临显著的挑战。一个重要问题是图表数据不规则的性质,其特点是结构过于分散,例如其相近矩阵表征,这可能导致数据读写操作效率低下。其他障碍包括基于跨轨查询的计算需求很高,特别是在大型网络内部,以及管理分布式图表环境中交易的复杂性。此外,依赖传统的中央结构限制了在线交易处理(OLTP)的可缩放性,造成争议、CPU间接费用和网络带宽限制等瓶颈。本文对图表数据库进行了彻底调查,首先是研究财产模型、查询语言和存储结构,概述用户和开发商通常参与的基本方面。随后,它详细分析了最新进展的中央结构,从而界定了关键图表数据库中的关键部署能力,评估了这些方面。


Article 14

Title@2025-06-25 (3): SuperSONIC: Cloud-Native Infrastructure for ML Inferencing

Title: SuperSONIC: Cloud-Native Infrastructure for ML Inferencing SuperSONIC: Cloud-Native Infrastruktur für die ML-Inferenzierung 用于ML推导的云源基础设施 2506.20657v1

Authors (10): Dmitry Kondratyev, Benedikt Riedel, Yuan-Tang Chou, Miles Cochran-Branson, Noah Paladino, David Schultz, Mia Liu, Javier Duarte, Philip Harris, Shih-Chieh Hsu

The increasing computational demand from growing data rates and complex machine learning (ML) algorithms in large-scale scientific experiments has driven the adoption of the Services for Optimized Network Inference on Coprocessors (SONIC) approach. SONIC accelerates ML inference by offloading it to local or remote coprocessors to optimize resource utilization. Leveraging its portability to different types of coprocessors, SONIC enhances data processing and model deployment efficiency for cutting-edge research in high energy physics (HEP) and multi-messenger astrophysics (MMA). We developed the SuperSONIC project, a scalable server infrastructure for SONIC, enabling the deployment of computationally intensive tasks to Kubernetes clusters equipped with graphics processing units (GPUs). Using NVIDIA Triton Inference Server, SuperSONIC decouples client workflows from server infrastructure, standardizing communication, optimizing throughput, load balancing, and monitoring. SuperSONIC has been successfully deployed for the CMS and ATLAS experiments at the CERN Large Hadron Collider (LHC), the IceCube Neutrino Observatory (IceCube), and the Laser Interferometer Gravitational-Wave Observatory (LIGO) and tested on Kubernetes clusters at Purdue University, the National Research Platform (NRP), and the University of Chicago. SuperSONIC addresses the challenges of the Cloud-native era by providing a reusable, configurable framework that enhances the efficiency of accelerator-based inference deployment across diverse scientific domains and industries.

在大规模科学实验中,不断增长的数据率和复杂的机器学习(ML)算法的计算需求不断增加,这促使采用了“优化网络网络对协同处理器(SONIC)的推断服务”方法。SONIC通过将其卸载到本地或远程协同处理器,加速ML的推断,以优化资源利用。SONIC将其可移植到不同类型的协同处理器中,加强数据处理和模型部署效率,用于高能量物理学(HEP)和多气象天体物理学(MMA)的尖端研究。我们为SONIC开发了超级SOSONIC项目,这是一个可扩缩的服务器基础设施,使得能够向Kubernetes组部署计算密集型任务,配有图形处理器(GPUPU)。利用NVDIA Triton Ference服务器、SUPERIC Decoupulation 客户工作流程,使通信标准化,优化吞吐量,负载平衡和监测。我们成功地为CMS和ATLASA系统在CERMER的CER(LH)C(提供CREBERNIRC-C)国家空间观测台和测试的地震观测台(CRA-CRIA-CA-C-CRIA-C-CRIA-C-CRIA-CRIA-CRIA-C-C-CRIA)观测台、CRIA-C-SO-SOBRIM-S-SOBAR-C-SO-C-S-SOBAR-C-C-C-C-C-SOBAR-SOBAR-SOBAR-C-C-SOBAR-C-SOL-C-C-SOBAR-SOL-C-C-S-C-C-C-S-S-SOL-C-C-S-C-C-C-S-S-SOL-S-SOBAR-SAR-SOBAR-SOL-SOL-SAR-C-SAR-SAR-SOL-SAR-SAR-S-S-S-SOL-SAR-SAR-SAR-SAR-SARTIARTIAR-SARTIARTIAR)测试台、CSAR-SO


Article 15

Title@2025-06-25 (3): Hear No Evil: Detecting Gradient Leakage by Malicious Servers in Federated Learning

Title: Hear No Evil: Detecting Gradient Leakage by Malicious Servers in Federated Learning Hear No Evil: Detecting Gradient Leakage by Malicious Servers in Federated Learning 听不见邪恶:在联邦学习中发现恶意服务器的渐变渗漏 2506.20651v1

Authors (2): Fei Wang, Baochun Li

Recent work has shown that gradient updates in federated learning (FL) can unintentionally reveal sensitive information about a client’s local data. This risk becomes significantly greater when a malicious server manipulates the global model to provoke information-rich updates from clients. In this paper, we adopt a defender’s perspective to provide the first comprehensive analysis of malicious gradient leakage attacks and the model manipulation techniques that enable them. Our investigation reveals a core trade-off: these attacks cannot be both highly effective in reconstructing private data and sufficiently stealthy to evade detection – especially in realistic FL settings that incorporate common normalization techniques and federated averaging. Building on this insight, we argue that malicious gradient leakage attacks, while theoretically concerning, are inherently limited in practice and often detectable through basic monitoring. As a complementary contribution, we propose a simple, lightweight, and broadly applicable client-side detection mechanism that flags suspicious model updates before local training begins, despite the fact that such detection may not be strictly necessary in realistic FL settings. This mechanism further underscores the feasibility of defending against these attacks with minimal overhead, offering a deployable safeguard for privacy-conscious federated learning systems.

最近的工作表明,联邦学习(FL)中的梯度更新会无意中透露有关客户本地数据的敏感信息。当恶意服务器操纵全球模型以激起客户提供丰富信息的最新信息时,这种风险会大得多。在本文件中,我们从捍卫者的角度出发,首次全面分析恶意梯度渗漏袭击和使这些袭击得以使用的模型操纵技术。我们的调查揭示了一个核心的权衡:这些袭击在重建私人数据时不可能非常有效,也不可能有足够的隐性以躲避探测,特别是在现实的FL环境中,这种环境包含共同的正常化技术和平均联盟化。基于这一洞察,我们认为,恶意梯度渗漏袭击虽然理论上涉及,但在实践上必然有限,而且往往可以通过基本监测探测出来。作为补充,我们建议了一个简单、轻量和广泛适用的客户方检测机制,在地方培训开始前标出可疑的模型更新信号,尽管在现实的FL环境中可能并非绝对必要。这一机制进一步强调了以最低的间接费用来防范这些袭击的可行性,为有隐私意识的进化学习系统提供可部署的保障。


Article 16

Title@2025-06-25 (3): On the Encoding Process in Decentralized Systems

Title: On the Encoding Process in Decentralized Systems Zum Kodierungsprozess in dezentralisierten Systemen 分权系统编码进程 2408.15203v2

Authors (2): Canran Wang, Netanel Raviv

We consider the problem of encoding information in a system of N=K+R processors that operate in a decentralized manner, i.e., without a central processor which orchestrates the operation. The system involves K source processors, each holding some data modeled as a vector over a finite field. The remaining R processors are sinks, and each of which requires a linear combination of all data vectors. These linear combinations are distinct from one sink processor to another, and are specified by a generator matrix of a systematic linear error correcting code. To capture the communication cost of decentralized encoding, we adopt a linear network model in which the process proceeds in consecutive communication rounds. In every round, every processor sends and receives one message through each one of its p ports. Moreover, inspired by linear network coding literature, we allow processors to transfer linear combinations of their own data and previously received data. We propose a framework that addresses the decentralized encoding problem on two levels. On the universal level, we provide a solution to the decentralized encoding problem for any possible linear code. On the specific level, we further optimize our solution towards systematic Reed-Solomon codes, as well as their variant, Lagrange codes, for their prevalent use in coded storage and computation systems. Our solutions are based on a newly-defined collective communication operation we call all-to-all encode.

我们考虑的是以分散方式运作的N=K+R处理器系统中的编码信息问题,即,没有中央处理器来协调操作。这个系统涉及K源处理器,每个处理器都持有某些数据,在有限的字段中以矢量为模型。剩下的R处理器是汇,每个处理器都要求所有数据矢量的线性组合。这些线性组合从一个水槽处理器到另一个水槽处理器,由系统线性错误校正代码的生成器矩阵指定。为了捕捉分散编码的通信成本,我们采用了线性网络模型,在连续的通信回合中进行进程。在每一回合中,每个处理器都通过每个 p端端发送和接收一个信息。此外,在线性网络编译文献的启发下,我们允许处理器将其本身的数据和以前收到的数据的线性组合传输到一个线性组合。我们提出了一个框架,在两个层次上处理分散编码问题。在任何可能的线性编码的生成器上,我们为分散编码提供了一种解决办法提供了一种解决办法。在具体层次上,我们进一步优化我们的解决方案,在系统Red-Solomon代码上发送和存储一个我们所有的编码,作为最新的版本的编码。


Article 17

Title@2025-06-25 (3): Vertex addition to a ball graph with application to reliability and area coverage in autonomous swarms

Title: Vertex addition to a ball graph with application to reliability and area coverage in autonomous swarms Vertex Ergänzung zu einem Kugelgraphen mit Anwendung zur Zuverlässigkeit und Flächenabdeckung in autonomen Schwarms 使用自动群群群的可靠性和覆盖面积的球图的顶点添加 2506.19197v2

Authors (4): Calum Buchanan, Puck Rombach, James Bagrow, Hamid R. Ossareh

A unit ball graph consists of a set of vertices, labeled by points in Euclidean space, and edges joining all pairs of points within distance $1$. These geometric graphs are used to model a variety of spatial networks, including communication networks between agents in an autonomous swarm. In such an application, vertices and/or edges of the graph may not be perfectly reliable; an agent may experience failure or a communication link rendered inoperable. With the goal of designing robust swarm formations, or unit ball graphs with high reliability (probability of connectedness), in a preliminary conference paper we provided an algorithm with cubic time complexity to determine all possible changes to a unit ball graph by repositioning a single vertex. Using this algorithm and Monte Carlo simulations, one obtains an efficient method to modify a unit ball graph by moving a single vertex to a location which maximizes the reliability. Another important consideration in many swarm missions is area coverage, yet highly reliable ball graphs often contain clusters of vertices. Here, we generalize our previous algorithm to improve area coverage as well as reliability. Our algorithm determines a location to add or move a vertex within a unit ball graph which maximizes the reliability, under the constraint that no other vertices of the graph be within some fixed distance. We compare this method of obtaining graphs with high reliability and evenly distributed area coverage to another method which uses a modified Fruchterman-Reingold algorithm for ball graphs.

单位球图由一组顶点组成,标签以 Euclidean 空间中的点为标签,边缘与距离范围内的所有对点相连接,$1美元。这些几何图形用于模拟各种空间网络,包括自动变暖中的代理器之间的通信网络。在这样的应用程序中,顶点和/或图形边缘可能不完全可靠;一个代理器可能经历失败或通信链接无法操作。设计强大的群点结构或高度可靠(连接概率)的单位球图的目标是在一份初步会议文件中,我们提供了一种具有立方时间复杂性的算法,以便通过重新定位单一的顶点和蒙特卡洛模拟来确定对单位球图的所有可能的变化。在这样的应用程序中,一个顶点和/或图边缘可能不完全可靠;在许多群点任务中,另一个重要的考虑因素是区域覆盖,但高度可靠的球图则往往包含顶点(连接概率的概率) 。在这里,我们比较了我们以前的算法,以便提高区域覆盖范围,通过重新定位来确定单点的准确性,在另一个方向中,我们算法确定一个最精确性的位置,在另一个方向中,在另一个方向上将一个最精确的直方形的直方图中,在另一个位置上将一个最精确的直方位位置到最精确的方法。


Article 18

Title@2025-06-25 (3): WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads

Title: WattsOnAI: Measuring, Analyzing, and Visualizing Energy and Carbon Footprint of AI Workloads WattsOnAI: Messen, Analysieren und Visualisieren von Energie und Carbon Footprint von KI-Workloads WattsOnAI:AI工作量的测量、分析、可视化能源和碳足迹 2506.20535v1

Authors (5): Hongzhen Huang, Kunming Zhang, Hanlong Liao, Kui Wu, Guoming Tang

The rapid advancement of AI, particularly large language models (LLMs), has raised significant concerns about the energy use and carbon emissions associated with model training and inference. However, existing tools for measuring and reporting such impacts are often fragmented, lacking systematic metric integration and offering limited support for correlation analysis among them. This paper presents WattsOnAI, a comprehensive software toolkit for the measurement, analysis, and visualization of energy use, power draw, hardware performance, and carbon emissions across AI workloads. By seamlessly integrating with existing AI frameworks, WattsOnAI offers standardized reports and exports fine-grained time-series data to support benchmarking and reproducibility in a lightweight manner. It further enables in-depth correlation analysis between hardware metrics and model performance and thus facilitates bottleneck identification and performance enhancement. By addressing critical limitations in existing tools, WattsOnAI encourages the research community to weigh environmental impact alongside raw performance of AI workloads and advances the shift toward more sustainable “Green AI” practices. The code is available at https://github.com/SusCom-Lab/WattsOnAI.

AI的快速发展,特别是大型语言模型(LLMS),引起了人们对与模型培训和推论有关的能源使用和碳排放的严重关切,然而,衡量和报告这些影响的现有工具往往支离破碎,缺乏系统的标准化整合,对相互关系分析的支持有限。本文介绍了WattsOnAI, 能源使用、电力抽取、硬件性能和碳排放计量、分析和可视化的综合软件工具包。WattsOnAI与现有的AI框架紧密结合,提供标准化报告和出口精细的实时系列数据,以支持基准制定和以轻量度方式再生。它进一步使得硬件计量和模型性能之间的深入分析能够促进瓶颈识别和绩效提高。WattsOnAI通过解决现有工具中的关键局限性,鼓励研究界与AI工作量的原始表现一道权衡环境影响,并推动转向更可持续的“绿色AI”做法。该代码可在https://github.com/SusCom-Lab/Wats Onats OnAI上查阅。


Article 19

Title@2025-06-25 (3): Collaborative Batch Size Optimization for Federated Learning

Title: Collaborative Batch Size Optimization for Federated Learning Kollaborative Batch-Größenoptimierung für Federated Learning 联邦学习联合会的合作批量数量优化 2506.20511v1

Authors (3): Arno Geimer, Karthick Panner Selvam, Beltran Fiz Pontiveros

Federated Learning (FL) is a decentralized collaborative Machine Learning framework for training models without collecting data in a centralized location. It has seen application across various disciplines, from helping medical diagnoses in hospitals to detecting fraud in financial transactions. In this paper, we focus on improving the local training process through hardware usage optimization. While participants in a federation might share the hardware they are training on, since there is no information exchange between them, their training process can be hindered by an improper training configuration. Taking advantage of the parallel processing inherent to Federated Learning, we use a greedy randomized search to optimize local batch sizes for the best training settings across all participants. Our results show that against default parameter settings, our method improves convergence speed while staying nearly on par with the case where local parameters are optimized.

联邦学习联合会(FL)是一个分散合作的机械学习框架,用于培训模式,而不在集中地点收集数据,它在不同学科应用,从帮助医院的医疗诊断到发现金融交易中的欺诈。在本文中,我们侧重于通过硬件使用优化改善当地培训过程。联邦参与者可能分享他们正在接受培训的硬件,因为他们之间没有信息交流,他们的培训过程可能受到不适当的培训配置的阻碍。我们利用联邦学习协会固有的平行处理,利用贪婪随机搜索,优化所有参与者的最佳培训环境的本地批量规模。我们的结果显示,在默认参数设置下,我们的方法可以提高趋同速度,同时与优化当地参数的情况保持接近一致。


Article 20

Title@2025-06-25 (3): MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing

Title: MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing MARCO: Multi-Agent Code-Optimierung mit Echtzeit-Knowledge Integration für High-Performance Computing MARCO: 利用实时知识整合优化多机构代码,促进高绩效计算 2505.03906v3

Authors (10): Asif Rahman, Veljko Cvetkovic, Kathleen Reece, Aidan Walters, Yasir Hassan, Aneesh Tummeti, Bryan Torres, Denise Cooney, Margaret Ellis, Dimitrios S. Nikolopoulos

Large language models (LLMs) have transformed software development through code generation capabilities, yet their effectiveness for high-performance computing (HPC) remains limited. HPC code requires specialized optimizations for parallelism, memory efficiency, and architecture-specific considerations that general-purpose LLMs often overlook. We present MARCO (Multi-Agent Reactive Code Optimizer), a novel framework that enhances LLM-generated code for HPC through a specialized multi-agent architecture. MARCO employs separate agents for code generation and performance evaluation, connected by a feedback loop that progressively refines optimizations. A key innovation is MARCO’s web-search component that retrieves real-time optimization techniques from recent conference proceedings and research publications, bridging the knowledge gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem set demonstrates that MARCO achieves a 14.6\% average runtime reduction compared to Claude 3.5 Sonnet alone, while the integration of the web-search component yields a 30.9\% performance improvement over the base MARCO system. These results highlight the potential of multi-agent systems to address the specialized requirements of high-performance code generation, offering a cost-effective alternative to domain-specific model fine-tuning.

大型语言模型(LLMS)通过代码生成能力改变了软件开发,但其高效高性能计算(HPC)的效果仍然有限,HPC代码要求专门优化平行性、记忆效率以及一般通用LMS经常忽略的建筑特有考虑。我们介绍了MARCO(MLCO)(Multi-Agency Reactive Code Apptimerimizer),这是一个新颖的框架,它通过专门的多试剂结构加强LLMM为HPC生成的代码。MARCO使用不同的代码生成和绩效评估代理,通过逐步完善优化的反馈回路进行连接。一个关键的创新是MARCO的网络搜索组件,它从最近的会议记录和研究出版物中检索实时优化技术,缩小培训前LMS的知识差距。我们对LetCode 75问题集的广泛评价表明,MARCO仅与Claude 3.5 Sonnet系统相比,平均减少了14.6 %的运行时间,而网络搜索组件的整合则使MARCO系统的业绩得到30.9的改进。这些结果突出表明,多试剂系统有可能解决高性能模型生成的专门要求。


Article 21

Title@2025-06-25 (3): When Servers Meet Species: A Fab-to-Grave Lens on Computing’s Biodiversity Impact

Title: When Servers Meet Species: A Fab-to-Grave Lens on Computing’s Biodiversity Impact Wenn Server Arten treffen: Eine Fab-to-Grave-Lens für die Biodiversitätswirkung von Computing 当服务器与物种相遇时:关于计算机的生物多样性影响的一个从宽到宽的镜头 2506.20442v1

Authors (4): Tianyao Shi, Ritbik Kumar, Inez Hua, Yi Ding

Biodiversity loss is a critical planetary boundary, yet its connection to computing remains largely unexamined. Prior sustainability efforts in computing have focused on carbon and water, overlooking biodiversity due to the lack of appropriate metrics and modeling frameworks. This paper presents the first end-to-end analysis of biodiversity impact from computing systems. We introduce two new metrics–Embodied Biodiversity Index (EBI) and Operational Biodiversity Index (OBI)–to quantify biodiversity impact across the lifecycle, and present FABRIC, a modeling framework that links computing workloads to biodiversity impacts. Our evaluation highlights the need to consider biodiversity alongside carbon and water in sustainable computing design and optimization. The code is available at https://github.com/TianyaoShi/FABRIC.

生物多样性的丧失是一个重要的行星边界,但生物多样性与计算的联系基本上仍未得到审查。先前的计算可持续性努力侧重于碳和水,由于缺乏适当的计量标准和模型框架,忽视了生物多样性。本文件对计算系统对生物多样性的影响进行了第一次端至端分析。我们引入了两种新的计量- Embodied生物多样性指数(EBI)和业务生物多样性指数(OBI),以量化整个生命周期的生物多样性影响,并推出了FABRIC, 这是一种将计算工作量与生物多样性影响联系起来的模型框架。我们的评估强调,在可持续计算设计和优化时,需要将生物多样性与碳和水一起考虑。该代码可在https://github.com/tianaoShi/FABRIC上查阅。


Article 22

Title@2025-06-25 (3): PAT: a new algorithm for all-gather and reduce-scatter operations at scale

Title: PAT: a new algorithm for all-gather and reduce-scatter operations at scale PAT: ein neuer Algorithmus für All-Gather- und Reduce-Scatter-Operationen im Maßstab PAT: 大规模全采集和减少散散作业的新算法 2506.20252v1

Authors (1): Sylvain Jeaugey

This paper describes a new algorithm called PAT, for Parallel Aggregated Trees, and which can be used to implement all-gather and reduce-scatter operations. This algorithm works on any number of ranks, has a logarithmic number of network transfers for small size operations, minimizes long-distance communication, and requires a logarithmic amount of internal buffers, independently from the total operation size. It is aimed at improving the performance of the NCCL library in cases where the ring algorithm would be inefficient, as its linear latency would show poor performance for small sizes and/or at scale.

本文描述了一种新的算法,称为PAT,用于平行集成树,可用于实施全采集和减少散射操作。这种算法在任何级别上都有作用,为小型操作提供网络传输的对数,最大限度地减少长距离通信,需要内部缓冲的对数,与总操作规模无关,目的是在环算法效率不高的情况下改进NCCL图书馆的性能,因为其线性延缓度将显示小尺寸和(或)规模的性能不佳。


Article 23

Title@2025-06-25 (3): The Blind Men and the Elephant: Mapping Interdisciplinarity in Research on Decentralized Autonomous Organizations

Title: The Blind Men and the Elephant: Mapping Interdisciplinarity in Research on Decentralized Autonomous Organizations Blinde Männer und Elefant: Interdisziplinarität in der Forschung über dezentralisierte autonome Organisationen kartieren 盲人和大象:绘制分权自治组织研究的多元性图 2502.09949v2

Authors (3): Giorgia Sampò, Oliver Baumann, Marco Peressotti

Decentralized Autonomous Organizations (DAOs) are attracting interdisciplinary interest, particularly in business, economics, and computer science. However, much like the parable of the blind men and the elephant, where each observer perceives only a fragment of the whole, DAO research remains fragmented across disciplines, limiting a comprehensive understanding of their potential. This paper assesses the maturity of interdisciplinary research on DAOs by analyzing knowledge flows between Business & Economics and Computer Science through citation network analysis, topic modelling, and outlet analysis. Our findings reveal that while DAOs serve as a vibrant topic of interdisciplinary discourse, current research remains predominantly applied and case-driven, with limited theoretical integration. Strengthening the alignment between organizational and technical insights is crucial for advancing DAO research and fostering a more cohesive interdisciplinary framework.

分权自治组织(DAO)吸引了跨学科的兴趣,特别是在商业、经济学和计算机科学方面,然而,与盲人和大象的比喻(每个观察者只看到整体的碎片)一样,DAO的研究在各学科之间仍然支离破碎,限制了对其潜力的全面了解;本文件通过引言网络分析、专题建模和外联网分析,分析企业和经济学与计算机科学之间的知识流动,评估DAO的跨学科研究的成熟程度;我们的调查结果显示,虽然DAO是跨学科讨论的一个活跃的话题,但目前的研究仍然以应用和个案为主,理论整合有限;加强组织和技术见解之间的协调一致,对于推进DAO的研究和促进更具凝聚力的跨学科框架至关重要。


Article 24

Title@2025-06-25 (3): LiteGD: Lightweight and Dynamic GPU Dispatching for Large-scale Heterogeneous Clusters

Title: LiteGD: Lightweight and Dynamic GPU Dispatching for Large-scale Heterogeneous Clusters LiteGD: Leichte und dynamische GPU Dispatching für großflächige Heterogene Cluster LiteGD: 大型异源集束体轻量和动态GPU发射 2506.15595v2

Authors (3): Kunming Zhang, Hanlong Liao, Guoming Tang

Although multi-GPU execution has become the de-facto paradigm for training and serving large language models (LLMs), today’s schedulers still rely on a simple heuristic: pick GPUs that are physically close. This proximity rule was adequate for small, uniform clusters, but it breaks down in modern fabrics where link capacities differ by up to an order of magnitude across PCIe, NVLink, and CXL tiers. Consequently, jobs placed by locality alone often suffer from severe bandwidth imbalance and unpredictable performance. In this paper, We present LiteGD, a lightweight, globally-aware GPU dispatching system that delivers near-optimal bandwidth without incurring prohibitive state or search overheads. Instead of materializing the full O(N^2) connectivity matrix, LiteGD encodes the fabric with a sparsified Tiny-Transformer trained on a few thousand random bandwidth probes, enabling fast adaptation to incremental hardware changes. LiteGD also employs a bidirectional tree search approach to find the optimal GPU dispatching in the data generated in the previous step, which can identify near-optimal solutions while reducing search overhead. We implement and evaluate LiteGD in both real and simulated GPU clusters with homogeneous and heterogeneous interconnects, respectively. Experimental results demonstrate that LiteGD consistently achieves high GPU Bandwidth Efficacy, approximately 90% across various cluster configurations and 80% in a real-world H100 cluster, significantly outperforming conventional default and interconnect topology-aware dispatching methods, particularly in large-scale heterogeneous environments.

虽然多GPU执行已经成为培训和服务大型语言模型(LLMS)的脱facto范式,但今天的调度器仍然依赖简单的超光速模式:选择实际接近的GPU。这种近距离规则对小型、统一的组群来说是适当的,但它在现代结构中崩溃,因为连接能力在PCIe、NVLink 和 CXL 级之间有不同程度的大小。因此,单由地点设置的工作往往遭受严重的带宽不平衡和不可预测的性能。在本文中,我们展示了Lite GGD,一个轻量的、全球有观测的GPUPS发送系统,在不产生令人望的状态或搜索管理器的情况下提供近于最佳的GPU;在将完整的O(N&2)连通性矩阵中,LiteGDG 将这种结构编码成宽度,在几千个随机的带宽度的带宽度的带宽度测试中,能够快速适应增量的硬件变化。LiteGDGDD还采用了双向树搜索方法,在前一步中找到最佳的GPUPUP,我们所生成的数据发送系统,在前一步中可以确定接近最佳的近最佳的宽度,在高端的GMFMLMLMLMUD 和连续的G结果中,在不断的G-CFDMLB中,在80级中可以分别地标度上进行精确度上显示一个接近的G的G的G结果。


Article 25

Title@2025-06-25 (3): On the $h$-majority dynamics with many opinions

Title: On the $h$-majority dynamics with many opinions Auf der $h$-Mehrheitsdynamik mit vielen Meinungen 关于以美元为多数的动态, 2506.20218v1

Authors (4): Francesco d’Amore, Niccolò D’Archivio, George Giakkoupis, Emanuele Natale

We present the first upper bound on the convergence time to consensus of the well-known $h$-majority dynamics with $k$ opinions, in the synchronous setting, for $h$ and $k$ that are both non-constant values. We suppose that, at the beginning of the process, there is some initial additive bias towards some plurality opinion, that is, there is an opinion that is supported by $x$ nodes while any other opinion is supported by strictly fewer nodes. We prove that, with high probability, if the bias is $\omega(\sqrt{x})$ and the initial plurality opinion is supported by at least $x = \omega(\log n)$ nodes, then the process converges to plurality consensus in $O(\log n)$ rounds whenever $h = \omega(n \log n / x)$. A main corollary is the following: if $k = o(n / \log n)$ and the process starts from an almost-balanced configuration with an initial bias of magnitude $\omega(\sqrt{n/k})$ towards the initial plurality opinion, then any function $h = \omega(k \log n)$ suffices to guarantee convergence to consensus in $O(\log n)$ rounds, with high probability. Our upper bound shows that the lower bound of $\Omega(k / h^2)$ rounds to reach consensus given by Becchetti et al.\ (2017) cannot be pushed further than $\widetilde{\Omega}(k / h)$. Moreover, the bias we require is asymptotically smaller than the $\Omega(\sqrt{n\log n})$ bias that guarantees plurality consensus in the $3$-majority dynamics: in our case, the required bias is at most any (arbitrarily small) function in $\omega(\sqrt{x})$ for any value of $k \ge 2$.

在同步环境下,我们对已知的以美元为单位的正辛醇-多数动态在2美元为单位,以美元为单位,以美元为单位,以美元和美元为单位,在趋同时间上,我们展示了第一个关于趋同时间的上限。我们认为,在进程开始时,对某种多元观点存在一些初始添加偏向性偏向,也就是说,有一种观点得到美元偏向的支持,而任何其他观点则得到严格更少的节点的支持。我们证明,如果偏差为$(美元)和美元为单位,那么以美元为单位,则以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元,以美元,以美元,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元为单位,以美元,以美元为,以美元,以美元,以美元


Article 26

Title@2025-06-24 (2): MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models

Title: MegaFold: System-Level Optimizations for Accelerating Protein Structure Prediction Models MegaFold: System-Level-Optimierungen zur Beschleunigung von Proteinstruktur-Vorhersagemodellen MegaFold:加速蛋白质结构结构预测模型的全系统优化 2506.20686v1

Authors (5): Hoa La, Ahan Gupta, Alex Morehead, Jianlin Cheng, Minjia Zhang

Protein structure prediction models such as AlphaFold3 (AF3) push the frontier of biomolecular modeling by incorporating science-informed architectural changes to the transformer architecture. However, these advances come at a steep system cost, introducing: compute- and memory-intensive operators, 2D attention mechanisms, and retrieval-augmented data pipelines, which collectively hinder the scalability of AF3 training. In this work, we present MegaFold, a cross-platform system to accelerate AF3 training. MegaFold tackles key bottlenecks through ahead-of-time caching to eliminate GPU idle time from the retrieval-augmented data pipeline, Triton-based kernels for memory-efficient EvoAttention on heterogeneous devices, and deep fusion for common and critical small operators in AF3. Evaluation on both NVIDIA H200 and AMD MI250 GPUs shows that MegaFold reduces peak memory usage of AF3 training by up to 1.23$\times$ and improves per-iteration training time by up-to 1.73$\times$ and 1.62$\times$ respectively. More importantly, MegaFold enables training on 1.35$\times$ longer sequence lengths compared to PyTorch baselines without running out-of-memory, significantly improving the scalability of modern protein folding models. We open source our code at https://github.com/Supercomputing-System-AI-Lab/MegaFold/.

Protein结构预测模型,如AlphaFold3 (AF3),通过将科学知情的建筑变化纳入变压器结构,推动生物分子模型的前沿。然而,这些进步以高昂的系统成本出现,引入了:计算和记忆密集操作员、2D关注机制和检索增强的数据管道,共同阻碍了AF3培训的可扩展性。在这项工作中,我们介绍了MegaFold,这是一个加速AF3培训的跨平台系统。MegaFold通过超时缓缓冲解决关键瓶颈问题,以消除GPU从调出数据管道的闲置时间、基于Triton的内核内核,用于对多种装置进行记忆高效的读取、记忆密集度密集的操作员、2D关注机制和检索增强的数据管道。对NVVIDIMAA H200和AMD MI250 GPUPUS的评估显示,MAFold将AF3培训的高峰记忆用量降低到1.23美元。Megaforst 时间, 改进每期培训时间,从1.73美元提高到1.35美元和1.62\\时间的顺序。


Article 27

Title@2025-06-24 (2): Power-Capping Metric Evaluation for Improving Energy Efficiency in HPC Applications

Title: Power-Capping Metric Evaluation for Improving Energy Efficiency in HPC Applications Power-Capping Metric Evaluation zur Verbesserung der Energieeffizienz in HPC-Anwendungen 提高高常委会应用能效提高方法评估 2505.21758v2

Authors (7): Maria Patrou, Thomas Wang, Wael Elwasif, Markus Eisenbach, Ross Miller, William Godoy, Oscar Hernandez

With high-performance computing systems now running at exascale, optimizing power-scaling management and resource utilization has become more critical than ever. This paper explores runtime power-capping optimizations that leverage integrated CPU-GPU power management on architectures like the NVIDIA GH200 superchip. We evaluate energy-performance metrics that account for simultaneous CPU and GPU power-capping effects by using two complementary approaches: speedup-energy-delay and a Euclidean distance-based multi-objective optimization method. By targeting a mostly compute-bound exascale science application, the Locally Self-Consistent Multiple Scattering (LSMS), we explore challenging scenarios to identify potential opportunities for energy savings in exascale applications, and we recognize that even modest reductions in energy consumption can have significant overall impacts. Our results highlight how GPU task-specific dynamic power-cap adjustments combined with integrated CPU-GPU power steering can improve the energy utilization of certain GPU tasks, thereby laying the groundwork for future adaptive optimization strategies.

由于高性能计算系统目前处于伸缩状态,优化电力扩缩管理和资源利用已变得比以往更加关键。本文件探索了运行时间的电力拉动优化,在像NVIDIA GH200超级芯片这样的建筑上利用CPU-GPU电力管理进行综合CPU-GPU优化。我们通过使用两种互补方法,评估了同时产生CPU和GPU动力拉动效应的能源性能衡量标准:加速能源拉动和以Euclidean远程为基础的多目标优化方法。通过针对一个大部分可计算到的扩展性科学应用,即本地自控多散射(LSMS),我们探索了具有挑战性的情景,以确定在大规模应用中节能的潜在机会,我们认识到即使能源消耗略有减少也会产生重大的总体影响。我们的结果突出表明,GPU的具体任务动态电动能上限调整与CPU-GPU电力指导相结合,可以如何改善某些GUPU任务的能源利用,从而为未来的适应性优化战略打下基础。


Article 28

Title@2025-06-24 (2): Can One Safety Loop Guard Them All? Agentic Guard Rails for Federated Computing

Title: Can One Safety Loop Guard Them All? Agentic Guard Rails for Federated Computing Kann ein Sicherheitsschlaufe Guard sie alle? Agentic Guard Rails für Federated Computing 一个安全环圈能保护全部吗? 2506.20000v1

Authors (2): Narasimha Raghavan Veeraragavan, Jan Franz Nygård

We propose Guardian-FC, a novel two-layer framework for privacy preserving federated computing that unifies safety enforcement across diverse privacy preserving mechanisms, including cryptographic back-ends like fully homomorphic encryption (FHE) and multiparty computation (MPC), as well as statistical techniques such as differential privacy (DP). Guardian-FC decouples guard-rails from privacy mechanisms by executing plug-ins (modular computation units), written in a backend-neutral, domain-specific language (DSL) designed specifically for federated computing workflows and interchangeable Execution Providers (EPs), which implement DSL operations for various privacy back-ends. An Agentic-AI control plane enforces a finite-state safety loop through signed telemetry and commands, ensuring consistent risk management and auditability. The manifest-centric design supports fail-fast job admission and seamless extensibility to new privacy back-ends. We present qualitative scenarios illustrating backend-agnostic safety and a formal model foundation for verification. Finally, we outline a research agenda inviting the community to advance adaptive guard-rail tuning, multi-backend composition, DSL specification development, implementation, and compiler extensibility alongside human-override usability.

我们提议一个保护隐私的新二层框架 – – 保护联邦计算框架,它统一了各种隐私保护机制的安全执法,包括完全同质加密(FHE)和多功能计算(MPC)等加密后端,以及不同隐私(DP)等统计技术。 监护-FC从隐私机制中分解的护栏,方法是用后端中性、特定域的语言(DSL)编写,专门为进化计算工作流程和可互换执行供应商(EPs)设计,用于执行DSL的多种隐私后端操作。一个Agenti-AI控制平面通过签名的遥测和指令执行固定状态安全循环,确保连贯一致的风险管理和可审计性。 显式设计支持不协调的工作接纳和无缝地延伸到新的隐私后端。我们提出了说明后端安全性的质量设想和正式的核查模式基础。 最后,我们概述了一个研究议程,邀请社区推进适应性后端调整、多后端配置构成、DSLDL规格的开发、汇编和扩展与人类的兼容性。


Article 29

Title@2025-06-24 (2): AI-coupled HPC Workflow Applications, Middleware and Performance

Title: AI-coupled HPC Workflow Applications, Middleware and Performance KI-gekoppelte HPC-Workflow-Anwendungen, Middleware und Performance 工作流量应用、中软件和性能 2406.14315v2

Authors (6): Wes Brewer, Ana Gainaru, Frédéric Suter, Feiyi Wang, Murali Emani, Shantenu Jha

AI integration is revolutionizing the landscape of HPC simulations, enhancing the importance, use, and performance of AI-driven HPC workflows. This paper surveys the diverse and rapidly evolving field of AI-driven HPC and provides a common conceptual basis for understanding AI-driven HPC workflows. Specifically, we use insights from different modes of coupling AI into HPC workflows to propose six execution motifs most commonly found in scientific applications. The proposed set of execution motifs is by definition incomplete and evolving. However, they allow us to analyze the primary performance challenges underpinning AI-driven HPC workflows. We close with a listing of open challenges, research issues, and suggested areas of investigation including the the need for specific benchmarks that will help evaluate and improve the execution of AI-driven HPC workflows.

AI正在使HPC模拟的景观发生革命性的变化,提高了AI驱动的HPC工作流程的重要性、使用和绩效。本文调查了AI驱动的HPC工作流程的多样性和迅速演变的领域,为理解AI驱动的HPC工作流程提供了一个共同的概念基础。具体地说,我们利用将AI与HPC工作流程相结合的不同模式的见解来提出在科学应用中最常见的六个执行模式。拟议的一套执行模式在定义上是不完整和不断演变的。然而,它们使我们能够分析AI驱动的HPC工作流程的主要绩效挑战。我们最后列举了公开的挑战、研究问题和建议的调查领域,包括需要制定具体基准,以帮助评价和改进AI驱动的HPC工作流程的执行。


Article 30

Title@2025-06-24 (2): MAIZX: A Carbon-Aware Framework for Optimizing Cloud Computing Emissions

Title: MAIZX: A Carbon-Aware Framework for Optimizing Cloud Computing Emissions MAIZX: Ein Carbon-Aware-Framework zur Optimierung von Cloud-Computing-Emissionen MAIZX:优化云计算排放的碳软件框架 2506.19972v1

Authors (3): Federico Ruilova, Ernst Gunnar Gran, Sven-Arne Reinemo

Cloud computing drives innovation but also poses significant environmental challenges due to its high-energy consumption and carbon emissions. Data centers account for 2-4% of global energy usage, and the ICT sector’s share of electricity consumption is projected to reach 40% by 2040. As the goal of achieving net-zero emissions by 2050 becomes increasingly urgent, there is a growing need for more efficient and transparent solutions, particularly for private cloud infrastructures, which are utilized by 87% of organizations, despite the dominance of public-cloud systems. This study evaluates the MAIZX framework, designed to optimize cloud operations and reduce carbon footprint by dynamically ranking resources, including data centers, edge computing nodes, and multi-cloud environments, based on real-time and forecasted carbon intensity, Power Usage Effectiveness (PUE), and energy consumption. Leveraging a flexible ranking algorithm, MAIZX achieved an 85.68% reduction in CO2 emissions compared to baseline hypervisor operations. Tested across geographically distributed data centers, the framework demonstrates scalability and effectiveness, directly interfacing with hypervisors to optimize workloads in private, hybrid, and multi-cloud environments. MAIZX integrates real-time data on carbon intensity, power consumption, and carbon footprint, as well as forecasted values, into cloud management, providing a robust tool for enhancing climate performance potential while maintaining operational efficiency.

数据中心占全球能源使用量的2%-4%,信通技术部门占电力消耗的比例预计到2040年将达到40%。 随着2050年实现净零排放的目标日益迫切,日益需要更有效和透明的解决方案,特别是87%的组织尽管在公共云层系统中占据主导地位,但仍使用87%的私营云层基础设施。本研究评估了MAIZX框架,目的是优化云层操作,并通过动态排序资源减少碳足迹,包括数据中心、边缘计算节点和多云层环境,其基础是实时和预测的碳强度、电力使用有效性和能源消耗。利用灵活的排名算法,MAIZX实现了87%的二氧化碳排放减少幅度,而基准超光谱操作操作操作操作。经过测试,该框架显示了可扩展性和有效性,直接与超导器进行互动,以优化私人、混合和多云层节点的工作量,基于实时和预测的碳密度、电力使用有效性和能源消耗量的多云层环境。MAIX将实时数据整合成一个稳定的碳密度数据,同时提供稳定的碳度和碳度的运行工具。


Article 31

Title@2025-06-24 (2): Maintaining a Bounded Degree Expander in Dynamic Peer-to-Peer Networks

Title: Maintaining a Bounded Degree Expander in Dynamic Peer-to-Peer Networks Aufrechterhaltung eines begrenzten Grades Expander in dynamischen Peer-to-Peer-Netzwerken 维持动态同侪网络中的宽度扩展器 2506.17757v2

Authors (1): Antonio Cruciani

We study the problem of maintaining robust and sparse overlay networks in fully distributed settings where nodes continuously join and leave the system. This scenario closely models real-world unstructured peer-to-peer networks, where maintaining a well-connected yet low-degree communication graph is crucial. We generalize a recent protocol by Becchetti et al. [SODA 2020] that relies on a simple randomized connection strategy to build an expander topology with high probability to a dynamic networks with churn setting. In this work, the network dynamism is governed by an oblivious adversary that controls which nodes join and leave the system in each round. The adversary has full knowledge of the system and unbounded computational power, but cannot see the random choices made by the protocol. Our analysis builds on the framework of Augustine et al. [FOCS 2015], and shows that our distributed algorithm maintains a constant-degree expander graph with high probability, despite a continuous adversarial churn with a rate of up to $\mathcal{O}(n/polylog(n))$ per round, where $n$ is the stable network size. The protocol and proof techniques are not new, but together they resolve a specific open problem raised in prior work. The result is a simple, fully distributed, and churn-resilient protocol with provable guarantees that align with observed empirical behavior.

我们研究了在完全分布的环境下维持强大和稀少的重叠网络的问题,节点持续加入并离开系统。这个假设非常密切地模拟现实世界中没有结构的同行对等网络,其中保持连接良好但低度的通信图至关重要。我们推广了贝切蒂等人最近的一项协议[SODA 2020],该协议依赖于简单的随机连接战略,以建立一个极有可能进入动态网络的动态网络的扩展型表层。在这项工作中,网络动态由一个模糊的对手控制,即节点加入并离开系统的每一回合的控制。对手完全了解系统和无约束的计算能力,但看不到协议作出的随机选择。我们的分析建立在奥古斯丁等人的框架上 [FOCS 2015],并表明我们分布的算法保持一个常态度扩张型的图,其概率很高,尽管连续的对抗性曲线速度高达$\mathcalcal{O}(n/polylogy) $ / colrow $ $ / colrow, 其中, 美元是简单的网络操作和先行式的行为是稳定的逻辑, 一种稳定的计算结果。


Article 32

Title@2025-06-24 (2): FDA-Opt: Communication-Efficient Federated Fine-Tuning of Language Models

Title: FDA-Opt: Communication-Efficient Federated Fine-Tuning of Language Models FDA-Opt: Kommunikationseffizientes Federated Fine-Tuning von Sprachmodellen FFDA-Opt: 交流-高效联邦语言模型精密使用 2505.04535v2

Authors (3): Michail Theologitis, Vasilis Samoladas, Antonios Deligiannakis

Federated Learning (FL) enables the utilization of vast, previously inaccessible data sources. At the same time, pre-trained Language Models (LMs) have taken the world by storm and for good reason. They exhibit remarkable emergent abilities and are readily adapted to downstream tasks. This opens one of the most exciting frontiers in FL: fine-tuning LMs. Yet, a persistent challenge in FL is the frequent, rigid communication of parameters – a problem magnified by the sheer size of these contemporary models. The FedOpt family of algorithms has become the go-to approach for FL, relying on fixed but arbitrary intervals for model exchanges. Recently, the FDA algorithm prescribed a dynamic approach by monitoring the training progress. However, it introduced a hard-to-calibrate parameter and imposed a rigid synchronization scheme. In this work, we address these limitations by proposing the FDA-Opt family of algorithms – a unified generalization of both FDA and FedOpt. Our experimental evaluation focuses on fine-tuning LMs on downstream NLP tasks and demonstrates that FDA-Opt outperforms FedOpt even when it is configured with hyper-parameters specifically optimized for the latter. In other words, we show that FDA-Opt is a practical, drop-in replacement for FedOpt in modern FL libraries and systems: it requires no additional configuration and delivers superior performance out of the box.

联邦学习(FL) 能够利用大量先前无法获取的数据源。 同时, 预先培训的语言模型(LMS) 已经通过风暴和出于良好理由而使世界成为了风暴, 表现出了非凡的新兴能力, 并且很容易适应下游任务。 这打开了FL最令人兴奋的边界之一: 微调LMS 。 然而, FL 的一个长期挑战就是频繁、 僵硬地交流参数 – – 由于这些当代模型的大小而放大了这一问题。 FedOptt的算法组合已成为FL的切换方法, 依靠固定但任意的交换模式。 最近, FDCA算法(LM) 设定了一个动态方法, 以监测培训进度。 但是, 它引入了一个难以调校准的参数, 并强制实行一个严格的同步计划。 在这项工作中, 我们通过建议FDA-O的精准算组合 – – 即统一地将FDA和FedO加以推广。 我们的实验性评价侧重于对下游的LMMS(LP) 进行微调整, 即使它配置时, 也用高超音调了FDO(FDFD-L) 。


Article 33

Title@2025-06-24 (2): Formalization and security analysis of the Bridgeless protocol

Title: Formalization and security analysis of the Bridgeless protocol Formalisierung und Sicherheitsanalyse des Bridgeless-Protokolls 对 “ 无桥梁议定书 “ 的正规化和安全分析 2506.19730v1

Authors (5): Orestis Alpos, Oleg Fomenko, Dimitris Karakostas, Oleksandr Kurbatov, Andrey Sabelnikov

This paper formalizes the proves the security of the Bridgeless protocol, a protocol able to bridge tokens between various chains. The Bridgeless protocol is run by a set of validators, responsible for verifying deposit transactions on the source chain and generating the corresponding withdrawals on the target chain. The protocol is designed to be chain-agnostic and the validators interact with each supported chain via a chain client. It currently supports EVM-compatible chains, the Zano, and the Bitcoin chains. The paper formalizes all involved subprotocols and describes the conditions under which the protocol maintains safety and liveness.

本文将证明“无桥协议”的安全性正式化,该协议可以连接不同链条之间的标识。“无桥协议”由一组验证人管理,负责核查源链上的存款交易,并产生目标链上的相应提取量。协议旨在成为链式的链条,验证人通过链条客户与每个支持链条互动。它目前支持与EVM兼容的链条、Zano和Bitcoin链条。该文件将所有涉及的子程序都正式化,并描述了协议维护安全和生存的条件。


Article 34

Title@2025-06-24 (2): PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling

Title: PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案 2506.19660v1

Authors (4): Shuhang Xu, Yunfei Gu, Linhui Liu, Chentao Wu

As flash-based Solid State Drive (SSD) arrays become essential to modern data centers, scaling these arrays to meet explosive data growth is a frequent and critical operation. However, the conventional wear-leveling (WL) paradigm applied during scaling suffers from a fundamental flaw: it ignores the non-linear relationship between wear and failure probability, potentially pushing the most vulnerable, aged disks towards premature failure. To address this critical issue at its root, we propose the Probability-Sensitive Wear Leveling (PS-WL) scheme, which shifts the optimization goal from balancing wear to directly balancing failure risk. At its core, PS-WL introduces an “effective lifetime” model derived from a realistic failure probability to more accurately assess disk lifetime. This model guides a PID controller for wear leveling operation, with a conservative zone minimizes performance overhead by restricting warm data migration. Comprehensive simulations validate the superiority of PS-WL over state-of-the-art methods. The results demonstrate that our approach significantly reduces performance overhead while, most critically, consistently and effectively lowering the aggregated array failure risk across diverse system configurations and workloads. This proves that by directly optimizing for reliability, PS-WL builds a scalable storage system that is, by design, fundamentally safer, more efficient, and more stable.

由于基于闪光的固态驱动器(SSD)阵列对现代数据中心至关重要,扩大这些阵列以适应爆炸性数据增长是一项经常和关键的操作。然而,在缩放过程中应用的常规磨损等级(WL)模式存在一个根本性缺陷:它忽视了磨损和故障概率之间的非线性关系,有可能将最脆弱的老磁盘推向过早的失败。为了从根本上解决这一关键问题,我们提议了“概率感应性湿分级(PS-WL)计划 ” , 将优化目标从平衡磨损转向直接平衡故障风险。 在其核心方面, PS-WL 引入了一个“ 有效终身” 模式, 其依据是现实性失败概率来更准确地评估磁盘寿命。 这个模式指导了PID控制器的磨损操作, 保守区通过限制热数据迁移而最大限度地减少性能管理。 全面模拟验证了PS-WL优于最新技术方法的优势。 其结果表明,我们的方法极大地降低了绩效管理,同时,最关键地、持续和有效地降低不同系统配置和工作量的汇总阵列失败风险。这个模型通过直接地证明,更稳定的存储是更稳定的安全。


Article 35

Title@2025-06-24 (2): Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures

Title: Towards an Introspective Dynamic Model of Globally Distributed Computing Infrastructures Auf dem Weg zu einem introspektiven dynamischen Modell weltweit verteilter Computing-Infrastrukturen 争取建立全球分布全球电子计算基础设施的前瞻性动态模型 2506.19578v1

Authors (21): Ozgur O. Kilic, David K. Park, Yihui Ren, Tatiana Korchuganova, Sairam Sri Vatsavai, Joseph Boudreau, Tasnuva Chowdhury, Shengyu Feng, Raees Khan, Jaehyung Kim, Scott Klasky, Tadashi Maeno, Paul Nilsson, Verena Ingrid Martinez Outschoorn, Norbert Podhorszki, Frédéric Suter, Wei Yang, Yiming Yang, Shinjae Yoo, Alexei Klimentov, Adolfy Hoisie

Large-scale scientific collaborations like ATLAS, Belle II, CMS, DUNE, and others involve hundreds of research institutes and thousands of researchers spread across the globe. These experiments generate petabytes of data, with volumes soon expected to reach exabytes. Consequently, there is a growing need for computation, including structured data processing from raw data to consumer-ready derived data, extensive Monte Carlo simulation campaigns, and a wide range of end-user analysis. To manage these computational and storage demands, centralized workflow and data management systems are implemented. However, decisions regarding data placement and payload allocation are often made disjointly and via heuristic means. A significant obstacle in adopting more effective heuristic or AI-driven solutions is the absence of a quick and reliable introspective dynamic model to evaluate and refine alternative approaches. In this study, we aim to develop such an interactive system using real-world data. By examining job execution records from the PanDA workflow management system, we have pinpointed key performance indicators such as queuing time, error rate, and the extent of remote data access. The dataset includes five months of activity. Additionally, we are creating a generative AI model to simulate time series of payloads, which incorporate visible features like category, event count, and submitting group, as well as hidden features like the total computational load-derived from existing PanDA records and computing site capabilities. These hidden features, which are not visible to job allocators, whether heuristic or AI-driven, influence factors such as queuing times and data movement.

大规模科学合作,如ATLAS、Belle II、CMS、DUNE等大规模科学合作,涉及数以百计的研究机构和分布在全球各地的数千名研究人员。这些实验产生了数据小字节,数量预计很快会达到字节。因此,越来越需要进行计算,包括结构化的数据处理,从原始数据到可供消费者使用的衍生数据,广泛的蒙特卡洛模拟运动,以及广泛的终端用户分析。为了管理这些计算和存储需求,实施了集中的工作流程和数据管理系统。然而,关于数据放置和有效载荷分配的决定往往不连贯,而且采用超动手段。在采用更有效的超动或由AI驱动的解决方案方面,一个重大障碍是缺乏一个快速和可靠的内向动态动态模型来评估和完善替代方法。在这项研究中,我们的目标是利用现实世界数据开发局工作流程管理系统的岗位执行记录,我们确定了关键业绩指标,如排队时间、误差率和远程数据访问范围。数据设置包括五个月的清晰的时间变化,例如:固定的动态数据移动和动态数据序列,我们把一个清晰的动态数据序列,作为固定的日历,我们把一个可测算的序列,将一个可测算的序列,将一个可测算。


Article 36

Title@2025-06-24 (2): MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning

Title: MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning MemAscend: Systemspeicheroptimierung für SSD-Offloaded LLM Fine-Tuning MemAscend: SSD- 卸载 LLM 精密调试的系统内存优化 2505.23254v2

Authors (2): Yong-Cheng Liaw, Shuo-Han Chen

Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations with limited hardware resources. Although SSD offloading (i.e., ZeRO-Infinity) has emerged as a viable strategy to overcome the GPU memory barrier via leveraging both system memory (i.e., CPU DRAM) and storage space (i.e., solid-state devices, SSDs), its design primarily targets model-centric performance issues. As a result, key system-level issues, including system memory fragmentation, inefficient pinned buffer allocation, peak CPU usage spikes, and file system overhead, remain unaddressed, stifling scalability and inflating costs. Such an observation motivates this paper to introduce MemAscend, a framework that systematically tackles the underexplored system memory bottlenecks in SSD-offloaded LLM training, with a focus on resource-constrained environments. By streamlining pinned-memory allocation, eradicating fragmentation, and mitigating peak overhead, MemAscend reclaims a substantial system memory budget, enabling larger models, longer context windows, and higher batch sizes without exceeding modest hardware limits. Across diverse LLM benchmarks, MemAscend reduces peak system-memory consumption by an average of 55.7% compared with standard SSD offloading techniques, lowering the hardware barrier for fine-tuning and unlocking new possibilities for cost-effective large-scale training on limited-resource machines.

由于基因化人工智能(AI)的巨大成功,大型语言模型(LLMS)已经成为一个核心小类,成为了核心小类,支持了问答、文本生成和代码完成等应用程序。虽然在具体领域数据上对这些模型进行微调可以产生显著的绩效收益,但也给研究人员和硬件资源有限的小型组织带来了巨大的计算挑战。尽管SSD卸载(即ZeRO-Infinity)已成为一项可行的战略,通过利用系统记忆(即,CPU DRA)和存储空间(即,固态设备、SSDSDs),克服了GPU的记忆障碍。 它的设计主要针对以模型为中心的绩效问题。因此,关键系统层面的问题,包括系统记忆破碎、低效率的缓冲分配、CUPUP使用峰值激增、缩缩放成本。 这样的观察促使本文引入了MemASDSBS的精细缩缩缩缩缩缩缩缩缩缩缩缩略缩略缩略微缩略微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩缩缩缩缩缩缩缩缩缩缩缩缩微缩缩缩缩缩缩缩缩缩略缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩


Article 37

Title@2025-06-24 (2): Picsou: Enabling Replicated State Machines to Communicate Efficiently

Title: Picsou: Enabling Replicated State Machines to Communicate Efficiently Picsou: Replizierte Staatsmaschinen effizient kommunizieren Picsou: 使可复制的国家机器能够有效通信 2312.11029v2

Authors (8): Reginald Frank, Micah Murray, Chawinphat Tankuranand, Junseo Yoo, Ethan Xu, Natacha Crooks, Suyash Gupta, Manos Kapritsos

Replicated state machines (RSMs) cannot communicate effectively today as there is no formal framework or efficient protocol to do so. To address this issue, we introduce a new primitive, Cross-Cluster Consistent Broadcast (C3B) and present PICSOU, a practical implementation of the C3B primitive. PICSOU draws inspiration from networking and TCP to allow two RSMs to communicate with constant metadata overhead in the failure-free case and a minimal number of message resends in the case of failures. PICSOU is flexible and allows both crash fault tolerant and Byzantine fault tolerant consensus protocols to communicate. At the heart of PICSOU’s good performance and generality is the concept of QUACKs (quorum acknowledgments). QUACKs allow nodes in each RSM to precisely determine when messages have definitely been received, or likely lost. Our results are promising: we obtain up to 24x better performance than prior solutions on microbenchmarks and applications, ranging from disaster recovery to data reconciliation.

为了解决这一问题,我们引入了一种新的原始的跨集群一致广播(C3B),并推出PICSOU,这是C3B原始软件的实际实施。 PICSOU从网络和TCP中汲取灵感,使两个RSM在无故障情况下能够与恒定元数据管理器进行通信,并在出现故障的情况下与少量信息转发器进行通信。 PICSOU是灵活的,允许碰撞错误、容忍错误和拜占庭错误的容忍共识协议进行沟通。在PICSOU的良好表现和普遍性的核心是QUACKs概念(承认报价)。QUACKs允许每个RM的节点准确确定信息何时确实收到或可能丢失。我们的结果很有希望:从灾后恢复到数据协调,我们得到了比以前关于微苯标记和应用的解决方案多到24x的更好表现。


Article 38

Title@2025-06-24 (2): TrainVerify: Equivalence-Based Verification for Distributed LLM Training

Title: TrainVerify: Equivalence-Based Verification for Distributed LLM Training TrainVerify: Gleichwertigkeitsbasierte Überprüfung für verteiltes LLM-Training 培训核查:分布式LLM培训的等效核查 2506.15961v2

Authors (7): Yunchi Lu, Youshan Miao, Cheng Tan, Peng Huang, Yi Zhu, Xian Zhang, Fan Yang

Training large language models (LLMs) at scale requires parallel execution across thousands of devices, incurring enormous computational costs. Yet, these costly distributed trainings are rarely verified, leaving them prone to silent errors and potentially wasting millions of GPU hours. We introduce TrainVerify, a system for verifiable distributed training of LLMs. Given a deep learning model’s logical specification as the ground truth, TrainVerify formally verifies that a distributed parallel execution plan is mathematically equivalent to it. Direct verification is notoriously difficult due to the sheer scale of LLMs which often involves billions of variables and highly intricate computation graphs. Therefore, TrainVerify introduces shape-reduction techniques and a stage-wise parallel verification algorithm that significantly reduces complexity while preserving formal correctness. TrainVerify scales to frontier LLMs, including the successful verification of the Llama3 (405B) and DeepSeek-V3 (671B) training plans.

大规模培训大型语言模型(LLMS)要求对数千个装置进行平行执行,并产生巨大的计算成本。然而,这些昂贵的分布式培训很少得到核实,因此容易发生沉默错误,可能浪费数百万个GPU小时。我们引入了可核实的LLMS分布培训系统TeatraVerify。鉴于一个深层次学习模型的逻辑规格是地面真理,TereVergy正式核实分布式平行执行计划在数学上与它相当。由于LMS的规模庞大,直接核查非常困难,这往往涉及数十亿个变量和高度复杂的计算图。因此,培训引入了降低形状技术和一个分阶段平行的核查算法,在保持形式正确性的同时大大降低复杂性。培训边境LLMS的尺度,包括成功核查Llama3 (405B) 和Eep Seek-V3 (671B) 培训计划。


Article 39

Title@2025-06-24 (2): RepuNet: A Reputation System for Mitigating Malicious Clients in DFL

Title: RepuNet: A Reputation System for Mitigating Malicious Clients in DFL RepuNet: Ein Reputationssystem zur Bekämpfung bösartiger Kunden in der DFL RepuNet:DFL中减少恶意客户的声望系统 2506.19892v1

Authors (4): Isaac Marroqui Penalva, Enrique Tomás Martínez Beltrán, Manuel Gil Pérez, Alberto Huertas Celdrán

Decentralized Federated Learning (DFL) enables nodes to collaboratively train models without a central server, introducing new vulnerabilities since each node independently selects peers for model aggregation. Malicious nodes may exploit this autonomy by sending corrupted models (model poisoning), delaying model submissions (delay attack), or flooding the network with excessive messages, negatively affecting system performance. Existing solutions often depend on rigid configurations or additional infrastructures such as blockchain, leading to computational overhead, scalability issues, or limited adaptability. To overcome these limitations, this paper proposes RepuNet, a decentralized reputation system that categorizes threats in DFL and dynamically evaluates node behavior using metrics like model similarity, parameter changes, message latency, and communication volume. Nodes’ influence in model aggregation is adjusted based on their reputation scores. RepuNet was integrated into the Nebula DFL platform and experimentally evaluated with MNIST and CIFAR-10 datasets under non-IID distributions, using federations of up to 25 nodes in both fully connected and random topologies. Different attack intensities, frequencies, and activation intervals were tested. Results demonstrated that RepuNet effectively detects and mitigates malicious behavior, achieving F1 scores above 95% for MNIST scenarios and approximately 76% for CIFAR-10 cases. These outcomes highlight RepuNet’s adaptability, robustness, and practical potential for mitigating threats in decentralized federated learning environments.

分散化的联邦学习(DFL)让节点能够在没有中央服务器的情况下合作培训模型,引入新的脆弱性,因为每个节点独立地选择了同行来进行模型聚合。 恶意节点可以通过发送腐败模型(模型中毒)、延迟提交模型(延迟袭击),或者用过多的信息充斥网络,对系统性能产生不利影响。 现有解决方案往往依赖于僵硬的配置或额外的基础设施,如块链,导致计算间接费用、可缩缩放问题或适应性有限。 为了克服这些限制,本文件提议了RepuNet,这是一个分散化的声誉系统,对DFLL的威胁进行分类,并动态地用模型相似性、参数变化、信息延缓度和通信量量等指标评估节点行为。 模式集合中的节点影响根据声分调整。 RepuNet被纳入了Nebula DFLLL平台,并在非IID分布下与MIT和CIFAR-10数据集进行实验性评估,在完全连通和随机的25个节点中使用联合会。 不同的攻击强度、频率、感应变和感应变间隔期评估了不同攻击的节点的节点。 测试了REputudeal IM 10案例。


Article 40

Title@2025-06-24 (2): The Autonomous Data Language – Concepts, Design and Formal Verification

Title: The Autonomous Data Language – Concepts, Design and Formal Verification Die autonome Datensprache – Konzepte, Design und formale Überprüfung 自主数据语言 – – 概念、设计和正式核查 2506.19457v1

Authors (3): Tom T. P. Franken, Thomas Neele, Jan Friso Groote

Nowadays, the main advances in computational power are due to parallelism. However, most parallel languages have been designed with a focus on processors and threads. This makes dealing with data and memory in programs hard, which distances the implementation from its original algorithm. We propose a new paradigm for parallel programming, the data-autonomous paradigm, where computation is performed by autonomous data elements. Programs in this paradigm are focused on making the data collaborate in a highly parallel fashion. We furthermore present AuDaLa, the first data autonomous programming language, and provide a full formalisation that includes a type system and operational semantics. Programming in AuDaLa is very natural, as illustrated by examples, albeit in a style very different from sequential and contemporary parallel programming. Additionally, it lends itself for the formal verification of parallel programs, which we demonstrate.

目前,计算能力的主要进步是平行的。然而,大多数平行语言的设计重点是处理器和线条。这使得处理程序的数据和记忆变得困难,使执行与原始算法相去甚远。我们提出了平行程序的新模式,即数据自主模式,即由自主数据元素进行计算。这个模式中的方案侧重于使数据以高度平行的方式协作。我们还介绍了第一种数据自主程序语言AuDaLa,这是第一个数据自主程序语言,并提供了一种包括类型系统和操作语义的全面正规化。AuDaLa的编程非常自然,例如实例所展示的,尽管其风格与顺序和当代平行程序非常不同。此外,它也有助于对平行程序进行正式核查,我们展示了这一点。


Article 41

Title@2025-06-24 (2): Agent-Based Triangle Counting: Unlocking Truss Decomposition, Triangle Centrality, and Local Clustering Coefficient

Title: Agent-Based Triangle Counting: Unlocking Truss Decomposition, Triangle Centrality, and Local Clustering Coefficient Agent-Based Triangle Counting: Entsperren Truss Zersetzung, Dreieck Zentralität und lokale Clustering Koeffizient 基于代理的三角计数:解锁Truss分解、三角中心以及地方集束 2402.03653v2

Authors (3): Prabhat Kumar Chand, Apurba Das, Anisur Rahaman Molla

Triangle counting in a graph is a fundamental problem with wide-ranging applications. It is crucial for understanding graph structure and serves as a basis for more advanced graph analytics. One key application is truss decomposition, a technique for identifying maximal, highly interconnected subgraphs, revealing structural cohesion and tight-knit communities in complex graphs. This facilitates analysis of relationships and information flow in fields such as social networks, biology, and recommendation systems. Using mobile agents or robots for tasks like truss decomposition and clustering coefficient computation is especially advantageous in decentralised environments with limited or unreliable communication. In such scenarios, agents can perform local computations without requiring an extensive communication infrastructure. This is valuable in contexts like disaster response, urban management, and military operations, where broadcast communication is impractical. In this paper, we address the triangle counting problem in an arbitrary anonymous graph using mobile agents. This method is extended as a subroutine to solve the truss decomposition problem and compute triangle centrality and the local clustering coefficient for each node. Our approach uses $n$ autonomous mobile agents, each starting at a different node of an $n$-node graph. These agents coordinate to collaboratively solve triangle enumeration, then truss decomposition, triangle centrality, and clustering coefficient. We assume a synchronous system where agents execute tasks concurrently, allowing time to be measured in rounds. The graph is anonymous (nodes have no IDs), but agents have distinct IDs and limited memory. Agents can perform local computations and communicate only when co-located. Our goal is to design algorithms that minimise both time and memory per agent, while enabling solutions to the above problems.

图表中三角形的计数是一个包含广泛应用的根本性问题。 它对于理解图形结构至关重要, 并且是更先进的图形分析的基础。 一个关键应用是 truss 分解, 这是一种在复杂的图表中识别最大、 高度相互关联的子图层的技术, 揭示结构的凝聚力和紧密的连接社区。 这有助于分析社交网络、 生物学 和建议系统 等领域的关系和信息流动。 使用移动剂或机器人处理 trus 分解和组合系数计算等任务, 在匿名通信有限或不可靠的分散环境中特别有利。 在这样的情况下, 代理可以进行本地计算, 而不需要广泛的通信基础设施。 在灾难反应、 城市管理和军事行动等情况下, 这是一种有价值的技术, 在广播通信不切实际的情况下, 我们用一个任意的匿名图表来处理三角点计数问题。 这种方法扩大为子路程, 解决tuss decomposition 问题, 并且为每个节点计算三角点, 我们的方法使用美元自主的移动剂, 每一个在不同的节点开始一个节点上, 实现一个不同的节流的递解的递解的递解的计算, 度, 解的递解的递解的递解的递解的递解的代数 度, 度, 度, 也就是的轴 的轴 的计算是 的轴 的轴 的轴 的轴 的轴 的轴 , 的轴 的轴 的计算, 开始一个调的递合的递合的递合的递合的递合的递合的递合的递合的递合的递合的递合的递合的调制的轴 。


Article 42

Title@2025-06-24 (2): Computing Tree Structures in Anonymous Graphs via Mobile Agents

Title: Computing Tree Structures in Anonymous Graphs via Mobile Agents Berechnung von Baumstrukturen in anonymen Graphen über Mobile Agents 通过移动代理器在匿名图纸中的电子树结构 2506.19365v1

Authors (3): Prabhat Kumar Chand, Manish Kumar, Anisur Rahaman Molla

Minimum Spanning Tree (MST) and Breadth-First Search (BFS) tree constructions are classical problems in distributed computing, traditionally studied in the message-passing model, where static nodes communicate via messages. This paper investigates MST and BFS tree construction in an agent-based network, where mobile agents explore a graph and compute. Each node hosts one agent, and communication occurs when agents meet at a node. We consider $n$ agents initially dispersed (one per node) in an anonymous, arbitrary $n$-node, $m$-edge graph $G$. The goal is to construct the BFS and MST trees from this configuration such that each tree edge is known to at least one of its endpoints, while minimizing time and memory per agent. We work in a synchronous model and assume agents have no prior knowledge of any graph parameters such as $n$, $m$, $D$, $\Delta$ (graph diameter and maximum degree). Prior work solves BFS in $O(D\Delta)$ rounds with $O(\log n)$ bits per agent, assuming the root is known. We give a deterministic algorithm that constructs the BFS tree in $O(\min(D\Delta, m\log n) + n\log n + \Delta \log^2 n)$ rounds using $O(\log n)$ bits per agent without root knowledge. To determine the root, we solve leader election and MST construction. We elect a leader and construct the MST in $O(n\log n + \Delta \log^2 n)$ rounds, with $O(\log n)$ bits per agent. Prior MST algorithms require $O(m + n\log n)$ rounds and $\max(\Delta, \log n) \log n$ bits. Our results significantly improve memory efficiency and time, achieving nearly linear-time leader election and MST. Agents are assumed to know $\lambda$, the maximum identifier, bounded by a polynomial in $n$.

最小树( MST) 和 Breadth- First (BFS) 树的构造是分布式计算中的典型问题, 这是传统上在信息传递模式中研究的经典问题, 其中静态节点通过信件进行交流。 此文件调查在基于代理的网络中 MST 和 BFS 树的构造, 其中移动代理者探索一个图形和计算。 每个节点都拥有一个代理商, 当代理商在节点中相会时, 通信就会发生。 我们认为, 美元( 美元/ nd) 最初以匿名方式分散( 美元/ nd) (美元/ 美元- 美元, 美元- 美元- 美元- 美元) 。 目标在于从此配置中构建 BFS 和 MST 树端至少知道一个端点, 时间\ m 模式和假设任何图形参数, 例如 $, 美元、 美元、 美元、 美元、 ndelta 和 最高度。 以 美元( D) 以 美元) 以 美元( 美元) 美元, 和 美元, 美元, 前工作解决 美元, 美元, 以 美元, 美元, 美元, 确定 美元- 确定 美元- 将 美元 美元 美元 美元- 美元- 树流流流流 建立一个 游戏的 游戏的 , , , , 根流流流流流流流 , , , , , 以 美元- 美元, 流流 流 流流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 流 , , , , 流 流 流 , , , 流 流 流 流 流 , , 流 流 流 流 流, 流 流 流 流 流 流 流 流 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流, 流


Article 43

Title@2025-06-24 (2): PBFT-Backed Semantic Voting for Multi-Agent Memory Pruning

Title: PBFT-Backed Semantic Voting for Multi-Agent Memory Pruning PBFT-unterstützte semantische Abstimmung für Multi-Agent Memory Pruning PBFT 多重机构内存缓冲后退的语义投票 2506.17338v2

Authors (1): Duong Bach

The proliferation of multi-agent systems (MAS) in complex, dynamic environments necessitates robust and efficient mechanisms for managing shared knowledge. A critical challenge is ensuring that distributed memories remain synchronized, relevant, and free from the accumulation of outdated or inconsequential data - a process analogous to biological forgetting. This paper introduces the Co-Forgetting Protocol, a novel, comprehensive framework designed to address this challenge by enabling synchronized memory pruning in MAS. The protocol integrates three key components: (1) context-aware semantic voting, where agents utilize a lightweight DistilBERT model to assess the relevance of memory items based on their content and the current operational context; (2) multi-scale temporal decay functions, which assign diminishing importance to memories based on their age and access frequency across different time horizons; and (3) a Practical Byzantine Fault Tolerance (PBFT)-based consensus mechanism, ensuring that decisions to retain or discard memory items are agreed upon by a qualified and fault-tolerant majority of agents, even in the presence of up to f Byzantine (malicious or faulty) agents in a system of N greater than or equal to 3f+1 agents. The protocol leverages gRPC for efficient inter-agent communication and Pinecone for scalable vector embedding storage and similarity search, with SQLite managing metadata. Experimental evaluations in a simulated MAS environment with four agents demonstrate the protocol’s efficacy, achieving a 52% reduction in memory footprint over 500 epochs, 88% voting accuracy in forgetting decisions against human-annotated benchmarks, a 92% PBFT consensus success rate under simulated Byzantine conditions, and an 82% cache hit rate for memory access.

多试剂系统(MAS)在复杂、动态环境中的扩散,使得管理共享知识的强大和高效机制成为了管理共享知识的强大和高效机制。一个关键的挑战是如何确保分布式记忆保持同步、相关且不受过时或无关紧要的数据积累的影响,这是一个类似于生物遗忘的过程。本文件介绍了《共同制定议定书》,这是一个新颖的、全面的框架,旨在通过在MAS中同步存储存储存储来应对这一挑战。协议包含三个关键组成部分:(1) 符合环境需要的语义表决,即代理使用一种轻巧的DistilBERT基准来评估根据其内容和当前操作环境的记忆项目的相关性;(2) 多尺度的时间衰减功能,根据时间跨不同时间跨度的存取频率,赋予记忆越来越不重要的重要性;(3) 实用的Byzantine Fault容忍(PBFFFFFT)基于共识的机制,确保保留或丢弃存储记忆物品的决定得到合格和不宽容的多数物剂的同意,即使存在比赞丁基准(错误或错误)基准,评估记忆项目的相关性;(2) 高级时间缩缩缩缩缩缩(PRPB) 和类似存储剂在4的存储器中,实现搜索速度(RPL) 速度的智能的流流流流流流中,实现。


Article 44

Title: A Heuristic Algorithm for Shortest Path Search Ein Heuristischer Algorithmus für die kürzeste Pfadsuche 用于最短路径搜索的 Hyuristic 算法 2506.19349v1

Authors (3): Huashan Yu, Xiaolin Wang, Yingwei Luo

The Single-Source Shortest Path (SSSP) problem is well-known for the challenges in developing fast, practical, and work-efficient parallel algorithms. This work introduces a novel shortest path search method. It allows paths with different lengths to be extended in parallel at the cost of almost negligible repeated relaxations. A dynamic-stepping heuristic is proposed for the method to efficiently reduce the extended paths and the synchronizations. A traversal-optimization heuristic is proposed to improve the method by efficiently reducing the created paths and alleviating the load imbalance. Based on the method, the two heuristics are used to develop a practical SSSP algorithm, which tactfully reduces workload and overhead. The heuristics and the algorithm were evaluated on 73 real-world and synthetic graphs. The algorithm was also compared with five state-of-the-art SSSP implementations. On each GAP benchmark suite graph except Road, its speedup to the best achieved by these five implementations is 2.5x to 5.83x.

单一源最短路径(SSSP)问题以发展快速、实用和高效工作平行算法方面的挑战而广为人知。 这项工作引入了一个新的最短路径搜索方法。 它允许以几乎微不足道的反复放松为代价,同时延长不同长度的路径。 为有效减少扩展路径和同步度的方法,提出了动态步进超速法。 提出了跨步优化超速法,以通过有效减少创建路径和减轻负载不平衡来改进方法。 根据这种方法,使用两种超速法来开发实用的SSSP算法,从而有效地减少工作量和间接费用。 超速和算法在73个真实世界和合成图上进行了评估。 算法也与5个最先进的SSSP执行方式进行了比较。 在除路外的每个GAP基准套图上,其速度达到这5个执行中最佳的速率是2.5x至5.83x。


Article 45

Title@2025-06-24 (2): Network Structures as an Attack Surface: Topology-Based Privacy Leakage in Federated Learning

Title: Network Structures as an Attack Surface: Topology-Based Privacy Leakage in Federated Learning Netzwerkstrukturen als Angriffsfläche: Topologiebasiertes Datenschutz-Leakage im Federated Learning 网络结构作为攻击表面:联邦学习中的基于地形的隐私渗漏 2506.19260v1

Authors (3): Murtaza Rangwala, Richard O. Sinnott, Rajkumar Buyya

Federated learning systems increasingly rely on diverse network topologies to address scalability and organizational constraints. While existing privacy research focuses on gradient-based attacks, the privacy implications of network topology knowledge remain critically understudied. We conduct the first comprehensive analysis of topology-based privacy leakage across realistic adversarial knowledge scenarios, demonstrating that adversaries with varying degrees of structural knowledge can infer sensitive data distribution patterns even under strong differential privacy guarantees. Through systematic evaluation of 4,720 attack instances, we analyze six distinct adversarial knowledge scenarios: complete topology knowledge and five partial knowledge configurations reflecting real-world deployment constraints. We propose three complementary attack vectors: communication pattern analysis, parameter magnitude profiling, and structural position correlation, achieving success rates of 84.1%, 65.0%, and 47.2% under complete knowledge conditions. Critically, we find that 80% of realistic partial knowledge scenarios maintain attack effectiveness above security thresholds, with certain partial knowledge configurations achieving performance superior to the baseline complete knowledge scenario. To address these vulnerabilities, we propose and empirically validate structural noise injection as a complementary defense mechanism across 808 configurations, demonstrating up to 51.4% additional attack reduction when properly layered with existing privacy techniques. These results establish that network topology represents a fundamental privacy vulnerability in federated learning systems while providing practical pathways for mitigation through topology-aware defense mechanisms.

联邦学习系统日益依赖不同的网络地形来应对可扩缩性和组织限制。虽然现有的隐私研究侧重于基于梯度的攻击,但网络地形知识的隐私影响仍然严重缺乏研究。我们对现实的对抗性知识情景中基于地形的隐私渗漏进行首次全面分析,表明具有不同程度结构知识的对手可以推断出敏感数据分布模式,即使有很强的差别隐私保障。我们通过系统评估4 720个攻击案例,分析6个截然不同的对立知识情景:完整的地形知识和反映现实世界部署制约的5个部分知识配置。我们提议三种互补攻击矢量:通信模式分析、参数规模特征分析和结构位置相关,在完全的知识条件下,实现84.1%、65.0%和47.2%的成功率。关键是,我们发现80%的现实部分知识情景将攻击效力维持在安全阈值之上,某些部分知识配置的性业绩优于基线完整知识情景。为了应对这些脆弱性,我们提议并用经验验证结构性噪音注入作为808个配置的补充防御机制。我们提议三种互补攻击矢量:通信模式分析、参数规模分析、参数规模大小特征特征特征特征分析、通过现有隐私学习系统建立基本防御系统,显示攻击减少51.4%。


Article 46

Title@2025-06-24 (2): Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems

Title: Research on Model Parallelism and Data Parallelism Optimization Methods in Large Language Model-Based Recommendation Systems Forschung über Modellparallelität und Datenparallelität Optimierungsmethoden in großsprachlichen modellbasierten Empfehlungssystemen 研究示范平行主义和数据平行主义 2506.17551v2

Authors (6): Haowei Yang, Yu Tian, Zhongheng Yang, Zhao Wang, Chengrui Zhou, Dannier Li

With the rapid adoption of large language models (LLMs) in recommendation systems, the computational and communication bottlenecks caused by their massive parameter sizes and large data volumes have become increasingly prominent. This paper systematically investigates two classes of optimization methods-model parallelism and data parallelism-for distributed training of LLMs in recommendation scenarios. For model parallelism, we implement both tensor parallelism and pipeline parallelism, and introduce an adaptive load-balancing mechanism to reduce cross-device communication overhead. For data parallelism, we compare synchronous and asynchronous modes, combining gradient compression and sparsification techniques with an efficient aggregation communication framework to significantly improve bandwidth utilization. Experiments conducted on a real-world recommendation dataset in a simulated service environment demonstrate that our proposed hybrid parallelism scheme increases training throughput by over 30% and improves resource utilization by approximately 20% compared to traditional single-mode parallelism, while maintaining strong scalability and robustness. Finally, we discuss trade-offs among different parallel strategies in online deployment and outline future directions involving heterogeneous hardware integration and automated scheduling technologies.

在建议系统中迅速采用大型语言模型(LLMs)后,由大量参数大小和大量数据量造成的计算和通信瓶颈变得日益突出。本文件系统地调查了两种最优化方法 – – 模型平行和数据平行 – – 用于在建议情景下对LMs进行分布式培训。关于模式平行,我们实施了多种平行和编审平行,并引入了适应性负载平衡机制,以减少跨系统通信间接费用。关于数据平行,我们比较了同步和不同步模式,将梯度压缩和擦拭技术与高效集成通信框架相结合,以显著改善带宽利用率。在模拟服务环境中对现实世界建议数据集进行的实验表明,我们拟议的混合平行计划使培训的吞吐量增加了30%以上,使资源利用率比传统的单一模式平行化增加了大约20%,同时保持强大的可缩放性和稳健健。最后,我们讨论了在线部署中不同平行战略之间的取舍,并概述了涉及多种硬件整合和自动排期技术的未来方向。


Article 47

Title@2025-06-24 (2): Shelby: Decentralized Storage Designed to Serve

Title: Shelby: Decentralized Storage Designed to Serve Shelby: Dezentraler Speicher für die Bedienung Shelby: 设计用于提供服务的分散储存 2506.19233v1

Authors (6): Guy Goren, Andrew Hariri, Timothy D. R. Hartley, Ravi Kappiyoor, Alexander Spiegelman, David Zmick

Existing decentralized storage protocols fall short of the service required by real-world applications. Their throughput, latency, cost-effectiveness, and availability are insufficient for demanding workloads such as video streaming, large-scale data analytics, or AI training. As a result, Web3 data-intensive applications are predominantly dependent on centralized infrastructure. Shelby is a high-performance decentralized storage protocol designed to meet demanding needs. It achieves fast, reliable access to large volumes of data while preserving decentralization guarantees. The architecture reflects lessons from Web2 systems: it separates control and data planes, uses erasure coding with low replication overhead and minimal repair bandwidth, and operates over a dedicated backbone connecting RPC and storage nodes. Reads are paid, which incentivizes good performance. Shelby also introduces a novel auditing protocol that provides strong cryptoeconomic guarantees without compromising performance, a common limitation of other decentralized solutions. The result is a decentralized system that brings Web2-grade performance to production-scale, read-intensive Web3 applications.

现有的分散储存协议无法满足现实世界应用程序所要求的服务,它们的吞吐量、延迟度、成本效益和可用性都不足以应付视频流、大规模数据分析或AI培训等繁重的工作量,因此,Web3数据密集型应用程序主要依赖中央基础设施;Shelby是一个高性能分散储存协议,旨在满足高要求的需求;它能够快速可靠地获取大量数据,同时保留分散化保障;该架构反映了Web2系统的经验教训:它将控制和数据平面分开,用低复制的间接费用和最小的修理带宽来消除编码,并在连接RPC和存储节点的专用主干线上运作;读者付费,鼓励良好的业绩;Shelby还引入了新的审计协议,提供强有力的加密经济保障,同时不损害业绩,其他分散化解决方案的共同限制;其结果是将Web2级的性能提高到生产规模、阅读密集的Web3应用程序的分散化系统。


Article 48

Title@2025-06-24 (2): Private Model Personalization Revisited

Title: Private Model Personalization Revisited Private Modell-Personalisierung überarbeitet 重新研究的私人个人模式 2506.19220v1

Authors (3): Conor Snedeker, Xinyu Zhou, Raef Bassily

We study model personalization under user-level differential privacy (DP) in the shared representation framework. In this problem, there are $n$ users whose data is statistically heterogeneous, and their optimal parameters share an unknown embedding $U^* \in\mathbb{R}^{d\times k}$ that maps the user parameters in $\mathbb{R}^d$ to low-dimensional representations in $\mathbb{R}^k$, where $k\ll d$. Our goal is to privately recover the shared embedding and the local low-dimensional representations with small excess risk in the federated setting. We propose a private, efficient federated learning algorithm to learn the shared embedding based on the FedRep algorithm in [CHM+21]. Unlike [CHM+21], our algorithm satisfies differential privacy, and our results hold for the case of noisy labels. In contrast to prior work on private model personalization [JRS+21], our utility guarantees hold under a larger class of users’ distributions (sub-Gaussian instead of Gaussian distributions). Additionally, in natural parameter regimes, we improve the privacy error term in [JRS+21] by a factor of $\widetilde{O}(dk)$. Next, we consider the binary classification setting. We present an information-theoretic construction to privately learn the shared embedding and derive a margin-based accuracy guarantee that is independent of $d$. Our method utilizes the Johnson-Lindenstrauss transform to reduce the effective dimensions of the shared embedding and the users’ data. This result shows that dimension-independent risk bounds are possible in this setting under a margin loss.

在共享代表框架的用户级差异隐私(DP)下,我们研究在共享代表框架的用户级差异隐私(DP)下的个人化模式。在这个问题中,有美元用户的数据在统计上各不相同,他们的最佳参数共享一个未知的嵌入 $U {\ \ \ \ mathb{Rd\ times k} 美元,用$\ mathb{Rd$ 将用户参数映射为低维度代表 $mathb{Rk$。我们的目标是私自恢复共享嵌入和本地低维度代表,在联合代表框架设置中风险较小。我们提出一个私人高效的、高效的联邦化学习算法以学习基于[CHM+21] FedRep 算法的共享嵌入 。不同于[CHM+21],我们的算法满足了差异隐私,我们的结果维持在噪音标签上。 与以往的私人模型个人化工作[JRS+21]相比,我们的效用保证维持在更大的用户分配类别(次Gus-Gusian 而不是Gabilental lidealalal reseral-roilation acilation roislation the we broislation the we bruislation rolation (我们在学习一个共同的计算中,我们学习一个共同的路径设置一个共同的路径的路径设置一个Bislislislation) 。


Article 49

Title@2025-06-23 (1): Binsparse: A Specification for Cross-Platform Storage of Sparse Matrices and Tensors

Title: Binsparse: A Specification for Cross-Platform Storage of Sparse Matrices and Tensors Binsparse: Eine Spezifikation für die plattformübergreifende Lagerung von Sparse Matrizen und Tensoren Binsparse: 微粒母体和导体跨平台储存规格 2506.19175v1

Authors (9): Benjamin Brock, Willow Ahrens, Hameer Abbasi, Timothy A. Davis, Juni Kim, James Kitchen, Spencer Patty, Isaac Virshup, Erik Welch

Sparse matrices and tensors are ubiquitous throughout multiple subfields of computing. The widespread usage of sparse data has inspired many in-memory and on-disk storage formats, but the only widely adopted storage specifications are the Matrix Market and FROSTT file formats, which both use ASCII text. Due to the inefficiency of text storage, these files typically have larger file sizes and longer parsing times than binary storage formats, which directly store an in-memory representation to disk. This can be a major bottleneck; since sparse computation is often bandwidth-bound, the cost of loading or storing a matrix to disk often exceeds the cost of performing a sparse computation. While it is common practice for practitioners to develop their own, custom, non-portable binary formats for high-performance sparse matrix storage, there is currently no cross-platform binary sparse matrix storage format. We present Binsparse, a cross-platform binary sparse matrix and tensor format specification. Binsparse is a modular, embeddable format, consisting of a JSON descriptor, which describes the matrix or tensor dimensions, type, and format, and a series of binary arrays, which can be stored in all modern binary containers, such as HDF5, Zarr, or NPZ. We provide several reference implementations of Binsparse spanning 5 languages, 5 frameworks, and 4 binary containers. We evaluate our Binsparse format on every matrix in the SuiteSparse Matrix Collection and a selection of tensors from the FROSTT collection. The Binsparse HDF5 CSR format shows file size reductions of 2.4x on average without compression and 7.5x with compression. We evaluate our parser’s read/write performance against a state-of-the-art Matrix Market parser, demonstrating warm cache mean read speedups of 26.5x without compression and 2.6x with compression, and write speedups of 31x without compression and 1.4x with compression.

在多个计算子字段中,大量使用稀少数据激发了许多磁盘和磁盘存储格式,但唯一广泛采用的存储规格是母体市场和FROSTT文件格式,这两种格式都使用 ASCII 文本。由于文本存储效率低下,这些文件通常比二进制存储格式有更大的文件大小和更长的分解时间,它们直接存储磁盘。这可能是一个主要的瓶颈;因为稀释的计算往往带宽,将一个矩阵装入磁盘或存储磁盘的成本往往超过进行稀释计算的成本。虽然执业者通常会开发自己的、自定义的、非便携式的二进制格式,但是由于文本存储效率低下,这些文档通常比二进制存储格式要大得多,比二进制的存储格式要长得多。我们介绍Binsparse,一个跨平台的二进制的、二进制的、二进制的母体的母体,一个模块,一个模块,一个由Json descrial 格式组成,一个由Json descrialS deal 5的缩缩缩缩缩缩缩缩算, ,用来描述我们B 的硬质 的硬体、格式和硬质 格式和硬质 格式,一个格式,一个格式,一个格式,一个格式,一个格式,一个格式和制式的硬化的版本,一个格式和制式的版本,一个版本,一个版本,一个版本,一个版本,一个格式和制式式式式的版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本,一个版本


Article 50

Title@2025-06-23 (1): GradualDiff-Fed: A Federated Learning Specialized Framework for Large Language Model

Title: GradualDiff-Fed: A Federated Learning Specialized Framework for Large Language Model GradualDiff-Fed: Ein Federated Learning Specialized Framework für großes Sprachmodell 逐步发展伙伴关系:联邦学习大语言模式专门框架 2506.19164v1

Authors (2): Amir Faiyaz, Tara Salman

The rapid proliferation of large language models (LLMs) has created an unprecedented demand for fine-tuning models for specialized domains, such as medical science. While federated learning (FL) offers a decentralized and privacy-preserving approach to collaboratively fine-tune LLMs without sharing raw data, it presents significant challenges, particularly in performance and managing large model sizes efficiently. In this paper, we introduce GradualDiff-Fed, an FL framework designed explicitly for LLMs, and their challenge of handling the high parameter size. GradualDiff-Fed reduces communication costs by transmitting only the difference of model weights rather than the entire model during training rounds. Such an approach significantly improves scalability and communication efficiency, making it more feasible to fine-tune LLMs across distributed clients without compromising performance. Our evaluation demonstrates that GradualDiff-Fed achieves performance on par with centralized training while drastically reducing communication overhead. These results highlight the potential of GradualDiff-Fed as an efficient solution for fine-tuning large models from distributed data in privacy-preserving settings without comprising performance.

大型语言模型(LLMs)的迅速扩展创造了对医学等专门领域微调模型的前所未有的需求。尽管联合学习(FL)为合作微调LMs提供了一种分散和隐私保护方法,在不分享原始数据的情况下对协作微调LMs提供了一种分散和保密的方法,但它提出了重大挑战,特别是在绩效和管理大型模型规模方面。在本文件中,我们引入了专门为LLMs设计的“GradualDiff-Fed”框架及其处理高参数规模的挑战。“GradualDiff-Fed”通过在培训回合中只传播模型重量的差别而不是整个模型来降低通信成本。这种方法大大提高了分布在分布在分散客户中的微调LLMs的可扩展性和通信效率,同时又不影响其业绩。我们的评估表明,GradualDiff-Fed在集中培训的同时取得了同等的业绩,同时大大减少了通信管理费用。这些结果突出表明了GradualDiff-Fed作为一种高效的解决方案,用于微调在不包含性能的情况下从隐私保护环境中传播的大模型中传播的大模型。


Article 51

Title@2025-06-23 (1): Survey of HPC in US Research Institutions

Title: Survey of HPC in US Research Institutions Erhebung über HPC in US-Forschungseinrichtungen 美国研究机构的HPC调查 2506.19019v1

Authors (6): Peng Shu, Junhao Chen, Zhengliang Liu, Huaqin Zhao, Xinliang Li, Tianming Liu

The rapid growth of AI, data-intensive science, and digital twin technologies has driven an unprecedented demand for high-performance computing (HPC) across the research ecosystem. While national laboratories and industrial hyperscalers have invested heavily in exascale and GPU-centric architectures, university-operated HPC systems remain comparatively under-resourced. This survey presents a comprehensive assessment of the HPC landscape across U.S. universities, benchmarking their capabilities against Department of Energy (DOE) leadership-class systems and industrial AI infrastructures. We examine over 50 premier research institutions, analyzing compute capacity, architectural design, governance models, and energy efficiency. Our findings reveal that university clusters, though vital for academic research, exhibit significantly lower growth trajectories (CAGR $\approx$ 18%) than their national ($\approx$ 43%) and industrial ($\approx$ 78%) counterparts. The increasing skew toward GPU-dense AI workloads has widened the capability gap, highlighting the need for federated computing, idle-GPU harvesting, and cost-sharing models. We also identify emerging paradigms, such as decentralized reinforcement learning, as promising opportunities for democratizing AI training within campus environments. Ultimately, this work provides actionable insights for academic leaders, funding agencies, and technology partners to ensure more equitable and sustainable HPC access in support of national research priorities.

AI、数据密集型科学和数字双子技术的迅速增长促使整个研究生态系统对高性能计算(HPC)的需求达到了前所未有的程度。虽然国家实验室和工业超大型企业在超大型和以GPU为中心的建筑上投入了大量资金,但大学运营的HPC系统仍然相对资源不足。这项调查对美国各大学的HPC景观进行了全面评估,对照能源部领导级系统和工业AI基础设施对其能力进行基准评估。我们审查了50多个首要研究机构,分析了计算能力、建筑设计、治理模型和能源效率等前所未有的需求。我们的调查结果显示,尽管对学术研究至关重要,但大学集群在增长轨迹(CAGRA $\ approx$ 18 %)上的投资大大低于其国家($\ approx 43%)和工业($\ approx 78% 7 % ) 。这项调查显示,在能源部领导阶层系统和成本分担模式中日益增长的能力差距加大,凸显了对联合计算、闲置的回收和成本分担模式的需求。我们还发现,正在形成的模式,例如分散的强化的学术学习,这是最终的学习,这是对ANI机构进行有希望的学习的机遇。


Article 52

Title@2025-06-23 (1): Pod: An Optimal-Latency, Censorship-Free, and Accountable Generalized Consensus Layer

Title: Pod: An Optimal-Latency, Censorship-Free, and Accountable Generalized Consensus Layer Pod: Eine optimale Latenz, Zensur-frei und buchhalterisch generalisierte Konsensebene pod:最佳、无检查和可问责的共识层 2501.14931v3

Authors (5): Orestis Alpos, Bernardo David, Jakov Mitrovski, Odysseas Sofikitis, Dionysis Zindros

This work addresses the inherent issues of high latency in blockchains and low scalability in traditional consensus protocols. We present pod, a novel notion of consensus whose first priority is to achieve the physically-optimal latency of one round-trip, i.e., requiring only one network trip for writing a transaction and one for reading it. To accomplish this, we first eliminate inter-replica communication. Instead, clients send transactions directly to all replicas, which independently process transactions and append them to local logs. Replicas assigns a timestamp and a sequence number to each transaction in their logs, allowing clients to extract valuable metadata about the transactions and the system state. Later on, clients retrieve these logs and extract transactions (and associated metadata) from them. Necessarily, this construction achieves weaker properties than a total-order broadcast protocol, due to existing lower bounds. Our work models the primitive of pod and defines its security properties. We then show pod-core, a protocol that satisfies properties such as transaction confirmation within $2\delta$, censorship resistance against Byzantine replicas, and accountability for safety violations. We show that single-shot auctions can be realized using the pod notion and observe that it is also sufficient for other popular applications.

这项工作涉及块链高度悬吊和传统共识协议的可伸缩性低等固有问题。 我们提出一个新颖的共识概念, 即: 缓冲是一个新颖的共识概念, 其第一重点是实现一个回合的极优时间, 也就是说, 只需要一次网络旅行来撰写交易, 并且阅读它。 为此, 我们首先消除复制通信。 相反, 客户直接将交易发送给所有复制品, 后者独立处理交易, 并将其附在本地日志上。 复制品为其日志中的每笔交易指定一个时间戳和序列号, 允许客户提取关于交易和系统状态的宝贵元数据。 稍后, 客户从这些日志中提取交易( 和相关元数据) 。 由于现有的下限, 此建筑的特性比全顺序广播协议要弱。 我们的工作模型是原始的缓冲器, 并定义其安全性能。 我们然后显示缓冲号, 协议满足交易确认在 $2\ delta$ 范围内的特性, 允许客户提取关于交易和系统状态的宝贵元数据。 之后, 客户检索这些日志( ) 提取这些日志( ) 并提取这些日志( ) 并提取其它的安全性标签, 。 我们展示一个实现其他的普通化软件。


Article 53

Title@2025-06-23 (1): The Power of Strong Linearizability: the Difficulty of Consistent Refereeing

Title: The Power of Strong Linearizability: the Difficulty of Consistent Refereeing Die Kraft der starken Linearität: die Schwierigkeit des konsequenten Referees 强强线性力量:一致裁判的困难 2506.18401v1

Authors (3): Hagit Attiya, Armando Castañeda, Constantin Enea

This paper studies the relation between agreement and strongly linearizable implementations of various objects. This leads to new results about implementations of concurrent objects from various primitives including window registers and interfering primitives. We consider implementations that provide both strong linearizability and decisive linearizability. We identify that lock-free, respectively, wait-free, strongly linearizable implementations of several concurrent objects entail a form of agreement that is weaker than consensus but impossible to strongly-linearizable implement with combinations of non-universal primitives. In both cases, lock-free and wait-free, this form of agreement requires a distinguished process to referee a competition that involves all other processes. Our results show that consistent refereeing of such competitions (i.e. the outcome of the competition does not change in extensions of the current execution) requires high coordination power. More specifically, two contest objects are defined and used to capture the power of strong linearizability in lock-free and wait-free implementations, respectively. Both objects are strictly weaker than consensus, in the sense that they have a wait-free linearizable (in fact, decisively linearizable) implementation from reads and writes. The contest objects capture strong linearizability since (1) they have strongly linearizable implementations from several ``high-level’’ objects like stacks, queues, snapshots, counters, and therefore, impossibility results for them carry over to these objects, and (2) they admit powerful impossibility results for strong linearizability that involve window registers and interfering primitives, which are non-universal.

本文研究了协议与各种目标的可大幅线性执行之间的关系。 这导致执行各种原始物体(包括窗口登记册和干扰原始物体)的并行目标方面的新结果。 我们考虑了提供强烈线性强和决定性线性执行的可靠结果。 我们确认,对若干并行物体的无锁、无等待、可强烈线性执行分别是一种比共识弱但无法以非普遍原始物体的组合强有力线性执行的协议形式。 在这两种情况中,无锁和无等待性,这种形式的协定要求有一个独特的进程,以提出涉及所有其他进程的可竞争性竞争。 我们的结果表明,此类竞争(即竞争的结果不会改变当前执行的延长)的可持续参考性需要高度协调力量。 更具体地说,对两个有争议的物体作了定义,并用来捕捉在无锁和无等待性执行中具有强烈线性执行力的强大线性力量。 这两种物体都完全比共识弱,即它们有可等待的可线性(事实上可直线性),执行从读性和书面和书面的不可比较性来看,这些目标从直径性和直径直径直径性一级、直径直径直径直径直线性,因此具有直线性执行。


Article 54

Title@2025-06-23 (1): Fully-Dynamic Parallel Algorithms for Single-Linkage Clustering

Title: Fully-Dynamic Parallel Algorithms for Single-Linkage Clustering Volldynamisch-Parallelalgorithmen für Single-Linkage-Clustering 单一链接集束的全动态平行平行数值 2506.18384v1

Authors (3): Quinten De Man, Laxman Dhulipala, Kishen N Gowda

Single-linkage clustering is a popular form of hierarchical agglomerative clustering (HAC) where the distance between two clusters is defined as the minimum distance between any pair of points across the two clusters. In single-linkage HAC, the output is typically the single-linkage dendrogram (SLD), which is the binary tree representing the hierarchy of clusters formed by iteratively contracting the two closest clusters. In the dynamic setting, prior work has only studied maintaining a minimum spanning forest over the data since single-linkage HAC reduces to computing the SLD on the minimum spanning forest of the data. In this paper, we study the problem of maintaining the SLD in the fully-dynamic setting. We assume the input is a dynamic forest $F$ (representing the minimum spanning forest of the data) which receives a sequence of edge insertions and edge deletions. To our knowledge, no prior work has provided algorithms to update an SLD asymptotically faster than recomputing it from scratch. All of our update algorithms are asymptotically faster than the best known static SLD computation algorithm, which takes $O(n \log h)$ time where $h$ is the height of the dendrogram ($h \leq n-1$). Furthermore, our algorithms are much faster in many cases, such as when $h$ is low. Our first set of results are an insertion algorithm in $O(h)$ time and a deletion algorithm in $O(h \log (1+n/h))$ time. Next, we describe parallel and batch-parallel versions of these algorithms which are work-efficient or nearly work-efficient and have poly-logarithmic depth. Finally, we show how to perform insertions near-optimally in $O(c \log(1+n/c))$ time, where $c$ is the number of structural changes in the dendrogram caused by the update, and give a work-efficient parallel version of this algorithm that has polylogarithmic depth.

单链接群集是一种流行的形式, 它将两个组群之间的距离定义为两个组群中任何一对点之间的最小距离。 在单链接的 HAC 中, 输出通常是单链接的 dendrookram (SLD) , 这是代表由迭接两个最接近的组群组成的组群等级的二进制树 。 在动态环境下, 先前的工作仅研究在数据上维持一个最小的跨森林范围, 因为单链接的 HAC (HAC) 将两个组群之间的距离定义为两个组群中任何一对点之间的最小距离 。 在本文件中, 我们研究在完全动态的设置中维持 SLD 的问题。 我们假设输入的是动态的 $ (代表数据最短的森林范围) dendrdrook groom (SLD) 。 之前的工作没有提供算法来更新 SLD ymptoc 的速度比从抓取的数据快。 我们所有更新的算法都比已知的 固定的 $- ral ral ral or 。 lial oral deal oral oral or or or or 工作是 。


Article 55

Title@2025-06-23 (1): A Contention-Free Model for Converged Kubernetes on HPC

Title: A Contention-Free Model for Converged Kubernetes on HPC Ein konfliktfreies Modell für konvergierte Kubernete auf HPC 高常动中聚苯乙烯趋同Kubernets无内容模式 2406.06995v2

Authors (3): Vanessa Sochat, David Fox, Daniel Milroy

High performance computing (HPC) and cloud have traditionally been separate, and presented in an adversarial light. The conflict arises from disparate beginnings that led to two drastically different cultures, incentive structures, and communities that are now in direct competition with one another for resources, talent, and speed of innovation. With the emergence of converged computing, a new paradigm of computing has entered the space that advocates for bringing together the best of both worlds from a technological and cultural standpoint. This movement has emerged due to economic and practical needs. Emerging heterogeneous, complex scientific workloads that require an orchestration of services, simulation, and reaction to state can no longer be served by traditional HPC paradigms. However, while cloud offers automation, portability, and orchestration, as it stands now it cannot deliver the network performance, fine-grained resource mapping, or scalability that these same simulations require. These novel requirements call for change not just in workflow software or design, but also in the underlying infrastructure to support them. This is one of the goals of converged computing. While the future of traditional HPC and commercial cloud cannot be entirely known, a reasonable approach to take is one that focuses on new models of convergence, and a collaborative mindset. In this paper, we introduce a new paradigm for compute – a traditional HPC workload manager, Flux Framework, running seamlessly with a user-space Kubernetes “Usernetes” to bring a service-oriented, modular, and portable architecture directly to on-premises HPC clusters. We present experiments that assess HPC application performance and networking between the environments, and provide a reproducible setup for the larger community to do exactly that.

高性能计算(HPC)和云层传统上是分开的,以对抗的光线显示。冲突源于不同的开端,导致两种截然不同的文化、激励结构和社区在资源、人才和创新速度方面相互直接竞争。随着趋同的计算的出现,一种新的计算范式进入了提倡从技术和文化角度将两个世界的最佳组合起来的空间。这种运动是由于经济和实际需要而出现的。新出现的各种复杂的科学工作量,需要协调服务、模拟和对国家的反应,而传统的HPC模式已不再能满足不同的科学工作量。然而,虽然云能提供自动化、可移动性和协调性,因为现在它无法提供网络业绩、微微微资源映射,或这些模拟所需要的伸缩性。这些新的要求不仅需要从工作流程软件或设计方面,还需要在支持它们的基本基础设施方面进行变革。这是趋同的计算目标之一。传统HPC和商业云层之间的未来不能完全为HPC的实验模式服务模式所服务模式所带来的好处。 运行中的一种合理的方法就是将新的HPC的模型引入新的HPC结构。


Article 56

Title@2025-06-23 (1): Edge Association Strategies for Synthetic Data Empowered Hierarchical Federated Learning with Non-IID Data

Title: Edge Association Strategies for Synthetic Data Empowered Hierarchical Federated Learning with Non-IID Data Edge Association Strategien für Synthetische Daten Empowered Hierarchical Federated Learning mit nicht-ID Daten 合成数据协会赋予非IID数据高级联邦学习权力、非IID数据的高级联邦学习战略 2506.18259v1

Authors (6): Jer Shyuan Ng, Aditya Pribadi Kalapaaking, Xiaoyu Xia, Dusit Niyato, Ibrahim Khalil, Iqbal Gondal

In recent years, Federated Learning (FL) has emerged as a widely adopted privacy-preserving distributed training approach, attracting significant interest from both academia and industry. Research efforts have been dedicated to improving different aspects of FL, such as algorithm improvement, resource allocation, and client selection, to enable its deployment in distributed edge networks for practical applications. One of the reasons for the poor FL model performance is due to the worker dropout during training as the FL server may be located far away from the FL workers. To address this issue, an Hierarchical Federated Learning (HFL) framework has been introduced, incorporating an additional layer of edge servers to relay communication between the FL server and workers. While the HFL framework improves the communication between the FL server and workers, large number of communication rounds may still be required for model convergence, particularly when FL workers have non-independent and identically distributed (non-IID) data. Moreover, the FL workers are assumed to fully cooperate in the FL training process, which may not always be true in practical situations. To overcome these challenges, we propose a synthetic-data-empowered HFL framework that mitigates the statistical issues arising from non-IID local datasets while also incentivizing FL worker participation. In our proposed framework, the edge servers reward the FL workers in their clusters for facilitating the FL training process. To improve the performance of the FL model given the non-IID local datasets of the FL workers, the edge servers generate and distribute synthetic datasets to FL workers within their clusters. FL workers determine which edge server to associate with, considering the computational resources required to train on both their local datasets and the synthetic datasets.

近年来,联邦学习联合会(FL)已成为广泛采用的保密保密分布式培训方法,吸引了学术界和行业的极大兴趣,研究致力于改进FL的不同方面,如算法改进、资源分配和客户选择,以便能够在分布的边缘网络中部署,以实际应用。FL模式表现不佳的原因之一是培训期间工人辍学,因为FL服务器可能远离FL工人。为解决这一问题,引入了等级式边际学习框架,增加了一层边缘服务器,以转发FL服务器和工人之间的通信。虽然HLFL框架改善了FL服务器和工人之间的沟通,但模型趋同仍然需要大量通信回合,特别是FL工人不依赖和同样分发(非II)数据。此外,FL工人假定在FL培训过程中将充分合作,这在实际情况下可能并不总是如此。为了克服这些挑战,我们提议在FL服务器中为FL服务器工人提供综合数据,同时在FL服务器中为FL服务器的员工提供非数据化数据,同时在FL服务器中为FL服务器的员工提供数据升级数据,同时在FL服务器中为FL服务器的员工提供数据升级。


Article 57

Title@2025-06-22 (7): DeInfoReg: A Decoupled Learning Framework for Better Training Throughput

Title: DeInfoReg: A Decoupled Learning Framework for Better Training Throughput DeInfoReg: Ein entkoppelter Lernrahmen für besseren Trainingsdurchsatz DInfoReg:一个分离的学习框架,以改善培训工作量 2506.18193v1

Authors (3): Zih-Hao Huang, You-Teng Lin, Hung-Hsuan Chen

This paper introduces Decoupled Supervised Learning with Information Regularization (DeInfoReg), a novel approach that transforms a long gradient flow into multiple shorter ones, thereby mitigating the vanishing gradient problem. Integrating a pipeline strategy, DeInfoReg enables model parallelization across multiple GPUs, significantly improving training throughput. We compare our proposed method with standard backpropagation and other gradient flow decomposition techniques. Extensive experiments on diverse tasks and datasets demonstrate that DeInfoReg achieves superior performance and better noise resistance than traditional BP models and efficiently utilizes parallel computing resources. The code for reproducibility is available at: https://github.com/ianzih/Decoupled-Supervised-Learning-for-Information-Regularization/.

本文介绍分解监督学习与信息规范化(DeInfoReg),这是一种新颖的办法,将长梯度流转换成多短的梯度流,从而减轻消失的梯度问题。DInfoReg结合了管道战略,使多个GPU的模型平行化,大大改进了培训流程。我们比较了我们提出的方法与标准回路转换和其他梯度流分解技术。关于不同任务和数据集的广泛实验表明,DeInfoReg比传统的BP模型取得优异性、更强的噪音阻力,并有效利用平行计算资源。可复制代码见:https://github.com/ianzih/Decoupled-Supervised-Learch-Infric-Regulization/。


Article 58

Title@2025-06-22 (7): Floating-Point Data Transformation for Lossless Compression

Title: Floating-Point Data Transformation for Lossless Compression Floating-Point-Datentransformation für verlustfreie Kompression 用于无损失压缩的浮动点数据转换 2506.18062v1

Authors (2): Samirasadat Jamalidinan, Kazem Cheshmi

Floating-point data is widely used across various domains. Depending on the required precision, each floating-point value can occupy several bytes. Lossless storage of this information is crucial due to its critical accuracy, as seen in applications such as medical imaging and language model weights. In these cases, data size is often significant, making lossless compression essential. Previous approaches either treat this data as raw byte streams for compression or fail to leverage all patterns within the dataset. However, because multiple bytes represent a single value and due to inherent patterns in floating-point representations, some of these bytes are correlated. To leverage this property, we propose a novel data transformation method called Typed Data Transformation (\DTT{}) that groups related bytes together to improve compression. We implemented and tested our approach on various datasets across both CPU and GPU. \DTT{} achieves a geometric mean compression ratio improvement of 1.16$\times$ over state-of-the-art compression tools such as zstd, while also improving both compression and decompression throughput by 1.18–3.79$\times$.

浮点数据在不同领域广泛使用。 取决于所需的精确度, 每个浮点值可以占几个字节。 信息无损存储由于其关键准确性至关重要, 正如医学成像和语言模型重量等应用中所看到的那样。 在这些情况下,数据大小往往很大, 使得无损压缩变得必要。 以往的方法要么将这些数据作为原始字节流处理, 用于压缩, 要么无法利用数据集中的所有模式。 但是, 由于多个字节代表一个单值, 并且由于浮点表示的固有模式, 其中一些字节是相互关联的。 为了利用这一属性, 我们提议了一个叫作“ 键入数据转换” (\ DTT) 的新的数据转换方法, 即将之称为“ 键入数据转换” (\ DTT) , 以组合组合方式连接来改进压缩。 我们在CPU 和 GPU 的多个数据集上实施并测试了我们的方法。 \ DTT 实现了像 zstd 这样的测算平均压缩比率改进了1.16美元, 同时将压缩和解压缩为1. 18-3. 79美元。


Article 59

Title@2025-06-22 (7): Leveraging Cloud-Fog Automation for Autonomous Collision Detection and Classification in Intelligent Unmanned Surface Vehicles

Title: Leveraging Cloud-Fog Automation for Autonomous Collision Detection and Classification in Intelligent Unmanned Surface Vehicles Nutzung von Cloud-Fog Automation zur autonomen Kollisionserkennung und Klassifizierung in intelligenten unbemannten Oberflächenfahrzeugen 利用云雾自动化对智能无载表面车辆进行自动碰撞探测和分类 2506.18024v1

Authors (7): Thien Tran, Quang Nguyen, Jonathan Kua, Minh Tran, Toan Luu, Thuong Hoang, Jiong Jin

Industrial Cyber-Physical Systems (ICPS) technologies are foundational in driving maritime autonomy, particularly for Unmanned Surface Vehicles (USVs). However, onboard computational constraints and communication latency significantly restrict real-time data processing, analysis, and predictive modeling, hence limiting the scalability and responsiveness of maritime ICPS. To overcome these challenges, we propose a distributed Cloud-Edge-IoT architecture tailored for maritime ICPS by leveraging design principles from the recently proposed Cloud-Fog Automation paradigm. Our proposed architecture comprises three hierarchical layers: a Cloud Layer for centralized and decentralized data aggregation, advanced analytics, and future model refinement; an Edge Layer that executes localized AI-driven processing and decision-making; and an IoT Layer responsible for low-latency sensor data acquisition. Our experimental results demonstrated improvements in computational efficiency, responsiveness, and scalability. When compared with our conventional approaches, we achieved a classification accuracy of 86\%, with an improved latency performance. By adopting Cloud-Fog Automation, we address the low-latency processing constraints and scalability challenges in maritime ICPS applications. Our work offers a practical, modular, and scalable framework to advance robust autonomy and AI-driven decision-making and autonomy for intelligent USVs in future maritime ICPS.

为克服这些挑战,我们提议利用最近提出的云雾自动化模式的设计原则,专门为海洋比较方案设计一个分布式的云层结构,我们的拟议结构由三个等级层组成:云层,用于集中和分散的数据汇总、先进的分析方法和未来模型的完善;云层,用于本地化的AI驱动的处理和决策;以及IoT层,负责低延迟感测数据的获取;我们的实验结果表明,在计算效率、反应能力和可缩放性方面有所改进;与我们的传统方法相比,我们实现了86的分类准确度,并改进了拉力性表现;通过采用云层自动化,我们解决了低延迟处理限制和可扩展性、未来模型改进;执行本地化AI-驱动的处理和决策;以及负责低延迟感应感测数据的采集的IoT层。


Article 60

Title@2025-06-22 (7): CFTel: A Practical Architecture for Robust and Scalable Telerobotics with Cloud-Fog Automation

Title: CFTel: A Practical Architecture for Robust and Scalable Telerobotics with Cloud-Fog Automation CFTel: Eine praktische Architektur für robuste und skalierbare Telerobotik mit Cloud-Fog Automation FLFel:一个有云雾自动化的强力和可缩放的Telerostotics实用建筑 2506.17991v1

Authors (6): Thien Tran, Jonathan Kua, Minh Tran, Honghao Lyu, Thuong Hoang, Jiong Jin

Telerobotics is a key foundation in autonomous Industrial Cyber-Physical Systems (ICPS), enabling remote operations across various domains. However, conventional cloud-based telerobotics suffers from latency, reliability, scalability, and resilience issues, hindering real-time performance in critical applications. Cloud-Fog Telerobotics (CFTel) builds on the Cloud-Fog Automation (CFA) paradigm to address these limitations by leveraging a distributed Cloud-Edge-Robotics computing architecture, enabling deterministic connectivity, deterministic connected intelligence, and deterministic networked computing. This paper synthesizes recent advancements in CFTel, aiming to highlight its role in facilitating scalable, low-latency, autonomous, and AI-driven telerobotics. We analyze architectural frameworks and technologies that enable them, including 5G Ultra-Reliable Low-Latency Communication, Edge Intelligence, Embodied AI, and Digital Twins. The study demonstrates that CFTel has the potential to enhance real-time control, scalability, and autonomy while supporting service-oriented solutions. We also discuss practical challenges, including latency constraints, cybersecurity risks, interoperability issues, and standardization efforts. This work serves as a foundational reference for researchers, stakeholders, and industry practitioners in future telerobotics research.

远程机器人是自主工业网络-物理系统(ICPS)的关键基础,它使各个领域的远程操作得以进行,然而,基于云的常规调频器存在隐性、可靠性、可伸缩性和复原力问题,妨碍关键应用的实时性能。 云-雾调频机(CFA)以云-雾自动化(CFA)范式为基础,通过利用分布式的云-叶-机器人计算架构,实现确定性连通、确定性连接、确定性连接情报和确定性网络计算,解决这些限制。本文综合了基于云的调频调频调频器的最新进展,目的是突出其在便利可变、低延迟、自主和AI驱动调频调频应用方面的作用。我们分析了使这些系统能够发挥作用的建筑框架和技术,包括5G超可燃低频低温通信、Edge Intellictell、Emblodiged-Robotictical AI和数字双峰。这项研究表明,FCTFCTel有可能加强实时控制、可缩性和自主性和自主性,同时支持服务导向性解决方案。我们还讨论了面向用户的标准化性研究。


Article 61

Title@2025-06-22 (7): Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms

Title: Leveraging Large Language Model for Intelligent Log Processing and Autonomous Debugging in Cloud AI Platforms Nutzung eines großen Sprachmodells für intelligente Protokollverarbeitung und autonomes Debugging in Cloud-KI-Plattformen 利用大语言模型,在云层独立平台中利用智能日志处理和自动调试大语言模型 2506.17900v1

Authors (2): Cheng Ji, Huaiying Luo

With the increasing complexity and rapid expansion of the scale of AI systems in cloud platforms, the log data generated during system operation is massive, unstructured, and semantically ambiguous, which brings great challenges to fault location and system self-repair. In order to solve this problem, this paper proposes an intelligent log processing and automatic debugging framework based on Large Language Model (LLM), named Intelligent Debugger (LLM-ID). This method is extended on the basis of the existing pre-trained Transformer model, and integrates a multi-stage semantic inference mechanism to realize the context understanding of system logs and the automatic reconstruction of fault chains. Firstly, the system log is dynamically structured, and the unsupervised clustering and embedding mechanism is used to extract the event template and semantic schema. Subsequently, the fine-tuned LLM combined with the multi-round attention mechanism to perform contextual reasoning on the log sequence to generate potential fault assumptions and root cause paths. Furthermore, this paper introduces a reinforcement learning-based policy-guided recovery planner, which is driven by the remediation strategy generated by LLM to support dynamic decision-making and adaptive debugging in the cloud environment. Compared with the existing rule engine or traditional log analysis system, the proposed model has stronger semantic understanding ability, continuous learning ability and heterogeneous environment adaptability. Experiments on the cloud platform log dataset show that LLM-ID improves the fault location accuracy by 16.2%, which is significantly better than the current mainstream methods

随着云层平台内AI系统规模的日益复杂和迅速扩大,在系统运行期间产生的日志数据是大规模、无结构的和语义模糊的,给定位和系统自我修复带来巨大的挑战。为了解决这一问题,本文件提议了一个基于大语言模型(LLM)的智能化日志处理和自动调试框架,名为Intelligent Debugger(LLM-ID) 。这个方法是根据预先培训的变异器模型扩展的,并结合一个多阶段的语义推论机制,以实现对系统日志的背景理解和自动重建错误定位链。首先,系统日志是动态结构化结构化的,并且使用未经监督的组合和嵌入机制来提取事件模板和语义结构图。随后,经过精细调整的LLMM与多圈关注机制一起对日志序列进行背景推理推理,以产生潜在的错误假设和根本原因路径。此外,本文还介绍了一个基于强化学习的政策指导的恢复计划设计者机制,这个机制是由当前定位定位系统生成的补救战略驱动,它能大大地支持不断调整的云层规则环境。


Article 62

Title@2025-06-22 (7): SPD-CFL: Stepwise Parameter Dropout for Efficient Continual Federated Learning

Title: SPD-CFL: Stepwise Parameter Dropout for Efficient Continual Federated Learning SPD-CFL: Schrittweiser Parameter-Ausfall für effizientes kontinuierliches Federated Learning SPD-CFL: 高效持续联邦学习的分级参数辍学 2405.09394v2

Authors (8): Yuning Yang, Han Yu, Chuan Sun, Tianrun Gao, Xiaohong Liu, Xiaodong Xu, Ping Zhang, Guangyu Wang

Federated Learning (FL) is a collaborative machine learning paradigm for training models on local sensitive data with privacy protection. Pre-trained transformer-based models have emerged as useful foundation models (FMs) to be fine-tuned for a wide range of downstream tasks. However, large-scale pre-trained models make it challenging for traditional FL due to high communication overhead in the resource-constrained IoT. This has inspired the field of parameter-efficient fine-tuning (PEFT) research. Existing PEFT methods attempt to optimize model performance at the given dropout level. Such an approach places the burden on human users to find a dropout rate that provides a satisfactory level of performance through trial-and-error, which is time consuming and resource intensive. To address this limitation, we propose the Step-wise Parameter Dropout for Continual Federated Learning (SPD-CFL) approach. Instead of pre-defining a desired dropout rate, it allows users to specify the target level of performance and then attempts to find the most suitable dropout rate for the given FL model. Specifically, on the server side, SPD-CFL drops trainable parameters in a stepwise manner to improve communication efficiency by reducing the rank of low-rank adaptation (LoRA). The sensitivity-based gradient consistency (SGC) measure is designed to facilitate the adaptive adjustment of parameter dropout. In addition, SPD-CFL introduces continual learning (CL) on the client side to mitigate performance degradation due to the inconsistent optima with distinct parameter dropout rates under heterogeneous FL. Extensive experiments on the public benchmark dataset CIFAR-10 and a real-world medical Face dataset demonstrate significant superiority of SPD-CFL over state-of-the-art methods. Compared to the best-performing baseline, it achieves a 2.07% higher test AUC while reducing communication overhead by 29.53%.

联邦学习联合会(FL)是培训具有隐私保护的当地敏感数据模型的合作机器学习模式; 培训前变压器模型已成为有用的基础模型(FMs),可以对广泛的下游任务进行微调; 然而,由于资源限制的IoT中通信管理费用较高,大规模预培训模型使得传统FL具有挑战性。 这启发了参数效率微调(PEFT)研究领域; 现有的PEFT方法试图优化特定辍学率水平的模型性能; 这种方法使人类用户难于找到一种通过试验和机床提供令人满意的性能水平的基础模型(FMs),这种模型耗时耗和资源密集。 然而,为了应对这一限制,我们建议采用“持续联邦学习联盟(SPD-CFL)的分数退出方法。 与预先确定一个预期的辍学率相比,用户可以指定业绩目标水平,然后尝试通过特定的FL模式找到最合适的辍学率。 具体地,在服务器侧,SD-L(SD-L) 货币变离子的精度调性精度(S-L) 的里程精度精确度精确度精确度精确度比,同时调整SLDFL) 数据标准的精确比值的精确比值的精确度,通过一个进步的精确度调整方法,可以降低的精确度,使SLDFLDDDDFLDDDD值的精确度的精确度的精确度比值调整。


Article 63

Title@2025-06-22 (7): NestQuant: Post-Training Integer-Nesting Quantization for On-Device DNN

Title: NestQuant: Post-Training Integer-Nesting Quantization for On-Device DNN NestQuant: Post-Training Integer-Nesting Quantization for On-Device DNN NestQuant: 培训后DNN的整数 2506.17870v1

Authors (6): Jianhang Xie, Chuntao Ding, Xiaqing Li, Shenyuan Ren, Yidong Li, Zhichao Lu

Deploying quantized deep neural network (DNN) models with resource adaptation capabilities on ubiquitous Internet of Things (IoT) devices to provide high-quality AI services can leverage the benefits of compression and meet multi-scenario resource requirements. However, existing dynamic/mixed precision quantization requires retraining or special hardware, whereas post-training quantization (PTQ) has two limitations for resource adaptation: (i) The state-of-the-art PTQ methods only provide one fixed bitwidth model, which makes it challenging to adapt to the dynamic resources of IoT devices; (ii) Deploying multiple PTQ models with diverse bitwidths consumes large storage resources and switching overheads. To this end, this paper introduces a resource-friendly post-training integer-nesting quantization, i.e., NestQuant, for on-device quantized model switching on IoT devices. The proposed NestQuant incorporates the integer weight decomposition, which bit-wise splits quantized weights into higher-bit and lower-bit weights of integer data types. It also contains a decomposed weights nesting mechanism to optimize the higher-bit weights by adaptive rounding and nest them into the original quantized weights. In deployment, we can send and store only one NestQuant model and switch between the full-bit/part-bit model by paging in/out lower-bit weights to adapt to resource changes and reduce consumption. Experimental results on the ImageNet-1K pretrained DNNs demonstrated that the NestQuant model can achieve high performance in top-1 accuracy, and reduce in terms of data transmission, storage consumption, and switching overheads. In particular, the ResNet-101 with INT8 nesting INT6 can achieve 78.1% and 77.9% accuracy for full-bit and part-bit models, respectively, and reduce switching overheads by approximately 78.1% compared with diverse bitwidths PTQ models.

在无处不在的Tings Internet (IoT) 设备上部署具有资源适应能力、可提供高质量AI服务的资源倾斜的深神经网络(DNN)模型,可以发挥压缩的好处,并满足多种预想的资源要求。然而,现有的动态/混合精密四分化需要再培训或特殊硬件,而培训后四分化(PTQ)在资源调整方面有两个局限性:(一) 最先进的PTQ方法仅提供一种固定的比特宽度模型,因此难以适应IOT 设备动态资源;(二) 使用多个具有不同位宽度的PTQQQQ的多位中位离位存储权模型消耗了大量的存储资源,并转换高端资源。为此,本文引入了一种资源友好的后级整分级整分量四分解后,Nest QQQQQtalt(Nestal Qet) 将最小四分化的重量转换成高位数和低位位位数的存储模型, QQQItal-ldal-deal-deal-dealalalal deal dealalal deal rotial deal deal deal deal deality deality deal mod mod mod modmental deal) deal deal deal deal deal deal deal deal deal deal deal deal deal deal deal destitmental deal dealment smmmmmmental deal deal deal deal deal deal deal dealment smmment.


Article 64

Title@2025-06-21 (6): FedBaF: Federated Learning Aggregation Biased by a Foundation Model

Title: FedBaF: Federated Learning Aggregation Biased by a Foundation Model FedBaF: Federated Learning Aggregation Durch ein Stiftungsmodell biased FedBAF: 联邦学习联合组织 2410.18352v3

Authors (4): Jong-Ik Park, Srinivasa Pranav, José M. F. Moura, Carlee Joe-Wong

Foundation models are now a major focus of leading technology organizations due to their ability to generalize across diverse tasks. Existing approaches for adapting foundation models to new applications often rely on Federated Learning (FL) and disclose the foundation model weights to clients when using it to initialize the global model. While these methods ensure client data privacy, they compromise model and information security. In this paper, we introduce Federated Learning Aggregation Biased by a Foundation Model (FedBaF), a novel method for dynamically integrating pre-trained foundation model weights during the FL aggregation phase. Unlike conventional methods, FedBaF preserves the confidentiality of the foundation model while still leveraging its power to train more accurate models, especially in non-IID and adversarial scenarios. Our comprehensive experiments use Pre-ResNet and foundation models like Vision Transformer to demonstrate that FedBaF not only matches, but often surpasses the test accuracy of traditional weight initialization methods by up to 11.4% in IID and up to 15.8% in non-IID settings. Additionally, FedBaF applied to a Transformer-based language model significantly reduced perplexity by up to 39.2%.

基金会模式目前是主要技术组织的一个主要重点,因为它们有能力推广各种任务。现有的基础模型适应新应用的方法往往依靠联邦学习联合会(FL),并在使用基础模型来启动全球模型时向客户披露基础模型的权重。这些方法可以确保客户数据隐私,但会损害模型和信息安全。在本文中,我们引入了一个基础模型(FedBaF)的联邦学习聚合,这是在FL汇总阶段动态地将培训前的基础模型权重整合起来的新方法。与常规方法不同,FedBaF保存基础模型的保密性,同时仍然利用其力量培训更准确的模型,特别是在非IID和对抗情景中。我们的全面实验使用前ResNet和愿景变异模型来证明FedBaF不仅匹配,而且常常超过传统重量初始化方法的测试精度,在ID综合阶段高达11.4%,在非IID环境下高达15.8%。此外,FedBAF适用于基于变异语言模型,将不易性显著降低到39.2%。。


Article 65

Title@2025-06-21 (6): Implementation and Evaluation of Fast Raft for Hierarchical Consensus

Title: Implementation and Evaluation of Fast Raft for Hierarchical Consensus Umsetzung und Bewertung des Fast Raft für den Hierarchischen Konsens 落实和评价促进等级共识的快行道 2506.17793v1

Authors (2): Anton Melnychuk, Bryan SebaRaj

We present the first open-source implementation and evaluation of Fast Raft, a hierarchical consensus protocol designed for dynamic, distributed environments. Fast Raft reduces the number of message rounds needed to commit log entries compared to standard Raft by introducing a fast-track mechanism and reducing leader dependence. Our implementation uses gRPC and Kubernetes-based deployment across AWS availability zones. Experimental results demonstrate a throughput improvement and reduced commit latency under low packet loss conditions, while maintaining Raft’s safety and liveness guarantees.

我们首次展示了Fast Raft(Fast Raft)的开放源码实施和评估,这是一个为动态分布式环境设计的等级共识协议。快速Raft(Fast Raft)通过引入快速机制并减少领导人依赖性,减少了用于进行日志输入所需的信息发回次数,而与标准Raft(Raft)相比,快速流转减少了信息发回次数。我们的实施使用GRPC和Kubernetes(Kubernetes)在AWS供应区进行部署。实验结果显示,在低包装损失条件下,吞吐量得到了改善,并减少了延缓,同时保持了Raft(Raft)的安全和生活保障。


Article 66

Title@2025-06-21 (6): A Locally Differential Private Coding-Assisted Succinct Histogram Protocol

Title: A Locally Differential Private Coding-Assisted Succinct Histogram Protocol Ein lokal differenziertes, privates Coding Assisted Succinct Histogramm Protokoll 本地差异私家编码辅助闪电直方图议定书 2506.17767v1

Authors (2): Hsuan-Po Liu, Hessam Mahdavifar

A succinct histogram captures frequent items and their frequencies across clients and has become increasingly important for large-scale, privacy-sensitive machine learning applications. To develop a rigorous framework to guarantee privacy for the succinct histogram problem, local differential privacy (LDP) has been utilized and shown promising results. To preserve data utility under LDP, which essentially works by intentionally adding noise to data, error-correcting codes naturally emerge as a promising tool for reliable information collection. This work presents the first practical $(\epsilon,\delta)$-LDP protocol for constructing succinct histograms using error-correcting codes. To this end, polar codes and their successive-cancellation list (SCL) decoding algorithms are leveraged as the underlying coding scheme. More specifically, our protocol introduces Gaussian-based perturbations to enable efficient soft decoding. Experiments demonstrate that our approach outperforms prior methods, particularly for items with low true frequencies, while maintaining similar frequency estimation accuracy.

简洁的直方图捕捉了客户的频繁项目及其频率,对于大规模、对隐私敏感的机器学习应用程序已变得日益重要。为了制定一个严格的框架来保障简明直方图问题的隐私,已经利用并展示了有希望的结果。为了维护LDP下的数据效用,LDP基本上通过有意增加数据噪音来发挥作用,错误更正代码自然成为可靠信息收集的一个很有希望的工具。这项工作展示了第一个实用的$(epsilon,\delta)$-LDP协议,用于使用错误校正代码构建简明直方图。为此,极地代码及其相继取消的编码列表(SCL)解码算法被作为基本编码方案。更具体地说,我们的协议引入了基于Gausian的扰动干扰,以便能够高效的软解码。实验表明,我们的方法超越了先前的方法,特别是真实频率较低的项目,同时保持类似的频率估计准确性。


Article 67

Title@2025-06-21 (6): Automated Selfish Mining Analysis for DAG-Based PoW Consensus Protocols

Title: Automated Selfish Mining Analysis for DAG-Based PoW Consensus Protocols Automatisierte Selfish Mining Analyse für DAG-basierte PoW-Konsensusprotokolle 为基于残疾非洲集团的《水、水、水、水、水、水的共识议定书》自动自采矿分析 2501.10888v3

Authors (1): Patrik Keller

Selfish mining is strategic rule-breaking to maximize rewards in proof-of-work protocols. Markov Decision Processes (MDPs) are the preferred tool for finding optimal strategies in Bitcoin and similar linear chain protocols. Protocols increasingly adopt DAG-based chain structures, for which MDP analysis is more involved. To date, researchers have tailored specific MDPs for each protocol. Protocol design suffers long feedback loops, as each protocol change implies manual work on the MDP. To overcome this, we propose a generic attack model that covers a wide range of protocols, including Ethereum Proof-of-Work, GhostDAG, and Parallel Proof-of-Work. Our approach is modular: we specify each protocol as a concise program, and our tooling then derives and solves the selfish mining MDP automatically.

自我采矿是战略规则的突破,目的是在工作证明协议中最大限度地获得回报。Markov决定程序(MDPs)是寻找比特币和类似线性链协议的最佳战略的首选工具。协议越来越多地采用基于DAG的链条结构,而MDP的分析则更多地涉及这一点。迄今为止,研究人员为每项协议定制了特定的MDP。协议设计有很长的反馈循环,因为每项协议的修改都意味着对MDP的手工操作。为了克服这一点,我们提出了一个通用攻击模式,涵盖广泛的协议,包括Etheinum Working-Cooperty、GhostDAG和平行的工作验证。我们的方法是模块化的:我们把每项协议指定为一个简明的方案,然后我们的工具产生并自动解决自私的采矿 MDP。


Article 68

Title@2025-06-21 (6): Choosing the Right Battery Model for Data Center Simulations

Title: Choosing the Right Battery Model for Data Center Simulations Auswahl des richtigen Batteriemodells für Rechenzentrumssimulationen 选择数据中心模拟的右电池模型 2506.17739v1

Authors (3): Paul Kilian, Philipp Wiesner, Odej Kao

As demand for computing resources continues to rise, the increasing cost of electricity and anticipated regulations on carbon emissions are prompting changes in data center power systems. Many providers are now operating compute nodes in microgrids, close to renewable power generators and energy storage, to maintain full control over the cost and origin of consumed electricity. Recently, new co-simulation testbeds have emerged that integrate domain-specific simulators to support research, development, and testing of such systems in a controlled environment. Yet, choosing an appropriate battery model for data center simulations remains challenging, as it requires balancing simulation speed, realism, and ease of configuration. In this paper, we implement four different battery models for data center scenarios within the co-simulation framework Vessim and analyze their behavior. The results show that linear models, which consider inefficiencies and power limits, closely match the behavior of complex physics-based models in short-term experiments while offering faster execution, and not requiring knowledge on electrochemical reactions and circuit-level dynamics. In contrast, simple, lossless models fail to accurately represent complex behavior and provide no further runtime advantage.

随着对计算资源的需求继续上升,电费和碳排放预期条例的不断增加正在促使数据中心电力系统发生变化。许多供应商目前正在运行微电网中的计算节点,靠近可再生能源发电机和能源储存,以保持对耗电成本和来源的全面控制。最近,出现了新的共同模拟试验台,将特定领域的模拟器整合在一起,以支持在受控制的环境中研究、开发和测试此类系统。然而,选择适当的数据中心模拟电池模型仍然具有挑战性,因为它需要平衡模拟速度、现实主义和配置的易易性。在本文件中,我们在共同模拟框架内对数据中心情景采用了四种不同的电池模型。结果显示,线性模型考虑到效率低和电力限度,与基于物理的复杂模型在短期实验中的行为密切匹配,同时提供更快的执行,而不需要电化学反应和电路级动态方面的知识。相比之下,简单、无损失模型无法准确地代表复杂的行为,也没有提供进一步的时间优势。


Article 69

Title@2025-06-21 (6): Distributed Butterfly Analysis using Mobile Agents

Title: Distributed Butterfly Analysis using Mobile Agents Verteilte Schmetterlingsanalyse mit mobilen Agenten 使用移动剂进行分布式蝴蝶分析 2506.17721v1

Authors (3): Prabhat Kumar Chand, Apurba Das, Anisur Rahaman Molla

Butterflies, or 4-cycles in bipartite graphs, are crucial for identifying cohesive structures and dense subgraphs. While agent-based data mining is gaining prominence, its application to bipartite networks remains relatively unexplored. We propose distributed, agent-based algorithms for \emph{Butterfly Counting} in a bipartite graph $G((A,B),E)$. Agents first determine their respective partitions and collaboratively construct a spanning tree, electing a leader within $O(n \log \lambda)$ rounds using only $O(\log \lambda)$ bits per agent. A novel meeting mechanism between adjacent agents improves efficiency and eliminates the need for prior knowledge of the graph, requiring only the highest agent ID $\lambda$ among the $n$ agents. Notably, our techniques naturally extend to general graphs, where leader election and spanning tree construction maintain the same round and memory complexities. Building on these foundations, agents count butterflies per node in $O(\Delta)$ rounds and compute the total butterfly count of $G$ in $O(\Delta+\min{ A , B })$ rounds.

蝴蝶或双面图中的4个周期对于确定凝固结构和密集子集至关重要。 虽然基于代理的数据挖掘越来越突出, 但它对双面网络的应用仍然相对没有被探索。 我们用双面图$G( B) E 提出用于 emph{Butterfly 计数的分布式代理算法。 代理人首先决定各自的分区, 并合力建造横贯树, 在$( n) log\lambda) 内选举一位领先者, 仅使用$O( log\ lambda) 的每代理比特。 相邻的代理人之间的新颖会议机制提高了效率, 并消除了先前对图形知识的需求, 只需要最高代理商在$( 美元) 代理商中确定 $( lambda) 。 值得注意的是, 我们的技术自然延伸到一般图, 领导选举和横跨树构造保持同样的圆圈和记忆复杂性。 在这些基础上, 代理人以$( delta) $( Delta) 回合计算每节的蝴蝶。


Article 70

Title@2025-06-21 (6): JAX-LaB: A High-Performance, Differentiable, Lattice Boltzmann Library for Modeling Multiphase Fluid Dynamics in Geosciences and Engineering

Title: JAX-LaB: A High-Performance, Differentiable, Lattice Boltzmann Library for Modeling Multiphase Fluid Dynamics in Geosciences and Engineering JAX-LaB: Eine leistungsstarke, differenzierbare Lattice Boltzmann Bibliothek zur Modellierung von Mehrphasen-Flüssigkeitsdynamiken in Geowissenschaften und Ingenieurwissenschaften JAX-LAB:地球科学和工程多阶段流力动力建模高绩效、可区别的Lattice Boltzmann图书馆 2506.17713v1

Authors (3): Piyush Pradhan, Pierre Gentine, Shaina Kelly

We present JAX-LaB, a differentiable, Python-based Lattice Boltzmann library for simulating multiphase and multiphysics flows in hydrologic, geologic, and engineered porous media. Built as an extension of the XLB library, JAX-LaB utilizes JAX for computations and offers a performant, hardware-agnostic implementation that integrates seamlessly with machine learning workflows and scales efficiently across CPUs, GPUs, and distributed systems. Multiphase interactions are modeled using the Shan-Chen pseudopotential method, which is coupled with an equation of state and an improved forcing scheme to obtain liquid-vapor densities that are consistent with Maxwell’s construction, enabling simulations of systems with very large density ratios while maintaining minimal spurious currents. Wetting is handled using the “improved” virtual density scheme, which allows precise control of contact angles and eliminates non-physical films seen in other Shan-Chen wetting methods. We validate the library through several analytical benchmarks, such as Laplace’s law, capillary rise, and cocurrent multicomponent flow, and demonstrate some exemplary use cases for the library. We also report single- and multi-GPU performance scaling of the library. The library is open-source under the Apache license and available at https://github.com/piyush-ppradhan/JAX-LaB.

我们展示了JAX-LAB,这是一个不同、基于Python的Lattice Boltzmann图书馆,用于模拟水文、地质和工程多孔介质的多阶段和多物理流动。作为XLB图书馆的延伸,JAX-LAB利用JAX进行计算,并提供一种性能强的硬件智能化实施,在CPU、GPU和分布式系统中与机器学习工作流程和比例进行无缝的整合。多阶段互动模式使用掸琴假潜在方法进行模拟,同时采用州等方和改良的强制办法获取液体蒸汽密度,这与Maxwell的构造一致,使得对密度比率极高的系统进行模拟,同时保持微小的浮浮流。湿则使用“简化”虚拟密度计划,从而能够精确控制接触角度,并消除其他Shan-Chen湿法中看到的非物理电影。我们通过若干分析基准对图书馆进行验证,例如Label A/B多层图书馆的流程和多层GALAFLA-S-SLILS-S-S-SLU Applial Appy Applial 和Sy Applical-Sy Apprviol-Serals.


Article 71

Title@2025-06-21 (6): Residue Number System (RNS) based Distributed Quantum Multiplication

Title: Residue Number System (RNS) based Distributed Quantum Multiplication Rückstandszahlsystem (RNS) basiert auf verteilter Quanten-Multiplikation 基于残余数字系统(RNS)的分布量乘法 2506.17588v1

Authors (2): Bhaskar Gaur, Himanshu Thapliyal

Multiplication of quantum states is a frequently used function or subroutine in quantum algorithms and applications, making quantum multipliers an essential component of quantum arithmetic. However, quantum multiplier circuits suffer from high Toffoli depth and T gate usage, which ultimately affects their scalability and applicability on quantum computers. To address these issues, we propose utilizing the Residue Number System (RNS) based distributed quantum multiplication, which executes multiple quantum modulo multiplication circuits across quantum computers or jobs with lower Toffoli depth and T gate usage. Towards this end, we propose a design of Quantum Diminished-1 Modulo $(2^n+1)$ Multiplier, an essential component of RNS based distributed quantum multiplication. We provide estimates of quantum resource usage and compare them with those of an existing non-distributed quantum multiplier for 6 to 16 qubit sized output. Our comparative analysis estimates up to 46.018% lower Toffoli depth, and reduction in T gates of 34.483% to 86.25%.

量子状态的乘法是量子算法和应用程序中常用的函数或次常规,使量子乘数成为量子算法和应用程序的一个基本组成部分。然而,量子倍化电路受到高 Toffoli 深度和T门使用的影响,最终影响到其在量子计算机上的可缩放性和可应用性。为了解决这些问题,我们提议使用基于残余数字系统分布量子乘法,在量子计算机或具有低 Toffoli 深度和T门使用量子计算机或工作之间执行多种量子模数倍化电路。为此,我们提议设计量子倍化电路,这是基于 RNS 分布量子乘法的一个必要组成部分。我们提供了量子资源使用估计数,并将其与现有的非分配量子乘数乘数乘法的6至16 公尺输出量乘法进行比较。我们的比较分析估计,其深度可达46.018%,低 Toffoli 深度为46.483 %至86.25%。


Article 72

Title@2025-06-21 (6): ConsumerBench: Benchmarking Generative AI Applications on End-User Devices

Title: ConsumerBench: Benchmarking Generative AI Applications on End-User Devices ConsumerBench: Benchmarking Generative KI-Anwendungen auf Endgeräten 消费者:确定最终用户设备应用基准 2506.17538v1

Authors (6): Yile Gu, Rohan Kadekodi, Hoang Nguyen, Keisuke Kamahori, Yiyu Liu, Baris Kasikci

The recent shift in Generative AI (GenAI) applications from cloud-only environments to end-user devices introduces new challenges in resource management, system efficiency, and user experience. This paper presents ConsumerBench, a comprehensive benchmarking framework designed to evaluate the system efficiency and response time of GenAI models running on end-user devices. Unlike existing benchmarks that assume exclusive model access on dedicated GPUs, ConsumerBench simulates realistic multi-application scenarios executing concurrently on constrained hardware. Furthermore, ConsumerBench supports customizable workflows that simulate complex tasks requiring coordination among multiple applications. ConsumerBench captures both application-level metrics, including latency and Service Level Objective (SLO) attainment, and system-level metrics like CPU/GPU utilization and memory bandwidth. Through extensive experiments, ConsumerBench reveals inefficiencies in resource sharing, unfair scheduling under greedy allocation, and performance pitfalls of static model server configurations. The paper also provides practical insights for model developers and system designers, highlighting the benefits of custom kernels tailored to consumer-grade GPU architectures and the value of implementing SLO-aware scheduling strategies.

最近,GenAI(GenAI)的生成性AI(GenAI)应用从云型环境向终端用户装置的转换带来了资源管理、系统效率和用户经验方面的新挑战。本文件介绍了消费者基准,这是一个旨在评价GenAI在终端用户装置上运行的模型的系统效率和反应时间的全面基准框架。与在专用GPUs上独家使用模型的现有基准不同,消费者基准模拟了现实的多应用情景,同时对受限制的硬件实施。此外,消费者基准支持可定制的工作流程,这些工作流程模拟了多种应用程序之间需要协调的复杂任务。消费者基准捕捉了应用级指标,包括延迟度和服务级目标(SLO)的实现,以及系统级指标,如CPU/GPU的利用和记忆带宽。通过广泛的实验,消费者基准揭示了资源共享效率低、贪婪分配下的不公平时间安排以及静态模型服务器配置的性能陷井。文件还为模型开发者和系统设计者提供了实际的洞察力,强调了定制的定制内仓子对消费者级GPUPI结构的惠益,以及实施SLO-awarowing战略的价值。


Article 73

Title@2025-06-20 (5): A Grassroots Network and Community Roadmap for Interconnected Autonomous Science Laboratories for Accelerated Discovery

Title: A Grassroots Network and Community Roadmap for Interconnected Autonomous Science Laboratories for Accelerated Discovery Ein Grassroots-Netzwerk und ein gemeinschaftlicher Fahrplan für vernetzte autonome Wissenschaftslaboratorien für beschleunigte Entdeckung 加速发现相互连接的自治科学实验室基层网络和社区路线图 2506.17510v1

Authors (18): Rafael Ferreira da Silva, Milad Abolhasani, Dionysios A. Antonopoulos, Laura Biven, Ryan Coffee, Ian T. Foster, Leslie Hamilton, Shantenu Jha, Theresa Mayer, Benjamin Mintz, Robert G. Moore, Salahudin Nimer, Noah Paulson, Woong Shin, Frederic Suter, Mitra Taheri, Michela Taufer, Newell R. Washburn

Scientific discovery is being revolutionized by AI and autonomous systems, yet current autonomous laboratories remain isolated islands unable to collaborate across institutions. We present the Autonomous Interconnected Science Lab Ecosystem (AISLE), a grassroots network transforming fragmented capabilities into a unified system that shorten the path from ideation to innovation to impact and accelerates discovery from decades to months. AISLE addresses five critical dimensions: (1) cross-institutional equipment orchestration, (2) intelligent data management with FAIR compliance, (3) AI-agent driven orchestration grounded in scientific principles, (4) interoperable agent communication interfaces, and (5) AI/ML-integrated scientific education. By connecting autonomous agents across institutional boundaries, autonomous science can unlock research spaces inaccessible to traditional approaches while democratizing cutting-edge technologies. This paradigm shift toward collaborative autonomous science promises breakthroughs in sustainable energy, materials development, and public health.

科学发现正被AI和自主系统革命化,然而,目前的自治实验室仍然是孤立的岛屿,无法进行跨机构合作。我们展示了自主连接的科学实验室生态系统(AISLE),这是一个基层网络,将零散的能力转化为一个统一系统,缩短了从思想到创新的道路,从几十年到几个月的影响和加速发现。 AISLE涉及五个关键方面:(1)跨机构设备协调,(2)智能数据管理符合FAIR的合规性,(3)基于科学原则的AI剂驱动的管弦化,(4)可相互操作的代理通信界面,(5)AI/ML-综合科学教育。通过将自主代理人员跨越机构边界,自主科学可以打开传统方法无法进入的研究空间,同时使尖端技术民主化。这种向合作自主科学转变的范式转变有望在可持续能源、材料开发和公共卫生方面实现突破。


Article 74

Title@2025-06-20 (5): Optimal Parallel Algorithms for Convex Hulls in 2D and 3D under Noisy Primitive Operations

Title: Optimal Parallel Algorithms for Convex Hulls in 2D and 3D under Noisy Primitive Operations Optimale Parallelalgorithmen für Konvexhüllen in 2D und 3D unter Noisy Primitive Operations 2D 和 3D 的Convex Hulls 在噪音原始操作下的最佳平行比值 2506.17507v1

Authors (2): Michael T. Goodrich, Vinesh Sridhar

In the noisy primitives model, each primitive comparison performed by an algorithm, e.g., testing whether one value is greater than another, returns the incorrect answer with random, independent probability p < 1/2 and otherwise returns a correct answer. This model was first applied in the context of sorting and searching, and recent work by Eppstein, Goodrich, and Sridhar extends this model to sequential algorithms involving geometric primitives such as orientation and sidedness tests. However, their approaches appear to be inherently sequential; hence, in this paper, we study parallel computational geometry algorithms for 2D and 3D convex hulls in the noisy primitives model. We give the first optimal parallel algorithms in the noisy primitives model for 2D and 3D convex hulls in the CREW PRAM model. The main technical contribution of our work concerns our ability to detect and fix errors during intermediate steps of our algorithm using a generalization of the failure sweeping technique.

在吵闹的原始模型中,每个原始比较都是由算法进行的,例如,测试一个值是否大于另一个值,以随机、独立概率 p < 1/2 返回错误答案,以随机、独立概率 p < 1/2 返回正确答案。该模型首先在分类和搜索方面应用,Eppstein、Goodrich和Sridhar最近的工作将这一模型推广到涉及定向和侧面测试等几何原始物的顺序算法。然而,它们的方法似乎是内在的相继法;因此,在本文件中,我们研究对噪音原始模型中的 2D 和 3D convex 船体的平行计算几何算法。我们在CREW PRAM 模型中为2D 和 3D convex 船体的噪音原始模型提供了第一种最佳平行算法。我们工作的主要技术贡献涉及我们利用故障扫描技术的概括在算法的中间步骤中探测和纠正错误的能力。


Article 75

Title@2025-06-20 (5): Fed-pilot: Optimizing LoRA Allocation for Efficient Federated Fine-Tuning with Heterogeneous Clients

Title: Fed-pilot: Optimizing LoRA Allocation for Efficient Federated Fine-Tuning with Heterogeneous Clients Fed-Pilot: Optimierung der LoRA-Allokation für effizientes Federated Fine-Tuning mit heterogenen Kunden Fed-试点:优化LORA分配,与异质客户进行高效的联邦货币调整 2410.10200v2

Authors (4): Zikai Zhang, Rui Hu, Ping Liu, Jiahao Xu

Federated Learning enables the fine-tuning of foundation models (FMs) across distributed clients for specific tasks; however, its scalability is limited by the heterogeneity of client memory capacities. In this work, we propose Fed-pilot, a memory-efficient federated fine-tuning framework. It enables memory-constrained clients to participate in Low-Rank Adaptation (LoRA)-based fine-tuning by training only a subset of LoRA modules locally. Fed-pilot identifies the optimal selection of trainable LoRA modules as a knapsack optimization problem, maximizing model performance under memory constraints for each client. To mitigate inconsistencies arising from heterogeneous module allocations and Non-IID data, Fed-pilot employs a novel aggregation rule that dynamically compensates for under-updated layers. Extensive experiments on five diverse datasets across various heterogeneous data settings demonstrate Fed-pilot’s effectiveness and efficiency compared to state-of-the-art methods. To the best of our knowledge, this is the first study on federated fine-tuning of FMs that integrates memory-constrained optimization. The code will be publicly available.

联邦学习联盟使分布客户对基础模型(FMs)进行微调,用于具体任务;然而,其可缩放性受到客户记忆能力差异的限制。在这项工作中,我们提议Fed-pilit(Fed-pilation),这是一个记忆效率高的联邦微调框架。它使受记忆限制的客户能够参加基于记忆的低兰克适应(LORA)微调,只对当地一组LORA模块进行培训。Fed-pilot(Fed-Lora)将可训练的LORA模块确定为最优化问题,最大限度地提高每个客户记忆力限制下的模型性能。为了减少不同模块分配和非IID数据产生的不一致,Fed-pilot(Fed-pilation)采用新的汇总规则,以动态方式弥补未更新的层。对五套不同数据设置的多种数据集进行广泛的实验,表明Fed-port(LORA)相对于最新方法的有效性和效率。据我们所知,这是关于将记忆受限制的调频调频调的联邦微调的首次研究,可以公开查阅。


Article 76

Title@2025-06-20 (5): Code Generation for Near-Roofline Finite Element Actions on GPUs from Symbolic Variational Forms

Title: Code Generation for Near-Roofline Finite Element Actions on GPUs from Symbolic Variational Forms Code-Generierung für flächennahe Finite-Element-Aktionen auf GPUs aus Symbolischen Variationsformen 象征式变换形式GPU的近Roofline有限元素行动代码生成代码 2506.17471v1

Authors (2): Kaushik Kulkarni, Andreas Klöckner

We present a novel parallelization strategy for evaluating Finite Element Method (FEM) variational forms on GPUs, focusing on those that are expressible through the Unified Form Language (UFL) on simplex meshes. We base our approach on code transformations, wherein we construct a space of scheduling candidates and rank them via a heuristic cost model to effectively handle the large diversity of computational workloads that can be expressed in this way. We present a design of a search space to which the cost model is applied, along with an associated pruning strategy to limit the number of configurations that need to be empirically evaluated. The goal of our design is to strike a balance between the device’s latency-hiding capabilities and the amount of state space, a key factor in attaining near-roofline performance. To make our work widely available, we have prototyped our parallelization strategy within the \textsc{Firedrake} framework, a UFL-based FEM solver. We evaluate the performance of our parallelization scheme on two generations of Nvidia GPUs, specifically the Titan V (Volta architecture) and Tesla K40c (Kepler architecture), across a range of operators commonly used in applications, including fluid dynamics, wave propagation, and structural mechanics, in 2D and 3D geometries. Our results demonstrate that our proposed algorithm achieves more than $50\%$ roofline performance in $65\%$ of the test cases on both devices.

我们提出一个新的平行战略,用于评估GPU的精密元素方法(FEM)变异形式,重点是那些通过统一格式语言(UFL)以简单度模模模版显示的组合。我们把方法建立在代码转换上,在代码转换上建造一个候选人时间安排空间,并通过超额成本模型对候选人进行排序,以便有效地处理可以用这种方式表示的多种计算工作量。我们提出了一个应用成本模型的搜索空间的设计,以及相关的调整战略,以限制需要经经验评估的配置数量。我们的设计目标是在设备定位能力与国家空间数量之间取得平衡,这是实现接近屋顶绩效的一个关键因素。为了广泛传播我们的工作,我们把我们的平行战略建在基于UFL$FEM的FEMSUL框架之内,一个基于UFL$的FEMSMU。我们对Nvidia GPUPU的平行化计划绩效进行了评估,具体而言,是TITAN VV的定位能力与国家空间数量之间的平衡,这是实现近Roofline II 和TEL IMLA 的运行者, 以及我们所使用的SLA IMLA IMLA 和 IMLA IMLA 的常规中, 的运行中, 和TULA 。


Article 77

Title@2025-06-20 (5): SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving

Title: SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving SLED: Ein spekulatives LLM-Decoding-Framework für effizientes Edge Serving SLED: 有效边缘服务投机性LLM代谢框架 2506.09397v2

Authors (8): Xiangchen Li, Dimitrios Spatharakis, Saeid Ghafouri, Jiakun Fan, Hans Vandierendonck, Deepu John, Bo Ji, Dimitrios Nikolopoulos

Regardless of the advancements in device capabilities, efficient inferencing advanced large language models (LLMs) at the edge remains challenging due to limited device memory and power constraints. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new approach that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a method that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server efficiently batches and verifies the tokens utilizing a more precise target model. This approach supports device heterogeneity and reduces server-side memory footprint by avoiding the need to deploy multiple target models. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: significantly increased system throughput, capacity, and better cost efficiency, all without sacrificing model accuracy.

无论装置能力有何进步,在边缘有效推断先进的大型语言模型(LLMs)仍然具有挑战性,因为设备内存和功率限制有限。现有的战略,如进取量计、剪裁或远程推断、效率交易精确度等,或导致巨大的成本负担。本立场文件引入了一种新的方法,利用投机性解码法,以前主要被视为自动递增生成LLMs的解码加速技术,这是一种有希望的方法,通过在各种装置之间调试计算,专门适用于边缘计算。我们提议采用acronym,这种方法允许轻量级边设备使用不同的草稿模型在当地起草多个候选标牌,而单一的共享边端服务器则高效地分批使用更精确的目标模型核实标牌。这种方法支持了装置的异质性,并通过避免部署多个目标模型来减少服务器上的记忆痕迹。我们最初对Jetson Orin Nano、Rasperry Pi 4B/5和配备4 Nvidia A100 GPUs的边缘服务器进行的实验表明有相当大的好处:显著的系统通过量、容量和效益,而没有牺牲。


Article 78

Title@2025-06-20 (5): A Comparative Analysis of Distributed Linear Solvers under Data Heterogeneity

Title: A Comparative Analysis of Distributed Linear Solvers under Data Heterogeneity Eine vergleichende Analyse der verteilten linearen Solver unter Daten Heterogenität 数据差异下分布线性溶剂的比较分析 2304.10640v4

Authors (4): Boris Velasevic, Rohit Parasnis, Christopher G. Brinton, Navid Azizan

We consider the problem of solving a large-scale system of linear equations in a distributed or federated manner by a taskmaster and a set of machines, each possessing a subset of the equations. We provide a comprehensive comparison of two well-known classes of algorithms used to solve this problem: projection-based methods and optimization-based methods. First, we introduce a novel geometric notion of data heterogeneity called angular heterogeneity and discuss its generality. Using this notion, we characterize the optimal convergence rates of the most prominent algorithms from each class, capturing the effects of the number of machines, the number of equations, and that of both cross-machine and local data heterogeneity on these rates. Our analysis establishes the superiority of Accelerated Projected Consensus in realistic scenarios with significant data heterogeneity and offers several insights into how angular heterogeneity affects the efficiency of the methods studied. Additionally, we develop distributed algorithms for the efficient computation of the proposed angular heterogeneity metrics. Our extensive numerical analyses validate and complement our theoretical results.

我们考虑用一个任务主管和一组机器以分布或联合的方式解决一个大型线性方程式系统的问题,每个任务主管和一组机器拥有一个子方程式。我们全面比较了用于解决这一问题的两种众所周知的算法:投影法和优化法。首先,我们引入了称为角异质的新的数据异质几何概念,并讨论了其普遍性。我们利用这个概念,确定每个类别最突出的算法的最佳趋同率,捕捉机器数量、方程式数量以及跨机器和地方数据异质对这些比率的影响。我们的分析确定了在现实情景下快速预测共识的优越性,并提出了大量数据异质性,并提供了对三角异性如何影响所研究方法效率的若干见解。此外,我们开发了分布式算法,以有效计算拟议的角异异质计量法。我们广泛的数字分析验证和补充了我们的理论结果。


Article 79

Title@2025-06-20 (5): LayerZero

Title: LayerZero SchichtZero 层数为零 2312.09118v3

Authors (5): Ryan Zarick, Bryan Pellegrino, Isaac Zhang, Thomas Kim, Caleb Banister

In this paper, we present the first intrinsically secure and semantically universal omnichain interoperability protocol: LayerZero. Utilizing an immutable endpoint, append-only verification modules, and fully-configurable verification infrastructure, LayerZero provides the security, configurability, and extensibility necessary to achieve omnichain interoperability. LayerZero enforces strict application-exclusive ownership of protocol security and cost through its novel trust-minimized modular security framework which is designed to universally support all blockchains and use cases. Omnichain applications (OApps) built on the LayerZero protocol achieve frictionless blockchain-agnostic interoperation through LayerZero’s universal network semantics.

在本文中,我们介绍了第一个内在安全且具有内在普遍性的全链互操作性协议:DlumeZero。利用一个不可改变的终点、只附加的核查模块和完全可配置的核查基础设施,TrumeZero提供了实现全链互操作性所必需的安全、可配置性和可扩展性。DolyZero通过其新的、信任最小的模块安全框架对协议安全及其成本实施严格的应用专属所有权,该框架旨在普遍支持所有块链和使用案例。在TreamZero协议基础上建立的全链应用(OApps)通过ThileZero通用网络的语义学实现了无摩性块链-不连接的相互操作。


Article 80

Title@2025-06-20 (5): $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction

Title: $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction $Δ$-Nets: Interaktionsbasiertes System für eine optimale parallele $λ$-Reduktion \(-净额:最佳平行互动系统\)$美元-削减 2505.20314v3

Authors (1): Daniel Augusto Rizzi Salvadori

I present a model of universal parallel computation called $\Delta$-Nets, and a method to translate $\lambda$-terms into $\Delta$-nets and back. Together, the model and the method constitute an algorithm for optimal parallel $\lambda$-reduction, solving the longstanding enigma with groundbreaking clarity. I show that the $\lambda$-calculus can be understood as a projection of $\Delta$-Nets$-$one that severely restricts the structure of sharing, among other drawbacks. Unhindered by these restrictions, the $\Delta$-Nets model opens the door to new parallel programming language implementations and computer architectures that are more efficient and performant than previously possible.

我提出了一个称为$Delta$-Nets的通用平行计算模型,以及将$lambda$-terms 转换成$Delta$-nets和回调的方法。模型和方法共同构成一个优化平行$\lambda$降值的算法,以突破性清晰度解决长期谜题。我表明,$Limbda$-calculus可以被理解为一个严重限制共享结构的$Delta$-Nets-one的预测,除其他缺陷外。由于这些限制,$Delta$-Nets模型为平行语言实施和计算机结构打开了大门,这些语言实施和计算机结构比以前可能更有效和更实用。


Article 81

Title@2025-06-20 (5): Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory

Title: Byzantine-Tolerant Consensus in GPU-Inspired Shared Memory Byzantinisch-Tolerant Konsens in GPU-inspiriert gemeinsamen Speicher 在GPU-受GPU启发的共同记忆中,拜占庭-容忍共识 2503.12788v3

Authors (3): Chryssis Georgiou, Manaswini Piduguralla, Sathya Peri

In this work, we formalize a novel shared memory model inspired by the popular GPU architecture. Within this model, we develop algorithmic solutions to the Byzantine Consensus problem and analyze their fault-resilience.

在这项工作中,我们正式确定了受广受欢迎的GPU架构启发的新颖的共享记忆模式。 在这个模式中,我们开发了拜占庭共识问题的算法解决方案,并分析了其缺陷的抵抗力。


Article 82

Title@2025-06-20 (5): JANUS: Resilient and Adaptive Data Transmission for Enabling Timely and Efficient Cross-Facility Scientific Workflows

Title: JANUS: Resilient and Adaptive Data Transmission for Enabling Timely and Efficient Cross-Facility Scientific Workflows JANUS: Resiliente und adaptive Datenübertragung zur rechtzeitigen und effizienten Cross-Facility wissenschaftlichen Workflows JANUS:为及时、高效的跨设施科学工作流程提供具有弹性和适应性的数据传输 2506.17084v1

Authors (7): Vladislav Esaulov, Jieyang Chen, Norbert Podhorszki, Fred Suter, Scott Klasky, Anu G Bourgeois, Lipeng Wan

In modern science, the growing complexity of large-scale projects has increased reliance on cross-facility workflows, where institutions share resources and expertise to accelerate discovery. These workflows often involve transferring massive data over wide-area networks. While high-speed networks like ESnet and data transfer services like Globus have improved data mobility, challenges remain. Large data volumes can strain bandwidth, TCP suffers from retransmissions due to packet loss, and traditional fault-tolerance methods like erasure coding introduce significant overhead. This paper presents JANUS, a resilient and adaptive data transmission approach for cross-facility scientific workflows. JANUS uses UDP, integrates erasure coding for fault tolerance, and applies error-bounded lossy compression to reduce overhead. This design enables users to balance transmission time and accuracy based on specific needs. JANUS also adapts coding parameters to real-time network conditions and uses optimization models to determine ideal configurations. Experiments show that JANUS significantly improves data transfer efficiency while preserving fidelity.

在现代科学中,大型项目日益复杂,更加依赖跨设施工作流程,各机构共享资源和专门知识,以加速发现。这些工作流程往往涉及在广域网络上传输大量数据。ESnet和Globus等数据传输服务等高速网络改善了数据流动性,但挑战依然存在。大量数据量可以增加带宽,TCP由于包装丢失而面临再传输问题,而像消除编码等传统的过错容忍方法引入了巨大的间接费用。本文介绍了JANUS,这是跨设施科学工作流程的一种适应性和适应性数据传输方法。JANUS使用UDP,将加密编码用于疏漏,并应用因错误造成的损失压缩来减少间接费用。这一设计使用户能够根据具体需求平衡传输时间和准确性。JANUS还根据实时网络条件调整了编码参数,并使用优化模型来确定理想配置。实验显示,JANUS在维护忠诚的同时,显著提高了数据传输效率。


Article 83

Title@2025-06-20 (5): Comparison of substructured non-overlapping domain decomposition and overlapping additive Schwarz methods for large-scale Helmholtz problems with multiple sources

Title: Comparison of substructured non-overlapping domain decomposition and overlapping additive Schwarz methods for large-scale Helmholtz problems with multiple sources Vergleich von substrukturierten nicht-überlappenden Domänenzersetzungen und überlappenden additiven Schwarz-Methoden für großflächige Helmholtz-Probleme mit mehreren Quellen 用于处理与多种来源有关的大规模Helmholtz问题的亚结构非重叠重叠域分解和重叠添加剂施瓦兹方法比较 2506.16875v1

Authors (3): Boris Martin, Pierre Jolivet, Christophe Geuzaine

Solving large-scale Helmholtz problems discretized with high-order finite elements is notoriously difficult, especially in 3D where direct factorization of the system matrix is very expensive and memory demanding, and robust convergence of iterative methods is difficult to obtain. Domain decomposition methods (DDM) constitute one of the most promising strategy so far, by combining direct and iterative approaches: using direct solvers on overlapping or non-overlapping subdomains, as a preconditioner for a Krylov subspace method on the original Helmholtz system or as an iterative solver on a substructured problem involving field values or Lagrange multipliers on the interfaces between the subdomains. In this work we compare the computational performance of non-overlapping substructured DDM and Optimized Restricted Additive Schwarz (ORAS) preconditioners for solving large-scale Helmholtz problems with multiple sources, as is encountered, e.g., in frequency-domain Full Waveform Inversion. We show on a realistic geophysical test-case that, when appropriately tuned, the non-overlapping methods can reduce the convergence gap sufficiently to significantly outperform the overlapping methods.

与高阶有限元素分解的大规模Helmholtz问题很难解决,特别是在3D领域,在3D领域,系统矩阵的直接因子化非常昂贵,记忆要求很高,而且难以获得迭代方法的强有力趋同。 域分解方法是迄今为止最有希望的战略之一,办法是将直接和迭接方法结合起来:在重叠或非重叠的子域上使用直接解答器,作为原Helmholtz系统Krylov子空间方法的前提条件,或作为涉及外地值或子域界面界面拉格朗乘数的亚结构问题迭代解解答器。在这项工作中,我们比较了不重叠的次结构DDM和优化的Additivive Schwarz(ORAS)的计算性功能,以便用多种来源解决大规模Helmholtz问题,例如,在频率-多面保持全波变换中遇到的问题。我们在现实的地球物理试验箱上展示了差距,在充分统一时,将非重叠的方法缩小了差距。


Article 84

Title@2025-06-20 (5): Speeding up Local Optimization in Vehicle Routing with Tensor-based GPU Acceleration

Title: Speeding up Local Optimization in Vehicle Routing with Tensor-based GPU Acceleration Beschleunigung der lokalen Optimierung im Fahrzeugrouting mit Tensor-basierter GPU-Beschleunigung 加速使用基于 Tensor 的 GPU 加速车辆运行的本地优化 2506.17357v1

Authors (3): Zhenyu Lei, Jin-Kao Hao, Qinghua Wu

Local search plays a central role in many effective heuristic algorithms for the vehicle routing problem (VRP) and its variants. However, neighborhood exploration is known to be computationally expensive and time consuming, especially for large instances or problems with complex constraints. In this study, we explore a promising direction to address this challenge by introducing an original tensor-based GPU acceleration method designed to speed up the commonly used local search operators in vehicle routing. By using an attribute-based representation, the method offers broad extensibility, making it applicable to different VRP variants. Its low-coupling architecture, with intensive computations completely offloaded to the GPU, ensures seamless integration in various local search-based algorithms and frameworks, leading to significant improvements in computational efficiency and potentially improved solution quality. Through comparative experiments on benchmark instances of three routing problems, we demonstrate the substantial computational advantages of the proposed approach over traditional CPU-based implementations. We also provide a detailed analysis of the strengths and limitations of the method, providing valuable insights into its performance characteristics and identifying potential bottlenecks in practical applications. These findings contribute to a better understanding and suggest directions for future improvements.

本地搜索在许多有效的车辆路由问题及其变式的超光速算法中发挥着核心作用,然而,已知邻里探索在计算上成本昂贵、耗时费时,特别是对于大的情况或复杂制约的问题。在本研究中,我们探索了应对这一挑战的有希望的方向,采用了原始的基于高压的GPU加速法,旨在加快车辆路由方面常用的本地搜索操作员的速度。该方法使用基于属性的代表法,提供了广泛的可扩展性,使之适用于不同的车辆路由问题(VRP)变式。该方法的低相联结构,其密集的计算完全卸载到GPU,确保无缝地融入各种基于本地搜索的算法和框架,导致计算效率的显著提高,并有可能提高解决方案的质量。通过对三个路由问题的基准实例的比较实验,我们展示了拟议方法相对于传统的CPU实施的巨大计算优势和局限性。我们还详细分析了该方法的长处和局限性,提供了对其性能特点的宝贵了解,并查明实际应用中的潜在瓶颈。这些发现有助于更好地了解和提出今后改进的方向。


Article 85

Title@2025-06-20 (5): Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry

Title: Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: Orchestrating Distributed Compound AI Systems with Nested Ancestry Alto: 带有内嵌原体的管弦式分布式 AI系统 2403.04311v2

Authors (10): Deepti Raghavan, Keshav Santhanam, Muhammad Shahir Rahman, Nayani Modugula, Luis Gaspar Schroeder, Maximilien Cura, Houjun Liu, Pratiksha Thaker, Philip Levis, Matei Zaharia

Compound AI applications chain together subcomponents such as generative language models, document retrievers, and embedding models. Applying traditional systems optimizations such as parallelism and pipelining in compound AI systems is difficult because each component has different constraints in terms of the granularity and type of data that it ingests. New data is often generated during intermediate computations, and text streams may be split into smaller, independent fragments (such as documents to sentences) which may then be re-aggregated at later parts of the computation. Due to this complexity, existing systems to serve compound AI queries do not fully take advantage of parallelism and pipelining opportunities. We present Alto, a framework that automatically optimizes execution of compound AI queries through streaming and parallelism. Bento introduces a new abstraction called nested ancestry, a metadata hierarchy that allows the system to correctly track partial outputs and aggregate data across the heterogeneous constraints of the components of compound AI applications. This metadata is automatically inferred from the programming model, allowing developers to express complex dataflow patterns without needing to reason manually about the details of routing and aggregation. Implementations of four applications in Alto outperform or match implementations in LangGraph, a popular existing AI programming framework. Alto implementations match or improve latency by between 10-30%.

复方 AI 应用程序串联, 诸如基因语言模型、 文档检索器和嵌入模型等子组件。 在复合AI系统中应用传统系统优化, 如平行和管道管状, 难度很大, 因为每个组件在颗粒和其摄入的数据类型方面都有不同的限制。 新的数据往往是在中间计算过程中产生的, 文本流可能会被分割成小的、 独立的碎片( 如文档到句子) , 然后可以在以后部分计算中重新分类。 由于这一复杂性, 现有系统为复合AI查询服务, 没有充分利用平行和管道整合机会。 我们提出阿尔托, 这个框架通过流和平行操作自动优化执行复合AI查询。 本托引入了称为嵌巢祖先的新抽象信息, 元数据分级使系统能够正确跟踪复合AI 应用程序各组成部分的多种限制部分产出和汇总数据( 如文档到句子) 。 这一元数据从编程模型中自动推断出, 使开发者能够表达复杂的数据流模式, 不需要人工理解关于路径和汇总的详细信息。 我们介绍阿尔托, 一个框架自动优化执行复合的复合AI AI 10 的四种应用程序, 通过执行, ALformax 或 ALformax trap 10 lap lap lap lap lap 10 lap lap lap


Article 86

Title@2025-06-20 (5): Incentivizing High-quality Participation From Federated Learning Agents

Title: Incentivizing High-quality Participation From Federated Learning Agents Anreize für eine qualitativ hochwertige Beteiligung von Federated Learning Agents 激励来自联邦学习代理机构的高质量参与 2506.16731v1

Authors (5): Jinlong Pang, Jiaheng Wei, Yifan Hua, Chen Qian, Yang Liu

Federated learning (FL) provides a promising paradigm for facilitating collaboration between multiple clients that jointly learn a global model without directly sharing their local data. However, existing research suffers from two caveats: 1) From the perspective of agents, voluntary and unselfish participation is often assumed. But self-interested agents may opt out of the system or provide low-quality contributions without proper incentives; 2) From the mechanism designer’s perspective, the aggregated models can be unsatisfactory as the existing game-theoretical federated learning approach for data collection ignores the potential heterogeneous effort caused by contributed data. To alleviate above challenges, we propose an incentive-aware framework for agent participation that considers data heterogeneity to accelerate the convergence process. Specifically, we first introduce the notion of Wasserstein distance to explicitly illustrate the heterogeneous effort and reformulate the existing upper bound of convergence. To induce truthful reporting from agents, we analyze and measure the generalization error gap of any two agents by leveraging the peer prediction mechanism to develop score functions. We further present a two-stage Stackelberg game model that formalizes the process and examines the existence of equilibrium. Extensive experiments on real-world datasets demonstrate the effectiveness of our proposed mechanism.

联邦学习(FL)为促进在不直接分享其当地数据的情况下共同学习全球模型的多个客户之间的合作提供了一个有希望的范例;然而,现有研究有两种告诫:(1) 从代理人的角度来看,自愿和无私参与往往是假设的;但自我感兴趣的代理人可能选择退出系统或提供低质量贡献,而没有适当的奖励办法;(2) 从机制设计者的观点看,综合模型可能不令人满意,因为现有的游戏-理论-联邦化学习数据收集方法忽视了所提供数据可能造成的多样化努力。为了缓解上述挑战,我们提议一个奖励意识的代理人参与框架,其中考虑到数据差异性,以加速趋同进程。具体地说,我们首先提出瓦塞尔斯坦距离的概念,以明确说明各种努力,重新界定现有的趋同的上层界限。为了让代理人作出真实的报告,我们分析并衡量任何两种代理人的普遍错误差距,利用同侪预测机制来发展得分功能。我们进一步提出一个两阶段的Stackelberg游戏模式,将这一过程正规化,并审视平衡的存在。


Article 87

Title@2025-06-20 (5): Persistent HyTM via Fast Path Fine-Grained Locking

Title: Persistent HyTM via Fast Path Fine-Grained Locking Persistent HyTM über Schnellweg feinkörnige Verriegelung 长效HYTM通过快车道 精密的锁闭 2501.14783v2

Authors (3): Gaetano Coccimiglio, Trevor Brown, Srivatsan Ravi

Utilizing hardware transactional memory (HTM) in conjunction with non-volatile memory (NVM) to achieve persistence is quite difficult and somewhat awkward due to the fact that the primitives utilized to write data to NVM will abort HTM transactions. We present several persistent hybrid transactional memory (HyTM) that, perhaps counterintuitively, utilize an HTM fast path primarily to read or acquire fine-grained locks which protect data items. Our implementations guarantee durable linearizable transactions and the STM path satisfies either weak progressiveness or strong progressiveness. We discuss the design choices related to the differing progress guarantees and we examine how these design choices impact performance. We evaluate our persistent HyTM implementations using various microbenchmarks. Our implementations achieve improved performance especially for read dominant workloads compared to state of the art persistent STMs and persistent HyTMs despite the challenges and apparent awkwardness of using current implementation HTM to achieve persistence.

利用硬件交易存储器(HTM)与非挥发性内存(NVM)一起实现持久性相当困难,也有些尴尬,因为用于向NVM写入数据的原始数据将中止HTM交易。我们展示了几种持续的混合交易存储器(HyTM),这些存储器或许是反直觉的,利用HTM快速路径主要读取或获取保护数据项目的细细的锁。我们的实施保证了持久的可线性交易,STM路径要么满足了不易实现的可线性交易,要么满足了不易实现的渐进性或强劲的渐进性。我们讨论了与不同进展保障有关的设计选择,我们研究了这些设计选择如何影响业绩。我们利用各种微基准评估我们持续执行 HYTM系统的情况。我们的实施取得了更好的业绩,特别是在阅读主要工作量方面,尽管使用当前实施HTM系统实现持久性的持久性STMs和持续性 HYTM系统所带来的挑战和明显困难。


Article 88

Title@2025-06-19 (4): Enabling Blockchain Interoperability Through Network Discovery Services

Title: Enabling Blockchain Interoperability Through Network Discovery Services Aktivierung der Blockchain-Interoperabilität durch Network Discovery Services 通过网络发现服务促进链链互连互操作性 2506.16611v1

Authors (3): Khalid Hassan, Amirreza Sokhankhosh, Sara Rouhani

Web3 technologies have experienced unprecedented growth in the last decade, achieving widespread adoption. As various blockchain networks continue to evolve, we are on the cusp of a paradigm shift in which they could provide services traditionally offered by the Internet, but in a decentralized manner, marking the emergence of the Internet of Blockchains. While significant progress has been achieved in enabling interoperability between blockchain networks, existing solutions often assume that networks are already mutually aware. This reveals a critical gap: the initial discovery of blockchain networks remains largely unaddressed. This paper proposes a decentralized architecture for blockchain network discovery that operates independently of any centralized authority. We also introduce a mechanism for discovering assets and services within a blockchain from external networks. Given the decentralized nature of the proposed discovery architecture, we design an incentive mechanism to encourage nodes to actively participate in maintaining the discovery network. The proposed architecture implemented and evaluated, using the Substrate framework, demonstrates its resilience and scalability, effectively handling up to 130,000 concurrent requests under the tested network configurations, with a median response time of 5.5 milliseconds, demonstrating the ability to scale its processing capacity further by increasing its network size.

Web3 技术在过去十年中经历了前所未有的增长,获得了广泛的采用。随着各种链链网络的继续演变,我们正处于范式转变的边缘,它们可以提供传统上由互联网提供的服务,但以分散的方式,标志着链链互联网的出现。虽然在使链链网络之间互操作性方面取得了显著进展,但现有解决方案往往假定网络已经相互了解。这揭示了一个重大差距:最初发现链链网络的问题基本上没有得到解决。本文件建议建立一个分散的链网络发现架构,该架构独立于任何中央机构运作。我们还引入了在外部网络的链条中发现资产和服务的机制。鉴于拟议的发现架构的分散性质,我们设计了一个激励机制,鼓励节点积极参与发现网络的维护。拟议架构利用子网框架实施并评估了网络的复原力和可扩展性,在测试的网络配置下有效处理了多达130 000项同时提出的请求,其中位响应时间为5.5毫秒,显示了通过扩大网络规模进一步扩展其处理能力的能力。


Article 89

Title@2025-06-19 (4): Parallel Point-to-Point Shortest Paths and Batch Queries

Title: Parallel Point-to-Point Shortest Paths and Batch Queries Parallele Punkt-zu-Punkt-Kurze Pfade und Batch-Abfragen 平行点对点最短路径和批量查询 2506.16488v1

Authors (4): Xiaojun Dong, Andy Li, Yan Gu, Yihan Sun

We propose Orionet, efficient parallel implementations of Point-to-Point Shortest Paths (PPSP) queries using bidirectional search (BiDS) and other heuristics, with an additional focus on batch PPSP queries. We present a framework for parallel PPSP built on existing single-source shortest paths (SSSP) frameworks by incorporating pruning conditions. As a result, we develop efficient parallel PPSP algorithms based on early termination, bidirectional search, A$^$ search, and bidirectional A$^$ all with simple and efficient implementations. We extend our idea to batch PPSP queries, which are widely used in real-world scenarios. We first design a simple and flexible abstraction to represent the batch so PPSP can leverage the shared information of the batch. Orionet formalizes the batch as a query graph represented by edges between queried sources and targets. In this way, we directly extended our PPSP framework to batched queries in a simple and efficient way. We evaluate Orionet on both single and batch PPSP queries using various graph types and distance percentiles of queried pairs, and compare it against two baselines, GraphIt and MBQ. Both of them support parallel single PPSP and A$^$ using unidirectional search. On 14 graphs we tested, on average, our bidirectional search is 2.9$\times$ faster than GraphIt, and 6.8$\times$ faster than MBQ. Our bidirectional A$^$ is 4.4$\times$ and 6.2$\times$ faster than the A$^*$ in GraphIt and MBQ, respectively. For batched PPSP queries, we also provide in-depth experimental evaluation, and show that Orionet provides strong performance compared to the plain solutions.

我们建议使用双向搜索(BIDS)和其他超光速搜索(PPSP)查询,在现有的单一来源最短路径(SSSP)框架的基础上,以现有单一来源最短路径(SSSP)为基础,建立一个平行的PPSP框架。结果,我们根据早期终止、双向搜索、美元搜索和双向直线A$(美元)查询,提出高效的平行的PPSP算法。我们把想法扩大到分批的PPSP查询,这些查询在现实世界情景中广泛使用。我们首先设计一个简单和灵活的抽象来代表批次,以便PPPSP能够利用现有最短路径(SS)的共享信息。Orionet将批次正式化为由查询来源和目标之间的边缘代表的查询图。通过这种方式,我们直接扩展我们的PPPSP框架,以简单而高效的方式分批的查询。我们用双批的OIPSP查询,用不同图表类型和距离的Oralalalalalal 美元来比较我们的OISP查询。我们用不同图表的搜索类型和直径直径直径直径搜索。


Article 90

Title@2025-06-19 (4): A Study of Synchronization Methods for Concurrent Size

Title: A Study of Synchronization Methods for Concurrent Size Eine Studie über Synchronisationsmethoden für die gleichzeitige Größe 同步体积同步化方法研究 2506.16350v1

Authors (3): Hen Kas-Sharir, Gal Sela, Erez Petrank

The size of collections, maps, and data structures in general, constitutes a fundamental property. An implementation of the size method is required in most programming environments. Nevertheless, in a concurrent environment, integrating a linearizable concurrent size introduces a noticeable overhead on all operations of the data structure, even when the size method is not invoked during the execution. In this work we present a study of synchronization methods in an attempt to improve the performance of the data structure. In particular, we study a handshake technique that is commonly used with concurrent garbage collection, an optimistic technique, and a lock-based technique. Evaluation against the state-of-the-art size methodology demonstrates that the overhead can be significantly reduced by selecting the appropriate synchronization approach, but there is no one-size-fits-all method. Different scenarios call for different synchronization methods, as rigorously shown in this study. Nevertheless, our findings align with general trends in concurrent computing. In scenarios characterized by low contention, optimistic and lock-based approaches work best, whereas under high contention, the most effective solutions are the handshake approach and the wait-free approach.

总体而言,收集、地图和数据结构的大小构成一个基本属性。在大多数编程环境中,需要实施规模方法。然而,在同时的环境中,整合一个可线性可扩展的并行体积,为数据结构的所有操作带来明显的间接费用,即使在执行过程中没有使用规模方法。在这项工作中,我们提出同步方法研究,以努力改进数据结构的性能。特别是,我们研究一种握手技术,通常在同时收集垃圾、乐观技术和锁定技术的情况下使用。对照最先进的规模方法进行评估,表明通过选择适当的同步法可以大大减少间接费用,但没有一种一刀切的方法。本研究报告严格显示,不同的情景要求采用不同的同步方法。然而,我们的调查结果与同时计算的一般趋势是一致的。在以低争议、乐观和锁定方法为特征的情景中,工作最有效的解决办法是握手方法和无等待方法。


Article 91

Title@2025-06-19 (4): LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Embodied Intelligence System

Title: LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration for IoT-based Embodied Intelligence System LAECIPS: Large Vision Model Assisted Adaptive Edge-Cloud Collaboration für IoT-basiertes Embodyd Intelligence System LAECIPS: 以IoT为基础的内嵌式情报系统大型远景模型 辅助适应性边缘群落协作 2404.10498v2

Authors (6): Shijing Hu, Zhihui Lu, Xin Xu, Ruijun Deng, Xin Du, Qiang Duan

Embodied intelligence (EI) enables manufacturing systems to flexibly perceive, reason, adapt, and operate within dynamic shop floor environments. In smart manufacturing, a representative EI scenario is robotic visual inspection, where industrial robots must accurately inspect components on rapidly changing, heterogeneous production lines. This task requires both high inference accuracy especially for uncommon defects and low latency to match production speeds, despite evolving lighting, part geometries, and surface conditions. To meet these needs, we propose LAECIPS, a large vision model-assisted adaptive edge-cloud collaboration framework for IoT-based embodied intelligence systems. LAECIPS decouples large vision models in the cloud from lightweight models on the edge, enabling plug-and-play model adaptation and continual learning. Through a hard input mining-based inference strategy, LAECIPS routes complex and uncertain inspection cases to the cloud while handling routine tasks at the edge, achieving both high accuracy and low latency. Experiments conducted on a real-world robotic semantic segmentation system for visual inspection demonstrate significant improvements in accuracy, processing latency, and communication overhead compared to state-of-the-art methods. LAECIPS provides a practical and scalable foundation for embodied intelligence in smart manufacturing, especially in adaptive robotic inspection and quality control scenarios.

在智能制造中,具有代表性的EI假设情景是机器人视觉检查,工业机器人必须准确检查迅速变化的多式生产线的部件。这项任务要求高推力精度,特别是异常的缺陷和低悬浮度,以适应生产速度,尽管照明、部分地貌和地表条件不断变化。为了满足这些需要,我们提议LAECIPS为基于IoT的成形情报系统建立一个大型的视觉模型辅助适应性边际宽宽广合作框架。LAECIPS将云层中的大型视觉模型与边缘的轻量模型脱钩,使插装模型适应和不断学习。通过基于硬投入的采矿推断战略,LAECIPS将复杂和不确定的检查案例输送到云层,同时处理边缘的常规任务,既达到高精度,又低纬度。在基于IOT的成型情报系统中进行视觉检查时,对真实世界机器人机制精度精度精度精度精度精度断分解度测试系统进行了实验。LAECIP将精度、精度和通信顶部与基于州级的智能质量的S-MAICS-SLIPS-S-S-S-S-SUD-S-S-S-S-S-S-SD-S-S-SLAC-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-SIC-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S


Article 92

Title@2025-06-19 (4): Serving Large Language Models on Huawei CloudMatrix384

Title: Serving Large Language Models on Huawei CloudMatrix384 Große Sprachmodelle auf Huawei CloudMatrix384 瓦威云马特列克384 2506.12708v3

Authors (46): Pengfei Zuo, Huimin Lin, Junbo Deng, Nan Zou, Xingkun Yang, Yingyu Diao, Weifeng Gao, Ke Xu, Zhangyu Chen, Shirui Lu, Zhao Qiu, Peiyang Li, Xianyu Chang, Zhengzhong Yu, Fangzheng Miao, Jia Zheng, Ying Li, Yuan Feng, Bei Wang, Zaijian Zong, Mosong Zhou, Wenli Zhou, Houjiang Chen, Xingyu Liao, Yipeng Li, Wenxiao Zhang, Ping Zhu, Yinggang Wang, Chuanjie Xiao, Depeng Liang, Dong Cao, Juncheng Liu, Yongqiang Yang, Xiaolong Bai, Yi Li, Huaguo Xie, Huatao Wu, Zhibin Yu, Lv Chen, Hu Liu, Yujun Ding, Haipei Zhu, Jing Xia, Yi Xiong, Zhou Yu, Heng Liao

The rapid evolution of large language models (LLMs), driven by growing parameter scales, adoption of mixture-of-experts (MoE) architectures, and expanding context lengths, imposes unprecedented demands on AI infrastructure. Traditional AI clusters face limitations in compute intensity, memory bandwidth, inter-chip communication, and latency, compounded by variable workloads and strict service-level objectives. Addressing these issues requires fundamentally redesigned hardware-software integration. This paper introduces Huawei CloudMatrix, a next-generation AI datacenter architecture, realized in the production-grade CloudMatrix384 supernode. It integrates 384 Ascend 910 NPUs and 192 Kunpeng CPUs interconnected via an ultra-high-bandwidth Unified Bus (UB) network, enabling direct all-to-all communication and dynamic pooling of resources. These features optimize performance for communication-intensive operations, such as large-scale MoE expert parallelism and distributed key-value cache access. To fully leverage CloudMatrix384, we propose CloudMatrix-Infer, an advanced LLM serving solution incorporating three core innovations: a peer-to-peer serving architecture that independently scales prefill, decode, and caching; a large-scale expert parallelism strategy supporting EP320 via efficient UB-based token dispatch; and hardware-aware optimizations including specialized operators, microbatch-based pipelining, and INT8 quantization. Evaluation with the DeepSeek-R1 model shows CloudMatrix-Infer achieves state-of-the-art efficiency: prefill throughput of 6,688 tokens/s per NPU and decode throughput of 1,943 tokens/s per NPU (<50 ms TPOT). It effectively balances throughput and latency, sustaining 538 tokens/s per NPU even under stringent 15 ms latency constraints, while INT8 quantization maintains model accuracy across benchmarks.

大型语言模型(LLMS)的快速演进,由不断增长的参数尺度驱动,采用了混合专家结构(MOE),并扩大了背景长度,这给AI基础设施提出了前所未有的要求。传统AI群组在计算强度、记忆带宽、芯片间通信和延迟度方面面临着限制,加上不同的工作量和严格的服务级目标。解决这些问题需要从根本上重新设计硬件软件集成。本文介绍了由生产级CloudMartrix384超级节拍制的下一代AI数据中心结构Huawe CloudMatrix。它整合了384 Ascend 910 NPUs和192 Kunpeng CPUs,通过超高频宽频宽度统一Bus(UB)网络相互连接,使直接的全通通信和动态共享资源。这些特征是优化通信密集型业务的性能,如大规模MEUE专家平行和分布式模型的存取。为了充分利用Cloadmartrix384, 我们提议Clodmarx-Infer, 高级LIM提供三种核心创新解决方案,包括:PIS-Sildal-deal-deal Stal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-sal-s-s-s-s-s-s-s-s-s-s-s-s-s-sal-s-s-sal-sal-s-s ex-al-al-al-al-al-sal-al-sal-sal-sal-sal-sal-sal-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-al-sal-al-


Article 93

Title@2025-06-19 (4): NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning

Title: NetSenseML: Network-Adaptive Compression for Efficient Distributed Machine Learning NetSenseML: Netzwerk-adaptive Kompression für effizientes verteiltes maschinelles Lernen NetSensenseML:高效分配机器学习网络-ADT压缩 2506.16235v1

Authors (5): Yisu Wang, Xinjiao Li, Ruilong Wu, Huangxun Chen, Dirk Kutscher

Training large-scale distributed machine learning models imposes considerable demands on network infrastructure, often resulting in sudden traffic spikes that lead to congestion, increased latency, and reduced throughput, which would ultimately affect convergence times and overall training performance. While gradient compression techniques are commonly employed to alleviate network load, they frequently compromise model accuracy due to the loss of gradient information. This paper introduces NetSenseML, a novel network adaptive distributed deep learning framework that dynamically adjusts quantization, pruning, and compression strategies in response to real-time network conditions. By actively monitoring network conditions, NetSenseML applies gradient compression only when network congestion negatively impacts convergence speed, thus effectively balancing data payload reduction and model accuracy preservation. Our approach ensures efficient resource usage by adapting reduction techniques based on current network conditions, leading to shorter convergence times and improved training efficiency. We present the design of the NetSenseML adaptive data reduction function and experimental evaluations show that NetSenseML can improve training throughput by a factor of 1.55 to 9.84 times compared to state-of-the-art compression-enabled systems for representative DDL training jobs in bandwidth-constrained conditions.

培训大规模分布式机器学习模式对网络基础设施提出了相当大的要求,往往导致交通突然激增,导致拥堵、延迟增加和吞吐量减少,最终将影响趋同时间和总体培训业绩。虽然梯度压缩技术通常用于减轻网络负荷,但由于丧失梯度信息,它们往往会损害模型的准确性。本文介绍了NetSenseML,这是一个适应性分散的深层学习框架,根据实时网络条件动态调整量化、剪裁和压缩战略。通过积极监测网络条件,NetSenseML只在网络拥堵对趋同速度产生不利影响时使用梯度压缩,从而有效平衡数据载荷减少和模型准确性保存。我们的方法是通过根据当前网络条件调整减少技术,缩短趋同时间,提高培训效率,从而确保高效使用资源。我们介绍了NetSenseML适应性数据减少功能的设计以及实验性评价显示,与在带宽限制条件下有代表性的DL培训岗位的现代化压缩辅助系统相比,NetSenseML能够改善1.55至9.84倍的人力培训。


Article 94

Title@2025-06-19 (4): Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction

Title: Federated Learning for MRI-based BrainAGE: a multicenter study on post-stroke functional outcome prediction Föderated Learning for MRI-based BrainAGE: Eine multizentrische Studie zur post-stroke funktionellen Ergebnisvorhersage 为基于MRI的脑力智能学习联合会学习:关于打击后功能性结果预测的多中心研究 2506.15626v2

Authors (11): Vincent Roca, Marc Tommasi, Paul Andrey, Aurélien Bellet, Markus D. Schirmer, Hilde Henon, Laurent Puy, Julien Ramon, Grégory Kuchcinski, Martin Bretzner, Renaud Lopes

$\textbf{Objective:}$ Brain-predicted age difference (BrainAGE) is a neuroimaging biomarker reflecting brain health. However, training robust BrainAGE models requires large datasets, often restricted by privacy concerns. This study evaluates the performance of federated learning (FL) for BrainAGE estimation in ischemic stroke patients treated with mechanical thrombectomy, and investigates its association with clinical phenotypes and functional outcomes. $\textbf{Methods:}$ We used FLAIR brain images from 1674 stroke patients across 16 hospital centers. We implemented standard machine learning and deep learning models for BrainAGE estimates under three data management strategies: centralized learning (pooled data), FL (local training at each site), and single-site learning. We reported prediction errors and examined associations between BrainAGE and vascular risk factors (e.g., diabetes mellitus, hypertension, smoking), as well as functional outcomes at three months post-stroke. Logistic regression evaluated BrainAGE’s predictive value for these outcomes, adjusting for age, sex, vascular risk factors, stroke severity, time between MRI and arterial puncture, prior intravenous thrombolysis, and recanalisation outcome. $\textbf{Results:}$ While centralized learning yielded the most accurate predictions, FL consistently outperformed single-site models. BrainAGE was significantly higher in patients with diabetes mellitus across all models. Comparisons between patients with good and poor functional outcomes, and multivariate predictions of these outcomes showed the significance of the association between BrainAGE and post-stroke recovery. $\textbf{Conclusion:}$ FL enables accurate age predictions without data centralization. The strong association between BrainAGE, vascular risk factors, and post-stroke recovery highlights its potential for prognostic modeling in stroke care.

$\ textbf{ 目标 :} 大脑预测年龄差异 (BrainAGE) 是一个反映大脑健康的神经成形生物标记 。 但是, 培训强大的脑分析模型需要大型数据集。 通常受到隐私问题的限制。 本研究评估了在接受机械性心肌梗塞治疗的缺血中病人中用于脑分析估算的联邦学习(FL)的性能, 并调查了它与临床性细胞类型和功能结果的关系 。 $\ textbf{ 方法:} 我们使用了16个医院中心16个中风病人的FLAIR脑图象。 我们根据三种数据管理战略实施了脑分析的标准机能学习和深层次学习模型: 集中学习(集合数据)、 FL(每个站点的地方培训) 和单点学习。 我们报告了脑分析与血管风险因素(如糖尿病、高血压、高血压、高血压和高血压后脑反应) 三个月的功能变化结果。 将大脑预测值与这些结果的预测值与高血压、 直径直径直径直径直径直径直径直径直径直径直径直径直径分析结果数据 显示数据显示数据在中央数据显示中, 。


Article 95

Title@2025-06-19 (4): Reconfigurable Intelligent Surface Assisted VEC Based on Multi-Agent Reinforcement Learning

Title: Reconfigurable Intelligent Surface Assisted VEC Based on Multi-Agent Reinforcement Learning Rekonfigurierbare intelligente oberflächenunterstützte VEC auf Basis von Multi-Agenten-Verstärkungslernen 基于多机构强化学习的可重新配置智能表面辅助VEC 2406.11318v2

Authors (6): Kangwei Qi, Qiong Wu, Pingyi Fan, Nan Cheng, Qiang Fan, Jiangzhou Wang

Vehicular edge computing (VEC) is an emerging technology that enables vehicles to perform high-intensity tasks by executing tasks locally or offloading them to nearby edge devices. However, obstacles such as buildings may degrade the communications and incur communication interruptions, and thus the vehicle may not meet the requirement for task offloading. Reconfigurable intelligent surfaces (RIS) is introduced to support vehicle communication and provide an alternative communication path. The system performance can be improved by flexibly adjusting the phase-shift of the RIS. For RIS-assisted VEC system where tasks arrive randomly, we design a control scheme that considers offloading power, local power allocation and phase-shift optimization. To solve this non-convex problem, we propose a new deep reinforcement learning (DRL) framework that employs modified multi-agent deep deterministic policy gradient (MADDPG) approach to optimize the power allocation for vehicle users (VUs) and block coordinate descent (BCD) algorithm to optimize the phase-shift of the RIS. Simulation results show that our proposed scheme outperforms the centralized deep deterministic policy gradient (DDPG) scheme and random scheme.

车辆边缘计算(Vec)是一种新兴技术,使车辆能够通过在当地执行任务或将其卸载到附近的边缘装置完成高强度任务,但建筑物等障碍可能会降低通信质量,造成通信中断,因此车辆可能达不到卸载任务的要求。引入了可配置智能表面(RIS)以支持车辆通信并提供替代通信路径。系统性能可以通过灵活调整RIS的阶段性位来改进。对于任务随机到达的RIS辅助VEC系统,我们设计了一个考虑到卸载能力、当地电力分配和分期制优化的控制方案。为解决这一非凝固问题,我们提出了一个新的深度强化学习(DRL)框架,采用经修改的多剂深度确定性政策梯度(MADPG)方法优化车辆使用者(VUs)的动力分配和块协调后代算法,以优化RIS的阶段性位。模拟结果表明,我们拟议的计划超出了中央深度确定性政策梯度(DPG)计划和随机计划。


Article 96

Title@2025-06-19 (4): Deep-Reinforcement-Learning-Based AoI-Aware Resource Allocation for RIS-Aided IoV Networks

Title: Deep-Reinforcement-Learning-Based AoI-Aware Resource Allocation for RIS-Aided IoV Networks Deep-Reinforcement-Learning-based AoI-Aware Ressourcenzuweisung für RIS-Aided IoV-Netzwerke 为RIS援助的IOV网络分配的深入加强-基于学习的AoI-软件资源 2406.11245v2

Authors (7): Kangwei Qi, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen, Jiangzhou Wang, Khaled B. Letaief

Reconfigurable Intelligent Surface (RIS) is a pivotal technology in communication, offering an alternative path that significantly enhances the link quality in wireless communication environments. In this paper, we propose a RIS-assisted internet of vehicles (IoV) network, considering the vehicle-to-everything (V2X) communication method. In addition, in order to improve the timeliness of vehicle-to-infrastructure (V2I) links and the stability of vehicle-to-vehicle (V2V) links, we introduce the age of information (AoI) model and the payload transmission probability model. Therefore, with the objective of minimizing the AoI of V2I links and prioritizing transmission of V2V links payload, we construct this optimization problem as an Markov decision process (MDP) problem in which the BS serves as an agent to allocate resources and control phase-shift for the vehicles using the soft actor-critic (SAC) algorithm, which gradually converges and maintains a high stability. A AoI-aware joint vehicular resource allocation and RIS phase-shift control scheme based on SAC algorithm is proposed and simulation results show that its convergence speed, cumulative reward, AoI performance, and payload transmission probability outperforms those of proximal policy optimization (PPO), deep deterministic policy gradient (DDPG), twin delayed deep deterministic policy gradient (TD3) and stochastic algorithms.

重新配置的智能表面(RIS)是通信中的关键技术,提供了显著提高无无通信环境连接质量的替代路径,提供了显著提高无线通信环境连接质量的替代路径。在本文中,我们提议建立一个车辆网络(IovV)网络,同时考虑到车辆到设备(V2X)的通信方法。此外,为了提高车辆到基础设施(V2I)连接的及时性,提高车辆到基础设施(V2X)连接的及时性,以及车辆到车辆之间(V2V2V)连接的稳定性,我们引入了信息(AoI)模型和载载量传输概率模型的时代,因此,为了尽可能减少V2I链接的AoI(AoI)的AoI(AoI)模型模型模型和有效传输概率模型。因此,为了尽量减少V2V2V链接连接的AoV的AoV(IVVV)网络连接的AoV的AoV(RI)辅助互联网网络网络网络网络,我们提议将这一优化问题作为Markov决定过程(V2X)通信通信通信通信通信通信通信通信通信通信通信通信连接的通信连接的通信联系的通信连接的通信连接(V2级(V2)通信联系的通信联系的通信联系的通信联系的通信联系的通信联系)通信联系的通信联系的通信联系的通信联系的通信联系的通信联系的通信联系的通信联系的通信的通信问题。此外、拟议、提议,根据SAAAAAAAL、AAAAA、ADFASG、AASG的升级政策性政策性、ASG的升级化、ASG的升级、ASG的升级的升级的升级的升级、提议的升级政策性政策性政策性政策性、ASG的升级和模的升级、ASG的升级、ASG的升级性政策的升级的升级的升级的升级的升级的升级的升级的升级的升级和模拟结果,以及模拟结果的升级的升级的升级的升级的升级的演、拟议、提议的演的演的演的演的演的演的演的演、拟议结果、其政策性政策的演的演的演的演的演的演的演的演、其政策性政策性政策性政策的演的演的演的演的演的演的


Article 97

Title@2025-06-19 (4): DRL-Based Federated Self-Supervised Learning for Task Offloading and Resource Allocation in ISAC-Enabled Vehicle Edge Computing

Title: DRL-Based Federated Self-Supervised Learning for Task Offloading and Resource Allocation in ISAC-Enabled Vehicle Edge Computing DRL-basiertes, selbstüberwachtes Lernen für Aufgabe Offloading und Ressourcenallokation im ISAC-fähigen Fahrzeug Edge Computing DRL-基于DRL的基于联邦的自我监督学习,以在ISAC-可加入的车辆边缘电子计算中进行任务卸载和资源分配 2408.14831v2

Authors (6): Xueying Gu, Qiong Wu, Pingyi Fan, Nan Cheng, Wen Chen, Khaled B. Letaief

Intelligent Transportation Systems (ITS) leverage Integrated Sensing and Communications (ISAC) to enhance data exchange between vehicles and infrastructure in the Internet of Vehicles (IoV). This integration inevitably increases computing demands, risking real-time system stability. Vehicle Edge Computing (VEC) addresses this by offloading tasks to Road Side Unit (RSU), ensuring timely services. Our previous work FLSimCo algorithm, which uses local resources for Federated Self-Supervised Learning (SSL), though vehicles often can’t complete all iterations task. Our improved algorithm offloads partial task to RSU and optimizes energy consumption by adjusting transmission power, CPU frequency, and task assignment ratios, balancing local and RSU-based training. Meanwhile, setting an offloading threshold further prevents inefficiencies. Simulation results show that the enhanced algorithm reduces energy consumption, improves offloading efficiency and the accuracy of Federated SSL.

智能运输系统(ITS)利用综合遥感和通信系统(ISAC)加强车辆与车辆互联网基础设施之间的数据交换,这种整合不可避免地增加计算需求,危及实时系统稳定。车辆边缘计算(VEC)通过向路边单位卸载任务来解决这个问题,确保及时提供服务。我们以前的工作FLSimCo算法使用当地资源进行联邦自助学习(SSL),虽然车辆往往无法完成所有迭代任务。我们改进的算法将部分任务卸到RSU,并通过调整传输动力、CPU频率和任务分配比率,优化能源消耗,平衡当地和基于RSU的培训。与此同时,设定一个卸载阈值进一步防止效率低下。模拟结果表明,强化的算法降低了能源消耗,提高了联邦自助学习系统的装载效率和准确性。


Article 98

Title@2025-06-19 (4): Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping

Title: Ladder-residual: parallelism-aware architecture for accelerating large model inference with communication overlapping Leiter-Residual: Parallelismus-bewusste Architektur zur Beschleunigung großer Modellinferenz mit Kommunikationsüberlappung 云梯-残余:加速大型模型推断与通信重叠的平行意识结构 2501.06589v5

Authors (10): Muru Zhang, Mayank Mishra, Zhongzhu Zhou, William Brandon, Jue Wang, Yoon Kim, Jonathan Ragan-Kelley, Shuaiwen Leon Song, Ben Athiwaratkun, Tri Dao

Large language model inference is both memory-intensive and time-consuming, often requiring distributed algorithms to efficiently scale. Various model parallelism strategies are used in multi-gpu training and inference to partition computation across multiple devices, reducing memory load and computation time. However, using model parallelism necessitates communication of information between GPUs, which has been a major bottleneck and limits the gains obtained by scaling up the number of devices. We introduce Ladder Residual, a simple architectural modification applicable to all residual-based models that enables straightforward overlapping that effectively hides the latency of communication. Our insight is that in addition to systems optimization, one can also redesign the model architecture to decouple communication from computation. While Ladder Residual can allow communication-computation decoupling in conventional parallelism patterns, we focus on Tensor Parallelism in this paper, which is particularly bottlenecked by its heavy communication. For a Transformer model with 70B parameters, applying Ladder Residual to all its layers can achieve 29% end-to-end wall clock speed up at inference time with TP sharding over 8 devices. We refer the resulting Transformer model as the Ladder Transformer. We train a 1B and 3B Ladder Transformer from scratch and observe comparable performance to a standard dense transformer baseline. We also show that it is possible to convert parts of the Llama-3.1 8B model to our Ladder Residual architecture with minimal accuracy degradation by only retraining for 3B tokens. We release our code for training and inference for easier replication of experiments.

大型语言模型的推断既需要记忆密集,又耗时,往往需要分布式的再分析算法,以高效地扩大。多种模型的平行战略被用于多格化培训和推断,以在多个设备之间进行分割计算,减少内存负荷和计算时间。但是,使用模型平行化,需要GPU之间的信息交流,这一直是一大瓶颈,限制了通过扩大设备数量而获得的收益。我们引入了 Ladder 残余值,这一简单的建筑修改适用于所有残余值模型,从而可以直接重叠,从而有效地隐藏通信的惯性。我们的洞察力是,除了系统优化之外,还可以重新设计模型结构结构,将通信从计算中分离出来。虽然 Ladder 残余值可以允许在常规平行模式模式中进行通信转换,但在本文中我们侧重于Tensor 平行化,这特别受到其大量通信的束缚。对于具有70B参数的变压模型来说,将 Wedder 残余值再转换为所有层次的更简单重叠。我们的观点是,在系统优化时,我们还可以重新设计模型,在计算过程中,通过TP 变压8 变压8 变压的模型, 显示我们变压的变压8 的变压的变压的变压的系统。


Article 99

Title@2025-06-19 (4): HetGPU: The pursuit of making binary compatibility towards GPUs

Title: HetGPU: The pursuit of making binary compatibility towards GPUs HetGPU: Das Streben nach binärer Kompatibilität gegenüber GPUs HETGPU: 努力使二进制兼容到 GPUs 2506.15993v1

Authors (4): Yiwei Yang, Yusheng Zheng, Tong Yu, Andi Quinn

Heterogeneous GPU infrastructures present a binary compatibility challenge: code compiled for one vendor’s GPU will not run on another due to divergent instruction sets, execution models, and driver stacks . We propose hetGPU, a new system comprising a compiler, runtime, and abstraction layer that together enable a single GPU binary to execute on NVIDIA, AMD, Intel, and Tenstorrent hardware. The hetGPU compiler emits an architecture-agnostic GPU intermediate representation (IR) and inserts metadata for managing execution state. The hetGPU runtime then dynamically translates this IR to the target GPU’s native code and provides a uniform abstraction of threads, memory, and synchronization. Our design tackles key challenges: differing SIMT vs. MIMD execution (warps on NVIDIA/AMD vs. many-core RISC-V on Tenstorrent), varied instruction sets, scheduling and memory model discrepancies, and the need for state serialization for live migration. We detail the hetGPU architecture, including the IR transformation pipeline, a state capture/reload mechanism for live GPU migration, and an abstraction layer that bridges warp-centric and core-centric designs. Preliminary evaluation demonstrates that unmodified GPU binaries compiled with hetGPU can be migrated across disparate GPUs with minimal overhead, opening the door to vendor-agnostic GPU computing.

高遗传性 GPU 基础设施存在一个二进制兼容性挑战: 一个供应商 GPU 的代码不会在另一个供应商的 GPU 上运行, 原因是不同的指令集、 执行模型和驱动器堆叠 。 我们提议 HetGPU , 这是一个包含一个编译器、 运行时间和抽象层的新系统, 使一个单一的 GPU 二进制能够在 NVIDIA、 AMD、 Intel 和 Testorrent 硬件上执行 。 HetGPU 编译出一个建筑- 不可理的 GPU 中间代表( IR) , 并插入用于管理执行状态的元数据 。 hetGPU 运行时间然后动态地将IGPU 转换为目标 GPU的本地代码, 并提供统一的线索、 内存和同步的模板的抽象集。 我们的设计解决了关键的挑战: 不同的 SIMT v. MID 执行 ( NIVDIA/ APU 诉 Testortical ) , 的多核心 RImoveal mission G.


Article 100

Title@2025-06-19 (4): KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider

Title: KVCache Cache in the Wild: Characterizing and Optimizing KVCache Cache at a Large Cloud Provider KVCache Cache in der Wildnis: KVCache Cache bei einem großen Cloud-Anbieter charakterisieren und optimieren KVcache 野生缓存: 大云提供方的 KVcache 缓存的特性和优化 KVcache 缓存 2506.02634v3

Authors (9): Jiahao Wang, Jinbo Han, Xingda Wei, Sijie Shen, Dingyan Zhang, Chenguang Fang, Rong Chen, Wenyuan Yu, Haibo Chen

Serving large language models (LLMs) is important for cloud providers, and caching intermediate results (KV$) after processing each request substantially improves serving throughput and latency. However, there is limited understanding of how LLM serving benefits from KV$ caching, where system design decisions like cache eviction policies are highly workload-dependent. In this paper, we present the first systematic characterization of the KV$ workload patterns from one of the leading LLM service providers. We draw observations that were not covered by previous studies focusing on synthetic workloads, including: KV$ reuses are skewed across requests, where reuses between single-turn requests are equally important as multi-turn requests; the reuse time and probability are diverse considering all requests, but for a specific request category, the pattern tends to be predictable; and the overall cache size required for an ideal cache hit ratio is moderate. Based on the characterization, we further propose a workload-aware cache eviction policy that improves the serving performance under real-world traces, especially with limited cache capacity.

使用大型语言模型(LLMs)对于云端提供者十分重要,在处理每项请求后,缓存中间结果(KV$)对于云端提供者十分重要,但对于LLM服务如何从KV$缓存中受益,了解有限,因为缓存驱逐政策等系统设计决定高度依赖工作量。在本文中,我们从一个主要的LLM服务提供者对KV$工作量模式的首次系统描述中,得出了以往侧重于合成工作量的研究所没有涉及的意见,包括:KV$再利用在各种请求中被扭曲,其中单点再利用与多点请求同样重要;再利用时间和概率各不相同,考虑到所有请求,但具体的请求类别,模式往往可以预测;理想缓存打击比率所需的总体缓存规模是适度的。根据特征,我们进一步提议一项工作量-觉缓存驱逐政策,在现实世界痕迹下改进服务绩效,特别是缓存能力有限。