cs.DC @ 2025-07-11: 106
-
00 07-10 (4) KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling KIS-S: Ein GPU-Aware Kubernetes Inferenzsimulator mit RL-basierter Auto-Skalierung KIS- S: 带有基于 RL 自动缩放的 GPU- Aware Kubernetes 推断模拟器 2507.07932v1 -
01 07-10 Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs Parallele CPU-GPU-Execution für LLM-Inferenz auf eingeschränkten GPUs LLM LLM 受控 GPU 推论的平行 CPU-GPU 执行 2506.03296v3 -
02 07-10 Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing Nexus: Durchsatz-Latenz-Tradeoff im LLM-Servieren durch effiziente GPU-Sharing Nexus:通过高效的GPU共享,在LLM服务中,控制通量-通量权衡交易 2507.06608v2 -
03 07-10 DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v2 -
04 07-10 Accelerating Transposed Convolutions on FPGA-based Edge Devices Beschleunigung transponierter Konvolutionen auf FPGA-basierten Edge-Geräten 加速基于 FPGA 的边缘设备的转换变速 2507.07683v1 -
05 07-10 Multi-agent Reinforcement Learning-based In-place Scaling Engine for Edge-cloud Systems Multi-Agenten-Verstärkung Learning-based In-place Scaling Engine für Edge-Cloud-Systeme 边缘球状系统内地增强引擎 2507.07671v1 -
06 07-10 Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data Stressüberwachung im Gesundheitswesen: Ein Ensemble Machine Learning Framework mit tragbaren Sensordaten 保健中压力监测:使用穿戴感感应数据的综合机械学习框架 2507.07589v1 -
07 07-10 TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference TokenWeave: Effiziente Compute-Communication Overlap für verteilte LLM-Inferenz TokenWeave: 有效计算分布式LLM 推理的通信重叠 2505.11329v2 -
08 07-10 A Unified Ontology for Scalable Knowledge Graph-Driven Operational Data Analytics in High-Performance Computing Systems Eine einheitliche Ontologie für skalierbare, graphgestützte Betriebsdatenanalytik in Hochleistungs-Computing-Systemen 高性能计算系统中可缩放知识、图表驱动操作数据分析的统一本体学 2507.06107v2 -
09 07-10 Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques Opt-GPTQ: Optimierte GPTQ Kombination von Sparsen-Achtung und Quantisierungstechniken GPTQ:最佳GPTQ,将分散关注和量化技术结合起来 2505.02351v2 -
10 07-10 KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows KVFlow: Effizientes Präfix-Caching zur Beschleunigung von LLM-basierten Multiagenten-Workflows KVFlow: 为加速基于LLM的多重需要工作流程而高效预置缓存 2507.07400v1 -
11 07-10 Future Resource Bank for ISAC: Achieving Fast and Stable Win-Win Matching for Both Individuals and Coalitions Future Resource Bank for ISAC: Schnelles und stabiles Win-Win-Matching für Einzelpersonen und Koalitionen ISAC未来资源银行:实现个人和联盟的快速和稳定的双赢比对 2502.08118v5 -
12 07-10 Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size Einschränkungen Programmiermodelle für serielle Batch-Scheichung mit minimaler Batch-Größe 具有最小批量大小的连续批次排程限制编程模型 2504.08793v2 -
13 07-10 Machine Learning-driven Multiscale MD Workflows: The Mini-MuMMI Experience Mehrstufige MD-Workflows mit maschinellem Lernen: Die Mini-MuMMI-Erfahrung 由学习驱动的机械式学习驱动的多规模MD工作流程:微型MIMI经验 2507.07352v1 -
14 07-09 (3) Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure Berechnen kann nicht mit der Wahrheit umgehen: Warum Kommunikationssteuer das Gedächtnis und die Verbindungen in der modernen KI-Infrastruktur priorisiert 计算无法处理真相:为什么通讯税在现代AI基础设施中将记忆和相互联系放在优先地位? 2507.07223v1 -
15 07-09 A Terminology for Scientific Workflow Systems Eine Terminologie für wissenschaftliche Workflow-Systeme 科学工作流程系统术语术语 2506.07838v6 -
16 07-09 Integrating Odeint Time Stepping into OpenFPM for Distributed and GPU Accelerated Numerical Solvers Integrieren von Odeint Time Schritt in OpenFPM für verteilte und GPU beschleunigte numerische Solver 将Odeint 时间步骤整合到 Odeint 分布式和 GPU 加速数字解答器的 OpenFPM 中 2309.05331v2 -
17 07-09 Accelerated Spatio-Temporal Bayesian Modeling for Multivariate Gaussian Processes Beschleunigte Spatio-Temporale Bayesische Modellierung für multivariate Gaußische Prozesse 加速多变量高斯进程SPatio-Te时海湾模型模型 2507.06938v1 -
18 07-09 DICE: Data Influence Cascade in Decentralized Learning DICE: Dateneinfluss Cascade im dezentralisierten Lernen DICIC: 分散学习中的数据影响连锁数据 2507.06931v1 -
19 07-09 Towards Enterprise-Ready Computer Using Generalist Agent Auf dem Weg zu Enterprise-Ready Computer mit Generalist Agent 争取利用通才代理实现企业-准备计算机 2503.01861v3 -
20 07-09 New Distributed Interactive Proofs for Planarity: A Matter of Left and Right Neue verteilte interaktive Beweise für Planarität: Eine Angelegenheit von links und rechts 新分发的 Planity 互动证据: 左右问题 2505.00338v3 -
21 07-09 Silent Failures in Stateless Systems: Rethinking Anomaly Detection for Serverless Computing Silent Failures in Stateless Systems: Anomaly Detection für serverloses Rechnen neu denken 无国籍系统中的静态故障:重新思考无服务器计算器的异常探测 2507.04969v2 -
22 07-09 iDynamics: A Novel Framework for Evaluating Microservice Scheduling Policies under Controllable Dynamics in Cloud-Edge Continuum iDynamics: Ein neuartiges Framework zur Bewertung von Microservice Scheduling-Richtlinien unter kontrollierbarer Dynamik im Cloud-Edge Continuum iDynamics:根据可控的云-江环球动态评估微观服务规划政策的新框架 2503.16029v3 -
23 07-09 Towards Efficient and Scalable Distributed Vector Search with RDMA Auf dem Weg zu einer effizienten und skalierbaren verteilten Vektorsuche mit RDMA 与 RDMA 一起努力实现高效和可缩放分布矢量搜索 2507.06653v1 -
24 07-09 Multi-objective methods in Federated Learning: A survey and taxonomy Multi-objektive Methoden im Federated Learning: Eine Umfrage und Taxonomie 联邦学习的多目标方法:调查和分类 2502.03108v2 -
25 07-09 M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure M$^2$-MFP: Ein Multi-Scale und Multi-Level-Memory Failure Prediction Framework für zuverlässige Cloud-Infrastruktur 2.2亿元元-百万元元:可靠云基础设施多层次和多层次记忆失败预测框架 2507.07144v1 -
26 07-09 SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference SlimCaching: Kanten-Caching von Mixture-of-Experts für verteilte Inferenz SlimCaching: 分布式推断的混合专家的边缘缓存 2507.06567v1 -
27 07-09 A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning Eine einzige Zusammenführung: Wiederherstellung serverbasierter Lernleistung im dezentralisierten Lernen 单一合并条件:在分散学习中恢复基于服务器的学习绩效 2507.06542v1 -
28 07-09 Designing Parallel Algorithms for Community Detection using Arachne Entwicklung von Parallelalgorithmen zur Gemeinschaftserkennung unter Verwendung von Arachne 设计使用 Arachne 进行社区探测的平行数值 2507.06471v1 -
29 07-08 (2) FedPhD: Federated Pruning with Hierarchical Learning of Diffusion Models FedPhD: Federated Pruning mit Hierarchical Learning of Diffusion Models FFPhD: 与传播模型的等级化学习结合的联邦节制 2507.06449v1 -
30 07-08 Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach Feinabstimmung multimodaler Transformer am Rand: Ein paralleler Split-Lernansatz 边缘的微调多式变形器:平行分割学习方法 2502.06355v3 -
31 07-08 Ampere: Communication-Efficient and High-Accuracy Split Federated Learning Ampere: Kommunikationseffizientes und hochgenaues Split-Federated-Learning Ampere: 通信效率和高准确度分立联邦学习 2507.07130v1 -
32 07-08 Few-Shot Learning by Explicit Physics Integration: An Application to Groundwater Heat Transport Wenig heißes Lernen durch explizite Physik-Integration: Eine Anwendung auf den Grundwasser-Wärmetransport 通过明确物理集成进行很少热的热学习:地下水热运输的应用 2507.06062v1 -
33 07-08 Efficient Federated Learning with Timely Update Dissemination Effizientes Federated Learning mit rechtzeitiger Aktualisierung der Verbreitung 及时更新传播的高效联邦学习和及时更新更新的传播 2507.06031v1 -
34 07-08 ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge ECORE: Energiebewusstes optimiertes Routing für Deep-Learning-Modelle am Rand ECORE: 在边缘深层学习模型的能源普能优化运行 2507.06011v1 -
35 07-08 Towards Serverless Processing of Spatiotemporal Big Data Queries Auf dem Weg zur serverlosen Verarbeitung von raumzeitlichen Big Data-Abfragen 迈向无服务器处理斯帕蒂奥多时大数据查询 2507.06005v1 -
36 07-08 Containerization in Multi-Cloud Environment: Roles, Strategies, Challenges, and Solutions for Effective Implementation Containerisierung in Multi-Cloud-Umgebungen: Rollen, Strategien, Herausforderungen und Lösungen für eine effektive Umsetzung 多种城市环境中的集装箱化:作用、战略、挑战和有效执行办法 2403.12980v3 -
37 07-08 Conthereum: Concurrent Ethereum Optimized Transaction Scheduling for Multi-Core Execution Conthereum: Concurrent Ethereum optimierte Transaktionsplanung für Multi-Core-Execution Contheum: 与Etheum同时的多核心执行优化交易日程安排 2504.07280v2 -
38 07-08 Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association Grundlegende Grenzen der Hierarchischen Sicheren Aggregation mit Cyclic User Association 与cycclic用户协会的等级安全分类基本限制 2503.04564v5 -
39 07-08 A Formal Refutation of the Blockchain Trilemma Eine formale Widerlegung des Blockchain Trilemma Trilemma 链链的正式反驳 2507.05809v1 -
40 07-08 A Distributed Consensus Algorithm for Prioritizing Autonomous Vehicle Passing at Unsignalized Intersections under Mixed Traffic Ein verteilter Konsens-Algorithmus für die Priorisierung autonomer Fahrzeuge bei unsignalisierten Kreuzungen unter gemischtem Verkehr 混合交通下未发信号的交叉路口通过自动车辆优先通行的分布式共识计算法 2507.03486v2 -
41 07-08 Air-FedGA: A Grouping Asynchronous Federated Learning Mechanism Exploiting Over-the-air Computation Air-FedGA: Ein asynchroner, asynchroner Lernmechanismus, der die Berechnung über die Luft ausnutzt Air-FedGA:一组非同步联邦学习机制 2507.05704v1 -
42 07-08 Skipper: Maximal Matching with a Single Pass over Edges Skipper: Maximale Übereinstimmung mit einem Single Pass über Kanten 船长: 最大匹配和单过边距 2507.04420v2 -
43 07-08 Archetype-Aware Predictive Autoscaling with Uncertainty Quantification for Serverless Workloads on Kubernetes Archetype-Aware Predictive Autoscaling mit Unsicherheit Quantifizierung für serverlose Workloads auf Kubernetes Kubernetes 上无服务器工作载量的不确定性量化 2507.05653v1 -
44 07-08 Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics Krümmungsorientiertes Federated Learning (CAFe): Harmonisierung von Verlustlandschaften für Fairness ohne Demographie CAFE: 协调丧失的景观,促进没有人口统计的公平 2404.19725v5 -
45 07-08 On Optimizing Resource Utilization in Distributed Connected Components Optimierung der Ressourcennutzung in verteilten vernetzten Komponenten 关于最佳利用分配式连接构件的资源 2507.03695v2 -
46 07-08 MOD-X: A Modular Open Decentralized eXchange Framework proposal for Heterogeneous Interoperable Artificial Intelligence Agents MOD-X: Ein modularer, offener, dezentralisierter eXchange-Rahmenvorschlag für heterogene interoperable Künstliche Intelligenz-Agenten MOD-X:关于不同基因、可相互操作的人工情报代理人的模块开放的分散式电子交流框架提案 2507.04376v2 -
47 07-08 Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference Torpor: GPU-fähiges serverloses Rechnen für geringe Latenz, ressourceneffiziente Schlussfolgerung Torpor: 用于低寿命、资源高效推断的GPU-Enable 服务器无服务器计算 2306.03622v3 -
48 07-07 (1) When Federated Learning Meets Quantum Computing: Survey and Research Opportunities Wenn Federated Learning auf Quanten Computing trifft: Umfrage- und Forschungsmöglichkeiten 《当联邦学习与量子计算:调查和研究机会》 2504.08814v2 -
49 07-07 Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding Helix Parallelismus: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decodierung Helix 平行主义:重新思考交互式多亿-千米调解解码的碎片战略 2507.07120v1 -
50 07-07 Cooperative Gradient Coding Kooperative Gradientencodierung 合作渐进编码 2507.05230v1 -
51 07-07 GPU-based complete search for nonlinear minimization subject to bounds GPU-basierte komplette Suche nach nichtlinearer Minimierung unter Grenzen 基于 GPU 的基于 GPU 的完整搜索, 以不受约束的方式对非线性最小化进行搜索 2507.01770v2 -
52 07-07 MoLink: Distributed and Efficient Serving Framework for Large Models MoLink: Verteilter und effizienter Servierrahmen für große Modelle MoLink:大型模型分配和高效服务框架 2507.05043v1 -
53 07-07 Distributed Approximation Algorithms for Minimum Dominating Set in Locally Nice Graphs Verteilte Annäherungsalgorithmen für das Minimum dominierendes Set in lokal schönen Grafiken 本地尼斯图表中最小主导设置的分布式近似分布比例比值 2507.04960v1 -
54 07-07 Bullshark on Narwhal: Implementation-level Workflow Analysis of Round-based DAG Consensus in Theory and Practice Bullshark on Narwhal: Implementation-Level-Workflow-Analyse des runden DAG-Konsenses in Theorie und Praxis Narwhal Bullshark on Narwhal:关于基于DAG理论和实践共识的圆桌工作流量分析执行层面的工作流量分析 2507.04956v1 -
55 07-07 BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning BackFed: Eine effiziente und standardisierte Benchmark-Suite für Backdoor-Angriffe im Federated Learning BackFeded:针对联邦学习联合会的后门袭击的高效和标准化基准套件 2507.04903v1 -
56 07-07 Phantom Subgroup Poisoning: Stealth Attacks on Federated Recommender Systems Phantom Subgroup Gifting: Stealth Attacks auf Federated Recommender Systems 幻影分组中毒:对联邦建议系统进行隐形袭击 2507.06258v1 -
57 07-07 High Order Collaboration-Oriented Federated Graph Neural Network for Accurate QoS Prediction High Order Collaboration-Oriented Federated Graph Neural Network für genaue QoS-Vorhersage 高级秩序协作-以联邦州际同步预测神经网络 2507.05308v1 -
58 07-07 A fast MPI-based Distributed Hash-Table as Surrogate Model demonstrated in a coupled reactive transport HPC simulation Eine schnelle MPI-basierte verteilte Hash-Tabelle als Surrogate-Modell in einer gekoppelten reaktiven Transport HPC-Simulation demonstriert 快速基于 MPI 的散散散散散口表,作为代用模型,在同时反应性运输的HPC模拟中演示 2504.14374v2 -
59 07-07 Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms Entmystifizierende NCCL: Eine eingehende Analyse der GPU-Kommunikationsprotokolle und -algorithmen 解开NCCL的神秘性:深入分析GPU通信协议和等级 2507.04786v1 -
60 07-07 Communication Round and Computation Efficient Exclusive Prefix-Sums Algorithms (for MPI_Exscan) Kommunikationsrunde und Computation Effiziente exklusive Präfix-Summe Algorithmen (für MPI_Exscan) 通信回合和计算效率(MPI_Exscan) 2507.04785v1 -
61 07-07 Semitopology: distributed collaborative action via topology, algebra, and logic Semitopologie: verteilte kollaborative Aktion über Topologie, Algebra und Logik 超土学:通过地形学、代数和逻辑进行分布式合作行动 2402.03253v3 -
62 07-07 Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation Performance-Evaluierung allgemeiner Zwecke Große Sprachmodelle für grundlegende lineare Algebra-Unterprogramme Code-Generierung 基本线性代代数子方案代码生成通用大语言模型绩效评价 2507.04697v1 -
63 07-07 Learning from the Past: Adaptive Parallelism Tuning for Stream Processing Systems Aus der Vergangenheit lernen: Adaptive Parallelitäts-Tuning für Stream Processing Systeme 向过去学习:流流处理系统的适应性平行制图 2504.12074v2 -
64 07-07 RAPTOR: Practical Numerical Profiling of Scientific Applications RAPTOR: Praktische numerische Profilierung wissenschaftlicher Anwendungen 科学应用实际数字分析 2507.04647v1 -
65 07-07 SPTCStencil: Using Sparse Tensor Cores for Stencil Computation SPTCStencil: Verwendung von Sparse Tensor Cores für Stencil Computation SPSSCtencil: 使用粗特质核心进行Stencil 计算 2506.22035v2 -
66 07-07 CFP: Efficient Optimization of Intra-Operator Parallelism Plans for Large Model Training CFP: Effiziente Optimierung von Intra-Operator-Parallelisierungsplänen für große Modellschulungen CFP: 高效优化大型示范培训操作人员内部平行计划 2504.00598v2 -
67 07-07 Denoising Application Performance Models with Noise-Resilient Priors Denoisierende Anwendungs-Performance-Modelle mit geräuschbeständigen Prioren 具有噪音-抗应前置物的低度应用性性能模型 2504.10996v2 -
68 07-06 (7) Exploring Micro Frontends: A Case Study Application in E-Commerce Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce 探索微观前沿:电子商务案例研究应用 2506.21297v2 -
69 07-06 Agentic Distributed Computing Agentisch verteiltes Computing A. 分配的计算 2507.04459v1 -
70 07-06 Static Analysis for Detecting Transaction Conflicts in Ethereum Smart Contracts Statische Analyse zur Erkennung von Transaktionskonflikten in Ethereum Smart Contracts Etheum智能合同中发现交易冲突的静态分析 2507.04357v1 -
71 07-06 Heterogeneous Federated Learning with Prototype Alignment and Upscaling Heterogenes Föderiertes Lernen mit Prototypenausrichtung und Upscaling 具有原型调整和升级的异异质联邦学习 2507.04310v1 -
72 07-05 (6) Gathering Teams of Bounded Memory Agents on a Line Sammeln von Teams von Begrenzten Speicher-Agenten auf einer Linie 在一条线上收集被损坏的内存人员小组 2507.04172v1 -
73 07-05 A3FR: Agile 3D Gaussian Splatting with Incremental Gaze Tracked Foveated Rendering in Virtual Reality A3FR: Agile 3D Gaussian Splatting mit Inkremental Gaze verfolgt Foveated Rendering in Virtual Reality A3FR: Agile 3D Gaussian Splating 配有虚拟现实中增量加热跟踪的变色成形成形成像 2507.04147v1 -
74 07-05 HiPerMotif: Novel Parallel Subgraph Isomorphism in Large-Scale Property Graphs HiPerMotif: Neuer Parallel-Subgraph Isomorphismus in großformatigen Property Graphen HiPerMotif: 大型财产图中的新平行平行子集 2507.04130v1 -
75 07-05 One-Bit Model Aggregation for Differentially Private and Byzantine-Robust Personalized Federated Learning Ein-Bit-Modell Aggregation für unterschiedlich privates und byzantinisches-Robust Personalisiertes Federated Learning 区别对待的私立和拜占庭-罗邦个人化联邦学习一比一模式 2507.03973v1 -
76 07-05 FedFog: Resource-Aware Federated Learning in Edge and Fog Networks FedFog: Ressourcenschonendes Lernen in Edge- und Fog-Netzwerken FFFFog: 边缘和雾网的资源-软件联合学习 2507.03952v1 -
77 07-05 On Fault Tolerance of Data Storage Systems: A Holistic Perspective Zur Fehlertoleranz von Datenspeichersystemen: Eine ganzheitliche Perspektive 关于数据存储系统不容错:整体观点 2507.03849v1 -
78 07-04 (5) Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction 用于大型电子结构预测的分布式等差图像神经网络 2507.03840v1 -
79 07-04 RVISmith: Fuzzing Compilers for RVV Intrinsics RVISmith: Fuzzing Compiler für RVV-Intrinsik RVISmith: RVV Intrinsics 模糊的编译者 2507.03773v1 -
80 07-04 Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN) Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines für Open Radio Access Networks (ORAN) 用于开放式无线电接入网络(ORAN)的矢量、图形和混合检索增强代(RAG)管道基准 2507.03608v1 -
81 07-04 FastSet: Parallel Claim Settlement FastSet: Parallele Forderungsabrechnung FastSet:平行索赔理赔 2506.23395v2 -
82 07-04 Hiku: Pull-Based Scheduling for Serverless Computing Hiku: Pull-Based Scheduling für serverloses Rechnen Hidku:无服务器计算系统 Pull- 以拉为基础的日程安排 2502.15534v2 -
83 07-04 Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis Universal Checkpointing: Ein flexibles und effizientes Distributed Checkpointing-System für großformatige DNN-Schulungen mit rekonfigurierbarer Parallelis 通用检查:采用可重新配置平行系统进行大型DNN培训的灵活和高效分布式检查系统 2406.18820v3 -
84 07-04 Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning Analyse und optimierte CXL-Attached-Speicherallokation für Long-Context LLM Fine-Tuning 分析和优化长文本LLM微调的CXL-附加记忆分配 2507.03305v1 -
85 07-04 Lion Cub: Minimizing Communication Overhead in Distributed Lion Lion Cub: Minimierung der Kommunikation über Kopf in verteilten Löwen Lion Cub:尽量减少分配狮子的通讯问题 2411.16462v2 -
86 07-04 Novel Blockchain-based Protocols for Electronic Voting and Auctions Neue Blockchain-basierte Protokolle für elektronische Abstimmung und Auktionen 关于电子表决和拍卖的基于新锁链的新议定书 2507.03258v1 -
87 07-04 HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration HPCTransCompile: Ein KI-Compiler-generierter Datensatz für Hochleistungs-CUDA-Transpilation und LLM-Voruntersuchung HPC Transtranscompility: AI CUDA 高性能 CUDA 转换和 LLM 初步探索的人工智能汇编器生成数据集 2506.10401v2 -
88 07-03 (4) Symbiosis: Multi-Adapter Inference and Fine-Tuning Symbiose: Multi-Adapter-Schlussfolgerung und Feinabstimmung 共生关系:多位开发商的推断和精准调整 2507.03220v1 -
89 07-03 Collective Communication Profiling of Modern-day Machine Learning Workloads Kollektive Kommunikation Profilierung von modernen maschinellen Lern-Workloads 现代机器学习工作量集体交流 2507.07117v1 -
90 07-03 BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers BLaST: High Performance Inferenz und Pretraining mit BLock Sparse Transformers BLAST:使用BLock Sparse变形器进行高性能推断和预先训练 2507.03117v1 -
91 07-03 Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications Charakterisieren von Compute-Communication Overlap in GPU-beschleunigt verteilt Deep Learning: Leistung und Leistung Implikationen GPU-加速传播深层学习中计算通信重叠的特性:表现和动力影响 2507.03114v1 -
92 07-03 Cppless: Single-Source and High-Performance Serverless Programming in C++ Cppless: Single-Source- und High-Performance-Serverless-Programmierung in C++ Cppless: C++中的单一来源和高绩效服务器无服务器程序 2401.10834v2 -
93 07-03 HybridTier: an Adaptive and Lightweight CXL-Memory Tiering System HybridTier: ein adaptives und leichtes CXL-Memory-Tiersystem 混合板:适应和轻量的CXL-模模铁环系 2312.04789v2 -
94 07-03 PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案 2506.19660v2 -
95 07-03 FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference FlowSpec: Kontinuierliche pipelined Spekulative Dekodierung für effiziente verteilte LLM-Inferenz 流谱:为有效分布分布的LLM 推断而持续喷射的投机性分解 2507.02620v1 -
96 07-03 MULTI-SCOUT: Multistatic Integrated Sensing and Communications in 5G and Beyond for Moving Target Detection, Positioning, and Tracking MULTI-SCOUT: Multistatisches integriertes Sensing und Kommunikation in 5G und darüber hinaus für das Verschieben von Zielerkennung, Positionierung und Tracking 目标探测、定位和跟踪:用于推进目标探测、定位和跟踪的 5G及5G 以外多空间综合遥感和通信 2507.02613v1 -
97 07-03 Analysing semantic data storage in Distributed Ledger Technologies for Data Spaces Analyse der semantischen Datenspeicherung in verteilten Ledger-Technologien für Datenräume 分析数据空间分布式分类账簿技术中的语义数据存储 2507.07116v1 -
98 07-03 AI Flow: Perspectives, Scenarios, and Approaches AI Flow: Perspektiven, Szenarien und Ansätze AI 流动:观点、设想和方法 2506.12479v2 -
99 07-03 Resolving CAP Through Automata-Theoretic Economic Design: A Unified Mathematical Framework for Real-Time Partition-Tolerant Systems Lösung von CAP durch Automata-Theoretisches Wirtschaftsdesign: Ein einheitlicher mathematischer Rahmen für Echtzeit-Partitions-Tolerante Systeme 通过自动化数据理论经济设计解决CAP:实时分区-耐用系统统一数学框架 2507.02464v1 -
100 07-03 Red grape detection with accelerated artificial neural networks in the FPGA’s programmable logic Rote Traubenerkennung mit beschleunigten künstlichen neuronalen Netzwerken in der programmierbaren Logik des FPGA FPGA的可编程逻辑的红葡萄探测与加速人工神经网络 2507.02443v1 -
101 07-03 The Artificial Scientist – in-transit Machine Learning of Plasma Simulations Der Künstliche Wissenschaftler – in-transit maschinelles Lernen von Plasmasimulationen 人造科学家 – – Plasma模拟模拟的中转机器学习 2501.03383v3 -
102 07-03 Alps, a versatile research infrastructure Alpen, eine vielseitige Forschungsinfrastruktur 阿尔卑斯山,多用途研究基础设施 2507.02404v1 -
103 07-03 VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software VEFIA: 垂直联邦合作软件有效推断审计框架 2507.02376v1 -
104 07-03 Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources Flotilla: Ein skalierbarer, modularer und widerstandsfähiger föderierter Lernrahmen für heterogene Ressourcen 船队:多样化资源的可扩展、模块化和有弹性的联邦学习框架 2507.02295v1 -
105 07-03 Domain-Adversarial Transfer Learning for Fault Root Cause Identification in Cloud Computing Systems Domain-Adversarial-Transfer-Lernen für fehlerhafte Root-Cause-Identifikation in Cloud Computing-Systemen 为在云计算系统中查明原因原因而进行校内自动转移学习 2507.02233v1
Article 0
Title@2025-07-10 (4): KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling
Title: KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling | KIS-S: Ein GPU-Aware Kubernetes Inferenzsimulator mit RL-basierter Auto-Skalierung | KIS- S: 带有基于 RL 自动缩放的 GPU- Aware Kubernetes 推断模拟器 2507.07932v1 |
Authors (5): Guilin Zhang, Wulan Guo, Ziqi Tan, Qiang Guan, Hailong Jiang
Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated environments.
Kubernetes 的自动计算 GPU 参数工作量仍然具有挑战性,因为默认机制,如水平 Pod Autassaler (HPA) 的被动和门槛性质,在动态和爆裂性交通模式下挣扎,没有与 GPU 级别指标整合。 我们展示了 KISS- S , 这是一个将 KISim 、 GPU-aware Kubernetes 参数模拟器与 KIScaler 、 Proximal 政策优化(PPPPO) 基于自动标尺的自动标尺结合起来的统一框架。 KIScaler 完全在模拟中学习 Latency- 觉悟性和资源效率提升政策, 并且不经再培训直接部署。 四个交通模式的实验显示, KIScaler 将平均报酬提高75.2%, 将P95 的宽度降低到6.7x CPU 基线, 并在没有再培训的情况下普遍化。 我们的工作缩小了可缩缩放环境的反动自动缩放和智能调制之间的差距。
Article 1
Title@2025-07-10 (4): Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs
Title: Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs | Parallele CPU-GPU-Execution für LLM-Inferenz auf eingeschränkten GPUs | LLM LLM 受控 GPU 推论的平行 CPU-GPU 执行 2506.03296v3 |
Authors (4): Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos
Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution by offloading KV cache management and parts of attention computation to the CPU. However, a key bottleneck remains: existing schedulers fail to effectively overlap CPU-offloaded tasks with GPU execution during the latency-critical, bandwidth-bound decode phase. This particularly penalizes real-time, decode-heavy applications (e.g., chat, Chain-of-Thought reasoning) which are currently underserved by existing systems, especially under memory pressure typical of edge or low-cost deployments. We present APEX, a novel, profiling-informed scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference. Unlike systems relying on static rules or purely heuristic approaches, APEX dynamically dispatches compute across heterogeneous resources by predicting execution times of CPU and GPU subtasks to maximize overlap while avoiding scheduling overheads. We evaluate APEX on diverse workloads and GPU architectures (NVIDIA T4, A10), using LLaMa-2-7B and LLaMa-3.1-8B models. Compared to GPU-only schedulers like VLLM, APEX improves throughput by 84% - 96% on T4 and 11% - 89% on A10 GPUs, while preserving latency. Against the best existing hybrid schedulers, it delivers up to 49% (T4) and 37% (A10) higher throughput in long-output settings. APEX significantly advances hybrid LLM inference efficiency on such memory-constrained hardware and provides a blueprint for scheduling in heterogeneous AI systems, filling a critical gap for efficient real-time LLM applications.
用于在线推断的大型语言模型(LLMS)的部署往往受到有限 GPU 记忆的限制,特别是由于在自动递增解码过程中KV缓存日益增长。混合 GPU-CPU 执行通过卸载 KV缓存管理和部分关注计算到 CPU 的典型存储压力而成为一个大有希望的解决办法。然而,一个关键的瓶颈仍然存在:现有的调度器未能有效地将 CPU 上载任务与GPU 执行工作重叠,而GPU-GPU 紧要带带带带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带。 这, APEX让实时流带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带
Article 2
Title@2025-07-10 (4): Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing
Title: Nexus: Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing | Nexus: Durchsatz-Latenz-Tradeoff im LLM-Servieren durch effiziente GPU-Sharing | Nexus:通过高效的GPU共享,在LLM服务中,控制通量-通量权衡交易 2507.06608v2 |
Authors (5): Xiaoxiang Shi, Colin Cai, Junjia Du, Zhanda Zhu, Zhihao Jia
Current prefill-decode (PD) disaggregation is typically deployed at the level of entire serving engines, assigning separate GPUs to handle prefill and decode phases. While effective at reducing latency, this approach demands more hardware. To improve GPU utilization, Chunked Prefill mixes prefill and decode requests within the same batch, but introduces phase interference between prefill and decode. While existing PD disaggregation solutions separate the phases across GPUs, we ask: can the same decoupling be achieved within a single serving engine? The key challenge lies in managing the conflicting resource requirements of prefill and decode when they share the same hardware. In this paper, we first show that chunked prefill requests cause interference with decode requests due to their distinct requirements for GPU resources. Second, we find that GPU resources exhibit diminishing returns. Beyond a saturation point, increasing GPU allocation yields negligible latency improvements. This insight enables us to split a single GPU’s resources and dynamically allocate them to prefill and decode on the fly, effectively disaggregating the two phases within the same GPU. Across a range of models and workloads, our system Nexus achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM. It also outperforms SGLang with up to 2x higher throughput, 2x lower TTFT, and 1.7x lower TBT, and achieves 1.4x higher throughput than vLLM-disaggregation using only half the number of GPUs.
当前填充前编码( PD) 分解通常部署在整个服务引擎的级别上, 指定不同的 GPU 来处理预填和解码阶段, 关键的挑战在于如何管理预填和解码阶段的矛盾资源需求。 虽然在降低 LPU 利用率方面效果有效, 这种方法要求更多硬件。 为改善 GPU 利用率, Checked 预填组合和解码请求在同一批次内, 引入预填和解码请求之间的阶段干扰。 虽然现有的 PD 分解解决方案将GPU 的阶段分为不同的 GPU ; 我们问 : 在单个服务引擎中, 能否实现相同的分解? 关键的挑战在于管理预填和解码的相冲突资源需求。 在本文中, 块块的预填请求会干扰解码请求, 因为他们对 GPUPU资源的要求不同。 其次, 我们发现 GPU 资源在预填和解码之间会减少回报。 在一个饱和点之外, 增加 GPUPU 资源, 仅将一个单一 GPU 资源分解到 预填和解, 在同一个 GPUTFTF 中有效分解两个阶段, 在GPUPUDL 中, 中, 20x 和 中, 20x 和 20x 工作 完成一个较低的半 较低 , , 20x , , 和 和 水平 低的分解算到 。
Article 3
Title@2025-07-10 (4): DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration
Title: DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration | DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung | DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v2 |
Authors (3): Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis
Transformers are gaining increasing attention across different application domains due to their outstanding accuracy. However, these data-intensive models add significant performance demands to the existing computing architectures. Systolic arrays are spatial architectures that have been adopted by commercial AI computing platforms (like Google TPUs), due to their energy-efficient approach of data-reusability. However, these spatial architectures face a penalty in throughput and energy efficiency due to the need for input and output synchronization using First-In-First-Out (FIFO) buffers. This paper proposes a novel scalable systolic-array architecture featuring Diagonal-Input and Permutated weight-stationary (DiP) dataflow for the acceleration of matrix multiplication. The proposed architecture eliminates the synchronization FIFOs required by state-of-the-art weight stationary systolic arrays. Aside from the area, power, and energy savings achieved by eliminating these FIFOs, DiP architecture maximizes the computational resources (PEs) utilization. Thus, it outperforms the weight-stationary counterparts in terms of throughput by up to 50%. A comprehensive hardware design space exploration is demonstrated using commercial 22nm technology, highlighting the scalability advantages of DiP over the conventional approach across various dimensions where DiP offers improvement of energy efficiency per area up to 2.02x. Furthermore, DiP is evaluated using various transformer workloads from widely-used models, consistently outperforming TPU-like architectures, achieving energy improvements of up to 1.81x and latency improvements of up to 1.49x across a range of transformer workloads. At a 64x64 size with 4096 PEs, DiP achieves a peak performance of 8.2 TOPS with energy efficiency 9.55 TOPS/W.
这些数据密集型模型增加了现有计算结构的显著性能要求。 系统阵列是商业AI计算平台(如Google TPUs)采用的空间结构,因为其数据的可恢复性具有节能性。 然而,这些空间结构由于需要使用FIFO(FIFO)缓冲进行投入和产出同步,在吞吐和能源效率方面面临着一个障碍。本文建议了一个新的可缩放的40级系统阵列结构,其特点是对角-内流和变换的加权-静态(DIP)数据流,以加速矩阵倍增。拟议的结构消除了最新重量固定式数据阵列所需的同步FIFFOs。除了通过消除FIFO(FIFO)实现的输入和产出同步之外,diPIP结构将计算资源最大化。 因此,它比重-平流-平流-平面-平面-平面-平面-平面-平面 1. 它比重-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平-平-平-平-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平-平-平-平-平-平-平面-平面-平面-平面-平面-平-平-平-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平面-平面-平面-平面-平面-平-平-平-平-平-平-平-平-
Article 4
Title@2025-07-10 (4): Accelerating Transposed Convolutions on FPGA-based Edge Devices
Title: Accelerating Transposed Convolutions on FPGA-based Edge Devices | Beschleunigung transponierter Konvolutionen auf FPGA-basierten Edge-Geräten | 加速基于 FPGA 的边缘设备的转换变速 2507.07683v1 |
Authors (2): Jude Haris, José Cano
Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.
为了解决这个问题,我们在本文件中提议了MM2IM, 一个硬件软件共同设计的加速器,将MM2IM与COL2IM组合在一起,以高效地处理控制资源边缘装置上的TCONV层。我们使用SECDA-TFLite设计工具包,执行MM2IM,并评估其在261 TCONV问题配置中的性能表现,实现1.9x的平均速度,与双轨的ARM Neon优化的CPU基准相对应。然后我们从众所周知的Com2SUI模型到达到4.2x速度的TRIM, 将其与类似的GMMSM2 基准模型相比较。
Article 5
Title@2025-07-10 (4): Multi-agent Reinforcement Learning-based In-place Scaling Engine for Edge-cloud Systems
Title: Multi-agent Reinforcement Learning-based In-place Scaling Engine for Edge-cloud Systems | Multi-Agenten-Verstärkung Learning-based In-place Scaling Engine für Edge-Cloud-Systeme | 边缘球状系统内地增强引擎 2507.07671v1 |
Authors (7): Jovan Prodanov, Blaž Bertalanič, Carolina Fortuna, Shih-Kai Chou, Matjaž Branko Jurič, Ramon Sanchez-Iborra, Jernej Hribar
Modern edge-cloud systems face challenges in efficiently scaling resources to handle dynamic and unpredictable workloads. Traditional scaling approaches typically rely on static thresholds and predefined rules, which are often inadequate for optimizing resource utilization and maintaining performance in distributed and dynamic environments. This inefficiency hinders the adaptability and performance required in edge-cloud infrastructures, which can only be achieved through the newly proposed in-place scaling. To address this problem, we propose the Multi-Agent Reinforcement Learning-based In-place Scaling Engine (MARLISE) that enables seamless, dynamic, reactive control with in-place resource scaling. We develop our solution using two Deep Reinforcement Learning algorithms: Deep Q-Network (DQN), and Proximal Policy Optimization (PPO). We analyze each version of the proposed MARLISE solution using dynamic workloads, demonstrating their ability to ensure low response times of microservices and scalability. Our results show that MARLISE-based approaches outperform heuristic method in managing resource elasticity while maintaining microservice response times and achieving higher resource efficiency.
现代边缘系统在高效率地增加资源以应对动态和不可预测的工作量方面面临挑战。传统的规模化方法通常依赖静态阈值和预先确定的规则,而静态阈值和预设规则往往不足以优化资源利用和维持分布式和动态环境中的业绩。这种效率低下妨碍了边缘型基础设施所需的适应性和绩效,而这种功能化基础设施只能通过新提议的地方规模化来实现。为解决这一问题,我们建议采用基于多代理强化学习的基于内部配置的基于多职位的辅助型引擎(MARLISE),该引擎能够实现无缝、动态和反应式的控制,同时利用内部资源规模化的资源规模化。我们利用两种深度强化学习算法:深Q网络(DQN)和优化政策优化(PPPO)来开发我们的解决方案。我们利用动态工作量来分析拟议中每个版本的MARLISE解决方案,展示其确保微观服务反应时间低和可扩展性的能力。我们的成果表明,基于多代理系统的方法在管理资源弹性的同时,在保持微观服务反应时间和实现更高资源效率方面超越了正规的超上的方法。
Article 6
Title@2025-07-10 (4): Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data
Title: Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data | Stressüberwachung im Gesundheitswesen: Ein Ensemble Machine Learning Framework mit tragbaren Sensordaten | 保健中压力监测:使用穿戴感感应数据的综合机械学习框架 2507.07589v1 |
Authors (3): Arpana Sinhal, Anay Sinhal, Amit Sinhal
Healthcare professionals, particularly nurses, face elevated occupational stress, a concern amplified during the COVID-19 pandemic. While wearable sensors offer promising avenues for real-time stress monitoring, existing studies often lack comprehensive datasets and robust analytical frameworks. This study addresses these gaps by introducing a multimodal dataset comprising physiological signals, electrodermal activity, heart rate and skin temperature. A systematic literature review identified limitations in prior stress-detection methodologies, particularly in handling class imbalance and optimizing model generalizability. To overcome these challenges, the dataset underwent preprocessing with the Synthetic Minority Over sampling Technique (SMOTE), ensuring balanced representation of stress states. Advanced machine learning models including Random Forest, XGBoost and a Multi-Layer Perceptron (MLP) were evaluated and combined into a Stacking Classifier to leverage their collective predictive strengths. By using a publicly accessible dataset and a reproducible analytical pipeline, this work advances the development of deployable stress-monitoring systems, offering practical implications for safeguarding healthcare workers’ mental health. Future research directions include expanding demographic diversity and exploring edge-computing implementations for low latency stress alerts.
在COVID-19大流行期间,保健专业人员,特别是护士,面临着职业压力升高的问题,这是人们更加关注的一个问题。虽然穿戴传感器为实时压力监测提供了有希望的渠道,但现有的研究往往缺乏全面的数据集和强有力的分析框架。这项研究通过引入由生理信号、电极活动、心率和皮肤温度组成的多式联运数据集,弥补了这些差距。系统文献审查查明了先前的压力检测方法的局限性,特别是在处理阶级不平衡和优化模型一般性方面。为了克服这些挑战,数据集与合成少数群体抽样技术(SMOTE)一起进行了预处理,确保压力状态的均衡代表。包括随机森林、XGBoust和多激光 Perceptron(MLP)在内的先进机器学习模型得到了评估,并合并成一个标准分类,以利用其集体预测优势。通过使用公众可获取的数据集和可复制的分析管道,这项工作推动了可部署的压力监测系统的开发,为保护保健工作者的心理健康提供了实际影响。未来的研究方向包括扩大人口多样性和探索低潜压压力警报的边缘执行。
Article 7
Title@2025-07-10 (4): TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Title: TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference | TokenWeave: Effiziente Compute-Communication Overlap für verteilte LLM-Inferenz | TokenWeave: 有效计算分布式LLM 推理的通信重叠 2505.11329v2 |
Authors (3): Raja Gond, Nipun Kwatra, Ramachandran Ramjee
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce–RMSNorm kernel that carefully leverages Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch’s computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
大型语言模型(LLMS)的分布式推论可以引入高达20%的间接费用,甚至超过通过高速互连(如 NVLink ) 连接的 GPU 。 已经提出了多种技术, 通过将计算分解成细微重分解任务和在完成时与子任务重复通信来缓解这些间接费用。 但是, 微细分分解将大量计算分解成在 GPU 上的许多较小计算导致间接费用。 此外, 通信本身使用许多流式多处理器( SMs) , 增加管理费用。 我们展示了托肯韦韦( Tokenweave) 来应对这些挑战。 TokenWeave提议一种托肯(Token- Split) 技术, 以波浪分解计算分解成两个大约相等的子集。 一个子集的通讯与其它的计算方法相重叠。 此外, TokenWeave( ) 优化了所有REW- REM- NOLKNQNQN 模式, 以便仔细地优化地对 HIM 的 OVA- 和S- hold 进行自动分析。
Article 8
Title@2025-07-10 (4): A Unified Ontology for Scalable Knowledge Graph-Driven Operational Data Analytics in High-Performance Computing Systems
Title: A Unified Ontology for Scalable Knowledge Graph-Driven Operational Data Analytics in High-Performance Computing Systems | Eine einheitliche Ontologie für skalierbare, graphgestützte Betriebsdatenanalytik in Hochleistungs-Computing-Systemen | 高性能计算系统中可缩放知识、图表驱动操作数据分析的统一本体学 2507.06107v2 |
Authors (2): Junaid Ahmed Khan, Andrea Bartolini
Modern high-performance computing (HPC) systems generate massive volumes of heterogeneous telemetry data from millions of sensors monitoring compute, memory, power, cooling, and storage subsystems. As HPC infrastructures scale to support increasingly complex workloads-including generative AI-the need for efficient, reliable, and interoperable telemetry analysis becomes critical. Operational Data Analytics (ODA) has emerged to address these demands; however, the reliance on schema-less storage solutions limits data accessibility and semantic integration. Ontologies and knowledge graphs (KG) provide an effective way to enable efficient and expressive data querying by capturing domain semantics, but they face challenges such as significant storage overhead and the limited applicability of existing ontologies, which are often tailored to specific HPC systems only. In this paper, we present the first unified ontology for ODA in HPC systems, designed to enable semantic interoperability across heterogeneous data centers. Our ontology models telemetry data from the two largest publicly available ODA datasets-M100 (Cineca, Italy) and F-DATA (Fugaku, Japan)-within a single data model. The ontology is validated through 36 competency questions reflecting real-world stakeholder requirements, and we introduce modeling optimizations that reduce knowledge graph (KG) storage overhead by up to 38.84% compared to a previous approach, with an additional 26.82% reduction depending on the desired deployment configuration. This work paves the way for scalable ODA KGs and supports not only analysis within individual systems, but also cross-system analysis across heterogeneous HPC systems.
现代高性能计算(HPC)系统从数以百万计的传感器监测计算、记忆、电力、冷却和储存子系统中产生大量异式遥测数据。随着HPC基础设施规模的扩大,支持日益复杂的工作量,包括基因化的AI – – 需要高效、可靠和互操作的遥测分析变得至关重要。运行数据分析(ODA)已经出现,以满足这些需求;然而,对无机存储解决方案的依赖限制了数据的可获取性和语义整合。核心和知识图表(KG)提供了一种有效的方法,通过捕获域名语义系统,使高效和直观的数据查询,但是,它们面临着巨大的存储管理以及仅针对特定HPC系统的有限适用性等挑战。在本文件中,我们提出了第一个用于HPC系统(ODA)的统一数据,目的是使混杂数据中心能够实现语义互操作性互操作性。我们从两个最大的可公开获得的官方发展援助数据集(M100(Cineca,意大利)和F-DATA(Fuga)系统支持了跨域语系的存储管理管理管理管理管理管理,日本)的现有在线数据分析(KG)系统,通过一个通过Slistal-listalimalalalalalalalalalalalal 定义的模型,这个模型,这个模型,这个模型,用来在一个模拟的存储模型上减少一个模型的存储能力分析方法,这个模型中,这个模型,这个模型,这个模型,这个模型里,用来减少一个模型,这个模型,这个模型,这个模型,用来分析,这个模型里程中,用来反映了一个模拟的存储能力要求。
Article 9
Title@2025-07-10 (4): Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques
Title: Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques | Opt-GPTQ: Optimierte GPTQ Kombination von Sparsen-Achtung und Quantisierungstechniken | GPTQ:最佳GPTQ,将分散关注和量化技术结合起来 2505.02351v2 |
Authors (10): Jie Kong, Junxiang Zhang, Jiheng Xu, Yalong Li, Shouhua Zhang, Jiehan Zhou, Yuhai Liu, Peng Liang, Quan Zhang, Luohan Jiang
In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and boosting parallel computing capabilities. Opt-GQA integrates Attention with Linear Biases (ALiBi) to reduce overhead and enhance long-sequence processing. Experimental results show that Opt-GPTQ significantly reduces computation time and memory usage while improving model performance.
在深层学习领域,传统关注机制在处理长序列数据时面临着与高计算复杂性和大量记忆消耗有关的重大挑战。为解决这些局限性,我们提议Opt-GPTQ,即优化的GPTQ,即基于优化的渐进式培训后量化(GPTQQ),将GQA(GQA)机制与组合存储管理相结合,优化传统的多负责人关注(MHA)机制,方法是将查询头分组并共享关键值矢量。优化的GQA(Opt-GQA)有效地降低计算复杂性,尽量减少记忆破碎,并加强大规模模型的记忆利用。Opt-GPTQQ(Opt-GQA)优化了数据中心单位(DCUs),并将其整合到 vLLLM 模型中,以最大限度地提高硬件效率。它定制了GPUP 内核,以通过减少记忆存取时间和增强平行计算能力来进一步增加注意力的计算。Opt-GQA将关注与线-Bises(ALiBiBi)有效地减少管理并增强长期模型处理。实验性结果显示Opt-GPTQ的利用。
Article 10
Title@2025-07-10 (4): KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows
Title: KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows | KVFlow: Effizientes Präfix-Caching zur Beschleunigung von LLM-basierten Multiagenten-Workflows | KVFlow: 为加速基于LLM的多重需要工作流程而高效预置缓存 2507.07400v1 |
Authors (9): Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, Yufei Ding
Large language model (LLM) based agentic workflows have become a popular paradigm for coordinating multiple specialized agents to solve complex tasks. To improve serving efficiency, existing LLM systems employ prefix caching to reuse key-value (KV) tensors corresponding to agents’ fixed prompts, thereby avoiding redundant computation across repeated invocations. However, current systems typically evict KV caches using a Least Recently Used (LRU) policy, which fails to anticipate future agent usage and often discards KV caches shortly before their reuse. This leads to frequent cache misses and substantial recomputation or swapping overhead. We present KVFlow, a workflow-aware KV cache management framework tailored for agentic workloads. KVFlow abstracts the agent execution schedule as an Agent Step Graph and assigns each agent a steps-to-execution value that estimates its temporal proximity to future activation. These values guide a fine-grained eviction policy at the KV node level, allowing KVFlow to preserve entries likely to be reused and efficiently manage shared prefixes in tree-structured caches. Moreover, KVFlow introduces a fully overlapped KV prefetching mechanism, which proactively loads required tensors from CPU to GPU in background threads for agents scheduled in the next step, thereby avoiding cache miss stalls during generation. Compared to SGLang with hierarchical radix cache, KVFlow achieves up to 1.83$\times$ speedup for single workflows with large prompts, and up to 2.19$\times$ speedup for scenarios with many concurrent workflows.
大型语言模型( LLM) 以大型语言模式为基础的代理工作流程已成为协调多个专门代理商解决复杂任务的流行范例。 为了提高效率, 现有的 LLM 系统使用前缀缓存, 重新使用与代理商固定提示相对的键值( KV) , 从而避免重复计算。 然而, 当前系统通常使用最不常用的( LRU) 政策驱逐 KV 缓存, 这无法预测未来代理商的使用情况, 并经常在重新使用之前不久丢弃 KV 缓存 。 这导致频繁的缓存丢失和大量重置或转换管理管理管理管理。 我们展示了 KVFlow, 一个为代理工作量量定制的工作流程- World KVVV 缓存管理框架。 KVFlow 将代理商执行时间表作为代理Step 图表, 并给每个代理商分配一个步骤到执行值, 估计其与未来激活时间的距离。 这些值指导了 KVPO 节点的细化驱逐政策, 允许 KVFlow 保存可能被再利用的单流流流流和高效共享的预置速度, 。 KVlalal-lickraterateal 时间里, 时间里, 需要完全地在 SG 。
Article 11
Title@2025-07-10 (4): Future Resource Bank for ISAC: Achieving Fast and Stable Win-Win Matching for Both Individuals and Coalitions
Title: Future Resource Bank for ISAC: Achieving Fast and Stable Win-Win Matching for Both Individuals and Coalitions | Future Resource Bank for ISAC: Schnelles und stabiles Win-Win-Matching für Einzelpersonen und Koalitionen | ISAC未来资源银行:实现个人和联盟的快速和稳定的双赢比对 2502.08118v5 |
Authors (6): Houyi Qi, Minghui Liwang, Seyyedali Hosseinalipour, Liqun Fu, Sai Zou, Wei Ni
Future wireless networks must support emerging applications where environmental awareness is as critical as data transmission. Integrated Sensing and Communication (ISAC) enables this vision by allowing base stations (BSs) to allocate bandwidth and power to mobile users (MUs) for communications and cooperative sensing. However, this resource allocation is highly challenging due to: (i) dynamic resource demands from MUs and resource supply from BSs, and (ii) the selfishness of MUs and BSs. To address these challenges, existing solutions rely on either real-time (online) resource trading, which incurs high overhead and failures, or static long-term (offline) resource contracts, which lack flexibility. To overcome these limitations, we propose the Future Resource Bank for ISAC, a hybrid trading framework that integrates offline and online resource allocation through a level-wise client model, where MUs and their coalitions negotiate with BSs. We introduce two mechanisms: (i) Role-Friendly Win-Win Matching (offRFW$^2$M), leveraging overbooking to establish risk-aware, stable contracts, and (ii) Effective Backup Win-Win Matching (onEBW$^2$M), which dynamically reallocates unmet demand and surplus supply. We theoretically prove stability, individual rationality, and weak Pareto optimality of these mechanisms. Through simulations, we show that our framework improves social welfare, latency, and energy efficiency compared to existing methods.
未来无线网络必须支持环境意识与数据传输一样至关重要的新兴应用; 综合遥感和通信(ISAC)允许基地站为通信和合作遥感向移动用户分配带宽和电力,从而使这一愿景得以实现; 然而,这一资源分配非常具有挑战性,因为:(一) 来自移动站的动态资源需求以及来自移动站的资源供应;(二) 移动站和移动站的自私自利。 为了应对这些挑战,现有解决方案依赖于实时(在线)资源交易,这种交易导致高管理费和失败,或静态(脱线)长期资源合同,缺乏灵活性。为克服这些限制,我们提议建立一个未来信息站资源银行,这是一个混合贸易框架,通过一个水平明智的客户模式,将离线和在线资源分配结合起来。 我们引入了两个机制:(一) 作用友好的Win-Win匹配(off RFW$%2M),利用过度的账面来建立风险意识、稳定合同、稳定的长期(offline)资源合同,以及(ii) 有效后期Sing Win-Win-Wimeal-Sildal-Sildalviolview Sility Supment Silvals-Wild),我们现有的能源供应和稳定性、不断提升机制。
Article 12
Title@2025-07-10 (4): Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size
Title: Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size | Einschränkungen Programmiermodelle für serielle Batch-Scheichung mit minimaler Batch-Größe | 具有最小批量大小的连续批次排程限制编程模型 2504.08793v2 |
Authors (2): Jorge A. Huertas, Pascal Van Hentenryck
In serial batch (s-batch) scheduling, jobs are grouped in batches and processed sequentially within their batch. This paper considers multiple parallel machines, nonidentical job weights and release times, and sequence-dependent setup times between batches of different families. Although s-batch has been widely studied in the literature, very few papers have taken into account a minimum batch size, typical in practical settings such as semiconductor manufacturing and the metal industry. The problem with this minimum batch size requirement has been mostly tackled with dynamic programming and meta-heuristics, and no article has ever used constraint programming (CP) to do so. This paper fills this gap by proposing, three CP models for s-batching with minimum batch size: (i) an \textit{Interval Assignment} model that computes and bounds the size of the batches using the presence literals of interval variables of the jobs. (ii) A \textit{Global} model that exclusively uses global constraints that track the size of the batches over time. (iii) And a \textit{Hybrid} model that combines the benefits of the extra global constraints with the efficiency of the sum-of-presences constraints to ensure the minimum batch sizes. The computational experiments on standard cases compare the three CP models with two existing mixed-integer programming (MIP) models from the literature. The results demonstrate the versatility of the proposed CP models to handle multiple variations of s-batching; and their ability to produce, in large instances, better solutions than the MIP models faster.
在序列批量(批量)列表中,工作按批次分组,并在批次内按批次处理。 本文考虑了多个平行机器、 不同工作重量和发布时间不完全相同, 以及不同家庭批次之间取决于顺序的设置时间。 虽然文献中已经对批量进行了广泛的研究, 但很少有论文考虑到最低批量规模, 在半导体制造和金属工业等实际环境下典型的批量规模。 这种最低批量规模要求的问题大多通过动态编程和元过量处理, 也没有文章使用过强制编程( CP) 。 本文通过提议, 三个批次间加载最小尺寸的批次模式填补了这一差距:(i) 一种批次的批次模式, 用半导体和金属工业的间隔变量的亮度来计算和约束批次的大小。 (ii) 一种纯度 {全球 提议模式, 专门使用跟踪批次规模的全球制约, 并且从未使用过强制编程程序程序( CP) 。 (iii) 以及一种纹/ Cen 级模型 来填补这一差距, 用最小的模型来填补这一差距 , , 将效率限制与两种模型的缩缩缩缩缩数 合并模型结合起来 结合, 。
Article 13
Title@2025-07-10 (4): Machine Learning-driven Multiscale MD Workflows: The Mini-MuMMI Experience
Title: Machine Learning-driven Multiscale MD Workflows: The Mini-MuMMI Experience | Mehrstufige MD-Workflows mit maschinellem Lernen: Die Mini-MuMMI-Erfahrung | 由学习驱动的机械式学习驱动的多规模MD工作流程:微型MIMI经验 2507.07352v1 |
Authors (11): Loïc Pottier, Konstantia Georgouli, Timothy S. Carpenter, Fikret Aydin, Jeremy O. B. Tempkin, Dwight V. Nissley, Frederick H. Streitz, Thomas R. W. Scogland, Peer-Timo Bremer, Felice C. Lightstone, Helgi I. Ingólfsson
Computational models have become one of the prevalent methods to model complex phenomena. To accurately model complex interactions, such as detailed biomolecular interactions, scientists often rely on multiscale models comprised of several internal models operating at difference scales, ranging from microscopic to macroscopic length and time scales. Bridging the gap between different time and length scales has historically been challenging but the advent of newer machine learning (ML) approaches has shown promise for tackling that task. Multiscale models require massive amounts of computational power and a powerful workflow management system. Orchestrating ML-driven multiscale studies on parallel systems with thousands of nodes is challenging, the workflow must schedule, allocate and control thousands of simulations operating at different scales. Here, we discuss the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a multiscale workflow management infrastructure, that can orchestrate thousands of molecular dynamics (MD) simulations operating at different timescales, spanning from millisecond to nanosecond. More specifically, we introduce a novel version of MuMMI called “mini-MuMMI”. Mini-MuMMI is a curated version of MuMMI designed to run on modest HPC systems or even laptops whereas MuMMI requires larger HPC systems. We demonstrate mini-MuMMI utility by exploring RAS-RAF membrane interactions and discuss the different challenges behind the generalization of multiscale workflows and how mini-MuMMI can be leveraged to target a broader range of applications outside of MD and RAS-RAF interactions.
精确地模拟复杂的相互作用,例如详细的生物分子相互作用,科学家往往依赖由从微观到宏观的长度和时间尺度等不同尺度运行的若干内部模型组成的多尺度模型。缩小不同时间和长度尺度之间的差距历来具有挑战性,但新机器学习(ML)方法的出现显示了应对这项任务的希望。多规模模型需要大量的计算力和强大的工作流程管理系统。用数千个节点对平行系统进行由ML驱动的多尺度研究具有挑战性,工作流程必须安排、分配和控制不同尺度运行的数千个模拟。在这里,我们讨论了大规模平行的多尺度机器模拟基础设施(MIMMI),一个多规模的工作流程管理基础设施,可以在不同的时间尺度上协调数千种分子动态模拟,从毫秒到毫秒不等。更具体地说,我们引入了名为“MIMMI-MI”的新型多尺度研究。MIMI的小型和MIMIMIMIMM系统需要更大规模地展示MIMIMIMA系统,而MIMI的小型和小型MIMIML系统则需要我们小规模的小型版本。
Article 14
Title@2025-07-09 (3): Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure
Title: Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure | Berechnen kann nicht mit der Wahrheit umgehen: Warum Kommunikationssteuer das Gedächtnis und die Verbindungen in der modernen KI-Infrastruktur priorisiert | 计算无法处理真相:为什么通讯税在现代AI基础设施中将记忆和相互联系放在优先地位? 2507.07223v1 |
Authors (1): Myoungsoo Jung
Modern AI workloads such as large language models (LLMs) and retrieval-augmented generation (RAG) impose severe demands on memory, communication bandwidth, and resource flexibility. Traditional GPU-centric architectures struggle to scale due to growing inter-GPU communication overheads. This report introduces key AI concepts and explains how Transformers revolutionized data representation in LLMs. We analyze large-scale AI hardware and data center designs, identifying scalability bottlenecks in hierarchical systems. To address these, we propose a modular data center architecture based on Compute Express Link (CXL) that enables disaggregated scaling of memory, compute, and accelerators. We further explore accelerator-optimized interconnects-collectively termed XLink (e.g., UALink, NVLink, NVLink Fusion)-and introduce a hybrid CXL-over-XLink design to reduce long-distance data transfers while preserving memory coherence. We also propose a hierarchical memory model that combines local and pooled memory, and evaluate lightweight CXL implementations, HBM, and silicon photonics for efficient scaling. Our evaluations demonstrate improved scalability, throughput, and flexibility in AI infrastructure.
大型语言模型(LLMS)和检索增强的生成(RAG)等现代AI工作量,如大型语言模型(LLMS)和检索增强的生成(RAG),对记忆、通信带宽和资源灵活性提出了严重的要求。传统的GPU中心建筑由于GPU之间的通信管理费用不断增加而难以扩大规模。本报告介绍主要的AI概念,并解释变异器如何在LLMS中使数据代表发生革命。我们分析大型AI硬件和数据中心设计,找出等级系统中的可缩放瓶颈。为了解决这些问题,我们提议基于计算快递链接(CXL)的模块式数据中心结构,以便能够对记忆、计算和加速器进行分解的缩。我们进一步探索加速器-优化的互联互通-集体称为XLink(例如, ALink, NVVLink, NVLink Fulsion)- 并采用混合的 CXL-over-XLink设计,以减少长距离数据传输,同时保持记忆的一致性。我们还提议一个等级记忆模型,将本地和集合记忆结合起来,并评价轻型的CXLL(轻重 CXL)执行、HBMMM, 和硅灵活性,展示我们通过高效的升级和智能基础设施。
Article 15
Title@2025-07-09 (3): A Terminology for Scientific Workflow Systems
Title: A Terminology for Scientific Workflow Systems | Eine Terminologie für wissenschaftliche Workflow-Systeme | 科学工作流程系统术语术语 2506.07838v6 |
Authors (26): Frédéric Suter, Tainã Coleman, İlkay Altintaş, Rosa M. Badia, Bartosz Balis, Kyle Chard, Iacopo Colonnelli, Ewa Deelman, Paolo Di Tommaso, Thomas Fahringer, Carole Goble, Shantenu Jha, Daniel S. Katz, Johannes Köster, Ulf Leser, Kshitij Mehta, Hilary Oliver, J. -Luc Peterson, Giovanni Pizzi, Loïc Pottier, Raül Sirvent, Eric Suchyta, Douglas Thain, Sean R. Wilkinson, Justin M. Wozniak, Rafael Ferreira da Silva
The term scientific workflow has evolved over the last two decades to encompass a broad range of compositions of interdependent compute tasks and data movements. It has also become an umbrella term for processing in modern scientific applications. Today, many scientific applications can be considered as workflows made of multiple dependent steps, and hundreds of workflow management systems (WMSs) have been developed to manage and run these workflows. However, no turnkey solution has emerged to address the diversity of scientific processes and the infrastructure on which they are implemented. Instead, new research problems requiring the execution of scientific workflows with some novel feature often lead to the development of an entirely new WMS. A direct consequence is that many existing WMSs share some salient features, offer similar functionalities, and can manage the same categories of workflows but also have some distinct capabilities. This situation makes researchers who develop workflows face the complex question of selecting a WMS. This selection can be driven by technical considerations, to find the system that is the most appropriate for their application and for the resources available to them, or other factors such as reputation, adoption, strong community support, or long-term sustainability. To address this problem, a group of WMS developers and practitioners joined their efforts to produce a community-based terminology of WMSs. This paper summarizes their findings and introduces this new terminology to characterize WMSs. This terminology is composed of fives axes: workflow characteristics, composition, orchestration, data management, and metadata capture. Each axis comprises several concepts that capture the prominent features of WMSs. Based on this terminology, this paper also presents a classification of 23 existing WMSs according to the proposed axes and terms.
过去二十年来,科学工作流程这一术语演变为包括相互依存计算任务和数据流动的广泛构成。它也成为现代科学应用中处理的总括术语。今天,许多科学应用可被视为由多个依赖步骤组成的工作流程,数百个工作流程管理系统(WMSs)已经开发出来来管理和运行这些工作流程。然而,没有出现任何统包式解决办法来解决科学流程及其实施基础设施的多样性问题。相反,需要执行具有某些新特点的科学工作流程的新研究问题往往导致形成全新的WMS。一个直接后果是,许多现有的WMS术语具有某些显著特征,提供类似的功能,可以管理相同的工作流程类别,但也有一些不同的能力。这种情况使开发工作流程的研究人员面临选择WMS的复杂问题。这种选择可以由技术因素驱动,找到最适合其应用和资源的系统,或诸如声誉、采用、强有力的社区支持或长期文件可持续性等其他因素。为了解决这个问题,WMSS的当前术语的特性是WMS的每个术语的每个核心, 将WMS的当前定义和每个核心的术语组成了WMS的系统。
Article 16
Title@2025-07-09 (3): Integrating Odeint Time Stepping into OpenFPM for Distributed and GPU Accelerated Numerical Solvers
Title: Integrating Odeint Time Stepping into OpenFPM for Distributed and GPU Accelerated Numerical Solvers | Integrieren von Odeint Time Schritt in OpenFPM für verteilte und GPU beschleunigte numerische Solver | 将Odeint 时间步骤整合到 Odeint 分布式和 GPU 加速数字解答器的 OpenFPM 中 2309.05331v2 |
Authors (5): Abhinav Singh, Landfried Kraatz, Serhii Yaskovets, Pietro Incardona, Ivo F. Sbalzarini
We present a software implementation integrating the time-integration library Odeint from Boost with the OpenFPM framework for scalable scientific computing. This enables compact and scalable codes for multi-stage, multi-step, and adaptive explicit time integration on distributed-memory parallel computers and on Graphics Processing Units (GPUs). The present implementation is based on extending OpenFPM’s metaprogramming system to Odeint data types. This makes the time-integration methods from Odeint available in a concise template-expression language for numerical simulations distributed and parallelized using OpenFPM. We benchmark the present software for exponential and sigmoidal dynamics and present application examples to the 3D Gray-Scott reaction-diffusion problem and the “dam break” problem from fluid mechanics. We find a strong-scaling efficiency of 80% on up to 512 CPU cores and a five-fold speedup on a single GPU.
我们推出一个将时间整合库 Odeint 和可缩放科学计算 Odeint 整合为可缩放的 Odest 框架的软件实施。 这为多阶段、多步骤和适应性明确的时间整合提供了压缩和可缩放代码,用于分布式模拟计算机和图形处理器( GPUs) 。 目前实施的基础是将 OpenFPM 的元程序系统扩展至 Odeint 数据类型。 这样Odeint 的时整合方法可以用简明的模板表达语言提供,用于使用 OpenFPM 进行数字模拟的分布和平行。 我们为当前软件设定指数和模拟动态基准,并将应用示例用于3D Gray-Scott 反射扩散问题和液力机械的“ 达姆分解” 问题。 我们发现,一个单一的 GPUP 将效率大幅提升到 80%, 高达 512 CPU 核心, 并有5 倍的加速 。
Article 17
Title@2025-07-09 (3): Accelerated Spatio-Temporal Bayesian Modeling for Multivariate Gaussian Processes
Title: Accelerated Spatio-Temporal Bayesian Modeling for Multivariate Gaussian Processes | Beschleunigte Spatio-Temporale Bayesische Modellierung für multivariate Gaußische Prozesse | 加速多变量高斯进程SPatio-Te时海湾模型模型 2507.06938v1 |
Authors (8): Lisa Gaedke-Merzhäuser, Vincent Maillou, Fernando Rodriguez Avellaneda, Olaf Schenk, Mathieu Luisier, Paula Moraga, Alexandros Nikolaos Ziogas, Håvard Rue
Multivariate Gaussian processes (GPs) offer a powerful probabilistic framework to represent complex interdependent phenomena. They pose, however, significant computational challenges in high-dimensional settings, which frequently arise in spatial-temporal applications. We present DALIA, a highly scalable framework for performing Bayesian inference tasks on spatio-temporal multivariate GPs, based on the methodology of integrated nested Laplace approximations. Our approach relies on a sparse inverse covariance matrix formulation of the GP, puts forward a GPU-accelerated block-dense approach, and introduces a hierarchical, triple-layer, distributed memory parallel scheme. We showcase weak scaling performance surpassing the state-of-the-art by two orders of magnitude on a model whose parameter space is 8$\times$ larger and measure strong scaling speedups of three orders of magnitude when running on 496 GH200 superchips on the Alps supercomputer. Applying DALIA to air pollution data from northern Italy over 48 days, we showcase refined spatial resolutions over the aggregated pollutant measurements.
多变量高斯进程(GPs)为代表复杂的相互依存现象提供了一个强大的概率框架。但是,在高维环境中,这些现象在空间时空应用中经常出现,在高维环境中构成了重大的计算挑战。我们展示了DALIA,这是一个在以综合嵌巢式拉普特近距离方法为基础的阵列时空多变量GP中执行巴伊西亚推论任务的高度可扩展的框架。我们的方法依赖于对GP的微弱反相异矩阵配制,提出了GPU加速的区块密度方法,并引入了等级、三层、分布式的记忆平行计划。我们展示了一个模型,其参数空间为8美元时值较大,其缩放性工作能力以两个数量级的速度超过最新水平。在Alps超级计算机运行496 GH200超级奇普时,我们用DALIA对意大利北部的空气污染数据应用了48天,我们展示了在综合污染物测量中改进的空间分辨率。
Article 18
Title@2025-07-09 (3): DICE: Data Influence Cascade in Decentralized Learning
Title: DICE: Data Influence Cascade in Decentralized Learning | DICE: Dateneinfluss Cascade im dezentralisierten Lernen | DICIC: 分散学习中的数据影响连锁数据 2507.06931v1 |
Authors (4): Tongtian Zhu, Wenhao Li, Can Wang, Fengxiang He
Decentralized learning offers a promising approach to crowdsource data consumptions and computational workloads across geographically distributed compute interconnected through peer-to-peer networks, accommodating the exponentially increasing demands. However, proper incentives are still in absence, considerably discouraging participation. Our vision is that a fair incentive mechanism relies on fair attribution of contributions to participating nodes, which faces non-trivial challenges arising from the localized connections making influence ``cascade’’ in a decentralized network. To overcome this, we design the first method to estimate \textbf{D}ata \textbf{I}nfluence \textbf{C}ascad\textbf{E} (DICE) in a decentralized environment. Theoretically, the framework derives tractable approximations of influence cascade over arbitrary neighbor hops, suggesting the influence cascade is determined by an interplay of data, communication topology, and the curvature of loss landscape. DICE also lays the foundations for applications including selecting suitable collaborators and identifying malicious behaviors. Project page is available at https://raiden-zhu.github.io/blog/2025/DICE/.
分散化学习为通过同侪网络进行地理分布的多方源数据消费和计算工作量提供了一种充满希望的方法,通过同侪网络进行计算,从而满足急剧增长的需求。然而,适当的奖励办法仍然缺乏,大大抑制了参与。我们的愿景是,公平的奖励机制依赖于对参与节点的捐款的公平分配,而参与节点面临非三重挑战,因为地方联系在分散化的网络中产生了“连带”影响。为了克服这一点,我们设计了第一个方法来估计在分散化的环境中对各种应用进行估计,包括选择适当的合作者和查明恶意行为。从理论上讲,该框架产生了对任意邻里跳跃的可移动影响力的可移动近似值,表明影响是数据、通信地貌和损失地貌曲线的相互作用所决定的。DICE还为各种应用奠定了基础,包括选择合适的合作者和确定恶意行为。项目网页见https://raiden-zhu.github.io/blogb/205/DICE/。
Article 19
Title@2025-07-09 (3): Towards Enterprise-Ready Computer Using Generalist Agent
Title: Towards Enterprise-Ready Computer Using Generalist Agent | Auf dem Weg zu Enterprise-Ready Computer mit Generalist Agent | 争取利用通才代理实现企业-准备计算机 2503.01861v3 |
Authors (9): Sami Marreed, Alon Oved, Avi Yaeli, Segev Shlomov, Ido Levy, Offer Akrabi, Aviad Sela, Asaf Adi, Nir Mashkif
This paper presents our ongoing work toward developing an enterprise-ready Computer Using Generalist Agent (CUGA) system. Our research highlights the evolutionary nature of building agentic systems suitable for enterprise environments. By integrating state-of-the-art agentic AI techniques with a systematic approach to iterative evaluation, analysis, and refinement, we have achieved rapid and cost-effective performance gains, notably reaching a new state-of-the-art performance on the WebArena and AppWorld benchmarks. We detail our development roadmap, the methodology and tools that facilitated rapid learning from failures and continuous system refinement, and discuss key lessons learned and future challenges for enterprise adoption.
本文件介绍了我们目前为开发企业准备的计算机使用通用代理系统而开展的工作,我们的研究突出了适合企业环境的建筑代理系统的演变性质,通过将最先进的代理性AI技术与系统化的迭代评价、分析和完善方法相结合,我们取得了迅速和成本效益高的业绩收益,特别是在WebArena和AppWorld基准上取得了新的最新业绩,我们详细介绍了我们的发展路线图、促进从失败和持续系统完善中快速学习的方法和工具,并讨论了企业采用的关键经验教训和未来挑战。
Article 20
Title@2025-07-09 (3): New Distributed Interactive Proofs for Planarity: A Matter of Left and Right
Title: New Distributed Interactive Proofs for Planarity: A Matter of Left and Right | Neue verteilte interaktive Beweise für Planarität: Eine Angelegenheit von links und rechts | 新分发的 Planity 互动证据: 左右问题 2505.00338v3 |
Authors (2): Yuval Gil, Merav Parter
We provide new distributed interactive proofs (DIP) for planarity and related graph families. The notion of a \emph{distributed interactive proof} (DIP) was introduced by Kol, Oshman, and Saxena (PODC 2018). In this setting, the verifier consists of $n$ nodes connected by a communication graph $G$. The prover is a single entity that communicates with all nodes by short messages. The goal is to verify that the graph $G$ satisfies a certain property (e.g., planarity) in a small number of rounds, and with a small communication bound, denoted as the \emph{proof size}. Prior work by Naor, Parter and Yogev (SODA 2020) presented a DIP for planarity that uses three interaction rounds and a proof size of $O(\log n)$. Feuilloley et al.\ (PODC 2020) showed that the same can be achieved with a single interaction round and without randomization, by providing a proof labeling scheme with a proof size of $O(\log n)$. In a subsequent work, Bousquet, Feuilloley, and Pierron (OPODIS 2021) achieved the same bound for related graph families such as outerplanarity, series-parallel graphs, and graphs of treewidth at most $2$. In this work, we design new DIPs that use exponentially shorter proofs compared to the state-of-the-art bounds.
我们为平面和相关的图形家庭提供新的分布式互动证明。 Kol、 Oshman 和 Saxena (2018年PODC 2018年) 引入了 emph{ 分布式互动证明 (DIP) 的概念。 在这一背景下, 校验人由用通信图形连接的一美元节点组成。 校验人是一个单一的实体, 与所有节点通过短信息进行沟通。 目的是核实图表用少量的回合和以小量的量级通信( 例如, 平面) 来满足某种属性( 例如, 平面平面) 。 由Naor、 Parter 和 Yogev (2020年SODODO ) 的先前工作为平面性提供了DIP, 使用三轮互动的一美元和证明大小。 Feuillololley et al. (2020年PODC 2020年) 的目标是核实该图表的单一互动周期和不随机化, 通过提供证据性标签计划, 以美元( logy nurph{mal) imal- imol- imol- imol- ex ex ex ex ex eximpeal ex ex ex ex fol- ex ex ex ex ex ex ex ex.
Article 21
Title@2025-07-09 (3): Silent Failures in Stateless Systems: Rethinking Anomaly Detection for Serverless Computing
Title: Silent Failures in Stateless Systems: Rethinking Anomaly Detection for Serverless Computing | Silent Failures in Stateless Systems: Anomaly Detection für serverloses Rechnen neu denken | 无国籍系统中的静态故障:重新思考无服务器计算器的异常探测 2507.04969v2 |
Authors (3): Chanh Nguyen, Erik Elmroth, Monowar Bhuyan
Serverless computing has redefined cloud application deployment by abstracting infrastructure and enabling on-demand, event-driven execution, thereby enhancing developer agility and scalability. However, maintaining consistent application performance in serverless environments remains a significant challenge. The dynamic and transient nature of serverless functions makes it difficult to distinguish between benign and anomalous behavior, which in turn undermines the effectiveness of traditional anomaly detection methods. These conventional approaches, designed for stateful and long-running services, struggle in serverless settings where executions are short-lived, functions are isolated, and observability is limited. In this first comprehensive vision paper on anomaly detection for serverless systems, we systematically explore the unique challenges posed by this paradigm, including the absence of persistent state, inconsistent monitoring granularity, and the difficulty of correlating behaviors across distributed functions. We further examine a range of threats that manifest as anomalies, from classical Denial-of-Service (DoS) attacks to serverless-specific threats such as Denial-of-Wallet (DoW) and cold start amplification. Building on these observations, we articulate a research agenda for next-generation detection frameworks that address the need for context-aware, multi-source data fusion, real-time, lightweight, privacy-preserving, and edge-cloud adaptive capabilities. Through the identification of key research directions and design principles, we aim to lay the foundation for the next generation of anomaly detection in cloud-native, serverless ecosystems.
无服务器计算重新定义了云层应用的部署,抽取了基础设施,方便了需求、事件驱动的执行,从而提高了开发者的敏捷性和可缩放性。然而,在无服务器环境中保持连贯一致的应用性仍是一个重大挑战。无服务器功能的动态和短暂性使得很难区分良性和异常行为,这反过来又破坏了传统异常检测方法的效力。这些常规方法是为固定和长期服务设计的,在无服务器环境中挣扎,执行时间短、功能孤立和可观察性有限。在这份关于无服务器系统异常检测的首份全面愿景文件中,我们系统地探索了这一模式带来的独特挑战,包括缺乏持久性、不稳定性监测颗粒性以及分散功能之间相互关联的行为的难度。我们进一步审视了一系列表现为不正常性的威胁,从传统的拒绝服务器(DoS)攻击到无服务器的具体威胁,如不否认(DoW)和冷开始。在这些观察的基础上,我们为下一代无服务器的系统系统检测系统制定了一个研究议程,即下一轮的生态系统检测框架,即缺乏持久性、不连贯的监测颗粒性,以及分散的反复度的服务器的定位定位定位定位,通过数据定位的定位定位定位定位定位定位,从而满足了对数据进行数据进行翻版的定位的定位的定位的定位的定位的定位的定位定位定位的定位,从而满足了对数据进行数据进行定位的定位的定位的定位的定位。
Article 22
Title@2025-07-09 (3): iDynamics: A Novel Framework for Evaluating Microservice Scheduling Policies under Controllable Dynamics in Cloud-Edge Continuum
Title: iDynamics: A Novel Framework for Evaluating Microservice Scheduling Policies under Controllable Dynamics in Cloud-Edge Continuum | iDynamics: Ein neuartiges Framework zur Bewertung von Microservice Scheduling-Richtlinien unter kontrollierbarer Dynamik im Cloud-Edge Continuum | iDynamics:根据可控的云-江环球动态评估微观服务规划政策的新框架 2503.16029v3 |
Authors (4): Ming Chen, Muhammed Tawfiqul Islam, Maria Rodriguez Read, Rajkumar Buyya
Designing and evaluating microservice scheduling policies is challenging, particularly under dynamic conditions such as complex call-graph dependencies and varying cross-node networking conditions. Moreover, deploying such systems in real-world cloud-edge environments to evaluate scheduling strategies is often impractical due to complexity, cost, and limited accessibility. This highlights the need for an emulation framework that can faithfully emulate the characteristics of the cloud-edge continuum. These characteristics include dynamic topology changes, latency-sensitive service chains, and varying networking conditions, all of which must be accurately modeled for meaningful evaluation. In this work, iDynamics addresses these challenges by providing a configurable and extensible framework that captures the essential dynamics of running microservice applications in cloud-edge environments, enabling systematic development and testing of microservice scheduling strategies. The framework comprises modular components, such as the Graph Dynamics Analyzer, Networking Dynamics Manager, and Scheduling Policy Extender. This enables fine-grained environmental control and facilitates systematic comparisons of different scheduling strategies. Extensive experiments on a real cloud-edge testbed demonstrate that iDynamics effectively captures diverse dynamic scenarios encountered in microservice deployments, offering a robust solution for designing and evaluating different policies under realistic and controllable conditions.
设计和评价微观服务时间安排政策具有挑战性,特别是在复杂的呼呼-呼-线依赖和各种交叉节点网络化条件等动态条件下,设计和评价微观服务时间安排政策具有挑战性,特别是在复杂、成本和有限无障碍性等动态条件下,特别是在复杂的呼-线依赖性和不同的交叉节点网络化条件等动态条件下;此外,在现实世界的云层环境中部署这种系统来评价时间安排战略往往不切实际,因为复杂、成本和有限的可获得性,这突出表明需要有一个能够忠实地仿效云层连续体特征特征的模范框架;这些特征包括动态地形变化、对长期敏感的服务链和不同的联网条件,所有这些条件都必须精确地模拟,以便进行有意义的评价;在这项工作中,iDynge测试台通过提供一个可配置和可扩展的框架来应对这些挑战,以捕捉在云层环境中运行微观服务应用程序的基本动态,使系统开发和测试微观服务时间安排战略。框架包括模块组成部分,如图表动态分析器、网络动态动态动态管理器和施展政策扩展器等。这可以使精细的环境控制和系统比较不同的规划战略。
Article 23
Title@2025-07-09 (3): Towards Efficient and Scalable Distributed Vector Search with RDMA
Title: Towards Efficient and Scalable Distributed Vector Search with RDMA | Auf dem Weg zu einer effizienten und skalierbaren verteilten Vektorsuche mit RDMA | 与 RDMA 一起努力实现高效和可缩放分布矢量搜索 2507.06653v1 |
Authors (8): Xiangyu Zhi, Meng Chen, Xiao Yan, Baotong Lu, Hui Li, Qianxi Zhang, Qi Chen, James Cheng
Similarity-based vector search facilitates many important applications such as search and recommendation but is limited by the memory capacity and bandwidth of a single machine due to large datasets and intensive data read. In this paper, we present CoTra, a system that scales up vector search for distributed execution. We observe a tension between computation and communication efficiency, which is the main challenge for good scalability, i.e., handling the local vectors on each machine independently blows up computation as the pruning power of vector index is not fully utilized, while running a global index over all machines introduces rich data dependencies and thus extensive communication. To resolve such tension, we leverage the fact that vector search is approximate in nature and robust to asynchronous execution. In particular, we run collaborative vector search over the machines with algorithm-system co-designs including clustering-based data partitioning to reduce communication, asynchronous execution to avoid communication stall, and task push to reduce network traffic. To make collaborative search efficient, we introduce a suite of system optimizations including task scheduling, communication batching, and storage format. We evaluate CoTra on real datasets and compare with four baselines. The results show that when using 16 machines, the query throughput of CoTra scales to 9.8-13.4x over a single machine and is 2.12-3.58x of the best-performing baseline at 0.95 recall@10.
基于相似的矢量搜索有助于许多重要的应用,如搜索和建议等,但因大型数据集和密集数据阅读而使单个机器的记忆能力和带宽受到限制。本文介绍CoTra,这是一个扩大矢量搜索以进行分布式执行的系统。我们观察到计算和通信效率之间的紧张,这是在计算和通信效率方面的主要挑战,因为没有充分利用矢量指数的支流能力,而在所有机器上运行一个全球指数,带来丰富的数据依赖性,从而造成广泛的通信。为了解决这种紧张关系,我们利用一个事实,即矢量搜索在性质上近似,而且强于不同步执行。我们特别看到计算和通信效率之间的矛盾,这是对良好的可缩放能力的主要挑战,即:在每台机器上处理本地矢量矢量的矢量计算,以避免通信中断,而任务推力减少网络流量。为了提高协作搜索效率,我们对所有机器实行一套系统优化,包括任务时间安排、通信批量和存储格式。我们利用Cotra-3.4的矢量搜索,我们用算法系统对机器进行协作性搜索,然后用四级基线对比。
Article 24
Title@2025-07-09 (3): Multi-objective methods in Federated Learning: A survey and taxonomy
Title: Multi-objective methods in Federated Learning: A survey and taxonomy | Multi-objektive Methoden im Federated Learning: Eine Umfrage und Taxonomie | 联邦学习的多目标方法:调查和分类 2502.03108v2 |
Authors (3): Maria Hartmann, Grégoire Danoy, Pascal Bouvry
The Federated Learning paradigm facilitates effective distributed machine learning in settings where training data is decentralized across multiple clients. As the popularity of the strategy grows, increasingly complex real-world problems emerge, many of which require balancing conflicting demands such as fairness, utility, and resource consumption. Recent works have begun to recognise the use of a multi-objective perspective in answer to this challenge. However, this novel approach of combining federated methods with multi-objective optimisation has never been discussed in the broader context of both fields. In this work, we offer a first clear and systematic overview of the different ways the two fields can be integrated. We propose a first taxonomy on the use of multi-objective methods in connection with Federated Learning, providing a targeted survey of the state-of-the-art and proposing unambiguous labels to categorise contributions. Given the developing nature of this field, our taxonomy is designed to provide a solid basis for further research, capturing existing works while anticipating future additions. Finally, we outline open challenges and possible directions for further research.
联邦学习模式有助于在培训数据分散于多个客户的环境下有效分发机器学习。随着该战略的普及程度的提高,日益复杂的现实世界问题出现,其中许多问题需要平衡公平、效用和资源消耗等相互冲突的需求。最近的工作已开始认识到采用多目标观点来应对这一挑战。然而,在这两个领域更广泛的范围内,从未讨论过这种将联邦方法与多目标优化相结合的新颖方法。在这项工作中,我们首次明确和系统地概述了这两个领域可以融合的不同方式。我们提出了在联邦学习中采用多目标方法的第一次分类,对最新情况进行了有针对性的调查,提出了将贡献定位的明确标签。鉴于该领域的发展性质,我们的分类学旨在为进一步研究提供一个坚实的基础,在预测未来增加的内容的同时捕捉现有的作品。我们概述了在进一步研究方面公开的挑战和可能的方向。
Article 25
Title@2025-07-09 (3): M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure
Title: M$^2$-MFP: A Multi-Scale and Multi-Level Memory Failure Prediction Framework for Reliable Cloud Infrastructure | M$^2$-MFP: Ein Multi-Scale und Multi-Level-Memory Failure Prediction Framework für zuverlässige Cloud-Infrastruktur | 2.2亿元元-百万元元:可靠云基础设施多层次和多层次记忆失败预测框架 2507.07144v1 |
Authors (7): Hongyi Xie, Min Zhou, Qiao Yu, Jialiang Yu, Zhenli Sheng, Hong Xie, Defu Lian
As cloud services become increasingly integral to modern IT infrastructure, ensuring hardware reliability is essential to sustain high-quality service. Memory failures pose a significant threat to overall system stability, making accurate failure prediction through the analysis of memory error logs (i.e., Correctable Errors) imperative. Existing memory failure prediction approaches have notable limitations: rule-based expert models suffer from limited generalizability and low recall rates, while automated feature extraction methods exhibit suboptimal performance. To address these limitations, we propose M$^2$-MFP: a Multi-scale and hierarchical memory failure prediction framework designed to enhance the reliability and availability of cloud infrastructure. M$^2$-MFP converts Correctable Errors (CEs) into multi-level binary matrix representations and introduces a Binary Spatial Feature Extractor (BSFE) to automatically extract high-order features at both DIMM-level and bit-level. Building upon the BSFE outputs, we develop a dual-path temporal modeling architecture: 1) a time-patch module that aggregates multi-level features within observation windows, and 2) a time-point module that employs interpretable rule-generation trees trained on bit-level patterns. Experiments on both benchmark datasets and real-world deployment show the superiority of M$^2$-MFP as it outperforms existing state-of-the-art methods by significant margins. Code and data are available at this repository: https://github.com/hwcloud-RAS/M2-MFP.
由于云层服务日益成为现代信息技术基础设施的有机组成部分,确保硬件可靠性对于维持高质量服务至关重要。记忆失灵对整个系统的稳定构成重大威胁,通过分析记忆错误日志(即可纠正错误)势在必行,对准确的故障作出预测。现有的记忆失灵预测方法有显著的局限性:基于规则的专家模型具有有限的通用性和低召回率,而自动化特征提取方法则表现不尽人意。为了解决这些局限性,我们建议建立一个多级和级级的记忆失灵预测框架:一个多级和级级的记忆失灵预测框架,目的是提高云层基础设施的可靠性和可用性。M$2-MFP将可纠正错误(Ces)转换为多级二元二元的二元MFM矩阵演示,并引入一个二元空间空间失灵提取模型(BSFE),以便自动提取DIMM的高级功能特征。我们根据BSFE的输出结果,开发了一个双向时间模型模型模块,用以将观察窗口的多级存储功能;以及2个时间点模块,这个使用可解释的模型模块,使用多级的可解释性规则-MMMMS-roma-roma-romab-roma-roma-de-deal-de-de-de-deal-deal-de-de-deal-deal-deal-de-de-de-deal-deal-deal-de-de-deal-de-de-deal-deal-deal-deal-deal-deal-deal-deal-deal-mo-mo-de-deal-deal-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-de-deal-deal-de-de-de-de-de-de-de-de-deal-deal-deal-deal-de-deal-deal-de-de-de-de-de-de-de-de-mogal-de-mo-mo-mo-mo-mo-mo-mo-mo-mo-mo-masal-mod-de-de-mod-mostral-mo-mod-mod-
Article 26
Title@2025-07-09 (3): SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference
Title: SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference | SlimCaching: Kanten-Caching von Mixture-of-Experts für verteilte Inferenz | SlimCaching: 分布式推断的混合专家的边缘缓存 2507.06567v1 |
Authors (3): Qian Chen, Xianhao Chen, Kaibin Huang
Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed within an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K\geq1$, expert co-activation within the same MoE layer introduces non-submodularity, causing greedy methods to be ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.
专家混合模型(MoE) 模型提高了大型语言模型(LLMS)的可缩缩性,它只激活了一小部分相关专家的每个投入。然而,在MOE模型中专家网络的庞大数量为边缘装置带来了巨大的存储负担。为了应对这一挑战,我们考虑一种假设,即专家分散在一个边缘网络中,以分布推导为目的的边缘网络中。根据流行的Top-K$专家选择战略,我们通过优化专家在储存限制下对边缘服务器的粘结,而最大限度地减少悬浮问题。当 $=1美元时,问题降为单一的单调子调式亚调模式问题,并带有 knappsack 限制,为此我们设计了一种基于贪婪的算法算法,用$(1- 1/ e) $- apag- adoproomm 来保证。对于同一MOE 层的专家共振动一般情况,我们引入了非亚调化方法,导致贪婪方法无效。为了解决这一问题,我们建议一种连续的贪婪解配置方法,将原始的变现模型转化为方法,在以动态方法中,通过升级的方法将我们以获取了一种基于动态的推算法的精化方法,从而获得了一种以加速的方法。
Article 27
Title@2025-07-09 (3): A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning
Title: A Single Merging Suffices: Recovering Server-based Learning Performance in Decentralized Learning | Eine einzige Zusammenführung: Wiederherstellung serverbasierter Lernleistung im dezentralisierten Lernen | 单一合并条件:在分散学习中恢复基于服务器的学习绩效 2507.06542v1 |
Authors (5): Tongtian Zhu, Tianyu Zhang, Mingze Wang, Zhanpeng Zhou, Can Wang
Decentralized learning provides a scalable alternative to traditional parameter-server-based training, yet its performance is often hindered by limited peer-to-peer communication. In this paper, we study how communication should be scheduled over time, including determining when and how frequently devices synchronize. Our empirical results show that concentrating communication budgets in the later stages of decentralized training markedly improves global generalization. Surprisingly, we uncover that fully connected communication at the final step, implemented by a single global merging, is sufficient to match the performance of server-based training. We further show that low communication in decentralized learning preserves the \textit{mergeability} of local models throughout training. Our theoretical contributions, which explains these phenomena, are first to establish that the globally merged model of decentralized SGD can converge faster than centralized mini-batch SGD. Technically, we novelly reinterpret part of the discrepancy among local models, which were previously considered as detrimental noise, as constructive components that accelerate convergence. This work challenges the common belief that decentralized learning generalizes poorly under data heterogeneity and limited communication, while offering new insights into model merging and neural network loss landscapes.
分散化学习为传统的参数服务器培训提供了可扩缩的替代办法,但是其绩效往往受到有限的同侪通信的阻碍。在本文中,我们研究沟通应如何安排一段时间,包括确定时间和频率的同步装置。我们的实证结果表明,将通信预算集中到分散化培训的后期阶段可明显改善全球普遍化。令人惊讶的是,我们发现,在最后一步通过单一全球合并进行的完全相连的通信足以与服务器培训的绩效相匹配。我们进一步表明,分散化学习中的低通信能在整个培训中保留当地模式的textit{mergeable}。我们解释这些现象的理论贡献首先确定,分散化的 SGD的全球合并模式可以比集中化的小型组合SGD速度更快。技术上,我们新重新解读了地方模式之间差异的一部分,这些模式过去被认为是有害的噪音,是加速趋同的建设性组成部分。这项工作挑战了一种共同的信念,即分散化学习在数据繁杂性和有限通信中一般化程度很低,同时对模型合并和神经网络景观损失提出了新的见解。
Article 28
Title@2025-07-09 (3): Designing Parallel Algorithms for Community Detection using Arachne
Title: Designing Parallel Algorithms for Community Detection using Arachne | Entwicklung von Parallelalgorithmen zur Gemeinschaftserkennung unter Verwendung von Arachne | 设计使用 Arachne 进行社区探测的平行数值 2507.06471v1 |
Authors (3): Fuhuan Li, Zhihui Du, David A. Bader
The rise of graph data in various fields calls for efficient and scalable community detection algorithms. In this paper, we present parallel implementations of two widely used algorithms: Label Propagation and Louvain, specifically designed to leverage the capabilities of Arachne which is a Python-accessible, open-source framework for large-scale graph analysis. Our implementations achieve substantial speedups over existing Python-based tools like NetworkX and igraph, which lack efficient parallelization, and are competitive with parallel frameworks such as NetworKit. Experimental results show that Arachne-based methods outperform these baselines, achieving speedups of up to 710x over NetworkX, 75x over igraph, and 12x over NetworKit. Additionally, we analyze the scalability of our implementation under varying thread counts, demonstrating how different phases contribute to overall performance gains of the parallel Louvain algorithm. Arachne, including our community detection implementation, is open-source and available at https://github.com/Bears-R-Us/arkouda-njit .
图表数据在各个领域的上升要求有效和可扩缩的社区检测算法。 在本文中,我们介绍了两种广泛使用的算法的平行实施:Label Propagation和Louvain, 专门设计这些算法是为了利用Arachne的能力,这是用于大规模图形分析的Python可访问的开放源框架。我们的实施大大加快了现有的Python工具,如网络X和igraph,这些工具缺乏有效的平行化,并且与NetworKit等平行框架具有竞争力。实验结果表明,Arachne使用的方法比这些基准要好,达到710x超过网络X,75x超过地图,12x超过NetworKit。此外,我们分析了我们在不同线索计下执行的可扩展性,表明不同的阶段如何有助于平行的Louvain算法的总体绩效收益。Arachne,包括我们的社区检测实施,是公开来源,可在https://github.com/Bears-R-arkouda-njit查阅。
Article 29
Title@2025-07-08 (2): FedPhD: Federated Pruning with Hierarchical Learning of Diffusion Models
Title: FedPhD: Federated Pruning with Hierarchical Learning of Diffusion Models | FedPhD: Federated Pruning mit Hierarchical Learning of Diffusion Models | FFPhD: 与传播模型的等级化学习结合的联邦节制 2507.06449v1 |
Authors (4): Qianyu Long, Qiyuan Wang, Christos Anagnostopoulos, Daning Bi
Federated Learning (FL), as a distributed learning paradigm, trains models over distributed clients’ data. FL is particularly beneficial for distributed training of Diffusion Models (DMs), which are high-quality image generators that require diverse data. However, challenges such as high communication costs and data heterogeneity persist in training DMs similar to training Transformers and Convolutional Neural Networks. Limited research has addressed these issues in FL environments. To address this gap and challenges, we introduce a novel approach, FedPhD, designed to efficiently train DMs in FL environments. FedPhD leverages Hierarchical FL with homogeneity-aware model aggregation and selection policy to tackle data heterogeneity while reducing communication costs. The distributed structured pruning of FedPhD enhances computational efficiency and reduces model storage requirements in clients. Our experiments across multiple datasets demonstrate that FedPhD achieves high model performance regarding Fr'echet Inception Distance (FID) scores while reducing communication costs by up to $88\%$. FedPhD outperforms baseline methods achieving at least a $34\%$ improvement in FID, while utilizing only $56\%$ of the total computation and communication resources.
联邦学习联合会(FL),作为一个分布式学习模式,对分布式客户数据进行模型培训。FL特别有利于传播传播模型(DMs)的培训,这些模型是高质量的图像生成器,需要多种数据,但是,在培训DMs(类似于培训变异器和进化神经网络)时,仍然存在着通信成本高和数据差异性的数据差异等挑战。有限研究在FL环境中处理了这些问题。为了解决这一差距和挑战,我们采用了一种新颖的方法FedPhD,目的是在FL环境中有效培训DMs。 FedPhD利用具有同质认知度的模型集成和选择政策,解决数据异质性的问题,同时降低通信成本。FedPhD的分布式调整提高了计算效率,减少了客户对模式储存的要求。我们通过多种数据集进行的实验表明,FedPhD在Fr'echetch Inpepion距离(FID)分数方面实现了高示范性业绩,同时将通信成本降低到88美元。FedPhD超出基准,在使用最少的35的通信资源中达到356美元。
Article 30
Title@2025-07-08 (2): Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach
Title: Fine-tuning Multimodal Transformers on Edge: A Parallel Split Learning Approach | Feinabstimmung multimodaler Transformer am Rand: Ein paralleler Split-Lernansatz | 边缘的微调多式变形器:平行分割学习方法 2502.06355v3 |
Authors (3): Timo Fudala, Vasileios Tsouvalas, Nirvana Meratnia
Multimodal transformers integrate diverse data types like images, audio, and text, advancing tasks such as audio-visual understanding and image-text retrieval; yet their high parameterization limits deployment on resource-constrained edge devices. Split Learning (SL), which partitions models at a designated cut-layer to offload compute-intensive operations to the server, offers a promising approach for distributed training of multimodal transformers, though its application remains underexplored. We present MPSL, a parallel SL approach for computational efficient fine-tuning of multimodal transformers in a distributed manner, while eliminating label sharing, client synchronization, and per-client sub-model management. MPSL employs lightweight client-side tokenizers and a unified modality-agnostic encoder, allowing flexible adaptation to task-specific needs. Our evaluation across 7 multimodal datasets demonstrates that MPSL matches or outperforms Federated Learning, reduces client-side computations by 250x, and achieves superior scalability in communication cost with model growth. Through extensive analysis, we highlight task suitability, trade-offs, and scenarios where MPSL excels, inspiring further exploration.
多式变压器整合了图像、音频和文字等多种数据类型,推进了视听理解和图像文本检索等任务;但高参数化限制了资源限制边缘装置的部署。 Split Learning(SL)在指定的切割机上分割模型,将计算密集型操作卸载到服务器上,为分配多式变压器的培训提供了一个有希望的方法,尽管其应用仍然未得到充分探讨。我们介绍了MPSL(一种平行的SL)方法,即以分布方式计算高效微调多式变压器,同时消除标签共享、客户同步和每个客户的子模型管理。 MPSL(MPS)使用轻量化客户端代号以及统一模式-通识编码器,允许灵活适应特定任务的需求。 我们对7个多式数据集的评估表明,MPSL(M)匹配或优于联邦学习,将客户端计算减少250x,并在通信成本与模型增长之间实现更佳的可缩缩缩。我们通过广泛分析,强调了任务适合性、交易和情景,从而鼓励进一步探索。
Article 31
Title@2025-07-08 (2): Ampere: Communication-Efficient and High-Accuracy Split Federated Learning
Title: Ampere: Communication-Efficient and High-Accuracy Split Federated Learning | Ampere: Kommunikationseffizientes und hochgenaues Split-Federated-Learning | Ampere: 通信效率和高准确度分立联邦学习 2507.07130v1 |
Authors (3): Zihan Zhang, Leon Wong, Blesson Varghese
A Federated Learning (FL) system collaboratively trains neural networks across devices and a server but is limited by significant on-device computation costs. Split Federated Learning (SFL) systems mitigate this by offloading a block of layers of the network from the device to a server. However, in doing so, it introduces large communication overheads due to frequent exchanges of intermediate activations and gradients between devices and the server and reduces model accuracy for non-IID data. We propose Ampere, a novel collaborative training system that simultaneously minimizes on-device computation and device-server communication while improving model accuracy. Unlike SFL, which uses a global loss by iterative end-to-end training, Ampere develops unidirectional inter-block training to sequentially train the device and server block with a local loss, eliminating the transfer of gradients. A lightweight auxiliary network generation method decouples training between the device and server, reducing frequent intermediate exchanges to a single transfer, which significantly reduces the communication overhead. Ampere mitigates the impact of data heterogeneity by consolidating activations generated by the trained device block to train the server block, in contrast to SFL, which trains on device-specific, non-IID activations. Extensive experiments on multiple CNNs and transformers show that, compared to state-of-the-art SFL baseline systems, Ampere (i) improves model accuracy by up to 13.26% while reducing training time by up to 94.6%, (ii) reduces device-server communication overhead by up to 99.1% and on-device computation by up to 93.13%, and (iii) reduces standard deviation of accuracy by 53.39% for various non-IID degrees highlighting superior performance when faced with heterogeneous data.
- 联邦学习(FL)系统通过协作培训神经网络,跨设备和服务器,但受到大量在轨计算成本的限制。分分分联邦学习(SFL)系统通过将网络的一组层从设备卸载到服务器而减轻了这一损失。然而,在这样做的过程中,由于设备与服务器之间经常交换中间激活和梯度,从而引入了大量的通信间接费用,降低了非IID数据的模型精确度。我们提议Ampere,这是一个新型的合作培训系统,既能最大限度地减少在轨计算和装置-服务器通信的精确度,同时又提高模型准确性能。 与SFL(SFL)系统不同,它利用迭接式的端对端至端培训全球损失,AMPL(SFL)系统使用连续的端对端到端,AMPL(S-FL)的直径直径,Axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 32
Title@2025-07-08 (2): Few-Shot Learning by Explicit Physics Integration: An Application to Groundwater Heat Transport
Title: Few-Shot Learning by Explicit Physics Integration: An Application to Groundwater Heat Transport | Wenig heißes Lernen durch explizite Physik-Integration: Eine Anwendung auf den Grundwasser-Wärmetransport | 通过明确物理集成进行很少热的热学习:地下水热运输的应用 2507.06062v1 |
Authors (4): Julia Pelzer, Corné Verburg, Alexander Heinlein, Miriam Schulte
Machine learning methods often struggle with real-world applications in science and engineering due to limited or low-quality training data. In this work, the example of groundwater flow with heat transport is considered; this corresponds to an advection-diffusion process under heterogeneous flow conditions, that is, spatially distributed material parameters and heat sources. Classical numerical simulations are costly and challenging due to high spatio-temporal resolution requirements and large domains. While often computationally more efficient, purely data-driven surrogate models face difficulties, particularly in predicting the advection process, which is highly sensitive to input variations and involves long-range spatial interactions. Therefore, in this work, a Local-Global Convolutional Neural Network (LGCNN) approach is introduced. It combines a lightweight numerical surrogate for the transport process (global) with convolutional neural networks for the groundwater velocity and heat diffusion processes (local). With the LGCNN, a city-wide subsurface temperature field is modeled, involving a heterogeneous groundwater flow field and one hundred groundwater heat pump injection points forming interacting heat plumes over long distances. The model is first systematically analyzed based on random subsurface input fields. Then, the model is trained on a handful of cut-outs from a real-world subsurface map of the Munich region in Germany, and it scales to larger cut-outs without retraining. All datasets, our code, and trained models are published for reproducibility.
由于培训数据有限或质量低,机械学习方法往往与科学和工程的实际世界应用相冲突,这是因为培训数据有限或质量低。在这项工作中,地下水流动与热传输的例子得到考虑;这相当于在各种流动条件下,即空间分布材料参数和热源条件下的平流扩散过程; 经典数字模拟由于高空时空分辨率要求和大领域而成本高且具有挑战性。 虽然计算效率更高、纯数据驱动的代谢模型往往面临困难,特别是在预测对输入变异高度敏感且涉及远距离空间互动的反流过程方面。 因此,在这项工作中,采用了地方-全球进化神经网络(LGCNN)方法; 将运输过程(全球)的轻量数字替代模型与地下水速度和热传播过程的进化神经网络(当地)相结合。 与全市范围的低表层温度模型模型模型模型,涉及混集地下水流场和100个地下水热泵注入点,形成远距离互动的热流。因此,在这项工作中,采用了一种本地进化的、经过系统分析的地表层模型,然后从德国的地层再分析。
Article 33
Title@2025-07-08 (2): Efficient Federated Learning with Timely Update Dissemination
Title: Efficient Federated Learning with Timely Update Dissemination | Effizientes Federated Learning mit rechtzeitiger Aktualisierung der Verbreitung | 及时更新传播的高效联邦学习和及时更新更新的传播 2507.06031v1 |
Authors (7): Juncheng Jia, Ji Liu, Chao Huo, Yihui Shen, Yang Zhou, Huaiyu Dai, Dejing Dou
Federated Learning (FL) has emerged as a compelling methodology for the management of distributed data, marked by significant advancements in recent years. In this paper, we propose an efficient FL approach that capitalizes on additional downlink bandwidth resources to ensure timely update dissemination. Initially, we implement this strategy within an asynchronous framework, introducing the Asynchronous Staleness-aware Model Update (FedASMU), which integrates both server-side and device-side methodologies. On the server side, we present an asynchronous FL system model that employs a dynamic model aggregation technique, which harmonizes local model updates with the global model to enhance both accuracy and efficiency. Concurrently, on the device side, we propose an adaptive model adjustment mechanism that integrates the latest global model with local models during training to further elevate accuracy. Subsequently, we extend this approach to a synchronous context, referred to as FedSSMU. Theoretical analyses substantiate the convergence of our proposed methodologies. Extensive experiments, encompassing six models and five public datasets, demonstrate that FedASMU and FedSSMU significantly surpass baseline methods in terms of both accuracy (up to 145.87%) and efficiency (up to 97.59%).
联邦学习联合会(FL)已成为管理分布式数据的一个令人信服的方法,近年来取得了显著进展。在本文件中,我们提出一个高效的FL方法,利用额外的下行链带带带宽资源,确保及时更新传播。最初,我们在一个不同步的框架内执行这项战略,采用“超同步的Stalesness-aware模型更新”,将服务器侧面和装置侧面方法结合起来。在服务器方面,我们提出了一个无节制的FL系统模型,采用动态模型集成技术,将本地模型更新与全球模型统一起来,以提高准确性和效率。与此同时,在设备方面,我们提议了一个适应性模型调整机制,在培训期间将最新的全球模型与当地模型结合起来,以进一步提高准确性。随后,我们将这一方法扩大到同步环境,称为FedSSMU。理论分析证实了我们拟议方法的趋同性。广泛的实验,包括六个模型和五个公共数据集,表明FedASMU和FedSSMUMU在准确性(达到145.87%)和效率方面大大超过基线方法。(达到97.59%)和效率方面。
Article 34
Title@2025-07-08 (2): ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge
Title: ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge | ECORE: Energiebewusstes optimiertes Routing für Deep-Learning-Modelle am Rand | ECORE: 在边缘深层学习模型的能源普能优化运行 2507.06011v1 |
Authors (5): Daghash K. Alqahtani, Maria A. Rodriguez, Muhammad Aamir Cheema, Hamid Rezatofighi, Adel N. Toosi
Edge computing enables data processing closer to the source, significantly reducing latency an essential requirement for real-time vision-based analytics such as object detection in surveillance and smart city environments. However, these tasks place substantial demands on resource constrained edge devices, making the joint optimization of energy consumption and detection accuracy critical. To address this challenge, we propose ECORE, a framework that integrates multiple dynamic routing strategies including estimation based techniques and a greedy selection algorithm to direct image processing requests to the most suitable edge device-model pair. ECORE dynamically balances energy efficiency and detection performance based on object characteristics. We evaluate our approach through extensive experiments on real-world datasets, comparing the proposed routers against widely used baseline techniques. The evaluation leverages established object detection models (YOLO, SSD, EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed context-aware routing strategies can reduce energy consumption and latency by 45% and 49%, respectively, while incurring only a 2% loss in detection accuracy compared to accuracy-centric methods.
电磁计算能够使数据处理更接近源头,大大降低基于实时视觉分析的基本要求,如监视和智能城市环境中的物体探测。然而,这些任务对资源有限的边缘装置提出了大量要求,使能源消耗和探测精确度的联合优化变得至关重要。为了应对这一挑战,我们提议ECORE,这是一个将多种动态路由战略(包括基于估算的技术)和贪婪的选择算法相结合的框架,将图像处理请求引导给最合适的边缘设备模型。ECORE动态地平衡了基于物体特性的能源效率和探测性能。我们通过对真实世界数据集的广泛实验来评估我们的方法,将拟议的路由器与广泛使用的基线技术进行比较。评价利用了既定的物体探测模型(YOLO、SSD、高效设计)和不同的边缘平台,包括Jetson Orin Nano、Raspberry Pi 4 和 5 以及 TPUcercercer 。结果显示,我们拟议的环境观测路由战略可以分别将能源消耗和耐用率分别减少45%和49 %,同时与精确度相比,在探测精确度方面仅损失2%。
Article 35
Title@2025-07-08 (2): Towards Serverless Processing of Spatiotemporal Big Data Queries
Title: Towards Serverless Processing of Spatiotemporal Big Data Queries | Auf dem Weg zur serverlosen Verarbeitung von raumzeitlichen Big Data-Abfragen | 迈向无服务器处理斯帕蒂奥多时大数据查询 2507.06005v1 |
Authors (3): Diana Baumann, Tim C. Rese, David Bermbach
Spatiotemporal data are being produced in continuously growing volumes by a variety of data sources and a variety of application fields rely on rapid analysis of such data. Existing systems such as PostGIS or MobilityDB usually build on relational database systems, thus, inheriting their scale-out characteristics. As a consequence, big spatiotemporal data scenarios still have limited support even though many query types can easily be parallelized. In this paper, we propose our vision of a native serverless data processing approach for spatiotemporal data: We break down queries into small subqueries which then leverage the near-instant scaling of Function-as-a-Service platforms to execute them in parallel. With this, we partially solve the scalability needs of big spatiotemporal data processing.
各种数据来源和各种应用领域正在以快速分析这些数据为基础,不断增长的量中生成随机数据,现有系统,如PostGIS或流动DB,通常以关系数据库系统为基础,从而继承其扩展特点。因此,大型时空数据假设方案仍然得到的支持有限,尽管许多查询类型可以很容易地平行。在本文中,我们提出了对空间时空数据采用本地服务器无数据处理方法的愿景:我们将查询分解成小小小小问题,然后利用功能-服务平台的近速缩放平行执行。因此,我们部分解决了大型时空数据处理的可扩展性需求。
Article 36
Title@2025-07-08 (2): Containerization in Multi-Cloud Environment: Roles, Strategies, Challenges, and Solutions for Effective Implementation
Title: Containerization in Multi-Cloud Environment: Roles, Strategies, Challenges, and Solutions for Effective Implementation | Containerisierung in Multi-Cloud-Umgebungen: Rollen, Strategien, Herausforderungen und Lösungen für eine effektive Umsetzung | 多种城市环境中的集装箱化:作用、战略、挑战和有效执行办法 2403.12980v3 |
Authors (8): Muhammad Waseem, Aakash Ahmad, Peng Liang, Muhammad Azeem Akbar, Arif Ali Khan, Iftikhar Ahmad, Manu Setälä, Tommi Mikkonen
Containerization in multi-cloud environments has received significant attention in recent years both from academic research and industrial development perspectives. However, there exists no effort to systematically investigate the state of research on this topic. The aim of this research is to systematically identify and categorize the multiple aspects of containerization in multi-cloud environment. We conducted the Systematic Mapping Study (SMS) on the literature published between January 2013 and July 2024. One hundred twenty one studies were selected and the key results are: (1) Four leading themes on containerization in multi-cloud environment are identified: ‘Scalability and High Availability’, ‘Performance and Optimization’, ‘Security and Privacy’, and ‘Multi-Cloud Container Monitoring and Adaptation’. (2) Ninety-eight patterns and strategies for containerization in multicloud environment were classified across 10 subcategories and 4 categories. (3) Ten quality attributes considered were identified with 47 associated tactics. (4) Four catalogs consisting of challenges and solutions related to security, automation, deployment, and monitoring were introduced. The results of this SMS will assist researchers and practitioners in pursuing further studies on containerization in multi-cloud environment and developing specialized solutions for containerization applications in multi-cloud environment.
近年来,从学术研究和工业发展的角度来看,多层环境中的集装箱化问题都受到学术研究和工业发展两方面的极大关注,然而,没有努力系统地调查这一专题的研究状况,目的是系统地查明和分类多层环境中集装箱化的多个方面,我们对2013年1月至2024年7月出版的文献进行了系统绘图研究(SMS),选定了121项研究,并得出了关键结果:(1) 确定了多层环境中集装箱化的四个主要主题:“可扩展性和高可用性”、“履约和优化”、“安全和隐私”和“多层集装箱监测和适应”。 (2) 将多层环境中集装箱化的98种模式和战略分为10个亚类和4类。 (3) 认为有47种相关策略的10个质量属性。 (4) 提出了4个由安全、自动化、部署和监测方面的挑战和解决办法组成的目录。
Article 37
Title@2025-07-08 (2): Conthereum: Concurrent Ethereum Optimized Transaction Scheduling for Multi-Core Execution
Title: Conthereum: Concurrent Ethereum Optimized Transaction Scheduling for Multi-Core Execution | Conthereum: Concurrent Ethereum optimierte Transaktionsplanung für Multi-Core-Execution | Contheum: 与Etheum同时的多核心执行优化交易日程安排 2504.07280v2 |
Authors (3): Atefeh Zareh Chahoki, Maurice Herlihy, Marco Roveri
Conthereum is a concurrent Ethereum solution for intra-block parallel transaction execution, enabling validators to utilize multi-core infrastructure and transform the sequential execution model of Ethereum into a parallel one. This shift significantly increases throughput and transactions per second (TPS), while ensuring conflict-free execution in both proposer and attestor modes and preserving execution order consistency in the attestor. At the heart of Conthereum is a novel, lightweight, high-performance scheduler inspired by the Flexible Job Shop Scheduling Problem (FJSS). We propose a custom greedy heuristic algorithm, along with its efficient implementation, that solves this formulation effectively and decisively outperforms existing scheduling methods in finding suboptimal solutions that satisfy the constraints, achieve minimal makespan, and maximize speedup in parallel execution. Additionally, Conthereum includes an offline phase that equips its real-time scheduler with a conflict analysis repository obtained through static analysis of smart contracts, identifying potentially conflicting functions using a pessimistic approach. Building on this novel scheduler and extensive conflict data, Conthereum outperforms existing concurrent intra-block solutions. Empirical evaluations show near-linear throughput gains with increasing computational power on standard 8-core machines. Although scalability deviates from linear with higher core counts and increased transaction conflicts, Conthereum still significantly improves upon the current sequential execution model and outperforms existing concurrent solutions under a wide range of conditions.
这一转变极大地提高了产出量和交易量(TPS)每秒(TPS)的不冲突执行率和交易量,同时确保在提议方和证明方模式中确保无冲突执行,同时在证明人中保持执行命令的一致性。在Contheum的核心是一个由灵活的工作商店调度问题(FJSS)启发的新颖、轻巧、高性能的调度器。我们提议了一种习惯贪婪的超脂算法,以及其高效的实施,从而有效、果断地超越了现有的时间安排方法,以找到满足限制、实现最低成份和最大限度地加快平行执行的非最佳解决办法。此外,Contheum包含一个脱线阶段,通过对智能合同进行静态分析,使其实时调度器配备冲突分析库,利用悲观主义方法确定潜在的冲突模式功能。我们提议了一种新颖的排缩式和广泛的冲突数据,从而有效地解决了这一公式,决定性地超越了现有的安排方法,从而在寻找满足这些限制、实现最低成份的不理想的解决办法时,并最大限度地加快了平行执行速度。
Article 38
Title@2025-07-08 (2): Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association
Title: Fundamental Limits of Hierarchical Secure Aggregation with Cyclic User Association | Grundlegende Grenzen der Hierarchischen Sicheren Aggregation mit Cyclic User Association | 与cycclic用户协会的等级安全分类基本限制 2503.04564v5 |
Authors (6): Xiang Zhang, Zhou Li, Kai Wan, Hua Sun, Mingyue Ji, Giuseppe Caire
Secure aggregation is motivated by federated learning (FL) where a cloud server aims to compute an averaged model (i.e., weights of deep neural networks) of the locally-trained models of numerous clients, while adhering to data security requirements. Hierarchical secure aggregation (HSA) extends this concept to a three-layer hierarchical network, where clustered users communicate with the server through an intermediate layer of relays. In HSA, beyond conventional server security, relay security is also enforced to ensure that the relays remain oblivious to the users’ inputs (an abstraction of the local models in FL). Existing study on HSA assumes that each user is associated with only one relay, limiting opportunities for coding across inter-cluster users to achieve efficient communication and key generation. In this paper, we consider HSA with a cyclic association pattern where each user is connected to $B$ consecutive relays in a wrap-around manner. We propose an efficient aggregation scheme which includes a message design for the inputs inspired by gradient coding-a well-known technique for efficient communication in distributed computing-along with a highly non-trivial security key design. We also derive novel converse bounds on the minimum achievable communication and key rates using information-theoretic arguments.
安全聚合的动机是联合学习(FL),云服务器的目的是计算当地培训的众多客户的模型的平均模型(即深神经网络的重量),同时遵守数据安全要求。 等级安全聚合(HSA)将这一概念推广到三级分级网络,集中用户通过中间继电器与服务器进行交流。在HSA中,除了常规服务器安全外,还实施中继安全,以确保中继器不为用户的投入所触动(即FL中当地模型的抽象化)。HSA的现有研究假设,每个用户只与一个中继器相关,从而限制了各组用户为高效通信和关键生成进行编码的机会。在本文件中,我们考虑HSA采用循环联系模式,即每个用户都通过中间继电器连接到$B$的连续继电器。我们提出了一个高效的汇总计划,其中包括由梯度编码驱动的投入信息设计信息设计,这是一种众所周知的技术,用于在分布式计算机时高效通信,同时使用高度非三端关键关键参数,同时使用最低关键参数。
Article 39
Title@2025-07-08 (2): A Formal Refutation of the Blockchain Trilemma
Title: A Formal Refutation of the Blockchain Trilemma | Eine formale Widerlegung des Blockchain Trilemma | Trilemma 链链的正式反驳 2507.05809v1 |
Authors (1): Craig Wright
The so-called blockchain trilemma asserts the impossibility of simultaneously achieving scalability, security, and decentralisation within a single blockchain protocol. In this paper, we formally refute that proposition. Employing predicate logic, formal automata theory, computational complexity analysis, and graph-theoretic measures of relay topology–specifically Baran’s model of network path redundancy–we demonstrate that the trilemma constitutes a category error, conflates distinct analytical domains, and relies upon unproven causal assumptions. We further expose its reliance on composition fallacies drawn from flawed system implementations. A constructive counterexample is presented: a blockchain protocol exhibiting unbounded transaction throughput, cryptographic security under adversarial load, and multipath decentralised propagation. This example is not hypothetical but grounded in protocol design enabled by compact block relay, SPV verification, and IPv6 multicast. The trilemma is revealed not as a law of protocol architecture, but as a heuristic fallacy sustained by imprecision and design defeatism.
所谓的“ 链链三角形” 声称不可能同时在一个单条链协议中实现可缩放性、 安全和分散化。 在本文中,我们正式驳斥了这一论点。 采用了上游逻辑、 正式的自动数据理论、 计算复杂度分析, 以及中继地形学模型的图形理论测量方法, 特别是Baran的网络路径冗余- 我们证明, 三角形构成一个分类错误, 组合不同的分析领域, 并依赖于未经证实的因果关系假设。 我们进一步暴露了它依赖从系统实施缺陷中得出的构成谬误。 一个建设性的反证 : 一个显示无约束交易通过量、 对抗性载荷下的加密安全以及多路径分散传播的串联协议。 这个例子不是假设性的,而是基于由紧凑区链中继、 SPV 验证 和 IPv6 多播送的协议设计。 三角形被披露为协议架构法, 而是由不精确和设计失败性所支撑的超度误判法。
Article 40
Title@2025-07-08 (2): A Distributed Consensus Algorithm for Prioritizing Autonomous Vehicle Passing at Unsignalized Intersections under Mixed Traffic
Title: A Distributed Consensus Algorithm for Prioritizing Autonomous Vehicle Passing at Unsignalized Intersections under Mixed Traffic | Ein verteilter Konsens-Algorithmus für die Priorisierung autonomer Fahrzeuge bei unsignalisierten Kreuzungen unter gemischtem Verkehr | 混合交通下未发信号的交叉路口通过自动车辆优先通行的分布式共识计算法 2507.03486v2 |
Authors (2): Younjeong Lee, Young Yoon
We propose a methodology for connected autonomous vehicles (CAVs) to determine their passing priority at unsignalized intersections where they coexist with human-driven vehicles (HVs). Assuming that CAVs can perceive the entry order of surrounding vehicles using computer vision technology and are capable of avoiding collisions, we introduce a voting-based distributed consensus algorithm inspired by Raft to resolve tie-breaking among simultaneously arriving CAVs. The algorithm is structured around the candidate and leader election processes and incorporates a minimal consensus quorum to ensure both safety and liveness among CAVs under typical asynchronous communication conditions. Assuming CAVs to be SAE (Society of Automotive Engineers) Level-4 or higher autonomous vehicles, we implemented the proposed distributed consensus algorithm using gRPC. By adjusting variables such as the CAV-to-HV ratio, intersection scale, and the processing time of computer vision modules, we demonstrated that stable consensus can be achieved even under mixed-traffic conditions involving HVs without adequate functionalities to interact with CAVs. Experimental results show that the proposed algorithm reached consensus at a typical unsignalized four-way, two-lane intersection in approximately 30-40 ms on average. A secondary vision-based system is employed to complete the crossing priorities based on the recognized lexicographical order of the license plate numbers in case the consensus procedure times out on an unreliable vehicle-to-vehicle communication network. The significance of this study lies in its ability to improve traffic flow at unsignalized intersections by enabling rapid determination of passing priority through distributed consensus even under mixed traffic with faulty vehicles.
我们建议采用连接的自治车辆(CAVs)的方法,以确定这些车辆在与人驱动车辆(HVs)共存的未标志十字路口的安全和生活状况。假设CAVs能够利用计算机视觉技术感知周围车辆的进入顺序,并能够避免碰撞,我们采用由Raft所启发的基于投票的分布式协商一致算法,以解决同时抵达的CAV之间的断裂问题。这种算法围绕候选人和领导选举过程进行,并包含最低限度的共识法定人数,以确保CAVs在典型的不同步通信条件下的安全和生活。假设CAVs是SAE(汽车工程师协会)4级或更高自治车辆,我们使用GRPC来实施拟议的分布式协商一致算法。通过调整CAV-HV比率、交叉比例和计算机视觉模块处理时间等变量,我们证明即使在混合交通状况下,在没有与混合通信飞行器进行互动的功能的情况下,也可以达成稳定的共识。实验结果表明,拟议的算法在典型的未标志性交叉路段路段交通能力(Siral-lexal-lexal lexal lax)中,在快速路路段上,在快速使用该驱动车辆(Sirxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx)
Article 41
Title@2025-07-08 (2): Air-FedGA: A Grouping Asynchronous Federated Learning Mechanism Exploiting Over-the-air Computation
Title: Air-FedGA: A Grouping Asynchronous Federated Learning Mechanism Exploiting Over-the-air Computation | Air-FedGA: Ein asynchroner, asynchroner Lernmechanismus, der die Berechnung über die Luft ausnutzt | Air-FedGA:一组非同步联邦学习机制 2507.05704v1 |
Authors (7): Qianpiao Ma, Junlong Zhou, Xiangpeng Hou, Jianchun Liu, Hongli Xu, Jianeng Miao, Qingmin Jia
Federated learning (FL) is a new paradigm to train AI models over distributed edge devices (i.e., workers) using their local data, while confronting various challenges including communication resource constraints, edge heterogeneity and data Non-IID. Over-the-air computation (AirComp) is a promising technique to achieve efficient utilization of communication resource for model aggregation by leveraging the superposition property of a wireless multiple access channel (MAC). However, AirComp requires strict synchronization among edge devices, which is hard to achieve in heterogeneous scenarios. In this paper, we propose an AirComp-based grouping asynchronous federated learning mechanism (Air-FedGA), which combines the advantages of AirComp and asynchronous FL to address the communication and heterogeneity challenges. Specifically, Air-FedGA organizes workers into groups and performs over-the-air aggregation within each group, while groups asynchronously communicate with the parameter server to update the global model. In this way, Air-FedGA accelerates the FL model training by over-the-air aggregation, while relaxing the synchronization requirement of this aggregation technology. We theoretically prove the convergence of Air-FedGA. We formulate a training time minimization problem for Air-FedGA and propose the power control and worker grouping algorithm to solve it, which jointly optimizes the power scaling factors at edge devices, the denoising factors at the parameter server, as well as the worker grouping strategy. We conduct experiments on classical models and datasets, and the results demonstrate that our proposed mechanism and algorithm can speed up FL model training by 29.9%-71.6% compared with the state-of-the-art solutions.
联邦学习(FL)是使用当地数据在分布边缘设备(即工人)上培训AI模型的一个新范例。 联邦学习(FL)是使用当地数据在分布边缘设备(即工人)上培训AI模型的新范例,同时应对各种挑战,包括通信资源限制、边缘异质性和数据非IID。 航空计算(AirComp)是实现高效利用通信资源进行模型聚合的一个大有希望的技术,办法是利用无线多重接入频道(MAC)的叠加属性。然而,AirComp要求在边缘设备之间进行严格的同步,这在各种情景中很难实现。在本文中,我们建议基于Air-Comprocomb 的不同步联合联合联合联合联合组合7级学习机制(Air-FedGA),将AirComproup的优势和不同步的FL的FL的优势结合起来,同时将Airal-F的升级的系统化要求与Airal-L的同步性能测试(我们通过Sal-GA)技术的同步化,我们用Sqolal-L 展示了A-al-Cal-lagal-Cal-Cal-Cal-L-L-L-L-L-Cal-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-L-
Article 42
Title@2025-07-08 (2): Skipper: Maximal Matching with a Single Pass over Edges
Title: Skipper: Maximal Matching with a Single Pass over Edges | Skipper: Maximale Übereinstimmung mit einem Single Pass über Kanten | 船长: 最大匹配和单过边距 2507.04420v2 |
Authors (1): Mohsen Koohi Esfahani
Maximal Matching (MM) is a fundamental graph problem with diverse applications. However, state-of-the-art parallel MM algorithms are limited by their need to process graph edges repeatedly over multiple iterations. Furthermore, optimized algorithms often require additional memory for graph contraction or edge filtering. In this paper, we introduce Skipper, an incremental asynchronous MM algorithm that (i) processes each edge deterministically and only once, (ii) skips a large fraction of edges during processing, and (iii) minimizes memory space utilization. Notably, Skipper requires (a) a single pass over the edges, and (b) only a single byte of memory space per vertex. Our evaluation of Skipper, using both real-world and synthetic graphs with up to 161 billion edges, and across three different computer architectures, shows that Skipper processes only 1.2% of the edges and delivers a 47.1 times average speedup (geometric mean). Moreover, Skipper’s output quality is highly competitive, with an average size of 88.6% relative to the output of the Lim-Chung algorithm as a state-of-the-art MM algorithm with the largest output size.
最大匹配( MM) 是各种应用的基本图形问题 。 但是, 最先进的平行 MM 算法由于需要多次在多个迭代中反复处理图形边缘而受到限制 。 此外, 优化算法通常需要为图形缩缩缩或边缘过滤增加记忆量 。 在本文中, 我们引入了Skipper , 这是一种渐进式的非同步 MM 算法, (一) 处理每个边缘, 并且只处理一次, (二) 在处理过程中跳过大部分边缘, 并(三) 最小化内存空间的利用。 值得注意的是, Skipper 的输出质量要求 (a) 单次越过边缘, 并且 (b) 只需要每个顶层的记忆空间的单一字节。 我们对Skipper 的评价通常需要额外的内存空间。 我们使用真实世界和合成的、 高达 1610 亿 边缘的图形, 并跨越三个不同的计算机结构, 显示Skipper 只处理边缘的1. 2% , 平均速度为47.1 倍( 平方平均值 ) 。 此外, shipper 的输出质量质量具有很高的竞争力, , 平均比 Lim- Chal- hal- halmax 的输出大小为 。
Article 43
Title@2025-07-08 (2): Archetype-Aware Predictive Autoscaling with Uncertainty Quantification for Serverless Workloads on Kubernetes
Title: Archetype-Aware Predictive Autoscaling with Uncertainty Quantification for Serverless Workloads on Kubernetes | Archetype-Aware Predictive Autoscaling mit Unsicherheit Quantifizierung für serverlose Workloads auf Kubernetes | Kubernetes 上无服务器工作载量的不确定性量化 2507.05653v1 |
Authors (10): Guilin Zhang, Srinivas Vippagunta, Raghavendra Nandagopal, Suchitra Raman, Jeff Xu, Marcus Pfeiffer, Shree Chatterjee, Ziqi Tan, Wulan Guo, Hailong Jiang
High-performance extreme computing (HPEC) platforms increasingly adopt serverless paradigms, yet face challenges in efficiently managing highly dynamic workloads while maintaining service-level objectives (SLOs). We propose AAPA, an archetype-aware predictive autoscaling system that leverages weak supervision to automatically classify 300\,000\,+ workload windows into four archetypes (PERIODIC, SPIKE, RAMP, STATIONARY_NOISY) with 99.8\% accuracy. Evaluation on publicly available Azure Functions traces shows that AAPA reduces SLO violations by up to 50\%, improves response time by 40\%, albeit with a 2–8\,$\times$ increase in resource cost under spike-heavy loads.
高性能极端计算(HPEC)平台越来越多地采用无服务器模式,但在有效管理高度动态的工作量的同时维持服务级目标方面却面临挑战。 我们提议** AAPA**,这是一个老式的、能预见到的自动扩缩系统,它利用薄弱的监督将300\ 000\,+工作量窗口自动分类为四种类型(PERIDIC、SPIKE、RAMP、STATARTIRENOISY),精确度达到99.8。 对公开提供的Azure函数痕迹的评估显示,AAPA将违反SLO的情况减少50,将反应时间缩短40,尽管在超重载下资源成本增加2-8\ $ 。
Article 44
Title@2025-07-08 (2): Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics
Title: Curvature-Aligned Federated Learning (CAFe): Harmonizing Loss Landscapes for Fairness Without Demographics | Krümmungsorientiertes Federated Learning (CAFe): Harmonisierung von Verlustlandschaften für Fairness ohne Demographie | CAFE: 协调丧失的景观,促进没有人口统计的公平 2404.19725v5 |
Authors (3): Shaily Roy, Harshit Sharma, Asif Salekin
Federated Learning (FL) enables privacy-preserving collaborative training, making it well-suited for decentralized human-sensing applications. Ensuring fairness in FL is challenging, as current methods rely on sensitive attribute knowledge, which conflicts with FL’s privacy principles. Additionally, sensitive attributes in human-sensing data may be unknown or latent. To address this, we introduce Curvature-Aligned Federated Learning (CAFe), a theoretically grounded approach that achieves fairness in FL without requiring sensitive attribute knowledge, a concept termed “Fairness without Demographics” (FWD). CAFe introduces loss-landscape curvature regularization during local training and clients’ loss-landscape sharpness-aware aggregation to align curvature both within and across clients, enabling a strong balance between higher fairness and performance. CAFe is especially suitable for real-world human-sensing FL scenarios involving single or multi-user edge devices with unknown or multiple bias factors. We validated CAFe through theoretical and empirical justifications, and comprehensive evaluations using three real-world datasets and a live real-world FL deployment with a heterogeneous testbed of resource-constrained devices. Additionally, we conduct sensitivity analyses on local training data volume, client sampling, communication overhead, resource costs, and runtime performance to demonstrate its feasibility for practical FL edge device deployment.
联邦学习联合会(FL)能够进行保护隐私的合作培训,使其适合于分散的人类遥感应用。确保FL的公平性具有挑战性,因为目前的方法依赖于敏感属性知识,而敏感属性知识与FL的隐私原则相冲突。此外,人类遥感数据中的敏感属性可能是未知的或潜在的。为了解决这个问题,我们引入了Curvatural-Axive Learning(CAFe),这是一种在不要求敏感属性知识的情况下实现FL公平、称为“没有人口统计的公平”的概念(FWD)。CAFe在当地培训和客户损失地貌敏锐度汇总期间引入了失地景观曲线正规化,以调整客户内部和跨客户的隐私原则。此外,CAFe特别适合现实世界人类遥感FL情景,其中涉及单一或多用户边缘装置,且有未知或多重偏差因素。我们通过理论和经验解释,以及利用三个真实世界数据集和实时FL部署实时FL配置,同时使用一个混成的测试台,对客户的清晰度敏锐度-敏锐度汇总度组合,对成本进行数据分析。此外,我们还进行了资源运行数据分析,以展示了对成本进行实地分析。
Article 45
Title@2025-07-08 (2): On Optimizing Resource Utilization in Distributed Connected Components
Title: On Optimizing Resource Utilization in Distributed Connected Components | Optimierung der Ressourcennutzung in verteilten vernetzten Komponenten | 关于最佳利用分配式连接构件的资源 2507.03695v2 |
Authors (1): Mohsen Koohi Esfahani
Connected Components (CC) is a core graph problem with numerous applications. This paper investigates accelerating distributed CC by optimizing memory and network bandwidth utilization. We present two novel distributed CC algorithms, SiskinCC and RobinCC, which are built upon the Jayanti-Tarjan disjoint set union algorithm. To optimize memory utilization, SiskinCC and RobinCC are designed to facilitate efficient access to a shared array for all cores running in a machine. This allows execution of faster algorithms with larger memory bounds. SiskinCC leverages the continuous inter-machine communication during the computation phase to reduce the final communication overhead and RobinCC leverages the structural properties of real-world graphs to optimize network bandwidth utilization. Our evaluation against state-of-the-art CC algorithms, using real-world and synthetic graphs with up to 500 billion edges and 11.7 billion vertices, and on up to 2048 CPU cores, demonstrates that SiskinCC and RobinCC achieve up to 58.5 times speedup.
本文通过优化内存和网络带宽利用率来调查加速传播的CC 。 我们展示了两种新颖的已分发CC 算法, 即SiskinCC 和 RobinCC, 它们是建立在Jayanti- Tarjan 脱节组合组合组合算法基础上的。 为了优化内存利用率, SiskinCC 和 RobinCC 的设计是为了便于所有在机器中运行的核心都能高效地获得共享的阵列。 这使得能够执行具有较大内存界限的更快的算法。 SiskinCC 在计算阶段利用连续的机器间通信来减少最后的通信间接费用, RobinCC 利用真实世界图的结构特性来优化网络带宽利用率。 我们用现实世界和合成图来评估最先进的CC 算法, 使用高达5000亿个边缘和117亿个脊椎的实时和合成图, 以及高达2048 CPU 的核心, 显示Siskin C和RobinCC 达到58.5倍的速度。
Article 46
Title@2025-07-08 (2): MOD-X: A Modular Open Decentralized eXchange Framework proposal for Heterogeneous Interoperable Artificial Intelligence Agents
Title: MOD-X: A Modular Open Decentralized eXchange Framework proposal for Heterogeneous Interoperable Artificial Intelligence Agents | MOD-X: Ein modularer, offener, dezentralisierter eXchange-Rahmenvorschlag für heterogene interoperable Künstliche Intelligenz-Agenten | MOD-X:关于不同基因、可相互操作的人工情报代理人的模块开放的分散式电子交流框架提案 2507.04376v2 |
Authors (5): Georgios Ioannides, Christos Constantinou, Vinija Jain, Aman Chadha, Aaron Elkins
As Artificial Intelligence systems evolve from monolithic models to ecosystems of specialized agents, the need for standardized communication protocols becomes increasingly critical. This paper introduces MOD-X (Modular Open Decentralized eXchange), a novel architectural framework proposal for agent interoperability that addresses key limitations of existing protocols. Unlike current approaches, MOD-X proposes a layered architecture with a Universal Message Bus, thorough state management, translation capabilities, and blockchain-based security mechanisms. We present MOD-X’s architecture, compare it with existing protocols, and demonstrate its application through a worked example how it enables integration between heterogeneous specialist agents (agents with different architectures, vendors, capabilities, and knowledge representations–including rule-based systems, neural networks, symbolic reasoning engines, and legacy software with agent wrappers). MOD-X’s key innovations include a publish-subscribe communication model, semantic capability discovery, and dynamic workflow orchestration–providing a framework that bridges theoretical formalism with practical implementation. This architecture addresses the growing need for truly decentralized, interoperable agent ecosystems that can scale effectively without the need for central coordination.
随着人工智能系统从单一模式发展到专业代理的生态系统,标准化通信协议的需要变得越来越重要。本文件介绍了MOD-X(Modular Open Defliced eXchange),这是关于代理互操作性的新建筑框架提案,解决了现有协议的主要局限性。与目前的做法不同,MOD-X提出了具有通用信息管道、彻底的国家管理、翻译能力和基于链锁的安全机制的分层结构。我们提出了MOD-X的架构,将其与现有的协议进行比较,并通过一个成功的例子展示了其应用,它是如何使不同专家代理(具有不同结构、供应商、能力和知识代表的代理,包括基于规则的系统、神经网络、象征性推理引擎和与代理包装商的遗留软件)之间实现一体化的。MOD-X的主要创新包括一个出版物订阅通信模式、语系能力发现和动态工作流程管弦化提供一种框架,将理论形式主义与实际执行联系起来。这一架构解决了日益需要真正分散、可相互操作的代理生态系统,而无需中央协调。
Article 47
Title@2025-07-08 (2): Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference
Title: Torpor: GPU-Enabled Serverless Computing for Low-Latency, Resource-Efficient Inference | Torpor: GPU-fähiges serverloses Rechnen für geringe Latenz, ressourceneffiziente Schlussfolgerung | Torpor: 用于低寿命、资源高效推断的GPU-Enable 服务器无服务器计算 2306.03622v3 |
Authors (11): Minchen Yu, Ao Wang, Dong Chen, Haoxuan Yu, Xiaonan Luo, Zhuohao Li, Wei Wang, Ruichuan Chen, Dapeng Nie, Haoran Yang, Yu Ding
Serverless computing offers a compelling cloud model for online inference services. However, existing serverless platforms lack efficient support for GPUs, hindering their ability to deliver high-performance inference. In this paper, we present Torpor, a serverless platform for GPU-efficient, low-latency inference. To enable efficient sharing of a node’s GPUs among numerous inference functions, Torpor maintains models in main memory and dynamically swaps them onto GPUs upon request arrivals (i.e., late binding with model swapping). Torpor uses various techniques, including asynchronous API redirection, GPU runtime sharing, pipelined model execution, and efficient GPU memory management, to minimize latency overhead caused by model swapping. Additionally, we design an interference-aware request scheduling algorithm that utilizes high-speed GPU interconnects to meet latency service-level objectives (SLOs) for individual inference functions. We have implemented Torpor and evaluated its performance in a production environment. Utilizing late binding and model swapping, Torpor can concurrently serve hundreds of inference functions on a worker node with 4 GPUs, while achieving latency performance comparable to native execution, where each model is cached exclusively on a GPU. Pilot deployment in a leading commercial serverless cloud shows that Torpor reduces the GPU provisioning cost by 70% and 65% for users and the platform, respectively.
无服务器计算为在线推断服务提供了一个令人信服的云模式。 但是, 现有的无服务器平台缺乏对 GPU 的有效支持, 妨碍其提供高性能推断的能力。 在本文中, 我们介绍Torpor, 这是一个无服务器平台, 用于 GPU 高效、 低长推导。 为了在众多推算功能中高效共享节点 GPU 的 GPU 组合, Torpor 在主记忆中维护模型, 并在接到请求时动态地将其转换为 GPU 。 Torpor 使用各种技术, 包括无同步的 API 调整、 GPU 运行时间共享、 管道模式执行以及高效的 GPU记忆管理, 以最大限度地减少由模型转换导致的 LAPODR 。 此外, 我们设计了一种干扰- 感应感应功能, 使用高速 GPUPU 连接以满足个人推断功能的LOV 级服务级别目标( SLOs ) 。 我们实施了Topropor, 并评估了它在生产环境中的运行平台上的性能性能。 , 而TOPLPIPI 则可以同时运行 运行运行运行运行 。
Article 48
Title@2025-07-07 (1): When Federated Learning Meets Quantum Computing: Survey and Research Opportunities
Title: When Federated Learning Meets Quantum Computing: Survey and Research Opportunities | Wenn Federated Learning auf Quanten Computing trifft: Umfrage- und Forschungsmöglichkeiten | 《当联邦学习与量子计算:调查和研究机会》 2504.08814v2 |
Authors (3): Aakar Mathur, Ashish Gupta, Sajal K. Das
Quantum Federated Learning (QFL) is an emerging field that harnesses advances in Quantum Computing (QC) to improve the scalability and efficiency of decentralized Federated Learning (FL) models. This paper provides a systematic and comprehensive survey of the emerging problems and solutions when FL meets QC, from research protocol to a novel taxonomy, particularly focusing on both quantum and federated limitations, such as their architectures, Noisy Intermediate Scale Quantum (NISQ) devices, and privacy preservation, so on. This work explores key developments and integration strategies, along with the impact of quantum computing on FL, keeping a sharp focus on hybrid quantum-classical approaches. The paper offers an in-depth understanding of how the strengths of QC, such as gradient hiding, state entanglement, quantum key distribution, quantum security, and quantum-enhanced differential privacy, have been integrated into FL to ensure the privacy of participants in an enhanced, fast, and secure framework. Finally, this study proposes potential future directions to address the identified research gaps and challenges, aiming to inspire faster and more secure QFL models for practical use.
量子联邦学习(QFL)是一个新兴领域,它利用量子计算(QC)的进步来提高分权的联邦学习(FL)模式的可扩展性和效率,本文件对FL达到质子(QC)时出现的新问题和解决办法进行了系统和全面的调查,从研究协议到新颖分类,特别侧重于量子和联邦限制,例如其结构、新式中等量子量子计算(NISQ)装置和隐私保护等。这项工作探索了关键的发展和一体化战略,以及量子计算对FL的影响,并密切关注混合量子类方法。该文件深入了解了QC的长处,例如梯度隐藏、状态缠绕、量子键分布、量子安全、量子安全以及量子增强的差别隐私等,是如何融入到FL的,以确保参与者在强化、快速和安全框架内的隐私。最后,本研究报告提出了未来可能的方向,以解决已确定的研究差距和挑战,目的是激励更快和更安全地实际使用QLFL模型。
Article 49
Title@2025-07-07 (1): Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding
Title: Helix Parallelism: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decoding | Helix Parallelismus: Rethinking Sharding Strategies for Interactive Multi-Million-Token LLM Decodierung | Helix 平行主义:重新思考交互式多亿-千米调解解码的碎片战略 2507.07120v1 |
Authors (10): Nidhi Bhatia, Ankit More, Ritika Borkar, Tiyasa Mitra, Ramon Matas, Ritchie Zhao, Maximilian Golub, Dheevatsa Mudigere, Brian Pharris, Bita Darvish Rouhani
As LLMs scale to multi-million-token KV histories, real-time autoregressive decoding under tight Token-to-Token Latency (TTL) constraints faces growing pressure. Two core bottlenecks dominate: accessing Feed-Forward Network (FFN) weights and reading long KV caches. While Tensor Parallelism (TP) helps mitigate the cost of FFN weight reads, it does not scale well for attention. When TP width exceeds the number of KV heads, it leads to inefficient KV duplication, limits parallelism, and constrains batch size. Simultaneously, DRAM reads for long KV histories scale linearly with batch size, further capping efficiency. We introduce Helix Parallelism, a hybrid execution strategy that applies KV parallelism during attention to shard KV caches across GPUs, then reuses the same GPUs for TP in dense LLMs or TPxExpert Parallel (EP) in MoEs during FFN computation. To preserve exact attention behavior, Helix includes a lightweight communication step. To minimize the exposed communication cost, we introduce Helix HOP-B. Helix HOP-B effectively minimizes communication overhead through batchwise overlap, preserving low TTL while improving GPU efficiency. Compared to conventional parallelism approaches, Helix reduces TTL by up to 1.5x at fixed batch sizes and supports up to 32x larger batches under the same latency budget for DeepSeek-R1, pushing forward the throughput-latency Pareto on Blackwell and making real-time inference with ultra-long-sequence practical.
随着LLMS规模扩大至百万吨KV历史, 实时自动自动递增在紧凑的 Token- Token Latency (TTL) (TTL) (TTL) 限制下, 面临越来越大的压力。 两个核心瓶颈占主导地位: 获取 Feed- Forward 网络(FFN) 重量和读读长 KV 缓存。 虽然Tensor 平行主义(TP) 有助于降低FFFN 重量的成本, 但它不值得关注。 当TP的宽度超过 KV 头数时, 它会导致 KVV 重复效率低, 限制平行主义, 限制批量规模。 与此同时, DRAM 将长的KV 历史缩放比例以线性为直线性, 用批量尺寸线性表示更长的KV 平行主义。 我们引入了混合执行战略, 在GPUPS 关注的硬性 KV 缓冲式 LMS 或 TP 平行 支持 IMFFFFFFFFFR 的TP 期间, 。 要保持精确的注意, 保持精确的行为, Helix 包括一个轻度 最轻度的通讯步骤, 最轻度的通讯步骤, 和最轻级的通讯步骤。
Article 50
Title@2025-07-07 (1): Cooperative Gradient Coding
Title: Cooperative Gradient Coding | Kooperative Gradientencodierung | 合作渐进编码 2507.05230v1 |
Authors (4): Shudi Weng, Ming Xiao, Chao Ren, Mikael Skoglund
This work studies gradient coding (GC) in the context of distributed training problems with unreliable communication. We propose cooperative GC (CoGC), a novel gradient-sharing-based GC framework that leverages cooperative communication among clients. This approach ultimately eliminates the need for dataset replication, making it both communication- and computation-efficient and suitable for federated learning (FL). By employing the standard GC decoding mechanism, CoGC yields strictly binary outcomes: either the global model is exactly recovered, or the decoding fails entirely, with no intermediate results. This characteristic ensures the optimality of the training and demonstrates strong resilience to client-to-server communication failures when the communication channels among clients are in good condition. However, it may also result in communication inefficiency and hinder convergence due to its lack of flexibility, especially when communication channels among clients are in poor condition. To overcome this limitation and further harness the potential of GC matrices, we propose a complementary decoding mechanism, termed GC$^+$, which leverages information that would otherwise be discarded during GC decoding failures. This approach significantly improves system reliability under unreliable communication, as the full recovery of the global model typically dominates in GC$^+$. To conclude, this work establishes solid theoretical frameworks for both CoGC and GC$^+$. We provide complete outage analyses for each decoding mechanism, along with a rigorous investigation of how outages affect the structure and performance of GC matrices. Building on these analyses, we derive convergence bounds for both decoding mechanisms. Finally, the effectiveness of CoGC and GC$^+$ is validated through extensive simulations.
这项工作研究在分布式培训问题和不可靠的通信背景下的梯度编码(GC)问题。我们建议合作性GC(GC),这是一个新的基于梯度共享的GC(GC)框架,利用客户之间合作交流的渠道,利用基于梯度的GC(GC)框架,利用不可靠的通信,最终消除对数据集复制的需要,使其既具有通信效率和计算效率,又适合联邦化学习(FL)。通过使用标准的GC解码机制,CO(GC)产生严格的二元结果:要么全球模式完全恢复,要么解码完全失败,没有中间结果。这确保了培训的最佳性,并显示出在客户之间通信渠道良好时,对客户与服务器之间通信失败的强烈复原力。然而,由于缺乏灵活性,特别是客户之间的通信渠道条件差,这种办法还可能导致通信效率低下,妨碍汇合。为了克服这一限制,进一步利用GC($)框架的潜力,我们提议一个补充性解码机制,即GC(GC)美元,在GC(C)解码失败期间,利用本来会放弃的信息。这一方法在不可靠的通信下大大改进系统的系统可靠性,因为全面恢复G(C)机制,通常由GC(C)和CAS)分析成为GC(C)的完整。
Article 51
Title@2025-07-07 (1): GPU-based complete search for nonlinear minimization subject to bounds
Title: GPU-based complete search for nonlinear minimization subject to bounds | GPU-basierte komplette Suche nach nichtlinearer Minimierung unter Grenzen | 基于 GPU 的基于 GPU 的完整搜索, 以不受约束的方式对非线性最小化进行搜索 2507.01770v2 |
Authors (3): Guanglu Zhang, Qihang Shan, Jonathan Cagan
This paper introduces a GPU-based complete search method to enclose the global minimum of a nonlinear function subject to simple bounds on the variables. Using interval analysis, coupled with the computational power and architecture of GPU, the method iteratively rules out the regions in the search domain where the global minimum cannot exist and leaves a finite set of regions where the global minimum must exist. For effectiveness, because of the rigor of interval analysis, the method is guaranteed to enclose the global minimum of the nonlinear function even in the presence of rounding errors. For efficiency, the method employs a novel GPU-based single program, single data parallel programming style to circumvent major GPU performance bottlenecks, and a variable cycling technique is also integrated into the method to reduce computational cost when minimizing large-scale nonlinear functions. The method is validated by minimizing 10 multimodal benchmark test functions with scalable dimensions, including the well-known Ackley function, Griewank function, Levy function, and Rastrigin function. These benchmark test functions represent grand challenges of global optimization, and enclosing the guaranteed global minimum of these benchmark test functions with more than 80 dimensions has not been reported in the literature. Our method completely searches the feasible domain and successfully encloses the guaranteed global minimum of these 10 benchmark test functions with up to 10,000 dimensions using only one GPU in a reasonable computation time, far exceeding the reported results in the literature due to the unique method design and implementation based on GPU architecture.
本文引入了一个基于 GPU 的完整搜索方法, 以包含非线性函数的全球最小值, 并受变量的简单界限 。 使用间隔分析, 加上 GPU 的计算力和架构, 该方法迭代排除了搜索域中无法存在全球最低值的区域, 并留下了一组必须存在全球最低值的区域 。 关于有效性, 由于间隔分析的严格性能, 该方法保证包含非线性函数的全球最低值 , 即使存在四舍五入错误 。 关于效率, 该方法使用一个新的基于 GPU 的单一程序、 单一数据平行编程风格, 以绕过 GPU 的主要性能瓶颈, 以及可变的循环法技术也被纳入了在尽量减少大规模非线性功能时降低计算成本的方法 。 对于该方法的验证方法是将10个具有可缩放尺寸的多模式基准测试功能, 包括众所周知的 Ackley 函数、 Griewank 函数、 Levy 函数 和 Rastrigin 函数 。 这些基准测试功能是全球优化的巨大挑战, , 这些基准性测试功能代表着全球优化的巨大挑战, , 将这些基准性测试功能与超过 80个范围的保证最低值的最低限度的最小值连接值连接值连接值 , 在10 上, 我们的模型在10个基准值中报告了10个基准值的模型的搜索中只基底域域域域域域中, 。
Article 52
Title@2025-07-07 (1): MoLink: Distributed and Efficient Serving Framework for Large Models
Title: MoLink: Distributed and Efficient Serving Framework for Large Models | MoLink: Verteilter und effizienter Servierrahmen für große Modelle | MoLink:大型模型分配和高效服务框架 2507.05043v1 |
Authors (8): Lewei Jin, Yongqi Chen, Kui Zhang, Yifan Zhuo, Yi Gao, Bowei Yang, Zhengong Cai, Wei Dong
Large language models represent a groundbreaking shift in generative AI. Yet, these advances come with a significant challenge: the high cost of model serving. To mitigate these costs, consumer-grade GPUs emerge as a more affordable alternative. This presents an opportunity for more cost-efficient LLM serving by leveraging these GPUs. However, it is non-trivial to achieve high-efficiency LLM serving on consumer-grade GPUs, mainly due to two challenges: 1) these GPUs are often deployed in limited network conditions; 2) these GPUs often exhibit heterogeneity in host systems. To address these challenges, we present MoLink, a distributed LLM serving system for large models. It incorporates several key techniques, enabling efficient LLM serving on heterogeneous and weakly connected consumer-grade GPUs. Our experiments demonstrate that it achieves throughput improvements of up to 458\% and cost-profit margin improvements of up to 151\%, compared to state-of-the-art systems. MoLink allows users on Windows, Linux, and containerized VMs to seamlessly integrate GPUs with just a few lines of code over Ethernet or public networks. Currently, it supports 18 mainstream architectures of open-source large language models.
大型语言模型代表了基因化AI的突破性转变。 然而,这些进步带来了巨大的挑战:模型服务的高昂成本。为了降低这些成本,消费者级的GPU成为了更廉价的替代方案。这为利用这些GPU提供更具成本效益的LLM服务提供了一个机会。然而,实现高效的LLM在消费者级GPU上服务是非三重性的,这主要是由于两个挑战:1)这些GPU常常在有限的网络条件下部署;2)这些GPU往往表现出东道系统中的异质性。为了应对这些挑战,我们提出了Molink,一个分布式LM服务系统,用于大型模型。它包含了若干关键技术,使LLM能够在多样化和连接薄弱的消费者级GPU上服务。我们的实验表明,它通过投入实现了高达458的改进,并且与最先进的系统相比,成本-利润幅度高达151。MOLink允许视窗、Linux和集装箱化VMS的用户将GPPPS与少数的开放型号组合,支持了Enet或18个主流公共网络的大型代码。
Article 53
Title@2025-07-07 (1): Distributed Approximation Algorithms for Minimum Dominating Set in Locally Nice Graphs
Title: Distributed Approximation Algorithms for Minimum Dominating Set in Locally Nice Graphs | Verteilte Annäherungsalgorithmen für das Minimum dominierendes Set in lokal schönen Grafiken | 本地尼斯图表中最小主导设置的分布式近似分布比例比值 2507.04960v1 |
Authors (4): Marthe Bonamy, Cyril Gavoille, Timothé Picavet, Alexandra Wesolek
We give a new, short proof that graphs embeddable in a given Euler genus-$g$ surface admit a simple $f(g)$-round $\alpha$-approximation distributed algorithm for Minimum Dominating Set (MDS), where the approximation ratio $\alpha \le 906$. Using tricks from Heydt et al. [European Journal of Combinatorics (2025)], we in fact derive that $\alpha \le 34 +\varepsilon$, therefore improving upon the current state of the art of $24g+O(1)$ due to Amiri et al. [ACM Transactions on Algorithms (2019)]. It also improves the approximation ratio of $91+\varepsilon$ due to Czygrinow et al. [Theoretical Computer Science (2019)] in the particular case of orientable surfaces. All our distributed algorithms work in the deterministic LOCAL model. They do not require any preliminary embedding of the graph and only rely on two things: a LOCAL algorithm for MDS on planar graphs with uniform'' approximation guarantees and the knowledge that graphs embeddable in bounded Euler genus surfaces have asymptotic dimension $2$. More generally, our algorithms work in any graph class of bounded asymptotic dimension where
most vertices’’ are locally in a graph class that admits a LOCAL algorithm for MDS with uniform approximation guarantees.
我们给出了一个新的简短证据, 显示在特定 Euler genus- g$ 表面上嵌入的图表中, 认可了一个简单的 $f(g) 圆 $ ALpha$- accolation 分配算法( MDS) , 即近似比率$ ALpha\le 906美元。 使用 Heydt et al. [欧洲组合学杂志( 2025 ) 的技巧, 我们事实上得出, $ alpha\le 34 varepsilon$, 因此, 由于 Amiri 等人, 24g+O(1)美元当前的艺术状态有了改善。 [ ACMAL- gal- pal- progalalalalal- deal- squalations (2019 ) , 也改善了由于Czygrigranow 等人的近似比率为91 rqual- volations , [The the LOCLAL libal ligalal dals ligalgals as ligals suplobal supals supals) , 在可确定als mals mustals musts musts 上, 任何可移动图图面面面面面面面的保证。
Article 54
Title@2025-07-07 (1): Bullshark on Narwhal: Implementation-level Workflow Analysis of Round-based DAG Consensus in Theory and Practice
Title: Bullshark on Narwhal: Implementation-level Workflow Analysis of Round-based DAG Consensus in Theory and Practice | Bullshark on Narwhal: Implementation-Level-Workflow-Analyse des runden DAG-Konsenses in Theorie und Praxis | Narwhal Bullshark on Narwhal:关于基于DAG理论和实践共识的圆桌工作流量分析执行层面的工作流量分析 2507.04956v1 |
Authors (1): Yusei Tanaka
Round-based DAGs enable high-performance Byzantine fault-tolerant consensus, yet their technical advantages remain underutilized due to their short history. While research on consensus protocols is active in both academia and industry, many studies overlook implementation-level algorithms, leaving actual performance unclear - particularly for theoretical protocols whose practical performance cannot often be evaluated. Bullshark, a Round-based DAG BFT protocol on Narwhal mempool, achieves optimal performance: 297,000 transactions per second with 2-second latency. We analyze the algorithm’s workflow, from transaction submission to blockchain commitment, breaking it down layer by layer at the functional level and delineating the key features and interactions of the Bullshark and Narwhal components. Future work aims to improve performance in Byzantine fault environments and optimize trade-offs in the CAP theorem.
以圆桌为基础的DAG能够达成高性能的拜占庭断裂容忍共识,但由于历史短促,它们的技术优势仍然没有得到充分利用。 虽然关于协商一致协议的研究在学术界和工业界都很活跃,但许多研究忽略了执行层面的算法,使得实际绩效不明确,特别是理论协议的实际绩效常常无法评估。 以圆桌为基础的DAG BFT协议的Bullshark在Narwhal Mempo 上取得了最佳绩效:每秒29.7万笔交易和2秒的延迟。 我们分析了算法的工作流程,从交易提交到连锁承诺,在功能层面将其分层分解,并分解布尔沙克和纳尔瓦勒两部分的关键特征和互动关系。 未来工作的目的是改善拜占庭断层环境的业绩,优化CAP理论的权衡。
Article 55
Title@2025-07-07 (1): BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning
Title: BackFed: An Efficient & Standardized Benchmark Suite for Backdoor Attacks in Federated Learning | BackFed: Eine effiziente und standardisierte Benchmark-Suite für Backdoor-Angriffe im Federated Learning | BackFeded:针对联邦学习联合会的后门袭击的高效和标准化基准套件 2507.04903v1 |
Authors (4): Thinh Dao, Dung Thuy Nguyen, Khoa D Doan, Kok-Seng Wong
Federated Learning (FL) systems are vulnerable to backdoor attacks, where adversaries train their local models on poisoned data and submit poisoned model updates to compromise the global model. Despite numerous proposed attacks and defenses, divergent experimental settings, implementation errors, and unrealistic assumptions hinder fair comparisons and valid conclusions about their effectiveness in real-world scenarios. To address this, we introduce BackFed - a comprehensive benchmark suite designed to standardize, streamline, and reliably evaluate backdoor attacks and defenses in FL, with a focus on practical constraints. Our benchmark offers key advantages through its multi-processing implementation that significantly accelerates experimentation and the modular design that enables seamless integration of new methods via well-defined APIs. With a standardized evaluation pipeline, we envision BackFed as a plug-and-play environment for researchers to comprehensively and reliably evaluate new attacks and defenses. Using BackFed, we conduct large-scale studies of representative backdoor attacks and defenses across both Computer Vision and Natural Language Processing tasks with diverse model architectures and experimental settings. Our experiments critically assess the performance of proposed attacks and defenses, revealing unknown limitations and modes of failures under practical conditions. These empirical insights provide valuable guidance for the development of new methods and for enhancing the security of FL systems. Our framework is openly available at https://github.com/thinh-dao/BackFed.
联邦学习联合会(FL)系统很容易受到幕后攻击,因为对手们就有毒数据培训当地模型,并提交有毒的模型更新,以损害全球模型。尽管提出了许多攻击和防御建议,但不同的实验环境、执行错误和不切实际的假设都阻碍了在现实世界情景中对其有效性进行公平的比较和有效结论。为了解决这个问题,我们引入了FackFed(Facked)系统——一个综合基准套件,旨在对Flax(Flaxed)的幕后攻击和防御进行标准化、精简和可靠评估,重点是实际制约因素。我们的基准通过其多处理实施提供了关键优势,它大大加快了试验和模块设计,通过明确界定的API将新方法无缝地整合。我们设想了一个标准化的评价管道,作为研究人员全面可靠地评价新的攻击和防御的插座和游戏环境。我们利用BackFack(B)对具有代表性的后门攻击和防御任务进行了大规模研究,同时以各种模式架构和实验环境为重点。我们的实验评估了拟议攻击和防御的绩效,揭示了在实际条件下的未知的限制和失败模式。这些经验洞察为我们Fax/B(AG)的新系统的开发提供了宝贵的安全框架。
Article 56
Title@2025-07-07 (1): Phantom Subgroup Poisoning: Stealth Attacks on Federated Recommender Systems
Title: Phantom Subgroup Poisoning: Stealth Attacks on Federated Recommender Systems | Phantom Subgroup Gifting: Stealth Attacks auf Federated Recommender Systems | 幻影分组中毒:对联邦建议系统进行隐形袭击 2507.06258v1 |
Authors (8): Bo Yan, Yurong Hao, Dingqi Liu, Huabin Sun, Pengpeng Qiao, Wei Yang Bryan Lim, Yang Cao, Chuan Shi
Federated recommender systems (FedRec) have emerged as a promising solution for delivering personalized recommendations while safeguarding user privacy. However, recent studies have demonstrated their vulnerability to poisoning attacks. Existing attacks typically target the entire user group, which compromises stealth and increases the risk of detection. In contrast, real-world adversaries may prefer to prompt target items to specific user subgroups, such as recommending health supplements to elderly users. Motivated by this gap, we introduce Spattack, the first targeted poisoning attack designed to manipulate recommendations for specific user subgroups in the federated setting. Specifically, Spattack adopts a two-stage approximation-and-promotion strategy, which first simulates user embeddings of target/non-target subgroups and then prompts target items to the target subgroups. To enhance the approximation stage, we push the inter-group embeddings away based on contrastive learning and augment the target group’s relevant item set based on clustering. To enhance the promotion stage, we further propose to adaptively tune the optimization weights between target and non-target subgroups. Besides, an embedding alignment strategy is proposed to align the embeddings between the target items and the relevant items. We conduct comprehensive experiments on three real-world datasets, comparing Spattack against seven state-of-the-art poisoning attacks and seven representative defense mechanisms. Experimental results demonstrate that Spattack consistently achieves strong manipulation performance on the specific user subgroup, while incurring minimal impact on non-target users, even when only 0.1\% of users are malicious. Moreover, Spattack maintains competitive overall recommendation performance and exhibits strong resilience against existing mainstream defenses.
联邦建议系统(FedRec)已成为在保护用户隐私的同时提供个性化建议的一个很有希望的解决方案。然而,最近的研究显示,它们很容易受到毒害攻击。现有的攻击通常针对整个用户群体,这些袭击会破坏隐形,增加检测风险。相反,现实世界对手可能更愿意将目标物品及时送到特定的用户分组,例如向老年用户建议健康补充。受这一差距的驱使,我们引入了首次有针对性的中毒攻击,目的是在联合的环境下操纵特定用户分组的建议。具体而言,Sprepack采用了两阶段的近似和促动战略,首先模拟目标/非目标分组的用户嵌入目标/非目标分组,然后将目标项目推广到目标分组。为了加强近似阶段,我们推向不同的用户分组,根据对比对比健康补充的学习,扩大目标群体基于集群的相关项目。为了加强推广阶段,我们进一步建议只调整目标与非目标组合分组之间的优化权重度。此外,一个嵌入式调整战略,首先模拟目标/非目标分组的用户嵌入目标/非目标分组,然后将目标分组的用户对目标级战略进行不升级,同时对具体防御项目进行比较。
Article 57
Title@2025-07-07 (1): High Order Collaboration-Oriented Federated Graph Neural Network for Accurate QoS Prediction
Title: High Order Collaboration-Oriented Federated Graph Neural Network for Accurate QoS Prediction | High Order Collaboration-Oriented Federated Graph Neural Network für genaue QoS-Vorhersage | 高级秩序协作-以联邦州际同步预测神经网络 2507.05308v1 |
Authors (2): Zehuan Chen, Xiangwei Lai
Predicting Quality of Service (QoS) data crucial for cloud service selection, where user privacy is a critical concern. Federated Graph Neural Networks (FGNNs) can perform QoS data prediction as well as maintaining user privacy. However, existing FGNN-based QoS predictors commonly implement on-device training on scattered explicit user-service graphs, thereby failing to utilize the implicit user-user interactions. To address this issue, this study proposes a high order collaboration-oriented federated graph neural network (HC-FGNN) to obtain accurate QoS prediction with privacy preservation. Concretely, it magnifies the explicit user-service graphs following the principle of attention mechanism to obtain the high order collaboration, which reflects the implicit user-user interactions. Moreover, it utilizes a lightweight-based message aggregation way to improve the computational efficiency. The extensive experiments on two QoS datasets from real application indicate that the proposed HC-FGNN possesses the advantages of high prediction accurate and privacy protection.
预测服务的质量(QOS)数据对于选择云服务至关重要,因为用户隐私是一个关键问题。联邦神经网络(FGNN)可以进行QOS数据预测并维护用户隐私。然而,现有的基于FGNN的QOS预测器通常对分散的明确的用户服务图表进行现场设计培训,从而无法利用隐含的用户-用户互动。为解决这一问题,本研究报告建议建立一个高顺序、面向协作的配制式图形神经网络(HC-FGNN),以获得准确的QOS预测并保护隐私。具体地说,它根据关注机制原则放大明确的用户服务图,以获得高度订单协作,这反映了用户与用户之间的暗中互动。此外,它利用基于轻量量信息汇总的方法提高计算效率。从实际应用中对两个QOS数据集进行的广泛实验表明,拟议的HC-FNNN具有高预测准确性和隐私保护的优势。
Article 58
Title@2025-07-07 (1): A fast MPI-based Distributed Hash-Table as Surrogate Model demonstrated in a coupled reactive transport HPC simulation
Title: A fast MPI-based Distributed Hash-Table as Surrogate Model demonstrated in a coupled reactive transport HPC simulation | Eine schnelle MPI-basierte verteilte Hash-Tabelle als Surrogate-Modell in einer gekoppelten reaktiven Transport HPC-Simulation demonstriert | 快速基于 MPI 的散散散散散口表,作为代用模型,在同时反应性运输的HPC模拟中演示 2504.14374v2 |
Authors (4): Max Lübke, Marco De Lucia, Stefan Petri, Bettina Schnor
Surrogate models can play a pivotal role in enhancing performance in contemporary High-Performance Computing applications. Cache-based surrogates use already calculated simulation results to interpolate or extrapolate further simulation output values. But this approach only pays off if the access time to retrieve the needed values is much faster than the actual simulation. While the most existing key-value stores use a Client-Server architecture with dedicated storage nodes, this is not the most suitable architecture for HPC applications. Instead, we propose a distributed architecture where the parallel processes offer a part of their available memory to build a shared distributed hash table based on MPI. This paper presents three DHT approaches with the special requirements of HPC applications in mind. The presented lock-free design outperforms both DHT versions which use explicit synchronization by coarse-grained resp. fine-grained locking. The lock-free DHT shows very good scaling regarding read and write performance. The runtime of a coupled reactive transport simulation was improved between 14% and 42% using the lock-free DHT as a surrogate model.
代理模型可以在提高当代高性能计算应用程序的性能方面发挥关键作用。 以缓冲为基础的代孕器使用已经计算好的模拟结果, 以内推或外推进一步的模拟输出值。 但这种方法只有在检索所需值的存取时间比实际模拟快得多的情况下才会有效果。 虽然大多数现有关键值仓库使用客户- 服务器结构, 并配有专门的存储节点, 但这不是 HPC 应用程序最合适的结构 。 相反, 我们提议一个分布式结构, 其平行进程提供其可用记忆的一部分, 以建立基于 MPI 的共享分布式散货桌 。 本文介绍了三种 DHT 方法, 并附有 HPC 应用程序的特殊要求。 推出的无锁设计比 DHT 版本都快。 这两种版本都使用明显的同步, 使用粗微重新涂料的精细加固锁。 DHT 显示无锁功能在读写性能上非常适合的缩放量。 相反, 我们提议一个分布式的架构, 。 匹配式反应式运输模拟的运行时间在14 % 和 42% 之间, 。
Article 59
Title@2025-07-07 (1): Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms
Title: Demystifying NCCL: An In-depth Analysis of GPU Communication Protocols and Algorithms | Entmystifizierende NCCL: Eine eingehende Analyse der GPU-Kommunikationsprotokolle und -algorithmen | 解开NCCL的神秘性:深入分析GPU通信协议和等级 2507.04786v1 |
Authors (8): Zhiyi Hu, Siyuan Shen, Tommaso Bonato, Sylvain Jeaugey, Cedell Alexander, Eric Spada, Jeff Hammond, Torsten Hoefler
The NVIDIA Collective Communication Library (NCCL) is a critical software layer enabling high-performance collectives on large-scale GPU clusters. Despite being open source with a documented API, its internal design remains largely opaque. The orchestration of communication channels, selection of protocols, and handling of memory movement across devices and nodes are not well understood, making it difficult to analyze performance or identify bottlenecks. This paper presents a comprehensive analysis of NCCL, focusing on its communication protocol variants (Simple, LL, and LL128), mechanisms governing intra-node and inter-node data movement, and ring- and tree-based collective communication algorithms. The insights obtained from this study serve as the foundation for ATLAHS, an application-trace-driven network simulation toolchain capable of accurately reproducing NCCL communication patterns in large-scale AI training workloads. By demystifying NCCL’s internal architecture, this work provides guidance for system researchers and performance engineers working to optimize or simulate collective communication at scale.
NVIDIA集体通信图书馆(NCIDIA集体通信图书馆)是一个关键的软件层,能够对大型GPU群群集进行高性能集体分析。尽管它是公开的来源,有文件证明的API,但其内部设计仍然基本不透明。对通信渠道的布局、协议的选择以及装置和节点之间记忆移动的处理没有很好地理解,因此难以分析业绩或查明瓶颈问题。本文件对NVIDIA集体通信图书馆(NVIDIA集体通信图书馆)进行了全面分析,重点是其通信协议变异(Spealy、LLL和LLL128)、规范节内和节内数据流动的机制以及环形和树基集体通信算法。从这项研究获得的见解是ATLAHS的基础,这是一个由应用程序驱动的网络模拟工具链,能够在大规模AI培训工作量中准确复制NCCL通信模式。通过解说NCLCLC的内部结构,为系统研究人员和绩效工程师在规模上优化或模拟集体通信提供了指导。
Article 60
Title@2025-07-07 (1): Communication Round and Computation Efficient Exclusive Prefix-Sums Algorithms (for MPI_Exscan)
Title: Communication Round and Computation Efficient Exclusive Prefix-Sums Algorithms (for MPI_Exscan) | Kommunikationsrunde und Computation Effiziente exklusive Präfix-Summe Algorithmen (für MPI_Exscan) | 通信回合和计算效率(MPI_Exscan) 2507.04785v1 |
Authors (1): Jesper Larsson Träff
Parallel scan primitives compute element-wise inclusive or exclusive prefix sums of input vectors contributed by $p$ consecutively ranked processors under an associative, binary operator $\oplus$. In message-passing systems with bounded, one-ported communication capabilities, at least $\lceil\log_2 p\rceil$ or $\lceil\log_2 (p-1)\rceil$ communication rounds are required to perform the scans. While there are well-known, simple algorithms for the inclusive scan that solve the problem in $\lceil\log_2 p\rceil$ communication rounds with $\lceil\log_2 p\rceil$ applications of $\oplus$ (which could be expensive), the exclusive scan appears more difficult. Conventionally, the problem is solved with either $\lceil\log_2 (p-1)\rceil+1$ communication rounds (e.g., by shifting the input vectors), or in $\lceil\log_2 p\rceil$ communication rounds with $2\lceil\log_2 p\rceil-1$ applications of $\oplus$ (by a modified inclusive scan algorithm). We give a new, simple algorithm that computes the exclusive prefix sums in $q=\lceil\log_2 (p-1)+\log_2\frac{4}{3}\rceil$ simultaneous send-receive communication rounds with $q-1$ applications of $\oplus$. We compare the three algorithms implemented in MPI against the MPI library native MPI_Exscan primitive on a small, $36$-node cluster with a state-of-the-art MPI library, indicating possible and worthwhile improvements to standard implementations. The algorithms assume input vectors to be small so that performance is dominated by the number of communication rounds. For large input vectors, other (pipelined, fixed-degree tree) algorithms must be used.
光线原始扫描 计算元素包含性值或纯正正方程式 , 由 $\ licil\ log\ 2 p\ pplus$。 在带有 $\ lceil\ log\ 2 p\ rceil_ 2 p\ rceil_ 2 (p-1\ rcele$) 或$\ lceil\ log_ 2 (p-1\ rcele$) 的端端矢量分析器中, 由 $\ lceil\ log_ 2 p\ rceil$ 组成的连续级处理器所贡献的内存矢量分析器 。 在有 $\ lceil\ log\ 2 prceil\ prceil2 prceil_ prceil2 应用器中, 唯一的扫描器似乎更难。 常规上, 要么是 $lceililil\ log_ log_ 2 (pil_ mail_ lical_ lical_ pral_ dal_ max) max a missional_ motional_ mocal_ motional_ motional_ motional_ motional_ motional_ motional_ mocal_ mocal_ mocal_ mocal_ motional_ motional_ motional_ motional_ motional_ mocal_ motional_ motional_ motional_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ mocal_ modal_ modal_ mocal_ modal_ pral_ modal_ modal_ motional_ lical_ p
Article 61
Title@2025-07-07 (1): Semitopology: distributed collaborative action via topology, algebra, and logic
Title: Semitopology: distributed collaborative action via topology, algebra, and logic | Semitopologie: verteilte kollaborative Aktion über Topologie, Algebra und Logik | 超土学:通过地形学、代数和逻辑进行分布式合作行动 2402.03253v3 |
Authors (1): Murdoch J. Gabbay
We introduce semitopologies, a generalisation of point-set topology that removes the restriction that intersections of open sets need necessarily be open. The intuition is that points are participants in some distributed system, and an open set is a collection of participants that can collaborate to update their local state by taking a distributed collaborative action; we call this an actionable coalition. What constitutes an actionable coalition depends on what actions we want to model. Intuitive examples include ‘a group of people that is collectively strong enough to lift a rock’, where the state update is very simply ‘holding rock low’ to ‘holding rock high’ and this update is common to all participants in the actionable coalition. Or, consider ‘two people wishing to barter a can of juice for a bar of chocolate’, in which case the coalition is any such pair and the state updates differ between participants to flip them between ‘has/has no juice’ and ‘has/has no chocolate’. A characteristic of these systems is that state updates are local to the coalition, voluntary, may vary between participants, and are not assumed subject to permission or synchronisation by a central authority. Peer-to-peer computer networks, including filesharing and blockchain systems, provide motivating examples from computing. This monograph presents a comprehensive view of semitopologies which includes point-set semitopology, algebra, and logic inspired by these considerations. This is interesting in and of itself and it provides a conceptual framework within which to understand a useful class of distributed systems.
我们引入了半模式, 将点定的地形学概括化, 从而消除开放组合交叉点必然需要开放的限制。 直觉是点点是某些分布式系统中的参与者, 而一个开放的一组是参与者的集合, 他们可以通过采取分布式合作行动来合作更新当地状态; 我们称之为一个可操作的联盟。 构成一个可操作的联盟取决于我们想要模拟什么行动。 直观的例子包括“ 一群集体强大到足以提升岩石的人群 ” , 国家更新非常简单地“ 保持岩石低位” , 以“ 保持岩石高位 ” , 而这种更新对于可操作的联盟的所有参与者来说都是常见的。 或者, 考虑“ 两个人想要用巧克力棒的果汁罐来换一罐; 这样, 联盟就是一对一对的, 州级更新取决于参与者的“ 有/ 没有果汁” 和“ 有巧克力 ” 。 这些系统的特征是, 国家更新是本地的, 参与者之间可能有所不同的, 并且不会被一个中央权力机关允许或同步的 。 数据库中, 提供了一个数据库的系统 的系统 , 提供了一个动态的系统 的系统, 。
Article 62
Title@2025-07-07 (1): Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation
Title: Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation | Performance-Evaluierung allgemeiner Zwecke Große Sprachmodelle für grundlegende lineare Algebra-Unterprogramme Code-Generierung | 基本线性代代数子方案代码生成通用大语言模型绩效评价 2507.04697v1 |
Authors (4): Daichi Mukunoki, Shun-ichiro Hayashi, Tetsuya Hoshino, Takahiro Katagiri
Generative AI technology based on Large Language Models (LLM) has been developed and applied to assist or automatically generate program codes. In this paper, we evaluate the capability of existing general LLMs for Basic Linear Algebra Subprograms (BLAS) code generation for CPUs. We use two LLMs provided by OpenAI: GPT-4.1, a Generative Pre-trained Transformer (GPT) model, and o4-mini, one of the o-series of Reasoning models. Both have been released in April 2025. For the routines from level-1 to 3 BLAS, we tried to generate (1) C code without optimization from routine name only, (2) C code with basic performance optimizations (thread parallelization, SIMD vectorization, and cache blocking) from routine name only, and (3) C code with basic performance optimizations based on Fortran reference code. As a result, we found that correct code can be generated in many cases even when only routine name are given. We also confirmed that thread parallelization with OpenMP, SIMD vectorization, and cache blocking can be implemented to some extent, and that the code is faster than the reference code.
基于大语言模型(LLM)的人工智能生成技术已经开发并应用,以协助或自动生成程序代码。在本文件中,我们评估了用于基本线性代数子程序代码生成的现有普通LLMs(BLAS)的能力。我们使用由OpenAI提供的两个LMS:GPT-4.1,一种先导变异器模型(GPT)和O4-mini,一种解释模式的O系列。两者都已经在2025年4月发布。对于从1级到3级的常规程序,我们试图生成(1) C 代码,而无需仅从常规名称优化;(2) C代码,仅从常规名称产生基本性能优化(全线平行化、SIMD矢量化和缓存阻塞),以及(3) C代码,基于Fortran参考代码的基本性能优化。结果我们发现,即使只给出例行名称,许多情况下也能够生成正确的代码。我们还确认,与OpenMP、SIMD矢量化和缓存封的线性能在一定程度上实施,而且该代码比参考代码更快。
Article 63
Title@2025-07-07 (1): Learning from the Past: Adaptive Parallelism Tuning for Stream Processing Systems
Title: Learning from the Past: Adaptive Parallelism Tuning for Stream Processing Systems | Aus der Vergangenheit lernen: Adaptive Parallelitäts-Tuning für Stream Processing Systeme | 向过去学习:流流处理系统的适应性平行制图 2504.12074v2 |
Authors (8): Yuxing Han, Lixiang Chen, Haoyu Wang, Zhanghao Chen, Yifan Zhang, Chengcheng Yang, Kongzhang Hao, Zhengyi Yang
Distributed stream processing systems rely on the dataflow model to define and execute streaming jobs, organizing computations as Directed Acyclic Graphs (DAGs) of operators. Adjusting the parallelism of these operators is crucial to handling fluctuating workloads efficiently while balancing resource usage and processing performance. However, existing methods often fail to effectively utilize execution histories or fully exploit DAG structures, limiting their ability to identity bottlenecks and determine the optimal parallelism. In this paper, we propose StreamTune, a novel approach for adaptive paralelism tuning in stream processing systems. StreamTune incorporates a pre-training and fine-tuning framework that leverages global knowledge from historical execution data for job-specific parallelism tuning. In the pre-training phase, Stream Tune clusters the historical data with Graph Edit Distance and pre-trains a Graph Neural Networkbased encoder per cluster to capture the correlation between the operator parallelism, DAG structures, and the identified operator-level bottlenecks. In the online tuning phase, StreamTune iteratively refines operator parallelism recommendations using an operator-level bottleneck prediction model enforced with a monotonic constraint, which aligns with the observed system performance behavior. Evaluation results demonstrate that StreamTune reduces reconfigurations by up to 29.6% and parallelism degrees by up to 30.8% in Apache Flink under a synthetic workload. In Timely Dataflow, StreamTune achieves up to an 83.3% reduction in parallelism degrees while maintaining comparable processing performance under the Nexmark benchmark, when compared to the state-of-the-art methods.
分散流处理系统依靠数据流模型来定义和执行流程中的工作,将计算作为操作员的定向循环图(DAGs)来组织操作员的计算。调整这些操作员的平行功能对于高效处理波动工作量至关重要,同时平衡资源使用和处理业绩。然而,现有方法往往无法有效利用执行历史或充分利用DAG结构,限制了它们识别瓶颈和确定最佳平行结构的能力。在本文件中,我们提议StreamTune(Stramlelism调整流处理系统的适应性parlelism调整的新办法)。 StreamTune(StreamTune)包含一个培训前和微调框架,将历史执行数据从历史执行数据中的可比较性知识用于具体工作平行调整。在培训前阶段, StreamT(Stramune) 将历史数据组合成“图编辑距离”和前基于图形神经网络的编码每组集,以捕捉到操作员的平行关系、DAGAG结构结构和确定的操作员级瓶颈。在网上调整阶段, StreamT(Stream-leutyal-lex) 6) 操作者使用操作员级的平行操作者建议,在运行中,在测试中通过测试中进行软化的软化模型中,通过观察到的软化结果,在测试中进行业绩模型下进行业绩分析,在测试下,在测试下进行业绩测试下进行稳定的递归。
Article 64
Title@2025-07-07 (1): RAPTOR: Practical Numerical Profiling of Scientific Applications
Title: RAPTOR: Practical Numerical Profiling of Scientific Applications | RAPTOR: Praktische numerische Profilierung wissenschaftlicher Anwendungen | 科学应用实际数字分析 2507.04647v1 |
Authors (7): Faveo Hoerold, Ivan R. Ivanov, Akash Dhruv, William S. Moses, Anshu Dubey, Mohamed Wahib, Jens Domke
The proliferation of low-precision units in modern high-performance architectures increasingly burdens domain scientists. Historically, the choice in HPC was easy: can we get away with 32 bit floating-point operations and lower bandwidth requirements, or is FP64 necessary? Driven by Artificial Intelligence, vendors introduced novel low-precision units for vector and tensor operations, and FP64 capabilities stagnate or are reduced. This is forcing scientists to re-evaluate their codes, but a trivial search-and-replace approach to go from FP64 to FP16 will not suffice. We introduce RAPTOR: a numerical profiling tool to guide scientists in their search for code regions where precision lowering is feasible. Using LLVM, we transparently replace high-precision computations using low-precision units, or emulate a user-defined precision. RAPTOR is a novel, feature-rich approach – with focus on ease of use – to change, profile, and reason about numerical requirements and instabilities, which we demonstrate with four real-world multi-physics Flash-X applications.
现代高性能建筑中低精度单位的扩散日益加重领域科学家的负担。 从历史上看,HPC的选择是容易的:我们能否摆脱32点浮点作业和低带宽要求,或者需要FP64? 由人工智能驱动的供应商为矢量和慢速作业采用了新的低精度单位,以及FP64 能力停滞或减少。这迫使科学家重新评价其代码,但从FP64到FP16的微不足道的搜索和替换方法是不够的。我们采用了RAPTOR:一个数字特征分析工具来指导科学家寻找可以精确降低的代码区域。我们使用LLLVM,透明地取代使用低精度单位的高精度计算,或者模仿用户定义的精确度。RAPTOR是一种新颖的、内容丰富的方法,重点是容易使用 – – 改变、描述和解释数字要求和不稳定性的理由,我们用四种真实世界多物理闪光X应用来证明。
Article 65
Title@2025-07-07 (1): SPTCStencil: Using Sparse Tensor Cores for Stencil Computation
Title: SPTCStencil: Using Sparse Tensor Cores for Stencil Computation | SPTCStencil: Verwendung von Sparse Tensor Cores für Stencil Computation | SPSSCtencil: 使用粗特质核心进行Stencil 计算 2506.22035v2 |
Authors (4): Qiqi GU, Chenpeng Wu, Heng Shi, Jianguo Yao
Stencil computation, a pivotal numerical method in science and engineering, iteratively updates grid points using weighted neighbor contributions and exhibits strong parallelism for multi-core processors. Current optimization techniques targeting conducting stencil computation on tensor core accelerators incur substantial overheads due to redundant zero-padding during the transformation to matrix multiplication. To address this, we introduce a sparse computation paradigm that eliminates inefficiencies by exploiting specialized hardware units. This paper exploits the sparsity in these matrices as a feature and presents SPTCStencil, a high-performance stencil computation system accelerated by Sparse Tensor Core (SpTCs). SPTCStencil is the first to harness SpTCs for acceleration beyond deep learning domains. First, Our approach generalizes an efficient transformation of stencil computation into matrix multiplications and specializes this conversion for SpTC compatibility through a novel sparsification strategy. Furthermore, SPTCStencil incorporates a high-performance GPU kernel with systematic optimizations designed to maximize efficiency on SpTCs. Experimental evaluations demonstrate that SPTCStencil 5.46$\times$ and Tensor Core-based approaches by 2.00$\times$ on average.
Stencils 计算是科学和工程中的关键数字方法,利用加权邻里贡献反复更新电网点,并展示了多核心处理器的强大平行。目前,针对对电压核心加速器进行静态计算的最优化技术由于向矩阵乘法转换过程中的多余零涂面而产生了大量的间接费用。为了解决这个问题,我们引入了一种稀疏的计算模式,通过利用专门硬件单位消除效率低下现象。本文利用这些矩阵中的孔隙作为特征,并介绍了由Spass Tensor Core(SpTCs)加速的高性能电线圈计算系统(SPCStencils)。POSCStencils是第一个利用SpTC实现超越深层学习域加速的先行技术。首先,我们的方法将电压计算高效转换为矩阵倍增法,并专门通过新式的蒸气化战略使SpTC兼容性转换。此外,SPCStencilstencils将高性GPU 内壳作为一种功能,系统优化旨在最大限度地提高SpTCs效率的系统优化。实验性评估显示,以2.CStencilstencils$00美元为标准。
Article 66
Title@2025-07-07 (1): CFP: Efficient Optimization of Intra-Operator Parallelism Plans for Large Model Training
Title: CFP: Efficient Optimization of Intra-Operator Parallelism Plans for Large Model Training | CFP: Effiziente Optimierung von Intra-Operator-Parallelisierungsplänen für große Modellschulungen | CFP: 高效优化大型示范培训操作人员内部平行计划 2504.00598v2 |
Authors (10): Weifang Hu, Xuanhua Shi, Yunkai Zhang, Chang Wu, Xuan Peng, Jiaqi Zhai, Hai Jin, Xuehai Qian, Jingling Xue, Yongluan Zhou
Optimizing the parallel training of large models requires exploring intra-operator parallelism plans for a computation graph that typically contains tens of thousands of primitive operators. While the optimization of parallel data processing graphs has been extensively researched in database systems, the vast search space makes it challenging to apply traditional database query optimization methods and algorithms. This paper introduces CFP, an optimization system for intra-operator parallelism that significantly reduces the complexity of searching for parallelism plans by leveraging two structural patterns found in large models. First, we identify parallel-preserving subgraphs, which ensure that the optimal global plan assigns the same parallel strategy to all operators within the subgraph. This approach allows us to avoid enumerating all possible combinations of parallel strategies for these operators. Second, we recognize repetitive subgraph patterns within the large computational graph, enabling us to profile a moderate number of representative subgraphs and accurately estimate the cost of parallelism plans with low overhead. With the significantly reduced search space, we can employ dynamic programming to search for the optimized parallelism plan. In our experiments, we demonstrate that CFP achieves significant speedups compared to the state-of-the-art framework for large models like GPT and LLAMA.
优化对大型模型的平行培训需要探索操作者内部平行的计算图计划,该图通常包含数以万计的原始操作者。虽然平行数据处理图的优化已经在数据库系统中进行了广泛研究,但巨大的搜索空间使得应用传统的数据库查询优化方法和算法具有挑战性。本文介绍了CFP,这是操作者内部平行的优化系统,它通过利用大型模型中发现的两个结构模式,大大降低了平行搜索计划的复杂性。首先,我们确定了平行保存子图,确保最佳全球计划为子图中的所有操作者指定了同样的平行战略。这个方法使我们能够避免为这些操作者列出所有可能的平行战略组合。第二,我们认识到大型计算图中重复的子图型模式,使我们能够描述适度的代表性子图,并准确估计与低间接费用平行计划的成本。随着搜索空间的显著缩小,我们可以使用动态的编程来搜索优化平行计划。在我们的实验中,我们证明CFP能够实现与GPMA和大型模型相比的重大速度。
Article 67
Title@2025-07-07 (1): Denoising Application Performance Models with Noise-Resilient Priors
Title: Denoising Application Performance Models with Noise-Resilient Priors | Denoisierende Anwendungs-Performance-Modelle mit geräuschbeständigen Prioren | 具有噪音-抗应前置物的低度应用性性能模型 2504.10996v2 |
Authors (8): Gustavo de Morais, Alexander Geiß, Alexandru Calotoiu, Gregor Corbin, Ahmad Tarraf, Torsten Hoefler, Bernd Mohr, Felix Wolf
When scaling parallel codes to larger machines, performance models help identify potential bottlenecks. Since analytically designing these mathematical representations is usually challenging, empirical models based on performance measurements offer a practical alternative. Yet, measurements on HPC systems are typically affected by noise, leading to potentially misleading model predictions. To reduce the influence of noise, we introduce application-specific dynamic priors into the modeling process, which we derive from noise-resilient measurements of computational effort and knowledge of typical algorithms used in communication routines. These priors then narrow the search space for our performance models, excluding complexity classes that reflect noise rather than performance. Our approach keeps the models much closer to theoretical expectations and significantly improves their predictive power. Finally, it cuts experimental costs in half by minimizing the number of repeated measurements.
当将平行代码缩放至大型机器时,性能模型有助于发现潜在的瓶颈。由于分析设计这些数学模型通常具有挑战性,基于性能测量的经验模型提供了一种实用的替代方法。然而,对高电聚苯乙烯系统的测量通常受到噪音的影响,从而可能导致模型预测产生误导。为了减少噪音的影响,我们在建模过程中引入了具体应用的动态前科,这是我们从对计算努力和通信常规所用典型算法知识的耐噪音测量中得出的。这些前科随后缩小了我们性能模型的搜索空间,排除了反映噪音而不是性能的复杂类别。我们的方法使这些模型更接近理论预期,并大大提高了它们的预测能力。最后,通过将重复测量的数量减少到一半,将实验成本减半。
Article 68
Title@2025-07-06 (7): Exploring Micro Frontends: A Case Study Application in E-Commerce
Title: Exploring Micro Frontends: A Case Study Application in E-Commerce | Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce | 探索微观前沿:电子商务案例研究应用 2506.21297v2 |
Authors (5): Ricardo Hideki Hangai Kojo, Luiz Fernando Corte Real, Renato Cordeiro Ferreira, Thatiane de Oliveira Rosa, Alfredo Goldman
In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company’s needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company’s context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.
在微观前端的建筑风格中,前端被分为小部分,从简单的按钮到整个页面。目标是提高可缩放性、复原力和团队独立性,尽管其成本增加了复杂性和基础设施需求。本文件试图了解何时值得采用微观前端,特别是在工业方面。为此,我们根据学术和灰色文献对微型前端的艺术状态进行了调查。我们随后在手工艺产品的市场中实施了这种建筑风格,这些产品已经使用了微观服务。最后,我们通过与开发商的半开放问卷评估了执行情况。在经过研究的市场公司中,由于主要系统(爪哇单项)和专门的前端系统之间的紧密连接,需要进行建筑变革。此外,为了解决这些问题,我们根据学术和灰色文献对微型前端结构进行了调查。我们随后在手工艺产品的市场中采用了微型前端结构,并且已经使用了已经使用过缩略图的后端模式。最后,我们通过半开放的问卷来评估执行情况。尽管通过最精密的前端基础设施(Java ) 和前端技术的运用方式得到了成功的应用,但是在采用最灵活的前端端分析中, 也实现了对正端分析。在采用最具有可比性的公司进行了成功的前端分析,但通过这种分析后端和最成功的前端技术的后端反应, 也取得了必要的应用。在采用了最成功的前端评估。在采用了最成功的前端技术。在采用后端技术,在采用后端的后端技术,在采用了最成功的前端技术,在采用后端技术,在采用。在采用后端技术,在采用后端技术,在采用后端端端端端端技术,在采用后端分析中,在采用后端技术,在采用后端,在采用后端技术,在采用了最成功的前端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端分析中实现了。在采用后端选择。在采用后端,在采用后端,在采用后端,在采用后端,在采用了最成功的前端选择,在采用了。在采用后端,在采用后端,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端
Article 69
Title@2025-07-06 (7): Agentic Distributed Computing
Title: Agentic Distributed Computing | Agentisch verteiltes Computing | A. 分配的计算 2507.04459v1 |
Authors (4): Ajay D. Kshemkalyani, Manish Kumar, Anisur Rahaman Molla, Gokarna Sharma
The most celebrated and extensively studied model of distributed computing is the {\em message-passing model,} in which each vertex/node of the (distributed network) graph corresponds to a static computational device that communicates with other devices through passing messages. In this paper, we consider the {\em agentic model} of distributed computing which extends the message-passing model in a new direction. In the agentic model, computational devices are modeled as relocatable or mobile computational devices (called agents in this paper), i.e., each vertex/node of the graph serves as a container for the devices, and hence communicating with another device requires relocating to the same node. We study two fundamental graph level tasks, leader election, and minimum spanning tree, in the agentic model, which will enhance our understanding of distributed computation across paradigms. The objective is to minimize both time and memory complexities. Following the literature, we consider the synchronous setting in which each agent performs its operations synchronously with others, and hence the time complexity can be measured in rounds. In this paper, we present two deterministic algorithms for leader election: one for the case of $k<n$ and another for the case of $k=n$, minimizing both time and memory complexities, where $k$ and $n$, respectively, are the number of agents and number of nodes of the graph. Using these leader election results, we develop deterministic algorithms for agents to construct a minimum spanning tree of the graph, minimizing both time and memory complexities. To the best of our knowledge, this is the first study of distributed graph level tasks in the agentic model with $k\leq n$. Previous studies only considered the case of $k=n$.
最有名和广泛研究的分布式计算模型是 $ 的递增或移动计算工具(本文中称为代理人), 也就是说, 图表的每个顶点/ 节点作为装置的容器, 因此与另一个装置的沟通需要迁移到同一个节点。 我们研究了两个基本的图表级别任务, 领导选举, 和在代理模型中最小的横跨树, 这将提高我们对分布式模式计算的理解。 在代理模型中, 计算设备建模为可转式或移动计算设备( 本文中称为代理人) , 也就是说, 图表的每个顶点/ 节点作为( 分布式网络) 的容器, 因此, 与另一个装置进行通信。 在本文中, 我们使用两个基本图表级别任务, 领导选举的首席选举和最起码的树型算法 。 目标是将时间和记忆的复杂度降低到时间。 在文献中, 我们考虑每个代理人进行其操作的同步的基点, 并且可以用最复杂的时间来测量 。 在本文中, 我们使用两个模型, 最起码的缩的缩的算数 和最精度 。
Article 70
Title@2025-07-06 (7): Static Analysis for Detecting Transaction Conflicts in Ethereum Smart Contracts
Title: Static Analysis for Detecting Transaction Conflicts in Ethereum Smart Contracts | Statische Analyse zur Erkennung von Transaktionskonflikten in Ethereum Smart Contracts | Etheum智能合同中发现交易冲突的静态分析 2507.04357v1 |
Authors (2): Zareh Chahoki Atefeh, Roveri Marco
Ethereum smart contracts operate in a concurrent environment where multiple transactions can be submitted simultaneously. However, the Ethereum Virtual Machine (EVM) enforces sequential execution of transactions within each block to prevent conflicts arising from concurrent access to the same state variables. Although this approach guarantees correct behavior, it limits the ability of validators to leverage multi-core architectures for faster transaction processing, thus restricting throughput. Existing solutions introduce concurrency by allowing simultaneous transaction execution combined with runtime conflict detection and rollback mechanisms to maintain correctness. However, these methods incur significant overhead due to continuous conflict tracking and transaction reversion. Recently, alternative approaches have emerged that aim to predict conflicts statically, before execution, by analyzing smart contract code for potential transaction interactions. Despite their promise, there is a lack of comprehensive studies that examine static conflict detection and its broader implications in specific smart contracts. This paper fills this important gap by proposing a novel static analysis method to detect potential transaction conflicts in Ethereum smart contracts. Our method identifies read-write, write-write, and function call conflicts between transaction pairs by analyzing state variable access patterns in Solidity contracts. We implement a tool that parses contract code and performs conflict detection. Evaluation on a dataset of real-world Ethereum smart contracts demonstrates that our approach achieves high precision in identifying potential conflicts. By enabling proactive conflict detection, our tool supports further design of transaction scheduling strategies that reduce runtime failures, enhance validator throughput, and contribute to blockchain scalability.
然而,Etheenum虚拟机器(EVM)在每一个区块内实施连续执行交易,以防止同时获得同一国家变量所产生的冲突。虽然这种方法保证了正确的行为,但限制了验证者利用多核心结构来加快交易处理的能力,从而限制了吞吐量。现有解决方案通过允许同时执行交易,同时与运行时的冲突探测和回滚机制相结合,从而保持正确性,引入了同流合金。然而,这些方法由于持续的冲突跟踪和交易回流而产生了巨大的间接费用。最近,出现了其他办法,目的是通过分析潜在的交易互动的智能合同代码,来静态地预测冲突。尽管这些办法保证了正确的行为,但它限制了验证者利用多核心结构来利用多核心结构来加快交易处理速度,从而限制交易处理过程的通畅通度。我们的方法通过分析稳定性合同中的国家可变的准入模式,来固定地预测冲突。我们用一种工具来分析静态冲突探测,通过智能的准确性交易规则来帮助我们进行交易的升级,并展示我们潜在的冲突定义。
Article 71
Title@2025-07-06 (7): Heterogeneous Federated Learning with Prototype Alignment and Upscaling
Title: Heterogeneous Federated Learning with Prototype Alignment and Upscaling | Heterogenes Föderiertes Lernen mit Prototypenausrichtung und Upscaling | 具有原型调整和升级的异异质联邦学习 2507.04310v1 |
Authors (3): Gyuejeong Lee, Jihwan Shin, Daeyoung Choi
Heterogeneity in data distributions and model architectures remains a significant challenge in federated learning (FL). Various heterogeneous FL (HtFL) approaches have recently been proposed to address this challenge. Among them, prototype-based FL (PBFL) has emerged as a practical framework that only shares per-class mean activations from the penultimate layer. However, PBFL approaches often suffer from suboptimal prototype separation, limiting their discriminative power. We propose Prototype Normalization (ProtoNorm), a novel PBFL framework that addresses this limitation through two key components: Prototype Alignment (PA) and Prototype Upscaling (PU). The PA method draws inspiration from the Thomson problem in classical physics, optimizing global prototype configurations on a unit sphere to maximize angular separation; subsequently, the PU method increases prototype magnitudes to enhance separation in Euclidean space. Extensive evaluations on benchmark datasets show that our approach better separates prototypes and thus consistently outperforms existing HtFL approaches. Notably, since ProtoNorm inherits the communication efficiency of PBFL and the PA is performed server-side, it is particularly suitable for resource-constrained environments.
在数据分配和模型结构中,数据分配和模型结构的多样化仍然是联邦学习(FLF)中的一个重大挑战。最近提出了各种不同的FL(HtFL)方法,以迎接这一挑战。其中,原型FL(PBFL)已经成为一个实用框架,仅分享倒数第二层每类平均激活的功能;然而,PBFL方法往往受到低于最佳原型的分解,限制其歧视力量。我们提议了原型正常化(ProtoNorm),这是一个新的PBFL框架,它通过两个关键组成部分:原型对齐(PA)和原型升级(PU)来解决这一限制。PA方法从古典物理学中的Thomson问题中汲取灵感,在单位范围内优化全球原型配置以最大限度地实现角分离;随后,PBFL方法增加了原型数量,以加强在Euclidean空间的分离。对基准数据集的广泛评价表明,我们的方法更好地区分原型,从而始终超越现有的HFLF方法。值得注意的是,由于ProtoN-Nom将特别为PBFS-FSER-S-S-S-SER-SER-SER-S-S-Side-Serviclock-PBIS-PBIS-PBAR-PBS-PBS-PBS-P-LS-LS-LS-S-S-S-PS-PS-PS-PS-PS-LS-PS-PS-LS-PS-PS-PS-PS-PS-LS-LS-LS-PS-PS-PS-PS-PS-P-LS-P-P-P-PS-PS-PS-PS-PS-PS-PS-P-P-P-P-P-P-LS-LS-P-P-P-P-L-P-P-P-P-P-P-P-P-P-LS-P-P-P-P-P-LS-P-P-LS-PS-PS-LS-LS-LS-LS-PS-PS-PS-P
Article 72
Title@2025-07-05 (6): Gathering Teams of Bounded Memory Agents on a Line
Title: Gathering Teams of Bounded Memory Agents on a Line | Sammeln von Teams von Begrenzten Speicher-Agenten auf einer Linie | 在一条线上收集被损坏的内存人员小组 2507.04172v1 |
Authors (2): Younan Gao, Andrzej Pelc
Several mobile agents, modelled as deterministic automata, navigate in an infinite line in synchronous rounds. All agents start in the same round. In each round, an agent can move to one of the two neighboring nodes, or stay idle. Agents have distinct labels which are integers from the set ${1,\dots, L}$. They start in teams, and all agents in a team have the same starting node. The adversary decides the compositions of teams, and their starting nodes. Whenever an agent enters a node, it sees the entry port number and the states of all collocated agents; this information forms the input of the agent on the basis of which it transits to the next state and decides the current action. The aim is for all agents to gather at the same node and stop. Gathering is feasible, if this task can be accomplished for any decisions of the adversary, and its time is the worst-case number of rounds from the start till gathering. We consider the feasibility and time complexity of gathering teams of agents, and give a complete solution of this problem. It turns out that both feasibility and complexity of gathering depend on the sizes of teams. We first concentrate on the case when all teams have the same size $x$. For the oriented line, gathering is impossible if $x=1$, and it can be accomplished in time $O(D)$, for $x>1$, where $D$ is the distance between the starting nodes of the most distant teams. This complexity is of course optimal. For the unoriented line, the situation is different. For $x=1$, gathering is also impossible, but for $x=2$, the optimal time of gathering is $\Theta(D\log L)$, and for $x\geq 3$, the optimal time of gathering is $\Theta(D)$. In the case when there are teams of different sizes, we show that gathering is always possible in time $O(D)$, even for the unoriented line. This complexity is of course optimal.
数个移动代理器, 仿照确定性自动自动数据, 以同步周期的无限线运行。 所有代理器都在同一回合中开始。 每回合中, 代理器可以移动到两个相邻节点中的某个端点, 或者闲置。 代理器有不同的标签, 这些标签的整数是 $1,\ dots, L美元。 它们以团队为起点, 团队中的所有代理器都有相同的起始节点。 对手决定团队的组成及其起始节点。 当一个代理器进入节点时, 它会看到输入端点和所有合点的状态; 这些信息是代理器输入到下一个节点和决定当前动作的基础。 代理器有不同的标签, $1, 美元1, 美元, 美元, 美元 美元 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元。 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 。 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 。 。 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 。 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元, 美元,
Article 73
Title@2025-07-05 (6): A3FR: Agile 3D Gaussian Splatting with Incremental Gaze Tracked Foveated Rendering in Virtual Reality
Title: A3FR: Agile 3D Gaussian Splatting with Incremental Gaze Tracked Foveated Rendering in Virtual Reality | A3FR: Agile 3D Gaussian Splatting mit Inkremental Gaze verfolgt Foveated Rendering in Virtual Reality | A3FR: Agile 3D Gaussian Splating 配有虚拟现实中增量加热跟踪的变色成形成形成像 2507.04147v1 |
Authors (3): Shuo Xin, Haiyu Wang, Sai Qian Zhang
Virtual reality (VR) significantly transforms immersive digital interfaces, greatly enhancing education, professional practices, and entertainment by increasing user engagement and opening up new possibilities in various industries. Among its numerous applications, image rendering is crucial. Nevertheless, rendering methodologies like 3D Gaussian Splatting impose high computational demands, driven predominantly by user expectations for superior visual quality. This results in notable processing delays for real-time image rendering, which greatly affects the user experience. Additionally, VR devices such as head-mounted displays (HMDs) are intricately linked to human visual behavior, leveraging knowledge from perception and cognition to improve user experience. These insights have spurred the development of foveated rendering, a technique that dynamically adjusts rendering resolution based on the user’s gaze direction. The resultant solution, known as gaze-tracked foveated rendering, significantly reduces the computational burden of the rendering process. Although gaze-tracked foveated rendering can reduce rendering costs, the computational overhead of the gaze tracking process itself can sometimes outweigh the rendering savings, leading to increased processing latency. To address this issue, we propose an efficient rendering framework called~\textit{A3FR}, designed to minimize the latency of gaze-tracked foveated rendering via the parallelization of gaze tracking and foveated rendering processes. For the rendering algorithm, we utilize 3D Gaussian Splatting, a state-of-the-art neural rendering technique. Evaluation results demonstrate that A3FR can reduce end-to-end rendering latency by up to $2\times$ while maintaining visual quality.
虚拟现实( VR) 极大地改造了隐性的数字界面,大大加强了教育、专业做法和娱乐,提高了用户的参与程度,并打开了各种行业的新可能性。 在众多的应用中,图像制作至关重要。 然而,3D高斯斯普拉特等方法在用户对高视觉质量的期待驱动下,提出了很高的计算要求。这导致实时图像制作的显著处理延迟,这极大地影响了用户的经验。此外,像头抬式显示(HMDs)这样的VR设备与人类视觉行为紧密相连,利用感知和视觉知识来改善用户经验。这些洞察力刺激了变形图像的开发,这是一种根据用户的视觉方向对分辨率进行动态调整的技术。结果的解决方案,主要被用户对高视觉质量的预期所驱动。 虽然凝视跟踪的顶端变色变色变异能可以降低成本,但视觉跟踪过程本身的计算间接费用有时会超过实现的节省,导致提高调的调幅度,从而导致更低调度的调化。 为了在视觉上演化过程中,我们建议一个高效的图像跟踪框架。
Article 74
Title@2025-07-05 (6): HiPerMotif: Novel Parallel Subgraph Isomorphism in Large-Scale Property Graphs
Title: HiPerMotif: Novel Parallel Subgraph Isomorphism in Large-Scale Property Graphs | HiPerMotif: Neuer Parallel-Subgraph Isomorphismus in großformatigen Property Graphen | HiPerMotif: 大型财产图中的新平行平行子集 2507.04130v1 |
Authors (5): Mohammad Dindoost, Oliver Alvarado Rodriguez, Bartosz Bryg, Ioannis Koutis, David A. Bader
Subgraph isomorphism, essential for pattern detection in large-scale graphs, faces scalability challenges in attribute-rich property graphs used in neuroscience, systems biology, and social network analysis. Traditional algorithms explore search spaces vertex-by-vertex from empty mappings, leading to extensive early-stage exploration with limited pruning opportunities. We introduce HiPerMotif, a novel hybrid parallel algorithm that fundamentally shifts the search initialization strategy. After structurally reordering the pattern graph to prioritize high-degree vertices, HiPerMotif systematically identifies all possible mappings for the first edge (vertices 0,1) in the target graph, validates these edge candidates using efficient vertex and edge validators, and injects the validated partial mappings as states at depth 2. The algorithm then continues with traditional vertex-by-vertex exploration from these pre-validated starting points, effectively pruning the expensive early search tree branches while enabling natural parallelization over edge candidates. Our contributions include the edge-centric initialization paradigm with state injection, a structural reordering strategy achieving up to 5x speedup, rapid edge and vertex validators for attribute-rich graphs, and efficient parallel enumeration over target graph edges. Implemented in the open-source Arachne framework, HiPerMotif achieves up to 66x speedup over state-of-the-art baselines (VF2-PS, VF3P, Glasgow) on diverse datasets where baselines successfully complete execution. Additionally, HiPerMotif successfully processes massive datasets such as the H01 connectome with 147 million edges, which existing methods cannot handle due to memory constraints. Comprehensive evaluation across synthetic and real-world graphs demonstrates HiPerMotif’s scalability, enabling advanced analysis in computational neuroscience and beyond.
子形是地貌形态, 在大型图形中进行模式探测所必不可少的, 面临在神经科学、 系统生物学和社会网络分析中使用的属性丰富的属性性神经属性属性图( verices 0, 1) 的可缩放性挑战。 传统算法从空绘图中探索空间的顶端和顶端, 导致在有限的修剪机会下进行广泛的早期探索。 我们引入了HiPerMotif, 一种新型混合平行算法, 从根本上改变了搜索初始化战略。 在结构重新排序模式图后, 将高水平P2 优先排序, HiPerMotif 在目标图形中系统系统地识别所有可能的关于第一边缘( verices 0, 1) 的属性丰富性财产图 。 使用高效的顶端点和边缘验证器来验证这些边缘候选对象, 在深度2 中输入经验证的部分映射图。 然后, 我们继续使用传统的顶端的顶端搜索树枝, 使昂贵的早期搜索树枝得以运行, 在边缘候选人中进行自然平行平行平行平行平行同步。 我们的贡献包括州级初始初始化模型模型模型模型模型模型模型模型模型模型, , 在5x快速的直置的直径端的直径直径向上, 直径向上, 直端的直端数据直端的直端数据直径直径直径直径直径直径直径直径直径直径分析, 直径直径直径直路, 。
Article 75
Title@2025-07-05 (6): One-Bit Model Aggregation for Differentially Private and Byzantine-Robust Personalized Federated Learning
Title: One-Bit Model Aggregation for Differentially Private and Byzantine-Robust Personalized Federated Learning | Ein-Bit-Modell Aggregation für unterschiedlich privates und byzantinisches-Robust Personalisiertes Federated Learning | 区别对待的私立和拜占庭-罗邦个人化联邦学习一比一模式 2507.03973v1 |
Authors (3): Muhang Lan, Song Xiao, Wenyi Zhang
As the scale of federated learning (FL) systems expands, their inherent performance limitations like communication overhead, Byzantine vulnerability, and privacy leakage have become increasingly critical. This paper considers a personalized FL framework based on model regularization, and proposes a model aggregation algorithm named PRoBit+ to concurrently overcome these limitations. PRoBit+ employs one-bit stochastic quantization and maximum likelihood estimation for parameter aggregation, and dynamically adjusts the step size of parameter updates, improving training stability of deep neural networks under low communication overhead and heterogeneous data distributions. PRoBit+’s statistical analysis is then conducted and its Byzantine robustness is proved. The $(\epsilon,0)$-differential privacy and a convergence upper bound of the PRoBit+ based FL are also theoretically established in heterogeneous contexts. The analysis illustrates the trade-off among transmission accuracy, security guarantees, and convergence rates, and also indicates that the performance degradation caused by transmission errors and privacy protection can be progressively eliminated at a rate of $\mathcal{O}(1/M)$ as the number of uploading clients $M$ increases. Comprehensive numerical experiments are conducted to assess PRoBit+ in comparison to benchmark methods across different Byzantine attacks and varying proportions of malicious clients. The experimental results demonstrate that PRoBit+ exhibits improved Byzantine robustness over existing bit-based transmission schemes, minimal performance degradation related to privacy protection, and nearly identical performance to full-precision FedAvg in a secure environment.
随着联合学习(FL)系统规模的扩大,其固有的绩效限制,如通信管理费、Byzantine脆弱性和隐私渗漏等,其固有的绩效限制变得日益重要。本文件认为基于模式正规化的个性化FL框架,并提出了名为PRoBit+的模型组合算法,以同时克服这些限制。PRoBit+使用一比特的随机定量和最大可能性估算参数汇总,并动态调整参数更新的步级大小,在低通信管理费和数据分布不一的情况下,改善深神经网络的培训稳定性。随后进行了PRoBit+的统计分析,并证明了其Byantine稳健性。$(epsilon,0) 美元差异性隐私和基于ProBit+FL的趋同性组合,也是在多种情况下理论上确立的。分析表明传输准确性能、安全保证和趋同率,还表明传输错误和隐私保护导致的性能退化可以逐步消除,以$mathcal{O}(1/M)美元的速度进行统计分析,以近为Rexalal-alalalimal Inalalalal assal beal beal beal beal beal beal beal beal beal beal deal ex ass ex ex ex axxxxx ax axxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx比比比比比比比比比比比比比比比比比比比比比
Article 76
Title@2025-07-05 (6): FedFog: Resource-Aware Federated Learning in Edge and Fog Networks
Title: FedFog: Resource-Aware Federated Learning in Edge and Fog Networks | FedFog: Ressourcenschonendes Lernen in Edge- und Fog-Netzwerken | FFFFog: 边缘和雾网的资源-软件联合学习 2507.03952v1 |
Authors (1): Somayeh Sobati-M
As edge and fog computing become central to modern distributed systems, there’s growing interest in combining serverless architectures with privacy-preserving machine learning techniques like federated learning (FL). However, current simulation tools fail to capture this integration effectively. In this paper, we introduce FedFog, a simulation framework that extends the FogFaaS environment to support FL-aware serverless execution across edge-fog infrastructures. FedFog incorporates an adaptive FL scheduler, privacy-respecting data flow, and resource-aware orchestration to emulate realistic, dynamic conditions in IoT-driven scenarios. Through extensive simulations on benchmark datasets, we demonstrate that FedFog accelerates model convergence, reduces latency, and improves energy efficiency compared to conventional FL or FaaS setups-making it a valuable tool for researchers exploring scalable, intelligent edge systems.
随着边缘和雾计算成为现代分布式系统的核心,人们越来越希望将无服务器架构与无隐私的机械学习技术(如联合学习(FL)相结合。 然而,目前的模拟工具未能有效地捕捉到这种整合。 在本文中,我们引入了FedFogFaaS(FogFaaS)模拟框架,这个模拟框架将FogFaaS(FogFaaaS)环境扩展到支持FL-aw(FL)服务器的跨边缘系统执行。 FedFog(Fog)将适应性FL(FL)定时器、尊重隐私的数据流和资源意识调控纳入IoT(IoT)驱动的情景中,以效仿现实、动态的条件。 通过对基准数据集的广泛模拟,我们证明FedFog(Fog)加快了模型的趋同,降低延时性,并提高了能源效率,与常规的FaaaS(Faa)设置相比,它成为研究人员探索可伸缩的智能边缘系统的宝贵工具。
Article 77
Title@2025-07-05 (6): On Fault Tolerance of Data Storage Systems: A Holistic Perspective
Title: On Fault Tolerance of Data Storage Systems: A Holistic Perspective | Zur Fehlertoleranz von Datenspeichersystemen: Eine ganzheitliche Perspektive | 关于数据存储系统不容错:整体观点 2507.03849v1 |
Authors (3): Mai Zheng, Duo Zhang, Ahmed Dajani
Data storage systems serve as the foundation of digital society. The enormous data generated by people on a daily basis make the fault tolerance of data storage systems increasingly important. Unfortunately, modern storage systems consist of complicated hardware and software layers interacting with each other, which may contain latent bugs that elude extensive testing and lead to data corruption, system downtime, or even unrecoverable data loss in practice. In this chapter, we take a holistic view to introduce the typical architecture and major components of modern data storage systems (e.g., solid state drives, persistent memories, local file systems, and distributed storage management at scale). Next, we discuss a few representative bug detection and fault tolerance techniques across layers with a focus on issues that affect system recovery and data integrity. Finally, we conclude with open challenges and future work.
数据储存系统是数字社会的基础。人们每天产生的大量数据使数据储存系统的过错容忍度越来越重要。不幸的是,现代储存系统由复杂的硬件和软件层组成,彼此相互作用,其中可能含有潜在的错误,无法进行广泛的测试,并导致数据腐败、系统故障,甚至在实践中造成无法收回的数据损失。在本章中,我们从整体上考虑采用现代数据储存系统的典型结构和主要组成部分(如固态驱动器、耐久的记忆、本地文件系统和大规模分布式储存管理)。接下来,我们讨论几个具有代表性的跨层的错误探测和过错容忍技术,重点是影响系统恢复和数据完整性的问题。最后,我们提出公开的挑战和今后的工作。
Article 78
Title@2025-07-04 (5): Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction
Title: Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction | Distributed Equivariant Graph Neural Networks for Large-Scale Electronic Structure Prediction | 用于大型电子结构预测的分布式等差图像神经网络 2507.03840v1 |
Authors (5): Manasa Kaniselvan, Alexander Maeder, Chen Hao Xia, Alexandros Nikolaos Ziogas, Mathieu Luisier
Equivariant Graph Neural Networks (eGNNs) trained on density-functional theory (DFT) data can potentially perform electronic structure prediction at unprecedented scales, enabling investigation of the electronic properties of materials with extended defects, interfaces, or exhibiting disordered phases. However, as interactions between atomic orbitals typically extend over 10+ angstroms, the graph representations required for this task tend to be densely connected, and the memory requirements to perform training and inference on these large structures can exceed the limits of modern GPUs. Here we present a distributed eGNN implementation which leverages direct GPU communication and introduce a partitioning strategy of the input graph to reduce the number of embedding exchanges between GPUs. Our implementation shows strong scaling up to 128 GPUs, and weak scaling up to 512 GPUs with 87% parallel efficiency for structures with 3,000 to 190,000 atoms on the Alps supercomputer.
在密度功能理论(DFT)数据方面受过培训的等同图形神经网络(eGNNs)有可能以前所未有的规模进行电子结构预测,从而能够对具有超长缺陷、接口或显示无序阶段的材料的电子特性进行调查,然而,由于原子轨道之间的相互作用通常超过10+Agstrom,这项任务所需的图形显示往往密不可分,对这些大型结构进行培训和推断的记忆要求可能超过现代GPU的限度。在这里,我们展示了一个分布式的eGNN(eGNN)实施工具,利用GPU的直接通信,并采用输入图分割战略,以减少GPUs之间的嵌入交换次数。我们的实施显示,可大力扩展至128GPUs,而微弱地扩大至512GPUs,在阿尔卑斯超级计算机上3 000至190 000个原子结构上平行效率达87%。
Article 79
Title@2025-07-04 (5): RVISmith: Fuzzing Compilers for RVV Intrinsics
Title: RVISmith: Fuzzing Compilers for RVV Intrinsics | RVISmith: Fuzzing Compiler für RVV-Intrinsik | RVISmith: RVV Intrinsics 模糊的编译者 2507.03773v1 |
Authors (6): Yibo He, Cunjian Huang, Xianmiao Qu, Hongdeng Chen, Wei Yang, Tao Xie
Modern processors are equipped with single instruction multiple data (SIMD) instructions for fine-grained data parallelism. Compiler auto-vectorization techniques that target SIMD instructions face performance limitations due to insufficient information available at compile time, requiring programmers to manually manipulate SIMD instructions. SIMD intrinsics, a type of built-in function provided by modern compilers, enable programmers to manipulate SIMD instructions within high-level programming languages. Bugs in compilers for SIMD intrinsics can introduce potential threats to software security, producing unintended calculation results, data loss, program crashes, etc. To detect bugs in compilers for SIMD intrinsics, we propose RVISmith, a randomized fuzzer that generates well-defined C programs that include various invocation sequences of RVV (RISC-V Vector Extension) intrinsics. We design RVISmith to achieve the following objectives: (i) achieving high intrinsic coverage, (ii) improving sequence variety, and (iii) without known undefined behaviors. We implement RVISmith based on the ratified RVV intrinsic specification and evaluate our approach with three modern compilers: GCC, LLVM, and XuanTie. Experimental results show that RVISmith achieves 11.5 times higher intrinsic coverage than the state-of-the-art fuzzer for RVV intrinsics. By differential testing that compares results across different compilers, optimizations, and equivalent programs, we detect and report 13 previously unknown bugs of the three compilers under test to date. Of these bugs, 10 are confirmed and another 3 are fixed by the compiler developers.
以 SIMD 指令为目标的编译器自动演算技术由于在编译时提供的信息不足而面临性能限制,要求程序员手动操作 SIMD 指令。 SIMD 内在功能是现代编译器提供的一种内在功能,使程序员能够在高层次编程语言中操作SIMD指令。 SIMD 内在内容编译器中的错误可能对软件安全造成潜在威胁,产生意外的计算结果、数据丢失、程序崩溃等。为了检测SIMD 内在内容的编译器中的错误,我们建议使用RVIS ,一个随机化的烟雾器,生成定义明确的C程序,其中包括各种 RVV(RISC-V Vctor 扩展) 的内置序列。我们设计了RVIS , 以实现以下目标:(一) 实现高内在覆盖, (二) 改进序列种类,以及(三) 没有已知的细微值行为, 我们根据已经批准的 RV 内在规格的编译器进行 RVIDM 3 , 用三个现代的内置的内置程序来评估结果。
Article 80
Title@2025-07-04 (5): Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN)
Title: Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines for Open Radio Access Networks (ORAN) | Benchmarking Vector, Graph and Hybrid Retrieval Augmented Generation (RAG) Pipelines für Open Radio Access Networks (ORAN) | 用于开放式无线电接入网络(ORAN)的矢量、图形和混合检索增强代(RAG)管道基准 2507.03608v1 |
Authors (4): Sarat Ahmad, Zeinab Nezami, Maryam Hafeez, Syed Ali Raza Zaidi
Generative AI (GenAI) is expected to play a pivotal role in enabling autonomous optimization in future wireless networks. Within the ORAN architecture, Large Language Models (LLMs) can be specialized to generate xApps and rApps by leveraging specifications and API definitions from the RAN Intelligent Controller (RIC) platform. However, fine-tuning base LLMs for telecom-specific tasks remains expensive and resource-intensive. Retrieval-Augmented Generation (RAG) offers a practical alternative through in-context learning, enabling domain adaptation without full retraining. While traditional RAG systems rely on vector-based retrieval, emerging variants such as GraphRAG and Hybrid GraphRAG incorporate knowledge graphs or dual retrieval strategies to support multi-hop reasoning and improve factual grounding. Despite their promise, these methods lack systematic, metric-driven evaluations, particularly in high-stakes domains such as ORAN. In this study, we conduct a comparative evaluation of Vector RAG, GraphRAG, and Hybrid GraphRAG using ORAN specifications. We assess performance across varying question complexities using established generation metrics: faithfulness, answer relevance, context relevance, and factual correctness. Results show that both GraphRAG and Hybrid GraphRAG outperform traditional RAG. Hybrid GraphRAG improves factual correctness by 8%, while GraphRAG improves context relevance by 7%.
在ORAN结构中,大语言模型(LLMS)可以通过利用RAN智能主计长(RIC)平台的规格和API定义,专门生成xApps和rApps。然而,用于电信特定任务的微调基础LLMs仍然昂贵,而且需要大量资源。回溯-启动一代(RAG)通过内文学习提供了一种实用的替代方法,使域能适应而无需经过全面再培训。传统RAG系统依靠基于矢量的检索、新兴变异(如GreagraG和混合图RAG),通过利用知识图表图或双轨检索战略来生成xAppspps和rAPPs。尽管这些方法有希望,但缺乏系统化的、计量驱动的评价,特别是在诸如ORAN等高层次领域。在本研究中,我们用ORAN规格对矢量的RAG、图RAG和混合图解进行比较评估。我们利用既定的生成指标评估不同问题的复杂性:忠实性、事实性GARCA的正确性、事实性G的正确性。
Article 81
Title@2025-07-04 (5): FastSet: Parallel Claim Settlement
Title: FastSet: Parallel Claim Settlement | FastSet: Parallele Forderungsabrechnung | FastSet:平行索赔理赔 2506.23395v2 |
Authors (2): Xiaohong Chen, Grigore Rosu
FastSet is an actor-based distributed protocol for decentralized finance and settlement, which is inspired from blockchains. Account holders cooperate by making claims, which can include payments, holding and transferring assets, accessing and updating shared data, medical records, digital identity, and mathematical theorems, among many others. The claims are signed by their owners and are broadcast to a decentralized network of validators, which validate and settle them. Validators replicate the global state of the accounts and need not communicate with each other. In sharp contrast to blockchains, strong consistency is purposely given up as a requirement. Yet, many if not most of the blockchain benefits are preserved. The protocol is proved to be correct, despite its massively parallel nature.
FastSet是一个基于行为体的分散金融和结算分配协议,其灵感来自供应链; 账户持有人合作,提出债权,其中可包括付款、持有和转移资产、获取和更新共享数据、医疗记录、数字身份和数学理论等; 债权由所有者签字,并广播给一个分散的验证人网络,由他们验证和结算; 验证人复制账户的全球状况,不需要相互沟通; 与供应链形成鲜明对比,故意放弃强有力的一致性,将其作为一项要求; 然而,即使不是大多数,也有许多供应链的好处得到了维护; 协议被证明是正确的,尽管其性质极为平行。
Article 82
Title@2025-07-04 (5): Hiku: Pull-Based Scheduling for Serverless Computing
Title: Hiku: Pull-Based Scheduling for Serverless Computing | Hiku: Pull-Based Scheduling für serverloses Rechnen | Hidku:无服务器计算系统 Pull- 以拉为基础的日程安排 2502.15534v2 |
Authors (2): Saman Akbari, Manfred Hauswirth
Serverless computing promises convenient abstractions for developing and deploying functions that execute in response to events. In such Function-as-a-Service (FaaS) platforms, scheduling is an integral task, but current scheduling algorithms often struggle with maintaining balanced loads, minimizing cold starts, and adapting to commonly occurring bursty workloads. In this work, we propose pull-based scheduling as a novel scheduling algorithm for serverless computing. Our key idea is to decouple worker selection from task assignment, with idle workers requesting new tasks proactively. Experimental evaluation on an open-source FaaS platform shows that pull-based scheduling, compared to other existing scheduling algorithms, significantly improves the performance and load balancing of serverless workloads, especially under high concurrency. The proposed algorithm improves response latencies by 14.9% compared to hash-based scheduling, reduces the frequency of cold starts from 43% to 30%, increases throughput by 8.3%, and achieves a more even load distribution by 12.9% measured by the requests assigned per worker.
无服务器计算为开发和部署因事件而执行的功能提供了方便的抽象信息。 在这种功能-服务平台(Faas-Service)平台中,排期是一项不可或缺的任务,但目前的排期算法往往与保持平衡负荷、尽量减少冷开关和适应常见的突发工作量相挣扎。 在这项工作中,我们提议以拉动计时法作为无服务器计算的新安排算法。 我们的关键想法是将工人选择与任务分配脱钩,让闲置工人主动要求新的任务。 在开放源代码-FaaS平台上进行的实验性评估显示,与其他现有排期算法相比,拉动计时法大大改进了无服务器工作量的性能和负载平衡,特别是在高通货制情况下。 拟议的算法将反应迟缓率比基于散列的排期增加了14.9%,将冷开始频率从43%降低到30%,吞吐量增加8.3%,并根据每个工人的请求而实现更均衡的工作量分配12.9%。
Article 83
Title@2025-07-04 (5): Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis
Title: Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelis | Universal Checkpointing: Ein flexibles und effizientes Distributed Checkpointing-System für großformatige DNN-Schulungen mit rekonfigurierbarer Parallelis | 通用检查:采用可重新配置平行系统进行大型DNN培训的灵活和高效分布式检查系统 2406.18820v3 |
Authors (7): Xinyu Lian, Sam Ade Jacobs, Lev Kurilenko, Masahiro Tanaka, Stas Bekman, Olatunji Ruwase, Minjia Zhang
Deep neural network (DNN) training continues to scale rapidly in terms of model size, data volume, and sequence length, to the point where multiple machines are required to fit large models for training. Different distributed and parallel training strategies have been developed to support large-scale DNN training by partitioning the training state across GPUs. However, existing DNN training systems provide very limited support for reconfiguring parallelism strategies in the middle of the training via checkpointing. This limitation arises because distributed checkpoints are tightly coupled to specific model parallelism and hardware configurations, preventing large-scale training jobs from efficiently adapting to hardware failures or resource elasticity. This paper presents Universal Checkpointing (UCP), a novel checkpointing system that enables flexible and efficient DNN training with reconfigurable parallelism. UCP overcomes challenges in existing systems by decoupling checkpoint structure from parallel training strategies and hardware configurations. In addition, we present a pattern-based reconfiguration pipeline that enables automatic, flexible, and efficient mapping of checkpoint state to various parallelism strategies. Evaluation on a range of DNN models, including state-of-the-art dense and sparse LLMs, shows that UCP enables reconfiguration for a broader set of widely used parallelism strategies than existing solutions while adding negligible reconfiguration cost. UCP has been successfully employed in real LLM training workloads, greatly enhancing their flexibility and resilience to dynamic hardware environments.
深度神经网络(DNN)培训在模型规模、数据量和序列长度方面继续迅速扩展,达到需要多机器以适应大型培训模式的高度,已经制定了不同的分布式和平行培训战略,通过将培训国分为所有GPU,支持大型DNN培训;然而,现有的DNN培训系统在培训中通过检查站检查为重新配置平行战略提供了非常有限的支持,因为分布式检查站与具体的模型平行和硬件配置紧密相连,使得大型培训岗位无法有效地适应硬件故障或资源弹性。本文介绍了通用检查(UCP),这是一个新型的检查站系统,能够灵活和高效地进行DNNNNN培训,同时进行可重新配置的平行培训,从而克服现有系统中的挑战,将检查站结构与平行培训战略和硬件配置分开。此外,我们提出了一个基于模式的重组管道,使检查站与各种平行战略同步、灵活和高效地配置,从而防止大型培训岗位工作高效地适应硬件故障或资源弹性。本文介绍通用检查点的检查(UCP)系统是一个崭新的检查系统,使得现有灵活、密集和分散的软磁体再配置战略得以实现。
Article 84
Title@2025-07-04 (5): Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning
Title: Analysis and Optimized CXL-Attached Memory Allocation for Long-Context LLM Fine-Tuning | Analyse und optimierte CXL-Attached-Speicherallokation für Long-Context LLM Fine-Tuning | 分析和优化长文本LLM微调的CXL-附加记忆分配 2507.03305v1 |
Authors (2): Yong-Cheng Liaw, Shuo-Han Chen
The growing prevalence of Large Language Models (LLMs) and their substantial memory requirements have prompted renewed interest in CPU offloading as a method to compensate for limited GPU memory. In particular, when CPU memory is leveraged to temporarily store intermediate states of LLMs, CPU memory becomes a new bottleneck and soon reaches the capacity limitation of commodity CPUs. In this work, we investigate the effectiveness of Compute Express Link (CXL) add-in card (AIC) memory as an extension to CPU memory, enabling larger model sizes and longer context lengths during fine-tuning. Through extensive benchmarking, this study quantifies the performance overhead introduced by transferring data between CXL memory, CPU, and GPUs, focusing on how concurrency and data volume influence bandwidth utilization and latency. This study also compares CPUbased optimizer steps when model parameters, gradients, and optimizer states reside in local memory versus CXL memory, revealing that naive adoption of CXL often degrades performance during the optimizer phase. To overcome these challenges, this study proposes a CXL-aware allocation to strategically partition CPU offloading workloads across both local and CXL memory. This study further demonstrates that employing multiple AICs significantly reduces bandwidth contention, thus improving scalability. Experimental results show that these optimizations enable efficient long-context LLM fine-tuning, underscoring CXL as a promising avenue for unlocking the full potential of CPU offloading in long-context LLM fine-tuning.
大型语言模型(LLMs)的日益普及及其大量记忆要求已促使人们重新关注将CPU卸载作为补偿有限GPU内存的方法。特别是,当CPU内存用于暂时存储LLM的中间状态时,CPU内存成为一个新的瓶颈,并很快达到商品CPU的能力限制。在这项工作中,我们调查了计算Express Link(CXL)附加卡卡(AIC)内存作为CPU内存延伸的效果,从而在微调期间能够使更大的模型尺寸和较长的上下文长度。通过广泛的基准,这项研究通过在CXL存储、CPU和GPU之间传输数据,将CPU内存用于临时储存LM的中间状态,使CPU内存成为新的瓶颈。在模型参数、梯度和优化状态状态内存于当地记忆和CXL记忆内存时,我们发现对CXLL的及时性会降低业绩。为了克服这些挑战,本研究报告提议在CX-LAULS-L对战略分区的全程分配中,使CPLULULL能够大幅改进这些内值。
Article 85
Title@2025-07-04 (5): Lion Cub: Minimizing Communication Overhead in Distributed Lion
Title: Lion Cub: Minimizing Communication Overhead in Distributed Lion | Lion Cub: Minimierung der Kommunikation über Kopf in verteilten Löwen | Lion Cub:尽量减少分配狮子的通讯问题 2411.16462v2 |
Authors (5): Satoki Ishikawa, Tal Ben-Nun, Brian Van Essen, Rio Yokota, Nikoli Dryden
Communication overhead is a key challenge in distributed deep learning, especially on slower Ethernet interconnects, and given current hardware trends, communication is likely to become a major bottleneck. While gradient compression techniques have been explored for SGD and Adam, the Lion optimizer has the distinct advantage that its update vectors are the output of a sign operation, enabling straightforward quantization. However, simply compressing updates for communication and using techniques like majority voting fails to lead to end-to-end speedups due to inefficient communication algorithms and reduced convergence. We analyze three factors critical to distributed learning with Lion: optimizing communication methods, identifying effective quantization methods, and assessing the necessity of momentum synchronization. Our findings show that quantization techniques adapted to Lion and selective momentum synchronization can significantly reduce communication costs while maintaining convergence. We combine these into Lion Cub, which enables up to 5x speedups in end-to-end training compared to Lion. This highlights Lion’s potential as a communication-efficient solution for distributed training.
通信管理是分布式深层次学习的关键挑战,特别是在以太网互连较慢的情况下,而且考虑到当前的硬件趋势,通信有可能成为一个主要瓶颈。虽然已经为SGD和Adam探索了梯度压缩技术,但狮子优化器的明显优势是其更新矢量是信号操作的输出,从而可以进行直截了当的量化。然而,仅仅压缩通信更新和使用多数表决等技术,由于通信算法效率低下和趋同程度降低,无法导致终端到终端的加速。我们分析了对狮子的传播学习至关重要的三个因素:优化通信方法,确定有效的量化方法,并评估动力同步的必要性。我们的调查结果显示,适应狮子和选择性动力同步的量化技术可以大大减少通信成本,同时保持趋同。我们将这些技术结合到狮子库布,这样就可以在终端到终端的训练中实现多达5x速度的加速,而与狮子相比,这凸显了狮子作为传播培训的一个通信效率解决方案的潜力。
Article 86
Title@2025-07-04 (5): Novel Blockchain-based Protocols for Electronic Voting and Auctions
Title: Novel Blockchain-based Protocols for Electronic Voting and Auctions | Neue Blockchain-basierte Protokolle für elektronische Abstimmung und Auktionen | 关于电子表决和拍卖的基于新锁链的新议定书 2507.03258v1 |
Authors (1): Zhaorun Lin
Programmable blockchains have long been a hot research topic given their tremendous use in decentralized applications. Smart contracts, using blockchains as their underlying technology, inherit the desired properties such as verifiability, immutability, and transparency, which make it a great suit in trustless environments. In this thesis, we consider several decentralized protocols to be built on blockchains, specifically using smart contracts on Ethereum. We used algorithmic and cryptographic tools in our implementations to further improve the level of security and efficiency beyond the state-of-the-art works. We proposed a new approach called Blind Vote, which is an untraceable, secure, efficient, secrecy-preserving, and fully on-chain electronic voting protocol based on the well-known concept of Chaum’s blind signatures. We illustrate that our approach achieves the same security guarantees as previous methods such as Tornado Vote [1], while consuming significantly less gas. Thus, we provide a cheaper and considerably more gas-efficient alternative for anonymous blockchain-based voting. On the other hand, we propose a new family of algorithms for private, trustless auctions that protect bidder identities and bid values while remaining practical for smart contract execution. We ensure trustlessness by running the auction logic in a smart contract, thereby eliminating reliance on any single trusted party. This approach prevents bid tampering, front-running, and collusion by enforcing immutability and decentralized verification of bids. The resulting protocol uniquely combines efficiency, trustlessness, and enduring bid privacy, offering a scalable and secure solution for blockchain-based marketplaces and other decentralized applications.
长期以来,由于在分散应用中大量使用,可编程的铁链一直是一个热门的研究课题。智能合同,使用铁链作为基础技术,继承了可核查、不可移动、透明等理想的特性,这在无信任的环境中是一件很适合的事情。在这个论点中,我们认为,若干分散的议定书将建在铁链上,具体使用Etheum的智能合同。我们在执行过程中使用了算法和加密工具,以进一步提高安全水平和效率,超越最先进的工序。我们提出了一个新的方法,称为“盲票”,这是一种不可追踪、安全、高效、保守保密和完全在链电子投票协议,基于众所周知的Chaum盲方签名概念。我们认为,在这一点上,我们的方法与以前的方法一样,例如“Contale Votage [1] 的智能合同使用量要少得多。因此,我们在执行过程中为匿名的铁链投票提供了一种更便宜和高得多的天然气效率的替代方法。另一方面,我们提出了一套名为“盲票式”的新方法,它是一种无法追踪的拍卖,它能保护投标人的身份、效率、保守的保密性、保密性以及完全的电路规则性,同时,我们则会通过执行一个明智的仲裁,从而降低的投标。
Article 87
Title@2025-07-04 (5): HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration
Title: HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration | HPCTransCompile: Ein KI-Compiler-generierter Datensatz für Hochleistungs-CUDA-Transpilation und LLM-Voruntersuchung | HPC Transtranscompility: AI CUDA 高性能 CUDA 转换和 LLM 初步探索的人工智能汇编器生成数据集 2506.10401v2 |
Authors (10): Jiaqi Lv, Xufeng He, Yanchen Liu, Xu Dai, Aocheng Shen, Yinghao Li, Jiachen Hao, Jianrong Ding, Yang Hu, Shouyi Yin
The rapid growth of deep learning has driven exponential increases in model parameters and computational demands. NVIDIA GPUs and their CUDA-based software ecosystem provide robust support for parallel computing, significantly alleviating computational bottlenecks. Meanwhile, due to the cultivation of user programming habits and the high performance of GPUs, the CUDA ecosystem has established a dominant position in the field of parallel software. This dominance requires other hardware platforms to support CUDA-based software with performance portability. However, translating CUDA code to other platforms poses significant challenges due to differences in parallel programming paradigms and hardware architectures. Existing approaches rely on language extensions, domain-specific languages (DSLs), or compilers but face limitations in workload coverage and generalizability. Moreover, these methods often incur substantial development costs. Recently, LLMs have demonstrated extraordinary potential in various vertical domains, especially in code-related tasks. However, the performance of existing LLMs in CUDA transpilation, particularly for high-performance code, remains suboptimal. To address these challenges, we propose a novel framework for generating high-performance CUDA and corresponding platform code pairs, leveraging AI compiler and automatic optimization technology. We further enhance the framework with a graph-based data augmentation method and introduce HPCTransEval, a benchmark for evaluating LLM performance on CUDA transpilation. We conduct experiments using CUDA-to-CPU transpilation as a case study on leading LLMs. The speedup ratio of the CPU operators has an average improvemnet of 43.8\%, highlighting the potential of LLMs to address compatibility challenges within the CUDA ecosystem. Our code is available at https://github.com/PJLAB-CHIP/HPCTransCompile.
NVIDIA GPU及其基于CUDA的软件生态系统为平行计算提供了强有力的支持,并大大缓解了计算瓶颈。与此同时,由于用户编程习惯的培养以及GPU的高性能,CUDA生态系统在平行软件领域建立了主导地位。这一主导地位要求其他硬件平台支持基于CUDA的可移植软件。然而,将CUDA代码转换到其他平台,由于平行编程模式和硬件结构的差异而构成重大挑战。现有方法依赖于语言扩展、特定域语言(DSLs)或汇编者,但面临着工作量覆盖和通用性方面的限制。此外,这些方法往往带来巨大的发展成本。最近,LLMMS在各种纵向领域,特别是在与代码有关的任务方面展现出超强的潜力。然而,CUDA中的现有LMMS的性能平台性能,特别是在高性能代码方面,仍然不那么,为了应对这些挑战,我们提出了一个新的框架,用于生成高性能CUDA和相应的平台代码,在CUDA的CA平均性能和ALMUDUDUD上,将ALUDA的自动数据化工具用于CUDA的升级。
Article 88
Title@2025-07-03 (4): Symbiosis: Multi-Adapter Inference and Fine-Tuning
Title: Symbiosis: Multi-Adapter Inference and Fine-Tuning | Symbiose: Multi-Adapter-Schlussfolgerung und Feinabstimmung | 共生关系:多位开发商的推断和精准调整 2507.03220v1 |
Authors (4): Saransh Gupta, Umesh Deshpande, Travis Janssen, Swami Sundararaman
Parameter-efficient fine-tuning (PEFT) allows model builders to capture the task specific parameters into adapters, which are a fraction of the size of the original base model. Popularity of PEFT technique for fine-tuning has led to creation of a large number of adapters for popular Large Language Models (LLMs). However, existing frameworks fall short in supporting inference or fine-tuning with multiple adapters in the following ways. 1) For fine-tuning, each job needs to deploy its dedicated base model instance, which results in excessive GPU memory consumption and poor GPU utilization. 2) While popular inference platforms can serve multiple PEFT adapters, they do not allow independent resource management or mixing of different PEFT methods. 3) They cannot share resources (such as base model instance) between inference and fine-tuning jobs. 4) They do not provide privacy to users who may not wish to expose their fine-tuned parameters to service providers. In Symbiosis, we address the above problems by enabling as-a-service deployment of base model. The base model layers can be shared across multiple inference or fine-tuning processes. Our split-execution technique decouples the execution of client-specific adapters and layers from the frozen base model layers offering them flexibility to manage their resources, to select their fine-tuning method, to achieve their performance goals. Our approach is transparent to models and works out-of-the-box for most models in the transformers library. Our evaluation on Llama2-13B shows the compared to baseline, Symbiosis can fine-tune 4X more adapters on the same set of GPUs in the same amount of time.
参数效率微调(PEFT)使模型构建者能够将任务特定参数捕捉到适应器中,这些参数是原始基准模型规模的一小部分。PEFT的普及性使广受欢迎的大语言模型(LLMS)产生大量适应器。然而,现有框架在支持与多个适应器进行下列方式的推断或微调方面做得不够。 1)微调方面,每个工作都需要将其专用基准模型实例部署到其专用基准实例中,这导致GPU内存消耗过多和GPU利用率差。 2)尽管流行的推断平台可以为多个PEFT的适应器服务,但它们不允许独立资源管理或混合不同的PEFT方法。 3 它们无法在广受欢迎的大语言模型(LLMMS)和微调工作之间共享大量资源(例如基础模型实例)。 4) 现有框架没有为可能不希望将其微调参数暴露给服务供应商的用户提供隐私。 在Symbiosiosisis,我们解决上述问题的方法是作为基准模型的升级部署。基础模型层的基模层层可共享于多个精调时间或微调过程,我们的软调化模型,从我们的软化模型可比分化程序,我们的软化系统显示, 显示它们的软化方法可以管理它们的软化系统化模型到我们的软化数据,它们用于其基础的软化数据层的软化数据。
Article 89
Title@2025-07-03 (4): Collective Communication Profiling of Modern-day Machine Learning Workloads
Title: Collective Communication Profiling of Modern-day Machine Learning Workloads | Kollektive Kommunikation Profilierung von modernen maschinellen Lern-Workloads | 现代机器学习工作量集体交流 2507.07117v1 |
Authors (6): Jit Gupta, Andrew Li, Tarun Banka, Ariel Cohen, T. Sridhar, Raj Yavatkar
Machine Learning jobs, carried out on large number of distributed high performance systems, involve periodic communication using operations like AllReduce, AllGather, and Broadcast. These operations may create high bandwidth and bursty traffic patterns, leading to network congestion and packet loss, thus impacting the performance of these jobs. Hence it is imperative to analyze these patterns, which can be helpful in provisioning network resources depending on the type of machine learning workloads. In this poster we carry out extensive analysis of the collective communication behavior seen in a wide variety of models (ex. DeepSeek, GPT, Llama, etc.) To achieve this we instrument Nvidia Collective Communication Library logging functionality for richer context about the collectives and workloads. We adjust configuration parameters that influence collective communication behavior, such as parallelism, number of nodes, and model type. This overview presents and discusses some of the results on the collective communication behavior for the open source DeepSeek V3 inferencing model, which includes operation type and count, transfer sizes per operation, and request size distribution. Our analysis shows that it makes sense to rethink current collective communication frameworks and network topologies so as to accommodate the effect of network anomalies on the mentioned workloads.
在分布式高性能系统的大量分布式高性能系统上开展的机器学习工作,涉及使用AllReduce、AllGather和广播等操作的定期通信。这些操作可能造成高带宽和频频交通模式,导致网络拥堵和包装损失,从而影响这些工作的绩效。因此,必须分析这些模式,这些模式有助于根据机器学习工作量的类型提供网络资源。在这个海报中,我们广泛分析了在多种模式(例如DeepSeek、GPT、Llama等)中看到的集体通信行为。为了实现这一功能,我们用Nvidia集体通信图书馆记录功能来为集体和工作量的更丰富背景。我们调整了影响集体通信行为的配置参数,例如平行性、节点数目和模式类型。本概览介绍并讨论了开放源DeepSeek V3 推导模型的集体通信行为的一些结果,其中包括操作类型和计数、每次操作的转移大小以及请求的大小分布。我们的分析表明,重新思考当前集体通信框架和网络的表层,以便适应网络的异常工作量。
Article 90
Title@2025-07-03 (4): BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers
Title: BLaST: High Performance Inference and Pretraining using BLock Sparse Transformers | BLaST: High Performance Inferenz und Pretraining mit BLock Sparse Transformers | BLAST:使用BLock Sparse变形器进行高性能推断和预先训练 2507.03117v1 |
Authors (7): Patrik Okanovic, Sameer Deshmukh, Grzegorz Kwasniewski, Kentaro Katayama, Takumi Honda, Maciej Besta, Torsten Hoefler
The energy consumption of large-scale ML models is dominated by data movement - shuffling billions of parameters across memory hierarchies and data centers. Effective sparsification to prune redundant parameters is still challenging: existing methods incur significant accuracy degradation, performance overhead, or both. We introduce (Bl)ock (a)nd (S)parse (T)ransformers (BLaST), a general, robust, and reliable sparsification method applicable to linear layers in all settings. Our method iteratively sparsifies weight matrices into a block sparsity pattern suitable for efficient sparse matrix-matrix (SpMM) multiplication. BLaST achieves up to 95% sparsity in MLP weights with negligible accuracy loss. Our fused, highly optimized Sparse MLP kernel delivers up to 16.7x speedup over dense MLPs across 9 architectures and 8 datasets, resulting in up to 1.6x inference speedup, 1.11x pretraining speedup and up to 3.12x inference memory usage reduction. BLaST enables the next generation of large-scale AI systems by reducing energy use, memory footprint, and latency.
大型 ML 模型的能量消耗主要以数据移动为主,在记忆级和数据中心之间对数十亿参数进行冲洗。 有效的对纯冗余参数的宽度仍然具有挑战性: 现有方法具有显著的精度降解、 性能管理或两者兼而有之。 我们引入了 (Bl) ock (a)nd (S)parse (T)ranstrades (BLAST) , 一种适用于所有环境线性层的一般、 稳健和可靠的弥漫方法。 我们的方法是迭代地将重力矩阵矩阵转换成一个块宽度模式,适合于高效的稀薄矩阵矩阵(SpMMM)的多重应用。 BLST 在 MLP 重量中达到95% 的宽度, 且精度损失微小。 我们的精度、 高度优化的 SplP 内核内核内核内核在9个建筑和8个数据集的密集 MLP 上达到16.7x 加速度, 导致1.6x 推导速度、 1.11x 预训练前加速速度和3.12x 内存内存使用减少。 BLAST 使下一代能够产生大规模的AI系统。
Article 91
Title@2025-07-03 (4): Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications
Title: Characterizing Compute-Communication Overlap in GPU-Accelerated Distributed Deep Learning: Performance and Power Implications | Charakterisieren von Compute-Communication Overlap in GPU-beschleunigt verteilt Deep Learning: Leistung und Leistung Implikationen | GPU-加速传播深层学习中计算通信重叠的特性:表现和动力影响 2507.03114v1 |
Authors (6): Seonho Lee, Jihwan Oh, Junkyum Kim, Seokjin Go, Jongse Park, Divya Mahajan
This paper provides an in-depth characterization of GPU-accelerated systems, to understand the interplay between overlapping computation and communication which is commonly employed in distributed training settings. Due to the large size of models, distributing them across multiple devices is required. Overlapping strategies, which enable concurrent computation and communication, are critical for mitigating communication bottlenecks and maximizing GPU utilization. However, the current consensus is that we should always and aggressively overlap compute and communication to mitigate the overhead of distribution. By systematically evaluating state-of-the-art GPUs, this study investigates the impact of hardware features such as numeric precision, specialized cores, and power capping on distributed training workloads. Comprehensive experiments and studies showcase the effects of overlapping strategies on performance and power consumption across varying scenarios. We observe that overlapping computation and communication can result in an average computational slowdown of 18.9%, with a maximum of 40.0% slowdown. This slowdown is in comparison to the scenario when no communication was happening with the compute. We consider this an ideal execution scenario, where the communication in parallel has not impact on the compute time. However, performing computation and communication sequentially is, on average, 10.2% slower than overlapped execution, with a maximum slowdown of 26.6%. We further observe, while specialized datapath and optimized numeric precision mitigate certain slowdowns, overlapping execution can lead to resource contention and also increase power consumption under specific configurations. The analysis also uncovers trade-offs introduced by power and frequency capping, emphasizing the importance of balanced strategies to optimize energy efficiency and training throughput.
本文对GPU加速的系统进行了深入的定性,以了解分布式培训环境中通常使用的重叠计算和通信的相互影响。由于模型规模庞大,需要通过多种装置加以分配。重叠战略有助于同时计算和通信,对于减少通信瓶颈和最大限度地利用GPU至关重要。然而,目前的共识是,我们应当始终积极地进行重叠计算和通信,以减轻分配的间接费用。通过系统评估最先进的GPU,本研究报告调查了诸如数字精确度、专门核心和电力封顶等硬件功能对分布式培训工作量的影响。由于模型规模庞大,需要通过多种装置加以分配。重叠战略有助于同时进行计算和通信,对于减少通信瓶颈和通信瓶颈至关重要。我们发现,重叠计算和通信可导致平均计算减速18.9%,最高减速40.0%。这种减速与在计算时没有进行通信的情况相比。我们认为这是一种理想的执行方案,在这种情况下,通信不会同时影响计算准确度对分布式培训工作量的影响。然而,进行计算和通信对分布式分析的重要性也强调对不同情景的分析,计算和通信对业绩和电力消费消费消耗消耗的影响。进行最慢性分析,同时,在平均的进度下进行,计算和通信在计算和通信速度上减少速度上进行,在计算和速度上,在计算和速度上,在计算和计算和计算和通信速度上,在计算和排序上,在计算和排序上,在计算速度慢慢速度方面,在计算和顺序上,在计算,在计算和计算和顺序上,在计算,在计算和计算和计算速度方面,在计算和计算,在计算,在计算和计算速度比在计算,在计算速度方面,在计算速度比在计算速度较慢的进度上,在计算中,在计算速度较慢。
Article 92
Title@2025-07-03 (4): Cppless: Single-Source and High-Performance Serverless Programming in C++
Title: Cppless: Single-Source and High-Performance Serverless Programming in C++ | Cppless: Single-Source- und High-Performance-Serverless-Programmierung in C++ | Cppless: C++中的单一来源和高绩效服务器无服务器程序 2401.10834v2 |
Authors (4): Marcin Copik, Lukas Möller, Alexandru Calotoiu, Torsten Hoefler
The rise of serverless computing introduced a new class of scalable, elastic and widely available parallel workers in the cloud. Many systems and applications benefit from offloading computations and parallel tasks to dynamically allocated resources. However, the developers of C++ applications find it difficult to integrate functions due to complex deployment, lack of compatibility between client and cloud environments, and loosely typed input and output data. To enable single-source and efficient serverless acceleration in C++, we introduce Cppless, an end-to-end framework for implementing remote functions which handles the creation, deployment, and invocation of serverless functions. Cppless is built on top of LLVM and requires only two compiler extensions to automatically extract C++ function objects and deploy them to the cloud. We demonstrate that offloading parallel computations, such as from a C++ application to serverless workers, can provide up to 59x speedup with minimal cost increase while requiring only minor code modifications.
无服务器计算机的崛起在云层中引入了一个新的可缩放、弹性和广泛可用的平行工人类别。 许多系统和应用程序受益于卸载计算和平行任务到动态分配的资源。 但是, C++ 应用程序的开发者发现,由于部署复杂、客户和云环境之间缺乏兼容性以及输入和输出数据松散,很难整合功能。 为了在 C++ 中实现单一源和高效的无服务器加速,我们引入了一个端对端框架,用于实施远程功能,处理无服务器功能的创建、部署和援引。 Cpless建在LLLLVM 顶端,只需要两个编译器扩展来自动提取 C++ 功能对象并将其投放到云端。 我们证明,从 C++ 应用程序到无服务器工人的平行计算可以提供多达59x 个的加速度,同时只需少量的代码修改。
Article 93
Title@2025-07-03 (4): HybridTier: an Adaptive and Lightweight CXL-Memory Tiering System
Title: HybridTier: an Adaptive and Lightweight CXL-Memory Tiering System | HybridTier: ein adaptives und leichtes CXL-Memory-Tiersystem | 混合板:适应和轻量的CXL-模模铁环系 2312.04789v2 |
Authors (6): Kevin Song, Jiacheng Yang, Zixuan Wang, Jishen Zhao, Sihang Liu, Gennady Pekhimenko
Modern workloads are demanding increasingly larger memory capacity. Compute Express Link (CXL)-based memory tiering has emerged as a promising solution for addressing this problem by utilizing traditional DRAM alongside slow-tier CXL memory devices. We analyze prior tiering systems and observe two challenges for high-performance memory tiering: adapting to skewed but dynamically varying data hotness distributions while minimizing memory and cache overhead due to tiering. To address these challenges, we propose HybridTier, an adaptive and lightweight tiering system for CXL memory. HybridTier tracks both long-term data access frequency and short-term access momentum \emph{simultaneously} to accurately capture and adapt to shifting hotness distributions. HybridTier reduces the metadata memory overhead by tracking data accesses \emph{probabilistically}, obtaining higher memory efficiency by trading off a small amount of tracking inaccuracy that has a negligible impact on application performance. To reduce cache overhead, HybridTier uses lightweight data structures that optimize for data locality to track data hotness. Our evaluations show that HybridTier outperforms prior systems by up to $91\%$ ($19\%$ geomean), incurring $2.0-7.8\times$ less memory overhead and $1.7-3.5\times$ less cache misses.
现代工作量要求增加记忆能力。 计算Express Link( CXL) 基于Express Link( CXL) 的记忆分层是解决这一问题的一个很有希望的解决方案,它利用传统的 DRAM 和缓慢的 CXL 记忆设备来解决这一问题。 我们分析先前的分层系统并观察到高性能内存分层的两个挑战: 适应扭曲但动态差异的数据热分布,同时因分层而尽量减少记忆力和缓冲间接费用。 为了应对这些挑战,我们提议混合Tier, 一种适应性和轻量级的CXL 内存分级系统。 混合Tier 跟踪长期数据存取频率和短期存取动力 \ emph{smultaneous} 以准确捕捉和适应移动热量分布。 混合TRADIER通过跟踪数据访问 \ emph{ { 概率} 来降低元数据存储存储率,通过交换少量的追踪对应用性能影响很小的不小的不精确性能。 为了减少缓度, 减少缓称, 混合Tier 使用轻度数据存数据访问频率的数据结构, 和短期存取动力动力 = $ 0.8 和前的存储系统在9__ 10美元 10美元 的存储中, 。
Article 94
Title@2025-07-03 (4): PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling
Title: PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling | PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays | PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案 2506.19660v2 |
Authors (4): Shuhang Xu, Yunfei Gu, Linhui Liu, Chentao Wu
As flash-based Solid State Drive (SSD) arrays become essential to modern data centers, scaling these arrays to meet explosive data growth is a frequent and critical operation. However, the conventional wear-leveling (WL) paradigm applied during scaling suffers from a fundamental flaw: it ignores the non-linear relationship between wear and failure probability, potentially pushing the most vulnerable, aged disks towards premature failure. To address this critical issue at its root, we propose the Probability-Sensitive Wear Leveling (PS-WL) scheme, which shifts the optimization goal from balancing wear to directly balancing failure risk. At its core, PS-WL introduces an “effective lifetime” model derived from a realistic failure probability to more accurately assess disk lifetime. This model guides a PID controller for wear leveling operation, with a conservative zone minimizes performance overhead by restricting warm data migration. Comprehensive simulations validate the superiority of PS-WL over state-of-the-art methods. The results demonstrate that our approach significantly reduces performance overhead while, most critically, consistently and effectively lowering the aggregated array failure risk across diverse system configurations and workloads. This proves that by directly optimizing for reliability, PS-WL builds a scalable storage system that is, by design, fundamentally safer, more efficient, and more stable.
由于基于闪光的固态驱动器(SSD)阵列对现代数据中心至关重要,扩大这些阵列以适应爆炸性数据增长是一项经常和关键的操作。然而,在缩放过程中应用的常规磨损等级(WL)模式存在一个根本性缺陷:它忽视了磨损和故障概率之间的非线性关系,有可能将最脆弱的老磁盘推向过早的失败。为了从根本上解决这一关键问题,我们提议了“概率感应性湿分级(PS-WL)计划 ” , 将优化目标从平衡磨损转向直接平衡故障风险。 在其核心方面, PS-WL 引入了一个“ 有效终身” 模式, 其依据是现实性失败概率来更准确地评估磁盘寿命。 这个模式指导了PID控制器的磨损操作, 保守区通过限制热数据迁移而最大限度地减少性能管理。 全面模拟验证了PS-WL优于最新技术方法的优势。 其结果表明,我们的方法极大地降低了绩效管理,同时,最关键地、持续和有效地降低不同系统配置和工作量的汇总阵列失败风险。这个模型通过直接地证明,更稳定的存储是更稳定的安全。
Article 95
Title@2025-07-03 (4): FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
Title: FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference | FlowSpec: Kontinuierliche pipelined Spekulative Dekodierung für effiziente verteilte LLM-Inferenz | 流谱:为有效分布分布的LLM 推断而持续喷射的投机性分解 2507.02620v1 |
Authors (4): Xing Liu, Lizhuo Luo, Ming Tang, Chao Huang
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accpeted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.36$\times$-1.77$\times$ compared to baselines. Our code is publicly available at \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec#}
在网络边缘,分布式的推论是一种大语言模型(LLMs)的推论方法,在网络边缘是一种大语言模型(LLMs)的推论方法,很有希望,它将推论过程分散到多个设备,以确保LLMs能够与设备内存相容。最近基于管道的方法有可能平行通信和计算,从而有助于减少推论的延缓度。然而,如果网络边缘的推论要求稀少,而管道通常使用率较低,则好处就会减少。为了在边缘有效分布式LLLM推论,我们提议建立一个基于管道-parllel树的投机解码框架。TextSpec包含三个关键机制,以提高解码效率:(1) 基于分数的分数分分分的分步核查办法,将更重要的代号放在更早一点,以带来折叠的标物;(2)在核查期间,在保持正确的因果关系的同时,高效率地管理普兰松无效的代币;(3) 动态的扩展战略草案,以提供高质量的投机性投入。这些技术工作以音乐方式加强管道利用和投机效率。我们评估在现实-美元-数字的Slex_Slex标准上改进了我们SlexSlex/slex标准在1号/s
Article 96
Title@2025-07-03 (4): MULTI-SCOUT: Multistatic Integrated Sensing and Communications in 5G and Beyond for Moving Target Detection, Positioning, and Tracking
Title: MULTI-SCOUT: Multistatic Integrated Sensing and Communications in 5G and Beyond for Moving Target Detection, Positioning, and Tracking | MULTI-SCOUT: Multistatisches integriertes Sensing und Kommunikation in 5G und darüber hinaus für das Verschieben von Zielerkennung, Positionierung und Tracking | 目标探测、定位和跟踪:用于推进目标探测、定位和跟踪的 5G及5G 以外多空间综合遥感和通信 2507.02613v1 |
Authors (6): Yalin E. Sagduyu, Kemal Davaslioglu, Tugba Erpek, Sastry Kompella, Gustave Anderson, Jonathan Ashdown
This paper presents a complete signal-processing chain for multistatic integrated sensing and communications (ISAC) using 5G Positioning Reference Signal (PRS). We consider a distributed architecture in which one gNB transmits a periodic OFDM-PRS waveform while multiple spatially separated receivers exploit the same signal for target detection, parameter estimation and tracking. A coherent cross-ambiguity function (CAF) is evaluated to form a range-Doppler map from which the bistatic delay and radial velocity are extracted for every target. For a single target, the resulting bistatic delays are fused through nonlinear least-squares trilateration, yielding a geometric position estimate, and a regularized linear inversion of the radial-speed equations yields a two-dimensional velocity vector, where speed and heading are obtained. The approach is applied to 2D and 3D settings, extended to account for time synchronization bias, and generalized to multiple targets by resolving target association. The sequence of position-velocity estimates is then fed to standard and extended Kalman filters to obtain smoothed tracks. Our results show high-fidelity moving-target detection, positioning, and tracking using 5G PRS signals for multistatic ISAC.
本文用5G定位参考信号(PRS)为多静态综合遥感和通信提供了一个完整的信号处理链(ISAC),用于多静态综合遥感和通信。我们考虑一个分布式结构,即一个GNB传输一个定期的OFDM-PRS波形,而多个空间分离的接收器利用同一信号进行目标探测、参数估计和跟踪。一个连贯的交叉立体功能(CAF)被评价成一个射程-Doppler地图,每个目标都可以从中提取二等缓存延迟和辐射速度。对于一个单一目标,由此产生的二等延迟会通过非线性最小方位的三角连接而成,产生几何位置估计,以及一个定期化线性线性线性线性对射线式转换产生二维速度矢量,获得速度和航向。该方法适用于2D和3D环境环境,用于计算时间同步偏差,并通过解决目标关联而向多个目标普及。对于位置-速度估计的顺序随后被输入到标准和扩大的Kalman过滤器,以获得平稳轨道。我们的成果显示用于高分辨率定位和多分辨率的IS目标探测。
Article 97
Title@2025-07-03 (4): Analysing semantic data storage in Distributed Ledger Technologies for Data Spaces
Title: Analysing semantic data storage in Distributed Ledger Technologies for Data Spaces | Analyse der semantischen Datenspeicherung in verteilten Ledger-Technologien für Datenräume | 分析数据空间分布式分类账簿技术中的语义数据存储 2507.07116v1 |
Authors (5): Juan Cano-Benito, Andrea Cimmino, Sven Hertling, Heiko Paulheim, Raúl García-Castro
Data spaces are emerging as decentralised infrastructures that enable sovereign, secure, and trustworthy data exchange among multiple participants. To achieve semantic interoperability within these environments, the use of semantic web technologies and knowledge graphs has been proposed. Although distributed ledger technologies (DLT) fit as the underlying infrastructure for data spaces, there remains a significant gap in terms of the efficient storage of semantic data on these platforms. This paper presents a systematic evaluation of semantic data storage across different types of DLT (public, private, and hybrid), using a real-world knowledge graph as an experimental basis. The study compares performance, storage efficiency, resource consumption, and the capabilities to update and query semantic data. The results show that private DLTs are the most efficient for storing and managing semantic content, while hybrid DLTs offer a balanced trade-off between public auditability and operational efficiency. This research leads to a discussion on the selection of the most appropriate DLT infrastructure based on the data sovereignty requirements of decentralised data ecosystems.
数据空间正在作为分散化基础设施出现,使多个参与者能够进行主权、安全和可信赖的数据交流。为了实现这些环境中的语义互操作性,提出了使用语义网络技术和知识图表的建议。虽然分布式分类账技术适合作为数据空间的基本基础设施,但在有效储存这些平台上的语义数据方面仍然存在巨大差距。本文件以真实世界知识图作为实验基础,对不同类型(公共、私人和混合)的语义数据储存进行了系统评价。这项研究比较了性能、储存效率、资源消耗以及更新和查询语义数据的能力。研究结果显示,私营DLT对储存和管理语义内容最为高效,而混合DLT则提供了公共审计能力与操作效率之间的平衡权衡。这一研究导致讨论根据分散式数据生态系统的数据主权要求选择最适合的DLT基础设施。
Article 98
Title@2025-07-03 (4): AI Flow: Perspectives, Scenarios, and Approaches
Title: AI Flow: Perspectives, Scenarios, and Approaches | AI Flow: Perspektiven, Szenarien und Ansätze | AI 流动:观点、设想和方法 2506.12479v2 |
Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li
Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.
由于克劳德·香农的基本信息理论和艾伦·图灵的机智智能远见框架的开创性,信息和通信技术(IT/CT)的趋同性演进形成了一个不间断的连通和计算浪潮,这种协同效应引发了技术革命,现在随着大型人工智能(AI)模型的重新塑造工业和重新界定人体机械合作而达到顶峰。然而,由于大型模型中大量资源消耗和高通信带宽需求,实现无处不在的情报面临巨大挑战。为了应对这些挑战,AI流动被引入了多学科框架,将先进的信息技术和CT进步结合起来,特别强调以下三个关键点。首先,装置-顶尖的云形框架作为基础,将终端装置、边缘服务器和云层集群结合起来,优化低电流模型的伸缩性和效率。第二,我们引入了家庭模型的概念,即一系列规模不同的模型,与一致的隐蔽性特征相适应,使得有效的合作和灵活性能够适应不同的资源限制和动态情景。第三,连接性和互动性框架作为基础基础,将连接性和互动性框架作为基础,将最终的智能升级性模型,从而提升AI系统。
Article 99
Title@2025-07-03 (4): Resolving CAP Through Automata-Theoretic Economic Design: A Unified Mathematical Framework for Real-Time Partition-Tolerant Systems
Title: Resolving CAP Through Automata-Theoretic Economic Design: A Unified Mathematical Framework for Real-Time Partition-Tolerant Systems | Lösung von CAP durch Automata-Theoretisches Wirtschaftsdesign: Ein einheitlicher mathematischer Rahmen für Echtzeit-Partitions-Tolerante Systeme | 通过自动化数据理论经济设计解决CAP:实时分区-耐用系统统一数学框架 2507.02464v1 |
Authors (1): Craig S Wright
The CAP theorem asserts a trilemma between consistency, availability, and partition tolerance. This paper introduces a rigorous automata-theoretic and economically grounded framework that reframes the CAP trade-off as a constraint optimization problem. We model distributed systems as partition-aware state machines and embed economic incentive layers to stabilize consensus behavior across adversarially partitioned networks. By incorporating game-theoretic mechanisms into the global transition semantics, we define provable bounds on convergence, liveness, and correctness. Our results demonstrate that availability and consistency can be simultaneously preserved within bounded epsilon margins, effectively extending the classical CAP limits through formal economic control.
CAP 定理主张一致性、 可用性和 分区容忍性之间的三角关系。 本文引入了严格的自动数据理论和经济基础框架, 将CAP 的权衡重新设定为限制优化问题。 我们将分布系统建为有分区意识的国家机器, 并嵌入经济激励层以稳定敌对隔离网络的共识行为。 通过将游戏理论机制纳入全球过渡语义, 我们定义了可测量的趋同、 活性和正确性界限。 我们的结果表明, 可用性和一致性可以同时保存在受约束的Epsilon边际范围内, 通过正式的经济控制有效地扩展了典型CAP的界限。
Article 100
Title@2025-07-03 (4): Red grape detection with accelerated artificial neural networks in the FPGA’s programmable logic
Title: Red grape detection with accelerated artificial neural networks in the FPGA’s programmable logic | Rote Traubenerkennung mit beschleunigten künstlichen neuronalen Netzwerken in der programmierbaren Logik des FPGA | FPGA的可编程逻辑的红葡萄探测与加速人工神经网络 2507.02443v1 |
Authors (5): Sandro Costa Magalhães, Marco Almeida, Filipe Neves dos Santos, António Paulo Moreira, Jorge Dias
Robots usually slow down for canning to detect objects while moving. Additionally, the robot’s camera is configured with a low framerate to track the velocity of the detection algorithms. This would be constrained while executing tasks and exploring, making robots increase the task execution time. AMD has developed the Vitis-AI framework to deploy detection algorithms into FPGAs. However, this tool does not fully use the FPGAs’ PL. In this work, we use the FINN architecture to deploy three ANNs, MobileNet v1 with 4-bit quantisation, CNV with 2-bit quantisation, and CNV with 1-bit quantisation (BNN), inside an FPGA’s PL. The models were trained on the RG2C dataset. This is a self-acquired dataset released in open access. MobileNet v1 performed better, reaching a success rate of 98 % and an inference speed of 6611 FPS. In this work, we proved that we can use FPGAs to speed up ANNs and make them suitable for attention mechanisms.
机器人在移动时通常会慢下来, 以便让罐头在移动时检测物体 。 此外, 机器人的相机配置低框架速率, 以跟踪检测算法的速度 。 这将在任务执行和探索时受到限制, 使机器人增加任务执行时间 。 AMD 开发了 Vitis- AI 框架, 将检测算法部署到 FPGAs 中。 但是, 这个工具没有完全使用 FPGAs PL 。 在这项工作中, 我们使用 FINN 结构来部署 3 个 ANN、 MobilNet v1 和 4 位四位方位数的 ANN、 CPV 和 CNV 和 1 位四位量化 (BNNN) 。 这些模型在 FPGA 的 PL 中被训练为 RG2C 数据集 。 这是在开放访问中释放的自取数据集 。 MOPNet v1 效果更好, 达到 98 和 611 FPS 的推断速度 。 在这项工作中, 我们证明我们可以使用 FPGA 来加速加速 。
Article 101
Title@2025-07-03 (4): The Artificial Scientist – in-transit Machine Learning of Plasma Simulations
Title: The Artificial Scientist – in-transit Machine Learning of Plasma Simulations | Der Künstliche Wissenschaftler – in-transit maschinelles Lernen von Plasmasimulationen | 人造科学家 – – Plasma模拟模拟的中转机器学习 2501.03383v3 |
Authors (22): Jeffrey Kelling, Vicente Bolea, Michael Bussmann, Ankush Checkervarty, Alexander Debus, Jan Ebert, Greg Eisenhauer, Vineeth Gutta, Stefan Kesselheim, Scott Klasky, Vedhas Pandit, Richard Pausch, Norbert Podhorszki, Franz Poschel, David Rogers, Jeyhun Rustamov, Steve Schmerler, Ulrich Schramm, Klaus Steiniger, Rene Widera, Anna Willmann, Sunita Chandrasekaran
Increasing HPC cluster sizes and large-scale simulations that produce petabytes of data per run, create massive IO and storage challenges for analysis. Deep learning-based techniques, in particular, make use of these amounts of domain data to extract patterns that help build scientific understanding. Here, we demonstrate a streaming workflow in which simulation data is streamed directly to a machine-learning (ML) framework, circumventing the file system bottleneck. Data is transformed in transit, asynchronously to the simulation and the training of the model. With the presented workflow, data operations can be performed in common and easy-to-use programming languages, freeing the application user from adapting the application output routines. As a proof-of-concept we consider a GPU accelerated particle-in-cell (PIConGPU) simulation of the Kelvin- Helmholtz instability (KHI). We employ experience replay to avoid catastrophic forgetting in learning from this non-steady process in a continual manner. We detail challenges addressed while porting and scaling to Frontier exascale system.
深度学习技术,特别是利用这些数量的域内数据来提取有助于建立科学理解的模式。在这里,我们展示了一个流流工作流程,模拟数据直接流到一个机器学习(ML)框架,绕过文件系统瓶颈。数据在中转过程中不同步地转换为模拟和培训模型。随着所提供的工作流程,数据操作可以用通用和易用的编程语言进行,使应用程序用户不必适应应用程序输出常规。作为证据,我们考虑对Kelvin-Helmholtz 进行GPU加速细胞中的粒子模拟(PIConGPU),我们利用经验再玩,避免在从非稳定的进程中不断学习过程中灾难性地遗忘。我们详细介绍了在移植和扩展到前沿外观系统时遇到的挑战。
Article 102
Title@2025-07-03 (4): Alps, a versatile research infrastructure
Title: Alps, a versatile research infrastructure | Alpen, eine vielseitige Forschungsinfrastruktur | 阿尔卑斯山,多用途研究基础设施 2507.02404v1 |
Authors (3): Maxime Martinasso, Mark Klein, Thomas C. Schulthess
The Swiss National Supercomputing Centre (CSCS) has a long-standing tradition of delivering top-tier high-performance computing systems, exemplified by the Piz Daint supercomputer. However, the increasing diversity of scientific needs has exposed limitations in traditional vertically integrated HPC architectures, which often lack flexibility and composability. To address these challenges, CSCS developed Alps, a next-generation HPC infrastructure designed with a transformative principle: resources operate as independent endpoints within a high-speed network. This architecture enables the creation of independent tenant-specific and platform-specific services, tailored to diverse scientific requirements. Alps incorporates heterogeneous hardware, including CPUs and GPUs, interconnected by a high-performance Slingshot network, and offers a modular storage system. A key innovation is the versatile software-defined cluster (vCluster) technology, which bridges cloud and HPC paradigms. By abstracting infrastructure, service management, and user environments into distinct layers, vClusters allow for customized platforms that support diverse workloads. Current platforms on Alps serve various scientific domains, including numerical weather prediction, and AI research.
瑞士国家超高速计算中心(CSCS)有着提供顶级高性能计算系统的长期传统,例如Piz Daint超级计算机;然而,科学需求日益多样化,暴露了传统的纵向一体化高电联结构中的局限性,这些结构往往缺乏灵活性和可复性;为了应对这些挑战,CSCS开发了高电联基础设施Alps,这是下一代高电联基础设施,其设计具有变革性原则:资源在高速网络中作为独立的端点运作;这一架构使得能够创建独立、针对租户和平台的服务,适合不同的科学要求;阿尔卑斯综合了多种硬件,包括CPU和GPUPUs,通过高性能闪光网络相互连接,提供了一个模块储存系统;一项关键的创新是多功能软件定义的集聚群技术,将云和高电联模式连接起来;通过将基础设施、服务管理和用户环境抽象地分化到不同的层, vClusters能够建立支持不同工作量的定制平台;目前阿尔卑斯平台为各种科学领域服务,包括数字天气预报和AI研究服务。
Article 103
Title@2025-07-03 (4): VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software
Title: VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software | VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software | VEFIA: 垂直联邦合作软件有效推断审计框架 2507.02376v1 |
Authors (6): Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang
Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants’ data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to audit whether the data party’s inference software is executed as expected during large-scale inference without leaking the data privacy of the data party or introducing additional latency to the inference system. The core of VeFIA is that the task party can use the inference results from a framework with Trusted Execution Environments (TEE) and the coordinator to validate the correctness of the data party’s computation results. VeFIA guarantees that, as long as the abnormal inference exceeds 5.4%, the task party can detect execution anomalies in the inference software with a probability of 99.99%, without incurring any additional online inference latency. VeFIA’s random sampling validation achieves 100% positive predictive value, negative predictive value, and true positive rate in detecting abnormal inference. To the best of our knowledge, this is the first paper to discuss the correctness of inference software execution in VFL.
VFIA帮助任务方审计数据方的推论软件是否在大规模推论期间按预期执行,而不会泄露数据方的数据隐私,也不会给推断系统带来额外的延迟。VFIA的核心是任务方可以使用信任执行环境框架和数据方计算结果协调员框架的推论结果来验证数据方计算结果的正确性。 VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的执行异常情况,其概率为99.99%,而不会给网上推论系统带来任何额外的延迟。VEFIA的核心是任务方可以使用信任执行环境框架和协调员框架的推论结果来验证数据方计算结果的正确性。VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的异常性异常性,且有可能达到99.99%,而不会在网上推论任何额外的拉度。VEFIA的随机抽样验证可以使用100%的精确度来检测我们准确度。
Article 104
Title@2025-07-03 (4): Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources
Title: Flotilla: A scalable, modular and resilient federated learning framework for heterogeneous resources | Flotilla: Ein skalierbarer, modularer und widerstandsfähiger föderierter Lernrahmen für heterogene Ressourcen | 船队:多样化资源的可扩展、模块化和有弹性的联邦学习框架 2507.02295v1 |
Authors (8): Roopkatha Banerjee, Prince Modi, Jinal Vyas, Chunduru Sri Abhijit, Tejus Chandrashekar, Harsha Varun Marisetty, Manik Gupta, Yogesh Simmhan
With the recent improvements in mobile and edge computing and rising concerns of data privacy, Federated Learning(FL) has rapidly gained popularity as a privacy-preserving, distributed machine learning methodology. Several FL frameworks have been built for testing novel FL strategies. However, most focus on validating the learning aspects of FL through pseudo-distributed simulation but not for deploying on real edge hardware in a distributed manner to meaningfully evaluate the federated aspects from a systems perspective. Current frameworks are also inherently not designed to support asynchronous aggregation, which is gaining popularity, and have limited resilience to client and server failures. We introduce Flotilla, a scalable and lightweight FL framework. It adopts a ``user-first’’ modular design to help rapidly compose various synchronous and asynchronous FL strategies while being agnostic to the DNN architecture. It uses stateless clients and a server design that separates out the session state, which are periodically or incrementally checkpointed. We demonstrate the modularity of Flotilla by evaluating five different FL strategies for training five DNN models. We also evaluate the client and server-side fault tolerance on 200+ clients, and showcase its ability to rapidly failover within seconds. Finally, we show that Flotilla’s resource usage on Raspberry Pis and Nvidia Jetson edge accelerators are comparable to or better than three state-of-the-art FL frameworks, Flower, OpenFL and FedML. It also scales significantly better compared to Flower for 1000+ clients. This positions Flotilla as a competitive candidate to build novel FL strategies on, compare them uniformly, rapidly deploy them, and perform systems research and optimizations.
随着移动和边缘计算的最新改进,以及数据隐私的日益关切,Federal Learning(FL)作为一个隐私保护、分布式的机器学习方法迅速获得普及。一些FL框架已经建成,用于测试新的FL战略。然而,大多数框架的重点是通过假分布模拟来验证FL的学习方面,而不是以分布式方式在真实边缘硬件上进行部署,以便从系统的角度有意义地评估FL的加入方方面面。目前框架本身也并非旨在支持非同步整合,这种整合正在越来越受欢迎,而且对客户和服务器故障的适应力也很有限。我们引入了可缩缩缩放和轻轻的FL框架。我们采用了“用户第一”的模块设计来测试FL的学习方面,以帮助快速制定各种同步和无声调的FL战略,同时对DNNN的架构进行定量或递增检查。我们用5种FL战略来评估5种不同的FL战略的模块。我们还大大评价客户和FLFloral-lielder Veral的快速使用能力,我们用Flixal-lax 和Fleveral的功能展示了300的功能,我们用Flder-lix的功能上显示了比R的功能更棒的功能更能展示。
Article 105
Title@2025-07-03 (4): Domain-Adversarial Transfer Learning for Fault Root Cause Identification in Cloud Computing Systems
Title: Domain-Adversarial Transfer Learning for Fault Root Cause Identification in Cloud Computing Systems | Domain-Adversarial-Transfer-Lernen für fehlerhafte Root-Cause-Identifikation in Cloud Computing-Systemen | 为在云计算系统中查明原因原因而进行校内自动转移学习 2507.02233v1 |
Authors (2): Bruce Fang, Danyi Gao
This paper addresses the challenge of fault root cause identification in cloud computing environments. The difficulty arises from complex system structures, dense service coupling, and limited fault information. To solve this problem, an intelligent identification algorithm based on transfer learning is proposed. The method introduces a shared feature extraction module and a domain adversarial mechanism to enable effective knowledge transfer from the source domain to the target domain. This improves the model’s discriminative ability and generalization performance in the target domain. The model incorporates a pseudo-label selection strategy. When labeled samples are lacking in the target domain, high-confidence predictions are used in training. This enhances the model’s ability to recognize minority classes. To evaluate the stability and adaptability of the method in real-world scenarios, experiments are designed under three conditions: label scarcity, class imbalance, and heterogeneous node environments. Experimental results show that the proposed method outperforms existing mainstream approaches in several key metrics, including accuracy, F1-Score, and AUC. The model demonstrates stronger discriminative power and robustness. Notably, under extreme class imbalance and significant structural differences in the target domain, the model still maintains high performance. This validates the effectiveness and practical value of the proposed mechanisms in complex cloud computing systems.
本文探讨云计算环境中的缺陷根源识别挑战。 困难来自复杂的系统结构、密集的服务连接和有限的缺陷信息。 为了解决这一问题, 提出了基于转移学习的智能识别算法。 方法引入了一个共享特征提取模块和一个域对称机制, 以便从源域向目标域进行有效的知识转移。 这改善了模型在目标域的偏差能力和概括性表现。 模型包含一个假标签选择战略。 当标注样本在目标域缺乏时, 使用高可信度预测方法来进行培训。 这增强了模型识别少数群体类别的能力。 为了评估在现实世界情景中方法的稳定性和适应性, 实验是在三个条件下设计的: 标签稀缺性、 阶级不平衡和多变节点环境。 实验结果表明, 拟议的方法超越了在几个关键指标领域( 包括准确性、 F1- 核心和 AUC ) 的现有主流方法。 该模型显示了更强的歧视性力量和强健性。 很明显, 在极端的类别不平衡性和目标域的重大结构差异下, 模型仍然保持高性能。