cs.DC @ 2025-06-01: 103
-
00 05-29 (4) From Connectivity to Autonomy: The Dawn of Self-Evolving Communication Systems Von der Konnektivität zur Autonomie: Die Morgenröte der sich selbst entwickelnden Kommunikationssysteme 从连接到自主:自我发展的通信系统的黎明 2505.23710v1 -
01 05-29 Distributed Federated Learning for Vehicular Network Security: Anomaly Detection Benefits and Multi-Domain Attack Threats Verteiltes Federated Learning für die Sicherheit des Vehicular Network: Anomalieerkennungsvorteile und Multi-Domain-Angriffsbedrohungen 分布式联邦学习促进车辆网络安全:反常探测效益和多领域攻击威胁 2505.23706v1 -
02 05-29 Parallel GPU-Accelerated Randomized Construction of Approximate Cholesky Preconditioners Parallele GPU-beschleunigte Randomisierte Konstruktion von ungefähren Cholesky-Vorkonditionen 平行的GPU-加速加速旋转式建造近焦天空预设装置 2505.02977v2 -
03 05-29 Complementary Time-Space Tradeoff for Self-Stabilizing Leader Election: Polynomial States Meet Sublinear Time Komplementärer Zeit-Raum-Tradeoff für selbststabilisierende Leader-Wahl: Polynome Staaten treffen auf sublineare Zeit 自我稳定领导人选举的补充时间-空间权衡:多民族国家满足亚线性时间 2505.23649v1 -
04 05-29 Accelerated Training of Federated Learning via Second-Order Methods Beschleunigte Ausbildung des Föderierten Lernens über Methoden der zweiten Ordnung 通过二级方法加快联邦学习培训 2505.23588v1 -
05 05-29 Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters Nachhaltiges CO2-basiertes und wassereffizientes LLM-Scheeduling in Geo-verteilten Cloud-Rechenzentren 地球分布云数据中心的可持续碳软件和水效率高的LLM 2505.23554v1 -
06 05-29 Accelerating AllReduce with a Persistent Straggler AllReduce mit einem persistenten Straggler beschleunigen 使用持久性斯特拉格驱动器加速全部拖动 2505.23523v1 -
07 05-29 SealOS+: A Sealos-based Approach for Adaptive Resource Optimization Under Dynamic Workloads for Securities Trading System SealOS+: Ein Sealos-basierter Ansatz für adaptive Ressourcenoptimierung unter dynamischen Workloads für Securities Trading System SealOS+:证券交易系统动态工作量下的适应性资源优化的以海路为基础的办法 2505.23258v1 -
08 05-29 Smaller, Smarter, Closer: The Edge of Collaborative Generative AI Kleiner, intelligenter, enger: Der Rand der kollaborativen Generativen KI 较小、更聪明、更近:合作创造的边缘 AI 2505.16499v2 -
09 05-29 MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning MemAscend: Systemspeicheroptimierung für SSD-Offloaded LLM Fine-Tuning MemAscend: SSD- 卸载 LLM 精密调试的系统内存优化 2505.23254v1 -
10 05-29 Edge-First Language Model Inference: Models, Metrics, and Tradeoffs Edge-First Language Model Inferenz: Modelle, Metrics und Tradeoffs 边缘第一语言模式示范推论:模型、计量和权衡取舍 2505.16508v2 -
11 05-29 Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism Ghidorah: Schnelle LLM-Inferenz am Rand mit spekulativer Dekodierung und Hetero-Core-Parallelität Ghidorah:快速LLM 2505.23219v1 -
12 05-30 (5) Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces Verbesserung der parallelen Programmleistung mit LLM-Optimierern über Agent-System-Schnittstellen 通过代理-系统接口改进与LLM优化器的平行方案绩效 2410.15625v4 -
13 05-29 (4) The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning Die Panaceas zur Verbesserung der Zersetzung mit geringem Rank im kommunikativ-effizienten Federated Learning 改善通信-高效联邦学习中低-兰克分解的全景 2505.23176v1 -
14 05-29 DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs DOPPLER: Dual-Policy-Lernen für die Gerätezuordnung in asynchronen Datenflussgraphen DOPPLER: 同步数据流图表中设备分配的双政策学习 2505.23131v1 -
15 05-29 Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony Auf dem Weg zu einem kosteneffizienten Servieren von Mixture-of-Experts mit Asynchrony 争取以成本低效益高的方式服务专家与非同步混合服务 2505.08944v2 -
16 05-29 Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts Shortcut-verbundene Experten-Parallelität für die Beschleunigung von Mixture-of-Experts 加速混合专家专家专家平行专家 2404.05019v3 -
17 05-29 Speeding up Model Loading with fastsafetensors Beschleunigen des Modells Beladung mit Schnellsicherern 加速装有快速保障装置的模型加载速度 2505.23072v1 -
18 05-28 (3) Profiling and optimization of multi-card GPU machine learning jobs Profilierung und Optimierung von Multi-Card-GPU-Maschinenlernjobs 多卡 GPPU 机器学习工作的分析和优化 2505.22905v1 -
19 05-28 Visualizing Cloud-native Applications with KubeDiagrams Cloud-native Anwendungen mit KubeDiagrammen visualisieren 带有KubeDiagrams 的可视化云源应用 2505.22879v1 -
20 05-28 The National Research Platform: Stretched, Multi-Tenant, Scientific Kubernetes Cluster Die Nationale Forschungsplattform: Streckiger, Multi-Tenant-Cluster, wissenschaftlicher Kubernetes-Cluster 国家研究平台:延伸、多层、多层、科学库伯涅茨集群 2505.22864v1 -
21 05-28 $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction $Δ$-Nets: Interaktionsbasiertes System für eine optimale parallele $λ$-Reduktion \(-净额:最佳平行互动系统\)$美元-削减 2505.20314v2 -
22 05-28 Smart Contracts for SMEs and Large Companies Intelligente Verträge für KMU und Großunternehmen 中小企业和大公司的智能合同 2505.22619v1 -
23 05-28 Pilot-Quantum: A Quantum-HPC Middleware for Resource, Workload and Task Management Pilot-Quantum: Eine Quantum-HPC Middleware für Ressourcen-, Workload- und Task-Management 试点量子:资源、工作量和任务管理的量子-氢氯氟烃中软件 2412.18519v3 -
24 05-28 Morpheus Consensus: Excelling on trails and autobahns Morpheus Consensus: Excelling auf Trails und Autobahnen Morpheus共识:关于足迹和自动铢的Excelling 2502.08465v2 -
25 05-28 Grassroots Federation: Fair Governance of Large-Scale, Decentralized, Sovereign Digital Communities Grassroots Federation: Faire Governance der großen, dezentralisierten, Souveränen Digitalen Gemeinschaften 基层联合会:大、分散、主权数字共同体的公平治理 2505.02208v4 -
26 05-28 Broadcast in Almost Mixing Time In fast mischender Zeit übertragen 几乎混合时间的广播 2502.02165v2 -
27 05-28 Inclusive, Differentially Private Federated Learning for Clinical Data Inklusives, differenziert privates Federated Learning für klinische Daten 包容性、差异化私联校临床数据学习 2505.22108v1 -
28 05-28 A Stochastic Approximation Approach for Efficient Decentralized Optimization on Random Networks Ein stochastischer Annäherungsansatz für eine effiziente dezentralisierte Optimierung von Random Networks 随机网络高效分散优化优化的斯托卡接近方法 2410.18774v2 -
29 05-28 Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference Effizientes Key-Value-Cache-Management für die Präfixvorfüllung in LLM-Inferenz 建立高效的键值缓存管理,用于在LLM 推理中预填前补全 2505.21919v1 -
30 05-28 Joint$λ$: Orchestrating Serverless Workflows on Jointcloud FaaS Systems Joint$λ$: Orchestrierung serverloser Workflows auf Jointcloud FaaS-Systemen 联合 $ $: 联合COLOUD FaaS系统无服务器工作流管 2505.21899v1 -
31 05-28 Hybrid Batch Normalisation: Resolving the Dilemma of Batch Normalisation in Federated Learning Hybride Batch-Normalisierung: Lösung des Dilemmas der Batch-Normalisierung im Federated Learning 混合批次正常化:解决联邦学习中批次正常化的难题 2505.21877v1 -
32 05-28 gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling gLLM: Global Balanced Pipeline Parallelism System für verteiltes LLM Serving mit Token Throttling gLLM:全球平衡管道平行系统 2504.14775v2 -
33 05-27 (2) Empowering Scientific Workflows with Federated Agents Stärkung wissenschaftlicher Workflows mit Federated Agents 赋予联邦药剂部门科学工作流程权能 2505.05428v2 -
34 05-27 LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models LV-XAttn: Verteilte Cross-Attention für lange visuelle Eingänge in multimodalen großen Sprachmodellen LV-XAttn:多式大语言模型中长视输入分布式交叉注意 2502.02406v3 -
35 05-27 Power-Capping Metric Evaluation for Improving Energy Efficiency Leistungskapitulation Metric-Evaluierung zur Verbesserung der Energieeffizienz 提高能源效率提高能源使用效率的节能计量评价 2505.21758v1 -
36 05-27 FedCostAware: Enabling Cost-Aware Federated Learning on the Cloud FedCostAware: Kostenbewusstes Lernen in der Cloud ermöglichen FestAware:在云上进行成本-软件联合学习 2505.21727v1 -
37 05-27 AMSFL: Adaptive Multi-Step Federated Learning via Gradient Difference-Based Error Modeling AMSFL: Adaptives Multi-Step-Federated Learning über gradient Difference-based Error Modeling ASFL:通过基于差异的渐进错误建模进行适应性多阶段联邦学习 2505.21695v1 -
38 05-27 Incentivizing Permissionless Distributed Learning of LLMs Anreize für das unbefugte Lernen von LLMs 激励对LLMM的无自由分配的学习 2505.21684v1 -
39 05-27 KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads KPerfIR: Auf dem Weg zu einem offenen und kompilerzentrierten Ökosystem für GPU-Kernel Performance Tooling auf modernen KI-Workloads KPerfIR:努力建立一个开放的、以编纂者为中心的生态系统,用于在现代AI 工作负荷上使用 GPU 内核性能工具 2505.21661v1 -
40 05-27 Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits Schnelle und kostengünstige spekulative Edge-Cloud-Dekodierung mit Early Exits 快速和成本效益高的投机性边缘-封闭式排污与早期出口 2505.21594v1 -
41 05-27 Distributed Discrete Morse Sandwich: Efficient Computation of Persistence Diagrams for Massive Scalar Data Distributed Diskrete Morse Sandwich: Effiziente Berechnung von Persistenzdiagrammen für massive Scalardaten 分布式分散的莫尔斯桑威奇:有效计算大规模卡路里数据持久性图图 2505.21266v1 -
42 05-27 DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks DeepCEE: Effizientes regionsübergreifendes Schulungssystem unter heterogenen GPUs und Netzwerken DeepCEE:在异种性全球保护单位和网络下建立高效跨区域分布示范培训系统 2505.15536v2 -
43 05-27 Grassroots Consensus Graswurzeln-Konsens 基层共识 2505.19216v2 -
44 05-27 Multi-Event Triggers for Serverless Computing Multi-Event-Trigger für serverloses Rechnen 无服务器电子计算多天触发器 2505.21199v1 -
45 05-27 Vectorized Sequence-Based Chunking for Data Deduplication Vektorisierte Sequenz-basiertes Chunking für Datendeduplikation 数据解析矢量序列相键 2505.21194v1 -
46 05-27 Constructive community race: full-density spiking neural network model drives neuromorphic computing Konstruktives Community-Rennen: Volldichte-Spitzen neuronales Netzwerkmodell treibt neuromorphes Computing an 充满建设性的社区种族:完全密度刺激神经网络模型驱动神经形态计算 2505.21185v1 -
47 05-27 SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA SHE-LoRA: Selektive homomorphe Verschlüsselung für Federated Tuning mit Heterogene LoRA SHE-LORA: 与异源罗拉结合的联邦调试的选择性单体单体加密 2505.21051v1 -
48 05-27 A Hitchhiker’s Guide to Privacy-Preserving Cryptocurrencies: A Survey on Anonymity, Confidentiality, and Auditability Ein Hitchhiker-Leitfaden zur Wahrung der Privatsphäre von Kryptowährungen: Eine Umfrage über Anonymität, Vertraulichkeit und Auditierbarkeit 《希希克人保护隐私加密指南:关于匿名、保密和可审计性的调查》 2505.21008v1 -
49 05-27 RACS-SADL: Robust and Understandable Randomized Consensus in the Cloud RACS-SADL: Robuster und verständlicher Randomisierter Konsens in der Cloud RACS-SADL:云层中的有力和可理解的随机共识 2404.04183v3 -
50 05-27 EPIC: Efficient Position-Independent Caching for Serving Large Language Models EPIC: Effizientes positionsunabhängiges Caching für das Servieren großer Sprachmodelle EPIC: 高效的、独立定位的为大语言模式服务的工作 2410.15332v3 -
51 05-27 Complexity landscape for local certification Komplexitätslandschaft für die lokale Zertifizierung 当地认证的复杂环境 2505.20915v1 -
52 05-27 Reduced and mixed precision turbulent flow simulations using explicit finite difference schemes Reduzierte und gemischte Präzision turbulente Strömungssimulationen mit expliziten Finite-Differenz-Systemen 使用明确的有限差别办法进行减少和混合精密混杂的波动流动模拟 2505.20911v1 -
53 05-27 Load Balancing in Strongly Inhomogeneous Simulations – a Vlasiator Case Study Lastausgleich in stark inhomogenen Simulationen – eine Vlasiator-Fallstudie 在极不相异模拟器中平衡载荷 – – 挥发器案例研究 2505.20908v1 -
54 05-27 An Efficient Implementation of Guard-Based Synchronization for an Object-Oriented Programming Language Effiziente Implementierung von Guard-Based Synchronization für eine objektorientierte Programmiersprache 高效率地实施以警卫为基础的同步,以用于以目标为导向的方案编制语言 2505.20850v1 -
55 05-27 Choreographies as Macros Choreographien als Makros 作为宏的舞蹈 2505.20845v1 -
56 05-27 ECC-SNN: Cost-Effective Edge-Cloud Collaboration for Spiking Neural Networks ECC-SNN: Kosteneffiziente Edge-Cloud-Kollaboration für Spiking Neuronal Networks ECC-SNN: 传播神经网络的成本-效益高的边缘-封闭式协作 2505.20835v1 -
57 05-27 Work-Efficient Parallel Counting via Sampling Arbeitseffiziente parallele Zählung über Probenahme 通过抽样计算实现工作效率的平行计数 2408.09719v2 -
58 05-27 Time-Series Learning for Proactive Fault Prediction in Distributed Systems with Deep Neural Structures Time-Series Learning für proaktive Fehlervorhersage in verteilten Systemen mit tiefen neuralen Strukturen 深心神经结构分布系统预发性故障预测时间序列学习 2505.20705v1 -
59 05-27 InstGenIE: Generative Image Editing Made Efficient with Mask-aware Caching and Scheduling InstGenIE: Generative Bildbearbeitung mit Mask-aware Caching und Scheduling effizient gemacht InstGenie: 生成图像编辑, 高效使用防面具图像缓冲和排程 2505.20600v1 -
60 05-26 (1) Asynchronous Fault-Tolerant Language Decidability for Runtime Verification of Distributed Systems Asynchrone Fehler-Tolerante Sprachentscheidung für die Laufzeitverifizierung von verteilten Systemen 分布式系统运行时核查的 Al- 同步错失容忍语言 2502.00191v2 -
61 05-29 (4) Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data Vermeiden Sie das Vergessen, indem Sie globale Wissensgradienten im Föderierten Lernen mit nicht-ID-Daten bewahren 避免在使用非二二二维数据进行联邦学习时因保留全球知识进步而被遗忘 2505.20485v2 -
62 05-26 (1) Fixing non-blocking data structures for better compatibility with memory reclamation schemes Fixierung von nicht blockierenden Datenstrukturen für eine bessere Kompatibilität mit Speicher-Reklamationssystemen 固定非阻塞性数据结构,以更好地与内存回收计划兼容 2504.06254v2 -
63 05-26 Efficient Optimization Accelerator Framework for Multistate Ising Problems Effizientes Optimierungs-Beschleuniger-Framework für Multistate Ising-Probleme 高效高效优化多州化问题加速加速框架 2505.20250v1 -
64 05-26 FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings FedECA: Eine Federated External Control Arm Methode für ursächliche Schlussfolgerungen mit Zeit-bis-Event-Daten in verteilten Einstellungen FedECA:在分布环境中利用时间到时间的数据进行因果关系推断的联邦外部控制武器法 2311.16984v9 -
65 05-26 BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems BurstGPT: Ein echter Workload-Datensatz zur Optimierung von LLM-Serviersystemen BurtGPT:优化LLM服务系统的现实世界工作量数据集 2401.17644v5 -
66 05-26 Parallelizing a modern GPU simulator Parallelisierung eines modernen GPU-Simulators 平行使用现代 GPU 模拟器 2502.14691v2 -
67 05-26 Snowman for partial synchrony Schneemann für partielle Synchronisation 部分同步的雪人 2501.15904v3 -
68 05-26 Beyond Optimal Fault Tolerance Jenseits der optimalen Fehlertoleranz 超越最佳错失容忍 2501.06044v6 -
69 05-26 Distortion Resilience for Goal-Oriented Semantic Communication Distortion Resilienz für zielorientierte semantische Kommunikation 目标导向语义交流的扭曲复原力 2309.14587v2 -
70 05-26 Optimizing edge AI models on HPC systems with the edge in the loop Optimierung der Kanten-KI-Modelle auf HPC-Systemen mit der Kante in der Schleife 优化循环边缘的HPC系统优化边缘 AI 模型 2505.19995v1 -
71 05-26 Federated Domain Generalization with Data-free On-server Matching Gradient Föderierte Domain-Verallgemeinerung mit datenfreiem On-Server-Zustimmungs-Gradient 具有无数据观测站上与渐变匹配的无数据观测器的联邦通用域 2501.14653v2 -
72 05-26 From Few to Many Faults: Adaptive Byzantine Agreement with Optimal Communication Von wenigen bis zu vielen Fehlern: Adaptive byzantinische Vereinbarung mit optimaler Kommunikation 从少到多的错失:适应性拜占庭协议与最佳沟通 2505.19989v1 -
73 05-26 Differential Privacy Analysis of Decentralized Gossip Averaging under Varying Threat Models Differential Privacy Analyse dezentralisierter Gossip Average unter unterschiedlichen Bedrohungsmodellen 对不同威胁模式下分散的流民的隐私差异分析 2505.19969v1 -
74 05-26 Universal Workers: A Vision for Eliminating Cold Starts in Serverless Computing Universal Workers: Eine Vision zur Beseitigung von Kaltstarts im serverlosen Computing 普遍工人:在无服务器计算机中消除冷源的愿景 2505.19880v1 -
75 05-26 DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud-Systemen DGGGAG: 在边缘封闭系统中分布的基于图图的回收回源-养代 2505.19847v1 -
76 05-26 Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices Wird LLMs Skalierung die Wand treffen? Über verteilte Ressourcen auf massiven Edge-Geräten Barrieren überwinden LLLMs SUlia扩大会撞上隔离墙吗?通过大规模边缘装置分配资源打破障碍 2503.08223v2 -
77 05-26 A Unified, Practical, and Understandable Model of Non-transactional Consistency Levels in Distributed Replication Ein einheitliches, praktisches und verständliches Modell nichttransaktionsfähiger Konsistenzstufen in verteilter Replikation 分布式重复中非交易一致性水平的统一、实用和可理解的模式 2409.01576v4 -
78 05-26 Justin: Hybrid CPU/Memory Elastic Scaling for Distributed Stream Processing Justin: Hybride CPU/Memory Elastic Scaling für verteilte Stream-Verarbeitung Justin: 用于分布流处理的混合 CPU/Memory Elastic 缩放比例 2505.19739v1 -
79 05-26 Towards Optimal Distributed Edge Coloring with Fewer Colors Auf dem Weg zu einer optimalen verteilten Randfärbung mit weniger Farben 向最优化分布式边缘配色,颜色更少 2504.13003v2 -
80 05-26 Byzantine Consensus in the Random Asynchronous Model Byzantinischer Konsens im zufälligen asynchronen Modell 随机非同步模型中的拜占庭共识 2502.09116v2 -
81 05-26 Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments Mosaic: Datenfreies Wissen Destillieren über Mixture-of-Experts für Heterogene verteilte Umgebungen Mosaic:通过混合专家进行无数据知识蒸馏,促进异基因分布式环境 2505.19699v1 -
82 05-26 PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving PRESSERVE: Prefetching Modellgewichte und KV-Cache in verteilter LLM-Servierung PRESSERVE: 分布式LLM服务中的预伸缩模型重量和 KV-缓冲 2501.08192v2 -
83 05-26 Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers Skalierung von großformatigen GNN-Schulungen zu Tausenden von Prozessoren auf CPU-basierten Supercomputern 向数千台基于CPU的超级计算机处理器提供大规模GNN培训 2411.16025v2 -
84 05-26 Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs Gewinnen Sie schnell oder verlieren Sie langsam: Ausgleichende Geschwindigkeit und Genauigkeit in Latenz-Sensitive Entscheidungen von LLMs 慢赢或慢输:LLMs的延缓敏感决定中平衡速度和准确性 2505.19481v1 -
85 05-26 GPU acceleration of non-equilibrium Green’s function calculation using OpenACC and CUDA FORTRAN GPU-Beschleunigung der Nicht-Equilibrium Green-Funktionsberechnung mit OpenACC und CUDA FORTRAN 使用 OpenACC 和 CUDA FORTRAN 加速 GPU 绿色非平衡的功能计算 2505.19467v1 -
86 05-26 FedHERO: A Federated Learning Approach for Node Classification Task on Heterophilic Graphs FedHERO: Ein Federated Learning Approach für Knotenklassifikation Aufgaben auf heterophilen Graphen FEFHERO: 异生物图节点分类任务联邦学习方法 2504.21206v2 -
87 05-25 (7) QMIO: A tightly integrated hybrid HPCQC system QMIO: Ein eng integriertes Hybrid-HPCQC-System QMIO:一个严格一体化的混合高和分PCQC系统 2505.19267v1 -
88 05-25 NanoFlow: Towards Optimal Large Language Model Serving Throughput NanoFlow: Auf dem Weg zu einem optimalen Large Language Model NanoFlow:走向最佳大语言模式 2408.12757v2 -
89 05-25 Matrix Multiplication in the MPC Model Matrix-Multiplikation im MPC-Modell MPC 模型中的矩阵乘法 2505.19137v1 -
90 05-25 Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods Birke SGD: Ein Baumdiagramm-Framework für lokale und asynchrone SGD-Methoden Birch SGD: 当地和非同步 SGD 方法树图框架 2505.09218v2 -
91 05-24 (6) Toward Malicious Clients Detection in Federated Learning Auf dem Weg zu bösartigen Kunden Erkennung im Föderierten Lernen 争取在联邦学习中发现恶意客户 2505.09110v2 -
92 05-24 Distributed Incremental SAT Solving with Mallob: Report and Case Study with Hierarchical Planning Distributed Incremental SAT Solving with Mallob: Report and Case Study with Hierarchical Planning 与马洛布公司共同解决:与等级规划有关的报告和案例研究 2505.18836v1 -
93 05-24 DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services DiSCo: Geräte-Server Kollaborative LLM-basierte Text-Streaming-Dienste DisCo: 设备-服务器协作协作LLM基于LLM的文本流服务 2502.11417v2 -
94 05-24 Distributed Set-membership Filtering Frameworks For Multi-agent Systems With Absolute and Relative Measurements Distributed Set-Membership Filtering Frameworks für Multi-Agent-Systeme mit absoluten und relativen Messungen 具有绝对和相对计量的多试剂系统分布式成员筛选框架 2305.15797v2 -
95 05-24 EvoSort: A Genetic-Algorithm-Based Adaptive Parallel Sorting Framework for Large-Scale High Performance Computing EvoSort: Ein genetisch-algorithmisch-adaptives Parallelsortierungs-Framework für großformatige Hochleistungsrechnen EvoSort: 大型高性能计算方法的基于遗传 – – 物理学的适应性平行排序框架 2505.18681v1 -
96 05-24 Towards Round-Optimal Approximate Agreement on Trees Auf dem Weg zu einem runden, optimalen Abkommen über Bäume 争取达成关于树木的圆顶和最接近于 2502.05591v2 -
97 05-24 Asynchronous Approximate Agreement with Quadratic Communication Asynchrone annähernde Vereinbarung mit quadratischer Kommunikation 与赤道通信的近似非同步协定 2408.05495v3 -
98 05-24 TEE is not a Healer: Rollback-Resistant Reliable Storage TEE ist kein Heiler: Rollback-Resistent Zuverlässige Lagerung TEE不是救治者:回击-恢复-可靠储存 2505.18648v1 -
99 05-24 CacheFL: Privacy-Preserving and Efficient Federated Cache Model Fine-Tuning for Vision-Language Models CacheFL: Datenschutzschonendes und effizientes Federated Cache Model Fine-Tuning für Vision-Language-Modelle CACHFL: 视力和语言模型微调模型 2505.05130v2 -
100 05-24 PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning PacTrain: Pruning and Adaptive Sparse Gradient Compression für effiziente kollektive Kommunikation im verteilten Deep Learning PacTrain:在分布式深层学习中促进高效集体交流的审慎和适应性零散梯级压缩 2505.18563v1 -
101 05-24 Consensus Under Adversary Majority Done Right Konsens unter gegnerischer Mehrheit Rechtsbeistand 在相反多数下达成的共识 2411.01689v3 -
102 05-24 Recursive Offloading for LLM Serving in Multi-tier Networks Rekursives Offloading für LLM-Serving in Multi-Tier-Netzwerken 多层网络LLM服务的递归性卸载 2505.16502v2
Article 0
Title@2025-05-29 (4): From Connectivity to Autonomy: The Dawn of Self-Evolving Communication Systems
Title: From Connectivity to Autonomy: The Dawn of Self-Evolving Communication Systems | Von der Konnektivität zur Autonomie: Die Morgenröte der sich selbst entwickelnden Kommunikationssysteme | 从连接到自主:自我发展的通信系统的黎明 2505.23710v1 |
Authors (5): Zeinab Nezami, Syed Danial Ali Shah, Maryam Hafeez, Karim Djemame, Syed Ali Raza Zaidi
This paper envisions 6G as a self-evolving telecom ecosystem, where AI-driven intelligence enables dynamic adaptation beyond static connectivity. We explore the key enablers of autonomous communication systems, spanning reconfigurable infrastructure, adaptive middleware, and intelligent network functions, alongside multi-agent collaboration for distributed decision-making. We explore how these methodologies align with emerging industrial IoT frameworks, ensuring seamless integration within digital manufacturing processes. Our findings emphasize the potential for improved real-time decision-making, optimizing efficiency, and reducing latency in networked control systems. The discussion addresses ethical challenges, research directions, and standardization efforts, concluding with a technology stack roadmap to guide future developments. By leveraging state-of-the-art 6G network management techniques, this research contributes to the next generation of intelligent automation solutions, bridging the gap between theoretical advancements and real-world industrial applications.
本文设想6G是一个自我演化的电信生态系统,AI驱动的智能使动态适应超越静态连接。我们探索了自主通信系统的关键促进因素,包括可重新配置的基础设施、适应性中器和智能网络功能,以及用于分配决策的多机构协作。我们探讨了这些方法如何与新兴工业互联网框架相协调,确保数字制造流程的无缝整合。我们的调查结果强调改进实时决策、优化效率和减少网络控制系统中的延迟的可能性。讨论涉及道德挑战、研究方向和标准化努力,并用技术堆叠路线图结束指导未来发展。通过利用最新的6G网络管理技术,这一研究有助于下一代智能自动化解决方案,弥合理论进步与现实世界工业应用之间的差距。
Article 1
Title@2025-05-29 (4): Distributed Federated Learning for Vehicular Network Security: Anomaly Detection Benefits and Multi-Domain Attack Threats
Title: Distributed Federated Learning for Vehicular Network Security: Anomaly Detection Benefits and Multi-Domain Attack Threats | Verteiltes Federated Learning für die Sicherheit des Vehicular Network: Anomalieerkennungsvorteile und Multi-Domain-Angriffsbedrohungen | 分布式联邦学习促进车辆网络安全:反常探测效益和多领域攻击威胁 2505.23706v1 |
Authors (6): Utku Demir, Yalin E. Sagduyu, Tugba Erpek, Hossein Jafari, Sastry Kompella, Mengran Xue
In connected and autonomous vehicles, machine learning for safety message classification has become critical for detecting malicious or anomalous behavior. However, conventional approaches that rely on centralized data collection or purely local training face limitations due to the large scale, high mobility, and heterogeneous data distributions inherent in inter-vehicle networks. To overcome these challenges, this paper explores Distributed Federated Learning (DFL), whereby vehicles collaboratively train deep learning models by exchanging model updates among one-hop neighbors and propagating models over multiple hops. Using the Vehicular Reference Misbehavior (VeReMi) Extension Dataset, we show that DFL can significantly improve classification accuracy across all vehicles compared to learning strictly with local data. Notably, vehicles with low individual accuracy see substantial accuracy gains through DFL, illustrating the benefit of knowledge sharing across the network. We further show that local training data size and time-varying network connectivity correlate strongly with the model’s overall accuracy. We investigate DFL’s resilience and vulnerabilities under attacks in multiple domains, namely wireless jamming and training data poisoning attacks. Our results reveal important insights into the vulnerabilities of DFL when confronted with multi-domain attacks, underlining the need for more robust strategies to secure DFL in vehicular networks.
在连接和自主的车辆中,安全信息分类的机器学习对于发现恶意或异常行为至关重要。然而,依赖集中数据收集或纯本地培训的常规方法由于车辆间网络所固有的大规模、高度流动性和分散的数据分配而面临限制。为克服这些挑战,本文件探讨了分布式联邦学习(DFL),即车辆通过在单点邻居之间交换最新消息和在多个跳站上传播模型来合作培训深层次学习模式。我们利用通用参考Misbehavir(VeRemi)扩展数据集,表明DFL能够大大提高所有车辆的分类准确性,而严格使用当地数据来学习。值得注意的是,个人准确度低的车辆通过DFLL看到大量准确性收益,表明整个网络共享知识的好处。我们进一步表明,当地培训数据规模和时间变化式网络连接与模型的总体准确性密切相关。我们调查DFLL在多个领域(即无线干扰和培训数据中毒攻击)受到攻击时的复原力和脆弱性。我们的结果显示,当面对更强点攻击时,DFLL在更稳健的多点攻击战略时,需要对DFLL的弱点有重要了解。我们的结果。
Article 2
Title@2025-05-29 (4): Parallel GPU-Accelerated Randomized Construction of Approximate Cholesky Preconditioners
Title: Parallel GPU-Accelerated Randomized Construction of Approximate Cholesky Preconditioners | Parallele GPU-beschleunigte Randomisierte Konstruktion von ungefähren Cholesky-Vorkonditionen | 平行的GPU-加速加速旋转式建造近焦天空预设装置 2505.02977v2 |
Authors (8): Tianyu Liang, Chao Chen, Yotam Yaniv, Hengrui Luo, David Tench, Xiaoye S. Li, Aydin Buluc, James Demmel
We introduce a parallel algorithm to construct a preconditioner for solving a large, sparse linear system where the coefficient matrix is a Laplacian matrix (a.k.a., graph Laplacian). Such a linear system arises from applications such as discretization of a partial differential equation, spectral graph partitioning, and learning problems on graphs. The preconditioner belongs to the family of incomplete factorizations and is purely algebraic. Unlike traditional incomplete factorizations, the new method employs randomization to determine whether or not to keep fill-ins, i.e., newly generated nonzero elements during Gaussian elimination. Since the sparsity pattern of the randomized factorization is unknown, computing such a factorization in parallel is extremely challenging, especially on many-core architectures such as GPUs. Our parallel algorithm dynamically computes the dependency among row/column indices of the Laplacian matrix to be factorized and processes the independent indices in parallel. Furthermore, unlike previous approaches, our method requires little pre-processing time. We implemented the parallel algorithm for multi-core CPUs and GPUs, and we compare their performance to other state-of-the-art methods.
我们引入了平行算法来构建一个解决大型、稀疏线性系统的先决条件,即系数矩阵是拉普拉西亚矩阵(a.k.a.a.a.,图Laplacian)的系数矩阵(a.k.a.a.a.a.,图Laplacian),这种线性系统来自部分差异方程式的离散化、光谱图形分割和图形学习问题等应用。这个前提属于不完全系数化的大家庭,是纯代数的。与传统的不完全的系数化不同,新的方法使用随机化来确定是否保持填充,即在高斯消除期间新产生的非零元素。由于随机化系数化的松散模式是未知的,因此平行计算这种系数化极具挑战性,特别是在诸如GPUs等多个核心结构中。我们平行的算法动态地将拉巴拉帕卡矩阵的行/柱性指数之间的依赖性进行系数化并同时处理独立指数。此外,我们的方法需要很少的预处理时间。我们为多核心的CPU和GPU-GPOs采用了平行的平行算算法,我们将其与其他状态进行比较。
Article 3
Title@2025-05-29 (4): Complementary Time-Space Tradeoff for Self-Stabilizing Leader Election: Polynomial States Meet Sublinear Time
Title: Complementary Time-Space Tradeoff for Self-Stabilizing Leader Election: Polynomial States Meet Sublinear Time | Komplementärer Zeit-Raum-Tradeoff für selbststabilisierende Leader-Wahl: Polynome Staaten treffen auf sublineare Zeit | 自我稳定领导人选举的补充时间-空间权衡:多民族国家满足亚线性时间 2505.23649v1 |
Authors (1): Yuichi Sudo
We study the self-stabilizing leader election (SS-LE) problem in the population protocol model, assuming exact knowledge of the population size $n$. Burman, Chen, Chen, Doty, Nowak, Severson, and Xu (PODC 2021) showed that this problem can be solved in $O(n)$ expected time with $O(n)$ states. Recently, G\k{a}sieniec, Grodzicki, and Stachowiak (PODC 2025) proved that $n+O(\log n)$ states suffice to achieve $O(n \log n)$ time both in expectation and with high probability (w.h.p.). If substantially more states are available, sublinear time can be achieved. Burman~et~al.~(PODC 2021) presented a $2^{O(n^\rho\log n)}$-state SS-LE protocol with a parameter $\rho$: setting $\rho = \Theta(\log n)$ yields an optimal $O(\log n)$ time both in expectation and w.h.p., while $\rho = \Theta(1)$ results in $O(\rho\,n^{1/(\rho+1)})$ expected time. Very recently, Austin, Berenbrink, Friedetzky, G"otte, and Hintze (PODC 2025) presented a novel SS-LE protocol parameterized by a positive integer $\rho$ with $1 \le \rho < n/2$ that solves SS-LE in $O(\frac{n}{\rho}\cdot\log n)$ time w.h.p.\ using $2^{O(\rho^2\log n)}$ states. This paper independently presents yet another time–space tradeoff of SS-LE: for any positive integer $\rho$ with $1 \le \rho \le \sqrt{n}$, SS-LE can be achieved within $O\left(\frac{n}{\rho}\cdot \log\rho\right)$ expected time using $2^{2\rho\lg\rho + O(\log n)}$ states. The proposed protocol uses significantly fewer states than the protocol of Austin~et~al.\ requires to achieve any expected stabilization time above $\Theta(\sqrt{n}\log n)$. When $\rho = \Theta\left(\frac{\log n}{\log \log n}\right)$,the proposed protocol is the first to achieve sublinear time while using only polynomially many states. A limitation of our protocol is that the constraint $\rho\le\sqrt{n}$ prevents achieving $o(\sqrt{n}\log n)$ time, whereas the protocol of Austin et~al.\ can surpass this bound.
nan
Article 4
Title@2025-05-29 (4): Accelerated Training of Federated Learning via Second-Order Methods
Title: Accelerated Training of Federated Learning via Second-Order Methods | Beschleunigte Ausbildung des Föderierten Lernens über Methoden der zweiten Ordnung | 通过二级方法加快联邦学习培训 2505.23588v1 |
Authors (3): Mrinmay Sen, Sidhant R Nair, C Krishna Mohan
This paper explores second-order optimization methods in Federated Learning (FL), addressing the critical challenges of slow convergence and the excessive communication rounds required to achieve optimal performance from the global model. While existing surveys in FL primarily focus on challenges related to statistical and device label heterogeneity, as well as privacy and security concerns in first-order FL methods, less attention has been given to the issue of slow model training. This slow training often leads to the need for excessive communication rounds or increased communication costs, particularly when data across clients are highly heterogeneous. In this paper, we examine various FL methods that leverage second-order optimization to accelerate the training process. We provide a comprehensive categorization of state-of-the-art second-order FL methods and compare their performance based on convergence speed, computational cost, memory usage, transmission overhead, and generalization of the global model. Our findings show the potential of incorporating Hessian curvature through second-order optimization into FL and highlight key challenges, such as the efficient utilization of Hessian and its inverse in FL. This work lays the groundwork for future research aimed at developing scalable and efficient federated optimization methods for improving the training of the global model in FL.
本文探讨了联邦学习联合会(FL)的二级优化方法,探讨了缓慢趋同和为达到全球模式最佳业绩所需的过度通信周期等关键挑战。虽然FL的现有调查主要侧重于与统计和装置标签差异有关的挑战,以及一级FL方法的隐私和安全问题,但对模式培训缓慢问题的关注较少。这种缓慢的培训往往导致需要过多的通信回合或增加通信成本,特别是在客户数据高度差异的情况下。我们在本文件中审查了利用第二级优化来加快培训进程的多种FL方法。我们提供了第二级FL方法的全面分类,并根据趋同速度、计算成本、记忆使用、传承间接费用和全球模型的普及,比较其业绩。我们的调查结果显示,通过第二级优化将赫森曲线纳入FL的可能性,并突出了主要挑战,例如赫桑的有效利用及其在FL的反面。这项工作为今后旨在改进FC可升级和高效全球优化方法的示范研究奠定了基础。
Article 5
Title@2025-05-29 (4): Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters
Title: Sustainable Carbon-Aware and Water-Efficient LLM Scheduling in Geo-Distributed Cloud Datacenters | Nachhaltiges CO2-basiertes und wassereffizientes LLM-Scheeduling in Geo-verteilten Cloud-Rechenzentren | 地球分布云数据中心的可持续碳软件和水效率高的LLM 2505.23554v1 |
Authors (6): Hayden Moore, Sirui Qi, Ninad Hogade, Dejan Milojicic, Cullen Bash, Sudeep Pasricha
In recent years, Large Language Models (LLM) such as ChatGPT, CoPilot, and Gemini have been widely adopted in different areas. As the use of LLMs continues to grow, many efforts have focused on reducing the massive training overheads of these models. But it is the environmental impact of handling user requests to LLMs that is increasingly becoming a concern. Recent studies estimate that the costs of operating LLMs in their inference phase can exceed training costs by 25x per year. As LLMs are queried incessantly, the cumulative carbon footprint for the operational phase has been shown to far exceed the footprint during the training phase. Further, estimates indicate that 500 ml of fresh water is expended for every 20-50 requests to LLMs during inference. To address these important sustainability issues with LLMs, we propose a novel framework called SLIT to co-optimize LLM quality of service (time-to-first token), carbon emissions, water usage, and energy costs. The framework utilizes a machine learning (ML) based metaheuristic to enhance the sustainability of LLM hosting across geo-distributed cloud datacenters. Such a framework will become increasingly vital as LLMs proliferate.
近年来,大语言模型(LLM),如ChatGPT、CoPilot和Gemini等,在不同领域被广泛采用。随着LLMs的使用继续增加,许多努力集中于减少这些模型的大量培训间接费用。但是,处理用户对LLMs的要求对环境的影响日益引起关注。最近的研究估计,在推论阶段操作LMs的成本每年可超过培训成本25x。LMs不断被问及,运行阶段的累积碳足迹显示远远超过培训阶段的足迹。此外,估计表明,在推断过程中,每20-50个LMs提出的LMs申请中,就有500毫升的淡水花费。为了与LMS解决这些重要的可持续性问题,我们提出了一个名为SLIT的新框架,以共同优化LMs服务质量(时间到头等)、碳排放、水使用和能源成本。框架利用基于机器的MLAEuric来提高LM公司在地理分布式云中托管服务的可持续性。这一框架将日益成为至关重要的一个框架。
Article 6
Title@2025-05-29 (4): Accelerating AllReduce with a Persistent Straggler
Title: Accelerating AllReduce with a Persistent Straggler | AllReduce mit einem persistenten Straggler beschleunigen | 使用持久性斯特拉格驱动器加速全部拖动 2505.23523v1 |
Authors (5): Arjun Devraj, Eric Ding, Abhishek Vijaya Kumar, Robert Kleinberg, Rachee Singh
Distributed machine learning workloads use data and tensor parallelism for training and inference, both of which rely on the AllReduce collective to synchronize gradients or activations. However, bulk-synchronous AllReduce algorithms can be delayed by a persistent straggler that is slower to reach the synchronization barrier required to begin the collective. To address this challenge, we propose StragglAR: an AllReduce algorithm that accelerates distributed training and inference in the presence of persistent stragglers. StragglAR implements a ReduceScatter among the remaining GPUs during the straggler-induced delay, and then executes a novel collective algorithm to complete the AllReduce once the straggler reaches the synchronization barrier. StragglAR achieves a 2x theoretical speedup over popular bandwidth-efficient AllReduce algorithms (e.g., Ring) for large GPU clusters with persistent stragglers. On an 8-GPU server, our implementation of StragglAR yields a 22% speedup over state-of-the-art AllReduce algorithms.
分散的机器学习工作量在培训和推论方面使用数据和分解的平行法,两者都依靠 AllReduce 集体组合来同步梯度或激活。 但是, 散装同步的全Reduce 算法可能会被一个持久性的分解器延缓, 而这种分解速度要慢到启动集体所需的同步屏障。 为了应对这一挑战, 我们提议 StragglAR : 一种全Reduce 算法, 加速在持久性排挤者面前的分布式培训和推论。 StragglAR 在 strggler 引发的延缓期间, 在其余的 GPU 中实施一个减少分解器, 然后执行一种新的集体算法, 以在拖动器到达同步屏障后完成全Reduce 。 StragglAR 实现2x理论速度, 超过流行的带宽效率的全Reduce 算法( 如 Ring) 。 在 8- GGPU 服务器上, 我们的 StragglAR 将产生一个超过 22% 的全局-Art- allRuedudes 算法 。
Article 7
Title@2025-05-29 (4): SealOS+: A Sealos-based Approach for Adaptive Resource Optimization Under Dynamic Workloads for Securities Trading System
Title: SealOS+: A Sealos-based Approach for Adaptive Resource Optimization Under Dynamic Workloads for Securities Trading System | SealOS+: Ein Sealos-basierter Ansatz für adaptive Ressourcenoptimierung unter dynamischen Workloads für Securities Trading System | SealOS+:证券交易系统动态工作量下的适应性资源优化的以海路为基础的办法 2505.23258v1 |
Authors (5): Haojie Jia, Zhenhao Li, Gen Li, Minxian Xu, Kejiang Ye
As securities trading systems transition to a microservices architecture, optimizing system performance presents challenges such as inefficient resource scheduling and high service response delays. Existing container orchestration platforms lack tailored performance optimization mechanisms for trading scenarios, making it difficult to meet the stringent 50ms response time requirement imposed by exchanges. This paper introduces SealOS+, a Sealos-based performance optimization approach for securities trading, incorporating an adaptive resource scheduling algorithm leveraging deep reinforcement learning, a three-level caching mechanism for trading operations, and a Long Short-Term Memory (LSTM) based load prediction model. Real-world deployment at a securities exchange demonstrates that the optimized system achieves an average CPU utilization of 78\%, reduces transaction response time to 105ms, and reaches a peak processing capacity of 15,000 transactions per second, effectively meeting the rigorous performance and reliability demands of securities trading.
随着证券交易系统向微观服务结构过渡,优化系统绩效带来了诸如资源排期效率低和服务反应迟缓等挑战;现有集装箱管线平台缺乏针对交易情景的定制性优化性性性能机制,难以满足交易所规定的严格的50米反应时间要求;本文介绍了SealOS+(SealOS+)(证券交易基于Sealos的绩效优化性能优化办法),其中包括利用深度强化学习的适应性资源排期算法、3级贸易业务缓冲机制以及基于长期短期内存的负载预测模型;在证券交易所的实时部署表明,优化的系统实现了平均CPU使用78,将交易反应时间减少到105米,达到最高处理能力,即每秒处理15 000次交易,有效满足证券交易的严格业绩和可靠性要求。
Article 8
Title@2025-05-29 (4): Smaller, Smarter, Closer: The Edge of Collaborative Generative AI
Title: Smaller, Smarter, Closer: The Edge of Collaborative Generative AI | Kleiner, intelligenter, enger: Der Rand der kollaborativen Generativen KI | 较小、更聪明、更近:合作创造的边缘 AI 2505.16499v2 |
Authors (2): Roberto Morabito, SiYoung Jang
The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.
迅速采用基因化的AI(GenAI),特别是大语言模型(LLM),暴露了云中心部署的关键局限性,包括隐秘性、成本和隐私问题;与此同时,小型语言模型(SLM)正在成为受资源限制的边缘环境的可行替代物,尽管它们往往缺乏较大对应方的能力;这一条探讨了利用边际和云层资源来应对这些挑战的协作推论系统的潜力;通过提出不同的合作战略以及实际设计原则和实验洞察力,我们为在整个计算过程中部署GENAI提供了可行的指导。
Article 9
Title@2025-05-29 (4): MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning
Title: MemAscend: System Memory Optimization for SSD-Offloaded LLM Fine-Tuning | MemAscend: Systemspeicheroptimierung für SSD-Offloaded LLM Fine-Tuning | MemAscend: SSD- 卸载 LLM 精密调试的系统内存优化 2505.23254v1 |
Authors (2): Yong-Cheng Liaw, Shuo-Han Chen
Owing to the huge success of generative artificial intelligence (AI), large language models (LLMs) have emerged as a core subclass, underpinning applications such as question answering, text generation, and code completion. While fine-tuning these models on domain-specific data can yield significant performance gains, it also poses daunting computational challenges, especially for researchers and small organizations with limited hardware resources. Although SSD offloading (i.e., ZeRO-Infinity) has emerged as a viable strategy to overcome the GPU memory barrier via leveraging both system memory (i.e., CPU DRAM) and storage space (i.e., solid-state devices, SSDs), its design primarily targets model-centric performance issues. As a result, key system-level issues, including system memory fragmentation, inefficient pinned buffer allocation, peak CPU usage spikes, and file system overhead, remain unaddressed, stifling scalability and inflating costs. Such an observation motivates this paper to introduce MemAscend, a framework that systematically tackles the underexplored system memory bottlenecks in SSD-offloaded LLM training, with a focus on resource-constrained environments. By streamlining pinned-memory allocation, eradicating fragmentation, and mitigating peak overhead, MemAscend reclaims a substantial system memory budget, enabling larger models, longer context windows, and higher batch sizes without exceeding modest hardware limits. Across diverse LLM benchmarks, MemAscend reduces peak system-memory consumption by an average of 55.7% compared with standard SSD offloading techniques, lowering the hardware barrier for fine-tuning and unlocking new possibilities for cost-effective large-scale training on limited-resource machines.
由于基因化人工智能(AI)的巨大成功,大型语言模型(LLMS)已经成为一个核心小类,成为了核心小类,支持了问答、文本生成和代码完成等应用程序。虽然在具体领域数据上对这些模型进行微调可以产生显著的绩效收益,但也给研究人员和硬件资源有限的小型组织带来了巨大的计算挑战。尽管SSD卸载(即ZeRO-Infinity)已成为一项可行的战略,通过利用系统记忆(即,CPU DRA)和存储空间(即,固态设备、SSDSDs),克服了GPU的记忆障碍。 它的设计主要针对以模型为中心的绩效问题。因此,关键系统层面的问题,包括系统记忆破碎、低效率的缓冲分配、CUPUP使用峰值激增、缩缩放成本。 这样的观察促使本文引入了MemASDSBS的精细缩缩缩缩缩缩缩缩缩缩缩缩略缩略缩略微缩略微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩微缩缩缩缩缩缩缩缩缩缩缩缩缩微缩缩缩缩缩缩缩缩缩略缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩
Article 10
Title@2025-05-29 (4): Edge-First Language Model Inference: Models, Metrics, and Tradeoffs
Title: Edge-First Language Model Inference: Models, Metrics, and Tradeoffs | Edge-First Language Model Inferenz: Modelle, Metrics und Tradeoffs | 边缘第一语言模式示范推论:模型、计量和权衡取舍 2505.16508v2 |
Authors (2): SiYoung Jang, Roberto Morabito
The widespread adoption of Language Models (LMs) across industries is driving interest in deploying these services across the computing continuum, from the cloud to the network edge. This shift aims to reduce costs, lower latency, and improve reliability and privacy. Small Language Models (SLMs), enabled by advances in model compression, are central to this shift, offering a path to on-device inference on resource-constrained edge platforms. This work examines the interplay between edge and cloud deployments, starting from detailed benchmarking of SLM capabilities on single edge devices, and extending to distributed edge clusters. We identify scenarios where edge inference offers comparable performance with lower costs, and others where cloud fallback becomes essential due to limits in scalability or model capacity. Rather than proposing a one-size-fits-all solution, we present platform-level comparisons and design insights for building efficient, adaptive LM inference systems across heterogeneous environments.
跨行业广泛采用语言模型(LMs)正在促使人们有兴趣在从云层到网络边缘的计算连续体中部署这些服务。这一转变旨在降低成本、降低潜伏度、提高可靠性和隐私性。由模型压缩进步促成的小型语言模型(SLMs)是这一转变的核心,为在资源紧缺的边缘平台上进行在线推论提供了一条路径。这项工作从对单一边缘装置的可持续土地管理能力进行详细基准设定开始,到分布边缘集群,审视了边缘和云层部署之间的相互作用。我们确定了边缘推论能够提供成本较低的可比性能,而由于可缩放性或模型能力的限制,云层回退变得至关重要的其他情况。我们提出的平台层面比较和设计洞察力,不是提出一刀切的解决办法,而是在各种环境中建立高效、适应性LM推力系统。
Article 11
Title@2025-05-29 (4): Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism
Title: Ghidorah: Fast LLM Inference on Edge with Speculative Decoding and Hetero-Core Parallelism | Ghidorah: Schnelle LLM-Inferenz am Rand mit spekulativer Dekodierung und Hetero-Core-Parallelität | Ghidorah:快速LLM 2505.23219v1 |
Authors (5): Jinhui Wei, Ye Huang, Yuhui Zhou, Jiazhi Jiang, Jiangsu Du
In-situ LLM inference on end-user devices has gained significant interest due to its privacy benefits and reduced dependency on external infrastructure. However, as the decoding process is memory-bandwidth-bound, the diverse processing units in modern end-user devices cannot be fully exploited, resulting in slow LLM inference. This paper presents Ghidorah, a LLM inference system for end-user devices with the unified memory architecture. The key idea of Ghidorah can be summarized in two steps: 1) leveraging speculative decoding approaches to enhance parallelism, and 2) ingeniously distributing workloads across multiple heterogeneous processing units to maximize computing power utilization. Ghidorah includes the hetero-core model parallelism (HCMP) architecture and the architecture-aware profiling (ARCA) approach. The HCMP architecture guides partitioning by leveraging the unified memory design of end-user devices and adapting to the hybrid computational demands of speculative decoding. The ARCA approach is used to determine the optimal speculative strategy and partitioning strategy, balancing acceptance rate with parallel capability to maximize the speedup. Additionally, we optimize sparse computation on ARM CPUs. Experimental results show that Ghidorah can achieve up to 7.6x speedup in the dominant LLM decoding phase compared to the sequential decoding approach in NVIDIA Jetson NX.
Ghidorah的主要概念可以归纳为两步:1)利用投机性解码方法加强平行关系,2)在多个不同处理单位之间巧妙地分配工作量,以最大限度地利用计算能力。Ghidorah包括了高核心模型平行结构(HCMP)和结构质量分析(ARCA)方法。Hidorah结构通过利用终端用户设备统一记忆设计并适应投机解码的混合计算要求来指导分割。ARCA方法用于确定最佳投机性策略和分解战略,平衡接受率和平行能力以最大限度地实现加速使用。此外,我们在GMASER中可以优化对GMASIM的Slational-DRAS级计算结果,在GARSISBSA中可以对GSIM的SLSAS级进行最高级的升级。
Article 12
Title@2025-05-30 (5): Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces
Title: Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces | Verbesserung der parallelen Programmleistung mit LLM-Optimierern über Agent-System-Schnittstellen | 通过代理-系统接口改进与LLM优化器的平行方案绩效 2410.15625v4 |
Authors (7): Anjiang Wei, Allen Nie, Thiago S. F. X. Teixeira, Rohan Yadav, Wonchan Lee, Ke Wang, Alex Aiken
Modern scientific discovery increasingly relies on high-performance computing for complex modeling and simulation. A key challenge in improving parallel program performance is efficiently mapping tasks to processors and data to memory, a process dictated by intricate, low-level system code known as mappers. Developing high-performance mappers demands days of manual tuning, posing a significant barrier for domain scientists without systems expertise. We introduce a framework that automates mapper development with generative optimization, leveraging richer feedback beyond scalar performance metrics. Our approach features the Agent-System Interface, which includes a Domain-Specific Language (DSL) to abstract away the low-level complexity of system code and define a structured search space, as well as AutoGuide, a mechanism that interprets raw execution output into actionable feedback. Unlike traditional reinforcement learning methods such as OpenTuner, which rely solely on scalar feedback, our method finds superior mappers in far fewer iterations. With just 10 iterations, it outperforms OpenTuner even after 1000 iterations, achieving 3.8X faster performance. Our approach finds mappers that surpass expert-written mappers by up to 1.34X speedup across nine benchmarks while reducing tuning time from days to minutes.
现代科学发现日益依赖高性能计算来进行复杂的建模和模拟。 改进平行程序性能的一个关键挑战是高效地绘制处理器和数据到记忆的处理器和数据的工作,这一过程由复杂、低层次的系统代码(即映射器)所决定。 开发高性能绘图师需要数日人工调整,这对没有系统专长的域科学家构成了巨大的障碍。 我们引入了一个框架,使成像开发自动成像,使其具有基因化优化,使更丰富的反馈超过缩微性能度量度尺度。 我们的方法特征是代理系统-系统界面,包括一个DSL(DSL)来抽取系统代码的低度复杂度,并定义结构搜索空间,以及AutoGuide(一个将原始执行输出解释为可操作反馈的机制) 。 与OpenTuner(OpenTuner)等传统的强化学习方法不同, 我们的方法仅依靠缩放反馈, 其发现高级地图师在更小得多的迭。 我们的方法在10次的外, 它比OpenTuster(OnTustry-TultalTustr)更接近于1000次后, 实现3.X更快的功能。 我们的方法从超过专家写地图数日,同时将速度调整到1.34时间调整至1.34时间到1xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 13
Title@2025-05-29 (4): The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning
Title: The Panaceas for Improving Low-Rank Decomposition in Communication-Efficient Federated Learning | Die Panaceas zur Verbesserung der Zersetzung mit geringem Rank im kommunikativ-effizienten Federated Learning | 改善通信-高效联邦学习中低-兰克分解的全景 2505.23176v1 |
Authors (9): Shiwei Li, Xiandi Luo, Haozhao Wang, Xing Tang, Shijie Xu, Weihong Luo, Yuhua Li, Xiuqiang He, Ruixuan Li
To improve the training efficiency of federated learning (FL), previous research has employed low-rank decomposition techniques to reduce communication overhead. In this paper, we seek to enhance the performance of these low-rank decomposition methods. Specifically, we focus on three key issues related to decomposition in FL: what to decompose, how to decompose, and how to aggregate. Subsequently, we introduce three novel techniques: Model Update Decomposition (MUD), Block-wise Kronecker Decomposition (BKD), and Aggregation-Aware Decomposition (AAD), each targeting a specific issue. These techniques are complementary and can be applied simultaneously to achieve optimal performance. Additionally, we provide a rigorous theoretical analysis to ensure the convergence of the proposed MUD. Extensive experimental results show that our approach achieves faster convergence and superior accuracy compared to relevant baseline methods. The code is available at https://github.com/Leopold1423/fedmud-icml25.
为了提高联邦学习的培训效率,先前的研究采用了低级分解技术,以减少通信管理费用。在本文中,我们力求提高这些低级分解方法的绩效。具体地说,我们侧重于与FL分解有关的三个关键问题:分解什么,如何分解,如何分解,以及如何综合。随后,我们引入了三种新颖技术:模范更新分解技术(MUD),布洛克-中克罗内克分解技术(BKD),以及聚合-Aware分解技术(AAAD),这些技术都是针对一个具体问题的。这些技术是相辅相成的,可以同时应用,以实现最佳绩效。此外,我们提供了严格的理论分析,以确保拟议的MUD的趋同。广泛的实验结果表明,我们的方法与相关的基线方法相比,更快地趋同和更加精确。该代码可在https://github.com/Leopold1423Fedmud-icml25上查阅。
Article 14
Title@2025-05-29 (4): DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs
Title: DOPPLER: Dual-Policy Learning for Device Assignment in Asynchronous Dataflow Graphs | DOPPLER: Dual-Policy-Lernen für die Gerätezuordnung in asynchronen Datenflussgraphen | DOPPLER: 同步数据流图表中设备分配的双政策学习 2505.23131v1 |
Authors (8): Xinyu Yao, Daniel Bourgeois, Abhinav Jain, Yuxin Tang, Jiawen Yao, Zhimin Ding, Arlei Silva, Chris Jermaine
We study the problem of assigning operations in a dataflow graph to devices to minimize execution time in a work-conserving system, with emphasis on complex machine learning workloads. Prior learning-based methods often struggle due to three key limitations: (1) reliance on bulk-synchronous systems like TensorFlow, which under-utilize devices due to barrier synchronization; (2) lack of awareness of the scheduling mechanism of underlying systems when designing learning-based methods; and (3) exclusive dependence on reinforcement learning, ignoring the structure of effective heuristics designed by experts. In this paper, we propose \textsc{Doppler}, a three-stage framework for training dual-policy networks consisting of 1) a $\mathsf{SEL}$ policy for selecting operations and 2) a $\mathsf{PLC}$ policy for placing chosen operations on devices. Our experiments show that \textsc{Doppler} outperforms all baseline methods across tasks by reducing system execution time and additionally demonstrates sampling efficiency by reducing per-episode training time.
我们研究在数据流图中将操作分配到在工作保护系统中最大限度地减少执行时间的设备上的问题,重点是复杂的机器学习工作量。先前的学习方法往往由于三个关键限制而困难重重:(1) 依赖诸如TensorFlow这样的散装同步系统,这些系统由于障碍同步而未充分利用设备;(2) 在设计学习方法时对基础系统的时间安排机制缺乏认识;(3) 完全依赖强化学习,忽视专家设计的有效超常结构。在本文中,我们提议为培训双政策网络建立一个三阶段框架,包括:1) $\mathsf{SEL} 业务选择政策;和(2) 将选定操作安装在设备上的政策。我们的实验表明, ktextsc{Doppler} 通过减少系统执行时间和通过减少人均培训时间来进一步展示取样效率,从而超越了所有任务的基线方法。
Article 15
Title@2025-05-29 (4): Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony
Title: Toward Cost-Efficient Serving of Mixture-of-Experts with Asynchrony | Auf dem Weg zu einem kosteneffizienten Servieren von Mixture-of-Experts mit Asynchrony | 争取以成本低效益高的方式服务专家与非同步混合服务 2505.08944v2 |
Authors (5): Shaoyu Wang, Guangrong He, Geon-Woo Kim, Yanqi Zhou, Seo Jin Park
Mixture-of-Experts (MoE) architectures offer the promise of larger model capacity without the prohibitive costs of fully dense designs. However, in real-world inference serving, load skew across experts often leads to suboptimal device utilization and excessive synchronization overheads. This paper introduces Asynchronous Expert Parallelism (AEP), a new paradigm that decouples layer execution from barrier-style synchronization. By dynamically queuing tokens at each layer (referred to as $\mu$-queuing) and adaptively re-batching them on demand, GPUs avoid waiting for straggling experts and instead continuously process whichever layer is ready. This asynchronous approach mitigates two major inefficiencies in traditional expert-parallel systems: (1) idle GPU time while waiting for the hottest expert, and (2) small-batch executions on colder experts that waste memory bandwidth. We implement these ideas in a serving system called AMoE, which disaggregates attention from expert layers and uses a defragging scheduler to reduce batch fragmentation. Evaluations on prototype MoE models show that AMoE improves throughput by up to 2.7x compared to state-of-the-art baselines, incurring a manageable latency penalty and providing a cost-effective operating point. Furthermore, experiments demonstrate nearly linear scalability to multi-node settings, whereas the baseline system shows no throughput increase even when the number of GPUs is doubled.
模拟专家(MoE)架构提供了更大的模型能力,而没有完全稠密的设计的高昂成本。 然而,在现实世界的推论服务中,专家之间负重力往往导致设备使用不优化和过度同步管理。 本文介绍了Asyncronous 专家平行主义(AEP),这是一个将层执行与屏障式同步脱钩的新范例。 通过在每一层(称为双倍递增)以适应性方式对标志进行重新比对,GPUs避免等待悬浮专家,而是持续处理任何已经准备好的层。 这种不协调的做法减轻了传统专家平行系统中两大低效率:(1) 闲置的GPU值时间等待最热的专家,(2) 对浪费记忆带的冷藏专家进行小规模处决。 我们在一个名为AMoOuti的系统里实施这些想法,该系统从专家层中分解关注,并使用分批的表单来减少分批的分解分解。 在模范中,对MoEx原型的模型的精确性操作性几乎显示APO-x的可控性基线, 显示AWO- dal-xxxxxx的可控性运行成本。
Article 16
Title@2025-05-29 (4): Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts
Title: Shortcut-connected Expert Parallelism for Accelerating Mixture-of-Experts | Shortcut-verbundene Experten-Parallelität für die Beschleunigung von Mixture-of-Experts | 加速混合专家专家专家平行专家 2404.05019v3 |
Authors (6): Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, Jiayi Huang
Expert parallelism has emerged as a key strategy for distributing the computational workload of sparsely-gated mixture-of-experts (MoE) models across multiple devices, enabling the processing of increasingly large-scale models. However, the All-to-All communication inherent to expert parallelism poses a significant bottleneck, limiting the efficiency of MoE models. Although existing optimization methods partially mitigate this issue, they remain constrained by the sequential dependency between communication and computation operations. To address this challenge, we propose ScMoE, a novel shortcut-connected MoE architecture integrated with an overlapping parallelization strategy. ScMoE decouples communication from its conventional sequential ordering, enabling up to 100% overlap with computation. Compared to the prevalent top-2 MoE baseline, ScMoE achieves speedups of 1.49 times in training and 1.82 times in inference. Moreover, our experiments and analyses indicate that ScMoE not only achieves comparable but in some instances surpasses the model quality of existing approaches.
专家的平行性已成为一种关键战略,用于通过多种装置分配分散的分散专家混合模型的计算工作量,从而能够处理越来越大规模的模型。然而,专家平行性所固有的 “ 人人交流 “ 构成了一个很大的瓶颈,限制了教育部模式的效率。虽然现有的优化方法在一定程度上缓解了这一问题,但它们仍然受到通信和计算操作之间依次依赖的制约。为了应对这一挑战,我们提议ScMoE,这是一个与重叠的平行战略相结合的新颖的、与捷径相连的教育部结构。ScMoE从常规顺序排序中解析通信,使计算重叠率达到100%。与普遍的上层-2教育部基线相比,ScMoE在培训中实现了1.49倍的加速率,在推断中实现了1.82倍的加速率。此外,我们的实验和分析表明,ScMoE不仅取得了可比较的结果,而且在某些情况下超过了现有方法的模型质量。
Article 17
Title@2025-05-29 (4): Speeding up Model Loading with fastsafetensors
Title: Speeding up Model Loading with fastsafetensors | Beschleunigen des Modells Beladung mit Schnellsicherern | 加速装有快速保障装置的模型加载速度 2505.23072v1 |
Authors (5): Takeshi Yoshimura, Tatsuhiro Chiba, Manish Sethi, Daniel Waddington, Swaminathan Sundararaman
The rapid increases in model parameter sizes introduces new challenges in pre-trained model loading. Currently, machine learning code often deserializes each parameter as a tensor object in host memory before copying it to device memory. We found that this approach underutilized storage throughput and significantly slowed down loading large models with a widely-used model file formats, safetensors. In this work, we present fastsafetensors, a Python library designed to optimize the deserialization of tensors in safetensors files. Our approach first copies groups of on-disk parameters to device memory, where they are directly instantiated as tensor objects. This design enables further optimization in low-level I/O and high-level tensor preprocessing, including parallelized copying, peer-to-peer DMA, and GPU offloading. Experimental results show performance improvements of 4.8x to 7.5x in loading models such as Llama (7, 13, and 70 billion parameters), Falcon (40 billion parameters), and the Bloom (176 billion parameters).
模型参数大小的快速增加给经过训练的模型装入带来了新的挑战。 目前, 机器学习代码通常在复制到设备内存之前, 将每个参数作为主机内存的 发光对象进行消散。 我们发现, 这种方法未充分利用存储输送量, 并大大减慢了以广泛使用的模型文件格式、 安全加速器装载大型模型。 在这项工作中, 我们展示了快速安全器, 即一个旨在优化安全加速器文件中的发光器的发光的Python 图书馆。 我们的方法首先复制了设备内存的显示器参数组, 在那里, 它们被直接作为发光对象即时。 这个设计可以进一步优化低级 I/ O 和高水平的发光预处理, 包括平行复制、 同行对等DMA 和 GPUP 卸载。 实验结果表明, Llama ( 7、 13 和 700 参数)、 Falcon( 400 参数) 和 Bloom( 760亿 参数) 。
Article 18
Title@2025-05-28 (3): Profiling and optimization of multi-card GPU machine learning jobs
Title: Profiling and optimization of multi-card GPU machine learning jobs | Profilierung und Optimierung von Multi-Card-GPU-Maschinenlernjobs | 多卡 GPPU 机器学习工作的分析和优化 2505.22905v1 |
Authors (4): Marcin Lawenda, Kyrylo Khloponin, Krzesimir Samborski, Łukasz Szustak
The effectiveness and efficiency of machine learning methodologies are crucial, especially with respect to the quality of results and computational cost. This paper discusses different model optimization techniques, providing a comprehensive analysis of key performance indicators. Several parallelization strategies for image recognition, adapted to different hardware and software configurations, including distributed data parallelism and distributed hardware processing, are analyzed. Selected optimization strategies are studied in detail, highlighting the related challenges and advantages of their implementation. Furthermore, the impact of different performance improvement techniques (DPO, LoRA, QLoRA, and QAT) on the tuning process of large language models is investigated. Experimental results illustrate how the nature of the task affects the iteration time in a multiprocessor environment, VRAM utilization, and overall memory transfers. Test scenarios are evaluated on the modern NVIDIA H100 GPU architecture.
机器学习方法的有效性和效率至关重要,特别是在结果质量和计算成本方面。本文件讨论了不同的模型优化技术,对关键业绩指标进行了全面分析。分析了若干针对不同硬件和软件配置的图像识别平行战略,包括分布式数据平行和分布式硬件处理。对选定的优化战略进行了详细研究,强调了实施这些战略的相关挑战和优势。此外,还调查了不同性能改进技术(DPO、LORA、QLORA和QAT)对大型语言模型调控过程的影响。实验结果说明了任务的性质如何影响多处理环境中的循环时间、VRAM的利用和总体记忆传输。对现代NVIDIA H100 GPU结构的测试情景进行了评估。
Article 19
Title@2025-05-28 (3): Visualizing Cloud-native Applications with KubeDiagrams
Title: Visualizing Cloud-native Applications with KubeDiagrams | Cloud-native Anwendungen mit KubeDiagrammen visualisieren | 带有KubeDiagrams 的可视化云源应用 2505.22879v1 |
Authors (2): Philippe Merle, Fabio Petrillo
Modern distributed applications increasingly rely on cloud-native platforms to abstract the complexity of deployment and scalability. As the de facto orchestration standard, Kubernetes enables this abstraction, but its declarative configuration model makes the architectural understanding difficult. Developers, operators, and architects struggle to form accurate mental models from raw manifests, Helm charts, or cluster state descriptions. We introduce KubeDiagrams, an open-source tool that transforms Kubernetes manifests into architecture diagrams. By grounding our design in a user-centered study of real-world visualization practices, we identify the specific challenges Kubernetes users face and map these to concrete design requirements. KubeDiagrams integrates seamlessly with standard Kubernetes artifacts, preserves semantic fidelity to core concepts, and supports extensibility and automation. We detail the tool’s architecture, visual encoding strategies, and extensibility mechanisms. Three case studies illustrate how KubeDiagrams enhances system comprehension and supports architectural reasoning in distributed cloud-native systems. KubeDiagrams addresses concrete pain points in Kubernetes-based DevOps practices and is valued for its automation, clarity, and low-friction integration into real-world tooling environments.
现代分布式应用日益依赖云化平台来抽象部署和缩放的复杂性。 Kubernetes在事实上的调试标准下, Kubernetes 能够让这个抽象化, 但是它的宣示性配置模式使得建筑理解变得很困难。 开发者、 操作者和建筑师努力从原始的表单、 Helm 图表或集束状态描述中形成准确的心理模型。 我们引入了KubeDiagrams, 这个将Kubernetes 转化为建筑图示的开放源工具。 通过以用户为中心的真实世界可视化做法研究, 我们确定了Kubernetes 用户所面临的具体挑战, 并将这些挑战映射到具体的设计要求中。 KubeDiagrams 与标准的 Kubernetes 工艺、 保护语义对核心概念的忠实性以及支持扩展性和自动化。 我们详细介绍了工具的架构、 视觉编码策略和扩展机制。 三个案例研究说明了KubeDiagragrams如何在分布式云化系统中加强系统的理解和支持建筑学推理。 KubeDiagragrams在Kubernets针对库bernets 的水泥疼痛点, 透明化、 透明化、 透明化、 透明化、 格式化、 格式化、 格式化、 格式化、 格式化、 格式化、 、 格式化、 格式化、 格式化、 、 、 格式化、透明化、 格式化、 格式化、 、 格式化、 、 、 、 、透明化、 、 、 、 格式化、 、 、 、 、 、 、 、 、 、 、 、 、 等、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、
Article 20
Title@2025-05-28 (3): The National Research Platform: Stretched, Multi-Tenant, Scientific Kubernetes Cluster
Title: The National Research Platform: Stretched, Multi-Tenant, Scientific Kubernetes Cluster | Die Nationale Forschungsplattform: Streckiger, Multi-Tenant-Cluster, wissenschaftlicher Kubernetes-Cluster | 国家研究平台:延伸、多层、多层、科学库伯涅茨集群 2505.22864v1 |
Authors (12): Derek Weitzel, Ashton Graves, Sam Albin, Huijun Zhu, Frank Würthwein, Mahidhar Tatineni, Dmitry Mishin, John Graham, Elham E Khoda, Mohammad Firas Sada, Larry Smarr, Thomas DeFanti
The National Research Platform (NRP) represents a distributed, multi-tenant Kubernetes-based cyberinfrastructure designed to facilitate collaborative scientific computing. Spanning over 75 locations in the U.S. and internationally, the NRP uniquely integrates varied computational resources, ranging from single nodes to extensive GPU and CPU clusters, to support diverse research workloads including advanced AI and machine learning tasks. It emphasizes flexibility through user-friendly interfaces such as JupyterHub and low level control of resources through direct Kubernetes interaction. Critical operational insights are discussed, including security enhancements using Kubernetes-integrated threat detection, extensive monitoring, and comprehensive accounting systems. This paper highlights the NRP’s growing importance and scalability in addressing the increasing demands for distributed scientific computational resources.
国家研究平台(NRP)是一个分布式、多租租户库伯涅茨网络基础设施,旨在便利合作科学计算,在美国和国际上超过75个地点,NRP的独特整合了从单一节点到广泛的GPU和CPU集群等多种计算资源,以支持各种研究工作量,包括先进的人工智能和机器学习任务。它强调通过诸如JupyterHub等方便用户的界面和通过直接Kubernetes互动对资源进行低水平控制的灵活性。它讨论了重要的业务见解,包括利用Kubernetes综合威胁探测、广泛监测和综合会计系统加强安全。本文强调NRP在满足分布式科学计算资源日益增长的需求方面越来越重要和可扩缩性。
Article 21
Title@2025-05-28 (3): $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction
Title: $Δ$-Nets: Interaction-Based System for Optimal Parallel $λ$-Reduction | $Δ$-Nets: Interaktionsbasiertes System für eine optimale parallele $λ$-Reduktion | \(-净额:最佳平行互动系统\)$美元-削减 2505.20314v2 |
Authors (1): Daniel Augusto Rizzi Salvadori
I present a model of universal parallel computation called $\Delta$-Nets, and a method to translate $\lambda$-terms into $\Delta$-nets and back. Together, the model and the method constitute an algorithm for optimal parallel $\lambda$-reduction, solving the longstanding enigma with groundbreaking clarity. I show that the $\lambda$-calculus can be understood as a projection of $\Delta$-Nets – one that severely restricts the structure of sharing, among other drawbacks. Unhindered by these restrictions, the $\Delta$-Nets model opens the door to new highly parallel programming language implementations and computer architectures that are more efficient and performant than previously possible.
我提出了一个称为$Delta$-Net的通用平行计算模型,以及一种将$lambda$-terms 转换成$Delta$-nets和回转的方法。模型和方法共同构成一个优化平行$\lambda$降值的算法,以突破性的清晰度解决长期谜题。我显示,$Lumbda$-计算法可以被理解为一个$Delta$-Nets的预测,这一预测严格限制了共享结构,以及其他缺陷。由于这些限制,$Delta$-Nets模型被打破,为与以前相比效率更高、更能发挥作用的新的高度平行的编程语言实施和计算机结构打开了大门。
Article 22
Title@2025-05-28 (3): Smart Contracts for SMEs and Large Companies
Title: Smart Contracts for SMEs and Large Companies | Intelligente Verträge für KMU und Großunternehmen | 中小企业和大公司的智能合同 2505.22619v1 |
Authors (3): C. G. Liu, P. Bodorik, D. Jutla
Research on blockchains addresses multiple issues, with one being writing smart contracts. In our previous research we described methodology and a tool to generate, in automated fashion, smart contracts from BPMN models. The generated smart contracts provide support for multi-step transactions that facilitate repair/upgrade of smart contracts. In this paper we show how the approach is used to support collaborations via smart contracts for companies ranging from SMEs with little IT capabilities to companies with IT using blockchain smart contracts. Furthermore, we also show how the approach is used for certain applications to generate smart contracts by a BPMN modeler who does not need any knowledge of blockchain technology or smart contract development - thus we are hoping to facilitate democratization of smart contracts and blockchain technology.
在以往的研究中,我们描述了从BPMN模型中以自动方式生成智能合同的方法和工具。产生的智能合同为多步交易提供了支持,便利了智能合同的修理/升级。在本文中,我们展示了如何利用这一方法支持从信息技术能力很小的中小企业到使用链链智能合同的信息技术公司通过智能合同进行协作。此外,我们还展示了BPMN模型的某个应用如何利用这一方法产生智能合同,而该模型不需要任何有关链式技术或智能合同开发的知识,因此我们希望促进智能合同和链式技术的民主化。
Article 23
Title@2025-05-28 (3): Pilot-Quantum: A Quantum-HPC Middleware for Resource, Workload and Task Management
Title: Pilot-Quantum: A Quantum-HPC Middleware for Resource, Workload and Task Management | Pilot-Quantum: Eine Quantum-HPC Middleware für Ressourcen-, Workload- und Task-Management | 试点量子:资源、工作量和任务管理的量子-氢氯氟烃中软件 2412.18519v3 |
Authors (5): Pradeep Mantha, Florian J. Kiwit, Nishant Saurabh, Shantenu Jha, Andre Luckow
As quantum hardware advances, integrating quantum processing units (QPUs) into HPC environments and managing diverse infrastructure and software stacks becomes increasingly essential. Pilot-Quantum addresses these challenges as a middleware designed to provide unified application-level management of resources and workloads across hybrid quantum-classical environments. It is built on a rigorous analysis of existing quantum middleware systems and application execution patterns. It implements the Pilot Abstraction conceptual model, originally developed for HPC, to manage resources, workloads, and tasks. It is designed for quantum applications that rely on task parallelism, including (i) hybrid algorithms, such as variational approaches, and (ii) circuit cutting systems, used to partition and execute large quantum circuits. Pilot-Quantum facilitates seamless integration of QPUs, classical CPUs, and GPUs, while supporting high-level programming frameworks like Qiskit and Pennylane. This enables users to efficiently design and execute hybrid workflows across diverse computing resources. The capabilities of Pilot-Quantum are demonstrated through mini-apps – simplified yet representative kernels focusing on critical performance bottlenecks. We demonstrate the capabilities of Pilot-Quantum through multiple mini-apps, including different circuit executions (e.g., using IBM's Eagle QPU and simulators), circuit cutting, and quantum machine learning scenarios.
随着量子硬件的进步,将量子处理单位(QPUs)纳入高常委会环境和管理各种基础设施和软件堆叠变得日益重要。试点量子系统将这些挑战作为中继器处理,旨在为混合量子古典环境中的资源和工作量提供统一的应用管理。它建立在对现有量子中软件系统和应用执行模式的严格分析的基础上。它实施了最初为高常委会开发的“试验抽象”概念模型,以管理资源、工作量和任务。它设计用于依赖任务平行的量子应用程序,包括(一)混合算法,例如变异方法,和(二)用于分割和执行大型量子电路的电路切割系统。试点量子系统促进了QPU、古典CPUs和GPUs之间的无缝整合,同时支持了Qiskit和Pennylane等高层次的编程框架。它使用户能够高效率地设计和执行各种计算资源之间的混合工作流程。试点量子系统的能力通过微型应用 – – 简化但有代表性的、侧重于关键性能瓶颈的电路箱。我们展示了试算机-BA-Cal-Cal-Cal-I Cal-QAVAL-C-C-I-C-I-C-I-C-I-I-I-C-I-I-I-I-I-C-I-I-I-I-I-I-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-I-I-I-I-I-I-I-I-I-I-I-I-I-I-C-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I
Article 24
Title@2025-05-28 (3): Morpheus Consensus: Excelling on trails and autobahns
Title: Morpheus Consensus: Excelling on trails and autobahns | Morpheus Consensus: Excelling auf Trails und Autobahnen | Morpheus共识:关于足迹和自动铢的Excelling 2502.08465v2 |
Authors (2): Andrew Lewis-Pye, Ehud Shapiro
Recent research in consensus has often focussed on protocols for State-Machine-Replication (SMR) that can handle high throughputs. Such state-of-the-art protocols (generally DAG-based) induce undue overhead when the needed throughput is low, or else exhibit unnecessarily-poor latency and communication complexity during periods of low throughput. Here we present Morpheus Consensus, which naturally morphs from a quiescent low-throughput leaderless blockchain protocol to a high-throughput leader-based DAG protocol and back, excelling in latency and complexity in both settings. During high-throughout, Morpheus pars with state-of-the-art DAG-based protocols, including Autobahn. During low-throughput, Morpheus exhibits competitive complexity and lower latency than standard protocols such as PBFT and Tendermint, which in turn do not perform well during high-throughput. The key idea of Morpheus is that as long as blocks do not conflict (due to Byzantine behaviour, network delays, or high-throughput simultaneous production) it produces a forkless blockchain, promptly finalizing each block upon arrival. It assigns a leader only if one is needed to resolve conflicts, in a manner and with performance not unlike Autobahn.
最近在共识方面的研究往往侧重于能够处理高输送量的国家-地中海复制(SMR)协议(SMR ) 。 这种最先进的协议(一般以DAG为基础 ) 在需要的输送量低时会引发不适当的间接费用,或者在低输送量低时会表现出不必要、贫穷的悬浮和通信的复杂性。 我们在这里介绍Morpheus共识,它自然地从一个低排低排领导性无领导性的连锁协议(QMR ) 转变为一个高通量领导性DAG协议和背面,在两种情况下都能够突出拉长性和复杂性。 在高通量通货过程中,Morpheus会与以最先进的DAG为基础的协议,包括Autobahn。在低输送量过程中,Morpheus会表现出比PBFT和Tendermint等标准协议更复杂的竞争复杂性和低通量性。 Morpheus的主要想法是,只要路障不会发生冲突(由于Byzantine 行为、网络延迟或高通量同时生产),它不会像一个不前方冲突。
Article 25
Title@2025-05-28 (3): Grassroots Federation: Fair Governance of Large-Scale, Decentralized, Sovereign Digital Communities
Title: Grassroots Federation: Fair Governance of Large-Scale, Decentralized, Sovereign Digital Communities | Grassroots Federation: Faire Governance der großen, dezentralisierten, Souveränen Digitalen Gemeinschaften | 基层联合会:大、分散、主权数字共同体的公平治理 2505.02208v4 |
Authors (2): Ehud Shapiro, Nimrod Talmon
Grassroots Federation aims to address the egalitarian formation and the fair democratic governance of large-scale, decentralized, sovereign digital communities, the size of the EU, the US, existing social networks, and even humanity at large. A grassroots federation evolves via the grassroots formation of digital communities and their consensual federation. Such digital communities may form according to geography, jurisdiction, affiliations, relations, interests, causes, and more. Small communities (say up to $100$ members) govern themselves; larger communities – no matter how large – are governed by a similarly-small assembly elected by sortition among its members. Earlier work on Grassroots Democratic Federation explored the fair sortition of the assemblies of a federation in a static setting: Given a federation, populate its assemblies with members satisfying ex ante and ex post fairness conditions on the participation of members of a community in its assembly, and on the representation of child communities in the assembly of their parent community. In practice, we expect a grassroots democratic federation to grow and evolve dynamically and in all directions – bottom-up, top-down, and middle-out. To address that, we formally specify this dynamic setting and adapt the static fairness conditions to it: The ex post condition on the fair representation of a child community becomes a condition that must always hold; the ex ante conditions in expectation on the fair participation of an individual and on the fair representation of a child community become conditions satisfied in actuality in the limit, provided the federation structure eventually stabilizes. We then present a protocol that satisfies these fairness conditions.
基层联合会的基层联合会通过数字社区及其自愿的联合会的基层形成,这种数字社区可以按照地理、管辖权、附属关系、关系、利益、事业等组成;小型社区(最多达100美元的成员)管理自己;较大的社区 – – 不论规模多大 – – 都由一个由其成员组成的类似小型议会管理 – – 由其成员组成的基层民主大会管理 – – 不论规模多大 – – 基层民主联合会早先的工作在固定的公平环境中探讨了联邦大会的公平性:鉴于一个联邦,其会议由成员在事先和事后都满足的公平条件组成的成员组成;这种数字社区可以按照地理、管辖权、附属关系、关系、利益、事业和更多的地域组成。实际上,我们期望基层民主联合会能够从各个方向 – – 自下而上、自上而下和中 – – 成长和动态地演变;为了解决这个问题,我们正式规定这种动态的公平性条件,并在固定的公平环境下调整联邦的联邦会议:在一个社区成员参加其会议之前和事后条件得到满足的情况下,儿童社区最终必须有一个公平的前代表条件。
Article 26
Title@2025-05-28 (3): Broadcast in Almost Mixing Time
Title: Broadcast in Almost Mixing Time | In fast mischender Zeit übertragen | 几乎混合时间的广播 2502.02165v2 |
Authors (2): Anton Paramonov, Roger Wattenhofer
We study the problem of broadcasting multiple messages in the CONGEST model. In this problem, a dedicated source node $s$ possesses a set $M$ of messages with every message of size $O(\log n)$ where $n$ is the total number of nodes. The objective is to ensure that every node in the network learns all messages in $M$. The execution of an algorithm progresses in rounds, and we focus on optimizing the round complexity of broadcasting multiple messages. Our primary contribution is a randomized algorithm for networks with expander topology, which are widely used in practice for building scalable and robust distributed systems. The algorithm succeeds with high probability and achieves a round complexity that is optimal up to a factor of the network’s mixing time and polylogarithmic terms. It leverages a multi-COBRA primitive, which uses multiple branching random walks running in parallel. To the best of our knowledge, this approach has not been applied in distributed algorithms before. A crucial aspect of our method is the use of these branching random walks to construct an optimal (up to a polylogarithmic factor) tree packing of a random graph, which is then used for efficient broadcasting. This result is of independent interest. We also prove the problem to be NP-hard in a centralized setting and provide insights into why straightforward lower bounds for general graphs, namely graph diameter and $\frac{ | M | }{\textit{minCut}}$, cannot be tight. |
我们研究在 CONEST 模式中广播多条信息的问题。 在这个问题中, 专用源节点 $ 美元 专门源节 美元 拥有一套以美元( log n) 每条大小( 美元) $( log n) 美元) 的电文 。 目标是确保网络的每个节点都能用$$ 来学习所有信息。 算法在各回合中执行一个算法进度, 我们的重点是优化广播多条电文的圆形复杂度。 我们的主要贡献是, 用于扩大型地形网络的随机算法, 这些网络在实践中广泛用于建立可缩放和稳健的分布系统。 算法成功率高, 并达到最优化的圆形复杂度, 与网络的混合时间和多色C 的组合值条件的总数数值相符。 它利用多- COBRA 原始的原始信息, 使用多处随机行道同时运行。 我们最了解的是, 这种方法没有在分布式算法中应用。 我们的方法的一个重要方面是, 使用这些支流的随机行来构建一个最优化的( 也就是硬的调调) rologyrologyal_ liot) liotal 。 和 Crealbalbalbal ma 。 roal ma ma ma 。 main be be roal ma ma roalbalbalbilus ma ma ma ma ma ma ma ma mail be be be roalbalbalbalbalbalbalbal ma ma maild ma ma ma ma 。
Article 27
Title@2025-05-28 (3): Inclusive, Differentially Private Federated Learning for Clinical Data
Title: Inclusive, Differentially Private Federated Learning for Clinical Data | Inklusives, differenziert privates Federated Learning für klinische Daten | 包容性、差异化私联校临床数据学习 2505.22108v1 |
Authors (10): Santhosh Parampottupadam, Melih Coşğun, Sarthak Pati, Maximilian Zenk, Saikat Roy, Dimitrios Bounias, Benjamin Hamm, Sinem Sav, Ralf Floca, Klaus Maier-Hein
Federated Learning (FL) offers a promising approach for training clinical AI models without centralizing sensitive patient data. However, its real-world adoption is hindered by challenges related to privacy, resource constraints, and compliance. Existing Differential Privacy (DP) approaches often apply uniform noise, which disproportionately degrades model performance, even among well-compliant institutions. In this work, we propose a novel compliance-aware FL framework that enhances DP by adaptively adjusting noise based on quantifiable client compliance scores. Additionally, we introduce a compliance scoring tool based on key healthcare and security standards to promote secure, inclusive, and equitable participation across diverse clinical settings. Extensive experiments on public datasets demonstrate that integrating under-resourced, less compliant clinics with highly regulated institutions yields accuracy improvements of up to 15% over traditional FL. This work advances FL by balancing privacy, compliance, and performance, making it a viable solution for real-world clinical workflows in global healthcare.
联邦学习联合会(FL)为培训临床AI模式提供了一种很有希望的方法,而没有集中敏感的病人数据,然而,其实际采用受到与隐私、资源限制和合规有关的挑战的阻碍;现有的差异隐私(DP)方法往往采用统一噪音,这种噪音不成比例地降低了示范性业绩,即使在遵守标准的机构中也是如此;在这项工作中,我们提议了一个新的了解合规性的FL框架,根据可量化客户合规分数调整噪音,从而增强DP;此外,我们引入了一个基于关键保健和安全标准的合规评分工具,以促进不同临床环境的安全、包容和公平参与;关于公共数据集的广泛实验表明,将资源不足、不合规性强的诊所与监管程度高的机构相结合,可以比传统FL提高高达15%的准确性。 这项工作通过平衡隐私、合规性和绩效,使FL成为全球保健中真实世界临床工作流程的一个可行解决方案,从而推进FL。
Article 28
Title@2025-05-28 (3): A Stochastic Approximation Approach for Efficient Decentralized Optimization on Random Networks
Title: A Stochastic Approximation Approach for Efficient Decentralized Optimization on Random Networks | Ein stochastischer Annäherungsansatz für eine effiziente dezentralisierte Optimierung von Random Networks | 随机网络高效分散优化优化的斯托卡接近方法 2410.18774v2 |
Authors (3): Chung-Yiu Yau, Haoming Liu, Hoi-To Wai
A challenging problem in decentralized optimization is to develop algorithms with fast convergence on random and time varying topologies under unreliable and bandwidth-constrained communication network. This paper studies a stochastic approximation approach with a Fully Stochastic Primal Dual Algorithm (FSPDA) framework. Our framework relies on a novel observation that randomness in time varying topology can be incorporated in a stochastic augmented Lagrangian formulation, whose expected value admits saddle points that coincide with stationary solutions of the decentralized optimization problem. With the FSPDA framework, we develop two new algorithms supporting efficient sparsified communication on random time varying topologies – FSPDA-SA allows agents to execute multiple local gradient steps depending on the time varying topology to accelerate convergence, and FSPDA-STORM further incorporates a variance reduction step to improve sample complexity. For problems with smooth (possibly non-convex) objective function, within $T$ iterations, we show that FSPDA-SA (resp. FSPDA-STORM) finds an $\mathcal{O}( 1/\sqrt{T} )$-stationary (resp. $\mathcal{O}( 1/T^{2/3} )$) solution. Numerical experiments show the benefits of the FSPDA algorithms.
分散化优化的一个具有挑战性的问题是,在不可靠和带宽限制的通信网络下,开发在随机和时间不同地形上快速趋同的算法,在不可靠和受带宽限制的通信网络下,开发快速趋同的随机和时间不同地形的算法。本文研究一种随机近似近似法,其框架使用全软的表面学半边际双对角算法(FSPDA-SA)框架。我们的框架依赖于一种新的观察,即随机随机随机随机随机的随机不同地形变异学可纳入变异的扩大拉格朗格的配方配方,其预期值包含与分散式优化问题的固定解决方案相吻合的支撑点。我们开发了两种新的算法,支持随机随机随机不同地形变异的高效循环通信 – FSDDA-SA 框架,允许代理实施多个本地梯级步骤,这取决于加速趋同速度的不同地形学,而FSDDA-STM进一步纳入一个减少差异的步骤,以提高样本复杂性。对于平滑(可能非科)目标功能的问题,我们在美元范围内显示FSPDA-SA(rep.FPDA-SA)发现一个价格/QQQQQQQ}(1/Q})的利得(1/Q_Q})
Article 29
Title@2025-05-28 (3): Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference
Title: Towards Efficient Key-Value Cache Management for Prefix Prefilling in LLM Inference | Effizientes Key-Value-Cache-Management für die Präfixvorfüllung in LLM-Inferenz | 建立高效的键值缓存管理,用于在LLM 推理中预填前补全 2505.21919v1 |
Authors (5): Yue Zhu, Hao Yu, Chen Wang, Zhuoran Liu, Eun Kyung Lee
The increasing adoption of large language models (LLMs) with extended context windows necessitates efficient Key-Value Cache (KVC) management to optimize inference performance. Inference workloads like Retrieval-Augmented Generation (RAG) and agents exhibit high cache reusability, making efficient caching critical to reducing redundancy and improving speed. We analyze real-world KVC access patterns using publicly available traces and evaluate commercial key-value stores like Redis and state-of-the-art RDMA-based systems (CHIME [1] and Sherman [2]) for KVC metadata management. Our work demonstrates the lack of tailored storage solution for KVC prefilling, underscores the need for an efficient distributed caching system with optimized metadata management for LLM workloads, and provides insights into designing improved KVC management systems for scalable, low-latency inference.
由于越来越多地采用大语言模型(LLMs),加上上下文窗口的扩展,必须进行高效的Ky-Value Cache(KVC)管理,以优化推算性能; 诸如Retrerelieval-Auged Generation(RAG)和代理商等推论工作量显示出较高的缓存可重现性,对减少冗余和提高速度至关重要; 我们利用公开提供的痕迹分析真实世界KVC的准入模式,并评估商业关键价值商店,如Redis和KVC元数据管理的最新RDMA系统(CHIME [1]和Sherman[2])。我们的工作表明,缺乏针对KVC预填的量制储存解决方案,强调需要高效的分布式缓存系统,为LLM工作量提供优化的元管理,并为设计更完善的KVC管理系统提供见解,以适应可伸缩的、低延度的推断。
Article 30
Title@2025-05-28 (3): Joint$λ$: Orchestrating Serverless Workflows on Jointcloud FaaS Systems
Title: Joint$λ$: Orchestrating Serverless Workflows on Jointcloud FaaS Systems | Joint$λ$: Orchestrierung serverloser Workflows auf Jointcloud FaaS-Systemen | 联合 $ $: 联合COLOUD FaaS系统无服务器工作流管 2505.21899v1 |
Authors (6): Jianfei Liu, Rui Li, Zhilin Yang, Peichang Shi, Guodong Yi, Huaimin Wang
Existing serverless workflow orchestration systems are predominantly designed for a single-cloud FaaS system, leading to vendor lock-in. This restricts performance optimization, cost reduction, and availability of applications. However, orchestrating serverless workflows on Jointcloud FaaS systems faces two main challenges: 1) Additional overhead caused by centralized cross-cloud orchestration; and 2) A lack of reliable failover and fault-tolerant mechanisms for cross-cloud serverless workflows. To address these challenges, we propose Joint$\lambda$, a distributed runtime system designed to orchestrate serverless workflows on multiple FaaS systems without relying on a centralized orchestrator. Joint$\lambda$ introduces a compatibility layer, Backend-Shim, leveraging inter-cloud heterogeneity to optimize makespan and reduce costs with on-demand billing. By using function-side orchestration instead of centralized nodes, it enables independent function invocations and data transfers, reducing cross-cloud communication overhead. For high availability, it ensures exactly-once execution via datastores and failover mechanisms for serverless workflows on Jointcloud FaaS systems. We validate Joint$\lambda$ on two heterogeneous FaaS systems, AWS and ALiYun, with four workflows. Compared to the most advanced commercial orchestration services for single-cloud serverless workflows, Joint$\lambda$ reduces up to 3.3$\times$ latency, saving up to 65\% cost. Joint$\lambda$ is also faster than the state-of-the-art orchestrators for cross-cloud serverless workflows up to 4.0$\times$, reducing up to 4.5$\times$ cost and providing strong execution guarantees.
无服务器的现有工作流程管弦系统主要是为单球FaaS系统设计的,导致供应商锁定。这限制了性能优化、降低成本和应用程序的可用性。然而,在Unitecloud FaaS系统上设置无服务器的工作流程面临两大挑战:(1) 中央交叉球管弦造成额外管理费用;(2) 交叉球型服务器工作流程缺乏可靠的故障和容错机制。为了应对这些挑战,我们提议使用联合美元(lambda$),这是一个分配运行时间系统,目的是在不依赖中央管弦师的情况下,在多个FaS系统上配置无服务器的工作流程。联合美元(lam) 使用跨球管弦管弦和防故障机制来优化空隙和降低按需计费的成本。 通过使用功能边调而不是中央节点,它使得独立运行和数据传输功能(lodoub) 降低成本(lodoud) 通信管理费。为了高可用性,它也确保通过数据储存和故障机制,在联合系统上,通过联合的服务器平流路基SLOLODLO(O) 将固定成本(MA) 4) 联合系统降低。
Article 31
Title@2025-05-28 (3): Hybrid Batch Normalisation: Resolving the Dilemma of Batch Normalisation in Federated Learning
Title: Hybrid Batch Normalisation: Resolving the Dilemma of Batch Normalisation in Federated Learning | Hybride Batch-Normalisierung: Lösung des Dilemmas der Batch-Normalisierung im Federated Learning | 混合批次正常化:解决联邦学习中批次正常化的难题 2505.21877v1 |
Authors (4): Hongyao Chen, Tianyang Xu, Xiaojun Wu, Josef Kittler
Batch Normalisation (BN) is widely used in conventional deep neural network training to harmonise the input-output distributions for each batch of data. However, federated learning, a distributed learning paradigm, faces the challenge of dealing with non-independent and identically distributed data among the client nodes. Due to the lack of a coherent methodology for updating BN statistical parameters, standard BN degrades the federated learning performance. To this end, it is urgent to explore an alternative normalisation solution for federated learning. In this work, we resolve the dilemma of the BN layer in federated learning by developing a customised normalisation approach, Hybrid Batch Normalisation (HBN). HBN separates the update of statistical parameters (i.e. , means and variances used for evaluation) from that of learnable parameters (i.e. , parameters that require gradient updates), obtaining unbiased estimates of global statistical parameters in distributed scenarios. In contrast with the existing solutions, we emphasise the supportive power of global statistics for federated learning. The HBN layer introduces a learnable hybrid distribution factor, allowing each computing node to adaptively mix the statistical parameters of the current batch with the global statistics. Our HBN can serve as a powerful plugin to advance federated learning performance. It reflects promising merits across a wide range of federated learning settings, especially for small batch sizes and heterogeneous data.
常规深层神经网络培训广泛使用批量正常化(BN),以统一每批数据的输入-输出分布;然而,联邦学习,即分散式学习模式,面临着处理客户节点之间非独立和相同分布的数据的挑战。由于缺乏一个一致的更新BN统计参数的方法,标准的BN降低了联邦学习绩效。为此,迫切需要探索一种联邦学习的替代正常化解决方案。在这项工作中,我们通过制定定制的标准化方法,即混合批量正常化(HBN)来解决BN层在联结学习中的两难困境。HBN将统计参数(即用于评价的手段和差异)的更新与可学习参数(即需要梯度更新的参数)的更新分开,获得对分布式假设中全球统计参数的公正估计。与现有解决方案相比,我们强调全球统计对联邦学习的支持力。HBNBN层引入了一个可学习的混合分配系数,允许每项计算方法的零位混合批量和差异性评估,从而能够反映我们不断调整的联邦阶段统计质量。
Article 32
Title@2025-05-28 (3): gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling
Title: gLLM: Global Balanced Pipeline Parallelism System for Distributed LLM Serving with Token Throttling | gLLM: Global Balanced Pipeline Parallelism System für verteiltes LLM Serving mit Token Throttling | gLLM:全球平衡管道平行系统 2504.14775v2 |
Authors (6): Tianyu Guo, Xianwei Zhang, Jiangsu Du, Zhiguang Chen, Nong Xiao, Yutong Lu
Pipeline parallelism has emerged as a predominant approach for deploying large language models (LLMs) across distributed nodes, owing to its lower communication overhead compared to tensor parallelism. While demonstrating high throughput in request serving, pipeline parallelism often suffers from performance limitations caused by pipeline bubbles, which are primarily resulted from imbalanced computation delays across batches. Existing methods like Sarathi-Serve attempt to address this through hybrid scheduling of chunked prefill and decode tokens using a fixed token budget. However, such methods may experience significant fluctuations due to either insufficient prefill tokens or uneven distribution of decode tokens, ultimately leading to computational imbalance. To overcome these inefficiencies, we present gLLM, a globally balanced pipeline parallelism system incorporating Token Throttling to effectively mitigate the pipeline bubbles. Our Token Throttling mechanism is a fine-grained scheduling policy that independently regulates the quantities of prefill and decode tokens, thus enabling balanced computation by leveraging global information from the inference system. Specifically, for decode tokens, gLLM maintains near-consistent token count across processing batches. For prefill tokens, it dynamically adjusts batch sizes based on both total pending tokens and the memory utilization rates of key-value cache (KV cache). Furthermore, gLLM runtime adopts an asynchronous execution and message passing architecture specifically optimized for pipeline parallelism characteristics. Experimental evaluations with representative LLMs show that gLLM achieves significant performance improvements, delivering 11% to 398% higher maximum throughput compared to state-of-the-art pipeline or tensor parallelism systems, while simultaneously maintaining lower latency.
在分布式节点上部署大型语言模型(LLMs)的主要方法已经出现管道平行,原因是其通信管理管理费用比高,且与超强平行性能相比,通信管理费用较低。虽然管道平行性能在请求服务中显示出很高的输送量,但管道平行性能往往由于管道泡沫造成的业绩限制,这主要是由于各批量的计算延误不平衡造成的。Sarathi-Serve等现有方法试图通过混合列表,使用固定的象征性预算来解决这个问题。然而,这种方法可能由于通信管理费用低于预填标牌或解码物的分布不均而出现大幅波动,最终导致计算不平衡。为了克服这些低效率,我们展示了全球平衡的管道平行平行平行并行性系统,其中包括Token Troottling 气泡,这主要是因为各批量的计算不均匀性,我们Token Troott-Servey 机制是一个精细的时间安排政策,它独立调节预填和解码代号数量,从而通过调全球信息系统进行均衡的计算。具体来说,GLLLMSignsmartments,GM的尺寸在接近一致总总和整个处理中保持接近一致性标数的计算。对于总总值的计算中,对于高度的比值的比值,而显示。
Article 33
Title@2025-05-27 (2): Empowering Scientific Workflows with Federated Agents
Title: Empowering Scientific Workflows with Federated Agents | Stärkung wissenschaftlicher Workflows mit Federated Agents | 赋予联邦药剂部门科学工作流程权能 2505.05428v2 |
Authors (6): J. Gregory Pauloski, Yadu Babuji, Ryan Chard, Mansi Sakarvadia, Kyle Chard, Ian Foster
Agentic systems, in which diverse agents cooperate to tackle challenging problems, are exploding in popularity in the AI community. However, the agentic frameworks used to build these systems have not previously enabled use with research cyberinfrastructure. Here we introduce Academy, a modular and extensible middleware designed to deploy autonomous agents across the federated research ecosystem, including HPC systems, experimental facilities, and data repositories. To meet the demands of scientific computing, Academy supports asynchronous execution, heterogeneous resources, high-throughput data flows, and dynamic resource availability. It provides abstractions for expressing stateful agents, managing inter-agent coordination, and integrating computation with experimental control. We present microbenchmark results that demonstrate high performance and scalability in HPC environments. To demonstrate the breadth of applications that can be supported by agentic workflow designs, we also present case studies in materials discovery, decentralized learning, and information extraction in which agents are deployed across diverse HPC systems.
各种代理人合作解决具有挑战性的问题的代理系统在AI社区中正在兴起。然而,用于建立这些系统的代理框架以前还没有能够用于研究网络基础设施。在这里,我们介绍了《教程》,这是一个模块和可扩展的中间软件,旨在在整个联合研究生态系统中部署自主代理人,包括HPC系统、实验设施和数据储存库。为了满足科学计算的需求,学院支持不同步地执行、多种资源、高通量数据流和动态资源可用性。它为表达国家代理人、管理机构间协调以及将计算与实验控制相结合提供了抽象信息。我们提出了显出高性能和可伸缩性的微小标准结果。为了展示能够得到代理工作流程设计支持的应用的广度,我们还介绍了材料发现、分散学习和信息提取方面的案例研究,其中将各种代理人部署在不同的HPC系统。
Article 34
Title@2025-05-27 (2): LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models
Title: LV-XAttn: Distributed Cross-Attention for Long Visual Inputs in Multimodal Large Language Models | LV-XAttn: Verteilte Cross-Attention für lange visuelle Eingänge in multimodalen großen Sprachmodellen | LV-XAttn:多式大语言模型中长视输入分布式交叉注意 2502.02406v3 |
Authors (2): Tzu-Tao Chang, Shivaram Venkataraman
Cross-attention is commonly adopted in multimodal large language models (MLLMs) for integrating visual information into the language backbone. However, in applications with large visual inputs, such as video understanding, processing a large number of visual tokens in cross-attention layers leads to high memory demands and often necessitates distributed computation across multiple GPUs. Existing distributed attention mechanisms face significant communication overheads, making cross-attention layers a critical bottleneck for efficient training and inference of MLLMs. To address this, we propose LV-XAttn, a distributed, exact cross-attention mechanism with minimal communication overhead. We observe that in applications involving large visual inputs, the size of the query block is typically much smaller than that of the key-value blocks. Thus, in LV-XAttn we keep the large key-value blocks locally on each GPU and exchange smaller query blocks across GPUs. We also introduce an efficient activation recomputation technique to support longer visual context. We theoretically analyze the communication benefits of LV-XAttn and show that it can achieve speedups for a wide range of models. Our evaluations with Llama 3-V, mPLUG-Owl3 and OpenFlamingo models find that LV-XAttn achieves up to 10.62$\times$ end-to-end speedup compared to existing approaches.
将视觉信息纳入语言主干网的多式大型语言模型(MLLMM)通常采用交叉关注模式(MLLM),将视觉信息纳入语言主干网。然而,在具有大量视觉投入的应用程序中,如视频理解,处理大量跨注意层的视觉标志,导致大量记忆需求,而且往往需要在多个GPU中进行分配计算。现有分布式关注机制面临巨大的通信间接费用,使跨注意层成为高效培训和推断MLLMS的关键瓶颈。为了解决这一问题,我们建议使用分布式的、准确的交叉关注机制LV-XAttn,一个分布式的分散式、通信管理机制。我们观察到,在涉及大量视觉投入的应用程序中,查询区块的大小通常比关键值区块的大小要小得多。因此,在LV-XAtttt中,我们将大型关键值块块保留在每一个GPU上,并在GPLlam和G之间交换较小的查询区块。我们还引入一种有效的激活性建议技术,以支持更长的视觉背景环境。我们从理论上分析LV-XAt-Xttn的传播的好处,并显示它可以实现一个开式的快速速度到Fxxxx的模型。
Article 35
Title@2025-05-27 (2): Power-Capping Metric Evaluation for Improving Energy Efficiency
Title: Power-Capping Metric Evaluation for Improving Energy Efficiency | Leistungskapitulation Metric-Evaluierung zur Verbesserung der Energieeffizienz | 提高能源效率提高能源使用效率的节能计量评价 2505.21758v1 |
Authors (7): Maria Patrou, Thomas Wang, Wael Elwasif, Markus Eisenbach, Ross Miller, William Godoy, Oscar Hernandez
With high-performance computing systems now running at exascale, optimizing power-scaling management and resource utilization has become more critical than ever. This paper explores runtime power-capping optimizations that leverage integrated CPU-GPU power management on architectures like the NVIDIA GH200 superchip. We evaluate energy-performance metrics that account for simultaneous CPU and GPU power-capping effects by using two complementary approaches: speedup-energy-delay and a Euclidean distance-based multi-objective optimization method. By targeting a mostly compute-bound exascale science application, the Locally Self-Consistent Multiple Scattering (LSMS), we explore challenging scenarios to identify potential opportunities for energy savings in exascale applications, and we recognize that even modest reductions in energy consumption can have significant overall impacts. Our results highlight how GPU task-specific dynamic power-cap adjustments combined with integrated CPU-GPU power steering can improve the energy utilization of certain GPU tasks, thereby laying the groundwork for future adaptive optimization strategies.
由于高性能计算系统目前处于伸缩状态,优化电力扩缩管理和资源利用已变得比以往更加关键。本文件探索了运行时间的电力拉动优化,在像NVIDIA GH200超级芯片这样的建筑上利用CPU-GPU电力管理进行综合CPU-GPU优化。我们通过使用两种互补方法,评估了同时产生CPU和GPU动力拉动效应的能源性能衡量标准:加速能源拉动和以Euclidean远程为基础的多目标优化方法。通过针对一个大部分可计算到的扩展性科学应用,即本地自控多散射(LSMS),我们探索了具有挑战性的情景,以确定在大规模应用中节能的潜在机会,我们认识到即使能源消耗略有减少也会产生重大的总体影响。我们的结果突出表明,GPU的具体任务动态电动能上限调整与CPU-GPU电力指导相结合,可以如何改善某些GUPU任务的能源利用,从而为未来的适应性优化战略打下基础。
Article 36
Title@2025-05-27 (2): FedCostAware: Enabling Cost-Aware Federated Learning on the Cloud
Title: FedCostAware: Enabling Cost-Aware Federated Learning on the Cloud | FedCostAware: Kostenbewusstes Lernen in der Cloud ermöglichen | FestAware:在云上进行成本-软件联合学习 2505.21727v1 |
Authors (6): Aditya Sinha, Zilinghan Li, Tingkai Liu, Volodymyr Kindratenko, Kibaek Kim, Ravi Madduri
Federated learning (FL) is a distributed machine learning (ML) approach that allows multiple clients to collaboratively train ML model without exchanging their original training data, offering a solution that is particularly valuable in sensitive domains such as biomedicine. However, training robust FL models often requires substantial computing resources from participating clients, such as GPUs, which may not be readily available at institutions such as hospitals. While cloud platforms (e.g., AWS) offer on-demand access to such resources, their usage can incur significant costs, particularly in distributed training scenarios where poor coordination strategies can lead to substantial resource wastage. To address this, we introduce FedCostAware, a cost-aware scheduling algorithm designed to optimize synchronous FL on cloud spot instances. FedCostAware addresses the challenges of training on spot instances and different client budgets by employing intelligent management of the lifecycle of spot instances. This approach minimizes resource idle time and overall expenses. Comprehensive experiments across multiple datasets demonstrate that FedCostAware significantly reduces cloud computing costs compared to conventional spot and on-demand schemes, enhancing the accessibility and affordability of FL.
联邦学习(FL)是一种分布式机器学习(ML)方法,它使多个客户能够在不交换原始培训数据的情况下合作培训ML模型,提供在生物医学等敏感领域特别有价值的解决办法;然而,培训稳健的FL模型往往需要参与客户提供大量计算资源,如医院等机构可能无法随时获得的GPU;虽然云层平台(如AWS)提供按需获取此类资源的机会,但其使用可能会产生大量费用,特别是在协调战略不善可能导致大量资源浪费的分布式培训情景中。为了解决这个问题,我们引入了FedCostAware(FedCostAware),这是一种成本意识的排期算法,目的是在云端场中优化同步FL(FL)的频率。FFCostAware(FedCostAware)应对现场培训和不同客户预算的挑战,办法是对现场情况采用智能管理,最大限度地减少资源闲置时间和总体费用。在多个数据集中的全面实验表明,FedCostAware与常规点和按需计划相比,大大降低云计算费用,提高FL的可及可承受性。
Article 37
Title@2025-05-27 (2): AMSFL: Adaptive Multi-Step Federated Learning via Gradient Difference-Based Error Modeling
Title: AMSFL: Adaptive Multi-Step Federated Learning via Gradient Difference-Based Error Modeling | AMSFL: Adaptives Multi-Step-Federated Learning über gradient Difference-based Error Modeling | ASFL:通过基于差异的渐进错误建模进行适应性多阶段联邦学习 2505.21695v1 |
Authors (1): Ganglou Xu
Federated learning faces critical challenges in balancing communication efficiency and model accuracy. One key issue lies in the approximation of update errors without incurring high computational costs. In this paper, we propose a lightweight yet effective method called Gradient Difference Approximation (GDA), which leverages first-order information to estimate local error trends without computing the full Hessian matrix. The proposed method forms a key component of the Adaptive Multi-Step Federated Learning (AMSFL) framework and provides a unified error modeling strategy for large-scale multi-step adaptive training environments.
联邦学习在平衡通信效率和模型准确性方面面临着重大挑战。一个关键问题在于更新误差的近似而不会产生高昂的计算成本。在本文件中,我们提议了一种轻量、但有效的方法,称为“渐进差异匹配(GDA) ” (GDA),该方法利用第一阶信息来估计当地误差趋势,而不必计算完整的赫西安矩阵。该拟议方法构成了适应性多系统联邦学习(AMSFL)框架的一个关键组成部分,并为大规模多阶段适应性培训环境提供了统一的误差建模战略。
Article 38
Title@2025-05-27 (2): Incentivizing Permissionless Distributed Learning of LLMs
Title: Incentivizing Permissionless Distributed Learning of LLMs | Anreize für das unbefugte Lernen von LLMs | 激励对LLMM的无自由分配的学习 2505.21684v1 |
Authors (6): Joel Lidin, Amir Sarfi, Evangelos Pappas, Samuel Dare, Eugene Belilovsky, Jacob Steeves
We describe an incentive system for distributed deep learning of foundational models where peers are rewarded for contributions. The incentive system, \textit{Gauntlet}, has been deployed on the bittensor blockchain and used to train a 1.2B LLM with completely permissionless contributions of pseudo-gradients: no control over the users that can register or their hardware. \textit{Gauntlet} can be applied to any synchronous distributed training scheme that relies on aggregating updates or pseudo-gradients. We rely on a two-stage mechanism for fast filtering of peer uptime, reliability, and synchronization, combined with the core component that estimates the loss before and after individual pseudo-gradient contributions. We utilized an OpenSkill rating system to track competitiveness of pseudo-gradient scores across time. Finally, we introduce a novel mechanism to ensure peers on the network perform unique computations. Our live 1.2B run, which has paid out real-valued tokens to participants based on the value of their contributions, yielded a competitive (on a per-iteration basis) 1.2B model that demonstrates the utility of our incentive system.
我们描述一个激励制度,用于对同行获得捐款奖励的基础模型进行分布式深入学习。激励制度(\ textit{Gadoclet})已经部署在比特纳链条上,并用于培训1.2BLM,使用假梯子完全无许可证的贡献:对能够注册的用户或其硬件没有控制权。\ textit{Gauntlet}可以应用到任何同步分布式的培训计划,这种计划依赖于汇总更新或假梯度。我们依靠一个两阶段机制快速过滤同侪的更新时间、可靠性和同步性,加上估算个人假梯度捐款之前和之后损失的核心部分。我们利用一个开放技能评级制度跟踪假分数的竞争力,最后,我们引入了一个新机制,确保网络的同行进行独特的计算。我们的1.2B运行现场运行,根据参与者的捐款价值向参与者支付实值的标牌,产生了一个(按百分比计算的)1.2B模型,展示了我们激励系统的实用性。
Article 39
Title@2025-05-27 (2): KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads
Title: KPerfIR: Towards an Open and Compiler-centric Ecosystem for GPU Kernel Performance Tooling on Modern AI Workloads | KPerfIR: Auf dem Weg zu einem offenen und kompilerzentrierten Ökosystem für GPU-Kernel Performance Tooling auf modernen KI-Workloads | KPerfIR:努力建立一个开放的、以编纂者为中心的生态系统,用于在现代AI 工作负荷上使用 GPU 内核性能工具 2505.21661v1 |
Authors (8): Yue Guan, Yuanwei Fang, Keren Zhou, Corbin Robeck, Manman Ren, Zhongkai Yu, Yufei Ding, Adnan Aziz
In this work, we propose KPerfIR, a novel multilevel compiler-centric infrastructure to enable the development of customizable, extendable, and portable profiling tools tailored for modern artificial intelligence (AI) workloads on modern GPUs. Our approach integrates profiling capabilities directly into the compiler workflow, allowing profiling functionalities to be implemented as compiler passes, offering a programmable and reusable framework for performance analysis. This design bridges the gap between compilers and profilers, enabling fine-grained insights into complex optimization challenges such as overlapping the execution of fine-grained function units on GPUs. KPerfIR is integrated into the Triton infrastructure to highlight the power of a compiler-centric approach to advance performance analysis and optimization in the ever-evolving landscape of AI compilers. Our evaluation shows that our tool incurs low overhead (8.2%), provides accurate measurements (2% relative error), and delivers actionable insights into complicated GPU intra-kernel optimizations.
在这项工作中,我们提出了KPerfIR,这是一个新的多级编译器中心基础设施,用于开发适合现代GPU的现代人工智能工作量的可定制、可扩展和便携式剖析工具。我们的方法将剖析能力直接纳入汇编者工作流程,允许将剖析功能作为编译者通行证加以实施,为绩效分析提供一个可编程和可重复使用的框架。这个设计可以弥合编译者和剖面设计器之间的差距,从而能够对复杂的优化挑战进行精细的洞察,例如,在GPUs上执行精细裁的功能单位重叠。 KPerfIR被整合到特里顿基础设施中,以突出以编纂者为中心的方法在不断演变的AI汇编者环境中推进绩效分析和优化的能力。我们的评估表明,我们的工具拥有低的间接费用(8.2%),提供准确的测量(2%相对错误),并提供了复杂的GPU内核优化的可操作的洞察力。
Article 40
Title@2025-05-27 (2): Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits
Title: Fast and Cost-effective Speculative Edge-Cloud Decoding with Early Exits | Schnelle und kostengünstige spekulative Edge-Cloud-Dekodierung mit Early Exits | 快速和成本效益高的投机性边缘-封闭式排污与早期出口 2505.21594v1 |
Authors (3): Yeshwanth Venkatesha, Souvik Kundu, Priyadarshini Panda
Large Language Models (LLMs) enable various applications on edge devices such as smartphones, wearables, and embodied robots. However, their deployment often depends on expensive cloud-based APIs, creating high operational costs, which limit access for smaller organizations and raise sustainability concerns. Certain LLMs can be deployed on-device, offering a cost-effective solution with reduced latency and improved privacy. Yet, limited computing resources constrain the size and accuracy of models that can be deployed, necessitating a collaborative design between edge and cloud. We propose a fast and cost-effective speculative edge-cloud decoding framework with a large target model on the server and a small draft model on the device. By introducing early exits in the target model, tokens are generated mid-verification, allowing the client to preemptively draft subsequent tokens before final verification, thus utilizing idle time and enhancing parallelism between edge and cloud. Using an NVIDIA Jetson Nano (client) and an A100 GPU (server) with Vicuna-68M (draft) and Llama2-7B (target) models, our method achieves up to a 35% reduction in latency compared to cloud-based autoregressive decoding, with an additional 11% improvement from preemptive drafting. To demonstrate real-world applicability, we deploy our method on the Unitree Go2 quadruped robot using Vision-Language Model (VLM) based control, achieving a 21% speedup over traditional cloud-based autoregressive decoding. These results demonstrate the potential of our framework for real-time LLM and VLM applications on resource-constrained edge devices.
大型语言模型(LLMS) 使智能手机、可磨损机和装有机器人等边缘装置的各种应用得以应用。 但是,它们的部署往往取决于昂贵的云基API, 造成高操作成本,限制较小组织的准入,并引起可持续性问题。 某些LLMS可以安装在设计上, 提供成本有效的解决方案, 减少潜伏, 改善隐私。 然而, 有限的计算资源限制了可以部署的模型的规模和准确性, 需要在边缘和云之间进行协作设计。 我们提议在服务器上建立一个具有大型目标模型的快速且成本效益的投机性极地分层脱色框架, 并在设备上建立一个小型的基于云的云性应用模型。 通过在目标模型中引入早期退出, 令客户在最后核查之前先草率地起草后续标语, 从而利用闲置的时间,加强边缘和云层之间的平行关系。 使用一个基于 Vickuna- 68M (草稿) 和 Llam2-7B (目标) 的潜在应用模型, 我们的方法在模型中实现了基于真实定位M 的升级, 将OliverM- develop- development 的模型转化为 演示演示, 演示了我们将Oliver- drodu- drodegradu
Article 41
Title@2025-05-27 (2): Distributed Discrete Morse Sandwich: Efficient Computation of Persistence Diagrams for Massive Scalar Data
Title: Distributed Discrete Morse Sandwich: Efficient Computation of Persistence Diagrams for Massive Scalar Data | Distributed Diskrete Morse Sandwich: Effiziente Berechnung von Persistenzdiagrammen für massive Scalardaten | 分布式分散的莫尔斯桑威奇:有效计算大规模卡路里数据持久性图图 2505.21266v1 |
Authors (3): Eve Le Guillou, Pierre Fortin, Julien Tierny
The persistence diagram, which describes the topological features of a dataset, is a key descriptor in Topological Data Analysis. The “Discrete Morse Sandwich” (DMS) method has been reported to be the most efficient algorithm for computing persistence diagrams of 3D scalar fields on a single node, using shared-memory parallelism. In this work, we extend DMS to distributed-memory parallelism for the efficient and scalable computation of persistence diagrams for massive datasets across multiple compute nodes. On the one hand, we can leverage the embarrassingly parallel procedure of the first and most time-consuming step of DMS (namely the discrete gradient computation). On the other hand, the efficient distributed computations of the subsequent DMS steps are much more challenging. To address this, we have extensively revised the DMS routines by contributing a new self-correcting distributed pairing algorithm, redesigning key data structures and introducing computation tokens to coordinate distributed computations. We have also introduced a dedicated communication thread to overlap communication and computation. Detailed performance analyses show the scalability of our hybrid MPI+thread approach for strong and weak scaling using up to 16 nodes of 32 cores (512 cores total). Our algorithm outperforms DIPHA, a reference method for the distributed computation of persistence diagrams, with an average speedup of x8 on 512 cores. We show the practical capabilities of our approach by computing the persistence diagram of a public 3D scalar field of 6 billion vertices in 174 seconds on 512 cores. Finally, we provide a usage example of our open-source implementation at https://github.com/eve-le-guillou/DDMS-example.
描述数据集地形特征的持久性图是地形数据分析中的一个关键描述符。 据报道, “ DmMS ” 方法是使用共享模拟平行法,在单一节点上计算 3D 斯卡拉尔字段的持久性图的最有效算法。 在这项工作中,我们将DMS 扩展为分布式模拟平行法,以便在多个计算节点中高效和可缩放地计算大量数据集的持久性图。 一方面,我们可以利用DMS 第一次和最耗时的代数的令人尴尬平行程序( 即离散梯度计算 )。 另一方面, 以单一节点计算 3D 斯卡拉尔字段的持久性图最为高效的分布计算方法更具有挑战性。 为了解决这个问题,我们广泛修订了DMS 程序, 提供了一种新的自我校正分布式配对算法, 重新设计了关键数据结构, 并引入了用于协调分布式计算方法的计算符号。 我们还引入了专门的通信和计算线索。 详细的业绩分析显示, 以 IP IP 3 3 的数值 直径直径直径直径直径直径直径直径计算方法, 展示了我们 IP 3 的直径直径直径直径直径直径直径直径直径直 。
Article 42
Title@2025-05-27 (2): DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks
Title: DeepCEE: Efficient Cross-Region Model Distributed Training System under Heterogeneous GPUs and Networks | DeepCEE: Effizientes regionsübergreifendes Schulungssystem unter heterogenen GPUs und Netzwerken | DeepCEE:在异种性全球保护单位和网络下建立高效跨区域分布示范培训系统 2505.15536v2 |
Authors (10): Jinquan Wang, Xiaojian Liao, Xuzhao Liu, Jiashun Suo, Zhisheng Huo, Chenhao Zhang, Xiangrong Xu, Runnan Shen, Xilong Xie, Limin Xiao
Most existing training systems focus on a single region. In contrast, we envision that cross-region training offers more flexible GPU resource allocation and yields significant potential. However, the hierarchical cluster topology and unstable networks in the cloud-edge-end (CEE) environment, a typical cross-region scenario, pose substantial challenges to building an efficient and autonomous model training system. We propose DeepCEE, a geo-distributed model training system tailored for heterogeneous GPUs and networks in CEE environments. DeepCEE adopts a communication-centric design philosophy to tackle challenges arising from slow and unstable inter-region networks. It begins with a heterogeneous device profiler that identifies and groups devices based on both network and compute characteristics. Leveraging device groups, DeepCEE implements compact, zero-bubble pipeline parallelism, automatically deriving optimal parallel strategies. To further adapt to runtime variability, DeepCEE integrates a dynamic environment adapter that reacts to network fluctuations. Extensive evaluations demonstrate that DeepCEE achieves 1.3-2.8x higher training throughput compared to widely used and SOTA training systems.
相比之下,我们设想,跨区域培训可提供更灵活的GPU资源分配,并产生巨大潜力;然而,云端环境中的分级集束地形和不稳定网络,这种典型的跨区域情景,对建立一个高效自主的示范培训系统构成重大挑战;我们提议,DeepCEE是一个地理分布式示范培训系统,专门为中欧和东欧环境中的多元GPU和网络定制;TeepCEE采用一种以通信为中心的设计理念,以应对缓慢和不稳定的区域间网络带来的挑战;首先使用一个不同设备剖面仪,根据网络和计算特点确定和组装装置;杠杆装置组、深层CEE执行紧凑、零缓冲管道平行、自动产生最佳平行战略;为了进一步适应运行时间的变化,深中电子将动态环境适应器纳入一个适应网络波动的动态环境;广泛的评价表明,与广泛使用和SOTA培训系统相比,深中电子设备实现了1.3-2.8x更高的培训。
Article 43
Title@2025-05-27 (2): Grassroots Consensus
Title: Grassroots Consensus | Graswurzeln-Konsens | 基层共识 2505.19216v2 |
Authors (3): Idit Keidar, Andrew Lewis-Pye, Ehud Shapiro
Grassroots platforms aim to offer an egalitarian alternative to global platforms – centralized/autocratic and decentralized/plutocratic alike. Within the grassroots architecture, consensus is needed to realize platforms that employ digital social contracts, which are like smart contracts except that they are among people not accounts and are executed by these people’s smartphones not by high-performance servers controlled by parties outside to the contract. Key envisioned grassroots platforms include sovereign democratic digital communities and federations, community banks and their grassroots cryptocurrencies, and digital cooperatives. The grassroots architecture can benefit from a consensus protocol that is (i) quiescent, (ii) efficient during low- and high-throughput, (iii) responsive, (iv) blocklace-based, (v) UDP-ready, and (vi) grassroots. The Grassroots Consensus protocol addresses all these requirements while having competitive performance in both low- and high-throughput scenarios and being one of the most concise and elegant consensus protocols for partial synchrony. It achieves that by building on two cutting-edge consensus protocols – the quiescent high-performance Morpheus and the blocklace-based Cordial Miners, improving the latter’s dissemination protocol and making it UDP-ready, and extending the protocol with a constitution and a constitutional amendment component, making it grassroots.
基层平台旨在为全球平台提供平等的替代方案 – – 中央/专制和权力下放/专制 – – 全球平台。在基层架构内,需要达成共识,以实现采用数字社会合同的平台,这些平台包括智能合同,但不属于账户,由这些人的智能手机而不是由合同外各方控制的高性能服务器执行。关键设想的基层平台包括主权民主数字社区和联合会、社区银行及其基层密码库和数字合作社。基层架构可受益于共识协议,即(一) Q-Q-级协议,(二) 低和高通量期间效率协议,(三) 反应灵敏度协议,(四) 块状合同,(五) UDP准备就绪,(六) 基层协议处理所有这些要求,同时在低和高通量情况下都有竞争力,而且是部分同步最简洁和最优雅的共识协议之一。它通过建立两个尖端共识协议,即Q-级高性高性Morphe协议和基块协议,(三) 反应灵敏锐性协议,(四) 基于块式合同,(五) 准备就绪,(五) UDP) 和草质协议修正,将《宪法议定书》和《宪法》扩充和《宪法》扩展。
Article 44
Title@2025-05-27 (2): Multi-Event Triggers for Serverless Computing
Title: Multi-Event Triggers for Serverless Computing | Multi-Event-Trigger für serverloses Rechnen | 无服务器电子计算多天触发器 2505.21199v1 |
Authors (6): Valentin Carl, Trever Schirmer, Joshua Adamek, Tobias Pfandzelter, Sergio Lucia, David Bermbach
Function-as-a-Service (FaaS) is an event-driven serverless cloud computing model in which small, stateless functions are invoked in response to events, such as HTTP requests, new database entries, or messages. Current FaaS platform assume that each function invocation corresponds to a single event. However, from an application perspective, it is desirable to invoke functions in response to a collection of events of different types or only with every n\textsuperscript{th} event. To implement this today, a function would need additional state management, e.g., in a database, and custom logic to determine whether its trigger condition is fulfilled and the actual application code should run. In such an implementation, most function invocations would be rendered essentially useless, leading to unnecessarily high resource usage, latency, and cost for applications. In this paper, we introduce multi-event triggers, through which complex conditions for function invocations can be specified. Specifically, we introduce abstractions for invoking functions based on a set of $n$ events and joins of multiple events of different types. This enables application developers to define intricate conditions for function invocations, workflow steps, and complex event processing. Our evaluation with a proof-of-concept prototype shows that this reduces event–invocation latency by 62.5\% in an incident detection use-case and that our system can handle more than 300,000 requests per second on limited hardware, which is sufficient load for implementation in large FaaS platforms.
函数- a- service (FaaS) 是一种由事件驱动的无服务器的云计算模型, 在发生HTTP请求、 新的数据库条目或信息等事件时, 援引小型、 无国籍的功能, 以响应小型、 无国籍的功能; 当前的 FaaS 平台假设, 每个函数的引用都对应一个单一事件。 然而, 从应用角度来说, 有必要援引功能来应对不同类型事件的集合, 或仅针对每个 n\ textsuperscript{th} 事件。 今天, 要实施此功能, 一个函数将需要额外的州管理, 例如, 在数据库和定制逻辑中, 以确定其触发条件是否已经满足, 实际应用代码是否运行。 在这样的执行中, 大多数功能的引用将基本上变得毫无用处, 导致不必要高的资源使用、 延时、 应用成本。 然而, 我们引入了多重事件触发触发点, 具体地, 我们引入了基于一组美元事件启动的功能, 和多种事件合并的事件。
Article 45
Title@2025-05-27 (2): Vectorized Sequence-Based Chunking for Data Deduplication
Title: Vectorized Sequence-Based Chunking for Data Deduplication | Vektorisierte Sequenz-basiertes Chunking für Datendeduplikation | 数据解析矢量序列相键 2505.21194v1 |
Authors (2): Sreeharsha Udayashankar, Samer Al-Kiswany
Data deduplication has gained wide acclaim as a mechanism to improve storage efficiency and conserve network bandwidth. Its most critical phase, data chunking, is responsible for the overall space savings achieved via the deduplication process. However, modern data chunking algorithms are slow and compute-intensive because they scan large amounts of data while simultaneously making data-driven boundary decisions. We present SeqCDC, a novel chunking algorithm that leverages lightweight boundary detection, content-defined skipping, and SSE/AVX acceleration to improve chunking throughput for large chunk sizes. Our evaluation shows that SeqCDC achieves 15x higher throughput than unaccelerated and 1.2x-1.35x higher throughput than vector-accelerated data chunking algorithms while minimally affecting deduplication space savings.
数据重复作为提高存储效率和保护网络带宽的机制,已获得广泛的赞誉。 其最关键阶段,即数据块块块,是减少过程所节省的空间总量。 然而,现代数据块块算法缓慢且计算密集,因为它们扫描大量数据,同时作出数据驱动的边界决定。 我们提出了SeqCDC,这是利用轻量边界探测、内容定义跳跃和SSE/AVX加速来提高大块块块块块块的散装吞吐量的新型累加算法,我们的评价显示SeqCDC的吞吐量比未加速的要高15x倍,而1.2x-1.35x的吞吐量比矢量加速的数据块积算法高出1.2x-1.35x,同时对空间的减缩影响最小。
Article 46
Title@2025-05-27 (2): Constructive community race: full-density spiking neural network model drives neuromorphic computing
Title: Constructive community race: full-density spiking neural network model drives neuromorphic computing | Konstruktives Community-Rennen: Volldichte-Spitzen neuronales Netzwerkmodell treibt neuromorphes Computing an | 充满建设性的社区种族:完全密度刺激神经网络模型驱动神经形态计算 2505.21185v1 |
Authors (21): Johanna Senk, Anno Kurth, Steve Furber, Tobias Gemmeke, Bruno Golosio, Arne Heittmann, James C. Knight, Eric Müller, Tobias Noll, Thomas Nowotny, Gorka Peraza Coppola, Luca Peres, Oliver Rhodes, Andrew Rowley, Johannes Schemmel, Tim Stadtmann, Tom Tetzlaff, Gianmarco Tiddia, Sacha J. van Albada, José Villamar, Markus Diesmann
The local circuitry of the mammalian brain is a focus of the search for generic computational principles because it is largely conserved across species and modalities. In 2014 a model was proposed representing all neurons and synapses of the stereotypical cortical microcircuit below $1\,\text{mm}^2$ of brain surface. The model reproduces fundamental features of brain activity but its impact remained limited because of its computational demands. For theory and simulation, however, the model was a breakthrough because it removes uncertainties of downscaling, and larger models are less densely connected. This sparked a race in the neuromorphic computing community and the model became a de facto standard benchmark. Within a few years real-time performance was reached and surpassed at significantly reduced energy consumption. We review how the computational challenge was tackled by different simulation technologies and derive guidelines for the next generation of benchmarks and other domains of science.
哺乳动物大脑的局部电路是寻找通用计算原则的一个焦点,因为它在物种和模式上都得到了很大程度的保护。 2014年,提出了一个模型,代表了大脑表面1美元以下的陈规定型皮质微电路的所有神经元和突触。模型复制了大脑活动的基本特征,但由于其计算需求,其影响仍然有限。然而,对于理论和模拟来说,模型是一个突破,因为它消除了缩小规模的不确定性,而更大的模型则不那么密集地连接。这在神经形态计算界引发了一场竞赛,模型成为了一个事实上的标准基准。几年内,当能源消耗大幅减少时,实现了实时性能并超过了实时性能。我们审查了不同的模拟技术如何应对计算挑战,并为下一代基准和其他科学领域制定了指导方针。
Article 47
Title@2025-05-27 (2): SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA
Title: SHE-LoRA: Selective Homomorphic Encryption for Federated Tuning with Heterogeneous LoRA | SHE-LoRA: Selektive homomorphe Verschlüsselung für Federated Tuning mit Heterogene LoRA | SHE-LORA: 与异源罗拉结合的联邦调试的选择性单体单体加密 2505.21051v1 |
Authors (5): Jianmin Liu, Li Yan, Borui Li, Lei Yu, Chao Shen
Federated fine-tuning of large language models (LLMs) is critical for improving their performance in handling domain-specific tasks. However, prior work has shown that clients’ private data can actually be recovered via gradient inversion attacks. Existing privacy preservation techniques against such attacks typically entail performance degradation and high costs, making them ill-suited for clients with heterogeneous data distributions and device capabilities. In this paper, we propose SHE-LoRA, which integrates selective homomorphic encryption (HE) and low-rank adaptation (LoRA) to enable efficient and privacy-preserving federated tuning of LLMs in cross-device environment. Heterogeneous clients adaptively select partial model parameters for homomorphic encryption based on parameter sensitivity assessment, with the encryption subset obtained via negotiation. To ensure accurate model aggregation, we design a column-aware secure aggregation method and customized reparameterization techniques to align the aggregation results with the heterogeneous device capabilities of clients. Extensive experiments demonstrate that SHE-LoRA maintains performance comparable to non-private baselines, achieves strong resistance to the state-of-the-art attacks, and significantly reduces communication overhead by 94.901\% and encryption computation overhead by 99.829\%, compared to baseline. Our code is accessible at https://anonymous.4open.science/r/SHE-LoRA-8D84.
在本文中,我们提议SHE-LORA将选择性同质加密(HE)和低级别适应(LORA)结合起来,以便能够在跨概念环境中对LLMS进行高效和保密的组合调试。 超基因客户根据参数敏感度评估,根据参数敏感度评估,根据加密子集,根据参数敏感度调整选择同质加密的部分模型参数,通过谈判获得加密子集。为了确保精确的模型集成,我们设计了一种有色安全集成法和定制的重新计数技术,使汇总结果与客户的混合装置能力相一致。广泛的实验表明,SHE-LORA保持了与非私人基线相当的可与非私人基准相当的性能,实现了很强的抗控性能,并在94.901/RA/RA号基准线上大大降低了通信管理费。
Article 48
Title@2025-05-27 (2): A Hitchhiker’s Guide to Privacy-Preserving Cryptocurrencies: A Survey on Anonymity, Confidentiality, and Auditability
Title: A Hitchhiker’s Guide to Privacy-Preserving Cryptocurrencies: A Survey on Anonymity, Confidentiality, and Auditability | Ein Hitchhiker-Leitfaden zur Wahrung der Privatsphäre von Kryptowährungen: Eine Umfrage über Anonymität, Vertraulichkeit und Auditierbarkeit | 《希希克人保护隐私加密指南:关于匿名、保密和可审计性的调查》 2505.21008v1 |
Authors (3): Matteo Nardelli, Francesco De Sclavis, Michela Iezzi
Cryptocurrencies and central bank digital currencies (CBDCs) are reshaping the monetary landscape, offering transparency and efficiency while raising critical concerns about user privacy and regulatory compliance. This survey provides a comprehensive and technically grounded overview of privacy-preserving digital currencies, covering both cryptocurrencies and CBDCs. We propose a taxonomy of privacy goals – including anonymity, confidentiality, unlinkability, and auditability – and map them to underlying cryptographic primitives, protocol mechanisms, and system architectures. Unlike previous surveys, our work adopts a design-oriented perspective, linking high-level privacy objectives to concrete implementations. We also trace the evolution of privacy-preserving currencies through three generations, highlighting shifts from basic anonymity guarantees toward more nuanced privacy-accountability trade-offs. Finally, we identify open challenges at the intersection of cryptography, distributed systems, and policy definition, which motivate further investigation into the primitives and design of digital currencies that balance real-world privacy and auditability needs.
加密和中央银行数字货币(CBCs)正在改变货币格局,提供透明度和效率,同时对用户隐私和监管合规提出重大关切。本调查对保护隐私的数字货币(包括密码和CBCs)提供了全面、技术上有根据的概览,涵盖密码和CBCs。我们建议对隐私目标进行分类,包括匿名、保密、不可连接和可审计性,并把它们映射为隐秘原始、协议机制和系统结构的基础。与以往的调查不同,我们的工作采用了面向设计的观点,将高级别隐私目标与具体实施联系起来。我们还追踪了隐私保护货币在三代人的演变,重点指出从基本的匿名保障向更细致的隐私和可问责性权衡的转变。最后,我们找出了密码学、分布式系统和政策定义的交叉点,这促使进一步调查原始和设计能平衡现实世界隐私和可审计需要的数字货币。
Article 49
Title@2025-05-27 (2): RACS-SADL: Robust and Understandable Randomized Consensus in the Cloud
Title: RACS-SADL: Robust and Understandable Randomized Consensus in the Cloud | RACS-SADL: Robuster und verständlicher Randomisierter Konsens in der Cloud | RACS-SADL:云层中的有力和可理解的随机共识 2404.04183v3 |
Authors (3): Pasindu Tennage, Antoine Desjardins, Lefteris Kokoris-Kogias
Widely deployed consensus protocols in the cloud are often leader-based and optimized for low latency under synchronous network conditions. However, cloud networks can experience disruptions such as network partitions, high-loss links, and configuration errors. These disruptions interfere with the operation of leader-based protocols, as their view change mechanisms interrupt the normal case replication and cause the system to stall. We propose RACS, a novel randomized consensus protocol that ensures robustness against adversarial network conditions. RACS achieves optimal one-round trip latency under synchronous network conditions while remaining resilient to adversarial network conditions. RACS follows a simple design inspired by Raft, the most widely used consensus protocol in the cloud, and therefore enables seamless integration with the existing cloud software stack. Experiments with a prototype running on Amazon EC2 show that RACS achieves 28k cmd/sec throughput, ninefold higher than Raft under adversarial cloud network conditions. Under synchronous network conditions, RACS matches the performance of Multi-Paxos and Raft, achieving a throughput of 200k cmd/sec with a median latency of 300ms, confirming that RACS introduces no unnecessary overhead. Finally, SADL-RACS, a throughput-optimized version of RACS, achieves a throughput of 500k cmd/sec, delivering 150% higher throughput than Raft.
在同步的网络条件下,云层中广泛部署的共识协议往往以领导为基础,在低潜值方面最优化,在同步的网络条件下,以低潜值为主;然而,云端网络可能会经历网络分割、高损失连接和配置错误等干扰,这些干扰干扰干扰了基于领导的协议的运作,因为其观点改变机制中断正常的复制案件复制,并导致系统停顿;我们建议RACS,这是一个新颖的随机随机化共识协议,确保对对抗性网络条件的稳健性;RACS在同步的网络条件下,在同步的网络条件下,实现最佳的一回合长距离连接,同时保持对对抗性网络条件的复原力;RACSCS遵循由云中最广泛使用的共识协议Raft所启发的简单设计,从而能够与现有云层软件堆进行无缝的整合;在亚马逊EC2上运行的原型模型实验显示,RACS达到28 k/secult ,在对抗性云网络条件下比Raft高9倍。在同步的网络条件下,RACS与多轴和Raft取得最佳的成绩,通过SDSDRCSDRDS的中位通过SDRCSDRPDRDRDRDRDRDRDRDRDRDRDRDRDRDRDRDRDRD的中,最终实现一个不必要版。
Article 50
Title@2025-05-27 (2): EPIC: Efficient Position-Independent Caching for Serving Large Language Models
Title: EPIC: Efficient Position-Independent Caching for Serving Large Language Models | EPIC: Effizientes positionsunabhängiges Caching für das Servieren großer Sprachmodelle | EPIC: 高效的、独立定位的为大语言模式服务的工作 2410.15332v3 |
Authors (10): Junhao Hu, Wenrui Huang, Weidong Wang, Haoyi Wang, Tiancheng Hu, Qin Zhang, Hao Feng, Xusheng Chen, Yizhou Shan, Tao Xie
Large Language Models (LLMs) show great capabilities in a wide range of applications, but serving them efficiently becomes increasingly challenging as requests (prompts) become more complex. Context caching improves serving performance by reusing Key-Value (KV) vectors, the intermediate representations of tokens that are repeated across requests. However, existing context caching requires exact prefix matches across requests, limiting reuse cases in settings such as few-shot learning and retrieval-augmented generation, where immutable content (e.g., documents) remains unchanged across requests but is preceded by varying prefixes. Position-Independent Caching (PIC) addresses this issue by enabling modular reuse of the KV vectors regardless of prefixes. We formalize PIC and advance prior work by introducing EPIC, a serving system incorporating our new LegoLink algorithm, which mitigates the inappropriate “attention sink” effect at every document beginning, to maintain accuracy with minimal computation. Experiments show that EPIC achieves up to 8x improvements in Time-To-First-Token (TTFT) and 7x throughput gains over existing systems, with negligible or no accuracy loss.
大型语言模型(LLMS)在范围广泛的各种应用中表现出巨大的能力,但随着请求(即发件人)变得更加复杂,高效地为它们服务变得日益具有挑战性。背景缓冲通过重复使用Ky-Value(KV)矢量(KV)矢量的中间表示形式,即不同请求之间反复重复的象征物,改善了业绩。但是,现有的环境缓冲要求要求需要精确的前缀匹配,限制在微小的学习和检索增强生成等环境中的再利用案例,在这种环境中,不可移动的内容(例如文件)在所有请求中保持不变,但之前有不同的前缀。依靠位置的Caching(PIC)解决这个问题的方法是,使KV矢量矢量的模块再利用,而不论前缀如何。我们正式确定PIC,并预先推进工作,采用EPIC,这是一个包含我们新的LegoLink算法的服务系统,在开始的每一份文件中减少不适当的“注意汇”效应,以便以最小的计算来保持准确性。实验表明,EPIC在时间到头(TFTFT)和7x通过现有系统取得8x总的收益方面达到8x。
Article 51
Title@2025-05-27 (2): Complexity landscape for local certification
Title: Complexity landscape for local certification | Komplexitätslandschaft für die lokale Zertifizierung | 当地认证的复杂环境 2505.20915v1 |
Authors (3): Nicolas Bousquet, Laurent Feuilloley, Sébastien Zeitoun
An impressive recent line of work has charted the complexity landscape of distributed graph algorithms. For many settings, it has been determined which time complexities exist, and which do not (in the sense that no local problem could have an optimal algorithm with that complexity). In this paper, we initiate the study of the landscape for space complexity of distributed graph algorithms. More precisely, we focus on the local certification setting, where a prover assigns certificates to nodes to certify a property, and where the space complexity is measured by the size of the certificates. Already for anonymous paths and cycles, we unveil a surprising landscape: - There is a gap between complexity $O(1)$ and $\Theta(\log \log n)$ in paths. This is the first gap established in local certification. - There exists a property that has complexity $\Theta(\log \log n)$ in paths, a regime that was not known to exist for a natural property. - There is a gap between complexity $O(1)$ and $\Theta(\log n)$ in cycles, hence a gap that is exponentially larger than for paths. We then generalize our result for paths to the class of trees. Namely, we show that there is a gap between complexity $O(1)$ and $\Theta(\log \log d)$ in trees, where $d$ is the diameter. We finally describe some settings where there are no gaps at all. To prove our results we develop a new toolkit, based on various results of automata theory and arithmetic, which is of independent interest.
令人印象深刻的近期工作线绘制了分布式图表算法的复杂面貌。 对于许多设置, 已经确定了时间复杂性存在的时间复杂性, 并且没有。 在本文中, 我们开始研究分布式图表算法的空间复杂性。 更准确地说, 我们侧重于本地认证设置, 证明者将证书指定给节点以认证属性, 而空间复杂性则以证书的大小来衡量。 早在匿名路径和周期中, 我们就公布了一个惊人的景观 : - 复杂值$(1) 和 $\ Theta (log\log\log n) 之间在路径中存在差距。 这是本地认证中的第一个缺口 。 - 存在一个在路径中具有复杂性 $(log\log\log n) 的属性。 一个未知的系统, 用于认证属性, 由验证器将证书指定为证书, 由节点(1美元) 和 $ (log n) 之间存在一个复杂度差距。 因此, 在路径上, 我们将结果概括为 $ 。
Article 52
Title@2025-05-27 (2): Reduced and mixed precision turbulent flow simulations using explicit finite difference schemes
Title: Reduced and mixed precision turbulent flow simulations using explicit finite difference schemes | Reduzierte und gemischte Präzision turbulente Strömungssimulationen mit expliziten Finite-Differenz-Systemen | 使用明确的有限差别办法进行减少和混合精密混杂的波动流动模拟 2505.20911v1 |
Authors (5): Bálint Siklósi, Pushpender K. Sharma, David J. Lusher, István Z. Reguly, Neil D. Sandham
The use of reduced and mixed precision computing has gained increasing attention in high-performance computing (HPC) as a means to improve computational efficiency, particularly on modern hardware architectures like GPUs. In this work, we explore the application of mixed precision arithmetic in compressible turbulent flow simulations using explicit finite difference schemes. We extend the OPS and OpenSBLI frameworks to support customizable precision levels, enabling fine-grained control over precision allocation for different computational tasks. Through a series of numerical experiments on the Taylor-Green vortex benchmark, we demonstrate that mixed precision strategies, such as half-single and single-double combinations, can offer significant performance gains without compromising numerical accuracy. However, pure half-precision computations result in unacceptable accuracy loss, underscoring the need for careful precision selection. Our results show that mixed precision configurations can reduce memory usage and communication overhead, leading to notable speedups, particularly on multi-CPU and multi-GPU systems.
在高性能计算(HPC)中,使用减少和混合精密计算作为提高计算效率的手段日益受到注意,特别是在诸如GPUs等现代硬件结构方面。在这项工作中,我们探索在使用明确的有限差异办法进行压缩动荡流模拟时采用混合精确算术。我们扩展了OPS和OpenSBLI框架,以支持可定制的精确水平,使不同计算任务的精确分配得到精密控制。通过对泰勒-绿色涡旋基准的一系列数字实验,我们表明,混合精准战略,例如半成和单倍组合,可以带来显著的性能增益,而不会降低数字精准性。然而,纯半精准计算导致不可接受的准确性损失,突出表明需要仔细精确选择。我们的结果显示,混合精准配置可以减少记忆用量和通信间接费用,导致显著的加速,特别是在多CPU和多GPU系统上。
Article 53
Title@2025-05-27 (2): Load Balancing in Strongly Inhomogeneous Simulations – a Vlasiator Case Study
Title: Load Balancing in Strongly Inhomogeneous Simulations – a Vlasiator Case Study | Lastausgleich in stark inhomogenen Simulationen – eine Vlasiator-Fallstudie | 在极不相异模拟器中平衡载荷 – – 挥发器案例研究 2505.20908v1 |
Authors (5): Leo Kotipalo, Markus Battarbee, Yann Pfau-Kempf, Vertti Tarvus, Minna Palmroth
Parallelization is a necessity for large-scale simulations due to the amount of data processed. In this article we investigate different load balancing methods using Vlasiator, a global magnetospheric simulation as our case study. The theoretical basis for load balancing is the (hyper)graph partitioning problem, modeling simulation units as vertices and their data dependencies as edges. As it is an NP-hard problem, heuristics are necessary for dynamic runtime balancing. We consider first hypergraph partitioning via an algorithm called parallel hypergraph partitioner (PHG); this is done by partitioning a simplified grid and then attempting to optimize the solution on the finer grid. The second and third are the geometric methods of recursive coordinate bisection (RCB) and recursive inertial bisection (RIB). Finally we consider the method of Hilbert space filling curves (HSFC). The algorithm projects simulation cells along a Hilbert curve and makes cuts along the curve. This works well due to the excellent locality of Hilbert curves, and can be optimized further by choice of curve. We introduce and investigate six three-dimensional Hilbert curves in total. Our findings on runs of two different scales indicate the HSFC method provides optimal load balance, followed by RIB and PHG methods and finally by RCB. Of the Hilbert curves evaluated, the Beta curve outperformed the most commonly used curve by a few percent.
由于所处理的数据数量, 大型模拟必须实现平行。 在本文中, 我们用Vlasiator( 一个全球磁层模拟, 作为案例研究) 来调查不同的负平衡方法。 负平衡的理论基础是( 超) 分解问题, 将模拟单位作为脊椎建模, 及其数据依赖性作为边缘建模。 由于这是一个NP- 硬问题, 逻辑学对于动态运行时间平衡是必要的。 我们考虑通过一个称为平行的双曲线分割器( PHG) 的算法, 第一次超高速分解是用简化的网格进行分解, 然后试图优化细格网的解决方案。 第二和第三点是递归坐标坐标( RCB) 和递归惯性惯性惯性惯性分解( RIB) 的几何方法。 最后我们考虑的是希尔伯特空间填充曲线( HSFFF) 的计算方法。 算法将希尔伯特 曲线 模拟细胞 沿曲线 进行切。 这很好, 是因为Hilbert 曲线 的精密地区, , 可以通过选择曲线来进一步优化 。 我们介绍并调查六维基 的 轴 , 的 以不同的 轴 的 轴 , 通过不同的 标准 计算法 , 通过不同的 , 以不同的 以不同的 平比 的 平比 的 标准 标准 方法来提供 。
Article 54
Title@2025-05-27 (2): An Efficient Implementation of Guard-Based Synchronization for an Object-Oriented Programming Language
Title: An Efficient Implementation of Guard-Based Synchronization for an Object-Oriented Programming Language | Effiziente Implementierung von Guard-Based Synchronization für eine objektorientierte Programmiersprache | 高效率地实施以警卫为基础的同步,以用于以目标为导向的方案编制语言 2505.20850v1 |
Authors (2): Shucai Yao, Emil Sekerinski
In the shared variable model of concurrency, guarded atomic actions restrict the possible interference between processes by regions of atomic execution. The guard specifies the condition for entering an atomic region. That is a convenient model for the specification and verification of concurrent programs, but has eschewed efficient execution so far. This article shows how guarded atomic actions, when attached to objects, can be implemented highly efficiently using a combination of coroutines, operating-system worker threads, and dedicated management of object queues and stacks. The efficiency of an experimental language, Lime, is shown to compare favourably with that of C/Pthreads, Go, Erlang, Java, and Haskell on synthetic benchmarks.
在共同的可变货币模型中,有戒备的原子行动限制了原子执行区域之间可能发生的干扰。 卫兵规定了进入原子区域的条件。 这是同时进行程序规格和核查的方便模式,但迄今避免了高效执行。 本条表明,在附在物体上的有戒备的原子行动如何能够高效地实施,同时使用共流、操作系统工人线和专门管理对象队列和堆叠。 实验语言Lime的效率与C/Pthreads、Go、Erlang、Java和Haskell在合成基准上的效率相当。
Article 55
Title@2025-05-27 (2): Choreographies as Macros
Title: Choreographies as Macros | Choreographien als Makros | 作为宏的舞蹈 2505.20845v1 |
Authors (2): Alexander Bohosian, Andrew K. Hirsch
Concurrent programming often entails meticulous pairing of sends and receives between participants to avoid deadlock. Choreographic programming alleviates this burden by specifying the system as a single program. However, there are more applications than implementations of choreographies, and developing new implementations takes a lot of time and effort. Our work uses Racket to expedite building a new choreographic language called Choret. Racket has a powerful macro system which allows Choret to reuse much of its infrastructure for greater functionality and correctness.
与此同时,编程往往需要参与者对发送和接收进行细致的配对以避免僵局。编程通过将系统指定为单一程序来减轻这一负担。然而,应用量比舞蹈编程多,开发新的实施需要大量的时间和精力。我们的工作利用拉克茨加快建立一个名为Choret的新编程语言。拉克茨拥有强大的宏观系统,使乔雷特能够重新利用大部分基础设施来提高功能和正确性。
Article 56
Title@2025-05-27 (2): ECC-SNN: Cost-Effective Edge-Cloud Collaboration for Spiking Neural Networks
Title: ECC-SNN: Cost-Effective Edge-Cloud Collaboration for Spiking Neural Networks | ECC-SNN: Kosteneffiziente Edge-Cloud-Kollaboration für Spiking Neuronal Networks | ECC-SNN: 传播神经网络的成本-效益高的边缘-封闭式协作 2505.20835v1 |
Authors (8): Di Yu, Changze Lv, Xin Du, Linshan Jiang, Wentao Tong, Zhenyu Liao, Xiaoqing Zheng, Shuiguang Deng
Most edge-cloud collaboration frameworks rely on the substantial computational and storage capabilities of cloud-based artificial neural networks (ANNs). However, this reliance results in significant communication overhead between edge devices and the cloud and high computational energy consumption, especially when applied to resource-constrained edge devices. To address these challenges, we propose ECC-SNN, a novel edge-cloud collaboration framework incorporating energy-efficient spiking neural networks (SNNs) to offload more computational workload from the cloud to the edge, thereby improving cost-effectiveness and reducing reliance on the cloud. ECC-SNN employs a joint training approach that integrates ANN and SNN models, enabling edge devices to leverage knowledge from cloud models for enhanced performance while reducing energy consumption and processing latency. Furthermore, ECC-SNN features an on-device incremental learning algorithm that enables edge models to continuously adapt to dynamic environments, reducing the communication overhead and resource consumption associated with frequent cloud update requests. Extensive experimental results on four datasets demonstrate that ECC-SNN improves accuracy by 4.15%, reduces average energy consumption by 79.4%, and lowers average processing latency by 39.1%.
大部分边缘线性协作框架依靠云基人工神经网络的大量计算和储存能力。然而,这种依赖导致边缘装置与云和高计算能消耗之间的大量通信,特别是在对资源受限制的边缘装置应用时。为了应对这些挑战,我们提议ECC-SNN,一个全新的边缘线性协作框架,其中包括节能喷发神经网络(SNN),以便从云端向边缘倾卸更多的计算工作量,从而提高成本效益和减少对云层的依赖。ECC-SNNN采用联合培训方法,将ANN和SNN模型结合起来,使边缘装置能够利用云模型的知识提高性能,同时减少能源消耗和处理耐久性。此外,ECC-SNN具有一种可操作性渐进式学习算法,使边缘模型能够持续适应动态环境,减少与频繁的云更新请求有关的通信间接费用和资源消耗。四个数据集的广泛实验结果表明,ECC-SNN的准确性提高了4.15%,使平均能源消耗减少79.4%,而平均处理率降低39.1%。
Article 57
Title@2025-05-27 (2): Work-Efficient Parallel Counting via Sampling
Title: Work-Efficient Parallel Counting via Sampling | Arbeitseffiziente parallele Zählung über Probenahme | 通过抽样计算实现工作效率的平行计数 2408.09719v2 |
Authors (3): Hongyang Liu, Yitong Yin, Yiyao Zhang
A canonical approach to approximating the partition function of a Gibbs distribution via sampling is simulated annealing. This method has led to efficient reductions from counting to sampling, including: $\bullet$ classic non-adaptive (parallel) algorithms with sub-optimal cost (Dyer-Frieze-Kannan ‘89; Bez'akov'a-\v{S}tefankovi\v{c}-Vazirani-Vigoda ‘08); $\bullet$ adaptive (sequential) algorithms with near-optimal cost (\v{S}tefankovi\v{c}-Vempala-Vigoda ‘09; Huber ‘15; Kolmogorov ‘18; Harris-Kolmogorov ‘24). We present an algorithm that achieves both near-optimal total work and efficient parallelism, providing a reduction from counting to sampling with logarithmic depth and near-optimal work. As consequences, we obtain work-efficient parallel counting algorithms for several important models, including the hardcore and Ising models within the uniqueness regime.
通过取样对 Gibbs 分布的分区函数进行近似平衡的 canonic 方法模拟了 anneal 。 这种方法导致从计算到抽样的有效减少, 包括: $\ bull$ 经典非适应性( parallel) 算法, 其成本为亚最佳( Dyr- Frieze- Kannan ‘ 89; Bez' akov' a- v{S} tefankovi\ v{c}- Vazirani- Vigoda ‘ 08; $\ bullet$ 适应性( 排序) 算法, 其成本为近最佳(\ v{S} Stekovi{ v}- Vempala- Vigoda ‘ 09; Huber ‘ 15; Kolmogorov ‘ 18; Harris- Kolmogorov ‘ 24) 。 我们提出的算法, 既实现近最佳总工作和高效平行, 也从计算到对对极深度和近最佳工作进行抽样抽样抽样取样, 。 作为结果, 我们获得了工作效率的平行平行平行计算算算算算算算法, 。
Article 58
Title@2025-05-27 (2): Time-Series Learning for Proactive Fault Prediction in Distributed Systems with Deep Neural Structures
Title: Time-Series Learning for Proactive Fault Prediction in Distributed Systems with Deep Neural Structures | Time-Series Learning für proaktive Fehlervorhersage in verteilten Systemen mit tiefen neuralen Strukturen | 深心神经结构分布系统预发性故障预测时间序列学习 2505.20705v1 |
Authors (6): Yang Wang, Wenxuan Zhu, Xuehui Quan, Heyi Wang, Chang Liu, Qiyuan Wu
This paper addresses the challenges of fault prediction and delayed response in distributed systems by proposing an intelligent prediction method based on temporal feature learning. The method takes multi-dimensional performance metric sequences as input. We use a Gated Recurrent Unit (GRU) to model the evolution of system states over time. An attention mechanism is then applied to enhance key temporal segments, improving the model’s ability to identify potential faults. On this basis, a feedforward neural network is designed to perform the final classification, enabling early warning of system failures. To validate the effectiveness of the proposed approach, comparative experiments and ablation analyses were conducted using data from a large-scale real-world cloud system. The experimental results show that the model outperforms various mainstream time-series models in terms of Accuracy, F1-Score, and AUC. This demonstrates strong prediction capability and stability. Furthermore, the loss function curve confirms the convergence and reliability of the training process. It indicates that the proposed method effectively learns system behavior patterns and achieves efficient fault detection.
本文通过基于时间特征学习的智能预测方法,提出基于时间特征学习的智能预测方法,处理错误预测和分布式系统中延迟反应的挑战。该方法采用多维性能量测序列作为投入。我们使用Gated 经常单元(GRU)来模拟系统状态的演变,然后运用注意机制来增强关键时间段,提高模型识别潜在错误的能力。在此基础上,设计了向前神经网络以进行最后分类,从而能够对系统故障发出预警。为了验证拟议方法的有效性,利用大规模真实世界云层系统的数据进行了比较试验和反差分析。实验结果显示,该模型在准确性、F1-Score和ACUC方面超越了各种主流时间序列模型。这显示了强大的预测能力和稳定性。此外,损失函数曲线证实了培训过程的趋同性和可靠性。它表明,拟议方法有效地学习了系统行为模式并实现了高效的误差检测。
Article 59
Title@2025-05-27 (2): InstGenIE: Generative Image Editing Made Efficient with Mask-aware Caching and Scheduling
Title: InstGenIE: Generative Image Editing Made Efficient with Mask-aware Caching and Scheduling | InstGenIE: Generative Bildbearbeitung mit Mask-aware Caching und Scheduling effizient gemacht | InstGenie: 生成图像编辑, 高效使用防面具图像缓冲和排程 2505.20600v1 |
Authors (15): Xiaoxiao Jiang, Suyi Li, Lingyun Yang, Tianyu Feng, Zhipeng Di, Weiyi Lu, Guoxuan Zhu, Xiu Lin, Kan Liu, Yinghao Yu, Tao Lan, Guodong Yang, Lin Qu, Liping Zhang, Wei Wang
Generative image editing using diffusion models has become a prevalent application in today’s AI cloud services. In production environments, image editing typically involves a mask that specifies the regions of an image template to be edited. The use of masks provides direct control over the editing process and introduces sparsity in the model inference. In this paper, we present InstGenIE, a system that efficiently serves image editing requests. The key insight behind InstGenIE is that image editing only modifies the masked regions of image templates while preserving the original content in the unmasked areas. Driven by this insight, InstGenIE judiciously skips redundant computations associated with the unmasked areas by reusing cached intermediate activations from previous inferences. To mitigate the high cache loading overhead, InstGenIE employs a bubble-free pipeline scheme that overlaps computation with cache loading. Additionally, to reduce queuing latency in online serving while improving the GPU utilization, InstGenIE proposes a novel continuous batching strategy for diffusion model serving, allowing newly arrived requests to join the running batch in just one step of denoising computation, without waiting for the entire batch to complete. As heterogeneous masks induce imbalanced loads, InstGenIE also develops a load balancing strategy that takes into account the loads of both computation and cache loading. Collectively, InstGenIE outperforms state-of-the-art diffusion serving systems for image editing, achieving up to 3x higher throughput and reducing average request latency by up to 14.7x while ensuring image quality.
使用扩散模型生成图像编辑已成为当今 AI 云层服务的流行应用。 在制作环境中, 图像编辑通常包含一个面罩, 指定要编辑的图像模板区域。 使用面罩可以直接控制编辑过程, 并在模型推断中引入宽度 。 在本文中, 我们介绍InstGenIE, 一个高效处理图像编辑请求的系统。 InstGenIE 背后的关键洞察力是, 图像编辑只能改变图像模板的蒙蔽区域, 同时保存未发牌区域的原始内容 。 在这种洞察的驱动下, InstGenIE 明智地跳过与未发色区域相关的重复计算 。 使用隐藏的中间激活器对编辑过程有直接控制, 并引入模型导出。 内地GenIE 使用一个无泡沫的管道计划, 有效处理图像编辑请求。 此外, InstGenIE 减少在线服务的连接度, 同时改进 lapest laberal coming 战略 , 允许新收到的请求加入与未发牌区域相关的重复的重复的重复的计算。 , 将图像转换到正在运行中, 将您的存储的存储系统进行递增的压缩的压缩的压缩到 。
Article 60
Title@2025-05-26 (1): Asynchronous Fault-Tolerant Language Decidability for Runtime Verification of Distributed Systems
Title: Asynchronous Fault-Tolerant Language Decidability for Runtime Verification of Distributed Systems | Asynchrone Fehler-Tolerante Sprachentscheidung für die Laufzeitverifizierung von verteilten Systemen | 分布式系统运行时核查的 Al- 同步错失容忍语言 2502.00191v2 |
Authors (2): Armando Castañeda, Gilde Valeria Rodríguez
Implementing correct distributed systems is an error-prone task. Runtime Verification (RV) offers a lightweight formal method to improve reliability by monitoring system executions against correctness properties. However, applying RV in distributed settings - where no process has global knowledge - poses fundamental challenges, particularly under full asynchrony and fault tolerance. This paper addresses the Distributed Runtime Verification (DRV) problem under such conditions. In our model, each process in a distributed monitor receives a fragment of the input word describing system behavior and must decide whether this word belongs to the language representing the correctness property being verified. Hence, the goal is to decide languages in a distributed fault-tolerant manner. We propose several decidability definitions, study the relations among them, and prove possibility and impossibility results. One of our main results is a characterization of the correctness properties that can be decided asynchronously. Remarkably, it applies to any language decidability definition. Intuitively, the characterization is that only properties with no real-time order constraints can be decided in asynchronous fault-tolerant settings. These results expose the expressive limits of DRV in realistic systems, as several properties of practical interest rely on reasoning about real-time order of events in executions. To overcome these limitations, we introduce a weaker model where the system under inspection is verified indirectly. Under this weaker model we define predictive decidability, a decidability definition that turn some real-time sensitive correctness properties verifiable. Our framework unifies and extends existing DRV theory and sharpens the boundary of runtime monitorability under different assumptions.
执行正确的分布式系统是一个容易出错的任务。 运行时核查( RV) 提供了一个轻量级的正式方法, 用来通过监测系统执行是否正确性能来提高可靠性。 但是, 在分布式环境中( 没有任何程序具备全球知识)应用 RV 带来了根本性的挑战, 特别是在完全不同步和差分容忍的情况下。 本文涉及在这种条件下分配的运行时核查( DRV) 问题。 在我们的模型中, 分布式监视器的每个程序都接收一个描述系统行为的输入词的碎片, 并且必须决定这个词是否属于代表正在核实的正确性属性的语言。 因此, 目标是以分布式的不正确性能方式决定语言。 我们提出了几种默认性定义, 研究它们之间的关系, 并证明存在可能性和不可能的结果。 我们的主要结果之一是描述在这样的条件下, 分布式的分布式核查( DRV) 正确性能特性的正确性能特性被描述为任何语言的变异性定义。 直观的模型是, 只有没有实时约束的特性, 才能决定不精确性框架的边界环境。 因此, 我们的不精确性环境的不精确性会暴露地决定。 这些结果会暴露的精确性判断系统在现实性规则下, 在现实的精确性评估中, 下, 我们的精确性能的精确性能上, 的精确性判断下, 我们的精确性能的精确性能的精确性能的精确性判断性 。
Article 61
Title@2025-05-29 (4): Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data
Title: Avoid Forgetting by Preserving Global Knowledge Gradients in Federated Learning with Non-IID Data | Vermeiden Sie das Vergessen, indem Sie globale Wissensgradienten im Föderierten Lernen mit nicht-ID-Daten bewahren | 避免在使用非二二二维数据进行联邦学习时因保留全球知识进步而被遗忘 2505.20485v2 |
Authors (5): Abhijit Chunduru, Majid Morafah, Mahdi Morafah, Vishnu Pandi Chellapandi, Ang Li
The inevitable presence of data heterogeneity has made federated learning very challenging. There are numerous methods to deal with this issue, such as local regularization, better model fusion techniques, and data sharing. Though effective, they lack a deep understanding of how data heterogeneity can affect the global decision boundary. In this paper, we bridge this gap by performing an experimental analysis of the learned decision boundary using a toy example. Our observations are surprising: (1) we find that the existing methods suffer from forgetting and clients forget the global decision boundary and only learn the perfect local one, and (2) this happens regardless of the initial weights, and clients forget the global decision boundary even starting from pre-trained optimal weights. In this paper, we present FedProj, a federated learning framework that robustly learns the global decision boundary and avoids its forgetting during local training. To achieve better ensemble knowledge fusion, we design a novel server-side ensemble knowledge transfer loss to further calibrate the learned global decision boundary. To alleviate the issue of learned global decision boundary forgetting, we further propose leveraging an episodic memory of average ensemble logits on a public unlabeled dataset to regulate the gradient updates at each step of local training. Experimental results demonstrate that FedProj outperforms state-of-the-art methods by a large margin.
不可避免的数据差异性的存在使得联盟间学习变得非常困难。 有很多方法可以解决这个问题, 比如本地规范化、更好的模型融合技术和数据共享。 虽然这些方法有效,但它们缺乏对数据差异性如何影响全球决策界限的深刻理解。 在本文中,我们通过使用一个玩具的例子对所学决定界限进行实验性分析来弥补这一差距。 我们的观察令人惊讶:(1) 我们发现,现有方法因忘记而受损,客户忘记了全球决策界限,只学会了完美的本地边界,(2) 不论初始重量如何, 客户都忘记了全球决策界限, 甚至从经过预先训练的最佳重量开始。 在本文中,我们介绍FedProj, 是一个能强有力地学习全球决策界限并避免在当地培训中遗忘的联邦化学习框架。 为了更好地实现共同的知识融合,我们设计了一个全新的服务器方知识转移损失,以进一步校准已学的全球决定界限。 为了减轻已学全球决定边界问题,我们进一步建议利用平均水平差值的记忆性记忆, 将每一步的FDP- Morealtial-lagial Adal fortial a ress a pretting a pretting a pretting a press press a press a press a press a maligilgaltiald progregal fal fal fal press a progal fal fal praldaldaldaldaldaldaldaldaldaldaldaldaldaldal uncald praldaldaldaldald un un praldaldaldaldaldaldaldaldald praldaldalds mas undaldaldaldaldaldal aps aps mas mas mas ap apdaldald praldaldaldaldal pral pral pral pral pral pral pral pral pral madaldaldaldaldaldaldaldaldaldaldaldaldal mas ap ma,我们,我们以在
Article 62
Title@2025-05-26 (1): Fixing non-blocking data structures for better compatibility with memory reclamation schemes
Title: Fixing non-blocking data structures for better compatibility with memory reclamation schemes | Fixierung von nicht blockierenden Datenstrukturen für eine bessere Kompatibilität mit Speicher-Reklamationssystemen | 固定非阻塞性数据结构,以更好地与内存回收计划兼容 2504.06254v2 |
Authors (2): Md Amit Hasan Arovi, Ruslan Nikolaev
We present a new technique, Safe Concurrent Optimistic Traversals (SCOT), to address a well-known problem related to optimistic traversals with both classical and more recent memory reclamation schemes, such as Hazard Pointers (HP), Hazard Eras (HE), Interval-Based Reclamation (IBR), and Hyaline. Unlike Epoch-Based Reclamation (EBR), these schemes guarantee protection against stalled threads (robustness) but lack support for well-known data structures with optimistic traversals such as Harris’ original list, Natarajan-Mittal tree, among others. For these reclamation schemes, existing data structure implementations are either buggy (e.g., Natarajan-Mittal tree) or come with performance trade-offs (e.g., Harris-Michael modified list). A recent work, HP++, supports optimistic traversals but uses a different API and is generally slower than even HP, not to mention more recent schemes such as IBR or Hyaline. Moreover, it has undesirable applicability trade-offs, and more complex implementation, among other issues. We propose a different method which keeps existing reclamation schemes intact but instead relies on data structure adaptations. Unlike existing Harris-Michael approach or HP++, our method retains performance benefits of the original data structure and also does not compromise performance of the underlying reclamation scheme. In fact, for IBR and Hyaline, our results almost match those of EBR, which often serves as a practical upper bound due to its great performance. We implement and evaluate two fundamentally different data structures: Harris’ list and Natarajan-Mittal tree. SCOT enables their first correct implementations with optimistic traversals for HP, HE, IBR, and Hyaline.
我们提出了一种新的技术,即“安全同时乐观的轨迹 ” , 以解决一个众所周知的问题,即以古老的和较近的记忆回收计划,如危险点(HP)、危险 Eras (HE)、基于跨谷的开源(IBR)和Hyaline(IBR ) , 来解决与乐观的轨迹(robustness)相伴的乐观旅行(Safe Plat-Poptimatic Trapal)有关的乐观旅行问题。 与Epoch-BRAD(EBR ) 不同的是,这些计划保证了防范线条( robt) , 但却缺乏对众所周知的数据结构的支持,比如Harris的最初清单, Natarajan-Mittal树(Natarajan-Mittar),等等。对于这些修复计划来说,现有的数据结构的实施要么是错误的(例如,Natarajan-Mital), 要么是错误的可适用性能结果,要么是更复杂的执行。 最近的数据结构, 也保留了我们目前的数据。
Article 63
Title@2025-05-26 (1): Efficient Optimization Accelerator Framework for Multistate Ising Problems
Title: Efficient Optimization Accelerator Framework for Multistate Ising Problems | Effizientes Optimierungs-Beschleuniger-Framework für Multistate Ising-Probleme | 高效高效优化多州化问题加速加速框架 2505.20250v1 |
Authors (2): Chirag Garg, Sayeef Salahuddin
Ising Machines are a prominent class of hardware architectures that aim to solve NP-hard combinatorial optimization problems. These machines consist of a network of interacting binary spins/neurons that evolve to represent the optimum ground state energy solution. Generally, combinatorial problems are transformed into quadratic unconstrained binary optimization (QUBO) form to harness the computational efficiency of these Ising machines. However, this transformation, especially for multi-state problems, often leads to a more complex exploration landscape than the original problem, thus severely impacting the solution quality. To address this challenge, we model the spin interactions as a generalized boolean logic function to significantly reduce the exploration space. We benchmark the graph coloring problem from the class of multi-state NP-hard optimization using probabilistic Ising solvers to illustrate the effectiveness of our framework. The proposed methodology achieves similar accuracy compared to state-of-the-art heuristics and machine learning algorithms, and demonstrates significant improvement over the existing Ising methods. Additionally, we demonstrate that combining parallel tempering with our existing framework further reduces the coloring error by up to 50% compared to the conventionally used Gibbs sampling algorithm. We also design a 1024-neuron all-to-all connected probabilistic Ising accelerator that shows up to 10000x performance acceleration compared to heuristics while reducing the number of required physical neurons by 1.5-4x compared to conventional Ising machines. Indeed, this accelerator solution demonstrates improvement across all metrics over the current methods, i.e., energy, performance, area, and solution quality. Thus, this work expands the potential of existing Ising hardware to solve a broad class of these multistate optimization problems.
机器是一组突出的硬件结构, 目的是解决NP硬组合优化问题。 这些机器包括一个互动的二进制螺旋/中中子网络, 以代表最佳地面状态能源解决方案。 一般来说, 组合问题被转化成四进式的不受限制的二进制优化( QUBO) 形式, 以利用这些ISing机器的计算效率。 然而, 这种转换, 特别是对于多状态问题, 往往导致比原有问题更复杂的勘探场景, 从而严重影响解决方案的质量。 为了应对这一挑战, 我们把旋转互动模拟成一个通用的布林逻辑函数, 以显著减少探索空间。 我们用多州NP- 硬化的优化来测量图形颜色问题, 使用概率性能解决方案来说明我们框架的效能。 拟议的方法与状态的超状态和机器学习算法相类似, 并展示了所有ISing 方法的显著改进。 此外, 我们展示了与现有框架平行的旋转互动关系, 将颜色错误比起来, 质平流的计算方法比重为50 % 。 我们使用了一种常规的硬化的计算法, 。 向常规的进度的进度的进度 显示了一种比比 的进度的进度的进度 。
Article 64
Title@2025-05-26 (1): FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings
Title: FedECA: A Federated External Control Arm Method for Causal Inference with Time-To-Event Data in Distributed Settings | FedECA: Eine Federated External Control Arm Methode für ursächliche Schlussfolgerungen mit Zeit-bis-Event-Daten in verteilten Einstellungen | FedECA:在分布环境中利用时间到时间的数据进行因果关系推断的联邦外部控制武器法 2311.16984v9 |
Authors (25): Jean Ogier du Terrail, Quentin Klopfenstein, Honghao Li, Imke Mayer, Nicolas Loiseau, Mohammad Hallal, Michael Debouver, Thibault Camalon, Thibault Fouqueray, Jorge Arellano Castro, Zahia Yanes, Laëtitia Dahan, Julien Taïeb, Pierre Laurent-Puig, Jean-Baptiste Bachet, Shulin Zhao, Remy Nicolle, Jérome Cros, Daniel Gonzalez, Robert Carreras-Torres, Adelaida Garcia Velasco, Kawther Abdilleh, Sudheer Doss, Félix Balazard, Mathieu Andreux
External control arms (ECA) can inform the early clinical development of experimental drugs and provide efficacy evidence for regulatory approval. However, the main challenge in implementing ECA lies in accessing real-world or historical clinical trials data. Indeed, regulations protecting patients’ rights by strictly controlling data processing make pooling data from multiple sources in a central server often difficult. To address these limitations, we develop a new method, ‘FedECA’ that leverages federated learning (FL) to enable inverse probability of treatment weighting (IPTW) for time-to-event outcomes on separate cohorts without needing to pool data. To showcase the potential of FedECA, we apply it in different settings of increasing complexity culminating with a real-world use-case in which FedECA is used to compare the treatment effect of two approved chemotherapy regimens using data from three separate cohorts of patients with metastatic pancreatic cancer. By sharing our code, we hope FedECA will foster the creation of federated research networks and thus accelerate drug development.
外部管制武器(ECA)可以为试验药物早期临床开发提供信息,并提供有效证据供监管批准。然而,实施ECA的主要挑战在于获取真实世界或历史临床试验数据。事实上,通过严格控制数据处理来保护病人的权利的条例使得很难在中央服务器上从多种来源汇集数据。为了解决这些限制,我们开发了一种新的方法,即“FedECA”利用联邦学习(FL)来利用联邦学习(FL),使不同组群的时间到活动结果的治疗权重(IPTW)反比概率(IPTW),而无需汇集数据。为了展示FedECA的潜力,我们将其应用在日益复杂的不同环境中,最终形成了一种现实世界使用案例,即使用FedECA来比较两种经批准的化疗程的治疗效果。通过分享我们的代码,我们希望FedECA将促进建立联邦研究网络,从而加速药物开发。
Article 65
Title@2025-05-26 (1): BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems
Title: BurstGPT: A Real-world Workload Dataset to Optimize LLM Serving Systems | BurstGPT: Ein echter Workload-Datensatz zur Optimierung von LLM-Serviersystemen | BurtGPT:优化LLM服务系统的现实世界工作量数据集 2401.17644v5 |
Authors (14): Yuxin Wang, Yuhan Chen, Zeyu Li, Xueze Kang, Yuchu Fang, Yeju Zhou, Yang Zheng, Zhenheng Tang, Xin He, Rui Guo, Xin Wang, Qiang Wang, Amelie Chi Zhou, Xiaowen Chu
Serving systems for Large Language Models (LLMs) are often optimized to improve quality of service (QoS) and throughput. However, due to the lack of open-source LLM serving workloads, these systems are frequently evaluated under unrealistic workload assumptions. Consequently, performance may degrade when systems are deployed in real-world scenarios. This work presents BurstGPT, an LLM serving workload with 10.31 million traces from regional Azure OpenAI GPT services over 213 days. BurstGPT captures LLM serving characteristics from user, model and system perspectives: (1) User request concurrency: burstiness variations of requests in Azure OpenAI GPT services, revealing diversified concurrency patterns in different services and model types. (2) User conversation patterns: counts and intervals within conversations for service optimizations. (3) Model response lengths: auto-regressive serving processes of GPT models, showing statistical relations between requests and their responses. (4) System response failures: failures of conversation and API services, showing intensive resource needs and limited availability of LLM services in Azure. The details of the characteristics can serve multiple purposes in LLM serving optimizations, such as system evaluation and trace provisioning. In our demo evaluation with BurstGPT, frequent variations in BurstGPT reveal declines in efficiency, stability, or reliability in realistic LLM serving. We identify that the generalization of KV cache management, scheduling and disaggregation optimizations can be improved under realistic workload evaluations. BurstGPT is publicly available now at https://github.com/HPMLL/BurstGPT and is widely used to develop prototypes of LLM serving frameworks in the industry.
大型语言模型服务系统(LLMS)通常得到优化,以提高服务质量(QOS)和吞吐量,然而,由于缺乏开放源码LLM服务的工作量,这些系统经常在不切实际的工作量假设下评估,因此,当系统在现实世界情景下部署时,业绩可能会下降;这项工作显示BurstGPT,一个在213天时间里从区域Azure OpenAI GPT服务处获得1,031万个微秒的工作量。BurstGPT从用户、模型和系统的角度获取LMSM服务的特点:(1)用户要求货币:Azure OpenAI GPT服务处的要求出现暴涨变化,在不同服务和模式类型中显示多种货币的汇率波动。 (2)用户对话模式:服务最优化的谈话次数和间隔。 (3) 模型反应长度:GPT模式自动递增服务进程,显示请求及其答复之间的统计关系。 (4) 系统反应失败:对话与API服务失败,显示资源需求紧张,LPM服务在Azure提供现实的LM服务。
Article 66
Title@2025-05-26 (1): Parallelizing a modern GPU simulator
Title: Parallelizing a modern GPU simulator | Parallelisierung eines modernen GPU-Simulators | 平行使用现代 GPU 模拟器 2502.14691v2 |
Authors (2): Rodrigo Huerta, Antonio González
Simulators are a primary tool in computer architecture research but are extremely computationally intensive. Simulating modern architectures with increased core counts and recent workloads can be challenging, even on modern hardware. This paper demonstrates that simulating some GPGPU workloads in a single-threaded state-of-the-art simulator such as Accel-sim can take more than five days. In this paper we present a simple approach to parallelize this simulator with minimal code changes by using OpenMP. Moreover, our parallelization technique is deterministic, so the simulator provides the same results for single-threaded and multi-threaded simulations. Compared to previous works, we achieve a higher speed-up, and, more importantly, the parallel simulation does not incur any inaccuracies. When we run the simulator with 16 threads, we achieve an average speed-up of 5.8x and reach 14x in some workloads. This allows researchers to simulate applications that take five days in less than 12 hours. By speeding up simulations, researchers can model larger systems, simulate bigger workloads, add more detail to the model, increase the efficiency of the hardware platform where the simulator is run, and obtain results sooner.
模拟器是计算机架构研究的主要工具, 但它在计算上极为密集。 模拟现代建筑, 核心计数和最近工作量增加, 即使是现代硬件也具有挑战性 。 本文显示, 模拟一些 GPGPPPU 工作量, 在一个单行最新模拟器中模拟 GPGPPU 工作量, 如 Accel- sim 模拟器, 需要超过五天时间。 本文中我们提出了一个简单的方法, 通过使用 OpenMP 将这个模拟器与最小的代码变化同步。 此外, 我们的平行技术是决定性的, 因此模拟器为单读和多读模拟器提供同样的结果 。 与以往的工程相比, 我们实现更高的速度, 更重要的是, 平行模拟器不会产生任何不准确的情况 。 当我们用 16 串线运行模拟器时, 我们实现平均速度5. 8 5 和 达到某些工作量的14x 。 这样可以让研究人员模拟应用在少于 12 小时的时间里进行5 。 通过加速模拟, 研究人员可以模拟系统模型, 模拟, 模拟器化系统, 模拟器可以模拟, 模拟系统, 模拟器模拟器模拟更大、 模拟器 模拟更大的工作量, 模拟, 模拟器 复制更大的工作量, 复制器 、 、 、 和硬件 更新器 更新器 更新器 更新到 和变变变变到 的硬件, 更精确到 。
Article 67
Title@2025-05-26 (1): Snowman for partial synchrony
Title: Snowman for partial synchrony | Schneemann für partielle Synchronisation | 部分同步的雪人 2501.15904v3 |
Authors (4): Aaron Buchwald, Stephen Buttolph, Andrew Lewis-Pye, Kevin Sekniqi
Snowman is the consensus protocol run by blockchains on Avalanche. Recent work established a rigorous proof of probabilistic consistency for Snowman in the \emph{synchronous} setting, under the simplifying assumption that correct processes execute sampling rounds in `lockstep’. In this paper, we describe a modification of the protocol that ensures consistency in the \emph{partially synchronous} setting, and when correct processes carry out successive sampling rounds at their own speed, with the time between sampling rounds determined by local message delays.
雪人(Snowman)是雪人(Snowman)的共识协议,它由电磁链链块组成。最近的工作为雪人(cemph{synchronous})设定的概率一致提供了严格的证明,证明雪人(snowman)的概率一致,根据简化的假设,即正确的过程在“锁制步骤”中执行采样回合。本文描述了对协议的修改,以确保\emph{部分同步}设置的一致性,当正确过程以自己的速度进行连续采样回合时,由本地信息延迟决定的采样回合之间的时间。
Article 68
Title@2025-05-26 (1): Beyond Optimal Fault Tolerance
Title: Beyond Optimal Fault Tolerance | Jenseits der optimalen Fehlertoleranz | 超越最佳错失容忍 2501.06044v6 |
Authors (2): Andrew Lewis-Pye, Tim Roughgarden
The optimal fault-tolerance achievable by any protocol has been characterized in a wide range of settings. For example, for state machine replication (SMR) protocols operating in the partially synchronous setting, it is possible to simultaneously guarantee consistency against $\alpha$-bounded adversaries (i.e., adversaries that control less than an $\alpha$ fraction of the participants) and liveness against $\beta$-bounded adversaries if and only if $\alpha + 2\beta \leq 1$. This paper characterizes to what extent “better-than-optimal” fault-tolerance guarantees are possible for SMR protocols when the standard consistency requirement is relaxed to allow a bounded number $r$ of consistency violations. We prove that bounding rollback is impossible without additional timing assumptions and investigate protocols that tolerate and recover from consistency violations whenever message delays around the time of an attack are bounded by a parameter $\Delta^$ (which may be arbitrarily larger than the parameter $\Delta$ that bounds post-GST message delays in the partially synchronous model). Here, a protocol’s fault-tolerance can be a non-constant function of $r$, and we prove, for each $r$, matching upper and lower bounds on the optimal “recoverable fault-tolerance” achievable by any SMR protocol. For example, for protocols that guarantee liveness against 1/3-bounded adversaries in the partially synchronous setting, a 5/9-bounded adversary can always cause one consistency violation but not two, and a 2/3-bounded adversary can always cause two consistency violations but not three. Our positive results are achieved through a generic “recovery procedure” that can be grafted on to any accountable SMR protocol and restores consistency following a violation while rolling back only transactions that were finalized in the previous $2\Delta^$ timesteps.
任何协议都能实现最佳的过错容忍度, 其特征是多种多样的。 例如, 对于在部分同步环境下运行的国家机器复制协议( SMR) 来说, 在部分同步环境下运行的州机器复制协议( SMR) , 有可能同时保证对受美元约束的对手( 即控制低于美元参与者部分的对手) 的一致性( 控制低于美元参与者部分的反对者) 和对美元受美元约束的对手的活性容忍度。 只有当 $\ alpha + 2\ beta\leq 1 的参数被约束的情况下, 并且只有 $\\ beta\ dleq 1 的活性。 本文描述了在标准一致性要求2/3 允许允许允许允许限制的美元违反协议时, 国家机器的“ 更好比 ” ( ) 最坏的 ) ( ) ( ) ) ( ) ) ( ) ( ) ( 任何可以任意大于 $\ Delta ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (
Article 69
Title@2025-05-26 (1): Distortion Resilience for Goal-Oriented Semantic Communication
Title: Distortion Resilience for Goal-Oriented Semantic Communication | Distortion Resilienz für zielorientierte semantische Kommunikation | 目标导向语义交流的扭曲复原力 2309.14587v2 |
Authors (5): Minh-Duong Nguyen, Quang-Vinh Do, Zhaohui Yang, Quoc-Viet Pham, Won-Joo Hwang
Recent research efforts on Semantic Communication (SemCom) have mostly considered accuracy as a main problem for optimizing goal-oriented communication systems. However, these approaches introduce a paradox: the accuracy of Artificial Intelligence (AI) tasks should naturally emerge through training rather than being dictated by network constraints. Acknowledging this dilemma, this work introduces an innovative approach that leverages the rate distortion theory to analyze distortions induced by communication and compression, thereby analyzing the learning process. Specifically, we examine the distribution shift between the original data and the distorted data, thus assessing its impact on the AI model’s performance. Founding upon this analysis, we can preemptively estimate the empirical accuracy of AI tasks, making the goal-oriented SemCom problem feasible. To achieve this objective, we present the theoretical foundation of our approach, accompanied by simulations and experiments that demonstrate its effectiveness. The experimental results indicate that our proposed method enables accurate AI task performance while adhering to network constraints, establishing it as a valuable contribution to the field of signal processing. Furthermore, this work advances research in goal-oriented SemCom and highlights the significance of data-driven approaches in optimizing the performance of intelligent systems.
最近关于语义通信(SemCom)的研究工作大多认为准确性是优化面向目标的通信系统的一个主要问题,但是,这些方法引出了一种自相矛盾的现象:人工智能(AI)任务的准确性应该自然地通过培训而不是由网络限制决定。认识到这一困境,这项工作引入了一种创新方法,利用率扭曲理论来分析通信和压缩引起的扭曲,从而分析学习过程。具体地说,我们审查了原始数据和扭曲数据之间的分配变化,从而评估了其对AI模型绩效的影响。根据这项分析,我们可以先发制人地估计AI任务的经验准确性,使面向目标的SEMCom问题成为可行的。为了实现这一目标,我们提出了我们方法的理论基础,同时进行模拟和实验,以证明其有效性。实验结果表明,我们所提议的方法既能准确的AI任务业绩,又坚持网络限制,将其确定为对信号处理领域的宝贵贡献。此外,这项工作推进了面向目标的SEMCom的研究,并强调了数据驱动方法在优化智能系统绩效方面的重要性。
Article 70
Title@2025-05-26 (1): Optimizing edge AI models on HPC systems with the edge in the loop
Title: Optimizing edge AI models on HPC systems with the edge in the loop | Optimierung der Kanten-KI-Modelle auf HPC-Systemen mit der Kante in der Schleife | 优化循环边缘的HPC系统优化边缘 AI 模型 2505.19995v1 |
Authors (4): Marcel Aach, Cyril Blanc, Andreas Lintermann, Kurt De Grave
Artificial intelligence and machine learning models deployed on edge devices, e.g., for quality control in Additive Manufacturing (AM), are frequently small in size. Such models usually have to deliver highly accurate results within a short time frame. Methods that are commonly employed in literature start out with larger trained models and try to reduce their memory and latency footprint by structural pruning, knowledge distillation, or quantization. It is, however, also possible to leverage hardware-aware Neural Architecture Search (NAS), an approach that seeks to systematically explore the architecture space to find optimized configurations. In this study, a hardware-aware NAS workflow is introduced that couples an edge device located in Belgium with a powerful High-Performance Computing system in Germany, to train possible architecture candidates as fast as possible while performing real-time latency measurements on the target hardware. The approach is verified on a use case in the AM domain, based on the open RAISE-LPBF dataset, achieving ~8.8 times faster inference speed while simultaneously enhancing model quality by a factor of ~1.35, compared to a human-designed baseline.
在边缘装置上部署的人工智能和机器学习模型,例如用于艾迪蒂夫制造业质量控制的人工智能和机器学习模型,其规模往往较小,这类模型通常必须在很短的时间内提供非常准确的结果。文献中通常使用的方法先从较大的经过培训的模型开始,然后试图通过结构裁剪、知识蒸馏或量化来减少其记忆和延缓足足迹。然而,也可以利用有硬件的神经结构搜索(NAS),这种方法寻求系统探索建筑空间以找到优化的配置。在这项研究中,引入了一个有硬件觉悟的NAS工作流程,将位于比利时的、在德国拥有强大高性能计算机系统的边缘装置配对在一起,以尽可能快的方式培训可能的建筑候选人,同时对目标硬件进行实时的耐久度测量。该方法根据开放的 SARE-LPBF数据集(NAS-LPBF),在提高模型质量的同时,以~8.8倍的速度提高模型质量,同时提高一个系数为~1.35。
Article 71
Title@2025-05-26 (1): Federated Domain Generalization with Data-free On-server Matching Gradient
Title: Federated Domain Generalization with Data-free On-server Matching Gradient | Föderierte Domain-Verallgemeinerung mit datenfreiem On-Server-Zustimmungs-Gradient | 具有无数据观测站上与渐变匹配的无数据观测器的联邦通用域 2501.14653v2 |
Authors (5): Trong-Binh Nguyen, Minh-Duong Nguyen, Jinsun Park, Quoc-Viet Pham, Won Joo Hwang
Domain Generalization (DG) aims to learn from multiple known source domains a model that can generalize well to unknown target domains. One of the key approaches in DG is training an encoder which generates domain-invariant representations. However, this approach is not applicable in Federated Domain Generalization (FDG), where data from various domains are distributed across different clients. In this paper, we introduce a novel approach, dubbed Federated Learning via On-server Matching Gradient (FedOMG), which can \emph{efficiently leverage domain information from distributed domains}. Specifically, we utilize the local gradients as information about the distributed models to find an invariant gradient direction across all domains through gradient inner product maximization. The advantages are two-fold: 1) FedOMG can aggregate the characteristics of distributed models on the centralized server without incurring any additional communication cost, and 2) FedOMG is orthogonal to many existing FL/FDG methods, allowing for additional performance improvements by being seamlessly integrated with them. Extensive experimental evaluations on various settings to demonstrate the robustness of FedOMG compared to other FL/FDG baselines. Our method outperforms recent SOTA baselines on four FL benchmark datasets (MNIST, EMNIST, CIFAR-10, and CIFAR-100), and three FDG benchmark datasets (PACS, VLCS, and OfficeHome).
常规化(DG) 旨在从多个已知源域中学习一个能够向未知目标域广泛推广的模型。 DG的主要方法之一是培训一个能生成域内异性表示的编码器。但是,这一方法不适用于联邦域通用化(FDG),因为来自不同领域的数据分布在不同客户之间。在本文中,我们引入了一种新颖的方法,称为通过服务器匹配梯级(FedOMG)进行联邦学习,这可以有效地利用分布域域域域的域域信息。具体地说,我们利用地方梯度作为分布模型的信息,通过内部产品渐变最大化,在所有域找到一个不变化的梯度方向。其优点有两个方面:(1) FedOMG可以将分布模型的特性汇总到中央服务器上,而不会产生额外的通信费用;(2) FedOMG与许多现有的FL/FDG方法不相近似,从而能够与这些方法紧密结合,从而能够进一步改进业绩。 具体地对各种环境进行广泛的实验性评估,以显示FOMOMG G办公室与其他FA/FAR 基准、CSA 3 CRA 基准、CSAS CIA 和 CSAR 。
Article 72
Title@2025-05-26 (1): From Few to Many Faults: Adaptive Byzantine Agreement with Optimal Communication
Title: From Few to Many Faults: Adaptive Byzantine Agreement with Optimal Communication | Von wenigen bis zu vielen Fehlern: Adaptive byzantinische Vereinbarung mit optimaler Kommunikation | 从少到多的错失:适应性拜占庭协议与最佳沟通 2505.19989v1 |
Authors (4): Andrei Constantinescu, Marc Dufay, Anton Paramonov, Roger Wattenhofer
Achieving agreement among distributed parties is a fundamental task in modern systems, underpinning applications such as consensus in blockchains, coordination in cloud infrastructure, and fault tolerance in critical services. However, this task can be communication-intensive, often requiring a large number of messages to be exchanged, especially in the presence of Byzantine faults, making efficiency a central challenge in the design of practical agreement protocols. In this paper, we study the problem of Strong Byzantine Agreement and establish tight upper and lower bounds on communication complexity, parameterized by the actual number of Byzantine faults. Specifically, for a system of $n$ parties tolerating up to $t$ Byzantine faults, out of which only $f \leq t$ are actually faulty, we obtain the following results: In the partially synchronous setting, we present the first Byzantine Agreement protocol that achieves adaptive communication complexity of $\mathcal{O}(n + t \cdot f)$ words, which is asymptotically optimal. Our protocol has an optimal resilience of $t < n/3$. In the asynchronous setting, we prove a lower bound of $\Omega(n + t^2)$ on the expected number of messages, and design an almost matching protocol with an optimal resilience that solves agreement with $\mathcal{O}((n + t^2)\cdot \log n)$ words. Our main technical contribution in the asynchronous setting is the utilization of a bipartite expander graph that allows for low-cost information dissemination.
在现代系统中,实现分布式缔约方之间的协议是一项基本任务,它支持了各种应用,例如链链中的共识、云层基础设施的协调以及关键服务中的过错容忍度。然而,这项任务可以是通信密集型的,常常需要大量信息交换,特别是在拜占庭断层出现时,效率成为设计实际协议协议协议的主要挑战。在本文件中,我们研究强大的拜占庭协议问题,并在通信复杂性上方和下方设置紧凑的界限,以拜占庭断层的实际数量为参数。具体地说,对于一个控制最高至拜占庭断层的美元缔约方系统,其中只有美元=莱克特特元的错误,我们得到的结果如下:在部分同步的环境下,我们介绍第一个实现美元和亚占市的适应性通信复杂性的拜占庭协议协议(n+t\cdockt f) 的上方和下方字节,这是最优化的。我们的协议具有最优化的耐用性($ < 拜占市 3美元 ) 的耐用性,我们所预期的正价协议使得我们能够将美元和正值的内置成正价协议。
Article 73
Title@2025-05-26 (1): Differential Privacy Analysis of Decentralized Gossip Averaging under Varying Threat Models
Title: Differential Privacy Analysis of Decentralized Gossip Averaging under Varying Threat Models | Differential Privacy Analyse dezentralisierter Gossip Average unter unterschiedlichen Bedrohungsmodellen | 对不同威胁模式下分散的流民的隐私差异分析 2505.19969v1 |
Authors (2): Antti Koskela, Tejas Kulkarni
Fully decentralized training of machine learning models offers significant advantages in scalability, robustness, and fault tolerance. However, achieving differential privacy (DP) in such settings is challenging due to the absence of a central aggregator and varying trust assumptions among nodes. In this work, we present a novel privacy analysis of decentralized gossip-based averaging algorithms with additive node-level noise, both with and without secure summation over each node’s direct neighbors. Our main contribution is a new analytical framework based on a linear systems formulation that accurately characterizes privacy leakage across these scenarios. This framework significantly improves upon prior analyses, for example, reducing the R'enyi DP parameter growth from $O(T^2)$ to $O(T)$, where $T$ is the number of training rounds. We validate our analysis with numerical results demonstrating superior DP bounds compared to existing approaches. We further illustrate our analysis with a logistic regression experiment on MNIST image classification in a fully decentralized setting, demonstrating utility comparable to central aggregation methods.
对机器学习模式的全面分散化培训在可扩缩性、稳健性和差错容忍度方面有很大的优势。然而,由于节点之间缺乏中央聚合器和不同的信任假设,在这种环境下实现不同的隐私(DP)具有挑战性。在这项工作中,我们提出了对分散的八卦平均算法进行新的隐私分析,这些算法带有添加节点级噪音,无论是否对每个节点的直接邻居进行安全搭配。我们的主要贡献是建立在线性系统配方基础上的新的分析框架,准确描述这些情景的隐私渗漏。这个框架在先前的分析基础上大有改进,例如,将R'enyi DP参数从$O(T)2)美元减到$O(T)美元,而美元是培训回合的数量。我们用数字结果验证我们的分析,表明与现有方法相比,DP的界限优。我们进一步用一个逻辑回归实验来说明我们的分析,在完全分散的环境中对MNIST图像分类进行后勤回归试验,显示与中央汇总方法的效用。
Article 74
Title@2025-05-26 (1): Universal Workers: A Vision for Eliminating Cold Starts in Serverless Computing
Title: Universal Workers: A Vision for Eliminating Cold Starts in Serverless Computing | Universal Workers: Eine Vision zur Beseitigung von Kaltstarts im serverlosen Computing | 普遍工人:在无服务器计算机中消除冷源的愿景 2505.19880v1 |
Authors (2): Saman Akbari, Manfred Hauswirth
Serverless computing enables developers to deploy code without managing infrastructure, but suffers from cold start overhead when initializing new function instances. Existing solutions such as “keep-alive” or “pre-warming” are costly and unreliable under bursty workloads. We propose universal workers, which are computational units capable of executing any function with minimal initialization overhead. Based on an analysis of production workload traces, our key insight is that requests in Function-as-a-Service (FaaS) platforms show a highly skewed distribution, with most requests invoking a small subset of functions. We exploit this observation to approximate universal workers through locality groups and three-tier caching (handler, install, import). With this work, we aim to enable more efficient and scalable FaaS platforms capable of handling diverse workloads with minimal initialization overhead.
无服务器计算使开发者能够在不管理基础设施的情况下部署代码,但在启动新功能实例时会遇到冷却的启动间接费用。 现有解决方案,如“ 维持生命” 或“ 升温前” ,在爆发性工作量下成本高昂且不可靠。 我们提议通用工人,它们是计算单位,能够以最低初始化间接费用执行任何功能。 根据对生产工作量跟踪的分析,我们的主要洞察力是,在“ 功能-as-a-Service(FaaaS) ” (FaaaS) 平台中,要求的分布高度偏斜,大多数请求都引用了一小部分功能。 我们利用这一观察,通过地点组和三层缓冲( handler、安装、进口)来接近通用工人。 我们通过这项工作,我们的目标是使能够更有效和可扩展的FaaaS平台能够以最小初始化间接费用处理多种工作量。
Article 75
Title@2025-05-26 (1): DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems
Title: DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud Systems | DGRAG: Distributed Graph-based Retrieval-Augmented Generation in Edge-Cloud-Systemen | DGGGAG: 在边缘封闭系统中分布的基于图图的回收回源-养代 2505.19847v1 |
Authors (3): Wenqing Zhou, Yuxuan Yan, Qianqian Yang
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the capabilities of language models by integrating external knowledge. Due to the diversity of data sources and the constraints of memory and computing resources, real-world data is often scattered in multiple devices. Conventional RAGs that store massive amounts of scattered data centrally face increasing privacy concerns and high computational costs. Additionally, RAG in a central node raises latency issues when searching over a large-scale knowledge base. To address these challenges, we propose a distributed Knowledge Graph-based RAG approach, referred to as DGRAG, in an edge-cloud system, where each edge device maintains a local knowledge base without the need to share it with the cloud, instead sharing only summaries of its knowledge. Specifically, DGRAG has two main phases. In the Distributed Knowledge Construction phase, DGRAG organizes local knowledge using knowledge graphs, generating subgraph summaries and storing them in a summary database in the cloud as information sharing. In the Collaborative Retrieval and Generation phase, DGRAG first performs knowledge retrieval and answer generation locally, and a gate mechanism determines whether the query is beyond the scope of local knowledge or processing capabilities. For queries that exceed the local knowledge scope, the cloud retrieves knowledge from the most relevant edges based on the summaries and generates a more precise answer. Experimental results demonstrate the effectiveness of the proposed DGRAG approach in significantly improving the quality of question-answering tasks over baseline approaches.
由于数据来源的多样性以及记忆和计算资源的限制,真实世界数据往往分散在多种设备中。存储大量分散数据的常规区域组面临越来越多的隐私关切和高昂的计算成本。此外,中央节点中的区域组在搜索大型知识库时会提高潜伏问题。为了应对这些挑战,我们提议在边缘悬崖系统中采用分布式知识图表方法,称为DGRAG, 在每个边缘装置都维持一个本地知识库而无需与云共享,而只分享其知识摘要。具体地说,DGRAG有两个主要阶段。在分散式知识建设阶段,DGRAG利用知识图组织本地知识,生成子图摘要,并将其储存在云层的汇总数据库中作为信息共享。在协作检索和生成阶段,DGRAG首先进行知识的本地级检索和回答,在对本地基线进行最深入的搜索时,对本地的准确的定位能力进行超越了本地范围,对本地的升级的定位,对本地的定位的定位,对本地范围进行更精确的检索,对本地的定位,对本地范围进行更精确的检索,对本地的定位的查询能力超过本地范围。
Article 76
Title@2025-05-26 (1): Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices
Title: Will LLMs Scaling Hit the Wall? Breaking Barriers via Distributed Resources on Massive Edge Devices | Wird LLMs Skalierung die Wand treffen? Über verteilte Ressourcen auf massiven Edge-Geräten Barrieren überwinden | LLLMs SUlia扩大会撞上隔离墙吗?通过大规模边缘装置分配资源打破障碍 2503.08223v2 |
Authors (6): Tao Shen, Didi Zhu, Ziyu Zhao, Zexi Li, Chao Wu, Fei Wu
The remarkable success of foundation models has been driven by scaling laws, demonstrating that model performance improves predictably with increased training data and model size. However, this scaling trajectory faces two critical challenges: the depletion of high-quality public data, and the prohibitive computational power required for larger models, which have been monopolized by tech giants. These two bottlenecks pose significant obstacles to the further development of AI. In this position paper, we argue that leveraging massive distributed edge devices can break through these barriers. We reveal the vast untapped potential of data and computational resources on massive edge devices, and review recent technical advancements in distributed/federated learning that make this new paradigm viable. Our analysis suggests that by collaborating on edge devices, everyone can participate in training large language models with small edge devices. This paradigm shift towards distributed training on edge has the potential to democratize AI development and foster a more inclusive AI community.
基础模型的显著成功是由规模法驱动的,这表明模型业绩随着培训数据和模型规模的增加而可预见地得到改善。然而,这一规模轨迹面临两个重大挑战:高质量公共数据的耗竭,以及大型模型所需的令人望而却步的计算能力,而大型模型已经被技术巨头所垄断。这两个瓶颈对AI的进一步发展构成重大障碍。在本立场文件中,我们认为利用大规模分布式边缘装置可以突破这些障碍。我们揭示了大规模边缘装置的数据和计算资源的巨大未开发潜力,并审查了分布式/联邦化学习的最新技术进展,使这一新模式具有可行性。我们的分析表明,通过在边缘装置上的合作,每个人都可以参与用小型边缘装置培训大型语言模型。这种向边缘分布式培训的转变,有可能使人工智能开发民主化,并培养更具包容性的人工智能社区。
Article 77
Title@2025-05-26 (1): A Unified, Practical, and Understandable Model of Non-transactional Consistency Levels in Distributed Replication
Title: A Unified, Practical, and Understandable Model of Non-transactional Consistency Levels in Distributed Replication | Ein einheitliches, praktisches und verständliches Modell nichttransaktionsfähiger Konsistenzstufen in verteilter Replikation | 分布式重复中非交易一致性水平的统一、实用和可理解的模式 2409.01576v4 |
Authors (3): Guanzhou Hu, Andrea Arpaci-Dusseau, Remzi Arpaci-Dusseau
We present a practical model of non-transactional consistency levels in the context of distributed data replication. Unlike prior work, our simple Shared Object Pool (SOP) model defines common consistency levels in a unified framework centered around the single concept of ordering. This naturally reflects modern cloud object storage services and is thus easy to understand. We show that a consistency level can be intuitively defined by specifying two types of constraints on the validity of orderings allowed by the level: convergence, which bounds the lineage shape of the ordering, and relationship, which bounds the relative positions between operations. We give examples of representative protocols and systems, and discuss their availability upper bound. To further demonstrate the expressiveness and practical relevance of our model, we use it to implement a Jepsen-integrated consistency checker for the four most common levels (linearizable, sequential, causal+, and eventual); the checker analyzes consistency conformity for small-scale histories of real system runs (etcd, ZooKeeper, and RabbitMQ).
与先前的工作不同,我们简单的共同对象库(SOP)模型在以单一定购概念为中心的统一框架内确定了共同一致性水平。这自然反映了现代云端存储服务,因此容易理解。我们表明,一致性水平可以直截了当地界定,具体指明对允许的定购的有效性的两种限制:趋同,它约束定购的线条形状,以及连接操作相对位置的关系。我们举了具有代表性的协议和系统的例子,并讨论了它们的可用性。为了进一步展示我们模型的清晰度和实际相关性,我们用它来对四种最普通的等级(可线性、相继性、因果+和最终)实施杰普森综合一致性检查;检查器分析实际系统运行的小规模历史的一致性(特克、祖韦尔和拉比特MQ ) 。
Article 78
Title@2025-05-26 (1): Justin: Hybrid CPU/Memory Elastic Scaling for Distributed Stream Processing
Title: Justin: Hybrid CPU/Memory Elastic Scaling for Distributed Stream Processing | Justin: Hybride CPU/Memory Elastic Scaling für verteilte Stream-Verarbeitung | Justin: 用于分布流处理的混合 CPU/Memory Elastic 缩放比例 2505.19739v1 |
Authors (3): Donatien Schmitz, Guillaume Rosinosky, Etienne Rivière
Distributed Stream Processing (DSP) engines analyze continuous data via queries expressed as a graph of operators. Auto-scalers adjust the number of parallel instances of these operators to support a target rate. Current auto-scalers couple CPU and memory scaling, allocating resources as one-size-fits-all packages. This contrasts with operators’ high diversity of requirements. We present Justin, an auto-scaler that enables hybrid CPU and memory scaling of DSP operators. Justin monitors both CPU usage and the performance of operators’ storage operations. Its mechanisms enable finegrain memory allocation for tasks upon a query reconfiguration. The Justin policy identifies individual operators’ memory pressure and decides between adjusting parallelism and/or memory assignment. We implement Justin in Apache Flink, extending the Flink Kubernetes Operator and the DS2 CPU-only auto-scaler. Using the Nexmark benchmark, our evaluation shows that Justin identifies suitable resource allocation in as many or fewer reconfiguration steps as DS2 and supports a target rate with significantly fewer CPU and memory resources.
分布式流处理引擎( DSP) 引擎通过以操作员图表表示的查询分析连续数据。 自动标尺者调整这些操作员的平行事件数量以支持目标率。 当前自动标尺者将CPU和记忆缩放组合, 将资源分配成一刀切的软件包。 这与操作员的要求差异很大形成鲜明对比。 我们介绍Justin, 一个允许混合CPU和存储操作员的记忆缩放的自动标尺者。 Justin 监测 CPU 的使用情况和操作员存储操作的绩效。 它的机制允许在查询重组时对任务进行细微的记忆分配。 Justin 政策确定了单个操作员的内存压力, 并在调整平行和/ 或记忆任务之间做出决定 。 我们在 Acapish Flink 执行 Justin, 扩展 Flink Kubernetes 操作员和 DS2 PUP 唯一的自动标尺。 我们的评价显示, Justin 使用Nexmark 基准, 确定像 DS2 那样的许多或更少的重组步骤的合适资源配置, 支持目标率, 并大大降低 CUPUPU和记忆资源。
Article 79
Title@2025-05-26 (1): Towards Optimal Distributed Edge Coloring with Fewer Colors
Title: Towards Optimal Distributed Edge Coloring with Fewer Colors | Auf dem Weg zu einer optimalen verteilten Randfärbung mit weniger Farben | 向最优化分布式边缘配色,颜色更少 2504.13003v2 |
Authors (3): Manuel Jakob, Yannic Maus, Florian Schager
There is a huge difference in techniques and runtimes of distributed algorithms for problems that can be solved by a sequential greedy algorithm and those that cannot. A prime example of this contrast appears in the edge coloring problem: while $(2\Delta-1)$-edge coloring can be solved in $\mathcal{O}(\log^{\ast}(n))$ rounds on constant-degree graphs, the seemingly minor reduction to $(2\Delta-2)$ colors leads to an $\Omega(\log n)$ lower bound [Chang, He, Li, Pettie & Uitto, SODA’18]. Understanding this sharp divide between very local problems and inherently more global ones remains a central open question in distributed computing and it is a core focus of this paper. As our main contribution we design a deterministic distributed $\mathcal{O}(\log n)$-round reduction from the $(2\Delta-2)$-edge coloring problem to the much easier $(2\Delta-1)$-edge coloring problem. This reduction is optimal, as the $(2\Delta-2)$-edge coloring problem admits an $\Omega(\log n)$ lower bound, whereas the $2\Delta-1$-edge coloring problem can be solved in $\mathcal{O}(\log^{\ast}n)$ rounds. By plugging in the $(2\Delta-1)$-edge coloring algorithms from [Balliu, Brandt, Kuhn & Olivetti, PODC’22] running in $\mathcal{O}(\log^{12}\Delta + \log^{\ast} n)$ rounds, we obtain an optimal runtime of $\mathcal{O}(\log n)$ rounds as long as $\Delta = 2^{\mathcal{O}(\log^{1/12} n)}$. Furthermore, on general graphs our reduction improves the runtime from $\widetilde{\mathcal{O}}(\log^3 n)$ to $\widetilde{\mathcal{O}}(\log^{5/3} n)$. In addition, we also obtain an optimal $\mathcal{O}(\log \log n)$-round randomized reduction of $(2\Delta - 2)$-edge coloring to $(2\Delta - 1)$-edge coloring. Lastly, we obtain an $\mathcal{O}(\log_\Delta n)$-round reduction from the $(2\Delta-1)$-edge coloring, albeit to the somewhat harder maximal independent set (MIS) problem.
nan
Article 80
Title@2025-05-26 (1): Byzantine Consensus in the Random Asynchronous Model
Title: Byzantine Consensus in the Random Asynchronous Model | Byzantinischer Konsens im zufälligen asynchronen Modell | 随机非同步模型中的拜占庭共识 2502.09116v2 |
Authors (5): George Danezis, Jovan Komatovic, Lefteris Kokoris-Kogias, Alberto Sonnino, Igor Zablotchi
We propose a novel relaxation of the classic asynchronous network model, called the random asynchronous model, which removes adversarial message scheduling while preserving unbounded message delays and Byzantine faults. Instead of an adversary dictating message order, delivery follows a random schedule. We analyze Byzantine consensus at different resilience thresholds ($n=3f+1$, $n=2f+1$, and $n=f+2$) and show that our relaxation allows consensus with probabilistic guarantees which are impossible in the standard asynchronous model or even the partially synchronous model. We complement these protocols with corresponding impossibility results, establishing the limits of consensus in the random asynchronous model.
我们建议对经典的零星网络模式进行新颖的放松,称之为随机的零星模式,它消除了对抗性信息排期,同时保留了未受约束的信息延迟和拜占庭断层。我们不采用对手指令发送信息的命令,而是遵循随机时间表。我们按照不同的弹性阈值分析拜占庭共识($=3f+1$,$n=2f+1$,$n=2f+2$),并表明我们的放松允许以概率保证达成共识,而这种保证在标准的非同步模式甚至部分同步模式中是不可能的。我们用相应的不可能结果来补充这些协议,在随机非同步模式中确定共识的限度。
Article 81
Title@2025-05-26 (1): Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments
Title: Mosaic: Data-Free Knowledge Distillation via Mixture-of-Experts for Heterogeneous Distributed Environments | Mosaic: Datenfreies Wissen Destillieren über Mixture-of-Experts für Heterogene verteilte Umgebungen | Mosaic:通过混合专家进行无数据知识蒸馏,促进异基因分布式环境 2505.19699v1 |
Authors (9): Junming Liu, Yanting Gao, Siyuan Meng, Yifei Sun, Aoqi Wu, Yufei Jin, Yirong Chen, Ding Wang, Guosun Zeng
Federated Learning (FL) is a decentralized machine learning paradigm that enables clients to collaboratively train models while preserving data privacy. However, the coexistence of model and data heterogeneity gives rise to inconsistent representations and divergent optimization dynamics across clients, ultimately hindering robust global performance. To transcend these challenges, we propose Mosaic, a novel data-free knowledge distillation framework tailored for heterogeneous distributed environments. Mosaic first trains local generative models to approximate each client’s personalized distribution, enabling synthetic data generation that safeguards privacy through strict separation from real data. Subsequently, Mosaic forms a Mixture-of-Experts (MoE) from client models based on their specialized knowledge, and distills it into a global model using the generated data. To further enhance the MoE architecture, Mosaic integrates expert predictions via a lightweight meta model trained on a few representative prototypes. Extensive experiments on standard image classification benchmarks demonstrate that Mosaic consistently outperforms state-of-the-art approaches under both model and data heterogeneity. The source code has been published at https://github.com/Wings-Of-Disaster/Mosaic.
联邦学习联合会(FL)是一个分散的机械学习模式,使客户能够合作培训模型,同时保护数据隐私;然而,模型和数据差异的共存导致不同客户的表达和优化动态不一致,最终阻碍全球业绩的稳健;为了克服这些挑战,我们提议采用针对不同分布环境的无数据新颖知识蒸馏框架Mosaic。Mosaic首先培训当地基因化模型,以近似每个客户的个人化分布,使合成数据生成能够通过严格与真实数据分离来保障隐私。随后,Mosaic以其专业知识为基础,从客户模型中形成了一个混合专家模型(MOE),并用生成的数据将其提炼成一个全球模型。为了进一步加强MOE的结构,Mosaic将专家预测通过在少数具有代表性的原型上培训的轻量的元模型整合为专家预测。关于标准图像分类基准的广泛实验表明,Mosaic一贯地超越模型和数据异质特性下的状态方法。源代码已在 https://github.com/Wasiming-Das-DIS-DIS-DIS。
Article 82
Title@2025-05-26 (1): PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving
Title: PRESERVE: Prefetching Model Weights and KV-Cache in Distributed LLM Serving | PRESSERVE: Prefetching Modellgewichte und KV-Cache in verteilter LLM-Servierung | PRESSERVE: 分布式LLM服务中的预伸缩模型重量和 KV-缓冲 2501.08192v2 |
Authors (3): Ahmet Caner Yüzügüler, Jiawei Zhuang, Lukas Cavigelli
Large language models (LLMs) are typically served from clusters of GPUs/NPUs that consist of large number of devices. Unfortunately, communication between these devices incurs significant overhead, increasing the inference latency and cost while limiting the scalability. Prior work addressed this issue by overlapping communication with compute, but has severe limitations due to the data dependencies between these operations. In this paper, we propose PRESERVE, a novel framework that prefetches model weights and KV-cache from off-chip HBM memory to the on-chip cache of AI accelerators during the communication operations, which offers various advantages and performance improvements compared to prior methods. Through extensive experiments conducted on commercial AI accelerators, we demonstrate up to 1.6x end-to-end speedup on state-of-the-art, open-source LLMs. Additionally, we perform a design space exploration that identifies the optimal hardware configuration for the proposed method, showing a further 1.25x improvement in performance per cost by selecting the optimal L2 cache size. Our results show that PRESERVE has the potential to mitigate the memory bottlenecks and communication overheads, offering a solution to improve the performance and scalability of the LLM inference systems.
大型语言模型(LLMS)通常来自由大量装置组成的GPU/NPU群群集。 不幸的是,这些装置之间的通信产生了巨大的间接费用,增加了推导潜值和成本,同时限制了可缩放性。以前的工作是通过与计算器的通信重叠来解决这个问题的,但由于这些操作之间的数据依赖性,我们建议PRESSERVE,这是一个新的框架,它预设模型重量和从离芯HBM存储的KV缓漏到通信操作期间AI加速器的芯片缓存,与以前的方法相比,具有各种优势和性能改进。通过对商用AI加速器进行的广泛试验,我们展示了多达1.6x端到端的加速,但由于这些操作之间的数据依赖性能。此外,我们进行了设计空间探索,确定了拟议方法的最佳硬件配置,通过选择最佳的L2缓存大小,表明每个成本的性能有1.25x的改进。我们的结果表明,PRESERVE在降低存储力和高层通信系统的能力方面有潜力,从而降低存储力和高层通信系统。
Article 83
Title@2025-05-26 (1): Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers
Title: Scaling Large-scale GNN Training to Thousands of Processors on CPU-based Supercomputers | Skalierung von großformatigen GNN-Schulungen zu Tausenden von Prozessoren auf CPU-basierten Supercomputern | 向数千台基于CPU的超级计算机处理器提供大规模GNN培训 2411.16025v2 |
Authors (11): Chen Zhuang, Lingqi Zhang, Du Wu, Peng Chen, Jiajun Huang, Xin Liu, Rio Yokota, Nikoli Dryden, Toshio Endo, Satoshi Matsuoka, Mohamed Wahib
Graph Convolutional Networks (GCNs), particularly for large-scale graphs, are crucial across numerous domains. However, training distributed full-batch GCNs on large-scale graphs suffers from inefficient memory access patterns and high communication overhead. To address these challenges, we introduce \method{}, an efficient and scalable distributed GCN training framework tailored for CPU-powered supercomputers. Our contributions are threefold: (1) we develop general and efficient aggregation operators designed for irregular memory access, (2) we propose a hierarchical aggregation scheme that reduces communication costs without altering the graph structure, and (3) we present a communication-aware quantization scheme to enhance performance. Experimental results demonstrate that \method{} achieves a speedup of up to 6$\times$ compared with the SoTA implementations, and scales to 1000s of HPC-grade CPUs on the largest publicly available datasets, without sacrificing model convergence and accuracy. Moreover, due to the effective strong scaling of \method{}, we outperform SoTA GPU-based and CPU-based distributed full-batch GCN training frameworks, in absolute performance, for large-scale graphs.
特别是对于大型图表而言,革命式的图形网络(GCNs)在许多领域都至关重要。然而,在大型图表上分布的全批GCNs的培训具有低效的内存访问模式和高通信管理费。为了应对这些挑战,我们引入了为CPU动力超级计算机定制的高效且可扩展的GCN培训框架。我们的贡献有三重:(1) 我们开发了用于非常规存储存取的一般性高效聚合操作员;(2) 我们建议了一个等级汇总计划,在不改变图形结构的情况下降低通信成本;(3) 我们提出了一个通信-觉醒量化计划,以提高性能。实验结果表明,与 SoTA 执行相比,\ method 实现了高达6美元/小时的加速; 在最大公开的数据集中,将HPC级的CPS级CPS比例提高到1000秒,同时不牺牲模型的趋同性和准确性。此外,由于\method的有效大规模缩缩放,我们超了SoATA GPU基础和CPU基础分布的全称GCN培训框架。
Article 84
Title@2025-05-26 (1): Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs
Title: Win Fast or Lose Slow: Balancing Speed and Accuracy in Latency-Sensitive Decisions of LLMs | Gewinnen Sie schnell oder verlieren Sie langsam: Ausgleichende Geschwindigkeit und Genauigkeit in Latenz-Sensitive Entscheidungen von LLMs | 慢赢或慢输:LLMs的延缓敏感决定中平衡速度和准确性 2505.19481v1 |
Authors (7): Hao Kang, Qingru Zhang, Han Cai, Weiyuan Xu, Tushar Krishna, Yilun Du, Tsachy Weissman
Large language models (LLMs) have shown remarkable performance across diverse reasoning and generation tasks, and are increasingly deployed as agents in dynamic environments such as code generation and recommendation systems. However, many real-world applications, such as high-frequency trading and real-time competitive gaming, require decisions under strict latency constraints, where faster responses directly translate into higher rewards. Despite the importance of this latency quality trade off, it remains underexplored in the context of LLM based agents. In this work, we present the first systematic study of this trade off in real time decision making tasks. To support our investigation, we introduce two new benchmarks: HFTBench, a high frequency trading simulation, and StreetFighter, a competitive gaming platform. Our analysis reveals that optimal latency quality balance varies by task, and that sacrificing quality for lower latency can significantly enhance downstream performance. To address this, we propose FPX, an adaptive framework that dynamically selects model size and quantization level based on real time demands. Our method achieves the best performance on both benchmarks, improving win rate by up to 80% in Street Fighter and boosting daily yield by up to 26.52% in trading, underscoring the need for latency aware evaluation and deployment strategies for LLM based agents. These results demonstrate the critical importance of latency aware evaluation and deployment strategies for real world LLM based agents. Our benchmarks are available at Latency Sensitive Benchmarks.
大型语言模型(LLMS)在各种推理和代代任务中表现出了显著的绩效,并越来越多地作为代号生成和建议系统等动态环境中的代理商被部署。然而,许多现实世界应用软件,如高频交易和实时竞争性竞技游戏等,都需要在严格的长期限制下作出决定,而更快的反应则直接转化为更高的奖励。尽管这种长期质量交易十分重要,但在基于LLLM的代理商中,它仍然没有得到充分探讨。在这项工作中,我们首次以实时决策实时任务的形式对这种交易进行了系统研究。为了支持我们的调查,我们引入了两个新的基准:HFTBench,一个高频交易模拟,以及StreetFighter,一个竞争性的组合平台。我们的分析表明,最佳的延时质量平衡因任务而异,而降低低长期质量可以大大提高下游业绩。为了解决这个问题,我们建议FPX,一个根据实时需求动态选择模型大小和四分级水平的适应性框架。我们的方法在两个基准上都取得了最佳的业绩,在街头战斗中以80%的速度递增速率率率率,在战略中提升到80%,在战略中,通过展示我们稳定的部署基准,在战略的升级的部署中,这些基准需要。这些基准以显示我们的安全度的部署战略的升级。
Article 85
Title@2025-05-26 (1): GPU acceleration of non-equilibrium Green’s function calculation using OpenACC and CUDA FORTRAN
Title: GPU acceleration of non-equilibrium Green’s function calculation using OpenACC and CUDA FORTRAN | GPU-Beschleunigung der Nicht-Equilibrium Green-Funktionsberechnung mit OpenACC und CUDA FORTRAN | 使用 OpenACC 和 CUDA FORTRAN 加速 GPU 绿色非平衡的功能计算 2505.19467v1 |
Authors (6): Jia Yin, Khaled Z. Ibrahim, Mauro Del Ben, Jack Deslippe, Yang-hao Chan, Chao Yang
The numerical solution of the Kadanoff-Baym nonlinear integro-differential equations, which yields the non-equilibrium Green’s functions (NEGFs) of quantum many-body systems, poses significant computational challenges due to its high computational complexity. In this work, we present efficient implementations of a numerical method for solving these equations on distributed-memory architectures, including many-core CPUs and multi-GPU systems. For CPU-based platforms, we adopt a hybrid MPI/OpenMP programming model to exploit both inter-node and intra-node parallelism. On GPU-accelerated systems, we implement the method using two distinct approaches: MPI/OpenACC and MPI/CUDA FORTRAN. Several optimization strategies are employed to enhance GPU performance, including techniques to maximize computational resource utilization and minimize the overhead associated with kernel launches and memory management. Although OpenACC is easy to use, CUDA FORTRAN provides more advanced features for configuring and managing multiple levels of concurrency, while also simplifying memory allocation and data movement between host and device. This flexibility translates into significant performance improvements. We compare the performance of the three implementations and demonstrate that the GPU-based approaches achieve substantial speedups over CPU-based implementations. Furthermore, both CPU and GPU versions exhibit excellent strong and weak scaling, confirming the scalability and efficiency of our approach for large-scale NEGF computations.
Kadanoff-Baym非线性内分化方程式的数值解决方案,使量子多体系统的非平衡绿色功能(NEGFs)产生量子多体系统的非平衡绿色功能,由于计算复杂性很高,因此在计算上提出了巨大的挑战。在这项工作中,我们提出了在分布式模拟结构中解决这些等式的数字方法的高效实施,包括许多核心CPU和多GPU系统。对于基于CPU的平台,我们采用了混合的MPI/OpenMP编程模式,以利用节点间和节点内平行的多种水平。在GPU加速的系统中,我们采用的方法有两种不同的方法:MPI/OpenACC和MPI/CUDA FORTRAN。我们采用了几种优化战略来提高GPU的性能,包括最大限度地计算资源利用率和尽量减少与内圈启动和记忆管理相关的间接费用。虽然基于CUDAC是容易使用的,但CUDA FORTRAN为配置和管理多级调调调制提供了更先进的特征,同时也简化记忆分配和数据流动的大规模升级化方法。
Article 86
Title@2025-05-26 (1): FedHERO: A Federated Learning Approach for Node Classification Task on Heterophilic Graphs
Title: FedHERO: A Federated Learning Approach for Node Classification Task on Heterophilic Graphs | FedHERO: Ein Federated Learning Approach für Knotenklassifikation Aufgaben auf heterophilen Graphen | FEFHERO: 异生物图节点分类任务联邦学习方法 2504.21206v2 |
Authors (5): Zihan Chen, Xingbo Fu, Yushun Dong, Jundong Li, Cong Shen
Federated Graph Learning (FGL) empowers clients to collaboratively train Graph neural networks (GNNs) in a distributed manner while preserving data privacy. However, FGL methods usually require that the graph data owned by all clients is homophilic to ensure similar neighbor distribution patterns of nodes. Such an assumption ensures that the learned knowledge is consistent across the local models from all clients. Therefore, these local models can be properly aggregated as a global model without undermining the overall performance. Nevertheless, when the neighbor distribution patterns of nodes vary across different clients (e.g., when clients hold graphs with different levels of heterophily), their local models may gain different and even conflict knowledge from their node-level predictive tasks. Consequently, aggregating these local models usually leads to catastrophic performance deterioration on the global model. To address this challenge, we propose FedHERO, an FGL framework designed to harness and share insights from heterophilic graphs effectively. At the heart of FedHERO is a dual-channel GNN equipped with a structure learner, engineered to discern the structural knowledge encoded in the local graphs. With this specialized component, FedHERO enables the local model for each client to identify and learn patterns that are universally applicable across graphs with different patterns of node neighbor distributions. FedHERO not only enhances the performance of individual client models by leveraging both local and shared structural insights but also sets a new precedent in this field to effectively handle graph data with various node neighbor distribution patterns. We conduct extensive experiments to validate the superior performance of FedHERO against existing alternatives.
联邦图表学习(FGL) 授权客户在保存数据隐私的同时,以分布方式合作培训图表神经网络(GNN) 。 但是, FGL 方法通常要求所有客户拥有的图表数据具有同性,以确保相邻节点分布模式的相似性。 这样的假设可以确保从所有客户的当地模型中学习到的知识的一致性。 因此,这些本地模型可以作为一个全球模型进行适当汇总,而不会破坏总体性能。 然而,当相邻节点的分布模式在不同客户之间(例如,当客户持有不同水平的复杂图时,其本地模型可能从他们的节点预测任务中获得不同甚至冲突性的知识。 因此, 合并这些本地模型通常会导致全球模型的灾难性性能恶化。 为了应对这一挑战,我们建议FEHERO, 一种FDGLL框架, 旨在有效地利用和共享性能图表的洞察和共享性能。 在FEHERO中心中心, 设计一个双轨GNN, 在一个结构学习器中, 用来从本地图表中识别结构知识, 并用本地图表的模型在本地图表中识别结构知识, 。
Article 87
Title@2025-05-25 (7): QMIO: A tightly integrated hybrid HPCQC system
Title: QMIO: A tightly integrated hybrid HPCQC system | QMIO: Ein eng integriertes Hybrid-HPCQC-System | QMIO:一个严格一体化的混合高和分PCQC系统 2505.19267v1 |
Authors (7): Javier Cacheiro, Álvaro C Sánchez, Russell Rundle, George B Long, Gavin Dold, Jamie Friel, Andrés Gómez
High-Performance Computing (HPC) systems are the most powerful tools that we currently have to solve complex scientific simulations. Quantum computing (QC) has the potential to enhance HPC systems by accelerating the execution of specific kernels that can be offloaded to a Quantum Processing Unit (QPU), granting them new capabilities, improving the speed of computation, or reducing energy consumption. In this paper, we present QMIO: a state-of-the-art hybrid HPCQC system, which tightly integrates HPC and QC. We describe its hardware and software components, the integration middleware, and the lessons learned during the design, implementation, and operation of the system.
高性能计算系统(HPC)是我们目前必须用来解决复杂的科学模拟的最有力工具。 量子计算(QC)有可能通过加速执行可卸到量子处理股(QPU)的具体内核,赋予它们新的能力,提高计算速度,或减少能源消耗,从而增强HPC系统。在本文件中,我们介绍了QMIO:最先进的混合HPCQC系统,它密切结合了HPC和QC。 我们描述了其硬件和软件组件、集成中间软件,以及在该系统的设计、实施和运行过程中吸取的经验教训。
Article 88
Title@2025-05-25 (7): NanoFlow: Towards Optimal Large Language Model Serving Throughput
Title: NanoFlow: Towards Optimal Large Language Model Serving Throughput | NanoFlow: Auf dem Weg zu einem optimalen Large Language Model | NanoFlow:走向最佳大语言模式 2408.12757v2 |
Authors (16): Kan Zhu, Yufei Gao, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Tian Tang, Qinyu Xu, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Ziren Wang, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci
Large Language Models (LLMs) have resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput has emerged as a key metric that determines serving systems’ performance. Due to large model sizes and memory-intensive self-attention, LLM serving has been commonly assumed to be memory-bound. Through a detailed analysis, we show that despite having memory-intensive components, end-to-end LLM serving is compute bound for most common workloads and LLMs. Alas, most existing serving engines fall short from optimal compute utilization, because the heterogeneous operations that comprise LLM serving–compute, memory, networking–are executed sequentially within a device. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of heterogeneous resources within a single device. NanoFlow splits inputs into smaller nano-batches and duplicates operations to operate on each portion independently, enabling overlapping. NanoFlow automatically identifies the number, size, ordering, and GPU resource allocation of nano-batches to minimize the execution time, while considering the interference of concurrent operations. We evaluate NanoFlow’s end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 50% to 72% of optimal throughput across popular models.
大型语言模型(LLMS)导致对全球规模服务系统的需求激增, 成千上万的GPU不断为数亿用户服务。 因此, 输送量已成为决定服务系统性能的关键指标。 由于规模庞大的模型和记忆密集的自我关注, LLM服务通常被假定为记忆性服务。 通过详细分析, 我们显示, 尽管拥有记忆密集型组件, 端到端LLM 服务对大多数常见工作量和LLMS 的计算是固定的。 唉, 大多数现有的服务引擎离优化的计算利用率差, 原因是由LLM- 服务计算、 记忆、 网络化- 在一个设备中连续执行。 我们提议纳诺弗罗( NanoFlow) , 利用内部差异性平行的平行框架。 纳诺福洛( NanoFlow) 将投入分成一个小型超小型超小型超小型超小型超小型超小型超小型超小型超小型超小型超小型超小型超小型超小型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型超大型
Article 89
Title@2025-05-25 (7): Matrix Multiplication in the MPC Model
Title: Matrix Multiplication in the MPC Model | Matrix-Multiplikation im MPC-Modell | MPC 模型中的矩阵乘法 2505.19137v1 |
Authors (4): Atharv Chhabra, Arya Deshmukh, Chetan Gupta, Lakshya Joshi
In this paper, we study the matrix multiplication problem in the MPC model. We have two matrices, and the task is to compute their product. These matrices are evenly distributed over $P$ processors. Each processor has $M$ memory such that $P \cdot M \geq $ (size of the matrices). The computation proceeds in synchronous rounds. In a communication round, a processor can send and receive messages to(from) any other processor, with the total size of messages sent or received being $O(M)$. We give an almost complete characterisation of the problem in various settings. We prove tight upper bounds and lower bounds for the problems in three different settings–when the given input matrices are (i) general square matrices, (ii) rectangular matrices, and (iii) sparse square matrices (that is, each row and column contains a bounded number of nonzero elements). In particular, we prove the following results: 1. Multiplication of two $n \times n$ matrices in the MPC model with $n^\alpha$ processors each with $O(n^{2-\alpha})$ memory, requires $\Theta(n^{\frac{\alpha}{2}})$ rounds in semirings. 2. Multiplication of two rectangular matrices of size $n \times d$ and $d \times n$ (where $d \leq n$) respectively, with $n$ processors of $O(n)$ memory requires $\Theta(\frac{d}{\sqrt{n}})$ rounds in semirings. 3. Multiplication of two rectangular matrices of size $d \times n$ and $n \times d$ ( where $d \leq n$) respectively requires i. $\Theta(\sqrt{d} + \log_d n)$ rounds with $n$ processors and $O(d)$ memory per processor in semirings ii. $\Theta (\frac{d}{\sqrt{n}})$ rounds with $d$ processors and $O(n)$ memory per processor in semirings. 4. Multiplication of two $d$-sparse matrices (each row and column of the matrices contains at most $d$-nonzero elements) with $n$ processors and $O(d)$ memory per processor can be done in $O(d^{0.9})$ rounds in semirings.
nan
Article 90
Title@2025-05-25 (7): Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods
Title: Birch SGD: A Tree Graph Framework for Local and Asynchronous SGD Methods | Birke SGD: Ein Baumdiagramm-Framework für lokale und asynchrone SGD-Methoden | Birch SGD: 当地和非同步 SGD 方法树图框架 2505.09218v2 |
Authors (2): Alexander Tyurin, Danil Sivtsov
We propose a new unifying framework, Birch SGD, for analyzing and designing distributed SGD methods. The central idea is to represent each method as a weighted directed tree, referred to as a computation tree. Leveraging this representation, we introduce a general theoretical result that reduces convergence analysis to studying the geometry of these trees. This perspective yields a purely graph-based interpretation of optimization dynamics, offering a new and intuitive foundation for method development. Using Birch SGD, we design eight new methods and analyze them alongside previously known ones, with at least six of the new methods shown to have optimal computational time complexity. Our research leads to two key insights: (i) all methods share the same “iteration rate” of $O\left(\frac{(R + 1) L \Delta}{\varepsilon} + \frac{\sigma^2 L \Delta}{\varepsilon^2}\right)$, where $R$ the maximum “tree distance” along the main branch of a tree; and (ii) different methods exhibit different trade-offs-for example, some update iterates more frequently, improving practical performance, while others are more communication-efficient or focus on other aspects. Birch SGD serves as a unifying framework for navigating these trade-offs. We believe these results provide a unified foundation for understanding, analyzing, and designing efficient asynchronous and parallel optimization methods.
我们提出一个新的统一框架,即Birch SGD,用于分析和设计分布式的 SGD 方法。中心思想是代表每种方法,作为加权定向树,称为计算树。利用这个代表,我们引入一个总体理论结果,减少趋同分析,以研究这些树的几何学制。这个观点产生对优化动态的纯粹基于图表的解释,为方法开发提供一个新的直观的基础。我们利用Birch SGD,设计了八种新方法,并分析了这些新方法与以前已知的方法,至少有六种新的方法显示具有最佳计算时间复杂性。我们的研究引出了两个关键见解:(一) 所有方法都具有相同的“拉夫(R+1)L\Deltaúvarepsilon}+\\gma2L\deltaúvarepsilonright)的“通性率,为方法开发方法提供了一个新的直观基础。我们设计了8个新方法,在树主分支上展示了最大“树距离”的R$;以及至少6个新方法,显示有两种方法显示不同的交易偏差,例如:所有方法都使用相同的“lev(ial-flefer-fer-ford-fer-fer-formilling),一些方法,经常更新的“bilal-hilling folview resmilling fal ress fal ress) laview) ress) laviews found found found found found founds found found founds) laus.
Article 91
Title@2025-05-24 (6): Toward Malicious Clients Detection in Federated Learning
Title: Toward Malicious Clients Detection in Federated Learning | Auf dem Weg zu bösartigen Kunden Erkennung im Föderierten Lernen | 争取在联邦学习中发现恶意客户 2505.09110v2 |
Authors (5): Zhihao Dou, Jiaqi Wang, Wei Sun, Zhuqing Liu, Minghong Fang
Federated learning (FL) enables multiple clients to collaboratively train a global machine learning model without sharing their raw data. However, the decentralized nature of FL introduces vulnerabilities, particularly to poisoning attacks, where malicious clients manipulate their local models to disrupt the training process. While Byzantine-robust aggregation rules have been developed to mitigate such attacks, they remain inadequate against more advanced threats. In response, recent advancements have focused on FL detection techniques to identify potentially malicious participants. Unfortunately, these methods often misclassify numerous benign clients as threats or rely on unrealistic assumptions about the server’s capabilities. In this paper, we propose a novel algorithm, SafeFL, specifically designed to accurately identify malicious clients in FL. The SafeFL approach involves the server collecting a series of global models to generate a synthetic dataset, which is then used to distinguish between malicious and benign models based on their behavior. Extensive testing demonstrates that SafeFL outperforms existing methods, offering superior efficiency and accuracy in detecting malicious clients.
联邦学习(FL)使多个客户能够在不分享原始数据的情况下合作培训全球机器学习模式。然而,FL的分散性质带来了脆弱性,特别是中毒袭击,恶意客户利用当地模式破坏培训过程。虽然Byzantine-robust 集成规则已经制定来缓解此类袭击,但这些规则仍然不足以应对更先进的威胁。作为回应,最近的进展侧重于FL检测技术,以识别潜在的恶意参与者。不幸的是,这些方法往往错误地将许多良客户归类为威胁,或依赖对服务器能力的不切实际假设。在本文件中,我们提议了一种新颖的算法“SafeFLL”,专门设计来准确识别FL的恶意客户。SafeFL方法涉及服务器收集一系列全球模型,以生成合成数据集,然后用于根据行为区分恶意和良性模型。广泛测试表明,FLaf安全超越了现有方法,为发现恶意客户提供了更高的效率和准确度。
Article 92
Title@2025-05-24 (6): Distributed Incremental SAT Solving with Mallob: Report and Case Study with Hierarchical Planning
Title: Distributed Incremental SAT Solving with Mallob: Report and Case Study with Hierarchical Planning | Distributed Incremental SAT Solving with Mallob: Report and Case Study with Hierarchical Planning | 与马洛布公司共同解决:与等级规划有关的报告和案例研究 2505.18836v1 |
Authors (1): Dominik Schreiber
This report describes an extension of the distributed job scheduling and SAT solving platform Mallob by incremental SAT solving, embedded in a case study on SAT-based hierarchical planning. We introduce a low-latency interface for incremental jobs and specifically for IPASIR-style incremental SAT solving to Mallob. This also allows to process many independent planning instances in parallel via Mallob’s scheduling capabilities. In an experiment where 587 planning inputs are resolved in parallel on 2348 cores, we observe significant speedups for several planning domains where SAT solving constitutes a major part of the planner’s running time. These findings indicate that our approach to distributed incremental SAT solving may be useful for a wide range of SAT applications.
本报告介绍了分布式工作时间安排和沙特卫星解决平台Mallob的延伸,其方法是通过渐进式沙特卫星解决,纳入关于基于沙特卫星的等级规划的案例研究中。我们为递增性工作,特别是IPASIR式的递增性SAT解决向马洛布提供低纬度界面。这也使得能够通过马洛布的排期能力同时处理许多独立的规划案例。在一项实验中,587项规划投入在2348个核心同时得到解决,我们观察到若干规划领域的显著加速,在这些规划领域,沙特卫星解决构成规划员运行时间的主要部分。这些结论表明,我们分配递增性SAT解决方案的方法可能有益于广泛的沙特卫星应用。
Article 93
Title@2025-05-24 (6): DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services
Title: DiSCo: Device-Server Collaborative LLM-Based Text Streaming Services | DiSCo: Geräte-Server Kollaborative LLM-basierte Text-Streaming-Dienste | DisCo: 设备-服务器协作协作LLM基于LLM的文本流服务 2502.11417v2 |
Authors (3): Ting Sun, Penghan Wang, Fan Lai
The rapid rise of large language models (LLMs) in text streaming services has introduced significant cost and Quality of Experience (QoE) challenges in serving millions of daily requests, especially in meeting Time-To-First-Token (TTFT) and Time-Between-Token (TBT) requirements for real-time interactions. Our real-world measurements show that both server-based and on-device deployments struggle to meet diverse QoE demands: server deployments face high costs and last-hop issues (e.g., Internet latency and dynamics), while on-device LLM inference is constrained by resources. We introduce DiSCo, a device-server cooperative scheduler designed to optimize users’ QoE by adaptively routing requests and migrating response generation between endpoints while maintaining cost constraints. DiSCo employs cost-aware scheduling, leveraging the predictable speed of on-device LLM inference with the flexible capacity of server-based inference to dispatch requests on the fly, while introducing a token-level migration mechanism to ensure consistent token delivery during migration. Evaluations on real-world workloads – including commercial services like OpenAI GPT and DeepSeek, and open-source deployments such as LLaMA3 – show that DiSCo can improve users’ QoE by reducing tail TTFT (11-52\%) and mean TTFT (6-78\%) across different model-device configurations, while dramatically reducing serving costs by up to 84\% through its migration mechanism while maintaining comparable QoE levels.
文本流服务中大型语言模型(LLMs)的迅速上升,在满足数百万次日常请求方面,特别是在满足实时互动的时间到第一(TTFT)和时间-Between-Token(TTTT)的要求方面,带来了巨大的成本和质量挑战。我们的现实世界测量显示,基于服务器的部署和设置装置的部署都难以满足不同的QE要求:服务器的部署面临高昂的成本和最后的点问题(例如因特网的延缓和动态),而在线LM的推断则受到资源的限制。我们引入DisCo,即一个设备服务器-服务器合作调度调度系统,目的是通过适应性地在端点之间调整路线请求和迁移反应生成优化用户的QO(TT),同时利用基于服务器的开放性推力灵活地将请求发送到空中,同时引入一个象征性的迁移机制,以确保在实时的交付方面,包括实时的G-IMA3,同时通过深度的配置,在实时的部署中显示实时的交付成本。
Article 94
Title@2025-05-24 (6): Distributed Set-membership Filtering Frameworks For Multi-agent Systems With Absolute and Relative Measurements
Title: Distributed Set-membership Filtering Frameworks For Multi-agent Systems With Absolute and Relative Measurements | Distributed Set-Membership Filtering Frameworks für Multi-Agent-Systeme mit absoluten und relativen Messungen | 具有绝对和相对计量的多试剂系统分布式成员筛选框架 2305.15797v2 |
Authors (3): Yu Ding, Yirui Cong, Xiangke Wang
In this paper, we focus on the distributed set-membership filtering (SMFing) problem for a multi-agent system with absolute (taken from agents themselves) and relative (taken from neighbors) measurements. In the literature, the relative measurements are difficult to deal with, and the SMFs highly rely on specific set descriptions. As a result, establishing the general distributed SMFing framework having relative measurements is still an open problem. To solve this problem, first, we provide the set description based on uncertain variables determined by the relative measurements between two agents as the foundation. Surprisingly, the accurate description requires only a single calculation step rather than multiple iterations, which can effectively reduce computational complexity. Based on the derived set description, called the uncertain range, we propose two distributed SMFing frameworks: one calculates the joint uncertain range of the agent itself and its neighbors, while the other only computes the marginal uncertain range of each local system. Furthermore, we compare the performance of our proposed two distributed SMFing frameworks and the benchmark – centralized SMFing framework. A rigorous set analysis reveals that the distributed SMF can be essentially considered as the process of computing the marginal uncertain range to outer bound the projection of the uncertain range obtained by the centralized SMF in the corresponding subspace. Simulation results corroborate the effectiveness of our proposed distributed frameworks and verify our theoretical analysis.
在本文中,我们侧重于对一个具有绝对(从代理人本身获得)和相对(从邻居获得)测量的多试剂系统进行分布式成员过滤(SMF)的问题。在文献中,相对测量很难处理,而SMF高度依赖特定描述。因此,建立具有相对测量的分布式成员过滤(SMF)框架仍然是一个尚未解决的问题。为了解决这个问题,首先,我们根据两个代理人之间相对测量所决定的不确定变数提供一套描述。令人惊讶的是,准确描述只需要一个单一的计算步骤,而不是多个迭代,才能有效减少计算的复杂性。根据衍生的既定描述,称为不确定范围,我们建议两个分布式成员高度依赖特定描述框架:一个计算代理人本身及其邻居的共同不确定性范围,而另一个仅计算每个地方系统的边际不确定范围。此外,我们比较了我们拟议的两个分布式的SMFF框架和基准 – – 集中的SMF框架 – – 令人惊讶地分析显示,分布式SMF的分布式SMF,基本上可以考虑通过我们拟议的不确定的边际空间预测的中央范围,我们拟议的SMF的边际分析,而将SMF的边际预测的边际范围视为我们提议的SMF的边际预测。
Article 95
Title@2025-05-24 (6): EvoSort: A Genetic-Algorithm-Based Adaptive Parallel Sorting Framework for Large-Scale High Performance Computing
Title: EvoSort: A Genetic-Algorithm-Based Adaptive Parallel Sorting Framework for Large-Scale High Performance Computing | EvoSort: Ein genetisch-algorithmisch-adaptives Parallelsortierungs-Framework für großformatige Hochleistungsrechnen | EvoSort: 大型高性能计算方法的基于遗传 – – 物理学的适应性平行排序框架 2505.18681v1 |
Authors (2): Shashank Raj, Kalyanmoy Deb
In today’s era of big data, sorting enormous datasets is a major challenge. We present EvoSort, an adaptive parallel sorting framework that employs a Genetic Algorithm (GA) to automatically discover and refine critical parameters, including insertion sort and fallback thresholds, tile size, and mergesort vs Least Significant Digit (LSD) radix sort. EvoSort integrates parallel sorting primitives and adapts continuously to input data and system architecture, ensuring optimal performance. Experiments on up to 10 billion elements show that EvoSort consistently outperforms NumPy sorting by factors from three to over 90 times. EvoSort exemplifies a powerful auto-tuning solution for large-scale data processing.
在当今的大数据时代,对庞大的数据集进行分类是一项重大挑战。我们展示了EvoSort(EvoSort),这是一个适应性平行的分类框架,它使用遗传算法(GA)自动发现和完善关键参数,包括插入分类和后退阈值、瓷块大小、合并索尔特和最小值的Digit(LSD) 弧形等。EvoSort将平行的原始分类和不断适应输入数据和系统结构,确保最佳性能。对多达100亿个元素的实验显示,EvoSort(EvoSort)一直比NumPy(NumPy)的分解三至90倍。EvoSort为大规模数据处理的强大自动调整解决方案提供了示范。
Article 96
Title@2025-05-24 (6): Towards Round-Optimal Approximate Agreement on Trees
Title: Towards Round-Optimal Approximate Agreement on Trees | Auf dem Weg zu einem runden, optimalen Abkommen über Bäume | 争取达成关于树木的圆顶和最接近于 2502.05591v2 |
Authors (3): Marc Fuchs, Diana Ghinea, Zahra Parsaeian
Approximate Agreement (AA) is a key consensus primitive that, even in the presence of Byzantine faults, allows honest parties to obtain close (but not necessarily identical) outputs that lie within the range of their inputs. While the optimal round complexity of synchronous AA on real values is well understood, its extension to other input spaces remains an open problem. Our work is concerned with AA on trees, where the parties hold as inputs vertices from a publicly known labeled tree $T$ and must output $1$-close vertices in the honest inputs’ convex hull. We present an optimal-resilience protocol in the synchronous model, with round complexity $O\left(\frac{\log | V(T) | }{\log \log | V(T) | } \right)$, where $V(T)$ is the set of vertices in the input space tree $T$. Our protocol non-trivially reduces the problem of AA on trees to AA on real values. Additionally, we extend the impossibility results regarding the round complexity of synchronous AA protocols on real values to trees: we prove a lower bound of $\Omega\left(\frac{\log D(T)}{\log \log D(T) + \log \frac{n + t}{t}} \right)$ rounds, where $D(T)$ denotes the diameter of the tree, $n$ denotes the number of parties, and $t$ denotes the number of Byzantine parties. This establishes the asymptotic optimality of our protocol for trees $T$ of diameter $D(T) \in | V(T) | ^{\Theta(1)}$ given that $t \in \Theta(n)$. |
近似协议( AA) 是一个关键的共识原始, 即便在有 Byzantine 断层的情况下, 诚实的政党也能在投入范围内获得接近( 但不一定相同) 的输出。 虽然对真实值同步的 AAA 的最佳回合复杂度得到了很好的理解, 但它扩展到其它输入空间仍然是一个尚未解决的问题 。 我们的工作与 AAA 在树上的工作有关, 在树上, 各方持有公开标注为$T$, 并且必须在诚实投入的 convex 船体上输出 $( 直径) $( 直径) 。 我们在同步模型中展示了最佳的( D)\ left\\ org@ v\\\ log\ log\ {V (T)\\ right) 协议, $( T) 是输入空间树上 $T$( 美元) 的峰值。 我们的协议在树上将 AAAA 的 问题不重 。 此外, 我们把关于同步协议的极复杂性( $- t) 美元协议的概率结果扩展结果扩展到我们的真实值。
Article 97
Title@2025-05-24 (6): Asynchronous Approximate Agreement with Quadratic Communication
Title: Asynchronous Approximate Agreement with Quadratic Communication | Asynchrone annähernde Vereinbarung mit quadratischer Kommunikation | 与赤道通信的近似非同步协定 2408.05495v3 |
Authors (2): Mose Mizrahi Erbes, Roger Wattenhofer
We consider an asynchronous network of $n$ message-sending parties, up to $t$ of which are byzantine. We study approximate agreement, where the parties obtain approximately equal outputs in the convex hull of their inputs. In their seminal work, Abraham, Amit and Dolev [OPODIS ‘04] solve this problem in $\mathbb{R}$ with the optimal resilience $t < \frac{n}{3}$ with a protocol where each party reliably broadcasts a value in every iteration. This takes $\Theta(n^2)$ messages per reliable broadcast, or $\Theta(n^3)$ messages per iteration. In this work, we forgo reliable broadcast to achieve asynchronous approximate agreement against $t < \frac{n}{3}$ faults with quadratic communication. In trees of diameter $D$ and maximum degree $\Delta$, we achieve edge agreement in $\lceil{6\log_2 D}\rceil$ rounds with $\mathcal{O}(n^2)$ messages of size $\mathcal{O}(\log \Delta + \log\log D)$ per round. We do this by designing a 6-round multivalued 2-graded consensus protocol, and by repeatedly using it to reduce edge agreement in a tree of diameter $D$ to edge agreement in a tree of diameter $\frac{D}{2}$. Then, we achieve edge agreement in the infinite path $\mathbb{Z}$, again with the help of 2-graded consensus. Finally, by reducing $\varepsilon$-agreement in $\mathbb{R}$ to edge agreement in $\mathbb{Z}$, we show that our edge agreement protocol enables $\varepsilon$-agreement in $\mathbb{R}$ in $6\log_2(\frac{M}{\varepsilon} + 1) + \mathcal{O}(\log \log \frac{M}{\varepsilon})$ rounds with $\mathcal{O}(n^2 \log \frac{M}{\varepsilon})$ messages and $\mathcal{O}(n^2\log \frac{M}{\varepsilon}\log \log \frac{M}{\varepsilon})$ bits of communication, where $M$ is the maximum input magnitude.
nan
Article 98
Title@2025-05-24 (6): TEE is not a Healer: Rollback-Resistant Reliable Storage
Title: TEE is not a Healer: Rollback-Resistant Reliable Storage | TEE ist kein Heiler: Rollback-Resistent Zuverlässige Lagerung | TEE不是救治者:回击-恢复-可靠储存 2505.18648v1 |
Authors (3): Sadegh Keshavarzi, Gregory Chockler, Alexey Gotsman
Recent advances in secure hardware technologies, such as Intel SGX or ARM TrustZone, offer an opportunity to substantially reduce the costs of Byzantine fault-tolerance by placing the program code and state within a secure enclave known as a Trusted Execution Environment (TEE). However, the protection offered by a TEE only applies during program execution. Once power is switched off, the non-volatile portion of the program state becomes vulnerable to rollback attacks wherein it is undetectably reverted to an older version. In this paper, we consider a problem of implementing reliable read/write registers out of failure-prone replicas subject to state rollbacks. To this end, we introduce a new unified model that captures the multiple failure types that can affect a TEE-based system. We then establish tight bounds on the fault-tolerance of register constructions in this model for both the static case, where failure thresholds hold throughout the entire execution, and the dynamic case, where they only hold eventually. Our dynamic register emulation algorithm resolves a long-standing question of how to correctly rebuild replica state upon restart without relying on additional hardware assumptions such as trusted monotonic counters.
安全硬件技术(如Intel SGX或ARM Trust Zone)的近期进展为大幅降低Byzantine错误容忍度的成本提供了机会,将程序代码和状态置于一个称为信任执行环境(TEE)的安全飞地内,从而大大降低了Byzantine错误容忍度的成本。然而,TEE提供的保护只在程序执行期间适用。一旦电力关闭,程序状态的非挥发性部分就很容易受到反弹攻击,而这种攻击无法被察觉地恢复到旧版本。在本文中,我们考虑了执行可靠的读/写性登记册的问题。为此,我们引入了新的统一模型,捕捉到多种类型的故障,从而可能影响基于TEE的系统。我们随后为这个模型中的注册结构设置了严格的阻断性界限,即在整个执行过程中,故障阈值将维持在最后状态。我们动态登记册的模拟算法解决了一个长期存在的问题,即如何在不依赖其他可靠的硬件假设作为单一反制的情况下,在重新开始时正确重建重复状态。
Article 99
Title@2025-05-24 (6): CacheFL: Privacy-Preserving and Efficient Federated Cache Model Fine-Tuning for Vision-Language Models
Title: CacheFL: Privacy-Preserving and Efficient Federated Cache Model Fine-Tuning for Vision-Language Models | CacheFL: Datenschutzschonendes und effizientes Federated Cache Model Fine-Tuning für Vision-Language-Modelle | CACHFL: 视力和语言模型微调模型 2505.05130v2 |
Authors (5): Mengjun Yi, Hanwen Zhang, Hui Dou, Jian Zhao, Furao Shen
Large pre-trained Vision-Language Models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), have exhibited remarkable zero-shot performance across various image classification tasks. Fine-tuning these models on domain-specific datasets further enhances their effectiveness for downstream applications. However, fine-tuning in cloud environments raises significant concerns regarding data security and privacy. Federated Learning (FL) offers a decentralized solution by enabling model training across local clients without centralizing sensitive data, but the high communication and computation costs of transmitting full pre-trained models during training limit its scalability. Additionally, non-Independent and Identically Distributed (non-IID) data across local clients can negatively impact model convergence and performance. To address these challenges, we propose CacheFL, a novel federated learning method that replaces traditional full model fine-tuning with lightweight cache model fine-tuning. The cache model is initialized using a class-balanced dataset generated by a generative pre-trained model, effectively mitigating the impact of non-IID data. This cache model is then distributed to local clients for fine-tuning, and the updated parameters from each client are aggregated on the server and redistributed. With the updated cache model, the classification performance of CLIP is improved after just a few epochs. By limiting the training and communication to the cache model, CacheFL significantly reduces resource demands while ensuring data privacy and security. Extensive experiments conducted on ImageNet and 10 additional datasets demonstrate that CacheFL outperforms traditional approaches in terms of classification accuracy, resource efficiency, and privacy preservation.
接受过预先培训的大型视觉语言模型(VLM)等大型视觉语言语言模型(VLMS)在各种图像分类任务中表现出显著的零发性表现。在具体领域数据集上对这些模型进行微调,进一步提高了下游应用的效果。然而,云层环境中的微调引起了对数据安全和隐私的极大关切。Falde(FL)提供了一种分散化的解决办法,在不集中敏感数据的情况下为当地客户提供模型培训,但培训期间传播全部经过培训的成熟程度模型的通信和计算费用高昂,限制了其可缩放性。此外,当地客户的不独立和同化(非IID)数据可能会对模型的趋近效和性能产生消极影响。为了应对这些挑战,我们建议Cachelfl, 一种新的联合学习方法,用轻量级缓存模型微调校正。 缓存模型使用一个班级平衡的模型来初始化,有效地减轻非IID数据的影响。这一缓存模型随后分发给当地客户,用于精细校正的存储成本(非IID)的缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩略缩缩缩缩缩缩缩缩缩缩缩缩缩图。我们缩缩缩略图。
Article 100
Title@2025-05-24 (6): PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning
Title: PacTrain: Pruning and Adaptive Sparse Gradient Compression for Efficient Collective Communication in Distributed Deep Learning | PacTrain: Pruning and Adaptive Sparse Gradient Compression für effiziente kollektive Kommunikation im verteilten Deep Learning | PacTrain:在分布式深层学习中促进高效集体交流的审慎和适应性零散梯级压缩 2505.18563v1 |
Authors (4): Yisu Wang, Ruilong Wu, Xinjiao Li, Dirk Kutscher
Large-scale deep neural networks (DNN) exhibit excellent performance for various tasks. As DNNs and datasets grow, distributed training becomes extremely time-consuming and demands larger clusters. A main bottleneck is the resulting gradient aggregation overhead. While gradient compression and sparse collective communication techniques are commonly employed to alleviate network load, many gradient compression schemes do not achieve acceleration of the training process while also preserving accuracy. This paper introduces PacTrain, a novel framework that accelerates distributed training by combining pruning with sparse gradient compression. Active pruning of the neural network makes the model weights and gradients sparse. By ensuring the global knowledge of the gradient sparsity among all distributed training workers, we can perform lightweight compression communication without harming accuracy. We show that the PacTrain compression scheme achieves a near-optimal compression strategy while remaining compatible with the all-reduce primitive. Experimental evaluations show that PacTrain improves training throughput by 1.25 to 8.72 times compared to state-of-the-art compression-enabled systems for representative vision and language models training tasks under bandwidth-constrained conditions.
大型大型深神经网络(DNN)在各种任务中表现出色。随着DNN和数据集的增长,分布式培训变得极其耗时,要求更多的集群。主要的瓶颈是由此产生的梯度汇总管理。虽然通常使用梯度压缩和分散的集体通信技术来减轻网络负荷,但许多梯度压缩计划和分散的集体通信技术并没有在保持准确性的同时实现培训进程的加速。本文介绍了PacTrain,这是一个通过将剪裁与稀薄的梯度压缩相结合来加快分布式培训的新框架。神经网络的运行使得模型的重量和梯度变得稀少。通过确保所有分布式培训工作者对梯度宽度的全球性了解,我们可以进行轻度压缩通信,而不会损害准确性。我们表明,PacTrain压缩计划在与全部生成的原始力兼容的同时,实现了接近最佳的压缩战略。实验性评估显示,PacTrain在带宽限制的条件下,比为具有代表性的视觉和语言模型培训任务提供最先进的压缩支持的系统改进了1.25至8.72次。
Article 101
Title@2025-05-24 (6): Consensus Under Adversary Majority Done Right
Title: Consensus Under Adversary Majority Done Right | Konsens unter gegnerischer Mehrheit Rechtsbeistand | 在相反多数下达成的共识 2411.01689v3 |
Authors (5): Srivatsan Sridhar, Ertem Nusret Tas, Joachim Neu, Dionysis Zindros, David Tse
A specter is haunting consensus protocols–the specter of adversary majority. Dolev and Strong in 1983 showed an early possibility for up to 99% adversaries. Yet, other works show impossibility results for adversaries above 50% under synchrony, seemingly the same setting as Dolev and Strong’s. What gives? It is high time that we pinpoint a key culprit for this ostensible contradiction: the modeling details of clients. Are the clients sleepy or always-on? Are they silent or communicating? Can validators be sleepy too? We systematize models for consensus across four dimensions (sleepy/always-on clients, silent/communicating clients, sleepy/always-on validators, and synchrony/partial-synchrony), some of which are new, and tightly characterize the achievable safety and liveness resiliences with matching possibilities and impossibilities for each of the sixteen models. To this end, we unify folklore and earlier results, and fill gaps left in the literature with new protocols and impossibility theorems.
幽灵正在笼罩着共识协议 — — 多数对手的幽灵。 Dolev and Strang在1983年显示有99%对手的早期可能性。然而,其他作品显示,在同步状态下对手超过50%的不可能结果,似乎与Dolev and Stronch’s相同。 是什么条件? 现在是时候了。 我们应找出这一表面矛盾的关键罪魁祸首:客户的模型细节。 客户是困睡还是永远沉睡? 他们是否静默还是永远沉睡? 验证者能否也沉睡? 我们系统化了四个维度的共识模式( 睡眠/ 双向客户、 沉默/ 沟通客户、 困睡/ / 双向验证器、 同步/ 部分同步/ 平衡 ) , 其中一些是新的, 严格地描述可实现的安全和活性复原力, 与16个模型中的每一种模型的相匹配可能性和不成熟性。 为此,我们统一民俗和早期的结果, 并用新的协议和不可能的理论填补文学中留下的空白。
Article 102
Title@2025-05-24 (6): Recursive Offloading for LLM Serving in Multi-tier Networks
Title: Recursive Offloading for LLM Serving in Multi-tier Networks | Rekursives Offloading für LLM-Serving in Multi-Tier-Netzwerken | 多层网络LLM服务的递归性卸载 2505.16502v2 |
Authors (8): Zhiyuan Wu, Sheng Sun, Yuwei Wang, Min Liu, Bo Gao, Jinda Lu, Zheming Yang, Tian Wen
Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained, manually adjusted thresholds makes it less responsive to dynamic network conditions and varying task complexities. To address these challenges, we propose RecServe, a recursive offloading framework tailored for LLM serving in multi-tier networks. RecServe integrates a task-specific hierarchical confidence evaluation mechanism that guides offloading decisions based on inferred task complexity in progressively scaled LLMs across device, edge, and cloud tiers. To further enable intelligent task routing across tiers, RecServe employs a sliding-window-based dynamic offloading strategy with quantile interpolation, enabling real-time tracking of historical confidence distributions and adaptive offloading threshold adjustments. Experiments on eight datasets demonstrate that RecServe outperforms CasServe in both service quality and communication efficiency, and reduces the communication burden by over 50\% compared to centralized cloud-based serving.
电讯运营商和广域网(广域网)广泛采用高频装置-隐蔽的计算机基础设施,为新兴智能服务提供多层次的计算支持。随着大型语言模型(LLM)服务的迅速扩散,高效协调推论任务和减少这些多层网络架构内的通信间接费用已成为一项关键的部署挑战。现有的LLM服务模式显示出了巨大的局限性:由于硬件限制,在设计上部署仅支持轻量的LLMS,而以云为中心的部署则由于高峰期服务请求频繁导致的资源拥堵和相当迅速的通信负担管理而受到影响。尽管基于模型的递增质量推断战略更适应多层次网络,但依赖精细的、手工调整的阈值使其对动态网络条件和不同的任务复杂性反应较少。为了应对这些挑战,我们建议RecService,一个为LMS服务在多层网络中服务而定制的循环式卸载框架。 REService整合了一个针对具体任务的等级信任评价机制,用以指导根据不断递增的LMSLM系统在设备、边缘、云层和云层轨道上进行递增的50级递增的递增的递越级递增的递越级递增的递增的LIMLM战略,从而进一步降低了对历史级的递增的递增的递增的递增的递增的递增的递升的递增的递升的递升的递增的LMLODFTLULUTFTFTFTFTLFTLF的递增性战略。