cs.DC @ 2025-07-18: 106
-
00 07-17 (4) Just Verification of Mutual Exclusion Algorithms Nur Überprüfung der gegenseitigen Ausschlussalgorithmen 仅仅核查相互排斥的核查 2507.13198v1 -
01 07-17 Distributed Algorithms for Potential Problems Verteilte Algorithmen für mögliche Probleme 潜在问题分配的比值 2507.12038v2 -
02 07-17 FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient FedGA: Ein faires, auf dem Gini-Koeffizienten basierendes Föderated Learning Framework FDGA:基于基尼系数的公平联邦学习框架 2507.12983v1 -
03 07-17 Autonomous Resource Management in Microservice Systems via Reinforcement Learning Autonomes Ressourcenmanagement in Mikroservice-Systemen durch Verstärkungslernen 通过加强学习,对微小服务系统进行自主资源管理 2507.12879v1 -
04 07-17 Building State Machine Replication Using Practical Network Synchrony State Machine Replication mit praktischer Netzwerksynchronie aufbauen 使用实用网络同步进行国家机器复制 2507.12792v1 -
05 07-16 (3) BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training BootSeer: Analysieren und Abmildern von Initialisierungsengpässen im großformatigen LLM-Training BoutSeer:大规模LLM培训中分析和减缓初始化瓶颈 2507.12619v1 -
06 07-16 AAPA: An Archetype-Aware Predictive Autoscaler with Uncertainty Quantification for Serverless Workloads on Kubernetes AAPA: Archetype-Aware Predictive Autoscaler mit Unsicherheitsquantifizierung für serverlose Workloads auf Kubernetes AAPA: Kubernetes 上无服务器工作载荷的不确定量化的Achtetype-Aware-Aware 预测自动标定器 2507.05653v3 -
07 07-16 Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases Rel-HNN: Paralleles Hypergraphen-Neurales Netzwerk zum Lernen auf relationalen Datenbanken Rel-HNN: 用于在关系数据库中学习的分平行超时图神经网络 2507.12562v1 -
08 07-16 CRAFT: Latency and Cost-Aware Genetic-Based Framework for Node Placement in Edge-Fog Environments CRAFT: Latency and Cost-Aware Genetic-Based Framework for Node Placement in Edge-Fog Environments CRAFT: 边缘雾环境中节点定位的延迟和成本-软件遗传框架 2507.12445v1 -
09 07-16 Programming Distributed Collective Processes in the eXchange Calculus Programmierung verteilter kollektiver Prozesse im eXchange Calculus eXchange Calculus 中的程序编程分配集体进程 2401.11212v5 -
10 07-16 Toward Efficient SpMV in Sparse LLMs via Block Extraction and Compressed Storage Effiziente SpMV in Sparse LLMs über Blockextraktion und komprimierte Lagerung 努力通过拆解和压缩储存块状采掘和压缩储存,在稀散液溶液中实现高效渗流 2507.12205v1 -
11 07-16 FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model FedRef: Kommunikation-Effizient Bayesian Feinabstimmung mit Referenzmodell FedRef: 通信-节能贝ysian精密票,参考模型 2506.23210v2 -
12 07-16 Urban Green Governance: IoT-Driven Management and Enhancement of Urban Green Spaces in Campobasso Urban Green Governance: IoT-getriebenes Management und Verbesserung städtischer Grünflächen in Campobasso 城市绿色治理:在坎波巴索管理和加强城市绿色空间 2507.12106v1 -
13 07-16 ARRC: Explainable, Workflow-Integrated Recommender for Sustainable Resource Optimization Across the Edge-Cloud Continuum ARRC: Erklärbarer, Workflow-integrierter Recommender für nachhaltige Ressourcenoptimierung über das Edge-Cloud-Kontinuum ARRC:可解释的、可持续资源优化工作流动综合建议者 2507.12032v1 -
14 07-16 MOFCO: Mobility- and Migration-Aware Task Offloading in Three-Layer Fog Computing Environments MOFCO: Mobilitäts- und Migrations-Bewusst-Aufgaben-Offloading in drei Ebenen Fog Computing-Umgebungen MOFCO: 在三层雾化计算机环境中卸载流动和移徙软件任务 2507.12028v1 -
15 07-16 NineToothed: A Triton-Based High-Level Domain-Specific Language for Machine Learning NineToothed: Eine auf Tritonen basierende Domain-spezifische Sprache für maschinelles Lernen 九段缩略语: 机器学习用一种以三顿为基础的高层次域语言 2507.11978v1 -
16 07-16 BlockBPE: Parallel BPE Tokenization BlockBPE: Parallele BPE-Tokenisierung BBPE: 平行 BPE 调制 2507.11941v1 -
17 07-16 Making Serverless Computing Extensible: A Case Study of Serverless Data Analytics serverloses Rechnen erweiterbar machen: Eine Fallstudie für serverlose Datenanalytik 使无服务器的计算可扩展:无服务器数据分析案例研究 2507.11929v1 -
18 07-16 A Parallel CPU-GPU Framework for Cost-Bounded DFS with Applications to IDA* and BTS Ein paralleles CPU-GPU-Framework für kostengebundene DFS mit Anwendungen für IDA* und BTS 适用于有成本的外勤部的并行CPU-GPU框架,并适用于开发协会* 和BTS 2507.11916v1 -
19 07-16 Performance Assessment of Load Balancing Methods in Cloud Computing: Analysis of Round Robin, Equally Spread, and Throttled Strategies Using Cloud Analyst Performance Assessment of Load Balancing Methods in Cloud Computing: Analyse von Round Robin, ebenso verbreitet und gedrosselte Strategien mit Cloud Analyst 云计算中负载平衡方法绩效评估:利用云分析师分析轮罗宾、同样扩散和推力战略 2507.11899v1 -
20 07-16 Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI Arctic Inferenz mit Shift Parallelismus: Schnelles und effizientes Open Source Inferenzsystem für Enterprise AI 北极与转移平行主义的推论:企业AI快速有效的开放源码推断系统 2507.11830v1 -
21 07-16 Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving Proaktive Intra-GPU-Disaggregation von Prefill und Decode in LLM Serving 预填和解除LLM服务中编码的预填和分解 2507.06608v4 -
22 07-16 Symbiosis: Multi-Adapter Inference and Fine-Tuning Symbiose: Multi-Adapter-Schlussfolgerung und Feinabstimmung 共生关系:多位开发商的推断和精准调整 2507.03220v2 -
23 07-15 (2) PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training PGT-I: Scaling Spatiotemporal GNNs mit speichereffizienter verteilter Ausbildung PGT-I: 具有记忆有效分配培训的Splap Spatotomotial GNNs 2507.11683v1 -
24 07-15 ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs ZKP-FedEval: Überprüfbare und datenschutzschonende Federated Evaluation mit Null-Wissensnachweisen ZKP-FedEval:使用零知识证明进行可核查和隐私保护的联邦评价 2507.11649v1 -
25 07-15 Scaling the memory wall using mixed-precision – HPG-MxP on an exascale machine Skalierung der Speicherwand mit gemischter Präzision – HPG-MxP auf einer Exascale-Maschine 使用混合精度 – – HPG-MxP 在缩放机上缩放内存墙壁 2507.11512v1 -
26 07-15 Elk: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques Elk: Erforschung der Effizienz von Intercore-vernetzten KI-Chips mit Deep Learning Compiler-Techniken Elk:探索与深学习汇编者技术一起的机构间连接的AI芯片的效率 2507.11506v1 -
27 07-15 D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data D3FL: Datenverteilung und Detrending für robustes Federated Learning in nichtlinearen Zeitreihendaten D3FL:非线性时间序列数据中硬性联邦学习的数据分配和分流 2507.11471v1 -
28 07-15 Uniting the World by Dividing it: Federated Maps to Enable Spatial Applications Die Welt zu vereinen, indem man sie teilt: Gefederte Karten, um räumliche Anwendungen zu aktivieren 将世界联合起来,实现世界分化:实现空间应用的联邦地图 2507.11437v1 -
29 07-15 FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning FLsim: Ein modulares und bibliotheks-agnostisches Simulations-Framework für Federated Learning FLsim: 联邦学习模式和图书馆-不可知模拟框架 2507.11430v1 -
30 07-15 Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations Quantifizierung des Energieverbrauches und der Kohlenstoffemissionen von LLM-Inferenz über Simulationen 通过模拟对LLM推理的能源消耗量和碳排放量进行量化 2507.11417v1 -
31 07-15 Bridging Paradigms: Designing for HPC-Quantum Convergence Bridging Paradigmen: Designing für HPC-Quantum Convergence 架桥建模:设计高常PC-量统合 2503.01787v2 -
32 07-15 A new Dune grid for scalable dynamic adaptivity based on the p4est software library Ein neues Dune-Grid für skalierbare dynamische Anpassungsfähigkeit basierend auf der p4est Software-Bibliothek 基于前四级软件库的可缩放动态适适适性新 Dune 网格 2507.11386v1 -
33 07-15 DeInfoReg: A Decoupled Learning Framework for Better Training Throughput DeInfoReg: Ein entkoppelter Lernrahmen für besseren Trainingsdurchsatz DInfoReg:一个分离的学习框架,以改善培训工作量 2506.18193v2 -
34 07-15 Rise and Shine Efficiently! Tight Bounds for Adversarial Wake-up Steigen Sie auf und glänzen Sie effizient! Enge Grenzen für adversarisches Aufwachen 提高警觉,提高警觉,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕; 2410.09980v3 -
35 07-15 Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics Zyklische Daten-Streaming auf GPUs für Kurzstrecken-Schablonen auf molekulare Dynamik angewendet 用于分子动态的短距离短距离电线的 GPU 的循环数据流 2507.11289v1 -
36 07-15 Deterministic Lower Bounds for $k$-Edge Connectivity in the Distributed Sketching Model Deterministische Lower Bounds für $k$-Edge Connectivity im Distributed Sketching Model 分布式切入模型中用于 $k$-Edge 连接的确定性下下界 2507.11257v1 -
37 07-15 Boosting Scientific Error-Bounded Lossy Compression through Optimized Synergistic Lossy-Lossless Orchestration Erhöhte wissenschaftliche Fehler-gebundene Lossy-Kompression durch optimierte synergistische Lossy-Lossless-Orchestertion 通过优化的同步失失无损无损管束,促进科学错误造成的损失压缩 2507.11165v1 -
38 07-15 EASTER: Embedding Aggregation-based Heterogeneous Models Training in Vertical Federated Learning EASTER: Einbettung von Aggregationsbasierten Heterogenen Modellen Training in vertikales Federated Learning EEASTER:在纵向联邦学习中嵌入基于聚合的异种模式培训 2310.13367v3 -
39 07-15 Generating Dynamic Graph Algorithms for Multiple Backends for a Graph DSL Dynamische Graphenalgorithmen für mehrere Backends für eine Graph DSL generieren 为图形 DSL 生成多后端的动态图形对多个后端生成动态图形算法 2507.11094v1 -
40 07-15 MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit MMStencil: Optimierung von High-Order-Stencils auf Multicore-CPU mit Matrix Unit MMStencil: 使用矩阵股优化多核心CPU高订单定级器 2507.11067v1 -
41 07-15 Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout Effizientes Federated Learning mit heterogenen Daten und adaptivem Dropout 采用异种数据和适应性辍学的高效联邦学习 2507.10430v2 -
42 07-15 Arcturus: A Cloud Overlay Network for Global Accelerator with Enhanced Performance and Stability Arcturus: Ein Cloud Overlay Netzwerk für Global Accelerator mit verbesserter Leistung und Stabilität Arcturus:增强性能和稳定的全球加速器云重叠网络 2507.10928v1 -
43 07-14 (1) Stream programs are monoid homomorphisms with state Stream-Programme sind monoide Homomorphismen mit Zustand 溪流方案是单一单一的同质状态方案 2507.10799v1 -
44 07-14 Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks Die NVIDIA Blackwell Architektur mit Microbenchmarks 使用微基准解析 NVIDIA Blackwell 建筑 2507.10789v1 -
45 07-14 FAFO: Over 1 million TPS on a single node running EVM while still Merkleizing every block FAFO: Mehr als 1 Million TPS auf einem einzigen Knoten, der EVM läuft, während immer noch jeder Block zusammengefügt wird 在运行 EVM 的单一节点上超过100万个TPS, 同时仍然挤压每个街区 2507.10757v1 -
46 07-14 Access Control for Information-Theoretically Secure Key-Document Stores Zugriffskontrolle für informationstheoretisch gesicherte Key-Document-Stores 信息-理论安全密钥文件库存取控制 2507.10730v1 -
47 07-14 Environmentally-Conscious Cloud Orchestration Considering Geo-Distributed Data Centers Umweltbewusste Cloud-Orchester unter Berücksichtigung von Geo-Distributed Data Centers 考虑到地理分布数据中心的无害环境云层交织式 2507.11563v1 -
48 07-14 Consensus, Inconsistency, Emergence: what’s paraconsistency got to do with it? Konsens, Inkonsistenz, Emergenz: Was hat Parakonsistenz damit zu tun? 共识、不一致性、新出现: 不一致与它有什么关系? 2507.10413v1 -
49 07-14 Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters Zorse: Optimierung der LLM-Trainingseffizienz auf heterogenen GPU-Clustern Zorse: 优化关于异基因性GPU集群的LLM培训效率 2507.10392v1 -
50 07-14 FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline FalconFS: Verteiltes Dateisystem für großformatige Deep-Learning-Pipeline FalconFS:大型深层学习管道分布式文件系统 2507.10367v1 -
51 07-14 FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference FlowSpec: Kontinuierliche pipelined Spekulative Dekodierung für effiziente verteilte LLM-Inferenz 流谱:为有效分布分布的LLM 推断而持续喷射的投机性分解 2507.02620v2 -
52 07-14 Convergence of Agnostic Federated Averaging Konvergenz der agnostischen Föderierten Durchschnittswerte Agnostic Federal 波动的趋同 2507.10325v1 -
53 07-14 Cross-Timeslot Optimization for Distributed GPU Inference Using Reinforcement Learning Cross-Timeslot-Optimierung für verteilte GPU-Inferenz mittels Verstärkungslernen 利用强化学习对分布式 GPU 推断进行跨时线绘图优化 2507.10259v1 -
54 07-14 Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models Trinity-RFT: Ein allgemein angelegtes und einheitliches Rahmenwerk zur Verstärkung der Feinsteuerung großer Sprachmodelle 三一-RFT:加强大语言模式精美应用的一般目的和统一框架 2505.17826v2 -
55 07-14 Domain Borders Are There to Be Crossed With Federated Few-Shot Adaptation Domain-Grenzen gibt es mit Föderated Few-Shot-Anpassung überschritten werden 与联邦几热量适应措施交界的域域边界 2507.10160v1 -
56 07-14 Past-Future Scheduler for LLM Serving under SLA Guarantees Zukünftiger Terminplaner für LLM-Wartung im Rahmen von SLA-Garantien 根据苏丹解放军担保的LLM服务以往未来时间表表 2507.10150v1 -
57 07-14 Large-Scale Graph Building in Dynamic Environments: Low Latency and High Quality Large-Scale Graph Building in dynamischen Umgebungen: geringe Latenz und hohe Qualität 动态环境中的大比例图建设:低长期和高质量 2507.10139v1 -
58 07-14 A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing AIGC-Aufgabe, die Algorithmen im IIoT Edge Computing ausladen IIOT 边端计算中意识到的AIGC任务卸载算法模型 2507.11560v1 -
59 07-14 ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism ElasticMM: Effiziente multimodale LLMs mit elastischer multimodaler Parallelität Elastic MM: 高效的多式多式LLMs 与 Elastic 多式平行主义一起服务 2507.10069v1 -
60 07-14 ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge ECORE: Energiebewusstes optimiertes Routing für Deep-Learning-Modelle am Rand ECORE: 在边缘深层学习模型的能源普能优化运行 2507.06011v2 -
61 07-14 EAT: QoS-Aware Edge-Collaborative AIGC Task Scheduling via Attention-Guided Diffusion Reinforcement Learning EAT: QoS-Aware Edge-Collaborative AIGC-Task Scheduling über aufmerksamkeitsgeführtes Diffusions-Verstärkungs-Lernen EAT: 通过关注辅助推广强化学习安排任务 2507.10026v1 -
62 07-14 The Hitchhiker’s Guide to Programming and Optimizing Cache Coherent Heterogeneous Systems: CXL, NVLink-C2C, and AMD Infinity Fabric Der Hitchhiker-Leitfaden zur Programmierung und Optimierung von Cache-Kohärenten Heterogenen Systemen: CXL, NVLink-C2C und AMD Infinity Fabric Hitchhiker编程和优化缓存系统指南:CXL、NVLink-C2C和AMD无穷无尽 2411.02814v2 -
63 07-14 Green-LLM: Optimal Workload Allocation for Environmentally-Aware Distributed Inference Green-LLM: Optimale Arbeitslastzuteilung für umweltbewusste Distributed Inferenz Green-LLM:环境软件分布式推断的最佳工作负荷分配 2507.09942v1 -
64 07-14 PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation PhoenixOS: Gleichzeitiger GPU-Checkpoint auf OS-Ebene und Wiederherstellung mit validierter Spekulation 菲尼克斯:同步的OS级GPU检查站和经验证的投机恢复 2405.12079v2 -
65 07-14 Content-Oblivious Leader Election in 2-Edge-Connected Networks Content-Offizier Leader Wahl in 2-Edge-Connected Networks 以两E连结网络进行内容清晰的领袖选举 2507.08348v2 -
66 07-14 Intelligent Task Management via Dynamic Multi-region Division in LEO Satellite Networks Intelligentes Task Management über die Division Dynamic Multi-Region in LEO-Satellitennetzen 通过低地轨道卫星网络的动态多区域司进行智能任务管理 2507.09926v1 -
67 07-14 QPET: A Versatile and Portable Quantity-of-Interest-Preservation Framework for Error-Bounded Lossy Compression QPET: Ein vielseitiges und tragbares Quantitäts-of-Interest-Preservation-Framework für fehlerbegründete Verlustkompression QPET: 差错错错错错损损压缩易容和可移动的利差数量保护框架 2412.02799v4 -
68 07-14 Module-conditioned distribution of quantum circuits Modulkonditionierte Verteilung von Quantenkreisen 量子电路的模块化配送 2501.11816v2 -
69 07-14 InstCache: A Predictive Cache for LLM Serving InstCache: Ein vorausschauender Cache für LLM Serving Instcache:LLM服务预测缓存 2411.13820v2 -
70 07-13 (7) SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving SLED: Ein spekulatives LLM-Decoding-Framework für effizientes Edge Serving SLED: 有效边缘服务投机性LLM代谢框架 2506.09397v4 -
71 07-13 TimberStrike: Dataset Reconstruction Attack Revealing Privacy Leakage in Federated Tree-Based Systems TimberStrike: Datensatz-Rekonstruktion Angriff Enthüllen der Privatsphäre Leckage in Federated Tree-Based Systems 木材三角:联邦树基系统中数据集重建攻击清除隐私渗漏 2506.07605v3 -
72 07-13 Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure Berechnen kann nicht mit der Wahrheit umgehen: Warum Kommunikationssteuer das Gedächtnis und die Verbindungen in der modernen KI-Infrastruktur priorisiert 计算无法处理真相:为什么通讯税在现代AI基础设施中将记忆和相互联系放在优先地位? 2507.07223v2 -
73 07-13 Two Pareto Optimum-based Heuristic Algorithms for Minimizing Tardiness and Late Jobs in the Single Machine Flowshop Problem Zwei Pareto Optimale Heuristische Algorithmen zur Minimierung von Tardiness und Spätjobs im Single Machine Flowshop-Problem 两种基于Pareto Opptimim 的以Pareto Opptimim 为基础的在单一机器流动问题中尽量减少迟滞和迟到就业机会的优优性乘数 2409.03778v2 -
74 07-13 PromptChain: A Decentralized Web3 Architecture for Managing AI Prompts as Digital Assets PromptChain: Eine dezentralisierte Web3-Architektur zur Verwaltung von AI-Prompts als digitale Assets Prentchain:一个分散式网络3架构,用以管理作为数字资产的AI 提示 2507.09579v1 -
75 07-13 Lightweight Federated Learning over Wireless Edge Networks Leichtes Federated Learning über drahtlose Edge-Netzwerke 对无线边缘网络进行轻量量量联邦学习 2507.09546v1 -
76 07-13 FastSet: Parallel Claim Settlement FastSet: Parallele Forderungsabrechnung FastSet:平行索赔理赔 2506.23395v3 -
77 07-13 Aequa: Fair Model Rewards in Collaborative Learning via Slimmable Networks Aequa: Faire Modellprämien im kollaborativen Lernen über schlanke Netzwerke Aequa:通过可恢复网络合作学习的公平示范奖励 2502.04850v2 -
78 07-13 SmartphoneDemocracy: Privacy-Preserving E-Voting on Decentralized Infrastructure using Novel European Identity SmartphoneDemokratie: Datenschutz-Erhaltung von E-Voting auf dezentraler Infrastruktur mit neuartiger europäischer Identität 智能民主:利用新欧洲身份对权力下放基础设施进行保护隐私电子投票 2507.09453v1 -
79 07-12 (6) Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge Intelligente Orchestrierung der verteilten Large Foundation Model Inferenz am Rande 分散在边缘的大基金会模型推断 2504.03668v3 -
80 07-12 SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding SLIM: Ein heterogener Beschleuniger für Edge Inferenz von Sparse Large Language Model über Adaptive Thresholding SLIM: 通过适应性推进控股的分散大语言模型边缘推推异异异加速器 2507.09201v1 -
81 07-11 (5) On Evaluating Performance of LLM Inference Serving Systems Zur Bewertung der Leistung von LLM-Inferenz-Serviersystemen 评价LLLM LM 推断服务系统的性能 2507.09019v1 -
82 07-11 HotSwap: Enabling Live Dependency Sharing in Serverless Computing HotSwap: Live-Abhängigkeitsfreigabe im serverlosen Rechnen aktivieren HotSwap:在无服务器计算中促进生活依赖性共享 2409.09202v3 -
83 07-11 MQFQ-Sticky: Fair Queueing For Serverless GPU Functions MQFQ-Sticky: Faire Warteschlange für serverlose GPU-Funktionen MQFQQ-Stisky: 为无服务器的 GPU 函数公平排队 2507.08954v1 -
84 07-11 Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint Carbon-Aware-Workflow-Planung mit Fixed Mapping und Deadline Constraint 固定绘图和最后期限限制的碳软件工作流程调度 2507.08725v1 -
85 07-11 Reciprocating Locks Umschaltschlösser 回收锁 2501.02380v9 -
86 07-11 Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference Mind the Memory Gap: Enthüllen von GPU-Flaschenhalsen in großflächiger LLM-Inferenz 牢记记忆差距:大型批量LLM 推理中的 GPU 堆积点 2503.08311v2 -
87 07-11 Naeural AI OS – Decentralized ubiquitous computing MLOps execution engine Naeural AI OS – Dezentrale allgegenwärtige Computer MLOps Ausführungs-Engine Naeur AI OS – – 分散分散的无处不在计算 MLOPs 执行引擎 2306.08708v6 -
88 07-11 CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization CCSS: Hardware-beschleunigte RTL-Simulation mit schnellem kombiniertem Logic Computing und sequentieller Logic Synchronisation CSS: 与快速组合逻辑计算和序列逻辑同步同步模拟的硬件加速式RTL模拟 2507.08406v1 -
89 07-11 Towards AI-Native RAN: An Operator’s Perspective of 6G Day 1 Standardization Auf dem Weg zu KI-Native RAN: Die Perspektive des Betreibers von 6G Tag 1 Standardisierung 面向AI-Native RAN:运营商对6G日1标准化的看法 2507.08403v1 -
90 07-11 Efficient Long Context Fine-tuning with Chunk Flow Effizientes Long Context Feinabstimmung mit Chunk Flow 与整流相配合的微调 2503.02356v3 -
91 07-11 Fast and Interactive Byzantine Fault-tolerant Web Services via Session-Based Consensus Decoupling Schnelle und interaktive Byzantinische Fehler-tolerante Web Services über Session-Based Consensus Entkopplung 通过会议共识脱钩提供快速和互动拜占庭防违约网络服务 2507.08281v1 -
92 07-10 (4) Supporting Intel(r) SGX on Multi-Package Platforms Unterstützung von Intel(r) SGX auf Multi-Package-Plattformen 支持多包平台的 Intel(r) SGX 2507.08190v1 -
93 07-10 KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling KIS-S: Ein GPU-Aware Kubernetes Inferenzsimulator mit RL-basierter Auto-Skalierung KIS- S: 带有基于 RL 自动缩放的 GPU- Aware Kubernetes 推断模拟器 2507.07932v1 -
94 07-10 Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs Parallele CPU-GPU-Execution für LLM-Inferenz auf eingeschränkten GPUs LLM LLM 受控 GPU 推论的平行 CPU-GPU 执行 2506.03296v3 -
95 07-10 DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v2 -
96 07-10 Accelerating Transposed Convolutions on FPGA-based Edge Devices Beschleunigung transponierter Konvolutionen auf FPGA-basierten Edge-Geräten 加速基于 FPGA 的边缘设备的转换变速 2507.07683v1 -
97 07-10 Multi-agent Reinforcement Learning-based In-place Scaling Engine for Edge-cloud Systems Multi-Agenten-Verstärkung Learning-based In-place Scaling Engine für Edge-Cloud-Systeme 边缘球状系统内地增强引擎 2507.07671v1 -
98 07-10 Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data Stressüberwachung im Gesundheitswesen: Ein Ensemble Machine Learning Framework mit tragbaren Sensordaten 保健中压力监测:使用穿戴感感应数据的综合机械学习框架 2507.07589v1 -
99 07-10 TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference TokenWeave: Effiziente Compute-Communication Overlap für verteilte LLM-Inferenz TokenWeave: 有效计算分布式LLM 推理的通信重叠 2505.11329v2 -
100 07-10 A Unified Ontology for Scalable Knowledge Graph-Driven Operational Data Analytics in High-Performance Computing Systems Eine einheitliche Ontologie für skalierbare, graphgestützte Betriebsdatenanalytik in Hochleistungs-Computing-Systemen 高性能计算系统中可缩放知识、图表驱动操作数据分析的统一本体学 2507.06107v2 -
101 07-10 Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques Opt-GPTQ: Optimierte GPTQ Kombination von Sparsen-Achtung und Quantisierungstechniken GPTQ:最佳GPTQ,将分散关注和量化技术结合起来 2505.02351v2 -
102 07-10 KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows KVFlow: Effizientes Präfix-Caching zur Beschleunigung von LLM-basierten Multiagenten-Workflows KVFlow: 为加速基于LLM的多重需要工作流程而高效预置缓存 2507.07400v1 -
103 07-10 Future Resource Bank for ISAC: Achieving Fast and Stable Win-Win Matching for Both Individuals and Coalitions Future Resource Bank for ISAC: Schnelles und stabiles Win-Win-Matching für Einzelpersonen und Koalitionen ISAC未来资源银行:实现个人和联盟的快速和稳定的双赢比对 2502.08118v5 -
104 07-10 Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size Einschränkungen Programmiermodelle für serielle Batch-Scheichung mit minimaler Batch-Größe 具有最小批量大小的连续批次排程限制编程模型 2504.08793v2 -
105 07-10 Machine Learning-driven Multiscale MD Workflows: The Mini-MuMMI Experience Mehrstufige MD-Workflows mit maschinellem Lernen: Die Mini-MuMMI-Erfahrung 由学习驱动的机械式学习驱动的多规模MD工作流程:微型MIMI经验 2507.07352v1
Article 0
Title@2025-07-17 (4): Just Verification of Mutual Exclusion Algorithms
Title: Just Verification of Mutual Exclusion Algorithms | Nur Überprüfung der gegenseitigen Ausschlussalgorithmen | 仅仅核查相互排斥的核查 2507.13198v1 |
Authors (3): Rob van Glabbeek, Bas Luttik, Myrthe Spronck
We verify the correctness of a variety of mutual exclusion algorithms through model checking. We look at algorithms where communication is via shared read/write registers, where those registers can be atomic or non-atomic. For the verification of liveness properties, it is necessary to assume a completeness criterion to eliminate spurious counterexamples. We use justness as completeness criterion. Justness depends on a concurrency relation; we consider several such relations, modelling different assumptions on the working of the shared registers. We present executions demonstrating the violation of correctness properties by several algorithms, and in some cases suggest improvements.
我们通过模式检查核查各种相互排斥算法的正确性。我们查看通过共享读/文字登记册进行通信的算法,这些登记册可以是原子的,也可以是非原子的。为了核查活性特性,必须采用完整性标准来消除虚假的反抽样。我们用公正作为完整性标准。正义取决于一种货币关系;我们考虑几种此类关系,对共同登记册的运作模式不同的假设。我们提出一些算法,表明一些算法违反正确性,有时还提出改进建议。
Article 1
Title@2025-07-17 (4): Distributed Algorithms for Potential Problems
Title: Distributed Algorithms for Potential Problems | Verteilte Algorithmen für mögliche Probleme | 潜在问题分配的比值 2507.12038v2 |
Authors (6): Alkida Balliu, Thomas Boudier, Francesco d’Amore, Dennis Olivetti, Gustav Schmid, Jukka Suomela
In this work we present a fast distributed algorithm for local potential problems: these are graph problems where the task is to find a locally optimal solution where no node can unilaterally improve the utility in its local neighborhood by changing its own label. A simple example of such a problem is the task of finding a locally optimal cut, i.e., a cut where for each node at least half of its incident edges are cut edges. The distributed round complexity of locally optimal cut has been wide open; the problem is known to require $\Omega(\log n)$ rounds in the deterministic LOCAL model and $\Omega(\log \log n)$ rounds in the randomized LOCAL model, but the only known upper bound is the trivial brute-force solution of $O(n)$ rounds. Locally optimal cut in bounded-degree graphs is perhaps the simplest example of a locally checkable labeling problem for which there is still such a large gap between current upper and lower bounds. We show that in bounded-degree graphs, all local potential problems, including locally optimal cut, can be solved in $\log^{O(1)} n$ rounds, both in the deterministic and randomized LOCAL models. In particular, the deterministic round complexity of the locally optimal cut problem is now settled to $\log^{\Theta(1)} n$.
在这项工作中,我们提出了一个快速分布的本地潜在问题的算法:这些是图表问题,任务在于找到一个本地最佳解决方案,即无节点可以通过改变自己的标签来单方面改善本地邻居的实用性。一个简单的问题例子是找到一个本地最佳切分,即对于每个节点,至少其事故边缘的一半是切开的边缘。局部最佳切分的圆形复杂程度是广泛开放的;众所周知,问题需要花在确定性的 LOCAL 模型和 $Omega (log\log n) 模型中,花在确定性LOCAL 模型中,用美元(log\log n) 来单方面改善本地邻居的实用性。这样一个问题的一个简单例子是找到一个本地最佳切分解方法,即对于每个节点至少一半的事件边缘是切开的边缘。地方最佳切分解是本地可核对的标签问题的最简单例子,目前上下界和下界之间仍然有如此大的差距。我们显示,在约束性度图形中,所有潜在的问题,包括当地最佳切分解的圆形($) 和最佳分解的本地的圆形) 问题都可以解决。
Article 2
Title@2025-07-17 (4): FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient
Title: FedGA: A Fair Federated Learning Framework Based on the Gini Coefficient | FedGA: Ein faires, auf dem Gini-Koeffizienten basierendes Föderated Learning Framework | FDGA:基于基尼系数的公平联邦学习框架 2507.12983v1 |
Authors (1): ShanBin Liu
Fairness has emerged as one of the key challenges in federated learning. In horizontal federated settings, data heterogeneity often leads to substantial performance disparities across clients, raising concerns about equitable model behavior. To address this issue, we propose FedGA, a fairness-aware federated learning algorithm. We first employ the Gini coefficient to measure the performance disparity among clients. Based on this, we establish a relationship between the Gini coefficient $G$ and the update scale of the global model ${U_s}$, and use this relationship to adaptively determine the timing of fairness intervention. Subsequently, we dynamically adjust the aggregation weights according to the system’s real-time fairness status, enabling the global model to better incorporate information from clients with relatively poor performance.We conduct extensive experiments on the Office-Caltech-10, CIFAR-10, and Synthetic datasets. The results show that FedGA effectively improves fairness metrics such as variance and the Gini coefficient, while maintaining strong overall performance, demonstrating the effectiveness of our approach.
公平已成为联邦学习的关键挑战之一。在横向联盟环境中,数据差异往往导致客户之间业绩的巨大差异,引起对公平模式行为的关切。为了解决这一问题,我们提议采用公平意识的联邦学习算法FedGA,即公平意识的联邦学习算法。我们首先使用基尼系数来衡量客户之间的业绩差异。在此基础上,我们建立了基尼系数$G美元与全球模型更新规模${U_s}之间的关系,并利用这种关系适应性决定公平干预的时机。随后,我们根据系统的实时公平状况动态调整汇总权重,使全球模式能够更好地纳入业绩较差的客户提供的信息。我们在办公室-Caltech-10、CIFAR-10和合成数据集方面进行了广泛的实验。结果显示,FDGA有效地改进了公平度指标,如差异和基尼系数等,同时保持了强有力的总体绩效,显示了我们方法的有效性。
Article 3
Title@2025-07-17 (4): Autonomous Resource Management in Microservice Systems via Reinforcement Learning
Title: Autonomous Resource Management in Microservice Systems via Reinforcement Learning | Autonomes Ressourcenmanagement in Mikroservice-Systemen durch Verstärkungslernen | 通过加强学习,对微小服务系统进行自主资源管理 2507.12879v1 |
Authors (6): Yujun Zou, Nia Qi, Yingnan Deng, Zhihao Xue, Ming Gong, Wuyang Zhang
This paper proposes a reinforcement learning-based method for microservice resource scheduling and optimization, aiming to address issues such as uneven resource allocation, high latency, and insufficient throughput in traditional microservice architectures. In microservice systems, as the number of services and the load increase, efficiently scheduling and allocating resources such as computing power, memory, and storage becomes a critical research challenge. To address this, the paper employs an intelligent scheduling algorithm based on reinforcement learning. Through the interaction between the agent and the environment, the resource allocation strategy is continuously optimized. In the experiments, the paper considers different resource conditions and load scenarios, evaluating the proposed method across multiple dimensions, including response time, throughput, resource utilization, and cost efficiency. The experimental results show that the reinforcement learning-based scheduling method significantly improves system response speed and throughput under low load and high concurrency conditions, while also optimizing resource utilization and reducing energy consumption. Under multi-dimensional resource conditions, the proposed method can consider multiple objectives and achieve optimized resource scheduling. Compared to traditional static resource allocation methods, the reinforcement learning model demonstrates stronger adaptability and optimization capability. It can adjust resource allocation strategies in real time, thereby maintaining good system performance in dynamically changing load and resource environments.
本文建议了一种强化的微观服务资源时间安排和优化学习方法,目的是解决资源分配不均、高潜值和传统微观服务结构中产出不足等问题。在微观服务系统中,服务数量和负荷增加、高效安排和分配资源(如计算能力、记忆和储存)已成为一项关键的研究挑战。为解决这一问题,本文件采用了基于强化学习的智能时间安排算法。通过代理和环境之间的互动,资源分配战略不断得到优化。在实验中,本文件考虑了不同的资源条件和负荷假设,评估了拟议方法的多个方面,包括反应时间、吞吐量、资源利用和成本效益。实验结果表明,强化基于学习的时间安排方法大大改进了低负荷和高通货条件下的系统反应速度和吞吐量,同时优化了资源利用和减少能源消耗。在多维资源条件下,拟议方法可以考虑多个目标,实现优化资源时间安排。与传统的静态资源分配方法相比,强化学习模型显示了更强的适应和优化能力。它能够调整实时资源配置战略,从而保持动态资源负荷环境中的良好系统绩效变化。
Article 4
Title@2025-07-17 (4): Building State Machine Replication Using Practical Network Synchrony
Title: Building State Machine Replication Using Practical Network Synchrony | State Machine Replication mit praktischer Netzwerksynchronie aufbauen | 使用实用网络同步进行国家机器复制 2507.12792v1 |
Authors (6): Yiliang Wan, Nitin Shivaraman, Akshaye Shenoi, Xiang Liu, Tao Luo, Jialin Li
Distributed systems, such as state machine replication, are critical infrastructures for modern applications. Practical distributed protocols make minimum assumptions about the underlying network: They typically assume a partially synchronous or fully asynchronous network model. In this work, we argue that modern data center systems can be designed to provide strong synchrony properties in the common case, where servers move in synchronous lock-step rounds. We prove this hypothesis by engineering a practical design that uses a combination of kernel-bypass network, multithreaded architecture, and loosened round length, achieving a tight round bound under 2us. Leveraging our engineered networks with strong synchrony, we co-design a new replication protocol, Chora. Chora exploits the network synchrony property to efficiently pipeline multiple replication instances, while allowing all replicas to propose in parallel without extra coordination. Through experiments, we show that Chora achieves 255% and 109% improvement in throughput over state-of-the-art single-leader and multi-leader protocols, respectively.
国家机器复制等分布式系统是现代应用的关键基础设施。 实用分布式协议对基础网络设定了最低假设: 它们通常假设部分同步或完全同步的网络模式。 在这项工作中, 我们争辩说, 现代数据中心系统可以设计为在常见情况下提供强大的同步特性, 服务器以同步的锁步骤运行。 我们通过设计一种实用设计来证明这一假设, 它将内核绕行网络、 多轨结构、 宽度的宽度结合起来, 从而在 2 us 下实现紧凑的圆圈绑。 我们共同设计一个新的复制协议, Chora。 Chora 利用网络同步特性来高效传输多个复制实例, 同时允许所有复制者在没有额外协调的情况下同时提出。 我们通过实验, 显示Chora 分别实现了255% 和 109% 的州级单一领导者和多领导者协议的吞吐量改善 。
Article 5
Title@2025-07-16 (3): BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training
Title: BootSeer: Analyzing and Mitigating Initialization Bottlenecks in Large-Scale LLM Training | BootSeer: Analysieren und Abmildern von Initialisierungsengpässen im großformatigen LLM-Training | BoutSeer:大规模LLM培训中分析和减缓初始化瓶颈 2507.12619v1 |
Authors (17): Rui Li, Xiaoyun Zhi, Jinxin Chi, Menghan Yu, Lixin Huang, Jia Zhu, Weilun Zhang, Xing Ma, Wenjia Liu, Zhicheng Zhu, Daowen Luo, Zuquan Song, Xin Yin, Chao Xiang, Shuguang Wang, Wencong Xiao, Gene Cooperman
Large Language Models (LLMs) have become a cornerstone of modern AI, driving breakthroughs in natural language processing and expanding into multimodal jobs involving images, audio, and video. As with most computational software, it is important to distinguish between ordinary runtime performance and startup overhead. Prior research has focused on runtime performance: improving training efficiency and stability. This work focuses instead on the increasingly critical issue of startup overhead in training: the delay before training jobs begin execution. Startup overhead is particularly important in large, industrial-scale LLMs, where failures occur more frequently and multiple teams operate in iterative update-debug cycles. In one of our training clusters, more than 3.5% of GPU time is wasted due to startup overhead alone. In this work, we present the first in-depth characterization of LLM training startup overhead based on real production data. We analyze the components of startup cost, quantify its direct impact, and examine how it scales with job size. These insights motivate the design of Bootseer, a system-level optimization framework that addresses three primary startup bottlenecks: (a) container image loading, (b) runtime dependency installation, and (c) model checkpoint resumption. To mitigate these bottlenecks, Bootseer introduces three techniques: (a) hot block record-and-prefetch, (b) dependency snapshotting, and (c) striped HDFS-FUSE. Bootseer has been deployed in a production environment and evaluated on real LLM training workloads, demonstrating a 50% reduction in startup overhead.
大型语言模型(LLMS)已成为现代AI的基石,推动了自然语言处理的突破,并发展成涉及图像、音频和视频的多式联运工作。与大多数计算软件一样,重要的是区分普通运行时间性能和启动间接费用。先前的研究侧重于运行时间性能:提高培训效率和稳定性。这项工作侧重于培训启动间接费用这一日益紧迫的问题:培训工作开始之前的延误。在大型工业规模LMS中,启动间接费用特别重要,因为失败发生频率更高,多个团队在迭接更新-调试周期运作。在我们的培训集群中,超过3.5%的GPU时间因启动间接费用而浪费。在这项工作中,我们根据实际生产数据对LLMM培训启动间接费用的首次深入描述。我们分析了启动费用的各个组成部分,量化了其直接影响,并考察了它与工作规模的大小。这些洞察力激励了Boutseer的设计,一个系统级优化框架,解决了三个初级启动瓶颈:(a) 集装箱图像装载,(b) 启动前期依赖性工序的SBRADR(c) 启动阶段的升级、升级和升级(c) 升级升级的SBOUDFS) 和升级记录。
Article 6
Title@2025-07-16 (3): AAPA: An Archetype-Aware Predictive Autoscaler with Uncertainty Quantification for Serverless Workloads on Kubernetes
Title: AAPA: An Archetype-Aware Predictive Autoscaler with Uncertainty Quantification for Serverless Workloads on Kubernetes | AAPA: Archetype-Aware Predictive Autoscaler mit Unsicherheitsquantifizierung für serverlose Workloads auf Kubernetes | AAPA: Kubernetes 上无服务器工作载荷的不确定量化的Achtetype-Aware-Aware 预测自动标定器 2507.05653v3 |
Authors (10): Guilin Zhang, Srinivas Vippagunta, Raghavendra Nandagopal, Suchitra Raman, Jeff Xu, Marcus Pfeiffer, Shreeshankar Chatterjee, Ziqi Tan, Wulan Guo, Hailong Jiang
Serverless platforms such as Kubernetes are increasingly adopted in high-performance computing, yet autoscaling remains challenging under highly dynamic and heterogeneous workloads. Existing approaches often rely on uniform reactive policies or unconditioned predictive models, ignoring both workload semantics and prediction uncertainty. We present AAPA, an archetype-aware predictive autoscaler that classifies workloads into four behavioral patterns – SPIKE, PERIODIC, RAMP, and STATIONARY – and applies tailored scaling strategies with confidence-based adjustments. To support reproducible evaluation, we release AAPAset, a weakly labeled dataset of 300,000 Azure Functions workload windows spanning diverse patterns. AAPA reduces SLO violations by up to 50% and lowers latency by 40% compared to Kubernetes HPA, albeit at 2-8x higher resource usage under spike-dominated conditions. To assess trade-offs, we propose the Resource Efficiency Index (REI), a unified metric balancing performance, cost, and scaling smoothness. Our results demonstrate the importance of modeling workload heterogeneity and uncertainty in autoscaling design.
高性能计算越来越多地采用Kubernetes等无服务器平台,然而,在高度动态和多样的工作量下,自动扩缩仍然具有挑战性。现有方法往往依赖统一的被动政策或无附加条件的预测模型,忽视工作量的语义和预测不确定性。我们提出了AAPA,这是一个古老的自觉预测自动标尺,将工作量分为四种行为模式 – – SPIKE、SlaimIC、RAMP和STatorial – – 并应用有基于信任的调整的量身定制的缩放战略。为了支持可复制的评价,我们发布了AAPASet,这是一个有30万个Azure函数窗口标签的微弱数据集,分布在多种模式上。AAPA将违反SLO的情况比Kubernetes HPA减少高达50%,将拉伸缩率降低40%,尽管在高度集中的条件下资源利用率为2-8x高,我们提出了资源效率指数(REI),统一衡量性能、成本和增缩性。我们的成果显示了在自动计算设计中模拟工作量的高度性和不确定性的重要性。
Article 7
Title@2025-07-16 (3): Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases
Title: Rel-HNN: Split Parallel Hypergraph Neural Network for Learning on Relational Databases | Rel-HNN: Paralleles Hypergraphen-Neurales Netzwerk zum Lernen auf relationalen Datenbanken | Rel-HNN: 用于在关系数据库中学习的分平行超时图神经网络 2507.12562v1 |
Authors (4): Md. Tanvir Alam, Md. Ahasanul Alam, Md Mahmudur Rahman, Md. Mosaddek Khan
Relational databases (RDBs) are ubiquitous in enterprise and real-world applications. Flattening the database poses challenges for deep learning models that rely on fixed-size input representations to capture relational semantics from the structured nature of relational data. Graph neural networks (GNNs) have been proposed to address this, but they often oversimplify relational structures by modeling all the tuples as monolithic nodes and ignoring intra-tuple associations. In this work, we propose a novel hypergraph-based framework, that we call rel-HNN, which models each unique attribute-value pair as a node and each tuple as a hyperedge, enabling the capture of fine-grained intra-tuple relationships. Our approach learns explicit multi-level representations across attribute-value, tuple, and table levels. To address the scalability challenges posed by large RDBs, we further introduce a split-parallel training algorithm that leverages multi-GPU execution for efficient hypergraph learning. Extensive experiments on real-world and benchmark datasets demonstrate that rel-HNN significantly outperforms existing methods in both classification and regression tasks. Moreover, our split-parallel training achieves substantial speedups – up to 3.18x for learning on relational data and up to 2.94x for hypergraph learning – compared to conventional single-GPU execution.
在企业和现实世界应用中, 关系数据库( RDBs) 普遍存在于企业和现实世界应用中。 数据库Flattleing 数据库对依靠固定规模投入表示来从关系数据的结构性质中获取关系语义的深层次学习模式提出了挑战。 已经提议了图形神经网络(GNNS)来解决这个问题, 但是它们往往过分简化关系结构, 将所有图例都建为单一节点, 忽视学生内部的关联。 在这项工作中, 我们提议了一个新型的超光速框架, 我们称之为rel- HNNN, 以每个独特的属性值配对为节点, 以及每个图例作为超端, 以捕捉关系。 我们的方法在属性价值、 图普尔和表级别中学习明确的多层次表示。 为了应对大型区域数据库带来的可缩放挑战, 我们进一步引入了一种双向培训算法, 利用多面GPU执行来高效的超音率学习。 在现实世界和基准数据结构中进行广泛的实验, 将每个属性对应数据关系建为超高端, 。 18 对比GNNNNBS 学习系统 , 的单个和基化为新的系统, , 以新的系统化为二进化为新的系统, , 在常规数据回归中, 和基准数据格式化, 将新的系统, 向新的系统, 在常规学习中, 进行大量学习中, 和基化。
Article 8
Title@2025-07-16 (3): CRAFT: Latency and Cost-Aware Genetic-Based Framework for Node Placement in Edge-Fog Environments
Title: CRAFT: Latency and Cost-Aware Genetic-Based Framework for Node Placement in Edge-Fog Environments | CRAFT: Latency and Cost-Aware Genetic-Based Framework for Node Placement in Edge-Fog Environments | CRAFT: 边缘雾环境中节点定位的延迟和成本-软件遗传框架 2507.12445v1 |
Authors (5): Soheil Mahdizadeh, Amir Mahdi Rasouli, Mohammad Pourashory, Sadra Galavani, Mohsen Ansari
Reducing latency in the Internet of Things (IoT) is a critical concern. While cloud computing facilitates communication, it falls short of meeting real-time requirements reliably. Edge and fog computing have emerged as viable solutions by positioning computing nodes closer to end users, offering lower latency and increased processing power. An edge-fog framework comprises various components, including edge and fog nodes, whose strategic placement is crucial as it directly impacts latency and system cost. This paper presents an effective and tunable node placement strategy based on a genetic algorithm to address the optimization problem of deploying edge and fog nodes. The main objective is to minimize latency and cost through optimal node placement. Simulation results demonstrate that the proposed framework achieves up to 2.77% latency and 31.15% cost reduction.
降低物联网(IoT)的潜伏是一个关键问题。虽然云计算可以促进通信,但不能可靠地满足实时需求。通过将计算节点定位于接近终端用户、提供较低的潜伏力和增加处理能力,边缘计算和雾计算已成为可行的解决办法。边缘数据框架包括各种组成部分,包括边缘和雾节点,其战略定位至关重要,因为它直接影响到潜伏和系统成本。本文件介绍了基于基因算法的有效和金枪鱼结节点定位战略,以解决部署边缘和雾节点的优化问题。主要目标是通过最佳节点定位最大限度地减少潜伏和成本。模拟结果显示,拟议框架实现了2.77%的延缓率和31.15%的成本削减。
Article 9
Title@2025-07-16 (3): Programming Distributed Collective Processes in the eXchange Calculus
Title: Programming Distributed Collective Processes in the eXchange Calculus | Programmierung verteilter kollektiver Prozesse im eXchange Calculus | eXchange Calculus 中的程序编程分配集体进程 2401.11212v5 |
Authors (5): Giorgio Audrito, Roberto Casadei, Ferruccio Damiani, Gianluca Torta, Mirko Viroli
Recent trends like the Internet of Things (IoT) suggest a vision of dense and multi-scale deployments of computing devices in nearly all kinds of environments. A prominent engineering challenge revolves around programming the collective adaptive behaviour of such computational ecosystems. This requires abstractions able to capture concepts like ensembles (dynamic groups of cooperating devices) and collective tasks (joint activities carried out by ensembles). In this work, we consider collections of devices interacting with neighbours and that execute in nearly-synchronised sense-compute-interact rounds, where the computation is given by a single program mapping sensing values and incoming messages to output and outcoming messages. To support programming whole computational collectives, we propose the abstraction of a distributed collective process, which can be used to define at once the ensemble formation logic and its collective task. We formalise the abstraction in the eXchange Calculus (XC), a core functional language based on neighbouring values (maps from neighbours to values) where state and interaction is handled through a single primitive, exchange, and provide a corresponding implementation in the FCPP language. Then, we exercise distributed collective processes using two case studies: multi-hop message propagation and distributed monitoring of spatial properties. Finally, we discuss the features of the abstraction and its suitability for different kinds of distributed computing applications.
在这项工作中,我们考虑与邻居发生互动的装置的集成,这些装置以近同步的感知和计算互动周期执行,计算方法是由一个单一程序绘制感测值和发送信息到输出和流出信息。为了支持整个计算集体的编程,我们提议一个分布式集体过程的抽象化,这个过程可以用来立即界定共性形成逻辑及其集体任务。我们把电子Xchange Calculus(XC)中的抽象化,这是一个基于相邻价值的核心功能语言(从邻居到价值观的图解),通过单一原始、交换处理国家和互动,并在FCPP语言中提供相应的执行。最后,我们利用两种案例研究,进行分布式集成的集体进程,并传播各种空间信息。最后,我们用两种案例研究的形式,进行集体分布式的数学特性。我们用两种案例研究来传播其空间信息。最后,我们用两种案例研究来传播空间信息。
Article 10
Title@2025-07-16 (3): Toward Efficient SpMV in Sparse LLMs via Block Extraction and Compressed Storage
Title: Toward Efficient SpMV in Sparse LLMs via Block Extraction and Compressed Storage | Effiziente SpMV in Sparse LLMs über Blockextraktion und komprimierte Lagerung | 努力通过拆解和压缩储存块状采掘和压缩储存,在稀散液溶液中实现高效渗流 2507.12205v1 |
Authors (4): Junqing Lin, Jingwei Sun, Mingge Lu, Guangzhong Sun
Sparse Matrix-Vector Multiplication (SpMV) has become a critical performance bottleneck in the local deployment of sparse Large Language Models (LLMs), where inference predominantly operates on workloads during the decoder phase with a batch size of one. Existing SpMV kernels and sparse matrix formats, originally designed for scientific computing, fail to exploit the unique structure patterns inherent in sparse LLMs, resulting in suboptimal performance and excessive storage overhead. This paper presents EC-SpMV, a GPU-optimized SpMV approach for accelerating sparse LLM inference. EC-SpMV introduces (1) a hierarchical block extraction algorithm that captures multiple granularities of block structures within sparse LLMs, and (2) a novel compressed sparse format (EC-CSR) that employs delta indexing to reduce storage overhead and enhance memory access efficiency. Evaluated on real sparse weight matrices from LLaMA and OPT models, EC-SpMV achieves up to 6.44x speedup over state-of-the-art SpMV libraries and reduces storage overhead by up to 55.4% compared to CSR.
在本地部署稀有大语言模型(LLMS)时,推论主要是在分批尺寸为1的解码阶段的工作量上进行。 原先设计用于科学计算的现有SpMV内核和稀有矩阵格式未能利用稀有的LMS所固有的独特结构模式,导致不优化性能和过度储存管理。本文介绍了EC-SpMV,一种加速稀有LMM推断的GPU优化SpMV方法。EC-SpMV采用了(1) 一种分级区块提取算法,在稀有的LMS中捕捉到块结构的多种颗粒,以及(2)一种新型的压缩稀有格式(EC-CSR),采用三角指数来减少储存管理间接费用,提高记忆存取效率。对Lama和OFMT模型的实际稀薄重矩阵进行了评价,EC-SpMV比CSR达到6.44x速度,将储量管理量降低到55.4%。
Article 11
Title@2025-07-16 (3): FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model
Title: FedRef: Communication-Efficient Bayesian Fine Tuning with Reference Model | FedRef: Kommunikation-Effizient Bayesian Feinabstimmung mit Referenzmodell | FedRef: 通信-节能贝ysian精密票,参考模型 2506.23210v2 |
Authors (2): Taehwan Yoon, Bongjun Choi
Federated learning(FL) is used for distributed scenarios to train artificial intelligence(AI) models while ensuring users’ privacy. In federated learning scenario, the server generally never knows about users’ data. This type of concept makes the AI training process efficient in terms of data privacy. However, regarding model performance, federated AI models may not sufficiently satisfy AI users’ expectations. Furthermore, AI users have a wide range of different needs. It is not easy to satisfy the whole users needs. These types of issues can be addressed through AI model optimization, fine-tuning, or personalization to achieve optimal model performance. To address model optimization challenges, we propose reference model-based federated learning for optimal fine-tuning, which overcomes catastrophic forgetting in each round. This method is derived from Bayesian parameter-efficient transfer learning, which includes an optimal proximal term and utilizes a reference model that incorporates previous model parameters. As a result, this method achieves both high model performance and clients’ low computing cost.
联邦学习(FL) 用于分布式情景, 用于培训人工智能模型,同时确保用户的隐私。在联合学习情景中,服务器一般从来不知道用户的数据。这种概念使得AI培训过程在数据隐私方面效率高。然而,关于模型性能,联合AI模型可能不能充分满足AI用户的期望。此外,AI用户有着各种各样的不同需要。满足整个用户的需要并非易事。这类问题可以通过AI模型优化、微调或个性化来解决,以实现最佳模型性能。为了应对模型优化挑战,我们建议采用基于参考模型的联合会式学习,以优化微调,克服每轮中灾难性的遗漏。这一方法来自巴伊西亚参数高效转移学习,其中包括一个最佳的精度术语,并利用一个包含先前模型参数的参考模型。因此,这种方法既能达到高模型性能,又能满足客户的低计算成本。
Article 12
Title@2025-07-16 (3): Urban Green Governance: IoT-Driven Management and Enhancement of Urban Green Spaces in Campobasso
Title: Urban Green Governance: IoT-Driven Management and Enhancement of Urban Green Spaces in Campobasso | Urban Green Governance: IoT-getriebenes Management und Verbesserung städtischer Grünflächen in Campobasso | 城市绿色治理:在坎波巴索管理和加强城市绿色空间 2507.12106v1 |
Authors (6): Antonio Salis, Gabriele Troina, Gianluca Boanelli, Marco Ottaviano, Paola Fortini, Soraya Versace
The efficient design and management of public green spaces is a key factor in promoting the health and well-being of urban population, as emphasized by the WHO, UNEP, and EEA. These areas serve as the “green lungs” of the urban ecosystem, playing a vital role in enhancing quality of life thanks to the provision of ecosystem services. In this context, the Smart Green City use case in Campobasso municipality, funded by the Italian Ministry of Enterprises (MIMIT), emerges as an innovative model for the sustainable management of green urban areas through the adoption of an advanced system of emerging technologies integrated and interoperable. The project integrates IoT systems and data-driven governance platforms, enabling real-time monitoring of the health status of trees and green areas via a Decision Support System (DSS). It also facilitates the collection and analysis of data from diverse sources, including weather conditions, air quality, soil moisture, pollution levels. The resulting cloud-based platform supports a holistic real time decision making for green urban managers, technical experts and operational staff. It enables intelligent control and management of urban green spaces using Tree Talker sensors, integrated with soil moisture and water potential monitoring systems. Thanks to predictive models based on machine learning algorithms and real time data provided by IoT sensors, irrigation of public parks can be optimized by providing suggestions on when and how much water to apply. Customized alerts layers are also activated warning users when monitored parameters, such as soil temperature, humidity, or water potential, exceed predefined thresholds. This Use Case demonstrates how digitalization, IoT sensors fusion and technological innovation can support sustainable urban governance, fostering environmental resilience and improving citizens quality of life.
如世卫组织、环境署和欧洲环境署所强调,高效设计和管理公共绿色空间是促进城市人口健康和福祉的一个关键因素。这些领域是城市生态系统的“绿色肺”,由于生态系统服务的提供,在提高生活质量方面发挥着至关重要的作用。在这方面,由意大利企业部(MIMIT)资助的Smart Green City在Campobasso市的“智能绿色城市使用案例”成为了通过采用先进的新式温度参数系统综合和可相互操作的技术对绿色城市地区进行可持续管理的创新模式。该项目整合了IOT系统和数据驱动的治理平台,通过决策支持系统(DSS)对树木和绿色地区的健康状况进行实时监测。它还有助于收集和分析来自不同来源的数据,包括天气条件、空气质量、土壤湿度和污染水平。由此产生的云基平台支持绿色城市管理者、技术专家和业务工作人员的全面实时决策。该项目使城市绿色空间的智能控制和管理能够利用树木谈话器传感器、与土壤湿度和水潜力监测系统进行整合,从而能够实时监测树木和绿色地区的健康状况。这还有助于通过预测性数据采集模型,同时提供以机器系统进行最优化的土壤温度监测。
Article 13
Title@2025-07-16 (3): ARRC: Explainable, Workflow-Integrated Recommender for Sustainable Resource Optimization Across the Edge-Cloud Continuum
Title: ARRC: Explainable, Workflow-Integrated Recommender for Sustainable Resource Optimization Across the Edge-Cloud Continuum | ARRC: Erklärbarer, Workflow-integrierter Recommender für nachhaltige Ressourcenoptimierung über das Edge-Cloud-Kontinuum | ARRC:可解释的、可持续资源优化工作流动综合建议者 2507.12032v1 |
Authors (5): Brian-Frederik Jahnke, René Brinkhege, Jan Peter Meyer, Daniel Tebernum, Falk Howar
Achieving sustainable, explainable, and maintainable automation for resource optimization is a core challenge across the edge-cloud continuum. Persistent overprovisioning and operational complexity often stem from heterogeneous platforms and layered abstractions, while systems lacking explainability and maintainability become fragile, impede safe recovery, and accumulate technical debt. Existing solutions are frequently reactive, limited to single abstraction layers, or require intrusive platform changes, leaving efficiency and maintainability gains unrealized. This paper addresses safe, transparent, and low-effort resource optimization in dynamic, multi-tenant edge-cloud systems, without disrupting operator workflows or increasing technical debt. We introduce ARRC, a recommender system rooted in software engineering design principles, which delivers explainable, cross-layer resource recommendations directly into operator workflows (such as tickets and GitOps pull requests). ARRC encapsulates optimization logic in specialized, auditable agents coordinated via a shared interface, supporting maintainability and extensibility through transparency and the ability to inspect both recommendations and their rationale. Empirical evaluation in a multi-region industrial deployment shows that ARRC reduces operator workload by over 50%, improves compute utilization by up to 7.7x, and maintains error rates below 5%, with most benefits achieved through incremental, operator-approved changes. This demonstrates that explainable, recommendation-based architectures can achieve sustainable efficiency and maintainability improvements at production scale. ARRC provides an empirically evaluated framework for integrating explainable, workflow-driven automation into resource management, intended to advance best practices for robust, maintainable, and transparent edge-cloud continuum platforms.
实现资源优化的可持续、可解释和可维持的自动化,是整个边缘、多度、多度、边缘和低效资源优化的核心挑战。持续的过度供给和操作复杂性,往往来自各种平台和分层抽象,而系统缺乏解释和可维持性则变得脆弱,妨碍安全恢复,积累技术债务。现有解决方案往往是被动反应的,限于单一抽象层,或需要干扰性平台变化,从而无法实现效率和可维持性收益。本文件述及动态、多度、多度、边缘、低效资源优化系统的安全、透明、透明和低效的资源优化,同时不干扰运营商工作流程或增加技术债务。我们引入一个基于软件精度设计原则的建议系统,即基于软件精细度设计原则的推荐系统,直接为运营商工作流程提供可解释的跨层资源建议(如门票和GitOps拉动请求等 ) 现有解决方案往往是被动反应的,局限于单一抽象的、单一抽象的平台,通过透明性、多区域工业部署来进行真实的评估,说明操作者工作量会减少50%以上,改进预估的预算性平台,通过操作性、最可实现的透明性、可操作性效率,通过操作性、最精确的流程化的流程化的流程化的流程改进,并维持成本化的流程,并维持成本化的流程化的流程化的流程化的架构,以达到低于的流程化的流程化的流程化的流程化的流程。
Article 14
Title@2025-07-16 (3): MOFCO: Mobility- and Migration-Aware Task Offloading in Three-Layer Fog Computing Environments
Title: MOFCO: Mobility- and Migration-Aware Task Offloading in Three-Layer Fog Computing Environments | MOFCO: Mobilitäts- und Migrations-Bewusst-Aufgaben-Offloading in drei Ebenen Fog Computing-Umgebungen | MOFCO: 在三层雾化计算机环境中卸载流动和移徙软件任务 2507.12028v1 |
Authors (3): Soheil Mahdizadeh, Elyas Oustad, Mohsen Ansari
Task offloading in three-layer fog computing environments presents a critical challenge due to user equipment (UE) mobility, which frequently triggers costly service migrations and degrades overall system performance. This paper addresses this problem by proposing MOFCO, a novel Mobility- and Migration-aware Task Offloading algorithm for Fog Computing environments. The proposed method formulates task offloading and resource allocation as a Mixed-Integer Nonlinear Programming (MINLP) problem and employs a heuristic-aided evolutionary game theory approach to solve it efficiently. To evaluate MOFCO, we simulate mobile users using SUMO, providing realistic mobility patterns. Experimental results show that MOFCO reduces system cost, defined as a combination of latency and energy consumption, by an average of 19% and up to 43% in certain scenarios compared to state-of-the-art methods.
由于用户设备(UE)的流动性,任务在三层雾计算环境中的卸载是一个严峻的挑战,因为用户设备(UE)的流动性经常引发费用高昂的服务迁移,并降低整个系统的业绩。本文件通过提议一个全新的移动和迁移意识任务卸载算法(MOFCO)来解决这个问题。拟议方法将任务卸载和资源分配作为一种混合-内向非线性编程(MINLP)问题,并采用超速辅助进化游戏理论方法来有效解决这一问题。为了评估MOFCO,我们模拟移动用户使用SUMO,提供现实的移动模式。实验结果显示,MOFCO将系统成本降低19%,在某些情景下,与最先进的方法相比,将系统成本降低43%。
Article 15
Title@2025-07-16 (3): NineToothed: A Triton-Based High-Level Domain-Specific Language for Machine Learning
Title: NineToothed: A Triton-Based High-Level Domain-Specific Language for Machine Learning | NineToothed: Eine auf Tritonen basierende Domain-spezifische Sprache für maschinelles Lernen | 九段缩略语: 机器学习用一种以三顿为基础的高层次域语言 2507.11978v1 |
Authors (4): Jiacheng Huang, Zimin Li, Yinghui Li, Haojie Wang
The emergence of deep learning domain-specific languages (DSLs) has substantially reduced the obstacles in developing high-performance, cross-platform compute kernels. However, current DSLs, such as Triton, still demand that developers possess expertise in parallel programming and expose them to many low-level details. This requirement complicates the development process and adds to the difficulty of maintaining compute kernels. Consequently, developing a new programming model that supports serial programming for deep learning workloads is crucial. This paper introduces NineToothed, a domain-specific language that offers serial semantics for machine learning programming. Through the automatic transformation of serial code into parallel code, NineToothed significantly streamlines the development process while causing minimal performance degradation. NineToothed encompasses (1) a language with tensor-oriented metaprogramming (TOM) that adopts the arrange-and-apply paradigm, enabling the expression of tiled computations without the need to manage low-level details and (2) a code generator for generating high-performance parallel code. Our evaluation results indicate that NineToothed can greatly simplify compute kernel development while maintaining performance comparable to that of Triton.
深度学习特定领域语言(DSLs)的出现大大减少了开发高性能、跨平台计算内核的障碍,然而,目前的DSL(如Triton)仍然要求开发者拥有平行编程的专门知识,并暴露于许多低层细节中。这一要求使发展进程复杂化,增加了保持计算内核的难度。因此,开发一种新的编程模式,支持为深层学习工作量编制系列方案至关重要。本文件介绍了九点特用语言,即为机器学习编程提供序列语义的域名。通过将序列代码自动转换为平行代码,九点图大大简化了发展进程,同时尽量减少性能退化。九点图包括(1) 一种语言,采用以高压为导向的元方案(TOM),采用有安排和适用的范式,能够表达拼凑的计算,而无需管理低层细节,和(2) 生成高性能平行代码的代码生成器。我们的评价结果表明,九点图可以大大简化调内核的开发,同时保持与Triton的可比较性能。
Article 16
Title@2025-07-16 (3): BlockBPE: Parallel BPE Tokenization
Title: BlockBPE: Parallel BPE Tokenization | BlockBPE: Parallele BPE-Tokenisierung | BBPE: 平行 BPE 调制 2507.11941v1 |
Authors (1): Amos You
Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of byte-pair encoding (BPE) that achieves near linear-time complexity under realistic assumptions and is optimized for high-throughput, batch inference. Unlike existing Rust-based tokenizers such as HuggingFace Tokenizers or OpenAI’s tiktoken-whose runtimes are dominated by Regex pre-tokenization and exhibit $O(n \log n)$ runtime-BlockBPE eliminates the Regex pre-tokenization which leads to small loss in generation quality, but enables highly parallelized token merges within thread blocks, reducing overall complexity to $O(nd)$ where $d \ll n$. On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers.
在大型语言模型管道中, Tokenization 是一个关键的预处理步骤,然而,广泛使用的实施仍然是在GPU上批量推导工作流程中CPU受约束和不最优化的。我们介绍了BlockBPE,这是在现实假设下实现近线性复杂度并优化高通量和批量推导的平行的字节调编码(BPE) GPU,在现实假设下实现了近线性复杂度,而对于高通量和批量推导来说则是最佳的。与现有的基于粗路的代号,如Hugging Face Togenizers 或 OpenAI 的tiktokkeen 运行时间由Regex 预引和展示 $O(n n) 运行时间- block- BlockBE 消除了Repex 预切化(regex) 导致小量的生成质量损失,但允许在线段内高度平行的代号合并, 将总复杂性降低到$(nx) $ 。在高通量推货推量重量工作量工作量中,BBBBBPE达到比tikE达到比tiktokeface Tozers高出2x 和2.5x 。
Article 17
Title@2025-07-16 (3): Making Serverless Computing Extensible: A Case Study of Serverless Data Analytics
Title: Making Serverless Computing Extensible: A Case Study of Serverless Data Analytics | serverloses Rechnen erweiterbar machen: Eine Fallstudie für serverlose Datenanalytik | 使无服务器的计算可扩展:无服务器数据分析案例研究 2507.11929v1 |
Authors (4): Minchen Yu, Yinghao Ren, Jiamu Zhao, Jiaqi Li
Serverless computing has attracted a broad range of applications due to its ease of use and resource elasticity. However, developing serverless applications often poses a dilemma – relying on general-purpose serverless platforms can fall short of delivering satisfactory performance for complex workloads, whereas building application-specific serverless systems undermines the simplicity and generality. In this paper, we propose an extensible design principle for serverless computing. We argue that a platform should enable developers to extend system behaviors for domain-specialized optimizations while retaining a shared, easy-to-use serverless environment. We take data analytics as a representative serverless use case and realize this design principle in Proteus. Proteus introduces a novel abstraction of decision workflows, allowing developers to customize control-plane behaviors for improved application performance. Preliminary results show that Proteus’s prototype effectively optimizes analytical query execution and supports fine-grained resource sharing across diverse applications.
无服务器计算因其使用方便和资源弹性而吸引了广泛的应用。然而,开发无服务器应用往往带来进退两难 – – 依赖通用服务器平台可能无法满足复杂工作量的满意性能,而建设具体应用程序的无服务器系统则会破坏简易性和普遍性。在本文中,我们提出了无服务器计算可扩展的设计原则。我们主张平台应使开发者能够扩展用于域专用优化的系统行为,同时保留一个共享的、容易使用的无服务器环境。我们将数据分析作为无服务器的代表性使用案例,并在普罗特乌斯实现这一设计原则。Proteus引入了决策工作流程的新抽象,允许开发者定制控制-平板行为来改进应用性能。初步结果显示,普罗特乌斯的原型有效优化分析查询执行,并支持在各种应用中进行精细的共享资源。
Article 18
Title@2025-07-16 (3): A Parallel CPU-GPU Framework for Cost-Bounded DFS with Applications to IDA* and BTS
Title: A Parallel CPU-GPU Framework for Cost-Bounded DFS with Applications to IDA* and BTS | Ein paralleles CPU-GPU-Framework für kostengebundene DFS mit Anwendungen für IDA* und BTS | 适用于有成本的外勤部的并行CPU-GPU框架,并适用于开发协会* 和BTS 2507.11916v1 |
Authors (2): Ehsan Futuhi, Nathan R. Sturtevant
The rapid advancement of GPU technology has unlocked powerful parallel processing capabilities, creating new opportunities to enhance classic search algorithms. A recent successful application of GPUs is in compressing large pattern database (PDB) heuristics using neural networks while preserving heuristic admissibility. However, very few algorithms have been designed to exploit GPUs during search. Several variants of A* exist that batch GPU computations. In this paper we introduce a method for batching GPU computations in depth first search. In particular, we describe a new cost-bounded depth-first search (CB-DFS) method that leverages the combined parallelism of modern CPUs and GPUs. This is used to create algorithms like \emph{Batch IDA}, an extension of the Iterative Deepening A (IDA) algorithm, or Batch BTS, an extensions of Budgeted Tree Search. Our approach builds on the general approach used by Asynchronous Parallel IDA (AIDA*), while maintaining optimality guarantees. We evaluate the approach on the 3x3 Rubik’s Cube and 4x4 sliding tile puzzle (STP), showing that GPU operations can be efficiently batched in DFS. Additionally, we conduct extensive experiments to analyze the effects of hyperparameters, neural network heuristic size, and hardware resources on performance.
GPU技术的快速进步释放了强大的平行处理能力,创造了加强经典搜索算法的新机会。最近,GPU的成功应用是利用神经网络压缩大型模式数据库(PDB)的功能,同时保持超常可接受性。然而,在搜索过程中,很少设计出利用GPU的算法。批量 GPU计算方法有几种A的变种。在本文中,我们引入了一种在深度搜索中分批 GPU计算的方法。特别是,我们描述了一种新的受成本限制的深度第一搜索(CB-DFS)方法,它利用了现代CPU和GPUs的合并平行功能。我们用这种方法创建了算法,例如 emph{Batch IDA}, 用于在搜索过程中利用透视A (ID) 算法的扩展, 或 Batch BTSTS, 预算树搜索的扩展。我们的方法以Asyncronical 平行 IDA (AID*) 所使用的一般方法为基础,同时保持最佳性保证。我们评估了3xRUPU和4 IMFDRA AS 系统运行的3x 的快速操作, 。我们评估了它能和四号的运行,可以显示GDFDRUPLA 和四号磁号的运行。
Article 19
Title@2025-07-16 (3): Performance Assessment of Load Balancing Methods in Cloud Computing: Analysis of Round Robin, Equally Spread, and Throttled Strategies Using Cloud Analyst
Title: Performance Assessment of Load Balancing Methods in Cloud Computing: Analysis of Round Robin, Equally Spread, and Throttled Strategies Using Cloud Analyst | Performance Assessment of Load Balancing Methods in Cloud Computing: Analyse von Round Robin, ebenso verbreitet und gedrosselte Strategien mit Cloud Analyst | 云计算中负载平衡方法绩效评估:利用云分析师分析轮罗宾、同样扩散和推力战略 2507.11899v1 |
Authors (1): Saeid Aghasoleymani Najafabadi
Load balancing plays a pivotal role in cloud computing, ensuring that resources are optimally allocated to maintain high service quality and operational efficiency. As workloads in cloud environments become increasingly dynamic and unpredictable, load balancing strategies are evolving from traditional static methods to more adaptive and intelligent approaches. In this study, the Cloud Analyst simulation tool was used to evaluate the performance of different load balancing algorithms under various scenarios, including both centralized and distributed resource setups. The results highlight that while the Round Robin algorithm yields slightly better processing times within a single data center, Equally Spread and Throttled techniques perform competitively, especially when network latency is considered. More importantly, when resources are distributed across multiple data centers, response times are significantly reduced, emphasizing the value of proximity and efficient load distribution. In these distributed environments, Equally Spread and Throttled algorithms not only maintain quick response times but also contribute to lower operational costs. These findings demonstrate the necessity of strategic resource placement and proactive infrastructure planning to balance performance and cost. Adopting intelligent, dynamic load balancing and resource management practices can help organizations meet evolving cloud demands, optimize costs, and maintain a competitive advantage. Continuous evaluation and integration of emerging technologies are crucial for sustaining effective and scalable cloud operations.
在云计算中,负载平衡在云计算中发挥着关键作用,确保优化分配资源以保持高服务质量和业务效率。随着云中环境的工作量越来越具有活力和不可预测,负载平衡战略正在从传统的静态方法演变为更适应性和更聪明的方法。在本研究中,云层分析器模拟工具用于评价不同情景下不同负负平衡算法的性能,包括集中和分散的资源配置。结果突出表明,虽然轮式罗宾算法在一个数据中心内产生略为更好的处理时间,但同样分散和紧凑的技术在竞争中发挥作用,特别是在考虑网络长期性时。更重要的是,在多个数据中心分配资源时,反应时间大大缩短,强调接近和高效的负载分配的价值。在这些分布式环境中,同样分散和挤动的算法不仅保持快速反应时间,而且有助于降低业务费用。这些结论表明,战略资源配置和积极主动的基础设施规划对于平衡业绩和成本的必要性。采用智能、动态的负载平衡和资源管理做法可以帮助各组织满足不断变化的云层需求,优化成本,并保持竞争优势。不断评价和整合新兴技术对于维持有效和可扩展的操作至关重要。
Article 20
Title@2025-07-16 (3): Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI
Title: Arctic Inference with Shift Parallelism: Fast and Efficient Open Source Inference System for Enterprise AI | Arctic Inferenz mit Shift Parallelismus: Schnelles und effizientes Open Source Inferenzsystem für Enterprise AI | 北极与转移平行主义的推论:企业AI快速有效的开放源码推断系统 2507.11830v1 |
Authors (8): Samyam Rajbhandari, Mert Hidayetoglu, Aurick Qiao, Ye Wang, Juncheng Yang, Jeff Rasley, Michael Wyatt, Yuxiong He
Inference is now the dominant AI workload, yet existing systems force trade-offs between latency, throughput, and cost. Arctic Inference, an open-source vLLM plugin from Snowflake AI Research, introduces Shift Parallelism, a dynamic parallelism strategy that adapts to real-world traffic while integrating speculative decoding, SwiftKV compute reduction, and optimized embedding inference. It achieves up to 3.4 times faster request completion, 1.75 times faster generation, and 1.6M tokens/sec per GPU for embeddings, outperforming both latency- and throughput-optimized deployments. Already powering Snowflake Cortex AI, Arctic Inference delivers state-of-the-art, cost-effective inference for enterprise AI and is now available to the community.
假设是目前最大的人工智能工作量,但现有系统迫使延绳、吞吐量和成本之间的权衡。北极推论是雪花AI研究的开放源码 vLLLM 插件,引入了转移平行主义,即动态平行主义战略,适应真实世界的交通,同时整合投机性解码、SwiftKV计算减少和优化嵌入推理。它达到请求完成速度达3.4倍,更快的生成率为1.75倍,以及每个GPU的1.6M质称/秒,用于嵌入,优于延绳和吞吐-优化的部署。 已经授权的Snowflake Cortex AI,北极推论为企业人工智能提供了最新、成本效益高的推论,现在可供社区使用。
Article 21
Title@2025-07-16 (3): Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving
Title: Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving | Proaktive Intra-GPU-Disaggregation von Prefill und Decode in LLM Serving | 预填和解除LLM服务中编码的预填和分解 2507.06608v4 |
Authors (3): Xiaoxiang Shi, Colin Cai, Junjia Du
Monolithic serving with chunked prefill improves GPU utilization by batching prefill and decode together, but suffers from fine-grained phase interference. Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead. Prior intra-GPU disaggregation approaches multiplex prefill and decode within a single GPU, using SLO-based tuning guided by heuristics from offline profiling or reactive feedback loops. However, these methods respond reactively to performance issues rather than anticipating them, limiting adaptability under dynamic workloads. We ask: can we achieve proactive intra-GPU disaggregation that adapts effectively to dynamic workloads? The key challenge lies in managing the conflicting resource demands of prefill and decode under varying conditions. We first show that GPU resources exhibit diminishing returns – beyond a saturation point, more allocation yields minimal latency benefit. Second, we observe that memory bandwidth contention becomes a critical bottleneck. These insights motivate a design that dynamically partitions GPU resources across prefill and decode phases, while jointly considering compute capacity, memory footprint, and bandwidth contention. Evaluated on diverse LLMs and workloads, our system Nexus achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SGLang by up to 2x; and matches or exceeds disaggregated vLLM.
以块块的预填前预填充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充充GPU的利用,但又受到细微的阶段干扰。 引擎级预填解码( PDD) 分解避免干扰, 但产生更高的硬件和协调管理管理。 之前的 GPU 内部分解方法在单一的GPU内采用基于 SLO 的调充充充并解码方法, 由离线剖充充充充充, 或被动反馈回馈回馈循环指导。 然而, 这些方法对业绩问题的反应是: 我们能否在前填和解码阶段对 GPUPU资源进行主动的分割, 从而有效地适应动态的工作量? 关键的挑战在于在不同条件下管理预填和解算的相冲突的资源需求。 我们首先显示, GPU的资源显示回报回报会减少 – 超过饱和低LFTF TLF TLM 20 和20 LLLV 比例 对比对20 评估, 通过不同的LF TLF TLUM 和20 和20 LLULULM 和20 和20 和20 和20 LFTF TLLV 和20 的递缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩算。
Article 22
Title@2025-07-16 (3): Symbiosis: Multi-Adapter Inference and Fine-Tuning
Title: Symbiosis: Multi-Adapter Inference and Fine-Tuning | Symbiose: Multi-Adapter-Schlussfolgerung und Feinabstimmung | 共生关系:多位开发商的推断和精准调整 2507.03220v2 |
Authors (4): Saransh Gupta, Umesh Deshpande, Travis Janssen, Swami Sundararaman
Parameter-efficient fine-tuning (PEFT) allows model builders to capture the task specific parameters into adapters, which are a fraction of the size of the original base model. Popularity of PEFT technique for fine-tuning has led to creation of a large number of adapters for popular Large Language Models (LLMs). However, existing frameworks fall short in supporting inference or fine-tuning with multiple adapters in the following ways. 1) For fine-tuning, each job needs to deploy its dedicated base model instance, which results in excessive GPU memory consumption and poor GPU utilization. 2) While popular inference platforms can serve multiple PEFT adapters, they do not allow independent resource management or mixing of different PEFT methods. 3) They cannot share resources (such as base model instance) between inference and fine-tuning jobs. 4) They do not provide privacy to users who may not wish to expose their fine-tuned parameters to service providers. In Symbiosis, we address the above problems by enabling as-a-service deployment of base model. The base model layers can be shared across multiple inference or fine-tuning processes. Our split-execution technique decouples the execution of client-specific adapters and layers from the frozen base model layers offering them flexibility to manage their resources, to select their fine-tuning method, to achieve their performance goals. Our approach is transparent to models and works out-of-the-box for most models in the transformers library. Our evaluation on Llama2-13B shows the compared to baseline, Symbiosis can fine-tune 4X more adapters on the same set of GPUs in the same amount of time.
参数效率微调(PEFT)使模型构建者能够将任务特定参数捕捉到适应器中,这些参数是原始基准模型规模的一小部分。PEFT的普及性使广受欢迎的大语言模型(LLMS)产生大量适应器。然而,现有框架在支持与多个适应器进行下列方式的推断或微调方面做得不够。 1)微调方面,每个工作都需要将其专用基准模型实例部署到其专用基准实例中,这导致GPU内存消耗过多和GPU利用率差。 2)尽管流行的推断平台可以为多个PEFT的适应器服务,但它们不允许独立资源管理或混合不同的PEFT方法。 3 它们无法在广受欢迎的大语言模型(LLMMS)和微调工作之间共享大量资源(例如基础模型实例)。 4) 现有框架没有为可能不希望将其微调参数暴露给服务供应商的用户提供隐私。 在Symbiosiosisis,我们解决上述问题的方法是作为基准模型的升级部署。基础模型层的基模层层可共享于多个精调时间或微调过程,我们的软调化模型,从我们的软化模型可比分化程序,我们的软化系统显示, 显示它们的软化方法可以管理它们的软化系统化模型到我们的软化数据,它们用于其基础的软化数据层的软化数据。
Article 23
Title@2025-07-15 (2): PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training
Title: PGT-I: Scaling Spatiotemporal GNNs with Memory-Efficient Distributed Training | PGT-I: Scaling Spatiotemporal GNNs mit speichereffizienter verteilter Ausbildung | PGT-I: 具有记忆有效分配培训的Splap Spatotomotial GNNs 2507.11683v1 |
Authors (7): Seth Ockerman, Amal Gueroudji, Tanwi Mallick, Yixuan He, Line Pouchard, Robert Ross, Shivaram Venkataraman
Spatiotemporal graph neural networks (ST-GNNs) are powerful tools for modeling spatial and temporal data dependencies. However, their applications have been limited primarily to small-scale datasets because of memory constraints. While distributed training offers a solution, current frameworks lack support for spatiotemporal models and overlook the properties of spatiotemporal data. Informed by a scaling study on a large-scale workload, we present PyTorch Geometric Temporal Index (PGT-I), an extension to PyTorch Geometric Temporal that integrates distributed data parallel training and two novel strategies: index-batching and distributed-index-batching. Our index techniques exploit spatiotemporal structure to construct snapshots dynamically at runtime, significantly reducing memory overhead, while distributed-index-batching extends this approach by enabling scalable processing across multiple GPUs. Our techniques enable the first-ever training of an ST-GNN on the entire PeMS dataset without graph partitioning, reducing peak memory usage by up to 89\% and achieving up to a 13.1x speedup over standard DDP with 128 GPUs.
Spototeal 图形神经网络(ST-GNNS)是模拟空间和时间数据依赖的强大工具,但是,由于记忆限制,其应用主要限于小型数据集;虽然分布式培训提供了解决办法,但目前的框架缺乏对时空模型的支持,忽视了时空数据的特性。通过大规模工作量的大规模研究,我们提供了PyTorch 地球物理时空指数(PGT-I),将分布式数据平行培训与两个新战略(索引加结和分布式索引加结)结合起来的PyTorch 地球时空学扩展。我们的索引技术利用时空结构在运行时动态地制作图像,显著减少记忆中位,而分布式索引加结存扩展了这一方法,使多个GPUS能够进行可缩放的处理。我们的技术使ST-GNN首次在没有图形分隔的全PEMS数据集上进行培训,将峰值记忆用率降低到89,并在128GPUS的标准D上达到13.1x速度。
Article 24
Title@2025-07-15 (2): ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs
Title: ZKP-FedEval: Verifiable and Privacy-Preserving Federated Evaluation using Zero-Knowledge Proofs | ZKP-FedEval: Überprüfbare und datenschutzschonende Federated Evaluation mit Null-Wissensnachweisen | ZKP-FedEval:使用零知识证明进行可核查和隐私保护的联邦评价 2507.11649v1 |
Authors (4): Daniel Commey, Benjamin Appiah, Griffith S. Klogo, Garth V. Crosby
Federated Learning (FL) enables collaborative model training on decentralized data without exposing raw data. However, the evaluation phase in FL may leak sensitive information through shared performance metrics. In this paper, we propose a novel protocol that incorporates Zero-Knowledge Proofs (ZKPs) to enable privacy-preserving and verifiable evaluation for FL. Instead of revealing raw loss values, clients generate a succinct proof asserting that their local loss is below a predefined threshold. Our approach is implemented without reliance on external APIs, using self-contained modules for federated learning simulation, ZKP circuit design, and experimental evaluation on both the MNIST and Human Activity Recognition (HAR) datasets. We focus on a threshold-based proof for a simple Convolutional Neural Network (CNN) model (for MNIST) and a multi-layer perceptron (MLP) model (for HAR), and evaluate the approach in terms of computational overhead, communication cost, and verifiability.
联邦学习联合会(FL)在不披露原始数据的情况下,使分散数据的合作模式培训得以进行;然而,FL的评价阶段可能通过共同的绩效衡量标准泄漏敏感信息;在本文件中,我们提议了一项新协议,纳入零知识验证(ZKPs),以便能够对FL进行隐私保护与可核查的评价。客户没有披露原始损失价值,而是提出简明证据,声称其当地损失低于预先确定的阈值。我们的方法是在不依靠外部API的情况下实施的,使用自成一体的模块进行联合学习模拟、ZKP电路设计和对MNIST和人类活动识别数据集的实验性评价。我们侧重于简单的革命神经网络(CNN)模型(针对MNIST)和多层感应(MLP)模型(针对HAR)的基于门槛的证明,并评价计算间接费用、通信成本和可核查性的方法。
Article 25
Title@2025-07-15 (2): Scaling the memory wall using mixed-precision – HPG-MxP on an exascale machine
Title: Scaling the memory wall using mixed-precision – HPG-MxP on an exascale machine | Skalierung der Speicherwand mit gemischter Präzision – HPG-MxP auf einer Exascale-Maschine | 使用混合精度 – – HPG-MxP 在缩放机上缩放内存墙壁 2507.11512v1 |
Authors (6): Aditya Kashi, Nicholson Koukpaizan, Hao Lu, Michael Matheson, Sarp Oral, Feiyi Wang
Mixed-precision algorithms have been proposed as a way for scientific computing to benefit from some of the gains seen for artificial intelligence (AI) on recent high performance computing (HPC) platforms. A few applications dominated by dense matrix operations have seen substantial speedups by utilizing low precision formats such as FP16. However, a majority of scientific simulation applications are memory bandwidth limited. Beyond preliminary studies, the practical gain from using mixed-precision algorithms on a given HPC system is largely unclear. The High Performance GMRES Mixed Precision (HPG-MxP) benchmark has been proposed to measure the useful performance of a HPC system on sparse matrix-based mixed-precision applications. In this work, we present a highly optimized implementation of the HPG-MxP benchmark for an exascale system and describe our algorithm enhancements. We show for the first time a speedup of 1.6x using a combination of double- and single-precision on modern GPU-based supercomputers.
已提出混合精密算法,作为科学计算从最近高性能计算平台人工智能(AI)取得的一些收益中获益的一种方法。一些以密集矩阵操作为主的应用程序通过使用低精度格式(如FP16),实现了大幅加速。然而,大多数科学模拟应用都是记忆带宽有限。除了初步研究外,在特定高PC系统中使用混合精度算法的实际收益在很大程度上还不清楚。提出了高性能GMRES混合精密(HPG-MxP)基准,以衡量以稀薄矩阵为基础的混合精度应用的HPC系统的有用性能。在这项工作中,我们展示了高度优化地实施超标准系统HPG-MxP基准的情况,并描述我们的算法增强。我们首次在基于现代GPU的超级计算机上使用双精度和单精度组合,显示1.6x的加速率。
Article 26
Title@2025-07-15 (2): Elk: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques
Title: Elk: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques | Elk: Erforschung der Effizienz von Intercore-vernetzten KI-Chips mit Deep Learning Compiler-Techniken | Elk:探索与深学习汇编者技术一起的机构间连接的AI芯片的效率 2507.11506v1 |
Authors (5): Yiqi Liu, Yuqi Xue, Noelle Crawford, Jilong Xue, Jian Huang
To meet the increasing demand of deep learning (DL) models, AI chips are employing both off-chip memory (e.g., HBM) and high-bandwidth low-latency interconnect for direct inter-core data exchange. However, it is not easy to explore the efficiency of these inter-core connected AI (ICCA) chips, due to a fundamental tussle among compute (per-core execution), communication (inter-core data exchange), and I/O (off-chip data access). In this paper, we develop Elk, a DL compiler framework to maximize the efficiency of ICCA chips by jointly trading off all the three performance factors discussed above. Elk structures these performance factors into configurable parameters and forms a global trade-off space in the DL compiler. To systematically explore this space and maximize overall efficiency, Elk employs a new inductive operator scheduling policy and a cost-aware on-chip memory allocation algorithm. It generates globally optimized execution plans that best overlap off-chip data loading and on-chip execution. To examine the efficiency of Elk, we build a full-fledged emulator based on a real ICCA chip IPU-POD4, and an ICCA chip simulator for sensitivity analysis with different interconnect network topologies. Elk achieves 94% of the ideal roofline performance of ICCA chips on average, showing the benefits of supporting large DL models on ICCA chips. We also show Elk’s capability of enabling architecture design space exploration for new ICCA chip development.
为了满足深层次学习(DL)模式日益增长的需求,AI芯片正在使用离芯存储(例如,HBM)和高带宽低的低频互连以进行核心数据直接交换;然而,由于在计算(核心执行)、通信(核心数据交换)和I/O(离芯数据存取)之间有一个基本的轨迹,探索这些核心连接的AI(ICCA)芯片的效率并非易事,因为在计算(核心执行)、通信(核心数据交换)和I/O(离芯数据存取)之间有一个基本的轨迹;在本文件中,我们开发了Elk(DL)编译框架,通过将上文讨论的所有三种性能要素联合交换,最大限度地提高ICA芯片的效率;Elk将这些性能因素构建成可配置参数,并在DL汇编中形成一个全球交易空间。为了系统探索这一空间并最大限度地提高总体效率,Elk采用了一种新的诱导操作员时间排定时的内存存储算算算算算法;它产生全球最佳的执行计划,以便最佳地将离芯片数据装载和芯片执行重叠。为了支持El-CCC的深度数据装装数据装装的系统化结构的大型结构的升级的升级,我们还在展示的系统内部分析中,我们建立一个完整的结构的升级的升级的升级的升级的升级的系统内部分析,以展示的升级的升级的升级的升级的升级的升级的升级的系统。
Article 27
Title@2025-07-15 (2): D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data
Title: D3FL: Data Distribution and Detrending for Robust Federated Learning in Non-linear Time-series Data | D3FL: Datenverteilung und Detrending für robustes Federated Learning in nichtlinearen Zeitreihendaten | D3FL:非线性时间序列数据中硬性联邦学习的数据分配和分流 2507.11471v1 |
Authors (3): Harsha Varun Marisetty, Manik Gupta, Yogesh Simmhan
With advancements in computing and communication technologies, the Internet of Things (IoT) has seen significant growth. IoT devices typically collect data from various sensors, such as temperature, humidity, and energy meters. Much of this data is temporal in nature. Traditionally, data from IoT devices is centralized for analysis, but this approach introduces delays and increased communication costs. Federated learning (FL) has emerged as an effective alternative, allowing for model training across distributed devices without the need to centralize data. In many applications, such as smart home energy and environmental monitoring, the data collected by IoT devices across different locations can exhibit significant variation in trends and seasonal patterns. Accurately forecasting such non-stationary, non-linear time-series data is crucial for applications like energy consumption estimation and weather forecasting. However, these data variations can severely impact prediction accuracy. The key contributions of this paper are: (1) Investigating how non-linear, non-stationary time-series data distributions, like generalized extreme value (gen-extreme) and log norm distributions, affect FL performance. (2) Analyzing how different detrending techniques for non-linear time-series data influence the forecasting model’s performance in a FL setup. We generated several synthetic time-series datasets using non-linear data distributions and trained an LSTM-based forecasting model using both centralized and FL approaches. Additionally, we evaluated the impact of detrending on real-world datasets with non-linear time-series data distributions. Our experimental results show that: (1) FL performs worse than centralized approaches when dealing with non-linear data distributions. (2) The use of appropriate detrending techniques improves FL performance, reducing loss across different data distributions.
随着计算和通信技术的进步,Things(IoT)互联网有了显著的发展。 IoT设备通常从温度、湿度和能源测量仪等各种传感器收集数据,这些数据大多具有时间性质。传统上,IoT设备的数据集中用于分析,但这种方法造成延迟和通信成本增加。Falde (FL)已经成为一种有效的替代方法,允许在分布式设备之间进行示范培训,而无需集中数据。在许多应用中,如智能家庭能源和环境监测,IoT设备在不同地点收集的数据可能显示趋势和季节模式的显著变化。准确预测这种非固定、非线性的时间序列数据对于能源消费估计和天气预报等应用至关重要。这些数据变化会严重影响预测准确性。 这份论文的主要贡献是:(1) 调查非线性、非静止时间序列数据分布,如通用的极端值(gen-extreme)和日志标准分布,会影响FL的绩效。(2) 分析在使用非线性数据预测的不连续性数据分布中,如何以不连续性数据流方式进行不同的分流分配。
Article 28
Title@2025-07-15 (2): Uniting the World by Dividing it: Federated Maps to Enable Spatial Applications
Title: Uniting the World by Dividing it: Federated Maps to Enable Spatial Applications | Die Welt zu vereinen, indem man sie teilt: Gefederte Karten, um räumliche Anwendungen zu aktivieren | 将世界联合起来,实现世界分化:实现空间应用的联邦地图 2507.11437v1 |
Authors (3): Sagar Bharadwaj, Srinivasan Seshan, Anthony Rowe
The emergence of the Spatial Web – the Web where content is tied to real-world locations has the potential to improve and enable many applications such as augmented reality, navigation, robotics, and more. The Spatial Web is missing a key ingredient that is impeding its growth – a spatial naming system to resolve real-world locations to names. Today’s spatial naming systems are digital maps such as Google and Apple maps. These maps and the location-based services provided on top of these maps are primarily controlled by a few large corporations and mostly cover outdoor public spaces. Emerging classes of applications, such as persistent world-scale augmented reality, require detailed maps of both outdoor and indoor spaces. Existing centralized mapping infrastructures are proving insufficient for such applications because of the scale of cartography efforts required and the privacy of indoor map data. In this paper, we present a case for a federated spatial naming system, or in other words, a federated mapping infrastructure. This enables disparate parties to manage and serve their own maps of physical regions and unlocks scalability of map management, isolation and privacy of maps. Map-related services such as address-to-location mapping, location-based search, and routing needs re-architecting to work on federated maps. We discuss some essential services and practicalities of enabling these services.
空间网的出现 – – 内容与现实世界位置相连的网络有潜力改进和促成许多应用,例如扩大现实、导航、机器人等等。空间网缺少一个阻碍其成长的关键成分 – – 解决真实世界位置命名名称的空间命名系统。今天的空间命名系统是谷歌和苹果地图等数字地图。这些地图和在这些地图上提供的基于位置的服务主要由几家大公司控制,而且大多覆盖户外公共空间。新的应用类别,如持续的世界规模扩大的现实,需要详细的户外空间和室内空间地图。现有的中央制图基础设施证明不足以满足这些应用,因为需要大量制图工作和室内地图数据的隐私。在本文中,我们举出了联邦空间命名系统,或换句话说,是联合绘图基础设施。这使差异方能够管理和服务自己的物理区域地图,并解开地图管理、孤立和隐私的缩放性。与地图有关的服务,例如地址到定位制图、基于位置的搜索、基于位置的搜索和基于互联网的数据数据数据的隐私。我们讨论这些基本的地图服务。
Article 29
Title@2025-07-15 (2): FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning
Title: FLsim: A Modular and Library-Agnostic Simulation Framework for Federated Learning | FLsim: Ein modulares und bibliotheks-agnostisches Simulations-Framework für Federated Learning | FLsim: 联邦学习模式和图书馆-不可知模拟框架 2507.11430v1 |
Authors (3): Arnab Mukherjee, Raju Halder, Joydeep Chandra
Federated Learning (FL) has undergone significant development since its inception in 2016, advancing from basic algorithms to complex methodologies tailored to address diverse challenges and use cases. However, research and benchmarking of novel FL techniques against a plethora of established state-of-the-art solutions remain challenging. To streamline this process, we introduce FLsim, a comprehensive FL simulation framework designed to meet the diverse requirements of FL workflows in the literature. FLsim is characterized by its modularity, scalability, resource efficiency, and controlled reproducibility of experimental outcomes. Its easy to use interface allows users to specify customized FL requirements through job configuration, which supports: (a) customized data distributions, ranging from non-independent and identically distributed (non-iid) data to independent and identically distributed (iid) data, (b) selection of local learning algorithms according to user preferences, with complete agnosticism to ML libraries, (c) choice of network topology illustrating communication patterns among nodes, (d) definition of model aggregation and consensus algorithms, and (e) pluggable blockchain support for enhanced robustness. Through a series of experimental evaluations, we demonstrate the effectiveness and versatility of FLsim in simulating a diverse range of state-of-the-art FL experiments. We envisage that FLsim would mark a significant advancement in FL simulation frameworks, offering unprecedented flexibility and functionality for researchers and practitioners alike.
联邦学习联合会(FL)自2016年成立以来经历了重大发展,从基本算法向适应不同挑战和使用案例的复杂方法迈进,从基本算法向适应不同挑战和使用案例的复杂方法推进;然而,针对大量既定最新解决方案对新的FL技术进行研究和制定基准仍然具有挑战性;为简化这一进程,我们引入FLsim,这是一个全面的FL模拟框架,旨在满足文献中FL工作流程的不同要求;FLSim的特点是模块性、可缩放性、资源效率以及实验结果的可控再复制性;界面易于使用,使用户能够通过工作配置指定定制的FL要求,从而支持:(a) 定制数据分发,从非独立和同样分布的数据到独立和同样分布的数据;(d) FL 数据;(b) 根据用户偏好选择本地学习联合会的学习算法,对ML图书馆进行完全的标识性;(c) 选择网络地形学,说明各节之间的通信模式,(d) 定义模型组合和协商一致算法,以及(e) 支持可插入的链段,以强化的FL型灵活性。
Article 30
Title@2025-07-15 (2): Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations
Title: Quantifying the Energy Consumption and Carbon Emissions of LLM Inference via Simulations | Quantifizierung des Energieverbrauches und der Kohlenstoffemissionen von LLM-Inferenz über Simulationen | 通过模拟对LLM推理的能源消耗量和碳排放量进行量化 2507.11417v1 |
Authors (4): Miray Özcan, Philipp Wiesner, Philipp Weiß, Odej Kao
The environmental impact of Large Language Models (LLMs) is rising significantly, with inference now accounting for more than half of their total lifecycle carbon emissions. However, existing simulation frameworks, which are increasingly used to determine efficient LLM deployments, lack any concept of power and, therefore, cannot accurately estimate inference-related emissions. We present a simulation framework to assess the energy and carbon implications of LLM inference under varying deployment setups. First, we extend a high-fidelity LLM inference simulator with a GPU power model that estimates power consumption based on utilization metrics, enabling analysis across configurations like batch size, sequence length, and model parallelism. Second, we integrate simulation outputs into an energy system co-simulation environment to quantify carbon emissions under specific grid conditions and explore the potential of carbon-aware scheduling. Through scenario-based analysis, our framework reveals how inference parameters affect energy demand and carbon footprint, demonstrates a renewable offset potential of up to 69.2% in an illustrative deployment case, and provides a foundation for future carbon-aware inference infrastructure design.
大型语言模型(LLMS)的环境影响正在大幅上升,目前推断值占其生命周期碳排放总量的一半以上,然而,现有的模拟框架越来越多地用于确定高效率的LLM部署,缺乏任何电力概念,因此无法准确估计与推论有关的排放量。我们提出了一个模拟框架,用以评估不同部署设置下LLM推理的能源和碳影响。首先,我们推广了一个高纤维性LLM推理模拟器,采用一种GPU动力模型,根据使用指标估算电力消耗量,进行分批量大小、序列长度和模型平行式等组合的赋能分析。第二,我们将模拟产出纳入能源系统共同模拟环境,以便在特定电网条件下量化碳排放,并探索碳对碳的排期潜力。通过基于情景的分析,我们的框架揭示了推理参数如何影响能源需求和碳足迹,在示范性部署案例中显示了高达69.2%的可再生抵消潜力,并为未来的碳抗衡基础设施设计提供了基础。
Article 31
Title@2025-07-15 (2): Bridging Paradigms: Designing for HPC-Quantum Convergence
Title: Bridging Paradigms: Designing for HPC-Quantum Convergence | Bridging Paradigmen: Designing für HPC-Quantum Convergence | 架桥建模:设计高常PC-量统合 2503.01787v2 |
Authors (8): Amir Shehata, Peter Groszkowski, Thomas Naughton, Murali Gopalakrishnan Meena, Elaine Wong, Daniel Claudino, Rafael Ferreira da Silvaa, Thomas Beck
This paper presents a comprehensive software stack architecture for integrating quantum computing (QC) capabilities with High-Performance Computing (HPC) environments. While quantum computers show promise as specialized accelerators for scientific computing, their effective integration with classical HPC systems presents significant technical challenges. We propose a hardware-agnostic software framework that supports both current noisy intermediate-scale quantum devices and future fault-tolerant quantum computers, while maintaining compatibility with existing HPC workflows. The architecture includes a quantum gateway interface, standardized APIs for resource management, and robust scheduling mechanisms to handle both simultaneous and interleaved quantum-classical workloads. Key innovations include: (1) a unified resource management system that efficiently coordinates quantum and classical resources, (2) a flexible quantum programming interface that abstracts hardware-specific details, (3) A Quantum Platform Manager API that simplifies the integration of various quantum hardware systems, and (4) a comprehensive tool chain for quantum circuit optimization and execution. We demonstrate our architecture through implementation of quantum-classical algorithms, including the variational quantum linear solver, showcasing the framework’s ability to handle complex hybrid workflows while maximizing resource utilization. This work provides a foundational blueprint for integrating QC capabilities into existing HPC infrastructures, addressing critical challenges in resource management, job scheduling, and efficient data movement between classical and quantum resources.
本文介绍了将量子计算(QC)能力与高性能计算(HPC)环境相结合的综合软件堆积结构。虽然量子计算机作为科学计算的专门加速器表现出希望,但与传统的高常委会系统的有效整合带来了巨大的技术挑战。我们提出了一个硬件不可知的软件框架,既支持当前吵闹的中等规模量子装置,又支持未来的耐故障量子计算机,同时保持与高常委会现有工作流程的兼容性。这一结构包括一个量子网关界面、标准化资源管理的API,以及处理同时和间断的量子古典工作量的强有力的排期机制。关键创新包括:(1) 一个能有效协调量子和古典资源的统一资源管理系统,(2) 一个能摘要介绍硬件具体细节的灵活量子编程界面,(3) 一个量子平台管理员 API, 简化各种量子硬件系统的集成,以及(4) 一个用于量子电路优化和执行的综合工具链。我们通过实施量子级算法,包括变量量线性求解器,展示框架处理复杂混合动态工作流程的能力,同时将资源最大化地纳入高精度动态数据库,同时将资源流流流流流流中的重要资源配置。
Article 32
Title@2025-07-15 (2): A new Dune grid for scalable dynamic adaptivity based on the p4est software library
Title: A new Dune grid for scalable dynamic adaptivity based on the p4est software library | Ein neues Dune-Grid für skalierbare dynamische Anpassungsfähigkeit basierend auf der p4est Software-Bibliothek | 基于前四级软件库的可缩放动态适适适性新 Dune 网格 2507.11386v1 |
Authors (3): Carsten Burstedde, Mikhail Kirilin, Robert Klöfkorn
In this work we extend the Dune solver library with another grid interface to the open-source p4est software. While Dune already supports about a dozen different mesh implementations through its mesh interface Dune-Grid, we undertake this new coupling effort in order to inherit p4est’s practically unlimited MPI scalability as well as its relatively thin data structures, and its native support for multi-block (forest) mesh topologies in both 2D and 3D. The presented implementation is compared to an existing implementation based on Dune-ALUGrid for a variety of challenging test examples in a parallel environment. The numerical experiments show that the implementation presented here is outperforming Dune-ALUGrid in terms of scalability. In addition, an alternative balancing strategy is presented to ensure 2:1 balancing across element faces showing improved performance compared to the existing p4est balance strategy in the numerical examples considered in this work.
在这项工作中,我们扩展了Dune Sollower 图书馆,将另一个网格界面扩展至开放源码 P4est 软件。虽然 Dune 已经通过其网格界面 Dune-Grid 支持了大约十几个不同的网格执行,但我们进行了这项新的组合努力,以继承P4est 几乎无限的MPI可缩放性及其相对薄的数据结构,以及其对2D 和 3D 中多块(森林)网格地形的本土支持。 提出的实施与基于Dune-ALUrid 的现有实施相比, 在一个平行环境中的各种具有挑战性的测试实例。 数字实验表明,这里介绍的实施在可缩放性方面优于Dune-ALUGrid 。 此外,还提出了另一种平衡战略,以确保2:1 平衡各个要素的面面面,与这项工作所考虑的数字示例中现有的P4平衡战略相比, 表现有所改善。
Article 33
Title@2025-07-15 (2): DeInfoReg: A Decoupled Learning Framework for Better Training Throughput
Title: DeInfoReg: A Decoupled Learning Framework for Better Training Throughput | DeInfoReg: Ein entkoppelter Lernrahmen für besseren Trainingsdurchsatz | DInfoReg:一个分离的学习框架,以改善培训工作量 2506.18193v2 |
Authors (3): Zih-Hao Huang, You-Teng Lin, Hung-Hsuan Chen
This paper introduces Decoupled Supervised Learning with Information Regularization (DeInfoReg), a novel approach that transforms a long gradient flow into multiple shorter ones, thereby mitigating the vanishing gradient problem. Integrating a pipeline strategy, DeInfoReg enables model parallelization across multiple GPUs, significantly improving training throughput. We compare our proposed method with standard backpropagation and other gradient flow decomposition techniques. Extensive experiments on diverse tasks and datasets demonstrate that DeInfoReg achieves superior performance and better noise resistance than traditional BP models and efficiently utilizes parallel computing resources. The code for reproducibility is available at: https://github.com/ianzih/Decoupled-Supervised-Learning-for-Information-Regularization/.
本文介绍分解监督学习与信息规范化(DeInfoReg),这是一种新颖的办法,将长梯度流转换成多短的梯度流,从而减轻消失的梯度问题。DInfoReg结合了管道战略,使多个GPU的模型平行化,大大改进了培训流程。我们比较了我们提出的方法与标准回路转换和其他梯度流分解技术。关于不同任务和数据集的广泛实验表明,DeInfoReg比传统的BP模型取得优异性、更强的噪音阻力,并有效利用平行计算资源。可复制代码见:https://github.com/ianzih/Decoupled-Supervised-Learch-Infric-Regulization/。
Article 34
Title@2025-07-15 (2): Rise and Shine Efficiently! Tight Bounds for Adversarial Wake-up
Title: Rise and Shine Efficiently! Tight Bounds for Adversarial Wake-up | Steigen Sie auf und glänzen Sie effizient! Enge Grenzen für adversarisches Aufwachen | 提高警觉,提高警觉,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕,提高警惕; 2410.09980v3 |
Authors (2): Peter Robinson, Ming Ming Tan
We study the wake-up problem in distributed networks, where an adversary awakens a subset of nodes at arbitrary times, and the goal is to wake up all other nodes as quickly as possible by sending only few messages. We prove the following lower bounds: * We first consider the setting where each node receives advice from an oracle who can observe the entire network, but does not know which nodes are awake initially. More specifically, we consider the $KT_0$ $LOCAL$ model with advice. We prove that any randomized algorithm must send $\Omega( \frac{n^{2}}{2^{\beta}\log n} )$ messages if nodes receive only $O(\beta)$ bits of advice on average. * For the $KT_1$ assumption, we show that any $(k+1)$-time algorithm requires $\Omega( n^{1+1/k} )$ messages. Our result is the first super-linear (in $n$) lower bound, for a problem that does not require individual nodes to learn a large amount of information about the network topology. To complement our lower bound results, we present several new algorithms: * We give an asynchronous $KT_1$ $LOCAL$ algorithm that solves the wake-up problem with a time and message complexity of $O( n\log n )$ with high probability. * We introduce the notion of \emph{awake distance} $\rho_{\text{awk}}$, which is upper-bounded by the network diameter, and present a synchronous $KT_1$ $LOCAL$ algorithm that takes $O( \rho_{\text{awk}} )$ rounds and sends $O( n^{3/2}\sqrt{\log n} )$ messages with high probability. We also extend these ideas to obtain a near-optimal time- and message complexity of $O( \rho_{awk} \log^3n )$ rounds $O( n \log^3n )$ messages. * We give deterministic advising schemes in the asynchronous $KT_0$ $CONGEST$ model (with advice). In particular, we obtain an $O( \rho_{\text{awk}}\log^2n )$-time advising scheme that sends $O( n\log^2n )$ messages, while requiring $O( \log^2n )$ bits of advice per node.
nan
Article 35
Title@2025-07-15 (2): Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics
Title: Cyclic Data Streaming on GPUs for Short Range Stencils Applied to Molecular Dynamics | Zyklische Daten-Streaming auf GPUs für Kurzstrecken-Schablonen auf molekulare Dynamik angewendet | 用于分子动态的短距离短距离电线的 GPU 的循环数据流 2507.11289v1 |
Authors (6): Martin Rose, Simon Homes, Lukas Ramsperger, Jose Gracia, Christoph Niethammer, Jadran Vrabec
In the quest for highest performance in scientific computing, we present a novel framework that relies on high-bandwidth communication between GPUs in a compute cluster. The framework offers linear scaling of performance for explicit algorithms that is only limited by the size of the dataset and the number of GPUs. Slices of the dataset propagate in a ring of processes (GPUs) from one GPU, where they are processed, to the next, which results in a parallel-in-time parallelization. The user of the framework has to write GPU kernels that implement the algorithm and provide slices of the dataset. Knowledge about the underlying parallelization strategy is not required because the communication between processes is carried out by the framework. As a case study, molecular dynamics simulation based on the Lennard-Jones potential is implemented to measure the performance for a homogeneous fluid. Single node performance and strong scaling behavior of this framework is compared to LAMMPS, which is outperformed in the strong scaling case.
在寻求科学计算的最高性能时,我们提出了一个新的框架,它依靠计算组中GPU之间的高带宽通信。框架为只受数据集大小和GPU数量限制的清晰算法提供了线性性性能缩放。从一个GPU(GPUs)到下一个进程环状(GPUs)的数据集的切片,从一个GPU(GPUs)传播到另一个进程环状(GPUs)(GPUs),这导致平行时间平行化。框架的用户必须写出执行算法的GPU内核,并提供数据集的切片。不需要了解基本平行化战略,因为两个进程之间的通信是由框架进行的。作为案例研究,基于Lennard-Jones潜力的分子动态模拟是为了测量同质液的性能。这个框架的单一节性能和强缩缩缩行为与LAMPS(LMPS)相比,后者在强大的缩放案例中表现优于后者。
Article 36
Title@2025-07-15 (2): Deterministic Lower Bounds for $k$-Edge Connectivity in the Distributed Sketching Model
Title: Deterministic Lower Bounds for $k$-Edge Connectivity in the Distributed Sketching Model | Deterministische Lower Bounds für $k$-Edge Connectivity im Distributed Sketching Model | 分布式切入模型中用于 $k$-Edge 连接的确定性下下界 2507.11257v1 |
Authors (2): Peter Robinson, Ming Ming Tan
We study the $k$-edge connectivity problem on undirected graphs in the distributed sketching model, where we have $n$ nodes and a referee. Each node sends a single message to the referee based on its 1-hop neighborhood in the graph, and the referee must decide whether the graph is $k$-edge connected by taking into account the received messages. We present the first lower bound for deciding a graph connectivity problem in this model with a deterministic algorithm. Concretely, we show that the worst case message length is $\Omega( k )$ bits for $k$-edge connectivity, for any super-constant $k = O(\sqrt{n})$. Previously, only a lower bound of $\Omega( \log^3 n )$ bits was known for ($1$-edge) connectivity, due to Yu (SODA 2021). In fact, our result is the first super-polylogarithmic lower bound for a connectivity decision problem in the distributed graph sketching model. To obtain our result, we introduce a new lower bound graph construction, as well as a new 3-party communication complexity problem that we call UniqueOverlap. As this problem does not appear to be amenable to reductions to existing hard problems such as set disjointness or indexing due to correlations between the inputs of the three players, we leverage results from cross-intersecting set families to prove the hardness of UniqueOverlap for deterministic algorithms. Finally, we obtain the sought lower bound for deciding $k$-edge connectivity via a novel simulation argument that, in contrast to previous works, does not introduce any probability of error and thus works for deterministic algorithms.
在分布式草图模型中,我们研究了无方向的图表中的$k$-对冲连接问题,我们在该模型中拥有美元节点和一个裁判。每个节点都根据图表中的1-hop区块向裁判发送单一信息,而裁判者必须确定图表是否是$k$-对冲连接,同时考虑到收到的信息。我们用确定式算法展示了该模型中决定图形连接问题的第一个较低约束值。具体地说,我们显示最差的情况信息长度是美元-Omega(k)比特,用于美元对齐连接。每个节点都根据图表中的1-hop区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区区
Article 37
Title@2025-07-15 (2): Boosting Scientific Error-Bounded Lossy Compression through Optimized Synergistic Lossy-Lossless Orchestration
Title: Boosting Scientific Error-Bounded Lossy Compression through Optimized Synergistic Lossy-Lossless Orchestration | Erhöhte wissenschaftliche Fehler-gebundene Lossy-Kompression durch optimierte synergistische Lossy-Lossless-Orchestertion | 通过优化的同步失失无损无损管束,促进科学错误造成的损失压缩 2507.11165v1 |
Authors (11): Shixun Wu, Jinwen Pan, Jinyang Liu, Jiannan Tian, Ziwei Qiu, Jiajun Huang, Kai Zhao, Xin Liang, Sheng Di, Zizhong Chen, Franck Cappello
As high-performance computing architectures evolve, more scientific computing workflows are being deployed on advanced computing platforms such as GPUs. These workflows can produce raw data at extremely high throughputs, requiring urgent high-ratio and low-latency error-bounded data compression solutions. In this paper, we propose cuSZ-Hi, an optimized high-ratio GPU-based scientific error-bounded lossy compressor with a flexible, domain-irrelevant, and fully open-source framework design. Our novel contributions are: 1) We maximally optimize the parallelized interpolation-based data prediction scheme on GPUs, enabling the full functionalities of interpolation-based scientific data prediction that are adaptive to diverse data characteristics; 2) We thoroughly explore and investigate lossless data encoding techniques, then craft and incorporate the best-fit lossless encoding pipelines for maximizing the compression ratio of cuSZ-Hi; 3) We systematically evaluate cuSZ-Hi on benchmarking datasets together with representative baselines. Compared to existing state-of-the-art scientific lossy compressors, with comparative or better throughput than existing high-ratio scientific error-bounded lossy compressors on GPUs, cuSZ-Hi can achieve up to 249% compression ratio improvement under the same error bound, and up to 215% compression ratio improvement under the same decompression data PSNR.
随着高性能计算架构的演变,在诸如GPUs等先进的计算平台上正在部署更科学的计算工作流程。这些工作流程可以以极高的传输量生成原始数据,这需要紧急的高纬度和低纬度误差数据压缩解决方案。在本文中,我们提议CuSZ-Hi,一个最佳的高纬度 GPU-基于高差的科学错误测损压缩机,配有灵活、与域相关和完全开放的源码框架设计。我们的新贡献包括:(1) 我们优化了GPUs的平行的基于内推的数据预测方案,使基于内推的科学数据预测能够充分满足不同数据特性;(2) 我们彻底探索和研究无损数据编码技术的全面功能,然后设计并纳入最佳的无损编码管道,以最大限度地最大程度的CUSZ-Hi压缩比率;(3) 我们系统地评估CUSZ-Hi与具有代表性的基线基准数据集。与现有的州际科学损失计价压缩压缩压缩压缩压缩机相比,通过比较或改进现有高度的科学比率,在高性压压下,可以实现高度的压压压压下,在SBRBRBR的改进。
Article 38
Title@2025-07-15 (2): EASTER: Embedding Aggregation-based Heterogeneous Models Training in Vertical Federated Learning
Title: EASTER: Embedding Aggregation-based Heterogeneous Models Training in Vertical Federated Learning | EASTER: Einbettung von Aggregationsbasierten Heterogenen Modellen Training in vertikales Federated Learning | EEASTER:在纵向联邦学习中嵌入基于聚合的异种模式培训 2310.13367v3 |
Authors (6): Shuo Wang, Keke Gai, Jing Yu, Liehuang Zhu, Kim-Kwang Raymond Choo, Bin Xiao
Vertical federated learning has garnered significant attention as it allows clients to train machine learning models collaboratively without sharing local data, which protects the client’s local private data. However, existing VFL methods face challenges when dealing with heterogeneous local models among participants, which affects optimization convergence and generalization. To address this challenge, this paper proposes a novel approach called Vertical federated learning for training multiple Heterogeneous models (VFedMH). VFedMH focuses on aggregating the local embeddings of each participant’s knowledge during forward propagation. To protect the participants’ local embedding values, we propose an embedding protection method based on lightweight blinding factors. In particular, participants obtain local embedding using local heterogeneous models. Then the passive party, who owns only features of the sample, injects the blinding factor into the local embedding and sends it to the active party. The active party aggregates local embeddings to obtain global knowledge embeddings and sends them to passive parties. The passive parties then utilize the global embeddings to propagate forward on their local heterogeneous networks. However, the passive party does not own the sample labels, so the local model gradient cannot be calculated locally. To overcome this limitation, the active party assists the passive party in computing its local heterogeneous model gradients. Then, each participant trains their local model using the heterogeneous model gradients. The objective is to minimize the loss value of their respective local heterogeneous models. Extensive experiments are conducted to demonstrate that VFedMH can simultaneously train multiple heterogeneous models with heterogeneous optimization and outperform some recent methods in model performance.
纵向联盟学习引起了人们的极大关注,因为它使客户能够在不分享当地数据的情况下合作培训机器学习模式,从而保护客户的当地私人数据。然而,现有的VFL方法在与参与者之间不同的当地模型打交道时面临挑战,这影响到优化趋同和概括化。为了应对这一挑战,本文件提议了一种新颖的方法,称为纵向联盟学习,用于培训多种异质模型(VFedMH)。 VFedMH 侧重于将每个参与者的知识的本地嵌入在远端传播过程中集中起来。为了保护参与者的本地嵌入值,我们建议了基于轻度盲点因素的嵌入保护方法。特别是,参与者利用本地混合模型获得本地嵌入的本地嵌入方法。随后,被动方没有使用本地混入模型的本地嵌入,而是用本地变异性模型进行本地变异性模型的本地变异性化。 被动方在本地变异性模型中可以使用本地变异性模型进行本地变异化。
Article 39
Title@2025-07-15 (2): Generating Dynamic Graph Algorithms for Multiple Backends for a Graph DSL
Title: Generating Dynamic Graph Algorithms for Multiple Backends for a Graph DSL | Dynamische Graphenalgorithmen für mehrere Backends für eine Graph DSL generieren | 为图形 DSL 生成多后端的动态图形对多个后端生成动态图形算法 2507.11094v1 |
Authors (6): Nibedita Behera, Ashwina Kumar, Atharva Chougule, Mohammed Shan P S, Rushabh Nirdosh Lalwani, Rupesh Nasre
With the rapid growth of unstructured and semistructured data, parallelizing graph algorithms has become essential for efficiency. However, due to the inherent irregularity in computation, memory access patterns, and communication, graph algorithms are notoriously difficult to parallelize. To address this challenge, several libraries, frameworks, and domain-specific languages (DSLs) have been proposed to ease the parallel programming burden for domain experts. Existing frameworks partially or fully abstract away parallelism intricacies, provide intuitive scheduling mnemonics, and employ program analysis to identify data races and generate synchronization code. Despite these advances, most frameworks are limited in their abstractions and runtime optimizations, especially when dealing with static graphs. In contrast, many real-world graphs are inherently dynamic, with evolving structures over time through insertions, deletions, and modifications of vertices, edges, and attributes. Generating efficient and correctly synchronized code for such dynamic graph algorithms remains a significant challenge. In this work, we introduce an abstraction scheme and runtime optimizations for the efficient processing of morph algorithms. Specifically, given an initial graph G and a set of updates $\Delta$G involving edge insertions and deletions, we express the dynamic processing logic through a DSL and automatically generate parallel code targeting multicore, distributed, and many-core environments. We demonstrate the effectiveness of our approach by applying the DSL-generated code to ten large graphs with diverse characteristics and three widely used algorithms: Shortest Paths, PageRank, and Triangle Counting.
随着非结构化和半结构化数据的快速增长,平行的图形算法对于效率至关重要。然而,由于计算、内存存访问模式和通信中固有的不规则性,图形算法很难平行。为了应对这一挑战,已经提议了一些图书馆、框架和特定域语言(DSLs)来减轻对域专家的平行编程负担。现有的框架部分或完全抽象地消除平行的复杂,提供直观的排程,并采用程序分析来识别数据竞赛和生成同步代码。尽管取得了这些进步,但大多数框架的抽象和运行时间优化都有限,特别是在处理静态图形时。相比之下,许多真实世界的图表具有内在的动态,随着时间的推移,通过插入、删除和修改脊椎、边缘和属性,不断演化高效和正确同步的平行代码。在这项工作中,我们引入了一种抽象化的模型和运行时间优化方法,以便高效处理变形的D值算法。具体地说,考虑到最初的图表G值路径,并用一个动态的动态的逻辑序列,我们用一个动态的G和动态的逻辑序列,通过一个动态的版本化的G和动态的版本,通过D的版本,我们用一个动态的版本的版本的版本的版本,来生成的版本化的版本化的版本。
Article 40
Title@2025-07-15 (2): MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit
Title: MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit | MMStencil: Optimierung von High-Order-Stencils auf Multicore-CPU mit Matrix Unit | MMStencil: 使用矩阵股优化多核心CPU高订单定级器 2507.11067v1 |
Authors (11): Yinuo Wang, Tianqi Mao, Lin Gan, Wubing Wan, Zeyu Song, Jiayu Fu, Lanke He, Wenqiang Wang, Zekun Yin, Wei Xue, Guangwen Yang
Matrix-accelerated stencil computation is a hot research topic, yet its application to three-dimensional (3D) high-order stencils and HPC remains underexplored. With the emergence of matrix units on multicore CPUs, we analyze matrix-based acceleration strategies and tailor an optimal approach for 3D high-order stencils. We introduce algorithmic optimizations based on SIMD and matrix units to address strided memory accesses, alignment conflicts, and redundant accesses. We propose memory optimizations to boost on-package memory efficiency, and a novel multi-thread parallelism paradigm to overcome data-sharing challenges caused by the absence of shared data caches. MMStencil sustains consistently high hardware utilization across diverse stencil shapes and dimensions. Our DMA-based inter-NUMA communication further mitigates NUMA effects and MPI limitations in hybrid parallelism. Combining all the innovations, MMStencil outperforms state-of-the-art libraries on Nvidia A100 GPGPU by up to 2.1x. Moreover, the performance improvements translate directly to real-world HPC applications and enable RTM applications to yield 1.8x speedup versus a highly optimized industrial Nvidia A100 GPGPU version.
矩阵加速线性计算是一个热点研究课题,然而,它应用于三维(3D)高阶急点和高常量计算法仍未得到充分探讨。随着多极CPU上出现矩阵单位,我们分析基于矩阵的加速战略,并为3D高阶急点设计最佳方法。我们采用基于SIMD和矩阵单位的算法优化方法,以解决分流记忆存取、校正冲突和冗余存存取。我们提议优化记忆以提升包装内存储效率,并建立一个新的多轨平行模式,以克服因缺少共享数据缓存而导致的数据共享挑战。MMStencil持续保持各种高水平的硬利用率,并针对三维高阶缓冲加速战略制定最佳方法。我们基于DMA的NUMA间通信进一步减轻NUMA效应和MPI在混合平行状态中的局限性。我们提议将Nvidia APGPPPPP PP 上的所有创新、 MMStencil 超越最新状态图书馆整合至2.1x。此外,业绩改进将A-PGPLA 直接转换为实际全球GPUPI 。
Article 41
Title@2025-07-15 (2): Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout
Title: Efficient Federated Learning with Heterogeneous Data and Adaptive Dropout | Effizientes Federated Learning mit heterogenen Daten und adaptivem Dropout | 采用异种数据和适应性辍学的高效联邦学习 2507.10430v2 |
Authors (10): Ji Liu, Beichen Ma, Qiaolin Yu, Ruoming Jin, Jingbo Zhou, Yang Zhou, Huaiyu Dai, Haixun Wang, Dejing Dou, Patrick Valduriez
Federated Learning (FL) is a promising distributed machine learning approach that enables collaborative training of a global model using multiple edge devices. The data distributed among the edge devices is highly heterogeneous. Thus, FL faces the challenge of data distribution and heterogeneity, where non-Independent and Identically Distributed (non-IID) data across edge devices may yield in significant accuracy drop. Furthermore, the limited computation and communication capabilities of edge devices increase the likelihood of stragglers, thus leading to slow model convergence. In this paper, we propose the FedDHAD FL framework, which comes with two novel methods: Dynamic Heterogeneous model aggregation (FedDH) and Adaptive Dropout (FedAD). FedDH dynamically adjusts the weights of each local model within the model aggregation process based on the non-IID degree of heterogeneous data to deal with the statistical data heterogeneity. FedAD performs neuron-adaptive operations in response to heterogeneous devices to improve accuracy while achieving superb efficiency. The combination of these two methods makes FedDHAD significantly outperform state-of-the-art solutions in terms of accuracy (up to 6.7% higher), efficiency (up to 2.02 times faster), and computation cost (up to 15.0% smaller).
联邦学习联盟(FL)是一种很有希望的分布式机器学习方法,它使得使用多边缘装置对一个全球模型进行合作培训成为可能,在边缘装置之间分布的数据非常多样化。因此,FL面临着数据分布和异质性的挑战,在边缘装置之间,非独立和同分布(非IID)数据可能会产生显著的准确性下降。此外,边缘装置的计算和通信能力有限,增加了分解器的可能性,从而导致模型融合速度缓慢。在本文中,我们提议FDDDHAD FL框架,采用两种新颖方法:动态异质模型汇总(FedDH)和适应性丢弃(FedAD)。FDDH根据非独立和相同分布(非IID)的混合数据程度,对模型集成过程中每个本地模型的权重进行动态调整,以应对统计数据的异质性。FedDAD对混集装置进行神经适应操作,以提高精度,同时实现超级效率。这两种方法的结合使得FDDHAD大大超越了状态的模型集成型模型集成(FedDHDHDAD)和适应状态(Fed-raft-raft-raft-raft-lad-lad-lax propy lapy pylementald) 2.xylexy pylexy py y pylexxy pyddaldentaldentaldaldaldaldaldaldaldent unt 202, unt unt ydaldaldaldaldaldddddaldd y y y 202, ex unt unt et et et et et et et et et et et et et et 202, 202, et et 202, 202 et 202, et et et et et et 202, 202, 202, et et et et et etx etx et etx etx et et et et etx etx et et et
Article 42
Title@2025-07-15 (2): Arcturus: A Cloud Overlay Network for Global Accelerator with Enhanced Performance and Stability
Title: Arcturus: A Cloud Overlay Network for Global Accelerator with Enhanced Performance and Stability | Arcturus: Ein Cloud Overlay Netzwerk für Global Accelerator mit verbesserter Leistung und Stabilität | Arcturus:增强性能和稳定的全球加速器云重叠网络 2507.10928v1 |
Authors (9): Matthew Yang Liu, Chuang Chen, Pengcheng Lv, Hui Guo, Yanan Zhang, Cong Wang, Yusen Li, Zhenyu Li, Yu-Chu Tian
Global Accelerator (GA) services play a vital role in ensuring low-latency, high-reliability communication for real-time interactive applications. However, existing GA offerings are tightly bound to specific cloud providers, resulting in high costs, rigid deployment, and limited flexibility, especially for large-scale or budget-sensitive deployments. Arcturus is a cloud-native GA framework that revisits the design of GA systems by leveraging low-cost, heterogeneous cloud resources across multiple providers. Rather than relying on fixed, high-end infrastructure, Arcturus dynamically constructs its acceleration network and balances performance, stability, and resource efficiency. To achieve this, Arcturus introduces a two-plane design: a forwarding plane that builds a proxy network with adaptive control, and a scheduling plane that coordinates load and routing through lightweight, quantitative optimization. Evaluations under millions of RPS show that Arcturus outperforms commercial GA services by up to 1.7X in acceleration performance, reduces cost by 71%, and maintains over 80% resource efficiency–demonstrating efficient use of cloud resources at scale.
全球加速器(GA)服务在确保实时互动应用程序的低延迟、高可靠性通信方面发挥着关键作用。然而,现有的GA服务与特定的云源供应商紧密相连,导致成本高、部署僵硬和灵活性有限,特别是大规模部署或预算敏感的部署。Arcturus是一个云型的GA框架,它利用多种供应商的低成本、多样云资源来重新审视GA系统的设计。Arcturus不依靠固定的高端基础设施,而是动态地构建其加速网络和平衡性能、稳定性和资源效率。为了实现这一点,Arcturus引入了两种计划设计:建立具有适应性控制的代理网络的转发式飞机,以及一个通过轻量度和量性优化来协调负荷和路由的调度的定时平面。百万个RPS下的评价显示,Arcturus在加速性能方面比商用GA服务高出1.7X,将成本降低71%,并保持80%以上的资源效率演示。
Article 43
Title@2025-07-14 (1): Stream programs are monoid homomorphisms with state
Title: Stream programs are monoid homomorphisms with state | Stream-Programme sind monoide Homomorphismen mit Zustand | 溪流方案是单一单一的同质状态方案 2507.10799v1 |
Authors (3): Tyler Hou, Michael Arntzenius, Max Willsey
We define a broad class of deterministic stream functions and show they can be implemented as homomorphisms into a “state” monoid. The homomorphism laws are simpler than the conditions of previous semantic frameworks for stream program optimization, yet retain support for rich equational reasoning over expressive dataflow programs, including sequential composition, parallel composition, and feedback. We demonstrate this using examples of partitioned database joins, stratified negation, and a simplified model of TCP.
我们定义了广泛的确定性流函数类别,并表明这些函数可以作为“状态”单体执行。 单体法比以往流程序优化的语义框架更为简单,但依然支持关于表达性数据流程序的丰富的等式推理,包括相继构成、平行组成和反馈。 我们用分隔式数据库连接、分层否定和简化TCP模式的例子来证明这一点。
Article 44
Title@2025-07-14 (1): Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks
Title: Dissecting the NVIDIA Blackwell Architecture with Microbenchmarks | Die NVIDIA Blackwell Architektur mit Microbenchmarks | 使用微基准解析 NVIDIA Blackwell 建筑 2507.10789v1 |
Authors (3): Aaron Jarmusch, Nathan Graddon, Sunita Chandrasekaran
The rapid development in scientific research provides a need for more compute power, which is partly being solved by GPUs. This paper presents a microarchitectural analysis of the modern NVIDIA Blackwell architecture by studying GPU performance features with thought through microbenchmarks. We unveil key subsystems, including the memory hierarchy, SM execution pipelines, and the SM sub-core units, including the 5th generation tensor cores supporting FP4 and FP6 precisions. To understand the different key features of the NVIDIA GPU, we study latency, throughput, cache behavior, and scheduling details, revealing subtle tuning metrics in the design of Blackwell. To develop a comprehensive analysis, we compare the Blackwell architecture with the previous Hopper architecture by using the GeForce RTX 5080 and H100 PCIe, respectively. We evaluate and compare results, presenting both generational improvements and performance regressions. Additionally, we investigate the role of power efficiency and energy consumption under varied workloads. Our findings provide actionable insights for application developers, compiler writers, and performance engineers to optimize workloads on Blackwell-based platforms, and contribute new data to the growing research on GPU architectures.
科学研究的快速发展需要更精确的计算能力,这一点部分是由GPU解决的。本文件通过研究通过微分标记思考的GPU性能特征,对NVIDIA Blackwell现代结构进行微观结构分析。我们公布了关键子系统,包括记忆等级、SM执行管道和SM子核心单位,包括支持FP4和FP6精确度的第五代高温核心。为了了解NVIDIA GPU的不同关键特征,我们研究了Lantency、吞吐、缓存行为和排期细节,揭示了Blackwell设计中的细微调度。为了进行全面分析,我们分别使用GeForce RTX 5080和H100 PCIe,将Blackwell结构与以前的H100 PCIe 结构进行比较。我们评估和比较了结果,提出了代际改进和业绩回归。此外,我们研究了NVDIA GPU在各种工作量下的权力效率和能源消耗作用。我们的调查结果为应用开发者、编译员和绩效工程师提供了可操作的洞察力的新的洞察力,以优化G研究平台上的数据。
Article 45
Title@2025-07-14 (1): FAFO: Over 1 million TPS on a single node running EVM while still Merkleizing every block
Title: FAFO: Over 1 million TPS on a single node running EVM while still Merkleizing every block | FAFO: Mehr als 1 Million TPS auf einem einzigen Knoten, der EVM läuft, während immer noch jeder Block zusammengefügt wird | 在运行 EVM 的单一节点上超过100万个TPS, 同时仍然挤压每个街区 2507.10757v1 |
Authors (7): Ryan Zarick, Isaac Zhang, Daniel Wong, Thomas Kim, Bryan Pellegrino, Mignon Li, Kelvin Wong
Current blockchain execution throughput is limited by data contention, reducing execution layer parallelism. Fast Ahead-of-Formation Optimization (FAFO) is the first blockchain transaction scheduler to address this problem by reordering transactions before block formation for maximum concurrency. FAFO uses CPU-optimized cache-friendly Bloom filters to efficiently detect conflicts and schedule parallel transaction execution at high throughput and low overhead. We integrate the Rust EVM client (REVM) into FAFO and achieve over 1.1 million native ETH transfers per second and over half a million ERC20 transfers per second on a single node (Table 1), with 91% lower cost compared to state-of-the-art sharded execution. Unlike many other existing high throughput blockchain execution clients, FAFO uses QMDB to Merkleize world state after every block, enabling light clients and stateless validation for ZK-based vApps. FAFO scales with minimal synchronization overhead, scaling linearly with additional CPU resources until it fully exploits the maximum parallelism of the underlying transaction flow. FAFO proves that the high throughput necessary to support future decentralized applications can be achieved with a streamlined execution layer and innovations in blockchain transaction scheduler design. FAFO is open-sourced at https://github.com/LayerZero-Labs/fafo.
当前的链条执行量受数据争议的限制,减少了执行层的平行性。快速格式化优化(FAFO)是第一个通过在块形成前重新排序交易以达到最高通量来解决这一问题的链式交易调度系统(FAFO ) 。 FAFO 使用CPU 优化的缓冲友好型Bloom过滤器来高效检测冲突,并安排高吞吐量和低管理量的平行交易执行。 我们将Rust EVM客户(REVM)纳入FAFO, 实现110万次双倍和每秒50万次以上的本地世贸转移。 在单节点(表1)上每秒(每秒)超过50万次ERC20转移, 其成本比最先进的标准执行低91%。 与许多其他现有的高吞吐量块链执行客户不同, FAFOFO 使用QMB 到Merkleizizizize, 使基于ZK的客户和无国籍的VApps. FAFO 比例, 能够充分利用基础交易流动的最大平行的平行交易流动。 FAFA-FO/DFO 的高级设计系统实现。
Article 46
Title@2025-07-14 (1): Access Control for Information-Theoretically Secure Key-Document Stores
Title: Access Control for Information-Theoretically Secure Key-Document Stores | Zugriffskontrolle für informationstheoretisch gesicherte Key-Document-Stores | 信息-理论安全密钥文件库存取控制 2507.10730v1 |
Authors (4): Yin Li, Sharad Mehrota, Shantanu Sharma, Komal Kumari
This paper presents a novel key-based access control technique for secure outsourcing key-value stores where values correspond to documents that are indexed and accessed using keys. The proposed approach adopts Shamir’s secret-sharing that offers unconditional or information-theoretic security. It supports keyword-based document retrieval while preventing leakage of the data, access rights of users, or the size (\textit{i}.\textit{e}., volume of the output that satisfies a query). The proposed approach allows servers to detect (and abort) malicious clients from gaining unauthorized access to data, and prevents malicious servers from altering data undetected while ensuring efficient access – it takes 231.5ms over 5,000 keywords across 500,000 files.
本文介绍了安全外包关键价值仓库的新型关键访问控制技术,其价值与使用钥匙编制索引和查阅的文件相对应。拟议方法采用了Shamir的保密共享方法,提供无条件或信息理论安全。它支持基于关键词的文件检索,同时防止数据泄漏、用户访问权或满足查询的输出量(\ textit{i}.\textit{e}.),以防止数据流出。拟议方法使服务器能够检测(和中止)恶意客户未经授权访问数据,防止恶意服务器在确保有效访问的同时改变数据未被检测到的数据 – – 超过5 000个关键字需要231.5米,超过50万个文件的5 000个关键字。
Article 47
Title@2025-07-14 (1): Environmentally-Conscious Cloud Orchestration Considering Geo-Distributed Data Centers
Title: Environmentally-Conscious Cloud Orchestration Considering Geo-Distributed Data Centers | Umweltbewusste Cloud-Orchester unter Berücksichtigung von Geo-Distributed Data Centers | 考虑到地理分布数据中心的无害环境云层交织式 2507.11563v1 |
Authors (2): Giulio Attenni, Novella Bartolini
This paper presents a theoretical discussion for environmentally-conscious job deployment and migration in cloud environments, aiming to minimize the environmental impact of resource provisioning while incorporating sustainability requirements. As the demand for sustainable cloud services grows, it is crucial for cloud customers to select data center operators based on sustainability metrics and to accurately report the ecological footprint of their services. To this end, we analyze sustainability reports and define comprehensive environmental impact profiles for data centers, incorporating key sustainability indicators. We formalize the problem as an optimization model, balancing multiple environmental factors while respecting user preferences. A simulative case study demonstrates the {potential} of our approach compared to baseline strategies that optimize for single sustainability factors.
本文介绍了在云层环境中进行有环境意识的工作部署和迁移的理论讨论,目的是尽量减少资源提供对环境的影响,同时纳入可持续性要求。随着对可持续云服务的需求不断增长,云客户必须根据可持续性指标选择数据中心运营商,并准确报告其服务的生态足迹。为此,我们分析可持续性报告,并界定数据中心的综合环境影响简介,纳入关键可持续性指标。我们将问题正式确定为优化模式,平衡多种环境因素,同时尊重用户的偏好。模拟案例研究表明,与优化单一可持续性因素的基线战略相比,我们的方法具有{潜力}。
Article 48
Title@2025-07-14 (1): Consensus, Inconsistency, Emergence: what’s paraconsistency got to do with it?
Title: Consensus, Inconsistency, Emergence: what’s paraconsistency got to do with it? | Konsens, Inkonsistenz, Emergenz: Was hat Parakonsistenz damit zu tun? | 共识、不一致性、新出现: 不一致与它有什么关系? 2507.10413v1 |
Authors (1): Gabriel Rocha
The consensus problem, briefly stated, consists of having processes in an asynchronous distributed system agree on a value. It is widely known that the consensus problem does not have a deterministic solution that ensures both termination and consistency, if there is at least one faulty process in the system. This result, known as the FLP impossibility theorem, led to several generalizations and developments in theoretical distributed computing. This paper argues that the FLP impossibility theorem holds even under a generalized definition of computation through oracles. Furthermore, using a theoretical machinery from complex systems, this paper also posits that inconsistency may be an emergent feature of consensus over distributed systems by examining how a system transitions phases. Under the same complex systems framework, this paper examines paraconsistent logics, arguing that while inconsistency is not an emergent feature for these logics, triviality may be. Lastly, some attention is given to the possibility of developing consensus algorithms capable of paraconsistent reasoning.
简言之,协商一致问题包括在一个分散分布的系统内的进程就一个价值达成一致,众所周知,如果系统至少有一个错误的过程,协商一致问题并没有确保终止和一致性的确定性解决办法,如果系统内至少有一个错误的过程,这种结果,即FLP不可能的理论,导致理论分布计算中的一些概括和动态。本文认为,即使根据对通过或手法进行计算的普遍定义,FLP也不可能的理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论性理论
Article 49
Title@2025-07-14 (1): Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters
Title: Zorse: Optimizing LLM Training Efficiency on Heterogeneous GPU Clusters | Zorse: Optimierung der LLM-Trainingseffizienz auf heterogenen GPU-Clustern | Zorse: 优化关于异基因性GPU集群的LLM培训效率 2507.10392v1 |
Authors (4): Runsheng Benson Guo, Utkarsh Anand, Khuzaima Daudjee, Rathijit Sen
Large language models (LLMs) require vast amounts of GPU compute to train, but limited availability and high costs of GPUs make homogeneous clusters impractical for many organizations. Instead, assembling heterogeneous clusters by pooling together GPUs of different generations allows them to achieve higher aggregate compute and make use of all available GPUs. However, training on heterogeneous clusters presents several challenges, including load balancing across GPUs, optimizing memory usage to accommodate varying memory capacities, and ensuring communication-efficient training over diverse network interconnects potentially spanning multiple datacenters. In this paper, we make the case that efficient training on heterogeneous clusters requires (1) the integration of pipeline parallelism and data parallelism in a manner that is both communication- and memory-efficient, and (2) a more adaptable configuration of pipeline and data parallelism, which includes the capability to flexibly partition GPUs into asymmetric pipeline parallel stages and to incorporate heterogeneous GPUs within the same data parallelism group. We propose Zorse, the first system to unify all these capabilities while incorporating a planner that automatically configures training strategies for a given workload. Our evaluation shows that Zorse significantly outperforms state-of-the-art systems in heterogeneous training scenarios.
大型语言模型(LLMS)要求大量GPU进行培训,但GPU的有限可用性和高成本使得许多组织无法使用同质集群。相反,通过将不同世代的GPU集合起来,将不同种类的集群组合起来,可以实现更高的合计计算,并使用所有可用的GPU。然而,关于不同集群的培训提出了几项挑战,包括:在GPU之间实现负载平衡,优化记忆使用,以适应不同的记忆能力,以及确保针对可能跨越多个数据中心的不同网络互联的通信高效培训。在本文件中,我们证明,关于不同集群的有效培训需要(1) 以既具有通信效率又具有记忆效率的方式整合管道平行和数据平行性,以及(2) 更加适应性地配置管道和数据平行性,这包括将灵活分配GPUP纳入不对称管道平行阶段的能力,以及将混合的GPUP纳入同一数据平行组。我们建议Zorse是统一所有这些能力的第一个系统,同时纳入一个为特定工作量自动配置培训战略的规划员。我们的评估显示,Zorsee 明显超越了州立式系统。
Article 50
Title@2025-07-14 (1): FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline
Title: FalconFS: Distributed File System for Large-Scale Deep Learning Pipeline | FalconFS: Verteiltes Dateisystem für großformatige Deep-Learning-Pipeline | FalconFS:大型深层学习管道分布式文件系统 2507.10367v1 |
Authors (13): Jingwei Xu, Junbin Kang, Mingkai Dong, Mingyu Liu, Lu Zhang, Shaohong Guo, Ziyan Qiu, Mingzhen You, Ziyi Tian, Anqi Yu, Tianhong Ding, Xinwei Hu, Haibo Chen
Client-side metadata caching has long been considered an effective method for accelerating metadata operations in distributed file systems (DFSs). However, we have found that client-side state (e.g., caching) is not only ineffective but also consumes valuable memory resources in the deep learning pipelines. We thus propose FalconFS, a DFS optimized for deep learning pipelines with the stateless-client architecture. Specifically, instead of performing client-side path resolution and caching, FalconFS efficiently resolves paths on the server side using hybrid metadata indexing and lazy namespace replication. FalconFS also boosts server concurrency with concurrent request merging and provides easy deployment with VFS shortcut. Evaluations against CephFS and Lustre show that FalconFS achieves up to 5.72$\times$ throughput for small file read/write and up to 12.81$\times$ throughput for deep learning model training. FalconFS has been running in Huawei autonomous driving system’s production environment with 10,000 NPUs for one year.
客户端元数据缓存长期以来一直被认为是在分布式文件系统(DFS)中加快元数据运行的有效方法。然而,我们发现客户端状态(如缓存)不仅无效,而且还消耗深层学习管道中宝贵的记忆资源。我们因此建议FalconFinalFS, 这是一种与无国籍客户架构一起优化的深学习管道的外勤支助部FalconFis。具体地说,FalconFisc公司不是使用客户端路径解析和缓存,而是使用混合元数据索引和懒惰名称空间复制,有效解决服务器端的路径。FalconFinFS公司还推动服务器通货与同时合并,并且以VFS公司快捷键提供方便的部署。对CephFS和Lustre公司的评价显示,FalconFinfFS公司在读/写小文档时达到5.72美元,在深层学习模型培训时达到12.81美元。 FalconfinFS公司在华威自主驱动系统生产环境中运行一年,拥有10,000 NPUPS。
Article 51
Title@2025-07-14 (1): FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference
Title: FlowSpec: Continuous Pipelined Speculative Decoding for Efficient Distributed LLM Inference | FlowSpec: Kontinuierliche pipelined Spekulative Dekodierung für effiziente verteilte LLM-Inferenz | 流谱:为有效分布分布的LLM 推断而持续喷射的投机性分解 2507.02620v2 |
Authors (4): Xing Liu, Lizhuo Luo, Ming Tang, Chao Huang
Distributed inference serves as a promising approach to enabling the inference of large language models (LLMs) at the network edge. It distributes the inference process to multiple devices to ensure that the LLMs can fit into the device memory. Recent pipeline-based approaches have the potential to parallelize communication and computation, which helps reduce inference latency. However, the benefit diminishes when the inference request at the network edge is sparse, where pipeline is typically at low utilization. To enable efficient distributed LLM inference at the edge, we propose \textbf{FlowSpec}, a pipeline-parallel tree-based speculative decoding framework. FlowSpec incorporates three key mechanisms to improve decoding efficiency: 1) score-based step-wise verification prioritizes more important draft tokens to bring earlier accpeted tokens; 2) efficient draft management to prune invalid tokens while maintaining correct causal relationship during verification; 3) dynamic draft expansion strategies to supply high-quality speculative inputs. These techniques work in concert to enhance both pipeline utilization and speculative efficiency. We evaluate FlowSpec on a real-world testbed with other baselines. Experimental results demonstrate that our proposed framework significantly improves inference speed across diverse models and configurations, achieving speedup ratios 1.28$\times$-1.79$\times$ compared to baselines. Our code is publicly available at \href{https://github.com/Leosang-lx/FlowSpec#}{https://github.com/Leosang-lx/FlowSpec#}
在网络边缘,分布式的推论是一种大语言模型(LLMs)的推论方法,在网络边缘是一种大语言模型(LLMs)的推论方法,很有希望,它将推论过程分散到多个设备,以确保LLMs能够与设备内存相容。最近基于管道的方法有可能平行通信和计算,从而有助于减少推论的延缓度。然而,如果网络边缘的推论要求稀少,而管道通常使用率较低,则好处就会减少。为了在边缘有效分布式LLLM推论,我们提议建立一个基于管道-parllel树的投机解码框架。TextSpec包含三个关键机制,以提高解码效率:(1) 基于分数的分数分分分分的分步核查办法,将更重要的代号放在更早一点,以带来折叠的标物;(2)在核查期间,在保持正确的因果关系的同时,对纯度无效的标物进行管理;(3)为提供高质量投机性投入的动态扩展战略草案。这些技术在加强管道利用和投机性效率方面都进行。我们评估了在现实-美元流-美元-数字的Slex-reauversal-slevles-stal survelevy sreal slational laxx s s sreal slational slevational slational lax lax lax s s s sreal deal deges surviews be be be laus be sal sal be sal sal be be be be bes sal bedudes sal sal bes sal sal sal sal surviews surgles be sal sal sal sal surviews supds sal sal sal be be be be sal sal be sal be s
Article 52
Title@2025-07-14 (1): Convergence of Agnostic Federated Averaging
Title: Convergence of Agnostic Federated Averaging | Konvergenz der agnostischen Föderierten Durchschnittswerte | Agnostic Federal 波动的趋同 2507.10325v1 |
Authors (3): Herlock, Rahimi, Dionysis Kalogerias
Federated learning (FL) enables decentralized model training without centralizing raw data. However, practical FL deployments often face a key realistic challenge: Clients participate intermittently in server aggregation and with unknown, possibly biased participation probabilities. Most existing convergence results either assume full-device participation, or rely on knowledge of (in fact uniform) client availability distributions – assumptions that rarely hold in practice. In this work, we characterize the optimization problem that consistently adheres to the stochastic dynamics of the well-known \emph{agnostic Federated Averaging (FedAvg)} algorithm under random (and variably-sized) client availability, and rigorously establish its convergence for convex, possibly nonsmooth losses, achieving a standard rate of order $\mathcal{O}(1/\sqrt{T})$, where $T$ denotes the aggregation horizon. Our analysis provides the first convergence guarantees for agnostic FedAvg under general, non-uniform, stochastic client participation, without knowledge of the participation distribution. We also empirically demonstrate that agnostic FedAvg in fact outperforms common (and suboptimal) weighted aggregation FedAvg variants, even with server-side knowledge of participation weights.
联邦学习(FL)使分散模式培训得以实现,而没有集中原始数据。然而,实际的FL部署往往面临一个关键的现实挑战:客户间断地参与服务器集成,参与概率不明,可能存在偏差,参与概率不明。大多数现有的趋同结果要么假设完全参与,要么依靠(事实上统一的)客户可用分布知识 – – 在实践中很少维持的假设。在这项工作中,我们将始终坚持众所周知的(empph{notistic Fed-Fed-Average (FedAvg))的随机(和可变规模)客户可用量算法的随机(和可变规模)客户的算法的优化问题定性问题定性问题定性为,并严格确立其趋同(可能非模拟损失)的趋同性趋同,实现标准的 $\mathcal{O}(1/\qrt{T}) 美元(实际很少维持在实际中。我们的分析为在一般的、非统一、可变化的客户参与方面提供了第一种趋同的趋同的客户参与提供了趋同的保证。我们还以在共同的FedA-Ad-commad-commaximal vical sqmal sqmact supstrupstrup vidududududududestrusmusmusmal compal compalpalpalpalpalpalpalpalpalpalpalpalpalpalpalpalds。我们。我们,我们以实验性地展示了对共同的参与性化的参与性,我们。我们以实验性的、甚至对共同的、甚至以对共同的、对参与进行了对参与的、对共同的、对等的、对等的、对等的、甚至对等的、对等等的、对等的、对等知识进行式的、对等的、对等的、对等的、对等等等等等等等的、对等的、对等等的、对等等等的、对等的、对等的、对等的、对等的、对等等的、对等的、对等的、对等的、对等的、对等的、对等的、对等的、对
Article 53
Title@2025-07-14 (1): Cross-Timeslot Optimization for Distributed GPU Inference Using Reinforcement Learning
Title: Cross-Timeslot Optimization for Distributed GPU Inference Using Reinforcement Learning | Cross-Timeslot-Optimierung für verteilte GPU-Inferenz mittels Verstärkungslernen | 利用强化学习对分布式 GPU 推断进行跨时线绘图优化 2507.10259v1 |
Authors (6): Chengze Du, Zhiwei Yu, Heng Xu, Haojie Wang, Bo liu, Jialong Li
The rapid growth of large language model (LLM) services imposes increasing demands on distributed GPU inference infrastructure. Most existing scheduling systems rely on the current system state to make decisions, without considering how task demand and resource availability evolve over time. This lack of temporal awareness leads to inefficient GPU utilization, high task migration overhead, and poor system responsiveness under dynamic workloads. In this work, we identify the fundamental limitations of these instantaneous-state-only scheduling approaches and propose Temporal Optimal Resource scheduling via Two-layer Architecture (TORTA). TORTA introduces a spatiotemporal scheduling framework that captures both long-term workload patterns and short-term execution constraints. It adopts a two-layer design: a macro-level scheduler leverages reinforcement learning and optimal transport to coordinate inter-region task distribution, while a micro-level allocator refines task-to-server assignments within each region to reduce latency and switching costs. Experimental results across multiple network topologies show that TORTA reduces average inference response time by up to 15\%, improves load balance by approximately 4-5\%, and cuts total operational cost by 10-20\% compared to state-of-the-art baseline methods.
大型语言模式(LLM)服务的迅速增长对分布式的GPU推导基础设施提出了越来越多的要求。大多数现有列表系统依靠目前的系统状态作出决定,而没有考虑任务需求和资源供应如何随时间推移而变化。这种缺乏时间意识导致GPU利用效率低、任务性迁移间接费用和动态工作量下系统反应能力差。在这项工作中,我们查明了这些只用瞬时状态的列表方法的根本局限性,并提议通过双层架构(TORTA)安排最佳的时空资源列表。TORTA引入了一个覆盖长期工作量模式和短期执行限制的全局性列表框架。它采用了两层设计:宏观级别调度器利用强化学习和最佳运输来协调跨区域任务分配,而微观级别分配器则改进了每个区域的任务-服务器任务分配,以减少消化和转换成本。多网络结构的实验结果显示,TORTA将平均发病反应时间降低到15,约4-5改善负荷平衡,并将总业务成本削减到10-20州基线方法。
Article 54
Title@2025-07-14 (1): Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
Title: Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models | Trinity-RFT: Ein allgemein angelegtes und einheitliches Rahmenwerk zur Verstärkung der Feinsteuerung großer Sprachmodelle | 三一-RFT:加强大语言模式精美应用的一般目的和统一框架 2505.17826v2 |
Authors (14): Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou
Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.
区域贸易论坛是一个通用、统一和易于使用的框架,旨在强化大型语言模型的微调(RFT),其设计是模块化和分离式的,包括:(1) RFT核心,统一和普及RFT的同步/非同步模式、政策/非政策模式以及在线/离线模式;(2) 代理-环境互动的无缝整合,高效和稳健;(3) 为RFT优化的系统数据管道。 三方-RFT可以很容易地适应不同的应用情景,并成为在宏观和微观两级开发和研究高级强化学习模式的统一平台。本技术报告概述了三一-RFT的愿景、特征、设计和实施,并附有展示其功能和用户友好度的大量实例、应用和实验。
Article 55
Title@2025-07-14 (1): Domain Borders Are There to Be Crossed With Federated Few-Shot Adaptation
Title: Domain Borders Are There to Be Crossed With Federated Few-Shot Adaptation | Domain-Grenzen gibt es mit Föderated Few-Shot-Anpassung überschritten werden | 与联邦几热量适应措施交界的域域边界 2507.10160v1 |
Authors (3): Manuel Röder, Christoph Raab, Frank-Michael Schleif
Federated Learning has emerged as a leading paradigm for decentralized, privacy-preserving learning, particularly relevant in the era of interconnected edge devices equipped with sensors. However, the practical implementation of Federated Learning faces three primary challenges: the need for human involvement in costly data labelling processes for target adaptation, covariate shift in client device data collection due to environmental factors affecting sensors, leading to discrepancies between source and target samples, and the impracticality of continuous or regular model updates in resource-constrained environments due to limited data transmission capabilities and technical constraints on channel availability and energy efficiency. To tackle these issues, we expand upon an efficient and scalable Federated Learning framework tailored for real-world client adaptation in industrial settings. This framework leverages a pre-trained source model comprising a deep backbone, an adaptation module, and a classifier running on a powerful server. By freezing the backbone and classifier during client adaptation on resource-constrained devices, we allow the domain adaptive linear layer to handle target domain adaptation, thus minimizing overall computational overhead. Furthermore, this setup, designated as FedAcross+, is extended to encompass the processing of streaming data, thereby rendering the solution suitable for non-stationary environments. Extensive experimental results demonstrate the effectiveness of FedAcross+ in achieving competitive adaptation on low-end client devices with limited target samples, successfully addressing the challenge of domain shift. Moreover, our framework accommodates sporadic model updates within resource-constrained environments, ensuring practical and seamless deployment.
联邦学习联合会已成为分散、隐私保护学习的主导范例,在配备传感器的相联边缘装置的时代尤为相关。然而,联邦学习联合会的实际实施面临三大挑战:在目标适应方面需要人参与费用高昂的数据标签程序,由于影响传感器的环境因素,客户设备数据收集的横向变化,导致源和目标样本之间的差异,以及由于数据传输能力有限,频道可用性和能源效率受到技术制约,资源紧张环境中连续或定期的模型更新不切实际,导致数据传输能力有限,使频道可用性和能源效率受到技术制约。为了解决这些问题,我们扩展了一个高效和可扩展的联邦学习框架,以适应工业环境中真实世界客户的适应性。这个框架利用了预先培训的源模式,其中包括一个深骨干、适应模块和在强大的服务器上运行的分类。通过冻结客户在资源限制装置上适应的骨干和分类器,我们允许域适应性线性线层处理目标领域适应,从而最大限度地减少总体计算间接费用。这个设置被指定为FedAcross +, 将数据流处理扩大到包括了流数据处理,从而将解决方案用于非固定的跨地点客户的升级,从而成功展示了在不易应用的客户环境上的风险。
Article 56
Title@2025-07-14 (1): Past-Future Scheduler for LLM Serving under SLA Guarantees
Title: Past-Future Scheduler for LLM Serving under SLA Guarantees | Zukünftiger Terminplaner für LLM-Wartung im Rahmen von SLA-Garantien | 根据苏丹解放军担保的LLM服务以往未来时间表表 2507.10150v1 |
Authors (8): Ruihao Gong, Shihao Bai, Siyu Wu, Yunqian Fan, Zaijun Wang, Xiuhong Li, Hailong Yang, Xianglong Liu
The exploration and application of Large Language Models (LLMs) is thriving. To reduce deployment costs, continuous batching has become an essential feature in current service frameworks. The effectiveness of continuous batching relies on an accurate estimate of the memory requirements of requests. However, due to the diversity in request output lengths, existing frameworks tend to adopt aggressive or conservative schedulers, which often result in significant overestimation or underestimation of memory consumption. Consequently, they suffer from harmful request evictions or prolonged queuing times, failing to achieve satisfactory throughput under strict Service Level Agreement (SLA) guarantees (a.k.a. goodput), across various LLM application scenarios with differing input-output length distributions. To address this issue, we propose a novel Past-Future scheduler that precisely estimates the peak memory resources required by the running batch via considering the historical distribution of request output lengths and calculating memory occupancy at each future time point. It adapts to applications with all types of input-output length distributions, balancing the trade-off between request queuing and harmful evictions, thereby consistently achieving better goodput. Furthermore, to validate the effectiveness of the proposed scheduler, we developed a high-performance LLM serving framework, LightLLM, that implements the Past-Future scheduler. Compared to existing aggressive or conservative schedulers, LightLLM demonstrates superior goodput, achieving up to 2-3$\times$ higher goodput than other schedulers under heavy loads. LightLLM is open source to boost the research in such direction (https://github.com/ModelTC/lightllm).
为降低部署成本,连续批量已成为当前服务框架的一个基本特征。连续批量的效力取决于对请求的内存要求的准确估计。然而,由于请求产出长度的多样性,现有框架倾向于采用侵略性或保守的排期表,这往往导致大量高估或低估记忆消耗量;因此,它们遭受有害的请求驱逐或长时间排队,未能在严格的服务级协议(a.k.a. goodput)下实现令人满意的输送量,无法在投入-产出长度分布不同、投入-产出长度分布不同的各种LLLM应用情景下实现令人满意的输送量。为了解决这一问题,我们建议采用新的过去-展望排期表,准确估计运行批量所需的高峰存储资源,为此要考虑到请求产出长度的历史分布,并计算记忆消耗量的每个时间点。因此,它们要适应各种投入-产出长度分布的应用,平衡请求中与有害驱逐之间的交易量,从而不断实现更高的投入-产出-产出长度分布。此外,为了更好地实现拟议的更高进度表,IMRLRRR的进度框架之下,更新了拟议的高预算进度表。
Article 57
Title@2025-07-14 (1): Large-Scale Graph Building in Dynamic Environments: Low Latency and High Quality
Title: Large-Scale Graph Building in Dynamic Environments: Low Latency and High Quality | Large-Scale Graph Building in dynamischen Umgebungen: geringe Latenz und hohe Qualität | 动态环境中的大比例图建设:低长期和高质量 2507.10139v1 |
Authors (11): Filipe Miguel Gonçalves de Almeida, CJ Carey, Hendrik Fichtenberger, Jonathan Halcrow, Silvio Lattanzi, André Linhares, Tao Meng, Ashkan Norouzi-Fard, Nikos Parotsidis, Bryan Perozzi, David Simcha
Learning and constructing large-scale graphs has attracted attention in recent decades, resulting in a rich literature that introduced various systems, tools, and algorithms. Grale is one of such tools that is designed for offline environments and is deployed in more than 50 different industrial settings at Google. Grale is widely applicable because of its ability to efficiently learn and construct a graph on datasets with multiple types of features. However, it is often the case that applications require the underlying data to evolve continuously and rapidly and the updated graph needs to be available with low latency. Such setting make the use of Grale prohibitive. While there are Approximate Nearest Neighbor (ANN) systems that handle dynamic updates with low latency, they are mostly limited to similarities over a single embedding. In this work, we introduce a system that inherits the advantages and the quality of Grale, and maintains a graph construction in a dynamic setting with tens of milliseconds of latency per request. We call the system Dynamic Grale Using ScaNN (Dynamic GUS). Our system has a wide range of applications with over 10 deployments at Google. One of the applications is in Android Security and Privacy, where Dynamic Grale Using ScaNN enables capturing harmful applications 4 times faster, before they can reach users.
近几十年来,学习和构建大型图表引起了人们的注意,导致丰富的文献文献引入了各种系统、工具和算法。Grale是针对离线环境设计的这类工具之一,在谷歌的50多个不同的工业环境中部署。Grale由于能够高效率地学习和构建具有多种特征的数据集图,因而广泛适用。然而,通常的情况是,应用程序需要基础数据不断和快速地演变,更新的图表需要低潜伏。这种设置使得Grale无法使用。虽然Grale是近邻(ANN)系统,用于处理动态更新的低潜伏,但这些工具大多限于与单个嵌入的相似之处。在这项工作中,我们引入了一个能够继承Grale的优势和质量的系统,并在一个动态环境中以数十毫秒的宽度来维持一个图形结构。我们称之为“动态Grale Grale”(DNNN)系统。我们的系统拥有广泛的应用范围,在Google10多个配置的近邻(ANN)系统,在低延度下处理动态更新动态更新的系统,但在单个嵌嵌嵌入中,但大多局限于一个单个嵌入系统,在Greal-Grairearea可以更快的S,在Grairearea 和Graial上快速的应用程序中,在Greal上,在Greareal可以进入Greamal上快速定位的S。
Article 58
Title@2025-07-14 (1): A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing
Title: A Model Aware AIGC Task Offloading Algorithm in IIoT Edge Computing | AIGC-Aufgabe, die Algorithmen im IIoT Edge Computing ausladen | IIOT 边端计算中意识到的AIGC任务卸载算法模型 2507.11560v1 |
Authors (3): Xin Wang, Xiao Huan Li, Xun Wang
The integration of the Industrial Internet of Things (IIoT) with Artificial Intelligence-Generated Content (AIGC) offers new opportunities for smart manufacturing, but it also introduces challenges related to computation-intensive tasks and low-latency demands. Traditional generative models based on cloud computing are difficult to meet the real-time requirements of AIGC tasks in IIoT environments, and edge computing can effectively reduce latency through task offloading. However, the dynamic nature of AIGC tasks, model switching delays, and resource constraints impose higher demands on edge computing environments. To address these challenges, this paper proposes an AIGC task offloading framework tailored for IIoT edge computing environments, considering the latency and energy consumption caused by AIGC model switching for the first time. IIoT devices acted as multi-agent collaboratively offload their dynamic AIGC tasks to the most appropriate edge servers deployed with different generative models. A model aware AIGC task offloading algorithm based on Multi-Agent Deep Deterministic Policy Gradient (MADDPG-MATO) is devised to minimize the latency and energy. Experimental results show that MADDPG-MATO outperforms baseline algorithms, achieving an average reduction of 6.98% in latency, 7.12% in energy consumption, and a 3.72% increase in task completion rate across four sets of experiments with model numbers ranging from 3 to 6, it is demonstrated that the proposed algorithm is robust and efficient in dynamic, high-load IIoT environments.
将物质工业互联网(IIOT)与人工智能化内容(AIGC)整合为人工智能化内容(AIGC)为智能制造提供了新的机会,但也带来了与计算密集型任务和低纬度需求相关的挑战。基于云计算的传统基因模型难以满足AIGC任务在IIOT环境中的实时要求,边际计算可以通过任务卸载有效减少潜伏。然而,AIGC任务的动态性质、模式转换延迟和资源限制对边际计算环境提出了更高的要求。为了应对这些挑战,本文件建议了AIGC任务卸载框架,专门为IIOT边缘计算环境定制,同时考虑到AIGC模型首次转换造成的延缩和能源消耗。基于II的惯性深度确定性政策强化(MADPG-MATOTO)的动态任务框架,即MADD-MAL 3的实验结果显示,在6-DGA-TOD 4级标准中,MADD-MA 标准任务将自动降压为6-% 标准。
Article 59
Title@2025-07-14 (1): ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism
Title: ElasticMM: Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism | ElasticMM: Effiziente multimodale LLMs mit elastischer multimodaler Parallelität | Elastic MM: 高效的多式多式LLMs 与 Elastic 多式平行主义一起服务 2507.10069v1 |
Authors (5): Zedong Liu, Shenggan Cheng, Guangming Tan, Yang You, Dingwen Tao
Multimodal large language models (MLLMs) extend LLMs to handle images, videos, and audio by incorporating feature extractors and projection modules. However, these additional components – combined with complex inference pipelines and heterogeneous workloads – introduce significant inference overhead. Therefore, efficiently serving MLLMs remains a major challenge. Current tightly coupled serving architectures struggle to distinguish between mixed request types or adapt parallelism strategies to different inference stages, leading to increased time-to-first-token (TTFT) latency and poor resource utilization. To address this, we propose Elastic Multimodal Parallelism (EMP), a new serving paradigm that elastically adapts to resource heterogeneity across request types and inference stages. Building upon EMP, we develop ElasticMM, an MLLM serving system that (1) separates requests into independent modality groups with dynamic resource allocation via a modality-aware load balancer; (2) decouples inference stages and enables parallelism adjustment and adaptive scaling via elastic partition scheduling; and (3) improves inference efficiency through unified multimodal prefix caching and non-blocking encoding. Experiments on diverse real-world datasets show that ElasticMM outperforms state-of-the-art (SOTA) serving systems, reducing TTFT by up to 4.2x and achieving 3.2-4.5x higher throughput while meeting service-level objectives (SLOs).
多式大型语言模型(MLLMS)通过纳入地物提取器和投影模块,将LLMS扩大到处理图像、视频和音频。然而,这些新增组件 – – 加上复杂的推断管道和不同的工作量 – – 引入了重大的推断间接费用。因此,高效为MLLMS服务仍然是一个重大挑战。当前紧密结合的服务结构在区分混合请求类型或使平行战略适应不同推论阶段方面挣扎,导致时间到最先(TTTFT)的延缓期和资源利用不良。为了解决这个问题,我们提议采用弹性多式平行模式(EMP),这是一种新的服务性模式,在请求类型和推论阶段对资源异异异异异异性作出弹性调整。在EMP的基础上,我们开发ElasticmMMM(MLM)服务系统,该系统(1) 将要求通过模式-感应力平衡器,将动态资源配置到独立的模式组群;(2) 分解推论阶段,并通过弹性分配安排安排,使平行性调整和适应规模扩大。(3) 通过统一的多式联运前S-CMTFS级,提高推断效率,同时通过现实-TFS-C-C-C-C-S-S-S-C-S-S-S-S-S-S-S-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 60
Title@2025-07-14 (1): ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge
Title: ECORE: Energy-Conscious Optimized Routing for Deep Learning Models at the Edge | ECORE: Energiebewusstes optimiertes Routing für Deep-Learning-Modelle am Rand | ECORE: 在边缘深层学习模型的能源普能优化运行 2507.06011v2 |
Authors (5): Daghash K. Alqahtani, Maria A. Rodriguez, Muhammad Aamir Cheema, Hamid Rezatofighi, Adel N. Toosi
Edge computing enables data processing closer to the source, significantly reducing latency an essential requirement for real-time vision-based analytics such as object detection in surveillance and smart city environments. However, these tasks place substantial demands on resource constrained edge devices, making the joint optimization of energy consumption and detection accuracy critical. To address this challenge, we propose ECORE, a framework that integrates multiple dynamic routing strategies including estimation based techniques and a greedy selection algorithm to direct image processing requests to the most suitable edge device-model pair. ECORE dynamically balances energy efficiency and detection performance based on object characteristics. We evaluate our approach through extensive experiments on real-world datasets, comparing the proposed routers against widely used baseline techniques. The evaluation leverages established object detection models (YOLO, SSD, EfficientDet) and diverse edge platforms, including Jetson Orin Nano, Raspberry Pi 4 and 5, and TPU accelerators. Results demonstrate that our proposed context-aware routing strategies can reduce energy consumption and latency by 45% and 49%, respectively, while incurring only a 2% loss in detection accuracy compared to accuracy-centric methods.
电磁计算能够使数据处理更接近源头,大大降低基于实时视觉分析的基本要求,如监视和智能城市环境中的物体探测。然而,这些任务对资源有限的边缘装置提出了大量要求,使能源消耗和探测精确度的联合优化变得至关重要。为了应对这一挑战,我们提议ECORE,这是一个将多种动态路由战略(包括基于估算的技术)和贪婪的选择算法相结合的框架,将图像处理请求引导给最合适的边缘设备模型。ECORE动态地平衡了基于物体特性的能源效率和探测性能。我们通过对真实世界数据集的广泛实验来评估我们的方法,将拟议的路由器与广泛使用的基线技术进行比较。评价利用了既定的物体探测模型(YOLO、SSD、高效设计)和不同的边缘平台,包括Jetson Orin Nano、Raspberry Pi 4 和 5 以及 TPUcercercer 。结果显示,我们拟议的环境观测路由战略可以分别将能源消耗和耐用率分别减少45%和49 %,同时与精确度相比,在探测精确度方面仅损失2%。
Article 61
Title@2025-07-14 (1): EAT: QoS-Aware Edge-Collaborative AIGC Task Scheduling via Attention-Guided Diffusion Reinforcement Learning
Title: EAT: QoS-Aware Edge-Collaborative AIGC Task Scheduling via Attention-Guided Diffusion Reinforcement Learning | EAT: QoS-Aware Edge-Collaborative AIGC-Task Scheduling über aufmerksamkeitsgeführtes Diffusions-Verstärkungs-Lernen | EAT: 通过关注辅助推广强化学习安排任务 2507.10026v1 |
Authors (8): Zhifei Xu, Zhiqing Tang, Jiong Lou, Zhi Yao, Xuan Xie, Tian Wang, Yinglong Wang, Weijia Jia
The growth of Artificial Intelligence (AI) and large language models has enabled the use of Generative AI (GenAI) in cloud data centers for diverse AI-Generated Content (AIGC) tasks. Models like Stable Diffusion introduce unavoidable delays and substantial resource overhead, which are unsuitable for users at the network edge with high QoS demands. Deploying AIGC services on edge servers reduces transmission times but often leads to underutilized resources and fails to optimally balance inference latency and quality. To address these issues, this paper introduces a QoS-aware \underline{E}dge-collaborative \underline{A}IGC \underline{T}ask scheduling (EAT) algorithm. Specifically: 1) We segment AIGC tasks and schedule patches to various edge servers, formulating it as a gang scheduling problem that balances inference latency and quality while considering server heterogeneity, such as differing model distributions and cold start issues. 2) We propose a reinforcement learning-based EAT algorithm that uses an attention layer to extract load and task queue information from edge servers and employs a diffusion-based policy network for scheduling, efficiently enabling model reuse. 3) We develop an AIGC task scheduling system that uses our EAT algorithm to divide tasks and distribute them across multiple edge servers for processing. Experimental results based on our system and large-scale simulations show that our EAT algorithm can reduce inference latency by up to 56\% compared to baselines. We release our open-source code at https://github.com/zzf1955/EAT.
人工智能(AI)和大型语言模型的增长使得在云中数据中心为各种人工智能内容(AIGC)任务使用GenAI(GenAI),稳定传播等模型带来了不可避免的延误和大量资源间接费用,不适合网络边缘用户,高QS需求。在边缘服务器上部署AIGC服务会减少传输时间,但往往导致资源利用不足,无法最佳平衡延迟和质量。为了解决这些问题,本文件引入了一种Qos-aware AI(GenAI),用于云中不同的 AI-Generaled Condition {E}Ge-colaure deline}Enderline{A}IGC\underline{A}IGC\underline{T}ask排程(EAT)算法。具体地:1)我们将AIGC的任务和排队任务排到各种边缘服务器的补丁。在考虑服务器的内置精度与质量平衡时,例如不同的模型分布和冷开始问题。2我们建议基于强化的基于EAT算法的EAT算法算法算法的计算方法,利用一个关注层来提取负荷和任务端端端端端端端端端系统,从而实现EAAT(EAT)的递制成一个基于EACT调度。
Article 62
Title@2025-07-14 (1): The Hitchhiker’s Guide to Programming and Optimizing Cache Coherent Heterogeneous Systems: CXL, NVLink-C2C, and AMD Infinity Fabric
Title: The Hitchhiker’s Guide to Programming and Optimizing Cache Coherent Heterogeneous Systems: CXL, NVLink-C2C, and AMD Infinity Fabric | Der Hitchhiker-Leitfaden zur Programmierung und Optimierung von Cache-Kohärenten Heterogenen Systemen: CXL, NVLink-C2C und AMD Infinity Fabric | Hitchhiker编程和优化缓存系统指南:CXL、NVLink-C2C和AMD无穷无尽 2411.02814v2 |
Authors (12): Zixuan Wang, Suyash Mahar, Luyi Li, Jangseon Park, Jinpyo Kim, Theodore Michailidis, Yue Pan, Mingyao Shen, Tajana Rosing, Dean Tullsen, Steven Swanson, Jishen Zhao
We present a thorough analysis of the use of modern heterogeneous systems interconnected by various cachecoherent links, including CXL, NVLink-C2C, and Infinity Fabric. We studied a wide range of server systems that combined CPUs from different vendors and various types of coherent memory devices, including CXL memory expander, CXL pool, CXL shared memory, GH200 GPU, and AMD MI300a HBM. For this study, we developed a heterogeneous memory benchmark suite, Heimdall, to profile the performance of such heterogeneous systems and present a detailed performance comparison across systems. By leveraging H E I M DA L L , we unveiled the detailed architecture design in these systems, drew observations on optimizing performance for workloads, and pointed out directions for future development of cache coherent heterogeneous systems.
我们透彻地分析了利用各种缓存连接,包括CXL、NVLink-C2C和Infinity Fabric等各种缓存连接连接的现代多元系统的使用情况。我们研究了范围广泛的服务器系统,这些系统将不同销售商的CPU和各种一致的内存装置,包括CXL记忆扩展器、CXL库、CXL共享记忆、GH200 GPU和AMD MI300a HBM结合起来。我们为这项研究开发了一套不同的记忆基准套件,即Heimdall,以描述这种缓存系统的性能,并提供了跨系统的详细性能比较。我们利用H E I M DA L L,公布了这些系统中的详细结构设计,就优化工作量绩效提出了意见,并指明了今后开发缓存一致的混合系统的方向。
Article 63
Title@2025-07-14 (1): Green-LLM: Optimal Workload Allocation for Environmentally-Aware Distributed Inference
Title: Green-LLM: Optimal Workload Allocation for Environmentally-Aware Distributed Inference | Green-LLM: Optimale Arbeitslastzuteilung für umweltbewusste Distributed Inferenz | Green-LLM:环境软件分布式推断的最佳工作负荷分配 2507.09942v1 |
Authors (2): Jiaming Cheng, Duong Tung Nguyen
This letter investigates the optimal allocation of large language model (LLM) inference workloads across heterogeneous edge data centers (DCs) over time. Each DC features on-site renewable generation and faces dynamic electricity prices and spatiotemporal variability in renewable availability. The central question is: how can inference workloads be optimally distributed to the DCs to minimize energy consumption, carbon emissions, and water usage while enhancing user experience? This letter proposes a novel optimization model for LLM service providers to reduce operational costs and environmental impacts. Numerical results validate the efficacy of the proposed approach.
本信调查大型语言模型(LLM)在不同时期在不同边缘数据中心之间最佳分配的推论工作量,每个发展中国家都具有现场可再生能源的特征,面临动态发电价格和可再生能源供应的时空变异性,中心问题是:如何最佳地将推论工作量分配给发展中国家,以最大限度地减少能源消耗、碳排放和用水,同时增强用户经验?本信为LLM服务供应商提出一个新的优化模式,以减少运营成本和环境影响。数字结果验证了拟议方法的有效性。
Article 64
Title@2025-07-14 (1): PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation
Title: PhoenixOS: Concurrent OS-level GPU Checkpoint and Restore with Validated Speculation | PhoenixOS: Gleichzeitiger GPU-Checkpoint auf OS-Ebene und Wiederherstellung mit validierter Spekulation | 菲尼克斯:同步的OS级GPU检查站和经验证的投机恢复 2405.12079v2 |
Authors (8): Xingda Wei, Zhuobin Huang, Tianle Sun, Yingyi Hao, Rong Chen, Mingcong Han, Jinyu Gu, Haibo Chen
PhoenixOS (PhOS) is the first OS service that can concurrently checkpoint and restore (C/R) GPU processes – a fundamental capability for critical tasks such as fault tolerance, process migration, and fast startup. While concurrent C/R is well-established on CPUs, it poses unique challenges on GPUs due to their lack of essential features for efficiently tracing concurrent memory reads and writes, such as specific hardware capabilities (e.g., dirty bits) and OS-mediated data paths (e.g., copy-on-write). To ensure correct concurrent C/R, PhOS proactively detects GPU memory reads and writes through a two-step process: first, it speculates about GPU memory accesses based on the arguments used when launching GPU kernels; then, it validates these accesses efficiently at runtime using binary instrumentation. With this validated speculation, PhOS retrofits CPU-based concurrent C/R for GPUs through software-based approaches, including soft copy-on-write, soft recopy, and soft on-demand restore. PhOS further proposes several GPU-aware techniques for efficient GPU C/R, including coordinated checkpoint data transfer and execution context pool. For downstream tasks that use C/R for tolerating failures, migrating processes live, and accelerating cold starts in serverless computing, PHOS achieves orders of magnitude higher performance than state-of-the-art OS-level GPU C/R systems like NVIDIA cuda-checkpoint.
菲尼克斯OS (PhOS) 是第一个可以同时检查和恢复(C/R) GPU进程(GPU进程)的OS服务, 这是一种基本能力, 用于完成错误容忍度、 进程迁移和快速启动等关键任务。 虽然在CPU上同时使用 C/ R 功能, 但它在GPU上提出了独特的挑战, 因为它们缺乏有效追踪同时记忆读和写的基本功能, 例如特定的硬件能力( 如脏比特) 和 OS 中介数据路径( 例如, 复制) 。 为确保正确同时使用 C/ R 程序, POS 主动检测 GPUPU 记忆水平的读写和写能力, 通过两步程序: 第一, 它根据启动 GPUPU 内核时所使用的参数, 假设 GPUA 快速操作系统 , 以及 CPUWI 运行系统 的快速操作系统 。
Article 65
Title@2025-07-14 (1): Content-Oblivious Leader Election in 2-Edge-Connected Networks
Title: Content-Oblivious Leader Election in 2-Edge-Connected Networks | Content-Offizier Leader Wahl in 2-Edge-Connected Networks | 以两E连结网络进行内容清晰的领袖选举 2507.08348v2 |
Authors (3): Yi-Jun Chang, Lyuting Chen, Haoran Zhou
Censor-Hillel, Cohen, Gelles, and Sela (PODC 2022 & Distributed Computing 2023) studied fully-defective asynchronous networks, where communication channels may suffer an extreme form of alteration errors, rendering messages completely corrupted. The model is equivalent to content-oblivious computation, where nodes communicate solely via pulses. They showed that if the network is 2-edge-connected, then any algorithm for a noiseless setting can be simulated in the fully-defective setting; otherwise, no non-trivial computation is possible in the fully-defective setting. However, their simulation requires a predesignated leader, which they conjectured to be necessary for any non-trivial content-oblivious task. Recently, Frei, Gelles, Ghazy, and Nolin (DISC 2024) refuted this conjecture for the special case of oriented ring topology. They designed two asynchronous content-oblivious leader election algorithms with message complexity $O(n \cdot \mathsf{ID}{\max})$, where $n$ is the number of nodes and $\mathsf{ID}{\max}$ is the maximum $\mathsf{ID}$. The first algorithm stabilizes in unoriented rings without termination detection. The second algorithm quiescently terminates in oriented rings, thus enabling the execution of the simulation algorithm after leader election. In this work, we present an asynchronous content-oblivious leader election algorithm that quiescently terminates in any 2-edge connected network with message complexity $O(m \cdot N \cdot \mathsf{ID}{\min})$, where $m$ is the number of edges, $N$ is a known upper bound on the number of nodes, and $\mathsf{ID}{\min}$ is the smallest $\mathsf{ID}$. Combined with the previous simulation result, our finding implies that any algorithm from the noiseless setting can be simulated in the fully-defective setting without assuming a preselected leader, entirely refuting the original conjecture.
Censor- Hillel、 Cohen、 Geles 和 Sela ({PODC 2022 & 分发 Expublic 2023} ) 研究了完全无效的匿名网络, 通信频道可能遭受极端的更改错误, 使信息完全腐败。 模型相当于内容可见的计算, 节点只能通过脉冲进行通信。 它们显示, 如果网络连接到两个边缘, 那么任何无噪音设置的算法都可以在完全失效的复杂环境中模拟; 否则, 在完全失效的设置中, 无法进行非三角的计算。 然而, 它们的模拟需要一个预指定的领导人, 他们认为, 任何非三角的内端的内端的内端的内端 。 最近, Frei, Gelles, Ghazy, 和 Nolin (DISC 2024) 之前的内端的内端点, 内端端端的内端的内端端的内端的内端的内端的内端的内端值, 内端的内端的内端的内端的内端的内端的内端的内端值, 内端的内端的内端的内端的内端的内端的内端的内端的内端的内, 的内端的内端的内端的内, 的内端的内端的内端的内端的内端的内端的内, 的内端的内端的内端的内, 的内端的内端的内端的内端的内端的内端的内端的内端的内端的内端的内端的内端的内, 。
Article 66
Title@2025-07-14 (1): Intelligent Task Management via Dynamic Multi-region Division in LEO Satellite Networks
Title: Intelligent Task Management via Dynamic Multi-region Division in LEO Satellite Networks | Intelligentes Task Management über die Division Dynamic Multi-Region in LEO-Satellitennetzen | 通过低地轨道卫星网络的动态多区域司进行智能任务管理 2507.09926v1 |
Authors (6): Zixuan Song, Zhishu Shen, Xiaoyu Zheng, Qiushi Zheng, Zheng Lei, Jiong Jin
As a key complement to terrestrial networks and a fundamental component of future 6G systems, Low Earth Orbit (LEO) satellite networks are expected to provide high-quality communication services when integrated with ground-based infrastructure, thereby attracting significant research interest. However, the limited satellite onboard resources and the uneven distribution of computational workloads often result in congestion along inter-satellite links (ISLs) that degrades task processing efficiency. Effectively managing the dynamic and large-scale topology of LEO networks to ensure balanced task distribution remains a critical challenge. To this end, we propose a dynamic multi-region division framework for intelligent task management in LEO satellite networks. This framework optimizes both intra- and inter-region routing to minimize task delay while balancing the utilization of computational and communication resources. Based on this framework, we propose a dynamic multi-region division algorithm based on the Genetic Algorithm (GA), which adaptively adjusts the size of each region based on the workload status of individual satellites. Additionally, we incorporate an adaptive routing algorithm and a task splitting and offloading scheme based on Multi-Agent Deep Deterministic Policy Gradient (MA-DDPG) to effectively accommodate the arriving tasks. Simulation results demonstrate that our proposed framework outperforms comparative methods in terms of the task delay, energy consumption per task, and task completion rate.
作为地面网络的关键补充和未来6G系统的基本组成部分,低地轨道卫星网络作为地面网络的关键补充和未来6G系统的基本组成部分,预期在与地面基础设施相结合时将提供高质量的通信服务,从而吸引重大研究兴趣;然而,卫星在船上的资源有限,计算工作量分布不均,往往导致卫星之间联系的拥挤,降低任务处理效率;有效管理低地轨道网络的动态和大规模地形学,确保任务分配平衡,仍然是一项重大挑战;为此,我们提议了一个动态多区域司框架,用于低地轨道卫星网络的智能任务管理;这一框架优化了区域内和区域间的路线,以尽量减少任务延误,同时平衡计算和通信资源的利用;根据这一框架,我们提议了一个动态的多区域分工算法,根据单个卫星的工作量状况调整每个区域的规模;此外,我们纳入了适应性路线算法和任务分工和卸载计划,其基础是多度低地轨道卫星网络的智能任务管理。
Article 67
Title@2025-07-14 (1): QPET: A Versatile and Portable Quantity-of-Interest-Preservation Framework for Error-Bounded Lossy Compression
Title: QPET: A Versatile and Portable Quantity-of-Interest-Preservation Framework for Error-Bounded Lossy Compression | QPET: Ein vielseitiges und tragbares Quantitäts-of-Interest-Preservation-Framework für fehlerbegründete Verlustkompression | QPET: 差错错错错错损损压缩易容和可移动的利差数量保护框架 2412.02799v4 |
Authors (6): Jinyang Liu, Pu Jiao, Kai Zhao, Xin Liang, Sheng Di, Franck Cappello
Error-bounded lossy compression has been widely adopted in many scientific domains because it can address the challenges in storing, transferring, and analyzing unprecedented amounts of scientific data. Although error-bounded lossy compression offers general data distortion control by enforcing strict error bounds on raw data, it may fail to meet the quality requirements on the results of downstream analysis, a.k.a. Quantities of Interest (QoIs), derived from raw data. This may lead to uncertainties and even misinterpretations in scientific discoveries, significantly limiting the use of lossy compression in practice. In this paper, we propose QPET, a novel, versatile, and portable framework for QoI-preserving error-bounded lossy compression, which overcomes the challenges of modeling diverse QoIs by leveraging numerical strategies. QPET features (1) high portability to multiple existing lossy compressors, (2) versatile preservation to most differentiable univariate and multivariate QoIs, and (3) significant compression improvements in QoI-preservation tasks. Experiments with six real-world datasets demonstrate that integrating QPET into state-of-the-art error-bounded lossy compressors can gain 2x to 10x compression speedups of existing QoI-preserving error-bounded lossy compression solutions, up to 1000% compression ratio improvements to general-purpose compressors, and up to 133% compression ratio improvements to existing QoI-integrated scientific compressors.
在许多科学领域,广泛采用了由错误造成的损失压缩,因为它可以应对储存、转移和分析前所未有的科学数据数量方面的挑战。虽然由错误造成的损失压缩通过对原始数据实施严格的误差限制提供了一般数据扭曲控制,但它可能无法满足下游分析结果的质量要求,即:a.k.a.从原始数据得出的利息数量(QoIs),这可能导致科学发现中的不确定性,甚至误解,大大限制了在实践中使用损失压缩比例。在本文中,我们提议为QoI保存错误造成的损失压缩提供一个新颖、多功能和便携式框架,通过运用数字战略克服了模型化不同QoI的难题。QPET特性(1) 对现有多种损失压缩压缩机的高度可感知性,(2) 对最不同的不可辨和多变的不可变性QoI-保存任务中的大幅压缩压缩比例改进。与六个真实世界数据系统进行实验,显示将QPET-reta-ral-rmaisal-ral-ral-ral-ral-ral-ral-ral-rma-ral-ral-ral-rma-ral-lassal-lassal-rassal-ral-ral-rma-l-l-ral-ral-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l-
Article 68
Title@2025-07-14 (1): Module-conditioned distribution of quantum circuits
Title: Module-conditioned distribution of quantum circuits | Modulkonditionierte Verteilung von Quantenkreisen | 量子电路的模块化配送 2501.11816v2 |
Authors (2): Hyunho Cha, Jungwoo Lee
As quantum computers require highly specialized and stable environments to operate, expanding their capabilities within a single system presents significant technical challenges. By interconnecting multiple quantum processors, distributed quantum computing can facilitate the execution of more complex and larger-scale quantum algorithms. End-to-end heuristics for the distribution of quantum circuits have been developed so far. In this work, we derive an exact integer programming approach for the Distributed Quantum Circuit (DQC) problem, assuming fixed module allocations. Since every DQC algorithm necessarily yields a module allocation function, our formulation can be integrated with it as a post-processing step. This improves on the hypergraph partitioning formulation, which finds a module allocation function and an efficient distribution at once. We also show that a suboptimal heuristic to find good allocations can outperform previous methods. In particular, for quantum Fourier transform circuits, we conjecture from experiments that the optimal module allocation is the trivial one found by this method.
由于量子计算机需要高度专业化和稳定的操作环境,因此在单一系统中扩大其能力带来了巨大的技术挑战。通过将多个量子处理器相互连接,分布式量子计算可以促进执行更复杂和更大的量子算法。迄今为止,已经开发出量子电路分布的端到端超常法。在这项工作中,我们为分布量子电路(DQC)问题得出了精确的整数编程方法,假设了固定的模块分配。由于每个 DQC 算法必然产生一个模块分配功能,因此我们的配方可以作为一个后处理步骤与它结合。这改善了高压分区配方,即找到一个模块分配功能,一次有效地分配。我们还表明,找到良好分配的次优劣性超标能能够超过以往的方法。特别是对于量 Fourier变电路来说,我们从实验中推断出,最佳模块分配是这种方法发现的微小的。
Article 69
Title@2025-07-14 (1): InstCache: A Predictive Cache for LLM Serving
Title: InstCache: A Predictive Cache for LLM Serving | InstCache: Ein vorausschauender Cache für LLM Serving | Instcache:LLM服务预测缓存 2411.13820v2 |
Authors (6): Longwei Zou, Yan Liu, Jiamu Kang, Tingfeng Liu, Jiangang Kong, Yangdong Deng
The revolutionary capabilities of Large Language Models (LLMs) are attracting rapidly growing popularity and leading to soaring user requests to inference serving systems. Caching techniques, which leverage data reuse to reduce computation, offer opportunities to optimize the performance of LLM inference engines. On the one hand, the low-level key-value (KV) cache working at the token level is widely adopted, albeit it incurs significant overhead as request volume grows. On the other hand, instruction-level caching, which stores full instruction-response pairs, is expected to play an increasingly crucial role. However, the high variability in the content and length of instructions make it rare for identical instructions to recur within a short time window, presenting challenges for effective caching instruction-response pairs. To address this challenge, we propose InstCache, a predictive caching mechanism for LLM serving systems. Leveraging the capability of LLMs, we can effectively reorder the representation space of instruction texts and develop a sufficient level of spatial locality. Such spatial locality enables us to predict potential instructions located in a compact region in the space, resulting in an effective caching system at runtime. Experimental results demonstrate that InstCache achieves a 2.3x higher hit rate compared to the upper bound of traditional caching mechanisms on WildChat dataset and reduces the time per output token of vLLM by up to 42.0% and 50.0% on LMSys and Moss datasets, respectively.
大语言模型(LLMS)的革命性能力正在吸引迅速增长的受欢迎度,并导致用户对推断服务系统的要求激增。 利用数据再利用数据再利用以降低计算,缓冲技术为优化LLM推断引擎的性能提供了机会。 一方面,在象征性层面工作的低级别关键值缓冲机制(KV)被广泛采用,尽管随着请求量的增加,它会产生巨大的间接成本。另一方面,储存全方位教学响应配对的指令级缓冲预计将发挥越来越关键的作用。然而,由于指令的内容和长度差异很大,在短时间窗口内重复相同指令的情况十分罕见,对有效缓冲指令-反应配对提出了挑战。为了应对这一挑战,我们提议InstCache,即低级别的关键值缓冲机制,尽管随着请求量的增加,我们可有效地重新排列教学文本的表述空间,并开发出足够的空间位置。这种空间位置将使我们能够预测位于紧凑区域的潜在指令,导致在运行时有效的缓冲系统重复,对有效的指令-反应对有效缓冲指令-反应双对双对等。 我们提议,IstcachealCsalalalalalalalalalmax 结果显示,每个创机机率将每达到2.3C的峰输出率。
Article 70
Title@2025-07-13 (7): SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving
Title: SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving | SLED: Ein spekulatives LLM-Decoding-Framework für effizientes Edge Serving | SLED: 有效边缘服务投机性LLM代谢框架 2506.09397v4 |
Authors (8): Xiangchen Li, Dimitrios Spatharakis, Saeid Ghafouri, Jiakun Fan, Hans Vandierendonck, Deepu John, Bo Ji, Dimitrios Nikolopoulos
The growing gap between the increasing complexity of large language models (LLMs) and the limited computational budgets of edge devices poses a key challenge for efficient on-device inference, despite gradual improvements in hardware capabilities. Existing strategies, such as aggressive quantization, pruning, or remote inference, trade accuracy for efficiency or lead to substantial cost burdens. This position paper introduces a new framework that leverages speculative decoding, previously viewed primarily as a decoding acceleration technique for autoregressive generation of LLMs, as a promising approach specifically adapted for edge computing by orchestrating computation across heterogeneous devices. We propose \acronym, a framework that allows lightweight edge devices to draft multiple candidate tokens locally using diverse draft models, while a single, shared edge server verifies the tokens utilizing a more precise target model. To further increase the efficiency of verification, the edge server batch the diverse verification requests from devices. This approach supports device heterogeneity and reduces server-side memory footprint by sharing the same upstream target model across multiple devices. Our initial experiments with Jetson Orin Nano, Raspberry Pi 4B/5, and an edge server equipped with 4 Nvidia A100 GPUs indicate substantial benefits: 2.2 more system throughput, 2.8 more system capacity, and better cost efficiency, all without sacrificing model accuracy.
大型语言模型(LLMs)日益复杂,而边缘装置计算预算有限,这两者之间日益加大了差距,尽管硬件能力逐步改善,但对于高效的在设备上进行精密推断是一个关键的挑战。现有的战略,如进取量计、裁剪或远程推断、效率方面的贸易准确性或导致巨大的成本负担。本立场文件引入了一个新的框架,利用投机性解码技术,利用投机性解码,过去主要被视为自动递减生成LMs的一种解码加速技术,这是一种有希望的方法,通过在多种设备中串通计算来专门适应边缘计算。我们提议了\acronym,这个框架允许轻量级边设备使用不同的草稿模型在当地起草多个候选标牌,而一个单一的共享边端服务器则使用更精确的目标模型来验证这些标牌。为了进一步提高核查效率,边端服务器从设备中收集了各种核查请求。这个方法支持装置的异质性,并通过共享多个设备共享相同的上游目标模型来减少服务器的记忆足迹。我们与Jetson Orin Nano、Raspberry Pi 4B/5、更精度系统更精准性、更精准性地显示所有N4VLALVALServi系统。
Article 71
Title@2025-07-13 (7): TimberStrike: Dataset Reconstruction Attack Revealing Privacy Leakage in Federated Tree-Based Systems
Title: TimberStrike: Dataset Reconstruction Attack Revealing Privacy Leakage in Federated Tree-Based Systems | TimberStrike: Datensatz-Rekonstruktion Angriff Enthüllen der Privatsphäre Leckage in Federated Tree-Based Systems | 木材三角:联邦树基系统中数据集重建攻击清除隐私渗漏 2506.07605v3 |
Authors (5): Marco Di Gennaro, Giovanni De Lucia, Stefano Longari, Stefano Zanero, Michele Carminati
Federated Learning has emerged as a privacy-oriented alternative to centralized Machine Learning, enabling collaborative model training without direct data sharing. While extensively studied for neural networks, the security and privacy implications of tree-based models remain underexplored. This work introduces TimberStrike, an optimization-based dataset reconstruction attack targeting horizontally federated tree-based models. Our attack, carried out by a single client, exploits the discrete nature of decision trees by using split values and decision paths to infer sensitive training data from other clients. We evaluate TimberStrike on State-of-the-Art federated gradient boosting implementations across multiple frameworks, including Flower, NVFlare, and FedTree, demonstrating their vulnerability to privacy breaches. On a publicly available stroke prediction dataset, TimberStrike consistently reconstructs between 73.05% and 95.63% of the target dataset across all implementations. We further analyze Differential Privacy, showing that while it partially mitigates the attack, it also significantly degrades model performance. Our findings highlight the need for privacy-preserving mechanisms specifically designed for tree-based Federated Learning systems, and we provide preliminary insights into their design.
联邦学习联合会已成为中央机构学习的以隐私为导向的替代方案,有利于合作模式培训,而没有直接分享数据。尽管对神经网络进行了广泛研究,但基于树的模型对安全和隐私的影响仍未得到充分探讨。这项工作引入了TaultStrike,这是以横向结合的树为基础的模型为对象的基于优化的数据元重建攻击。我们由一个客户进行的攻击,利用决策树的离散性质,利用不同的价值和决定路径从其他客户处推断敏感培训数据。我们评估了木材在包括Flower、NVFFlare和FedTre在内的多个框架的州级联盟梯度促进实施方面发生的碰撞,展示了它们易受隐私破坏的脆弱性。在公开提供的中风预测数据集中,木材Strike持续地重建了所有执行过程中目标数据集的73.05 %至95.63%。我们进一步分析差异隐私,表明它虽然部分减轻了攻击,但也显著地降低了模型性。我们的调查结果突出表明需要专门为基于树木的联邦学习系统设计的隐私保护机制,我们提供了初步的见解。
Article 72
Title@2025-07-13 (7): Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure
Title: Compute Can’t Handle the Truth: Why Communication Tax Prioritizes Memory and Interconnects in Modern AI Infrastructure | Berechnen kann nicht mit der Wahrheit umgehen: Warum Kommunikationssteuer das Gedächtnis und die Verbindungen in der modernen KI-Infrastruktur priorisiert | 计算无法处理真相:为什么通讯税在现代AI基础设施中将记忆和相互联系放在优先地位? 2507.07223v2 |
Authors (1): Myoungsoo Jung
Modern AI workloads such as large language models (LLMs) and retrieval-augmented generation (RAG) impose severe demands on memory, communication bandwidth, and resource flexibility. Traditional GPU-centric architectures struggle to scale due to growing inter-GPU communication overheads. This report introduces key AI concepts and explains how Transformers revolutionized data representation in LLMs. We analyze large-scale AI hardware and data center designs, identifying scalability bottlenecks in hierarchical systems. To address these, we propose a modular data center architecture based on Compute Express Link (CXL) that enables disaggregated scaling of memory, compute, and accelerators. We further explore accelerator-optimized interconnects-collectively termed XLink (e.g., UALink, NVLink, NVLink Fusion)-and introduce a hybrid CXL-over-XLink design to reduce long-distance data transfers while preserving memory coherence. We also propose a hierarchical memory model that combines local and pooled memory, and evaluate lightweight CXL implementations, HBM, and silicon photonics for efficient scaling. Our evaluations demonstrate improved scalability, throughput, and flexibility in AI infrastructure.
大型语言模型(LLMS)和检索增强的生成(RAG)等现代AI工作量,如大型语言模型(LLMS)和检索增强的生成(RAG),对记忆、通信带宽和资源灵活性提出了严重的要求。传统的GPU中心建筑由于GPU之间的通信管理费用不断增加而难以扩大规模。本报告介绍主要的AI概念,并解释变异器如何在LLMS中使数据代表发生革命。我们分析大型AI硬件和数据中心设计,找出等级系统中的可缩放瓶颈。为了解决这些问题,我们提议基于计算快递链接(CXL)的模块式数据中心结构,以便能够对记忆、计算和加速器进行分解的缩。我们进一步探索加速器-优化的互联互通-集体称为XLink(例如, ALink, NVVLink, NVLink Fulsion)- 并采用混合的 CXL-over-XLink设计,以减少长距离数据传输,同时保持记忆的一致性。我们还提议一个等级记忆模型,将本地和集合记忆结合起来,并评价轻型的CXLL(轻重 CXL)执行、HBMMM, 和硅灵活性,展示我们通过高效的升级和智能基础设施。
Article 73
Title@2025-07-13 (7): Two Pareto Optimum-based Heuristic Algorithms for Minimizing Tardiness and Late Jobs in the Single Machine Flowshop Problem
Title: Two Pareto Optimum-based Heuristic Algorithms for Minimizing Tardiness and Late Jobs in the Single Machine Flowshop Problem | Zwei Pareto Optimale Heuristische Algorithmen zur Minimierung von Tardiness und Spätjobs im Single Machine Flowshop-Problem | 两种基于Pareto Opptimim 的以Pareto Opptimim 为基础的在单一机器流动问题中尽量减少迟滞和迟到就业机会的优优性乘数 2409.03778v2 |
Authors (6): Matthew Gradwohl, Guidio Sewa, Oke Blessing Oghojafor, Richard Wilouwou, Muminu Adamu, Christopher Thron
Flowshop problems play a prominent role in operations research, and have considerable practical significance. The single-machine flowshop problem is of particular theoretical interest. Until now the problem of minimizing late jobs or job tardiness can only be solved exactly by computationally-intensive methods such as dynamic programming or linear programming. In this paper we introduce, test, and optimize two new heuristic algorithms for mixed tardiness and late job minimization in single-machine flowshops. The two algorithms both build partial schedules iteratively. Both also retain Pareto optimal solutions at intermediate stages, to take into account both tardiness and late jobs within the partial schedule, as well as the effect of partial completion time on not-yet scheduled jobs. Both algorithms can be applied to scenarios with hundreds of jobs, with execution times running from less than a second to a few minutes. Although they are slower than dispatch rule-based heuristics, the solutions obtained are far better. We also compare a neural-network solution, which performs poorly.
流程问题在业务研究中起着突出作用,具有相当大的实际意义。 单机流程问题在理论上特别有意义。 直到现在,最大限度地减少延迟工作或延迟工作的问题只能通过诸如动态编程或线性编程等计算密集型方法才能完全解决。 在本文件中,我们引入、测试和优化两种新的超速算法,用于在单机流程工厂中混合延迟和延迟工作最小化。两种算法都迭接地构建了部分时间表。两种算法还保留了中间阶段的Pareto最佳解决方案,以考虑到部分计划内的延迟和延迟工作,以及部分完成时间对非实时工作的影响。两种算法都可以适用于数百个工作的情景,执行时间从不到一秒到几分钟。虽然比发送基于规则的超时化速度慢,但获得的解决方案要好得多。 我们还比较了神经网络解决方案,其表现很差。
Article 74
Title@2025-07-13 (7): PromptChain: A Decentralized Web3 Architecture for Managing AI Prompts as Digital Assets
Title: PromptChain: A Decentralized Web3 Architecture for Managing AI Prompts as Digital Assets | PromptChain: Eine dezentralisierte Web3-Architektur zur Verwaltung von AI-Prompts als digitale Assets | Prentchain:一个分散式网络3架构,用以管理作为数字资产的AI 提示 2507.09579v1 |
Authors (1): Marc Bara
We present PromptChain, a decentralized Web3 architecture that establishes AI prompts as first-class digital assets with verifiable ownership, version control, and monetization capabilities. Current centralized platforms lack mechanisms for proper attribution, quality assurance, or fair compensation for prompt creators. PromptChain addresses these limitations through a novel integration of IPFS for immutable storage, smart contracts for governance, and token incentives for community curation. Our design includes: (1) a comprehensive metadata schema for cross-model compatibility, (2) a stake-weighted validation mechanism to align incentives, and (3) a token economy that rewards contributors proportionally to their impact. The proposed architecture demonstrates how decentralized systems could potentially match centralized alternatives in efficiency while providing superior ownership guarantees and censorship resistance through blockchain-anchored provenance tracking. By decoupling prompts from specific AI models or outputs, this work establishes the foundation for an open ecosystem of human-AI collaboration in the Web3 era, representing the first systematic treatment of prompts as standalone digital assets with dedicated decentralized infrastructure.
我们提出PerentChain,这是一个分权的Web3架构,它把AI作为头等数字资产建立,具有可核查的所有权、版本控制和货币化能力。目前中央集权平台缺乏适当归属、质量保证或对迅速创造者公平补偿的机制。Perent Chain通过将IPS与不可改变的存储、智能治理合同和社区整理象征性激励等新型整合,来解决这些限制。我们的设计包括:(1) 综合元数据组合,用于跨模范兼容性,(2) 组合式验证机制,(3) 按比例奖励捐助者的象征性经济。 拟议的架构表明,分散化系统如何在效率上与集中的替代方法相匹配,同时通过链式源跟踪提供高级所有权保障和审查阻力。 通过拆分具体的AI模式或产出,这项工作为在Web3时代人类-AI合作的开放生态系统奠定了基础,代表首次系统处理作为独立数字资产的快速点,并专门分散化基础设施。
Article 75
Title@2025-07-13 (7): Lightweight Federated Learning over Wireless Edge Networks
Title: Lightweight Federated Learning over Wireless Edge Networks | Leichtes Federated Learning über drahtlose Edge-Netzwerke | 对无线边缘网络进行轻量量量联邦学习 2507.09546v1 |
Authors (6): Xiangwang Hou, Jingjing Wang, Jun Du, Chunxiao Jiang, Yong Ren, Dusit Niyato
With the exponential growth of smart devices connected to wireless networks, data production is increasing rapidly, requiring machine learning (ML) techniques to unlock its value. However, the centralized ML paradigm raises concerns over communication overhead and privacy. Federated learning (FL) offers an alternative at the network edge, but practical deployment in wireless networks remains challenging. This paper proposes a lightweight FL (LTFL) framework integrating wireless transmission power control, model pruning, and gradient quantization. We derive a closed-form expression of the FL convergence gap, considering transmission error, model pruning error, and gradient quantization error. Based on these insights, we formulate an optimization problem to minimize the convergence gap while meeting delay and energy constraints. To solve the non-convex problem efficiently, we derive closed-form solutions for the optimal model pruning ratio and gradient quantization level, and employ Bayesian optimization for transmission power control. Extensive experiments on real-world datasets show that LTFL outperforms state-of-the-art schemes.
随着智能装置与无线网络连接的指数增长,数据生产正在迅速增长,需要机器学习技术来释放其价值。然而,中央ML模式引起了对通信间接费用和隐私的关切。联邦学习(FL)在网络边缘提供了一种替代办法,但无线网络的实际部署仍然具有挑战性。本文提议了一个轻度FL(LTFL)框架,将无线传输电源控制、模型旋转和梯度四分法结合起来。我们从FL聚合差距的封闭式表达方式中得出一个考虑到传输错误、模型倾斜错误和梯度四分法错误。根据这些洞察,我们形成了一个优化问题,以尽量减少汇合差距,同时应对延迟和能源限制。为了有效解决非碳化问题,我们为最佳模型运行率和梯度四分法水平找到封闭式解决方案,并利用贝耶斯优化来控制传输电源。关于真实世界数据集的广泛实验显示,LTFL优于最先进的计划。
Article 76
Title@2025-07-13 (7): FastSet: Parallel Claim Settlement
Title: FastSet: Parallel Claim Settlement | FastSet: Parallele Forderungsabrechnung | FastSet:平行索赔理赔 2506.23395v3 |
Authors (2): Xiaohong Chen, Grigore Rosu
FastSet is a distributed protocol for decentralized finance and settlement, which is inspired from both actors and blockchains. Account holders cooperate by making claims, which can include payments, holding and transferring assets, accessing and updating shared data, medical records, digital identity, and mathematical theorems, among others. The claims are signed by their owners and are broadcast to a decentralized network of validators, which validate and settle them. Validators replicate the global state of the accounts and need not communicate with each other. In sharp contrast to blockchains, strong consistency is purposely given up as a requirement. Yet, many if not most of the blockchain benefits are preserved, while capitalizing on actor’s massive parallelism. The protocol is proved to be correct, despite its massively parallel nature.
FastSet是分散化融资和结算的分布式协议,其灵感来自行为者和连锁店; 账户持有人合作,提出债权,其中可包括付款、持有和转移资产、获取和更新共享数据、医疗记录、数字身份和数学理论等; 债权由所有者签字,并广播给一个分散化的验证人网络,由他们验证和结算; 验证人复制账户的全球状况,不需要相互沟通; 与连锁店形成鲜明对比, 故意放弃强有力的一致性,将其作为一项要求。 然而,即使不是大多数,许多连锁店的好处都得到了维护,同时利用了行为者的大规模平行主义。 协议被证明是正确的,尽管它具有巨大的平行性质。
Article 77
Title@2025-07-13 (7): Aequa: Fair Model Rewards in Collaborative Learning via Slimmable Networks
Title: Aequa: Fair Model Rewards in Collaborative Learning via Slimmable Networks | Aequa: Faire Modellprämien im kollaborativen Lernen über schlanke Netzwerke | Aequa:通过可恢复网络合作学习的公平示范奖励 2502.04850v2 |
Authors (3): Nurbek Tastan, Samuel Horvath, Karthik Nandakumar
Collaborative learning enables multiple participants to learn a single global model by exchanging focused updates instead of sharing data. One of the core challenges in collaborative learning is ensuring that participants are rewarded fairly for their contributions, which entails two key sub-problems: contribution assessment and reward allocation. This work focuses on fair reward allocation, where the participants are incentivized through model rewards - differentiated final models whose performance is commensurate with the contribution. In this work, we leverage the concept of slimmable neural networks to collaboratively learn a shared global model whose performance degrades gracefully with a reduction in model width. We also propose a post-training fair allocation algorithm that determines the model width for each participant based on their contributions. We theoretically study the convergence of our proposed approach and empirically validate it using extensive experiments on different datasets and architectures. We also extend our approach to enable training-time model reward allocation.
合作学习使多个参与者能够通过交流重点更新而不是分享数据来学习单一的全球模式。合作学习的核心挑战之一是确保参与者因其贡献而获得公平奖励,这涉及两个重要的次级问题:会费评估和奖赏分配。这项工作侧重于公平奖励分配,通过示范奖励激励参与者,其业绩与贡献相称的有区别的最后模式。在这项工作中,我们利用微弱神经网络的概念,合作学习一个共同的全球模式,其性能随着模型宽度的缩小而优美地下降。我们还提议一个培训后公平分配算法,根据每个参与者的贡献确定模式宽度。我们理论上研究我们拟议方法的趋同,并用经验验证它,在不同的数据集和结构上进行广泛的实验。我们还扩展了我们的方法,以便能够进行培训时间模型奖励分配。
Article 78
Title@2025-07-13 (7): SmartphoneDemocracy: Privacy-Preserving E-Voting on Decentralized Infrastructure using Novel European Identity
Title: SmartphoneDemocracy: Privacy-Preserving E-Voting on Decentralized Infrastructure using Novel European Identity | SmartphoneDemokratie: Datenschutz-Erhaltung von E-Voting auf dezentraler Infrastruktur mit neuartiger europäischer Identität | 智能民主:利用新欧洲身份对权力下放基础设施进行保护隐私电子投票 2507.09453v1 |
Authors (2): Michał Jóźwik, Johan Pouwelse
The digitization of democratic processes promises greater accessibility but presents challenges in terms of security, privacy, and verifiability. Existing electronic voting systems often rely on centralized architectures, creating single points of failure and forcing too much trust in authorities, which contradicts democratic principles. This research addresses the challenge of creating a secure, private e-voting system with minimized trust dependencies designed for the most versatile personal device: the smartphone. We introduce SmartphoneDemocracy, a novel e-voting protocol that combines three key technologies: the emerging European Digital Identity (EUDI) Wallet for Sybil-resistant identity verification, Zero-Knowledge Proofs for privacy-preserving validation, and a peer-to-peer blockchain (TrustChain) for a resilient, serverless public bulletin board. Our protocol enables voters to register and cast ballots anonymously and verifiably directly from their smartphones. We provide a detailed protocol design, a security analysis against a defined threat model, and a performance evaluation demonstrating that the computational and network overhead is feasible for medium- to large-scale elections. By developing and prototyping this system, we demonstrate a viable path to empower citizens with a trustworthy, accessible, and user-controlled digital voting experience.
民主过程的数字化意味着更大的无障碍,但在安全、隐私和可核查方面提出了挑战。现有的电子投票系统往往依赖中央集权结构,造成单一的失败点,并迫使当局过于信任,这与民主原则相悖。这一研究解决了建立一个安全、私人电子投票系统的挑战,该系统将信任性最小化,其信任性最低,为最能干的个人装置(智能手机)设计的信任性依赖性最弱。我们引入了智能手机民主,这是一个新型的电子投票程序,它结合了三项关键技术:新兴的欧洲数字身份(EUDI)用于Sybil抗争身份核查的钱包、用于隐私保护验证的零知识证明和同行对等联锁(Trustchain),用于一个有弹性的无服务器的公共公告板。我们的协议使选民能够匿名和直接从智能手机登记和投选票。我们提供了详细的协议设计,针对一个确定的威胁模式进行安全分析,以及一个绩效评估,表明计算和网络的间接费用对于中度选举是可行的。通过开发和预设的系统,我们展示了一种可获取的、可获取的、可操作的公民能力。
Article 79
Title@2025-07-12 (6): Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge
Title: Intelligent Orchestration of Distributed Large Foundation Model Inference at the Edge | Intelligente Orchestrierung der verteilten Large Foundation Model Inferenz am Rande | 分散在边缘的大基金会模型推断 2504.03668v3 |
Authors (3): Fernando Koch, Aladin Djuhera, Alecio Binotto
Large Foundation Models (LFMs), including multi-modal and generative models, promise to unlock new capabilities for next-generation Edge AI applications. However, performing inference with LFMs in resource-constrained and heterogeneous edge environments, such as Multi-access Edge Computing (MEC), presents significant challenges for workload orchestration due to time-varying network, compute, and storage conditions. In particular, current split inference strategies, which partition LFM layers across nodes, are not designed to adapt to fluctuating workloads, dynamic bandwidth conditions, or evolving privacy constraints in high-utilization MEC environments. In this work, we propose a novel adaptive split inference orchestration framework that elevates both the placement and partitioning of LFM layers to runtime-tunable variables. Specifically, our framework enables real-time, quality-of-service (QoS)-aware management of inference workloads by extending conventional orchestrators with three key services: (1) Capacity-aware workload distribution, which continuously profiles node resources and selects an optimal subset of MEC nodes; (2) Dynamic partition migration, which transparently relocates pre-cut LFM segments in response to changes in utilization or network conditions; (3) Real-time reconfiguration, which dynamically re-splits LFM layers to balance latency, throughput, and privacy. We formalize the joint placement-partitioning problem, outline a reference architecture and algorithmic workflow, and discuss applicability in representative smart city, V2X, and industrial edge scenarios.
大型基础模型(LMM),包括多模式和基因模型,有望为下一代的边缘应用释放新的能力,但通过在资源限制和多样化的边缘环境中(如多接入边缘计算(MEC))与LFMS进行推论,由于时间变化的网络、计算和储存条件,对工作量的调控提出了重大挑战。特别是,目前将LFM层分布于节点的不同推论战略,其设计并不是为了适应在高利用的MEC环境中不断变化的工作量、动态带宽条件或隐私限制。在这项工作中,我们提出了一个新的适应性差异性差异性差异性调整调控框架,将LFM结构的布局和分区提升为可运行的时间可调变变量。具体地说,我们的框架通过扩大传统的管弦风琴管,将LFM结构分为三个关键服务:(1) 能力认知性工作量分配,不断描述资源,并在MEC节点选择一个最佳的参照区段;(2) 智能分区迁移,将LFM结构的布局前的配置和结构的重新定位,从而透明地将稳定和结构的升级地改变到真正的结构。
Article 80
Title@2025-07-12 (6): SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding
Title: SLIM: A Heterogeneous Accelerator for Edge Inference of Sparse Large Language Model via Adaptive Thresholding | SLIM: Ein heterogener Beschleuniger für Edge Inferenz von Sparse Large Language Model über Adaptive Thresholding | SLIM: 通过适应性推进控股的分散大语言模型边缘推推异异异加速器 2507.09201v1 |
Authors (5): Weihong Xu, Haein Choi, Po-kai Hsu, Shimeng Yu, Tajana Rosing
Large language models (LLMs) have demonstrated exceptional proficiency in understanding and generating human language, but efficient inference on resource-constrained embedded devices remains challenging due to large model sizes and memory-intensive operations in feedforward network (FFN) and multi-head attention (MHA) layers. While existing accelerators offload LLM inference to expensive heterogeneous computing systems, they fail to exploit the significant sparsity inherent in LLM operations, leaving hardware resources underutilized. We propose SLIM, an algorithm-hardware co-design optimized for sparse LLM serving on edge devices. SLIM exploits LLM sparsity through an adaptive thresholding algorithm that enables runtime-configurable sparsity with negligible accuracy loss, fetching only activated neurons to dramatically reduce data movement. Our heterogeneous hardware architecture strategically combines near-storage processing (NSP) and processing-in-memory (PIM): FFN weights are stored in high-density 3D NAND and computed using NSP units, while memory-intensive MHA operations are processed in PIM modules. This design significantly reduces memory footprint, data movement, and energy consumption. Our comprehensive evaluation demonstrates SLIM’s effectiveness, achieving 13-18x throughput improvements over SSD-GPU systems and 9-10x better energy efficiency over DRAM-GPU systems while maintaining low latency, making cost-effective LLM deployment viable for edge computing environments.
大型语言模型(LLMS)在理解和生成人文语言方面表现出了非凡的熟练程度,但是,由于在饲料前网络(FFN)和多头关注层(MHA)中有大量模型规模和记忆密集操作,对资源限制的嵌入装置的有效推论仍然具有挑战性。虽然现有的加速器将LLM推卸成昂贵的多式计算系统,但是它们未能利用LLM操作中固有的巨大宽度,使得硬件资源没有得到充分利用。我们提议SLIM,一种为边缘装置上服务的稀有LMLM优化的算法-硬软件共同设计。SLIM通过适应性门槛算法探索LLM的广度,使可运行的时间配置的宽度和精度损失微小,只能提取激活的神经来大幅减少数据移动。我们混合的硬件结构在战略上将近存储处理(NSP)和处理中(PIM):FFM重量储存在高密度的3D-硬度中,并使用NSP单位计算,而记忆密集的MHA操作则在PIM模块中进行处理。这一设计使S-LMS-LMS-LMSLMS-S-S-S-S-SDS-S-S-S-SDSDS-SDSDSDSDSDSDSD 实现更好的全面成本和SD-SD-SD-SD-SD-SD-SD-S-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-S-S-SD-SD-S-S-S-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-SD-S-SD-SD-SD-SD-SD-SD
Article 81
Title@2025-07-11 (5): On Evaluating Performance of LLM Inference Serving Systems
Title: On Evaluating Performance of LLM Inference Serving Systems | Zur Bewertung der Leistung von LLM-Inferenz-Serviersystemen | 评价LLLM LM 推断服务系统的性能 2507.09019v1 |
Authors (8): Amey Agrawal, Nitin Kedia, Anmol Agarwal, Jayashree Mohan, Nipun Kwatra, Souvik Kundu, Ramachandran Ramjee, Alexey Tumanov
The rapid evolution of Large Language Model (LLM) inference systems has yielded significant efficiency improvements. However, our systematic analysis reveals that current evaluation methodologies frequently exhibit fundamental flaws, often manifesting as common evaluation anti-patterns that obscure true performance characteristics and impede scientific progress. Through a comprehensive examination of recent systems, we identify recurring anti-patterns across three key dimensions: Baseline Fairness, Evaluation Setup, and Metric Design. These anti-patterns are uniquely problematic for LLM inference due to its dual-phase nature combining distinct prefill and decode operations, its handling of highly heterogeneous workloads, and its strict temporal requirements for interactive use. We demonstrate how common anti-patterns – such as inadequate baseline comparisons that conflate engineering effort with algorithmic novelty, workload selections that fail to represent production scenarios, and metric normalizations that hide substantial performance variability like generation stalls-lead to misleading conclusions. To address these challenges, we provide a comprehensive checklist derived from our analysis, establishing a framework for recognizing and avoiding these anti-patterns in favor of robust LLM inference evaluation. To demonstrate the practical application of our framework, we present a case study analyzing speculative decoding, a technique whose bursty, non-uniform token generation is easily misinterpreted when evaluated using approaches characteristic of these anti-patterns. Our work establishes a rigorous foundation for evaluation methodology, enabling meaningful comparisons, ensuring reproducible results, and ultimately accelerating genuine progress in LLM inference systems by moving beyond common anti-patterns to align evaluation with real-world requirements.
大语言模型(LLM)推导系统的迅速演变带来了显著的效率改进,然而,我们的系统分析表明,目前的评价方法经常显示出根本性的缺陷,通常表现为共同评价反模式,掩盖了真正的性能特征,阻碍了科学进步。我们通过对最近系统的全面审查,查明了三个关键方面反复出现的反模式:基线公平、评价设置和计量设计。这些反模式对于LLM推论具有独特的问题,因为其具有两阶段性,结合了不同的预充和解码操作、对高度差异性工作量的处理和对互动使用的严格时间要求。我们展示了常见的反模式 – – 例如基准比较不足,无法将工程工作与算法的新颖性、无法代表生产情景的工作量选择以及掩盖重大性业绩变化的标准化,如产生摊位导致误导的结论。为了应对这些挑战,我们从我们的分析中得出了全面的核对清单,建立了一种承认和避免这些反模式的框架,有利于对反周期的严格评估。我们展示了框架的实际应用程度,我们用一种不成熟的模型分析方法,最终用一种不精确的模型分析方法,从而确定我们模拟的复制了一种不精确地分析。
Article 82
Title@2025-07-11 (5): HotSwap: Enabling Live Dependency Sharing in Serverless Computing
Title: HotSwap: Enabling Live Dependency Sharing in Serverless Computing | HotSwap: Live-Abhängigkeitsfreigabe im serverlosen Rechnen aktivieren | HotSwap:在无服务器计算中促进生活依赖性共享 2409.09202v3 |
Authors (3): Rui Li, Devesh Tiwari, Gene Cooperman
This work presents HotSwap, a novel provider-side cold-start optimization for serverless computing. This optimization reduces cold-start time when booting and loading dependencies at runtime inside a function container. Previous research has extensively focused on reducing cold-start latency for specific functions. However, little attention has been given to skewed production workloads. In such cases, cross-function optimization becomes essential. Without cross-function optimization, a cloud provider is left with two equally poor options: (i) Either the cloud provider gives up optimization for each function in the long tail (which is slow); or (ii) the cloud provider applies function-specific optimizations (e.g., cache function images) to every function in the long tail (which violates the vendor’s cache constraints). HotSwap demonstrates cross-function optimization using a novel pre-warming strategy. In this strategy, a pre-initialized live dependency image is migrated to the new function instance. At the same time, HotSwap respects the provider’s cache constraints, because a single pre-warmed dependency image in the cache can be shared among all serverless functions that require that image. HotSwap has been tested on seven representative functions from FunctionBench. In those tests, HotSwap accelerates dependency loading for those serverless functions with large dependency requirements by a factor ranging from 2.2 to 3.2. Simulation experiments using Azure traces indicate that HotSwap can save 88\% of space, compared with a previous function-specific method, PreBaking, when sharing a dependency image among ten different functions.
这项工作展示了HotSwap, 是一个全新的提供方- 端冷启动优化, 用于不使用服务器的计算。 优化会减少运行时运行在功能容器内的运行时运行时运行时的冷启动时间。 先前的研究广泛侧重于减少特定功能的冷启动缓存。 但是, 很少注意偏斜的生产工作量。 在这种情况下, 交叉功能优化变得至关重要。 没有交叉功能优化, 云端提供方会留下两个同样差的选项 。 (一) 要么 云端提供方会放弃对长尾( 缓慢) 中每个功能的优化; 或 (二) 云端提供方会对长尾( 违反供应商缓存限制) 的每个函数应用特定功能( 缓存功能) 。 HotSwap 展示了交叉功能的优化, 使用新的前端调整策略, 将前端依赖图像迁移到新的功能中。 同时, HotSwapwap 尊重供应商的缓存限制, 因为缓存前的图像可以在缓存状态中使用所有不特定功能优化的图像中共享 。 。 高级ShotS) 需要先端测试后, 快速服务器的功能将快速的功能将快速测试这些功能从前的缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩的功能将显示的功能将显示缩缩缩缩缩缩缩缩缩缩缩缩缩图图。
Article 83
Title@2025-07-11 (5): MQFQ-Sticky: Fair Queueing For Serverless GPU Functions
Title: MQFQ-Sticky: Fair Queueing For Serverless GPU Functions | MQFQ-Sticky: Faire Warteschlange für serverlose GPU-Funktionen | MQFQQ-Stisky: 为无服务器的 GPU 函数公平排队 2507.08954v1 |
Authors (6): Alexander Fuerst, Siddharth Anil, Vishakha Dixit, Purushottam, Kulkarni, Prateek Sharma
Hardware accelerators like GPUs are now ubiquitous in data centers, but are not fully supported by common cloud abstractions such as Functions as a Service (FaaS). Many popular and emerging FaaS applications such as machine learning and scientific computing can benefit from GPU acceleration. However, FaaS frameworks (such as OpenWhisk) are not capable of providing this acceleration because of the impedance mismatch between GPUs and the FaaS programming model, which requires virtualization and sandboxing of each function. The challenges are amplified due to the highly dynamic and heterogeneous FaaS workloads. This paper presents the design and implementation of a FaaS system for providing GPU acceleration in a black-box manner (without modifying function code). Running small functions in containerized sandboxes is challenging due to limited GPU concurrency and high cold-start overheads, resulting in heavy queueing of function invocations. We show how principles from I/O scheduling, such as fair queuing and anticipatory scheduling, can be translated to function scheduling on GPUs. We develop MQFQ-Sticky, an integrated fair queueing and GPU memory management approach, which balances the tradeoffs between locality, fairness, and latency. Empirical evaluation on a range of workloads shows that it reduces function latency by 2x to 20x compared to existing GPU and CPU queueing policies.
类似 GPU 的硬盘加速器现在在数据中心中普遍存在,但并非完全得到普通云层抽象学的支持,如服务功能(FaAS ) 。许多广受欢迎的和新兴的 FaAS 应用程序,如机器学习和科学计算,可以受益于 GPU 加速。然而, FaAS 框架(如 OpenWhisk ) 无法提供这种加速,因为 GPU 和 FaaAS 编程模式之间有阻力不匹配,这要求每个功能的虚拟化和沙箱。由于FaaAS 工作量的高度动态和差异,挑战会扩大。本文展示了以黑箱方式提供 GPU加速的FaAS 系统的设计与实施(不修改功能代码)。 在集装箱式沙箱中运行小型功能是困难的,因为 GPU Comp Commocolation 和高冷启动率管理器, 导致功能排队排得很紧。 我们展示了I/O 日程安排的原则,例如公平的排队和防排队和排队安排,如何被转换成 GPPPPOL 的功能。 我们开发了CQQQQ- Streal- Streal-S-Slevlex- slax- slection- slax- slaction- slaction- slaction- slactions
Article 84
Title@2025-07-11 (5): Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint
Title: Carbon-Aware Workflow Scheduling with Fixed Mapping and Deadline Constraint | Carbon-Aware-Workflow-Planung mit Fixed Mapping und Deadline Constraint | 固定绘图和最后期限限制的碳软件工作流程调度 2507.08725v1 |
Authors (4): Dominik Schweisgut, Anne Benoit, Yves Robert, Henning Meyerhenke
Large data and computing centers consume a significant share of the world’s energy consumption. A prominent subset of the workloads in such centers are workflows with interdependent tasks, usually represented as directed acyclic graphs (DAGs). To reduce the carbon emissions resulting from executing such workflows in centers with a mixed (renewable and non-renewable) energy supply, it is advisable to move task executions to time intervals with sufficient green energy when possible. To this end, we formalize the above problem as a scheduling problem with a given mapping and ordering of the tasks. We show that this problem can be solved in polynomial time in the uniprocessor case. For at least two processors, however, the problem becomes NP-hard. Hence, we propose a heuristic framework called CaWoSched that combines several greedy approaches with local search. To assess the 16 heuristics resulting from different combinations, we also devise a simple baseline algorithm and an exact ILP-based solution. Our experimental results show that our heuristics provide significant savings in carbon emissions compared to the baseline.
大型数据和计算中心消耗了世界能源消耗的很大一部分。这类中心的工作量中,一个突出的子集是具有相互依存任务的工作流程,通常以定向单极图(DAGs)为代表。为了减少在混合(可再生和不可再生)能源供应中心执行这种工作流程所产生的碳排放,可取的做法是将任务执行时间间隔移到与可能情况下足够的绿色能源相结合的状态。为此,我们将上述问题正式确定为在给定任务绘图和排序方面的时间安排问题。我们表明,这一问题可以在单处理器案例中的多元时间解决。但是,至少对于两个处理器来说,问题就变成了硬化的。因此,我们建议了一个称为CaWosched的超自然框架,将几种贪婪的方法与当地搜索结合起来。为了评估不同组合产生的16种超自然现象,我们还设计了一个简单的基线算法和精确的 ILP 解决方案。我们的实验结果表明,我们的超自然能在碳排放方面比基线能节省大量碳排放。
Article 85
Title@2025-07-11 (5): Reciprocating Locks
Title: Reciprocating Locks | Umschaltschlösser | 回收锁 2501.02380v9 |
Authors (2): Dave Dice, Alex Kogan
We present “Reciprocating Locks”, a novel mutual exclusion locking algorithm, targeting cache-coherent shared memory (CC), that enjoys a number of desirable properties. The doorway arrival phase and the release operation both run in constant-time. Waiting threads use local spinning and only a single waiting element is required per thread, regardless of the number of locks a thread might hold at a given time. While our lock does not provide strict FIFO admission, it bounds bypass and has strong anti-starvation properties. The lock is compact, space efficient, and has been intentionally designed to be readily usable in real-world general purpose computing environments such as the linux kernel, pthreads, or C++. We show the lock exhibits high throughput under contention and low latency in the uncontended case. The performance of Reciprocating Locks is competitive with and often better than the best state-of-the-art scalable spin locks.
我们展示了“反转锁”这一新颖的相互排斥锁定算法,针对缓存一致的共享记忆(CC),该算法具有一些可取的特性。 门到门阶段和释放操作都是在固定时间运行的。 等待线使用本地旋转, 只需要每条线有一个单独的等待元素, 不论线索在特定时间可能持有的锁数多少。 虽然我们的锁不提供严格的FIFFO 进入, 但它会绕开, 并具有很强的反星系特性。 锁定是紧凑的, 空间效率高, 并且被有意设计成可以随时用于现实世界通用的计算环境, 如 Linux 内核、 prthread 或 C++ 。 我们展示了在未连接的案例中的锁定显示高度过量和低耐久性。 重新定位锁的性能与最佳的状态可缩放的旋转锁有竞争力, 并且通常比最好的效果更好。
Article 86
Title@2025-07-11 (5): Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference
Title: Mind the Memory Gap: Unveiling GPU Bottlenecks in Large-Batch LLM Inference | Mind the Memory Gap: Enthüllen von GPU-Flaschenhalsen in großflächiger LLM-Inferenz | 牢记记忆差距:大型批量LLM 推理中的 GPU 堆积点 2503.08311v2 |
Authors (8): Pol G. Recasens, Ferran Agullo, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Jordi Torres, Josep Ll. Berral
Large language models have been widely adopted across different tasks, but their auto-regressive generation nature often leads to inefficient resource utilization during inference. While batching is commonly used to increase throughput, performance gains plateau beyond a certain batch size, especially with smaller models, a phenomenon that existing literature typically explains as a shift to the compute-bound regime. In this paper, through an in-depth GPU-level analysis, we reveal that large-batch inference remains memory-bound, with most GPU compute capabilities underutilized due to DRAM bandwidth saturation as the primary bottleneck. To address this, we propose a Batching Configuration Advisor (BCA) that optimizes memory allocation, reducing GPU memory requirements with minimal impact on throughput. The freed memory and underutilized GPU compute capabilities can then be leveraged by concurrent workloads. Specifically, we use model replication to improve serving throughput and GPU utilization. Our findings challenge conventional assumptions about LLM inference, offering new insights and practical strategies for improving resource utilization, particularly for smaller language models. The code is publicly available at https://github.com/FerranAgulloLopez/vLLMBatchingMemoryGap.
大型语言模型在不同任务中被广泛采用,但其自动递减生成的性质往往导致在推断过程中资源利用效率低下。虽然批量通常用于增加吞吐量,但绩效增益超过一定批量规模,特别是较小的模型,现有文献通常将这种现象解释为向计算约束制度转变。在本文中,通过深入的GPU级别分析,我们发现,大批量推论仍然有记忆限制,大多数GPU计算能力由于DRAM带宽饱和度作为主要瓶颈而未得到充分利用。为此,我们提议设立一个Batching配置顾问(BCA),以优化记忆分配,减少GPU的内存要求,对吞吐量影响最小。随后,通过同时工作量可以利用自由的记忆和未充分利用GPU的计算能力。具体地说,我们利用模型复制来改进吞吐量和GPU的利用。我们关于LM推论的常规假设对LM推算提出了挑战,为改进资源利用提供了新的洞察力和实用战略,特别是对于较小的语言模型。该代码可在https://gioryroybub.com/Ferranamav/LOS/LOS/LOCPUALVLAllo。
Article 87
Title@2025-07-11 (5): Naeural AI OS – Decentralized ubiquitous computing MLOps execution engine
Title: Naeural AI OS – Decentralized ubiquitous computing MLOps execution engine | Naeural AI OS – Dezentrale allgegenwärtige Computer MLOps Ausführungs-Engine | Naeur AI OS – – 分散分散的无处不在计算 MLOPs 执行引擎 2306.08708v6 |
Authors (4): Cristian Bleotiu, Stefan Saraev, Bogdan Hobeanu, Andrei Ionut Damian
Over the past few years, ubiquitous, or pervasive computing has gained popularity as the primary approach for a wide range of applications, including enterprise-grade systems, consumer applications, and gaming systems. Ubiquitous computing refers to the integration of computing technologies into everyday objects and environments, creating a network of interconnected devices that can communicate with each other and with humans. By using ubiquitous computing technologies, communities can become more connected and efficient, with members able to communicate and collaborate more easily. This enabled interconnectedness and collaboration can lead to a more successful and sustainable community. The spread of ubiquitous computing, however, has emphasized the importance of automated learning and smart applications in general. Even though there have been significant strides in Artificial Intelligence and Deep Learning, large scale adoption has been hesitant due to mounting pressure on expensive and highly complex cloud numerical-compute infrastructures. Adopting, and even developing, practical machine learning systems can come with prohibitive costs, not only in terms of complex infrastructures but also of solid expertise in Data Science and Machine Learning. In this paper we present an innovative approach for low-code development and deployment of end-to-end AI cooperative application pipelines. We address infrastructure allocation, costs, and secure job distribution in a fully decentralized global cooperative community based on tokenized economics.
过去几年来,无处不在的或普遍的计算作为包括企业级系统、消费者应用程序和赌博系统在内的各种应用的主要方法,已日益受到欢迎。 通俗计算是指将计算机技术纳入日常物品和环境,建立一个能够相互沟通和与人沟通的相互关联的装置网络。通过使用无处不在的计算技术,社区可以更加相互联系和高效,成员可以更容易地进行沟通与合作。这可以使相互联系和协作导致一个更成功和可持续的社区。然而,无处不在的计算的扩散强调了自动化学习和一般智能应用的重要性。尽管在人工智能和深层学习方面已取得重大进展,但大规模采用却由于对昂贵和高度复杂的云层数字组合基础设施的压力越来越大而犹豫不决。采用、甚至开发实用的机器学习系统,不仅在复杂的基础设施方面,而且在数据科学和机器学习方面的扎实的专门知识方面,都会产生令人望而望而步的代价。在本文件中,我们提出了一种创新的方法,用于低码开发和部署基于最终到最终分配的全球合作性基础设施,在完全以安全方式分配的AI合作管道应用方面,我们提出了一种创新的方法。
Article 88
Title@2025-07-11 (5): CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization
Title: CCSS: Hardware-Accelerated RTL Simulation with Fast Combinational Logic Computing and Sequential Logic Synchronization | CCSS: Hardware-beschleunigte RTL-Simulation mit schnellem kombiniertem Logic Computing und sequentieller Logic Synchronisation | CSS: 与快速组合逻辑计算和序列逻辑同步同步模拟的硬件加速式RTL模拟 2507.08406v1 |
Authors (7): Weigang Feng, Yijia Zhang, Zekun Wang, Zhengyang Wang, Yi Wang, Peijun Ma, Ningyi Xu
As transistor counts in a single chip exceed tens of billions, the complexity of RTL-level simulation and verification has grown exponentially, often extending simulation campaigns to several months. In industry practice, RTL simulation is divided into two phases: functional debug and system validation. While system validation demands high simulation speed and is typically accelerated using FPGAs, functional debug relies on rapid compilation-rendering multi-core CPUs the primary choice. However, the limited simulation speed of CPUs has become a major bottleneck. To address this challenge, we propose CCSS, a scalable multi-core RTL simulation platform that achieves both fast compilation and high simulation throughput. CCSS accelerates combinational logic computation and sequential logic synchronization through specialized architecture and compilation strategies. It employs a balanced DAG partitioning method and efficient boolean computation cores for combinational logic, and adopts a low-latency network-on-chip (NoC) design to synchronize sequential states across cores efficiently. Experimental results show that CCSS delivers up to 12.9x speedup over state-of-the-art multi-core simulators.
由于晶体体体在一个晶片中计数超过数百亿,RTL级模拟和核查的复杂性已经成倍增长,常常将模拟活动扩大到几个月。在行业实践中,RTL模拟分为两个阶段:功能调试和系统验证。系统验证要求高模拟速度,通常使用FPGAs加速进行。功能调试依赖于快速编译的多核心CPU。然而,CPU的有限模拟速度已成为一个主要的瓶颈。为了应对这一挑战,我们建议CCS,这是一个可缩放的多核心RTL模拟平台,既能快速编集,又能高模拟吞吐量。CCS通过专门的架构和编译战略加速组合逻辑计算和顺序逻辑同步。它使用平衡的DAG分配方法和高效的布尔计算核心组合逻辑,并采用低纬度网络对柱式设计,以高效率地使各个核心的连续状态同步。实验结果表明,CCS交付到12.9x速度超过州级多核心模拟器。
Article 89
Title@2025-07-11 (5): Towards AI-Native RAN: An Operator’s Perspective of 6G Day 1 Standardization
Title: Towards AI-Native RAN: An Operator’s Perspective of 6G Day 1 Standardization | Auf dem Weg zu KI-Native RAN: Die Perspektive des Betreibers von 6G Tag 1 Standardisierung | 面向AI-Native RAN:运营商对6G日1标准化的看法 2507.08403v1 |
Authors (9): Nan Li, Qi Sun, Lehan Wang, Xiaofei Xu, Jinri Huang, Chunhui Liu, Jing Gao, Yuhong Huang, Chih-Lin I
Artificial Intelligence/Machine Learning (AI/ML) has become the most certain and prominent feature of 6G mobile networks. Unlike 5G, where AI/ML was not natively integrated but rather an add-on feature over existing architecture, 6G shall incorporate AI from the onset to address its complexity and support ubiquitous AI applications. Based on our extensive mobile network operation and standardization experience from 2G to 5G, this paper explores the design and standardization principles of AI-Native radio access networks (RAN) for 6G, with a particular focus on its critical Day 1 architecture, functionalities and capabilities. We investigate the framework of AI-Native RAN and present its three essential capabilities to shed some light on the standardization direction; namely, AI-driven RAN processing/optimization/automation, reliable AI lifecycle management (LCM), and AI-as-a-Service (AIaaS) provisioning. The standardization of AI-Native RAN, in particular the Day 1 features, including an AI-Native 6G RAN architecture, were proposed. For validation, a large-scale field trial with over 5000 5G-A base stations have been built and delivered significant improvements in average air interface latency, root cause identification, and network energy consumption with the proposed architecture and the supporting AI functions. This paper aims to provide a Day 1 framework for 6G AI-Native RAN standardization design, balancing technical innovation with practical deployment.
6G与5G不同,我们调查AI-NML框架,并展示其三项基本能力,以说明标准化方向;即AI-驱动的RAN处理/优化/优化/美化、可靠的AI生命周期管理(LCM)和AI-as-as-Servic(AIaS)的提供。 本文根据我们从2G到5G的广泛移动网络操作和标准化经验,探讨了6G的AI-NNNN无线电接入网络的设计和标准化原则,特别侧重于其第1天的关键结构、功能和能力。我们调查AI-NML的框架,并展示其三项基本能力,以说明标准化方向;即AI-ROAN的处理/优化/优化/优化/美化、可靠的AI-as-a-Service(LCM)和AI-as-as-Servicer(AIaS)的提供。 6NAN的标准化,特别是第1日的技术特征,包括AI-Native 6G RAN结构。为了验证、大规模实地测试、5G-LA(5NA)标准设计结构,在5G-IA 5G-IA的标准化设计中提供了重要的标准化设计。
Article 90
Title@2025-07-11 (5): Efficient Long Context Fine-tuning with Chunk Flow
Title: Efficient Long Context Fine-tuning with Chunk Flow | Effizientes Long Context Feinabstimmung mit Chunk Flow | 与整流相配合的微调 2503.02356v3 |
Authors (13): Xiulong Yuan, Hongtao Xu, Wenting Shen, Ang Wang, Xiafei Qiu, Jie Zhang, Yuqiong Liu, Bowen Yu, Junyang Lin, Mingzhen Li, Weile Jia, Yong Li, Wei Lin
Long context fine-tuning of large language models(LLMs) involves training on datasets that are predominantly composed of short sequences and a small proportion of longer sequences. However, existing approaches overlook this long-tail distribution and employ training strategies designed specifically for long sequences. Moreover, these approaches also fail to address the challenges posed by variable sequence lengths during distributed training, such as load imbalance in data parallelism and severe pipeline bubbles in pipeline parallelism. These issues lead to suboptimal training performance and poor GPU resource utilization. To tackle these problems, we propose a chunk-centric training method named ChunkFlow. ChunkFlow reorganizes input sequences into uniformly sized chunks by consolidating short sequences and splitting longer ones. This approach achieves optimal computational efficiency and balance among training inputs. Additionally, ChunkFlow incorporates a state-aware chunk scheduling mechanism to ensure that the peak memory usage during training is primarily determined by the chunk size rather than the maximum sequence length in the dataset. Integrating this scheduling mechanism with existing pipeline scheduling algorithms further enhances the performance of distributed training. Experimental results demonstrate that, compared with Megatron-LM, ChunkFlow can be up to 4.53x faster in the long context fine-tuning of LLMs. Furthermore, we believe that ChunkFlow serves as an effective solution for a broader range of scenarios, such as long context continual pre-training, where datasets contain variable-length sequences.
对大型语言模型(LLMS)进行长长背景微调涉及对数据集的培训,这种培训主要由短序和一小部分较长序列组成。然而,现有方法忽略了长序分布,并采用了专门为长序设计的培训战略。此外,这些方法也未能解决分布式培训期间不同序列长度构成的挑战,如数据平行的负载不平衡和管道平行的严重管道泡沫。这些问题导致培训业绩不尽人意,GPU资源利用率低。为了解决这些问题,我们建议采用以块为中心的培训方法,即ChunkFlow。ChunkFlow通过合并短序和分解较长序列,将输入序列重组成统一规模块。这种方法实现了最佳计算效率和培训投入之间的平衡。此外,CunkFlow还引入了一种状态-可觉觉察的整排机制,以确保培训期间最高峰的记忆使用主要取决于数据集的块大小,而不是最大序列长度。为了解决这些问题,我们建议将这种编程安排机制与现有的编程算法进一步增强分布式培训的绩效。实验结果表明,相对于Mg-LM-M-K-CR-C-C-S-S-S-L-S-S-S-S-L-S-S-S-S-S-S-S-S-S-S-V-S-S-S-S-S-V-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-L-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 91
Title@2025-07-11 (5): Fast and Interactive Byzantine Fault-tolerant Web Services via Session-Based Consensus Decoupling
Title: Fast and Interactive Byzantine Fault-tolerant Web Services via Session-Based Consensus Decoupling | Schnelle und interaktive Byzantinische Fehler-tolerante Web Services über Session-Based Consensus Entkopplung | 通过会议共识脱钩提供快速和互动拜占庭防违约网络服务 2507.08281v1 |
Authors (3): Ahmad Zaki Akmal, Azkario Rizky Pratama, Guntur Dharma Putra
Byzantine fault-tolerant (BFT) web services provide critical integrity guarantees for distributed applications but face significant latency challenges that hinder interactive user experiences. We propose a novel two-layer architecture that addresses this fundamental tension between security and responsiveness in BFT systems. Our approach introduces a session-aware transaction buffer layer (Layer 2) that delivers immediate feedback to users through consensus simulation, while periodically committing batched operations to a fully Byzantine fault-tolerant consensus layer (Layer 1). By separating interactive operations from consensus finalization, our system achieves responsive user experiences of under 200ms, while maintaining strong BFT security guarantees. We demonstrate the efficacy of our architecture through a supply chain management implementation, where operators require both immediate feedback during multi-step workflows and tamper-proof record keeping. Our evaluation shows that our Layer 2 operations perform four times faster than the Layer 1 counterpart, while substantially preserving the end-to-end transaction integrity. Our approach enables BFT applications in domains previously considered impractical due to latency constraints, such as metaverse environments, where users require both responsive interaction and guaranteed state consistency.
Byzantine 错误容忍(BFT)网络服务为分布式应用程序提供了关键的完整保障,但面临阻碍互动用户经验的重大延迟性挑战。我们提出了一个新的两层结构,以解决BFT系统中安全和反应能力之间的这种根本紧张。我们的方法是引入一个会话交易缓冲层(Layer 2),通过协商一致模拟向用户提供即时反馈,同时定期进行分批作业,建立全Byzantine 错误容忍共识层(Layer 1)。通过将互动操作与协商一致的定稿分开,我们的系统实现了200米以下用户的响应性经验,同时保持了BFT强有力的安全保障。我们通过供应链管理实施,展示了我们的架构的功效,在多步工作流程和防篡改记录保存过程中,操作者需要即时立即反馈。我们的评估显示,我们的2层操作比图层1的对应速度快四倍,同时大大维护端对端交易的完整性。我们的方法使得BFT在用户既需要响应性互动又保证国家一致性的领域,例如元反环境,以前认为不现实性的限制使得BFT应用程序的应用变得不切实际。
Article 92
Title@2025-07-10 (4): Supporting Intel(r) SGX on Multi-Package Platforms
Title: Supporting Intel(r) SGX on Multi-Package Platforms | Unterstützung von Intel(r) SGX auf Multi-Package-Plattformen | 支持多包平台的 Intel(r) SGX 2507.08190v1 |
Authors (4): Simon Johnson, Raghunandan Makaram, Amy Santoni, Vinnie Scarlata
Intel(r) Software Guard Extensions (SGX) was originally released on client platforms and later extended to single socket server platforms. As developers have become familiar with the capabilities of the technology, the applicability of this capability in the cloud has been tested. Various Cloud Service Providers (CSPs) are demonstrating the value of using SGX based Trusted Execution Environments (TEE) to create a new paradigm of Confidential Cloud Computing. This paper describes the additional platform enhancements we believe are necessary to deliver a user programmable Trusted Execution Environment that scales to cloud usages, performs and is secure on multi-package platforms.
Intel(r) 软件保护扩展(SGX) 最初在客户平台上发布,后来扩大到单个套接服务器平台。随着开发者已经熟悉该技术的能力,这种能力在云层中的可应用性已经测试。 不同的云服务供应商(CSPs)正在展示使用基于 SGX 的可信任执行环境(TEE) 来创建机密云计算新模式的价值。 本文描述了我们认为为提供用户可编程可信赖的执行环境所必要的额外平台增强, 该环境可以与云的使用、 运行和多包平台的安全相适应。
Article 93
Title@2025-07-10 (4): KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling
Title: KIS-S: A GPU-Aware Kubernetes Inference Simulator with RL-Based Auto-Scaling | KIS-S: Ein GPU-Aware Kubernetes Inferenzsimulator mit RL-basierter Auto-Skalierung | KIS- S: 带有基于 RL 自动缩放的 GPU- Aware Kubernetes 推断模拟器 2507.07932v1 |
Authors (5): Guilin Zhang, Wulan Guo, Ziqi Tan, Qiang Guan, Hailong Jiang
Autoscaling GPU inference workloads in Kubernetes remains challenging due to the reactive and threshold-based nature of default mechanisms such as the Horizontal Pod Autoscaler (HPA), which struggle under dynamic and bursty traffic patterns and lack integration with GPU-level metrics. We present KIS-S, a unified framework that combines KISim, a GPU-aware Kubernetes Inference Simulator, with KIScaler, a Proximal Policy Optimization (PPO)-based autoscaler. KIScaler learns latency-aware and resource-efficient scaling policies entirely in simulation, and is directly deployed without retraining. Experiments across four traffic patterns show that KIScaler improves average reward by 75.2%, reduces P95 latency up to 6.7x over CPU baselines, and generalizes without retraining. Our work bridges the gap between reactive autoscaling and intelligent orchestration for scalable GPU-accelerated environments.
Kubernetes 的自动计算 GPU 参数工作量仍然具有挑战性,因为默认机制,如水平 Pod Autassaler (HPA) 的被动和门槛性质,在动态和爆裂性交通模式下挣扎,没有与 GPU 级别指标整合。 我们展示了 KISS- S , 这是一个将 KISim 、 GPU-aware Kubernetes 参数模拟器与 KIScaler 、 Proximal 政策优化(PPPPO) 基于自动标尺的自动标尺结合起来的统一框架。 KIScaler 完全在模拟中学习 Latency- 觉悟性和资源效率提升政策, 并且不经再培训直接部署。 四个交通模式的实验显示, KIScaler 将平均报酬提高75.2%, 将P95 的宽度降低到6.7x CPU 基线, 并在没有再培训的情况下普遍化。 我们的工作缩小了可缩缩放环境的反动自动缩放和智能调制之间的差距。
Article 94
Title@2025-07-10 (4): Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs
Title: Parallel CPU-GPU Execution for LLM Inference on Constrained GPUs | Parallele CPU-GPU-Execution für LLM-Inferenz auf eingeschränkten GPUs | LLM LLM 受控 GPU 推论的平行 CPU-GPU 执行 2506.03296v3 |
Authors (4): Jiakun Fan, Yanglin Zhang, Xiangchen Li, Dimitrios S. Nikolopoulos
Deploying large language models (LLMs) for online inference is often constrained by limited GPU memory, particularly due to the growing KV cache during auto-regressive decoding. Hybrid GPU-CPU execution has emerged as a promising solution by offloading KV cache management and parts of attention computation to the CPU. However, a key bottleneck remains: existing schedulers fail to effectively overlap CPU-offloaded tasks with GPU execution during the latency-critical, bandwidth-bound decode phase. This particularly penalizes real-time, decode-heavy applications (e.g., chat, Chain-of-Thought reasoning) which are currently underserved by existing systems, especially under memory pressure typical of edge or low-cost deployments. We present APEX, a novel, profiling-informed scheduling strategy that maximizes CPU-GPU parallelism during hybrid LLM inference. Unlike systems relying on static rules or purely heuristic approaches, APEX dynamically dispatches compute across heterogeneous resources by predicting execution times of CPU and GPU subtasks to maximize overlap while avoiding scheduling overheads. We evaluate APEX on diverse workloads and GPU architectures (NVIDIA T4, A10), using LLaMa-2-7B and LLaMa-3.1-8B models. Compared to GPU-only schedulers like VLLM, APEX improves throughput by 84% - 96% on T4 and 11% - 89% on A10 GPUs, while preserving latency. Against the best existing hybrid schedulers, it delivers up to 49% (T4) and 37% (A10) higher throughput in long-output settings. APEX significantly advances hybrid LLM inference efficiency on such memory-constrained hardware and provides a blueprint for scheduling in heterogeneous AI systems, filling a critical gap for efficient real-time LLM applications.
用于在线推断的大型语言模型(LLMS)的部署往往受到有限 GPU 记忆的限制,特别是由于在自动递增解码过程中KV缓存日益增长。混合 GPU-CPU 执行通过卸载 KV缓存管理和部分关注计算到 CPU 的典型存储压力而成为一个大有希望的解决办法。然而,一个关键的瓶颈仍然存在:现有的调度器未能有效地将 CPU 上载任务与GPU 执行工作重叠,而GPU-GPU 紧要带带带带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带。 这, APEX让实时流带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带宽带
Article 95
Title@2025-07-10 (4): DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration
Title: DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration | DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung | DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v2 |
Authors (3): Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis
Transformers are gaining increasing attention across different application domains due to their outstanding accuracy. However, these data-intensive models add significant performance demands to the existing computing architectures. Systolic arrays are spatial architectures that have been adopted by commercial AI computing platforms (like Google TPUs), due to their energy-efficient approach of data-reusability. However, these spatial architectures face a penalty in throughput and energy efficiency due to the need for input and output synchronization using First-In-First-Out (FIFO) buffers. This paper proposes a novel scalable systolic-array architecture featuring Diagonal-Input and Permutated weight-stationary (DiP) dataflow for the acceleration of matrix multiplication. The proposed architecture eliminates the synchronization FIFOs required by state-of-the-art weight stationary systolic arrays. Aside from the area, power, and energy savings achieved by eliminating these FIFOs, DiP architecture maximizes the computational resources (PEs) utilization. Thus, it outperforms the weight-stationary counterparts in terms of throughput by up to 50%. A comprehensive hardware design space exploration is demonstrated using commercial 22nm technology, highlighting the scalability advantages of DiP over the conventional approach across various dimensions where DiP offers improvement of energy efficiency per area up to 2.02x. Furthermore, DiP is evaluated using various transformer workloads from widely-used models, consistently outperforming TPU-like architectures, achieving energy improvements of up to 1.81x and latency improvements of up to 1.49x across a range of transformer workloads. At a 64x64 size with 4096 PEs, DiP achieves a peak performance of 8.2 TOPS with energy efficiency 9.55 TOPS/W.
这些数据密集型模型增加了现有计算结构的显著性能要求。 系统阵列是商业AI计算平台(如Google TPUs)采用的空间结构,因为其数据的可恢复性具有节能性。 然而,这些空间结构由于需要使用FIFO(FIFO)缓冲进行投入和产出同步,在吞吐和能源效率方面面临着一个障碍。本文建议了一个新的可缩放的40级系统阵列结构,其特点是对角-内流和变换的加权-静态(DIP)数据流,以加速矩阵倍增。拟议的结构消除了最新重量固定式数据阵列所需的同步FIFFOs。除了通过消除FIFO(FIFO)实现的输入和产出同步之外,diPIP结构将计算资源最大化。 因此,它比重-平流-平流-平面-平面-平面-平面-平面-平面 1. 它比重-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平-平-平-平-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平-平-平-平-平-平-平面-平面-平面-平面-平面-平-平-平-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平面-平-平-平面-平面-平面-平面-平面-平-平-平-平-平-平-平-平-
Article 96
Title@2025-07-10 (4): Accelerating Transposed Convolutions on FPGA-based Edge Devices
Title: Accelerating Transposed Convolutions on FPGA-based Edge Devices | Beschleunigung transponierter Konvolutionen auf FPGA-basierten Edge-Geräten | 加速基于 FPGA 的边缘设备的转换变速 2507.07683v1 |
Authors (2): Jude Haris, José Cano
Transposed Convolutions (TCONV) enable the up-scaling mechanism within generative Artificial Intelligence (AI) models. However, the predominant Input-Oriented Mapping (IOM) method for implementing TCONV has complex output mapping, overlapping sums, and ineffectual computations. These inefficiencies further exacerbate the performance bottleneck of TCONV and generative models on resource-constrained edge devices. To address this problem, in this paper we propose MM2IM, a hardware-software co-designed accelerator that combines Matrix Multiplication (MatMul) with col2IM to process TCONV layers on resource-constrained edge devices efficiently. Using the SECDA-TFLite design toolkit, we implement MM2IM and evaluate its performance across 261 TCONV problem configurations, achieving an average speedup of 1.9x against a dual-thread ARM Neon optimized CPU baseline. We then evaluate the performance of MM2IM on a range of TCONV layers from well-known generative models achieving up to 4.2x speedup, and compare it against similar resource-constrained TCONV accelerators, outperforming them by at least 2x GOPs/DSP. Finally, we evaluate MM2IM on the DCGAN and pix2pix GAN models, achieving up to 3x speedup and 2.4x energy reduction against the CPU baseline.
为了解决这个问题,我们在本文件中提议了MM2IM, 一个硬件软件共同设计的加速器,将MM2IM与COL2IM组合在一起,以高效地处理控制资源边缘装置上的TCONV层。我们使用SECDA-TFLite设计工具包,执行MM2IM,并评估其在261 TCONV问题配置中的性能表现,实现1.9x的平均速度,与双轨的ARM Neon优化的CPU基准相对应。然后我们从众所周知的Com2SUI模型到达到4.2x速度的TRIM, 将其与类似的GMMSM2 基准模型相比较。
Article 97
Title@2025-07-10 (4): Multi-agent Reinforcement Learning-based In-place Scaling Engine for Edge-cloud Systems
Title: Multi-agent Reinforcement Learning-based In-place Scaling Engine for Edge-cloud Systems | Multi-Agenten-Verstärkung Learning-based In-place Scaling Engine für Edge-Cloud-Systeme | 边缘球状系统内地增强引擎 2507.07671v1 |
Authors (7): Jovan Prodanov, Blaž Bertalanič, Carolina Fortuna, Shih-Kai Chou, Matjaž Branko Jurič, Ramon Sanchez-Iborra, Jernej Hribar
Modern edge-cloud systems face challenges in efficiently scaling resources to handle dynamic and unpredictable workloads. Traditional scaling approaches typically rely on static thresholds and predefined rules, which are often inadequate for optimizing resource utilization and maintaining performance in distributed and dynamic environments. This inefficiency hinders the adaptability and performance required in edge-cloud infrastructures, which can only be achieved through the newly proposed in-place scaling. To address this problem, we propose the Multi-Agent Reinforcement Learning-based In-place Scaling Engine (MARLISE) that enables seamless, dynamic, reactive control with in-place resource scaling. We develop our solution using two Deep Reinforcement Learning algorithms: Deep Q-Network (DQN), and Proximal Policy Optimization (PPO). We analyze each version of the proposed MARLISE solution using dynamic workloads, demonstrating their ability to ensure low response times of microservices and scalability. Our results show that MARLISE-based approaches outperform heuristic method in managing resource elasticity while maintaining microservice response times and achieving higher resource efficiency.
现代边缘系统在高效率地增加资源以应对动态和不可预测的工作量方面面临挑战。传统的规模化方法通常依赖静态阈值和预先确定的规则,而静态阈值和预设规则往往不足以优化资源利用和维持分布式和动态环境中的业绩。这种效率低下妨碍了边缘型基础设施所需的适应性和绩效,而这种功能化基础设施只能通过新提议的地方规模化来实现。为解决这一问题,我们建议采用基于多代理强化学习的基于内部配置的基于多职位的辅助型引擎(MARLISE),该引擎能够实现无缝、动态和反应式的控制,同时利用内部资源规模化的资源规模化。我们利用两种深度强化学习算法:深Q网络(DQN)和优化政策优化(PPPO)来开发我们的解决方案。我们利用动态工作量来分析拟议中每个版本的MARLISE解决方案,展示其确保微观服务反应时间低和可扩展性的能力。我们的成果表明,基于多代理系统的方法在管理资源弹性的同时,在保持微观服务反应时间和实现更高资源效率方面超越了正规的超上的方法。
Article 98
Title@2025-07-10 (4): Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data
Title: Stress Monitoring in Healthcare: An Ensemble Machine Learning Framework Using Wearable Sensor Data | Stressüberwachung im Gesundheitswesen: Ein Ensemble Machine Learning Framework mit tragbaren Sensordaten | 保健中压力监测:使用穿戴感感应数据的综合机械学习框架 2507.07589v1 |
Authors (3): Arpana Sinhal, Anay Sinhal, Amit Sinhal
Healthcare professionals, particularly nurses, face elevated occupational stress, a concern amplified during the COVID-19 pandemic. While wearable sensors offer promising avenues for real-time stress monitoring, existing studies often lack comprehensive datasets and robust analytical frameworks. This study addresses these gaps by introducing a multimodal dataset comprising physiological signals, electrodermal activity, heart rate and skin temperature. A systematic literature review identified limitations in prior stress-detection methodologies, particularly in handling class imbalance and optimizing model generalizability. To overcome these challenges, the dataset underwent preprocessing with the Synthetic Minority Over sampling Technique (SMOTE), ensuring balanced representation of stress states. Advanced machine learning models including Random Forest, XGBoost and a Multi-Layer Perceptron (MLP) were evaluated and combined into a Stacking Classifier to leverage their collective predictive strengths. By using a publicly accessible dataset and a reproducible analytical pipeline, this work advances the development of deployable stress-monitoring systems, offering practical implications for safeguarding healthcare workers’ mental health. Future research directions include expanding demographic diversity and exploring edge-computing implementations for low latency stress alerts.
在COVID-19大流行期间,保健专业人员,特别是护士,面临着职业压力升高的问题,这是人们更加关注的一个问题。虽然穿戴传感器为实时压力监测提供了有希望的渠道,但现有的研究往往缺乏全面的数据集和强有力的分析框架。这项研究通过引入由生理信号、电极活动、心率和皮肤温度组成的多式联运数据集,弥补了这些差距。系统文献审查查明了先前的压力检测方法的局限性,特别是在处理阶级不平衡和优化模型一般性方面。为了克服这些挑战,数据集与合成少数群体抽样技术(SMOTE)一起进行了预处理,确保压力状态的均衡代表。包括随机森林、XGBoust和多激光 Perceptron(MLP)在内的先进机器学习模型得到了评估,并合并成一个标准分类,以利用其集体预测优势。通过使用公众可获取的数据集和可复制的分析管道,这项工作推动了可部署的压力监测系统的开发,为保护保健工作者的心理健康提供了实际影响。未来的研究方向包括扩大人口多样性和探索低潜压压力警报的边缘执行。
Article 99
Title@2025-07-10 (4): TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference
Title: TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference | TokenWeave: Effiziente Compute-Communication Overlap für verteilte LLM-Inferenz | TokenWeave: 有效计算分布式LLM 推理的通信重叠 2505.11329v2 |
Authors (3): Raja Gond, Nipun Kwatra, Ramachandran Ramjee
Distributed inference of large language models (LLMs) can introduce overheads of up to 20% even over GPUs connected via high-speed interconnects such as NVLink. Multiple techniques have been proposed to mitigate these overheads by decomposing computations into finer-grained tasks and overlapping communication with sub-tasks as they complete. However, fine-grained decomposition of a large computation into many smaller computations on GPUs results in overheads. Furthermore, the communication itself uses many streaming multiprocessors (SMs), adding to the overhead. We present TokenWeave to address these challenges. TokenWeave proposes a Token-Splitting technique that divides the tokens in the inference batch into two approximately equal subsets in a wave-aware manner. The communication of one subset is then overlapped with the computation of the other. In addition, TokenWeave optimizes the order of the layer normalization computation with respect to communication operations and implements a novel fused AllReduce–RMSNorm kernel that carefully leverages Multimem instruction support available on NVIDIA Hopper GPUs. These optimizations allow TokenWeave to perform communication and RMSNorm using only 2-8 SMs. Moreover, our kernel enables the memory-bound RMSNorm to be overlapped with the other batch’s computation, providing additional gains. Our evaluations demonstrate up to 1.29x speedup in latency and 1.26x higher throughput across multiple models and workloads. In several settings, TokenWeave results in better performance compared to an equivalent model with all communication removed.
大型语言模型(LLMS)的分布式推论可以引入高达20%的间接费用,甚至超过通过高速互连(如 NVLink ) 连接的 GPU 。 已经提出了多种技术, 通过将计算分解成细微重分解任务和在完成时与子任务重复通信来缓解这些间接费用。 但是, 微细分分解将大量计算分解成在 GPU 上的许多较小计算导致间接费用。 此外, 通信本身使用许多流式多处理器( SMs) , 增加管理费用。 我们展示了托肯韦韦( Tokenweave) 来应对这些挑战。 TokenWeave提议一种托肯(Token- Split) 技术, 以波浪分解计算分解成两个大约相等的子集。 一个子集的通讯与其它的计算方法相重叠。 此外, TokenWeave( ) 优化了所有REW- REM- NOLKNQNQN 模式, 以便仔细地优化地对 HIM 的 OVA- 和S- hold 进行自动分析。
Article 100
Title@2025-07-10 (4): A Unified Ontology for Scalable Knowledge Graph-Driven Operational Data Analytics in High-Performance Computing Systems
Title: A Unified Ontology for Scalable Knowledge Graph-Driven Operational Data Analytics in High-Performance Computing Systems | Eine einheitliche Ontologie für skalierbare, graphgestützte Betriebsdatenanalytik in Hochleistungs-Computing-Systemen | 高性能计算系统中可缩放知识、图表驱动操作数据分析的统一本体学 2507.06107v2 |
Authors (2): Junaid Ahmed Khan, Andrea Bartolini
Modern high-performance computing (HPC) systems generate massive volumes of heterogeneous telemetry data from millions of sensors monitoring compute, memory, power, cooling, and storage subsystems. As HPC infrastructures scale to support increasingly complex workloads-including generative AI-the need for efficient, reliable, and interoperable telemetry analysis becomes critical. Operational Data Analytics (ODA) has emerged to address these demands; however, the reliance on schema-less storage solutions limits data accessibility and semantic integration. Ontologies and knowledge graphs (KG) provide an effective way to enable efficient and expressive data querying by capturing domain semantics, but they face challenges such as significant storage overhead and the limited applicability of existing ontologies, which are often tailored to specific HPC systems only. In this paper, we present the first unified ontology for ODA in HPC systems, designed to enable semantic interoperability across heterogeneous data centers. Our ontology models telemetry data from the two largest publicly available ODA datasets-M100 (Cineca, Italy) and F-DATA (Fugaku, Japan)-within a single data model. The ontology is validated through 36 competency questions reflecting real-world stakeholder requirements, and we introduce modeling optimizations that reduce knowledge graph (KG) storage overhead by up to 38.84% compared to a previous approach, with an additional 26.82% reduction depending on the desired deployment configuration. This work paves the way for scalable ODA KGs and supports not only analysis within individual systems, but also cross-system analysis across heterogeneous HPC systems.
现代高性能计算(HPC)系统从数以百万计的传感器监测计算、记忆、电力、冷却和储存子系统中产生大量异式遥测数据。随着HPC基础设施规模的扩大,支持日益复杂的工作量,包括基因化的AI – – 需要高效、可靠和互操作的遥测分析变得至关重要。运行数据分析(ODA)已经出现,以满足这些需求;然而,对无机存储解决方案的依赖限制了数据的可获取性和语义整合。核心和知识图表(KG)提供了一种有效的方法,通过捕获域名语义系统,使高效和直观的数据查询,但是,它们面临着巨大的存储管理以及仅针对特定HPC系统的有限适用性等挑战。在本文件中,我们提出了第一个用于HPC系统(ODA)的统一数据,目的是使混杂数据中心能够实现语义互操作性互操作性。我们从两个最大的可公开获得的官方发展援助数据集(M100(Cineca,意大利)和F-DATA(Fuga)系统支持了跨域语系的存储管理管理管理管理管理管理,日本)的现有在线数据分析(KG)系统,通过一个通过Slistal-listalimalalalalalalalalalalalal 定义的模型,这个模型,这个模型,这个模型,用来在一个模拟的存储模型上减少一个模型的存储能力分析方法,这个模型中,这个模型,这个模型,这个模型,这个模型里,用来减少一个模型,这个模型,这个模型,这个模型,用来分析,这个模型里程中,用来反映了一个模拟的存储能力要求。
Article 101
Title@2025-07-10 (4): Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques
Title: Opt-GPTQ: An Optimized GPTQ Combining Sparse Attention and Quantization Techniques | Opt-GPTQ: Optimierte GPTQ Kombination von Sparsen-Achtung und Quantisierungstechniken | GPTQ:最佳GPTQ,将分散关注和量化技术结合起来 2505.02351v2 |
Authors (10): Jie Kong, Junxiang Zhang, Jiheng Xu, Yalong Li, Shouhua Zhang, Jiehan Zhou, Yuhai Liu, Peng Liang, Quan Zhang, Luohan Jiang
In the field of deep learning, traditional attention mechanisms face significant challenges related to high computational complexity and large memory consumption when processing long sequence data. To address these limitations, we propose Opt-GPTQ, an optimized Gradient-based Post Training Quantization (GPTQ) combining the Grouped Query Attention (GQA) mechanism with paging memory management, optimizing the traditional Multi-Head Attention (MHA) mechanism by grouping query heads and sharing key-value vectors. Optimized GQA (Opt-GQA) effectively reduces computational complexity, minimizes memory fragmentation, and enhances memory utilization for large-scale models. Opt-GPTQ is optimized for Data Center Units (DCUs) and integrated into the vLLM model to maximize hardware efficiency. It customizes GPU kernels to further enhance attention computation by reducing memory access latency and boosting parallel computing capabilities. Opt-GQA integrates Attention with Linear Biases (ALiBi) to reduce overhead and enhance long-sequence processing. Experimental results show that Opt-GPTQ significantly reduces computation time and memory usage while improving model performance.
在深层学习领域,传统关注机制在处理长序列数据时面临着与高计算复杂性和大量记忆消耗有关的重大挑战。为解决这些局限性,我们提议Opt-GPTQ,即优化的GPTQ,即基于优化的渐进式培训后量化(GPTQQ),将GQA(GQA)机制与组合存储管理相结合,优化传统的多负责人关注(MHA)机制,方法是将查询头分组并共享关键值矢量。优化的GQA(Opt-GQA)有效地降低计算复杂性,尽量减少记忆破碎,并加强大规模模型的记忆利用。Opt-GPTQQ(Opt-GQA)优化了数据中心单位(DCUs),并将其整合到 vLLLM 模型中,以最大限度地提高硬件效率。它定制了GPUP 内核,以通过减少记忆存取时间和增强平行计算能力来进一步增加注意力的计算。Opt-GQA将关注与线-Bises(ALiBiBi)有效地减少管理并增强长期模型处理。实验性结果显示Opt-GPTQ的利用。
Article 102
Title@2025-07-10 (4): KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows
Title: KVFlow: Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows | KVFlow: Effizientes Präfix-Caching zur Beschleunigung von LLM-basierten Multiagenten-Workflows | KVFlow: 为加速基于LLM的多重需要工作流程而高效预置缓存 2507.07400v1 |
Authors (9): Zaifeng Pan, Ajjkumar Patel, Zhengding Hu, Yipeng Shen, Yue Guan, Wan-Lu Li, Lianhui Qin, Yida Wang, Yufei Ding
Large language model (LLM) based agentic workflows have become a popular paradigm for coordinating multiple specialized agents to solve complex tasks. To improve serving efficiency, existing LLM systems employ prefix caching to reuse key-value (KV) tensors corresponding to agents’ fixed prompts, thereby avoiding redundant computation across repeated invocations. However, current systems typically evict KV caches using a Least Recently Used (LRU) policy, which fails to anticipate future agent usage and often discards KV caches shortly before their reuse. This leads to frequent cache misses and substantial recomputation or swapping overhead. We present KVFlow, a workflow-aware KV cache management framework tailored for agentic workloads. KVFlow abstracts the agent execution schedule as an Agent Step Graph and assigns each agent a steps-to-execution value that estimates its temporal proximity to future activation. These values guide a fine-grained eviction policy at the KV node level, allowing KVFlow to preserve entries likely to be reused and efficiently manage shared prefixes in tree-structured caches. Moreover, KVFlow introduces a fully overlapped KV prefetching mechanism, which proactively loads required tensors from CPU to GPU in background threads for agents scheduled in the next step, thereby avoiding cache miss stalls during generation. Compared to SGLang with hierarchical radix cache, KVFlow achieves up to 1.83$\times$ speedup for single workflows with large prompts, and up to 2.19$\times$ speedup for scenarios with many concurrent workflows.
大型语言模型( LLM) 以大型语言模式为基础的代理工作流程已成为协调多个专门代理商解决复杂任务的流行范例。 为了提高效率, 现有的 LLM 系统使用前缀缓存, 重新使用与代理商固定提示相对的键值( KV) , 从而避免重复计算。 然而, 当前系统通常使用最不常用的( LRU) 政策驱逐 KV 缓存, 这无法预测未来代理商的使用情况, 并经常在重新使用之前不久丢弃 KV 缓存 。 这导致频繁的缓存丢失和大量重置或转换管理管理管理管理。 我们展示了 KVFlow, 一个为代理工作量量定制的工作流程- World KVVV 缓存管理框架。 KVFlow 将代理商执行时间表作为代理Step 图表, 并给每个代理商分配一个步骤到执行值, 估计其与未来激活时间的距离。 这些值指导了 KVPO 节点的细化驱逐政策, 允许 KVFlow 保存可能被再利用的单流流流流和高效共享的预置速度, 。 KVlalal-lickraterateal 时间里, 时间里, 需要完全地在 SG 。
Article 103
Title@2025-07-10 (4): Future Resource Bank for ISAC: Achieving Fast and Stable Win-Win Matching for Both Individuals and Coalitions
Title: Future Resource Bank for ISAC: Achieving Fast and Stable Win-Win Matching for Both Individuals and Coalitions | Future Resource Bank for ISAC: Schnelles und stabiles Win-Win-Matching für Einzelpersonen und Koalitionen | ISAC未来资源银行:实现个人和联盟的快速和稳定的双赢比对 2502.08118v5 |
Authors (6): Houyi Qi, Minghui Liwang, Seyyedali Hosseinalipour, Liqun Fu, Sai Zou, Wei Ni
Future wireless networks must support emerging applications where environmental awareness is as critical as data transmission. Integrated Sensing and Communication (ISAC) enables this vision by allowing base stations (BSs) to allocate bandwidth and power to mobile users (MUs) for communications and cooperative sensing. However, this resource allocation is highly challenging due to: (i) dynamic resource demands from MUs and resource supply from BSs, and (ii) the selfishness of MUs and BSs. To address these challenges, existing solutions rely on either real-time (online) resource trading, which incurs high overhead and failures, or static long-term (offline) resource contracts, which lack flexibility. To overcome these limitations, we propose the Future Resource Bank for ISAC, a hybrid trading framework that integrates offline and online resource allocation through a level-wise client model, where MUs and their coalitions negotiate with BSs. We introduce two mechanisms: (i) Role-Friendly Win-Win Matching (offRFW$^2$M), leveraging overbooking to establish risk-aware, stable contracts, and (ii) Effective Backup Win-Win Matching (onEBW$^2$M), which dynamically reallocates unmet demand and surplus supply. We theoretically prove stability, individual rationality, and weak Pareto optimality of these mechanisms. Through simulations, we show that our framework improves social welfare, latency, and energy efficiency compared to existing methods.
未来无线网络必须支持环境意识与数据传输一样至关重要的新兴应用; 综合遥感和通信(ISAC)允许基地站为通信和合作遥感向移动用户分配带宽和电力,从而使这一愿景得以实现; 然而,这一资源分配非常具有挑战性,因为:(一) 来自移动站的动态资源需求以及来自移动站的资源供应;(二) 移动站和移动站的自私自利。 为了应对这些挑战,现有解决方案依赖于实时(在线)资源交易,这种交易导致高管理费和失败,或静态(脱线)长期资源合同,缺乏灵活性。为克服这些限制,我们提议建立一个未来信息站资源银行,这是一个混合贸易框架,通过一个水平明智的客户模式,将离线和在线资源分配结合起来。 我们引入了两个机制:(一) 作用友好的Win-Win匹配(off RFW$%2M),利用过度的账面来建立风险意识、稳定合同、稳定的长期(offline)资源合同,以及(ii) 有效后期Sing Win-Win-Wimeal-Sildal-Sildalviolview Sility Supment Silvals-Wild),我们现有的能源供应和稳定性、不断提升机制。
Article 104
Title@2025-07-10 (4): Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size
Title: Constraint Programming Models For Serial Batch Scheduling With Minimum Batch Size | Einschränkungen Programmiermodelle für serielle Batch-Scheichung mit minimaler Batch-Größe | 具有最小批量大小的连续批次排程限制编程模型 2504.08793v2 |
Authors (2): Jorge A. Huertas, Pascal Van Hentenryck
In serial batch (s-batch) scheduling, jobs are grouped in batches and processed sequentially within their batch. This paper considers multiple parallel machines, nonidentical job weights and release times, and sequence-dependent setup times between batches of different families. Although s-batch has been widely studied in the literature, very few papers have taken into account a minimum batch size, typical in practical settings such as semiconductor manufacturing and the metal industry. The problem with this minimum batch size requirement has been mostly tackled with dynamic programming and meta-heuristics, and no article has ever used constraint programming (CP) to do so. This paper fills this gap by proposing, three CP models for s-batching with minimum batch size: (i) an \textit{Interval Assignment} model that computes and bounds the size of the batches using the presence literals of interval variables of the jobs. (ii) A \textit{Global} model that exclusively uses global constraints that track the size of the batches over time. (iii) And a \textit{Hybrid} model that combines the benefits of the extra global constraints with the efficiency of the sum-of-presences constraints to ensure the minimum batch sizes. The computational experiments on standard cases compare the three CP models with two existing mixed-integer programming (MIP) models from the literature. The results demonstrate the versatility of the proposed CP models to handle multiple variations of s-batching; and their ability to produce, in large instances, better solutions than the MIP models faster.
在序列批量(批量)列表中,工作按批次分组,并在批次内按批次处理。 本文考虑了多个平行机器、 不同工作重量和发布时间不完全相同, 以及不同家庭批次之间取决于顺序的设置时间。 虽然文献中已经对批量进行了广泛的研究, 但很少有论文考虑到最低批量规模, 在半导体制造和金属工业等实际环境下典型的批量规模。 这种最低批量规模要求的问题大多通过动态编程和元过量处理, 也没有文章使用过强制编程( CP) 。 本文通过提议, 三个批次间加载最小尺寸的批次模式填补了这一差距:(i) 一种批次的批次模式, 用半导体和金属工业的间隔变量的亮度来计算和约束批次的大小。 (ii) 一种纯度 {全球 提议模式, 专门使用跟踪批次规模的全球制约, 并且从未使用过强制编程程序程序( CP) 。 (iii) 以及一种纹/ Cen 级模型 来填补这一差距, 用最小的模型来填补这一差距 , , 将效率限制与两种模型的缩缩缩缩缩数 合并模型结合起来 结合, 。
Article 105
Title@2025-07-10 (4): Machine Learning-driven Multiscale MD Workflows: The Mini-MuMMI Experience
Title: Machine Learning-driven Multiscale MD Workflows: The Mini-MuMMI Experience | Mehrstufige MD-Workflows mit maschinellem Lernen: Die Mini-MuMMI-Erfahrung | 由学习驱动的机械式学习驱动的多规模MD工作流程:微型MIMI经验 2507.07352v1 |
Authors (11): Loïc Pottier, Konstantia Georgouli, Timothy S. Carpenter, Fikret Aydin, Jeremy O. B. Tempkin, Dwight V. Nissley, Frederick H. Streitz, Thomas R. W. Scogland, Peer-Timo Bremer, Felice C. Lightstone, Helgi I. Ingólfsson
Computational models have become one of the prevalent methods to model complex phenomena. To accurately model complex interactions, such as detailed biomolecular interactions, scientists often rely on multiscale models comprised of several internal models operating at difference scales, ranging from microscopic to macroscopic length and time scales. Bridging the gap between different time and length scales has historically been challenging but the advent of newer machine learning (ML) approaches has shown promise for tackling that task. Multiscale models require massive amounts of computational power and a powerful workflow management system. Orchestrating ML-driven multiscale studies on parallel systems with thousands of nodes is challenging, the workflow must schedule, allocate and control thousands of simulations operating at different scales. Here, we discuss the massively parallel Multiscale Machine-Learned Modeling Infrastructure (MuMMI), a multiscale workflow management infrastructure, that can orchestrate thousands of molecular dynamics (MD) simulations operating at different timescales, spanning from millisecond to nanosecond. More specifically, we introduce a novel version of MuMMI called “mini-MuMMI”. Mini-MuMMI is a curated version of MuMMI designed to run on modest HPC systems or even laptops whereas MuMMI requires larger HPC systems. We demonstrate mini-MuMMI utility by exploring RAS-RAF membrane interactions and discuss the different challenges behind the generalization of multiscale workflows and how mini-MuMMI can be leveraged to target a broader range of applications outside of MD and RAS-RAF interactions.
精确地模拟复杂的相互作用,例如详细的生物分子相互作用,科学家往往依赖由从微观到宏观的长度和时间尺度等不同尺度运行的若干内部模型组成的多尺度模型。缩小不同时间和长度尺度之间的差距历来具有挑战性,但新机器学习(ML)方法的出现显示了应对这项任务的希望。多规模模型需要大量的计算力和强大的工作流程管理系统。用数千个节点对平行系统进行由ML驱动的多尺度研究具有挑战性,工作流程必须安排、分配和控制不同尺度运行的数千个模拟。在这里,我们讨论了大规模平行的多尺度机器模拟基础设施(MIMMI),一个多规模的工作流程管理基础设施,可以在不同的时间尺度上协调数千种分子动态模拟,从毫秒到毫秒不等。更具体地说,我们引入了名为“MIMMI-MI”的新型多尺度研究。MIMI的小型和MIMIMIMIMM系统需要更大规模地展示MIMIMIMA系统,而MIMI的小型和小型MIMIML系统则需要我们小规模的小型版本。