cs.DC @ 2025-08-01: 097

07-31 (4)

The ArborX library: version 2.0

Die ArborX-Bibliothek: Version 2.0

ArborX 图书馆:2.0版

2507.23700v1

07-31

Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Satelliten-Federated Fine-Tuning für Basismodelle in Weltraum Computing Power Networks

卫星卫星联合会空间电子计算动力网络基础模型精密设计

2504.10403v3

07-31

Parallel Split Learning with Global Sampling

Paralleles Split-Lernen mit globaler Probenahme

与全球抽样平行拆分学习

2407.15738v4

07-31

Beyond Optimal Fault Tolerance

Jenseits der optimalen Fehlertoleranz

超越最佳错失容忍

2501.06044v7

04 07-31 Consistent Point Matching Konsistente Punktgleichung 统一点匹配 2507.23609v1

07-31

Threshold-Driven Streaming Graph: Expansion and Rumor Spreading

Threshold-Driven Streaming Graph: Expansion und Gerüchte Verbreitung

阈值驱动流图:扩展和谣言扩散

2507.23533v1

07-31

Towards Serverless Processing of Spatiotemporal Big Data Queries

Auf dem Weg zur serverlosen Verarbeitung von raumzeitlichen Big Data-Abfragen

迈向无服务器处理斯帕蒂奥多时大数据查询

2507.06005v2

07-31

Scalable contribution bounding to achieve privacy

Skalierbarer Beitrag zur Wahrung der Privatsphäre

实现隐私的可缩放贡献

2507.23432v1

07-31

Towards a Testbed for Scalable FaaS Platforms

Auf dem Weg zu einem Testbett für skalierbare FaaS-Plattformen

迈向可缩放的 FaaS 平台测试台

2507.23431v1

07-31

Minos: Exploiting Cloud Performance Variation with Function-as-a-Service Instance Selection

Minos: Nutzung der Cloud-Performance-Variante mit der Funktion-as-a-Service-Instanz-Auswahl

Minos: 利用云性工作表现变化与选择服务性功能项目

2505.12928v2

07-31

H2SGEMM: Emulating FP32 GEMM on Ascend NPUs using FP16 Units with Precision Recovery and Cache-Aware Optimization

H2SGEMM: Emulieren von FP32 GEMM auf Ascend-NPUs mit FP16-Einheiten mit Präzisionsrückgewinnung und Cache-Aware-Optimierung

H2SGEMM:利用具有精密恢复和缓存优化功能的FP16单位模拟FP32关于升降国家核动力源的GEMMF32 GEMM

2507.23387v1

07-31

A Simple $(1-ε)$-Approximation Semi-Streaming Algorithm for Maximum (Weighted) Matching

Ein einfacher $(1-ε)$-Annäherungshalbstrahl-Algorithmus für maximale (gewichtete) Übereinstimmung

用于最大(加权)匹配的简单 $(1- ) $( ) $( ) $( ) $( ) 的近似半调整算法

2307.02968v4

07-30 (3)

GALE: Leveraging Heterogeneous Systems for Efficient Unstructured Mesh Data Analysis

GALE: Nutzung heterogener Systeme für effiziente unstrukturierte Mesh-Datenanalyse

GALE:利用异异基因系统进行高效无结构的网目数据分析

2507.15230v3

07-30

Data Readiness for Scientific AI at Scale

Datenbereitstellung für wissenschaftliche KI im Maßstab

规模化科学AI 数据准备程度

2507.23018v1

07-30

Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Föderiertes Lernen mit On-Device-Training und Kommunikation im 8-Bit-Schwebepunkt

在8位浮动点进行联邦在职培训和交流

2407.02610v2

07-30

DSPE: Profit Maximization in Edge-Cloud Storage System using Dynamic Space Partitioning with Erasure Code

DSPE: Profitmaximierung im Edge-Cloud-Speichersystem mit Dynamic Space Partitioning mit Erasure Code

DSPE: 利用具有时代代码的动态空间分割法在边缘封闭储存系统中实现利润最大化

2507.22801v1

07-30

A Survey on Large Language Model Acceleration based on KV Cache Management

Eine Umfrage über die Beschleunigung von großen Sprachmodellen auf Basis von KV Cache Management

基于 KV 缓存管理大语言模式加速调查

2412.19442v3

07-30

Leveraging Caliper and Benchpark to Analyze MPI Communication Patterns: Insights from AMG2023, Kripke, and Laghos

Caliper und Benchpark nutzen, um MPI-Kommunikationsmuster zu analysieren: Einblicke von AMG2023, Kripke und Laghos

利用卡利珀和法官场分析MPI通信模式:来自AMG2023、Kripke和Laghos的透视

2507.22372v1

07-30

PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling

PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays

PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案

2506.19660v3

07-30

Understanding Power and Energy Utilization in Large Scale Production Physics Simulation Codes

Leistungs- und Energienutzung in großmaßstäblichen Produktionsphysik-Simulationscodes verstehen

大规模生产中了解电力和能源利用情况

2201.01278v2

07-30

A Semi-Supervised Federated Learning Framework with Hierarchical Clustering Aggregation for Heterogeneous Satellite Networks

Ein semi-überwachtes Federated Learning Framework mit Hierarchical Clustering Aggregation für heterogene Satellitennetzwerke

半上层联邦学习框架,包括异源卫星网络的等级集群聚合

2507.22339v1

07-30

Hypernetworks for Model-Heterogeneous Personalized Federated Learning

Hypernetzwerke für modell-heterogenes personalisiertes Federated Learning

模拟异异异性个性化联邦学习超级网络

2507.22330v1

07-30

SP-Chain: Boosting Intra-Shard and Cross-Shard Security and Performance in Blockchain Sharding

SP-Chain: Stärkung von Intra-Shard und Cross-Shard Sicherheit und Performance in Blockchain Sharding

SP-Chain: 推动碎裂和交叉碎片内部的安全和工作表现

2407.06953v2

07-30

Towards Experiment Execution in Support of Community Benchmark Workflows for HPC

Zur Durchführung von Experimenten zur Unterstützung von gemeinschaftlichen Benchmark-Workflows für HPC

争取实验执行以支持高常委会的社区基准工作流程

2507.22294v1

07-29 (2)

Minimizing CGYRO HPC Communication Costs in Ensembles with XGYRO by Sharing the Collisional Constant Tensor Structure

Minimierung der CGYRO HPC-Kommunikationskosten in Ensembles mit XGYRO durch gemeinsame Nutzung der Kollusionskonstanten-Tensor-Struktur

通过共享对齐常数感应结构,最大限度地减少与XGYRO结合的CGYRO HPC HPC 通信费用

2507.22245v1

07-29

AgileDART: An Agile and Scalable Edge Stream Processing Engine

AgileDART: Eine agile und skalierbare Edge Stream Processing Engine

AGILDART: 一个能动和可扩缩的边缘流处理引擎

2407.14953v3

07-29

OpenRASE: Service Function Chain Emulation

OpenRASE: Service-Funktionskette Emulation

OpenRASE: 服务功能链模拟

2507.22131v1

07-29

Large-Scale Linear Energy System Optimization: A Systematic Review on Parallelization Strategies via Decomposition

Large-Scale Linear Energy System Optimization: Eine systematische Überprüfung von Parallelisierungsstrategien durch Zersetzung

大型线性能源系统优化:通过分解对平行战略进行系统审查

2507.21932v1

07-29

The Performance of Low-Synchronization Variants of Reorthogonalized Block Classical Gram–Schmidt

Die Performance von Low-Synchronization Varianten von Reorthogonalized Block Classical Gram–Schmidt

古经典古典古典古典石密的重新解析区块的低同步变异功能的性能

2507.21791v1

07-29

Evaluating the Impact Of Spatial Features Of Mobility Data and Index Choice On Database Performance

Bewertung der Auswirkungen räumlicher Merkmale von Mobilitätsdaten und Indexwahl auf die Datenbankleistung

评价移动数据空间特征和指数选择对数据库绩效的影响

2505.14466v2

07-29

Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees

Einmal quantifizieren, schnell trainieren: Allreduce-kompatible Kompression mit wahrnehmbaren Garantien

量化一次,快速列车:用可变担保进行减压-可比较压缩

2305.18627v2

31 07-29 Ethereum Conflicts Graphed Ethereum-Konflikte EEeenum 冲突图图 2507.20196v2

07-29

A Massively Parallel Performance Portable Free-space Spectral Poisson Solver

Ein massiv parallele Leistung Portable Freiraum Spectral Poisson Solver

大规模平行平行性能便携式自由空间光谱 Poisson 解答器

2405.02603v2

07-29

Collaborative State Machines: A Better Programming Model for the Cloud-Edge-IoT Continuum

Kollaborative Staatsmaschinen: Ein besseres Programmiermodell für das Cloud-Edge-IoT Continuum

协作型国家机器:云-云-日-环-环-环-环-环-环-环-环-环-环-

2507.21685v1

07-29

Accelerating Stable Matching between Workers and Spatial-Temporal Tasks for Dynamic MCS: A Stagewise Service Trading Approach

Beschleunigte stabile Abstimmung zwischen Arbeitern und räumlich-zeitlichen Aufgaben für dynamische MCS: Ein schrittweiser Service-Trading-Ansatz

加快工人与动态监控监的时空任务之间的稳定匹配:分阶段服务贸易办法

2502.08386v3

07-29

Bridging Cache-Friendliness and Concurrency: A Locality-Optimized In-Memory B-Skiplist

Überbrückung von Cache-Freundlichkeit und Concurrency: Eine lokalitätsoptimierte In-Memory-B-Skiplist

搭桥便利取快和货币通融:一个地方性优化的记忆B-空间列表

2507.21492v1

07-29

GlideinBenchmark: collecting resource information to optimize provisioning

GlideinBenchmark: Sammeln von Ressourceninformationen zur Optimierung der Bereitstellung

Gliidein基准:收集资源信息,优化供应

2507.21472v1

07-29

Using Containers to Speed Up Development, to Run Integration Tests and to Teach About Distributed Systems

Container verwenden, um die Entwicklung zu beschleunigen, Integrationstests durchzuführen und über verteilte Systeme zu unterrichten

利用集装箱加速发展、运行一体化测试和教授分配系统

2507.21464v1

07-29

InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain für LLM mit optischen Schaltungsschalter Transceivern

无限HBD:利用光电转换收发器为LLM 建立数据中心 – – 高度宽宽度高域域

2502.03885v5

07-28 (1)

FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning

FedStrategist: Ein Meta-Learning-Framework für adaptive und robuste Aggregation im Federated Learning

联邦战略:联邦学习中适应性和强力聚合的元学习框架

2507.14322v2

07-28

LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

LeMix: Unified Scheduling für LLM-Training und Schlussfolgerung auf Multi-GPU-Systemen

LeMix:关于多功能保U系统的LLM培训和推理的LLM培训统一日程安排

2507.21276v1

07-28

Improving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation

Verbesserung der SpGEMM-Performance durch Matrix-Neuordnung und clusterweise Berechnung

通过矩阵重新排序和集群计算改进 SGEMM 业绩

2507.21253v1

07-28

Parallel Point-to-Point Shortest Paths and Batch Queries

Parallele Punkt-zu-Punkt-Kurze Pfade und Batch-Abfragen

平行点对点最短路径和批量查询

2506.16488v2

07-28

Metric Criticality Identification for Cloud Microservices

Metrische Criticality Identification für Cloud Microservices

云云微微服务计量临界度识别

2501.03547v2

07-28

COoL-TEE: Client-TEE Collaboration for Resilient Distributed Search

COoL-TEE: Client-TEE-Kollaboration für resiliente verteilte Suche

ColoL-TEE:客户-TEE合作进行弹性分配搜索

2503.19063v2

07-28

The Case for Time-Shared Computing Resources

Der Fall für zeitverteilte Computing-Ressourcen

时间共享电子计算资源案

2507.19287v2

07-28

Accelerating Deterministic Global Optimization via GPU-parallel Interval Arithmetic

Beschleunigung der Deterministischen globalen Optimierung über GPU-Parallel Interval Arithmetik

通过 GPU- 平行对称器加速确定性全球优化

2507.20769v1

07-28

Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications

Verbesserung der kompositorischen LLM-Reasoning mit strukturierten Arbeitsbeziehungen in der interaktiven multimodalen Kommunikation

与互动多模式通信中结构性任务关系有关的理由

2507.21199v1

07-28

CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

CUDA-L1: Verbesserung der CUDA-Optimierung durch kontrastives Verstärkungslernen

CUDA-L1:通过反竞争强化学习改进CUDA优化

2507.14111v4

07-28

RIMMS: Runtime Integrated Memory Management System for Heterogeneous Computing

RIMMS: Laufzeit-Integriertes Speicher-Management-System für Heterogenes Rechnen

RIMMS: 运行时异质计算综合记忆管理系统

2507.20514v1

07-27 (7)

Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

Kommunikation-Effizient verteiltes Training für kollaborative Flat Optima Erholung im Deep Learning

促进深学习合作、平板最佳最佳恢复的传播-高效分配培训

2507.20424v1

07-27

A Comparative Study of OpenMP Scheduling Algorithm Selection Strategies

Eine vergleichende Studie der OpenMP-Scheeduling-Algorithm-Auswahlstrategien

OpenMP 测高计表选择战略比较研究

2507.20312v1

07-27

Silent Self-Stabilising Leader Election in Programmable Matter Systems with Holes

Stille selbststabilisierende Leader-Wahl in programmierbaren Materiesystemen mit Löchern

在有洞洞的可规划物质系统中进行无声自稳定领导人选举

2507.20201v1

07-27

High-Performance Parallel Optimization of the Fish School Behaviour on the Setonix Platform Using OpenMP

Leistungsstarke Paralleloptimierung des Fish School Verhaltens auf der Setonix-Plattform mit OpenMP

利用开放式Setonix平台的鱼类学校行为高绩效平行优化

2507.20173v1

07-27

Syno: Structured Synthesis for Neural Operators

Syno: Strukturierte Synthese für neurale Operatoren

同步:神经操作员结构化合成

2410.23745v2

07-27

Accelerating Containerized Service Delivery at the Network Edge

Beschleunigen der containerisierten Service-Lieferung am Netzwerkrand

加速在网络边缘提供集装箱化服务

2507.20116v1

07-27

Data-Locality-Aware Task Assignment and Scheduling for Distributed Job Executions

Daten-Lokalität-Bewusste Aufgabe Zuordnung und Planung für verteilte Job-Executionen

分配任务执行的数据- 本地- 软件任务分配和时间安排

2407.08584v4

07-26 (6)

Racing to Idle: Energy Efficiency of Matrix Multiplication on Heterogeneous CPU and GPU Architectures

Racing to Idle: Energieeffizienz der Matrix-Multiplikation auf heterogenen CPU- und GPU-Architekturen

乘以 IDL : 不同式CPU 和 GPU 建筑的矩阵乘法的能源效率

2507.20063v1

07-26

$K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

$K^4$: Online Log Anomalienerkennung durch unüberwachtes Lernen

4K元:在线记录异常探测不受监督的典型学习

2507.20051v1

07-26

Parallel Hierarchical Agglomerative Clustering in Low Dimensions

Paralleles hierarchisches Agglomerat-Clustering in niedrigen Abmessungen

相平行的低尺寸等级群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群

2507.20047v1

07-26

MTASet: A Tree-based Set for Efficient Range Queries in Update-heavy Workloads

MTASet: Ein Baum-basiertes Set für effiziente Reichweitenfragen in Update-schweren Workloads

MTASSet: 更新重工作量中高效测距查询的树基套件

2507.20041v1

07-26

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

MegaScale-Infer: Servieren von Mixture-of-Experts auf Scale mit disaggregierten Experten-Parallelismus

超星级――推:利用分级专家平行主义在规模上为混合专家服务

2504.02263v4

07-26

Offloading tracing for real-time systems using a scalable cloud infrastructure

Offloading-Nachverfolgung für Echtzeit-Systeme mit einer skalierbaren Cloud-Infrastruktur

使用可缩放云层基础设施卸载实时系统的实时跟踪跟踪

2507.19953v1

07-26

A Fast Parallel Median Filtering Algorithm Using Hierarchical Tiling

Ein schneller paralleler Median, der Algorithmen mit Hierarchischem Tiling filtert

快速平行中位过滤, 使用分级曲线算法

2507.19926v1

07-26

MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

MultiKernelBench: Ein Multi-Platform Benchmark für die Kernel-Generation

多KenneelBench: 核心生成的多平台基准

2507.17773v2

07-26

MegatronApp: Efficient and Comprehensive Management on Distributed LLM Training

MegatronApp: Effizientes und umfassendes Management auf verteilten LLM-Schulungen

威天:有效、全面管理分配的有限LLM培训

2507.19845v1

07-26

CleANN: Efficient Full Dynamism in Graph-based Approximate Nearest Neighbor Search

CleANN: Effizienter Volldynamismus auf Graph-Basis Ungefähre nächste Nachbarsuche

CleANN: 以图形为基础的近邻近邻搜索中的高效全面动态

2507.19802v1

07-26

Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Beschleunigung der Matrix-Multiplikation: Ein Leistungsvergleich zwischen Multi-Core-CPU und GPU

加速矩阵乘法:多焦CPU和GPU之间的性能比较

2507.19723v1

07-25 (5)

Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Oranits: Missionszuweisung und Aufgabe-Offloading in Open RAN-basierten ITS mit Hilfe von Metaheuristic und Deep Reinforcement Learning

Oranits:利用超常和深强化学习在以开放RAN为基础的ITS中执行特派任务和卸载任务

2507.19712v1

07-25

Improved Distributed Algorithms for Random Colorings

Verbesserte verteilte Algorithmen für Random Colorings

改进随机配色配色的分布比值

2309.07859v3

07-25

Quantifying the Performance Gap for Simple Versus Optimal Dynamic Server Allocation Policies

Quantifizierung der Performance Gap für einfache Versus Optimal Dynamic Server Allocation Richtlinien

量化简单 Versus 最佳最佳动态服务器配置政策的业绩差距

2507.19667v1

07-25

Efficient and Scalable Agentic AI with Heterogeneous Systems

Effiziente und skalierbare Agentische KI mit Heterogenen Systemen

具有异质系统的高效和可缩放剂AIA

2507.19635v1

07-25

An OpenSource CI/CD Pipeline for Variant-Rich Software-Defined Vehicles

Eine OpenSource CI/CD Pipeline für Variant-Rich Software-definierte Fahrzeuge

变式Rich软件定型车辆的开源CI/CD管道

2507.19446v1

07-25

SDVDiag: A Modular Platform for the Diagnosis of Connected Vehicle Functions

SDVDiag: Modulare Plattform für die Diagnose von vernetzten Fahrzeugfunktionen

SDVDiag: 连接车辆功能诊断模块平台

2507.19403v1

07-25

Big Data Energy Systems: A Survey of Practices and Associated Challenges

Big Data Energy Systems: Eine Übersicht über Praktiken und damit verbundene Herausforderungen

大数据能源系统:做法和相关挑战概览

2507.19154v1

07-25

Urban Green Governance: IoT-Driven Management and Enhancement of Urban Green Spaces in Campobasso

Urban Green Governance: IoT-getriebenes Management und Verbesserung städtischer Grünflächen in Campobasso

城市绿色治理:在坎波巴索管理和加强城市绿色空间

2507.12106v4

07-25

A New One-Shot Federated Learning Framework for Medical Imaging Classification with Feature-Guided Rectified Flow and Knowledge Distillation

Ein neues eins-Shot-Federated-Learning-Framework für die Klassifizierung medizinischer Bildgebung mit funktionsgeführter rektifizierter Strömung und Wissensdestillation

新的以地制引校正流动和知识蒸馏法的医学成像分类单一式联邦学习框架

2507.19045v1

07-25

GPUnion: Autonomous GPU Sharing on Campus

GPUnion: Autonomer GPU-Sharing auf dem Campus

GPUU:在校园中自主分享GPU

2507.18928v1

07-25

RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems

RailX: Eine flexible, skalierbare und kostenarme Netzwerkarchitektur für Hyper-Scale LLM Trainingssysteme

RailX:超大型有限LM培训系统灵活、可缩放和低成本网络架构

2507.18889v1

07-25

Fully Energy-Efficient Randomized Backoff: Slow Feedback Loops Yield Fast Contention Resolution

Vollenergieeffizienter Randomized Backoff: Langsame Rückkopplungsschleifen liefern schnelle Streitbeilegung

完全节能随机后退:慢速反馈循环

2302.07751v5

07-25

Deadline-Aware Joint Task Scheduling and Offloading in Mobile Edge Computing Systems

Deadline-Aware Joint Task Planung und Offloading in Mobile Edge Computing Systemen

移动边缘电子计算系统联合任务安排和卸载

2507.18864v1

07-24 (4)

FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

FedVSR: Auf dem Weg zu einem modell-agnostischen Federated Learning in Video Super-Resolution

FFVSR: 争取在视频超分辨率中开展示范性、不可计量的联邦学习

2503.13745v2

07-24

Performance in solving the Hermitian and pseudo-Hermitian Bethe-Salpeter equation with the Yambo code

Leistung bei der Lösung der Hermitian und Pseudo-Hermitian Bethe-Salpeter-Gleichung mit dem Yambo-Code

用Yambo 代码解决埃米迪和伪希腊比斯- 圣彼得方程式的性能

2504.10096v2

07-24

PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism

PPipe: Effiziente Videoanalyse mit heterogenen GPU-Clustern über Pool-Based Pipeline Parallelismus

PPipe:通过基于联营的管道平行主义,在异基因性GPU集群上提供高效视频分析工具

2507.18748v1

07-24

CUTHERMO: Understanding GPU Memory Inefficiencies with Heat Map Profiling

CUTHERMO: GPU-Speicher-Ineffizienzen mit Wärmekartenprofilierung verstehen

CUTHMO: 了解 GPU 内存效率不及热地图分析

2507.18729v1

07-24

AI Flow: Perspectives, Scenarios, and Approaches

AI Flow: Perspektiven, Szenarien und Ansätze

AI 流动:观点、设想和方法

2506.12479v3

07-24

Distributed Load Balancing with Workload-Dependent Service Rates

Distributed Load Balancing mit Workload-Dependent-Service-Raten

与工作量-依赖性服务费率平衡

2411.17103v2

07-24

Towards Designing an Energy Aware Data Replication Strategy for Cloud Systems Using Reinforcement Learning

Auf dem Weg zu einer Strategie für eine energiebewusste Datenreplikation für Cloud-Systeme mittels Verstärkungslernen

为利用强化学习的云层系统设计一个有能源意识的数据复制战略

2507.18459v1

07-24

DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration

DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung

DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列

2412.09709v3

07-24

FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping

FMI trifft SystemC: Ein Rahmen für das Cross-Tool Virtual Prototyping

FMI 满足系统C:跨工具虚拟原型框架

2507.18339v1

07-24

Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

Staleness-Centric Optimierungen für parallele Diffusion MoE-Inferenz

平行扩散MOE推推推的堆积-堆积中心优化

2411.16786v3

07-24

A large-scale distributed parallel discrete event simulation engines based on Warped2 for Wargaming simulation

Eine großflächige verteilte parallele diskrete Event-Simulations-Engines basierend auf Warped2 für Wargaming-Simulation

以Wordped2为基础的大规模分布式平行离散事件模拟引擎,用于Wargaming模拟

2507.18050v1

07-24

FCPO: Federated Continual Policy Optimization for Real-Time High-Throughput Edge Video Analytics

FCPO:实时高水压高压边缘实时视频分析分析的联邦持续政策优化

2507.18047v1

07-24

PPFPL: Cross-silo Privacy-preserving Federated Prototype Learning Against Data Poisoning Attacks on Non-IID Data

PPFPL: Cross-silo Datenschutz-erhaltendes Federated Prototype Learning gegen Datenvergiftung Angriffe auf nicht-ID-Daten

PPPPL: 跨硅隐私保护联邦原型学习,反对对非IID数据进行数据中毒攻击

2504.03173v4

07-24

Cloud Native System for LLM Inference Serving

Cloud Native System für LLM Inferenz Serving

LLM 推断服务云原系统

2507.18007v1

07-24

Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling

Entsperren Sie das Potenzial des feinkörnigen LLM Servierens über Dynamic Module Scaling

通过动态模块缩放来释放精制 LLM 服务的潜力

2507.18006v1

07-24

C-Koordinator: Interference-aware Management for Large-scale and Co-located Microservice Clusters

C-Koordinator: Interference-aware Management für großräumige und Co-Location-Mikroservice-Cluster

C-科协调员:大型和合用同一地点的微型服务集群的干涉意识管理

2507.18005v1

Article 0

Title@2025-07-31 (4): The ArborX library: version 2.0

Title: The ArborX library: version 2.0

Die ArborX-Bibliothek: Version 2.0

ArborX 图书馆:2.0版 2507.23700v1

Authors (4): Andrey Prokopenko, Daniel Arndt, Damien Lebrun-Grandié, Bruno Turcksin

This paper provides an overview of the 2.0 release of the ArborX library, a performance portable geometric search library based on Kokkos. We describe the major changes in ArborX 2.0 including a new interface for the library to support a wider range of user problems, new search data structures (brute force, distributed), support for user functions to be executed on the results (callbacks), and an expanded set of the supported algorithms (ray tracing, clustering).

本文件概述了ArborX图书馆的2.0版发行情况,这是一个基于Kokkos的性能便携式几何搜索图书馆。我们描述了ArborX 2.0的主要变化,包括图书馆为支持更广泛的用户问题而新建的界面、新的搜索数据结构(粗力、分布)、根据结果执行的用户功能的支持(回召),以及一套扩大的支持算法(光追踪、分组)。

Article 1

Title@2025-07-31 (4): Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Title: Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Satelliten-Federated Fine-Tuning für Basismodelle in Weltraum Computing Power Networks

卫星卫星联合会空间电子计算动力网络基础模型精密设计 2504.10403v3

Authors (6): Yan Zhu, Jingyang Zhu, Ting Wang, Yuanming Shi, Chunxiao Jiang, Khaled Ben Letaief

Advancements in artificial intelligence (AI) and low-earth orbit (LEO) satellites have promoted the application of large remote sensing foundation models for various downstream tasks. However, direct downloading of these models for fine-tuning on the ground is impeded by privacy concerns and limited bandwidth. Satellite federated learning (FL) offers a solution by enabling model fine-tuning directly on-board satellites and aggregating model updates without data downloading. Nevertheless, for large foundation models, the computational capacity of satellites is insufficient to support effective on-board fine-tuning in traditional satellite FL frameworks. To address these challenges, we propose a satellite-ground collaborative federated fine-tuning framework. The key of the framework lies in how to reasonably decompose and allocate model components to alleviate insufficient on-board computation capabilities. During fine-tuning, satellites exchange intermediate results with ground stations or other satellites for forward propagation and back propagation, which brings communication challenges due to the special communication topology of space transmission networks, such as intermittent satellite-ground communication, short duration of satellite-ground communication windows, and unstable inter-orbit inter-satellite links (ISLs). To reduce transmission delays, we further introduce tailored communication strategies that integrate both communication and computing resources. Specifically, we propose a parallel intra-orbit communication strategy, a topology-aware satellite-ground communication strategy, and a latency-minimalization inter-orbit communication strategy to reduce space communication costs. Simulation results demonstrate significant reductions in training time with improvements of approximately 33%.

人工智能(AI)和低地轨道(LEO)卫星的进步促进了大型遥感基础模型应用于各种下游任务;然而,隐私关切和有限带宽妨碍了直接下载这些模型以进行地面微调; 卫星联合学习(FL)提供了一种解决办法,使模型能够直接对机载卫星进行微调,并在没有数据下载的情况下对模型更新进行汇总; 然而,对于大型基础模型而言,卫星的计算能力不足以支持对传统卫星FL框架进行有效的机载微调; 为了应对这些挑战,我们提议了一个卫星-地面协作联合微调框架; 该框架的关键在于如何合理拆解和分配模型组成部分以缓解机载计算能力不足的情况; 在微调过程中,卫星与地面站或其他卫星交换中间结果,以便进行前向传播和后向传播,这带来了通信挑战,因为空间传输网络的特殊通信地形学,例如间歇的卫星地面通信、卫星地面通信窗口的短暂期限以及不稳定的轨道间通信连接(ISLs),为了减少传输延迟,我们提出了一个同步通信战略。

Article 2

Title@2025-07-31 (4): Parallel Split Learning with Global Sampling

Title: Parallel Split Learning with Global Sampling

Paralleles Split-Lernen mit globaler Probenahme

与全球抽样平行拆分学习 2407.15738v4

Authors (4): Mohammad Kohankhaki, Ahmad Ayad, Mahdi Barhoush, Anke Schmeink

Distributed deep learning in resource-constrained environments faces scalability and generalization challenges due to large effective batch sizes and non-identically distributed client data. We introduce a server-driven sampling strategy that maintains a fixed global batch size by dynamically adjusting client-side batch sizes. This decouples the effective batch size from the number of participating devices and ensures that global batches better reflect the overall data distribution. Using standard concentration bounds, we establish tighter deviation guarantees compared to existing approaches. Empirical results on a benchmark dataset confirm that the proposed method improves model accuracy, training efficiency, and convergence stability, offering a scalable solution for learning at the network edge.

在资源受限制的环境中,分散的深层次学习面临可缩放和概括化的挑战,因为有大量有效的批量规模和未识别分布的客户数据。我们采用了由服务器驱动的抽样战略,通过动态调整客户端批量规模,保持固定的全球批量规模。这将有效的批量规模与参与设备的数量脱钩,确保全球批量更好地反映总体数据分布。我们使用标准集中界限,建立比现有方法更严格的偏离保证。基准数据集的经验结果证实,拟议方法提高了模型的准确性、培训效率和趋同稳定性,为网络边缘学习提供了可扩展的解决方案。

Article 3

Title@2025-07-31 (4): Beyond Optimal Fault Tolerance

Title: Beyond Optimal Fault Tolerance

Jenseits der optimalen Fehlertoleranz

超越最佳错失容忍 2501.06044v7

Authors (2): Andrew Lewis-Pye, Tim Roughgarden

The optimal fault-tolerance achievable by any protocol has been characterized in a wide range of settings. For example, for state machine replication (SMR) protocols operating in the partially synchronous setting, it is possible to simultaneously guarantee consistency against $\alpha$-bounded adversaries (i.e., adversaries that control less than an $\alpha$ fraction of the participants) and liveness against $\beta$-bounded adversaries if and only if $\alpha + 2\beta \leq 1$. This paper characterizes to what extent “better-than-optimal” fault-tolerance guarantees are possible for SMR protocols when the standard consistency requirement is relaxed to allow a bounded number $r$ of consistency violations. We prove that bounding rollback is impossible without additional timing assumptions and investigate protocols that tolerate and recover from consistency violations whenever message delays around the time of an attack are bounded by a parameter $\Delta^$ (which may be arbitrarily larger than the parameter $\Delta$ that bounds post-GST message delays in the partially synchronous model). Here, a protocol’s fault-tolerance can be a non-constant function of $r$, and we prove, for each $r$, matching upper and lower bounds on the optimal “recoverable fault-tolerance” achievable by any SMR protocol. For example, for protocols that guarantee liveness against 1/3-bounded adversaries in the partially synchronous setting, a 5/9-bounded adversary can always cause one consistency violation but not two, and a 2/3-bounded adversary can always cause two consistency violations but not three. Our positive results are achieved through a generic “recovery procedure” that can be grafted on to any accountable SMR protocol and restores consistency following a violation while rolling back only transactions that were finalized in the previous $2\Delta^$ timesteps.

任何协议都能实现最佳的过错容忍度, 其特征是多种多样的。例如, 对于在部分同步环境下运行的国家机器复制协议( SMR) 来说, 在部分同步环境下运行的州机器复制协议( SMR) , 有可能同时保证对受美元约束的对手( 即控制低于美元参与者部分的对手) 的一致性( 控制低于美元参与者部分的反对者) 和对美元受美元约束的对手的活性容忍度。只有当 $\ alpha + 2\ beta\leq 1 的参数被约束的情况下, 并且只有 $\\ beta\ dleq 1 的活性。本文描述了在标准一致性要求2/3 允许允许允许允许限制的美元违反协议时, 国家机器的“ 更好比 ” ( ) 最坏的 ) ( ) ( ) ) ( ) ) ( ) ( ) ( 任何可以任意大于 $\ Delta ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (

Article 4

Title@2025-07-31 (4): Consistent Point Matching

Title: Consistent Point Matching

Konsistente Punktgleichung

统一点匹配 2507.23609v1

Authors (2): Halid Ziya Yerebakan, Gerardo Hermosillo Valadez

This study demonstrates that incorporating a consistency heuristic into the point-matching algorithm \cite{yerebakan2023hierarchical} improves robustness in matching anatomical locations across pairs of medical images. We validated our approach on diverse longitudinal internal and public datasets spanning CT and MRI modalities. Notably, it surpasses state-of-the-art results on the Deep Lesion Tracking dataset. Additionally, we show that the method effectively addresses landmark localization. The algorithm operates efficiently on standard CPU hardware and allows configurable trade-offs between speed and robustness. The method enables high-precision navigation between medical images without requiring a machine learning model or training data.

这项研究表明,在点对称算法\cite{yerebakan2023hiarartic}中纳入一致性超值法可以提高对相对医学图像的解剖位置进行匹配的稳健性。我们验证了我们关于跨CT和MRI模式的不同纵向内部和公共数据集的方法。值得注意的是,它超过了深 Lesion跟踪数据集的最新结果。此外,我们表明,该方法有效地解决了里程碑式的定位问题。该算法在标准的 CPU 硬件上运作高效,允许在速度和稳健性之间进行可配置的权衡。该方法可以在不需要机器学习模型或培训数据的情况下对医疗图像进行高精度导航。

Article 5

Title@2025-07-31 (4): Threshold-Driven Streaming Graph: Expansion and Rumor Spreading

Title: Threshold-Driven Streaming Graph: Expansion and Rumor Spreading

Threshold-Driven Streaming Graph: Expansion und Gerüchte Verbreitung

阈值驱动流图:扩展和谣言扩散 2507.23533v1

Authors (5): Flora Angileri, Andrea Clementi, Emanuele Natale, Michele Salvi, Isabella Ziccardi

A randomized distributed algorithm called RAES was introduced in [Becchetti et al., SODA 2020] to extract a bounded-degree expander from a dense $n$-vertex expander graph $G = (V, E)$. The algorithm relies on a simple threshold-based procedure. A key assumption in [Becchetti et al., SODA 2020] is that the input graph $G$ is static - i.e., both its vertex set $V$ and edge set $E$ remain unchanged throughout the process - while the analysis of RAES in dynamic models is left as a major open question. In this work, we investigate the behavior of RAES under a dynamic graph model induced by a streaming node-churn process (also known as the sliding window model), where, at each discrete round, a new node joins the graph and the oldest node departs. This process yields a bounded-degree dynamic graph $\mathcal{G} ={ G_t = (V_t, E_t) : t \in \mathbb{N}}$ that captures essential characteristics of peer-to-peer networks – specifically, node churn and threshold on the number of connections each node can manage. We prove that every snapshot $G_t$ in the dynamic graph sequence has good expansion properties with high probability. Furthermore, we leverage this property to establish a logarithmic upper bound on the completion time of the well-known PUSH and PULL rumor spreading protocols over the dynamic graph $\mathcal{G}$.

[Becchetti et al., SODA 2020] 中引入了一个随机分布式算法,名为RAES , 以从一个稠密的 $n美元的顶端扩张图形 $G = (V, E) 中提取一个约束度扩张器。该算法依赖于一个简单的门槛程序。 [Becetti et al., SODA 2020] 中的一个关键假设是,输入图$G$是静态的 - 也就是说, 它的顶端设定 $V$, 边缘设定 $E$ 在整个过程中保持不变 - 而动态模型中的RAES 分析仍是一个主要的开放问题。在这项工作中,我们根据一个动态节点偏移进程(也称为滑动窗口模型) 的动态扩展模型来调查RAES 的行为。在每一个离散的圆轮中,一个新的节点加入图表和最古老的节点离开。这个过程产生一个固定度的水平动态平流值 $ $ g} G_ t= (V t, 基本的平流流流流流流流流流流流流 = 具体的每个直径将每个直径的直径的直径的直径。

Article 6

Title@2025-07-31 (4): Towards Serverless Processing of Spatiotemporal Big Data Queries

Title: Towards Serverless Processing of Spatiotemporal Big Data Queries

Auf dem Weg zur serverlosen Verarbeitung von raumzeitlichen Big Data-Abfragen

迈向无服务器处理斯帕蒂奥多时大数据查询 2507.06005v2

Authors (3): Diana Baumann, Tim C. Rese, David Bermbach

Spatiotemporal data are being produced in continuously growing volumes by a variety of data sources and a variety of application fields rely on rapid analysis of such data. Existing systems such as PostGIS or MobilityDB usually build on relational database systems, thus, inheriting their scale-out characteristics. As a consequence, big spatiotemporal data scenarios still have limited support even though many query types can easily be parallelized. In this paper, we propose our vision of a native serverless data processing approach for spatiotemporal data: We break down queries into small subqueries which then leverage the near-instant scaling of Function-as-a-Service platforms to execute them in parallel. With this, we partially solve the scalability needs of big spatiotemporal data processing.

各种数据来源和各种应用领域正在以快速分析这些数据为基础,不断增长的量中生成随机数据,现有系统,如PostGIS或流动DB,通常以关系数据库系统为基础,从而继承其扩展特点。因此,大型时空数据假设方案仍然得到的支持有限,尽管许多查询类型可以很容易地平行。在本文中,我们提出了对空间时空数据采用本地服务器无数据处理方法的愿景:我们将查询分解成小小小小问题,然后利用功能-服务平台的近速缩放平行执行。因此,我们部分解决了大型时空数据处理的可扩展性需求。

Article 7

Title@2025-07-31 (4): Scalable contribution bounding to achieve privacy

Title: Scalable contribution bounding to achieve privacy

Skalierbarer Beitrag zur Wahrung der Privatsphäre

实现隐私的可缩放贡献 2507.23432v1

Authors (4): Vincent Cohen-Addad, Alessandro Epasto, Jason Lee, Morteza Zadimoghaddam

In modern datasets, where single records can have multiple owners, enforcing user-level differential privacy requires capping each user’s total contribution. This “contribution bounding” becomes a significant combinatorial challenge. Existing sequential algorithms for this task are computationally intensive and do not scale to the massive datasets prevalent today. To address this scalability bottleneck, we propose a novel and efficient distributed algorithm. Our approach models the complex ownership structure as a hypergraph, where users are vertices and records are hyperedges. The algorithm proceeds in rounds, allowing users to propose records in parallel. A record is added to the final dataset only if all its owners unanimously agree, thereby ensuring that no user’s predefined contribution limit is violated. This method aims to maximize the size of the resulting dataset for high utility while providing a practical, scalable solution for implementing user-level privacy in large, real-world systems.

在现代数据集中,单项记录可以拥有多个所有者,实施用户一级的差异隐私要求限制每个用户的总贡献。这种“贡献约束”是一个重大的组合性挑战。目前用于这一任务的现有顺序算法在计算上是密集的,与当今普遍存在的大量数据集不相上下。为了解决这一可缩放的瓶颈问题,我们提出了一个新颖而有效的分布算法。我们的方法将复杂的所有制结构建为高压结构,用户是脊椎和记录是高级的。算法以回合方式进行,允许用户以平行方式提出记录。只有当所有拥有者一致同意时,才在最后数据集中添加记录,从而确保不违反任何用户预先定义的贡献限制。这种方法旨在尽量扩大由此产生的数据集的大小,以便高用途,同时为在大型、现实世界系统中实施用户的隐私提供实用、可缩放的解决方案。

Article 8

Title@2025-07-31 (4): Towards a Testbed for Scalable FaaS Platforms

Title: Towards a Testbed for Scalable FaaS Platforms

Auf dem Weg zu einem Testbett für skalierbare FaaS-Plattformen

迈向可缩放的 FaaS 平台测试台 2507.23431v1

Authors (2): Trever Schirmer, David Bermbach

Most cloud platforms have a Function-as-a-Service (FaaS) offering that enables users to easily write highly scalable applications. To better understand how the platform’s architecture impacts its performance, we present a research-focused testbed that can be adapted to quickly evaluate the impact of different architectures and technologies on the characteristics of scalability-focused FaaS platforms.

多数云层平台都有一个功能化服务平台(Faas-as-service ) , 使用户能够方便地写出高度可缩放的应用程序。为了更好地了解平台的架构如何影响其性能,我们提出了一个以研究为重点的测试台,可用于快速评估不同架构和技术对以缩放为重点的Faas-service平台特性的影响。

Article 9

Title@2025-07-31 (4): Minos: Exploiting Cloud Performance Variation with Function-as-a-Service Instance Selection

Title: Minos: Exploiting Cloud Performance Variation with Function-as-a-Service Instance Selection

Minos: Nutzung der Cloud-Performance-Variante mit der Funktion-as-a-Service-Instanz-Auswahl

Minos: 利用云性工作表现变化与选择服务性功能项目 2505.12928v2

Authors (5): Trever Schirmer, Valentin Carl, Nils Höller, Tobias Pfandzelter, David Bermbach

Serverless Function-as-a-Service (FaaS) is a popular cloud paradigm to quickly and cheaply implement complex applications. Because the function instances cloud providers start to execute user code run on shared infrastructure, their performance can vary. From a user perspective, slower instances not only take longer to complete, but also increase cost due to the pay-per-use model of FaaS services where execution duration is billed with microsecond accuracy. In this paper, we present Minos, a system to take advantage of this performance variation by intentionally terminating instances that are slow. Fast instances are not terminated, so that they can be re-used for subsequent invocations. One use case for this are data processing and machine learning workflows, which often download files as a first step, during which Minos can run a short benchmark. Only if the benchmark passes, the main part of the function is actually executed. Otherwise, the request is re-queued and the instance crashes itself, so that the platform has to assign the request to another (potentially faster) instance. In our experiments, this leads to a speedup of up to 13% in the resource intensive part of a data processing workflow, resulting in up to 4% faster overall performance (and consequently 4% cheaper prices). Longer and complex workflows lead to increased savings, as the pool of fast instances is re-used more often. For platforms exhibiting this behavior, users get better performance and save money by wasting more of the platforms resources.

无服务器函数- as- service (FaaS) 是快速且廉价地实施复杂应用程序的流行云型模式。因为功能性云供应商开始在共享基础设施上执行用户代码, 其性能可能各不相同。从用户的角度来看, 较慢的事例不仅需要更长的时间来完成, 而且由于FaaS服务的付费- 使用模式而增加成本, 执行期以微秒的精确度计费。在本文中, 我们向米诺斯展示一个系统, 利用这种性能变异, 故意终止缓慢的平台。快速事例没有终止, 以便它们可以被重新用于以后的行业。一个用于此功能的选项是数据处理和机器学习工作流程, 通常以第一步的方式下载文件文件, 其中米诺斯可以运行一个短的基准。只有当基准过后, 该功能的主要部分才会实际执行。否则, 请求会被重新排队和实例崩溃, 以便平台不得不将请求分配给另一个( 可能更快的) 。在我们的实验中, 快速的事例是, 导致更快地将 13 % 的运行到更高的行为, 在资源快速的流程中, 快速的流程中, 更快的流程中, 导致更快速的运行更快的流程中, 更快的流程更快地处理。

Article 10

Title@2025-07-31 (4): H2SGEMM: Emulating FP32 GEMM on Ascend NPUs using FP16 Units with Precision Recovery and Cache-Aware Optimization

Title: H2SGEMM: Emulating FP32 GEMM on Ascend NPUs using FP16 Units with Precision Recovery and Cache-Aware Optimization

H2SGEMM: Emulieren von FP32 GEMM auf Ascend-NPUs mit FP16-Einheiten mit Präzisionsrückgewinnung und Cache-Aware-Optimierung

H2SGEMM:利用具有精密恢复和缓存优化功能的FP16单位模拟FP32关于升降国家核动力源的GEMMF32 GEMM 2507.23387v1

Authors (7): Weicheng Xue, Baisong Xu, Kai Yang, Yongxiang Liu, Dengdeng Fan, Pengxiang Xu, Yonghong Tian

Low-precision matrix engines, such as FP16 cube, offer high throughput but lack support for full-precision computation. In this work, we propose H2SGEMM, a high-performance algorithm for emulating FP32 general matrix-matrix multiplication (GEMM) using only FP16 computation units on a representative AI accelerator. The method decomposes each FP32 operand into two FP16 values and compensates for numerical errors through a tunable scaling strategy. A detailed analysis of numerical errors, including underflow conditions and precision loss, guides the selection of scaling parameters to preserve up to 22 bits of mantissa accuracy. We further investigate the effect of computation order on accuracy and demonstrate that a term-wise accumulation scheme improves numerical stability over conventional FP32 GEMM in low-exponent regimes. Finally, a cache-aware blocking strategy and double-buffered pipeline are introduced to overlap memory transfers with computation, enabling H2SGEMM to achieve up to 77% of the theoretical FP32-equivalent peak performance on Ascend 910A NPU lacking native FP32 support. Extensive numerical experiments confirm that our method not only recovers the accuracy of native FP32 GEMM but also exhibits superior numerical stability under certain conditions, due to its structured and error-aware computation order.

低精度矩阵引擎,如 FP16 立方体,提供高吞吐量,但缺乏对全精度计算的支持。在这项工作中,我们提议H2SGEMM,这是一个高性能算法,用于在具有代表性的AI 加速器上仅使用 FP16 计算单位来模拟 FP32 通用矩阵矩阵矩阵倍增(GEMM) 。该方法将每个FP32 操作分解成两个FP16 值,并通过一个可捕捉量缩放战略来弥补数字错误。对数字错误,包括流量不足条件和精确损失进行详细分析,指导如何选择缩放参数,以保存最多22位曼特萨准确度。我们进一步调查计算顺序对准确性的影响,并表明在低耗量制度下,定期积累计划比常规的FP32 通用的FP32 通用组合矩阵倍增量计算器(GEMMM) 的稳定性。最后,采用了缓存阻截战略和双压管道,以便将记忆传输与计算重叠,使H2SGEMMMM达到77%的理论等量峰值最高性性工作,以至A910A NPPPP32 的精确度测试中,但又无法根据本地的精确度进行。

Article 11

Title@2025-07-31 (4): A Simple $(1-ε)$-Approximation Semi-Streaming Algorithm for Maximum (Weighted) Matching

Title: A Simple $(1-ε)$-Approximation Semi-Streaming Algorithm for Maximum (Weighted) Matching

Ein einfacher $(1-ε)$-Annäherungshalbstrahl-Algorithmus für maximale (gewichtete) Übereinstimmung

用于最大(加权)匹配的简单 $(1- ) $( ) $( ) $( ) $( ) 的近似半调整算法 2307.02968v4

Authors (1): Sepehr Assadi

We present a simple semi-streaming algorithm for $(1-\epsilon)$-approximation of bipartite matching in $O(\log{!(n)}/\epsilon)$ passes. This matches the performance of state-of-the-art “$\epsilon$-efficient” algorithms – the ones with much better dependence on $\epsilon$ albeit with some mild dependence on $n$ – while being considerably simpler. The algorithm relies on a direct application of the multiplicative weight update method with a self-contained primal-dual analysis that can be of independent interest. To show case this, we use the same ideas, alongside standard tools from matching theory, to present an equally simple semi-streaming algorithm for $(1-\epsilon)$-approximation of weighted matchings in general (not necessarily bipartite) graphs, again in $O(\log{!(n)}/\epsilon)$ passes.

我们提出了一个简单的半流算法,用美元( 1-\ epsilon) $( =xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Article 12

Title@2025-07-30 (3): GALE: Leveraging Heterogeneous Systems for Efficient Unstructured Mesh Data Analysis

Title: GALE: Leveraging Heterogeneous Systems for Efficient Unstructured Mesh Data Analysis

GALE: Nutzung heterogener Systeme für effiziente unstrukturierte Mesh-Datenanalyse

GALE:利用异异基因系统进行高效无结构的网目数据分析 2507.15230v3

Authors (4): Guoxi Liu, Thomas Randall, Rong Ge, Federico Iuricich

Unstructured meshes present challenges in scientific data analysis due to irregular distribution and complex connectivity. Computing and storing connectivity information is a major bottleneck for visualization algorithms, affecting both time and memory performance. Recent task-parallel data structures address this by precomputing connectivity information at runtime while the analysis algorithm executes, effectively hiding computation costs and improving performance. However, existing approaches are CPU-bound, forcing the data structure and analysis algorithm to compete for the same computational resources, limiting potential speedups. To overcome this limitation, we introduce a novel task-parallel approach optimized for heterogeneous CPU-GPU systems. Specifically, we offload the computation of mesh connectivity information to GPU threads, enabling CPU threads to focus on executing the visualization algorithm. Following this paradigm, we propose GALE (GPU-Aided Localized data structurE), the first open-source CUDA-based data structure designed for heterogeneous task parallelism. Experiments on two 20-core CPUs and an NVIDIA V100 GPU show that GALE achieves up to 2.7x speedup over state-of-the-art localized data structures while maintaining memory efficiency.

由于分布不规律和复杂的连接性,无结构的缩略图在科学数据分析方面提出了挑战。计算和储存连接信息是视觉化算法的一个主要瓶颈,影响到时间和记忆性性能。最近的任务平行数据结构通过在分析算法执行、有效隐藏计算成本和改善性能的同时在运行时预先计算连接信息来解决这个问题。但是,现有的方法是CPU约束的,迫使数据结构和分析算法为相同的计算资源竞争,限制了潜在的加速。为了克服这一限制,我们引入了一种新的任务平行法,为各种CPU-GPU系统优化了任务平行法。具体地说,我们把网格连接信息的计算卸载到GPU线,使CPU的线索能够专注于执行视觉化算法。遵循这一模式,我们提出了GALE(GPU辅助的本地化数据结构),这是为不同任务平行而设计的首个开放源CUDA数据结构。实验了两个核心的CPU和NVIDIA V100 GPU显示GLE在保持州一级数据存储效率的同时达到2.7x的速度结构。

Article 13

Title@2025-07-30 (3): Data Readiness for Scientific AI at Scale

Title: Data Readiness for Scientific AI at Scale

Datenbereitstellung für wissenschaftliche KI im Maßstab

规模化科学AI 数据准备程度 2507.23018v1

Authors (7): Wesley Brewer, Patrick Widener, Valentine Anantharaj, Feiyi Wang, Tom Beck, Arjun Shankar, Sarp Oral

This paper examines how Data Readiness for AI (DRAI) principles apply to leadership-scale scientific datasets used to train foundation models. We analyze archetypal workflows across four representative domains - climate, nuclear fusion, bio/health, and materials - to identify common preprocessing patterns and domain-specific constraints. We introduce a two-dimensional readiness framework composed of Data Readiness Levels (raw to AI-ready) and Data Processing Stages (ingest to shard), both tailored to high performance computing (HPC) environments. This framework outlines key challenges in transforming scientific data for scalable AI training, emphasizing transformer-based generative models. Together, these dimensions form a conceptual maturity matrix that characterizes scientific data readiness and guides infrastructure development toward standardized, cross-domain support for scalable and reproducible AI for science.

本文探讨了AI(DRAI)原则的数据准备程度如何适用于用于培训基础模型的领导层规模科学数据集。我们分析了四个有代表性的领域――气候、核聚变、生物/健康和材料――的老式工作流程,以确定共同的预处理模式和特定领域的制约因素。我们引入了由数据准备程度水平(拖动至AI-准备就绪)和数据处理阶段(取之于硬)组成的两维准备状态框架,这两个阶段都是针对高性能计算环境的。本框架概述了在将科学数据转换为可升级的AI(HPC)培训、强调以变压器为基础的基因化模型方面所面临的主要挑战。这些层面共同形成了一个概念成熟性矩阵,以科学数据准备状态为特征,并指导基础设施的发展走向标准化、跨位支持可缩放和可复制的AI用于科学。

Article 14

Title@2025-07-30 (3): Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Title: Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Föderiertes Lernen mit On-Device-Training und Kommunikation im 8-Bit-Schwebepunkt

在8位浮动点进行联邦在职培训和交流 2407.02610v2

Authors (4): Bokun Wang, Axel Berg, Durmus Alp Emre Acar, Chuteng Zhou

Recent work has shown that 8-bit floating point (FP8) can be used for efficiently training neural networks with reduced computational cost compared to training in FP32/FP16. In this work, we investigate the use of FP8 training in a federated learning context. This approach brings not only the usual benefits of FP8 which are desirable for on-device training at the edge, but also reduces client-server communication costs due to significant weight compression. We present a novel method for combining FP8 client training while maintaining a global FP32 server model and provide convergence analysis. Experiments with various machine learning models and datasets show that our method consistently yields communication reductions of at least 2.9x across a variety of tasks and models compared to an FP32 baseline to achieve the same trained model accuracy.

最近的工作表明,8比特浮动点(FP8)可用于有效培训神经网络,与FP32/FP16培训相比,计算成本降低。在这项工作中,我们调查了在联合学习背景下使用FP8培训的情况。这种办法不仅带来FP8的通常好处,这些好处对于边缘的在设备上的培训是可取的,而且由于体重压缩很大,客户-服务器通信费用也有所减少。我们提出了一个新颖的方法,将FP8客户培训结合起来,同时保持全球FP32服务器模型并提供趋同分析。与各种机器学习模型和数据集的实验表明,与FP32基准相比,我们的方法使各种任务和模型的通信量减少至少2.9倍,以实现同样的经过培训的模型准确性。

Article 15

Title@2025-07-30 (3): DSPE: Profit Maximization in Edge-Cloud Storage System using Dynamic Space Partitioning with Erasure Code

Title: DSPE: Profit Maximization in Edge-Cloud Storage System using Dynamic Space Partitioning with Erasure Code

DSPE: Profitmaximierung im Edge-Cloud-Speichersystem mit Dynamic Space Partitioning mit Erasure Code

DSPE: 利用具有时代代码的动态空间分割法在边缘封闭储存系统中实现利润最大化 2507.22801v1

Authors (4): Shubhradeep Roy, Suvarthi Sarkar, Vivek Verma, Aryabartta Sahu

Edge Storage Systems have emerged as a critical enabler of low latency data access in modern cloud networks by bringing storage and computation closer to end users. However, the limited storage capacity of edge servers poses significant challenges in handling high volume and latency sensitive data access requests, particularly under dynamic workloads. In this work, we propose a profit driven framework that integrates three key mechanisms which are collaborative caching, erasure coding, and elastic storage partitioning. Unlike traditional replication, erasure coding enables space efficient redundancy, allowing data to be reconstructed from any subset of K out of K plus M coded blocks. We dynamically partition each edge server s storage into private and public regions. The private region is further subdivided among access points based on their incoming request rates, enabling adaptive control over data locality and ownership. We design a data placement and replacement policy that determines how and where to store or evict coded data blocks to maximize data access within deadlines. While the private region serves requests from local APs, the public region handles cooperative storage requests from neighboring servers. Our proposed Dynamic Space Partitioning and Elastic caching strategy is evaluated on both synthetic and real world traces from Netflix and Spotify. Experimental results show that our method improves overall system profitability by approximately 5 to 8% compared to state of the art approaches under varied workload conditions.

在这项工作中,我们提出了一个利润驱动框架,其中整合了三个关键机制,即合作缓存、消化编码和弹性储存分区。与传统复制不同,消化编码使空间冗余成为现代云端网络低长期数据访问的关键促进因素,使储存和计算更接近终端用户。我们将每个边缘服务器储存能力有限,这在处理大量和延迟敏感数据访问请求方面构成重大挑战,特别是在动态工作量的情况下。我们提议了一个利润驱动框架,将三个关键机制结合起来,即合作封存、消化编码和弹性储存分区。与传统复制不同,消化编码编码使空间节能冗余,从K+M编码区中的K组中的任何组中重建数据。我们把每个边缘服务器的储存动态隔开到私营和公共区域。我们提出的每个边缘服务器的储存能力有限,根据它们收到的请求率进一步细分为接入点,以便能够对数据地点和所有权进行适应性控制。我们设计了一个数据放置和替换政策,确定如何储存或驱逐编码数据区块,以便在最后期限内实现数据访问。虽然私人区域服务当地AP区的要求,但公共区则处理邻接服务器的合作存储请求。我们提议的动态空间分割和缩缩缩存储战略将每个边缘服务器。根据收到的合成和真实数据定位方法,将改进了我们的全球实验系统,以更新了5号检索。根据实验方法,对实验系统进行。根据全球实验性实验性研究。

Article 16

Title@2025-07-30 (3): A Survey on Large Language Model Acceleration based on KV Cache Management

Title: A Survey on Large Language Model Acceleration based on KV Cache Management

Eine Umfrage über die Beschleunigung von großen Sprachmodellen auf Basis von KV Cache Management

基于 KV 缓存管理大语言模式加速调查 2412.19442v3

Authors (10): Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, Lei Chen

Large Language Models (LLMs) have revolutionized a wide range of domains such as natural language processing, computer vision, and multi-modal tasks due to their ability to comprehend context and perform logical reasoning. However, the computational and memory demands of LLMs, particularly during inference, pose significant challenges when scaling them to real-world, long-context, and real-time applications. Key-Value (KV) cache management has emerged as a critical optimization technique for accelerating LLM inference by reducing redundant computations and improving memory utilization. This survey provides a comprehensive overview of KV cache management strategies for LLM acceleration, categorizing them into token-level, model-level, and system-level optimizations. Token-level strategies include KV cache selection, budget allocation, merging, quantization, and low-rank decomposition, while model-level optimizations focus on architectural innovations and attention mechanisms to enhance KV reuse. System-level approaches address memory management, scheduling, and hardware-aware designs to improve efficiency across diverse computing environments. Additionally, the survey provides an overview of both text and multimodal datasets and benchmarks used to evaluate these strategies. By presenting detailed taxonomies and comparative analyses, this work aims to offer useful insights for researchers and practitioners to support the development of efficient and scalable KV cache management techniques, contributing to the practical deployment of LLMs in real-world applications. The curated paper list for KV cache management is in: \href{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}{https://github.com/TreeAI-Lab/Awesome-KV-Cache-Management}.

大型语言模型(LLMS)由于能够理解背景和进行逻辑推理,使自然语言处理、计算机视觉和多模式任务等广泛领域发生革命性变化,然而,LMS的计算和记忆需求,特别是在推断过程中,在将其推广到现实世界、长文本和实时应用程序方面,构成重大挑战。Ky-Value(KV)缓存管理已成为通过减少重复计算和改进记忆利用来加快LLM推断的关键优化技术。这项调查全面概述了KV缓冲管理战略加速LLM的速度,将其分类为象征性的、示范的和系统一级的优化。Token一级战略包括KV缓冲选择、预算分配、合并、四分化和低级变迁,同时在模型一级优化侧重于建筑创新和关注机制,以加强KV再利用。系统级别方法处理记忆管理、时间安排和硬件认知设计,以提高不同计算环境的效率。此外,调查还概述了KVVO-M-MLS-MS-ODRLLLLLLA 应用的文本和MLA-SLSeral-Seral-Servial-Seral-Servial-Servial-Servial-Serviews-Serviews-Serview 和基准,用来评估这些战略的文本的文本和基准。

Article 17

Title@2025-07-30 (3): Leveraging Caliper and Benchpark to Analyze MPI Communication Patterns: Insights from AMG2023, Kripke, and Laghos

Title: Leveraging Caliper and Benchpark to Analyze MPI Communication Patterns: Insights from AMG2023, Kripke, and Laghos

Caliper und Benchpark nutzen, um MPI-Kommunikationsmuster zu analysieren: Einblicke von AMG2023, Kripke und Laghos

利用卡利珀和法官场分析MPI通信模式:来自AMG2023、Kripke和Laghos的透视 2507.22372v1

Authors (9): Grace Nansamba, Evelyn Namugwanya, David Boehme, Dewi Yokelson, Riley Shipley, Derek Schafer, Michael McKinsey, Olga Pearce, Anthony Skjellum

We introduce ``communication regions’’ into the widely used Caliper HPC profiling tool. A communication region is an annotation enabling capture of metrics about the data being communicated (including statistics of these metrics), and metrics about the MPI processes involved in the communications, something not previously possible in Caliper. We explore the utility of communication regions with three representative modeling and simulation applications, AMG2023, Kripke, and Laghos, all part of the comprehensive Benchpark suite that includes Caliper annotations. Enhanced Caliper reveals detailed communication behaviors. Using Caliper and Thicket in tandem, we create new visualizations of MPI communication patterns, including halo exchanges. Our findings reveal communication bottlenecks and detailed behaviors, indicating significant utility of the special-regions addition to Caliper. The comparative scaling behavior of both CPU and GPU oriented systems are shown; we are able to look at different regions within a given application, and see how scalability and message-traffic metrics differ.

我们把“通信区域”引入广泛使用的卡利伯HPC特征分析工具中。一个通信区域是一个说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明所传达的数据(包括这些衡量度的统计),以及卡利伯以前不可能做到的通信中涉及的多指标性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明,包括光说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明性说明

Article 18

Title@2025-07-30 (3): PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling

Title: PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling

PS-WL: Ein Probability-Sensitive Wear Leveling-Schema für die Skalierung von SSD-Arrays

PS-WL: SSD 阵列比例缩放的概率感敏性穿级方案 2506.19660v3

Authors (4): Shuhang Xu, Yunfei Gu, Linhui Liu, Chentao Wu

As flash-based Solid State Drive (SSD) arrays become essential to modern data centers, scaling these arrays to meet explosive data growth is a frequent and critical operation. However, the conventional wear-leveling (WL) paradigm applied during scaling suffers from a fundamental flaw: it ignores the non-linear relationship between wear and failure probability, potentially pushing the most vulnerable, aged disks towards premature failure. To address this critical issue at its root, we propose the Probability-Sensitive Wear Leveling (PS-WL) scheme, which shifts the optimization goal from balancing wear to directly balancing failure risk. At its core, PS-WL introduces an “effective lifetime” model derived from a realistic failure probability to more accurately assess disk lifetime. This model guides a PID controller for wear leveling operation, with a conservative zone minimizes performance overhead by restricting warm data migration. Comprehensive simulations validate the superiority of PS-WL over state-of-the-art methods. The results demonstrate that our approach significantly reduces performance overhead while, most critically, consistently and effectively lowering the aggregated array failure risk across diverse system configurations and workloads. This proves that by directly optimizing for reliability, PS-WL builds a scalable storage system that is, by design, fundamentally safer, more efficient, and more stable.

由于基于闪光的固态驱动器(SSD)阵列对现代数据中心至关重要,扩大这些阵列以适应爆炸性数据增长是一项经常和关键的操作。然而,在缩放过程中应用的常规磨损等级(WL)模式存在一个根本性缺陷:它忽视了磨损和故障概率之间的非线性关系,有可能将最脆弱的老磁盘推向过早的失败。为了从根本上解决这一关键问题,我们提议了“概率感应性湿分级(PS-WL)计划 ” , 将优化目标从平衡磨损转向直接平衡故障风险。在其核心方面, PS-WL 引入了一个“ 有效终身” 模式, 其依据是现实性失败概率来更准确地评估磁盘寿命。这个模式指导了PID控制器的磨损操作, 保守区通过限制热数据迁移而最大限度地减少性能管理。全面模拟验证了PS-WL优于最新技术方法的优势。其结果表明,我们的方法极大地降低了绩效管理,同时,最关键地、持续和有效地降低不同系统配置和工作量的汇总阵列失败风险。这个模型通过直接地证明,更稳定的存储是更稳定的安全。

Article 19

Title@2025-07-30 (3): Understanding Power and Energy Utilization in Large Scale Production Physics Simulation Codes

Title: Understanding Power and Energy Utilization in Large Scale Production Physics Simulation Codes

Leistungs- und Energienutzung in großmaßstäblichen Produktionsphysik-Simulationscodes verstehen

大规模生产中了解电力和能源利用情况 2201.01278v2

Authors (11): Adam Bertsch, Michael R. Collette, Shawn A. Dawson, Si D. Hammond, Ian Karlin, M. Scott McKinley, Kevin Pedretti, Robert N. Rieben, Brian S. Ryujin, Arturo Vargas, Kenneth Weiss

Power is an often-cited reason for the move to advanced architectures on the path to Exascale computing. This is due to practical considerations related to delivering enough power to successfully site and operate these machines, as well as concerns about energy usage while running large simulations. Since obtaining accurate power measurements can be challenging, it may be tempting to use the processor thermal design power (TDP) as a surrogate due to its simplicity and availability. However, TDP is not indicative of typical power usage while running simulations. Using commodity and advanced technology systems at Lawrence Livermore and Sandia National Labs, we performed a series of experiments to measure power and energy usage in running simulation codes. These experiments indicate that large scale Lawrence Livermore simulation codes are significantly more efficient than a simple processor TDP model might suggest.

电能是转向快速计算路径上先进结构的一个经常被引用的原因。这是因为与提供足够电力成功定位和操作这些机器有关的实际考虑,以及在进行大型模拟时对能源使用情况的关切。由于获得准确的电量测量可能具有挑战性,因此可能诱人使用处理器热设计动力(TDP)作为替代,因为其简单易得。然而,TDP并不表明模拟过程中的典型电能使用情况。我们利用劳伦斯·利弗莫尔尔和桑迪亚国家实验室的商品和先进技术系统,进行了一系列实验,以测量模拟代码运行中的电能和能源使用情况。这些实验表明,大型Lawrence Livemotorre模拟代码比简单的处理器TDP模型可能显示的效率要高得多。

Article 20

Title@2025-07-30 (3): A Semi-Supervised Federated Learning Framework with Hierarchical Clustering Aggregation for Heterogeneous Satellite Networks

Title: A Semi-Supervised Federated Learning Framework with Hierarchical Clustering Aggregation for Heterogeneous Satellite Networks

Ein semi-überwachtes Federated Learning Framework mit Hierarchical Clustering Aggregation für heterogene Satellitennetzwerke

半上层联邦学习框架,包括异源卫星网络的等级集群聚合 2507.22339v1

Authors (6): Zhuocheng Liu, Zhishu Shen, Qiushi Zheng, Tiehua Zhang, Zheng Lei, Jiong Jin

Low Earth Orbit (LEO) satellites are emerging as key components of 6G networks, with many already deployed to support large-scale Earth observation and sensing related tasks. Federated Learning (FL) presents a promising paradigm for enabling distributed intelligence in these resource-constrained and dynamic environments. However, achieving reliable convergence, while minimizing both processing time and energy consumption, remains a substantial challenge, particularly in heterogeneous and partially unlabeled satellite networks. To address this challenge, we propose a novel semi-supervised federated learning framework tailored for LEO satellite networks with hierarchical clustering aggregation. To further reduce communication overhead, we integrate sparsification and adaptive weight quantization techniques. In addition, we divide the FL clustering into two stages: satellite cluster aggregation stage and Ground Stations (GSs) aggregation stage. The supervised learning at GSs guides selected Parameter Server (PS) satellites, which in turn support fully unlabeled satellites during the federated training process. Extensive experiments conducted on a satellite network testbed demonstrate that our proposal can significantly reduce processing time (up to 3x) and energy consumption (up to 4x) compared to other comparative methods while maintaining model accuracy.

低地球轨道卫星(LEO)正在成为6G网络的关键组成部分,许多卫星已经部署用于支持大规模地球观测和遥感相关任务。Federal Learning(FL)为在资源限制和动态环境中进行分布情报提供了一个很有希望的模式,然而,实现可靠的趋同,同时尽量减少处理时间和能源消耗,仍然是一项重大挑战,特别是在多式和部分无标签的卫星网络中。为了应对这一挑战,我们提议为低地球轨道卫星网络专门设计一个新的半监督的半监督联合学习框架,配有等级集群组合。为了进一步减少通信间接费用,我们整合了超载和适应性重量量化技术。此外,我们将FL集群分为两个阶段:卫星集成阶段和地面站汇总阶段。在GS上监督的学习指导选定的参数服务器(PS)卫星,这反过来又在进化培训过程中支持完全无标签的卫星。在卫星网络试验台进行的广泛实验表明,我们的提案可以大大缩短处理时间(最多为3x)和能源消耗(最多为4x),而同时保持模型准确性。

Article 21

Title@2025-07-30 (3): Hypernetworks for Model-Heterogeneous Personalized Federated Learning

Title: Hypernetworks for Model-Heterogeneous Personalized Federated Learning

Hypernetzwerke für modell-heterogenes personalisiertes Federated Learning

模拟异异异性个性化联邦学习超级网络 2507.22330v1

Authors (5): Chen Zhang, Husheng Li, Xiang Liu, Linshan Jiang, Danxin Wang

Recent advances in personalized federated learning have focused on addressing client model heterogeneity. However, most existing methods still require external data, rely on model decoupling, or adopt partial learning strategies, which can limit their practicality and scalability. In this paper, we revisit hypernetwork-based methods and leverage their strong generalization capabilities to design a simple yet effective framework for heterogeneous personalized federated learning. Specifically, we propose MH-pFedHN, which leverages a server-side hypernetwork that takes client-specific embedding vectors as input and outputs personalized parameters tailored to each client’s heterogeneous model. To promote knowledge sharing and reduce computation, we introduce a multi-head structure within the hypernetwork, allowing clients with similar model sizes to share heads. Furthermore, we further propose MH-pFedHNGD, which integrates an optional lightweight global model to improve generalization. Our framework does not rely on external datasets and does not require disclosure of client model architectures, thereby offering enhanced privacy and flexibility. Extensive experiments on multiple benchmarks and model settings demonstrate that our approach achieves competitive accuracy, strong generalization, and serves as a robust baseline for future research in model-heterogeneous personalized federated learning.

个人化联合会学习的最新进展侧重于解决客户模式差异性,然而,大多数现有方法仍需要外部数据,依赖模式脱钩,或采用部分学习战略,从而限制其实用性和可缩放性。在本文件中,我们重新审视超网络方法,并利用其强大的概括性能力,设计一个简单而有效的框架,供多种个人化联合会化学习使用。具体地说,我们提议MH-pFedHN,利用服务器端超网络,将客户特定的嵌入矢量作为针对每个客户的多元模型的投入和产出个化参数。为了促进知识共享和减少计算,我们在超网络中引入多头结构,使类似型号客户能够分享头目。此外,我们进一步提议MH-PFedHNGD,其中整合一个选择的轻量全球模型,以改进通用。我们的框架不依赖外部数据集,也不要求披露客户模式架构,从而提供强化的隐私和灵活性。在多个基准和模型设置上进行广泛的实验,表明我们的方法在个人基质化的先进性基准中实现了个人基数的可靠性学习。

Article 22

Title@2025-07-30 (3): SP-Chain: Boosting Intra-Shard and Cross-Shard Security and Performance in Blockchain Sharding

Title: SP-Chain: Boosting Intra-Shard and Cross-Shard Security and Performance in Blockchain Sharding

SP-Chain: Stärkung von Intra-Shard und Cross-Shard Sicherheit und Performance in Blockchain Sharding

SP-Chain: 推动碎裂和交叉碎片内部的安全和工作表现 2407.06953v2

Authors (4): Mingzhe Li, You Lin, Wei Wang, Jin Zhang

A promising way to overcome the scalability limitations of the current blockchain is to use sharding, which is to split the transaction processing among multiple, smaller groups of nodes. A well-performed blockchain sharding system requires both high performance and high security in both intra- and cross-shard perspectives. However, existing protocols either have issues on protecting security or trade off great performance for security. In this paper, we propose SP-Chain, a blockchain sharding system with enhanced Security and Performance for both intra- and cross-shard perspectives. For intra-shard aspect, we design a two-phase concurrent voting scheme to provide high system throughput and low transaction confirmation latency. Moreover, we propose an efficient unbiased leader rotation scheme to ensure high performance under malicious behavior. For cross-shard aspect, a proof-assisted efficient cross-shard transaction processing mechanism is proposed to guard the cross-shard transactions with low overhead. We implement SP-Chain based on Harmony, and evaluate its performance via large-scale deployment. Extensive evaluations suggest that SP-Chain can process more than 10,000 tx/sec under malicious behaviors with a confirmation latency of 7.6s in a network of 4,000 nodes.

克服当前链条可伸缩性限制的一个有希望的方法是使用碎片,即将交易处理分成多个、较小的节点组。完善的块块分割系统既需要高性能,也需要高安全性能,既需要高性能,也需要高安全性能,既要从内部和跨碎片角度考虑。但是,现有的协议在安全保护方面或为了安全而交换高性能方面都有问题。在本文件中,我们建议SP-Chain(一个为内部和跨碎片视角加强安全和性能的块块分割系统)。关于内部困难方面,我们设计了两阶段同时进行的投票计划,以提供高系统吞吐量和低交易确认耐久性。此外,我们提出了高效的不带偏见的领导人轮换计划,以确保在恶意行为下高性能。对于交叉困难方面,我们提议了一个由证据辅助的高效交叉硬性交易处理机制,以在低管理下监管交叉性交易。我们根据和谐执行SP-Chain(SP-Chain),并通过大规模部署来评估其绩效。广泛的评价表明SP-Chain(SP-Chain)可以处理超过10,000个无恶意行为7.6的网络,需要确认。

Article 23

Title@2025-07-30 (3): Towards Experiment Execution in Support of Community Benchmark Workflows for HPC

Title: Towards Experiment Execution in Support of Community Benchmark Workflows for HPC

Zur Durchführung von Experimenten zur Unterstützung von gemeinschaftlichen Benchmark-Workflows für HPC

争取实验执行以支持高常委会的社区基准工作流程 2507.22294v1

Authors (8): Gregor von Laszewski, Wesley Brewer, Sean R. Wilkinson, Andrew Shao, J. P. Fleischer, Harshad Pitkar, Christine R. Kirkpatrick, Geoffrey C. Fox

A key hurdle is demonstrating compute resource capability with limited benchmarks. We propose workflow templates as a solution, offering adaptable designs for specific scientific applications. Our paper identifies common usage patterns for these templates, drawn from decades of HPC experience, including recent work with the MLCommons Science working group. We found that focusing on simple experiment management tools within the broader computational workflow improves adaptability, especially in education. This concept, which we term benchmark carpentry, is validated by two independent tools: Cloudmesh’s Experiment Executor and Hewlett Packard Enterprise’s SmartSim. Both frameworks, with significant functional overlap, have been tested across various scientific applications, including conduction cloudmask, earthquake prediction, simulation-AI/ML interactions, and the development of computational fluid dynamics surrogates.

一个关键障碍是用有限的基准计算资源能力。我们提出工作流程模板,作为解决方案,为特定科学应用提供适应性设计。我们的文件确定了这些模板的通用使用模式,这些模式来自数十年高管经验,包括最近与康德蒙科学工作组的合作。我们发现,在更广泛的计算工作流程中侧重于简单的实验管理工具,可以提高适应性,特别是在教育领域。这个我们称为基准木工的概念,通过两个独立工具(克劳德梅什的实验执行器和惠普公司SmartSim)得到验证。这两个框架在功能上有很大重叠,在各种科学应用中都经过了测试,包括导电云、地震预测、模拟-AI/ML互动以及计算液动力代孕的开发。

Article 24

Title: Minimizing CGYRO HPC Communication Costs in Ensembles with XGYRO by Sharing the Collisional Constant Tensor Structure

Minimierung der CGYRO HPC-Kommunikationskosten in Ensembles mit XGYRO durch gemeinsame Nutzung der Kollusionskonstanten-Tensor-Struktur

通过共享对齐常数感应结构,最大限度地减少与XGYRO结合的CGYRO HPC HPC 通信费用 2507.22245v1

Authors (3): Igor Sfiligoi, Emily A. Belli, Jeff Candy

First-principles fusion plasma simulations are both compute and memory intensive, and CGYRO is no exception. The use of many HPC nodes to fit the problem in the available memory thus results in significant communication overhead, which is hard to avoid for any single simulation. That said, most fusion studies are composed of ensembles of simulations, so we developed a new tool, named XGYRO, that executes a whole ensemble of CGYRO simulations as a single HPC job. By treating the ensemble as a unit, XGYRO can alter the global buffer distribution logic and apply optimizations that are not feasible on any single simulation, but only on the ensemble as a whole. The main saving comes from the sharing of the collisional constant tensor structure, since its values are typically identical between parameter-sweep simulations. This data structure dominates the memory consumption of CGYRO simulations, so distributing it among the whole ensemble results in drastic memory savings for each simulation, which in turn results in overall lower communication overhead.

首先原理的聚变等离子模拟是计算和记忆密集的,CGYRO也不例外。使用许多HPC节点来适应现有记忆中的问题,从而导致大量的通信管理,这对于任何单一的模拟都很难避免。他说,大多数聚变研究由一系列的模拟组成,因此我们开发了一个新的工具,名为XGYRO,将CGYRO模拟的整组合用作为单一的HPC工作。通过将合用体作为单位处理,XGYRO可以改变全球缓冲分布逻辑,并应用在任何单一模拟中都行不通的优化,但仅限于整个集成体。主要节约来自于共享碰撞常数阵列结构,因为其数值通常在参数扫描模拟之间是相同的。这个数据结构控制着CGYRO模拟的记忆消耗量,因此在全部集成中分配,每个模拟都会产生剧烈的内存节余,而这反过来导致总体的通信顶部。

Article 25

Title@2025-07-29 (2): AgileDART: An Agile and Scalable Edge Stream Processing Engine

Title: AgileDART: An Agile and Scalable Edge Stream Processing Engine

AgileDART: Eine agile und skalierbare Edge Stream Processing Engine

AGILDART: 一个能动和可扩缩的边缘流处理引擎 2407.14953v3

Authors (7): Cheng-Wei Ching, Xin Chen, Chaeeun Kim, Tongze Wang, Dong Chen, Dilma Da Silva, Liting Hu

Edge applications generate a large influx of sensor data on massive scales, and these massive data streams must be processed shortly to derive actionable intelligence. However, traditional data processing systems are not well-suited for these edge applications as they often do not scale well with a large number of concurrent stream queries, do not support low-latency processing under limited edge computing resources, and do not adapt to the level of heterogeneity and dynamicity commonly present in edge computing environments. As such, we present AgileDart, an agile and scalable edge stream processing engine that enables fast stream processing of many concurrently running low-latency edge applications’ queries at scale in dynamic, heterogeneous edge environments. The novelty of our work lies in a dynamic dataflow abstraction that leverages distributed hash table-based peer-to-peer overlay networks to autonomously place, chain, and scale stream operators to reduce query latencies, adapt to workload variations, and recover from failures and a bandit-based path planning model that re-plans the data shuffling paths to adapt to unreliable and heterogeneous edge networks. We show that AgileDart outperforms Storm and EdgeWise on query latency and significantly improves scalability and adaptability when processing many real-world edge stream applications’ queries.

边缘应用程序产生大量大规模传感数据,这些大规模的数据流必须很快处理,以获得可操作的情报。然而,传统的数据处理系统对于这些边缘应用程序来说并不适宜,因为它们往往规模不适宜,不能同时进行大量流质查询,在有限的边缘计算资源下不支持低纬度处理,并且不适应在边缘计算环境中常见的异质和动态水平。因此,我们展示了AgileDart,一个灵活和可缩放的边缘流处理引擎,使许多同时在动态、混杂边缘环境中进行低纬度边缘应用查询的快速流处理。我们工作的新颖之处在于动态数据流抽象化,这种动态数据流的杠杆分布于基于表端对端的对端重叠网络到自主的位置、链和规模流操作器,以减少查询迟误,适应工作量的变化,从故障中恢复过来,以及一个以土带为基础的路径规划模型,重新规划数据振动路径,以适应不可靠和不均匀的边缘网络。我们展示了AgileDart在磁性上大量流流流度处理时,从而极大地改进了流流流流流流流流流流和电的适应。

Article 26

Title@2025-07-29 (2): OpenRASE: Service Function Chain Emulation

Title: OpenRASE: Service Function Chain Emulation

OpenRASE: Service-Funktionskette Emulation

OpenRASE: 服务功能链模拟 2507.22131v1

Authors (2): Theviyanthan Krishnamohan, Paul Harvey

Service Function Chains (SFCs) are one of the key enablers in providing programmable computer networks, paving the way for network autonomy. However, this also introduces new challenges, such as resource allocation and optimisation related to their operation, requiring new algorithms to address these challenges. Various tools have been used in the literature to evaluate these algorithms. However, these tools suffer from inaccuracy, low fidelity, unscalability, inflexibility, or additional code requirements. This paper introduces an emulator based on Mininet and Docker for SFCs called OpenRASE. The goal of OpenRASE is to enable the exploration of resource allocation algorithms for SFCs in a dynamic setting, allowing real CPU usage and latency to be measured. We describe the design and implementation of OpenRASE and discuss its characteristics. We also experimentally evaluate two different algorithms to address the SFC resource allocation challenge, including an online Genetic Algorithm, using OpenRASE to show its effectiveness and practicality for dynamic network conditions.

服务功能链(SFCs)是提供可编程计算机网络的关键促进因素之一,为网络自主铺平了道路,但也带来了新的挑战,例如资源分配和优化与网络自主相关的操作,需要新的算法来应对这些挑战。文献中使用了各种工具来评估这些算法。然而,这些工具存在不准确、不忠诚、不可缩放、不灵活、不灵活或额外的代码要求。本文介绍了一个基于微型网和多克的模拟器,称为OpenRASE。OpenRASE的目标是在动态环境中为可持续财务公司探索资源分配算法,以便能够对实际的CUP使用和延绳进行测量。我们描述了OpenRASE的设计和实施,并讨论其特点。我们还实验性地评估了两种不同的算法,以解决可持续财务资源分配的挑战,包括在线遗传Algorithm,使用OpreRASE来展示其在动态网络条件下的有效性和实用性。

Article 27

Title@2025-07-29 (2): Large-Scale Linear Energy System Optimization: A Systematic Review on Parallelization Strategies via Decomposition

Title: Large-Scale Linear Energy System Optimization: A Systematic Review on Parallelization Strategies via Decomposition

Large-Scale Linear Energy System Optimization: Eine systematische Überprüfung von Parallelisierungsstrategien durch Zersetzung

大型线性能源系统优化:通过分解对平行战略进行系统审查 2507.21932v1

Authors (10): Lars Hadidi, Leonard Göke, Maximilian Hoffmann, Mario Klostermeier, Shima Sasanpour, Tim Varelmann, Vassilios Yfantis, Jochen Linßen, Detlef Stolten, Jann M. Weinand

As renewable energy integration, sector coupling, and spatiotemporal detail increase, energy system optimization models grow in size and complexity, often pushing solvers to their performance limits. This systematic review explores parallelization strategies that can address these challenges. We first propose a classification scheme for linear energy system optimization models, covering their analytical focus, mathematical structure, and scope. We then review parallel decomposition methods, finding that while many offer performance benefits, no single approach is universally superior. The lack of standardized benchmark suites further complicates comparison. To address this, we recommend essential criteria for future benchmarks and minimum reporting standards. We also survey available software tools for parallel decomposition, including modular frameworks and algorithmic abstractions. Though centered on energy system models, our insights extend to the broader operations research field.

随着可再生能源整合、部门组合和时空细节的增加,能源系统优化模式的规模和复杂性不断增长,常常将解决者推向业绩极限。这一系统审查探索了能够应对这些挑战的平行战略。我们首先提出了线性能源系统优化模式的分类计划,包括分析重点、数学结构和范围。我们随后审查了平行分解方法,发现虽然许多方法都带来绩效效益,但没有单一方法普遍优异。缺乏标准化的基准套件使比较更加复杂。为了解决这个问题,我们建议了未来基准和最低报告标准的基本标准。我们还调查了平行分解的现有软件工具,包括模块框架和算法抽象。尽管我们的观点以能源系统模型为中心,但我们的洞察力扩大到更广泛的业务研究领域。

Article 28

Title@2025-07-29 (2): The Performance of Low-Synchronization Variants of Reorthogonalized Block Classical Gram–Schmidt

Title: The Performance of Low-Synchronization Variants of Reorthogonalized Block Classical Gram–Schmidt

Die Performance von Low-Synchronization Varianten von Reorthogonalized Block Classical Gram–Schmidt

古经典古典古典古典石密的重新解析区块的低同步变异功能的性能 2507.21791v1

Authors (2): Erin Carson, Yuxin Ma

Numerous applications, such as Krylov subspace solvers, make extensive use of the block classical Gram-Schmidt (BCGS) algorithm and its reorthogonalized variants for orthogonalizing a set of vectors. For large-scale problems in distributed memory settings, the communication cost, particularly the global synchronization cost, is a major performance bottleneck. In recent years, many low-synchronization BCGS variants have been proposed in an effort to reduce the number of synchronization points. The work [E. Carson, Y. Ma, arXiv preprint 2411.07077] recently proposed stable one-synchronization and two-synchronization variants of BCGS, i.e., BCGSI+P-1S and BCGSI+P-2S. In this work, we evaluate the performance of BCGSI+P-1S and BCGSI+P-2S on a distributed memory system compared to other well-known low-synchronization BCGS variants. In comparison to the classical reorthogonalized BCGS algorithm (BCGSI+), numerical experiments demonstrate that BCGSI+P-1S and BCGSI+P-2S can achieve up to 4 times and 2 times speedups, respectively, and perform similarly to other (less stable) one-synchronization and two-synchronization variants. BCGSI+P-1S and BCGSI+P-2S are therefore recommended as the best choice in practice for computing an economic QR factorization on distributed memory systems due to their superior stability when compared to other variants with the same synchronization cost.

Krylov 子空间求解器等许多应用程序,广泛使用块式古典 Gram- Schmidt (BCGS) 古典古典 Gram- Schmidt (BCGS) 算法及其重新对齐化变方程式,以对一组矢量进行正正对化。对于分布式内存设置中的大规模问题,通信成本,特别是全球同步成本,是一个主要的性能瓶颈。近年来,提出了许多低同步化的 BCGS 变方程式,以努力减少同步点的数量。工作[E. Carson, Y. Ma, Arxiv 预印数 241.07077] 最近提出的BCGS(BCGSI+P) 稳定一同步化和双同步化变方。在这项工作中,我们评估分布式内存系统中的BCSI+P-1和 BCGSI+ 变方程式的性能, 将CBCGS- 2级S 的内上等级再稳定化的内位性实验,可以将BSI- S 的内值- Sleval- slational- sal-xleval-xxxxal- sal-xxxxxxxxxal-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx , ,将S ,将S ,将S 至S 至S 至S 至S 和S 和S-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Article 29

Title@2025-07-29 (2): Evaluating the Impact Of Spatial Features Of Mobility Data and Index Choice On Database Performance

Title: Evaluating the Impact Of Spatial Features Of Mobility Data and Index Choice On Database Performance

Bewertung der Auswirkungen räumlicher Merkmale von Mobilitätsdaten und Indexwahl auf die Datenbankleistung

评价移动数据空间特征和指数选择对数据库绩效的影响 2505.14466v2

Authors (3): Tim C. Rese, Alexandra Kapp, David Bermbach

The growing number of moving Internet-of-Things (IoT) devices has led to a surge in moving object data, powering applications such as traffic routing, hotspot detection, or weather forecasting. When managing such data, spatial database systems offer various index options and data formats, e.g., point-based or trajectory-based. Likewise, dataset characteristics such as geographic overlap and skew can vary significantly. All three significantly affect database performance. While this has been studied in existing papers, none of them explore the effects and trade-offs resulting from a combination of all three aspects. In this paper, we evaluate the performance impact of index choice, data format, and dataset characteristics on a popular spatial database system, PostGIS. We focus on two aspects of dataset characteristics, the degree of overlap and the degree of skew, and propose novel approximation methods to determine these features. We design a benchmark that compares a variety of spatial indexing strategies and data formats, while also considering the impact of dataset characteristics on database performance. We include a variety of real-world and synthetic datasets, write operations, and read queries to cover a broad range of scenarios that might occur during application runtime. Our results offer practical guidance for developers looking to optimize spatial storage and querying, while also providing insights into dataset characteristics and their impact on database performance.

移动式互联网电话(IoT)设备的数量不断增加,这导致移动对象数据、交通路线选择、热点探测或天气预报等应用程序的动力激增。空间数据库系统在管理这些数据时提供各种指数选项和数据格式,例如点基或轨迹基。同样,诸如地理重叠和扭曲等数据集特征也会有很大差异。这三个特征都对数据库的性能产生重大影响。虽然已在现有文件中研究过,但它们都没有探讨所有三个方面组合的结果和取舍。在本文件中,我们评估了指数选择、数据格式和数据集特性对流行空间数据库系统PostGIS的性能影响。我们侧重于数据集特性的两个方面,即重叠程度和斜线度,并提出确定这些特征的新近似方法。我们设计了一个基准,比较了各种空间索引战略和数据格式,同时也考虑到数据集对数据库性能的影响。我们包括各种现实世界和合成数据集、写作操作和阅读查询,以涵盖范围很广的空间数据设置特点,同时提供我们数据库运行期间的精确性判读结果。

Article 30

Title@2025-07-29 (2): Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees

Title: Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees

Einmal quantifizieren, schnell trainieren: Allreduce-kompatible Kompression mit wahrnehmbaren Garantien

量化一次,快速列车:用可变担保进行减压-可比较压缩 2305.18627v2

Authors (4): Jihao Xin, Marco Canini, Peter Richtárik, Samuel Horváth

Distributed training enables large-scale deep learning, but suffers from high communication overhead, especially as models and datasets grow. Gradient compression, particularly quantization, is a promising approach to mitigate this bottleneck. However, existing quantization schemes are often incompatible with Allreduce, the dominant communication primitive in distributed deep learning, and many prior solutions rely on heuristics without theoretical guarantees. We introduce Global-QSGD, an Allreduce-compatible gradient quantization method that leverages global norm scaling to reduce communication overhead while preserving accuracy. Global-QSGD is backed by rigorous theoretical analysis, extending standard unbiased compressor frameworks to establish formal convergence guarantees. Additionally, we develop a performance model to evaluate its impact across different hardware configurations. Extensive experiments on NVLink, PCIe, and large-scale cloud environments show that Global-QSGD accelerates distributed training by up to 3.51% over baseline quantization methods, making it a practical and efficient solution for large-scale deep learning workloads.

分散培训有助于大规模深层次学习,但受高水平通信管理费的困扰,特别是随着模型和数据集的增长。渐进压缩,特别是量化,是缓解这一瓶颈的一个很有希望的方法。然而,现有的量化计划往往与Alledue不相容,Alledue是分布式深层学习中占主导地位的通信原始,许多先前的解决办法在没有理论保障的情况下依赖于重力学。我们引入了“全球-QSGD”,一种可减排兼容的梯度量化方法,利用全球规范缩放来减少通信管理费,同时保持准确性。“全球-QSGD”得到了严格的理论分析的支持,扩大了标准的不带偏见压缩机框架以建立正式的趋同保证。此外,我们开发了一种绩效模型来评估其在不同硬件配置中的影响。关于NVLink、PCIe和大型云层环境的广泛实验表明,“全球-QSGD”在基线量化方法上加快了高达3.51%的培训,从而成为大规模深层次学习工作量的实用有效解决方案。

Article 31

Title@2025-07-29 (2): Ethereum Conflicts Graphed

Title: Ethereum Conflicts Graphed

Ethereum-Konflikte

EEeenum 冲突图图 2507.20196v2

Authors (3): Dvir David Biton, Roy Friedman, Yaron Hay

Ethereum, a leading blockchain platform, has revolutionized the digital economy by enabling decentralized transactions and the execution of smart contracts. Ethereum transactions form the backbone of its network, facilitating peer-to-peer exchanges and interactions with complex decentralized applications. Smart contracts extend Ethereum’s capabilities by automating processes and enabling trustless execution of agreements. Hence, understanding how these smart contracts interact is important in order to facilitate various performance optimizations, such as warming objects before they are being accessed and enabling concurrent execution. Of particular interest to us are the development of the calling graph, as well as the read sets and write sets of invocations within the same block, and the properties of the associated conflict graph that is derived from them. The latter is important for understanding the parallelization potential of smart contracts on Ethereum. We traced upwards of 2 million recent Ethereum blocks using call tracer and prestate tracer, out of a total of 21.4 million blocks at the time of writing. We report on the transactions per block distribution, the structure of call trees in smart contract invocations, the ratio of value-transfer transactions to smart contract invocations, as well as provide a comprehensive study of the structure of blocks’ conflict graphs. We find that conflict graphs predominantly show a star like configuration, as well as other noteworthy structural properties.

Etheum是一个领先的连锁平台,它通过分散交易和执行智能合同,使数字经济发生了革命性的变化。Etheum交易是其网络的主干,促进了同行之间的交流和与复杂的分散应用的相互作用。智能合同扩大了Etheum的能力,使过程自动化,并使协议的执行变得无信任。因此,了解这些智能合同如何相互作用是重要的,以促进各种性能优化,例如,在进入之前使物体变暖,并能够同时执行。我们特别感兴趣的是,调用图的开发,以及在同一块内部的读数和写数,以及从中衍生出的相关冲突图的特性。后者对于理解Etheem智能合同的平行潜力十分重要。我们利用呼叫追踪器和先期追踪器追踪了200万个最近的Etheum区块,在撰写本报告时总共2 140万个区块之外。我们报告每块的交易情况,调用智能合同的结构,调用智能合同,以及从这些区块中得出的价值转移交易与智能合同的属性之比。后者对于理解Ethereum合同的平行潜力很重要。我们在撰写时,像图表一样,提供了一个稳定的图表。

Article 32

Title@2025-07-29 (2): A Massively Parallel Performance Portable Free-space Spectral Poisson Solver

Title: A Massively Parallel Performance Portable Free-space Spectral Poisson Solver

Ein massiv parallele Leistung Portable Freiraum Spectral Poisson Solver

大规模平行平行性能便携式自由空间光谱 Poisson 解答器 2405.02603v2

Authors (6): Sonali Mayani, Veronica Montanaro, Antoine Cerfon, Matthias Frey, Sriramkrishnan Muralikrishnan, Andreas Adelmann

Vico et al. (2016) suggest a fast algorithm for computing volume potentials, beneficial to fields with problems requiring the solution of the free-space Poisson’s equation, such as beam and plasma physics. Currently, the standard is the algorithm of Hockney and Eastwood (1988), with second order in convergence at best. The algorithm proposed by Vico et al. converges spectrally for sufficiently smooth functions i.e. faster than any fixed order in the number of grid points. We implement a performance portable version of the traditional Hockney-Eastwood and the novel Vico-Greengard Poisson solver as part of the IPPL (Independent Parallel Particle Layer) library. For sufficiently smooth source functions, the Vico-Greengard algorithm achieves higher accuracy than the Hockney-Eastwood method with the same grid size, reducing the computational demands of high resolution simulations since one could use coarser grids to achieve them. Additionally, we propose an improvement to the Vico-Greengard method which further reduces its memory footprint. This is important for GPUs, which have limited memory, and should be taken into account when selecting numerical algorithms for performance portable codes. Finally, we showcase performance through GPU and CPU scaling studies on the Perlmutter (NERSC) supercomputer, with efficiencies staying above 50% in the strong scaling case. To showcase portability, we also run the scaling studies on the Alps supercomputer at CSCS, Switzerland and the GPU partition of the Lumi supercomputer at CSC, Finland.

维科等人( Vico et al.) 提出了计算体积潜力的快速算法, 有益于需要解决自由空间 Poisson 等方程式( 如光束和等离子物理) 的字段。目前, 标准是霍克尼和Eastwood的算法( 1988) , 最多为第二顺序。维科等人( Vico et al. ) 提议的算法在光谱上汇集足够顺利的功能, 即比网格点数量中的任何固定顺序更快。我们实施传统Hokney- Wastwood 和新颖的 Vico- Gregard Poisson 软件的性能便携式版本, 作为IPPL( 独立的平行线) 库的一部分。目前, 标准是霍克尼和Eastwood 的算法( 1988) 。 Vico- Gregard 算法的精度比Hokney和Eastwood的精度要高, 因为使用粗格网格网格网格网来实现这些目标。此外, 我们提议改进Vico- Gco- Gloardgard 方法, 这对于GPOL 的GPOL 的缩缩缩缩缩分析很重要很重要, 这对于GPUPOL很重要, 对于GPOL的记忆很重要。当我们在选择C, 的CS 的CSL 的CS 的C, 最后的CS 的CSL 上, 最后的CSL 上, 的C。

Article 33

Title@2025-07-29 (2): Collaborative State Machines: A Better Programming Model for the Cloud-Edge-IoT Continuum

Title: Collaborative State Machines: A Better Programming Model for the Cloud-Edge-IoT Continuum

Kollaborative Staatsmaschinen: Ein besseres Programmiermodell für das Cloud-Edge-IoT Continuum

协作型国家机器:云-云-日-环-环-环-环-环-环-环-环-环-环- 2507.21685v1

Authors (8): Marlon Etheredge, Thomas Fahringer, Felix Erlacher, Elias Kohler, Stefan Pedratscher, Juan Aznar-Poveda, Nishant Saurabh, Adrien Lebre

The development of Cloud-Edge-IoT applications requires robust programming models. Existing models often struggle to manage the dynamic and stateful nature of these applications effectively. This paper introduces the Collaborative State Machines (CSM) programming model to address these complexities. CSM facilitates the development of reactive, event-driven, and stateful applications targeting the Cloud-Edge-IoT continuum. Applications built with CSM are composed of state machines that collaborate autonomously and can be distributed across different layers of the continuum. Key features of CSM include (i) a sophisticated collaboration mechanism among state machines utilizing events and persistent data; (ii) encapsulation of state through the inherent state of state machines and persistent data; (iii) integration of actions and service invocations within states and state transitions, thereby decoupling complex application logic from compute and data processing services; and (iv) an advanced data model that supports the processing of local, static, and persistent data with defined scope and lifetime. In addition to introducing the CSM programming model, we present a runtime system and a comprehensive evaluation of our approach. This evaluation is based on three use cases: a stress test on a large-scale infrastructure, a surveillance system application, and a complex smart factory scenario, all deployed on the Grid’5000 testbed. Our results demonstrate a 12x increase in throughput through novel language features in the stress test. Compared to Serverless Workflow, a state-of-the-art baseline system, we show a 2.3x improvement in processing time per processed image in a surveillance system use case, a 55x reduction in total processing time for a smart factory use case, and an overall improvement in productivity across these use cases.

Cloud-Edge-IoT应用程序的开发需要强有力的编程模型。现有的模型往往难以有效地管理这些应用程序的动态和性质。本文件介绍了合作型国家机器(CSM)编程模型,以应对这些复杂问题。CSM为开发针对Cloud-Edge-IoT连续体的被动、事件驱动和有声化应用程序提供便利。与CSM一起建造的应用程序由国家机器组成,这些机器可以自主地在连续体的不同层次上分布。CSM的主要特点包括:(一) 利用事件和持续数据的国家机器之间复杂的协作机制;(二) 通过国家机器和持续数据的内在结构状态来概括状态;(三) 将州和州过渡期间的行动和服务整合在一起,从而将复杂的应用逻辑与Clod-Edd-Edge-Ioot相容和数据处理服务分解;以及(四) 支持本地、静态和持续数据的处理的高级数据模型,除了引入CSMX编程模型外,我们还推出一个运行式改进系统,并全面评估我们的方法。这一评估基于三个案例:智能式的系统在Smartal-lical lical-rial rocal rocal rocal rocal roil roilning cal cal roilning cal rocal cal cal trisal real tring in in in a surviol be a surviolview a su be a su be a su be a sualviolviolviolviolviolvicus a sualviolvicus in a su suction a suction a subus a lave a subus a subus a subus a subus a subus a subus a sucal str a sucal a suction a suction a lavical suction subal subal suvical a su su su su su su su su su su su su su su su su su lacurrvical a us a sal a laus a lacustral a su

Article 34

Title@2025-07-29 (2): Accelerating Stable Matching between Workers and Spatial-Temporal Tasks for Dynamic MCS: A Stagewise Service Trading Approach

Title: Accelerating Stable Matching between Workers and Spatial-Temporal Tasks for Dynamic MCS: A Stagewise Service Trading Approach

Beschleunigte stabile Abstimmung zwischen Arbeitern und räumlich-zeitlichen Aufgaben für dynamische MCS: Ein schrittweiser Service-Trading-Ansatz

加快工人与动态监控监的时空任务之间的稳定匹配:分阶段服务贸易办法 2502.08386v3

Authors (7): Houyi Qi, Minghui Liwang, Xianbin Wang, Liqun Fu, Yiguang Hong, Li Li, Zhipeng Cheng

Designing effective incentive mechanisms in mobile crowdsensing (MCS) networks is crucial for engaging distributed mobile users (workers) to contribute heterogeneous data for various applications (tasks). In this paper, we propose a novel stagewise trading framework to achieve efficient and stable task-worker matching, explicitly accounting for task diversity (e.g., spatio-temporal limitations) and network dynamics inherent in MCS environments. This framework integrates both futures and spot trading stages. In the former, we introduce the \textbf{f}utures \textbf{t}rading-driven \textbf{s}table \textbf{m}atching and \textbf{p}re-\textbf{p}ath-\textbf{p}lanning mechanism (FT-SMP$^3$), which enables long-term task-worker assignment and pre-planning of workers’ trajectories based on historical statistics and risk-aware analysis. In the latter, we develop the \textbf{s}pot \textbf{t}rading-driven \textbf{D}QN-based \textbf{p}ath \textbf{p}lanning and onsite \textbf{w}orker \textbf{r}ecruitment mechanism (ST-DP$^2$WR), which dynamically improves the practical utilities of tasks and workers by supporting real-time recruitment and path adjustment. We rigorously prove that the proposed mechanisms satisfy key economic and algorithmic properties, including stability, individual rationality, competitive equilibrium, and weak Pareto optimality. Extensive experiements further validate the effectiveness of our framework in realistic network settings, demonstrating superior performance in terms of service quality, computational efficiency, and decision-making overhead.

移动人群监测( MCS) 网络中设计有效的激励机制对于让分布式移动用户( 工作) 参与提供各种应用( 任务) 的多样化数据至关重要。在本文中, 我们提出一个新的阶段化交易框架, 以实现高效和稳定的任务- 工作匹配, 明确考虑到任务的多样性( 例如, spatio- 时间限制) 和 MCS 环境中固有的网络动态。这个框架将未来和点交易阶段结合起来。在前一个框架中, 我们引入了\ textb{ f} 未来\ textb{ textf{ textf} 来为各种应用( 工作) 提供多样化数据。在基于历史统计和风险意识分析的中, 我们开发了文本/ textbb= text_ liveral- liverality, ST- talverb\ text_ liverfral- silveral- silverfral- silvalent serview.

Article 35

Title@2025-07-29 (2): Bridging Cache-Friendliness and Concurrency: A Locality-Optimized In-Memory B-Skiplist

Title: Bridging Cache-Friendliness and Concurrency: A Locality-Optimized In-Memory B-Skiplist

Überbrückung von Cache-Freundlichkeit und Concurrency: Eine lokalitätsoptimierte In-Memory-B-Skiplist

搭桥便利取快和货币通融:一个地方性优化的记忆B-空间列表 2507.21492v1

Authors (5): Yicong Luo, Senhe Hao, Brian Wheatman, Prashant Pandey, Helen Xu

Skiplists are widely used for in-memory indexing in many key-value stores, such as RocksDB and LevelDB, due to their ease of implementation and simple concurrency control mechanisms. However, traditional skiplists suffer from poor cache locality, as they store only a single element per node, leaving performance on the table. Minimizing last-level cache misses is key to maximizing in-memory index performance, making high cache locality essential. In this paper, we present a practical concurrent B-skiplist that enhances cache locality and performance while preserving the simplicity of traditional skiplist structures and concurrency control schemes. Our key contributions include a top-down, single-pass insertion algorithm for B-skiplists and a corresponding simple and efficient top-down concurrency control scheme. On 128 threads, the proposed concurrent B-skiplist achieves between 2x-9x higher throughput compared to state-of-the-art concurrent skiplist implementations, including Facebook’s concurrent skiplist from Folly and the Java ConcurrentSkipListMap. Furthermore, we find that the B-skiplist achieves competitive (0.9x-1.7x) throughput on point workloads compared to state-of-the-art cache-optimized tree-based indices (e.g., Masstree). For a more complete picture of the performance, we also measure the latency of skiplist and tree-based indices and find that the B-skiplist achieves between 3.5x-103x lower 99% latency compared to other concurrent skiplists and between 0.85x-64x lower 99% latency compared to tree-based indices on point workloads with inserts.

64 在许多关键值商店,如 RocksDB 和 DageDB 中,由于执行方便和简单的调值控制机制,上层列表被广泛用于模拟索引。然而,传统的上层列表由于每个节点只存储一个元素而存在不便的缓存地点,使得业绩留在桌面上。最小化最后一级缓存缺失是最大限度地提高模拟指数性能的关键,使得高缓存地点至关重要。在本文件中,我们提出了一个实用的双层列表,既能增加缓存地点和性能,同时又保持传统跳板结构和同类货币控制机制的简单性。我们的主要贡献包括:B-skiplists的上下层、单层插入算法的算法,以及相应的自上至下层的直线控制方案。在128 线上,拟议同时的B-skiplistist在2x 高过量的过量之间,包括基于 Folly 和 Javaimi- delistallistal 之间,我们发现B-skiplist-listal-lational-lex 10-lights browx lax 和比Ve.9.9-lex 10-lex-lex-levex 和比B-l-lex-lation-lent-lex 工作x 和比B-leval-lex 10-look-look-l-lx) 工作工作表,还比比比比B-lex-lex-lex-l-l-l-l-l-lex-l-lex-lex-l-l-lxxxxxx-l-l-l-l-l-l-lx-lx-l-l-l-l-l-l-l-l-l-lx-l-l-l-l-l-l-l-l-l-l-l-lx-lx-lx-l-l-l-l-lx-lx-l-lx-lx-l-l-l-l-lx-lx-lx-l-l-l-l-l-l-l-l-l-l-l-l-l-l-l

Article 36

Title@2025-07-29 (2): GlideinBenchmark: collecting resource information to optimize provisioning

Title: GlideinBenchmark: collecting resource information to optimize provisioning

GlideinBenchmark: Sammeln von Ressourceninformationen zur Optimierung der Bereitstellung

Gliidein基准:收集资源信息,优化供应 2507.21472v1

Authors (2): Marco Mambelli, Shrijan Swaminathan

Choosing the right resource can speed up job completion, better utilize the available hardware, and visibly reduce costs, especially when renting computers in the cloud. This was demonstrated in earlier studies on HEPCloud. However, the benchmarking of the resources proved to be a laborious and time-consuming process. This paper presents GlideinBenchmark, a new Web application leveraging the pilot infrastructure of GlideinWMS to benchmark resources, and it shows how to use the data collected and published by GlideinBenchmark to automate the optimal selection of resources. An experiment can select the benchmark or the set of benchmarks that most closely evaluate the performance of its workflows. GlideinBenchmark, with the help of the GlideinWMS Factory, controls the benchmark execution. Finally, a scheduler like HEPCloud’s Decision Engine can use the results to optimize resource provisioning.

选择正确的资源可以加速完成工作,更好地利用可用的硬件,并明显降低成本,特别是在租用云层中的计算机时。这一点在早先关于HEPCloud的研究中得到了证明。然而,资源的基准设定证明是一个费时费力的过程。本文展示了GliideinBenchmark,这是一个利用GliideinWMS试点基础设施对资源进行基准设定的新的网络应用程序,它展示了如何利用GliideinBenchmark收集和公布的数据使资源的最佳选择自动化。一个实验可以选择最密切评估其工作流程绩效的基准或基准集。GliideinBenchmark,在GliideinWMS工厂的帮助下,控制了基准执行。最后,像HEPCloud的“决定引擎”这样的调度器可以利用结果优化资源配置。

Article 37

Title@2025-07-29 (2): Using Containers to Speed Up Development, to Run Integration Tests and to Teach About Distributed Systems

Title: Using Containers to Speed Up Development, to Run Integration Tests and to Teach About Distributed Systems

Container verwenden, um die Entwicklung zu beschleunigen, Integrationstests durchzuführen und über verteilte Systeme zu unterrichten

利用集装箱加速发展、运行一体化测试和教授分配系统 2507.21464v1

Authors (4): Marco Mambelli, Bruno Moreira Coimbra, Namratha Urs, Ilya Baburashvili

GlideinWMS is a workload manager provisioning resources for many experiments, including CMS and DUNE. The software is distributed both as native packages and specialized production containers. Following an approach used in other communities like web development, we built our workspaces, system-like containers to ease development and testing. Developers can change the source tree or check out a different branch and quickly reconfigure the services to see the effect of their changes. In this paper, we will talk about what differentiates workspaces from other containers. We will describe our base system, composed of three containers: a one-node cluster including a compute element and a batch system, a GlideinWMS Factory controlling pilot jobs, and a scheduler and Frontend to submit jobs and provision resources. Additional containers can be used for optional components. This system can easily run on a laptop, and we will share our evaluation of different container runtimes, with an eye for ease of use and performance. Finally, we will talk about our experience as developers and with students. The GlideinWMS workspaces are easily integrated with IDEs like VS Code, simplifying debugging and allowing development and testing of the system even when offline. They simplified the training and onboarding of new team members and summer interns. And they were useful in workshops where students could have first-hand experience with the mechanisms and components that, in production, run millions of jobs.

GliideinWMS 是一个工作量管理者,为包括CMS 和 DUNE 在内的许多实验提供资源。软件作为本地包件和专门生产容器分发。按照网络开发等其他社区使用的方法,我们建造了我们的工作空间和系统式容器,以方便开发和测试。开发者可以改变源树或检查不同的分支,并快速重新配置服务,以了解其变化的影响。在本文件中,我们将谈论哪些工作空间不同于其他容器。我们将讲述我们的基地系统,由三个集装箱组成:一节集群,包括一个计算要素和批量系统,一个GliideinWMS工厂控制试点工作,一个调度器和前端,以提交工作和供应资源。其他容器可用于可选部件。该系统可以很容易地使用一台手提电脑运行,我们将分享我们对不同集装箱运行时间的评价,以方便使用和工作表现。最后,我们将谈论我们作为开发者和学生的经历。GliideinWMS 工作空间很容易与IDES 代码等系统整合,一个GlidelinWMS 系统控制试点工作,一个调度员和前期的调度员可以进行简化和测试。

Article 38

Title@2025-07-29 (2): InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

Title: InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain für LLM mit optischen Schaltungsschalter Transceivern

无限HBD:利用光电转换收发器为LLM 建立数据中心 – – 高度宽宽度高域域 2502.03885v5

Authors (14): Chenchen Shou, Guyue Liu, Hao Nie, Huaiyu Meng, Yu Zhou, Yimin Jiang, Wenqing Lv, Yelong Xu, Yuanwei Lu, Zhang Chen, Yanbo Yu, Yichen Shen, Yibo Zhu, Daxin Jiang

Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism (TP) and Expert Parallelism (EP). However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs such as TPUv4 take a middle-ground approach, but the fault explosion radius remains large at the cube level (e.g., 64 TPUs). We propose InfiniteHBD, a novel transceiver-centric HBD architecture that unifies connectivity and dynamic switching at the transceiver level using Optical Circuit Switching (OCS). By embedding OCS within each transceiver, InfiniteHBD achieves reconfigurable point-to-multipoint connectivity, allowing the topology to adapt to variable-size rings. This design provides: i) datacenter-wide scalability without cost explosion; ii) fault resilience by isolating failures to a single node, and iii) full bandwidth utilization for fault-free GPUs. Key innovations include a Silicon Photonic (SiPh)-based low-cost OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology co-designed with intra-/inter-node communication, and an HBD-DCN orchestration algorithm maximizing GPU utilization while minimizing cross-ToR datacenter network traffic. The evaluation demonstrates that InfiniteHBD achieves 31% of the cost of NVL-72, near-zero GPU waste ratio (over one order of magnitude lower than NVL-72 and TPUv4), near-zero cross-ToR traffic when node fault ratios are under 7%, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs per Node).

扩大语言模型( LLM) 培训依赖于多维平行, 而高频- 中心 HBD( 例如, TPV3/ Dojo) 则会受到严重错误传播的影响。切换- GPU 混合 HBD( TP ) 和专家平行( EP ) 等通信密集的平行关系。然而, 现有的 HBD 架构在可缩放、成本和错失弹性方面面临着根本性的限制: 以开关为中心的 HBD( 例如, NVL- 72) 具有令人望而却望的缩放成本, 而以高频为中心的 HBDD( 比如, TPV3/ 低频- DBD) 也存在严重错误传播的问题。切换- GPOVDE 混合 HBD( 低频- DVLD) , 将OCS- 混合的 HBDOVD( 将O- diralder Verv) 的 Orental- developational- developational- developational- dislational- dislational- dislational- dislational- dislates 。我们提议, 将一个可以对Orental- dislational- dislational- dislental- dislations 提供提供和S- dislental- siltal- slationslental- 提供一种不提供一种不提供一种超低电流。

Article 39

Title@2025-07-28 (1): FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning

Title: FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning

FedStrategist: Ein Meta-Learning-Framework für adaptive und robuste Aggregation im Federated Learning

联邦战略:联邦学习中适应性和强力聚合的元学习框架 2507.14322v2

Authors (3): Md Rafid Haque, Abu Raihan Mostofa Kamal, Md. Azam Hossain

Federated Learning (FL) offers a paradigm for privacy-preserving collaborative AI, but its decentralized nature creates significant vulnerabilities to model poisoning attacks. While numerous static defenses exist, their effectiveness is highly context-dependent, often failing against adaptive adversaries or in heterogeneous data environments. This paper introduces FedStrategist, a novel meta-learning framework that reframes robust aggregation as a real-time, cost-aware control problem. We design a lightweight contextual bandit agent that dynamically selects the optimal aggregation rule from an arsenal of defenses based on real-time diagnostic metrics. Through comprehensive experiments, we demonstrate that no single static rule is universally optimal. We show that our adaptive agent successfully learns superior policies across diverse scenarios, including a ``Krum-favorable” environment and against a sophisticated “stealth” adversary designed to neutralize specific diagnostic signals. Critically, we analyze the paradoxical scenario where a non-robust baseline achieves high but compromised accuracy, and demonstrate that our agent learns a conservative policy to prioritize model integrity. Furthermore, we prove the agent’s policy is controllable via a single “risk tolerance” parameter, allowing practitioners to explicitly manage the trade-off between performance and security. Our work provides a new, practical, and analyzable approach to creating resilient and intelligent decentralized AI systems.

联邦学习联盟(FL)提供了一个保护隐私的合作性AI(FL)范例,但其分散性在模式中毒袭击方面造成了巨大的脆弱性。尽管存在许多静态防御,但其有效性高度依赖环境,往往无法应对适应性对手或不同数据环境。本文介绍FedStrechnicist(FedStechnical),这是一个新型的元学习框架,将强力聚合重新定义为实时的、有成本意识的控制问题。我们设计了一个轻量级的背景强盗剂,能动态地从基于实时诊断性指标的防御库中选择最佳集合规则。我们通过全面实验,证明没有任何单一静态规则是普遍最佳的。我们表明,我们的适应性代理人成功地学习了不同情景的优异政策,包括“Krum-forestable”环境,以及旨在抵消特定诊断信号的精密“stealth”对手。关键地说,我们分析了一种自相矛盾的假想,即非暴动基线达到高但损害准确性,并表明我们的代理人学会了将模型完整性置于优先位置的保守政策。此外,我们证明该代理人的政策可以通过单一的“liforent-lifornicentalable”方法来控制好。我们的贸易-liforty-

Article 40

Title@2025-07-28 (1): LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

Title: LeMix: Unified Scheduling for LLM Training and Inference on Multi-GPU Systems

LeMix: Unified Scheduling für LLM-Training und Schlussfolgerung auf Multi-GPU-Systemen

LeMix:关于多功能保U系统的LLM培训和推理的LLM培训统一日程安排 2507.21276v1

Authors (4): Yufei Li, Zexin Li, Yinglun Zhu, Cong Liu

Modern deployment of large language models (LLMs) frequently involves both inference serving and continuous retraining to stay aligned with evolving data and user feedback. Common practices separate these workloads onto distinct servers in isolated phases, causing substantial inefficiencies (e.g., GPU idleness) and delayed adaptation to new data in distributed settings. Our empirical analysis reveals that these inefficiencies stem from dynamic request arrivals during serving and workload heterogeneity in pipeline-parallel training. To address these challenges, we propose LeMix, a system for co-locating and managing concurrent LLM serving and training workloads. LeMix integrates offline profiling, execution prediction mechanisms, and runtime scheduling to dynamically adapt resource allocation based on workload characteristics and system conditions. By understanding task-specific behaviors and co-execution interference across shared nodes, LeMix improves utilization and serving quality without compromising serving responsiveness. Our evaluation shows that LeMix improves throughput by up to 3.53x, reduces inference loss by up to 0.61x, and delivers up to 2.12x higher response time SLO attainment over traditional separate setups. To our knowledge, this is the first work to uncover and exploit the opportunities of joint LLM inference and training, paving the way for more resource-efficient deployment of LLMs in production environments.

现代使用大型语言模型(LLMS)经常涉及为适应不断演变的数据和用户反馈而对大型语言模型(LLMS)的现代部署的推论和持续再培训,以适应不断演变的数据和用户反馈。常见做法是将这些工作量分解到孤立的服务器上,造成效率极低(例如,GPU闲置)和对分布式环境中新数据的延迟调整。我们的经验分析表明,这些效率低下是由于在服务期间需求激增和工作量在管道平行培训中出现差异造成的。为了应对这些挑战,我们建议LeMix(一个同时分配和管理LLMMX服务和培训工作量的系统)共同定位和管理。LeMix(LeMix)整合了离线剖析、执行预测机制以及运行时间安排,以便根据工作量特点和系统条件动态调整资源分配。通过理解特定任务的行为和共同执行干扰,在共享节点之间提高利用率和工作质量,同时不损害响应能力。我们的评估表明,LMix(LeMix)的吞吐量增加至3.53x,将误损失降低到0.61x,并将预测损失降低2.12x的响应时间超过传统的SLOE实现传统单独设置和再利用我们的知识,从而探索资源环境。

Article 41

Title@2025-07-28 (1): Improving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation

Title: Improving SpGEMM Performance Through Matrix Reordering and Cluster-wise Computation

Verbesserung der SpGEMM-Performance durch Matrix-Neuordnung und clusterweise Berechnung

通过矩阵重新排序和集群计算改进 SGEMM 业绩 2507.21253v1

Authors (4): Abdullah Al Raqibul Islam, Helen Xu, Dong Dai, Aydın Buluç

Sparse matrix-sparse matrix multiplication (SpGEMM) is a key kernel in many scientific applications and graph workloads. Unfortunately, SpGEMM is bottlenecked by data movement due to its irregular memory access patterns. Significant work has been devoted to developing row reordering schemes towards improving locality in sparse operations, but prior studies mostly focus on the case of sparse-matrix vector multiplication (SpMV). In this paper, we address these issues with hierarchical clustering for SpGEMM that leverages both row reordering and cluster-wise computation to improve reuse in the second input (B) matrix with a novel row-clustered matrix format and access pattern in the first input (A) matrix. We find that hierarchical clustering can speed up SpGEMM by 1.39x on average with low preprocessing cost (less than 20x the cost of a single SpGEMM on about 90% of inputs). Furthermore, we decouple the reordering algorithm from the clustered matrix format so they can be applied as independent optimizations. Additionally, this paper sheds light on the role of both row reordering and clustering independently and together for SpGEMM with a comprehensive empirical study of the effect of 10 different reordering algorithms and 3 clustering schemes on SpGEMM performance on a suite of 110 matrices. We find that reordering based on graph partitioning provides better SpGEMM performance than existing alternatives at the cost of high preprocessing time. The evaluation demonstrates that the proposed hierarchical clustering method achieves greater average speedup compared to other reordering schemes with similar preprocessing times.

不幸的是,SpGEMM因其不规则的内存访问模式而受数据流动的制约。我们发现,等级组合可以平均地将SpGEMM的速度加快1.39x,而处理前成本则较低(低于20倍,仅次于单一SpGIMM在90%左右投入方面的成本)。此外,我们用组合式矩阵格式对算法进行分解,以便作为独立优化加以应用。此外,本文说明了行重整和分组计算的作用,即第二输入(B)矩阵的再利用和第一个输入(A)矩阵的新的行分组矩阵格式和访问模式。我们发现,等级组合可以平均将SpGEMM的速度加快1.39x的速度,而处理前成本较低(比单一SpGEMMM的90%左右成本要低20倍)。此外,我们用组合式矩阵格式对算法进行分解,这样可以作为独立调整。本文还说明了行重整和分组在第一个输入(A)输入矩阵的新的内层矩阵格式格式格式和组合的更大规模组合的作用。我们从SpGEMIMA前的平面的平面结构中可以对10号组合进行更好的业绩研究,我们根据SBMIMGIMGMA的平面的平面的平面结构进行更好的分析。

Article 42

Title@2025-07-28 (1): Parallel Point-to-Point Shortest Paths and Batch Queries

Title: Parallel Point-to-Point Shortest Paths and Batch Queries

Parallele Punkt-zu-Punkt-Kurze Pfade und Batch-Abfragen

平行点对点最短路径和批量查询 2506.16488v2

Authors (4): Xiaojun Dong, Andy Li, Yan Gu, Yihan Sun

We propose Orionet, efficient parallel implementations of Point-to-Point Shortest Paths (PPSP) queries using bidirectional search (BiDS) and other heuristics, with an additional focus on batch PPSP queries. We present a framework for parallel PPSP built on existing single-source shortest paths (SSSP) frameworks by incorporating pruning conditions. As a result, we develop efficient parallel PPSP algorithms based on early termination, bidirectional search, A$^$ search, and bidirectional A$^$ all with simple and efficient implementations. We extend our idea to batch PPSP queries, which are widely used in real-world scenarios. We first design a simple and flexible abstraction to represent the batch so PPSP can leverage the shared information of the batch. Orionet formalizes the batch as a query graph represented by edges between queried sources and targets. In this way, we directly extended our PPSP framework to batched queries in a simple and efficient way. We evaluate Orionet on both single and batch PPSP queries using various graph types and distance percentiles of queried pairs, and compare it against two baselines, GraphIt and MBQ. Both of them support parallel single PPSP and A$^$ using unidirectional search. On 14 graphs we tested, on average, our bidirectional search is 2.9$\times$ faster than GraphIt, and 6.8$\times$ faster than MBQ. Our bidirectional A$^$ is 4.4$\times$ and 6.2$\times$ faster than the A$^*$ in GraphIt and MBQ, respectively. For batched PPSP queries, we also provide in-depth experimental evaluation, and show that Orionet provides strong performance compared to the plain solutions.

我们建议使用双向搜索(BIDS)和其他超光速搜索(PPSP)查询,在现有的单一来源最短路径(SSSP)框架的基础上,以现有单一来源最短路径(SSSP)为基础,建立一个平行的PPSP框架。结果,我们根据早期终止、双向搜索、美元搜索和双向直线A$(美元)查询,提出高效的平行的PPSP算法。我们把想法扩大到分批的PPSP查询,这些查询在现实世界情景中广泛使用。我们首先设计一个简单和灵活的抽象来代表批次,以便PPPSP能够利用现有最短路径(SS)的共享信息。Orionet将批次正式化为由查询来源和目标之间的边缘代表的查询图。通过这种方式,我们直接扩展我们的PPPSP框架,以简单而高效的方式分批的查询。我们用双批的OIPSP查询,用不同图表类型和距离的Oralalalalalal 美元来比较我们的OISP查询。我们用不同图表的搜索类型和直径直径直径直径搜索。

Article 43

Title@2025-07-28 (1): Metric Criticality Identification for Cloud Microservices

Title: Metric Criticality Identification for Cloud Microservices

Metrische Criticality Identification für Cloud Microservices

云云微微服务计量临界度识别 2501.03547v2

Authors (6): Akanksha Singal, Divya Pathak, Kaustabha Ray, Felix George, Mudit Verma, Pratibha Moogi

Modern cloud-native applications built on microservice architectures present unprecedented challenges for system monitoring and alerting. Site Reliability Engineers (SREs) face the daunting challenge of defining effective monitoring strategies across multitude of metrics to ensure system reliability, a task that traditionally requires extensive manual expertise. The distributed nature of microservices, characterized by stochastic execution patterns and intricate inter-service dependencies, renders the traditional manual approach of navigating the vast metrics landscape computationally and operationally prohibitive. To address this critical challenge, we propose KIMetrix, a data-driven system that automatically identifies minimal yet comprehensive metric subsets to aid SREs in monitoring microservice applications. KIMetrix leverages information-theoretic measures, specifically entropy and mutual information, to quantify metric criticality while considering the stochastic execution patterns inherent in microservice topologies. Our approach operates solely on lightweight metrics and traces, eliminating the need for expensive processing of unstructured logs, and requires no expert-defined training data. Experimental evaluation on state-of-the-art real-world microservice benchmark datasets demonstrates KIMetrix’s effectiveness in identifying critical metric subsets that provide comprehensive system coverage while significantly reducing the burden on SREs. By automating the identification of essential metrics for alerting, KIMetrix enables more reliable system monitoring without overwhelming operators with false positives or missing critical system events.

以微观服务结构为基础的现代云层应用对系统监测和警报提出了前所未有的挑战; 站点可靠性工程师(SRE)面临一项艰巨的挑战,即确定各种衡量标准的有效监测战略,以确保系统的可靠性,这是传统上需要大量人工专门知识的任务; 微观服务的分散性质,其特点是随机执行模式和复杂的服务间依赖性,因此传统的手工方法在计算上和操作上对巨大的指标地貌进行浏览,这在操作上令人望而却步; 为了应对这一严峻挑战,我们提议KIMetrix,这是一个数据驱动系统,它自动确定最低限度的和全面的指标子集,以协助战略资源秘书处监测微观服务应用; KIMetrix利用信息-理论措施,特别是昆虫和相互信息,在考虑微观服务结构中固有的随机性执行模式的同时,量化指标性指标性。我们的方法仅以轻量度的衡量基准和痕迹为运作,不再需要昂贵地处理非结构化的木材,不需要专家定义的培训数据。对当前最先进的实时微观服务基准系统进行实验性评估,同时在大幅降低临界性指标性基准范围上,同时确定关键性指标性指标性指标性指标性指标性指标性确定,同时在大幅度降低临界性基准系统上减少关键指标性基准系统的负担。

Article 44

Title@2025-07-28 (1): COoL-TEE: Client-TEE Collaboration for Resilient Distributed Search

Title: COoL-TEE: Client-TEE Collaboration for Resilient Distributed Search

COoL-TEE: Client-TEE-Kollaboration für resiliente verteilte Suche

ColoL-TEE:客户-TEE合作进行弹性分配搜索 2503.19063v2

Authors (4): Matthieu Bettinger, Etienne Rivière, Sonia Ben Mokhtar, Anthony Simonet-Boulogne

Current marketplaces rely on search mechanisms with distributed systems but centralized governance, making them vulnerable to attacks, failures, censorship and biases. While search mechanisms with more decentralized governance (e.g., DeSearch) have been recently proposed, these are still exposed to information head-start attacks (IHS) despite the use of Trusted Execution Environments (TEEs). These attacks allow malicious users to gain a head-start over other users for the discovery of new assets in the market, which give them an unfair advantage in asset acquisition. We propose COoL-TEE, a TEE-based provider selection mechanism for distributed search, running in single- or multi-datacenter environments, that is resilient to information head-start attacks. COoL-TEE relies on a Client-TEE collaboration, which enables clients to distinguish between slow providers and malicious ones. Performance evaluations in single- and multi-datacenter environments show that, using COoL-TEE, malicious users respectively gain only up to 2% and 7% of assets more than without IHS, while they can claim 20% or more on top of their fair share in the same conditions with DeSearch.

目前的市场依赖分布式系统搜索机制,但有集中式治理,使其易受攻击、失败、检查和偏见的伤害。虽然最近提出了更分散式治理(例如DeSearch)搜索机制,但尽管使用信任执行环境,这些搜索机制仍然面临信息启动攻击(IHS ) 。这些袭击允许恶意用户在发现市场新资产方面获得领先于其他用户的头部启动,这使他们在获取资产方面获得了不公平的优势。我们提议采用基于 COoL-TEE的基于TEE的供应商选择机制,即分散式搜索机制,在单一或多数据中心环境中运行,能够抵御信息启动式袭击。 CooL-TEE依靠客户与TEE的合作,从而能够区分缓慢的提供者和恶意的提供者。在单一和多数据中心环境中的绩效评估表明,恶意用户在使用COol-TEE后,获得的资产分别只有2%和7%的收益超过没有ISS的2%和7%,同时,他们可以在DeSearch的相同条件下要求占其公平份额的20%或20%以上。

Article 45

Title@2025-07-28 (1): The Case for Time-Shared Computing Resources

Title: The Case for Time-Shared Computing Resources

Der Fall für zeitverteilte Computing-Ressourcen

时间共享电子计算资源案 2507.19287v2

Authors (2): Pierre Jacquet, Adrien Luxey-Bitri

The environmental impact of Information and Communication Technologies (ICT) continues to grow, driven notably by increasing usage, rebound effects, and emerging demands. However, despite the virtual nature of its services, the sector remains inherently constrained by its materiality and cannot rely on an infinite pool of resources. As a result, the wide variety of supported services may need to be managed under stricter limits within hosting facilities in the future. Contrary to common assumptions, we show that tenants typically do not share computing resources, even in environments commonly perceived as mutualized, such as cloud platforms. Time-sharing has been progressively phased out for reasons of performance, security, predictability, and, perhaps more importantly, due to the decreasing cost of computing resources. This paper advocates for managing fewer physical resources by improving resource sharing between tenants. It represents a paradigm shift, moving beyond traditional time-sharing at the hardware level to a higher abstraction. This approach entails “doing with fewer resources” under conditions of “reduced performance”. Nonetheless, enhancing the mutualization of infrastructure can reduce cluster sizes (through consolidation) and improve energy efficiency, with gains related to the accepted performance trade-off, a situation potentially more socially acceptable than eliminating services. We review the current state of the art, identify challenges and opportunities, propose interpretations of Time-Shared Computing, and outline key research directions.

信息和通信技术(信通技术)的环境影响继续扩大,这主要是因为使用量增加、反弹效应和新出现的需求。然而,尽管该部门的服务具有虚拟性质,但该部门仍然受到其重要性的内在限制,不能依赖无限的资源总量。因此,可能需要在未来托管设施内以更严格的限度管理各种支助服务。与通常的假设相反,我们表明,租户通常不共享计算资源,即使在云层平台等通常被视为相互共存的环境中也是如此。由于业绩、安全、可预测性以及也许更重要的是计算资源成本的下降,时间共享已经逐步淘汰。本文主张通过改善租户之间的资源共享来减少对实物资源的管理。这意味着范式转变,从传统的硬件时间共享转向更高的抽象化。这种方法意味着在“降低绩效”的条件下“用更少的资源”。然而,加强基础设施的相互化可以降低集群规模规模(通过整合),提高能源效率,同时取得与公认的绩效交易有关的成果,一种可能比取消服务在社会上更可以接受的状况,以及提出关键方向。我们审查当前的研究大纲。

Article 46

Title@2025-07-28 (1): Accelerating Deterministic Global Optimization via GPU-parallel Interval Arithmetic

Title: Accelerating Deterministic Global Optimization via GPU-parallel Interval Arithmetic

Beschleunigung der Deterministischen globalen Optimierung über GPU-Parallel Interval Arithmetik

通过 GPU- 平行对称器加速确定性全球优化 2507.20769v1

Authors (7): Hongzhen Zhang, Tim Kerkenhoff, Neil Kichler, Manuel Dahmen, Alexander Mitsos, Uwe Naumann, Dominik Bongartz

Spatial Branch and Bound (B&B) algorithms are widely used for solving nonconvex problems to global optimality, yet they remain computationally expensive. Though some works have been carried out to speed up B&B via CPU parallelization, GPU parallelization is much less explored. In this work, we investigate the design of a spatial B&B algorithm that involves an interval-based GPU-parallel lower bounding solver: The domain of each B&B node is temporarily partitioned into numerous subdomains, then massive GPU parallelism is leveraged to compute interval bounds of the objective function and constraints on each subdomain, using the Mean Value Form. The resulting bounds are tighter than those achieved via regular interval arithmetic without partitioning, but they remain fast to compute. We implement the method into our open-source solver MAiNGO via CUDA in two manners: wrapping all GPU tasks within one kernel function, or distributing the GPU tasks onto a CUDA graph. Numerical experiments show that using more subdomains leads to significantly tighter lower bounds and thus less B&B iterations. Regarding wall clock time, the proposed spatial B&B framework achieves a speedup of three orders of magnitude compared to applying interval arithmetic on the CPU without domain partitioning. Among the two implementations, the one developed with CUDA graph enables higher efficiency. Moreover, in some case studies, the proposed method delivers competitive or better performance compared to MAiNGO’s default solver which is based on McCormick relaxations. These results highlight the potential of GPU-accelerated bounding techniques to accelerate B&B algorithms.

B&B 空间分支和 Bound (B&B) 算法被广泛用于解决全球最佳化的非convex问题, 但其计算成本仍然很高。尽管已经开展了一些工作, 通过 CPU 平行化加快 B&B 问题, 但 GPU 平行化的探索要少得多。在这项工作中, 我们调查空间 B&B 算法的设计, 其中包括一个间隔为基础的 GPU- 平行的下下拉线求解器: 每个 B&B 节点的域被暂时分割成多个子域, 然后利用大规模 GPU 平行化来计算每个子域的目标函数和限制的间隔区间距, 使用平均值表来计算。由此产生的界限比通过不分割的定期间距算实现的拉快得多, 但是仍然可以快速化。我们用两种方式将这个方法应用于我们的开源解解解码解码解算码的 MINGO : 将所有 GPU 任务都集中在一个内核函数中, 或者将 GUD 任务分配到 CUDA 图表中。快速化实验显示, 较高级的解解解调法使得 B 速度化的进度化 B 的校正平比 B 的校略法的校略的校略法的校正的校略法。

Article 47

Title@2025-07-28 (1): Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications

Title: Advancing Compositional LLM Reasoning with Structured Task Relations in Interactive Multimodal Communications

Verbesserung der kompositorischen LLM-Reasoning mit strukturierten Arbeitsbeziehungen in der interaktiven multimodalen Kommunikation

与互动多模式通信中结构性任务关系有关的理由 2507.21199v1

Authors (12): Xinye Cao, Hongcan Guo, Guoshun Nan, Jiaoyang Cui, Haoting Qian, Yihan Lin, Yilin Peng, Diyang Zhang, Yanzhao Hou, Huici Wu, Xiaofeng Tao, Tony Q. S. Quek

Interactive multimodal applications (IMAs), such as route planning in the Internet of Vehicles, enrich users’ personalized experiences by integrating various forms of data over wireless networks. Recent advances in large language models (LLMs) utilize mixture-of-experts (MoE) mechanisms to empower multiple IMAs, with each LLM trained individually for a specific task that presents different business workflows. In contrast to existing approaches that rely on multiple LLMs for IMAs, this paper presents a novel paradigm that accomplishes various IMAs using a single compositional LLM over wireless networks. The two primary challenges include 1) guiding a single LLM to adapt to diverse IMA objectives and 2) ensuring the flexibility and efficiency of the LLM in resource-constrained mobile environments. To tackle the first challenge, we propose ContextLoRA, a novel method that guides an LLM to learn the rich structured context among IMAs by constructing a task dependency graph. We partition the learnable parameter matrix of neural layers for each IMA to facilitate LLM composition. Then, we develop a step-by-step fine-tuning procedure guided by task relations, including training, freezing, and masking phases. This allows the LLM to learn to reason among tasks for better adaptation, capturing the latent dependencies between tasks. For the second challenge, we introduce ContextGear, a scheduling strategy to optimize the training procedure of ContextLoRA, aiming to minimize computational and communication costs through a strategic grouping mechanism. Experiments on three benchmarks show the superiority of the proposed ContextLoRA and ContextGear. Furthermore, we prototype our proposed paradigm on a real-world wireless testbed, demonstrating its practical applicability for various IMAs. We will release our code to the community.

汽车互联网线路规划等互动多式联运应用程序(IMAs)通过将各种形式的数据纳入无线网络,丰富了用户的个人经验。大型语言模型(LLMs)最近的进展利用了混合专家机制增强多种IMA的能力,每个LLM都经过了个人培训,以完成不同业务流程的具体任务。与现有方法相比,依赖多个LLMs来完成各种IMA,在无线网络上使用单一组成LLM。两个主要挑战包括:1)指导一个单一LLM以适应多种IMA目标;2)确保LLM在资源限制的流动环境中的灵活性和效率。为了应对第一个挑战,我们提议Cloin LoRA,这是指导LMM通过构建任务依赖性图表来学习IMA之间丰富的结构背景。我们将每个IMA的神经层可学习参数矩阵分解,以方便LLMM的构成。然后,我们制定了一个由任务关系引导的逐步调整程序,包括培训、冻结、掩藏其实际成本;我们建议LLOMLMsalorizalalalalal 来学习一个更精确的流程,以便我们组织了解一个升级的流程。

Article 48

Title@2025-07-28 (1): CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

Title: CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning

CUDA-L1: Verbesserung der CUDA-Optimierung durch kontrastives Verstärkungslernen

CUDA-L1:通过反竞争强化学习改进CUDA优化 2507.14111v4

Authors (5): Xiaoya Li, Xiaofei Sun, Albert Wang, Jiwei Li, Chris Shum

The exponential growth in demand for GPU computing resources has created an urgent need for automated CUDA optimization strategies. While recent advances in LLMs show promise for code generation, current SOTA models achieve low success rates in improving CUDA speed. In this paper, we introduce CUDA-L1, an automated reinforcement learning framework for CUDA optimization that employs a novel contrastive RL algorithm. CUDA-L1 achieves significant performance improvements on the CUDA optimization task: trained on NVIDIA A100, it delivers an average speedup of x3.12 with a median speedup of x1.42 across all 250 CUDA kernels of KernelBench, with peak speedups reaching x120. Furthermore, the model also demonstrates portability across GPU architectures, achieving average speedups of x3.12 on L40, x2.50 on RTX 3090, x2.39 on H100, and x2.37 on H20 despite being optimized specifically for A100. The capabilities of CUDA-L1 demonstrate that, RL can transform an initially poor-performing LLM into an effective CUDA optimizer through speedup-based reward signals alone, without human expertise or domain knowledge. This paradigm opens possibilities for automated optimization of CUDA operations, and holds promise to substantially promote GPU efficiency and alleviate the rising pressure on GPU computing resources. We also identify important challenges posed by training RL models for tasks like CUDA development, where RL often learns to exploit loopholes in reward functions rather than solve the intended optimization problems. By identifying these failure modes and analyzing their root causes, we develop practical methods for creating more robust training procedures that prevent reward hacking.

对GPU计算资源的需求急剧增长,导致迫切需要CUDA优化战略。虽然LLM公司最近的进展显示,对代码生成有希望,但目前的SOTA模型在提高CUDA速度方面成功率较低。在本文中,我们引入CUDA-L1,一个CUDA优化的自动强化学习框架,采用新的对比RL算法。CUDA-L1在CUDA优化任务上取得了显著的业绩改进:对NVDIA A100进行了培训,它提供了平均速度x3.12,在250个CENNelBench公司的所有CUDA内仓中,平均速度为x1.42,而目前的SOTA模型在提高CUPA速度速度速度方面则取得了低成功率。此外,该模型还展示了GPUPA优化结构的平均速度,在LX 3090, x2.50 和xPUFA优化任务上取得了显著的成绩改进。CUDA-L的精度评估能力表明,将最初表现较差的LMLM(CLM)转变为一个有效的CUDA 最佳学习模式,在快速操作上也使得GUDUDUDLOLOLOrals 成为了这种不断升级的升级的升级的升级的轨道上,从而使得GUDLLLLLLLL的升级的升级的升级的操作成为了这些速度。

Article 49

Title@2025-07-28 (1): RIMMS: Runtime Integrated Memory Management System for Heterogeneous Computing

Title: RIMMS: Runtime Integrated Memory Management System for Heterogeneous Computing

RIMMS: Laufzeit-Integriertes Speicher-Management-System für Heterogenes Rechnen

RIMMS: 运行时异质计算综合记忆管理系统 2507.20514v1

Authors (8): Serhan Gener, Aditya Ukarande, Shilpa Mysore Srinivasa Murthy, Sahil Hassan, Joshua Mack, Chaitali Chakrabarti, Umit Ogras, Ali Akoglu

Efficient memory management in heterogeneous systems is increasingly challenging due to diverse compute architectures (e.g., CPU, GPU, FPGA) and dynamic task mappings not known at compile time. Existing approaches often require programmers to manage data placement and transfers explicitly, or assume static mappings that limit portability and scalability. This paper introduces RIMMS (Runtime Integrated Memory Management System), a lightweight, runtime-managed, hardware-agnostic memory abstraction layer that decouples application development from low-level memory operations. RIMMS transparently tracks data locations, manages consistency, and supports efficient memory allocation across heterogeneous compute elements without requiring platform-specific tuning or code modifications. We integrate RIMMS into a baseline runtime and evaluate with complete radar signal processing applications across CPU+GPU and CPU+FPGA platforms. RIMMS delivers up to 2.43X speedup on GPU-based and 1.82X on FPGA-based systems over the baseline. Compared to IRIS, a recent heterogeneous runtime system, RIMMS achieves up to 3.08X speedup and matches the performance of native CUDA implementations while significantly reducing programming complexity. Despite operating at a higher abstraction level, RIMMS incurs only 1-2 cycles of overhead per memory management call, making it a low-cost solution. These results demonstrate RIMMS’s ability to deliver high performance and enhanced programmer productivity in dynamic, real-world heterogeneous environments.

由于不同的计算结构(如:CPU、GPU、GPU、FPGA)和在汇编时不为人知的动态任务映射,不同系统中的内存管理越来越具有挑战性,因为不同的计算结构(如:CPU、GPU、FPGA)和动态任务映射,现有办法往往要求程序设计者明确管理数据定位和传输,或进行静态映射,限制可移动性和可缩放性。本文介绍了RIMMS(RIMMS)(RIMMS)(RIMMS)(RPU+GPU和CPU+FPGA)平台的完整雷达信号处理应用程序,这是一个轻巧的、运行时间管理的运行时间管理、运行时间管理的硬件存储存储系统(GPU)和基于FPGA(FGA)的系统(1.82X),与低级存储存储存储系统(IRIS)相比,最近一个混合运行运行环境,RIMMS(RIM)在3.08X(S-II)标准调整或代码IMA(RIM)的快速执行周期内,同时大幅展示高水平的绩效管理。

Article 50

Title@2025-07-27 (7): Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

Title: Communication-Efficient Distributed Training for Collaborative Flat Optima Recovery in Deep Learning

Kommunikation-Effizient verteiltes Training für kollaborative Flat Optima Erholung im Deep Learning

促进深学习合作、平板最佳最佳恢复的传播-高效分配培训 2507.20424v1

Authors (2): Tolga Dimlioglu, Anna Choromanska

We study centralized distributed data parallel training of deep neural networks (DNNs), aiming to improve the trade-off between communication efficiency and model performance of the local gradient methods. To this end, we revisit the flat-minima hypothesis, which suggests that models with better generalization tend to lie in flatter regions of the loss landscape. We introduce a simple, yet effective, sharpness measure, Inverse Mean Valley, and demonstrate its strong correlation with the generalization gap of DNNs. We incorporate an efficient relaxation of this measure into the distributed training objective as a lightweight regularizer that encourages workers to collaboratively seek wide minima. The regularizer exerts a pushing force that counteracts the consensus step pulling the workers together, giving rise to the Distributed Pull-Push Force (DPPF) algorithm. Empirically, we show that DPPF outperforms other communication-efficient approaches and achieves better generalization performance than local gradient methods and synchronous gradient averaging, while significantly reducing communication overhead. In addition, our loss landscape visualizations confirm the ability of DPPF to locate flatter minima. On the theoretical side, we show that DPPF guides workers to span flat valleys, with the final valley width governed by the interplay between push and pull strengths, and that its pull-push dynamics is self-stabilizing. We further provide generalization guarantees linked to the valley width and prove convergence in the non-convex setting.

我们研究的是中央分布式的深神经网络数据平行培训,目的是改善通信效率和当地梯度方法模型性能之间的平衡。我们为此重新审视平米假设,认为更加概括化的模式往往存在于损失地貌的平坦区域。我们采用了简单而有效、敏锐的测量方法,并展示了它与DNNT普遍化差距的强烈关联性。我们将这一措施有效地放松到分布式培训目标中,以此作为一种轻量级常规,鼓励工人合作寻找宽度迷你。正规化者施加了一种推力,抵消了协商一致将工人拉在一起的步伐,从而形成了分散式拉动-普什部队(DPPF)的算法。生动性地说,我们表明DPPF比其他通信效率高的方法要优于当地梯度方法和同步加速度平均度差,同时显著降低通信费。此外,我们的损失景观直观化证实DPPF有能力找到美度迷你。在理论的一面,我们展示的是,在理论的一面,我们展示的是,使DPPF的平衡性动力与稳定在平谷层之间进一步拉动。

Article 51

Title@2025-07-27 (7): A Comparative Study of OpenMP Scheduling Algorithm Selection Strategies

Title: A Comparative Study of OpenMP Scheduling Algorithm Selection Strategies

Eine vergleichende Studie der OpenMP-Scheeduling-Algorithm-Auswahlstrategien

OpenMP 测高计表选择战略比较研究 2507.20312v1

Authors (6): Jonas H. Müller Korndörfer, Ali Mohammed, Ahmed Eleliemy, Quentin Guilloteau, Reto Krummenacher, Florina M. Ciorba

Scientific and data science applications are becoming increasingly complex, with growing computational and memory demands. Modern high performance computing (HPC) systems provide high parallelism and heterogeneity across nodes, devices, and cores. To achieve good performance, effective scheduling and load balancing techniques are essential. Parallel programming frameworks such as OpenMP now offer a variety of advanced scheduling algorithms to support diverse applications and platforms. This creates an instance of the scheduling algorithm selection problem, which involves identifying the most suitable algorithm for a given combination of workload and system characteristics. In this work, we explore learning-based approaches for selecting scheduling algorithms in OpenMP. We propose and evaluate expert-based and reinforcement learning (RL)-based methods, and conduct a detailed performance analysis across six applications and three systems. Our results show that RL methods are capable of learning high-performing scheduling decisions, although they require significant exploration, with the choice of reward function playing a key role. Expert-based methods, in contrast, rely on prior knowledge and involve less exploration, though they may not always identify the optimal algorithm for a specific application-system pair. By combining expert knowledge with RL-based learning, we achieve improved performance and greater adaptability. Overall, this work demonstrates that dynamic selection of scheduling algorithms during execution is both viable and beneficial for OpenMP applications. The approach can also be extended to MPI-based programs, enabling optimization of scheduling decisions across multiple levels of parallelism.

现代高性能计算(HPC)系统在节点、装置和核心之间提供高度平行性和差异性。为了实现良好的业绩,有效的时间安排和负荷平衡技术至关重要。OpenMP等平行方案框架现在提供了各种先进的时间安排算法,以支持各种应用和平台。这产生了一个安排算法选择问题的例子,这涉及为特定的工作负荷和系统特点组合确定最合适的算法。在这项工作中,我们探索基于学习的定时算法方法。我们提出和评价基于专家的强化学习方法,并在6个应用程序和3个系统进行详细的绩效分析。我们的结果表明,RL方法能够学习高绩效的时间安排决定,尽管它们需要大量探索,而奖赏职能的选择也发挥着关键作用。专家方法则依赖先前的知识,涉及较少的探索。在选择特定应用系统配方时,我们并不总是确定最佳的算法。通过将专家知识与基于RL的学习和强化学习方法(RL)为基础方法进行详细分析,并在6个应用程序和3个系统进行详细的绩效分析。我们的结果表明,RL方法能够学习高绩效和更高程度的平行选择方案。

Article 52

Title@2025-07-27 (7): Silent Self-Stabilising Leader Election in Programmable Matter Systems with Holes

Title: Silent Self-Stabilising Leader Election in Programmable Matter Systems with Holes

Stille selbststabilisierende Leader-Wahl in programmierbaren Materiesystemen mit Löchern

在有洞洞的可规划物质系统中进行无声自稳定领导人选举 2507.20201v1

Authors (3): Jérémie Chalopin, Shantanu Das, Maria Kokkou

Leader election is a fundamental problem in distributed computing, particularly within programmable matter systems, where coordination among simple computational entities is crucial for solving complex tasks. In these systems, particles (i.e., constant memory computational entities) operate in a regular triangular grid as described in the geometric Amoebot model. While leader election has been extensively studied in non self-stabilising settings, self-stabilising solutions remain more limited. In this work, we study the problem of self-stabilising leader election in connected (but not necessarily simply connected) configurations. We present the first self-stabilising algorithm for programmable matter that guarantees the election of a unique leader under an unfair scheduler, assuming particles share a common sense of direction. Our approach leverages particle movement, a capability not previously exploited in the self-stabilising context. We show that movement in conjunction with particles operating in a grid can overcome classical impossibility results for constant-memory systems established by Dolev et al.

领导人选举是分配计算中的一个基本问题,特别是在可编程物质系统中,在这种系统中,简单的计算实体之间的协调对于解决复杂的任务至关重要。在这些系统中,粒子(即恒记忆计算实体)在固定的三角网格中运行,正如几何阿莫博特模型所描述的那样。虽然在非自我稳定的环境中对领导人选举进行了广泛研究,但自我稳定解决方案仍然较为有限。在这项工作中,我们研究了在连接(但不一定只是连接)的配置中自我稳定领导人选举的问题。我们提出了第一个可编程事项的自我稳定算法,该算法保证了在不公平的排程中选举一个独特的领导人,假设粒子具有共同的方向感。我们的方法利用了粒子运动,而这种能力以前在自我稳定的背景下没有开发。我们表明,与在电网中运行的粒子一起移动可以克服Dolev等人建立的恒定系统的传统不可能的结果。

Article 53

Title@2025-07-27 (7): High-Performance Parallel Optimization of the Fish School Behaviour on the Setonix Platform Using OpenMP

Title: High-Performance Parallel Optimization of the Fish School Behaviour on the Setonix Platform Using OpenMP

Leistungsstarke Paralleloptimierung des Fish School Verhaltens auf der Setonix-Plattform mit OpenMP

利用开放式Setonix平台的鱼类学校行为高绩效平行优化 2507.20173v1

Authors (2): Haitian Wang, Long Qin

This paper presents an in-depth investigation into the high-performance parallel optimization of the Fish School Behaviour (FSB) algorithm on the Setonix supercomputing platform using the OpenMP framework. Given the increasing demand for enhanced computational capabilities for complex, large-scale calculations across diverse domains, there’s an imperative need for optimized parallel algorithms and computing structures. The FSB algorithm, inspired by nature’s social behavior patterns, provides an ideal platform for parallelization due to its iterative and computationally intensive nature. This study leverages the capabilities of the Setonix platform and the OpenMP framework to analyze various aspects of multi-threading, such as thread counts, scheduling strategies, and OpenMP constructs, aiming to discern patterns and strategies that can elevate program performance. Experiments were designed to rigorously test different configurations, and our results not only offer insights for parallel optimization of FSB on Setonix but also provide valuable references for other parallel computational research using OpenMP. Looking forward, other factors, such as cache behavior and thread scheduling strategies at micro and macro levels, hold potential for further exploration and optimization.

本文对使用 OpenMP 框架的Setonix 超级计算平台上的鱼类学校行为(FSB) 算法的高度性平行优化进行了深入的调查。鉴于对复杂、大规模计算在不同领域的需求日益增加,迫切需要优化平行算法和计算结构。 FSB 算法受自然社会行为模式的启发,为由于迭接和计算密集性而平行化提供了理想的平台。本研究利用Setonix 平台和 OpenMP 框架的能力,分析多重阅读的各个方面,例如线条计数、排期战略和OpenMP 构造,以辨别能够提高程序性能的模式和战略。实验旨在严格测试不同配置,我们的结果不仅为Setonix 的FSB 平行优化提供了深刻的见解,而且还为使用 OpenMP 进行其他平行计算研究提供了宝贵的参考。展望,其他因素,如微型和宏观两级的缓存行为和线线列战略等,具有进一步探索和优化的潜力。

Article 54

Title@2025-07-27 (7): Syno: Structured Synthesis for Neural Operators

Title: Syno: Structured Synthesis for Neural Operators

Syno: Strukturierte Synthese für neurale Operatoren

同步:神经操作员结构化合成 2410.23745v2

Authors (4): Yongqi Zhuo, Zhengyuan Su, Chenggang Zhao, Mingyu Gao

The desires for better prediction accuracy and higher execution performance in neural networks never end. Neural architecture search (NAS) and tensor compilers are two popular techniques to optimize these two goals, but they are both limited to composing or optimizing existing manually designed operators rather than coming up with completely new designs. In this work, we explore the less studied direction of neural operator synthesis, which aims to automatically and efficiently discover novel neural operators with better accuracy and/or speed. We develop an end-to-end framework Syno, to realize practical neural operator synthesis. Syno makes use of a novel set of fine-grained primitives defined on tensor dimensions, which ensure various desired properties to ease model training, and also enable expression canonicalization techniques to avoid redundant candidates during search. Syno further adopts a novel guided synthesis flow to obtain valid operators matched with the specified input/output dimension sizes, and leverages efficient stochastic tree search algorithms to quickly explore the design space. We demonstrate that Syno discovers better operators with average speedups of $1.37\times$ to $2.06\times$ on various hardware and compiler choices, while keeping less than 1% accuracy loss even on NAS-optimized models.

在神经网络中,改善预测准确性和更高的执行性性能的愿望永无止尽。神经结构搜索(NAS)和高压编译器是优化这两个目标的两种流行技术,但两者都局限于形成或优化现有的手工设计操作员,而不是提出全新的设计。在这项工作中,我们探索神经操作员合成的学习较少的方向,其目的是以更准确和/或速度自动和高效地发现新的神经操作员,目的是以更准确和/或更快的方式自动和高效地发现新的神经操作员。我们开发了一个端到端框架Syno,以实现实际神经操作员合成。协同利用一套在发声维度上定义的精细微原始体,确保各种理想的特性,以方便模型培训,并且还能够使表达可化技术在搜索中避免多余的候选人。协同还进一步采用了一种新的引导合成流程,以获得与规定的投入/输出尺寸相匹配的有效操作员,并利用高效的随机树搜索算法快速探索设计空间。我们证明Sylo在各种硬件和编译模型上发现更好的操作员,平均速度为1.37美元至2.06美元。同时减少损失1美元。

Article 55

Title@2025-07-27 (7): Accelerating Containerized Service Delivery at the Network Edge

Title: Accelerating Containerized Service Delivery at the Network Edge

Beschleunigen der containerisierten Service-Lieferung am Netzwerkrand

加速在网络边缘提供集装箱化服务 2507.20116v1

Authors (8): Yinuo Deng, Hailiang Zhao, Dongjing Wang, Peng Chen, Wenzhuo Qian, Jianwei Yin, Schahram Dustdar, Shuiguang Deng

Efficient container image distribution is crucial for enabling machine learning inference at the network edge, where resource limitations and dynamic network conditions create significant challenges. In this paper, we present PeerSync, a decentralized P2P-based system designed to optimize image distribution in edge environments. PeerSync employs a popularity- and network-aware download engine that dynamically adapts to content popularity and real-time network conditions using a sliding window mechanism. PeerSync further integrates automated tracker election for rapid peer discovery and dynamic cache management for efficient storage utilization. We implement PeerSync with 8000+ lines of Rust code and test its performance extensively on both physical edge devices and Docker-based emulations. Experimental results show that PeerSync delivers a remarkable speed increase of 2.72$\times$, 1.79$\times$, and 1.28$\times$ compared to the Baseline, Dragonfly, and Kraken, respectively, while significantly reducing peak cross-network traffic by 90.72\% under congested and varying network conditions.

高效的集装箱图像发布对于在网络边缘进行机器学习的推论至关重要,因为在网络边缘,资源限制和动态网络条件造成了重大挑战。在本文件中,我们介绍PeerSync,这是一个分散的P2P系统,旨在优化边缘环境中的图像发布。PeerSync使用一种广受欢迎和网络认知的下载引擎,使用滑动窗口机制,动态地适应内容受欢迎和实时网络条件。PeerSync进一步整合了自动追踪器选举,用于快速同行发现和动态缓存管理,以便高效储存。我们用8000+线的Rust代码执行PeerSync,并在物理边缘装置和基于Docker的模拟器中广泛测试其性能。实验结果显示,PeerSync与基线、龙蝇和克拉肯相比,其速度分别显著增加2.72美元、1.79美元和1.28美元,同时在凝聚和不同网络条件下,将顶端的跨网络交通量大幅减少90.72。

Article 56

Title@2025-07-27 (7): Data-Locality-Aware Task Assignment and Scheduling for Distributed Job Executions

Title: Data-Locality-Aware Task Assignment and Scheduling for Distributed Job Executions

Daten-Lokalität-Bewusste Aufgabe Zuordnung und Planung für verteilte Job-Executionen

分配任务执行的数据- 本地- 软件任务分配和时间安排 2407.08584v4

Authors (5): Hailiang Zhao, Xueyan Tang, Peng Chen, Jianwei Yin, Shuiguang Deng

This paper addresses the data-locality-aware task assignment and scheduling problem for distributed job executions. Our goal is to minimize job completion times without prior knowledge of future job arrivals. We propose an Optimal Balanced Task Assignment algorithm (OBTA), which achieves minimal job completion times while significantly reducing computational overhead through efficient narrowing of the solution search space. To balance performance and efficiency, we extend the approximate Water-Filling (WF) algorithm, providing a rigorous proof that its approximation factor equals the number of task groups in a job. We also introduce a novel heuristic, Replica-Deletion (RD), which outperforms WF by leveraging global optimization techniques. To further enhance scheduling efficiency, we incorporate job ordering strategies based on a shortest-estimated-time-first policy, reducing average job completion times across workloads. Extensive trace-driven evaluations validate the effectiveness and scalability of the proposed algorithms.

本文论述数据-地点意识任务分配和分布式工作处决时间安排问题。我们的目标是在不事先了解未来到来工作的情况下尽量减少完成工作的时间。我们建议采用最佳平衡任务分配算法(OBTA),该算法通过有效缩小解决方案搜索空间,实现最低完成工作时间,同时大幅减少计算管理费用。为了平衡业绩和效率,我们推广了大约的打水算法(WF),提供严格的证据证明其近似系数等于工作任务组的数量。我们还引入了一种新的超常、复制-Deletion(RD),它通过利用全球优化技术比WFD更完善。为了进一步提高排期效率,我们根据最短的预计时间第一政策,纳入工作定级战略,减少工作量的平均完成时间。广泛的追踪评估证实了拟议算法的有效性和可扩展性。

Article 57

Title@2025-07-26 (6): Racing to Idle: Energy Efficiency of Matrix Multiplication on Heterogeneous CPU and GPU Architectures

Title: Racing to Idle: Energy Efficiency of Matrix Multiplication on Heterogeneous CPU and GPU Architectures

Racing to Idle: Energieeffizienz der Matrix-Multiplikation auf heterogenen CPU- und GPU-Architekturen

乘以 IDL : 不同式CPU 和 GPU 建筑的矩阵乘法的能源效率 2507.20063v1

Authors (2): Mufakir Qamar Ansari, Mudabir Qamar Ansari

The paradigm shift towards multi-core and heterogeneous computing, driven by the fundamental power and thermal limits of single-core processors, has established energy efficiency as a first-class design constraint in high-performance computing (HPC). Heterogeneous systems, integrating traditional multi-core CPUs with specialized accelerators like discrete (dGPU) and integrated (iGPU) graphics processing units, offer a compelling path to navigating the trade-offs between performance and power. However, quantifying these trade-offs on widely accessible hardware remains a critical area of study. This paper presents a direct, empirical measurement of the performance and energy-to-solution of a canonical HPC workload – a 4096x4096 matrix-matrix multiplication – on three distinct compute architectures within a single consumer-grade laptop: a multi-core AMD Ryzen 7 5800H CPU, a discrete NVIDIA GeForce GTX 1650 GPU, and an integrated AMD Radeon Vega GPU. Using standard, validated, and minimally intrusive tools such as Linux perf and nvidia-smi, we find that the discrete GPU is not only the performance leader, achieving a 93.5x speedup over the CPU, but is also the most energy-efficient, consuming only 2% of the energy used by the CPU, resulting in a 50-fold improvement in energy efficiency. These findings provide a practical demonstration of the “race to idle” principle and offer clear, quantitative guidance on architectural choices for energy-aware software development.

由单一核心处理器(HPC)的基本电力和热量限制驱动的向多核心和多种计算模式的转变,将能源效率确定为高性能计算(HPC)的一流设计限制。超多种系统,将传统的多核心CPU与专门的加速器(如离散(dGPU)和集成(iGPU)图形处理器)相结合,为在性能和功率之间权衡取舍提供了一条令人信服的道路。然而,在可广泛获取的硬件上对这些取舍进行量化仍然是一个关键的研究领域。本文对高性能计算机(HPC)的工作量 – – 4096x4096矩阵矩阵增殖 – – 的性能和能源溶解性能进行了直接、实证的衡量。在单一消费水平的膝上的三个不同的结构中:多核心AMRyzen 7,5800HCPU CPU, 离散的GFORA GTX 1650 GPU, 以及综合的AMD Radeon GPU。使用标准、经验证和最小侵入性工具,如Linux perfer 和nvix-C-cal-deal-cal-deal-deal-deal-deal-deal-deal-deal-deal-de ex-deal develut the supleg-pal-pal-palisl), ex lautalisalisalisal deal develutisml lamental develtial disal develtial deal deal deal deal develmental dismaldalth, 也只提供一种标准、只有93的能源效率。

Article 58

Title@2025-07-26 (6): $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

Title: $K^4$: Online Log Anomaly Detection Via Unsupervised Typicality Learning

$K^4$: Online Log Anomalienerkennung durch unüberwachtes Lernen

4K元:在线记录异常探测不受监督的典型学习 2507.20051v1

Authors (6): Weicong Chen, Vikash Singh, Zahra Rahmani, Debargha Ganguly, Mohsen Hariri, Vipin Chaudhary

Existing Log Anomaly Detection (LogAD) methods are often slow, dependent on error-prone parsing, and use unrealistic evaluation protocols. We introduce $K^4$, an unsupervised and parser-independent framework for high-performance online detection. $K^4$ transforms arbitrary log embeddings into compact four-dimensional descriptors (Precision, Recall, Density, Coverage) using efficient k-nearest neighbor (k-NN) statistics. These descriptors enable lightweight detectors to accurately score anomalies without retraining. Using a more realistic online evaluation protocol, $K^4$ sets a new state-of-the-art (AUROC: 0.995-0.999), outperforming baselines by large margins while being orders of magnitude faster, with training under 4 seconds and inference as low as 4 $\mu$s.

现有的日志异常探测(LogAD)方法往往很慢,取决于易出错分析,并使用不切实际的评价程序。我们为高性能在线探测引入了4K$,这是一个不受监督和独立的高性能在线检测框架。 4K$将任意的日志嵌入成四维缩写器(精密、回调、密度、覆盖面),使用高效的K-近邻(k-NN)统计数据。这些描述器使轻量检测器能够在不进行再培训的情况下准确分辨异常。使用更现实的在线评估程序, 4K$4$建立了新的最新技术( AUROC: 0.995-0. 999), 运行率超过大边距, 且规模更快, 培训不到4秒钟, 推断值低至4美元。

Article 59

Title@2025-07-26 (6): Parallel Hierarchical Agglomerative Clustering in Low Dimensions

Title: Parallel Hierarchical Agglomerative Clustering in Low Dimensions

Paralleles hierarchisches Agglomerat-Clustering in niedrigen Abmessungen

相平行的低尺寸等级群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群群 2507.20047v1

Authors (7): MohammadHossein Bateni, Laxman Dhulipala, Willem Fletcher, Kishen N Gowda, D Ellis Hershkowitz, Rajesh Jayaram, Jakub Łącki

Hierarchical Agglomerative Clustering (HAC) is an extensively studied and widely used method for hierarchical clustering in $\mathbb{R}^k$ based on repeatedly merging the closest pair of clusters according to an input linkage function $d$. Highly parallel (i.e., NC) algorithms are known for $(1+\epsilon)$-approximate HAC (where near-minimum rather than minimum pairs are merged) for certain linkage functions that monotonically increase as merges are performed. However, no such algorithms are known for many important but non-monotone linkage functions such as centroid and Ward’s linkage. In this work, we show that a general class of non-monotone linkage functions – which include centroid and Ward’s distance – admit efficient NC algorithms for $(1+\epsilon)$-approximate HAC in low dimensions. Our algorithms are based on a structural result which may be of independent interest: the height of the hierarchy resulting from any constant-approximate HAC on $n$ points for this class of linkage functions is at most $\operatorname{poly}(\log n)$ as long as $k = O(\log \log n / \log \log \log n)$. Complementing our upper bounds, we show that NC algorithms for HAC with these linkage functions in \emph{arbitrary} dimensions are unlikely to exist by showing that HAC is CC-hard when $d$ is centroid distance and $k = n$.

高平行( 即 NC) 算法以$(1 epsilon) $- apapparate HAC (即近最小值而不是最小值对配方合一) 在某些连接函数上广泛研究并广泛使用 $mathbb{Rk$ 的等级组合法。但是,对于许多重要但非恒定的连接函数,例如美甲和沃德的远程连接功能,没有这样的算法。在这项工作中,我们显示非恒定连接功能的一般类别 – – 包括美甲和沃德的距离 – 接受低维度$(1 epsilon) 的高效NC 算法。我们的算法基于一个可能具有独立兴趣的结构结果: 任何恒定的HAC $( 近于美元) 的等级, 诸如美式和 Ward的远程连接函数的等级值值, 以 $= Nlogallog= 显示我们长期链接的 $。

Article 60

Title@2025-07-26 (6): MTASet: A Tree-based Set for Efficient Range Queries in Update-heavy Workloads

Title: MTASet: A Tree-based Set for Efficient Range Queries in Update-heavy Workloads

MTASet: Ein Baum-basiertes Set für effiziente Reichweitenfragen in Update-schweren Workloads

MTASSet: 更新重工作量中高效测距查询的树基套件 2507.20041v1

Authors (3): Daniel Manor, Mor Perry, Moshe Sulamy

In concurrent data structures, the efficiency of set operations can vary significantly depending on the workload characteristics. Numerous concurrent set implementations are optimized and fine-tuned to excel in scenarios characterized by predominant read operations. However, they often perform poorly when confronted with workloads that heavily prioritize updates. Additionally, current leading-edge concurrent sets optimized for update-heavy tasks typically lack efficiency in handling atomic range queries. This study introduces the MTASet, which leverages a concurrent (a,b)-tree implementation. Engineered to accommodate update-heavy workloads and facilitate atomic range queries, MTASet surpasses existing counterparts optimized for tasks in range query operations by up to 2x. Notably, MTASet ensures linearizability.

在同时的数据结构中,成套业务的效率视工作量特点而大不相同,许多同时实施的成套业务得到优化和微调,以在以主要阅读操作为特点的情景中取得优异成绩;然而,在面临大量优先更新的工作量时,这些运行往往表现不佳;此外,目前为更新和重度任务优化的前沿同时运行的前沿数据集通常在处理原子范围查询方面缺乏效率;本研究介绍了MTASet,它利用并行(a,b)-树木执行。MTASet为适应更新工作量和便利原子范围查询而设计,比现有对口单位在范围查询业务中优化的对口单位多出2x。值得注意的是,MTASet确保了线性。

Article 61

Title@2025-07-26 (6): MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

Title: MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism

MegaScale-Infer: Servieren von Mixture-of-Experts auf Scale mit disaggregierten Experten-Parallelismus

超星级――推:利用分级专家平行主义在规模上为混合专家服务 2504.02263v4

Authors (20): Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, Xin Liu

Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs. We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE’s sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.90x higher per-GPU throughput than state-of-the-art solutions.

专家混合(MOE)展示了扩大大型语言模型(LLMS)规模的巨大潜力,提高了性能,降低了计算复杂性。然而,其微弱作用的架构在推论期间将饲料向前网络(FFNs)从计算密集程度转向记忆密集程度,从而大大降低GPU的利用率和增加业务费用。我们介绍了一个高效和成本效益高的系统,用于为大型的MOE模型服务。MegaSeal-Infer 将注意力和FFFFNF模块在每个模型层中进行分解,使独立缩放、量制平行战略以及两个模块的混合部署成为可能。在显示MOE的孔性时,MegaSAC-Infer 充分利用分解方法,将FFFFFMS模块和最小化数据向上传输,MGI-Infer引入平流管平行,将请求的批量分解成微孔,在注意力和FFFU-PU之间穿透。MA-PU(e-PI)MS-S-S-Siral-assillational-Silizal-Suplievilation(eal)提供G-G-Suplievillation/G-PIPI)

Article 62

Title@2025-07-26 (6): Offloading tracing for real-time systems using a scalable cloud infrastructure

Title: Offloading tracing for real-time systems using a scalable cloud infrastructure

Offloading-Nachverfolgung für Echtzeit-Systeme mit einer skalierbaren Cloud-Infrastruktur

使用可缩放云层基础设施卸载实时系统的实时跟踪跟踪 2507.19953v1

Authors (3): David Jannis Schmidt, Grigory Fridman, Florian von Zabiensky

Real-time embedded systems require precise timing and fault detection to ensure correct behavior. Traditional tracing tools often rely on local desktops with limited processing and storage capabilities, which hampers large-scale analysis. This paper presents a scalable, cloud-based architecture for software tracing in real-time systems based on microservices and edge computing. Our approach shifts the trace processing workload from the developer’s machine to the cloud, using a dedicated tracing component that captures trace data and forwards it to a scalable backend via WebSockets and Apache Kafka. This enables long-term monitoring and collaborative analysis of target executions, e.g., to detect and investigate sporadic errors. We demonstrate how this architecture supports scalable analysis of parallel tracing sessions and lays the foundation for future integration of rule-based testing and runtime verification. The evaluation results show that the architecture can handle many parallel tracing sessions efficiently, although the per-session throughput decreases slightly as the system load increases, while the overall throughput increases. Although the design includes a dedicated tracer for analysis during development, this approach is not limited to such setups. Target systems with network connectivity can stream reduced trace data directly, enabling runtime monitoring in the field.

传统追踪工具往往依靠处理和储存能力有限的本地桌面进行长期监测和协作分析,从而发现和调查零星错误。我们展示了一种基于微观服务和边缘计算,在实时系统进行软件追踪的可缩放的云型结构。我们的方法是将追踪处理工作量从开发者的机器转移到云层,使用专门的追踪部分来记录追踪数据,然后通过WebSockets和Apache Kafka将其传送到可缩放的后端,这样就可以对目标处决进行长期监测和协作分析,例如,发现和调查零星错误。我们展示了这一结构如何支持对平行追踪会议进行可缩放的分析,并为今后集成基于规则的测试和运行时间核查打下基础。评价结果显示,该结构可以有效地处理许多平行的追踪会议,尽管随着系统负荷的增加,每次的吞吐量略有减少,而总体吞吐量增加。虽然设计中包括一个专门的追踪器用于分析,但这一方法并不局限于这种设置。我们展示的是,具有网络连接性的目标系统能够直接减少实地的追踪数据。

Article 63

Title@2025-07-26 (6): A Fast Parallel Median Filtering Algorithm Using Hierarchical Tiling

Title: A Fast Parallel Median Filtering Algorithm Using Hierarchical Tiling

Ein schneller paralleler Median, der Algorithmen mit Hierarchischem Tiling filtert

快速平行中位过滤, 使用分级曲线算法 2507.19926v1

Authors (1): Louis Sugy

Median filtering is a non-linear smoothing technique widely used in digital image processing to remove noise while retaining sharp edges. It is particularly well suited to removing outliers (impulse noise) or granular artifacts (speckle noise). However, the high computational cost of median filtering can be prohibitive. Sorting-based algorithms excel with small kernels but scale poorly with increasing kernel diameter, in contrast to constant-time methods characterized by higher constant factors but better scalability, such as histogram-based approaches or the 2D wavelet matrix. This paper introduces a novel algorithm, leveraging the separability of the sorting problem through hierarchical tiling to minimize redundant computations. We propose two variants: a data-oblivious selection network that can operate entirely within registers, and a data-aware version utilizing random-access memory. These achieve per-pixel complexities of $O(k \log(k))$ and $O(k)$, respectively, for a $k \times k$ kernel - unprecedented for sorting-based methods. Our CUDA implementation is up to 5 times faster than the current state of the art on a modern GPU and is the fastest median filter in most cases for 8-, 16-, and 32-bit data types and kernels from $3 \times 3$ to $75 \times 75$.

中层过滤是一种非线性平滑技术,在数字图像处理过程中广泛使用,以去除噪音,同时保留锐利边缘。它特别适合于清除离子(impulses 噪声)或颗粒工艺(peckle 噪声),然而,中位过滤的高计算成本可能令人望而却望而却。基于分类的算法优于小内核,但规模差于不断增长的内核直径,这与恒定方法不同,其特点是恒定系数较高,但缩缩度更高,例如基于直方图的方法或2D波盘矩阵。本文引入了一种新型算法,通过等级平移来利用分解问题的分解性,以尽量减少多余的计算。我们提出了两种变式:完全在登记册内运行的、以数据感知度为基础的筛选网络,以及使用随机存留存的对数据版本。这分别实现了美元(k\log(k)美元)和美元(k)和美元)的每平方位复杂性。对于以美元为基内内(k)的Knellnel-kelex-在排序方法上是前所未有的,我们CUDUDAAA在目前3和GU3级中,最快速的运行中,其中,其执行速度为速度为5次。

Article 64

Title@2025-07-26 (6): MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

Title: MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

MultiKernelBench: Ein Multi-Platform Benchmark für die Kernel-Generation

多KenneelBench: 核心生成的多平台基准 2507.17773v2

Authors (6): Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang

The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at https://github.com/wzzll123/MultiKernelBench.

利用大型语言模型(LLMs)自动生成深层次学习(DL)核心的自动生成(DL)核心,这已成为一种很有希望的方法,可以减少手工努力和硬件专长,以编写高性能操作者执行工作所需的高性能操作软件。然而,目前对这一领域中LLM的评估基准缺乏硬件支持、粗微的内核分类和任务覆盖不平衡。为克服这些限制,我们引入了多环邦奇,这是以LLLM为主的DLL内核生成的第一个全面、多平台基准。多环贝尼奇跨越了14个明确界定的内核类别的285项任务,支持了三大硬件平台:Nvidia GPUs、Huawei NPUs和Google TPUs。为了能够在未来的扩展性,我们设计了一个模块后端抽象层,将平台特定平台的逻辑与核心基准基础设施脱钩,便于新硬件平台的整合。我们进一步提出一个简单而有效的类别认知/125点快速提示方法,通过提供分类外壳类的外壳,提高代质量质量。通过对7个州GLLM号的公开平台进行系统的系统评估,在公共风险上进行显著的变换。

Article 65

Title@2025-07-26 (6): MegatronApp: Efficient and Comprehensive Management on Distributed LLM Training

Title: MegatronApp: Efficient and Comprehensive Management on Distributed LLM Training

MegatronApp: Effizientes und umfassendes Management auf verteilten LLM-Schulungen

威天:有效、全面管理分配的有限LLM培训 2507.19845v1

Authors (7): Bohan Zhao, Guang Yang, Shuo Chen, Ruitao Liu, Tingrui Zhang, Yongchao He, Wei Xu

The rapid escalation in the parameter count of large language models (LLMs) has transformed model training from a single-node endeavor into a highly intricate, cross-node activity. While frameworks such as Megatron-LM successfully integrate tensor (TP), pipeline (PP), and data (DP) parallelism to enable trillion-parameter training, they simultaneously expose practitioners to unprecedented systems-level challenges in performance optimization, diagnosis, and interpretability. MegatronApp is an open-source toolchain expressly designed to meet these challenges. It introduces four orthogonal, yet seamlessly composable modules–MegaScan, MegaFBD, MegaDPP, and MegaScope–that collectively elevate the reliability, efficiency, and transparency of production-scale training. This paper presents the motivation, architecture, and distinctive contributions of each module, and elucidates how their synergistic integration augments the Megatron-LM ecosystem.

大型语言模型(LLMs)参数计数的迅速升级使模型培训从单节努力转变为高度复杂、交叉节点活动。虽然威震天-LM等框架成功地整合了高尔夫(TP)、管道(PP)和数据(DP)平行性,从而得以进行万万亿参数培训,但同时也使从业人员面临在业绩优化、诊断和可解释性方面前所未有的系统层面的挑战。威震天-亚普是一个开放源工具链,专门设计来应对这些挑战。它引入了四个正方位、但无缝可折叠的模块-MegaScan、MegaFBD、MegaFBD、MegaDPP和MegaScope-共同提高生产规模培训的可靠性、效率和透明度。本文介绍了每个模块的动力、架构和独特贡献,并阐明了它们的协同整合如何增强Megatron-LM生态系统。

Article 66

Title@2025-07-26 (6): CleANN: Efficient Full Dynamism in Graph-based Approximate Nearest Neighbor Search

Title: CleANN: Efficient Full Dynamism in Graph-based Approximate Nearest Neighbor Search

CleANN: Effizienter Volldynamismus auf Graph-Basis Ungefähre nächste Nachbarsuche

CleANN: 以图形为基础的近邻近邻搜索中的高效全面动态 2507.19802v1

Authors (4): Ziyu Zhang, Yuanhao Wei, Joshua Engels, Julian Shun

Approximate nearest neighbor search (ANNS) has become a quintessential algorithmic problem for various other foundational data tasks for AI workloads. Graph-based ANNS indexes have superb empirical trade-offs in indexing cost, query efficiency, and query approximation quality. Most existing graph-based indexes are designed for the static scenario, where there are no updates to the data after the index is constructed. However, full dynamism (insertions, deletions, and searches) is crucial to providing up-to-date responses in applications using vector databases. It is desirable that the index efficiently supports updates and search queries concurrently. Existing dynamic graph-based indexes suffer from at least one of the following problems: (1) the query quality degrades as updates happen; and (2) the graph structure updates used to maintain the index quality upon updates are global and thus expensive. To solve these problems, we propose the CleANN system which consists of three main components: (1) workload-aware linking of diverse search tree descendants to combat distribution shift; (2)query-adaptive on-the-fly neighborhood consolidation to efficiently handle deleted nodes; and (3) semi-lazy memory cleaning to clean up stale information in the data structure and reduce the work spent by the first two components. We evaluate CleANN on 7 diverse datasets on fully dynamic workloads and find that CleANN has query quality at least as good as if the index had been built statically using the corresponding data. In the in-memory setting using 56 hyper-threads, with all types of queries running concurrently, at the same recall level, CleANN achieves 7-1200x throughput improvement on million-scale real-world datasets. To the best of our knowledge, CleANN is the first concurrent ANNS index to achieve such efficiency while maintaining quality under full dynamism.

最近的近邻搜索( ANNS ) 已成为各种其他基本数据任务中AI 工作量的典型算法问题。基于图形的 ANNS 指数在指数化成本、查询效率和查询近似质量方面有着超强的经验权衡。大多数基于图形的指数是为静态假设设计的, 在指数构建后没有更新数据。但是, 完全的活力( 插入、删除和搜索) 对于在使用矢量数据库的应用中提供最新的响应至关重要。指数最好能有效支持更新和搜索。现有的动态图形型指数至少存在以下一个问题:(1) 更新后的查询质量下降;(2) 用于更新时保持索引质量的图表结构更新是全球性的,因此昂贵。为了解决这些问题, 我们建议使用 CleANNN系统, 它由三大主要组成部分组成:(1) 将不同的搜索树树后裔连接到战斗性分布变化中;(2) 快速的目录式社区整合到高效地处理节点;(3) 以两种动态图形型索引型索引型指数型指数型指数型指数型指数型指数型指数型指数型指数型索引型指数型指数型指数型指数型指数型指数型指数型索引型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型指数型

Article 67

Title@2025-07-26 (6): Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Title: Accelerating Matrix Multiplication: A Performance Comparison Between Multi-Core CPU and GPU

Beschleunigung der Matrix-Multiplikation: Ein Leistungsvergleich zwischen Multi-Core-CPU und GPU

加速矩阵乘法:多焦CPU和GPU之间的性能比较 2507.19723v1

Authors (2): Mufakir Qamar Ansari, Mudabir Qamar Ansari

Matrix multiplication is a foundational operation in scientific computing and machine learning, yet its computational complexity makes it a significant bottleneck for large-scale applications. The shift to parallel architectures, primarily multi-core CPUs and many-core GPUs, is the established solution, and these systems are now ubiquitous from datacenters to consumer laptops. This paper presents a direct, empirical performance analysis of matrix multiplication on a modern, consumer-grade heterogeneous platform. We implemented and benchmarked three versions of the algorithm: a baseline sequential C++ implementation, a parallel version for its multi-core CPU using OpenMP, and a massively parallel version for its discrete GPU using CUDA with shared memory optimizations. The implementations were evaluated with square matrices of varying dimensions, from 128x128 to 4096x4096. Our results show that while the parallel CPU provides a consistent speedup of 12-14x over the sequential version, the GPU’s performance scales dramatically with problem size. For a 4096x4096 matrix, the GPU implementation achieved a speedup of approximately 593x over the sequential baseline and 45x over the optimized parallel CPU version. These findings quantitatively demonstrate the profound impact of many-core GPU architectures on accelerating data-parallel workloads, underscoring that significant performance gains are readily accessible even on consumer-level hardware.

矩阵乘积是科学计算和机器学习的基础操作, 但它的计算复杂性使它成为大规模应用的重大瓶颈。向平行结构的转变, 主要是多核心CPU和多核心GPU, 是既定的解决方案, 这些系统现在从数据中心到消费膝上型计算机无处不在。本文对现代消费者级多元平台的矩阵乘积进行了直接的实验性绩效分析。我们实施并基准了三种算法: 基线顺序C++ 实施, 使用 OpenMP 的多核心CPU平行版本, 以及使用 CUDA 共享记忆优化的离散 GPU 大规模平行版本。已经用不同维度的平方矩阵对实施进行了评估, 从128x128到4096x4096x4096。我们的结果表明, 虽然平行的CPU为现代的12-14x乘积提供了一致的加速速度, 但是GPU的性能与问题大小相当大。对于4096 矩阵, GPU的实施在连续基线基线上实现了大约59x的加速速度, 并且45x 最接近于最接近的CPI 的硬的硬的成绩。

Article 68

Title@2025-07-25 (5): Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Title: Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Oranits: Missionszuweisung und Aufgabe-Offloading in Open RAN-basierten ITS mit Hilfe von Metaheuristic und Deep Reinforcement Learning

Oranits:利用超常和深强化学习在以开放RAN为基础的ITS中执行特派任务和卸载任务 2507.19712v1

Authors (8): Ngoc Hung Nguyen, Nguyen Van Thieu, Quang-Trung Luu, Anh Tuan Nguyen, Senura Wanasekara, Nguyen Cong Luong, Fatemeh Kavehmadavani, Van-Dinh Nguyen

In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)-based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision-making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic-based evolutionary computing algorithm, namely the Chaotic Gaussian-based Global ARO (CGG-ARO), serving as a baseline for one-slot optimization. Second, we design an enhanced reward-based deep reinforcement learning (DRL) framework, referred to as the Multi-agent Double Deep Q-Network (MA-DDQN), that integrates both multi-agent coordination and multi-action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG-ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA-DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.

在本文中,我们探索在基于开放无线电接入网络(开放RAN)的智能运输系统中进行任务分配和任务卸载,在该系统中,自主车辆利用移动边缘计算进行高效处理;现有研究往往忽视特派团之间错综复杂的相互依存关系和将任务卸载到边缘服务器的相关费用,导致决策不优化;为了缩小这一差距,我们引入了奥拉尼特,这是一个新颖的系统模型,明确说明特派团依赖性和卸载费用,同时通过车辆合作优化业绩。为了实现这一点,我们建议采用双优化办法。首先,我们开发了基于计量经济学的进化计算算法,即基于查托高斯的Gaussian全球ARO(CGG-ARO),作为一线优化的基准。第二,我们设计了一个基于奖励的强化学习(DRL)框架,称为多试机构双深网络(MA-DQN),将多机构协调和多行动选择机制结合起来,显著缩短了任务分配时间,改进了基线方法的适应性调整能力。

Article 69

Title@2025-07-25 (5): Improved Distributed Algorithms for Random Colorings

Title: Improved Distributed Algorithms for Random Colorings

Verbesserte verteilte Algorithmen für Random Colorings

改进随机配色配色的分布比值 2309.07859v3

Authors (3): Charlie Carlson, Daniel Frishberg, Eric Vigoda

We study distributed versions of Markov Chain Monte Carlo (MCMC) algorithms for generating random $k$-colorings of an input graph with maximum degree $\Delta$. In the sequential setting, the Glauber dynamics is the simple MCMC algorithm which updates the color at a randomly chosen vertex in each step. Fischer and Ghaffari (2018), and independently Feng, Hayes, and Yin (2018), presented a parallel and distributed version of the Glauber dynamics which converges in $O(\log{n})$ rounds for $k>(2+\varepsilon)\Delta$ for any $\varepsilon>0$. We present the distributed flip dynamics and prove $O(n\log{n})$ mixing for $k>(11/6-\delta)\Delta$ for a fixed $\delta>0$. Our new Markov chain is a generalization of the distributed Glauber dynamics previously analyzed, and is a parallel and distributed version of the more general flip dynamics considered in the sequential setting which recolors local maximal two-colored components in each step. While the distributed Glauber dynamics and the sequential flip dynamics are symmetric Markov chains, and hence their stationary distribution is uniformly distributed over colorings, our distributed flip dynamics is not symmetric and hence the stationary distribution is unclear.

我们研究马可夫链条蒙特卡洛(Markov Chain Monte Carlo)(MCMCMC)的分布版算法,以生成以最大度为$(Delta$) 的随机输入图的彩色。在顺序设置中, Glauber 动态是简单的 MMC 算法, 以随机选择的顶点更新颜色。 Fischer和Ghaffari(2018年) , 以及独立的Feng、 Hayes 和 Yin (2018年) 。我们的新 Markov 链是一个以美元( log{n} ) 的分布式Glauber 动态, 以美元( 2 varepsilon)\ Delta$ 来生成。在任何 $( varevalepslon>0$ ) 的顺序设置中, 我们展示的分布式翻转动动态, 并证明$( n\ log{n=log{n} 混合的 $k> 。我们的新移动链是连续设置中分布的。

Article 70

Title@2025-07-25 (5): Quantifying the Performance Gap for Simple Versus Optimal Dynamic Server Allocation Policies

Title: Quantifying the Performance Gap for Simple Versus Optimal Dynamic Server Allocation Policies

Quantifizierung der Performance Gap für einfache Versus Optimal Dynamic Server Allocation Richtlinien

量化简单 Versus 最佳最佳动态服务器配置政策的业绩差距 2507.19667v1

Authors (2): Niklas Carlsson, Derek Eager

Cloud computing enables the dynamic provisioning of server resources. To exploit this opportunity, a policy is needed for dynamically allocating (and deallocating) servers in response to the current load conditions. In this paper we describe several simple policies for dynamic server allocation and develop analytic models for their analysis. We also design semi-Markov decision models that enable determination of the performance achieved with optimal policies, allowing us to quantify the performance gap between simple, easily implemented policies, and optimal policies. Finally, we apply our models to study the potential performance benefits of state-dependent routing in multi-site systems when using dynamic server allocation at each site. Insights from our results are valuable to service providers wanting to balance cloud service costs and delays.

云计算能够动态地提供服务器资源。为了利用这个机会, 需要针对当前负荷条件制定动态分配( 和分配) 服务器的政策。在本文中, 我们描述一些动态服务器分配的简单政策, 并开发分析模型进行分析。我们还设计了半马尔科夫决定模型, 以便确定以最佳政策实现的绩效, 从而让我们量化简单、容易执行的政策与最佳政策之间的性能差距。最后, 我们运用模型来研究多站点系统中使用动态服务器分配时依赖状态的路线在多站点系统中的潜在性能效益。我们的观察结果对于想要平衡云服务成本和延误的服务提供者来说很有价值。

Article 71

Title@2025-07-25 (5): Efficient and Scalable Agentic AI with Heterogeneous Systems

Title: Efficient and Scalable Agentic AI with Heterogeneous Systems

Effiziente und skalierbare Agentische KI mit Heterogenen Systemen

具有异质系统的高效和可缩放剂AIA 2507.19635v1

Authors (3): Zain Asgar, Michelle Nguyen, Sachin Katti

AI agents are emerging as a dominant workload in a wide range of applications, promising to be the vehicle that delivers the promised benefits of AI to enterprises and consumers. Unlike conventional software or static inference, agentic workloads are dynamic and structurally complex. Often these agents are directed graphs of compute and IO operations that span multi-modal data input and conversion), data processing and context gathering (e.g vector DB lookups), multiple LLM inferences, tool calls, etc. To scale AI agent usage, we need efficient and scalable deployment and agent-serving infrastructure. To tackle this challenge, in this paper, we present a system design for dynamic orchestration of AI agent workloads on heterogeneous compute infrastructure spanning CPUs and accelerators, both from different vendors and across different performance tiers within a single vendor. The system delivers several building blocks: a framework for planning and optimizing agentic AI execution graphs using cost models that account for compute, memory, and bandwidth constraints of different HW; a MLIR based representation and compilation system that can decompose AI agent execution graphs into granular operators and generate code for different HW options; and a dynamic orchestration system that can place the granular components across a heterogeneous compute infrastructure and stitch them together while meeting an end-to-end SLA. Our design performs a systems level TCO optimization and preliminary results show that leveraging a heterogeneous infrastructure can deliver significant TCO benefits. A preliminary surprising finding is that for some workloads a heterogeneous combination of older generation GPUs with newer accelerators can deliver similar TCO as the latest generation homogenous GPU infrastructure design, potentially extending the life of deployed infrastructure.

AI代理商正在成为范围广泛的应用中的主要工作量,有望成为向企业和消费者提供AI所承诺的好处的载体。与常规软件或静态推断不同,代理人工作量是动态的,结构复杂。这些代理商往往是计算和IO操作的定向图表,涉及多模式数据输入和转换、数据处理和背景收集(如矢量 DB查看、多种LLM推算、工具电话等)。为了扩大AI代理商的使用规模,我们需要高效和可缩放的部署和代理服务基础设施。为了应对这一挑战,在本文件中,我们提出了一套系统设计,用于在包含CPU和加速器的多式计算机化基础设施上,对AAA代理商和IO加速器操作进行动态协调。该系统提供了几个构件:一个用于规划和优化代理商AI执行图的框架,使用成本模型来计算不同HW的计算、记忆和带宽度制约;一个基于MLIR的新的表述和编集系统,可以将AI代理商执行系统的初步结构图解分解,将CICO的GOL结构结构结构结构图绘制成一个不同的GSLMLA系统,同时将一个动态设计工具展示了我们G-SLILA系统的最新版本的版本的版本设计工具,可以展示一个可交付的版本。

Article 72

Title@2025-07-25 (5): An OpenSource CI/CD Pipeline for Variant-Rich Software-Defined Vehicles

Title: An OpenSource CI/CD Pipeline for Variant-Rich Software-Defined Vehicles

Eine OpenSource CI/CD Pipeline für Variant-Rich Software-definierte Fahrzeuge

变式Rich软件定型车辆的开源CI/CD管道 2507.19446v1

Authors (5): Matthias Weiß, Anish Navalgund, Johannes Stümpfle, Falk Dettinger, Michael Weyrich

Software-defined vehicles (SDVs) offer a wide range of connected functionalities, including enhanced driving behavior and fleet management. These features are continuously updated via over-the-air (OTA) mechanisms, resulting in a growing number of software versions and variants due to the diversity of vehicles, cloud/edge environments, and stakeholders involved. The lack of a unified integration environment further complicates development, as connected mobility solutions are often built in isolation. To ensure reliable operations across heterogeneous systems, a dynamic orchestration of functions that considers hardware and software variability is essential. This paper presents an open-source CI/CD pipeline tailored for SDVs. It automates the build, test, and deployment phases using a combination of containerized open-source tools, creating a standardized, portable, and scalable ecosystem accessible to all stakeholders. Additionally, a custom OTA middleware distributes software updates and supports rollbacks across vehicles and backend services. Update variants are derived based on deployment target dependencies and hardware configurations. The pipeline also supports continuous development and deployment of AI models for autonomous driving features. Its effectiveness is evaluated using an automated valet parking (AVP) scenario involving TurtleBots and a coordinating backend server. Two object detection variants are developed and deployed to match hardware-specific requirements. Results demonstrate seamless OTA updates, correct variant selection, and successful orchestration across all targets. Overall, the proposed pipeline provides a scalable and efficient solution for managing software variants and OTA updates in SDVs, contributing to the advancement of future mobility technologies.

软件定义的车辆(SDV)具有广泛的连通功能,包括加强驾驶行为和车队管理。这些功能通过空外机制不断更新,导致软件版本和变体越来越多,因为车辆的多样性、云层/尖端环境以及利益攸关方都可使用。缺乏统一的整合环境使发展更加复杂,因为连接的流动解决方案往往是孤立地建立的。为确保不同系统之间的可靠操作,动态调控考虑硬件和软件变异性的职能至关重要。本文介绍了为SDV定制的开放源的CI/CD管道。它利用集装箱化的开放源码工具组合来自动更新构建、测试和部署阶段。它创造了标准化、可移植和可扩展的生态系统。此外,定制的OTA中软件传播软件更新软件,支持车辆和后端服务的滚回。根据部署目标依赖和硬件配置,更新变量也支持自动驱动驱动驱动功能的AI模式的继续开发和部署。它的有效性是使用自动化的VAVDLO解决方案的升级、测试和升级后端版本的OVA、测试和升级的OTA的升级版本。它提供了所有运行的自动变式的升级和升级的自动变式。

Article 73

Title@2025-07-25 (5): SDVDiag: A Modular Platform for the Diagnosis of Connected Vehicle Functions

Title: SDVDiag: A Modular Platform for the Diagnosis of Connected Vehicle Functions

SDVDiag: Modulare Plattform für die Diagnose von vernetzten Fahrzeugfunktionen

SDVDiag: 连接车辆功能诊断模块平台 2507.19403v1

Authors (3): Matthias Weiß, Falk Dettinger, Michael Weyrich

Connected and software-defined vehicles promise to offer a broad range of services and advanced functions to customers, aiming to increase passenger comfort and support autonomous driving capabilities. Due to the high reliability and availability requirements of connected vehicles, it is crucial to resolve any occurring failures quickly. To achieve this however, a complex cloud/edge architecture with a mesh of dependencies must be navigated to diagnose the responsible root cause. As such, manual analyses become unfeasible since they would significantly delay the troubleshooting. To address this challenge, this paper presents SDVDiag, an extensible platform for the automated diagnosis of connected vehicle functions. The platform enables the creation of pipelines that cover all steps from initial data collection to the tracing of potential root causes. In addition, SDVDiag supports self-adaptive behavior by the ability to exchange modules at runtime. Dependencies between functions are detected and continuously updated, resulting in a dynamic graph view of the system. In addition, vital system metrics are monitored for anomalies. Whenever an incident is investigated, a snapshot of the graph is taken and augmented by relevant anomalies. Finally, the analysis is performed by traversing the graph and creating a ranking of the most likely causes. To evaluate the platform, it is deployed inside an 5G test fleet environment for connected vehicle functions. The results show that injected faults can be detected reliably. As such, the platform offers the potential to gain new insights and reduce downtime by identifying problems and their causes at an early stage.

连接和软件定义的车辆有望向客户提供广泛的服务和高级功能,目的是增加乘客舒适度,支持自主驾驶能力。由于连接车辆的高度可靠性和可用性要求,因此迅速解决任何出现故障至关重要。然而,要做到这一点,必须引导一个复杂的云层/尖端结构,其中含有依赖性网格,以诊断负责任的根本原因。因此,人工分析变得不可行,因为它们会大大拖延故障排除。为了应对这一挑战,本文件提供了SDVDiag,这是一个可自动诊断相关车辆功能的可扩展平台。该平台使得能够创建管道,涵盖从初步数据收集到潜在根源追踪的所有步骤。此外,SDVDiag支持通过在运行时交换模块的能力进行自我适应的行为。检测和不断更新功能之间的依赖性,从而形成一个动态的系统图表视图。此外,对于异常现象进行监测的关键系统测量。每当对事件进行调查,即对图表进行简要分析,并辅之以相关的异常点。最后,分析是通过跨时间定位平台进行早期定位,从而显示车辆内部的定位结果。

Article 74

Title@2025-07-25 (5): Big Data Energy Systems: A Survey of Practices and Associated Challenges

Title: Big Data Energy Systems: A Survey of Practices and Associated Challenges

Big Data Energy Systems: Eine Übersicht über Praktiken und damit verbundene Herausforderungen

大数据能源系统:做法和相关挑战概览 2507.19154v1

Authors (4): Lunodzo J. Mwinuka, Massimo Cafaro, Lucas Pereira, Hugo Morais

Energy systems generate vast amounts of data in extremely short time intervals, creating challenges for efficient data management. Traditional data management methods often struggle with scalability and accessibility, limiting their usefulness. More advanced solutions, such as NoSQL databases and cloud-based platforms, have been adopted to address these issues. Still, even these advanced solutions can encounter bottlenecks, which can impact the efficiency of data storage, retrieval, and analysis. This review paper explores the research trends in big data management for energy systems, highlighting the practices, opportunities and challenges. Also, the data regulatory demands are highlighted using chosen reference architectures. The review, in particular, explores the limitations of current storage and data integration solutions and examines how new technologies are applied to the energy sector. Novel insights into emerging technologies, including data spaces, various data management architectures, peer-to-peer data management, and blockchains, are provided, along with practical recommendations for achieving enhanced data sharing and regulatory compliance.

传统的数据管理方法往往与可缩放性和可获取性相争,限制了其效用; 已经通过了更先进的解决方案,如诺萨基L数据库和云基平台,以解决这些问题; 然而,即使这些先进解决方案也可能遇到瓶颈,这可能影响数据储存、检索和分析的效率; 本审查文件探讨了能源系统大数据管理的研究趋势,强调了做法、机遇和挑战; 此外,还利用选定的参考结构强调了数据监管需求; 审查特别探讨了目前储存和数据整合解决方案的局限性,并审查了新技术如何应用于能源部门; 提供了对新兴技术,包括数据空间、各种数据管理架构、同行数据管理和链块的洞察,并提出了加强数据共享和监管合规的实用建议。

Article 75

Title@2025-07-25 (5): Urban Green Governance: IoT-Driven Management and Enhancement of Urban Green Spaces in Campobasso

Title: Urban Green Governance: IoT-Driven Management and Enhancement of Urban Green Spaces in Campobasso

Urban Green Governance: IoT-getriebenes Management und Verbesserung städtischer Grünflächen in Campobasso

城市绿色治理:在坎波巴索管理和加强城市绿色空间 2507.12106v4

Authors (6): Antonio Salis, Gabriele Troina, Gianluca Boanelli, Marco Ottaviano, Paola Fortini, Soraya Versace

The efficient design and management of public green spaces is a key factor in promoting the health and well-being of urban population, as emphasized by the WHO, UNEP, and EEA. These areas serve as the “green lungs” of the urban ecosystem, playing a vital role in enhancing quality of life thanks to the provision of ecosystem services. In this context, the Smart Green City use case in Campobasso municipality, funded by the Italian Ministry of Enterprises (MIMIT), emerges as an innovative model for the sustainable management of green urban areas through the adoption of an advanced system of emerging technologies integrated and interoperable. The project integrates IoT systems and data-driven governance platforms, enabling real-time monitoring of the health status of trees and green areas via a Decision Support System (DSS). It also facilitates the collection and analysis of data from diverse sources, including weather conditions, air quality, soil moisture, pollution levels. The resulting cloud-based platform supports a holistic real time decision making for green urban managers, technical experts and operational staff. It enables intelligent control and management of urban green spaces using Tree Talker sensors, integrated with soil moisture and water potential monitoring systems. Thanks to predictive models based on machine learning algorithms and real time data provided by IoT sensors, irrigation of public parks can be optimized by providing suggestions on when and how much water to apply. Customized alerts layers are also activated warning users when monitored parameters, such as soil temperature, humidity, or water potential, exceed predefined thresholds. This Use Case demonstrates how digitalization, IoT sensors fusion and technological innovation can support sustainable urban governance, fostering environmental resilience and improving citizens quality of life.

如世卫组织、环境署和欧洲环境署所强调,高效设计和管理公共绿色空间是促进城市人口健康和福祉的一个关键因素。这些领域是城市生态系统的“绿色肺”,由于生态系统服务的提供,在提高生活质量方面发挥着至关重要的作用。在这方面,由意大利企业部(MIMIT)资助的Smart Green City在Campobasso市的“智能绿色城市使用案例”成为了通过采用先进的新式温度参数系统综合和可相互操作的技术对绿色城市地区进行可持续管理的创新模式。该项目整合了IOT系统和数据驱动的治理平台,通过决策支持系统(DSS)对树木和绿色地区的健康状况进行实时监测。它还有助于收集和分析来自不同来源的数据,包括天气条件、空气质量、土壤湿度和污染水平。由此产生的云基平台支持绿色城市管理者、技术专家和业务工作人员的全面实时决策。该项目使城市绿色空间的智能控制和管理能够利用树木谈话器传感器、与土壤湿度和水潜力监测系统进行整合,从而能够实时监测树木和绿色地区的健康状况。这还有助于通过预测性数据采集模型,同时提供以机器系统进行最优化的土壤温度监测。

Article 76

Title@2025-07-25 (5): A New One-Shot Federated Learning Framework for Medical Imaging Classification with Feature-Guided Rectified Flow and Knowledge Distillation

Title: A New One-Shot Federated Learning Framework for Medical Imaging Classification with Feature-Guided Rectified Flow and Knowledge Distillation

Ein neues eins-Shot-Federated-Learning-Framework für die Klassifizierung medizinischer Bildgebung mit funktionsgeführter rektifizierter Strömung und Wissensdestillation

新的以地制引校正流动和知识蒸馏法的医学成像分类单一式联邦学习框架 2507.19045v1

Authors (5): Yufei Ma, Hanwen Zhang, Qiya Yang, Guibo Luo, Yuesheng Zhu

In multi-center scenarios, One-Shot Federated Learning (OSFL) has attracted increasing attention due to its low communication overhead, requiring only a single round of transmission. However, existing generative model-based OSFL methods suffer from low training efficiency and potential privacy leakage in the healthcare domain. Additionally, achieving convergence within a single round of model aggregation is challenging under non-Independent and Identically Distributed (non-IID) data. To address these challenges, in this paper a modified OSFL framework is proposed, in which a new Feature-Guided Rectified Flow Model (FG-RF) and Dual-Layer Knowledge Distillation (DLKD) aggregation method are developed. FG-RF on the client side accelerates generative modeling in medical imaging scenarios while preserving privacy by synthesizing feature-level images rather than pixel-level images. To handle non-IID distributions, DLKD enables the global student model to simultaneously mimic the output logits and align the intermediate-layer features of client-side teacher models during aggregation. Experimental results on three non-IID medical imaging datasets show that our new framework and method outperform multi-round federated learning approaches, achieving up to 21.73% improvement, and exceeds the baseline FedISCA by an average of 21.75%. Furthermore, our experiments demonstrate that feature-level synthetic images significantly reduce privacy leakage risks compared to pixel-level synthetic images.

在多个中心情景中,一流联邦学习(OSFL)因其通信管理管理费用低而吸引了越来越多的关注,只需要单轮传输。然而,现有的基于基因模型的OSFL方法在医疗保健领域培训效率低和潜在隐私泄漏方面受到影响。此外,在非独立和同种分布(非IID)的模型汇总数据下,在单一一轮模型汇总中实现趋同是一项挑战。为了应对这些挑战,本文件提议了一个修改的OSFL框架,在这个框架中,开发了一个新的基于特性的、有指导的校正流程模型(FG-RF)和双拉叶知识蒸馏(DLKD)汇总方法。客户方的FG-RF加快了医学成像假设情景中的基因模型,同时通过将地平级图像而不是像素级图像合成(Pix-IID)数据同步来保护隐私。为了处理非IID的分布,DLKD使全球学生模型能够同时模拟输出对输出逻辑并调整客户端教师模型的中间层特征。在21级分组期间,将客户端图像蒸馏(DFI-II)的实验结果显示三种非基础模型模型的模型,将Mex-IMFisalimal 学习方法比了我们的平均数据框架,将MLisimma-xxxxxxxxxxxxxxxxxx

Article 77

Title: GPUnion: Autonomous GPU Sharing on Campus

GPUnion: Autonomer GPU-Sharing auf dem Campus

GPUU:在校园中自主分享GPU 2507.18928v1

Authors (5): Yufang Li, Yuanbo Zhang, Hanlong Liao, Guoming Tang, Deke Guo

A pronounced imbalance in GPU resources exists on campus, where some laboratories own underutilized servers while others lack the compute needed for AI research. GPU sharing can alleviate this disparity, while existing platforms typically rely on centralized oversight and persistent allocation models, conflicting with the voluntary and autonomous nature of academic resource ownership. We present GPUnion, a campus-scale GPU sharing platform enabling voluntary participation while preserving full provider autonomy. GPUnion incorporates three core mechanisms: i) container-based task dispatching and execution, ii) resource provider-first architecture, and iii) resilient execution featuring automatic checkpointing and migration. GPUnion also supports custom data storage and integrates the non-root execution and image attestation for isolation and security improvement for containerization. Case studies across multiple campus scenarios demonstrate 30% more GPU utilization improvement, 40% increase in interactive sessions, and 94% successful workload migration during provider departures. GPUnion demonstrates that provider autonomy and platform reliability can coexist, challenging conventional centralized paradigms and democratizing access to computational resources within campus networks.

在园区,一些实验室拥有利用不足的服务器,而另一些实验室则缺乏AI研究所需的计算方法。GPU共享可以缓解这种差异,而现有平台通常依赖集中监督和持续分配模式,这与学术资源拥有的自愿和自主性质相冲突。我们介绍了园区规模的GPUIion,一个校园规模的GPU共享平台,既能自愿参与,又能维护供应商的完全自主权。 GPUUI包含三个核心机制:(一) 集装箱任务发送和执行,(二) 资源提供者第一架构,以及(三) 以自动检查和迁移为主的具有弹性的执行。GPUnion还支持定制数据储存,并整合非基层执行和图像证明,以改善集装箱化的隔离和安全性。多个校园的案例研究显示,GPU利用率提高30%以上,互动会议增加40%,供应商离开期间成功转移工作量达到94%。GPUUnion表明,供应商的自主权和平台可靠性可以共存,挑战常规集中模式,并使校园网络内的计算资源民主化。

Article 78

Title@2025-07-25 (5): RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems

Title: RailX: A Flexible, Scalable, and Low-Cost Network Architecture for Hyper-Scale LLM Training Systems

RailX: Eine flexible, skalierbare und kostenarme Netzwerkarchitektur für Hyper-Scale LLM Trainingssysteme

RailX:超大型有限LM培训系统灵活、可缩放和低成本网络架构 2507.18889v1

Authors (8): Yinxiao Feng, Tiancheng Chen, Yuchen Wei, Siyuan Shen, Shiju Wang, Wei Li, Kaisheng Ma, Torsten Hoefler

Increasingly large AI workloads are calling for hyper-scale infrastructure; however, traditional interconnection network architecture is neither scalable nor cost-effective enough. Tree-based topologies such as the \textit{Rail-optimized} network are extremely expensive, while direct topologies such as \textit{Torus} have insufficient bisection bandwidth and flexibility. In this paper, we propose \textit{RailX}, a reconfigurable network architecture based on intra-node direct connectivity and inter-node circuit switching. Nodes and optical switches are physically 2D-organized, achieving better scalability than existing centralized circuit switching networks. We propose a novel interconnection method based on \textit{Hamiltonian Decomposition} theory to organize separate rail-based rings into \textit{all-to-all} topology, simultaneously optimizing ring-collective and all-to-all communication. More than $100$K chips with hyper bandwidth can be interconnected with a flat switching layer, and the diameter is only $2\sim4$ inter-node hops. The network cost per injection/All-Reduce bandwidth of \textit{RailX} is less than $10\%$ of the Fat-Tree, and the cost per bisection/All-to-All bandwidth is less than $50\%$ of the Fat-Tree. Specifically, only $\sim$$$1.3$B is required to interconnect 200K chips with 1.8TB bandwidth. \textit{RailX} can also be used in the ML-as-a-service (MLaaS) scenario, where single or multiple training workloads with various shapes, scales, and parallelism strategies can be flexibly mapped, and failures can be worked around.

AI 工作量越来越大, 需要超大型基础设施; 但是, 传统的互连网络架构既不可缩放, 也不具备足够的成本效益。基于树的地形结构, 如 rextit{ Rail- optimified} 网络非常昂贵, 而像\ textit{ Torus} 这样的直接地形结构没有足够的双节带宽和灵活性。在本文中, 我们提议\ textit{ RailX} , 一个基于节点内直接连通和跨节电路转换的可重新配置的网络架构。节点和光学开关是有形的 2D 组织起来的, 并且比现有的中央电路转换网络网络更容易缩放。我们提议一种基于\ textit{ Hamiltonian Decomposit} 的新型互连通方法, 将不同的铁路环组织成\ textitle{all{ all- all- to- commission。我们提议的网络成本成本是每100美元, 而不是每平面的平流。

Article 79

Title@2025-07-25 (5): Fully Energy-Efficient Randomized Backoff: Slow Feedback Loops Yield Fast Contention Resolution

Title: Fully Energy-Efficient Randomized Backoff: Slow Feedback Loops Yield Fast Contention Resolution

Vollenergieeffizienter Randomized Backoff: Langsame Rückkopplungsschleifen liefern schnelle Streitbeilegung

完全节能随机后退:慢速反馈循环 2302.07751v5

Authors (5): Michael A. Bender, Jeremy T. Fineman, Seth Gilbert, John Kuszmaul, Maxwell Young

Contention resolution addresses the problem of coordinating access to a shared channel. Time proceeds in slots, and a packet transmission can be made in any slot. A packet is successfully sent if no other packet is also transmitted during that slot. If two or more packets are sent in the same slot, then none of these transmissions succeed. Listening during a slot gives ternary feedback, indicating if that slot had (0) silence, (1) a successful transmission, or (2+) noise. No other feedback is available. Packets are (adversarially) injected into the system over time. A packet departs the system once it is successful. The goal is to send all packets while optimizing throughput, which is roughly the fraction of successful slots. Most prior algorithms with constant throughput require a short feedback loop, in the sense that a packet’s sending probability in slot t+1 is fully determined by its internal state at slot t and the channel feedback at slot t. An open question is whether these short feedback loops are necessary; that is, how often must listening and updating occur in order to achieve constant throughput? This question addresses energy efficiency, since both listening and sending consume significant energy. The channel can also suffer adversarial noise (“jamming”), which causes any listener to hear noise, even when no packets are sent. How does jamming affect our goal of long feedback loops/energy efficiency? Connecting these questions, we ask: what does a contention-resolution algorithm have to sacrifice to reduce channel accesses? Must we give up on constant throughput or robustness to noise? Here, we show that we need not concede anything. Suppose there are N packets and J jammed slots, where the input is determined by an adaptive adversary. We give an algorithm that, with high probability in N+J, has constant throughput and polylog(N+J) channel accesses per packet.

内容解析解决了对共享频道访问的协调问题。时间在空格中持续, 并且可以在任何空格中进行包传输。如果在空格中, 没有其它的包也会成功发送。如果两个或两个以上的包被在同一空格中发送, 那么这些传输就不会成功。在空格中监听会提供永恒的反馈, 表明空格是否有( 0) 沉默, (1) 成功传输, 或 ( 2+) 噪音。没有其他反馈。封会( 对抗性) 随着时间的推移被注入系统。包一旦成功, 就会退出系统。封会发送所有的包, 并且优化通量, 这大约是成功空格的一小部分。大多数具有恒定输量的算法都需要一个简短的回路圈回路, 也就是说, 一个在空档中发送的概率完全取决于它的内部状态, (1) 沉默, (1) 成功传输或频道反馈。一个开放的问题是, 我们从这些短的回馈回路是需要什么; 在那里, 需要多少监听和更新来实现不断的流流流流流流 ? 这个问题会降低节流节流, 。

Article 80

Title@2025-07-25 (5): Deadline-Aware Joint Task Scheduling and Offloading in Mobile Edge Computing Systems

Title: Deadline-Aware Joint Task Scheduling and Offloading in Mobile Edge Computing Systems

Deadline-Aware Joint Task Planung und Offloading in Mobile Edge Computing Systemen

移动边缘电子计算系统联合任务安排和卸载 2507.18864v1

Authors (6): Ngoc Hung Nguyen, Van-Dinh Nguyen, Anh Tuan Nguyen, Nguyen Van Thieu, Hoang Nam Nguyen, Symeon Chatzinotas

The demand for stringent interactive quality-of-service has intensified in both mobile edge computing (MEC) and cloud systems, driven by the imperative to improve user experiences. As a result, the processing of computation-intensive tasks in these systems necessitates adherence to specific deadlines or achieving extremely low latency. To optimize task scheduling performance, existing research has mainly focused on reducing the number of late jobs whose deadlines are not met. However, the primary challenge with these methods lies in the total search time and scheduling efficiency. In this paper, we present the optimal job scheduling algorithm designed to determine the optimal task order for a given set of tasks. In addition, users are enabled to make informed decisions for offloading tasks based on the information provided by servers. The details of performance analysis are provided to show its optimality and low complexity with the linearithmic time O(nlogn), where $n$ is the number of tasks. To tackle the uncertainty of the randomly arriving tasks, we further develop an online approach with fast outage detection that achieves rapid acceptance times with time complexity of O(n). Extensive numerical results are provided to demonstrate the effectiveness of the proposed algorithm in terms of the service ratio and scheduling cost.

由于需要改进用户经验,对严格互动性服务质量的需求在移动边缘计算(MEC)和云层系统中都有所增加。因此,处理这些系统中的计算密集型任务需要遵守具体的最后期限,或达到极低的延迟。为优化任务时间安排业绩,现有研究主要侧重于减少未达到最后期限的延迟工作数量。然而,这些方法的主要挑战在于总搜索时间和时间安排效率。我们在本文件中介绍了最佳的工作时间安排算法,目的是确定特定任务组的最佳任务顺序。此外,用户能够根据服务器提供的信息就卸载任务做出知情决定。提供了业绩分析的细节,以显示其最佳性和低复杂性,与任务数量为美元线性O(nron)的时间相比。为解决随机抵达任务的不确定性,我们进一步开发了一种在线方法,快速外出检测方法,在时间和O(n)时间的复杂度上获得快速接受。提供了广泛的数字结果,以显示拟议算法在服务安排比例和成本方面的有效性。

Article 81

Title@2025-07-24 (4): FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

Title: FedVSR: Towards Model-Agnostic Federated Learning in Video Super-Resolution

FedVSR: Auf dem Weg zu einem modell-agnostischen Federated Learning in Video Super-Resolution

FFVSR: 争取在视频超分辨率中开展示范性、不可计量的联邦学习 2503.13745v2

Authors (5): Ali Mollaahmadi Dehaghi, Hossein KhademSohi, Reza Razavi, Steve Drew, Mohammad Moshirpour

Video super-resolution aims to enhance low-resolution videos by leveraging both spatial and temporal information. While deep learning has led to impressive progress, it typically requires centralized data, which raises privacy concerns. Federated learning offers a privacy-friendly solution, but general FL frameworks often struggle with low-level vision tasks, resulting in blurry, low-quality outputs. To address this, we introduce FedVSR, the first FL framework specifically designed for VSR. It is model-agnostic and stateless, and introduces a lightweight loss function based on the DWT to better preserve high-frequency details during local training. Additionally, a loss-aware aggregation strategy combines both DWT-based and task-specific losses to guide global updates effectively. Extensive experiments across multiple VSR models and datasets demonstrate that FedVSR consistently outperforms existing FL methods, achieving up to 0.82 dB higher PSNR, 0.0327 higher SSIM, and 0.0251 lower LPIPS. These results underscore FedVSR’s ability to bridge the gap between privacy and performance, setting a new benchmark for federated learning in low-level vision tasks. The code is available at: https://github.com/alimd94/FedVSR

通过利用空间和时间信息加强低分辨率视频超分辨率的超分辨率视频,目的是通过利用空间和时间信息加强低分辨率视频。虽然深层次的学习导致了令人印象深刻的进展,但通常需要集中的数据,这引起了隐私问题。联邦学习提供了一种方便隐私的解决办法,但一般FL框架往往与低层次的愿景任务纠缠不休,造成模糊和低质量的产出。为此,我们引入了第一个为VSR专门设计的FedVSR(FL框架),即FedVSR(FLS),这是第一个为VSR专门设计的FLFL框架。它是一个模式性能和无国籍,并引入基于DWT的轻度损失功能,以更好地保存当地培训中的高频细节。此外,一个基于DWT和特定任务的损失汇总战略将DWT的损失结合起来,以有效指导全球更新。多个VSR模型和数据集的广泛实验表明,FVSR(F)始终超越现有的FL方法,达到0.82 dB更高PSIM、0.0327高的SSIM和0.0251低的LPPS。这些结果强调了FSR(FVSR)有能力弥补隐私和业绩之间的差距,为低度学习提供新的基准,为低度任务提供新的基准/SR.94。

Article 82

Title@2025-07-24 (4): Performance in solving the Hermitian and pseudo-Hermitian Bethe-Salpeter equation with the Yambo code

Title: Performance in solving the Hermitian and pseudo-Hermitian Bethe-Salpeter equation with the Yambo code

Leistung bei der Lösung der Hermitian und Pseudo-Hermitian Bethe-Salpeter-Gleichung mit dem Yambo-Code

用Yambo 代码解决埃米迪和伪希腊比斯- 圣彼得方程式的性能 2504.10096v2

Authors (10): Petru Milev, Blanca Mellado-Pinto, Muralidhar Nalabothula, Ali Esquembre Kucukalic, Fernando Alvarruiz, Enrique Ramos, Alejandro Molina-Sanchez, Ludger Wirtz, Jose E. Roman, Davide Sangalli

We analyze the performance of two strategies in solving the structured eigenvalue problem deriving from the Bethe-Salpeter equation (BSE) in condensed matter physics. The BSE matrix is constructed with the \texttt{Yambo} code, and the two strategies are implemented by interfacing \texttt{Yambo} with the ScaLAPACK and ELPA libraries for direct diagonalization, and with the SLEPc library for the iterative approach. We consider both the Hermitian (Tamm-Dancoff approximation) and pseudo-Hermitian forms, addressing dense matrices of three different sizes. A description of the implementation is also provided, with details for the pseudo-Hermitian case. Timing and memory utilization are analyzed on both CPU and GPU clusters. Our results demonstrate that it is now feasible to handle dense BSE matrices of the order of 10$^5$.

我们分析了两种战略在解决浓缩物质物理学中来自Bethe-Salpeter等式(BSE)的结构化二元值问题方面的表现。 BSE矩阵是用\ textt{Yambo} 代码构建的,而这两种战略的实施方式是:与ScalAPACK和ELPA图书馆进行接口,以直接对分化,并与SLEPc图书馆进行迭接方法。我们考虑了Hermitian(Tamm-Dancoff近似)和伪Hermitian表格,处理三种不同大小的密集基质。还介绍了执行情况,并提供了伪赫米提案的细节。对时间和记忆利用情况进行了分析,对CPU和GPU两个组进行了分析。我们的结果表明,现在可以处理10-5美元的密集BSE矩阵。

Article 83

Title@2025-07-24 (4): PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism

Title: PPipe: Efficient Video Analytics Serving on Heterogeneous GPU Clusters via Pool-Based Pipeline Parallelism

PPipe: Effiziente Videoanalyse mit heterogenen GPU-Clustern über Pool-Based Pipeline Parallelismus

PPipe:通过基于联营的管道平行主义,在异基因性GPU集群上提供高效视频分析工具 2507.18748v1

Authors (3): Z. Jonny Kong, Qiang Xu, Y. Charlie Hu

With the rapid innovation of GPUs, heterogeneous GPU clusters in both public clouds and on-premise data centers have become increasingly commonplace. In this paper, we demonstrate how pipeline parallelism, a technique wellstudied for throughput-oriented deep learning model training, can be used effectively for serving latency-bound model inference, e.g., in video analytics systems, on heterogeneous GPU clusters. Our work exploits the synergy between diversity in model layers and diversity in GPU architectures, which results in comparable inference latency for many layers when running on low-class and high-class GPUs. We explore how such overlooked capability of low-class GPUs can be exploited using pipeline parallelism and present a novel inference serving system, PPipe, that employs pool-based pipeline parallelism via an MILP-based control plane and a data plane that performs resource reservation-based adaptive batching. Evaluation results on diverse workloads (18 CNN models) show that PPipe achieves 41.1% - 65.5% higher utilization of low-class GPUs while maintaining high utilization of high-class GPUs, leading to 32.2% - 75.1% higher serving throughput compared to various baselines.

随着GPU的快速创新,公共云层和准备状态数据中心中的各种GPU集群日益变得司空见惯。在本文件中,我们展示了如何有效地利用管道平行(一种对吞吐量的深层次学习模式培训有很好研究的技术,即管道平行(一种对吞吐量的深入学习模式培训有很好研究的技术)来为延缓模式推断服务,例如在视频分析系统、对多元GPU集群的模型分析系统等。我们的工作利用了模型层多样性和GPU结构多样性之间的协同作用,这导致在使用低级和高级GPU时,许多层次的比重具有可比的推论。我们探索了如何利用管道平行(这种对低级GPUP)的这种被忽视的能力,并提出了一种新的推论(PPipe),即通过基于视频的MILP控制平面和基于资源保留的适应性批量数据平面,利用基于库的管道的平行模式。关于不同工作量的评价结果(18 NCNN模型)显示,低级GPIPPI在使用低级GPUPUPUPPPPPPPPS时达到41.1%至更高的65%,同时保持高利用率至高位至高位,同时维持高端的75。

Article 84

Title@2025-07-24 (4): CUTHERMO: Understanding GPU Memory Inefficiencies with Heat Map Profiling

Title: CUTHERMO: Understanding GPU Memory Inefficiencies with Heat Map Profiling

CUTHERMO: GPU-Speicher-Ineffizienzen mit Wärmekartenprofilierung verstehen

CUTHMO: 了解 GPU 内存效率不及热地图分析 2507.18729v1

Authors (6): Yanbo Zhao, Jinku Cui, Zecheng Li, Shuyin Jiao, Xu Liu, Jiajia Li

GPUs have become indispensable in high-performance computing, machine learning, and many other domains. Efficiently utilizing the memory subsystem on GPUs is critical for maximizing computing power through massive parallelism. Analyzing memory access patterns has proven to be an effective method for understanding memory bottlenecks in applications. However, comprehensive runtime and fine-grained memory profiling support is lacking on GPU architectures. In this work, we introduce cuThermo, a lightweight and practical profiling tool for GPU memory analysis. It operates on GPU binaries without requiring any modifications to hardware, operating system, or application source code. Given a CUDA application, cuThermo identifies memory inefficiencies at runtime via a heat map based on distinct visited warp counts to represent word-sector-level data sharing and provides optimization guidance in performance tuning iterations. Through our experiments on six applications, we identified five memory access patterns that are portable across different GPU architectures. By evaluating optimization on two GPUs, cuThermo achieves up to $721.79\%$ performance improvement.

在高性能计算、机器学习和其他许多领域, GPU变得不可或缺。有效地利用 GPU 上的记忆子系统对于通过大规模平行方式实现计算能力最大化至关重要。分析记忆存取模式已证明是理解应用中记忆瓶颈的有效方法。但是, GPU 结构缺乏全面的运行时间和微微微的记忆特征分析支持。在这项工作中, 我们为 GPU 内存分析引入了一个轻量和实用的剖析工具 cutermo 。它在 GPU 中运行, 不需要修改硬件、操作系统或应用源代码。根据 CUDA 应用程序, CuThermo 发现运行时的记忆效率低下, 以不同访问的 Wordp 计热图为基础, 代表单部门一级数据共享, 并提供性能调试调的优化指导。我们通过在六个应用程序上进行的实验, 确定了五个可移动到不同 GPUP 结构的记忆存取模式。通过对两个 GPUP 的优化评估, Cuter 达到 721.79 $ 改进性能。

Article 85

Title@2025-07-24 (4): AI Flow: Perspectives, Scenarios, and Approaches

Title: AI Flow: Perspectives, Scenarios, and Approaches

AI Flow: Perspektiven, Szenarien und Ansätze

AI 流动:观点、设想和方法 2506.12479v3

Authors (14): Hongjun An, Wenhan Hu, Sida Huang, Siqi Huang, Ruanjun Li, Yuanzhi Liang, Jiawei Shao, Yiliang Song, Zihan Wang, Cheng Yuan, Chi Zhang, Hongyuan Zhang, Wenhao Zhuang, Xuelong Li

Pioneered by the foundational information theory by Claude Shannon and the visionary framework of machine intelligence by Alan Turing, the convergent evolution of information and communication technologies (IT/CT) has created an unbroken wave of connectivity and computation. This synergy has sparked a technological revolution, now reaching its peak with large artificial intelligence (AI) models that are reshaping industries and redefining human-machine collaboration. However, the realization of ubiquitous intelligence faces considerable challenges due to substantial resource consumption in large models and high communication bandwidth demands. To address these challenges, AI Flow has been introduced as a multidisciplinary framework that integrates cutting-edge IT and CT advancements, with a particular emphasis on the following three key points. First, device-edge-cloud framework serves as the foundation, which integrates end devices, edge servers, and cloud clusters to optimize scalability and efficiency for low-latency model inference. Second, we introduce the concept of familial models, which refers to a series of different-sized models with aligned hidden features, enabling effective collaboration and the flexibility to adapt to varying resource constraints and dynamic scenarios. Third, connectivity- and interaction-based intelligence emergence is a novel paradigm of AI Flow. By leveraging communication networks to enhance connectivity, the collaboration among AI models across heterogeneous nodes achieves emergent intelligence that surpasses the capability of any single model. The innovations of AI Flow provide enhanced intelligence, timely responsiveness, and ubiquitous accessibility to AI services, paving the way for the tighter fusion of AI techniques and communication systems.

由于克劳德·香农的基本信息理论和艾伦·图灵的机智智能远见框架的开创性,信息和通信技术(IT/CT)的趋同性演进形成了一个不间断的连通和计算浪潮,这种协同效应引发了技术革命,现在随着大型人工智能(AI)模型的重新塑造工业和重新界定人体机械合作而达到顶峰。然而,由于大型模型中大量资源消耗和高通信带宽需求,实现无处不在的情报面临巨大挑战。为了应对这些挑战,AI流动被引入了多学科框架,将先进的信息技术和CT进步结合起来,特别强调以下三个关键点。首先,装置-顶尖的云形框架作为基础,将终端装置、边缘服务器和云层集群结合起来,优化低电流模型的伸缩性和效率。第二,我们引入了家庭模型的概念,即一系列规模不同的模型,与一致的隐蔽性特征相适应,使得有效的合作和灵活性能够适应不同的资源限制和动态情景。第三,连接性和互动性框架作为基础基础,将连接性和互动性框架作为基础,将最终的智能升级性模型,从而提升AI系统。

Article 86

Title@2025-07-24 (4): Distributed Load Balancing with Workload-Dependent Service Rates

Title: Distributed Load Balancing with Workload-Dependent Service Rates

Distributed Load Balancing mit Workload-Dependent-Service-Raten

与工作量-依赖性服务费率平衡 2411.17103v2

Authors (6): Wenxin Zhang, Santiago R. Balseiro, Robert Kleinberg, Vahab Mirrokni, Balasubramanian Sivan, Bartek Wydrowski

We study distributed load balancing in bipartite queueing systems where frontends route jobs to heterogeneous backends with workload-dependent service rates. The system’s connectivity – governed by compatibility constraints such as data residency or resource requirements – is represented by an arbitrary bipartite graph. Each frontend operates independently without communication with other frontends, and the goal is to minimize the expected average latency of all jobs. We propose a closed-loop policy called the Greatest Marginal Service Rate (GMSR) policy that achieves effective coordination without requiring knowledge of arrival rates. In a discrete-time stochastic model, we show that the behavior of our routing policy converges (almost surely) to the behavior of a fluid model, in the limit as job sizes tend to zero and job arrival rates are scaled so that the expected total volume of jobs arriving per unit time remains fixed. Then, in the fluid regime, we demonstrate that the policy attains an $\epsilon$-suboptimal solution in $O(\delta + \log{1/\epsilon})$ time from $\delta$-suboptimal initial workloads, which implies global convergence to the centrally coordinated optimal routing. Finally, we analyze the fluid model when the system is overloaded. We show that GMSR lexicographically maximizes throughput, maximizes the number of stable backends, and minimizes their collective workload.

我们研究的是双面排队系统中的平衡分配负荷。在双面列队系统中, 前端将工作转向不同的后端,且服务费率取决于工作量。该系统的连通性 – – 受数据居住或资源需求等兼容性制约的制约 – – 由任意的双面图代表。每个前端独立运作,不与其他前端沟通,目标是最大限度地减少所有工作的预期平均长度。我们提出一个封闭式环流政策,称为最大边际服务率(GMSR)政策,在不需要了解抵达率的情况下实现有效协调。在离散时间的随机模型中,我们显示我们行进政策的行为会(几乎肯定)与流体模式的行为趋同(随着工作规模趋向为零,而工作到达率则被缩小,因此每个单位时间的预期工作量总量将保持不变。然后,在流动制度中,我们表明,政策在不需要了解抵达率的情况下,在不要求知道抵达率的情况下,就能实现有效的协调。我们的分流式政策的行为将(delta+ welog@1/\epsocial) 与流动模式的动作相交汇, 时间将(从美元开始) 显示, 最优化的系统显示, 最优化的顺流- 最优化的GMILAxxxI 显示, 显示, 最优化的系统意味着, 最优化的周期的递化的周期的递合的系统意味着它们的最大递合。

Article 87

Title@2025-07-24 (4): Towards Designing an Energy Aware Data Replication Strategy for Cloud Systems Using Reinforcement Learning

Title: Towards Designing an Energy Aware Data Replication Strategy for Cloud Systems Using Reinforcement Learning

Auf dem Weg zu einer Strategie für eine energiebewusste Datenreplikation für Cloud-Systeme mittels Verstärkungslernen

为利用强化学习的云层系统设计一个有能源意识的数据复制战略 2507.18459v1

Authors (3): Amir Najjar, Riad Mokadem, Jean-Marc Pierson

The rapid growth of global data volumes has created a demand for scalable distributed systems that can maintain a high quality of service. Data replication is a widely used technique that provides fault tolerance, improved performance and higher availability. Traditional implementations often rely on threshold-based activation mechanisms, which can vary depending on workload changes and system architecture. System administrators typically bear the responsibility of adjusting these thresholds. To address this challenge, reinforcement learning can be used to dynamically adapt to workload changes and different architectures. In this paper, we propose a novel data replication strategy for cloud systems that employs reinforcement learning to automatically learn system characteristics and adapt to workload changes. The strategy’s aim is to provide satisfactory Quality of Service while optimizing a trade-off between provider profit and environmental impact. We present the architecture behind our solution and describe the reinforcement learning model by defining the states, actions and rewards.

全球数据量的迅速增长产生了对可扩展分布系统的需求,这种系统能够保持高质量的服务。数据复制是一种广泛使用的技术,可以提供错误容忍度、改进性能和更高的可用性。传统的实施往往依靠基于门槛的启动机制,这种机制可能因工作量的变化和系统结构而不同。系统管理员通常有责任调整这些阈值。为了应对这一挑战,可以利用强化学习来动态地适应工作量的变化和不同的结构。在本文件中,我们提出了云层系统的新数据复制战略,利用强化学习来自动学习系统特点和适应工作量变化。该战略的目的是提供令人满意的服务质量,同时优化供应商利润和环境影响之间的平衡。我们提出解决方案背后的结构,并通过界定州、行动和奖励来描述强化学习模式。

Article 88

Title@2025-07-24 (4): DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration

Title: DiP: A Scalable, Energy-Efficient Systolic Array for Matrix Multiplication Acceleration

DiP: Ein skalierbarer, energieeffizienter Systolischer Array für Matrix-Multiplikationsbeschleunigung

DiP:一个可缩放的、节能的、用于加速矩阵乘法加速的节能收缩阵列阵列 2412.09709v3

Authors (3): Ahmed J. Abdelmaksoud, Shady Agwa, Themis Prodromakis

Transformers are gaining increasing attention across Natural Language Processing (NLP) application domains due to their outstanding accuracy. However, these data-intensive models add significant performance demands to the existing computing architectures. Systolic array architectures, adopted by commercial AI computing platforms like Google TPUs, offer energy-efficient data reuse but face throughput and energy penalties due to input-output synchronization via First-In-First-Out (FIFO) buffers. This paper proposes a novel scalable systolic array architecture featuring Diagonal-Input and Permutated weight stationary (DiP) dataflow for matrix multiplication acceleration. The proposed architecture eliminates the synchronization FIFOs required by state-of-the-art weight stationary systolic arrays. Beyond the area, power, and energy savings achieved by eliminating these FIFOs, DiP architecture maximizes the computational resource utilization, achieving up to 50\% throughput improvement over conventional weight stationary architectures. Analytical models are developed for both weight stationary and DiP architectures, including latency, throughput, time to full PEs utilization (TFPU), and FIFOs overhead. A comprehensive hardware design space exploration using 22nm commercial technology demonstrates DiP’s scalability advantages, achieving up to a 2.02x improvement in energy efficiency per area. Furthermore, DiP outperforms TPU-like architectures on transformer workloads from widely-used models, delivering energy improvement up to 1.81x and latency improvement up to 1.49x. At a 64x64 size with 4096 PEs, DiP achieves a peak throughput of 8.192 TOPS with energy efficiency 9.548 TOPS/W.

64 自然语言处理(NLP)应用领域中变压者由于它们的精确性而越来越受到越来越多的关注。然而,这些数据密集型模型增加了现有计算结构的显著性能需求。谷歌 TPUs 等商业AI 计算平台采用的Systolic 阵列结构提供了节能数据再利用,但由于通过FIFFOs(FIPO)缓冲实现输入-输出同步,而面临过量和能量惩罚。本文件提议了一个新的可缩放的系统阵列结构,其特点是对数-投入和变换重量固定(DiP)40数据流动,以加速矩阵的倍增倍增速度。拟议的结构消除了州级的重力固定阵列阵列阵列所需的同步FIFFFFFFFOs。在消除这些FIFFOs后实现了节能性数据再利用, DiPPFP结构最大限度地实现了计算资源的利用率,在常规重定额结构上实现了50 吞吐量改进。分析模型针对重量固定/变压结构, 包括平流、时间到完整PElix Slievilent distryal distrex distreal divex distrealus laus lavel lax divation lax divational

Article 89

Title@2025-07-24 (4): FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping

Title: FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping

FMI trifft SystemC: Ein Rahmen für das Cross-Tool Virtual Prototyping

FMI 满足系统C:跨工具虚拟原型框架 2507.18339v1

Authors (5): Nils Bosbach, Meik Schmidt, Lukas Jünger, Matthias Berthold, Rainer Leupers

As systems become more complex, the demand for thorough testing and virtual prototyping grows. To simulate whole systems, multiple tools are usually needed to cover different parts. These parts include the hardware of a system and the environment with which the system interacts. The Functional Mock-up Interface (FMI) standard for co-simulation can be used to connect these tools. The control part of modern systems is usually a computing unit, such as a System-on-a-Chip (SoC) or Microcontroller Unit (MCU), which executes software from a connected memory and interacts with peripherals. To develop software without requiring access to physical hardware, full-system simulators, the so-called Virtual Platforms (VPs), are commonly used. The IEEE-standardized framework for VP development is SystemC TLM. SystemC provides interfaces and concepts that enable modular design and model exchange. However, SystemC lacks native FMI support, which limits the integration into broader co-simulation environments. This paper presents a novel framework to control and interact with SystemC-based VPs using the FMI. We present a case study showing how a simulated temperature sensor in a SystemC simulation can obtain temperature values from an external tool via FMI. This approach allows the unmodified target software to run on the VP and receive realistic environmental input data such as temperature, velocity, or acceleration values from other tools. Thus, extensive software testing and verification is enabled. By having tests ready and the software pre-tested using a VP once the physical hardware is available, certifications like ISO 26262 can be done earlier.

随着系统变得更为复杂,对彻底测试和虚拟原型的需求日益增长。要模拟整个系统,通常需要多种工具来覆盖不同部分。这些部分包括系统硬件和系统互动环境。可以使用功能模拟界面(FMI)标准,用于共同模拟这些工具。现代系统的控制部分通常是一个计算单位,例如系统对立系统(SoC)或微控制器(MCU),它从连接的记忆中执行软件,并与外围环境互动。开发软件不需要使用物理硬件,系统对软件进行全面系统模拟,并使用所谓的虚拟平台(VPs),这些部分通常使用功能模拟界面接口(FMI)标准化框架,用于共同模拟这些工具。系统C提供界面和概念,便于模块设计和模型交换。然而,系统C缺乏本地的FMI支持,它提供了一个与基于系统(VP)的同步存储和互动的新框架,使用FMI的系统(OFP)的全系统快速化软件模拟系统(VP),我们用这种系统对系统进行快速的测试,我们用系统(FMI)的系统进行测试,可以让外部的系统对服务器进行测试系统进行测试。我们用一个测试,可以让外部的系统进行这样的服务器进行这样的测试。

Article 90

Title@2025-07-24 (4): Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

Title: Staleness-Centric Optimizations for Parallel Diffusion MoE Inference

Staleness-Centric Optimierungen für parallele Diffusion MoE-Inferenz

平行扩散MOE推推推的堆积-堆积中心优化 2411.16786v3

Authors (7): Jiajun Luo, Lizhuo Luo, Jianru Xu, Jiajun Song, Rongwei Lu, Chen Tang, Zhi Wang

Mixture-of-Experts-based (MoE-based) diffusion models demonstrate remarkable scalability in high-fidelity image generation, yet their reliance on expert parallelism introduces critical communication bottlenecks. State-of-the-art methods alleviate such overhead in parallel diffusion inference through computation-communication overlapping, termed displaced parallelism. However, we identify that these techniques induce severe staleness-the usage of outdated activations from previous timesteps that significantly degrades quality, especially in expert-parallel scenarios. We tackle this fundamental tension and propose DICE, a staleness-centric optimization framework with a three-fold approach: (1) Interweaved Parallelism introduces staggered pipelines, effectively halving step-level staleness for free; (2) Selective Synchronization operates at layer-level and protects layers vulnerable from staled activations; and (3) Conditional Communication, a token-level, training-free method that dynamically adjusts communication frequency based on token importance. Together, these strategies effectively reduce staleness, achieving 1.26x speedup with minimal quality degradation. Empirical results establish DICE as an effective and scalable solution. Our code is publicly available at https://github.com/Cobalt-27/DICE

然而,我们发现,这些技术导致严重螺旋*-使用以往时间步骤的过时激活,大大降低质量,特别是在专家-平行情景中。我们解决了这一基本紧张问题,并提出了DICE,这是一个具有三重方法的惯性中心优化框架:(1) 互织平行主义引入了错开的管道,将逐步递升的免费有效减半;(2) 选择性同步化在层层一级运作,保护易受压动的层层;(3) 传统通信,是一种象征性的、无培训的、无风险的方法,根据象征性的重要性对通信频率进行动态调整。这些战略共同有效地减少了粘合性,实现了1.26x速度,在最低质量方面实现了最低质量的降解。

Article 91

Title@2025-07-24 (4): A large-scale distributed parallel discrete event simulation engines based on Warped2 for Wargaming simulation

Title: A large-scale distributed parallel discrete event simulation engines based on Warped2 for Wargaming simulation

Eine großflächige verteilte parallele diskrete Event-Simulations-Engines basierend auf Warped2 für Wargaming-Simulation

以Wordped2为基础的大规模分布式平行离散事件模拟引擎,用于Wargaming模拟 2507.18050v1

Authors (5): Xiaoning Jia, Ruilin Kong, Guangya Si, Bilong Shen, Zhe Ji

Rising demand for complex simulations highlights conventional engines’scalability limits, spurring Parallel Discrete Event Simulation (PDES) adoption.Warped2, a PDES engine leveraging Time Warp synchronization with Pending Event Set optimization, delivers strong performance, it struggles with inherent wargaming limitations: inefficient LP resource allocation during synchronization and unaddressed complex entity interaction patterns. To address these challenges, we present an optimized framework featuring four synergistic improvements: (1) Asynchronous listener threads are introduced to address event monitoring latency in large-scale scenarios, instead of synchronous polling mechanisms, (2) METIS-based load rebalancing strategy is incorporated to address the issue of dynamic event allocation during real-world simulation, (3) Entity interaction solver with constraint satisfaction mechanisms is designed to mitigate state conflicts, and (4) Spatial hashing algorithm to overcome O(n^2) complexity bottlenecks in large-scale nearest-neighbor searches. Experimental validation through a GridWorld demo demonstrates significant enhancements in temporal fidelity and computational efficiency. Benchmark results show our framework achieves 16x acceleration over baseline implementations and maintains 8x speedup over 1-thread configuration across MPI and Pthreads implementations.The combined load balancing and LP migration strategy reduces synchronization overhead by 58.18%, with load balancing accounting for 57% of the total improvement as the dominant optimization factor. These improvements provide an enhanced solution for PDES implementation in large-scale simulation scenarios.

对复杂模拟的需求不断上升,突显了常规引擎的伸缩限制,刺激了平行分解事件模拟(PDES)的采用。Warded2,一个利用时间变速同步与待决事件Set优化的PDES引擎,提供了强大的性能,它遇到了固有的扭曲性限制:在同步和未解决的复杂实体互动模式中,LP资源分配效率低下;为了应对这些挑战,我们提出了一个优化框架,其中包括四个协同改进:(1) 采用“同步式倾听器线”,以解决大规模情景中事件监测拉长的问题,而不是同步投票机制;(2) 采用基于MEDIS的负载重新平衡战略,以解决现实世界模拟中动态事件分配的问题;(3) 实体互动求解与约束性满意机制的内在扭曲性限制性限制性限制性限制性限制性限制性限制性限制性限制性限制性限制性制约性限制性限制性限制性限制性限制性限制性限制性限制性限制性限制性:在同步和未解决的复杂因素时,我们采用“同步世界”式听力器的实验性验证,显示时间准确性和计算效率方面的显著提高。基准结果显示我们的框架在基线执行方面加速了16x加速,在现实模拟模拟中保持8x速度上速度分配,在1至18的升级的升级的进度上,使PMLSimalimalimalimalimalimalimalimalimalimalimalimalimalimalimalimalimalimalis AS使整个执行中,在518的升级的升级的升级的升级了5的升级了整个的升级。

Article 92

Title@2025-07-24 (4): FCPO: Federated Continual Policy Optimization for Real-Time High-Throughput Edge Video Analytics

Title: FCPO: Federated Continual Policy Optimization for Real-Time High-Throughput Edge Video Analytics

FCPO: Federated Continual Policy Optimization for Real-Time High-Throughput Edge Video Analytics

FCPO:实时高水压高压边缘实时视频分析分析的联邦持续政策优化 2507.18047v1

Authors (3): Lucas Liebe, Thanh-Tung Nguyen, Dongman Lee

The growing complexity of Edge Video Analytics (EVA) facilitates new kind of intelligent applications, but creates challenges in real-time inference serving systems. State-of-the-art (SOTA) scheduling systems optimize global workload distributions for heterogeneous devices but often suffer from extended scheduling cycles, leading to sub-optimal processing in rapidly changing Edge environments. Local Reinforcement Learning (RL) enables quick adjustments between cycles but faces scalability, knowledge integration, and adaptability issues. Thus, we propose FCPO, which combines Continual RL (CRL) with Federated RL (FRL) to address these challenges. This integration dynamically adjusts inference batch sizes, input resolutions, and multi-threading during pre- and post-processing. CRL allows agents to learn from changing Markov Decision Processes, capturing dynamic environmental variations, while FRL improves generalization and convergence speed by integrating experiences across inference models. FCPO combines these via an agent-specific aggregation scheme and a diversity-aware experience buffer. Experiments on a real-world EVA testbed showed over 5 times improvement in effective throughput, 60% reduced latency, and 20% faster convergence with up to 10 times less memory consumption compared to SOTA RL-based approaches.

电磁视频分析(EVA)日益复杂,有利于新型智能应用,但在实时推断服务系统方面造成挑战。最新技术(SOTA)列表系统优化了不同设备的全球工作量分布,但往往受到延长的排期周期的影响,导致在快速变化的边缘环境中进行亚最佳处理。地方强化学习(RL)使周期之间能够快速调整,但面临可缩放性、知识整合和适应性问题。因此,我们提议FCPO将连续RL(CRL)与Freed RL(FRL)相结合,以应对这些挑战。这种整合动态调整动态地调整了不同设备的全球工作量分布、输入分辨率以及处理前和处理后多读的批次。CRL允许代理商学习改变Markov决定过程,捕捉动态环境变化,同时通过综合各种推论模型改进一般化和趋同速度。FCPOI将这些组合起来,通过一个基于代理人的特定组合计划和一个多样性的缓冲经验。在现实世界的EVA测试台上进行的实验显示在5倍以上的推算方法上取得了更快的一致,在10个时期将20 %的记忆水平与10比水平上作了缩短。

Article 93

Title@2025-07-24 (4): PPFPL: Cross-silo Privacy-preserving Federated Prototype Learning Against Data Poisoning Attacks on Non-IID Data

Title: PPFPL: Cross-silo Privacy-preserving Federated Prototype Learning Against Data Poisoning Attacks on Non-IID Data

PPFPL: Cross-silo Datenschutz-erhaltendes Federated Prototype Learning gegen Datenvergiftung Angriffe auf nicht-ID-Daten

PPPPL: 跨硅隐私保护联邦原型学习,反对对非IID数据进行数据中毒攻击 2504.03173v4

Authors (8): Hongliang Zhang, Jiguo Yu, Fenghua Xu, Chunqiang Hu, Yongzhao Zhang, Xiaofen Wang, Zhongyuan Yu, Xiaosong Zhang

Privacy-Preserving Federated Learning (PPFL) allows multiple clients to collaboratively train a deep learning model by submitting hidden model updates. Nonetheless, PPFL is vulnerable to data poisoning attacks due to the distributed training nature of clients. Existing solutions have struggled to improve the performance of cross-silo PPFL in poisoned Non-IID data. To address the issues, this paper proposes a privacy-preserving federated prototype learning framework, named PPFPL, which enhances the cross-silo FL performance in poisoned Non-IID data while effectively resisting data poisoning attacks. Specifically, we adopt prototypes as client-submitted model updates to eliminate the impact of tampered data distribution on federated learning. Moreover, we utilize two servers to achieve Byzantine-robust aggregation by secure aggregation protocol, which greatly reduces the impact of malicious clients. Theoretical analyses confirm the convergence of PPFPL, and experimental results on publicly available datasets show that PPFPL is effective for resisting data poisoning attacks with Non-IID conditions.

保护隐私-联邦学习(PPFL)允许多个客户通过提交隐藏的模型更新来合作培训深层次学习模式。然而,由于客户的培训性质分布,PPFL很容易受到数据中毒袭击。现有的解决方案在有毒的非二维数据中努力改进跨硅 PPFL的性能。为了解决问题,本文件提议了一个名为PPPFPL的隐私保护联邦原型学习框架(PPPFPL),这个框架在有效抵制数据中毒袭击的同时,提高了有毒非二维数据数据的跨硅FL性能。具体地说,我们采用原型作为客户提交的模型更新,以消除被篡改的数据传播对联邦化学习的影响。此外,我们利用两个服务器通过安全集成协议实现Byzantine-robust的聚合,这大大降低了恶意客户的影响。理论分析证实了PPFPLPL的趋同,以及公开可得数据集的实验结果显示,PPFPPFPPL对抵制数据中毒袭击与非二维D条件有效。

Article 94

Title@2025-07-24 (4): Cloud Native System for LLM Inference Serving

Title: Cloud Native System for LLM Inference Serving

Cloud Native System für LLM Inferenz Serving

LLM 推断服务云原系统 2507.18007v1

Authors (6): Minxian Xu, Junhan Liao, Jingfeng Wu, Yiyuan He, Kejiang Ye, Chengzhong Xu

Large Language Models (LLMs) are revolutionizing numerous industries, but their substantial computational demands create challenges for efficient deployment, particularly in cloud environments. Traditional approaches to inference serving often struggle with resource inefficiencies, leading to high operational costs, latency issues, and limited scalability. This article explores how Cloud Native technologies, such as containerization, microservices, and dynamic scheduling, can fundamentally improve LLM inference serving. By leveraging these technologies, we demonstrate how a Cloud Native system enables more efficient resource allocation, reduces latency, and enhances throughput in high-demand scenarios. Through real-world evaluations using Kubernetes-based autoscaling, we show that Cloud Native architectures can dynamically adapt to workload fluctuations, mitigating performance bottlenecks while optimizing LLM inference serving performance. This discussion provides a broader perspective on how Cloud Native frameworks could reshape the future of scalable LLM inference serving, offering key insights for researchers, practitioners, and industry leaders in cloud computing and artificial intelligence.

大型语言模型(LLMS)正在使众多产业发生革命,但它们巨大的计算需求为高效部署带来了挑战,特别是在云层环境中。传统的推论方法往往与资源效率低下抗争,导致高运作成本、延迟问题和有限的伸缩性。文章探讨了云型本地技术,如集装箱化、微观服务和动态列表,如何从根本上改善LLM推理服务。通过利用这些技术,我们展示了云型本地系统如何使资源配置更加有效、降低延缓性、提高高需求情景的吞吐量。通过使用Kubernetes的自动缩放法进行真实世界评估,我们显示云型本地结构能够动态地适应工作量波动,缓解工作瓶颈,同时优化LLM的推理功能。本次讨论从更广泛的角度探讨了云型本地框架如何重新塑造可扩展LM推理的未来,为研究人员、从业人员和行业领袖提供云计算和人工智能的关键见解。

Article 95

Title@2025-07-24 (4): Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling

Title: Unlock the Potential of Fine-grained LLM Serving via Dynamic Module Scaling

Entsperren Sie das Potenzial des feinkörnigen LLM Servierens über Dynamic Module Scaling

通过动态模块缩放来释放精制 LLM 服务的潜力 2507.18006v1

Authors (6): Jingfeng Wu, Yiyuan He, Minxian Xu, Xitong Gao, Kejiang Ye, Chengzhong Xu

The rise of large language models (LLMs) has created new opportunities across various fields but has also introduced significant challenges in resource management. Current LLM serving systems face a fundamental tension: balancing serving demands with limited resources while adapting to unpredictable traffic patterns. Static deployments lead to suboptimal resource utilization and performance degradation under dynamic workloads. Furthermore, the high cost of adjusting instances hinders dynamic scaling, limiting the true potential of efficient LLM serving. To address this, we propose CoCoServe, an elastic system that facilitates dynamic and fine-grained scaling. Its key innovation lies in the module-level operations for the replication and migration of LLM modules, such as decoder layers and projections. Through a comprehensive analysis of the trade-offs associated with these operations, we develop an auto-scaling mechanism that dynamically regulates module-level resource allocation and performance optimization, enabling a more cost-effective deployment of LLMs. Our evaluation demonstrates that the scaling operations employed by CoCoServe exhibit excellent scalability and can reduce costs by 46% while maintaining availability. Compared to state-of-the-art LLM serving systems (e.g., Hugging Face Transformers and vLLM), our approach reduces latency by 14%-75% and achieves 1.16x-4x throughput on average across different model sizes and workloads.

大型语言模式(LLMS)的兴起为各领域创造了新的机会,但也带来了资源管理方面的重大挑战。目前的LLM服务系统面临根本性的紧张:在有限的资源中平衡满足需求,同时适应不可预测的交通模式。静态部署导致资源利用率和性能在动态工作量下下降低于最佳水平。此外,调整案例的高昂成本阻碍了动态规模的扩大,限制了高效LLM服务的真正潜力。为此,我们提议CoServe系统是一个弹性系统,有利于动态和细微的扩展规模。它的关键创新在于LLM模块的复制和迁移模块级操作,如脱coder层和预测。通过全面分析与这些行动相关的权衡,我们开发了自动缩放机制,动态地管理模块一级的资源分配和性能优化,使LMLM的部署更具成本效益。我们的评价表明,CoServeyerveyervice所使用的规模操作具有很强的可扩展性,在保持可用性的同时可以将成本降低46%。与LLM模块(ecoder lovelres)服务系统相比(e.75和横跨14M),通过不同规模和跨系统降低。

Article 96

Title@2025-07-24 (4): C-Koordinator: Interference-aware Management for Large-scale and Co-located Microservice Clusters

Title: C-Koordinator: Interference-aware Management for Large-scale and Co-located Microservice Clusters

C-Koordinator: Interference-aware Management für großräumige und Co-Location-Mikroservice-Cluster

C-科协调员:大型和合用同一地点的微型服务集群的干涉意识管理 2507.18005v1

Authors (8): Shengye Song, Minxian Xu, Zuowei Zhang, Chengxi Gao, Fansong Zeng, Yu Ding, Kejiang Ye, Chengzhong Xu

Microservices transform traditional monolithic applications into lightweight, loosely coupled application components and have been widely adopted in many enterprises. Cloud platform infrastructure providers enhance the resource utilization efficiency of microservices systems by co-locating different microservices. However, this approach also introduces resource competition and interference among microservices. Designing interference-aware strategies for large-scale, co-located microservice clusters is crucial for enhancing resource utilization and mitigating competition-induced interference. These challenges are further exacerbated by unreliable metrics, application diversity, and node heterogeneity. In this paper, we first analyze the characteristics of large-scale and co-located microservices clusters at Alibaba and further discuss why cycle per instruction (CPI) is adopted as a metric for interference measurement in large-scale production clusters, as well as how to achieve accurate prediction of CPI through multi-dimensional metrics. Based on CPI interference prediction and analysis, we also present the design of the C-Koordinator platform, an open-source solution utilized in Alibaba cluster, which incorporates co-location and interference mitigation strategies. The interference prediction models consistently achieve over 90.3% accuracy, enabling precise prediction and rapid mitigation of interference in operational environments. As a result, application latency is reduced and stabilized across all percentiles (P50, P90, P99) response time (RT), achieving improvements ranging from 16.7% to 36.1% under various system loads compared with state-of-the-art system. These results demonstrate the system’s ability to maintain smooth application performance in co-located environments.

云端平台基础设施供应商通过将不同的微观服务合用同一地点,提高了微观服务系统的资源利用效率。不过,这一方法还引入了资源竞争和微观服务之间的干扰。为大规模、合用同一地点的微观服务集群设计干扰意识战略对于加强资源利用和减少竞争引起的干扰至关重要。这些挑战还因不可靠的衡量标准、应用多样性和节点差异性差而进一步加剧。本文首先分析了在阿里巴巴的大型和合用同一地点的微观服务集群的特点,并进一步讨论了为什么将每套指令(CPI)作为大规模生产集群干扰计量的衡量标准,以及如何通过多维度计量实现对CPI的准确预测。根据CPI干扰预测和分析,我们还介绍了C-Komer平台的设计,这是Alibaba集群使用的开放源解决方案,其中包含了共同定位和干扰减缓战略。干涉预测模型在90.3级的系统下持续实现了快速应用,准确的干扰率提高了整个系统运行环境。

Article 0

Title@2025-07-31 (4): The ArborX library: version 2.0

Article 1

Title@2025-07-31 (4): Satellite Federated Fine-Tuning for Foundation Models in Space Computing Power Networks

Article 2

Title@2025-07-31 (4): Parallel Split Learning with Global Sampling

Article 3

Title@2025-07-31 (4): Beyond Optimal Fault Tolerance

Article 4

Title@2025-07-31 (4): Consistent Point Matching

Article 5

Title@2025-07-31 (4): Threshold-Driven Streaming Graph: Expansion and Rumor Spreading

Article 6

Title@2025-07-31 (4): Towards Serverless Processing of Spatiotemporal Big Data Queries

Article 7

Title@2025-07-31 (4): Scalable contribution bounding to achieve privacy

Article 8

Title@2025-07-31 (4): Towards a Testbed for Scalable FaaS Platforms

Article 9

Title@2025-07-31 (4): Minos: Exploiting Cloud Performance Variation with Function-as-a-Service Instance Selection

Article 10

Title@2025-07-31 (4): H2SGEMM: Emulating FP32 GEMM on Ascend NPUs using FP16 Units with Precision Recovery and Cache-Aware Optimization

Article 11

Title@2025-07-31 (4): A Simple $(1-ε)$-Approximation Semi-Streaming Algorithm for Maximum (Weighted) Matching

Article 12

Title@2025-07-30 (3): GALE: Leveraging Heterogeneous Systems for Efficient Unstructured Mesh Data Analysis

Article 13

Title@2025-07-30 (3): Data Readiness for Scientific AI at Scale

Article 14

Title@2025-07-30 (3): Towards Federated Learning with On-device Training and Communication in 8-bit Floating Point

Article 15

Title@2025-07-30 (3): DSPE: Profit Maximization in Edge-Cloud Storage System using Dynamic Space Partitioning with Erasure Code

Article 16

Title@2025-07-30 (3): A Survey on Large Language Model Acceleration based on KV Cache Management

Article 17

Title@2025-07-30 (3): Leveraging Caliper and Benchpark to Analyze MPI Communication Patterns: Insights from AMG2023, Kripke, and Laghos

Article 18

Title@2025-07-30 (3): PS-WL: A Probability-Sensitive Wear Leveling scheme for SSD array scaling

Article 19

Title@2025-07-30 (3): Understanding Power and Energy Utilization in Large Scale Production Physics Simulation Codes

Article 20

Title@2025-07-30 (3): A Semi-Supervised Federated Learning Framework with Hierarchical Clustering Aggregation for Heterogeneous Satellite Networks

Article 21

Title@2025-07-30 (3): Hypernetworks for Model-Heterogeneous Personalized Federated Learning

Article 22

Title@2025-07-30 (3): SP-Chain: Boosting Intra-Shard and Cross-Shard Security and Performance in Blockchain Sharding

Article 23

Title@2025-07-30 (3): Towards Experiment Execution in Support of Community Benchmark Workflows for HPC

Article 24

Title@2025-07-29 (2): Minimizing CGYRO HPC Communication Costs in Ensembles with XGYRO by Sharing the Collisional Constant Tensor Structure

Article 25

Title@2025-07-29 (2): AgileDART: An Agile and Scalable Edge Stream Processing Engine

Article 26

Title@2025-07-29 (2): OpenRASE: Service Function Chain Emulation

Article 27

Title@2025-07-29 (2): Large-Scale Linear Energy System Optimization: A Systematic Review on Parallelization Strategies via Decomposition

Article 28

Title@2025-07-29 (2): The Performance of Low-Synchronization Variants of Reorthogonalized Block Classical Gram–Schmidt

Article 29

Title@2025-07-29 (2): Evaluating the Impact Of Spatial Features Of Mobility Data and Index Choice On Database Performance

Article 30

Title@2025-07-29 (2): Quantize Once, Train Fast: Allreduce-Compatible Compression with Provable Guarantees

Article 31

Title@2025-07-29 (2): Ethereum Conflicts Graphed

Article 32

Title@2025-07-29 (2): A Massively Parallel Performance Portable Free-space Spectral Poisson Solver

Article 33

Title@2025-07-29 (2): Collaborative State Machines: A Better Programming Model for the Cloud-Edge-IoT Continuum

Article 34

Title@2025-07-29 (2): Accelerating Stable Matching between Workers and Spatial-Temporal Tasks for Dynamic MCS: A Stagewise Service Trading Approach

Article 35

Title@2025-07-29 (2): Bridging Cache-Friendliness and Concurrency: A Locality-Optimized In-Memory B-Skiplist

Article 36

Title@2025-07-29 (2): GlideinBenchmark: collecting resource information to optimize provisioning

Article 37

Title@2025-07-29 (2): Using Containers to Speed Up Development, to Run Integration Tests and to Teach About Distributed Systems

Article 38

Title@2025-07-29 (2): InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers

Article 39

Title@2025-07-28 (1): FedStrategist: A Meta-Learning Framework for Adaptive and Robust Aggregation in Federated Learning