cs.SE @ 2025-06-27: 140
-
00 06-26 (4) Large Language Model-Powered Agent for C to Rust Code Translation Large Language Model-Powered Agent für C to Rust Code Übersetzung C至Rust 代码翻译的大型语言示范授权代理 2505.15858v2 -
01 06-26 Anonymized Network Sensing Graph Challenge Anonymisierte Network Sensing Graph Challenge 匿名网络遥感图图挑战 2409.08115v2 -
02 06-26 IXAII: An Interactive Explainable Artificial Intelligence Interface for Decision Support Systems IXAII: Interactive Explainable Artificial Intelligence Interface für Entscheidungsunterstützungssysteme IXAII:决策支持系统互动解释人工智能接口 2506.21310v1 -
03 06-26 An object-centric core metamodel for IoT-enhanced event logs Ein objektzentriertes Kernmetamodell für IoT-verstärkte Ereignisprotokolle IoT 强化事件日志的以物体为中心的核心元元模型 2506.21300v1 -
04 06-26 Exploring Micro Frontends: A Case Study Application in E-Commerce Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce 探索微观前沿:电子商务案例研究应用 2506.21297v1 -
05 06-26 KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks KOALA: Ein konfigurierbares Tool zum Sammeln von IDE-Daten beim Lösen von Programmieraufgaben KOALA: 在解决方案拟订任务时收集 IDE 数据的配置工具 2506.21266v1 -
06 06-26 $T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models $T^3$: Mehrstufige Baum-basierte automatische Programm-Reparatur mit großen Sprachmodellen $T$3美元:使用大语言模型进行多层次基于树的自动方案维修 2506.21211v1 -
07 06-26 Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks 维持MDEB:实现长期使用和可复制嵌入基准 2506.21182v1 -
08 06-26 How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE Wie gut sind synthetische Anforderungen ? Bewertung von LLM-generierten Datensätzen für AI4RE 合成要求如何好? 评价AI4RE的LLM-发光数据集 2506.21138v1 -
09 06-26 SceneGenAgent: Precise Industrial Scene Generation with Coding Agent SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent SceneGenerAgenti: 精密工业场景与编码剂生成 2410.21909v3 -
10 06-26 Boosting Vulnerability Detection with Inter-function Multilateral Association Insights Förderung der Erkennung von Schwachstellen durch multilaterale Integrations-Insights zwischen den Funktionen 与职能间多边协会透视促进脆弱性探测 2506.21014v1 -
11 06-26 ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs ToolScan: Ein Benchmark für die Charakterisierung von Fehlern in Tool-Use LLMs 工具扫描:工具使用 LLM 错误识别基准 2411.13547v2 -
12 06-25 (3) Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance Komplexe Modelltransformationen durch verstärktes Lernen mit unsicherer menschlicher Führung 以不确定的人类指导加强学习的复杂模式转变 2506.20883v1 -
13 06-25 Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation Engineering RAG-Systeme für Real-World-Anwendungen: Design, Entwicklung und Evaluation RAG 现实世界应用工程系统:设计、开发和评价 2506.20869v1 -
14 06-25 Generating Reliable Adverse event Profiles for Health through Automated Integrated Data (GRAPH-AID): A Semi-Automated Ontology Building Approach Erzeugen von zuverlässigen unerwünschten Ereignisprofilen für die Gesundheit durch automatisierte integrierte Daten (GRAPH-AID): Ein semi-automatisierter Ontologie-Bauansatz 通过自动综合数据生成可靠的有害健康事件简介:半自动本体学构建方法 2506.20851v1 -
15 06-25 GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization GPU-Kernel-Wissenschaftler: Ein LLM-getriebenes Framework für iterative Kernel-Optimierung GPU 核心科学家:循环核心优化LLM-驱动框架 2506.20807v1 -
16 06-25 Agile Management for Machine Learning: A Systematic Mapping Study Agiles Management für maschinelles Lernen: Eine systematische Mapping-Studie 机器学习管理:系统绘图研究 2506.20759v1 -
17 06-25 Domain Knowledge in Requirements Engineering: A Systematic Mapping Study Domain Knowledge in Requirements Engineering: Eine systematische Mapping-Studie 要求工程领域知识:系统绘图研究 2506.20754v1 -
18 06-25 Define-ML: An Approach to Ideate Machine Learning-Enabled Systems Define-ML: Ein Ansatz zur Idee von maschinellen Lernsystemen 定义-ML:设计机器学习-可操作系统的方法 2506.20621v1 -
19 06-25 Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair Integration verschiedener Software-Artefakte für bessere LLM-basierte Fehlerlokalisierung und Programmreparatur 整合各种软件人工操作,以更好地使用LLM为主的错误定位和方案维修 2412.03905v3 -
20 06-25 Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees Adaptive Anforderungsplanung für CodeLLM Serving mit SLA-Garantien 在苏丹解放军保障下服务CCLLM服务的适应性请求日程安排 2506.19677v2 -
21 06-25 CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency CCISolver: End-to-End-Erkennung und Reparatur von Methoden-Level Code-Comment-Unstimmigkeit CISClolver: 最终到最后检测和修理方法编码水平-不一致情况Comment 2506.20558v1 -
22 06-25 Large Language Model-Driven Code Compliance Checking in Building Information Modeling Large Language Model-Driven Code Compliance Checking in Building Information Modeling 在建筑信息建模中检查大型语文示范版本编码合规情况 2506.20551v1 -
23 06-25 ReCode: Updating Code API Knowledge with Reinforcement Learning ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen ReCode:更新法规API知识与强化学习 2506.20495v1 -
24 06-25 MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing MARCO: Multi-Agent Code-Optimierung mit Echtzeit-Knowledge Integration für High-Performance Computing MARCO: 利用实时知识整合优化多机构代码,促进高绩效计算 2505.03906v3 -
25 06-25 Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds Smart Cuts: Erweitern Sie aktives Lernen für die Erkennung von Gefährlichkeit durch Beschneiden von schlechten Samen 智能剪裁:加强积极学习,通过粗鲁坏种子发现脆弱性 2506.20444v1 -
26 06-25 The Composition of Digital Twins for Systems-of-Systems: a Systematic Literature Review Die Zusammensetzung von digitalen Zwillingen für Systemsysteme: ein Systematischer Literaturbericht 系统系统数字双对的构成:系统文献审查 2506.20435v1 -
27 06-25 VulStamp: Vulnerability Assessment using Large Language Model VulStamp: Sicherheitsbewertung mit großem Sprachmodell VulStamp:使用大语言模式进行脆弱性评估 2506.11484v2 -
28 06-25 Lifting the Veil on Composition, Risks, and Mitigations of the Large Language Model Supply Chain Heben des Veils über Zusammensetzung, Risiken und Minderungen der Large Language Model Supply Chain 提高关于大语言示范供应链的组成、风险和缓解的《标准》 2410.21218v3 -
29 06-25 Ten simple rules for PIs to integrate Research Software Engineering into their research group Zehn einfache Regeln für PIs zur Integration von Research Software Engineering in ihre Forschungsgruppe 十条简单规则,供各研究所将研究软件工程纳入其研究组 2506.20217v1 -
30 06-25 Research Artifacts in Secondary Studies: A Systematic Mapping in Software Engineering Forschungs-Artefakte in Sekundärstudien: Ein systematisches Mapping in der Software-Engineering 中等研究中的研究异异物研究:软件工程系统绘图 2504.12646v2 -
31 06-25 Zero-Shot Attribution for Large Language Models: A Distribution Testing Approach Zero-Shot Attribution für große Sprachmodelle: Ein Distributionstestverfahren 大语言模式零点位数:分销测试方法 2506.20197v1 -
32 06-25 AI and Agile Software Development: From Frustration to Success – XP2025 Workshop Summary KI und Agile Software-Entwicklung: Von der Frustration zum Erfolg – XP2025 Workshop Zusammenfassung AI和Alile软件开发:从挫折到成功 – – XP2025讲习班摘要 2506.20159v1 -
33 06-24 (2) When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration When Domains Collide: Eine Aktivitätstheorie zur Erforschung der disziplinübergreifenden Zusammenarbeit 当域碰撞:跨纪律协作活动理论探索时 2506.20063v1 -
34 06-24 QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges QHackBench: Benchmarking großer Sprachmodelle für die Quantencode-Generation mit PennyLane Hackathon-Herausforderungen QHackBench:利用PennyLane Hackathon挑战为量制代码生成量设定大语言模式基准 2506.20008v1 -
35 06-24 Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’ Können Sprachmodelle Programmierer für Coding ersetzen? REPOCOD sagt ‘Noch nicht’ 语言模式能替换编码程序程序员吗? REPOCOD 说“ 还没有” 。 2410.21647v4 -
36 06-24 WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development WAFFLE: Feinsteuerungs-Multi-Modal-Modell für automatisierte Front-End-Entwicklung WAFFLE: 自动前端开发的微调多模式模型 2410.18362v2 -
37 06-24 An Empirical Investigation on the Challenges in Scientific Workflow Systems Development Eine empirische Untersuchung der Herausforderungen in der Entwicklung wissenschaftlicher Workflowsysteme 关于科学工作流程系统开发挑战的经验调查 2411.10890v2 -
38 06-24 Exploring Developer Experience Factors in Software Ecosystems Erforschen von Entwickler-Erfahrungsfaktoren in Software-Ökosystemen 探索软件生态系统中开发者经验因素 2506.19757v1 -
39 06-24 Simulating the Waterfall Model: A Systematic Review Simulation des Wasserfallmodells: Eine systematische Überprüfung 模拟瀑瀑瀑模型:系统审查 2506.19653v1 -
40 06-24 A Verification Methodology for Safety Assurance of Robotic Autonomous Systems Eine Verifikationsmethodik für die Sicherheit von Roboter autonomen Systemen 机器人自主系统安全保证核查方法 2506.19622v1 -
41 06-24 Probabilistic modelling and safety assurance of an agriculture robot providing light-treatment Probabilistische Modellierung und Sicherheitsgarantie eines landwirtschaftlichen Roboters zur Lichtbehandlung 提供轻处理的农业机器人的概率建模和安全保障 2506.19620v1 -
42 06-24 Can LLMs Replace Humans During Code Chunking? Können LLMs Menschen beim Code-Chunking ersetzen? LLMs能否在代码启动时替换人类? 2506.19897v1 -
43 06-24 Lost in Translation? Converting RegExes for Log Parsing into Dynatrace Pattern Language Verloren in Übersetzung? Umwandlung von RegExes für Log Parsing in Dynatrace Pattern Language 丢失于翻译中 ? 将日志解析的 RegExs 转换为同步模式语言 2506.19539v1 -
44 06-24 Integrating Pair Programming as a Work Practice Integration der Pair-Programmierung als Arbeitspraxis 将 “ 平等规划 “ 纳入工作实践 2506.19511v1 -
45 06-24 LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code LLM-basiertes Multi-Agent-System zur intelligenten Refactoring von Haskell-Code 以LLM为基础的哈斯凯尔码智能再构要素多代理商系统 2506.19481v1 -
46 06-24 What Makes the Best Decomposition? Investigating Binary Decomposition Under FCG Variance Was macht die beste Zersetzung? Untersuchung der binären Zersetzung unter FCG Variance 根据FCG差异调查二进分解 2506.19425v1 -
47 06-24 Online Discovery of Simulation Models for Evolving Business Processes (Extended Version) Online Discovery of Simulation Models for Evolving Business Processes (Erweiterte Version) 不断演变的业务流程模拟模型在线发现(扩展版) 2506.10049v2 -
48 06-24 High-Performance ARM-on-ARM Virtualization for Multicore SystemC-TLM-Based Virtual Platforms Leistungsstarke ARM-on-ARM-Virtualisierung für Multicore-SystemC-TLM-basierte virtuelle Plattformen 以多核心系统C-TLM为基础的虚拟平台的ARM在亚美尼亚国内的虚拟化 2505.12987v2 -
49 06-24 VFArchē: A Dual-Mode Framework for Locating Vulnerable Functions in Open-Source Software VFArchē: Ein Dual-Mode-Framework für die Suche nach gefährdeten Funktionen in Open-Source-Software VFFARCHZ:在开放源码软件中确定脆弱功能的双模式框架 2506.18050v2 -
50 06-24 MCP-Zero: Active Tool Discovery for Autonomous LLM Agents MCP-Zero: Active Tool Discovery für autonome LLM-Agenten MCP-零:为自动LLM代理商提供主动工具发现工具 2506.01056v4 -
51 06-24 MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core Selection MNN-AECS: Energieoptimierung für die LLM-Dekodierung auf mobilen Geräten über adaptive Core Selection MNN-AN-ANECS:通过适应核心选择在移动设备上添加LLM的能量优化 2506.19884v1 -
52 06-24 Generating and Understanding Tests via Path-Aware Symbolic Execution with LLMs Erzeugen und Verstehen von Tests über path-aware Symbolische Ausführung mit LLMs 通过使用LLMM 进行路径-意识符号执行生成和理解测试 2506.19287v1 -
53 06-24 DynNPC: Finding More Violations Induced by ADS in Simulation Testing through Dynamic NPC Behavior Generation DynNPC: Weitere Verletzungen durch ADS bei Simulationstests durch dynamische NPC-Behavior-Generierung DynNPC:通过动态NPC行为一代在模拟测试中发现ADS诱导的更多违规行为 2411.19567v2 -
54 06-24 GroupTuner: Efficient Group-Aware Compiler Auto-Tuning GroupTuner: Efficient Group-Aware Compiler Auto-Tuning GroupTuner: 高效的 Group- Awar 软件编辑器自动调试 2505.08598v2 -
55 06-24 Breaking Single-Tester Limits: Multi-Agent LLMs for Multi-User Feature Testing Breaking Single-Tester Limits: Multi-Agent LLMs für Multi-User Feature Testing 打破单一试验者限制:多用户功能测试的多代理机构LLMs 2506.17539v2 -
56 06-23 (1) Dataset of Yul Contracts to Support Solidity Compiler Research Datensatz von Yul-Verträgen zur Unterstützung der Solidity Compiler-Forschung 支持固体汇编者研究的Yul合同数据集 2506.19153v1 -
57 06-23 Framework for On the Fly Input Refinement for Deep Learning Models Framework for On the Fly Input Raffinement for Deep Learning Models 深学习模式 Fly 投入改进框架框架 2502.05456v2 -
58 06-23 cuVSLAM: CUDA accelerated visual odometry and mapping cuVSLAM: CUDA beschleunigte visuelle Odometrie und Mapping CUDA 加速视觉测量和绘图 2506.04359v2 -
59 06-23 Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks Code Graph Model (CGM): Ein Graph-integriertes Large Language Model für Repository-Level Software Engineering Aufgaben 代码图表模型(CGM):存储层软件工程任务 2505.16901v4 -
60 06-23 Black-Box Test Code Fault Localization Driven by Large Language Models and Execution Estimation Black-Box Test Code Fehler Lokalisierung angetrieben durch große Sprachmodelle und Ausführung Schätzung 由大语言模型和执行估计驱动的黑牛测试代码 2506.19045v1 -
61 06-23 A Comprehensive Study of Machine Learning Techniques for Log-Based Anomaly Detection Eine umfassende Untersuchung von Techniken des maschinellen Lernens zur logbasierten Anomalieerkennung 全面研究用于基于日志异常探测的机器学习技术 2307.16714v5 -
62 06-23 Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories Software Engineering Agents verstehen: Eine Studie über Gedanken-Action-Result-Trajektorien 了解软件工程剂:关于思想-行动-结果轨迹的研究 2506.18824v1 -
63 06-23 Context-Aware CodeLLM Eviction for AI-assisted Coding Context-Aware CodeLLM Eviction für KI-unterstützte Coding 使用 AI 辅助编码的内装软件 coolLLM 驱逐 2506.18796v1 -
64 06-23 FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction FORGE: Ein LLM-gesteuertes Framework für großflächige Smart Contract Vulnerability Dataset Construction FORGE:由LLM驱动的大型智能合同脆弱性数据集构建框架 2506.18795v1 -
65 06-23 ModeliHub: A Web-based, Federated Analytics Platform for Modelica-centric, Model-based Systems Engineering ModeliHub: Eine Web-basierte, Federated Analytics Plattform für modellisch-zentrierte, modellbasierte Systemtechnik 模型Hub:一个基于网络的、以模型为中心的、以模型为基础的系统工程联合会分析平台 2506.18790v1 -
66 06-23 Working Document – Formalising Software Requirements with Large Language Models Arbeitsdokument – Formalisierung von Softwareanforderungen mit großen Sprachmodellen 工作文件 – – 用大语言模式正式确定软件要求 2506.14627v2 -
67 06-23 The Impact of Input Order Bias on Large Language Models for Software Fault Localization Die Auswirkungen der Eingabereihenfolge Bias auf große Sprachmodelle für Softwarefehlerlokalisierung 输入顺序对软件失错本地化大语言模式的影响 2412.18750v3 -
68 06-23 Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic? Pilotieren von Copilot, Codex und StarCoder2: Heiße Temperatur, kalte Prompts oder schwarze Magie? 联合飞行员 代码代码和星际代码2: 热温、冷感或黑魔法? 2210.14699v3 -
69 06-23 MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems MORTAR: Multiturn Metamorphic Testing für LLM-basierte Dialogsysteme MORTAR:以LLM为基础的对话系统的多轨变形测试 2412.15557v3 -
70 06-23 Automatic Selection of Protections to Mitigate Risks Against Software Applications Automatische Auswahl von Schutzsystemen, um Risiken gegen Software-Anwendungen abzumildern 自动选择防范软件应用风险的防范措施 2506.18470v1 -
71 06-23 Bloch Vector Assertions for Debugging Quantum Programs Bloch Vector Assertions für Debugging Quantenprogramme 调试量子程序Bloch 矢量批量 2506.18458v1 -
72 06-23 The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs Der Debugging Decay Index: Debugging Strategien für Code LLMs neu denken 调试衰减指数:重新思考守则LMS的调试战略 2506.18403v1 -
73 06-23 Your Token Becomes Worthless: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis Ihr Token wird wertlos: Enthüllen von Rug Pull Schemes in Crypto Token über Code-and-Transaction Fusion Analysis 您的名声变得毫无价值:通过代码和交易整合分析,在加密调控中采用不懈的Rug拉力计划 2506.18398v1 -
74 06-23 Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval Nachvollziehen von Fehlern, Konstruieren von Fehlern: Repository-Level-Speicherfehler Reparieren von Fehlern über typstate-guided Context Retrieval 追踪错误, 构建修补: 通过 Tystate- Guide- Guided Intern Enter Review 修复存储器级存储器级存储器级内存错误 2506.18394v1 -
75 06-23 Recipe for Discovery: A Framework for Systematic Open Source Project Identification Rezept für Entdeckung: Ein Rahmen für die systematische Identifizierung von Open Source-Projekten 发现秘诀:系统开放源码项目确认框架 2506.18359v1 -
76 06-23 Predictive Analytics for Collaborators Answers, Code Quality, and Dropout on Stack Overflow Predictive Analytics für Kollaboratoren Antworten, Codequalität und Dropout auf Stack Overflow 合作者答复的预测分析、守则质量和Stack 溢流的辍学情况 2506.18329v1 -
77 06-23 Use Property-Based Testing to Bridge LLM Code Generation and Validation Verwenden Sie property-based testing to Bridge LLM Code-Generierung und Validierung 使用基于财产的测试进行桥桥LLM编码的生成和验证 2506.18315v1 -
78 06-23 Tu(r)ning AI Green: Exploring Energy Efficiency Cascading with Orthogonal Optimizations Tu(r)ning AI Green: Erforschung der Energieeffizienz Kaskadierung mit orthogonalen Optimierungen Tu(r)ning AI Green:探索利用矫形优化的能源效率链条 2506.18289v1 -
79 06-23 Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection Smart-LlaMA-DPO: Verstärktes Large Language Model für erklärbare Smart Contract Vulnerability Detection Smart-LLamaMA-DPO:可解释的智能合同脆弱性探测强化大语言模型 2506.18245v1 -
80 06-23 Managing Technical Debt in a Multidisciplinary Data Intensive Software Team: an Observational Case Study Verwaltung technischer Schulden in einem multidisziplinären Data Intensive Software Team: eine Beobachtungsfallstudie 多学科数据密集软件小组管理技术债务:观察案例研究 2506.18219v1 -
81 06-22 (7) BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning BLAZE: Cross-Language und Cross-Project Bug Lokalisierung über Dynamic Chunking und Hard Example Learning BLAZE:通过动态打字和硬实例学习实现跨语言和跨项目错误定位 2407.17631v3 -
82 06-22 Call Me Maybe: Enhancing JavaScript Call Graph Construction using Graph Neural Networks Rufen Sie mich vielleicht an: Verbesserung der JavaScript Call Graph Construction mit Graph Neural Networks 使用图形神经网络加强 JavaScript 呼叫图图建设 2506.18191v1 -
83 06-22 Generating Energy-efficient code with LLMs Energieeffizienter Code mit LLMs generieren 与LLMM 生成节能代码 2411.10599v2 -
84 06-22 Build It Clean: Large-Scale Detection of Code Smells in Build Scripts Build It Clean: Großräumige Erkennung von Code-Gemälden in Build-Scripts 构建干净的代码: 在构建脚本中大规模检测代码的气味 2506.17948v1 -
85 06-22 Software Reuse in the Generative AI Era: From Cargo Cult Towards AI Native Software Engineering Software-Wiederverwertung in der Generativen KI-Ära: Vom Cargo-Cult hin zum KI-Indianischen Software-Engineering 产生AI时代的软件再利用:从货物邪道到AI 本土软件工程 2506.17937v1 -
86 06-22 Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics Rubric ist alles, was Sie brauchen: Verbesserung der LLM-basierten Code-Bewertung mit Frage-spezifischen Rubrics 需要的是所有你需要的卢布:加强基于LLM的法规评价,用特定问题规范 2503.23989v2 -
87 06-21 (6) The Impact of AI-Generated Solutions on Software Architecture and Productivity: Results from a Survey Study Die Auswirkungen von KI-generierten Lösungen auf Softwarearchitektur und Produktivität: Ergebnisse einer Umfragestudie AI创创的解决方案对软件结构和生产力的影响:一项调查研究的结果 2506.17833v1 -
88 06-21 Is Your Automated Software Engineer Trustworthy? Ist Ihr automatisierter Software-Ingenieur vertrauenswürdig? 你的自动软件工程师可信吗? 2506.17812v1 -
89 06-21 SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis SAVANT: Sicherheitserkennung in Anwendungsabhängigkeiten durch Semantik-geführte Reichweitenanalyse SAVANT: 通过语义辅助控制可达性分析,在应用依赖性中发现脆弱性 2506.17798v1 -
90 06-21 Efficient Strategy Synthesis for MDPs via Hierarchical Block Decomposition Effiziente Strategiesynthese für MDPs über Hierarchische Blockzersetzung 通过分层块分解实现 MDP 高效战略合成 2506.17792v1 -
91 06-21 PAGENT: Learning to Patch Software Engineering Agents PAGENT: Lernen, Software Engineering Agents zu Patchen PAGENT: 学习修补软件工程代理 2506.17772v1 -
92 06-21 Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models Beyond Functional Correctness: Untersuchung von Coding Style Inkonsistenzen in großen Sprachmodellen 超越功能正确性:调查大语言模式的编码样式不一致问题 2407.00456v2 -
93 06-21 Improving Compiler Bug Isolation by Leveraging Large Language Models Verbesserung der Compiler-Fehlerisolierung durch die Nutzung großer Sprachmodelle 通过利用大语言模型改进编译者虫虫隔离 2506.17647v1 -
94 06-21 May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs Möge das Feedback mit dir sein! Entsperren der Kraft des Feedback-getriebenen Deep Learning Framework Fuzzing über LLMs 愿回馈与你同在! 2506.17642v1 -
95 06-21 Deep Learning Framework Testing via Model Mutation: How Far Are We? Deep Learning Framework Testing über Modellmutation: Wie weit sind wir? 通过模型变异进行深层次学习框架测试:我们有多远? 2506.17638v1 -
96 06-21 CodeMorph: Mitigating Data Leakage in Large Language Model Assessment CodeMorph: Eindämmung der Datenleckage in der Bewertung von Großsprachenmodellen 代码Morph:减少大语言模式评估中的数据泄漏 2506.17627v1 -
97 06-21 Fuzzing-based Mutation Testing of C/C++ CPS Fuzzing-basierte Mutationsprüfung von C/C++ CPS C/C++CPS的模糊基变异测试 2503.24100v2 -
98 06-21 Large Language Model Guided Self-Debugging Code Generation Große Sprache Modell geführte Selbst-Debugging-Code-Generierung 大语言制导自调自调码生成 2502.02928v2 -
99 06-21 EditLord: Learning Code Transformation Rules for Code Editing EditLord: Regeln zur Code-Transformation für die Code-Editing 编辑主: 学习代码编辑的代码转换规则 2504.15284v3 -
100 06-20 (5) Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems Desecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems SWE-区领导板拆解:LLM-和代理修理系统的分析提交者和结构 2506.17208v1 -
101 06-20 LLMs and Stack Overflow Discussions: Reliability, Impact, and Challenges LLMs und Stack-Überflussdiskussionen: Zuverlässigkeit, Wirkung und Herausforderungen LLM和Stack 溢流讨论:可靠性、影响和挑战 2402.08801v2 -
102 06-20 Large Language Model Unlearning for Source Code Großes Sprachmodell Unlearning für Quellcode 源代码的大语言模式重新学习 2506.17125v1 -
103 06-20 Reassessing Code Authorship Attribution in the Era of Language Models Neubewertung von Code Authorship Attribution im Zeitalter der Sprachmodelle 重新评估在语言模式时代重新确定《语言模式时代》中归属的法规授权人 2506.17120v1 -
104 06-20 Software Fairness Testing in Practice Software Fairness-Tests in der Praxis 实践中软件公平测试 2506.17095v1 -
105 06-20 Re-Evaluating Code LLM Benchmarks Under Semantic Mutation Neubewertung von Code-LLM-Benchmarks unter semantischer Mutation 在语义变异下重新估价代码法LLM基准 2506.17369v1 -
106 06-20 Behavior Driven Development for 3D Games Behavior Driven Entwicklung für 3D-Spiele 3D运动会行为驱动器开发 2506.17057v1 -
107 06-20 Identifying Explanation Needs: Towards a Catalog of User-based Indicators Erklärungsbedarf identifizieren: Auf dem Weg zu einem Katalog von benutzerbasierten Indikatoren 查明解释需要:建立用户指标目录 2506.16997v1 -
108 06-20 Accelerating Quantum Eigensolver Algorithms With Machine Learning Beschleunigung von Quanten Eigensolver-Algorithmen mit maschinellem Lernen 用机器学习加速量子 Eigensolver 算法 2409.13587v2 -
109 06-20 Adversarial Reasoning for Repair Based on Inferred Program Intent Adversariale Begründung für die Reparatur auf der Grundlage von abgeleiteten Programm Intent 根据被推断的方案意图进行修复的反向理由 2505.13008v2 -
110 06-20 PinChecker: Identifying Unsound Safe Abstractions of Rust Pinning APIs PinChecker: Identifizieren von unschallsicheren Abstraktionen von Rust Pinning APIs Pin checker: 识别混乱平铺API的不健全安全事件 2504.14500v2 -
111 06-20 Quantum Optimization for Software Engineering: A Survey Quantenoptimierung für die Software-Engineering: Eine Umfrage 软件工程量量的优化:调查 2506.16878v1 -
112 06-20 Revolutionizing Validation and Verification: Explainable Testing Methodologies for Intelligent Automotive Decision-Making Systems Revolutionierung der Validierung und Verifizierung: Erklärbare Prüfmethoden für intelligente Automotive-Entscheidungs-Making-Systeme 验证与核查:智能汽车决策系统可解释的测试方法 2506.16876v1 -
113 06-20 Accountability of Robust and Reliable AI-Enabled Systems: A Preliminary Study and Roadmap Rechenschaftspflicht von robusten und zuverlässigen KI-fähigen Systemen: Eine Vorstudie und Roadmap 健全和可靠的独立独立使用系统问责制:初步研究和路线图 2506.16831v1 -
114 06-20 Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers Model Context Protocol (MCP) auf den ersten Blick: Die Sicherheit und Nachhaltigkeit von MCP-Servern untersuchen 《第一一一一一一一时示范背景议定书》:研究MCP服务器的安全性和可维持性 2506.13538v4 -
115 06-19 (4) LLMs in Coding and their Impact on the Commercial Software Engineering Landscape LLMs in Coding und ihre Auswirkungen auf die kommerzielle Software-Engineering-Landschaft 编码及其对商业软件工程景观的影响 2506.16653v1 -
116 06-19 CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity CodeDiffuser: Aufmerksamkeitsverstärkte Diffusionspolitik über VLM-generierten Code für Instruction Ambiguity 代码用户:通过VLM - 教育结构设计守则加强关注 - 强化传播政策 2506.16652v1 -
117 06-19 SemAgent: A Semantics Aware Program Repair Agent SemAgent: Ein Semantik-Bewusst-Programm-Reparatur-Agent SemAgenger: 语义学意识方案维修代理 2506.16650v1 -
118 06-19 LLM-based Satisfiability Checking of String Requirements by Consistent Data and Checker Generation LLM-basierte Zufriedenheitsprüfung von String-Anforderungen durch konsistente Daten- und Checker-Generierung 以LLM为基础的LLM按统一数据和生成核对器对字符串要求的兼容性核对 2506.16639v1 -
119 06-19 Safety Interventions against Adversarial Patches in an Open-Source Driver Assistance System Sicherheitsinterventionen gegen störende Patches in einem Open-Source Fahrerassistenzsystem 在开放源码的司机协助系统中针对对面补丁采取安全干预措施 2504.18990v2 -
120 06-19 AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions KI-getriebene Werkzeuge in der modernen Software-Qualitätssicherung: Eine Bewertung von Vorteilen, Herausforderungen und Zukunftsrichtungen 《现代软件质量保证方面的AI-Driver 工具:效益、挑战和今后方向评估》 2506.16586v1 -
121 06-19 Scaling GR(1) Synthesis via a Compositional Framework for LTL Discrete Event Control Scaling GR(1) Synthese über ein kompositorisches Framework für LTL Discrete Event Control GR(1) 通过立特分解事件控制的组成框架合成 2506.16557v1 -
122 06-19 ChatDBG: Augmenting Debugging with Large Language Models ChatDBG: Augmenting Debugging mit großen Sprachmodellen 聊天DBG: 使用大语言模式加强调试 2403.16354v5 -
123 06-19 SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development SWE-Dev: Bewertung und Schulung autonomer Feature-getriebener Software-Entwicklung SWE-Dev: 评估和培训自主开发地物-驱动软件开发 2505.16975v2 -
124 06-19 Teaching Complex Systems based on Microservices Teaching Complex Systems auf Basis von Microservices 以微观服务为基础的教学复杂系统 2506.16492v1 -
125 06-19 AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation AlphaTrans: Ein neuro-symbolischer Kompositionsansatz für Repository-Level-Code-Übersetzung und Validierung AlphaTrans: 存储层代码翻译和校验的神经-交元组合法 2410.24117v5 -
126 06-19 Understanding the Challenges and Promises of Developing Generative AI Apps: An Empirical Study Die Herausforderungen und Versprechen der Entwicklung generativer KI-Apps verstehen: Eine empirische Studie 了解 “ 开发创新的AI Apps:经验研究 “ 的挑战和前景 2506.16453v1 -
127 06-19 Thermal Modeling and Optimal Allocation of Avionics Safety-critical Tasks on Heterogeneous MPSoCs Thermische Modellierung und optimale Allokation von Avionik Sicherheitskritische Aufgaben auf heterogenen MPSoCs 热建模和最佳分配航空气象安全关键任务 2505.22214v2 -
128 06-19 Evaluating the Use of LLMs for Documentation to Code Traceability Bewertung der Verwendung von LLMs für Dokumentation zur Code-Rückverfolgbarkeit 评价利用LLML 进行文件记录以便遵守可追踪性法规的情况 2506.16440v1 -
129 06-19 SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks SWE-Factory: Ihre automatisierte Fabrik für Ausgabeauflösungstraining Daten- und Bewertungs-Benchmarks SWE-Foctory: 您的解决问题自动工厂 培训数据和评价基准 2506.10954v2 -
130 06-19 Chaos Engineering: A Multi-Vocal Literature Review Chaos Engineering: Ein mehrstimmiger Literaturbericht 混乱工程:多语言文学评论 2412.01416v2 -
131 06-19 Evaluating Time-Dependent Methods and Seasonal Effects in Code Technical Debt Prediction Bewertung von zeitabhängigen Methoden und saisonalen Auswirkungen in Code Technical Debt Prediction 评估法典技术债务预测中依赖时间的方法和季节效应 2408.08095v2 -
132 06-19 The Technical Debt Gamble: A Case Study on Technical Debt in a Large-Scale Industrial Microservice Architecture The Technical Debt Gamble: Eine Fallstudie über technische Schulden in einer großräumigen Industrie-Mikroservice-Architektur 技术债务赌博:关于大型工业微观服务结构中技术债务的案例研究 2506.16214v1 -
133 06-19 Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing Sehen ist Fixing: Cross-Modal Reasoning mit multimodalen LLMs für Visual Software Problem Fixing 确定:与用于确定视觉软件问题的多模式LLMs进行交叉模式解释 2506.16136v1 -
134 06-19 Regression Testing Optimization for ROS-based Autonomous Systems: A Comprehensive Review of Techniques Regressionsprüfung Optimierung für ROS-basierte autonome Systeme: Eine umfassende Überprüfung von Techniken 以ROS为基础的自动系统优化后退试验:技术的全面审查 2506.16101v1 -
135 06-19 LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research LMR-BENCH: Bewertung der Fähigkeit des LLM-Agenten zur Reproduktion von Sprachmodellierungsforschung LMR-BENCH:评价LLM代理复制语言建模研究的能力 2506.17335v1 -
136 06-19 ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration ExploraCoder: Advancing Codegenerierung für mehrere unsichtbare APIs durch Planung und verkettete Exploration 探索Coder:通过规划和链式探索,推进多个看不见的API代码生成 2412.05366v2 -
137 06-19 Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE Große Sprachmodelle für Tabellen: Benchmarking-Fortschritt und Leistungsbewertung mit FLARE 电子表格大语言模式:与FLARE制定进度基准和评估业绩 2506.17330v1 -
138 06-19 FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation FEA-Bench: Ein Benchmark für die Bewertung der Code-Generierung auf Repository-Ebene für die Feature-Implementierung FEA-Bench:评估存储器一级实施地物代码生成的基准 2503.06680v2 -
139 06-19 From Generation to Adaptation: Comparing AI-Assisted Strategies in High School Programming Education Von der Generation zur Anpassung: Vergleich von KI-Assistenten Strategien in der High School Programming Education 从一代到适应:在高中方案规划教育中比较AI协助战略 2506.15955v1
Article 0
Title@2025-06-26 (4): Large Language Model-Powered Agent for C to Rust Code Translation
Title: Large Language Model-Powered Agent for C to Rust Code Translation | Large Language Model-Powered Agent für C to Rust Code Übersetzung | C至Rust 代码翻译的大型语言示范授权代理 2505.15858v2 |
Authors (6): HoHyun Sim, Hyeonjoong Cho, Yeonghyeon Go, Zhoulai Fu, Ali Shokri, Binoy Ravindran
The C programming language has been foundational in building system-level software. However, its manual memory management model frequently leads to memory safety issues. In response, a modern system programming language, Rust, has emerged as a memory-safe alternative. Moreover, automating the C-to-Rust translation empowered by the rapid advancements of the generative capabilities of LLMs is gaining growing interest for large volumes of legacy C code. Despite some success, existing LLM-based approaches have constrained the role of LLMs to static prompt-response behavior and have not explored their agentic problem-solving capability. Applying the LLM agentic capability for the C-to-Rust translation introduces distinct challenges, as this task differs from the traditional LLM agent applications, such as math or commonsense QA domains. First, the scarcity of parallel C-to-Rust datasets hinders the retrieval of suitable code translation exemplars for in-context learning. Second, unlike math or commonsense QA, the intermediate steps required for C-to-Rust are not well-defined. Third, it remains unclear how to organize and cascade these intermediate steps to construct a correct translation trajectory. To address these challenges in the C-to-Rust translation, we propose a novel intermediate step, the Virtual Fuzzing-based equivalence Test (VFT), and an agentic planning framework, the LLM-powered Agent for C-to-Rust code translation (LAC2R). The VFT guides LLMs to identify input arguments that induce divergent behaviors between an original C function and its Rust counterpart and to generate informative diagnoses to refine the unsafe Rust code. LAC2R uses the MCTS to systematically organize the LLM-induced intermediate steps for correct translation. We experimentally demonstrated that LAC2R effectively conducts C-to-Rust translation on large-scale, real-world benchmarks.
C编程语言是建立系统级软件的基础语言。然而,其人工内存管理模式经常导致记忆安全问题。作为回应,现代系统编程语言Rust(Rust)已成为一种耐记忆的替代方案。此外,由于LLM的基因化能力迅速提高,使得C-Rust翻译自动化起来。尽管取得了一些成功,但基于LLMst的现有方法限制了LLM(LLM)的作用,使其成了静态的快速反应行为,而没有探索其中间解决问题的能力。在C-Rst翻译中应用LLM(LM)代理能力,带来了不同的挑战,因为这一任务不同于传统的LLM代理应用程序,例如数学或普通QA域。首先,平行C-Rust数据集的缺乏,阻碍了适当的C(LM)代码翻译的检索。第二,与数学或普通QA(Commerical-R)的QA(C-RM),为C-RM(LM)的原始-RM(LM-R-RM)的解算法解算的中间步骤没有很好地界定。第三,它仍然不清楚如何组织和不断组织和升级的翻译。
Article 1
Title@2025-06-26 (4): Anonymized Network Sensing Graph Challenge
Title: Anonymized Network Sensing Graph Challenge | Anonymisierte Network Sensing Graph Challenge | 匿名网络遥感图图挑战 2409.08115v2 |
Authors (29): Hayden Jananthan, Michael Jones, William Arcand, David Bestor, William Bergeron, Daniel Burrill, Aydin Buluc, Chansup Byun, Timothy Davis, Vijay Gadepally, Daniel Grant, Michael Houle, Matthew Hubbell, Piotr Luszczek, Peter Michaleas, Lauren Milechin, Chasen Milner, Guillermo Morales, Andrew Morris, Julie Mullen, Ritesh Patel, Alex Pentland, Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Gabriel Wachman, Charles Yee, Jeremy Kepner
The MIT/IEEE/Amazon GraphChallenge encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to discover relationships between events as they unfold in the field. The anonymized network sensing Graph Challenge seeks to enable large, open, community-based approaches to protecting networks. Many large-scale networking problems can only be solved with community access to very broad data sets with the highest regard for privacy and strong community buy-in. Such approaches often require community-based data sharing. In the broader networking community (commercial, federal, and academia) anonymized source-to-destination traffic matrices with standard data sharing agreements have emerged as a data product that can meet many of these requirements. This challenge provides an opportunity to highlight novel approaches for optimizing the construction and analysis of anonymized traffic matrices using over 100 billion network packets derived from the largest Internet telescope in the world (CAIDA). This challenge specifies the anonymization, construction, and analysis of these traffic matrices. A GraphBLAS reference implementation is provided, but the use of GraphBLAS is not required in this Graph Challenge. As with prior Graph Challenges the goal is to provide a well-defined context for demonstrating innovation. Graph Challenge participants are free to select (with accompanying explanation) the Graph Challenge elements that are appropriate for highlighting their innovations.
MIT/IEEE/Amazon Graph Challenge鼓励采取社区办法,为分析图表和从社交媒体、传感器和科学数据获得的稀少数据开发新的解决方案,以发现实地所发生事件之间的关系。匿名网络感知图挑战力求促成大型、开放、基于社区的网络保护网络办法。许多大规模联网问题只能通过社区利用最重视隐私和强有力的社区买入的非常广泛的数据集来解决。这种办法往往需要社区分享数据。在更广泛的网络社区(商业、联邦和学术界)中,有标准数据共享协议的匿名源到目的地交通信息总库已成为能够满足许多这些要求的数据产品。这个挑战提供了一个机会,以突出新办法优化匿名交通总库的构建和分析,利用世界上最大的互联网望远镜(CAIDA)提供的1000亿多亿个网络包进行最高度尊重隐私和强有力的社区买入。这个方法往往需要社区分享数据。在更广泛的网络社区(商业、联邦和学术界)中提供了匿名源到目的地交通信息总库,但使用具有标准数据共享协议的源到目的地交通总库作为数据共享协议的一种数据产品,可以满足许多要求。这个要求。这个数据共享协议的挑战总路路路路标的用户在选择中提供适当的图表上显示图表的图表的图表的正确解释。
Article 2
Title@2025-06-26 (4): IXAII: An Interactive Explainable Artificial Intelligence Interface for Decision Support Systems
Title: IXAII: An Interactive Explainable Artificial Intelligence Interface for Decision Support Systems | IXAII: Interactive Explainable Artificial Intelligence Interface für Entscheidungsunterstützungssysteme | IXAII:决策支持系统互动解释人工智能接口 2506.21310v1 |
Authors (3): Pauline Speckmann, Mario Nadj, Christian Janiesch
Although several post-hoc methods for explainable AI have been developed, most are static and neglect the user perspective, limiting their effectiveness for the target audience. In response, we developed the interactive explainable intelligent system called IXAII that offers explanations from four explainable AI methods: LIME, SHAP, Anchors, and DiCE. Our prototype provides tailored views for five user groups and gives users agency over the explanations’ content and their format. We evaluated IXAII through interviews with experts and lay users. Our results indicate that IXAII, which provides different explanations with multiple visualization options, is perceived as helpful to increase transparency. By bridging the gaps between explainable AI methods, interactivity, and practical implementation, we provide a novel perspective on AI explanation practices and human-AI interaction.
虽然已经开发了几种可解释的后热方法,但大多数是静态的,忽视了用户视角,限制了用户视角,限制了用户对目标受众的实效。作为回应,我们开发了名为IXAII的互动式可解释智能系统,从四种可解释的AI方法(LIME、SHAP、Anchors和DICE)中提供了解释。我们的原型为五个用户群体提供了量身定制的观点,并为用户机构提供了解释内容和格式。我们通过与专家和普通用户的访谈对IXAII进行了评估。我们的结果表明,IXAII提供了多种可视化选项的不同解释,被认为有助于提高透明度。通过弥合可解释的AI方法、互动性和实际实施之间的差距,我们从新的视角审视了AI解释做法和人类-AI互动。
Article 3
Title@2025-06-26 (4): An object-centric core metamodel for IoT-enhanced event logs
Title: An object-centric core metamodel for IoT-enhanced event logs | Ein objektzentriertes Kernmetamodell für IoT-verstärkte Ereignisprotokolle | IoT 强化事件日志的以物体为中心的核心元元模型 2506.21300v1 |
Authors (13): Yannis Bertrand, Christian Imenkamp, Lukas Malburg, Matthias Ehrendorfer, Marco Franceschetti, Joscha Grüger, Francesco Leotta, Jürgen Mangler, Ronny Seiger, Agnes Koschmider, Stefanie Rinderle-Ma, Barbara Weber, Estefania Serral
Advances in Internet-of-Things (IoT) technologies have prompted the integration of IoT devices with business processes (BPs) in many organizations across various sectors, such as manufacturing, healthcare and smart spaces. The proliferation of IoT devices leads to the generation of large amounts of IoT data providing a window on the physical context of BPs, which facilitates the discovery of new insights about BPs using process mining (PM) techniques. However, to achieve these benefits, IoT data need to be combined with traditional process (event) data, which is challenging due to the very different characteristics of IoT and process data, for instance in terms of granularity levels. Recently, several data models were proposed to integrate IoT data with process data, each focusing on different aspects of data integration based on different assumptions and requirements. This fragmentation hampers data exchange and collaboration in the field of PM, e.g., making it tedious for researchers to share data. In this paper, we present a core model synthesizing the most important features of existing data models. As the core model is based on common requirements, it greatly facilitates data sharing and collaboration in the field. A prototypical Python implementation is used to evaluate the model against various use cases and demonstrate that it satisfies these common requirements.
在诸如制造、保健和智能空间等各个部门的许多组织中,互联网技术的进步促使将互联网设备与业务流程(BPs)整合在一起,这在制造、保健和智能空间等许多部门中都促使将互联网设备与业务流程(BPs)整合在一起。互联网设备的扩散导致产生了大量互联网数据,为基于业务流程物理背景的数据整合提供了一个窗口,为利用工艺采矿(PM)技术发现关于业务流程的新认识提供了便利。然而,为了实现这些效益,互联网数据需要与传统流程(活动)数据相结合,而传统流程(活动)数据则具有挑战性,因为互联网和流程数据的特点非常不同,例如在颗粒度方面。最近,提出了若干数据模型,将互联网数据与流程数据整合在一起,每个数据侧重于基于不同假设和要求的数据整合的不同方面。这种破碎妨碍了数据交换和在PMM(PM)领域的合作,例如,使研究人员难以分享数据。在本文件中,我们介绍了一个核心模型,将现有数据模型的最重要特征(例如颗粒度)加以综合,因为核心模型以共同要求为基础,有利于使用这些共同的数据共享。
Article 4
Title@2025-06-26 (4): Exploring Micro Frontends: A Case Study Application in E-Commerce
Title: Exploring Micro Frontends: A Case Study Application in E-Commerce | Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce | 探索微观前沿:电子商务案例研究应用 2506.21297v1 |
Authors (5): Ricardo Hideki Hangai Kojo, Luiz Fernando Corte Real, Renato Cordeiro Ferreira, Thatiane de Oliveira Rosa, Alfredo Goldman
In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company’s needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company’s context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.
在微观前端的建筑风格中,前端被分为小部分,从简单的按钮到整个页面。目标是提高可缩放性、复原力和团队独立性,尽管其成本增加了复杂性和基础设施需求。本文件试图了解何时值得采用微观前端,特别是在工业方面。为此,我们根据学术和灰色文献对微型前端的艺术状态进行了调查。我们随后在手工艺产品的市场中实施了这种建筑风格,这些产品已经使用了微观服务。最后,我们通过与开发商的半开放问卷评估了执行情况。在经过研究的市场公司中,由于主要系统(爪哇单项)和专门的前端系统之间的紧密连接,需要进行建筑变革。此外,为了解决这些问题,我们根据学术和灰色文献对微型前端结构进行了调查。我们随后在手工艺产品的市场中采用了微型前端结构,并且已经使用了已经使用过缩略图的后端模式。最后,我们通过半开放的问卷来评估执行情况。尽管通过最精密的前端基础设施(Java ) 和前端技术的运用方式得到了成功的应用,但是在采用最灵活的前端端分析中, 也实现了对正端分析。在采用最具有可比性的公司进行了成功的前端分析,但通过这种分析后端和最成功的前端技术的后端反应, 也取得了必要的应用。在采用了最成功的前端评估。在采用了最成功的前端技术。在采用后端技术,在采用后端的后端技术,在采用了最成功的前端技术,在采用后端技术,在采用。在采用后端技术,在采用后端技术,在采用后端端端端端端技术,在采用后端分析中,在采用后端技术,在采用后端,在采用后端技术,在采用了最成功的前端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端分析中实现了。在采用后端选择。在采用后端,在采用后端,在采用后端,在采用后端,在采用了最成功的前端选择,在采用了。在采用后端,在采用后端,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端
Article 5
Title@2025-06-26 (4): KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks
Title: KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks | KOALA: Ein konfigurierbares Tool zum Sammeln von IDE-Daten beim Lösen von Programmieraufgaben | KOALA: 在解决方案拟订任务时收集 IDE 数据的配置工具 2506.21266v1 |
Authors (6): Daniil Karol, Elizaveta Artser, Ilya Vlasov, Yaroslav Golubev, Hieke Keuning, Anastasiia Birillo
Collecting data of students solving programming tasks is incredibly valuable for researchers and educators. It allows verifying that the students correctly apply the features and concepts they are taught, or finding students’ misconceptions. However, existing data collection tools have limitations, e.g., no control over the granularity of the collected code, not collecting the specific events of the programming environment used, and overall being hard to configure. To overcome these limitations, we propose KOALA, a convenient and highly configurable tool for collecting code snapshots and feature usage from students solving programming tasks in JetBrains IDEs. The plugin can be installed in IDEs and configured to provide the students with the necessary tasks, enable or disable certain IDE features like code completion, and run surveys. During problem solving, the plugin collects code snapshots at the configured granularity, all IDE actions like running and debugging, as well as some data not collected in prior works, like employed hotkeys and switching focus between files. The collected data is sent to the server that comes with the tool, where it is stored and can be converted to the standardized ProgSnap2 format. To showcase the tool, we collected data from 28 students solving tasks in two courses within the IDE, highlighting some insights from this data.
收集完成编程任务的学生的数据对于研究人员和教育工作者来说是极其宝贵的。 它能够核实学生正确应用他们所教授的特征和概念,或者发现学生的错误概念。 但是, 现有的数据收集工具有局限性, 例如对所收集的代码的颗粒没有控制, 不收集所使用的编程环境的具体事件, 并且总体来说很难配置。 为了克服这些局限性, 我们提议 KOALA, 这是一种方便和高度可配置的工具, 用来收集在 JeetBrains IDEs 中完成编程任务的学生的代码快照和特征使用。 所收集的数据可以安装在 IDE 中, 配置该插件可以向学生提供必要的任务, 启用或禁用某些 IDE 功能, 如代码完成, 并运行调查 。 在解决问题过程中, 插件会收集到已配置的颗粒特性, 所有的 IDE 动作, 如运行和调试, 以及一些未在先前工作中收集的数据, 比如用过的热键和文件之间的焦点转换。 所收集的数据被发送到该工具的服务器, 在那里存储, 并且可以转换成标准化的 ProgSNA 2 格式中的学生 。
Article 6
Title@2025-06-26 (4): $T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models
Title: $T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models | $T^3$: Mehrstufige Baum-basierte automatische Programm-Reparatur mit großen Sprachmodellen | $T$3美元:使用大语言模型进行多层次基于树的自动方案维修 2506.21211v1 |
Authors (4): Quanming Liu, Xupeng Bu, Zhichao Yan, Ru Li
Automatic Program Repair (APR) is a core technology in software development and maintenance, with aims to enable automated defect repair with minimal human intervention. In recent years, the substantial advancements in Large Language Models (LLMs) and the Chain-of-Thought (CoT) techniques have significantly enhanced the reasoning capabilities of these models. However, due to the complex logic and multi-step reasoning ability needed, the application of CoT techniques in the APR domain remains insufficient. This study systematically evaluates the performance of several common CoT techniques in APR tasks and proposes an innovative framework $T^3$, which integrates the powerful reasoning capabilities of LLMs with tree search, effectively improving the precision of generating candidate repair solutions. Furthermore, $T^3$ provides valuable guidance for optimizing sample selection and repair strategies in APR tasks, establishing a robust framework for achieving efficient automated debugging.
自动程序修理(APR)是软件开发和维护方面的一项核心技术,目的是在最低限度的人力干预下实现自动缺陷修复,近年来,大语言模型(LLMs)和 “ 研究链 “ (Cot)技术的重大进步大大提高了这些模型的推理能力,然而,由于需要复杂的逻辑和多步推理能力,在PRA领域应用COT技术仍然不够充分,这项研究系统地评估了在PRA任务中若干通用COT技术的绩效,并提出了一个创新框架,即3美元,将LLMS的强大推理能力与树木搜索结合起来,有效地提高产生候选修复解决方案的精确性,此外,3美元为优化RA任务中的样本选择和修复战略提供了宝贵的指导,为实现高效自动调试建立一个强有力的框架。
Article 7
Title@2025-06-26 (4): Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks
Title: Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks | Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks | 维持MDEB:实现长期使用和可复制嵌入基准 2506.21182v1 |
Authors (5): Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen
The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB’s continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results’ generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb
大量文本嵌入基准(MDEB)已成为文本嵌入模型的标准评价平台。虽然以前的工作已经确立了核心基准方法,但本文件侧重于确保MDEB继续可复制和可推广的工程方面,我们介绍了我们维持强有力的连续整合管道的方法,这些管道验证数据集的完整性、自动测试执行以及评估基准结果的一般性。我们详细介绍了共同加强可复制性和可用性的设计选择。此外,我们讨论了处理社区贡献和以新的任务和数据集扩展基准的战略。这些工程做法有助于扩大MTEB的规模,使之更加全面,同时保持质量,并最终与外地相关。我们的经验为基准维护者提供了宝贵的洞见,他们在确保机器学习评估框架的可复制性和可用性方面面临类似挑战。MTEB存放于https://github.com/embeddings-benchmark/mteb。我们可在以下网址查阅:https://github. com/embeddings-benchmark/mteb。
Article 8
Title@2025-06-26 (4): How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE
Title: How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE | Wie gut sind synthetische Anforderungen ? Bewertung von LLM-generierten Datensätzen für AI4RE | 合成要求如何好? 评价AI4RE的LLM-发光数据集 2506.21138v1 |
Authors (2): Abdelkarim El-Hajjami, Camille Salinesi
The shortage of publicly available, labeled requirements datasets remains a major barrier to advancing Artificial Intelligence for Requirements Engineering (AI4RE). While Large Language Models offer promising capabilities for synthetic data generation, systematic approaches to control and optimize the quality of generated requirements remain underexplored. This paper presents Synthline v1, an enhanced Product Line approach for generating synthetic requirements data that extends our earlier v0 version with advanced generation strategies and curation techniques. We investigate four research questions assessing how prompting strategies, automated prompt optimization, and post-generation curation affect data quality across four classification tasks: defect detection, functional vs. non-functional, quality vs. non-quality, and security vs. non-security. Our evaluation shows that multi-sample prompting significantly boosts both utility and diversity over single-sample generation, with F1-score gains from 6 to 44 points. The use of PACE (Prompt Actor-Critic Editing) for automated prompt optimization yields task-dependent results, greatly improving functional classification (+32.5 points) but reducing performance on others. Interestingly, similarity-based curation improves diversity but often harms classification performance, indicating that some redundancy may help ML models. Most importantly, our results show that synthetic requirements can match or outperform human-authored ones for specific tasks, with synthetic data surpassing human data for security (+7.8 points) and defect classification (+15.4 points). These findings offer practical insights for AI4RE and chart a viable path to mitigating dataset scarcity through systematic synthetic generation.
虽然大语言模型为合成数据生成提供了大有希望的能力,但控制和优化生成要求质量的系统方法仍未得到充分探讨。本文件展示了Synthline v1, 一个强化产品系列方法,用于生成合成要求数据,将我们先前的V0版本扩展至先进的生成战略和校正技术。我们调查了四个研究问题,评估快速战略、自动快速优化和后生成曲线如何影响四个分类任务的数据质量:缺陷检测、功能性与功能性对非功能性、质量对质量和安全性对非安全性。 我们的评估表明,多样本极大地促进在单一样本生成过程中的效用和多样性,F1核心收益从6点增加到44点。 使用PACE(Prompt Acor-Critical 编辑)自动快速优化产出任务依据的结果,大大改进功能性分类(+32.5点),但降低其他任务的绩效。 有趣的是,基于相似的缩略图路径, 显示具体数据生成值(MRegile) 能够显示具体的安全性分类。
Article 9
Title@2025-06-26 (4): SceneGenAgent: Precise Industrial Scene Generation with Coding Agent
Title: SceneGenAgent: Precise Industrial Scene Generation with Coding Agent | SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent | SceneGenerAgenti: 精密工业场景与编码剂生成 2410.21909v3 |
Authors (8): Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong
The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at https://github.com/THUDM/SceneGenAgent .
工业场景模型对于工业制造业的模拟至关重要。大型语言模型(LLMS)在用文字描述生成一般的3D场景方面取得了显著进展,而利用LLMS生成工业场景则因其对精确测量和定位的需求而带来了独特的挑战,需要对空间安排进行复杂的规划。为了应对这一挑战,我们引入了C#代码生成工业场景的基于LLM的CeneGenAgenti代理商SceenGenAgenti(CeneGenAgenti)确保了精确的布局规划,通过结构化和可计算的格式、布局核查以及迭接的完善以满足工业场景的数量要求。实验结果表明,SceneGenAgenent所驱动的LMS超过其最初的性能,在真实世界工业场景生成任务中达到高达81.0%的成功率,并有效地满足了大多数场景生成要求。为了进一步提高无障碍性,我们建造了SeenInstruct(SenInGPTHMM),一个旨在将开源LMMS-GPTSentrentrental 3./MUD)和MSUDSUDMS。
Article 10
Title@2025-06-26 (4): Boosting Vulnerability Detection with Inter-function Multilateral Association Insights
Title: Boosting Vulnerability Detection with Inter-function Multilateral Association Insights | Förderung der Erkennung von Schwachstellen durch multilaterale Integrations-Insights zwischen den Funktionen | 与职能间多边协会透视促进脆弱性探测 2506.21014v1 |
Authors (3): Shaojian Qiu, Mengyang Huang, Jiahao Cheng
Vulnerability detection is a crucial yet challenging technique for ensuring the security of software systems. Currently, most deep learning-based vulnerability detection methods focus on stand-alone functions, neglecting the complex inter-function interrelations, particularly the multilateral associations. This oversight can fail to detect vulnerabilities in these interrelations. To address this gap, we present an Inter-Function Multilateral Association analysis framework for Vulnerability Detection (IFMA-VD). The cornerstone of the IFMA-VD lies in constructing a code behavior hypergraph and utilizing hyperedge convolution to extract multilateral association features. Specifically, we first parse functions into a code property graph to generate intra-function features. Following this, we construct a code behavior hypergraph by segmenting the program dependency graph to isolate and encode behavioral features into hyperedges. Finally, we utilize a hypergraph network to capture the multilateral association knowledge for augmenting vulnerability detection. We evaluate IFMA-VD on three widely used vulnerability datasets and demonstrate improvements in F-measure and Recall compared to baseline methods. Additionally, we illustrate that multilateral association features can boost code feature representation and validate the effectiveness of IFMA-VD on real-world datasets.
目前,大多数深层次的基于学习的脆弱性检测方法都侧重于独立功能,忽视复杂的功能间相互关系,特别是多边协会。这种监督可能无法发现这些相互关系中的弱点。为了解决这一差距,我们提出了一个跨功能多边协会脆弱性检测分析框架(IFMA-VD)。IMA-VD的基石在于建立一套守则行为高光谱,并利用高级变相来提取多边联系特征。具体地说,我们首先将功能分析成一个代码属性图,以产生内部功能特征。之后,我们通过将程序依赖性图进行分解,将行为特征编码高光谱,将程序依赖性图分离和编码到高端。最后,我们利用一个高光谱网络来捕捉多边联系知识,以加强脆弱性检测。我们评估了三个广泛使用的弱点数据集,并展示了FMA-VD与基线方法相比在F计量和召回方面的改进。此外,我们说明,多边联系特征可以提高代码特征,并验证IMA-VD在现实世界数据集上的有效性。
Article 11
Title@2025-06-26 (4): ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs
Title: ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs | ToolScan: Ein Benchmark für die Charakterisierung von Fehlern in Tool-Use LLMs | 工具扫描:工具使用 LLM 错误识别基准 2411.13547v2 |
Authors (18): Shirley Kokane, Ming Zhu, Tulika Awalgaonkar, Jianguo Zhang, Thai Hoang, Akshara Prabhakar, Zuxin Liu, Tian Lan, Liangwei Yang, Juntao Tan, Rithesh Murthy, Weiran Yao, Zhiwei Liu, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong, Silivo Savarese
Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce TOOLSCAN, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using TOOLSCAN, we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.
评价大语言模型(LLMS)是建立性能复合AI系统的最关键方面之一。由于LLMS的输出向下游步骤传播,确定LLM错误对于系统性能至关重要。AI系统中的LLMs的一项共同任务是工具使用。虽然评价LLMs在这项工作上有一些基准环境,但它们通常只给出成功率而不解释失败案例。为了解决这个问题,我们引入了TOOLSCAN,这是一个新的基准,用以确定LLLM产出在工具使用任务上的错误模式。我们的基准数据集包括来自不同环境的查询,可用于测试七个新发现的错误模式的存在。我们使用TOOLSCAN,显示即使是最著名的LMs也在其产出中展示了这些错误模式。研究人员可以利用TOOLSCAN的这些洞见来指导其减少错误的战略。
Article 12
Title@2025-06-25 (3): Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance
Title: Complex Model Transformations by Reinforcement Learning with Uncertain Human Guidance | Komplexe Modelltransformationen durch verstärktes Lernen mit unsicherer menschlicher Führung | 以不确定的人类指导加强学习的复杂模式转变 2506.20883v1 |
Authors (2): Kyanna Dagenais, Istvan David
Model-driven engineering problems often require complex model transformations (MTs), i.e., MTs that are chained in extensive sequences. Pertinent examples of such problems include model synchronization, automated model repair, and design space exploration. Manually developing complex MTs is an error-prone and often infeasible process. Reinforcement learning (RL) is an apt way to alleviate these issues. In RL, an autonomous agent explores the state space through trial and error to identify beneficial sequences of actions, such as MTs. However, RL methods exhibit performance issues in complex problems. In these situations, human guidance can be of high utility. In this paper, we present an approach and technical framework for developing complex MT sequences through RL, guided by potentially uncertain human advice. Our framework allows user-defined MTs to be mapped onto RL primitives, and executes them as RL programs to find optimal MT sequences. Our evaluation shows that human guidance, even if uncertain, substantially improves RL performance, and results in more efficient development of complex MTs. Through a trade-off between the certainty and timeliness of human advice, our method takes a step towards RL-driven human-in-the-loop engineering methods.
由模型驱动的工程问题往往要求复杂的模型转换,即以广泛顺序链条的模型转换。这些问题的相关例子包括模型同步、自动模型修理和空间探索设计。手工开发复杂的模型是一个容易出错且往往不可行的过程。强化学习(RL)是缓解这些问题的恰当方法。在RL,自主代理商通过试验和错误探索国家空间,以确定有益的行动序列,如MTs。然而,RL方法显示出复杂的问题中的性能问题。在这种情况下,人类指导可能非常有用。在本文中,我们提出了一个通过RL开发复杂的MT序列的方法和技术框架,以潜在的不确定人类建议为指导。我们的框架允许用户定义的MTs被绘图到RL原始,并将它们作为RL方案加以执行,以找到最佳的MT序列。我们的评估表明,人类指导,即使不确定,也会大大改进RL的性能,并导致更高效地发展复杂的RMTs。我们通过贸易驱动的确定性和及时性方法,在人类工程方法之间采取了一种贸易驱动式的一步。
Article 13
Title@2025-06-25 (3): Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation
Title: Engineering RAG Systems for Real-World Applications: Design, Development, and Evaluation | Engineering RAG-Systeme für Real-World-Anwendungen: Design, Entwicklung und Evaluation | RAG 现实世界应用工程系统:设计、开发和评价 2506.20869v1 |
Authors (6): Md Toufique Hasan, Muhammad Waseem, Kai-Kristian Kemell, Ayman Asad Khan, Mika Saari, Pekka Abrahamsson
Retrieval-Augmented Generation (RAG) systems are emerging as a key approach for grounding Large Language Models (LLMs) in external knowledge, addressing limitations in factual accuracy and contextual relevance. However, there is a lack of empirical studies that report on the development of RAG-based implementations grounded in real-world use cases, evaluated through general user involvement, and accompanied by systematic documentation of lessons learned. This paper presents five domain-specific RAG applications developed for real-world scenarios across governance, cybersecurity, agriculture, industrial research, and medical diagnostics. Each system incorporates multilingual OCR, semantic retrieval via vector embeddings, and domain-adapted LLMs, deployed through local servers or cloud APIs to meet distinct user needs. A web-based evaluation involving a total of 100 participants assessed the systems across six dimensions: (i) Ease of Use, (ii) Relevance, (iii) Transparency, (iv) Responsiveness, (v) Accuracy, and (vi) Likelihood of Recommendation. Based on user feedback and our development experience, we documented twelve key lessons learned, highlighting technical, operational, and ethical challenges affecting the reliability and usability of RAG systems in practice.
- 由于缺乏经验性研究,无法报告在现实世界使用案例基础上、通过用户普遍参与和系统记录经验教训,开发基于区域小组的执行,通过用户普遍参与进行评估,并辅以系统记录经验教训。本文件介绍了五种针对特定领域的区域小组应用软件,用于在治理、网络安全、农业、工业研究和医学诊断等各种现实世界情景。每个系统都包含多种语言的OCR、通过矢量嵌入进行语义检索和经域调整的LLMS,通过本地服务器或云端应用的LMS,用于满足不同用户的需要。基于网络的评价共100名参与者评估了六个层面的系统:(一) 使用方便度,(二) 相关性,(三) 透明度,(四) 应对性,(五) 准确性,(六) 建议的相似性。根据用户反馈和我们的发展经验,我们记录了在RAG系统方面12项关键的经验教训,突出技术、操作、道德和可操作性,并影响我们的可靠性。
Article 14
Title@2025-06-25 (3): Generating Reliable Adverse event Profiles for Health through Automated Integrated Data (GRAPH-AID): A Semi-Automated Ontology Building Approach
Title: Generating Reliable Adverse event Profiles for Health through Automated Integrated Data (GRAPH-AID): A Semi-Automated Ontology Building Approach | Erzeugen von zuverlässigen unerwünschten Ereignisprofilen für die Gesundheit durch automatisierte integrierte Daten (GRAPH-AID): Ein semi-automatisierter Ontologie-Bauansatz | 通过自动综合数据生成可靠的有害健康事件简介:半自动本体学构建方法 2506.20851v1 |
Authors (6): Srikar Reddy Gadusu, Larry Callahan, Samir Lababidi, Arunasri Nishtala, Sophia Healey, Hande McGinty
As data and knowledge expand rapidly, adopting systematic methodologies for ontology generation has become crucial. With the daily increases in data volumes and frequent content changes, the demand for databases to store and retrieve information for the creation of knowledge graphs has become increasingly urgent. The previously established Knowledge Acquisition and Representation Methodology (KNARM) outlines a systematic approach to address these challenges and create knowledge graphs. However, following this methodology highlights the existing challenge of seamlessly integrating Neo4j databases with the Web Ontology Language (OWL). Previous attempts to integrate data from Neo4j into an ontology have been discussed, but these approaches often require an understanding of description logics (DL) syntax, which may not be familiar to many users. Thus, a more accessible method is necessary to bridge this gap. This paper presents a user-friendly approach that utilizes Python and its rdflib library to support ontology development. We showcase our novel approach through a Neo4j database we created by integrating data from the Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) database. Using this dataset, we developed a Python script that automatically generates the required classes and their axioms, facilitating a smoother integration process. This approach offers a practical solution to the challenges of ontology generation in the context of rapidly growing adverse drug event datasets, supporting improved drug safety monitoring and public health decision-making.
随着数据和知识的迅速扩展,采用系统方法生成本体学数据变得至关重要。随着数据量的日常增加和内容的频繁变化,对数据库储存和检索信息以创建知识图表的需求变得日益迫切。以前制定的知识获取和代表方法(KNARM)概述了应对这些挑战的系统方法,并创建了知识图表。然而,根据这种方法,我们强调将Neo4j数据库与Web本体学语言(OWL)无缝地结合到Neo4j数据库中的现有挑战。以前试图将Neo4j数据纳入本体学的尝试已经讨论过,但这些方法往往需要理解描述逻辑(DL)的语法,而许多用户可能并不熟悉这些逻辑。因此,需要一种更方便使用的方法来弥补这一差距。本文介绍了一种方便用户的方法,利用Python及其Ardflip图书馆来支持本学的发展。我们通过整合食品和药品管理局的反活动报告系统(FAERS)的数据而创建的Neoo4j数据库展示了我们的新方法。我们利用这一数据,利用这一数据设置一个支持公众决策过程的平稳化过程,从而自动地形成了一个需要生成一个生成一个健康解决方案的系统,在生产中自动地展示一个完整的过程上生成一个需要一个数据整合一个完整的过程。在生产一个健康学上的一个过程。
Article 15
Title@2025-06-25 (3): GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization
Title: GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization | GPU-Kernel-Wissenschaftler: Ein LLM-getriebenes Framework für iterative Kernel-Optimierung | GPU 核心科学家:循环核心优化LLM-驱动框架 2506.20807v1 |
Authors (2): Martin Andrews, Sam Witteveen
Optimizing GPU kernels for high performance is a complex task, often demanding deep architectural knowledge, extensive profiling, and iterative experimentation. This challenge is amplified when targeting newer or less-documented GPU architectures where traditional development aids are scarce. This paper introduces an LLM-powered “GPU Kernel Scientist,” an automated methodology for iteratively refining accelerator kernels. Our methodology employs LLMs in a multi-stage, evolutionary process: (a) strategically selecting promising prior code versions as a basis for new iterations; (b) generating hypotheses for optimization experiments, based on existing code and assimilated knowledge from general GPU literature; and (c) autonomously implementing these experiments through code modification and subsequent submission to an external evaluation system, using only observed timing data as performance feedback. We detail how this approach navigates the challenges of the AMD MI300 target architecture and leverages LLMs to compensate for limited domain-specific human expertise. Since quantitative results from an ongoing performance competition were embargoed on paper submission date, we present the architectural design, operational workflow, and qualitative insights, highlighting the potential of LLM-driven agents to democratise and accelerate GPU kernel optimization, especially in resource-constrained or rapidly evolving hardware environments.
优化 GPU 内核以取得高绩效是一项复杂的任务,往往需要深层次的建筑知识、广泛的剖析和迭代实验。当针对传统发展辅助手段稀缺的较新或较少记录的 GPU 结构时,这一挑战会更加艰巨。本文介绍了LLM 驱动的“GPU Kernel 科学家 ” , 这是一种自动的方法,用于迭接地精炼加速器内核。我们的方法在一个多阶段的演进过程中采用LMS : (a) 从战略上选择有前途的先前代码版本,作为新的迭代的基础;(b) 根据现有的代码和一般GPU文献的吸收知识,为优化实验创造假说;以及(c) 通过修改代码和随后向外部评价系统提交数据,仅使用观察的定时数据作为业绩反馈,自主实施这些实验。我们详细介绍这一方法如何应对AMM300目标架构的挑战,并利用LMMS 来补偿有限的具体领域人类专门知识。由于持续的业绩竞争的结果在纸质提交日期被禁,我们介绍建筑设计设计、操作工作流程和定性洞察;以及定性洞察看,通过规则,强调不断变化的硬质化的硬质环境,特别是加速的硬质分析器。
Article 16
Title@2025-06-25 (3): Agile Management for Machine Learning: A Systematic Mapping Study
Title: Agile Management for Machine Learning: A Systematic Mapping Study | Agiles Management für maschinelles Lernen: Eine systematische Mapping-Studie | 机器学习管理:系统绘图研究 2506.20759v1 |
Authors (5): Lucas Romao, Hugo Villamizar, Romeu Oliveira, Silvio Alonso, Marcos Kalinowski
[Context] Machine learning (ML)-enabled systems are present in our society, driving significant digital transformations. The dynamic nature of ML development, characterized by experimental cycles and rapid changes in data, poses challenges to traditional project management. Agile methods, with their flexibility and incremental delivery, seem well-suited to address this dynamism. However, it is unclear how to effectively apply these methods in the context of ML-enabled systems, where challenges require tailored approaches. [Goal] Our goal is to outline the state of the art in agile management for ML-enabled systems. [Method] We conducted a systematic mapping study using a hybrid search strategy that combines database searches with backward and forward snowballing iterations. [Results] Our study identified 27 papers published between 2008 and 2024. From these, we identified eight frameworks and categorized recommendations and practices into eight key themes, such as Iteration Flexibility, Innovative ML-specific Artifacts, and the Minimal Viable Model. The main challenge identified across studies was accurate effort estimation for ML-related tasks. [Conclusion] This study contributes by mapping the state of the art and identifying open gaps in the field. While relevant work exists, more robust empirical evaluation is still needed to validate these contributions.
以实验周期和数据快速变化为特征的ML开发动态性质给传统项目管理带来了挑战。 敏捷的方法及其灵活性和递增性似乎非常适合应对这一动态。然而,尚不清楚如何在ML驱动的系统背景下有效运用这些方法,因为需要量身定做的方法。 [目标]我们的目标是概述ML驱动的系统灵活管理方面的最新水平。[Method]我们利用混合搜索战略进行了系统绘图研究,将数据库搜索与后向和前向的雪球迭代相结合。[Results]我们的研究查明了2008年至2024年期间公布的27份文件。从这些研究中,我们确定了8个框架,并将建议和做法分为8个关键主题,如 “ Iteration灵活性 “ 、 “ 创新ML “ 特定艺术行为和 “ 最低可行模式。我们发现的主要挑战是准确估计与ML相关任务有关的工作。[CML]这一研究有助于更强有力地验证实地需要的经验,同时查明这些经验评估状况。
Article 17
Title@2025-06-25 (3): Domain Knowledge in Requirements Engineering: A Systematic Mapping Study
Title: Domain Knowledge in Requirements Engineering: A Systematic Mapping Study | Domain Knowledge in Requirements Engineering: Eine systematische Mapping-Studie | 要求工程领域知识:系统绘图研究 2506.20754v1 |
Authors (5): Marina Araújo, Júlia Araújo, Romeu Oliveira, Lucas Romao, Marcos Kalinowski
[Context] Domain knowledge is recognized as a key component for the success of Requirements Engineering (RE), as it provides the conceptual support needed to understand the system context, ensure alignment with stakeholder needs, and reduce ambiguity in requirements specification. Despite its relevance, the scientific literature still lacks a systematic consolidation of how domain knowledge can be effectively used and operationalized in RE. [Goal] This paper addresses this gap by offering a comprehensive overview of existing contributions, including methods, techniques, and tools to incorporate domain knowledge into RE practices. [Method] We conducted a systematic mapping study using a hybrid search strategy that combines database searches with iterative backward and forward snowballing. [Results] In total, we found 75 papers that met our inclusion criteria. The analysis highlights the main types of requirements addressed, the most frequently considered quality attributes, and recurring challenges in the formalization, acquisition, and long-term maintenance of domain knowledge. The results provide support for researchers and practitioners in identifying established approaches and unresolved issues. The study also outlines promising directions for future research, emphasizing the development of scalable, automated, and sustainable solutions to integrate domain knowledge into RE processes. [Conclusion] The study contributes by providing a comprehensive overview that helps to build a conceptual and methodological foundation for knowledge-driven requirements engineering.
科学文献尽管具有相关性,但仍然没有系统地综合利用和运用域知识的方法、技术和工具,将域知识纳入域内实践,以此弥补这一差距。[目标]本文件通过全面概述现有贡献,包括方法、技术和工具,将域知识纳入域内实践,以此全面概述现有贡献,包括方法、技术和工具,将域内知识纳入域内实践,从而处理这一差距。[方法]我们利用一种混合搜索战略,将数据库搜索与迭代后向和前向雪球学相结合,提供所需的概念支持,从而提供了理解系统制图研究。[Results]共发现75份文件符合我们的列入标准。分析突出了所处理的主要要求类型,最经常考虑的质量属性,以及域知识正规化、获取和长期维护方面反复出现的挑战。研究结果为研究人员和从业人员确定既定方法和未决问题提供了支持。研究还概述了未来研究的有希望的方向,强调开发可扩展的、自动化和可持续的解决办法,将域内存知识纳入域内流进程。[结论]研究有助于建立概念性概览,有助于建立概念性知识基础。
Article 18
Title@2025-06-25 (3): Define-ML: An Approach to Ideate Machine Learning-Enabled Systems
Title: Define-ML: An Approach to Ideate Machine Learning-Enabled Systems | Define-ML: Ein Ansatz zur Idee von maschinellen Lernsystemen | 定义-ML:设计机器学习-可操作系统的方法 2506.20621v1 |
Authors (5): Silvio Alonso, Antonio Pedro Santos Alves, Lucas Romao, Hélio Lopes, Marcos Kalinowski
[Context] The increasing adoption of machine learning (ML) in software systems demands specialized ideation approaches that address ML-specific challenges, including data dependencies, technical feasibility, and alignment between business objectives and probabilistic system behavior. Traditional ideation methods like Lean Inception lack structured support for these ML considerations, which can result in misaligned product visions and unrealistic expectations. [Goal] This paper presents Define-ML, a framework that extends Lean Inception with tailored activities - Data Source Mapping, Feature-to-Data Source Mapping, and ML Mapping - to systematically integrate data and technical constraints into early-stage ML product ideation. [Method] We developed and validated Define-ML following the Technology Transfer Model, conducting both static validation (with a toy problem) and dynamic validation (in a real-world industrial case study). The analysis combined quantitative surveys with qualitative feedback, assessing utility, ease of use, and intent of adoption. [Results] Participants found Define-ML effective for clarifying data concerns, aligning ML capabilities with business goals, and fostering cross-functional collaboration. The approach’s structured activities reduced ideation ambiguity, though some noted a learning curve for ML-specific components, which can be mitigated by expert facilitation. All participants expressed the intention to adopt Define-ML. [Conclusion] Define-ML provides an openly available, validated approach for ML product ideation, building on Lean Inception’s agility while aligning features with available data and increasing awareness of technical feasibility.
软件系统越来越多地采用机器学习(ML),这要求采取专门的构想方法,应对ML具体挑战,包括数据依赖性、技术可行性以及商业目标与概率系统行为之间的一致性。传统构想方法,如Lean Inception缺乏对这些ML考虑的结构性支持,这可能导致产品愿景和不切实际的期望不相符。[目 本文提出了定义-ML,这个框架将Lean Inception与有针对性的活动——数据源绘图、地物对数据源的验证绘图和ML绘图——系统地将数据和技术制约因素纳入早期ML产品构想。[方法]我们根据技术转让模式制定和验证了定义-ML,进行了静态验证(存在微弱问题)和动态验证(在现实世界工业案例研究中),这可能导致产品预测性反馈、效用、使用方便程度和采用意向。[Results]与会者认为定义-ML方法对于澄清数据关切、使ML能力与商业目标相匹配、促进跨功能性合作是有效的。该方法在可理解性定义-定义-定义方面,通过专家现有定义-定义-定义-定义-定义-定义-定义-可以使所有参与者了解现有定义-了解各种定义-定义-定义-定义-定义-定义-定义-定义-可减少定义-了解各种定义-但可使现有定义-定义-定义-定义-定义-定义-定义-可使所有定义-了解-了解-了解-定义-可使所有
Article 19
Title@2025-06-25 (3): Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair
Title: Integrating Various Software Artifacts for Better LLM-based Bug Localization and Program Repair | Integration verschiedener Software-Artefakte für bessere LLM-basierte Fehlerlokalisierung und Programmreparatur | 整合各种软件人工操作,以更好地使用LLM为主的错误定位和方案维修 2412.03905v3 |
Authors (6): Qiong Feng, Xiaotian Ma, Jiayi Sheng, Ziyuan Feng, Wei Song, Peng Liang
LLMs have garnered considerable attention for their potential to streamline Automated Program Repair (APR). LLM-based approaches can either insert the correct code or directly generate patches when provided with buggy methods. However, most of LLM-based APR methods rely on a single type of software information, without fully leveraging different software artifacts. Despite this, many LLM-based approaches do not explore which specific types of information best assist in APR. Addressing this gap is crucial for advancing LLM-based APR techniques. We propose DEVLoRe to use issue content (description and message) and stack error traces to localize buggy methods, then rely on debug information in buggy methods and issue content and stack error to localize buggy lines and generate plausible patches which can pass all unit tests. The results show that while issue content is particularly effective in assisting LLMs with fault localization and program repair, different types of software artifacts complement each other. By incorporating different artifacts, DEVLoRe successfully locates 49.3% and 47.6% of single and non-single buggy methods and generates 56.0% and 14.5% plausible patches for the Defects4J v2.0 dataset, respectively. This outperforms current state-of-the-art APR methods. Furthermore, we re-implemented and evaluated our framework, demonstrating its effectiveness in its effectiveness in resolving 9 unique issues compared to other state-of-the-art frameworks using the same or more advanced models on SWE-bench Lite.We also discussed whether a leading framework for Python code can be directly applied to Java code, or vice versa. The source code and experimental results of this work for replication are available at https://github.com/XYZboom/DEVLoRe.
LLMLM对于简化自动化程序维修(APR)的潜力引起了相当的关注。基于 LLMLM 的方法既可以插入正确的代码,也可以在提供错误方法时直接生成补丁。然而,基于 LLM 的PRARA 方法大多依靠单一类型的软件信息,而没有充分利用不同的软件文物。尽管如此,许多基于LLM 的方法并不探讨哪些类型的信息最能帮助ARPR。解决这一差距对于推进基于LM 的PRRA技术至关重要。我们建议DEVLORE使用发行内容(描述和消息)和堆叠错误痕迹将错误方法本地化,然后依赖错误方法中的调试信息,发布内容和堆叠错误错误将错误线本地化,产生可信的补丁。结果显示,尽管发布内容对于协助LLMSLM的错误本地化和程式修理特别有效,但不同种类的软件文物是相辅相成的。我们建议DEVLORE成功定位了49.3%和47.6%的单源错误/非错误处理方法,然后在错误方法中生成56.0%和14.5%的堆错误错误错误错误错误处理信息信息, 也使用了我们现在的SLODRADRDRDRDRDRDRDR4的快速评估方法。
Article 20
Title@2025-06-25 (3): Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees
Title: Adaptive Request Scheduling for CodeLLM Serving with SLA Guarantees | Adaptive Anforderungsplanung für CodeLLM Serving mit SLA-Garantien | 在苏丹解放军保障下服务CCLLM服务的适应性请求日程安排 2506.19677v2 |
Authors (5): Shi Chang, Boyuan Chen, Kishanthan Thangarajah, Hanan Lutfiyya, Ahmed E. Hassan
Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development workflows, yet efficiently serving them in resource-constrained, self-hosted environments remains a significant challenge. Existing LLM serving systems employs Continuous Batching for throughput improvement. However, they rely on static batch size configurations that cannot adapt to fluctuating request rates or heterogeneous workloads, leading to frequent SLA (Service Level Agreement) violations and unstable performance. In this study, We propose SABER, a dynamic batching strategy that predicts per-request SLA feasibility and adjusts decisions in real time. SABER improves goodput by up to 26% over the best static configurations and reduces latency variability by up to 45%, all without manual tuning or service restarts. Our results demonstrate that SLA-aware, adaptive scheduling is key to robust, high-performance CodeLLM serving.
大语言代码模型(CodeLLLMS)日益被纳入现代软件开发工作流程,但在资源受限制、自我托管的环境中有效地为它们服务,这仍然是一个重大挑战。现有的LLM服务系统采用不断的捆绑来改进吞吐量。然而,它们依赖静态的批量配置,无法适应变化不定的要求率或不同的工作量,导致频繁违反SLA(服务级协议)和工作不稳定。在本研究中,我们建议SABER(SABER),这是一个动态的批量战略,预测每个索要的SLA可行性,并实时调整决定。SABER将最佳静态配置的精干率提高26%,将长期性变异性降低45 %,所有这些都没有人工调整或服务重新启动。我们的结果表明,SLA-觉、适应性时间安排是强大、高性能的编码LLLM服务的关键。
Article 21
Title@2025-06-25 (3): CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency
Title: CCISolver: End-to-End Detection and Repair of Method-Level Code-Comment Inconsistency | CCISolver: End-to-End-Erkennung und Reparatur von Methoden-Level Code-Comment-Unstimmigkeit | CISClolver: 最终到最后检测和修理方法编码水平-不一致情况Comment 2506.20558v1 |
Authors (9): Renyi Zhong, Yintong Huo, Wenwei Gu, Jinxi Kuang, Zhihan Jiang, Guangba Yu, Yichen Li, David Lo, Michael R. Lyu
Comments within code serve as a crucial foundation for software documentation, facilitating developers to communicate and understand the code effectively. However, code-comment inconsistency (CCI) can negatively affect software development, testing, and maintenance. Recent efforts to mitigate this issue have emerged, but existing studies often suffer from inaccurate datasets and inadequate solutions, weakening their practical effectiveness. In this study, we first conduct a quantitative analysis of existing datasets, revealing a substantial portion of sampled data are mislabeled. To address these data limitations, we introduce CCIBench, a refined dataset comprising high-quality data, to support the training and evaluation of method-level CCI methods. Furthermore, we present an innovative end-to-end LLM-based framework, CCISolver, designed to improve code quality by identifying and rectifying CCIs. Comprehensive evaluations demonstrate CCISolver’s superior performance. For detection, it establishes a new state-of-the-art with an F1-score of 89.54%. In fixing task, it achieves a remarkable 18.84% relative improvement in GLEU score over the strongest baseline. This superiority is confirmed by human evaluation, where CCISolver’s fixing success rate of 0.6533 significantly surpasses existing methods. Critically, in a practical end-to-end setting, CCISolver’s innovative architecture is approximately 36% faster for inference than the baseline model, underscoring its scalability and real-world applicability.
代码内的评论是软件文件的关键基础,有助于开发者有效地交流和理解代码。然而,代码不一致(CCI)会对软件开发、测试和维护产生消极影响。最近为缓解这一问题所作的努力已经出现,但现有研究往往受到不准确的数据集和不适当的解决办法的影响,从而削弱了其实际效力。在本研究中,我们首先对现有数据集进行定量分析,披露大量抽样数据存在错误标签。为解决这些数据限制,我们引入了由高质量数据组成的精细数据集CCIBench,以支持对方法级CCI方法的培训和评价。此外,我们提出了一个创新的终至终LLM基准框架,CCCSIolver,旨在通过识别和纠正CCIRS,提高代码质量。全面评价显示了CICIS的优异性表现。为了检测,我们首先建立了一个具有89.54%的F1核心模型的新型数据。在确定任务时,GLEU比最强的基线高出18.84%的数据集。此外,CCIIS的升级率得到了确认,而CITL3在实际的终端,CLIFA中,其最短的底底值是高的CU性,CU性是CIA的比CU值。CRV值。
Article 22
Title@2025-06-25 (3): Large Language Model-Driven Code Compliance Checking in Building Information Modeling
Title: Large Language Model-Driven Code Compliance Checking in Building Information Modeling | Large Language Model-Driven Code Compliance Checking in Building Information Modeling | 在建筑信息建模中检查大型语文示范版本编码合规情况 2506.20551v1 |
Authors (7): Soumya Madireddy, Lu Gao, Zia Din, Kinam Kim, Ahmed Senouci, Zhe Han, Yunpeng Zhang
This research addresses the time-consuming and error-prone nature of manual code compliance checking in Building Information Modeling (BIM) by introducing a Large Language Model (LLM)-driven approach to semi-automate this critical process. The developed system integrates LLMs such as GPT, Claude, Gemini, and Llama, with Revit software to interpret building codes, generate Python scripts, and perform semi-automated compliance checks within the BIM environment. Case studies on a single-family residential project and an office building project demonstrated the system’s ability to reduce the time and effort required for compliance checks while improving accuracy. It streamlined the identification of violations, such as non-compliant room dimensions, material usage, and object placements, by automatically assessing relationships and generating actionable reports. Compared to manual methods, the system eliminated repetitive tasks, simplified complex regulations, and ensured reliable adherence to standards. By offering a comprehensive, adaptable, and cost-effective solution, this proposed approach offers a promising advancement in BIM-based compliance checking, with potential applications across diverse regulatory documents in construction projects.
这项研究通过采用大型语文模型(LLM)驱动的半自动化这一关键过程,解决了建筑信息建模中手工编码合规检查的耗时和易出错误性质,开发的系统将GPT、Claude、Gemini和Llama等LLMS与Revit软件相结合,以解释建筑法规、生成Python脚本和在BIM环境中进行半自动合规检查;关于单家庭住宅项目和办公建筑项目的案例研究表明,该系统有能力减少合规检查所需的时间和努力,同时提高准确性;通过自动评估关系和生成可采取行动的报告,简化了对违规行为的识别,例如不符合要求的房间尺寸、材料使用和对象放置;与人工方法相比,该系统取消了重复性任务,简化了复杂条例,并确保了对标准的可靠遵守;通过提供全面、适应性和成本效益高的解决办法,这一拟议方法在基于BIM的合规检查方面大有希望的进展,并有可能适用于建筑项目中的各种监管文件。
Article 23
Title@2025-06-25 (3): ReCode: Updating Code API Knowledge with Reinforcement Learning
Title: ReCode: Updating Code API Knowledge with Reinforcement Learning | ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen | ReCode:更新法规API知识与强化学习 2506.20495v1 |
Authors (5): Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
大型语言模型(LLMS)具有非凡的代码生成能力,但在适应外部图书馆API的频繁更新时却步履维艰。这一关键限制来自对培训数据中过时的 API 知识的依赖,即使能够查阅现有文件,从而在动态环境中阻碍可靠的代码生成。为了解决这一问题,我们提议ReCode(基于规则的加强学习以更新代码),这是一个模仿人类程序程序员适应API变化的新框架。具体地说,我们建立一个大约2 000个数据条目的数据集,以培训LLMS进行基于更新信息的版本的迁移。然后,我们引入一个修改后的代码评估字符串相似度指标,作为强化学习的奖励。我们的实验表明,ReCode大大提升了LPIS在动态API情景中的代码生成性能,特别是在隐蔽的代码AredateArena任务上。与监管的微调相比,ReCode对于LMS的一般代码生成能力影响较小。我们应用了一套LMS和强化学习算法(GPO和DAPO),所有这些都都实现了一致的改进。 值得注意的是,在培训后,Quender2.5-C-7BB的模型/Rebroughdaldroformax
Article 24
Title@2025-06-25 (3): MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing
Title: MARCO: Multi-Agent Code Optimization with Real-Time Knowledge Integration for High-Performance Computing | MARCO: Multi-Agent Code-Optimierung mit Echtzeit-Knowledge Integration für High-Performance Computing | MARCO: 利用实时知识整合优化多机构代码,促进高绩效计算 2505.03906v3 |
Authors (10): Asif Rahman, Veljko Cvetkovic, Kathleen Reece, Aidan Walters, Yasir Hassan, Aneesh Tummeti, Bryan Torres, Denise Cooney, Margaret Ellis, Dimitrios S. Nikolopoulos
Large language models (LLMs) have transformed software development through code generation capabilities, yet their effectiveness for high-performance computing (HPC) remains limited. HPC code requires specialized optimizations for parallelism, memory efficiency, and architecture-specific considerations that general-purpose LLMs often overlook. We present MARCO (Multi-Agent Reactive Code Optimizer), a novel framework that enhances LLM-generated code for HPC through a specialized multi-agent architecture. MARCO employs separate agents for code generation and performance evaluation, connected by a feedback loop that progressively refines optimizations. A key innovation is MARCO’s web-search component that retrieves real-time optimization techniques from recent conference proceedings and research publications, bridging the knowledge gap in pre-trained LLMs. Our extensive evaluation on the LeetCode 75 problem set demonstrates that MARCO achieves a 14.6\% average runtime reduction compared to Claude 3.5 Sonnet alone, while the integration of the web-search component yields a 30.9\% performance improvement over the base MARCO system. These results highlight the potential of multi-agent systems to address the specialized requirements of high-performance code generation, offering a cost-effective alternative to domain-specific model fine-tuning.
大型语言模型(LLMS)通过代码生成能力改变了软件开发,但其高效高性能计算(HPC)的效果仍然有限,HPC代码要求专门优化平行性、记忆效率以及一般通用LMS经常忽略的建筑特有考虑。我们介绍了MARCO(MLCO)(Multi-Agency Reactive Code Apptimerimizer),这是一个新颖的框架,它通过专门的多试剂结构加强LLMM为HPC生成的代码。MARCO使用不同的代码生成和绩效评估代理,通过逐步完善优化的反馈回路进行连接。一个关键的创新是MARCO的网络搜索组件,它从最近的会议记录和研究出版物中检索实时优化技术,缩小培训前LMS的知识差距。我们对LetCode 75问题集的广泛评价表明,MARCO仅与Claude 3.5 Sonnet系统相比,平均减少了14.6 %的运行时间,而网络搜索组件的整合则使MARCO系统的业绩得到30.9的改进。这些结果突出表明,多试剂系统有可能解决高性能模型生成的专门要求。
Article 25
Title@2025-06-25 (3): Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds
Title: Smart Cuts: Enhance Active Learning for Vulnerability Detection by Pruning Bad Seeds | Smart Cuts: Erweitern Sie aktives Lernen für die Erkennung von Gefährlichkeit durch Beschneiden von schlechten Samen | 智能剪裁:加强积极学习,通过粗鲁坏种子发现脆弱性 2506.20444v1 |
Authors (3): Xiang Lan, Tim Menzies, Bowen Xu
Vulnerability detection is crucial for identifying security weaknesses in software systems. However, the effectiveness of machine learning models in this domain is often hindered by low-quality training datasets, which contain noisy, mislabeled, or imbalanced samples. This paper proposes a novel dataset maps-empowered approach that systematically identifies and mitigates hard-to-learn outliers, referred to as “bad seeds”, to improve model training efficiency. Our approach can categorize training examples based on learning difficulty and integrate this information into an active learning framework. Unlike traditional methods that focus on uncertainty-based sampling, our strategy prioritizes dataset quality by filtering out performance-harmful samples while emphasizing informative ones. Our experimental results show that our approach can improve F1 score over random selection by 45.36% (DeepGini) and 45.91% (K-Means) and outperforms standard active learning by 61.46% (DeepGini) and 32.65% (K-Means) for CodeBERT on the Big-Vul dataset, demonstrating the effectiveness of integrating dataset maps for optimizing sample selection in vulnerability detection. Furthermore, our approach also enhances model robustness, improves sample selection by filtering bad seeds, and stabilizes active learning performance across iterations. By analyzing the characteristics of these outliers, we provide insights for future improvements in dataset construction, making vulnerability detection more reliable and cost-effective.
然而,该领域的机器学习模型的有效性往往受到低质量培训数据集的阻碍,这些数据集含有吵闹、标签错误或不平衡的样本。本文提出一种新的数据集地图-动力化方法,系统识别和减少难以读取的离子机,称为“坏种子”,以提高示范培训效率。我们的方法可以根据学习困难对培训实例进行分类,并将这种信息纳入积极的学习框架。与侧重于基于不确定性的抽样的传统方法不同,我们的战略通过过滤有害业绩的样本,强调信息性能,将数据集的质量列为优先事项。我们的实验结果显示,我们的方法可以将F1比随机选择的得分提高45.36%(迪普吉尼)和45.91%(基米尼)的得分,并比标准积极学习率提高61.46%(迪普吉尼)和32.65%(K-Means)的代码-Vul数据集培训实例。我们的战略通过过滤性能展示将数据集用于优化脆弱性检测样本选择的实效。此外,我们的实验结果表明,我们的方法还可以提高模型的准确性,从而改进未来的精确性。
Article 26
Title@2025-06-25 (3): The Composition of Digital Twins for Systems-of-Systems: a Systematic Literature Review
Title: The Composition of Digital Twins for Systems-of-Systems: a Systematic Literature Review | Die Zusammensetzung von digitalen Zwillingen für Systemsysteme: ein Systematischer Literaturbericht | 系统系统数字双对的构成:系统文献审查 2506.20435v1 |
Authors (2): Mennatullah T. Khedr, John S. Fitzgerald
Digital Twins (DTs) are increasingly used to model complex systems, especially in Cyber-Physical Systems (CPS) and System-of-Systems (SoS), where effective integration is key. This systematic literature review investigates DT composition and verification and validation (V&V) methodologies. Analyzing 21 studies from 2022-2024, we examined composition mechanisms, SoS characteristics, and V&V formality, scope, and challenges. While composition is discussed, formalization is limited. V&V approaches vary, with semi-formal methods and simulations dominating; formal verification is underutilized. Key technical challenges include model uncertainty and integration complexity. Methodological challenges highlight the lack of standardized DT-specific V&V frameworks. There is a need to move beyond model validation to address integration and cyber-physical consistency. This review contributes a structured classification of V&V approaches and emphasizes the need for standardized, scalable V&V and rigorous composition methodologies for complex DT implementations.
对2022-2024年的21项研究进行了分析,分析了2022-2024年的构成机制、 SoS特性以及V和V的正规性、范围和挑战。虽然讨论了组成问题,但正式化是有限的。V和V方法各不相同,以半正式方法和模拟为主导;正式核查没有得到充分利用。关键技术挑战包括模型不确定性和集成复杂性。方法挑战突出表明缺乏标准化的DT特定V和V&V框架。需要超越模型验证,解决整合和网络-物理一致性问题。这一审查有助于对V和V方法进行结构化分类,并强调需要标准化、可扩展V和严格的组成方法,用于复杂的DT实施。
Article 27
Title@2025-06-25 (3): VulStamp: Vulnerability Assessment using Large Language Model
Title: VulStamp: Vulnerability Assessment using Large Language Model | VulStamp: Sicherheitsbewertung mit großem Sprachmodell | VulStamp:使用大语言模式进行脆弱性评估 2506.11484v2 |
Authors (5): Hao Shen, Ming Hu, Xiaofei Xie, Jiaye Li, Mingsong Chen
Although modern vulnerability detection tools enable developers to efficiently identify numerous security flaws, indiscriminate remediation efforts often lead to superfluous development expenses. This is particularly true given that a substantial portion of detected vulnerabilities either possess low exploitability or would incur negligible impact in practical operational environments. Consequently, vulnerability severity assessment has emerged as a critical component in optimizing software development efficiency. Existing vulnerability assessment methods typically rely on manually crafted descriptions associated with source code artifacts. However, due to variability in description quality and subjectivity in intention interpretation, the performance of these methods is seriously limited. To address this issue, this paper introduces VulStamp, a novel intention-guided framework, to facilitate description-free vulnerability assessment. Specifically, VulStamp adopts static analysis together with Large Language Model (LLM) to extract the intention information of vulnerable code. Based on the intention information, VulStamp uses a prompt-tuned model for vulnerability assessment. Furthermore, to mitigate the problem of imbalanced data associated with vulnerability types, VulStamp integrates a Reinforcement Learning (RL)-based prompt-tuning method to train the assessment model.
尽管现代脆弱性检测工具使开发者能够有效地查明许多安全缺陷,但不加区分的补救努力往往导致多余的发展费用,特别是考虑到发现的大量脆弱性要么具有低利用性,要么在实际操作环境中产生微不足道的影响,因此脆弱性严重程度评估已成为优化软件开发效率的一个关键组成部分;现有的脆弱性评估方法通常依赖与源代码人工制品有关的手工制作的描述;然而,由于描述质量和意图解释的主观性差异,这些方法的绩效受到严重限制;为解决这一问题,本文件介绍了VulStamp,这是一个新的意向指导框架,目的是促进无描述脆弱性评估;具体而言,VulStamp采用静态分析,与大语言模型一起,以获取脆弱代码的意向信息;根据意图信息,VulStamp使用一个迅速调整的模型进行脆弱性评估;此外,为了减轻与脆弱性类型相关的数据不平衡问题,VulStamp采用了基于强化学习的快速调整方法,以培训评估模型。
Article 28
Title@2025-06-25 (3): Lifting the Veil on Composition, Risks, and Mitigations of the Large Language Model Supply Chain
Title: Lifting the Veil on Composition, Risks, and Mitigations of the Large Language Model Supply Chain | Heben des Veils über Zusammensetzung, Risiken und Minderungen der Large Language Model Supply Chain | 提高关于大语言示范供应链的组成、风险和缓解的《标准》 2410.21218v3 |
Authors (10): Kaifeng Huang, Bihuan Chen, You Lu, Susheng Wu, Dingji Wang, Yiheng Huang, Haowen Jiang, Zhuotong Zhou, Junming Cao, Xin Peng
Large language models (LLMs) have sparked significant impact with regard to both intelligence and productivity. Numerous enterprises have integrated LLMs into their applications to solve their own domain-specific tasks. However, integrating LLMs into specific scenarios is a systematic process that involves substantial components, which are collectively referred to as the LLM supply chain. A comprehensive understanding of LLM supply chain composition, as well as the relationships among its components, is crucial for enabling effective mitigation measures for different related risks. While existing literature has explored various risks associated with LLMs, there remains a notable gap in systematically characterizing the LLM supply chain from the dual perspectives of contributors and consumers. In this work, we develop a structured taxonomy encompassing risk types, risky actions, and corresponding mitigations across different stakeholders and components of the supply chain. We believe that a thorough review of the LLM supply chain composition, along with its inherent risks and mitigation measures, would be valuable for industry practitioners to avoid potential damages and losses, and enlightening for academic researchers to rethink existing approaches and explore new avenues of research.
大型语言模型(LLMs)在情报和生产力两方面都产生了重大影响。许多企业已经将LLMs纳入其应用中,以解决自己的具体领域任务。然而,将LLMs纳入具体情景是一个系统的过程,涉及大量组成部分,统称为LLM供应链。全面了解LLM供应链构成及其各组成部分之间的关系,对于针对不同相关风险采取有效的缓解措施至关重要。虽然现有文献探讨了与LLMs相关的各种风险,但在从捐助方和消费者的双重角度系统地确定LLM供应链的特征方面仍然存在显著差距。在这项工作中,我们制定了一个结构化的分类方法,包括风险类型、风险行动以及供应链不同利益攸关方和组成部分的相应缓解措施。我们认为,彻底审查LLM供应链构成及其固有的风险和缓解措施,对于业界从业人员避免潜在损害和损失,以及启发学术研究人员重新思考现有方法并探索新的研究途径,将很有价值。
Article 29
Title@2025-06-25 (3): Ten simple rules for PIs to integrate Research Software Engineering into their research group
Title: Ten simple rules for PIs to integrate Research Software Engineering into their research group | Zehn einfache Regeln für PIs zur Integration von Research Software Engineering in ihre Forschungsgruppe | 十条简单规则,供各研究所将研究软件工程纳入其研究组 2506.20217v1 |
Authors (11): Stuart M. Allen, Neil Chue Hong, Stephan Druskat, Toby Hodges, Daniel S. Katz, Jan Linxweiler, Frank Löffler, Lars Grunske, Heidi Seibold, Jan Philipp Thiele, Samantha Wittke
Research Software Engineering (RSEng) is a key success factor in producing high-quality research software, which in turn enables and improves research outcomes. However, as a principal investigator or leader of a research group you may not know what RSEng is, where to get started with it, or how to use it to maximize its benefit for your research. RSEng also often comes with technical complexity, and therefore reduced accessibility to some researchers. The ten simple rules presented in this paper aim to improve the accessibility of RSEng, and provide practical and actionable advice to PIs and leaders for integrating RSEng into their research group. By following these rules, readers can improve the quality, reproducibility, and trustworthiness of their research software, ultimately leading to better, more reproducible and more trustworthy research outcomes.
研究软件工程(RSEng)是生产高质量研究软件(RSEng)的关键成功因素,而高质量研究软件(RSEng)反过来又能够并改进研究成果。然而,作为研究小组的主要调查员或领导者,你可能不知道RSENG是什么,从哪里开始,或如何利用RSENG为研究带来最大利益。 RSEng也常常带来技术复杂性,从而降低某些研究人员的可及性。 本文提出的十项简单规则旨在改善RSENG的可及性,并为PIS和领导人提供切实可行的建议,以便将RSEng纳入其研究小组。 遵循这些规则,读者可以提高研究软件的质量、可复制性和可信度,最终导致更好、更可复制和更可信赖的研究成果。
Article 30
Title@2025-06-25 (3): Research Artifacts in Secondary Studies: A Systematic Mapping in Software Engineering
Title: Research Artifacts in Secondary Studies: A Systematic Mapping in Software Engineering | Forschungs-Artefakte in Sekundärstudien: Ein systematisches Mapping in der Software-Engineering | 中等研究中的研究异异物研究:软件工程系统绘图 2504.12646v2 |
Authors (3): Aleksi Huotala, Miikka Kuutila, Mika Mäntylä
Context: Systematic reviews (SRs) summarize state-of-the-art evidence in science, including software engineering (SE). Objective: Our objective is to evaluate how SRs report research artifacts and to provide a comprehensive list of these artifacts. Method: We examined 537 secondary studies published between 2013 and 2023 to analyze the availability and reporting of research artifacts. Results: Our findings indicate that only 31.5% of the reviewed studies include research artifacts. Encouragingly, the situation is gradually improving, as our regression analysis shows a significant increase in the availability of research artifacts over time. However, in 2023, just 62.0% of secondary studies provide a research artifact while an even lower percentage, 30.4% use a permanent repository with a digital object identifier (DOI) for storage. Conclusion: To enhance transparency and reproducibility in SE research, we advocate for the mandatory publication of research artifacts in secondary studies.
系统审查:系统审查(SRs)总结科学,包括软件工程(SE)的最新科学证据。目标:我们的目标是评估SRs如何报告研究文物,并提供这些文物的全面清单。方法:我们审查了2013年至2023年出版的537份次级研究,以分析研究文物的可用性和报告情况。结果:我们的调查结果表明,所审查的研究中只有31.5%包括研究文物。令人鼓舞的是,情况正在逐步改善,因为我们的回归分析表明,研究文物的可用性随着时间推移显著增加。然而,在2023年,只有62.0%的次级研究提供了研究文物,而这一比例则更低,30.4%使用带有数字物体识别器(DOI)的永久储存库储存。结论:为了提高SE研究的透明度和再生能力,我们主张在二级研究中强制出版研究文物。
Article 31
Title@2025-06-25 (3): Zero-Shot Attribution for Large Language Models: A Distribution Testing Approach
Title: Zero-Shot Attribution for Large Language Models: A Distribution Testing Approach | Zero-Shot Attribution für große Sprachmodelle: Ein Distributionstestverfahren | 大语言模式零点位数:分销测试方法 2506.20197v1 |
Authors (3): Clément L. Canonne, Yash Pote, Uddalok Sarkar
A growing fraction of all code is sampled from Large Language Models (LLMs). We investigate the problem of attributing code generated by language models using hypothesis testing to leverage established techniques and guarantees. Given a set of samples $S$ and a suspect model $\mathcal{L}^$, our goal is to assess the likelihood of $S$ originating from $\mathcal{L}^$. Due to the curse of dimensionality, this is intractable when only samples from the LLM are given: to circumvent this, we use both samples and density estimates from the LLM, a form of access commonly available. We introduce $\mathsf{Anubis}$, a zero-shot attribution tool that frames attribution as a distribution testing problem. Our experiments on a benchmark of code samples show that $\mathsf{Anubis}$ achieves high AUROC scores ( $\ge0.9$) when distinguishing between LLMs like DeepSeek-Coder, CodeGemma, and Stable-Code using only $\approx 2000$ samples.
从大语言模型(LLMs)取样的所有代码中,有越来越多的部分来自大语言模型(LLMs) 。 我们调查语言模型生成的代码归属问题, 使用假想测试来利用既有技术和保障。 根据一组样本S$S$和可疑模型$mathcal{L$, 我们的目标是评估来自$mathcal{L$的可能性。 由于维度的诅咒, 当只提供LLM的样本时, 这一点很难解决: 为了绕过这一点, 我们使用LLM的样本和密度估计, 这是一种常见的存取方式。 我们引入了 $\mathsf{Anubis}, 一种将归属标定为分配测试问题的零速归属工具。 我们关于代码样本基准的实验显示, $mathsf{Anubis} 在区分LMs(如DeepSeek-Coder, CodeGemma, 和Stalo-Code) 之间, 当只使用 $\ approx 2000美元样本时, 。
Article 32
Title@2025-06-25 (3): AI and Agile Software Development: From Frustration to Success – XP2025 Workshop Summary
Title: AI and Agile Software Development: From Frustration to Success – XP2025 Workshop Summary | KI und Agile Software-Entwicklung: Von der Frustration zum Erfolg – XP2025 Workshop Zusammenfassung | AI和Alile软件开发:从挫折到成功 – – XP2025讲习班摘要 2506.20159v1 |
Authors (5): Tomas Herda, Victoria Pichler, Zheying Zhang, Pekka Abrahamsson, Geir K. Hanssen
The full-day workshop on AI and Agile at XP 2025 convened a diverse group of researchers and industry practitioners to address the practical challenges and opportunities of integrating Artificial Intelligence into Agile software development. Through interactive sessions, participants identified shared frustrations related to integrating AI into Agile Software Development practices, including challenges with tooling, governance, data quality, and critical skill gaps. These challenges were systematically prioritized and analyzed to uncover root causes. The workshop culminated in the collaborative development of a research roadmap that pinpoints actionable directions for future work, including both immediate solutions and ambitious long-term goals. The key outcome is a structured agenda designed to foster joint industry-academic efforts to move from identified frustrations to successful implementation.
2025年XP关于AI和Agile的全天讲习班召集了一组不同的研究人员和业界从业人员,讨论将人工智能纳入Agile软件开发的实际挑战和机遇,通过互动会议,与会者查明了将AI纳入Agile软件开发做法的共同挫折感,包括工具、治理、数据质量和关键技能差距方面的挑战,对这些挑战进行了系统的优先排序和分析,以找出根源;讲习班最终通过合作制定了一份研究路线图,确定未来工作的可操作方向,包括即时解决方案和雄心勃勃的长期目标;主要成果是一项结构化的议程,目的是促进工业-学术联合努力,从查明的挫折感转向成功的执行。
Article 33
Title@2025-06-24 (2): When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration
Title: When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration | When Domains Collide: Eine Aktivitätstheorie zur Erforschung der disziplinübergreifenden Zusammenarbeit | 当域碰撞:跨纪律协作活动理论探索时 2506.20063v1 |
Authors (6): Zixuan Feng, Thomas Zimmermann, Lorenzo Pisani, Christopher Gooley, Jeremiah Wander, Anita Sarma
Background: Software development teams are increasingly diverse, embedded, and cross-disciplinary. Domain experts (DEs) from different disciplines collaborate with professional software developers (SDEs), bringing complementary expertise in creating and maintaining complex production software. However, contested expectations, divergent problem-solving perspectives, and conflicting priorities lead to friction. Aims: This study aims to investigate the dynamics of emerging collaboration of cross-disciplinary software development (CDSD) by exploring the expectations held by DEs and SDEs and understanding how these frictions manifest in practice. Method: We utilize Activity Theory (AT), a well-established socio-technical framework, as an analytical lens in a grounded, empirical investigation, conducted through a mixed-method study involving 24 interviews (12 DEs and 12 SDEs) and a large-scale validation survey with 293 participants (161 DEs and 132 SDEs). Results: We conceptualize and empirically ground the CDSD dynamics. We identified eight expectations held by SDEs and six by DEs. By mapping these expectations to AT components, we revealed 21 frictions in CDSD and illustrated where and how they arise. Conclusions: This study offers a theoretical lens for understanding the dynamics and frictions in CDSD and provides actionable insights for future research, practitioners, and infrastructure design.
软件开发团队日益多样化、嵌入和跨学科。来自不同学科的专家与专业软件开发者(SDEs)合作,在创建和维护复杂的生产软件方面提供互补的专门知识。然而,有争议的期望、不同的解决问题观点和相互冲突的优先事项导致摩擦。目的:本研究的目的是通过探索DEs和SDEs持有的期望并了解这些摩擦在实践中如何表现来调查跨学科软件开发(CDSD)新兴协作的动态,并了解这些摩擦的实际表现。方法:我们利用活动理论(AT)这个成熟的社会技术框架,作为基础、经验性调查的分析透镜,通过由24次访谈(12个DEs和12个SDEs)进行的混合方法研究以及293名参与者(161个DEs和132个SDEs)进行的大规模验证调查,进行。结果:我们从概念上和从经验上确定了CDSD动态的8项期望和DEs所持有的6项期望。我们通过向AT组成部分绘制这些期望图,揭示了CDSD的21项摩擦,并说明了它们在何处和如何产生的。结论:本研究为CDSD的未来研究、可理解的理论视角,为CDSD设计中的动态和设计提供了可理解。
Article 34
Title@2025-06-24 (2): QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges
Title: QHackBench: Benchmarking Large Language Models for Quantum Code Generation Using PennyLane Hackathon Challenges | QHackBench: Benchmarking großer Sprachmodelle für die Quantencode-Generation mit PennyLane Hackathon-Herausforderungen | QHackBench:利用PennyLane Hackathon挑战为量制代码生成量设定大语言模式基准 2506.20008v1 |
Authors (7): Abdul Basit, Minghao Shao, Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, Muhammad Shafique
Recent advances in Large Language Models (LLMs) have demonstrated strong potential in code generation, yet their effectiveness in quantum computing remains underexplored. This paper benchmarks LLMs for PennyLane-based quantum code generation using real-world challenges from the Quantum Hackathon (QHack). We introduce QHackBench, a novel benchmark dataset derived from QHack competitions, and evaluate model performance under vanilla prompting and Retrieval-Augmented Generation (RAG). Our structured evaluation framework assesses functional correctness, syntactic validity, and execution success across varying challenge difficulties. Results indicate that RAG-enhanced models, supplemented with an augmented PennyLane dataset, approximately generate similar results as the standard prompting, particularly in complex quantum algorithms. Additionally, we introduce a multi-agent evaluation pipeline that iteratively refines incorrect solutions, further enhancing execution success rates. To foster further research, we commit to publicly releasing QHackBench, along with our evaluation framework and experimental results, enabling continued advancements in AI-assisted quantum programming.
大语言模型(LLMS)最近的进展表明,在代码生成方面有巨大的潜力,但其量子计算方面的效力仍未得到充分探讨。本文基准了利用Qantum Hackathon(QHack)(Quantum Hackathon)(Qantum Hackathon)(QHackbench)(QHackBench)(QHackBench)(QHackBench)(QHack Bench)(QHackBench)(QHackBench)(QHack Bench)(QHackBench)(QHackBench)(QHackBench)(QHack Bench)(QHack Bench)(QHack Bench)(LLLLLM)(LLLM)(LLLLLM)(LLLM)(LLLLLM)(LM)(LM)(LLLLLLM)(LLM)(LLLLM)(LM)(LLLLLM)(LLLM)(LM)(LM)(LLLLM)(LLLLM)(LLM(LM)(LM)(DM)(LLLLLM)(DM)(LM)(DM)(DM)(LLM)(LLLLLM)(LLLLLLLLLM)(LLLLM)(LLLLLLLM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LLLLLLLLLM)(LM)(LM)(LM)(LM)(LM)(LLLM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)(LM)
Article 35
Title@2025-06-24 (2): Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’
Title: Can Language Models Replace Programmers for Coding? REPOCOD Says ‘Not Yet’ | Können Sprachmodelle Programmierer für Coding ersetzen? REPOCOD sagt ‘Noch nicht’ | 语言模式能替换编码程序程序员吗? REPOCOD 说“ 还没有” 。 2410.21647v4 |
Authors (4): Shanchao Liang, Yiran Hu, Nan Jiang, Lin Tan
Recently, a number of repository-level code generation benchmarks-such as CoderEval, DevEval, RepoEval, RepoBench, and LongCodeArena-have emerged to evaluate the capabilities of large language models (LLMs) beyond standalone benchmarks like HumanEval and MBPP. Thus, a natural question is, would LLMs have similar performance in real world coding tasks as their performance in these benchmarks? Unfortunately, one cannot answer this question, since these benchmarks consist of short completions, synthetic examples, or focus on limited scale repositories, failing to represent real-world coding tasks. To address these challenges, we create REPOCOD, a Python code-generation benchmark containing complex tasks with realistic dependencies in real-world large projects and appropriate metrics for evaluating source code. It includes 980 whole-function generation tasks from 11 popular projects, 50.8% of which require repository-level context. REPOCOD includes 314 developer-written test cases per instance for better evaluation. We evaluate ten LLMs on REPOCOD and find that none achieves more than 30% pass@1 on REPOCOD, indicating the necessity of building stronger LLMs that can help developers in real-world software development. In addition, we found that retrieval-augmented generation achieves better results than using target function dependencies as context.
最近,一些仓库级代码生成基准(如CoderEval、DevEval、RepoEval、RepoEval、Repobench、RepoBench和LongCodeArena)已经出现,以评价大型语言模型(LLMs)的能力,而这种模型超出了人类经济学和MBP等独立基准的范围。因此,一个自然的问题是,LLOMs在现实世界的编码任务中是否具有与这些基准的业绩相类似的性能?不幸的是,无法回答这一问题,因为这些基准包括短期完成、合成实例或侧重于有限规模的储存库,无法代表现实世界的编码任务。为了应对这些挑战,我们创建了REPOCOD,这是一个Python 代码生成基准,其中包含在现实世界大型项目中具有实际依赖性依赖性的复杂任务以及评估源码的适当指标。它包括11个流行项目中的980项全功能生成任务,其中50.8%需要具备储存级别的背景。REPOCD包括314个开发者编写的测试案例,以更好地评估。我们评估了REOCDD,发现在REPD上没有取得超过30%的成绩,我们在Sreal-OD上能够更有力地建立更强有力的软件开发公司。
Article 36
Title@2025-06-24 (2): WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development
Title: WAFFLE: Finetuning Multi-Modal Model for Automated Front-End Development | WAFFLE: Feinsteuerungs-Multi-Modal-Modell für automatisierte Front-End-Entwicklung | WAFFLE: 自动前端开发的微调多模式模型 2410.18362v2 |
Authors (4): Shanchao Liang, Nan Jiang, Shangshu Qian, Lin Tan
Web development involves turning UI designs into functional webpages, which can be difficult for both beginners and experienced developers due to the complexity of HTML’s hierarchical structures and styles. While Large Language Models (LLMs) have shown promise in generating source code, two major challenges persist in UI-to-HTML code generation: (1) effectively representing HTML’s hierarchical structure for LLMs, and (2) bridging the gap between the visual nature of UI designs and the text-based format of HTML code. To tackle these challenges, we introduce Waffle, a new fine-tuning strategy that uses a structure-aware attention mechanism to improve LLMs’ understanding of HTML’s structure and a contrastive fine-tuning approach to align LLMs’ understanding of UI images and HTML code. Models fine-tuned with Waffle show up to 9.00 pp (percentage point) higher HTML match, 0.0982 higher CW-SSIM, 32.99 higher CLIP, and 27.12 pp higher LLEM on our new benchmark WebSight-Test and an existing benchmark Design2Code, outperforming current fine-tuning methods.
由于HTML的等级结构和风格的复杂性,对于初学者和有经验的开发者来说,将UI设计转换成功能性网页可能是困难的。虽然大语言模型在生成源代码方面显示出希望,但在UI-HTML代码生成方面仍然存在两大挑战:(1) 有效代表HTML的LLM的等级结构,(2) 缩小UI设计视觉性质与HTML代码基于文本的格式之间的差距。为了应对这些挑战,我们引入了Waffle(Waffle)这一新的微调战略,它使用一种结构觉悟的注意机制来提高LAM对HTML结构的理解,以及一种对比性微调方法来调整LLMMS对UI图像和 HTML代码的理解。用Waffle(百分比)微调的模型显示高达9.00 pp(百分比) 更高HTML匹配值,0.0982 更高CW-SSIM, 32.99 更高CLIP和27.12 pp(LEM)在我们的新基准WebSight-ight Test和现有基准设计2Code, 优于当前的微调方法。
Article 37
Title@2025-06-24 (2): An Empirical Investigation on the Challenges in Scientific Workflow Systems Development
Title: An Empirical Investigation on the Challenges in Scientific Workflow Systems Development | Eine empirische Untersuchung der Herausforderungen in der Entwicklung wissenschaftlicher Workflowsysteme | 关于科学工作流程系统开发挑战的经验调查 2411.10890v2 |
Authors (4): Khairul Alam, Banani Roy, Chanchal K. Roy, Kartik Mittal
Scientific Workflow Systems (SWSs) are advanced software frameworks that drive modern research by orchestrating complex computational tasks and managing extensive data pipelines. These systems offer a range of essential features, including modularity, abstraction, interoperability, workflow composition tools, resource management, error handling, and comprehensive documentation. Utilizing these frameworks accelerates the development of scientific computing, resulting in more efficient and reproducible research outcomes. However, developing a user-friendly, efficient, and adaptable SWS poses several challenges. This study explores these challenges through an in-depth analysis of interactions on Stack Overflow (SO) and GitHub, key platforms where developers and researchers discuss and resolve issues. In particular, we leverage topic modeling (BERTopic) to understand the topics SWSs developers discuss on these platforms. We identified 10 topics developers discuss on SO (e.g., Workflow Creation and Scheduling, Data Structures and Operations, Workflow Execution) and found that workflow execution is the most challenging. By analyzing GitHub issues, we identified 13 topics (e.g., Errors and Bug Fixing, Documentation, Dependencies) and discovered that data structures and operations is the most difficult. We also found common topics between SO and GitHub, such as data structures and operations, task management, and workflow scheduling. Additionally, we categorized each topic by type (How, Why, What, and Others). We observed that the How type consistently dominates across all topics, indicating a need for procedural guidance among developers. The dominance of the How type is also evident in domains like Chatbots and Mobile development. Our study will guide future research in proposing tools and techniques to help the community overcome the challenges developers face when developing SWSs.
科学工作流程系统(SWS)是先进的软件框架,通过组织复杂的计算任务和管理广泛的数据管道来推动现代研究,这些系统提供一系列基本特征,包括模块性、抽象性、互操作性、工作流程构成工具、资源管理、错误处理和综合文件。利用这些框架加快了科学计算的发展,从而产生更有效和可复制的研究成果。然而,开发一个方便用户、高效和适应性强的SWS(SWS)带来了若干挑战。本研究通过深入分析Stack Ooverflow(SO)和GitHub(开发者和研究人员讨论和解决问题的主要平台)的互动来探讨这些挑战。这些系统提供了一系列基本特征,包括模块化、抽象性、互操作性、工作流程。 利用这些模型(BERTopic)来理解SWSWS的开发者在这些平台上讨论的专题(例如:工作流程的创建和调整、数据结构和操作、工作流程执行)发现工作流程执行最具有挑战性的问题。通过分析GitHub(SUB),我们还确定了13个主题(例如,错误和错误修正者、文档、文档、文件、依赖数据结构)以及SOIal Studlear Stal Stal Stal Stal Stal Studutes) 将发现的数据结构和流程结构和Shedustr 将如何每个。
Article 38
Title@2025-06-24 (2): Exploring Developer Experience Factors in Software Ecosystems
Title: Exploring Developer Experience Factors in Software Ecosystems | Erforschen von Entwickler-Erfahrungsfaktoren in Software-Ökosystemen | 探索软件生态系统中开发者经验因素 2506.19757v1 |
Authors (5): Rodrigo Oliveira Zacarias, Léo Carvalho Ramos Antunes, Márcio de Oliveira Barros, Rodrigo Pereira dos Santos, Patricia Lago
Context: Developer experience (DX) plays a key role in developers’ performance and their continued involvement in a software ecosystem (SECO) platform. While researchers and practitioners have recognized several factors affecting DX in SECO platforms, a clear roadmap of the most influential factors is still missing. This is particularly important given the direct impact on developers’ interest in SECO and their ongoing engagement with the common technological platform. Goal: This work aims to identify key DX factors and understand how they influence third-party developers’ decisions to adopt and keep contributing to a SECO. Methods: We conducted a systematic mapping study (SMS), analyzing 29 studies to assess the state-of-the-art of DX in SECO. Additionally, we conducted a Delphi study to evaluate the influence of 27 DX factors (identified in our SMS) from the perspective of 21 third-party developers to adopt and keep contributing to a SECO. Results: The factors that most strongly influence developers’ adoption and ongoing contributions to a SECO are: financial costs for using the platform, desired technical resources for development, low barriers to entry into the applications market, and more financial gains. Conclusion: DX is essential for the success and sustainability of SECO. Our set of DX factors provides valuable insights and recommendations for researchers and practitioners to address key DX concerns from the perspective of third-party developers.
背景:开发者经验(DX)在开发者业绩及其继续参与软件生态系统平台方面发挥着关键作用;研究人员和从业人员认识到影响SECO平台DX的若干因素,但仍缺乏关于最有影响力因素的明确路线图;鉴于对开发者对SECO的兴趣及其与共同技术平台的持续接触的直接影响,这一点特别重要。 目标:这项工作旨在确定关键DX因素,并了解它们如何影响第三方开发者通过和不断促进SECO的决定。方法:我们进行了系统的绘图研究,分析了29项研究,以评估SECODX的最新水平。此外,我们进行了德尔菲研究,从21个第三方开发者的角度评价27个DX因素(我们在SMS中确定)的影响,以通过和保持对SECO的贡献。结果:最强烈影响开发者通过和不断促进SECO的决定的因素是:使用平台的金融成本、发展所需的技术资源、进入应用市场的低壁垒以及更多的财政收益。结论:DX是,DFX对研究人员的成功和可持续性至关重要,而DX是我们研究者提出的重要见解。
Article 39
Title@2025-06-24 (2): Simulating the Waterfall Model: A Systematic Review
Title: Simulating the Waterfall Model: A Systematic Review | Simulation des Wasserfallmodells: Eine systematische Überprüfung | 模拟瀑瀑瀑模型:系统审查 2506.19653v1 |
Authors (1): Antonios Saravanos
This systematic mapping study examines how the Waterfall Model has been represented in computational simulations within peer-reviewed literature. While Agile methodologies dominate contemporary software design practices, the Waterfall Model persists, particularly, within hybrid approaches that fuse structured, sequential workflows with the adaptability of agile practices. Despite its continued presence, little attention has been given to how the Waterfall Model is simulated in research contexts. A structured search of major academic databases identified 68 peer-reviewed studies published between 2000 and 2024. After applying inclusion criteria, selected studies were analyzed across four dimensions: (1) simulation methodologies (e.g., discrete-event simulation, system dynamics), (2) platforms and tools (e.g., Simphony.NET, SimPy), (3) geographic and temporal trends, and (4) fidelity to Royce’s original seven-phase model. Discrete-event simulation was most commonly used, reflecting the model’s sequential nature. Early work relied on proprietary platforms, while recent studies increasingly use open-source, Python-based tools. No studies fully implemented Royce’s original formulation, most employed adaptations. These findings suggest that although niche, simulation of the Waterfall Model is present in academic discourse. This work highlights the need for accessible modeling tools and calls for future research that integrates the waterfall software process model with modern hybrid practices.
这一系统绘图研究审查了《瀑布模型》如何在同行审评文献中的计算模拟中体现。尽管《瀑布模型》在当代软件设计做法中占主导地位,但《瀑布模型》在混合方法(例如Simphony.NET、SimPy)中继续存在,特别是在结构化、顺序工作流程与灵活做法的适应性相结合的混合方法中。尽管它继续存在,但对如何在研究背景下模拟《瀑布模型》却很少注意。对主要学术数据库进行的结构性搜索确定了2000年至2024年期间出版的68项同行审评研究。在应用了包容标准之后,对选定的研究进行了四个方面分析:(1)模拟方法(例如独立事件模拟、系统动态)、(2)平台和工具(例如Simphony.NET、SimPy)、(3)地理和时间趋势以及(4)对Royce最初的七阶段模型的忠实性。在使用时最为普遍,这种模糊的活动模拟反映了模型的顺序性质。早期工作依赖于专利平台,而最近的研究则越来越多地使用开放源、基于Python的工具。没有充分实施Royce的原始设计研究,使用最多的应用了适应方法。这些结论表明,现在的模型的模型的模型的模型的模型是用来模拟。
Article 40
Title@2025-06-24 (2): A Verification Methodology for Safety Assurance of Robotic Autonomous Systems
Title: A Verification Methodology for Safety Assurance of Robotic Autonomous Systems | Eine Verifikationsmethodik für die Sicherheit von Roboter autonomen Systemen | 机器人自主系统安全保证核查方法 2506.19622v1 |
Authors (3): Mustafa Adam, David A. Anisi, Pedro Ribeiro
Autonomous robots deployed in shared human environments, such as agricultural settings, require rigorous safety assurance to meet both functional reliability and regulatory compliance. These systems must operate in dynamic, unstructured environments, interact safely with humans, and respond effectively to a wide range of potential hazards. This paper presents a verification workflow for the safety assurance of an autonomous agricultural robot, covering the entire development life-cycle, from concept study and design to runtime verification. The outlined methodology begins with a systematic hazard analysis and risk assessment to identify potential risks and derive corresponding safety requirements. A formal model of the safety controller is then developed to capture its behaviour and verify that the controller satisfies the specified safety properties with respect to these requirements. The proposed approach is demonstrated on a field robot operating in an agricultural setting. The results show that the methodology can be effectively used to verify safety-critical properties and facilitate the early identification of design issues, contributing to the development of safer robots and autonomous systems.
在诸如农业环境等人类共同环境中部署的自主机器人需要严格的安全保障,以满足功能可靠性和监管合规性。这些系统必须在动态、无结构的环境中运作,与人类安全互动,并有效应对各种潜在危害。本文件为自主农业机器人的安全保障提供了一个核查工作流程,涵盖从概念研究和设计到运行时间核查的整个开发生命周期,涵盖从概念研究和设计到运行时间核查的整个开发生命周期。概述的方法首先进行系统的危险分析和风险评估,以查明潜在风险并得出相应的安全要求。随后开发了安全控制器的正式模型,以捕捉其行为,并核实控制器符合这些要求的特定安全特性。拟议方法在农业环境中运行的实地机器人上展示。结果显示,该方法可以有效地用于核查安全临界特性,便利尽早发现设计问题,有助于开发更安全的机器人和自主系统。
Article 41
Title@2025-06-24 (2): Probabilistic modelling and safety assurance of an agriculture robot providing light-treatment
Title: Probabilistic modelling and safety assurance of an agriculture robot providing light-treatment | Probabilistische Modellierung und Sicherheitsgarantie eines landwirtschaftlichen Roboters zur Lichtbehandlung | 提供轻处理的农业机器人的概率建模和安全保障 2506.19620v1 |
Authors (6): Mustafa Adam, Kangfeng Ye, David A. Anisi, Ana Cavalcanti, Jim Woodcock, Robert Morris
Continued adoption of agricultural robots postulates the farmer’s trust in the reliability, robustness and safety of the new technology. This motivates our work on safety assurance of agricultural robots, particularly their ability to detect, track and avoid obstacles and humans. This paper considers a probabilistic modelling and risk analysis framework for use in the early development phases. Starting off with hazard identification and a risk assessment matrix, the behaviour of the mobile robot platform, sensor and perception system, and any humans present are captured using three state machines. An auto-generated probabilistic model is then solved and analysed using the probabilistic model checker PRISM. The result provides unique insight into fundamental development and engineering aspects by quantifying the effect of the risk mitigation actions and risk reduction associated with distinct design concepts. These include implications of adopting a higher performance and more expensive Object Detection System or opting for a more elaborate warning system to increase human awareness. Although this paper mainly focuses on the initial concept-development phase, the proposed safety assurance framework can also be used during implementation, and subsequent deployment and operation phases.
继续采用农业机器人假定农民信任新技术的可靠性、稳健性和安全性。这促使我们开展农业机器人的安全保障工作,特别是他们探测、跟踪和避免障碍和人类的能力。本文件审议了早期开发阶段使用的概率建模和风险分析框架。从危险识别和风险评估矩阵开始,移动机器人平台、传感器和感知系统的行为以及任何在场的人使用三种国家机器捕获。然后,利用概率模型检查器PRIS来解决和分析自动生成的概率模型。结果通过量化风险缓解行动和降低风险的影响以及与不同设计概念相关的风险,对基本发展和工程方面提供了独特的洞察力。其中包括采用更高性能和更昂贵的物体探测系统或选择更精细的预警系统以提高人类认识的影响。尽管本文件主要侧重于初始概念开发阶段,但拟议的安全保障框架也可以用于实施阶段以及随后的部署和运作阶段。
Article 42
Title@2025-06-24 (2): Can LLMs Replace Humans During Code Chunking?
Title: Can LLMs Replace Humans During Code Chunking? | Können LLMs Menschen beim Code-Chunking ersetzen? | LLMs能否在代码启动时替换人类? 2506.19897v1 |
Authors (17): Christopher Glasz, Emily Escamilla, Eric O. Scott, Anand Patel, Jacob Zimmer, Colin Diggs, Michael Doyle, Scott Rosen, Nitin Naik, Justin F. Brunelle, Samruddhi Thaker, Parthav Poudel, Arun Sridharan, Amit Madan, Doug Wendt, William Macke, Thomas Schill
Large language models (LLMs) have become essential tools in computer science, especially for tasks involving code understanding and generation. However, existing work does not address many of the unique challenges presented by code written for government applications. In particular, government enterprise software is often written in legacy languages like MUMPS or assembly language code (ALC) and the overall token lengths of these systems exceed the context window size for current commercially available LLMs. Additionally, LLMs are primarily trained on modern software languages and have undergone limited testing with legacy languages, making their ability to understand legacy languages unknown and, hence, an area for empirical study. This paper examines the application of LLMs in the modernization of legacy government code written in ALC and MUMPS, addressing the challenges of input limitations. We investigate various code-chunking methods to optimize the generation of summary module comments for legacy code files, evaluating the impact of code-chunking methods on the quality of documentation produced by different LLMs, including GPT-4o, Claude 3 Sonnet, Mixtral, and Llama 3. Our results indicate that LLMs can select partition points closely aligned with human expert partitioning. We also find that chunking approaches have significant impact on downstream tasks such as documentation generation. LLM-created partitions produce comments that are up to 20% more factual and up to 10% more useful than when humans create partitions. Therefore, we conclude that LLMs can be used as suitable replacements for human partitioning of large codebases during LLM-aided modernization.
大型语言模型(LLMS)已成为计算机科学的基本工具,特别是涉及代码理解和生成的任务;然而,现有工作并没有解决为政府应用程序编写的代码带来的许多独特挑战;特别是,政府企业软件往往用MUMPS或组装语言代码(ALC)等遗留语言写成,这些系统的总体象征性长度超过了目前商业上可用的LMS的上下文窗口大小。此外,LMS主要接受现代软件语言的培训,并经过了有限的遗留语言测试,使其理解遗留语言的能力不为人所熟知,因此也成为了经验研究的领域。本文审视了LMS在ALC和MMPS编写的政府遗留代码现代化过程中应用LMS软件的情况,解决了投入限制的挑战。我们调查了各种代码拆解方法,以优化遗留代码文档中摘要模块评论的生成,评估了代码沉积方法对包括GPT-4o、Claude 3 Sonnet、Mixtral和Llama等不同LM公司所制作的文件质量的影响。我们的结果表明,LMS可以选择与人类专家大规模配置的配置文件点密切吻合点,在20 %的版版中,我们发现,在产生更接近的LM中,成为了更有用的数据分区中,在产生了10个实际的路径上,我们用来的路径中,在产生了比分流中,我们成为了更接近。
Article 43
Title@2025-06-24 (2): Lost in Translation? Converting RegExes for Log Parsing into Dynatrace Pattern Language
Title: Lost in Translation? Converting RegExes for Log Parsing into Dynatrace Pattern Language | Verloren in Übersetzung? Umwandlung von RegExes für Log Parsing in Dynatrace Pattern Language | 丢失于翻译中 ? 将日志解析的 RegExs 转换为同步模式语言 2506.19539v1 |
Authors (4): Julian Fragner, Christian Macho, Bernhard Dieber, Martin Pinzger
Log files provide valuable information for detecting and diagnosing problems in enterprise software applications and data centers. Several log analytics tools and platforms were developed to help filter and extract information from logs, typically using regular expressions (RegExes). Recent commercial log analytics platforms provide domain-specific languages specifically designed for log parsing, such as Grok or the Dynatrace Pattern Language (DPL). However, users who want to migrate to these platforms must manually convert their RegExes into the new pattern language, which is costly and error-prone. In this work, we present Reptile, which combines a rule-based approach for converting RegExes into DPL patterns with a best-effort approach for cases where a full conversion is impossible. Furthermore, it integrates GPT-4 to optimize the obtained DPL patterns. The evaluation with 946 RegExes collected from a large company shows that Reptile safely converted 73.7% of them. The evaluation of Reptile’s pattern optimization with 23 real-world RegExes showed an F1-score and MCC above 0.91. These results are promising and have ample practical implications for companies that migrate to a modern log analytics platform, such as Dynatrace.
日志文件为发现和诊断企业软件应用程序和数据中心的问题提供了宝贵的信息。 开发了一些日志分析工具和平台,以帮助过滤和从日志中提取信息, 通常使用常规表达式( RegExes ) 。 最近的商业日志分析平台提供具体用于日志分析的域名语言, 如 Grok 或 Dynatrace 样式语言( DPL 语言 ) 。 但是, 想要迁移到这些平台的用户必须手工将 RegExes 转换成新的模式语言, 语言成本昂贵且容易出错。 在这项工作中, 我们介绍 Reptile, 将基于规则的方法结合在一起, 将 RegExes 转换成DPL 模式, 并采用最优化的方法处理不可能完全转换的案件 。 此外, 它将 GPT-4 整合为优化所获得的 DPL 模式 。 从大公司收集的 946 RegEx 评估显示, Reptile 安全转换了其中的73. 7% 。 对 RegExes 23 真实世界 RegExes 模式优化的评价显示F1 和 MC CM 0.91 以上 F1 的F1 和 CLCMCF1- slateal 。 这些结果很有影响是现代的现代的, 。
Article 44
Title@2025-06-24 (2): Integrating Pair Programming as a Work Practice
Title: Integrating Pair Programming as a Work Practice | Integration der Pair-Programmierung als Arbeitspraxis | 将 “ 平等规划 “ 纳入工作实践 2506.19511v1 |
Authors (7): Nina Haugland Andersen, Anastasiia Tkalich, Nils Brede Moe, Darja Smite, Asgaut Mjølne Söderbom, Ola Hast, Viktoria Stray
Context: Pair programming (PP) is more relevant than ever. As modern systems grow in complexity, knowledge sharing and collaboration across teams have become essential. However, despite well-documented benefits of PP, its adoption remains inconsistent across software teams. Objective: This study aims to understand the factors that facilitate or hinder team members’ adoption as well as lasting engagement in PP. Method: We have conducted an exploratory single-case study in a mature agile company in Norway. We collected data through two rounds of interviews with team members in different roles and performed a thematic analysis of the interviews. Results: Our key finding is that multiple factors, related to the perceptions of how PP contributes to daily work, efforts associated with engaging in PP sessions, company and team attitudes, resources, infrastructure, and task characteristics, affect PP engagement. Conclusion: Long-term engagement in PP requires expected benefits with the practice being confirmed in firsthand experiences. Adapting the practice to each unique team, with insights drawn from collective learning, is also beneficial. Our findings will be beneficial for software practitioners seeking to make PP an integrated part of their team’s workflow.
目标:本项研究的目的是了解促进或阻碍小组成员收养以及持久参与PP的因素。方法:我们在挪威成熟的灵活公司进行了一项探索性的单一案例研究。我们通过与不同角色的小组成员的两轮访谈收集数据,并对访谈进行了专题分析。结果:我们的主要发现是,与PP如何促进日常工作、参与PP会议的相关努力、公司和团队态度、资源、基础设施和任务特点有关的多种因素影响到PP的参与。结论:长期参与PPP需要预期的好处,其做法在第一手经验中得到确认。将做法适应于每个独特的团队,从集体学习中得到深刻的见解,也是有益的。我们的调查结果将有利于软件从业人员争取将PPP纳入团队的工作流程。
Article 45
Title@2025-06-24 (2): LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code
Title: LLM-based Multi-Agent System for Intelligent Refactoring of Haskell Code | LLM-basiertes Multi-Agent-System zur intelligenten Refactoring von Haskell-Code | 以LLM为基础的哈斯凯尔码智能再构要素多代理商系统 2506.19481v1 |
Authors (10): Shahbaz Siddeeq, Muhammad Waseem, Zeeshan Rasheed, Md Mahade Hasan, Jussi Rasku, Mika Saari, Henri Terho, Kalle Makela, Kai-Kristian Kemell, Pekka Abrahamsson
Refactoring is a constant activity in software development and maintenance. Scale and maintain software systems are based on code refactoring. However, this process is still labor intensive, as it requires programmers to analyze the codebases in detail to avoid introducing new defects. In this research, we put forward a large language model (LLM)-based multi-agent system to automate the refactoring process on Haskell code. The objective of this research is to evaluate the effect of LLM-based agents in performing structured and semantically accurate refactoring on Haskell code. Our proposed multi-agent system based on specialized agents with distinct roles, including code analysis, refactoring execution, verification, and debugging. To test the effectiveness and practical applicability of the multi-agent system, we conducted evaluations using different open-source Haskell codebases. The results of the experiments carried out showed that the proposed LLM-based multi-agent system could average 11.03% decreased complexity in code, an improvement of 22.46% in overall code quality, and increase performance efficiency by an average of 13.27%. Furthermore, memory allocation was optimized by up to 14.57%. These results highlight the ability of LLM-based multi-agent in managing refactoring tasks targeted toward functional programming paradigms. Our findings hint that LLM-based multi-agent systems integration into the refactoring of functional programming languages can enhance maintainability and support automated development workflows.
软件的开发和维护是一项经常性的活动。 软件系统的规模和保养是建立在代码重构基础上的不断活动。 但是,这个过程仍然耗费大量人力,因为它要求程序员详细分析代码库,以避免引入新的缺陷。 在这项研究中,我们提出了一个大型语言模型(LLM)多试剂系统,使Haskell代码的再构件进程自动化。这项研究的目的是评价基于LLM的多试剂在对Haskell代码进行结构化和精密的自动再构件方面的影响。我们提议的多试剂系统基于具有不同作用的专门代理,包括代码分析、重新构件执行、核查和调试等。为了测试多试剂系统的有效性和实用适用性,我们利用不同的公开源Haskell代码库进行了评价。 实验结果显示,基于LMM的多试剂系统在代码上平均降低11.03%的复杂度,在整体代码质量上提高22.46%,并以13.27%的平均比例提高业绩效率。 此外,在目标型试管系统中,将存储M的配置能力提升了我们基于LM的功能性成果的重新定位,可以优化到基于LLM的系统。
Article 46
Title@2025-06-24 (2): What Makes the Best Decomposition? Investigating Binary Decomposition Under FCG Variance
Title: What Makes the Best Decomposition? Investigating Binary Decomposition Under FCG Variance | Was macht die beste Zersetzung? Untersuchung der binären Zersetzung unter FCG Variance | 根据FCG差异调查二进分解 2506.19425v1 |
Authors (6): Ang Jia, He Jiang, Zhilei Ren, Xiaochen Li, Ming Fan, Ting Liu
Binary decomposition, which decomposes binary files into modules, plays a critical role in binary reuse detection. Existing binary decomposition works either apply anchor-based methods by extending anchor functions to generate modules, or apply clustering-based methods by using clustering algorithms to group binary functions, which all rely on that reused code shares similar function call relationships. However, we find that function call graphs (FCGs) vary a lot when using different compilation settings, especially with diverse function inlining decisions. In this work, we conduct the first systematic empirical study on the variance of FCGs compiled by various compilation settings and explore its effect on binary decomposition methods. We first construct a dataset compiled by 17 compilers, using 6 optimizations to 4 architectures and analyze the changes and mappings of the FCGs. We find that the size of FCGs changes dramatically, while the FCGs are still linked by three different kinds of mappings. Then we evaluate the existing works under the FCG variance, and results show that existing works are facing great challenges when conducting cross-compiler evaluation with diverse optimization settings. Finally, we propose a method to identify the optimal decomposition and compare the existing decomposition works with the optimal decomposition. Existing works either suffer from low coverage or cannot generate stable community similarities.
二进制分解法将二进制文件分解成模块,在二进制再利用检测中发挥着关键作用。现有的二进制分解法要么通过扩展锁定功能以生成模块而应用锚基方法,要么通过对分组二进制函数采用组合算法,这些函数都依赖再利用代码的组合算法,这些函数共享类似的函数调用关系。然而,我们发现该函数调用图在使用不同的编译设置时差异很大,特别是不同功能对决策进行分解。在这项工作中,我们对由各种汇编设置汇编的FCG差异进行第一次系统化的经验性研究,并探讨其对二进制分解法方法的影响。我们首先用17个编译者汇编的数据集,使用6个优化到4个架构,分析FCG的变化和绘图。我们发现,调用图表的大小变化很大,而FCGG仍然有三种不同的绘图类型。然后,我们评估FCG的差异,结果显示,在进行交叉组合评估时,现有工程面临着巨大的挑战,在进行交叉拼制评估时,无法与多样化的优化配置,我们最后建议采用一种模式,无法使现有正变。
Article 47
Title@2025-06-24 (2): Online Discovery of Simulation Models for Evolving Business Processes (Extended Version)
Title: Online Discovery of Simulation Models for Evolving Business Processes (Extended Version) | Online Discovery of Simulation Models for Evolving Business Processes (Erweiterte Version) | 不断演变的业务流程模拟模型在线发现(扩展版) 2506.10049v2 |
Authors (4): Francesco Vinci, Gyunam Park, Wil van der Aalst, Massimiliano de Leoni
Business Process Simulation (BPS) refers to techniques designed to replicate the dynamic behavior of a business process. Many approaches have been proposed to automatically discover simulation models from historical event logs, reducing the cost and time to manually design them. However, in dynamic business environments, organizations continuously refine their processes to enhance efficiency, reduce costs, and improve customer satisfaction. Existing techniques to process simulation discovery lack adaptability to real-time operational changes. In this paper, we propose a streaming process simulation discovery technique that integrates Incremental Process Discovery with Online Machine Learning methods. This technique prioritizes recent data while preserving historical information, ensuring adaptation to evolving process dynamics. Experiments conducted on four different event logs demonstrate the importance in simulation of giving more weight to recent data while retaining historical knowledge. Our technique not only produces more stable simulations but also exhibits robustness in handling concept drift, as highlighted in one of the use cases.
商业过程模拟(BPS)是指旨在复制商业过程动态行为的技术; 提出了许多办法,以便从历史事件日志中自动发现模拟模型,减少成本和人工设计这些模型的时间; 然而,在动态商业环境中,各组织不断改进其程序,以提高效率、降低成本和提高客户满意度; 现有的模拟发现处理技术缺乏适应实时业务变化的适应性; 在本文件中,我们提议了一种将递增过程发现与在线机器学习方法相结合的流动过程模拟发现技术; 这一技术在保存历史信息的同时优先考虑最新数据,确保适应不断演变的过程动态; 在四个不同的事件日志上进行的实验表明,在模拟中,必须更多地重视最新数据,同时保留历史知识; 我们的技术不仅产生更稳定的模拟,而且在处理概念漂移方面表现出稳健,正如其中一个使用案例所强调的那样。
Article 48
Title@2025-06-24 (2): High-Performance ARM-on-ARM Virtualization for Multicore SystemC-TLM-Based Virtual Platforms
Title: High-Performance ARM-on-ARM Virtualization for Multicore SystemC-TLM-Based Virtual Platforms | Leistungsstarke ARM-on-ARM-Virtualisierung für Multicore-SystemC-TLM-basierte virtuelle Plattformen | 以多核心系统C-TLM为基础的虚拟平台的ARM在亚美尼亚国内的虚拟化 2505.12987v2 |
Authors (6): Nils Bosbach, Rebecca Pelke, Niko Zurstraßen, Jan Henrik Weinstock, Lukas Jünger, Rainer Leupers
The increasing complexity of hardware and software requires advanced development and test methodologies for modern systems on chips. This paper presents a novel approach to ARM-on-ARM virtualization within SystemC-based simulators using Linux’s KVM to achieve high-performance simulation. By running target software natively on ARM-based hosts with hardware-based virtualization extensions, our method eliminates the need for instruction-set simulators, which significantly improves performance. We present a multicore SystemC-TLM-based CPU model that can be used as a drop-in replacement for an instruction-set simulator. It places no special requirements on the host system, making it compatible with various environments. Benchmark results show that our ARM-on-ARM-based virtual platform achieves up to 10 x speedup over traditional instruction-set-simulator-based models on compute-intensive workloads. Depending on the benchmark, speedups increase to more than 100 x.
硬件和软件日益复杂,要求为现代芯片系统采用先进的开发和测试方法。本文件展示了一种新颖的方法,利用Linux的KVM在基于系统C的模拟器中进行ARM-on-ARM模拟虚拟化,以实现高性能模拟。我们的方法通过在基于硬件虚拟化扩展的基于软件主机上运行目标软件,消除了对基于硬件的模拟器的需求,从而大大改进了性能。我们展示了一个以多核心系统C-TLM为基础的CPU模型,该模型可以用作指示设置模拟器的投放替换器。它没有为主机系统设置任何特殊要求,使之与各种环境兼容。基准结果显示,我们的基于ARM-ARM的虚拟平台在传统的基于指示-设置模拟器的模型上比基于计算密集工作量的模型实现多达10x的加速度。根据基准,加速度增加到100x以上。
Article 49
Title@2025-06-24 (2): VFArchē: A Dual-Mode Framework for Locating Vulnerable Functions in Open-Source Software
Title: VFArchē: A Dual-Mode Framework for Locating Vulnerable Functions in Open-Source Software | VFArchē: Ein Dual-Mode-Framework für die Suche nach gefährdeten Funktionen in Open-Source-Software | VFFARCHZ:在开放源码软件中确定脆弱功能的双模式框架 2506.18050v2 |
Authors (9): Lyuye Zhang, Jian Zhang, Kaixuan Li, Chong Wang, Chengwei Liu, Jiahui Wu, Sen Chen, Yaowen Zheng, Yang Liu
Software Composition Analysis (SCA) has become pivotal in addressing vulnerabilities inherent in software project dependencies. In particular, reachability analysis is increasingly used in Open-Source Software (OSS) projects to identify reachable vulnerabilities (e.g., CVEs) through call graphs, enabling a focus on exploitable risks. Performing reachability analysis typically requires the vulnerable function (VF) to track the call chains from downstream applications. However, such crucial information is usually unavailable in modern vulnerability databases like NVD. While directly extracting VF from modified functions in vulnerability patches is intuitive, patches are not always available. Moreover, our preliminary study shows that over 26% of VF do not exist in the modified functions. Meanwhile, simply ignoring patches to search vulnerable functions suffers from overwhelming noises and lexical gaps between descriptions and source code. Given that almost half of the vulnerabilities are equipped with patches, a holistic solution that handles both scenarios with and without patches is required. To meet real-world needs and automatically localize VF, we present VFArch=e, a dual-mode approach designed for disclosed vulnerabilities, applicable in scenarios with or without available patch links. The experimental results of VFArch=e on our constructed benchmark dataset demonstrate significant efficacy regarding three metrics, achieving 1.3x and 1.9x Mean Reciprocal Rank over the best baselines for Patch-present and Patch-absent modes, respectively. Moreover, VFArch=e has proven its applicability in real-world scenarios by successfully locating VF for 43 out of 50 latest vulnerabilities with reasonable efforts and significantly reducing 78-89% false positives of SCA tools.
软件构成分析(SCA)在解决软件项目依赖性所固有的脆弱性方面变得至关重要。 特别是,在开放源码软件(OSS)项目中越来越多地使用可获取性分析,通过调用图表确定可实现的脆弱性(如CVES),从而能够关注可利用的风险。 进行可实现性分析通常要求脆弱功能(VF)从下游应用程序中跟踪呼叫链。然而,在NVD等现代脆弱性数据库中通常没有这种关键信息。虽然直接从脆弱性补丁中修改的功能中提取 VFF 的功能是不直观的,但并非总能提供补丁。 此外,我们的初步研究表明,在修改后的功能中,超过 26 % VFS 的可达性(例如CVVVVV) , 简单忽略搜索脆弱功能的补丁 ,因为描述和源码之间的词汇差距很大。 鉴于几乎一半的脆弱性都配有补丁,因此需要一种处理情景和无补丁的综合解决方案。 为了满足现实世界需求和自动本地化VFF, 我们介绍VFArchZe,一种双向外的双向方法,旨在披露脆弱性暴露的双重方法, 。 用于暴露脆弱性,在已披露的变现的假设中分别用于披露的50%的变现的变现的变现的虚拟的变现的虚拟的虚拟的虚拟的虚拟的虚拟的虚拟的虚拟的虚拟的虚拟的虚拟的虚拟的实验性框架, , , , 的虚拟的虚拟的虚拟的实验性标准 , ,其精确性标准 ,其实验性 的实验性标准 的实验性 ,其精确性标准 ,其 的精确性标准 ,其 的精确性标准 的模型的精确性 的精确性 ,其 的精确性 ,其 的精确性 的精确性 的 ,其 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的 的
Article 50
Title@2025-06-24 (2): MCP-Zero: Active Tool Discovery for Autonomous LLM Agents
Title: MCP-Zero: Active Tool Discovery for Autonomous LLM Agents | MCP-Zero: Active Tool Discovery für autonome LLM-Agenten | MCP-零:为自动LLM代理商提供主动工具发现工具 2506.01056v4 |
Authors (3): Xiang Fei, Xiawu Zheng, Hao Feng
True intelligence requires active capability acquisition, yet current LLM agents inject pre-defined tool schemas into prompts, reducing models to passive selectors and falling short of robust general-purpose agency. We introduce MCP-Zero, an active agent framework that restores tool discovery autonomy to LLMs themselves. Instead of overwhelming models with all available tools, MCP-Zero enables agents to actively identify capability gaps, and request specific tools on-demand, transforming them from large-scale retrievers into genuine autonomous agents. The framework operates through three core mechanisms: (1) Active Tool Request, where models autonomously generate structured requests specifying their exact tool requirements; (2) Hierarchical Semantic Routing, a two-stage algorithm that matches requests to relevant servers and tools through improved semantic alignment; (3) Iterative Capability Extension, enabling agents to progressively build cross-domain toolchains while maintaining minimal context footprint. We construct MCP-tools, a comprehensive dataset of 308 MCP servers and 2,797 tools from the official Model-Context-Protocol repository. Experiments demonstrate that MCP-Zero preserves agent autonomy while achieving substantial efficiency gains: (i) accurate tool selection from nearly 3k candidates across 248.1k tokens; (ii) 98\% reduction in token consumption on APIBank while maintaining high accuracy; and (iii) consistent multi-turn performance that scales with tool ecosystem growth. This work establishes active tool discovery as a fundamental design pattern for scalable autonomous agent systems.
真正的情报要求积极获取能力,而目前的LLMM代理商则将预先界定的工具模型输入快速,将模型降低到被动选择者,并低于强大的一般用途机构。我们引入了MCP-Zero(MCP-Zero),这是一个恢复工具发现自主性的主动代理机构框架,恢复了LLMs自身自主。MCP-Zero(MCP-Zero)不是使用所有现有工具的压倒性模型,而是使代理商能够积极查明能力差距,并按需要求提供具体工具,将它们从大型检索器中转换成真正的自主代理商。框架通过三个核心机制运作:(1) 主动工具请求,其中模型自动生成结构化请求,具体说明其确切的工具要求;(2) 等级性静态规则运行,即两阶段算法,通过改进语义性对相关服务器和工具的要求进行匹配;(3) 循环性能力扩展,使代理商能够逐步建立跨多面工具链,同时保持最小的环境足迹。 我们建立MCP-工具,一个综合数据集,308 MCP服务器和2,797工具来自正式的Model-C-Comm-Commex-Prot-Promission-Protal存放存放存放存放。 实验显示MCP-ste-ste-sistris-rview,同时实现持续高效性能;98xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 51
Title@2025-06-24 (2): MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core Selection
Title: MNN-AECS: Energy Optimization for LLM Decoding on Mobile Devices via Adaptive Core Selection | MNN-AECS: Energieoptimierung für die LLM-Dekodierung auf mobilen Geräten über adaptive Core Selection | MNN-AN-ANECS:通过适应核心选择在移动设备上添加LLM的能量优化 2506.19884v1 |
Authors (11): Zhengxiang Huang, Chaoyue Niu, Zhaode Wang, Jiarui Xue, Hanming Zhang, Yugang Wang, Zewei Xin, Xiaotang Jiang, Chengfei Lv, Fan Wu, Guihai Chen
As the demand for on-device Large Language Model (LLM) inference grows, energy efficiency has become a major concern, especially for battery-limited mobile devices. Our analysis shows that the memory-bound LLM decode phase dominates energy use, and yet most existing works focus on accelerating the prefill phase, neglecting energy concerns. We introduce Adaptive Energy-Centric Core Selection (AECS) and integrate it into MNN to create the energy-efficient version, MNN-AECS, the first engine-level system solution without requiring root access or OS modifications for energy-efficient LLM decoding. MNN-AECS is designed to reduce LLM decoding energy while keeping decode speed within an acceptable slowdown threshold by dynamically selecting low-power CPU cores. MNN-AECS is evaluated across 5 Android and 2 iOS devices on 5 popular LLMs of various sizes. Compared to original MNN, MNN-AECS cuts down energy use by 23% without slowdown averaged over all 7 devices and 4 datasets. Against other engines, including llama.cpp, executorch, mllm, and MediaPipe, MNN-AECS delivers 39% to 78% energy saving and 12% to 363% speedup on average.
随着对大型语言模型(LLM)不断增长的需求,能源效率已成为一个主要关切,特别是电池限制的移动设备。我们的分析表明,内存的LLM解码阶段控制着能源使用,但大多数现有工作的重点是加速预填阶段,忽视能源关切。我们引入适应性能源中心核心选择(AECS)并将其纳入MNNN,以创建节能版本,MNN-AECS,这是第一个不需要对节能LLM解码进行根接入或OS修改的引擎级系统解决方案。MNN-AECS旨在减少LM解码能源,同时通过动态选择低功率的CPU核心,将解码速度保持在可接受的临界点内。MNNE-AECS在5种流行规模的LMMMMMMMMNNNNN、MP% 3和MNEP% 递增速度为3的MEPPMUP、MCP 3和MCMCP 向平均速度为12的MUP、MCRP 3和MCOIP 向平均速度为12的5 LMCOLOL 等其他引擎,将能量降降为23。
Article 52
Title@2025-06-24 (2): Generating and Understanding Tests via Path-Aware Symbolic Execution with LLMs
Title: Generating and Understanding Tests via Path-Aware Symbolic Execution with LLMs | Erzeugen und Verstehen von Tests über path-aware Symbolische Ausführung mit LLMs | 通过使用LLMM 进行路径-意识符号执行生成和理解测试 2506.19287v1 |
Authors (5): Yaoxuan Wu, Xiaojie Zhou, Ahmad Humayun, Muhammad Ali Gulzar, Miryung Kim
Symbolic execution is a widely used technique for test generation, offering systematic exploration of program paths through constraint solving. However, it is fundamentally constrained by the capability to model the target code including library functions in terms of symbolic constraint and the capability of underlying constraint solvers. As a result, many paths involving complex features remain unanalyzed or insufficiently modeled. Recent advances in large language models (LLMs) have shown promise in generating diverse and valid test inputs. Yet, LLMs lack mechanisms for systematically enumerating program paths and often fail to cover subtle corner cases. We observe that directly prompting an LLM with the full program leads to missed coverage of interesting paths. In this paper, we present PALM, a test generation system that combines symbolic path enumeration with LLM-assisted test generation. PALM statically enumerates possible paths through AST-level analysis and transforms each into an executable variant with embedded assertions that specify the target path. This avoids the need to translate path constraints into SMT formulae, by instead constructing program variants that LLM can interpret. Importantly, PALM is the first to provide an interactive frontend that visualizes path coverage alongside generated tests, assembling tests based on the specific paths they exercise. A user study with 12 participants demonstrates that PALM’s frontend helps users better understand path coverage and identify which paths are actually exercised by PALM-generated tests, through verification and visualization of their path profiles.
光学执行是一种广泛使用的测试生成技术,它提供了系统探索程序路径的方法,通过克服限制而解决限制,然而,它受到以下能力的根本制约:模拟目标代码的能力,包括图书馆功能,包括象征性限制和基本制约解决器的能力。因此,许多复杂特征的路径仍未分析,或建模不足。大型语言模型(LLMS)的最近进展显示了产生多样化和有效测试投入的希望。然而,LLMS缺乏系统罗列程序路径的机制,往往无法覆盖微妙的角落案例。我们发现,直接促使一个带有完整程序的LLMLM导致错过有趣的路径的覆盖。在本文件中,我们介绍PALM,这是一个将象征性路径查点与LM协助的测试生成者相结合的测试系统。PALM静态地罗列了可能的路径,并将每个路径转换成一个可执行的变量,其中含有指定目标路径。这避免了将路径限制转化为SMT公式,而不是构建LM能够解释的程序变量。DI,PALM是第一个提供互动前端路径的测试系统系统,将用户的图像测试结果用于测试。
Article 53
Title@2025-06-24 (2): DynNPC: Finding More Violations Induced by ADS in Simulation Testing through Dynamic NPC Behavior Generation
Title: DynNPC: Finding More Violations Induced by ADS in Simulation Testing through Dynamic NPC Behavior Generation | DynNPC: Weitere Verletzungen durch ADS bei Simulationstests durch dynamische NPC-Behavior-Generierung | DynNPC:通过动态NPC行为一代在模拟测试中发现ADS诱导的更多违规行为 2411.19567v2 |
Authors (5): You Lu, Yifan Tian, Dingji Wang, Bihuan Chen, Xin Peng
Recently, a number of simulation testing approaches have been proposed to generate diverse driving scenarios for autonomous driving systems (ADSs) testing. However, the behaviors of NPC vehicles in these scenarios generated by previous approaches are predefined and mutated before simulation execution, ignoring traffic signals and the behaviors of the Ego vehicle. Thus, a large number of the violations they found are induced by unrealistic behaviors of NPC vehicles, revealing no bugs of ADSs. Besides, the vast scenario search space of NPC behaviors during the iterative mutations limits the efficiency of previous approaches. To address these limitations, we propose a novel scenario-based testing framework, DynNPC, to generate more violation scenarios induced by the ADS. Specifically, DynNPC allows NPC vehicles to dynamically generate behaviors using different driving strategies during simulation execution based on traffic signals and the real-time behavior of the Ego vehicle. We compare DynNPC with five state-of-the-art scenario-based testing approaches. Our evaluation has demonstrated the effectiveness and efficiency of DynNPC in finding more violation scenarios induced by the ADS.
最近,提出了若干模拟测试方法,为自动驾驶系统(ADS)的测试提供不同的驾驶场景;然而,在模拟执行之前,预先界定和变换了以前方法产生的NPC车辆在这些场景中的行为,忽略了交通信号和Ego车辆的行为;因此,发现的大量违规行为是由NPC车辆不切实际的行为引起的,没有显示ADS的虫子;此外,迭代突变期间NPC行为的巨大场景搜索空间限制了以前方法的效率;为解决这些限制,我们提出了一个新的基于情景的测试框架,即DynNPC,以产生更多的由ADS引发的违规场景。具体地说,DynNPC车辆在模拟执行期间,根据交通信号和Ego车辆的实时行为,采用不同的驾驶策略,动态地产生行为;我们将DynNPC与5种基于最新情景的测试方法进行比较,我们的评价表明DynNPC在发现ADS的更多违规场景时,是有效的。
Article 54
Title@2025-06-24 (2): GroupTuner: Efficient Group-Aware Compiler Auto-Tuning
Title: GroupTuner: Efficient Group-Aware Compiler Auto-Tuning | GroupTuner: Efficient Group-Aware Compiler Auto-Tuning | GroupTuner: 高效的 Group- Awar 软件编辑器自动调试 2505.08598v2 |
Authors (7): Bingyu Gao, Mengyu Yao, Ziming Wang, Dong Liu, Ding Li, Xiangqun Chen, Yao Guo
Modern compilers typically provide hundreds of options to optimize program performance, but users often cannot fully leverage them due to the huge number of options. While standard optimization combinations (e.g., -O3) provide reasonable defaults, they often fail to deliver near-peak performance across diverse programs and architectures. To address this challenge, compiler auto-tuning techniques have emerged to automate the discovery of improved option combinations. Existing techniques typically focus on identifying critical options and prioritizing them during the search to improve efficiency. However, due to limited tuning iterations, the resulting data is often sparse and noisy, making it highly challenging to accurately identify critical options. As a result, these algorithms are prone to being trapped in local optima. To address this limitation, we propose GroupTuner, a group-aware auto-tuning technique that directly applies localized mutation to coherent option groups based on historically best-performing combinations, thus avoiding explicitly identifying critical options. By forgoing the need to know precisely which options are most important, GroupTuner maximizes the use of existing performance data, ensuring more targeted exploration. Extensive experiments demonstrate that GroupTuner can efficiently discover competitive option combinations, achieving an average performance improvement of 12.39% over -O3 while requiring only 77.21% of the time compared to the random search algorithm, significantly outperforming state-of-the-art methods.
现代编译者通常提供数百种优化程序绩效的选项,但用户往往无法充分利用这些选项,原因是选项数量众多。标准优化组合(例如-O3)提供了合理的默认,但他们往往无法在不同的程序和架构中提供近乎高峰的绩效。为了应对这一挑战,编译者自动调试技术已经出现,以自动发现改进的选项组合。现有技术通常侧重于确定关键选项,并在寻找提高效率的过程中对其进行优先排序。然而,由于调试有限,由此产生的数据往往稀少和吵闹,因此很难准确确定关键选项。因此,这些算法很容易被困在本地的奥地马。为了应对这一限制,我们提议GroupTuner,一个群体自觉的自动调制调整技术,直接将本地化突变适用于基于历史最佳组合的连贯选项组,从而避免明确确定关键选项。由于需要准确了解哪些选项最为重要,GroupTuner将最大限度地利用现有绩效数据,确保更有针对性的探索。广泛的实验表明,GroupTuner能够有效地发现竞争性选项,同时要求将12.39的组合转化为12.39的概率组合。
Article 55
Title@2025-06-24 (2): Breaking Single-Tester Limits: Multi-Agent LLMs for Multi-User Feature Testing
Title: Breaking Single-Tester Limits: Multi-Agent LLMs for Multi-User Feature Testing | Breaking Single-Tester Limits: Multi-Agent LLMs für Multi-User Feature Testing | 打破单一试验者限制:多用户功能测试的多代理机构LLMs 2506.17539v2 |
Authors (7): Sidong Feng, Changhao Du, Huaxiao Liu, Qingnan Wang, Zhengwei Lv, Mengfei Wang, Chunyang Chen
The growing dependence on mobile phones and their apps has made multi-user interactive features, like chat calls, live streaming, and video conferencing, indispensable for bridging the gaps in social connectivity caused by physical and situational barriers. However, automating these interactive features for testing is fraught with challenges, owing to their inherent need for timely, dynamic, and collaborative user interactions, which current automated testing methods inadequately address. Inspired by the concept of agents designed to autonomously and collaboratively tackle problems, we propose MAdroid, a novel multi-agent approach powered by the Large Language Models (LLMs) to automate the multi-user interactive task for app feature testing. Specifically, MAdroid employs two functional types of multi-agents: user agents (Operator) and supervisor agents (Coordinator and Observer). Each agent takes a specific role: the Coordinator directs the interactive task; the Operator mimics user interactions on the device; and the Observer monitors and reviews the task automation process. Our evaluation, which included 41 multi-user interactive tasks, demonstrates the effectiveness of our approach, achieving 82.9% of the tasks with 96.8% action similarity, outperforming the ablation studies and state-of-the-art baselines. Additionally, a preliminary investigation underscores MAdroid’s practicality by helping identify 11 multi-user interactive bugs during regression app testing, confirming its potential value in real-world software development contexts.
由于对移动电话及其应用软件的依赖日益增强,使得多用户互动功能,如聊天电话、现场直播和视频会议等,成为弥合因物理和情况障碍造成的社会连通差距所不可或缺的多用户互动功能。然而,这些用于测试的互动式功能自动化充满了挑战,原因是它们本身需要及时、动态和协作的用户互动,而目前自动测试方法没有很好解决这些互动互动。受旨在自主和协作解决问题的代理商概念的启发,我们提议采用MAdroid,这是由大语言模型(LLLMs)推动的新型多用户多试剂方法,将多用户互动功能任务自动化,用于软件功能测试。具体地说,MAdroid使用两种功能性多代理器:用户代理商(Operator)和主管代理商(协调员和观察员),每个代理商都发挥具体作用:协调员指导互动任务;操作员模拟用户在设备上的互动;观察员监测并审查任务自动化进程。我们的评价包括41项多用户互动任务,展示了我们的方法的有效性,实现了82.9%的任务,实现了96.8%的多用户互动式互动任务,实现了互动性任务,在实际操作背景中以模拟分析基础测试中,展示了基础和模拟的模型测试中确定了基础,从而确认了其潜在的系统。
Article 56
Title@2025-06-23 (1): Dataset of Yul Contracts to Support Solidity Compiler Research
Title: Dataset of Yul Contracts to Support Solidity Compiler Research | Datensatz von Yul-Verträgen zur Unterstützung der Solidity Compiler-Forschung | 支持固体汇编者研究的Yul合同数据集 2506.19153v1 |
Authors (1): Krzysztof Fonal
The YulCode dataset presents a comprehensive collection of 348,840 Yul-based smart contract instances, comprising approximately 135,013 unique contracts. These contracts were generated through the compilation of Solidity source files that have been deployed on the Ethereum mainnet, making the dataset directly representative of real-world decentralized applications. YulCode provides a rich foundation for a variety of research and development tasks, including but not limited to machine learning applications, formal verification, optimization analysis, and software engineering tool evaluation in the context of low-level smart contract code. To the best of our knowledge at the time of writing, YulCode is the first and only publicly available dataset that focuses specifically on Yul, an intermediate language designed for the Ethereum Virtual Machine (EVM). As such, it fills a critical gap in the current ecosystem of smart contract datasets and opens new avenues for research and tooling aimed at low-level contract analysis and generation.
YulCode数据集综合收集了348,840个基于Yol的智能合同实例,其中包括约135,013份独特的合同,这些合同是通过汇编在Etheum主网上部署的固体源文件产生的,使数据集直接代表了现实世界分散应用软件。YulCode为各种研究和发展任务提供了丰富的基础,包括但不限于在低级智能合同代码范围内的机器学习应用、正式核查、优化分析和软件工程工具评价。根据我们所了解的,在撰写时,YolCode是第一个而且是唯一公开提供的数据集,专门侧重于Yul,这是Etheem虚拟机器(EVM)设计的中间语言。因此,它填补了目前智能合同数据集生态系统中的一个关键空白,并为旨在进行低级合同分析和生成的研究和工具开辟了新的渠道。
Article 57
Title@2025-06-23 (1): Framework for On the Fly Input Refinement for Deep Learning Models
Title: Framework for On the Fly Input Refinement for Deep Learning Models | Framework for On the Fly Input Raffinement for Deep Learning Models | 深学习模式 Fly 投入改进框架框架 2502.05456v2 |
Authors (1): Ravishka Rathnasuriya
Advancements in deep learning have significantly improved model performance across tasks involving code, text, and image processing. However, these models still exhibit notable mispredictions in real-world applications, even when trained on up-to-date data. Such failures often arise from slight variations in inputs such as minor syntax changes in code, rephrasing in text, or subtle lighting shifts in images that reveal inherent limitations in these models’ capability to generalize effectively. Traditional approaches to address these challenges involve retraining, a resource-intensive process that demands significant investments in data labeling, model updates, and redeployment. This research introduces an adaptive, on-the-fly input refinement framework aimed at improving model performance through input validation and transformation. The input validation component detects inputs likely to cause errors, while input transformation applies domain-specific adjustments to better align these inputs with the model’s handling capabilities. This dual strategy reduces mispredictions across various domains, boosting model performance without necessitating retraining. As a scalable and resource-efficient solution, this framework holds significant promise for high-stakes applications in software engineering, natural language processing, and computer vision.
深层次学习的进步大大改善了涉及代码、文本和图像处理等各项任务的示范性业绩,然而,这些模型在现实世界应用中仍然表现出明显的错误,即使经过最新数据培训,这些错误往往产生于投入的微小差异,例如代码的微小语法变化、文本的改写或图像的细微光变化,这些变化揭示出这些模型有效推广能力的内在局限性。处理这些挑战的传统方法包括再培训,这是一个资源密集型过程,要求在数据标签、模型更新和重新部署方面进行大量投资。这一研究引入了一个适应性的、即时输入改进框架,目的是通过输入验证和转换来改进模型的性能。投入验证部分检测出可能导致错误的投入,而投入转换则应用特定领域的调整,使这些投入更好地与模型的处理能力接轨。这一双重战略减少了不同领域的错误,提高了模型的性能,而无需再培训。作为一个可扩展和资源效率高的解决方案,这一框架为软件工程、自然语言处理和计算机愿景的高级应用带来了巨大的希望。
Article 58
Title@2025-06-23 (1): cuVSLAM: CUDA accelerated visual odometry and mapping
Title: cuVSLAM: CUDA accelerated visual odometry and mapping | cuVSLAM: CUDA beschleunigte visuelle Odometrie und Mapping | CUDA 加速视觉测量和绘图 2506.04359v2 |
Authors (8): Alexander Korovko, Dmitry Slepichev, Alexander Efitorov, Aigul Dzhumamuratova, Viktor Kuznetsov, Hesam Rabeti, Joydeep Biswas, Soha Pouya
Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 cameras, in arbitrary geometric configurations, thus supporting a wide range of robotic setups. cuVSLAM is specifically optimized using CUDA to deploy in real-time applications with minimal computational overhead on edge-computing devices such as the NVIDIA Jetson. We present the design and implementation of cuVSLAM, example use cases, and empirical results on several state-of-the-art benchmarks demonstrating the best-in-class performance of cuVSLAM.
精确和稳健的姿势估计是任何自主机器人的关键要求。 我们展示了 CuVSLAM, 这是视觉同步定位和映像的最先进的解决方案, 可用各种视觉- 内脏传感器套件操作, 包括多个 RGB 和深度摄像头, 以及惯性测量器。 CuVSLAM 支持使用一个只有32个 RGB 摄像头的任意几何配置, 从而支持一系列广泛的机器人设置。 CuVSLAM 被特别优化, 利用 CUDA 实时应用, 在像 NVIDIA Jetson 这样的边缘计算设备上部署最低计算间接费用的实时应用。 我们展示了 CuVSLAM 的设计和实施, 实例使用案例, 以及几个最先进的基准的经验性结果, 展示了 CuVSLAM 的最佳水平性能 。
Article 59
Title@2025-06-23 (1): Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks
Title: Code Graph Model (CGM): A Graph-Integrated Large Language Model for Repository-Level Software Engineering Tasks | Code Graph Model (CGM): Ein Graph-integriertes Large Language Model für Repository-Level Software Engineering Aufgaben | 代码图表模型(CGM):存储层软件工程任务 2505.16901v4 |
Authors (15): Hongyuan Tao, Ying Zhang, Zhenhao Tang, Hongen Peng, Xukun Zhu, Bingchang Liu, Yingguang Yang, Ziyin Zhang, Zhaogui Xu, Haipeng Zhang, Linchao Zhu, Rui Wang, Hang Yu, Jianguo Li, Peng Di
Recent advances in Large Language Models (LLMs) have shown promise in function-level code generation, yet repository-level software engineering tasks remain challenging. Current solutions predominantly rely on proprietary LLM agents, which introduce unpredictability and limit accessibility, raising concerns about data privacy and model customization. This paper investigates whether open-source LLMs can effectively address repository-level tasks without requiring agent-based approaches. We demonstrate this is possible by enabling LLMs to comprehend functions and files within codebases through their semantic information and structural dependencies. To this end, we introduce Code Graph Models (CGMs), which integrate repository code graph structures into the LLM’s attention mechanism and map node attributes to the LLM’s input space using a specialized adapter. When combined with an agentless graph RAG framework, our approach achieves a 43.00% resolution rate on the SWE-bench Lite benchmark using the open-source Qwen2.5-72B model. This performance ranks first among open weight models, second among methods with open-source systems, and eighth overall, surpassing the previous best open-source model-based method by 12.33%.
大语言模型(LLMS)最近的进展显示,在功能层面代码生成方面有希望,但存储库级软件工程任务仍具有挑战性。目前的解决办法主要依赖专有的LLM代理商,这些代理商引入了不可预测性和限制可访问性,引起了对数据隐私和模式定制的关切。本文调查了开放源LLMs能否在不需要代理商方法的情况下有效处理存储层任务。我们通过使LLMs能够通过其语义信息和结构依赖性来理解代码库内的功能和文件而证明这是可能的。为此,我们引入了代码图表模型,将存储库代码图形结构纳入LLM的注意机制,并用专门调整器绘制LLMM输入空间的节点。当与无代理图RAG框架相结合时,我们的方法在使用开放源 Quen2.5-72B模型的SWE-Ben Lite基准上实现了43.00%的分辨率。这一性能在开放量模型中排第一,在开放源系统方法中排第二,第八位,在12.33%的开放源模式上超过了先前的最佳开放源模型方法。
Article 60
Title@2025-06-23 (1): Black-Box Test Code Fault Localization Driven by Large Language Models and Execution Estimation
Title: Black-Box Test Code Fault Localization Driven by Large Language Models and Execution Estimation | Black-Box Test Code Fehler Lokalisierung angetrieben durch große Sprachmodelle und Ausführung Schätzung | 由大语言模型和执行估计驱动的黑牛测试代码 2506.19045v1 |
Authors (6): Ahmadreza Saboor Yaraghi, Golnaz Gharachorlu, Sakina Fatima, Lionel C. Briand, Ruiyuan Wan, Ruifeng Gao
Fault localization (FL) is a critical step in debugging which typically relies on repeated executions to pinpoint faulty code regions. However, repeated executions can be impractical in the presence of non-deterministic failures or high execution costs. While recent efforts have leveraged Large Language Models (LLMs) to aid execution-free FL, these have primarily focused on identifying faults in the system under test (SUT) rather than in the often complex system test code. However, the latter is also important as, in practice, many failures are triggered by faulty test code. To overcome these challenges, we introduce a fully static, LLM-driven approach for system test code fault localization (TCFL) that does not require executing the test case. Our method uses a single failure execution log to estimate the test’s execution trace through three novel algorithms that identify only code statements likely involved in the failure. This pruned trace, combined with the error message, is used to prompt the LLM to rank potential faulty locations. Our black-box, system-level approach requires no access to the SUT source code and is applicable to large test scripts that assess full system behavior. We evaluate our technique at function, block, and line levels using an industrial dataset of faulty test cases not previously used in pre-training LLMs. Results show that our best estimated trace closely match actual traces, with an F1 score of around 90%. Additionally, pruning the complex system test code reduces the LLM’s inference time by up to 34% without any loss in FL performance. Our results further suggest that block-level TCFL offers a practical balance, narrowing the search space while preserving useful context, achieving an 81% hit rate at top-3 (Hit@3).
错误本地化( FL) 是调试的关键步骤, 通常依靠重复处决来定位错误代码区域。 但是, 重复处决可能不切实际, 因为在非确定性失败或高执行成本的情况下, 重复处决可能不切实际 。 虽然最近的努力已经利用大语言模型( LLLM) 来帮助不执行 FL , 但主要侧重于识别测试系统( SUT) 而不是通常复杂的系统测试代码中的错误。 但是, 后者也很重要, 因为在实践中, 许多失败是由错误的测试代码触发的。 为了克服这些挑战, 我们引入了完全静态的、 LLLM 驱动的系统测试代码错误代码34 本地化( TTCFL ) 的方法, 不需要执行测试案例。 我们的方法使用单一的失败执行日志来估算测试测试执行过程的轨迹, 通过三种新的算法, 只识别出可能与失败有关的代码。 这种标定的追踪, 加上错误信息, 用来促使 LLM 将潜在错误的位置定级。 我们的黑框, , 系统, 级别 要求没有访问 SUT源代码代码的代码, 搜索 , 并应用大操作系统 , 搜索 的轨道 , 显示系统运行的轨迹迹 。 显示系统 运行运行 运行 运行 运行 运行 运行 , 运行 。
Article 61
Title@2025-06-23 (1): A Comprehensive Study of Machine Learning Techniques for Log-Based Anomaly Detection
Title: A Comprehensive Study of Machine Learning Techniques for Log-Based Anomaly Detection | Eine umfassende Untersuchung von Techniken des maschinellen Lernens zur logbasierten Anomalieerkennung | 全面研究用于基于日志异常探测的机器学习技术 2307.16714v5 |
Authors (5): Shan Ali, Chaima Boufaied, Domenico Bianculli, Paula Branco, Lionel Briand
Growth in system complexity increases the need for automated log analysis techniques, such as Log-based Anomaly Detection (LAD). While deep learning (DL) methods have been widely used for LAD, traditional machine learning (ML) techniques can also perform well depending on the context and dataset. Semi-supervised techniques deserve the same attention as they offer practical advantages over fully supervised methods. Current evaluations mainly focus on detection accuracy, but this alone is insufficient to determine the suitability of a technique for a given LAD task. Other aspects to consider include training and prediction times as well as the sensitivity to hyperparameter tuning, which in practice matters to engineers. This paper presents a comprehensive empirical study evaluating a wide range of supervised and semi-supervised, traditional and deep ML techniques across four criteria: detection accuracy, time performance, and sensitivity to hyperparameter tuning in both detection accuracy and time performance. The experimental results show that supervised traditional and deep ML techniques fare similarly in terms of their detection accuracy and prediction time on most of the benchmark datasets considered in our study. Moreover, overall, sensitivity analysis to hyperparameter tuning with respect to detection accuracy shows that supervised traditional ML techniques are less sensitive than deep learning techniques. Further, semi-supervised techniques yield significantly worse detection accuracy than supervised techniques.
系统复杂程度的增长增加了对自动日志分析技术的需求,例如基于日志的异常探测(LAD)等,增加了对自动日志分析技术的需求。虽然在LAD中广泛使用了深度学习(DL)方法,但传统的机器学习(ML)技术根据背景和数据集也可以很好地发挥作用。半监督技术值得同等关注,因为它们在完全监督的方法方面提供了实际优势。目前的评价主要侧重于检测准确性,但仅此一项还不足以确定某项LAD任务中某项技术的适宜性。其他方面包括培训和预测时间以及对超光谱调的敏感性,这些在实际操作中与工程师有关。本文提供了一项全面的实证研究,评估了广泛的有监督和半监督的、传统和深度的ML技术,这四大标准是:检测准确性、时间性能和对超参数调的敏感性,这四大标准是:检测准确性能、时间性能和对检测准确性能的敏感性。实验结果表明,在检测大多数基准数据集的准确性和预测时间方面,受监督的传统和深度的ML技术与受监督的精度相当。
Article 62
Title@2025-06-23 (1): Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories
Title: Understanding Software Engineering Agents: A Study of Thought-Action-Result Trajectories | Software Engineering Agents verstehen: Eine Studie über Gedanken-Action-Result-Trajektorien | 了解软件工程剂:关于思想-行动-结果轨迹的研究 2506.18824v1 |
Authors (2): Islem Bouzenia, Michael Pradel
Large Language Model (LLM)-based agents are increasingly employed to automate complex software engineering tasks such as program repair and issue resolution. These agents operate by autonomously generating natural language thoughts, invoking external tools, and iteratively refining their solutions. Despite their widespread adoption, the internal decision-making processes of these agents remain largely unexplored, limiting our understanding of their operational dynamics and failure modes. In this paper, we present a large-scale empirical study of the thought-action-result trajectories of three state-of-the-art LLM-based agents: \textsc{RepairAgent}, \textsc{AutoCodeRover}, and \textsc{OpenHands}. We unify their interaction logs into a common format, capturing 120 trajectories and 2822 LLM interactions focused on program repair and issue resolution. Our study combines quantitative analyses of structural properties, action patterns, and token usage with qualitative assessments of reasoning coherence and feedback integration. We identify key trajectory characteristics such as iteration counts and token consumption, recurring action sequences, and the semantic coherence linking thoughts, actions, and their results. Our findings reveal behavioral motifs and anti-patterns that distinguish successful from failed executions, providing actionable insights for improving agent design, including prompting strategies, failure diagnosis, and anti-pattern detection. We release our dataset and annotation framework to support further research on transparent and robust autonomous software engineering agents.
大型语言模型(LLM)代理商越来越多地被用于将程序修理和问题解答等三个最先进的LLM代理商的思维-行动轨迹自动化,这些代理商通过自主生成自然语言思维,援引外部工具,并反复完善其解决方案。尽管这些代理商的内部决策过程被广泛采用,但基本上仍未探索,限制了我们对操作动态和故障模式的理解。在本文件中,我们对三个最先进的LLM代理商的思维-行动-结果轨迹进行了大规模的经验性研究,如:Textsc{RepairA}、\textsc{AutoCodeRover}和\textsc{OpenHands}。我们将其互动日志统一成一个共同格式,捕捉120个轨迹和2822LLM互动,侧重于方案修理和问题解决。我们的研究将结构属性、行动模式和象征性使用与对逻辑一致性和反馈整合的定性评估结合起来。我们确定了关键轨迹特征,如消费和符号消费、经常性行动序列、重复的动作序列、以及从我们的成功的诊断结果、改进行动、不成功的解析的动作和动作分析。
Article 63
Title@2025-06-23 (1): Context-Aware CodeLLM Eviction for AI-assisted Coding
Title: Context-Aware CodeLLM Eviction for AI-assisted Coding | Context-Aware CodeLLM Eviction für KI-unterstützte Coding | 使用 AI 辅助编码的内装软件 coolLLM 驱逐 2506.18796v1 |
Authors (4): Kishanthan Thangarajah, Boyuan Chen, Shi Chang, Ahmed E. Hassan
AI-assisted coding tools powered by Code Large Language Models (CodeLLMs) are increasingly integrated into modern software development workflows. To address concerns around privacy, latency, and model customization, many enterprises opt to self-host these models. However, the diversity and growing number of CodeLLMs, coupled with limited accelerator memory, introduce practical challenges in model management and serving efficiency. This paper presents CACE, a novel context-aware model eviction strategy designed specifically to optimize self-hosted CodeLLM serving under resource constraints. Unlike traditional eviction strategies based solely on recency (e.g., Least Recently Used), CACE leverages multiple context-aware factors, including model load time, task-specific latency sensitivity, expected output length, and recent usage and future demand tracked through a sliding window. We evaluate CACE using realistic workloads that include both latency-sensitive code completion and throughput-intensive code reasoning tasks. Our experiments show that CACE reduces Time-to-First-Token (TTFT) and end-to-end (E2E) latency, while significantly lowering the number of model evictions compared to state-of-the-art systems. Ablation studies further demonstrate the importance of multi-factor eviction in balancing responsiveness and resource efficiency. This work contributes practical strategies for deploying scalable, low-latency AI coding assistants in real-world software engineering environments.
由《守则》大语言模型(CodeLLLM)驱动的由AI协助的编码工具日益被纳入现代软件开发工作流程。为了解决对隐私、延迟和模式定制的关切,许多企业选择自己主持这些模型,但是,由于代码LLM的多样性和数量不断增加,加上有限的加速记忆,在模式管理和服务效率方面提出了实际挑战。本文介绍了CACE, 这是一种符合背景的新模式驱逐战略,专门为在资源制约下优化自办代码LLLM服务而设计。与仅基于耐用性(例如,最不新近使用)的传统驱逐战略不同,CACE利用多种环境意识因素,包括模型负荷时间、特定延迟敏感度、预期产出长度、近期使用量和通过滑动窗口跟踪的未来需求。我们利用现实的工作量对CACE进行了评估,这些工作量既包括耐用敏感代码完成,也包括吞吐费密集的代码推理任务。我们的实验表明CACE降低了时间对一流(TTFTFT)和终端至终端(E2E)的驱逐战略。与此同时,CACE利用了多种环境意识,大大降低了模型对驱逐战略的实际反应能力,并且进一步展示了在州内调整。
Article 64
Title@2025-06-23 (1): FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction
Title: FORGE: An LLM-driven Framework for Large-Scale Smart Contract Vulnerability Dataset Construction | FORGE: Ein LLM-gesteuertes Framework für großflächige Smart Contract Vulnerability Dataset Construction | FORGE:由LLM驱动的大型智能合同脆弱性数据集构建框架 2506.18795v1 |
Authors (10): Jiachi Chen, Yiming Shen, Jiashuo Zhang, Zihao Li, John Grundy, Zhenzhe Shao, Yanlin Wang, Jiashui Wang, Ting Chen, Zibin Zheng
High-quality smart contract vulnerability datasets are critical for evaluating security tools and advancing smart contract security research. Two major limitations of current manual dataset construction are (1) labor-intensive and error-prone annotation processes limiting the scale, quality, and evolution of the dataset, and (2) absence of standardized classification rules results in inconsistent vulnerability categories and labeling results across different datasets. To address these limitations, we present FORGE, the first automated approach for constructing smart contract vulnerability datasets. FORGE leverages an LLM-driven pipeline to extract high-quality vulnerabilities from real-world audit reports and classify them according to the CWE, the most widely recognized classification in software security. FORGE employs a divide-and-conquer strategy to extract structured and self-contained vulnerability information from these reports. Additionally, it uses a tree-of-thoughts technique to classify the vulnerability information into the hierarchical CWE classification. To evaluate FORGE’s effectiveness, we run FORGE on 6,454 real-world audit reports and generate a dataset comprising 81,390 solidity files and 27,497 vulnerability findings across 296 CWE categories. Manual assessment of the dataset demonstrates high extraction precision and classification consistency with human experts (precision of 95.6% and inter-rater agreement k-$\alpha$ of 0.87). We further validate the practicality of our dataset by benchmarking 13 existing security tools on our dataset. The results reveal the significant limitations in current detection capabilities. Furthermore, by analyzing the severity-frequency distribution patterns through a unified CWE perspective in our dataset, we highlight inconsistency between current smart contract research focus and priorities identified from real-world vulnerabilities…
高品质的智能合同脆弱性数据集对于评估安全工具和推进智能合同安全研究至关重要。当前手工数据集结构的两大限制是:(1) 劳动力密集和容易出错的批注流程限制数据集的规模、质量和演变,(2) 标准化分类规则的缺乏导致不同数据集的脆弱性类别和标签结果不一致。为克服这些限制,我们介绍了建立智能合同脆弱性数据集的第一种自动化方法FORGE,这是建立智能合同脆弱性数据集的第一种自动化方法。FORGE利用由LLOM驱动的管道从真实世界审计报告中提取高质量的脆弱性,并按照软件安全方面最公认的分类CWE进行分类。FORGE采用差异和易出错的批注程序,以从这些报告中提取结构化和自成一体的脆弱性信息。此外,它使用“一棵一树”技术将脆弱性信息分类为CWE分类等级。为了评估FORGE的效能,我们运行了6 454份真实世界审计报告,根据296 CWE类别中最广为人所知的准确性和27,497个脆弱性分析结果。我们用SLI-LILILA标准数据定义的精确度数据定位和精确度数据对比,我们现有13SLILILLLLLA的数据定比值的精确数据,我们现有安全工具的精确和精确度数据比值。
Article 65
Title@2025-06-23 (1): ModeliHub: A Web-based, Federated Analytics Platform for Modelica-centric, Model-based Systems Engineering
Title: ModeliHub: A Web-based, Federated Analytics Platform for Modelica-centric, Model-based Systems Engineering | ModeliHub: Eine Web-basierte, Federated Analytics Plattform für modellisch-zentrierte, modellbasierte Systemtechnik | 模型Hub:一个基于网络的、以模型为中心的、以模型为基础的系统工程联合会分析平台 2506.18790v1 |
Authors (1): Mohamad Omar Nachawati
This paper introduces ModeliHub, a Web-based, federated analytics platform designed specifically for model-based systems engineering with Modelica. ModeliHub’s key innovation lies in its Modelica-centric, hub-and-spoke federation architecture that provides systems engineers with a Modelica-based, unified system model of repositories containing heterogeneous engineering artifacts. From this unified system model, ModeliHub’s Virtual Twin engine provides a real-time, interactive simulation environment for deploying Modelica simulation models that represent digital twins of the virtual prototype of the system under development at a particular iteration of the iterative systems engineering life cycle. The implementation of ModeliHub is centered around its extensible, Modelica compiler frontend developed in Isomorphic TypeScript that can run seamlessly across browser, desktop and server environments. This architecture aims to strike a balance between rigor and agility, enabling seamless integration and analysis across various engineering domains.
本文介绍MmodeliHub, 这是一个基于网络的、联合分析平台,专门为与Mmodelica合作的基于模型的系统工程设计设计而设计的模型化分析平台。MmodiHub的关键创新在于其以中枢、中标和单调为主的模型化联邦架构,它为系统工程师提供了一个包含多种工程文物的模型化、统一的储存库系统模型。在这个统一的系统模型模型中,MmodiHub的虚拟双胞胎引擎提供了一个实时、互动模拟环境,用于在迭代系统工程生命周期的特定迭代时,部署代表正在开发的系统虚拟原型数字双胞胎的模型。MmodeliHub的实施围绕其可扩展性,即在Isoftic TystemScript开发的模型编译前端,可以在浏览器、桌面和服务器环境之间无缝不缝地运行。这一架构的目的是在固定和灵活性之间实现平衡,使各种工程领域的无缝整合和分析得以实现。
Article 66
Title@2025-06-23 (1): Working Document – Formalising Software Requirements with Large Language Models
Title: Working Document – Formalising Software Requirements with Large Language Models | Arbeitsdokument – Formalisierung von Softwareanforderungen mit großen Sprachmodellen | 工作文件 – – 用大语言模式正式确定软件要求 2506.14627v2 |
Authors (3): Arshad Beg, Diarmuid O’Donoghue, Rosemary Monahan
This draft is a working document, having a summary of nighty-four (94) papers with additional sections on Traceability of Software Requirements (Section 4), Formal Methods and Its Tools (Section 5), Unifying Theories of Programming (UTP) and Theory of Institutions (Section 6). Please refer to abstract of [7,8]. Key difference of this draft from our recently anticipated ones with similar titles, i.e. AACS 2025 [7] and SAIV 2025 [8] is: [7] is a two page submission to ADAPT Annual Conference, Ireland. Submitted on 18th of March, 2025, it went through the light-weight blind review and accepted for poster presentation. Conference was held on 15th of May, 2025; [8] is a nine page paper with additional nine pages of references and summary tables, submitted to Symposium on AI Verification (SAIV 2025) on 24th of April, 2025. It went through rigorous review process. The uploaded version on arXiv.org [8] is the improved one of the submission, after addressing the specific suggestions to improve the paper.
这份草案是一份工作文件,其中附有关于软件要求的可追踪性(第4节)、正式方法及其工具(第5节)、统一方案拟订理论(UTP)和机构理论(第6节)的附加章节(第7、7、8节),这份草案与我们最近预期的类似标题(即AACS 2025[7]和SAIV 2025[8])的主要区别是:[7]份是提交给爱尔兰ADAPT年度会议的两页文件,2025年3月18日提交,经过轻度盲视审查,并被接受为海报展示。会议于2025年5月15日举行;[8]是一份九页文件,增加了九页参考资料和摘要表,于2025年4月24日提交AI核查专题讨论会(SAIV 2025),经过严格的审查过程。
Article 67
Title@2025-06-23 (1): The Impact of Input Order Bias on Large Language Models for Software Fault Localization
Title: The Impact of Input Order Bias on Large Language Models for Software Fault Localization | Die Auswirkungen der Eingabereihenfolge Bias auf große Sprachmodelle für Softwarefehlerlokalisierung | 输入顺序对软件失错本地化大语言模式的影响 2412.18750v3 |
Authors (4): Md Nakhla Rafi, Dong Jae Kim, Tse-Hsun Chen, Shaowei Wang
Large Language Models (LLMs) have shown significant potential in software engineering tasks such as Fault Localization (FL) and Automatic Program Repair (APR). This study investigates how input order and context size influence LLM performance in FL, a crucial step for many downstream software engineering tasks. We evaluate different method orderings using Kendall Tau distances, including “perfect” (where ground truths appear first) and “worst” (where ground truths appear last), across two benchmarks containing Java and Python projects. Our results reveal a strong order bias: in Java projects, Top-1 FL accuracy drops from 57% to 20% when reversing the order, while in Python projects, it decreases from 38% to approximately 3%. However, segmenting inputs into smaller contexts mitigates this bias, reducing the performance gap in FL from 22% and 6% to just 1% across both benchmarks. We replaced method names with semantically meaningful alternatives to determine whether this bias is due to data leakage. The observed trends remained consistent, suggesting that the bias is not caused by memorization from training data but rather by the inherent effect of input order. Additionally, we explored ordering methods based on traditional FL techniques and metrics, finding that DepGraph’s ranking achieves 48% Top-1 accuracy, outperforming simpler approaches such as CallGraph(DFS). These findings highlight the importance of structuring inputs, managing context effectively, and selecting appropriate ordering strategies to enhance LLM performance in FL and other software engineering applications.
大型语言模型(LLMS) 展示了在软件工程任务中的巨大潜力, 如错误本地化(FL) 和自动程序修补(APR) 。 本研究调查了投入顺序和背景大小如何影响FLL的LLM绩效,这是许多下游软件工程任务的一个关键步骤。 我们评估了使用Kendall Tau 距离的不同方法,包括两个基准的“完美”(地面真相首先出现的地方)和“扭曲”(地面真相最后出现的地方) , 包括Java 和 Python 项目的两个基准。 我们的结果揭示了强烈的顺序偏差:在爪哇项目中,顶层-1 FL 精确度从57%下降到20 %,而在平通项目中,它从38%下降到大约3%。然而,将投入分割成较小的环境来减轻这种偏差,将两种基准中的FL的绩效差距从22%和6%缩小到仅仅1%。 我们用具有内在意义的替代方法命名方法,以确定这种偏差是否归因于数据泄漏。 观察到的趋势仍然一致, 表明我们偏差的原因不是因为培训中的模缩缩图,而是要从培训数据,而是通过S压的精准的F的精准性方法, 。
Article 68
Title@2025-06-23 (1): Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic?
Title: Piloting Copilot, Codex, and StarCoder2: Hot Temperature, Cold Prompts, or Black Magic? | Pilotieren von Copilot, Codex und StarCoder2: Heiße Temperatur, kalte Prompts oder schwarze Magie? | 联合飞行员 代码代码和星际代码2: 热温、冷感或黑魔法? 2210.14699v3 |
Authors (5): Jean-Baptiste Döderlein, Nguessan Hermann Kouadio, Mathieu Acher, Djamel Eddine Khelladi, Benoit Combemale
Language models are promising solutions for tackling increasing complex problems. In software engineering, they recently gained attention in code assistants, which generate programs from a natural language task description (prompt). They have the potential to save time and effort but remain poorly understood, limiting their optimal use. In this article, we investigate the impact of input variations on two configurations of a language model, focusing on parameters such as task description, surrounding context, model creativity, and the number of generated solutions. We design specific operators to modify these inputs and apply them to three LLM-based code assistants (Copilot, Codex, StarCoder2) and two benchmarks representing algorithmic problems (HumanEval, LeetCode). Our study examines whether these variations significantly affect program quality and how these effects generalize across models. Our results show that varying input parameters can greatly improve performance, achieving up to 79.27% success in one-shot generation compared to 22.44% for Codex and 31.1% for Copilot in default settings. Actioning this potential in practice is challenging due to the complex interplay in our study - the optimal settings for temperature, prompt, and number of generated solutions vary by problem. Reproducing our study with StarCoder2 confirms these findings, indicating they are not model-specific. We also uncover surprising behaviors (e.g., fully removing the prompt can be effective), revealing model brittleness and areas for improvement.
语言模型是解决日益复杂的问题的有希望的解决方案。在软件工程中,它们最近得到了代码助理的注意,代码助理从自然语言任务描述(即速)中生成了程序。它们具有节省时间和精力的潜力,但是仍然无法很好地理解,限制了它们的最佳使用。在本条中,我们调查了投入差异对语言模型两种配置的影响,侧重于任务描述、周围环境、模型创造力和生成的解决方案的数量等参数。我们设计了具体操作员来修改这些投入,并将其应用到三个基于LLLM的代码助理(Coitol、codex、StarCoder2)和两个代表算法问题的基准(HumanEval、LeetCode)。我们的研究审查了这些差异是否显著地影响方案质量和这些影响如何贯穿各种模型。我们的研究结果表明,不同的投入参数可以大大改善业绩,在一集中达到79.27%的成功率,而在代码x为22.44 %,在默认环境下为Codor 模型设计为31.1%。由于我们的研究中复杂的相互作用,这种潜力在实际中的行动具有挑战性――温度、迅速和生成的解决方案的最佳环境和数量因问题而不同而变化。我们的研究,我们用的是:用StarderStarder2 正在充分地展示这些研究。我们的研究,我们的研究可以确认这些研究。我们的研究,要揭示这些结果。我们的研究,我们的行为可以迅速地表明这些结果。
Article 69
Title@2025-06-23 (1): MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems
Title: MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems | MORTAR: Multiturn Metamorphic Testing für LLM-basierte Dialogsysteme | MORTAR:以LLM为基础的对话系统的多轨变形测试 2412.15557v3 |
Authors (6): Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen
With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.
由于在日常生活中广泛应用基于LLM的LLM对话系统,质量保证比以往任何时候都更加重要。最近的研究成功地引进了在单点测试情景中查明出意想不到行为的方法。然而,多方向互动是对话系统通常的现实世界使用情况,但这种互动的测试方法仍然未得到充分探讨。这主要是由于多点测试中的孔雀问题,这继续对对话系统开发者和研究人员构成重大挑战。在本文件中,我们提议采用一个变形多点对话开发者测试方法,即MORTAR,它减轻了测试基于LLM的对话系统的测试问题。MOR将对话系统的多点测试正式化,并将生成问答对话测试的个案与多个对话层面的扰动和变异关系自动连接起来。自动化的MR匹配机制使MARTAR在对话系统测试中具有更大的灵活性和效率,不依赖LLMM法官。在测试六个广点的LMM对话系统时,MOTRAR取得了显著的效益,在测试单一基准测试时,对标准质量进行更精确的测试,对标准进行更精确性测试。
Article 70
Title@2025-06-23 (1): Automatic Selection of Protections to Mitigate Risks Against Software Applications
Title: Automatic Selection of Protections to Mitigate Risks Against Software Applications | Automatische Auswahl von Schutzsystemen, um Risiken gegen Software-Anwendungen abzumildern | 自动选择防范软件应用风险的防范措施 2506.18470v1 |
Authors (4): Daniele Canavese, Leonardo Regano, Bjorn De Sutter, Cataldo Basile
This paper introduces a novel approach for the automated selection of software protections to mitigate MATE risks against critical assets within software applications. We formalize the key elements involved in protection decision-making - including code artifacts, assets, security requirements, attacks, and software protections - and frame the protection process through a game-theoretic model. In this model, a defender strategically applies protections to various code artifacts of a target application, anticipating repeated attack attempts by adversaries against the confidentiality and integrity of the application’s assets. The selection of the optimal defense maximizes resistance to attacks while ensuring the application remains usable by constraining the overhead introduced by protections. The game is solved through a heuristic based on a mini-max depth-first exploration strategy, augmented with dynamic programming optimizations for improved efficiency. Central to our formulation is the introduction of the Software Protection Index, an original contribution that extends existing notions of potency and resilience by evaluating protection effectiveness against attack paths using software metrics and expert assessments. We validate our approach through a proof-of-concept implementation and expert evaluations, demonstrating that automated software protection is a practical and effective solution for risk mitigation in software.
本文介绍了自动选择软件保护以降低对软件应用中关键资产的风险的软件保护的新做法。我们正式确定了保护决策中涉及的关键要素,包括代码文物、资产、安全要求、攻击和软件保护,并通过游戏理论模型来规范保护过程。在这个模型中,捍卫者战略性地对目标应用中的各种代码文物实施保护,预测对手一再企图攻击应用资产的保密性和完整性。选择最佳防御措施最大限度地防止攻击,同时通过限制保护带来的间接费用来保证应用。游戏的解决方式是以小型深度第一探索战略为基础,辅之以动态程序优化以提高效率。我们制定软件保护指数的核心是引入软件保护指数,这是通过使用软件测量和专家评估评估对攻击路径进行保护的有效性的原始贡献。我们通过验证概念实施和专家评估来验证我们的做法,表明自动软件保护是软件减少风险的实用和有效解决方案。
Article 71
Title@2025-06-23 (1): Bloch Vector Assertions for Debugging Quantum Programs
Title: Bloch Vector Assertions for Debugging Quantum Programs | Bloch Vector Assertions für Debugging Quantenprogramme | 调试量子程序Bloch 矢量批量 2506.18458v1 |
Authors (3): Noah H. Oldfield, Christoph Laaber, Shaukat Ali
Quantum programs must be reliable to ensure trustworthy results, yet debugging them is notoriously challenging due to quantum-specific faults like gate misimplementations and hardware noise, as well as their inherently probabilistic nature. Assertion-based debugging provides a promising solution by enabling localized correctness checks during execution. However, current approaches face challenges including manual assertion generation, reliance on mid-circuit-measurements, and poor scalability. In this paper, we present Bloq, a scalable, automated fault localization approach introducing Bloch-vector-based assertions utilizing expectation value measurements of Pauli operators, enabling low-overhead fault localization without mid-circuit measurements. In addition, we introduce AutoBloq, a component of Bloq for automatically generating assertion schemes from quantum algorithms. An experimental evaluation over 684432 programs using two algorithms (Quantum Fourier Transform (QFT) and Grover) shows that Bloq consistently outperforms the state-of-the-art approach Proq, notably as circuit depth and noise increase. For Grover, Bloq achieves a mean F1 score across all experimental instances of 0.74 versus 0.38 for Proq under ideal conditions, and maintains performance under noise (0.43 versus 0.06). Bloq also reduces Proq’s runtime by a factor of 5 and circuit depth overhead by a factor of 23. These results underline Bloq’s potential to make assertion-based debugging scalable and effective for near-term quantum devices.
量子调试程序必须可靠,以确保可信赖的结果,然而调试它们却由于诸如门错误执行和硬件噪音等量子特有缺陷,以及其内在的概率性,而臭名昭著地具有挑战性。 以发声为基础的调试通过在实施过程中进行局部性校准检查提供了一个很有希望的解决方案。 然而,目前的方法面临挑战,包括人工主张生成、依赖中路测量和可缩放性差等。 在本文中,我们提出了可缩放的、自动的、可缩放性的布洛克方法,采用布洛克方法,利用对保利操作者的预期值测量,使低超头错位本地化,而没有中路测量。 此外,我们引入了AutoBloq,这是Bloq的一个组件,通过量子算法自动生成主张计划。
Article 72
Title@2025-06-23 (1): The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs
Title: The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs | Der Debugging Decay Index: Debugging Strategien für Code LLMs neu denken | 调试衰减指数:重新思考守则LMS的调试战略 2506.18403v1 |
Authors (2): Muntasir Adnan, Carlos C. N. Kuhn
The effectiveness of AI debugging follows a predictable exponential decay pattern; most models lose 60-80% of their debugging capability within just 2-3 attempts, despite iterative debugging being a critical capability for practical code generation systems. We introduce the Debugging Decay Index (DDI), a mathematical framework that quantifies when debugging becomes ineffective and predicts intervention points. Our strategic fresh start approach shifts from exploitation to exploration at strategic points in the debugging process, demonstrating that well-timed interventions can rescue the effectiveness of debugging. DDI reveals a fundamental limitation in current AI debugging and provides the first quantitative framework for optimising iterative code generation strategies.
AI 调试的有效性遵循一种可预测的指数衰变模式;尽管迭代调试是实用代码生成系统的关键能力,但大多数模型仅在2-3次尝试中丧失了60-80%的调试能力。 我们引入了调试衰减指数(DDI),这是一个数学框架,当调试失效时可以量化,并预测干预点。 我们的新战略启动方法从开发转向在调试过程中的战略点进行探索,表明及时的干预措施可以挽救调试的有效性。 DDI揭示了当前AI调试中的基本限制,并为优化迭代代码生成战略提供了第一个量化框架。
Article 73
Title@2025-06-23 (1): Your Token Becomes Worthless: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis
Title: Your Token Becomes Worthless: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis | Ihr Token wird wertlos: Enthüllen von Rug Pull Schemes in Crypto Token über Code-and-Transaction Fusion Analysis | 您的名声变得毫无价值:通过代码和交易整合分析,在加密调控中采用不懈的Rug拉力计划 2506.18398v1 |
Authors (8): Hao Wu, Haijun Wang, Shangwang Li, Yin Wu, Ming Fan, Wuxia Jin, Yitao Zhao, Ting Liu
Rug pull scams have emerged as a persistent threat to cryptocurrency, causing significant financial losses. A typical scenario involves scammers deploying honeypot contracts to attract investments, restricting token sales, and draining the funds, which leaves investors with worthless tokens. Current methods either rely on predefined patterns to detect code risks or utilize statistical transaction data to train detection models. However, real-world Rug Pull schemes often involve a complex interplay between malicious code and suspicious transaction behaviors. These methods, which solely focus on one aspect, fall short in detecting such schemes effectively. In this paper, we propose RPhunter, a novel technique that integrates code and transaction for Rug Pull detection. First, RPhunter establishes declarative rules and performs flow analysis to extract code risk information, further constructing a semantic risk code graph (SRCG). Meanwhile, to leverage transaction information, RPhunter formulates dynamic token transaction activities as a token flow behavior graph (TFBG) in which nodes and edges are characterized from network structure and market manipulation perspectives. Finally, RPhunter employs graph neural networks to extract complementary features from SRCG and TFBG, integrating them through an attention fusion model to enhance the detection of Rug Pull. We manually analyzed 645 Rug Pull incidents from code and transaction aspects and constructed a ground-truth dataset. We evaluated RPhunter on our dataset, achieving a precision of 95.3%, a recall of 93.8% and an F1 score of 94.5%, which highlights superior performance compared to existing state-of-the-art methods. Furthermore, when applied to the real-world scenarios, RPhunter has identified 4801 Rug Pull tokens, achieving a precision of 91%.
冲绳骗术已成为对加密货币的持续威胁,造成了巨大的金融损失。典型的情景是,诈骗者利用蜂蜜合同来吸引投资,限制象征性销售,耗尽资金,使投资者留下没有价值的代币。当前的方法要么依靠预先定义的模式来检测代码风险,要么利用统计交易数据来培训检测模型。然而,真实世界的冲网计划往往涉及恶意代码和可疑交易行为之间的复杂互动。这些方法仅侧重于一个方面,在有效检测此类计划方面有缺陷。在本文中,我们提议RPhunter,这是一种将代码和交易整合在一起的新技术,用于吸引投资,限制象征性销售销售,使资金流失。首先,RPhunter制定宣示规则并进行流程分析,以提取代码风险信息,进一步构建一个语义风险代码图。与此同时,为了利用交易信息,RPhunter制定动态象征性的代币交易活动,从网络结构和市场操纵角度对节点和边缘进行定性。最后,RPhunter使用图形网络,从 SRCG 和 RBG 中提取95 的代币模型的补本功能, 。 将 RBG 。
Article 74
Title@2025-06-23 (1): Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval
Title: Tracing Errors, Constructing Fixes: Repository-Level Memory Error Repair via Typestate-Guided Context Retrieval | Nachvollziehen von Fehlern, Konstruieren von Fehlern: Repository-Level-Speicherfehler Reparieren von Fehlern über typstate-guided Context Retrieval | 追踪错误, 构建修补: 通过 Tystate- Guide- Guided Intern Enter Review 修复存储器级存储器级存储器级内存错误 2506.18394v1 |
Authors (4): Xiao Cheng, Zhihao Guo, Huan Huo, Yulei Sui
Memory-related errors in C programming continue to pose significant challenges in software development, primarily due to the complexities of manual memory management inherent in the language. These errors frequently serve as vectors for severe vulnerabilities, while their repair requires extensive knowledge of program logic and C’s memory model. Automated Program Repair (APR) has emerged as a critical research area to address these challenges. Traditional APR approaches rely on expert-designed strategies and predefined templates, which are labor-intensive and constrained by the effectiveness of manual specifications. Deep learning techniques offer a promising alternative by automatically extracting repair patterns, but they require substantial training datasets and often lack interpretability. This paper introduces LTFix, a novel approach that harnesses the potential of Large Language Models (LLMs) for automated memory error repair, especially for complex repository-level errors that span multiple functions and files. We address two fundamental challenges in LLM-based memory error repair: a limited understanding of interprocedural memory management patterns and context window limitations for repository-wide analysis. Our approach utilizes a finite typestate automaton to guide the tracking of error-propagation paths and context trace, capturing both spatial (memory states) and temporal (execution history) dimensions of error behavior. This typestate-guided context retrieval strategy provides the LLM with concise yet semantically rich information relevant to erroneous memory management, effectively addressing the token limitation of LLMs.
C编程中与内存有关的错误在软件开发方面继续构成重大挑战,这主要是因为语言固有的人工记忆管理的复杂性,这些错误往往成为严重脆弱性的载体,而修复这些错误需要大量了解程序逻辑和C的内存模型。自动化程序维修(APR)已成为应对这些挑战的关键研究领域。传统的非洲同侪审议机制方法依靠专家设计的战略和预先界定的模板,这些战略和模板是劳力密集型的,受到手工规格效力的限制。深深学习技术通过自动提取修复模式提供了一个有希望的替代办法,但它们需要大量的培训数据集,而且往往缺乏可解释性。本文介绍了LTFix,这是一种新颖的方法,利用大语言模型的潜力进行自动记忆错误修复,特别是用于跨越多种功能和档案的复杂的仓库级错误。我们处理基于LLM的内存错误修补的两个基本挑战:对间记忆管理模式和受全库范围分析的环境窗口限制理解有限。我们的方法使用固定的型号自动地图来指导对错误分析路径和背景进行跟踪,但往往缺乏可解释性。本文介绍了大语言模型模型的新方法,它利用大语言模型的潜力,利用大语言模型进行自动存储错误错误追踪,特别是历史级的内径,从而有效地记录。
Article 75
Title@2025-06-23 (1): Recipe for Discovery: A Framework for Systematic Open Source Project Identification
Title: Recipe for Discovery: A Framework for Systematic Open Source Project Identification | Rezept für Entdeckung: Ein Rahmen für die systematische Identifizierung von Open Source-Projekten | 发现秘诀:系统开放源码项目确认框架 2506.18359v1 |
Authors (5): Juanita Gomez, Emily Lovell, Stephanie Lieggi, Alvaro A. Cardenas, James Davis
Open source software development, particularly within institutions such as universities and research laboratories, is often decentralized and difficult to track. Despite producing highly impactful tools in science, these efforts often go unrecognized due to a lack of visibility and institutional awareness. This paper addresses the challenge of discovering, classifying, and analyzing open source software projects developed across distributed institutional systems. We present a framework for systematically identifying institutional affiliated repositories, using the University of California (UC) system as a case study. Using GitHub’s REST API, we build a pipeline to discover relevant repositories and extract meaningful metadata. We then propose and evaluate multiple classification strategies, including both traditional machine learning models and large language models (LLMs), to distinguish affiliated projects from unrelated repositories and generate accurate insights into the academic open source landscape. Our results show that the framework is effective at scale, discovering over 52,000 repositories and predicting institutional affiliation with high accuracy.
开放源码软件开发,特别是大学和研究实验室等机构内部的开放源码软件开发,往往分散进行,难以追踪。尽管在科学领域产生了影响极大的工具,但由于缺乏知名度和机构意识,这些努力往往得不到承认。本文件讨论了发现、分类和分析分布式机构系统开发的开放源码软件项目的挑战。我们提出了一个框架,以便利用加利福尼亚大学(UC)系统进行案例研究,系统确定附属机构储存库。我们利用GitHub的REST API, 建立了一个管道,以发现相关的储存库并提取有意义的元数据。我们随后提出并评价多种分类战略,包括传统机器学习模型和大型语言模型(LLMS),以区分关联项目与无关储存库,并准确了解学术开放源地貌。我们的成果显示,该框架规模有效,发现超过52,000个储存库并预测高度精确的机构关联。
Article 76
Title@2025-06-23 (1): Predictive Analytics for Collaborators Answers, Code Quality, and Dropout on Stack Overflow
Title: Predictive Analytics for Collaborators Answers, Code Quality, and Dropout on Stack Overflow | Predictive Analytics für Kollaboratoren Antworten, Codequalität und Dropout auf Stack Overflow | 合作者答复的预测分析、守则质量和Stack 溢流的辍学情况 2506.18329v1 |
Authors (3): Elijah Zolduoarrati, Sherlock A. Licorish, Nigel Stanger
Previous studies that used data from Stack Overflow to develop predictive models often employed limited benchmarks of 3-5 models or adopted arbitrary selection methods. Despite being insightful, their limited scope suggests the need to benchmark more models to avoid overlooking untested algorithms. Our study evaluates 21 algorithms across three tasks: predicting the number of question a user is likely to answer, their code quality violations, and their dropout status. We employed normalisation, standardisation, as well as logarithmic and power transformations paired with Bayesian hyperparameter optimisation and genetic algorithms. CodeBERT, a pre-trained language model for both natural and programming languages, was fine-tuned to classify user dropout given their posts (questions and answers) and code snippets. We found Bagging ensemble models combined with standardisation achieved the highest R2 value (0.821) in predicting user answers. The Stochastic Gradient Descent regressor, followed by Bagging and Epsilon Support Vector Machine models, consistently demonstrated superior performance to other benchmarked algorithms in predicting user code quality across multiple quality dimensions and languages. Extreme Gradient Boosting paired with log-transformation exhibited the highest F1-score (0.825) in predicting user dropout. CodeBERT was able to classify user dropout with a final F1-score of 0.809, validating the performance of Extreme Gradient Boosting that was solely based on numerical data. Overall, our benchmarking of 21 algorithms provides multiple insights. Researchers can leverage findings regarding the most suitable models for specific target variables, and practitioners can utilise the identified optimal hyperparameters to reduce the initial search space during their own hyperparameter tuning processes.
使用Stack Overflow 数据开发预测模型的以往研究往往使用3-5个模型的有限基准或采用任意选择方法。尽管其范围有限,但表明需要为更多模型基准,以避免忽略未经检验的算法。我们的研究评估了以下三项任务中的21种算法:预测用户可能回答的问题数量、其代码质量违规现象及其辍学状态。我们采用了正常化、标准化以及对数和动力转换,与Bayesian 超光谱优化和遗传算法相匹配。CodBER,即为自然语言和编程语言预先培训的语言模型,经过精细调整,以根据他们的职位(问答)和代码片断码来对用户辍学进行分类。我们发现,与标准化相结合的具体组合模型在预测用户答案中达到了最高R2值(0.821) 。Stocartical Greax Regilationorationorations, 持续显示在预测多质量初始层面和语言的用户代码质量质量质量质量质量和语言的高级语言的高级语言的高级语言的高级语言的高级语言化语言化语言化语言的高级语言化语言化语言化语言化语言化语言化。我们找到了最高级精化的高级精化精化精化精化精化的精化精化精化的精化精化精化的精化精化精化精化精化的精化潜化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精化精制。
Article 77
Title@2025-06-23 (1): Use Property-Based Testing to Bridge LLM Code Generation and Validation
Title: Use Property-Based Testing to Bridge LLM Code Generation and Validation | Verwenden Sie property-based testing to Bridge LLM Code-Generierung und Validierung | 使用基于财产的测试进行桥桥LLM编码的生成和验证 2506.18315v1 |
Authors (6): Lehan He, Zeren Chen, Zhe Zhang, Jing Shao, Xiang Gao, Lu Sheng
Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, including biased tests or inaccurate output predictions that can misdirect the correction process. This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing (PBT) to validate high-level program properties or invariants, instead of relying on specific input-output examples. These properties are often simpler to define and verify than directly predicting exhaustive test oracles, breaking the “cycle of self-deception” where tests might share flaws with the code they are meant to validate. Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle and formulate semantically rich feedback from property violations. The resulting comprehensive and actionable feedback then guides the Generator in its refinement efforts. By establishing PBT as the core validation engine within this iterative, closed-loop paradigm, Property-Generated Solver provides a robust mechanism for steering LLMs towards more correct and generalizable code. Extensive experimental results on multiple code generation benchmarks demonstrate that Property-Generated Solver achieves substantial pass@1 improvements, ranging from 23.1% to 37.3% relative gains over established TDD methods.
大型语言模型(LLMS)在代码生成方面非常出色,但确保产出在功能上正确,特别是在复杂的程序化任务中,这是一个长期的挑战。传统测试驱动开发(TDD)为代码完善提供了一条路径,而传统测试驱动开发(TDD)为代码完善提供了一条路径,但与LLMS的功效往往由于缺少高质量测试案例或自动测试生成的陷阱而受到损害,包括有偏差的测试或不准确的产出预测,从而可能误导校正进程。本文介绍了地产定位解析器,这是一个利用地产测试(PBT)验证高级程序属性或变异性的新框架,而不是依赖具体的投入输出实例。这些属性往往比直接预测详尽的测试或触碰(TDD)更简单,打破了“自我失常的循环”,因为测试可能与用于校正的代码共享缺陷。
Article 78
Title@2025-06-23 (1): Tu(r)ning AI Green: Exploring Energy Efficiency Cascading with Orthogonal Optimizations
Title: Tu(r)ning AI Green: Exploring Energy Efficiency Cascading with Orthogonal Optimizations | Tu(r)ning AI Green: Erforschung der Energieeffizienz Kaskadierung mit orthogonalen Optimierungen | Tu(r)ning AI Green:探索利用矫形优化的能源效率链条 2506.18289v1 |
Authors (3): Saurabhsingh Rajput, Mootez Saad, Tushar Sharma
AI’s exponential growth intensifies computational demands and energy challenges. While practitioners employ various optimization techniques, that we refer as “knobs” in this paper, to tune model efficiency, these are typically afterthoughts and reactive ad-hoc changes applied in isolation without understanding their combinatorial effects on energy efficiency. This paper emphasizes on treating energy efficiency as the first-class citizen and as a fundamental design consideration for a compute-intensive pipeline. We show that strategic selection across five AI pipeline phases (data, model, training, system, inference) creates cascading efficiency. Experimental validation shows orthogonal combinations reduce energy consumption by up to $94.6$% while preserving $95.95$% of the original F1 score of non-optimized pipelines. This curated approach provides actionable frameworks for informed sustainable AI that balance efficiency, performance, and environmental responsibility.
AI的指数增长强化了计算需求和能源挑战。 实践者运用了各种优化技术,我们在本文中称之为“ knobs ” 来调和模型效率,但这些通常是事后思考和反应性特设变化,孤立地应用,而没有理解其对能源效率的组合效应。 本文强调将能源效率作为一流公民对待,并作为计算密集型管道的基本设计考虑。 我们显示,在AI的五个管道阶段(数据、模型、培训、系统、推理)的战略选择产生了累进效率。 实验验证显示,正方形组合将能源消耗减少高达9.4.6 % , 同时保留了原F1非优化管道评分的95.95%。 这一调整方法为知情的可持续AI提供了可行的框架,平衡了效率、绩效和环境责任。
Article 79
Title@2025-06-23 (1): Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection
Title: Smart-LLaMA-DPO: Reinforced Large Language Model for Explainable Smart Contract Vulnerability Detection | Smart-LlaMA-DPO: Verstärktes Large Language Model für erklärbare Smart Contract Vulnerability Detection | Smart-LLamaMA-DPO:可解释的智能合同脆弱性探测强化大语言模型 2506.18245v1 |
Authors (11): Lei Yu, Zhirong Huang, Hang Yuan, Shiqi Cheng, Li Yang, Fengjun Zhang, Chenjie Shen, Jiajia Ma, Jingyuan Zhang, Junyi Lu, Chun Zuo
Smart contract vulnerability detection remains a major challenge in blockchain security. Existing vulnerability detection methods face two main issues: (1) Existing datasets lack comprehensive coverage and high-quality explanations for preference learning. (2) Large language models (LLMs) often struggle with accurately interpreting specific concepts in smart contract security. Empirical analysis shows that even after continual pre-training (CPT) and supervised fine-tuning (SFT), LLMs may misinterpret the execution order of state changes, resulting in incorrect explanations despite making correct detection decisions. To address these challenges, we propose Smart-LLaMA-DPO based on LLaMA-3.1-8B. We construct a comprehensive dataset covering four major vulnerability types and machine-unauditable vulnerabilities, including precise labels, explanations, and locations for SFT, as well as high-quality and low-quality output pairs for Direct Preference Optimization (DPO). Second, we perform CPT using large-scale smart contract to enhance the LLM’s understanding of specific security practices in smart contracts. Futhermore, we conduct SFT with our comprehensive dataset. Finally, we apply DPO, leveraging human feedback and a specially designed loss function that increases the probability of preferred explanations while reducing the likelihood of non-preferred outputs. We evaluate Smart-LLaMA-DPO on four major vulnerability types: reentrancy, timestamp dependence, integer overflow/underflow, and delegatecall, as well as machine-unauditable vulnerabilities. Our method significantly outperforms state-of-the-art baselines, with average improvements of 10.43% in F1 score and 7.87% in accuracy. Moreover, both LLM evaluation and human evaluation confirm that our method generates more correct, thorough, and clear explanations.
现有脆弱性检测方法面临两个主要问题:(1) 现有数据集缺乏全面覆盖面,而且缺乏用于优惠学习的高质量解释。 (2) 大语言模型(LLMS)常常在精确解释智能合同安全的具体概念方面挣扎。 经验分析显示,即使经过持续的培训前(CPT)和监督下的微调(SFT),LMS也可能曲解国家变化的执行顺序,导致错误解释,尽管作出正确的检测决定,但导致错误解释。为了应对这些挑战,我们提议Smart-LalaMA-DPO以LalaMA-3.1-18B为基础。 我们建立一个综合数据集,涵盖四种主要脆弱性类型和机能可读性弱点,包括精确的标签、解释和SFTF1, 以及高品质和低品质产出配对(DP)。 其次,我们使用大型智能合同执行CPT,提高LM对智能合同中具体安全做法的理解。 Fthermoremore,我们用综合数据集进行SFT。 最后,我们应用DPO, 利用人类的反馈和特别的准确性估算方法,同时用智能的准确性解释, 提高我们的平均概率分析。
Article 80
Title@2025-06-23 (1): Managing Technical Debt in a Multidisciplinary Data Intensive Software Team: an Observational Case Study
Title: Managing Technical Debt in a Multidisciplinary Data Intensive Software Team: an Observational Case Study | Verwaltung technischer Schulden in einem multidisziplinären Data Intensive Software Team: eine Beobachtungsfallstudie | 多学科数据密集软件小组管理技术债务:观察案例研究 2506.18219v1 |
Authors (5): Ulrike M. Graetsch, Rashina Hoda, Hourieh Khalazjadeh, Mojtaba Shahin, John Grundy
Context: There is an increase in the investment and development of data-intensive (DI) solutions, systems that manage large amounts of data. Without careful management, this growing investment will also grow associated technical debt (TD). Delivery of DI solutions requires a multidisciplinary skill set, but there is limited knowledge about how multidisciplinary teams develop DI systems and manage TD. Objective: This research contributes empirical, practice based insights about multidisciplinary DI team TD management practices. Method: This research was conducted as an exploratory observation case study. We used socio-technical grounded theory (STGT) for data analysis to develop concepts and categories that articulate TD and TDs debt management practices. Results: We identify TD that the DI team deals with, in particular technical data components debt and pipeline debt. We explain how the team manages the TD, assesses TD, what TD treatments they consider and how they implement TD treatments to fit sprint capacity constraints. Conclusion: We align our findings to existing TD and TDM taxonomies, discuss their implications and highlight the need for new implementation patterns and tool support for multidisciplinary DI teams.
目标:这项研究有助于对多学科的DI小组TD管理做法进行经验和实践上的深入了解。方法:这项研究是作为一项探索性观察案例研究进行的。我们利用基于社会技术的理论进行数据分析,以制定阐述TD和TD债务管理做法的概念和类别。结果:我们确定DI小组处理的TD是如何管理的,特别是技术数据债务和管道债务。我们解释该小组如何管理TD、评估TD、他们认为的TD待遇以及他们如何执行TD处理方法以适应打印能力限制。结论:我们把我们的调查结果与现有的TD和TDM分类研究结合起来,讨论其影响,并强调需要新的执行模式和对多学科DI小组的工具支持。
Article 81
Title@2025-06-22 (7): BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning
Title: BLAZE: Cross-Language and Cross-Project Bug Localization via Dynamic Chunking and Hard Example Learning | BLAZE: Cross-Language und Cross-Project Bug Lokalisierung über Dynamic Chunking und Hard Example Learning | BLAZE:通过动态打字和硬实例学习实现跨语言和跨项目错误定位 2407.17631v3 |
Authors (3): Partha Chakraborty, Mahmoud Alfadel, Meiyappan Nagappan
Software bugs require developers to exert significant effort to identify and resolve them, often consuming about one-third of their time. Bug localization, the process of pinpointing the exact source code files that need modification, is crucial in reducing this effort. Existing bug localization tools, typically reliant on deep learning techniques, face limitations in cross-project applicability and effectiveness in multi-language environments. Recent advancements with Large Language Models (LLMs) offer detailed representations for bug localization. However, they encounter challenges with limited context windows and mapping accuracy. To address these issues, we propose BLAZE, an approach that employs dynamic chunking and hard example learning. First, BLAZE dynamically segments source code to minimize continuity loss. Then, BLAZE fine-tunes a GPT-based model using challenging bug cases, in order to enhance cross-project and cross-language bug localization. To support the capability of BLAZE, we create the BEETLEBOX dataset, which comprises 26,321 bugs from 29 large and thriving open-source projects across five different programming languages (Java, C++, Python, Go, and JavaScript). Our evaluations of BLAZE on three benchmark datasets BEETLEBOX, SWE-Bench, and Ye et al. demonstrate substantial improvements compared to six state-of-the-art baselines. Specifically, BLAZE achieves up to an increase of 120% in Top 1 accuracy, 144% in Mean Average Precision (MAP), and 100% in Mean Reciprocal Rank (MRR). An extensive ablation study confirms the contributions of our pipeline components to the overall performance enhancement.
软件错误要求开发者做出重大努力来识别和解决它们,通常花费大约三分之一的时间。 错误定位, 确定需要修改的确切源代码文件的过程, 是减少这种努力的关键。 现有的错误本地化工具, 通常依靠深层学习技术, 在多语言环境中, 跨项目适用性和有效性面临限制。 使用大语言模型( LLMs) 的最近进步为错误本地化提供了详细的表达方式。 但是, 他们遇到了背景窗口有限且绘图准确性的挑战 。 为了解决这些问题, 我们建议 BLAZE , 一种使用动态块块和硬例学习的方法。 首先, BLAZE 动态部分源代码代码, 以最大限度地减少连续性损失。 然后, BLAZE 微调基于GPT的模型, 以便增强跨项目和跨语言错误本地化。 为了支持 BLAZE的能力, 我们创建了 BEETLEBX 数据集, 由来自五种不同语言的29个大型和活跃的开源项目组成的 BLAZ 。 在 VEA、 C+A、 CROM 和 BEBS 的S 平均 上, 的普通 数据库 数据库 的普通 和BRBS 等 BBEBRBRBS 级 的普通 的普通 级 级 级 级 级 级 级 级 级 级 级 级 级 级 上, 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级 级
Article 82
Title@2025-06-22 (7): Call Me Maybe: Enhancing JavaScript Call Graph Construction using Graph Neural Networks
Title: Call Me Maybe: Enhancing JavaScript Call Graph Construction using Graph Neural Networks | Rufen Sie mich vielleicht an: Verbesserung der JavaScript Call Graph Construction mit Graph Neural Networks | 使用图形神经网络加强 JavaScript 呼叫图图建设 2506.18191v1 |
Authors (4): Masudul Hasan Masud Bhuiyan, Gianluca De Stefano, Giancarlo Pellegrino, Cristian-Alexandru Staicu
Static analysis plays a key role in finding bugs, including security issues. A critical step in static analysis is building accurate call graphs that model function calls in a program. However, due to hard-to-analyze language features, existing call graph construction algorithms for JavaScript are neither sound nor complete. Prior work shows that even advanced solutions produce false edges and miss valid ones. In this work, we assist these tools by identifying missed call edges. Our main idea is to frame the problem as link prediction on full program graphs, using a rich representation with multiple edge types. Our approach, GRAPHIA, leverages recent advances in graph neural networks to model non-local relationships between code elements. Concretely, we propose representing JavaScript programs using a combination of syntactic- and semantic-based edges. GRAPHIA can learn from imperfect labels, including static call edges from existing tools and dynamic edges from tests, either from the same or different projects. Because call graphs are sparse, standard machine learning metrics like ROC are not suitable. Instead, we evaluate GRAPHIA by ranking function definitions for each unresolved call site. We conduct a large-scale evaluation on 50 popular JavaScript libraries with 163K call edges (150K static and 13K dynamic). GRAPHIA builds program graphs with 6.6M structural and 386K semantic edges. It ranks the correct target as the top candidate in over 42% of unresolved cases and within the top 5 in 72% of cases, reducing the manual effort needed for analysis. Our results show that learning-based methods can improve the recall of JavaScript call graph construction. To our knowledge, this is the first work to apply GNN-based link prediction to full multi-file program graphs for interprocedural analysis.
静态分析在寻找错误( 包括安全问题) 方面起着关键作用。 静态分析中的一个关键步骤是建立精确的调时图, 模型函数在程序中需要。 但是, 由于难以分析的语言特征, JavaScript 现有的调用图形构建算法既不健全也不完善。 先前的工作显示, 即使是先进的解决方案也会产生假边缘, 并且错过有效边缘 。 在这项工作中, 我们协助这些工具的方法是, 使用多种边缘类型的丰富表达方式, 将问题设置为全程序图形的链接预测。 我们的方法, GAPHIA, 利用图表神经网络中最近的进展来模拟代码元素之间的非本地关系。 具体地说, 我们提议, 将 JavaScript 构建的组合和基于语系的边际。 GAPHIA可以从现有工具的静态调边和测试的边际边际边框中学习。 由于调时, 调的图表是稀疏, 像 ROC一样的标准机器学习指标不合适。 相反, 我们用SAPPIA 的平级前端分析方法来进行我们Sil VIA 。
Article 83
Title@2025-06-22 (7): Generating Energy-efficient code with LLMs
Title: Generating Energy-efficient code with LLMs | Energieeffizienter Code mit LLMs generieren | 与LLMM 生成节能代码 2411.10599v2 |
Authors (3): Tom Cappendijk, Pepijn de Reus, Ana Oprescu
The increasing electricity demands of personal computers, communication networks, and data centers contribute to higher atmospheric greenhouse gas emissions, which in turn lead to global warming and climate change. Therefore the energy consumption of code must be minimized. Code can be generated by large language models. We look at the influence of prompt modification on the energy consumption of the code generated. We use three different Python code problems of varying difficulty levels. Prompt modification is done by adding the sentence ``Give me an energy-optimized solution for this problem’’ or by using two Python coding best practices. The large language models used are CodeLlama-70b, CodeLlama-70b-Instruct, CodeLlama-70b-Python, DeepSeek-Coder-33b-base, and DeepSeek-Coder-33b-instruct. We find a decrease in energy consumption for a specific combination of prompt optimization, LLM, and Python code problem. However, no single optimization prompt consistently decreases energy consumption for the same LLM across the different Python code problems.
个人计算机、通信网络和数据中心的电力需求不断增加,导致大气温室气体排放增加,这反过来又导致全球变暖和气候变化。 因此,必须最大限度地减少代码的能源消耗。 代码可以由大型语言模型生成。 我们查看迅速修改生成代码的能源消耗的影响。 我们使用三个不同的Python代码问题, 难度程度不一。 快速修改的方法是添加一句“ 为我提供这一问题的节能解决方案 ” , 或者使用两个 Python 编码最佳做法。 所使用的大语言模型是 代码Llama- 70b、 代码Llama- 70b- Instruct、 代码Llama- 70b- Python、 DeepSeek- Coder-33b- b b/ intruct。 我们发现, 快速优化、 LLM 和 Python 代码问题的具体组合导致能源消耗减少。 但是, 没有单一优化能够持续减少同一 LLM的能源消耗量, 跨越不同的 Python 代码问题 。
Article 84
Title@2025-06-22 (7): Build It Clean: Large-Scale Detection of Code Smells in Build Scripts
Title: Build It Clean: Large-Scale Detection of Code Smells in Build Scripts | Build It Clean: Großräumige Erkennung von Code-Gemälden in Build-Scripts | 构建干净的代码: 在构建脚本中大规模检测代码的气味 2506.17948v1 |
Authors (6): Mahzabin Tamanna, Yash Chandrani, Matthew Burrows, Brandon Wroblewski, Laurie Williams, Dominik Wermke
Build scripts are files that automate the process of compiling source code, managing dependencies, running tests, and packaging software into deployable artifacts. These scripts are ubiquitous in modern software development pipelines for streamlining testing and delivery. While developing build scripts, practitioners may inadvertently introduce code smells. Code smells are recurring patterns of poor coding practices that may lead to build failures or increase risk and technical debt. The goal of this study is to aid practitioners in avoiding code smells in build scripts through an empirical study of build scripts and issues on GitHub. We employed a mixed-methods approach, combining qualitative and quantitative analysis. We conducted a qualitative analysis of 2000 build-script-related GitHub issues. Next, we developed a static analysis tool, Sniffer, to identify code smells in 5882 build scripts of Maven, Gradle, CMake, and Make files, collected from 4877 open-source GitHub repositories. We identified 13 code smell categories, with a total of 10,895 smell occurrences, where 3184 were in Maven, 1214 in Gradle, 337 in CMake, and 6160 in Makefiles. Our analysis revealed that Insecure URLs were the most prevalent code smell in Maven build scripts, while Hardcoded Paths/URLs were commonly observed in both Gradle and CMake scripts. Wildcard Usage emerged as the most frequent smell in Makefiles. The co-occurrence analysis revealed strong associations between specific smell pairs of Hardcoded Paths/URLs with Duplicates, and Inconsistent Dependency Management with Empty or Incomplete Tags, indicating potential underlying issues in the build script structure and maintenance practices. Based on our findings, we recommend strategies to mitigate the existence of code smells in build scripts to improve the efficiency, reliability, and maintainability of software projects.
构建脚本是将编译源代码、 管理依赖性、 运行测试和将软件包装成可部署的文物的过程自动化的文档。 这些脚本在现代软件开发管道中无处不在, 用于简化测试和交付。 在开发脚本的同时, 执业者可能会无意中引入代码气味。 代码气味是反复出现的编码做法, 可能导致建立失败或增加风险和技术债务。 这项研究的目的是帮助执业者避免代码在构建脚本的过程中出现代码气味, 通过对 GitHub 的脚本和问题进行实验性研究。 我们采用了混合方法方法, 结合定性和定量分析。 我们对2000年的构建标本的 GitHub 问题进行了定性分析。 下一步, 我们开发了一个静态分析工具Sniffer, 在5882年建构的脚本、 Gradle、 Cartle、 CMake 和Make 文件, 从487 开源库收集到开源 GiutHub 。 我们确定了13个代码的嗅觉, 类别, 总共10, 895 嗅觉, 在Gradle, 在Gradle, 在Graddle, 在Gradlexle, 在Gradal, 在Gradle, 在Gradalalalalalal 中, 在Grealalalalalalalal sail sail sail sail sail sail sabal sail 中, sail sail sail sail sail sail sail sail sail sail sail sail 中, 中, sail sail sail sail sail sail sail sail sail saild sail sail sail sail sail saild sail 中, 在Sild 中, 中, 中, sail sailds sail sail sail sail sail sail sads sail sail sail sail sail sail sail sail sail sail 中, 在 sail
Article 85
Title@2025-06-22 (7): Software Reuse in the Generative AI Era: From Cargo Cult Towards AI Native Software Engineering
Title: Software Reuse in the Generative AI Era: From Cargo Cult Towards AI Native Software Engineering | Software-Wiederverwertung in der Generativen KI-Ära: Vom Cargo-Cult hin zum KI-Indianischen Software-Engineering | 产生AI时代的软件再利用:从货物邪道到AI 本土软件工程 2506.17937v1 |
Authors (2): Tommi Mikkonen, Antero Taivalsaari
Software development is currently under a paradigm shift in which artificial intelligence and generative software reuse are taking the center stage in software creation. Consequently, earlier software reuse practices and methods are rapidly being replaced by AI-assisted approaches in which developers place their trust on code that has been generated by artificial intelligence. This is leading to a new form of software reuse that is conceptually not all that different from cargo cult development. In this paper we discuss the implications of AI-assisted generative software reuse in the context of emerging “AI native” software engineering, bring forth relevant questions, and define a tentative research agenda and call to action for tackling some of the central issues associated with this approach.
软件开发目前处于范式转变之中,人工智能和基因再利用软件正在成为软件创建的中心阶段,因此,早期的软件再利用做法和方法正在迅速被人工智能辅助方法所取代,而开发者则相信人工智能生成的代码。这导致了一种新的软件再利用形式,在概念上与货物邪教开发并不完全不同。在本文件中,我们讨论了人工智能辅助基因软件再利用在新兴的“AI本地”软件工程中的影响,提出了相关问题,确定了一个暂定研究议程,并呼吁采取行动解决与这一方法相关的一些核心问题。
Article 86
Title@2025-06-22 (7): Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics
Title: Rubric Is All You Need: Enhancing LLM-based Code Evaluation With Question-Specific Rubrics | Rubric ist alles, was Sie brauchen: Verbesserung der LLM-basierten Code-Bewertung mit Frage-spezifischen Rubrics | 需要的是所有你需要的卢布:加强基于LLM的法规评价,用特定问题规范 2503.23989v2 |
Authors (14): Aditya Pathak, Rachit Gandhi, Vaibhav Uttam, Devansh, Yashwanth Nakka, Aaryan Raj Jindal, Pratyush Ghosh, Arnav Ramamoorthy, Shreyash Verma, Aditya Mittal, Aashna Ased, Chirag Khatri, Jagat Sesh Challa, Dhruv Kumar
Since the emergence of Large Language Models (LLMs) popularized by the release of GPT-3 and ChatGPT, LLMs have shown remarkable promise in programming-related tasks. While code generation using LLMs has become a popular field of research, code evaluation using LLMs remains under-explored. In this paper, we focus on LLM-based code evaluation and attempt to fill in the existing gaps. We propose multi-agentic novel approaches using \emph{question-specific rubrics} tailored to the problem statement, arguing that these perform better for logical assessment than the existing approaches that use \emph{question-agnostic rubrics}. To address the lack of suitable evaluation datasets, we introduce two datasets: a Data Structures and Algorithms dataset containing 150 student submissions from a popular Data Structures and Algorithms practice website, and an Object Oriented Programming dataset comprising 80 student submissions from undergraduate computer science courses. In addition to using standard metrics (Spearman Correlation, Cohen’s Kappa), we additionally propose a new metric called as Leniency, which quantifies evaluation strictness relative to expert assessment. Our comprehensive analysis demonstrates that \emph{question-specific rubrics} significantly enhance logical assessment of code in educational settings, providing better feedback aligned with instructional goals beyond mere syntactic correctness.
自通过发行GPT-3和ChatGPT而普及的大型语言模型(LLM)出现以来,LLMs在与方案编制有关的任务中表现出了令人瞩目的希望。虽然使用LLMs的代码生成已成为一个受欢迎的研究领域,但使用LLMs的代码评价仍然未得到充分探索。在本文中,我们侧重于基于LLM的代码评价,并试图填补现有的空白。我们建议采用针对问题说明的多试性新办法,使用针对问题说明的mph{特定问题的标本。我们主张,这些方法比使用\emph{问题-敏感标本的现行方法更好地进行逻辑评估。为了解决缺乏合适的评价数据集的问题,我们引入了两个数据集:数据结构和Algorithms数据集,其中载有来自广受欢迎的数据结构和Algorithms实践网站的150名学生提交材料,以及由80名学生从本科计算机科学课程提交材料组成的目标规划数据集。除了使用标准指标(Spearman Correlation, Coh’s Kappa)之外,我们只是提出一种更精确性的新指标,我们提出了更精确的相对性评估。我们要求的精确性评估。
Article 87
Title@2025-06-21 (6): The Impact of AI-Generated Solutions on Software Architecture and Productivity: Results from a Survey Study
Title: The Impact of AI-Generated Solutions on Software Architecture and Productivity: Results from a Survey Study | Die Auswirkungen von KI-generierten Lösungen auf Softwarearchitektur und Produktivität: Ergebnisse einer Umfragestudie | AI创创的解决方案对软件结构和生产力的影响:一项调查研究的结果 2506.17833v1 |
Authors (2): Giorgio Amasanti, Jasmin Jahic
AI-powered software tools are widely used to assist software engineers. However, there is still a need to understand the productivity benefits of such tools for software engineers. In addition to short-term benefits, there is a question of how adopting AI-generated solutions affects the quality of software over time (e.g., maintainability and extendability). To provide some insight on these questions, we conducted a survey among software practitioners who use AI tools. Based on the data collected from our survey, we conclude that AI tools significantly increase the productivity of software engineers. However, the productivity benefits of using AI tools reduce as projects become more complex. The results also show that there are no significant negative influences of adopting AI-generated solutions on software quality, as long as those solutions are limited to smaller code snippets. However, when solving larger and more complex problems, AI tools generate solutions of a lower quality, indicating the need for architects to perform problem decomposition and solution integration.
以AI为动力的软件工具被广泛用于协助软件工程师,然而,仍然需要了解这类工具对软件工程师的生产率效益,除了短期效益外,还有采用AI产生的解决方案如何随着时间的推移影响软件质量的问题(例如,可维持性和可扩展性)。为了对这些问题提供一些深入了解,我们对使用AI工具的软件从业人员进行了一次调查。根据我们调查收集的数据,我们得出结论,AI工具大大提高了软件工程师的生产率。然而,随着项目变得更加复杂,使用AI工具的生产率效益下降。结果还表明,采用AI产生的解决方案对软件质量没有重大的负面影响,只要这些解决方案仅限于较小的代码片。然而,在解决更大和更加复杂的问题时,AI工具产生质量较低的解决方案,表明建筑师需要进行问题解构和解决方案整合。
Article 88
Title@2025-06-21 (6): Is Your Automated Software Engineer Trustworthy?
Title: Is Your Automated Software Engineer Trustworthy? | Ist Ihr automatisierter Software-Ingenieur vertrauenswürdig? | 你的自动软件工程师可信吗? 2506.17812v1 |
Authors (2): Noble Saji Mathews, Meiyappan Nagappan
Large Language Models (LLMs) are being increasingly used in software engineering tasks, with an increased focus on bug report resolution over the past year. However, most proposed systems fail to properly handle uncertain or incorrect inputs and outputs. Existing LLM-based tools and coding agents respond to every issue and generate a patch for every case, even when the input is vague or their own output is incorrect. There are no mechanisms in place to abstain when confidence is low. This leads to unreliable behaviour, such as hallucinated code changes or responses based on vague issue reports. We introduce BouncerBench, a benchmark that evaluates whether LLM-based software agents can refuse to act when inputs are ill-defined or refuse to respond when their own outputs are likely to be incorrect. Unlike prior benchmarks that implicitly incentivize models to generate responses even when uncertain, BouncerBench aims to improve precision by targeting two overlooked failure points: (1) vague or underspecified issue descriptions in tickets and (2) logically or functionally incorrect code patches created by the system. It measures whether proposed systems can distinguish actionable issues from vague tickets and valid patches from untrustworthy ones. We also implement a basic input and output bouncer, evaluating how well current LLMs can abstain when needed. Our results show that most models fail to abstain from underspecified inputs or incorrect outputs. Hence, we conclude that there is significant room for improvement before LLMs can be trusted to make correct decisions and recommendations in real-world software engineering workflows. BouncerBench provides a first step toward evaluating and building more cautious, trustworthy code agents. The replication package, dataset, and leaderboard can be found at bouncerbench.com
大型语言模型(LLMS)在软件工程任务中越来越多地被使用,在过去一年中,对错误报告的解析越来越重视。然而,大多数拟议的系统都未能正确处理不确定或不正确的投入和产出。现有的LLM工具和编码代理商对每个问题都有反应,并给每个案例造成补丁,即使输入模糊或其本身产出不正确,也没有机制在信任度低时放弃。这导致不可靠的行为,如在问题报告含混的情况下对代码进行盲目的修改或反应。我们引入了BoserBench,这个基准评估了LLM软件代理商在投入可能错误或错误时是否拒绝采取行动。与先前的基准不同,即即使输入不明确,或输入本身产出不正确,BuscrBench的目的是通过锁定两个被忽视的失败点来提高准确性:(1) 门票的模糊性或未详细定义,以及(2) 系统创建的逻辑或功能错误的代码补丁。我们衡量拟议的系统是否能区分可操作的问题与模糊的票票和基于LMSB的阶级的阶梯和错误的阶梯,我们也可以首先评估一个基本的输入结果。
Article 89
Title@2025-06-21 (6): SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis
Title: SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis | SAVANT: Sicherheitserkennung in Anwendungsabhängigkeiten durch Semantik-geführte Reichweitenanalyse | SAVANT: 通过语义辅助控制可达性分析,在应用依赖性中发现脆弱性 2506.17798v1 |
Authors (7): Wang Lingxiang, Quanzhi Fu, Wenjia Song, Gelei Deng, Yi Liu, Dan Williams, Ying Zhang
The integration of open-source third-party library dependencies in Java development introduces significant security risks when these libraries contain known vulnerabilities. Existing Software Composition Analysis (SCA) tools struggle to effectively detect vulnerable API usage from these libraries due to limitations in understanding API usage semantics and computational challenges in analyzing complex codebases, leading to inaccurate vulnerability alerts that burden development teams and delay critical security fixes. To address these challenges, we proposed SAVANT by leveraging two insights: proof-of-vulnerability test cases demonstrate how vulnerabilities can be triggered in specific contexts, and Large Language Models (LLMs) can understand code semantics. SAVANT combines semantic preprocessing with LLM-powered context analysis for accurate vulnerability detection. SAVANT first segments source code into meaningful blocks while preserving semantic relationships, then leverages LLM-based reflection to analyze API usage context and determine actual vulnerability impacts. Our evaluation on 55 real-world applications shows that SAVANT achieves 83.8% precision, 73.8% recall, 69.0% accuracy, and 78.5% F1-score, outperforming state-of-the-art SCA tools.
现有软件构成分析(SCA)工具在有效检测这些图书馆的脆弱API使用情况方面挣扎着。 SAVANT将精密的脆弱性检测与LLM驱动的背景分析相结合。 SAVANT将精密的语义预处理与LLOM驱动的背景分析相结合。 SAVANT的首部分源代码在保留语义关系的同时,将有意义的区块纳入到有意义的区块中,然后利用基于LLAM的思考来分析API的使用背景并确定实际的脆弱性影响。我们对55个实际应用软件的评估表明,SAVANT实现了83.8%的精确度,73.8%的回顾,69.0%的精确度和78.5%的F1核心,高于艺术的状态工具。
Article 90
Title@2025-06-21 (6): Efficient Strategy Synthesis for MDPs via Hierarchical Block Decomposition
Title: Efficient Strategy Synthesis for MDPs via Hierarchical Block Decomposition | Effiziente Strategiesynthese für MDPs über Hierarchische Blockzersetzung | 通过分层块分解实现 MDP 高效战略合成 2506.17792v1 |
Authors (3): Alexandros Evangelidis, Gricel Vázquez, Simos Gerasimou
Software-intensive systems, such as software product lines and robotics, utilise Markov decision processes (MDPs) to capture uncertainty and analyse sequential decision-making problems. Despite the usefulness of conventional policy synthesis methods, they fail to scale to large state spaces. Our approach addresses this issue and accelerates policy synthesis in large MDPs by dynamically refining the MDP and iteratively selecting the most fragile MDP regions for refinement. This iterative procedure offers a balance between accuracy and efficiency, as refinement occurs only when necessary. Through a comprehensive empirical evaluation comprising diverse case studies and MDPs up to 1M states, we demonstrate significant performance improvements yielded by our approach compared to the leading probabilistic model checker PRISM (up to 2x), thus offering a very competitive solution for real-world policy synthesis tasks in larger MDPs.
软件密集型系统,如软件产品线和机器人系统,利用Markov决策程序(MDPs)来捕捉不确定性并分析顺序决策问题。尽管常规政策综合方法有用,但它们未能推广到大型国家空间。我们的方法是动态地完善MDP,并反复选择最脆弱的MDP区域加以完善,从而解决这一问题并加速大型MDP部门的政策综合。这一迭接程序在准确性和效率之间提供了平衡,因为只有在必要时才进行完善。通过由不同案例研究和1MM州以下的MDPs组成的全面经验评估,我们展示了与领先的概率模型PRISM(高达2x)相比,我们的方法取得了显著的绩效改进,从而为大型MDP中现实世界政策综合任务提供了非常有竞争力的解决方案。
Article 91
Title@2025-06-21 (6): PAGENT: Learning to Patch Software Engineering Agents
Title: PAGENT: Learning to Patch Software Engineering Agents | PAGENT: Lernen, Software Engineering Agents zu Patchen | PAGENT: 学习修补软件工程代理 2506.17772v1 |
Authors (3): Haoran Xue, Gias Uddin, Song Wang
LLM Agents produce patches automatically to resolve an issue. However, they can generate inaccurate patches. Little is known about the root causes behind those failed patches or how those could be fixed. This paper reports an empirical study of the failed patches generated by seven top LLM code agents. We collected 114 issues from the SWE-bench Lite dataset that remained unresolved across the agents. The seven agents produced a total of 769 failed patches for those issues, which we checked with a combination of GPT-4o and manual analysis. We present a taxonomy of the failure reasons across the patches. The taxonomy contains six categories, with several sub-categories under each category. For example, a frequently observed category is the inability of an LLM to correctly infer/produce the appropriate variable type in the produced patch. As a first step towards addressing such type-related errors, we designed PAGENT (Patch Agent). PAGENT utilizes program analysis techniques like CFG creation and exploration to infer the type of information of a patch. PAGENT does this by applying repository-level static code analysis techniques. Then, PAGENT refines the inferred type by further utilizing an LLM-based inference technique. We tested PAGENT on all 127 type-related failed patches from the top three agents in our study. PAGENT could fix 29 of the 127 failed patches.
LLM 代理器自动生成补丁以解决某个问题。 但是, 它们可以生成不准确的补丁 。 对于这些失败补丁背后的根源或如何修补这些补丁知之甚少 。 本文报告了七个顶级 LLM 代码代理器生成的失败补丁的经验性研究 。 我们从SWE- bench Lite数据集中收集了114个问题, 这些代理器之间尚未解决的这类问题。 7个代理器共生成了769个故障补丁, 我们结合GPT-4o和人工分析对这些问题进行了检查 。 我们对补丁之间的故障原因进行了分类分析。 分类包括六个类别, 每个类别下有几个子类。 例如, 经常观察到的一个类别是 LLM 无法正确推断/ / production Listem 数据 。 作为解决这类类型错误的第一步, 我们设计了 PAGENT (Patch Aget) 。 PAGENT 使用程序分析技术, 如CFG 创建和探索一个补丁的信息类型。 PAGENT , 通过应用基于仓库的固定代码分析技术对此做了进一步的改进。
Article 92
Title@2025-06-21 (6): Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models
Title: Beyond Functional Correctness: Investigating Coding Style Inconsistencies in Large Language Models | Beyond Functional Correctness: Untersuchung von Coding Style Inkonsistenzen in großen Sprachmodellen | 超越功能正确性:调查大语言模式的编码样式不一致问题 2407.00456v2 |
Authors (8): Yanlin Wang, Tianyue Jiang, Mingwei Liu, Jiachi Chen, Mingzhi Mao, Xilin Liu, Yuchi Ma, Zibin Zheng
Large language models (LLMs) have brought a paradigm shift to the field of code generation, offering the potential to enhance the software development process. However, previous research mainly focuses on the accuracy of code generation, while coding style differences between LLMs and human developers remain under-explored. In this paper, we empirically analyze the differences in coding style between the code generated by mainstream Code LLMs and the code written by human developers, and summarize coding style inconsistency taxonomy. Specifically, we first summarize the types of coding style inconsistencies by manually analyzing a large number of generation results. We then compare the code generated by Code LLMs with the code written by human programmers in terms of readability, conciseness, and robustness. The results reveal that LLMs and developers have different coding styles. Additionally, we study the possible causes of these inconsistencies and provide some solutions to alleviate the problem.
大型语言模型(LLMS)给代码生成领域带来了范式的转变,提供了增强软件开发过程的潜力。然而,先前的研究主要侧重于代码生成的准确性,而LLMS与人类开发商的编码风格差异仍然未得到充分探讨。在本文中,我们从经验上分析了主流代码LLMS与人类开发商的编码风格之间的差异,并总结了编码风格不一致分类法。具体地说,我们首先通过手动分析大量生成结果来总结编码风格不一致的种类。然后,我们从可读性、简洁性和稳健性方面将代码LMS生成的代码与人类程序员编写的代码进行比较。结果显示,LLMMS和开发商的编码风格不同。此外,我们研究这些不一致的可能原因,并提供一些解决办法来缓解问题。
Article 93
Title@2025-06-21 (6): Improving Compiler Bug Isolation by Leveraging Large Language Models
Title: Improving Compiler Bug Isolation by Leveraging Large Language Models | Verbesserung der Compiler-Fehlerisolierung durch die Nutzung großer Sprachmodelle | 通过利用大语言模型改进编译者虫虫隔离 2506.17647v1 |
Authors (6): Yixian Qi, Jiajun Jiang, Fengjie Li, Bowen Chen, Hongyu Zhang, Junjie Chen
Compilers play a foundational role in building reliable software systems, and bugs within them can lead to catastrophic consequences. The compilation process typically involves hundreds of files, making traditional automated bug isolation techniques inapplicable due to scalability or effectiveness issues. Current mainstream compiler bug localization techniques have limitations in test program mutation and resource consumption. Inspired by the recent advances of pre-trained Large Language Models (LLMs), we propose an innovative approach named AutoCBI, which (1) uses LLMs to summarize compiler file functions and (2) employs specialized prompts to guide LLM in reordering suspicious file rankings. This approach leverages four types of information: the failing test program, source file function summaries, lists of suspicious files identified through analyzing test coverage, as well as compilation configurations with related output messages, resulting in a refined ranking of suspicious files. Our evaluation of AutoCBI against state-of-the-art approaches (DiWi, RecBi and FuseFL) on 120 real-world bugs from the widely-used GCC and LLVM compilers demonstrates its effectiveness. Specifically, AutoCBI isolates 66.67%/69.23%, 300%/340%, and 100%/57.14% more bugs than RecBi, DiWi, and FuseFL, respectively, in the Top-1 ranked results for GCC/LLVM. Additionally, the ablation study underscores the significance of each component in our approach.
编译者在建立可靠的软件系统方面发挥着基础作用, 其内部的错误可能导致灾难性后果。 编译过程通常涉及数百个文件, 使得传统的自动虫隔离技术由于可缩放性或有效性问题而不适用。 当前主流编译者错误本地化技术在测试程序突变和资源消耗方面有局限性。 在经过预先培训的大语言模型(LLMS)的最新进展的启发下, 我们提议了一个名为AutoCBI的创新方法, 它 (1) 使用LLMS来总结汇编者文件功能, (2) 使用专门提示来指导LLLM重新排序可疑文件。 这个方法利用了四种类型的信息: 失败的测试程序、 源文件功能摘要、 通过分析测试范围而查明的可疑文件清单, 以及相关输出信息的汇编配置, 导致对可疑文件的精细排序。 我们用120个真实世界的错误( DiwiiW、ReBI和LLVM 汇编者) 来评估其有效性。 具体而言, AutoCBIBI将66.67% 的69. 23%, 300/340 和RBIBI/ dib 的每RBIL. 的结果。
Article 94
Title@2025-06-21 (6): May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs
Title: May the Feedback Be with You! Unlocking the Power of Feedback-Driven Deep Learning Framework Fuzzing via LLMs | Möge das Feedback mit dir sein! Entsperren der Kraft des Feedback-getriebenen Deep Learning Framework Fuzzing über LLMs | 愿回馈与你同在! 2506.17642v1 |
Authors (5): Shaoyu Yang, Chunrong Fang, Haifeng Lin, Xiang Chen, Zhenyu Chen
Artificial Intelligence (AI) Infrastructures, represented by Deep Learning (DL) frameworks, have served as fundamental DL systems over the last decade. However, the bugs in DL frameworks could lead to catastrophic consequences in some critical scenarios (e.g., healthcare and autonomous driving). A simple yet effective way to find bugs in DL frameworks is fuzz testing (Fuzzing). Unfortunately, existing fuzzing techniques have not comprehensively considered multiple types of feedback. Additionally, they analyze feedback in a coarse-grained manner, such as mutating the test cases only according to whether the coverage increases. Recently, researchers introduced Large Language Models (LLMs) into fuzzing. However, current LLM-based fuzzing techniques only focus on using LLMs to generate test cases while overlooking their potential to analyze feedback information, failing to create more valid and diverse test cases. To fill this gap, we propose FUEL to break the seal of Feedback-driven fuzzing for DL frameworks. The backbone of FUEL comprises two LLM-based agents, namely analysis LLM and generation LLM. Analysis LLM agent infers analysis summaries from feedback information, while the generation LLM agent creates tests guided by these analysis summaries. So far, FUEL has detected 104 bugs for PyTorch and TensorFlow, with 93 confirmed as new bugs, 47 already fixed, and 5 assigned with CVE IDs. Our work indicates that considering multiple types of feedback is beneficial to fuzzing performance, and leveraging LLMs to analyze feedback information is a promising direction. Our artifact is available at https://github.com/NJU-iSE/FUEL
以深学习(DL)框架为代表的人工智能(AI)基础设施在过去10年中一直作为基本的DL系统。然而,DL框架中的错误在某些关键情景(如保健和自主驱动)中可能导致灾难性后果。在DL框架中发现错误的一个简单而有效的方法就是模糊测试(Fuzzing ) 。遗憾的是,现有的模糊技术没有全面考虑多种反馈类型。此外,它们以粗糙的方式分析反馈,例如测试案例的反馈只根据覆盖面是否增加而变异。最近,研究人员将大语言模型(LLLM)引入了模糊。然而,目前基于LLM的模糊技术可能在某些关键情景(如医疗保健和自主驱动)中带来灾难性后果。在忽略其分析反馈信息的潜力的同时,忽略了DLF框架中的错误分析(FUEL),我们建议FULF打破以反馈驱动的模糊信息封印。 FUELL的骨架由两个基于LM的信息代理组成,即分析LM和生成LLM的LM。 分析LM的LM代理人在分析过程中,在分析过程中,通过DLM 104 RalF 分析模型,通过这些原始分析,通过测试,通过这些磁盘分析, 分析, 分析,通过这些磁盘分析,通过测试,通过这些分析,通过这些磁盘的磁盘的磁盘分析,通过这些磁盘分析,通过这些磁盘分析,通过这些磁盘分析,通过。
Article 95
Title@2025-06-21 (6): Deep Learning Framework Testing via Model Mutation: How Far Are We?
Title: Deep Learning Framework Testing via Model Mutation: How Far Are We? | Deep Learning Framework Testing über Modellmutation: Wie weit sind wir? | 通过模型变异进行深层次学习框架测试:我们有多远? 2506.17638v1 |
Authors (10): Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Zhiyuan Peng, Peiran Yang, Ruixiang Qian, Shaoyu Yang, Zhenyu Chen
Deep Learning (DL) frameworks are a fundamental component of DL development. Therefore, the detection of DL framework defects is important and challenging. As one of the most widely adopted DL testing techniques, model mutation has recently gained significant attention. In this study, we revisit the defect detection ability of existing mutation-based testing methods and investigate the factors that influence their effectiveness. To begin with, we reviewed existing methods and observed that many of them mutate DL models (e.g., changing their parameters) without any customization, ignoring the unique challenges in framework testing. Another issue with these methods is their limited effectiveness, characterized by a high rate of false positives caused by illegal mutations arising from the use of generic, non-customized mutation operators. Moreover, we tracked the defects identified by these methods and discovered that most of them were ignored by developers. Motivated by these observations, we investigate the effectiveness of existing mutation-based testing methods in detecting important defects that have been authenticated by framework developers. We begin by collecting defect reports from three popular frameworks and classifying them based on framework developers’ ratings to build a comprehensive dataset. We then perform an in-depth analysis to uncover valuable insights. Based on our findings, we propose optimization strategies to address the shortcomings of existing approaches. Following these optimizations, we identified seven new defects, four of which were confirmed by developers as high-priority issues, with three resolved. In summary, we identified 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
深学习(DL)框架是DL开发的基本组成部分。因此,发现DL框架缺陷很重要而且具有挑战性。作为最广泛采用的DL测试技术之一,模型突变最近引起了极大关注。在本研究中,我们重新审视了现有突变测试方法的缺陷检测能力,并调查了影响其有效性的因素。首先,我们审查了现有方法,发现许多现有突变测试方法(如改变参数)在不作任何定制的情况下变换DL模型(如改变其参数),无视框架测试中的独特挑战。这些方法的另一个问题是效力有限,其特点是使用通用的、非定制的变异操作者造成的非法突变率过高。此外,我们跟踪了这些方法所发现的缺陷,发现其中多数为开发者所忽视。我们根据这些观察,调查了现有突变测试方法在发现框架开发者所验证的重要缺陷方面的有效性。我们首先从三个流行框架收集缺陷报告,并根据框架开发者的评级对其进行分类,以构建一个全面的数据集为特征。我们随后提出了一个有8个高精度的精确度分析,然后我们用8个高精度的精确度分析。我们用3个新的精度来发现。
Article 96
Title@2025-06-21 (6): CodeMorph: Mitigating Data Leakage in Large Language Model Assessment
Title: CodeMorph: Mitigating Data Leakage in Large Language Model Assessment | CodeMorph: Eindämmung der Datenleckage in der Bewertung von Großsprachenmodellen | 代码Morph:减少大语言模式评估中的数据泄漏 2506.17627v1 |
Authors (6): Hongzhou Rao, Yanjie Zhao, Wenjie Zhu, Ling Xiao, Meizhen Wang, Haoyu Wang
Concerns about benchmark leakage in large language models for code (Code LLMs) have raised issues of data contamination and inflated evaluation metrics. The diversity and inaccessibility of many training datasets make it difficult to prevent data leakage entirely, even with time lag strategies. Consequently, generating new datasets through code perturbation has become essential. However, existing methods often fail to produce complex and diverse variations, struggle with complex cross-file dependencies, and lack support for multiple programming languages, which limits their effectiveness in enhancing LLM evaluations for coding tasks. To fill this gap, we propose CodeMorph, an approach designed to support multiple programming languages while preserving cross-file dependencies to mitigate data leakage. CodeMorph consists of two main components that work together to enhance the perturbation process. The first component employs 26 semantic-preserving transformation methods to iteratively perturb code, generating diverse variations while ensuring that the modified code remains compilable. The second component introduces a genetic algorithm-based selection algorithm, PESO, to identify the more effective perturbation method for each iteration by targeting lower similarity scores between the perturbed and original code, thereby enhancing overall perturbation effectiveness. Experimental results demonstrate that after applying CodeMorph, the accuracy of the LLM on code completion tasks across five programming languages decreased by an average of 24.67%, with Python showing the most significant reduction at 45%. The similarity score of code optimized by PESO is, on average, 7.01% lower than that of randomly perturbed code, peaking at a reduction of 42.86%.
对大语言代码模型(Code LLMS)基准渗漏的关切引起了数据污染和夸大评价指标的问题。许多培训数据集的多样性和难以获取性使得很难完全防止数据渗漏,即使时间滞后战略也是如此。因此,通过代码扰动产生新的数据集变得至关重要。然而,现有方法往往不能产生复杂和多样的变异,与复杂的交叉依赖性相争,以及缺乏对多种编程语言的支持,这限制了它们加强LLM对编码任务的编码评估的有效性。为了填补这一空白,我们提议了代码Morph,这是一种旨在支持多种编程语言,同时保持跨访问路径依赖以减缓数据渗漏的方法。 CodeMorph由两个主要组成部分组成,它们一起工作,通过代码的更低相似性,在完成前五年的代码中,在读完后,使用26个语系保留转换方法,同时确保修改后的代码仍然可以比较。第二个组成部分采用了基于遗传算法的筛选算法,即PESOOO,以便确定每次编程的更有效度方法,通过将最相似的精度排序的精度标值排序,在完成后,在完成后的精度平均代码中,在完成后的精度中,通过测试中显示平均的精度的精度的精度排序中,在降低的精度的精度排序中,在总的精度排序中,在降低的精度的精度的精度的精度的精度的精度排序中,在降低的精度排序中,在总体的精度排序中,在总精度的精度上。
Article 97
Title@2025-06-21 (6): Fuzzing-based Mutation Testing of C/C++ CPS
Title: Fuzzing-based Mutation Testing of C/C++ CPS | Fuzzing-basierte Mutationsprüfung von C/C++ CPS | C/C++CPS的模糊基变异测试 2503.24100v2 |
Authors (3): Jaekwon Lee, Fabrizio Pastore, Lionel Briand
Mutation testing can help minimize the delivery of faulty software. Therefore, it is a recommended practice for developing embedded software in safety-critical cyber-physical systems (CPS). However, state-of-the-art mutation testing techniques for C and C++ software, which are common languages for CPS, depend on symbolic execution. Unfortunately, symbolic execution’s limitations hinder its applicability (e.g., systems with black-box components). We propose relying on fuzz testing, which has demonstrated its effectiveness for C and C++ software. Fuzz testing tools automatically create test inputs that explore program branches in various ways, exercising statements in different program states, and thus enabling the detection of mutants, which is our objective. We empirically evaluated our approach using software components from operational satellite systems. Our assessment shows that our approach can detect between 40% and 90% of the mutants not detected by developers’ test suites. Further, we empirically determined that the best results are obtained by integrating the Clang compiler, a memory address sanitizer, and relying on laf-intel instrumentation to collect coverage and guide fuzzing. Our approach detects a significantly higher percentage of live mutants compared to symbolic execution, with an increase of up to 50 percentage points; further, we observed that although the combination of fuzzing and symbolic execution leads to additional mutants being killed, the benefits are minimal (a gain of less than one percentage point).
因此,这是开发安全临界网络物理系统(CPS)内嵌软件的建议做法。然而,C和C++软件(CPS通用语言的C和C++软件)的最先进的突变测试技术取决于象征性的执行。不幸的是,象征性执行的限制妨碍了它的适用性(例如黑盒组件的系统)。我们提议依靠模糊测试,这已经证明了C和C++软件的有效性。Fuzz测试工具自动创造测试投入,以各种方式探索程序分支,在不同的程序状态中进行陈述,从而能够探测我们的目标所在的变异体。我们用操作卫星系统的软件组件对我们的方法进行了实证性评估。我们的评估表明,我们的方法可以探测出40%至90%的变异体,而开发者测试室没有检测到这些变异体。此外,我们从经验上确定,最佳结果是通过整合Clang编译器、记忆地址S+++软件、依靠laf-int仪器来收集覆盖范围,从而能够探测出不同程序状态,从而能够探测变异体,而这正是我们的目标。我们的方法通过使用操作软件组件评估了50%的象征性化执行率,但比观察到的变异体杀伤率要高得多。
Article 98
Title@2025-06-21 (6): Large Language Model Guided Self-Debugging Code Generation
Title: Large Language Model Guided Self-Debugging Code Generation | Große Sprache Modell geführte Selbst-Debugging-Code-Generierung | 大语言制导自调自调码生成 2502.02928v2 |
Authors (3): Muntasir Adnan, Zhiwei Xu, Carlos C. N. Kuhn
Automated code generation is gaining significant importance in intelligent computer programming and system deployment. However, current approaches often face challenges in computational efficiency and lack robust mechanisms for code parsing and error correction. In this work, we propose a novel framework, PyCapsule, with a simple yet effective two-agent pipeline and efficient self-debugging modules for Python code generation. PyCapsule features sophisticated prompt inference, iterative error handling, and case testing, ensuring high generation stability, safety, and correctness. Empirically, PyCapsule achieves up to 5.7% improvement of success rate on HumanEval, 10.3% on HumanEval-ET, and 24.4% on BigCodeBench compared to the state-of-art methods. We also observe a decrease in normalized success rate given more self-debugging attempts, potentially affected by limited and noisy error feedback in retention. PyCapsule demonstrates broader impacts on advancing lightweight and efficient code generation for artificial intelligence systems.
在智能计算机编程和系统部署方面,自动代码生成正在变得日益重要。然而,目前的方法在计算效率方面往往面临挑战,缺乏对代码进行分解和纠正错误的强大机制。在这项工作中,我们提议建立一个新的框架,即PyCapsule,为Python代码生成提供一个简单而有效的双试管管道和高效自调模块。PyCapsule具有精密的快速快速推断、迭接错误处理和案件测试等特征,确保高生成稳定性、安全性和正确性。PyCapsule在人类经济学成功率上取得了5.7%的提高,在人类经济学中提高了10.3%,在大计算机伯恩奇(BigCode Bench)上提高了24.4%。我们还注意到,由于更多的自我调试尝试,可能受到保留中有限和吵闹的错误反馈的影响,标准化的成功率有所下降。PyCapsule对人造智能系统的轻度和高效代码生成产生了更广泛的影响。
Article 99
Title@2025-06-21 (6): EditLord: Learning Code Transformation Rules for Code Editing
Title: EditLord: Learning Code Transformation Rules for Code Editing | EditLord: Regeln zur Code-Transformation für die Code-Editing | 编辑主: 学习代码编辑的代码转换规则 2504.15284v3 |
Authors (6): Weichen Li, Albert Jan, Baishakhi Ray, Junfeng Yang, Chengzhi Mao, Kexin Pei
Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code’s intended functionality. Existing approaches often formulate code editing as an implicit end-to-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps. Thus, they suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets. Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing. EditLordoutperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness across critical software engineering and security applications, LM models, and editing modes.
代码编辑是软件开发中的一项基本任务,其有效性取决于它是否在不改变原代码的预期功能的情况下引入了理想的代码属性变化。现有方法往往将代码编辑作为一种隐含的端对端任务来制定代码编辑,忽略了编码编辑程序本身就包含离散和清晰的步骤这一事实。因此,它们表现欠佳,缺乏稳健性和概括性。我们引入了编辑框架,即使代码转换步骤明确化的代码编辑框架。我们的关键洞察力是使用一种语言模式(LM)作为感化学习者,从培训代码对口中提取代码编辑规则,作为简洁的元规则。对于每个培训样本,这些套规则将表现为增强它们,以进行微调,或协助快速和反复的代码编辑。在编辑性能和功能方面,用平均22.7%的编辑性能和58.1%的强性能来修改现状,同时在关键软件工程和安全应用程序、LM模式和编辑模式中实现20.2%的更高功能正确性。
Article 100
Title@2025-06-20 (5): Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems
Title: Dissecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems | Desecting the SWE-Bench Leaderboards: Profiling Submitters and Architectures of LLM- and Agent-Based Repair Systems | SWE-区领导板拆解:LLM-和代理修理系统的分析提交者和结构 2506.17208v1 |
Authors (2): Matias Martinez, Xavier Franch
The rapid progress in Automated Program Repair (APR) has been driven by advances in AI, particularly large language models (LLMs) and agent-based systems. SWE-Bench is a recent benchmark designed to evaluate LLM-based repair systems using real issues and pull requests mined from 12 popular open-source Python repositories. Its public leaderboards, SWE-Bench Lite and SWE-Bench Verified, have become central platforms for tracking progress and comparing solutions. However, because the submission process does not require detailed documentation, the architectural design and origin of many solutions remain unclear. In this paper, we present the first comprehensive study of all submissions to the SWE-Bench Lite (68 entries) and Verified (79 entries) leaderboards, analyzing 67 unique approaches across dimensions such as submitter type, product availability, LLM usage, and system architecture. Our findings reveal the dominance of proprietary LLMs (especially Claude 3.5/3.7), the presence of both agentic and non-agentic designs, and a contributor base spanning from individual developers to large tech companies.
自动化方案维修(APR)的快速进展是由AI(特别是大型语言模型和代理系统)的进展所推动的。SWE-Bench是最近的一项基准,旨在利用实际问题评价基于LLM的修理系统,并调用12个广受欢迎的开放源码Python仓库中提取的请求。其公共领导板SWE-Bench Lite和SWE-Bench Verification,已经成为跟踪进展和比较解决办法的中心平台。然而,由于提交过程不需要详细文件,许多解决办法的建筑设计和来源仍然不清楚。在本文件中,我们首次全面研究了向SWE-Bench Lite(68个条目)和Verized(79个条目)领先板提交的所有呈件,分析了67个独特的方法,如提交器类型、产品供应、LLMM的使用和系统结构。我们的调查结果显示,专有LMS(特别是Claude 3.5/3.7)、代理和非试剂设计的存在以及个人开发商向大型技术公司提供基础。
Article 101
Title@2025-06-20 (5): LLMs and Stack Overflow Discussions: Reliability, Impact, and Challenges
Title: LLMs and Stack Overflow Discussions: Reliability, Impact, and Challenges | LLMs und Stack-Überflussdiskussionen: Zuverlässigkeit, Wirkung und Herausforderungen | LLM和Stack 溢流讨论:可靠性、影响和挑战 2402.08801v2 |
Authors (3): Leuson Da Silva, Jordan Samhi, Foutse Khomh
Since its release in November 2022, ChatGPT has shaken up Stack Overflow, the premier platform for developers queries on programming and software development. Demonstrating an ability to generate instant, human-like responses to technical questions, ChatGPT has ignited debates within the developer community about the evolving role of human-driven platforms in the age of generative AI. Two months after ChatGPT release, Meta released its answer with its own Large Language Model (LLM) called LLaMA: the race was on. We conducted an empirical study analyzing questions from Stack Overflow and using these LLMs to address them. This way, we aim to (i) quantify the reliability of LLMs answers and their potential to replace Stack Overflow in the long term; (ii) identify and understand why LLMs fail; (iii) measure users activity evolution with Stack Overflow over time; and (iv) compare LLMs together. Our empirical results are unequivocal: ChatGPT and LLaMA challenge human expertise, yet do not outperform it for some domains, while a significant decline in user posting activity has been observed. Furthermore, we also discuss the impact of our findings regarding the usage and development of new LLMs and provide guidelines for future challenges faced by users and researchers.
自2022年11月公布以来,ChatGPT已经动摇了Stack Overflow,这是开发者在编程和软件开发方面查询的首要平台。展示了对技术问题作出即时、人性化答复的能力。ChatGPT已经引发了开发者社区内部关于人类驱动平台在基因化AI时代不断变化的作用的辩论。在ChatgPT发布两个月后,Meta以自己的大语言模型(LLLAMA)(LLMM)(LLLAMA):竞赛开始了。我们进行了一项经验性研究,分析来自Stack Overpt 的问题,并利用这些LLMMS来解决这些问题。这样,我们的目的是(一)量化LLMS答复的可靠性及其取代Stack Overtrop 长期的潜能;(二) 查明和理解LMMS失败的原因;(三) 衡量用户活动与Stack Overtrap 的演变过程;以及(四) 比较LMMS。我们的经验结论是明确的:ChatGPT和LAMA挑战人类专门知识,但在某些领域没有超越它,而用户张贴活动则明显减少。此外,我们还讨论LMS的用户和今后使用和研究人员面临的挑战。
Article 102
Title@2025-06-20 (5): Large Language Model Unlearning for Source Code
Title: Large Language Model Unlearning for Source Code | Großes Sprachmodell Unlearning für Quellcode | 源代码的大语言模式重新学习 2506.17125v1 |
Authors (11): Xue Jiang, Yihong Dong, Zheng Fang, Yingwei Ma, Tangxinyu Wang, Rongyu Cao, Binhua Li, Zhi Jin, Wenpin Jiao, Yongbin Li, Ge Li
LLM4SE has demonstrated significant success, but LLMs’ potential memorization of sensitive or outdated training data introduces critical risks to legal compliance, software security, and code quality. LLM unlearning techniques, which can eliminate the influence of undesired data from LLMs in a post-training way, present a promising solution to address these concerns. While recent efforts in LLM unlearning show effectiveness in natural language, their applicability to source code remains underexplored. Our empirical study reveals that existing LLM unlearning approaches, when applied to source code, cause severe model utility degradation, rendering models practically unusable for code generation. In this paper, we propose PROD, a novel unlearning approach that enables LLMs to forget undesired code content while effectively preserving their code generation capabilities. PROD suppresses the probability of forget data in LLMs’ output distribution while promoting candidate distributional components, enabling the model to jointly learn to forget specific content and retain its general capabilities. To facilitate this study, we establish a benchmark for code unlearning evaluation, which includes three critical downstream tasks: copyrighted code unlearning, insecure code unlearning, and deprecated API unlearning. Our evaluation demonstrates that PROD achieves superior balance between forget quality and model utility compared to existing unlearning approaches across three downstream tasks, while consistently exhibiting improvements when applied to LLMs of varying series. PROD also exhibits superior robustness against adversarial attacks without generating or exposing the data to be forgotten. The results underscore that our approach not only extends the application boundary of unlearning techniques to source code, but also holds significant implications for advancing reliable code generation.
LLM4SE已经表现出显著的成功,但LLMS公司对敏感或过时的培训数据的潜在记忆化给法律合规、软件安全和代码质量带来重大风险。LLM公司不学习技术,它能够以培训后的方式消除LLM公司不受欢迎的数据的影响,为解决这些问题提供了一个有希望的解决办法。LLM公司不学习的最近努力显示了自然语言的有效性,但对源代码的适用性仍未得到充分探讨。我们的经验研究表明,LLM公司现有的不学习方法在应用源代码时,造成了严重的示范工具退化,使模型实际上无法用于代码生成。在本文件中,我们建议PROD公司采用新的不学习方法,使LLMS公司能够忘记不受欢迎的代码内容,同时有效地维护其代码生成能力。PROD公司抑制了在LMS产出分配中忘记数据的可能性,同时促进候选人分配部分,使模型能够共同学习忘记具体内容并保留其一般能力。为了便利这一研究,我们只为代码不学习差异评价制定基准,其中包括三项关键的下游任务:版权代码不学习,不加密代码不学习,不可靠代码不学习,不可靠代码不学习,不使用不准备,不学模式不学模式不学模式不学模式,而贬低的LOPLPIPIS公司不学方法,同时将现有高级数据质量评估方法用于不断推进进行质量质量评估。
Article 103
Title@2025-06-20 (5): Reassessing Code Authorship Attribution in the Era of Language Models
Title: Reassessing Code Authorship Attribution in the Era of Language Models | Neubewertung von Code Authorship Attribution im Zeitalter der Sprachmodelle | 重新评估在语言模式时代重新确定《语言模式时代》中归属的法规授权人 2506.17120v1 |
Authors (3): Atish Kumar Dipongkor, Ziyu Yao, Kevin Moran
The study of Code Stylometry, and in particular Code Authorship Attribution (CAA), aims to analyze coding styles to identify the authors of code samples. CAA is crucial in cybersecurity and software forensics for addressing, detecting plagiarism, and supporting criminal prosecutions. However, CAA is a complex and error prone task, due to the need for recognizing nuanced relationships between coding patterns. This challenge is compounded in large software systems with numerous authors due to the subtle variability of patterns that signify the coding style of one author among many. Given the challenges related to this task, researchers have proposed and studied automated approaches that rely upon classical Machine Learning and Deep Learning techniques. However, such techniques have historically relied upon hand-crafted features, and due to the often intricate interaction of different features (e.g., formatting, etc.), have key limitations in properly characterizing authorship, and are sensitive to adversarial code perturbations. Recently, transformer-based Language Models (LMs) have shown remarkable efficacy across a range of software engineering tasks, and in the authorship attribution on natural language in the NLP domain. However, their effectiveness in CAA is not well understood. As such, we conduct the first extensive empirical study applying two larger state-of-the-art code LMs, and five smaller code LMs to the task of CAA to 6 diverse datasets that encompass 12k code snippets written by 463 developers. Furthermore, we perform an in-depth analysis of our studied models’ performance on CAA using established machine learning interpretability techniques. The results of our analysis illustrate important findings that illuminate the behavior of LMs in understanding stylometric code patterns during the task of CAA, and point towards important directions for future work.
由于需要识别编码模式之间的细微关系,因此,CAA是一项复杂和容易出错的任务。在大型软件系统中,挑战更为复杂,因为许多作者的形态变化微妙,表明许多作者的编码风格。鉴于与这项任务有关的挑战,研究人员提出并研究了依靠经典机器学习和深层学习技术的自动化方法。然而,CAA在网络安全和软件取证方面至关重要,对于处理、发现污蔑以及支持刑事诉讼至关重要。然而,CAA是一项复杂和易出错的任务,因为需要识别编码模式之间的细微关系。在大型软件系统系统中,这一挑战更为复杂,因为许多作者都具有微妙的变异性,表明一个作者的编码风格。鉴于此任务的挑战,研究人员提出并研究了依赖古典机器学习和深层学习技术的自动化方法的自动化方法。然而,这类技术历来依赖于手工艺特征,而且由于不同特征(如格式等)之间往往错综复杂的相互作用,因此,CAA(变异语言模型)在一系列软件工程任务中表现出了惊人的功效,而在NLPA领域对自然定位进行书面语言的归属,因此,在LMSA系统内部进行了更深入的研究中,而我们理解了对LA系统系统系统系统进行更深入的分析。
Article 104
Title@2025-06-20 (5): Software Fairness Testing in Practice
Title: Software Fairness Testing in Practice | Software Fairness-Tests in der Praxis | 实践中软件公平测试 2506.17095v1 |
Authors (4): Ronnie de Souza Santos, Matheus de Morais Leca, Reydne Santos, Cleyton Magalhaes
Software testing ensures that a system functions correctly, meets specified requirements, and maintains high quality. As artificial intelligence and machine learning (ML) technologies become integral to software systems, testing has evolved to address their unique complexities. A critical advancement in this space is fairness testing, which identifies and mitigates biases in AI applications to promote ethical and equitable outcomes. Despite extensive academic research on fairness testing, including test input generation, test oracle identification, and component testing, practical adoption remains limited. Industry practitioners often lack clear guidelines and effective tools to integrate fairness testing into real-world AI development. This study investigates how software professionals test AI-powered systems for fairness through interviews with 22 practitioners working on AI and ML projects. Our findings highlight a significant gap between theoretical fairness concepts and industry practice. While fairness definitions continue to evolve, they remain difficult for practitioners to interpret and apply. The absence of industry-aligned fairness testing tools further complicates adoption, necessitating research into practical, accessible solutions. Key challenges include data quality and diversity, time constraints, defining effective metrics, and ensuring model interoperability. These insights emphasize the need to bridge academic advancements with actionable strategies and tools, enabling practitioners to systematically address fairness in AI systems.
由于人工智能和机器学习技术已成为软件系统的组成部分,因此测试已经发展到能够解决其独特复杂性的地步。这一空间的一个关键进步是公平测试,它确定并减少AI应用中的偏见,以促进道德和公平结果。尽管对公平测试进行了广泛的学术研究,包括测试投入生成、测试或触角识别和组成部分测试,但实际采用仍然有限。工业从业人员往往缺乏明确的指南和有效工具,无法将公平测试纳入现实世界的AI开发。这项研究调查了软件专业人员如何通过与从事AI和ML项目的22名从业人员的访谈,测试AI驱动的系统,以实现公平。我们的调查结果突出表明理论公平概念与行业实践之间的巨大差距。虽然公平定义在继续演变,但实践者仍然难以解释和应用。缺乏与行业一致的公平测试工具使采用更加复杂,需要研究实际的、可获取的解决办法。关键的挑战包括数据质量和多样性、时间限制、确定有效的衡量标准以及确保模式的互操作性。这些见解强调需要将学术进步与可操作的战略和工具联系起来,使从业人员能够系统地处理AI系统中的公平问题。
Article 105
Title@2025-06-20 (5): Re-Evaluating Code LLM Benchmarks Under Semantic Mutation
Title: Re-Evaluating Code LLM Benchmarks Under Semantic Mutation | Neubewertung von Code-LLM-Benchmarks unter semantischer Mutation | 在语义变异下重新估价代码法LLM基准 2506.17369v1 |
Authors (4): Zhiyuan Pan, Xing Hu, Xin Xia, Xiaohu Yang
In the era of large language models (LLMs), code benchmarks have become an important research area in software engineering and are widely used by practitioners. These benchmarks evaluate the performance of LLMs on specific code-related tasks, such as code understanding and generation. A critical step in constructing code benchmarks is the design of prompts. However, as existing code benchmarks typically rely on a single prompt template per task, they are prone to the issue of prompt sensitivity, where minor prompt variations could result in substantial performance variations, leading to unreliable evaluations of model capabilities. While previous studies have explored prompt sensitivity, their experimental designs and findings are limited to traditional natural language processing (NLP) tasks. In this paper, we present an empirical study to investigate prompt sensitivity in code benchmarks. We first propose a general framework that modifies prompt templates in a manner that preserves both their semantics and their structure as much as possible. Based on the framework, we conduct extensive experiments across eight code benchmark tasks on 10 representative open-source LLMs, with each task featuring 100 semantically similar prompt templates. We then analyze the evaluation results using various statistical metrics, focusing on both absolute and relative model performance. Our findings suggest that even slight prompt variations can lead to significant shifts in performance. Additionally, we observe that such variations can introduce inconsistencies in the performance rankings across different models. These insights highlight the need for considering prompt sensitivity when designing future code benchmarks, to ensure more reliable and accurate evaluation of LLM capabilities.
在大型语言模型(LLMS)时代,守则基准已成为软件工程中的一个重要研究领域,并被从业人员广泛使用。这些基准评估LLMS在具体代码相关任务(如代码理解和生成)方面的业绩。建立守则基准的关键一步是设计速效。然而,由于现有的守则基准通常依赖单一的及时模板,因此很容易出现迅速敏感问题,在这种问题上,轻微的迅速变化可能导致显著的业绩差异,导致对模型能力的评价不可靠。虽然以往的研究探索了及时的敏感性,但其实验设计和结果仅限于传统的自然语言处理(NLP)任务。在本文件中,我们提出实证研究,以调查守则基准迅速敏感度调查。我们首先提出一个总框架,以尽可能保存其语义和结构的方式修改及时模板。根据该框架,我们对10个具有代表性的公开源LMS的8项基准任务进行了广泛的试验,每项任务包含100个具有性质类似的快速模板。我们随后利用各种统计指标分析评价结果,重点是绝对和相对的准确度。我们提出一个总框架,我们提出一个总框架,以尽可能保持其准确性的方式修改模板。我们的结论结论建议,在不同的业绩中可以确保未来的准确性变化。
Article 106
Title@2025-06-20 (5): Behavior Driven Development for 3D Games
Title: Behavior Driven Development for 3D Games | Behavior Driven Entwicklung für 3D-Spiele | 3D运动会行为驱动器开发 2506.17057v1 |
Authors (6): Fernando Pastor Ricós, Beatriz Marín, I. S. W. B. Prasetya, Tanja E. J. Vos, Joseph Davidson, Karel Hovorka
Computer 3D games are complex software environments that require novel testing processes to ensure high-quality standards. The Intelligent Verification/Validation for Extended Reality Based Systems (iv4XR) framework addresses this need by enabling the implementation of autonomous agents to automate game testing scenarios. This framework facilitates the automation of regression test cases for complex 3D games like Space Engineers. Nevertheless, the technical expertise required to define test scripts using iv4XR can constrain seamless collaboration between developers and testers. This paper reports how integrating a Behavior-driven Development (BDD) approach with the iv4XR framework allows the industrial company behind Space Engineers to automate regression testing. The success of this industrial collaboration has inspired the iv4XR team to integrate the BDD approach to improve the automation of play-testing for the experimental 3D game LabRecruits. Furthermore, the iv4XR framework has been extended with tactical programming to enable the automation of long-play test scenarios in Space Engineers. These results underscore the versatility of the iv4XR framework in supporting diverse testing approaches while showcasing how BDD empowers users to create, manage, and execute automated game tests using comprehensive and human-readable statements.
计算机 3D 游戏是复杂的软件环境,需要创新的测试程序来确保高质量的标准。扩展现实系统(iv4XR)的智能验证/认证框架通过使自动代理器能够自动设定游戏测试情景来解决这一需要。这个框架为像空间工程师这样的复杂3D游戏的回归测试案例的自动化提供了便利。然而,使用 iv4XR 定义测试脚本所需的技术专长可以限制开发者和测试者之间的无缝合作。本文报告了行为驱动开发(BDD) 方法如何与iv4XR 框架相结合,使空间工程师后面的工业公司能够自动进行回归测试。这一工业协作的成功激励了iv4XR 团队整合 BDD 方法,以提高实验3D游戏实验室的游戏测试自动化。此外,IV4XR 框架已经与战术编程相扩展,使空间工程师的长剧性测试情景得以自动化。这些结果突出表明了IV4XR 框架在支持多种测试方法时的多功能性,同时展示了游戏自动测试方法,并演示了BDDD用户如何进行自动测试。
Article 107
Title@2025-06-20 (5): Identifying Explanation Needs: Towards a Catalog of User-based Indicators
Title: Identifying Explanation Needs: Towards a Catalog of User-based Indicators | Erklärungsbedarf identifizieren: Auf dem Weg zu einem Katalog von benutzerbasierten Indikatoren | 查明解释需要:建立用户指标目录 2506.16997v1 |
Authors (5): Hannah Deters, Laura Reinhardt, Jakob Droste, Martin Obaidi, Kurt Schneider
In today’s digitalized world, where software systems are becoming increasingly ubiquitous and complex, the quality aspect of explainability is gaining relevance. A major challenge in achieving adequate explanations is the elicitation of individual explanation needs, as it may be subject to severe hypothetical or confirmation biases. To address these challenges, we aim to establish user-based indicators concerning user behavior or system events that can be captured at runtime to determine when a need for explanations arises. In this work, we conducted explorative research in form of an online study to collect self-reported indicators that could indicate a need for explanation. We compiled a catalog containing 17 relevant indicators concerning user behavior, 8 indicators concerning system events and 14 indicators concerning emotional states or physical reactions. We also analyze the relationships between these indicators and different types of need for explanation. The established indicators can be used in the elicitation process through prototypes, as well as after publication to gather requirements from already deployed applications using telemetry and usage data. Moreover, these indicators can be used to trigger explanations at appropriate moments during the runtime.
在当今的数字化世界中,软件系统越来越普遍和复杂,解释的质量问题越来越具有相关性。在充分解释方面,一个重大挑战是引起个人解释需要,因为可能存在严重的假设或确认偏差。为了应对这些挑战,我们的目标是建立用户行为或系统事件方面的用户指标,以便确定何时需要解释。在这项工作中,我们以在线研究的形式进行了探索性研究,以收集自报指标,表明需要解释。我们汇编了一份目录,其中载有17个与用户行为有关的指标,8个系统事件指标,14个有关情感状态或身体反应的指标。我们还分析了这些指标与不同类型解释需要之间的关系。既定指标可以通过原型在引出过程中使用,并在公布后,利用遥测和使用数据收集已经部署的应用软件的要求。此外,这些指标还可以用来在运行期间的适当时刻触发解释。
Article 108
Title@2025-06-20 (5): Accelerating Quantum Eigensolver Algorithms With Machine Learning
Title: Accelerating Quantum Eigensolver Algorithms With Machine Learning | Beschleunigung von Quanten Eigensolver-Algorithmen mit maschinellem Lernen | 用机器学习加速量子 Eigensolver 算法 2409.13587v2 |
Authors (5): Avner Bensoussan, Elena Chachkarova, Karine Even-Mendoza, Sophie Fortz, Connor Lenihan
In this paper, we explore accelerating Hamiltonian ground state energy calculation on NISQ devices. We suggest using search-based methods together with machine learning to accelerate quantum algorithms, exemplified in the Quantum Eigensolver use case. We trained two small models on classically mined data from systems with up to 16 qubits, using XGBoost’s Python regressor. We evaluated our preliminary approach on 20-, 24- and 28-qubit systems by optimising the Eigensolver’s hyperparameters. These models predict hyperparameter values, leading to a 0.12% reduction in error when tested on 28-qubit systems. However, due to inconclusive results with 20- and 24-qubit systems, we suggest further examination of the training data based on Hamiltonian characteristics. In future work, we plan to train machine learning models to optimise other aspects or subroutines of quantum algorithm execution beyond its hyperparameters.
在本文中,我们探索加速汉密尔顿地基NISQ装置的地面状态能源计算。我们建议使用基于搜索的方法和机器学习加速量子算法,如Quantum Eigensolver使用案例所示。我们用XGBoost的Python回溯器,对来自高达16 的系统的传统采掘数据的两个小模型进行了培训。我们通过优化Eigensolver的超参数,评估了我们对20 、24 和28 qits 系统的初步方法。这些模型预测了超参数值,在28 qubit 系统测试时导致误差减少0.12%。然而,由于20 和 24 qubit 系统没有取得结果,我们建议进一步审查基于汉密尔顿的特性的培训数据。在未来的工作中,我们计划对机器学习模型进行培训,以优化超参数以外的其他方面或量子算法执行亚路径。
Article 109
Title@2025-06-20 (5): Adversarial Reasoning for Repair Based on Inferred Program Intent
Title: Adversarial Reasoning for Repair Based on Inferred Program Intent | Adversariale Begründung für die Reparatur auf der Grundlage von abgeleiteten Programm Intent | 根据被推断的方案意图进行修复的反向理由 2505.13008v2 |
Authors (6): He Ye, Aidan Z. H. Yang, Chang Hu, Yanlin Wang, Tao Zhang, Claire Le Goues
Automated program repair (APR) has shown promising results, particularly with the use of neural networks. Currently, most APR tools focus on code transformations specified by test suites, rather than reasoning about the program intent and the high-level bug specification. Without a proper understanding of program intent, these tools tend to generate patches that overfit incomplete test suites and fail to reflect the developers intentions. However, reasoning about program intent is challenging. In our work, we propose an approach called AdverIntent-Agent, based on critique and adversarial reasoning. Our approach is novel to shift the focus from generating multiple APR patches to inferring multiple potential program intents. Ideally, we aim to infer intents that are, to some extent, adversarial to each other, maximizing the probability that at least one aligns closely with the developers original intent. AdverIntent-Agent is a multi-agent approach consisting of three agents: a reasoning agent, a test agent, and a repair agent. First, the reasoning agent generates adversarial program intents along with the corresponding faulty statements. Next, the test agent produces adversarial test cases that align with each inferred intent, constructing oracles that use the same inputs but have different expected outputs. Finally, the repair agent uses dynamic and precise LLM prompts to generate patches that satisfy both the inferred program intent and the generated tests. AdverIntent-Agent was evaluated on two benchmarks: Defects4J 2.0 and HumanEval-Java. AdverIntent-Agent correctly repaired 77 and 105 bugs in both benchmarks, respectively.
自动程序修理(APR) 显示了令人乐观的结果, 特别是在使用神经网络的情况下。 目前, 大部分 APR 工具都侧重于测试套件指定的代码转换, 而不是对程序意图和高级错误规格的推理。 如果对程序意图没有正确理解, 这些工具往往会产生补丁, 过分配齐不完整的测试套件, 没有反映开发者的意图。 但是, 关于程序意图的推理是具有挑战性的。 在我们的工作中, 我们提出一个叫做 Aver- Intal- Agency 的方法, 以批评和对立推理为依据。 我们的方法是新颖的, 将焦点从生成多个 APR 补丁 转向推断出多个潜在程序意图。 理想的是, 我们的目标是推断出某种程度上对立的意向, 相互对立的意向, 使至少一个与开发者原始意图一致的可能性最大化。 Aver- Intent- A 是一个多试算方法, 由三个代理商组成: 推理代理、 测试代理商、 和满足性代理商。 首先, 推理, 和推理程序产生对抗性程序与相应的对调意图意图以及相应的错误声明。 接下来, 测试过程将产生两种推算结果都使用精确测试。 最后的推算。 。 最后, 和推算过程使用两种推算出两种推算结果。
Article 110
Title@2025-06-20 (5): PinChecker: Identifying Unsound Safe Abstractions of Rust Pinning APIs
Title: PinChecker: Identifying Unsound Safe Abstractions of Rust Pinning APIs | PinChecker: Identifizieren von unschallsicheren Abstraktionen von Rust Pinning APIs | Pin checker: 识别混乱平铺API的不健全安全事件 2504.14500v2 |
Authors (2): Yuxuan Dai, Yang Feng
The pinning APIs of Rust language guarantee memory location stability for self-referential and asynchronous constructs, as long as used according to the pinning API contract. Rust ensures violations of such contract are impossible in regular safe code, but not in unsafe code where unsafe pinning APIs can be used. Library authors can encapsulate arbitrary unsafe code within regular library functions. These can be freely called in higher-level code without explicit warnings. Therefore, it is crucial to analyze library functions to rule out pinning API contract violations. Unfortunately, such testing relies on manual analysis by library authors, which is ineffective. Our goal is to develop a methodology that, given a library, attempts to construct programs that intentionally breach the pinning API contract by chaining library function calls, thereby verifying their soundness. We introduce RPIL, a novel intermediate representation that models functions’ critical behaviors pertaining to pinning APIs. We implement PinChecker, a synthesis-driven violation detection tool guided by RPIL, which automatically synthesizes bug-revealing programs. Our experiments on 13 popular Rust libraries from crates.io found 2 confirmed bugs.
Rust 语言的钉钉式 API 可以将任意的不安全代码包含在常规的图书馆功能中,这些代码可以在没有明确警告的情况下被自由调用到更高层次的代码中。 因此, 分析图书馆功能以排除钉定API合同违反情况至关重要。 不幸的是, 这种测试依靠的是图书馆作者的手工分析, 而这没有效果。 我们的目标是开发一种方法, 在一个图书馆里, 试图通过连锁图书馆功能来制造故意违反钉定API合同的程序, 从而核实其是否正确性。 我们引入了RPIL, 这是一种新型的中间代表, 模型可以运行与钉定API相关的关键行为。 我们实施了 PinCrecker, 这是一种由 RPIL 指导的合成驱动的违规检测工具, 它自动合成了窃听程序。 我们在13个流行的 Rust 图书馆进行实验, 从木箱中发现了2号错误 。
Article 111
Title@2025-06-20 (5): Quantum Optimization for Software Engineering: A Survey
Title: Quantum Optimization for Software Engineering: A Survey | Quantenoptimierung für die Software-Engineering: Eine Umfrage | 软件工程量量的优化:调查 2506.16878v1 |
Authors (4): Man Zhang, Yuechen Li, Tao Yue, Kai-Yuan Cai
Quantum computing, particularly in the area of quantum optimization, is steadily progressing toward practical applications, supported by an expanding range of hardware platforms and simulators. While Software Engineering (SE) optimization has a strong foundation, which is exemplified by the active Search-Based Software Engineering (SBSE) community and numerous classical optimization methods, the growing complexity of modern software systems and their engineering processes demands innovative solutions. This Systematic Literature Review (SLR) focuses specifically on studying the literature that applies quantum or quantum-inspired algorithms to solve classical SE optimization problems. We examine 77 primary studies selected from an initial pool of 2083 publications obtained through systematic searches of six digital databases using carefully crafted search strings. Our findings reveal concentrated research efforts in areas such as SE operations and software testing, while exposing significant gaps across other SE activities. Additionally, the SLR uncovers relevant works published outside traditional SE venues, underscoring the necessity of this comprehensive review. Overall, our study provides a broad overview of the research landscape, empowering the SBSE community to leverage quantum advancements in addressing next-generation SE challenges.
量子计算,特别是在量子优化领域,正在稳步向实际应用迈进,得到范围不断扩大的硬件平台和模拟器的支持。软件工程(SE)优化具有牢固的基础,例如活跃的基于搜索的软件工程(SBSE)群和许多传统的优化方法,现代软件系统及其工程过程日益复杂,需要创新的解决办法。系统文学评论(SLR)特别侧重于研究应用量子或量子激励的算法解决传统的SE优化问题的文献。我们研究了通过利用精心设计的搜索字符串对六个数字数据库进行系统搜索而从最初的2083年出版物库中挑选出来的77项初级研究。我们的调查结果显示,在SE业务和软件测试等领域的研究工作十分集中,同时暴露了其他SE活动之间的巨大差距。此外,SLR还发现了在传统SE场地以外出版的相关作品,强调了这一全面审查的必要性。总体而言,我们的研究对研究的全貌提供了广泛的概览,使SBSEE社区能够利用量子进步来应对下一代SE挑战。
Article 112
Title@2025-06-20 (5): Revolutionizing Validation and Verification: Explainable Testing Methodologies for Intelligent Automotive Decision-Making Systems
Title: Revolutionizing Validation and Verification: Explainable Testing Methodologies for Intelligent Automotive Decision-Making Systems | Revolutionierung der Validierung und Verifizierung: Erklärbare Prüfmethoden für intelligente Automotive-Entscheidungs-Making-Systeme | 验证与核查:智能汽车决策系统可解释的测试方法 2506.16876v1 |
Authors (2): Halit Eris, Stefan Wagner
Autonomous Driving Systems (ADS) use complex decision-making (DM) models with multimodal sensory inputs, making rigorous validation and verification (V&V) essential for safety and reliability. These models pose challenges in diagnosing failures, tracing anomalies, and maintaining transparency, with current manual testing methods being inefficient and labor-intensive. This vision paper presents a methodology that integrates explainability, transparency, and interpretability into V&V processes. We propose refining V&V requirements through literature reviews and stakeholder input, generating explainable test scenarios via large language models (LLMs), and enabling real-time validation in simulation environments. Our framework includes test oracle, explanation generation, and a test chatbot, with empirical studies planned to evaluate improvements in diagnostic efficiency and transparency. Our goal is to streamline V&V, reduce resources, and build user trust in autonomous technologies.
自主驾驶系统(ADS)使用具有多式感官投入的复杂决策模式(DM),使严格的验证和核查(V&V)对安全和可靠性至关重要,这些模式在诊断失败、追踪异常现象和保持透明度方面构成挑战,因为目前的人工测试方法效率低,劳动密集型。本愿景文件提出了一种将解释性、透明度和可解释性纳入V&V过程的方法。我们提议通过文献审查和利益攸关方投入来完善V&V要求,通过大型语言模型(LLMS)产生可解释的测试情景,并在模拟环境中促成实时验证。我们的框架包括测试或奇迹、解释生成和测试聊天器,并计划进行经验性研究来评估诊断效率和透明度的提高。我们的目标是精简V&V,减少资源,建立用户对自主技术的信任。
Article 113
Title@2025-06-20 (5): Accountability of Robust and Reliable AI-Enabled Systems: A Preliminary Study and Roadmap
Title: Accountability of Robust and Reliable AI-Enabled Systems: A Preliminary Study and Roadmap | Rechenschaftspflicht von robusten und zuverlässigen KI-fähigen Systemen: Eine Vorstudie und Roadmap | 健全和可靠的独立独立使用系统问责制:初步研究和路线图 2506.16831v1 |
Authors (3): Filippo Scaramuzza, Damian A. Tamburri, Willem-Jan van den Heuvel
This vision paper presents initial research on assessing the robustness and reliability of AI-enabled systems, and key factors in ensuring their safety and effectiveness in practical applications, including a focus on accountability. By exploring evolving definitions of these concepts and reviewing current literature, the study highlights major challenges and approaches in the field. A case study is used to illustrate real-world applications, emphasizing the need for innovative testing solutions. The incorporation of accountability is crucial for building trust and ensuring responsible AI development. The paper outlines potential future research directions and identifies existing gaps, positioning robustness, reliability, and accountability as vital areas for the development of trustworthy AI systems of the future.
本愿景文件介绍了关于评估由AI支持的系统的可靠性和可靠性的初步研究,以及确保其在实际应用中的安全和有效性的关键因素,包括注重问责制。研究通过探讨这些概念的演变定义和审查当前文献,强调了该领域的主要挑战和办法。案例研究用于说明现实世界的应用,强调需要创新的测试解决办法。纳入问责制对于建立信任和确保负责任的AI开发至关重要。文件概述了未来可能的研究方向,并确定了现有差距,定位了可靠性、可靠性和问责制,作为未来可信赖的AI系统发展的重要领域。
Article 114
Title@2025-06-20 (5): Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers
Title: Model Context Protocol (MCP) at First Glance: Studying the Security and Maintainability of MCP Servers | Model Context Protocol (MCP) auf den ersten Blick: Die Sicherheit und Nachhaltigkeit von MCP-Servern untersuchen | 《第一一一一一一一时示范背景议定书》:研究MCP服务器的安全性和可维持性 2506.13538v4 |
Authors (6): Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
Although Foundation Models (FMs), such as GPT-4, are increasingly used in domains like finance and software engineering, reliance on textual interfaces limits these models’ real-world interaction. To address this, FM providers introduced tool calling-triggering a proliferation of frameworks with distinct tool interfaces. In late 2024, Anthropic introduced the Model Context Protocol (MCP) to standardize this tool ecosystem, which has become the de facto standard with over eight million weekly SDK downloads. Despite its adoption, MCP’s AI-driven, non-deterministic control flow introduces new risks to sustainability, security, and maintainability, warranting closer examination. Towards this end, we present the first large-scale empirical study of MCP servers. Using state-of-the-art health metrics and a hybrid analysis pipeline, combining a general-purpose static analysis tool with an MCP-specific scanner, we evaluate 1,899 open-source MCP servers to assess their health, security, and maintainability. Despite MCP servers demonstrating strong health metrics, we identify eight distinct vulnerabilities - only three overlapping with traditional software vulnerabilities. Additionally, 7.2% of servers contain general vulnerabilities and 5.5% exhibit MCP-specific tool poisoning. Regarding maintainability, while 66% exhibit code smells, 14.4% contain nine bug patterns overlapping with traditional open-source software projects. These findings highlight the need for MCP-specific vulnerability detection techniques while reaffirming the value of traditional analysis and refactoring practices.
尽管诸如GPT-4等基础模型(FMs)越来越多地用于金融和软件工程等领域,但依赖文本界面限制了这些模型的实际世界互动。为了解决这个问题,调频供应商采用了调频供应商采用工具,催生了不同工具界面框架的激增。2024年后期,人类学采用了模型背景协议(MCP),使这一工具生态系统标准化,该模型已经成为事实上的标准,每周下载800多万SDK。尽管采用了该模型,但MCP的AI驱动、非确定性控制流程给可持续性、安全性和可维持性带来了新的风险,需要更仔细地检查。为此,我们提出了首次大规模的经验性研究MCP服务器,使用最先进的健康指标和混合分析管道,将通用静态分析工具与MCP特定扫描器相结合,我们评估了1,899台开放源的MCP服务器,以评估其具体健康、安全性和可维持性。尽管MCP服务器显示强健的健康度,但我们发现了8个明显的脆弱性,只有3个与传统软件脆弱性的重叠之处。为此,我们提出了对MCP服务器的7.2%的大规模实验性分析,同时,同时,同时也展示了常规系统检测力分析了9项(MC%)的模型的弹性分析。
Article 115
Title@2025-06-19 (4): LLMs in Coding and their Impact on the Commercial Software Engineering Landscape
Title: LLMs in Coding and their Impact on the Commercial Software Engineering Landscape | LLMs in Coding und ihre Auswirkungen auf die kommerzielle Software-Engineering-Landschaft | 编码及其对商业软件工程景观的影响 2506.16653v1 |
Authors (3): Vladislav Belozerov, Peter J Barclay, Askhan Sami
Large-language-model coding tools are now mainstream in software engineering. But as these same tools move human effort up the development stack, they present fresh dangers: 10% of real prompts leak private data, 42% of generated snippets hide security flaws, and the models can even ``agree’’ with wrong ideas, a trait called sycophancy. We argue that firms must tag and review every AI-generated line of code, keep prompts and outputs inside private or on-premises deployments, obey emerging safety regulations, and add tests that catch sycophantic answers – so they can gain speed without losing security and accuracy.
大型语言模型编码工具现已成为软件工程的主流。但当这些工具将人类的努力推向开发堆积时,它们带来了新的危险:10%的真快泄漏私人数据,42%生成的片段掩盖了安全缺陷,模型甚至可以用错误的想法“同意’ ” , 一种称为“偏执”的特征。我们争辩说,公司必须标记和审查每一个AI生成的代码线,将提示和产出保留在私人内部或者在安全部署中,遵守新出现的安全条例,并添加能够捕捉类同答案的测试,这样它们就可以在不失去安全和准确性的情况下获得速度。
Article 116
Title@2025-06-19 (4): CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity
Title: CodeDiffuser: Attention-Enhanced Diffusion Policy via VLM-Generated Code for Instruction Ambiguity | CodeDiffuser: Aufmerksamkeitsverstärkte Diffusionspolitik über VLM-generierten Code für Instruction Ambiguity | 代码用户:通过VLM - 教育结构设计守则加强关注 - 强化传播政策 2506.16652v1 |
Authors (9): Guang Yin, Yitong Li, Yixuan Wang, Dale McConachie, Paarth Shah, Kunimatsu Hashimoto, Huan Zhang, Katherine Liu, Yunzhu Li
Natural language instructions for robotic manipulation tasks often exhibit ambiguity and vagueness. For instance, the instruction “Hang a mug on the mug tree” may involve multiple valid actions if there are several mugs and branches to choose from. Existing language-conditioned policies typically rely on end-to-end models that jointly handle high-level semantic understanding and low-level action generation, which can result in suboptimal performance due to their lack of modularity and interpretability. To address these challenges, we introduce a novel robotic manipulation framework that can accomplish tasks specified by potentially ambiguous natural language. This framework employs a Vision-Language Model (VLM) to interpret abstract concepts in natural language instructions and generates task-specific code - an interpretable and executable intermediate representation. The generated code interfaces with the perception module to produce 3D attention maps that highlight task-relevant regions by integrating spatial and semantic information, effectively resolving ambiguities in instructions. Through extensive experiments, we identify key limitations of current imitation learning methods, such as poor adaptation to language and environmental variations. We show that our approach excels across challenging manipulation tasks involving language ambiguity, contact-rich manipulation, and multi-object interactions.
用于机器人操作任务的自然语言指令往往含混不清和模糊不清。 例如,“在树杯树上挂一个杯子”指令可能涉及多种有效行动,如果需要从其中选择几个杯子和分支的话。 现有的有语言条件的政策通常依赖端对端模型,这些模型共同处理高层次语义理解和低层次行动生成,这可能导致由于缺乏模块性和可解释性而表现欠佳。 为了应对这些挑战,我们引入了一个新颖的机器人操作框架,能够完成可能模糊的自然语言规定的任务。 这个框架使用愿景语言模型(VLM)来解释自然语言指令中的抽象概念,并生成特定任务的代码 — 一个可解释和可执行的中间代表。 生成的代码界面与感知模块生成了3D关注地图,通过整合空间和语义信息,有效解决指令中的模糊问题,突出任务相关区域。 我们通过广泛的实验,确定了当前模仿学习方法的关键局限性,例如语言适应性不强和环境差异。 我们展示了我们的方法在语言模糊性、接触性操纵性、多面互动性之间挑战操纵任务。
Article 117
Title@2025-06-19 (4): SemAgent: A Semantics Aware Program Repair Agent
Title: SemAgent: A Semantics Aware Program Repair Agent | SemAgent: Ein Semantik-Bewusst-Programm-Reparatur-Agent | SemAgenger: 语义学意识方案维修代理 2506.16650v1 |
Authors (4): Anvith Pabba, Alex Mathai, Anindya Chakraborty, Baishakhi Ray
Large Language Models (LLMs) have shown impressive capabilities in downstream software engineering tasks such as Automated Program Repair (APR). In particular, there has been a lot of research on repository-level issue-resolution benchmarks such as SWE-Bench. Although there has been significant progress on this topic, we notice that in the process of solving such issues, existing agentic systems tend to hyper-localize on immediately suspicious lines of code and fix them in isolation, without a deeper understanding of the issue semantics, code semantics, or execution semantics. Consequently, many existing systems generate patches that overfit to the user issue, even when a more general fix is preferable. To address this limitation, we introduce SemAgent, a novel workflow-based procedure that leverages issue, code, and execution semantics to generate patches that are complete - identifying and fixing all lines relevant to the issue. We achieve this through a novel pipeline that (a) leverages execution semantics to retrieve relevant context, (b) comprehends issue-semantics via generalized abstraction, (c) isolates code-semantics within the context of this abstraction, and (d) leverages this understanding in a two-stage architecture: a repair stage that proposes fine-grained fixes, followed by a reviewer stage that filters relevant fixes based on the inferred issue-semantics. Our evaluations show that our methodology achieves a solve rate of 44.66% on the SWEBench-Lite benchmark beating all other workflow-based approaches, and an absolute improvement of 7.66% compared to our baseline, which lacks such deep semantic understanding. We note that our approach performs particularly well on issues requiring multi-line reasoning (and editing) and edge-case handling, suggesting that incorporating issue and code semantics into APR pipelines can lead to robust and semantically consistent repairs.
大型语言模型(LLMS) 显示下游软件工程任务(如自动程序修理(APR) ) 的能力令人印象深刻。 特别是,对SWE- Bench等存储器级问题解答基准(如SWE- Bench)进行了大量研究。 虽然在这一专题上取得了显著进展,但我们注意到,在解决这些问题的过程中,现有代理系统倾向于在直接可疑的代码线上超本地化,并孤立地修正它们,而没有更深入地理解问题语义学、代码语义学或执行语义学等。 因此,许多现有系统产生了与用户问题格格不入的补丁,甚至更一般的修补。 为解决这一限制,我们引入了SemAgent这个基于工作流程的新程序,即利用基于工作流程的新的流程来生成完整补丁。 我们通过一个新的管道来做到这一点:(a) 利用执行语义语义学来检索相关背景, (b) 通过通用的抽象的抽象的抽象的抽象的抽象的抽象化,理解, (c) 将问题- 改进处理- 改进的处理方法, (c) 将精化的精化的精化的精化的精化的精化的精化的精化, 显示, 显示我们这个阶段的精化的精化的精化的精化的精化的精化的精化的精化的精化的精化的精化的精制, 显示的精制的精制, 显示的精制, 。
Article 118
Title@2025-06-19 (4): LLM-based Satisfiability Checking of String Requirements by Consistent Data and Checker Generation
Title: LLM-based Satisfiability Checking of String Requirements by Consistent Data and Checker Generation | LLM-basierte Zufriedenheitsprüfung von String-Anforderungen durch konsistente Daten- und Checker-Generierung | 以LLM为基础的LLM按统一数据和生成核对器对字符串要求的兼容性核对 2506.16639v1 |
Authors (5): Boqi Chen, Aren A. Babikian, Shuzhao Feng, Dániel Varró, Gunter Mussbacher
Requirements over strings, commonly represented using natural language (NL), are particularly relevant for software systems due to their heavy reliance on string data manipulation. While individual requirements can usually be analyzed manually, verifying properties (e.g., satisfiability) over sets of NL requirements is particularly challenging. Formal approaches (e.g., SMT solvers) may efficiently verify such properties, but are known to have theoretical limitations. Additionally, the translation of NL requirements into formal constraints typically requires significant manual effort. Recently, large language models (LLMs) have emerged as an alternative approach for formal reasoning tasks, but their effectiveness in verifying requirements over strings is less studied. In this paper, we introduce a hybrid approach that verifies the satisfiability of NL requirements over strings by using LLMs (1) to derive a satisfiability outcome (and a consistent string, if possible), and (2) to generate declarative (i.e., SMT) and imperative (i.e., Python) checkers, used to validate the correctness of (1). In our experiments, we assess the performance of four LLMs. Results show that LLMs effectively translate natural language into checkers, even achieving perfect testing accuracy for Python-based checkers. These checkers substantially help LLMs in generating a consistent string and accurately identifying unsatisfiable requirements, leading to more than doubled generation success rate and F1-score in certain cases compared to baselines without generated checkers.
通常使用自然语言(NL)代表的对字符串的要求,对于软件系统特别具有相关性,因为它们高度依赖字符串数据操纵。虽然个人要求通常可以人工分析,但核实非字符串要求的属性(如可对称性)尤其具有挑战性。正式方法(如SMT解答器)可以有效核实这些属性,但已知存在理论限制。此外,将非语言要求转换成正式限制通常需要大量手工努力。最近,大型语言模型(LLLMs)作为正式推理任务的替代方法出现,但它们在核实对字符串要求方面的有效性研究较少。在本文件中,我们采用混合方法,通过使用LMS(1)来核实NL要求相对于字符串要求的可对称性(如SMT解答器)进行核实,从而产生一种可兼容性结果(如有可能,则具有一致的字符串),以及产生宣讲性(如SM)和必要性(即Python)核对器,用来验证正式推理(1)。我们在实验中,我们评估四个LMSMS的性业绩显示,这些LMS-在不精确性核对中,甚至能地将精确地核对要求转换为不精确地进行。
Article 119
Title@2025-06-19 (4): Safety Interventions against Adversarial Patches in an Open-Source Driver Assistance System
Title: Safety Interventions against Adversarial Patches in an Open-Source Driver Assistance System | Sicherheitsinterventionen gegen störende Patches in einem Open-Source Fahrerassistenzsystem | 在开放源码的司机协助系统中针对对面补丁采取安全干预措施 2504.18990v2 |
Authors (7): Cheng Chen, Grant Xiao, Daehyun Lee, Lishan Yang, Evgenia Smirni, Homa Alemzadeh, Xugui Zhou
Drivers are becoming increasingly reliant on advanced driver assistance systems (ADAS) as autonomous driving technology becomes more popular and developed with advanced safety features to enhance road safety. However, the increasing complexity of the ADAS makes autonomous vehicles (AVs) more exposed to attacks and accidental faults. In this paper, we evaluate the resilience of a widely used ADAS against safety-critical attacks that target perception inputs. Various safety mechanisms are simulated to assess their impact on mitigating attacks and enhancing ADAS resilience. Experimental results highlight the importance of timely intervention by human drivers and automated safety mechanisms in preventing accidents in both driving and lateral directions and the need to resolve conflicts among safety interventions to enhance system resilience and reliability.
随着自主驾驶技术越来越受欢迎,并开发具有先进安全特征的自动驾驶技术,以加强道路安全,驾驶员越来越依赖先进的驾驶协助系统(ADAS);然而,由于ADAS的日益复杂,自动驾驶器更易受到攻击和意外故障的影响;在本文件中,我们评估了广泛使用的ADAS对以感知投入为目标的安全临界攻击的抗御能力;模拟了各种安全机制,以评估其对减轻攻击和加强ADAS的抗御能力的影响;实验结果突出表明,人驾驶器和自动安全机制在防止驾驶和横向方向的事故方面及时干预的重要性,以及解决安全干预措施之间的冲突以提高系统复原力和可靠性的必要性。
Article 120
Title@2025-06-19 (4): AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions
Title: AI-Driven Tools in Modern Software Quality Assurance: An Assessment of Benefits, Challenges, and Future Directions | KI-getriebene Werkzeuge in der modernen Software-Qualitätssicherung: Eine Bewertung von Vorteilen, Herausforderungen und Zukunftsrichtungen | 《现代软件质量保证方面的AI-Driver 工具:效益、挑战和今后方向评估》 2506.16586v1 |
Authors (3): Ihor Pysmennyi, Roman Kyslyi, Kyrylo Kleshch
Traditional quality assurance (QA) methods face significant challenges in addressing the complexity, scale, and rapid iteration cycles of modern software systems and are strained by limited resources available, leading to substantial costs associated with poor quality. The object of this research is the Quality Assurance processes for modern distributed software applications. The subject of the research is the assessment of the benefits, challenges, and prospects of integrating modern AI-oriented tools into quality assurance processes. We performed comprehensive analysis of implications on both verification and validation processes covering exploratory test analyses, equivalence partitioning and boundary analyses, metamorphic testing, finding inconsistencies in acceptance criteria (AC), static analyses, test case generation, unit test generation, test suit optimization and assessment, end to end scenario execution. End to end regression of sample enterprise application utilizing AI-agents over generated test scenarios was implemented as a proof of concept highlighting practical use of the study. The results, with only 8.3% flaky executions of generated test cases, indicate significant potential for the proposed approaches. However, the study also identified substantial challenges for practical adoption concerning generation of semantically identical coverage, “black box” nature and lack of explainability from state-of-the-art Large Language Models (LLMs), the tendency to correct mutated test cases to match expected results, underscoring the necessity for thorough verification of both generated artifacts and test execution results. The research demonstrates AI’s transformative potential for QA but highlights the importance of a strategic approach to implementing these technologies, considering the identified limitations and the need for developing appropriate verification methodologies.
传统质量保证(QA)方法在应对现代软件系统的复杂性、规模和快速迭代周期方面面临重大挑战,并因资源有限而紧张,导致与质量差有关的大量费用。本研究的目标是现代分布式软件应用程序的质量保证进程。研究的主题是评估将现代面向AI的工具纳入质量保证进程的好处、挑战和前景。我们对核查和验证进程的影响进行了全面分析,包括试探性测试分析、等值分隔和边界分析、变形测试、发现接受标准(AC)、静态分析、测试案例生成、单位测试生成、测试套件优化和评估中的不一致之处、最终情景执行的结束。利用AI-代理机构结束抽样企业应用程序的倒退,作为强调实际使用这项研究概念的证明。结果仅对生成的测试案例进行8.3%的冷却处决,表明拟议方法的巨大潜力。然而,研究还查明了实际采用以下方法的重大挑战:生成与标准一致的覆盖范围、“黑盒”性质、测试单位生成的单位测试、测试优化和评估、最终设想型企业应用程序的回归性、测试结果对州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-州-级核查结果的核查结果的核查结果的核查结果的核查结果的测试结果的测试结果的测试结果的测试结果的测试结果的测试结果的测试结果的测试结果的正确性)的正确性结果的正确性)的正确性结果的检验结果的检验结果。
Article 121
Title@2025-06-19 (4): Scaling GR(1) Synthesis via a Compositional Framework for LTL Discrete Event Control
Title: Scaling GR(1) Synthesis via a Compositional Framework for LTL Discrete Event Control | Scaling GR(1) Synthese über ein kompositorisches Framework für LTL Discrete Event Control | GR(1) 通过立特分解事件控制的组成框架合成 2506.16557v1 |
Authors (3): Hernán Gagliardi, Victor Braberman, Sebastian Uchitel
We present a compositional approach to controller synthesis of discrete event system controllers with linear temporal logic (LTL) goals. We exploit the modular structure of the plant to be controlled, given as a set of labelled transition systems (LTS), to mitigate state explosion that monolithic approaches to synthesis are prone to. Maximally permissive safe controllers are iteratively built for subsets of the plant LTSs by solving weaker control problems. Observational synthesis equivalence is used to reduce the size of the controlled subset of the plant by abstracting away local events. The result of synthesis is also compositional, a set of controllers that when run in parallel ensure the LTL goal. We implement synthesis in the MTSA tool for an expressive subset of LTL, GR(1), and show it computes solutions to that can be up to 1000 times larger than those that the monolithic approach can solve.
我们提出了一个组合法,用于控制离散事件系统控制器与线性时间逻辑(LTL)目标的组合合成。我们利用作为一组标签过渡系统(LTS)而加以控制的工厂模块结构来控制该工厂的模块结构,以缓解合成单一体方法容易发生的状况爆炸。最大允许的安全控制器通过解决较弱的控制问题,为工厂LTS子子组迭接地建造。通过抽取局部事件,观测合成等值用来缩小该工厂受控子组的大小。合成的结果也是组合式的,一组控制器同时运行,确保LTL目标。我们在MTA工具中对LTL、GR(1)的直观子组实施合成,并显示该组合的计算方法比单一体方法所能解决的多1000倍。
Article 122
Title@2025-06-19 (4): ChatDBG: Augmenting Debugging with Large Language Models
Title: ChatDBG: Augmenting Debugging with Large Language Models | ChatDBG: Augmenting Debugging mit großen Sprachmodellen | 聊天DBG: 使用大语言模式加强调试 2403.16354v5 |
Authors (4): Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, Stephen N. Freund
Debugging is a critical but challenging task for programmers. This paper proposes ChatDBG, an AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like “why is x null?”. To handle these queries, ChatDBG grants the LLM autonomy to “take the wheel”: it can act as an independent agent capable of querying and controlling the debugger to navigate through stacks and inspect program state. It then reports its findings and yields back control to the programmer. By leveraging the real-world knowledge embedded in LLMs, ChatDBG can diagnose issues identifiable only through the use of domain-specific reasoning. Our ChatDBG prototype integrates with standard debuggers including LLDB and GDB for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded more than 75,000 times.
调试对于程序员来说是一项关键但具有挑战性的任务。 本文提议了 CHATDBG , 是一个 AI 驱动调试助手 。 聊天DBG 整合了大语言模型( LLMS) , 以大大增强常规调试器的能力和用户友好性。 聊天DBG 允许程序员与调试器进行协作对话, 允许他们提出有关程序状态的复杂问题, 对崩溃或主张失败进行根源分析, 并探索“ 为何是无效的” 等开放式查询。 要处理这些查询, 聊天DBG 授予LDB 自主权, 以“ 掌握方向盘 ” : 它可以作为一个能够查询和控制调试器的大型调试器( LLLMDBG ) , 以快速调试器( LLDBDG ) 自动调试“ ” : 它可以作为一个独立的代理代理机构, 包括C/C+G 快速调试算器, 将一个真实的调序算法解算器解算出一个已知的PBDBDBDB 错误, , 可以成功地解算算出一个系统。
Article 123
Title@2025-06-19 (4): SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development
Title: SWE-Dev: Evaluating and Training Autonomous Feature-Driven Software Development | SWE-Dev: Bewertung und Schulung autonomer Feature-getriebener Software-Entwicklung | SWE-Dev: 评估和培训自主开发地物-驱动软件开发 2505.16975v2 |
Authors (9): Yaxin Du, Yuzhu Cai, Yifan Zhou, Cheng Wang, Yu Qian, Xianghe Pang, Qian Liu, Yue Hu, Siheng Chen
Large Language Models (LLMs) have shown strong capability in diverse software engineering tasks, e.g. code completion, bug fixing, and document generation. However, feature-driven development (FDD), a highly prevalent real-world task that involves developing new functionalities for large, existing codebases, remains underexplored. We therefore introduce SWE-Dev, the first large-scale dataset (with 14,000 training and 500 test samples) designed to evaluate and train autonomous coding systems on real-world feature development tasks. To ensure verifiable and diverse training, SWE-Dev uniquely provides all instances with a runnable environment and its developer-authored executable unit tests. This collection not only provides high-quality data for Supervised Fine-Tuning (SFT), but also enables Reinforcement Learning (RL) by delivering accurate reward signals from executable unit tests. Our extensive evaluations on SWE-Dev, covering 17 chatbot LLMs, 10 reasoning models, and 10 Multi-Agent Systems (MAS), reveal that FDD is a profoundly challenging frontier for current AI (e.g., Claude-3.7-Sonnet achieves only 22.45\% Pass@3 on the hard test split). Crucially, we demonstrate that SWE-Dev serves as an effective platform for model improvement: fine-tuning on training set enabled a 7B model comparable to GPT-4o on \textit{hard} split, underscoring the value of its high-quality training data. Code is available here \href{https://github.com/DorothyDUUU/SWE-Dev}{https://github.com/DorothyDUUU/SWE-Dev}.
大型语言模型(LLMS)在多种软件工程任务方面表现出很强的能力,例如代码完成、错误修正和文件生成;然而,由特性驱动的发展(DFD)是一个非常普遍的现实世界任务,涉及为现有的大型代码库开发新的功能,但这一任务仍未得到充分探讨;因此,我们引入了SWE-Dev,这是第一个大型数据集(有14 000个培训和500个试样),旨在评估和训练关于现实世界特征开发任务的自动编码系统;为了确保可核查和多样化的培训,SWE-Dev为所有实例提供了可运行的环境及其开发的可执行单位测试。这一收集不仅为超导精密的精密数据库(SFDFD)提供了高质量的数据,而且还通过提供可执行单位测试的准确的奖赏信号,使SWE-D(有14 000个培训模型)、10个推理模型和10个多读性系统(MAS)显示,FDDD是一个极具挑战性的前沿区域(例如Crusyral-3-3 SFTreal-SWE),这是我们在Sload-Sload-d-d-DSload-d-d-d-d-d-D Studyal Studyal Studyal Studutment Studutus pows a supal Stows suploutemental 上的一个硬的测试S.
Article 124
Title@2025-06-19 (4): Teaching Complex Systems based on Microservices
Title: Teaching Complex Systems based on Microservices | Teaching Complex Systems auf Basis von Microservices | 以微观服务为基础的教学复杂系统 2506.16492v1 |
Authors (4): Renato Cordeiro Ferreira, Thatiane de Oliveira Rosa, Alfredo Goldman, Eduardo Guerra
Developing complex systems using microservices is a current challenge. In this paper, we present our experience with teaching this subject to more than 80 students at the University of S~ao Paulo (USP), fostering team work and simulating the industry’s environment. We show it is possible to teach such advanced concepts for senior undergraduate students of Computer Science and related fields.
利用微观服务开发复杂系统是当前的挑战,在本文中,我们向圣保罗大学80多名学生介绍了我们教授这一科目的经验,促进了团队工作,模拟了该行业的环境,我们证明有可能为计算机科学和相关领域的高级本科生教授这种先进概念。
Article 125
Title@2025-06-19 (4): AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation
Title: AlphaTrans: A Neuro-Symbolic Compositional Approach for Repository-Level Code Translation and Validation | AlphaTrans: Ein neuro-symbolischer Kompositionsansatz für Repository-Level-Code-Übersetzung und Validierung | AlphaTrans: 存储层代码翻译和校验的神经-交元组合法 2410.24117v5 |
Authors (7): Ali Reza Ibrahimzada, Kaiyao Ke, Mrigank Pawagi, Muhammad Salman Abid, Rangeet Pan, Saurabh Sinha, Reyhaneh Jabbarvand
Code translation transforms programs from one programming language (PL) to another. Several rule-based transpilers have been designed to automate code translation between different pairs of PLs. However, the rules can become obsolete as the PLs evolve and cannot generalize to other PLs. Recent studies have explored the automation of code translation using Large Language Models (LLMs). One key observation is that such techniques may work well for crafted benchmarks but fail to generalize to the scale and complexity of real-world projects with dependencies, custom types, PL-specific features, etc. We propose AlphaTrans, a neuro-symbolic approach to automate repository-level code translation. AlphaTrans translates both source and test code, and employs multiple levels of validation to ensure the translation preserves the functionality of the source program. To break down the problem for LLMs, AlphaTrans leverages program analysis to decompose the program into fragments and translates them in the reverse call order. We leveraged AlphaTrans to translate ten real-world open-source projects consisting of <836, 8575, 2719> classes, methods, and tests. AlphaTrans breaks down these projects into 17874 fragments and translates the entire repository. 96.40% of the translated fragments are syntactically correct, and AlphaTrans validates the translations’ runtime behavior and functional correctness for 27.03% and 25.14% of fragments. On average, the integrated translation and validation take 34 hours to translate a project, showing its scalability in practice. For the incorrect translations, AlphaTrans generates a report including existing translation, stack trace, test errors, or assertion failures. We provided these artifacts to two developers to fix the translation bugs in four projects. They were able to fix the issues in 20.1 hours on average and achieve all passing tests.
代码翻译将程序从一种编程语言( PL) 转换为另一种程序。 几个基于规则的传输器已经设计成使不同的 PL 代码翻译自动化的方法。 但是, 规则可能会随着 PL 的演变而过时, 并且无法向其他 PL 推广。 最近的研究已经探索了使用大语言模型( LLMs) 的代码翻译自动化。 一项关键的意见是, 此类技术对于制定基准来说可能效果良好, 但是无法推广到具有依赖性、 定制类型、 PLTL 特性等的真实世界项目的规模和复杂性。 我们提出了 AlphaTrans, 一种神经同步方法, 用于自动存储库级代码翻译。 AlphaTrans会翻译源代码和测试代码代码, 并使用多层次的验证程序来保存源程序的功能程序的功能功能功能。 为了打破LLMS, Alphal Transiew 程序分析将程序拆解成碎片, 将程序转换成反调顺序。 我们利用 Alpha Transad将10个真实的开源项目( Ralth) 856, 85, 2740 类, 类, 方法, 和测试的翻译。
Article 126
Title@2025-06-19 (4): Understanding the Challenges and Promises of Developing Generative AI Apps: An Empirical Study
Title: Understanding the Challenges and Promises of Developing Generative AI Apps: An Empirical Study | Die Herausforderungen und Versprechen der Entwicklung generativer KI-Apps verstehen: Eine empirische Studie | 了解 “ 开发创新的AI Apps:经验研究 “ 的挑战和前景 2506.16453v1 |
Authors (3): Buthayna AlMulla, Maram Assi, Safwat Hassan
The release of ChatGPT in 2022 triggered a rapid surge in generative artificial intelligence mobile apps (i.e., Gen-AI apps). Despite widespread adoption, little is known about how end users perceive and evaluate these Gen-AI functionalities in practice. In this work, we conduct a user-centered analysis of 676,066 reviews from 173 Gen-AI apps on the Google Play Store. We introduce a four-phase methodology, SARA (Selection, Acquisition, Refinement, and Analysis), that enables the systematic extraction of user insights using prompt-based LLM techniques. First, we demonstrate the reliability of LLMs in topic extraction, achieving 91% accuracy through five-shot prompting and non-informative review filtering. Then, we apply this method to the informative reviews, identify the top 10 user-discussed topics (e.g., AI Performance, Content Quality, and Content Policy & Censorship) and analyze the key challenges and emerging opportunities. Finally, we examine how these topics evolve over time, offering insight into shifting user expectations and engagement patterns with Gen-AI apps. Based on our findings and observations, we present actionable implications for developers and researchers.
2022年公布ChattGPT后,基因化人工智能移动应用软件(即Gen-AI Apps)迅速激增。尽管广泛采用,但对于终端用户如何看待和评价Gen-AI的功能却知之甚少。在这项工作中,我们对Google Play Store上的173 Gen-AI应用软件进行了676 066次以用户为中心的分析,对Google Play Store上的173 Gen-AI应用软件进行了676 066次审查。我们采用了四阶段方法SARA(选择、获取、精炼和分析),以便利用基于迅速的LLM技术系统提取用户的洞见。首先,我们展示了专题提取中的LLMs的可靠性,通过五发即时即时和非信息化的审查过滤实现91%的准确性。然后,我们将这一方法应用于信息化的审查,确定十大用户讨论的议题(例如AI性能、内容质量和内容政策与检查),并分析主要挑战和新出现的机会。我们研究了这些专题如何随着时间而演变,我们深入了解用户对Gen-AI应用软件的预期和参与模式。我们根据调查结果和观察了各种影响和研究。
Article 127
Title@2025-06-19 (4): Thermal Modeling and Optimal Allocation of Avionics Safety-critical Tasks on Heterogeneous MPSoCs
Title: Thermal Modeling and Optimal Allocation of Avionics Safety-critical Tasks on Heterogeneous MPSoCs | Thermische Modellierung und optimale Allokation von Avionik Sicherheitskritische Aufgaben auf heterogenen MPSoCs | 热建模和最佳分配航空气象安全关键任务 2505.22214v2 |
Authors (5): Ondřej Benedikt, Michal Sojka, Přemysl Šůcha, Pavel Zaykov, Zdeněk Hanzálek
Multi-Processor Systems-on-Chip (MPSoC) can deliver high performance needed in many industrial domains, including aerospace. However, their high power consumption, combined with avionics safety standards, brings new thermal management challenges. This paper investigates techniques for offline thermal-aware allocation of periodic tasks on heterogeneous MPSoCs running at a fixed clock frequency, as required in avionics. The goal is to find the assignment of tasks to (i) cores and (ii) temporal isolation windows while minimizing the MPSoC temperature. To achieve that, we propose and analyze three power models, and integrate them within several novel optimization approaches based on heuristics, a black-box optimizer, and Integer Linear Programming (ILP). We perform the experimental evaluation on three popular MPSoC platforms (NXP i.MX8QM MEK, NXP i.MX8QM Ixora, NVIDIA TX2) and observe a difference of up to 5.5{\deg}C among the tested methods (corresponding to a 22% reduction w.r.t. the ambient temperature). We also show that our method, integrating the empirical power model with the ILP, outperforms the other methods on all tested platforms.
多处理器在芯片上系统(MPSoC)可以提供许多工业领域(包括航空航天)所需的高性能,然而,它们的高电耗,加上航空安全标准,带来了新的热管理挑战。本文调查了根据航空频率的要求,在离线热觉中分配以固定时钟频率运行的多式MPSC定期任务的技术。目的是找到(一) 核心和(二) 时间隔离窗口的任务分配,同时尽量减少MPSoC温度。为了实现这一点,我们提议和分析三种动力模型,并将它们纳入基于超光学、黑盒优化器和 Integer 线性程序(ILP)的几种新型优化方法。我们对三种流行的MPSC平台(NXP i.MX8QM MEK, NXP i.MX8QM Ixora, NVIDIA TX2) 进行实验性评估,并观察在测试方法(corperc)中出现高达5.5xdeg的差别(c),我们提议和分析三种电动模型与22wer.r.stexmodroduft the musal ex
Article 128
Title@2025-06-19 (4): Evaluating the Use of LLMs for Documentation to Code Traceability
Title: Evaluating the Use of LLMs for Documentation to Code Traceability | Bewertung der Verwendung von LLMs für Dokumentation zur Code-Rückverfolgbarkeit | 评价利用LLML 进行文件记录以便遵守可追踪性法规的情况 2506.16440v1 |
Authors (3): Ebube Alor, SayedHassan Khatoonabadi, Emad Shihab
Large Language Models (LLMs) offer new potential for automating documentation-to-code traceability, yet their capabilities remain underexplored. We present a comprehensive evaluation of LLMs (Claude 3.5 Sonnet, GPT-4o, and o3-mini) in establishing trace links between various software documentation (including API references and user guides) and source code. We create two novel datasets from two open-source projects (Unity Catalog and Crawl4AI). Through systematic experiments, we assess three key capabilities: (1) trace link identification accuracy, (2) relationship explanation quality, and (3) multi-step chain reconstruction. Results show that the best-performing LLM achieves F1-scores of 79.4% and 80.4% across the two datasets, substantially outperforming our baselines (TF-IDF, BM25, and CodeBERT). While fully correct relationship explanations range from 42.9% to 71.1%, partial accuracy exceeds 97%, indicating that fundamental connections are rarely missed. For multi-step chains, LLMs maintain high endpoint accuracy but vary in capturing precise intermediate links. Error analysis reveals that many false positives stem from naming-based assumptions, phantom links, or overgeneralization of architectural patterns. We demonstrate that task-framing, such as a one-to-many matching strategy, is critical for performance. These findings position LLMs as powerful assistants for trace discovery, but their limitations could necessitate human-in-the-loop tool design and highlight specific error patterns for future research.
大型语言模型(LLMS)为文件到代码的可追踪性自动化提供了新的潜力,但其能力仍未得到充分探索。我们展示了对LMS(Claude 3.5 Sonnet、GPT-4o和o3-mini)的全面评估,以建立各种软件文件(包括API参考和用户指南)和源代码之间的追踪链接。我们从两个开放源项目(Unity Catalog和Crawl4AI)创建了两个新的数据集。我们通过系统实验评估了三个关键能力:(1) 跟踪链接识别准确性,(2) 关系解释质量,(3) 多步链重建。结果显示,最出色的LMM在两个数据集中取得了F1分的79.4%和80.4%的分数,大大超过我们的基线(TF-IDF、BM25和代码BERT)和源代码编码。我们完全准确地解释了从42.9%到71.1%之间的关系解释,部分准确性可能超过97%,表明基本连接很少被错过。对于多步链来说,LMS保持高端精确的精确度,但在获取精确的中间链接方面各不相同。错误分析显示,从一个错误分析显示,从一个错误显示,从一个错误定位到一个错误到一个错误的人类历史定位模型的链接到一个错误的链接,从我们从一个错误到一个错误到一个错误的标志性定义的标志性定义的标志性模型的定位的标志性选择。
Article 129
Title@2025-06-19 (4): SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks
Title: SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks | SWE-Factory: Ihre automatisierte Fabrik für Ausgabeauflösungstraining Daten- und Bewertungs-Benchmarks | SWE-Foctory: 您的解决问题自动工厂 培训数据和评价基准 2506.10954v2 |
Authors (9): Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng
Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.
为GitHub 问题解答任务构建大型数据集对于培训和评估大语言模型软件工程能力至关重要。 但是,创建这些基准的传统程序具有众所周知的挑战性和劳动密集型,特别是在建立评价环境、测试结果分级和验证任务实例的阶段。在本文件中,我们提议SWE-Factor(一个旨在应对这些挑战的自动化管道)来应对这些挑战。为了解决这些问题,我们的管道整合了三个核心自动自动组件。首先,我们引入了SWE-Builder(一个多试办系统,一个自动存储评价环境的多试办系统),这个系统雇用了4个专业代理,在协作、迭代环中工作,并利用环境存储库来提高效率。第二,我们采用了标准化的、基于退出代码的评级方法,从而消除了手动写定制读取器的需求。最后,我们用这些可靠的退出代码信号来自动连接验证系统。在四个编程语言的671问题上的实验显示,我们的管道可以有效地构建有效的任务实例;例如GPT-4.1-mini、我们SWE-Bilder-revildal-dealalalalal a ladeal ladeal labal 和我们GWe-reval-reval dalalal 20xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。
Article 130
Title@2025-06-19 (4): Chaos Engineering: A Multi-Vocal Literature Review
Title: Chaos Engineering: A Multi-Vocal Literature Review | Chaos Engineering: Ein mehrstimmiger Literaturbericht | 混乱工程:多语言文学评论 2412.01416v2 |
Authors (4): Joshua Owotogbe, Indika Kumara, Willem-Jan Van Den Heuvel, Damian Andrew Tamburri
Organizations, particularly medium and large enterprises, typically rely heavily on complex, distributed systems to deliver critical services and products. However, the growing complexity of these systems poses challenges in ensuring service availability, performance, and reliability. Traditional resilience testing methods often fail to capture the intricate interactions and failure modes of modern systems. Chaos Engineering addresses these challenges by proactively testing how systems in production behave under turbulent conditions, allowing developers to uncover and resolve potential issues before they escalate into outages. Though chaos engineering has received growing attention from researchers and practitioners alike, we observed a lack of reviews that synthesize insights from both academic and grey literature. Hence, we conducted a Multivocal Literature Review (MLR) on chaos engineering to address this research gap by systematically analyzing 96 academic and grey literature sources published between January 2016 and April 2024. We first used the chosen sources to derive a unified definition of chaos engineering and to identify key functionalities, components, and adoption drivers. We also developed a taxonomy for chaos engineering platforms and compared the relevant tools using it. Finally, we analyzed the current state of chaos engineering research and identified several open research issues.
各组织,特别是中型和大型企业,通常都严重依赖分布式的复杂系统来提供关键服务和产品。然而,这些系统日益复杂,在确保服务提供、性能和可靠性方面构成挑战。传统复原力测试方法往往无法捕捉现代系统复杂的互动和失败模式。Chaos Engineering应对这些挑战,主动测试生产系统在动荡条件下如何运作,让开发商能够在潜在问题升级到停产之前发现和解决这些潜在问题。虽然混乱工程受到研究人员和从业人员越来越多的关注,但我们注意到缺乏综合学术和灰色文献的见解的审查。因此,我们进行了关于混乱工程的多语言文献审查,以通过系统分析2016年1月至2024年4月公布的96个学术和灰色文献来源来解决这一研究差距。我们首先利用所选来源来得出混乱工程的统一定义,并查明关键的功能、组成部分和采用驱动因素。我们还开发了混乱工程平台的分类,并比较了使用的相关工具。最后,我们分析了混乱工程研究的现状,并查明了若干公开研究问题。
Article 131
Title@2025-06-19 (4): Evaluating Time-Dependent Methods and Seasonal Effects in Code Technical Debt Prediction
Title: Evaluating Time-Dependent Methods and Seasonal Effects in Code Technical Debt Prediction | Bewertung von zeitabhängigen Methoden und saisonalen Auswirkungen in Code Technical Debt Prediction | 评估法典技术债务预测中依赖时间的方法和季节效应 2408.08095v2 |
Authors (6): Mikel Robredo, Nyyti Saarimaki, Matteo Esposito, Davide Taibi, Rafael Penaloza, Valentina Lenarduzzi
Background. Code Technical Debt (Code TD) prediction has gained significant attention in recent software engineering research. However, no standardized approach to Code TD prediction fully captures the factors influencing its evolution. Objective. Our study aims to assess the impact of time-dependent models and seasonal effects on Code TD prediction. It evaluates such models against widely used Machine Learning models, also considering the influence of seasonality on prediction performance. Methods. We trained 11 prediction models with 31 Java open-source projects. To assess their performance, we predicted future observations of the SQALE index. To evaluate the practical usability of our TD forecasting model and its impact on practitioners, we surveyed 23 software engineering professionals. Results. Our study confirms the benefits of time-dependent techniques, with the ARIMAX model outperforming the others. Seasonal effects improved predictive performance, though the impact remained modest. \ReviewerA{ARIMAX/SARIMAX models demonstrated to provide well-balanced long-term forecasts. The survey highlighted strong industry interest in short- to medium-term TD forecasts. Conclusions. Our findings support using techniques that capture time dependence in historical software metric data, particularly for Code TD. Effectively addressing this evidence requires adopting methods that account for temporal patterns.
技术债务(Code TD)预测在最近的软件工程研究中受到高度重视。然而,没有标准化的《准则》预测方法能够充分捕捉到影响其演变的因素。目标:我们的研究旨在评估时间依赖模型和季节效应对《准则》预测的影响。我们的研究旨在评估这些模型对广泛使用的机器学习模型的影响,同时也考虑到季节性对预测绩效的影响。方法:我们用31个爪哇开放源项目培训了11个预测模型。为了评估其绩效,我们预测了SQALE指数的未来观测。为了评估我们的TD预测模型的实际可用性及其对实践者的影响,我们调查了23名软件工程专业人员。结果。我们的研究证实,依靠时间依赖技术的好处,ARIMAX模型优于其他技术。季节性效果提高了预测性绩效,尽管影响仍然不大。 \RevererA{ARIMAX/SARIMAX 模型展示了提供十分平衡的长期预测。为了评估其绩效,我们估计了工业对短期到中期TD预测的强烈兴趣。我们的结论。我们的调查结果证实了利用在历史软件计量数据中反映时间依赖性的技术,特别是代码TD。
Article 132
Title@2025-06-19 (4): The Technical Debt Gamble: A Case Study on Technical Debt in a Large-Scale Industrial Microservice Architecture
Title: The Technical Debt Gamble: A Case Study on Technical Debt in a Large-Scale Industrial Microservice Architecture | The Technical Debt Gamble: Eine Fallstudie über technische Schulden in einer großräumigen Industrie-Mikroservice-Architektur | 技术债务赌博:关于大型工业微观服务结构中技术债务的案例研究 2506.16214v1 |
Authors (3): Klara Borowa, Andrzej Ratkowski, Roberto Verdecchia
Microservice architectures provide an intuitive promise of high maintainability and evolvability due to loose coupling. However, these quality attributes are notably vulnerable to technical debt (TD). Few studies address TD in microservice systems, particularly on a large scale. This research explores how TD manifests in a large-scale microservice-based industrial system. The research is based on a mixed-method case study of a project including over 100 microservices and serving over 15k locations. Results are collected via a quantitative method based static code analyzers combined with qualitative insights derived from a focus group discussion with the development team and a follow-up interview with the lead architect of the case study system. Results show that (1) simple static source code analysis can be an efficient and effective entry point for holistic TD discovery, (2) inadequate communication significantly contributes to TD, (3) misalignment between architectural and organizational structures can exacerbate TD accumulation, (4) microservices can rapidly cycle through TD accumulation and resolution, a phenomenon referred to as “microservice architecture technical debt gamble”. Finally, we identify a set of fitting strategies for TD management in microservice architectures.
微观服务结构由于松散的结合,提供了高可维持性和可发展性的直觉前景。然而,这些质量特征特别容易发生技术债务(TD)。很少有研究涉及微观服务系统中的TD(TD),特别是大规模研究探索大型微观服务工业系统中的TD表征如何在大型微观服务工业系统中出现。研究基于对一个项目(包括100多个微观服务和15公里以上地点)的混合方法案例研究。通过基于定量方法的静态代码分析器收集了结果,同时从与开发团队的焦点小组讨论和与案例研究系统主要设计师的后续访谈中获得了定性的洞察力。结果显示:(1) 简单的静态源代码分析可以成为整体TD发现的一个高效和有效的切入点,(2) 通信不足大大促进了TD,(3) 建筑和组织结构之间的不协调会加剧TD的积累,(4) 微观服务可以通过TD的积累和解决迅速循环,这是一种被称为“微观服务结构技术债务赌博”的现象。最后,我们为微观服务结构中的TD管理确定了一套适当的战略。
Article 133
Title@2025-06-19 (4): Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing
Title: Seeing is Fixing: Cross-Modal Reasoning with Multimodal LLMs for Visual Software Issue Fixing | Sehen ist Fixing: Cross-Modal Reasoning mit multimodalen LLMs für Visual Software Problem Fixing | 确定:与用于确定视觉软件问题的多模式LLMs进行交叉模式解释 2506.16136v1 |
Authors (4): Kai Huang, Jian Zhang, Xiaofei Xie, Chunyang Chen
Large language model-(LLM) based automated program repair (APR) techniques have shown promising results in resolving real-world GitHub issue tasks. Existing APR systems are primarily evaluated in unimodal settings (e.g., SWE-bench). However, these autonomous systems struggle to resolve multimodal problem scenarios (e.g., SWE-bench M) due to limitations in interpreting and leveraging visual information. In multimodal scenarios, LLMs need to rely on visual information in the graphical user interface (GUI) to understand bugs and generate fixes. To bridge this gap, we propose GUIRepair, a cross-modal reasoning approach for resolving multimodal issue scenarios by understanding and capturing visual information. Specifically, GUIRepair integrates two key components, Image2Code and Code2Image, to enhance fault comprehension and patch validation. Image2Code extracts relevant project documents based on the issue report, then applies this domain knowledge to generate the reproduced code responsible for the visual symptoms, effectively translating GUI images into executable context for better fault comprehension. Code2Image replays the visual issue scenario using the reproduced code and captures GUI renderings of the patched program to assess whether the fix visually resolves the issue, providing feedback for patch validation. We evaluate GUIRepair on SWE-bench M, and the approach demonstrates significant effectiveness. When utilizing GPT-4o as the base model, GUIRepair solves 157 instances, outperforming the best open-source baseline by 26 instances. Furthermore, when using o4-mini as the base model, GUIRepair can achieve even better results and solve 175 instances, outperforming the top commercial system by 22 instances. This emphasizes the success of our new perspective on incorporating cross-modal reasoning by understanding and capturing visual information to resolve multimodal issues.
大型语言模型(LLIM)基于大型语言模型(LLIM)的自动程序修理(APR)技术在解决真实世界 GitHub 问题任务方面显示出了大有希望的结果。现有的APR系统主要在单式环境中(例如SWE-bench)进行评估。然而,由于解释和使用视觉信息的局限性,这些自主系统难以解决多式问题方案(例如SWE-bench M) 。在多式方案中,LOMS需要依靠图形用户界面中的视觉信息来理解错误并产生修正。为了缩小这一差距,我们建议GUIRePair, 一种通过理解和捕捉视觉信息解决多式问题设想的跨模式推理法。具体地,GIrepar 将两个关键组成部分(例如Simage2Code和代码)整合到两个关键部分(例如Scod2IMMMMIM),然后将域知识用于生成用于直观症状时的复制代码,将图形图像图像图像图像图像图像转换到可更好理解的背景环境。 code2IM 将S-mail devial view view view view view view view view
Article 134
Title@2025-06-19 (4): Regression Testing Optimization for ROS-based Autonomous Systems: A Comprehensive Review of Techniques
Title: Regression Testing Optimization for ROS-based Autonomous Systems: A Comprehensive Review of Techniques | Regressionsprüfung Optimierung für ROS-basierte autonome Systeme: Eine umfassende Überprüfung von Techniken | 以ROS为基础的自动系统优化后退试验:技术的全面审查 2506.16101v1 |
Authors (3): Yupeng Jiang, Shuaiyi Sun, Xi Zheng
Regression testing plays a critical role in maintaining software reliability, particularly for ROS-based autonomous systems (ROSAS), which frequently undergo continuous integration and iterative development. However, conventional regression testing techniques face significant challenges when applied to autonomous systems due to their dynamic and non-deterministic behaviors, complex multi-modal sensor data, asynchronous distributed architectures, and stringent safety and real-time constraints. Although numerous studies have explored test optimization in traditional software contexts, regression testing optimization specifically for ROSAS remains largely unexplored. To address this gap, we present the first comprehensive survey systematically reviewing regression testing optimization techniques tailored for ROSAS. We analyze and categorize 122 representative studies into regression test case prioritization, minimization, and selection methods. A structured taxonomy is introduced to clearly illustrate their applicability and limitations within ROSAS contexts. Furthermore, we highlight major challenges specific to regression testing for ROSAS, including effectively prioritizing tests in response to frequent system modifications, efficiently minimizing redundant tests, and difficulty in accurately selecting impacted test cases. Finally, we propose research insights and identify promising future directions, such as leveraging frame-to-vector coverage metrics, multi-source foundation models, and neurosymbolic reasoning to enhance regression testing efficiency and effectiveness. This survey provides a foundational reference and practical roadmap for advancing the state-of-the-art in regression testing optimization for ROSAS.
退步测试在保持软件可靠性方面发挥着关键作用,特别是ROSAS基于自动系统(ROSAS)的退步测试在保持软件可靠性方面发挥着关键作用,特别是ROSAS的自动系统(ROSAS),这些系统经常不断进行整合和迭接开发;然而,常规退步测试技术由于动态和非决定性行为、复杂的多式传感器数据、分散式结构、严格的安全和实时限制,在应用于自主系统时面临重大挑战;尽管许多研究探索了传统软件环境下的测试优化,但具体针对ROSAS的退步测试优化在很大程度上尚未探索;为弥补这一差距,我们提出了第一份全面调查,系统地审查为ROSAS量度设计的回归测试优化测试技术;我们分析并分类122项代表性研究,以确定回归测试案件的优先次序、最小化和选择方法;采用结构化分类,以明确说明其在ROSASAS系统背景下的适用性和局限性;此外,我们强调对ROSASAS的回归测试的具体挑战,包括有效优先安排测试,以适应频繁的系统修改、有效减少冗余试验,以及准确选择受影响的测试案例。 最后,我们提出研究见解,并确定有希望的未来方向,例如利用基准到控制范围定位的定位指标、多源基础基础测试,以推进后制和神经系统基础测试,为基准基础。
Article 135
Title@2025-06-19 (4): LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research
Title: LMR-BENCH: Evaluating LLM Agent’s Ability on Reproducing Language Modeling Research | LMR-BENCH: Bewertung der Fähigkeit des LLM-Agenten zur Reproduktion von Sprachmodellierungsforschung | LMR-BENCH:评价LLM代理复制语言建模研究的能力 2506.17335v1 |
Authors (14): Shuo Yan, Ruochen Li, Ziming Luo, Zimu Wang, Daoyang Li, Liqiang Jing, Kaiyu He, Peilin Wu, George Michalopoulos, Yue Zhang, Ziyang Zhang, Mian Zhang, Zhiyu Chen, Xinya Du
Large language model (LLM) agents have demonstrated remarkable potential in advancing scientific discovery. However, their capability in the fundamental yet crucial task of reproducing code from research papers, especially in the NLP domain, remains underexplored. This task includes unique complex reasoning challenges in the intellectual synthesis of abstract concepts and the comprehension of code repositories with interdependent files. Motivated by this gap, we present LMR-BENCH, a benchmark designed to systematically evaluate the capability of LLM agents on code reproduction from Language Modeling Research. It consists of 28 code reproduction tasks derived from 23 research papers published in top-tier NLP venues over the past five years, spanning nine fundamental categories. Models are provided with a research paper, a code repository containing one or more masked functions, and instructions for implementing these functions. We conduct extensive experiments in standard prompting and LLM agent settings with state-of-the-art LLMs, evaluating the accuracy of unit tests and performing LLM-based evaluation of code correctness. Experimental results reveal that even the most advanced models still exhibit persistent limitations in scientific reasoning and code synthesis, highlighting critical gaps in LLM agents’ ability to autonomously reproduce scientific research
大型语言模型(LLM)代理物在推动科学发现方面表现出了显著的潜力,然而,他们在复制研究论文的代码这一基本而关键的任务,特别是在NLP领域,其能力仍未得到充分探讨。这项任务包括在抽象概念的智力综合和理解代码库方面的独特复杂的推理挑战,以及以相互依存的文档理解代码库。受这一差距的驱动,我们介绍了LMR-BENCH,这是旨在系统评价LM代理物从语言建模研究中复制代码的能力的基准。它包含28个代码复制任务,这些任务来自过去5年来在最高一级NLP地点出版的23份研究论文,涵盖9个基本类别。模型提供了一份研究文件、一个含有一个或多个隐蔽功能的代码库以及执行这些功能的指示。我们在标准快速和LM代理物环境与最新技术的LMM进行广泛的实验,评价单位测试的准确性,并对代码正确性进行基于LM的LM评估。实验结果表明,即使是最先进的模型也仍然在科学推理和代码合成方面表现出持续的局限性,突出了LM代理物证代理人自主复制科学研究的能力。
Article 136
Title@2025-06-19 (4): ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration
Title: ExploraCoder: Advancing code generation for multiple unseen APIs via planning and chained exploration | ExploraCoder: Advancing Codegenerierung für mehrere unsichtbare APIs durch Planung und verkettete Exploration | 探索Coder:通过规划和链式探索,推进多个看不见的API代码生成 2412.05366v2 |
Authors (8): Yunkun Wang, Yue Zhang, Zhen Qin, Chen Zhi, Binhua Li, Fei Huang, Yongbin Li, Shuiguang Deng
Large language models face intrinsic limitations in coding with APIs that are unseen in their training corpora. As libraries continuously evolve, it becomes impractical to exhaustively retrain LLMs with new API knowledge. This limitation hampers LLMs from solving programming problems which require newly introduced or privately maintained libraries. Inspired by exploratory programming paradigm in human behavior, we propose ExploraCoder, a training-free framework that empowers LLMs to invoke multiple unseen APIs in code solution by (1) planning a complex problem into several API invocation subtasks, and (2) experimenting with correct API usage at intermediate steps through a novel chain-of-API-exploration. We conduct evaluation on program synthesizing tasks involving complex API interactions. Experimental results demonstrate that ExploraCoder significantly improves performance for models lacking prior API knowledge, achieving absolute increases of up to 11.99% over retrieval-based approaches and 17.28% over pretraining-based methods in pass@10.
大型语言模式在与培训公司中看不见的API编码时面临内在限制。随着图书馆不断演变,对拥有新的API知识的LLMs进行彻底再培训变得不切实际。这种限制妨碍了LLMs解决需要新引入或私人维护的图书馆的方案编制问题。在人类行为探索性方案规划范式的启发下,我们提议OricaCoder,这是一个没有培训的框架,使LOricaCoder能够在代码解决方案中援引多种不为人知的API,其方法是:(1) 规划一个复杂的问题,将其纳入一些API的次级任务;(2) 通过新的API-Exploration链在中间步骤中试验正确使用API。我们对涉及复杂的API相互作用的方案综合任务进行了评价。实验结果表明,ExricaCoder大大改进了以前缺乏API知识的模型的绩效,实现了绝对增加11.99%的检索方法和17.28%的通行证前培训方法。
Article 137
Title@2025-06-19 (4): Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE
Title: Large Language Models for Spreadsheets: Benchmarking Progress and Evaluating Performance with FLARE | Große Sprachmodelle für Tabellen: Benchmarking-Fortschritt und Leistungsbewertung mit FLARE | 电子表格大语言模式:与FLARE制定进度基准和评估业绩 2506.17330v1 |
Authors (1): Simon Thorne
Large Language Models (LLMs) have demonstrated some significant capabilities across various domains; however, their effectiveness in spreadsheet related tasks remains underexplored. This study introduces a foundation for a comprehensive benchmark framework to evaluate the performance of leading LLMs in executing spreadsheet functions, formula generation and data manipulation tasks. The benchmark encompasses tasks ranging from basic formula creation to complex, real world spreadsheet scenarios. Our findings reveal that while LLMs exhibit proficiency in straightforward tasks, they often falter in complex, multi step operations, frequently producing plausible yet incorrect outputs. These results underscore the limitations of current LLMs in handling spreadsheet tasks that require precise logical reasoning and highlight the need for integrating symbolic reasoning capabilities into LLM architectures. To support this, we introduce FLARE (Formula Logic, Auditing, Reasoning and Evaluation) a new benchmark for evaluating LLM performance on real-world spreadsheet logic, auditing, and reasoning tasks.
大型语言模型(LLMS)在各个领域都表现出了一定的显著能力;然而,它们在电子表格相关任务方面的效力仍未得到充分探讨;这项研究为评价主要LLMs在执行电子表格功能、公式生成和数据操作任务方面业绩的全面基准框架奠定了基础;该基准包括从基本公式生成到复杂、真实的世界电子表格情景等任务;我们的调查结果显示,虽然LLMS表现出精通直截了当的任务,但它们往往在复杂、多步骤的操作中步步步不前,常常产生合理但不正确的产出;这些结果突出表明,目前的LMS在处理电子表格任务方面有局限性,需要精确的逻辑推理,并强调需要将象征性推理能力纳入LLM结构。为了支持这一点,我们引入了FLARE(FMLAC、审计、理性和评价)新的基准,用以评价LMM在现实世界电子逻辑、审计和推理工作中的业绩。
Article 138
Title@2025-06-19 (4): FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation
Title: FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation | FEA-Bench: Ein Benchmark für die Bewertung der Code-Generierung auf Repository-Ebene für die Feature-Implementierung | FEA-Bench:评估存储器一级实施地物代码生成的基准 2503.06680v2 |
Authors (9): Wei Li, Xin Zhang, Zhongxin Guo, Shaoguang Mao, Wen Luo, Guangyue Peng, Yangyu Huang, Houfeng Wang, Scarlett Li
Implementing new features in repository-level codebases is a crucial application of code generation models. However, current benchmarks lack a dedicated evaluation framework for this capability. To fill this gap, we introduce FEA-Bench, a benchmark designed to assess the ability of large language models (LLMs) to perform incremental development within code repositories. We collect pull requests from 83 GitHub repositories and use rule-based and intent-based filtering to construct task instances focused on new feature development. Each task instance containing code changes is paired with relevant unit test files to ensure that the solution can be verified. The feature implementation requires LLMs to simultaneously possess code completion capabilities for new components and code editing abilities for other relevant parts in the code repository, providing a more comprehensive evaluation method of LLMs’ automated software engineering capabilities. Experimental results show that LLMs perform significantly worse in the FEA-Bench, highlighting considerable challenges in such repository-level incremental code development.
在库库级代码库中实施新功能是守则生成模型的关键应用,然而,目前的基准缺乏这一能力的专门评价框架;为填补这一空白,我们引入了FEA-Bench,这是评估大型语言模型(LLMs)在代码存储库内进行增量开发能力的基准;我们从83个GitHub存储库收集请求,并使用基于规则和意向的过滤方法来构建侧重于新特征开发的任务实例;每个包含代码变化的任务实例都与相关单位测试文件相配,以确保能够核实解决方案;功能实施要求LOMs同时拥有代码存储库中其他相关部分的代码完成能力,为LLMs自动化软件工程能力提供更全面的评价方法;实验结果表明,LLMs在FEA-Bench的功能上表现严重恶化,突出了此类库级增量代码开发过程中的巨大挑战。
Article 139
Title@2025-06-19 (4): From Generation to Adaptation: Comparing AI-Assisted Strategies in High School Programming Education
Title: From Generation to Adaptation: Comparing AI-Assisted Strategies in High School Programming Education | Von der Generation zur Anpassung: Vergleich von KI-Assistenten Strategien in der High School Programming Education | 从一代到适应:在高中方案规划教育中比较AI协助战略 2506.15955v1 |
Authors (2): Tong Hu, Songzan Wang
This exploratory case study investigated two contrasting pedagogical approaches for LCA-assisted programming with five novice high school students preparing for a WeChat Mini Program competition. In Phase 1, students used LCAs to generate code from abstract specifications (From-Scratch approach), achieving only 20% MVP completion. In Phase 2, students adapted existing Minimal Functional Units (MFUs), small, functional code examples, using LCAs, achieving 100% MVP completion. Analysis revealed that the MFU-based approach succeeded by aligning with LCA strengths in pattern modification rather than de novo generation, while providing cognitive scaffolds that enabled students to navigate complex development tasks. The study introduces a dual-scaffolding model combining technical support (MFUs) with pedagogical guidance (structured prompting strategies), demonstrating that effective LCA integration depends less on AI capabilities than on instructional design. These findings offer practical guidance for educators seeking to transform AI tools from sources of frustration into productive learning partners in programming education.
这项探索性案例研究调查了LCA协助编制方案的两个对比式教学方法,有5名新高中学生准备参加WeChat Mini方案竞赛。第一阶段,学生利用LCS生成抽象规格代码(From-Scracatch方法),只完成了20%的MVP。第二阶段,学生调整了现有的最低功能单位(MFUS),小型功能代码实例,使用LCS实现了100%的MVP完成率。分析显示,MFU方法成功地与LCS在模式修改而不是新一代方面的优势相匹配,同时提供认知辅助工具,使学生能够理解复杂的发展任务。这项研究引入了双脚手架模式,将技术支持(MFUs)与教学指导(结构性提示战略)结合起来,表明有效的LCS融合不那么依赖AI能力,而是依赖教学设计。这些研究结果为教育工作者寻求将AI工具从沮丧的来源转变为规划教育的生产性学习伙伴提供了实用指导。