• 00 07-31 (4) An Empirical Study on the Amount of Changes Required for Merge Request Acceptance Eine empirische Studie über die Menge der Änderungen, die für die Annahme von Merge-Anfragen erforderlich sind 关于合并申请接受所需变动数额的经验研究 2507.23640v1
  • 01 07-31 Using weakest application conditions to rank graph transformations for graph repair Verwendung schwächster Anwendungsbedingungen, um Graphentransformationen für Graphenreparatur zu ordnen 使用最弱应用条件来排行图形变形以进行图形修理 2405.08788v3
  • 02 07-31 Testing Compositionality Prüfung der Zusammensetzung 测试的构成性 2407.05028v3
  • 03 07-31 Automated Code Review Using Large Language Models at Ericsson: An Experience Report Automatisierte Code-Überprüfung mit großen Sprachmodellen bei Ericsson: Ein Erfahrungsbericht Ericsson公司使用大语言模型的自动码审查:经验报告 2507.19115v2
  • 04 07-31 Blended PC Peer Review Model: Process and Reflection Blended PC Peer Review Modell: Prozess und Reflexion PC 混合同行审查模式:进程和反思 2504.19105v2
  • 05 07-31 PurpCode: Reasoning for Safer Code Generation PurpCode: Begründung für eine sicherere Code-Generierung PurpCode:更安全代码生成的理由 2507.19060v2
  • 06 07-31 Dynamic and Static Analysis of Python Software with Kieker Including Reconstructed Architectures Dynamische und statische Analyse von Python-Software mit Kieker inklusive rekonstruierter Architekturen 使用Kieker 包括重建建筑的 Python 软件动态和静态分析 2507.23425v1
  • 07 07-31 Mokav: Execution-driven Differential Testing with LLMs Mokav: Execution-getriebene Differentialprüfung mit LLMs Mokav:由执行驱动的用LLMs进行差别测试 2406.10375v2
  • 08 07-31 Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling Trae Agent: Ein LLM-basierter Agent für Software Engineering mit Testzeitskalierung Trae Agent: 设在LLM的软件工程应用测试时间缩放软件工程代理 2507.23370v1
  • 09 07-31 REST API Testing in DevOps: A Study on an Evolving Healthcare IoT Application REST API Testing in DevOps: Eine Studie über eine sich entwickelnde IoT-Anwendung im Gesundheitswesen 在DevOps进行的REST API测试:关于不断演变的卫生保健IOT应用的研究 2410.12547v2
  • 10 07-31 SWE-Exp: Experience-Driven Software Issue Resolution SWE-Exp: Erfahrungsgetriebene Software-Ausgabeauflösung SWE-Expl:经验丰富的软件问题决议 2507.23361v1
  • 11 07-31 Quality Evaluation of COBOL to Java Code Transformation Qualitätsbewertung von COBOL zu Java Code Transformation 对CCOBOL的质量评价与爪哇法典转换 2507.23356v1
  • 12 07-31 SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution SWE-Debatte: Wettbewerbsfähige Multi-Agenten-Debatte für die Lösung von Software-Problemen SWE-Debate:解决软件问题竞争性多机构辩论 2507.23348v1
  • 13 07-31 Scalable and Precise Patch Robustness Certification for Deep Learning Models with Top-k Predictions Skalierbare und präzise Patch Robustness Zertifizierung für Deep Learning Modelle mit Top-K Vorhersagen 具有顶级预测力的深学习模型可缩放和精确的补丁强度认证 2507.23335v1
  • 14 07-31 SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy SequenzLayer: Sequenzverarbeitung und Streaming von Neuronalen Netzwerken leicht gemacht 序列激光器:序列处理和串联神经网络变得容易 2507.23292v1
  • 15 07-31 A Privacy-Preserving DAO Model Using NFT Authentication for the Punishment not Reward Blockchain Architecture A Privacy-Preserving DAO-Modell mit NFT-Authentifizierung für die Strafe nicht Belohnung Blockchain Architektur 使用NFT认证用于惩罚而不是回报链架构的隐私保护 DAO 模式 2405.13156v2
  • 16 07-31 XABPs: Towards eXplainable Autonomous Business Processes XABPs: Auf dem Weg zu eXplainable Autonomous Business Processes XABPs:迈向可塑性自治商业进程 2507.23269v1
  • 17 07-31 CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation CodeIF-Bench: Bewertung von Instruction-Following-Fähigkeiten von großen Sprachmodellen in der interaktiven Codegenerierung 守则-框架框架框架:评估互动代码生成中大语言模式的指令-遵守能力 2503.22688v3
  • 18 07-31 Kernel-FFI: Transparent Foreign Function Interfaces for Interactive Notebooks Kernel-FFI: Transparente Fremdfunktionsschnittstellen für interaktive Notebooks 核心-FFI:交互式笔记本的透明外国函数界面 2507.23205v1
  • 19 07-31 CodePod: A Language-Agnostic Hierarchical Scoping System for Interactive Development CodePod: Ein sprach-agnostisches Hierarchisches Scoping-System für interaktive Entwicklung 代码pod:一个促进互动发展的语文、不可知的等级分级范围界定系统 2301.02410v2
  • 20 07-31 VRISE: A Virtual Reality Platfrom for Immersive and Interactive Surveying Education VRISE: Eine Virtual Reality-Platfrom für immersive und interaktive Vermessungsausbildung VRISE: 模拟和互动调查教育虚拟现实分布图 2507.22810v2
  • 21 07-31 AutoBridge: Automating Smart Device Integration with Centralized Platform AutoBridge: Automatisierung der Smart Device Integration mit zentralisierter Plattform AutoBridge: 与集中化平台自动整合智能设备 2507.23178v1
  • 22 07-31 Extension Decisions in Open Source Software Ecosystem Erweiterungsentscheidungen in Open Source Software Ecosystem 开放源软件生态系统的推广决定 2507.23168v1
  • 23 07-30 (3) Vibe Modeling: Challenges and Opportunities Vibe Modeling: Herausforderungen und Chancen 虚拟建模:挑战和机遇 2507.23120v1
  • 24 07-30 FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering FlowETL: Eine autonome Beispiel-gesteuerte Pipeline für die Datentechnik FLFETL:数据工程的自主如流管道 2507.23118v1
  • 25 07-30 Insights into resource utilization of code small language models serving with runtime engines and execution providers Einblicke in die Ressourcennutzung von Code-Small Language-Modellen, die mit Laufzeit-Engines und Ausführungsanbietern dienen 深入了解为运行时引擎和执行提供方服务的编码小型语文模式的资源利用情况 2412.15441v2
  • 26 07-30 On LLM-Assisted Generation of Smart Contracts from Business Processes Zur LLM-Assistenten Generierung von Smart Contracts aus Geschäftsprozessen 利用LLM协助从业务流程中生成智能合同 2507.23087v1
  • 27 07-30 The Design Space of Lockfiles Across Package Managers Der Design-Raum von Lockfiles Across Package Managers 全包管理员的锁文件设计空间 2505.04834v2
  • 28 07-30 Tracking research software outputs in the UK Verfolgung von Forschungssoftware-Outputs im Vereinigten Königreich 联合王国跟踪研究软件产出 2507.22871v1
  • 29 07-30 Repair-R1: Better Test Before Repair Reparatur-R1: Besserer Test vor Reparatur 修理-R1:在修理前进行更好的测试 2507.22853v1
  • 30 07-30 The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach Das Multi-Agent Fault Localization System basiert auf Monte Carlo Tree Search Approach 以蒙特卡洛树搜索方法为基础的多机构错失地方化系统 2507.22800v1
  • 31 07-30 Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning Bewertungsprüfer: Bewertung der synthetischen Überprüfung für Code und Begründung 标定验证符:评估编码和理由的合成核查 2502.13820v3
  • 32 07-30 Designing for Self-Regulation in Informal Programming Learning: Insights from a Storytelling-Centric Approach Designing for Self-Regulation in Informelles Programmieren Lernen: Einblicke aus einem Storytelling-Centric-Ansatz 设计非正规方案拟订学习的自我管理:从讲故事-核心方法的洞察 2507.22671v1
  • 33 07-30 RobEthiChor: Automated Context-aware Ethics-based Negotiation for Autonomous Robots RobEthiChor: Automatisierte kontextorientierte Ethik-basierte Verhandlung für autonome Roboter RobEthiChor:关于自主机器人的基于道德操守的自动内部意识谈判 2507.22664v1
  • 34 07-30 A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models Ein systematischer Literaturbericht über die Erkennung von Softwarelücken mit großen Sprachmodellen 关于检测与大语言模型的软件脆弱性的系统文献评论 2507.22659v1
  • 35 07-30 Metamorphic Testing of Deep Code Models: A Systematic Literature Review Metamorphische Prüfung von Deep Code Modellen: Ein systematischer Literaturbericht 深代码模型的变形测试:系统文献审查 2507.22610v1
  • 36 07-30 RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment RePaCA: Ausschüttung von mit Gründen versehenen großen Sprachmodellen für statische automatisierte Patch-Korrekturbewertung REPACA:利用大型语言模型进行静态自动补丁正确性评估 2507.22580v1
  • 37 07-30 Inside madupite: Technical Design and Performance Inside madupite: Technisches Design und Performance 内部疯狂:技术设计和性能 2507.22538v1
  • 38 07-30 Ensemble Fuzzing with Dynamic Resource Scheduling and Multidimensional Seed Evaluation Ensemble Fuzzing mit dynamischer Ressourcenplanung und mehrdimensionaler Saatgutbewertung 结合动态资源安排和多层面种子评估 2507.22442v1
  • 39 07-30 ETrace:Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis ETrace: Event-getriebene Sicherheitserkennung in Smart Contracts über LLM-basierte Trace-Analyse ETRAR:通过基于LLM的追踪分析,在智能合同中实现对脆弱性的彻底发现 2506.15790v3
  • 40 07-30 AutoCodeSherpa: Symbolic Explanations in AI Coding Agents AutoCodeSherpa: Symbolische Erklärungen in KI-Coding-Agenten AutoCodeSherpa: AI 编码剂中的符号解释 2507.22414v1
  • 41 07-30 Scalability, Availability, Reproducibility and Extensibility in Islamic Database Systems Skalierbarkeit, Verfügbarkeit, Reproduzierbarkeit und Erweiterbarkeit in islamischen Datenbanksystemen 伊斯兰数据库系统中的可扩展性、可获取性、可复制性和可推广性 2507.22384v1
  • 42 07-30 SAEL: Leveraging Large Language Models with Adaptive Mixture-of-Experts for Smart Contract Vulnerability Detection SAEL: Nutzung großer Sprachmodelle mit adaptiven Mixture-of-Experts für Smart Contract Vulnerability Detection SAEL:利用适应性混合专家的大型语言模型利用智能合同脆弱性检测 2507.22371v1
  • 43 07-30 From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications Von Artikeln zum Code: On-Demand-Erzeugung von Kernalgorithmen aus wissenschaftlichen Publikationen 《从条款到守则:科学出版物核心数值的按需生成》 2507.22324v1
  • 44 07-29 (2) Automated Prompt Engineering for Cost-Effective Code Generation Using Evolutionary Algorithm Automatisierte Prompt-Engineering für kosteneffiziente Code-Generierung mit evolutionärem Algorithmus 利用进化算法为成本效率高的代码生成自动快速工程 2408.11198v2
  • 45 07-29 Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy Zur automatisierten Validierung von Sprachmodellen synthesierten Testfällen mit semantischer Entropie 争取使用语义通则对语言模拟合成试验案例进行自动验证 2411.08254v2
  • 46 07-29 Secure coding for web applications: Frameworks, challenges, and the role of LLMs Sichere Codierung für Web-Anwendungen: Frameworks, Herausforderungen und die Rolle von LLMs 网络应用的安全编码:框架、挑战和LLMM的作用 2507.22223v1
  • 47 07-29 Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go? Runtime Failure Hunting for Physics Engine Based Software Systems: Wie weit können wir gehen? 物理引擎软件系统运行时失灵追寻:我们能走多远? 2507.22099v1
  • 48 07-29 Fine-Tuning Code Language Models to Detect Cross-Language Bugs Fine-Tuning-Code-Sprachenmodelle zur Erkennung von Cross-Language-Fehlern 用于检测跨语言错误的精细调整代码语言模型 2507.21954v1
  • 49 07-29 DeepGo: Predictive Directed Greybox Fuzzing DeepGo: Predictive Directed Greybox Fuzzing 深度Go:预测方向灰盒模糊 2507.21952v1
  • 50 07-29 Vibe Coding as a Reconfiguration of Intent Mediation in Software Development: Definition, Implications, and Research Agenda Vibe Coding als Rekonfiguration von Intent Mediation in der Software-Entwicklung: Definition, Implikationen und Forschungsagenda 作为软件开发中内在调解的重组:定义、影响和研究议程 2507.21928v1
  • 51 07-29 LLM-based Content Classification Approach for GitHub Repositories by the README Files LLM-basierter Content-Klassifikationsansatz für GitHub-Repositories durch die README-Dateien REEADME 文件中基于LLM的GitHub储存库内容分类方法 2507.21899v1
  • 52 07-29 The Impact of Foundational Models on Patient-Centric e-Health Systems Die Auswirkungen von Basismodellen auf die e-Health-Systeme von Patienten-Centric 基础模型对病人中心电子保健系统的影响 2507.21882v1
  • 53 07-29 Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses? Out of Distribution, Out of Luck: Wie gut können LLMs auf Sicherheitsdatensätze trainieren Top 25 CWE-Schwächen erkennen? 发售之外,运气不好:如何训练LLMs研究检测25大CWE弱点的脆弱性数据集? 2507.21817v1
  • 54 07-29 ChartMark: A Structured Grammar for Chart Annotation ChartMark: Eine strukturierte Grammatik für Chart-Annotation 图表 Mark: 用于图表注释的结构性语法 2507.21810v1
  • 55 07-29 Identification of Design Recommendations for Augmented Reality Authors in Corporate Training Identifizierung von Design-Empfehlungen für Augmented Reality-Autoren in der Unternehmensschulung 确定公司培训中增加现实作者的设计建议 2507.21722v1
  • 56 07-29 iPanda: An LLM-based Agent for Automated Conformance Testing of Communication Protocols iPanda: Ein LLM-basierter Agent für automatisierte Konformitätsprüfung von Kommunikationsprotokollen iPanda:以LLM为基地的通信协议自动合规测试自动合规测试代理 2507.00378v2
  • 57 07-29 MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios MultiAIGCD: Ein umfassender Datensatz für KI Generated Code Detection für mehrere Sprachen, Modelle,Prompts und Szenarien 多AIGCD:AI生成的包含多种语言、模型、提示和情景的代码探测综合数据集 2507.21693v1
  • 58 07-29 Predicting Maintenance Cessation of Open Source Software Repositories with An Integrated Feature Framework Vorhersage der Wartungseinstellung von Open Source Software Repositories mit integriertem Feature Framework 具有综合地物框架的开放源码软件储存库预测维持维持状态的停止 2507.21678v1
  • 59 07-29 Harnessing Large Language Model for Virtual Reality Exploration Testing: A Case Study Großes Sprachmodell für Virtual Reality Exploration Testing nutzen: Eine Fallstudie 利用大语言模型进行虚拟现实探索试验:案例研究 2501.05625v2
  • 60 07-29 Ethical Classification of Non-Coding Contributions in Open-Source Projects via Large Language Models Ethische Klassifizierung von Nichtkodierungsbeiträgen in Open-Source-Projekten über große Sprachmodelle 通过大语言模式对开放源码项目非编码捐款进行道德分类 2507.21583v1
  • 61 07-29 Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding Kodezi Chronos: Ein Debugging-First Language Model für Repository-Scale Code Understanding Kodezi Chronos:调试第一语言模式,用于存储库-范围守则理解 2507.12482v2
  • 62 07-29 HLSDebugger: Identification and Correction of Logic Bugs in HLS Code with LLM Solutions HLSDebugger: Identifizierung und Korrektur von Logic Bugs im HLS-Code mit LLM-Lösungen HLSDebugger: 使用LLM 解决方案识别和校正 HLS 代码中的逻辑错误 2507.21485v1
  • 63 07-29 LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests LLM4VV: Bewertung von Cutting-Edge-LLMs für die Erstellung und Bewertung von richtliniesbasierten parallelen Programmiermodellkompilertests LLM4VV:评估用于编制和评价基于指令的平行方案拟订模式模型汇编者示范测试的切削-切削LMs 2507.21447v1
  • 64 07-28 (1) MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration MAAD: Software-Architektur-Design automatisieren durch wissensgetriebene Multi-Agent-Kollaboration MAAD: 通过知识开发多机构协作开发自动化软件结构设计 2507.21382v1
  • 65 07-28 Does Editing Improve Answer Quality on Stack Overflow? A Data-Driven Investigation Verbessert das Bearbeiten die Antwortqualität bei Stack-Überfluss? Eine datengetriebene Untersuchung 编辑是否改进堆叠溢出时的回答质量? 数据驱动调查 2507.21329v1
  • 66 07-28 Black-Box Bug-Amplification for Multithreaded Software Black-Box Bug-Verstärkung für Multithreaded Software 多行软件的黑箱臭虫修正 2507.21318v1
  • 67 07-28 “Maybe We Need Some More Examples:” Individual and Team Drivers of Developer GenAI Tool Use “Vielleicht brauchen wir noch einige Beispiele:” Individuelle und Team-Treiber von Entwickler-GenAI-Tool-Nutzung “也许我们需要更多例子:”开发者GenAI工具使用的个人和团队驱动者 2507.21280v1
  • 68 07-28 Generating Highly Structured Test Inputs Leveraging Constraint-Guided Graph Refinement Generierung von hochstrukturierten Testeingaben, die eine eingeschränkte Graphenverfeinerung ermöglichen 正在生成高结构化测试输入, 以杠杆方式调节受约束的辅助图表精度 2507.21271v1
  • 69 07-28 Smart Expansion Techniques for ASP-based Interactive Configuration Smart Expansion Techniques für ASP-basierte interaktive Konfiguration 基于ASP的交互式互动配置的智能扩展技术 2507.21027v1
  • 70 07-28 Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments Automatisierte Identifizierung von sexueller Orientierung und Geschlechteridentität diskriminatorische Texte aus der Ausgabe Kommentare 从问题评论中自动识别性取向和性别认同歧视文本 2311.08485v2
  • 71 07-28 Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs Reparieren von Schwachstellen ohne unsichtbare Hände. Eine differenzierte Replikationsstudie auf LLMs 在没有无形手的情况下修复弱点,对LLMs进行差别化的推广研究。 2507.20977v1
  • 72 07-28 Adopting Large Language Models to Automated System Integration Annahme großer Sprachmodelle zur Automatisierten Systemintegration 采用大语言模型实现自动化系统整合 2504.08490v2
  • 73 07-28 Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation Advanced System Integration: Analysieren von OpenAPI Chunking für retrieval-Augmented Generation 高级系统集成:分析用于回溯源源代的 OpenAPI 弹进器 2411.19804v2
  • 74 07-28 A first look at ROS 2 applications written in asynchronous Rust Ein erster Blick auf ROS 2 Anwendungen geschrieben in asynchronen Rust 第一次查看ROS 2 申请,以非同步鲁斯特书写 2505.21323v3
  • 75 07-28 TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories TypyBench: Bewertung der LLM-Typ-Schlussfolgerung für nicht typisierte Python-Repositories TypyBench: 评估非型式 Python 仓库的 LLM 类型推理 2507.22086v1
  • 76 07-28 SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation SPICE: Eine automatisierte SWE-Bench-Etikettierungspipeline für Ausgabeklarheit, Testabdeckung und Aufwandsabschätzung SPICE: 用于议题清晰度、测试覆盖率和努力估算的SWE-Bennch自动标签管道 2507.09108v2
  • 77 07-28 Secret Breach Detection in Source Code with Large Language Models Geheime Breach-Erkennung im Quellcode mit großen Sprachmodellen 具有大语言模式的源代码秘密侦测 2504.18784v2
  • 78 07-28 Enhancing Project-Specific Code Completion by Inferring Internal API Information Verbesserung der projektspezifischen Code-Vervollständigung durch Schlussfolgerung interner API-Informationen 通过推断内部API信息加强具体项目法规的完成 2507.20888v1
  • 79 07-28 Search-Based Fuzzing For RESTful APIs That Use MongoDB Suchbasiertes Fuzzing für RESTful APIs, die MongoDB verwenden 使用 MOngoDB 的基于搜索的模糊信息 2507.20848v1
  • 80 07-28 Client–Library Compatibility Testing with API Interaction Snapshots Client-Bibliothek Kompatibilitätstest mit API-Interaktions-Snapshots 客户- Library 兼容性测试与 API 互动抓图 2507.20814v1
  • 81 07-28 LLM-Based Repair of Static Nullability Errors LLM-basierte Reparatur von statischen Nullierbarkeitsfehlern LLM – – 基于LLM的静态误差修复 2507.20674v1
  • 82 07-28 Testora: Using Natural Language Intent to Detect Behavioral Regressions Testora: Mit Hilfe der natürlichen Sprache intent, um Verhaltensregressionen zu erkennen 测试:使用自然语言意图检测行为倒退 2503.18597v2
  • 83 07-28 Intention-Driven Generation of Project-Specific Test Cases Intentionsgetriebene Generierung projektspezifischer Testfälle 项目具体试验个案的有意和有意生成 2507.20619v1
  • 84 07-28 Refactoring Deep Learning Code: A Study of Practices and Unsatisfied Tool Needs Refactoring Deep Learning Code: Ein Studium von Praktiken und unzufriedenen Werkzeugbedürfnissen 重构深层学习守则:实践和不满意工具需要研究 2405.04861v2
  • 85 07-28 GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation GeoJSEval: Ein automatisiertes Evaluations-Framework für große Sprachmodelle auf JavaScript-Basis Geospatial Computation and Visualization Code Generierung GeoJSEval: JavaScript基于地理空间计算和可视化代码生成大语言模型自动评价框架 2507.20553v1
  • 86 07-28 Adaptive and Accessible User Interfaces for Seniors Through Model-Driven Engineering Adaptive und zugängliche Benutzeroberflächen für Senioren durch modellgetriebene Technik 通过模型驱动工程为老年人提供适应性和无障碍用户界面 2502.18828v2
  • 87 07-28 VDGraph: A Graph-Theoretic Approach to Unlock Insights from SBOM and SCA Data VDGraph: Ein graphisch-theoretischer Ansatz, um Einsichten aus SBOM- und SCA-Daten zu entsperren VDGraph: SBOM 和 SCA 数据解锁透视的图形理论方法 2507.20502v1
  • 88 07-28 Distinguishing Quantum Software Bugs from Hardware Noise: A Statistical Approach Distinguishing Quantum Software Bugs from Hardware Noise: Ein statistischer Ansatz 将量子软件错误与硬件噪音区分开:统计方法 2507.20475v1
  • 89 07-27 (7) When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions Wenn Prompts falsch gehen: Bewertung von Code-Modell Robustheit zu Ambigued, Contradictory und Unvollständige Aufgabenbeschreibungen 当提示出错时: 评估代码模型的强度, 使之与模糊、 矛盾和不完成的任务描述 2507.20439v1
  • 90 07-27 Testing Is Not Boring: Characterizing Challenge in Software Testing Tasks Testen ist nicht langweilig: Charakterisierende Herausforderung bei Software-Testaufgaben 测试并非无足轻重:在软件测试任务中突出挑战 2507.20407v1
  • 91 07-27 How to Save My Gas Fees: Understanding and Detecting Real-world Gas Issues in Solidity Programs Wie ich meine Gasgebühren erspare: Verstehen und Erkennen realer Gasprobleme in Soliditätsprogrammen 如何节省我的煤气费:了解和检测实世天然气在实实在在方案中的问题 2403.02661v2
  • 92 07-27 CIgrate: Automating CI Service Migration with Large Language Models CIgrate: Automatisierung der CI-Service-Migration mit großen Sprachmodellen cigrate: 以大语言模式实现CI服务迁移自动化 2507.20402v1
  • 93 07-27 Software Fairness Testing in Practice Software Fairness-Tests in der Praxis 实践中软件公平测试 2506.17095v2
  • 94 07-27 BOOP: Write Right Code BOOP: Schreiben Sie den richtigen Code 写权利法典 2507.22085v1
  • 95 07-27 Beyond Binary Moderation: Identifying Fine-Grained Sexist and Misogynistic Behavior on GitHub with Large Language Models Beyond Binary Moderation: Fine-Grained Sexist und Misogynistic Behavior auf GitHub mit großen Sprachmodellen identifizieren 超越二进制温度: 识别高语言模式的吉特胡布人中精美的、有性别色彩的和有偏见的行为 2507.20358v1
  • 96 07-27 Strategic Motivators for Ethical AI System Development: An Empirical and Holistic Model Strategische Motivatoren für ethische KI-Systementwicklung: Ein empirisches und ganzheitliches Modell 道德与伦理合作系统发展战略动力器:经验和整体模式 2507.20218v1
  • 97 07-27 Testing Autonomous Driving Systems – What Really Matters and What Doesn’t Autonome Fahrsysteme testen – Was wirklich zählt und was nicht 自动自动驾驶测试系统 – – 真正重要和不重要的东西 2507.13661v2
  • 98 07-27 Relating System Safety and Machine Learnt Model Performance Über Systemsicherheit und maschinenerfahrene Modellleistung 与系统安全和机器学习模型性能有关的系统安全和机器学习模型性能 2507.20135v1
  • 99 07-27 From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics Von Prompt zur Pipeline: Große Sprachmodelle für die wissenschaftliche Workflow-Entwicklung in der Bioinformatik 从提示到管道:生物信息学科学工作流程发展大语言模式 2507.20122v1
  • 100 07-27 Learning to Align Human Code Preferences Human Code-Vorlieben ausrichten lernen 学习调整人类法典的首选 2507.20109v1
  • 101 07-27 From First Use to Final Commit: Studying the Evolution of Multi-CI Service Adoption Vom ersten Einsatz bis zum endgültigen Commit: Studieren der Evolution der Multi-CI-Service-Adoption 从首次使用到最后提交:研究采用多种CI服务的发展演变 2507.20095v1
  • 102 07-26 (6) The Effect of Pointer Analysis on Semantic Conflict Detection Die Wirkung der Pointer-Analyse auf die semantische Konflikterkennung 指针分析对语义冲突探测的影响 2507.20081v1
  • 103 07-26 Selective Prompt Anchoring for Code Generation Selektive Prompt-Ankerung für die Code-Generierung 代代代代代代代代代代代代代代代代代 代代代代代代代代代代代代代 代代代代代代代代代代代代 2408.09121v6
  • 104 07-26 PDLogger: Automated Logging Framework for Practical Software Development PDLogger: Automatisiertes Logging-Framework für die praktische Softwareentwicklung PD Logger:实用软件开发自动记录框架 2507.19951v1
  • 105 07-26 Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases Prometheus:多语言代码库解决问题的统一知识图 2507.19942v1
  • 106 07-26 The Impact of Fine-tuning Large Language Models on Automated Program Repair Die Auswirkungen von Feinabstimmungen großer Sprachmodelle auf die automatisierte Programmreparatur 微调大语言模型对自动方案维修的影响 2507.19909v1
  • 107 07-26 CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation CrossPL: Bewertung großer Sprachmodelle bei der Erzeugung von Cross Programming Language Code 交叉语言:评估关于编制跨方案拟订语言代码的大型语言模式 2507.19904v1
  • 108 07-26 AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation AgentMesh: Ein kooperativer Multi-Agent Generativer KI-Rahmen für Software-Entwicklungsautomatisierung AgentMesh:软件开发自动化多生化合作框架 2507.19902v1
  • 109 07-26 MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation MultiKernelBench: Ein Multi-Platform Benchmark für die Kernel-Generation 多KenneelBench: 核心生成的多平台基准 2507.17773v2
  • 110 07-26 A Cooperative Approach for Knowledge-based Business Process Design in a Public Authority Ein kooperativer Ansatz für wissensbasiertes Business Process Design in einer öffentlichen Behörde 公共当局知识商业程序设计合作办法 2507.19842v1
  • 111 07-26 From Few-Label to Zero-Label: An Approach for Cross-System Log-Based Anomaly Detection with Meta-Learning Von wenigem zum Null-Label: Ein Ansatz für systemübergreifende Log-basierte Anomalienerkennung mit Meta-Learning 从少标点到零标点:跨系统、基于日志的反常检测和元学习的方法 2507.19806v1
  • 112 07-26 Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries Automatisierte Synthese von formal verifizierten Multi-Abstraktions-Funktionszusammenfassungen 正式核证的多种吸管功能摘要自动综合 2506.09550v3
  • 113 07-26 NeuSemSlice: Towards Effective DNN Model Maintenance via Neuron-level Semantic Slicing NeuSemSlice: Auf dem Weg zu einer effektiven DNN-Modellpflege über Semantisches Schneiden auf Neuron-Ebene NeusSemelice:通过中程语义剪切实现有效的 DNN 模型维护 2407.20281v2
  • 114 07-26 Defining ethically sourced code generation Definition der ethisch bedingten Codegenerierung 界定道德源代码生成 2507.19743v1
  • 115 07-26 Clean Code In Practice: Challenges and Opportunities Sauberer Code in der Praxis: Herausforderungen und Chancen 《清洁守则》实践:挑战与机遇 2507.19721v1
  • 116 07-25 (5) Refactoring $\neq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis Refactoring $\neq$ Bug-Inducing: Verbesserung der Fehlervorhersage mit Code Change Tactics Analyse 重构 $neq$ 臭虫诱导:用代码改变策略分析改进缺陷预测 2507.19714v1
  • 117 07-25 LastMerge: A language-agnostic structured tool for code integration LastMerge: Ein sprach-agnostisches strukturiertes Tool für die Codeintegration LastMorge:一个语言不可知的代码集成结构化工具 2507.19687v1
  • 118 07-25 GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning GEPA: Reflektierende Prompt-Evolution kann Verstärkungs-Lernen übertreffen GEPA: 反思即时进化能够超过成绩的强化学习 2507.19457v1
  • 119 07-25 An OpenSource CI/CD Pipeline for Variant-Rich Software-Defined Vehicles Eine OpenSource CI/CD Pipeline für Variant-Rich Software-definierte Fahrzeuge 变式Rich软件定型车辆的开源CI/CD管道 2507.19446v1
  • 120 07-25 Resolving Build Conflicts via Example-Based and Rule-Based Program Transformations Lösen von Build-Konflikten über examplebasierte und regelbasierte Programmtransformationen 通过基于实例和基于规则的方案转型解决冲突 2507.19432v1
  • 121 07-25 CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback CodeEvo: Interaktionsgetriebene Synthese codezentrierter Daten durch hybrides und iteratives Feedback 代码化:通过混合和循环反馈对以代码为中心的数据进行互动驱动合成 2507.22080v1
  • 122 07-25 SDVDiag: A Modular Platform for the Diagnosis of Connected Vehicle Functions SDVDiag: Modulare Plattform für die Diagnose von vernetzten Fahrzeugfunktionen SDVDiag: 连接车辆功能诊断模块平台 2507.19403v1
  • 123 07-25 ReCatcher: Towards LLMs Regression Testing for Code Generation ReCatcher: Auf dem Weg zu LLMs Regressionstests für die Codegenerierung Recatcher: 转向为代码生成而进行回归测试的LLMs 2507.19390v1
  • 124 07-25 Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports Mut4All: Fuzzing Compiler über LLM-Synthesized Mutators aus Fehlerberichten gelernt Mut4All: 通过 LLM 合成的从臭虫报告中学习的 Mult4All: 模糊的编译者 2507.19275v1
  • 125 07-25 Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects Fine-Tuning Mehrsprachige Sprachmodelle für Code Review: Eine empirische Studie zu industriellen C#-Projekten 用于代码审查的精美多语言语言模式:工业C#项目经验研究 2507.19271v1
  • 126 07-25 Exploring the Use of LLMs for Requirements Specification in an IT Consulting Company Erforschung der Verwendung von LLMs für Anforderungen Spezifikation in einer IT-Beratungsgesellschaft 探索在IT咨询公司中如何使用按要求规格要求的LLMS 2507.19113v1
  • 127 07-25 Emerging Trends in Software Architecture from the Practitioners Perspective: A Five Year Review Aufkommende Trends in der Softwarearchitektur aus der Perspektive der Praktizierenden: Ein Fünf-Jahres-Bericht 从从从业人员角度看软件架构的新趋势:五年审查 2507.14554v2
  • 128 07-25 SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews SESR-Eval: Datensatz zur Bewertung von LLMs im Titel-Abstract Screening von Systematischen Bewertungen SESR-Eval:系统审查标题摘要筛选中评价LLMs的数据集 2507.19027v1
  • 129 07-25 An Enumerative Embedding of the Python Type System in ACL2s Eine Enumerative Einbettung des Python-Typsystems in ACL2s Python型系统在ACL2中的插图嵌入 2507.19015v1
  • 130 07-25 Towards Bug-Free Distributed Go Programs Auf dem Weg zu fehlerfreien verteilten Go-Programmen 迈向无臭虫分配方案 2506.15135v2
  • 131 07-25 CoCoEvo: Co-Evolution of Programs and Test Cases to Enhance Code Generation CoCoEvo: Co-Evolution von Programmen und Testfällen zur Verbesserung der Codegenerierung CooEvo: 共同推进方案和试验案例,以加强法典的生成 2502.10802v2
  • 132 07-25 Classifying Issues in Open-source GitHub Repositories Einordnung von Problemen in Open-Source GitHub-Repositories 开放源码 GitHub 存储库中问题的分类 2507.18982v1
  • 133 07-25 Do Existing Testing Tools Really Uncover Gender Bias in Text-to-Image Models? Enthüllen bestehende Testtools Gender-Bias wirklich in Text-to-Image-Modellen? 现有测试工具是否真的在文本到图像模型中排除性别偏见? 2501.15775v2
  • 134 07-25 SLICEMATE: Accurate and Scalable Static Program Slicing via LLM-Powered Agents SLICEMATE: Genaue und skalierbare statische Programm-Slicing über LLM-Powered Agents 液态:通过LLM授权代理器进行准确和可缩放的静态程序切除 2507.18957v1
  • 135 07-25 Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity Messung des Einflusses der frühen-2025 KI auf erfahrene Open-Source-Entwicklerproduktivität 衡量2025年初AI(AI)对经验丰富的开放源码开发者生产力的影响 2507.09089v2
  • 136 07-24 (4) Observing Fine-Grained Changes in Jupyter Notebooks During Development Time Beobachten feinkörniger Änderungen in Jupyter-Notebooks während der Entwicklungszeit 发展时期黄极笔记本中观察到的微小变化 2507.15831v2
  • 137 07-24 Exploring the Jupyter Ecosystem: An Empirical Study of Bugs and Vulnerabilities Erforschung des Jupyter-Ökosystems: Eine empirische Studie über Bugs und Schwachstellen 探索黄极生态系统:关于虫子和脆弱性的经验研究 2507.18833v1
  • 138 07-24 MemoCoder: Automated Function Synthesis using LLM-Supported Agents MemoCoder: Automatisierte Funktionssynthese mit LLM-unterstützten Agenten MemoCoder:使用LLM支持的代理器自动功能合成 2507.18812v1
  • 139 07-24 MetaSel: A Test Selection Approach for Fine-tuned DNN Models MetaSel: Ein Testauswahlverfahren für fein abgestimmte DNN-Modelle MetaSel: 微调 DNN 模型的测试选择方法 2503.17534v3
  • 140 07-24 Decompiling Rust: An Empirical Study of Compiler Optimizations and Reverse Engineering Challenges Decompiling Rust: Eine empirische Studie über Compiler-Optimierungen und Reverse Engineering-Herausforderungen Drecomping Rust:关于编纂者优化和逆向工程挑战的经验性研究 2507.18792v1
  • 141 07-24 Initial Steps in Integrating Large Reasoning and Action Models for Service Composition Erste Schritte bei der Integration großer Vernunft- und Handlungsmodelle für die Servicezusammensetzung 整合服务构成大理由和行动模式的初步步骤 2507.18775v1
  • 142 07-24 Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback Agentische Programm-Reparatur von Testfehlern im Maßstab: Ein neuro-symbolischer Ansatz mit statischer Analyse und Test-Ausführungs-Feedback 大规模试验失败时的试验失败时的代理方案修复:采用静态分析和测试执行反馈的神经-正反方法 2507.18755v1
  • 143 07-24 HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis HLSTester: Effiziente Prüfung von Verhaltensdiskrepanzen mit LLMs für High-Level-Synthese HLS Tester: 与高级别合成项目LLM有效测试行为差异 2504.14641v3
  • 144 07-24 Exploring the Landscape of Fairness Interventions in Software Engineering Erforschung der Landschaft der Fairness-Interventionen in der Software-Engineering 探索软件工程中公平干预的景观 2507.18726v1
  • 145 07-24 AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code AccessGuru: LLMs zur Erkennung und korrekten Web-Zugänglichkeit von Verstößen im HTML-Code nutzen AccessGuru:利用LLMS检测和纠正HTML代码中违反网络无障碍的情况 2507.19549v1
  • 146 07-24 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation 3D-Software-Synthese geführt durch eingeschränkt-expressive Zwischendarstellung 3D 由限制性中等代表制指导的软件合成 2507.18625v1
  • 147 07-24 OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advancing Next-Generation Intelligent Transportation Systems Research OpenCAMS: Eine Open-Source vernetzte und automatisierte Mobilitäts-Co-Simulationsplattform für die Weiterentwicklung der Forschung für intelligente Transportsysteme der nächsten Generation OpenCAMS: 推进下一轮智能运输系统研究的开放源码连接和自动化流动联合模拟平台 2507.09186v3
  • 148 07-24 Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench Sind KI-erzeugte Fixes sicher? LLM und Agent Patches auf der SWE-Bench analysieren AI - 具有安全性吗? 分析SWE-bench 上的LLM 和代理补丁 2507.02976v2
  • 149 07-24 On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten 关于含有闭合同步词类的标识名称的结构和语义 2505.18444v4
  • 150 07-24 A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat Ein tiefer Tauchgang in die retrieval-angereicherte Generation zur Code-Vervollständigung: Erfahrung auf WeChat 为完成代码的完成而深入挖掘回收的一代人:关于 WeChat 的经验 2507.18515v1
  • 151 07-24 Automated Code Review Using Large Language Models with Symbolic Reasoning Automatisierte Code-Überprüfung mit großen Sprachmodellen mit symbolischer Begründung 使用有符号理由的大语言模型的自动码审查 2507.18476v1
  • 152 07-24 Exploring and Evaluating Interplays of BPpy with Deep Reinforcement Learning and Formal Methods Erforschung und Auswertung von Interplays von BPpy mit Deep Reinforcement Learning und Formal Methods 探索和评价与深强化学习和正规方法的BPpy的相互作用 2501.15480v2
  • 153 07-24 It is Giving Major Satisfaction: Why Fairness Matters for Software Practitioners Es gibt große Zufriedenheit: Warum Fairness für Software-Praktiker wichtig ist 它给予重大满意:为什么软件从业人员的公平问题? 2410.02482v5
  • 154 07-24 FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping FMI trifft SystemC: Ein Rahmen für das Cross-Tool Virtual Prototyping FMI 满足系统C:跨工具虚拟原型框架 2507.18339v1
  • 155 07-24 LLMShot: Reducing snapshot testing maintenance via LLMs LLMShot: Reduzierung der Snapshot-Test-Wartung über LLMs LLMShot:减少通过LLMM减少快速测试维护 2507.10062v2
  • 156 07-24 Gotta catch ‘em all! Towards File Localisation from Issues at Large Ich muss sie alle fangen! Auf dem Weg zur Dateilokalisierung von Themen im Großen und Ganzen 必须抓住他们所有人! 2507.18319v1
  • 157 07-24 YATE: The Role of Test Repair in LLM-Based Unit Test Generation YATE: Die Rolle der Testreparatur bei der LLM-basierten Einheiten-Testgenerierung YATE:在以LLM为基础的单位试验生成中测试修理的作用 2507.18316v1
  • 158 07-24 Scheduzz: Constraint-based Fuzz Driver Generation with Dual Scheduling Scheduzz: Fuzz Driver Generation mit Dual Scheduling Scheduzz:基于节制的有双重日程安排的 Fiszz 驱动力生成 2507.18289v1
  • 159 07-24 An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs Eine empirische Studie über körpereigene Software-Fehler im Bereich Künstliche Intelligenz von Robotern (EAIR) 关于人造人工智能机器人(EAIR)软件虫的经验研究 2507.18267v1
  • 160 07-24 GenAI for Automotive Software Development: From Requirements to Wheels GenAI für die Entwicklung von Automotive-Software: Von Anforderungen bis zu Rädern GENAI 汽车软件开发GENAI:从要求到轮子 2507.18223v1
  • 161 07-24 SMECS: A Software Metadata Extraction and Curation Software KMUCS: Eine Software Metadata Extraktions- und Kurationssoftware SMECS:软件元数据抽取和计算软件 2507.18159v1
  • 162 07-24 When Retriever Meets Generator: A Joint Model for Code Comment Generation Wenn Retriever trifft Generator: Ein gemeinsames Modell für Code Comment Generation 当再利用与生成器相遇时: 代码Comment生成联合模式 2507.12558v2
  • 163 07-24 NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition NoCode-Bench: Ein Benchmark für die Bewertung der Erweiterung natürlicher sprachgetriebener Funktionen NoCode-Bonch:评价自然语言-驱动地物的基准 2507.18130v1
  • 164 07-24 OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization OrQstrator: Ein KI-Powered-Framework für erweiterte Quantenschaltungsoptimierung Orstrator: AI授权的高级量子电路优化框架 2507.09682v2
  • 165 07-24 Understanding the Supply Chain and Risks of Large Language Model Applications Verständnis der Supply Chain und Risiken von Großsprachenmodellanwendungen 了解供应链和大语言模式应用的风险 2507.18105v1
  • 166 07-24 Identifier Name Similarities: An Exploratory Study Identifier Name Ähnlichkeiten: Eine Sondierungsstudie 说明性名称 相似点:探索性研究 2507.18081v1
  • 167 07-24 An Empirical Study of Complexity, Heterogeneity, and Compliance of GitHub Actions Workflows Eine empirische Studie über Komplexität, Heterogenität und Compliance von GitHub-Maßnahmen 关于 “ 吉特胡布行动 “ 的复杂性、异质性和合规性的经验研究 2507.18062v1
  • 168 07-24 SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis SAVANT: Sicherheitserkennung in Anwendungsabhängigkeiten durch Semantik-geführte Reichweitenanalyse SAVANT: 通过语义辅助控制可达性分析,在应用依赖性中发现脆弱性 2506.17798v2
  • 169 07-24 Factors Impacting Faculty Adoption of Project-Based Learning in Computing Education: a Survey Faktoren, die die Fakultät beeinflussen Adoption des projektbasierten Lernens in der Computerausbildung: eine Umfrage 影响学院在计算机教育中采用基于项目学习:调查 2507.18039v1
  • 170 07-24 Your ATs to Ts: MITRE ATT&CK Attack Technique to P-SSCRM Task Mapping Ihre ATs zu Ts: MITRE ATT&CK Angriffstechnik zu P-SSCRM Task Mapping 您的ATs to Ts: MITRE ATT和CK 攻击技术到 P-SSCRM任务绘图 2507.18037v1
  • 171 07-24 An Empirical Study of GenAI Adoption in Open-Source Game Development: Tools, Tasks, and Developer Challenges Eine empirische Studie zur GenAI-Adoption in der Open-Source-Spielentwicklung: Werkzeuge, Aufgaben und Entwickler-Herausforderungen GENAI采用开放源码游戏开发的经验研究:工具、任务和开发者的挑战 2507.18029v1

Article 0

Title@2025-07-31 (4): An Empirical Study on the Amount of Changes Required for Merge Request Acceptance

Title: An Empirical Study on the Amount of Changes Required for Merge Request Acceptance Eine empirische Studie über die Menge der Änderungen, die für die Annahme von Merge-Anfragen erforderlich sind 关于合并申请接受所需变动数额的经验研究 2507.23640v1

Authors (4): Samah Kansab, Mohammed Sayagh, Francis Bordeleau, Ali Tizghadam

Code review (CR) is essential to software development, helping ensure that new code is properly integrated. However, the CR process often involves significant effort, including code adjustments, responses to reviewers, and continued implementation. While past studies have examined CR delays and iteration counts, few have investigated the effort based on the volume of code changes required, especially in the context of GitLab Merge Requests (MRs), which remains underexplored. In this paper, we define and measure CR effort as the amount of code modified after submission, using a dataset of over 23,600 MRs from four GitLab projects. We find that up to 71% of MRs require adjustments after submission, and 28% of these involve changes to more than 200 lines of code. Surprisingly, this effort is not correlated with review time or the number of participants. To better understand and predict CR effort, we train an interpretable machine learning model using metrics across multiple dimensions: text features, code complexity, developer experience, review history, and branching. Our model achieves strong performance (AUC 0.84-0.88) and reveals that complexity, experience, and text features are key predictors. Historical project characteristics also influence current review effort. Our findings highlight the feasibility of using machine learning to explain and anticipate the effort needed to integrate code changes during review.

代码审查(CR)对于软件开发至关重要,有助于确保新代码的适当整合。然而,CR进程往往涉及重大努力,包括代码调整、对审查者的答复和持续实施。虽然过去的研究已经审查了CR的延迟和迭代计数,但很少有人根据所需代码修改的数量进行调查,特别是在GitLab 兼并请求(MRs)方面,该请求仍未得到充分探讨。在本文件中,我们用四个GitLab项目的23 600个MR数据集,将CR的工作定义为提交后修改的代码数量。我们发现,高达71%的MR需要在提交后进行调整,其中28%涉及对200多行代码的修改。令人惊讶的是,这一努力与审查时间或参与者人数无关。为了更好地理解和预测CRR的努力,我们用多维度指标,即文字特征、代码复杂性、开发者经验、历史回顾者和分支化体。我们的模式还取得了很强的绩效(AUC 0.84-0.88),并显示,在提交后,有28%的MRMRMR努力中,其复杂性、经验和文本是使用当前项目预测结果所需的关键预测。


Article 1

Title@2025-07-31 (4): Using weakest application conditions to rank graph transformations for graph repair

Title: Using weakest application conditions to rank graph transformations for graph repair Verwendung schwächster Anwendungsbedingungen, um Graphentransformationen für Graphenreparatur zu ordnen 使用最弱应用条件来排行图形变形以进行图形修理 2405.08788v3

Authors (5): Lars Fritsche, Alexander Lauer, Maximilian Kratz, Andy Schürr, Gabriele Taentzer

When using graphs and graph transformations to model systems, consistency is an important concern. While consistency has primarily been viewed as a binary property, i.e., a graph is consistent or inconsistent with respect to a set of constraints, recent work has presented an approach to consistency as a graduated property. This allows living with inconsistencies for a while and repairing them when necessary. For repairing inconsistencies in a graph, we use graph transformation rules with so-called {\em impairment-indicating and repair-indicating application conditions} to understand how much repair gain certain rule applications would bring. Both types of conditions can be derived from given graph constraints. Our main theorem shows that the difference between the number of actual constraint violations before and after a graph transformation step can be characterized by the difference between the numbers of violated impairment-indicating and repair-indicating application conditions. This theory forms the basis for algorithms with look-ahead that rank graph transformations according to their potential for graph repair. An evaluation shows that graph repair can be well supported by rules with these new types of application conditions in terms of effectiveness and scalability.

使用图表和图形转换到模型系统时,一致性是一个重要问题。虽然一致性主要被视为一种二元属性,即,一个图表在一系列制约方面是一致的或不一致的,但最近的工作提出了一种方法,将一致性作为分级属性,允许在不一致的情况下生活一段时间,并在必要时加以修补。为了修复图表中的不一致之处,我们使用图形转换规则,使用所谓的“prem-减值说明和修复-指示应用条件”来理解修补某些规则应用程序将带来多少好处。两种条件都可以从特定图表的制约中得出。我们的主要理论表明,在图形转换步骤前后实际违反约束规定的次数之间的差别,可以用被破坏的标记数量和修补-指示应用条件之间的差别为特征。这种理论构成了以直观为根据的算法的基础,根据图表的修补潜力进行等级图形变换。评估表明,图表的修补可以得到这些新类型的应用条件在有效性和可缩度方面的规则的有力支持。


Article 2

Title@2025-07-31 (4): Testing Compositionality

Title: Testing Compositionality Prüfung der Zusammensetzung 测试的构成性 2407.05028v3

Authors (3): Gijs van Cuyck, Lars van Arragon, Jan Tretmans

Compositionality supports the manipulation of large systems by working on their components. For model-based testing, this means that large systems can be tested by modelling and testing their components: passing tests for all components implies passing tests for the whole system. In previous work, we defined mutual acceptance for specification models and proved that this property is a sufficient condition for compositionality in model-based testing. In this paper, we present three main algorithms for using mutual acceptance in practice. First, we can verify mutual acceptance on specifications, proving compositionality for all valid implementations. Second, we give a sound and exhaustive model-based testing procedure which checks mutual acceptance on a specific black-box implementation. The result is that testing the correctness of large systems can be decomposed into testing the component implementations for uioco conformance to their specifications, and testing for environmental conformance to the specifications of their environment. Finally, we optimise this procedure further by utilizing the constraints imposed by multiple specifications at the same time. These three algorithms together allow picking the most suitable approach for a given situation, trading in more generalizable results for faster runtime by optimising for a specific context as desired.

对于基于模型的测试,这意味着大型系统可以通过模拟和测试其部件来测试:所有部件的通过测试意味着整个系统的通过测试。在以往的工作中,我们界定了对规格模型的相互接受,并证明在基于模型的测试中,这种属性是构成特征的充分条件。在本文件中,我们提出了实际使用相互接受的三种主要算法。首先,我们可以核实规格的相互接受,证明所有有效执行的构成性。第二,我们给出一个健全和详尽的基于模型的测试程序,检查特定黑盒实施过程中的相互接受性。结果是,在测试大型系统的正确性时,可以分解对符合其规格的组件实施进行测试,并测试环境是否符合其环境的规格。最后,我们通过同时使用多种规格的制约,进一步优化这一程序。这三种算法一起可以选择一种对特定情形最合适的方法,通过选择一种特定环境的优化,以更快的速度进行交易。


Article 3

Title@2025-07-31 (4): Automated Code Review Using Large Language Models at Ericsson: An Experience Report

Title: Automated Code Review Using Large Language Models at Ericsson: An Experience Report Automatisierte Code-Überprüfung mit großen Sprachmodellen bei Ericsson: Ein Erfahrungsbericht Ericsson公司使用大语言模型的自动码审查:经验报告 2507.19115v2

Authors (8): Shweta Ramesh, Joy Bose, Hamender Singh, A K Raghavan, Sujoy Roychowdhury, Giriprasad Sridhara, Nishrith Saini, Ricardo Britto

Code review is one of the primary means of assuring the quality of released software along with testing and static analysis. However, code review requires experienced developers who may not always have the time to perform an in-depth review of code. Thus, automating code review can help alleviate the cognitive burden on experienced software developers allowing them to focus on their primary activities of writing code to add new features and fix bugs. In this paper, we describe our experience in using Large Language Models towards automating the code review process in Ericsson. We describe the development of a lightweight tool using LLMs and static program analysis. We then describe our preliminary experiments with experienced developers in evaluating our code review tool and the encouraging results.

代码审查是与测试和静态分析一起确保发布软件质量的主要手段之一,但代码审查需要有经验的开发商,他们不一定总有时间对代码进行深入审查。因此,自动代码审查有助于减轻有经验的软件开发商的认知负担,使他们能够专注于主要写代码活动,添加新的特征和修补错误。本文介绍我们在爱立信使用大语言模型实现代码审查过程自动化方面的经验。我们描述了利用LLMS和静态程序分析开发轻量级工具的情况。然后我们描述了我们在评估代码审查工具方面与有经验的开发商进行的初步实验以及令人鼓舞的结果。


Article 4

Title@2025-07-31 (4): Blended PC Peer Review Model: Process and Reflection

Title: Blended PC Peer Review Model: Process and Reflection Blended PC Peer Review Modell: Prozess und Reflexion PC 混合同行审查模式:进程和反思 2504.19105v2

Authors (5): Chakkrit Tantithamthavorn, Nicole Novielli, Ayushi Rastogi, Olga Baysal, Bram Adams

The academic peer review system is under increasing pressure due to a growing volume of submissions and a limited pool of available reviewers, resulting in delayed decisions and an uneven distribution of reviewing responsibilities. Building upon the International Conference on Mining Software Repositories (MSR) community’s earlier experience with a Shadow PC (2021 and 2022) and Junior PC (2023 and 2024), MSR 2025 experimented with a Blended Program Committee (PC) peer review model for its Technical Track. This new model pairs up one Junior PC member with two regular PC members as part of the core review team of a given paper, instead of adding them as an extra reviewer. This paper presents the rationale, implementation, and reflections on the model, including empirical insights from a post-review author survey evaluating the quality and usefulness of reviews. Our findings highlight the potential of a Blended PC to alleviate reviewer shortages, foster inclusivity, and sustain a high-quality peer review process. We offer lessons learned and recommendations to guide future adoption and refinement of the model.

学术同侪审查制度由于提交材料数量不断增加和现有审查人员数量有限而面临越来越大的压力,造成决定延迟和审查责任分配不均。根据采矿软件储存国际会议(MSR)社区以前在影子PC(2021和2022年)和初级PC(2023和2024年)方面的经验,2025年学术同侪审查制度试验了一个混合方案委员会(PC)技术轨道同侪审查模式,这一新模式将一名初级PC成员与两名普通PC成员作为特定文件核心审查小组的一部分,而不是作为额外审查人员增加这两个成员。本文介绍了该模式的理由、执行情况和反思,包括评估审查质量和效用的后审查作者调查的经验性见解。我们的调查结果强调,混合同侪审查方案有可能减轻审查人员的短缺,促进包容性,并维持高质量的同侪审查程序。我们提出了经验教训和建议,以指导今后采用和完善该模式。


Article 5

Title@2025-07-31 (4): PurpCode: Reasoning for Safer Code Generation

Title: PurpCode: Reasoning for Safer Code Generation PurpCode: Begründung für eine sicherere Code-Generierung PurpCode:更安全代码生成的理由 2507.19060v2

Authors (14): Jiawei Liu, Nirav Diwan, Zhe Wang, Haoyu Zhai, Xiaona Zhou, Kiet A. Nguyen, Tianjiao Yu, Muntasir Wahed, Yinlin Deng, Hadjer Benkraouda, Yuxiang Wei, Lingming Zhang, Ismini Lourentzou, Gang Wang

We introduce PurpCode, the first post-training recipe for training safe code reasoning models towards generating secure code and defending against malicious cyberactivities. PurpCode trains a reasoning model in two stages: (i) Rule Learning, which explicitly teaches the model to reference cybersafety rules to generate vulnerability-free code and to avoid facilitating malicious cyberactivities; and (ii) Reinforcement Learning, which optimizes model safety and preserves model utility through diverse, multi-objective reward mechanisms. To empower the training pipelines with comprehensive cybersafety data, we conduct internal red-teaming to synthesize comprehensive and high-coverage prompts based on real-world tasks for inducing unsafe cyberactivities in the model. Based on PurpCode, we develop a reasoning-based coding model, namely PurpCode-32B, which demonstrates state-of-the-art cybersafety, outperforming various frontier models. Meanwhile, our alignment method decreases the model overrefusal rates in both general and cybersafety-specific scenarios, while preserving model utility in both code generation and common security knowledge.

我们引入了PurpCode(PurpCode)(PurpCode)(PurpCode)(PurpCode)(PurpCode)(Purcledge Learning)(这是培训安全代码推理模型的第一个培训后指南)(PurpCode)(这是培训安全代码推理模型的第一批培训后配方),旨在生成安全代码和防范恶意网络活动。PurpCode(Purp Learning)将一个推理模型分为两个阶段:(一) 规则学习,明确教授参考网络安全规则模式,以生成无脆弱性代码,避免为恶意网络活动提供便利;(二) 强化学习(Sergment Learning)(Sergment)(通过多种多目标奖励机制优化模式安全模式,维护模型的实用性,通过综合网络安全数据使培训管道具备能力,我们内部红队(refusal)将基于现实世界任务的全面和高覆盖性提示器,同时维护代码生成和共同安全知识的模型实用性。


Article 6

Title@2025-07-31 (4): Dynamic and Static Analysis of Python Software with Kieker Including Reconstructed Architectures

Title: Dynamic and Static Analysis of Python Software with Kieker Including Reconstructed Architectures Dynamische und statische Analyse von Python-Software mit Kieker inklusive rekonstruierter Architekturen 使用Kieker 包括重建建筑的 Python 软件动态和静态分析 2507.23425v1

Authors (3): Daphné Larrivain, Shinhyung Yang, Wilhelm Hasselbring

The Kieker observability framework is a tool that provides users with the means to design a custom observability pipeline for their application. Originally tailored for Java, supporting Python with Kieker is worthwhile. Python’s popularity has exploded over the years, thus making structural insights of Python applications highly valuable. Our Python analysis pipeline combines static and dynamic analysis in order to build a complete picture of a given system.

Kieker可观察性框架是一个工具,为用户设计用于应用的定制可观察性管道提供手段。 最初是为爪哇设计的,支持Python和Kieker是值得的。 Python的受欢迎程度多年来急剧上升,从而使Python应用的结构性洞察力变得非常宝贵。 我们的Python分析管道将静态和动态分析结合起来,以建立对特定系统的完整图像。


Article 7

Title@2025-07-31 (4): Mokav: Execution-driven Differential Testing with LLMs

Title: Mokav: Execution-driven Differential Testing with LLMs Mokav: Execution-getriebene Differentialprüfung mit LLMs Mokav:由执行驱动的用LLMs进行差别测试 2406.10375v2

Authors (4): Khashayar Etemadi, Bardia Mohammadi, Zhendong Su, Martin Monperrus

It is essential to detect functional differences between programs in various software engineering tasks, such as automated program repair, mutation testing, and code refactoring. The problem of detecting functional differences between two programs can be reduced to searching for a difference exposing test (DET): a test input that results in different outputs on the subject programs. In this paper, we propose Mokav, a novel execution-driven tool that leverages LLMs to generate DETs. Mokav takes two versions of a program (P and Q) and an example test input. When successful, Mokav generates a valid DET, a test input that leads to provably different outputs on P and Q. Mokav iteratively prompts an LLM with a specialized prompt to generate new test inputs. At each iteration, Mokav provides execution-based feedback from previously generated tests until the LLM produces a DET. We evaluate Mokav on 1535 pairs of Python programs collected from the Codeforces competition platform and 32 pairs of programs from the QuixBugs dataset. Our experiments show that Mokav outperforms the state-of-the-art, Pynguin and Differential Prompting, by a large margin. Mokav can generate DETs for 81.7% (1,255/1535) of the program pairs in our benchmark (versus 4.9% for Pynguin and 37.3% for Differential Prompting). We demonstrate that the iterative and execution-driven feedback components of the system contribute to its high effectiveness.

检测各种软件工程任务,例如自动化程序维修、突变测试和代码再设定等程序之间的功能差异至关重要。 检测两个程序之间的功能差异的问题可以减少, 以寻找不同曝光测试( DET) : 测试输入, 导致主题程序的不同产出。 在本文中, 我们提议Mokav, 这是一种创新的执行驱动工具, 利用LLM 生成 DET。 Mokav 使用两个版本的程序( P和 Q) 和一个示例测试输入。 成功时, Mokav 生成一个有效的 DET, 一种测试输入, 导致P和Q. Q. Mokav 反复触发一个LLMM, 具有生成新测试投入的专门性。 每次迭代, Mokav 提供以前生成的测试的基于执行的反馈, 直到 LLMM 生成一个 DET。 我们评估Mokav, 1 535对从代码竞争平台收集的 Python 程序, 和 QuixBugs 数据集的32对程序, 测试输入一个可辨辨别的 Dral- main- main prilling prilation% 1 和Dirass fal prilling pas pas presmilling 。 我们实验演示演示显示, 的Min- mass 和Dal- milling pal- mass 81- paldaldaldaldaldrodrodaldaldald pas 。


Article 8

Title@2025-07-31 (4): Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling

Title: Trae Agent: An LLM-based Agent for Software Engineering with Test-time Scaling Trae Agent: Ein LLM-basierter Agent für Software Engineering mit Testzeitskalierung Trae Agent: 设在LLM的软件工程应用测试时间缩放软件工程代理 2507.23370v1

Authors (15): Trae Research Team, Pengfei Gao, Zhao Tian, Xiangxin Meng, Xinchen Wang, Ruida Hu, Yuanan Xiao, Yizhou Liu, Zhao Zhang, Junjie Chen, Cuiyun Gao, Yun Lin, Yingfei Xiong, Chao Peng, Xia Liu

Software issue resolution is a critical challenge in software engineering and has garnered increasing attention in recent years. With the rapid advancement of large language models (LLMs), substantial progress has been made in addressing real-world software engineering tasks. Recent studies have introduced ensemble reasoning techniques to enhance the performance of LLM-based issue resolution. However, existing prompting-based methods still face limitations in effectively exploring large ensemble spaces and lack the capacity for repository-level understanding, both of which constrain their overall effectiveness. In this paper, we propose Trae Agent, the first agent-based ensemble reasoning approach for repository-level issue resolution. Trae Agent formulates our goal as an optimal solution search problem and addresses two key challenges, i.e., large ensemble spaces and repository-level understanding, through modular agents for generation, pruning, and selection. We conduct extensive experiments using three leading LLMs on the widely-adopted SWE-bench benchmark, comparing Trae Agent against four state-of-the-art ensemble reasoning techniques. Experimental results demonstrate that Trae Agent consistently achieves superior performance, with an average improvement of 10.22% over all baselines in terms of Pass@1. Trae Agent has achieved first place on the SWE-bench Verified leaderboard, with a notable Pass@1 score of 75.20%. We are pleased to release Trae Agent as an open-source project to support the research community, with all resources available at https://github.com/bytedance/trae-agent.

软件问题的解决是软件工程的一个重大挑战,近年来,这个问题的解决日益受到越来越多的关注。随着大型语言模型(LLMs)的快速进步,在应对现实世界软件工程任务方面取得了显著进展。最近的研究引入了全套推理技术,以提高基于LLM问题的解决方案的绩效。然而,现有的基于促进型方法在有效探索大型共同空间方面仍面临局限性,缺乏存储器层面的理解能力,这两者都限制了它们的总体有效性。在本文件中,我们提出了以代理器为基础的首个共享推理方法(Trae Agres),这是用于存储器层面问题的解决方案的首个共享推理方法。Trae Agres将我们的目标设计成一个最佳解决方案搜索问题,并解决了两大挑战,即:通过模块代理器来提高基于LLLLMM的解决方案解决问题。然而,现有的基于快速智能空间和基于存储器的存储器层面理解,我们在广泛采用的SWE-Bench基准上进行了三大主要LMs,将Trae代理与四种基于开放状态的共享推理推理技术。实验结果显示,Trae-lades@alrealrealalalal lavel lady在10.221 上取得了显著的排名中,在10-reval Streal1的排名中,平均改进。


Article 9

Title@2025-07-31 (4): REST API Testing in DevOps: A Study on an Evolving Healthcare IoT Application

Title: REST API Testing in DevOps: A Study on an Evolving Healthcare IoT Application REST API Testing in DevOps: Eine Studie über eine sich entwickelnde IoT-Anwendung im Gesundheitswesen 在DevOps进行的REST API测试:关于不断演变的卫生保健IOT应用的研究 2410.12547v2

Authors (3): Hassan Sartaj, Shaukat Ali, Julie Marie Gjøby

Healthcare Internet of Things (IoT) applications often integrate various third-party healthcare applications and medical devices through REST APIs, resulting in complex and interdependent networks of REST APIs. Oslo City’s healthcare department collaborates with various industry partners to develop such healthcare IoT applications enriched with a diverse set of REST APIs. Following the DevOps process, these REST APIs continuously evolve to accommodate evolving needs such as new features, services, and devices. Oslo City’s primary goal is to utilize automated solutions for continuous testing of these REST APIs at each evolution stage, thereby ensuring their dependability. Although the literature offers various automated REST API testing tools, their effectiveness in regression testing of the evolving REST APIs of healthcare IoT applications within a DevOps context remains undetermined. This paper evaluates state-of-the-art and well-established REST API testing tools, specifically, RESTest, EvoMaster, Schemathesis, RESTler, and RestTestGen, for the regression testing of a real-world healthcare IoT application, considering failures, faults, coverage, regressions, and cost. We conducted experiments using all accessible REST APIs (17 APIs with 120 endpoints), and 14 releases evolved during DevOps. Overall, all tools generated tests leading to several failures, 18 potential faults, up to 84% coverage, and 23 regressions. Over 70% of tests generated by all tools fail to detect failures, resulting in significant overhead.

奥斯陆市的保健部门与各行业伙伴合作开发此类保健应用软件,这些应用以多种不同的REST API形式得到丰富。在DevOps进程之后,这些REST API不断演变,以适应不断变化的需要,如新的特征、服务和装置。奥斯陆市的首要目标是在每个进化阶段利用自动解决方案不断测试这些REST API,从而确保其可靠性。尽管文献提供了各种自动的REST API测试工具,但它们与各个行业伙伴合作开发了这类保健应用软件,这些应用软件以各种各样的REST API API 的多样化。在DevOps背景下,这些应用软件应用软件的一体化应用软件与多样化。在DevOps进程之后,这些REST API 不断演变,以适应新特征、服务和装置等不断变化的需求。奥斯陆市的首要目标是在每个进化阶段,利用各种成本测试对真实的ICEO API 应用系统进行回归测试,考虑各种失败、错误导致的退缩率,在ADO ARPA 的回归过程中,以及所有对RED AL AL 进行的所有成本和升级测试。


Article 10

Title@2025-07-31 (4): SWE-Exp: Experience-Driven Software Issue Resolution

Title: SWE-Exp: Experience-Driven Software Issue Resolution SWE-Exp: Erfahrungsgetriebene Software-Ausgabeauflösung SWE-Expl:经验丰富的软件问题决议 2507.23361v1

Authors (10): Silin Chen, Shaoxin Lin, Xiaodong Gu, Yuling Shi, Heng Lian, Longfei Yun, Dong Chen, Weiguo Sun, Lin Cao, Qianxiang Wang

Recent advances in large language model (LLM) agents have shown remarkable progress in software issue resolution, leveraging advanced techniques such as multi-agent collaboration and Monte Carlo Tree Search (MCTS). However, current agents act as memoryless explorers - treating each problem separately without retaining or reusing knowledge from previous repair experiences. This leads to redundant exploration of failed trajectories and missed chances to adapt successful issue resolution methods to similar problems. To address this problem, we introduce SWE-Exp, an experience - enhanced approach that distills concise and actionable experience from prior agent trajectories, enabling continuous learning across issues. Our method introduces a multi-faceted experience bank that captures both successful and failed repair attempts. Specifically, it extracts reusable issue resolution knowledge at different levels - from high-level problem comprehension to specific code changes. Experiments show that SWE-Exp achieves state-of-the-art resolution rate (41.6% Pass@1) on SWE-bench-Verified under open-source agent frameworks. Our approach establishes a new paradigm in which automated software engineering agents systematically accumulate and leverage repair expertise, fundamentally shifting from trial-and-error exploration to strategic, experience-driven issue resolution.

大型语言模型(LLM)代理最近的进展表明,在软件问题的解决方面取得了显著进展,利用了多剂协作和蒙特卡洛树搜索等先进技术。然而,目前代理作为没有记忆的探险家,在不保留或重复以往修复经验的知识的情况下分别处理每个问题,从而导致对失败的轨迹进行重复探索,并错过了将成功解决问题的方法适应类似问题的机会。为解决这一问题,我们引入SWE-Exporation(SWE-Exporation),一种强化方法,从以前的代理轨迹中提取简明和可操作的经验,使各种问题能够不断学习。我们的方法引入了一个多面的经验库,记录成功和失败的修复尝试。具体地说,它提取了不同层次的可重复的解决问题知识――从高层次的问题理解到具体的代码变化。实验表明,SWE-Explex在开放源代理框架下对SWE-bench-Verizer化的SWE-pass@1,在SWE-bench-vicer 框架下,我们的方法建立了一个新的范例,使自动软件工程代理系统积累和利用修复专门知识,从试验和驱动的解决方案问题从根本上转向战略探索。


Article 11

Title@2025-07-31 (4): Quality Evaluation of COBOL to Java Code Transformation

Title: Quality Evaluation of COBOL to Java Code Transformation Qualitätsbewertung von COBOL zu Java Code Transformation 对CCOBOL的质量评价与爪哇法典转换 2507.23356v1

Authors (4): Shmulik Froimovich, Raviv Gal, Wesam Ibraheem, Avi Ziv

We present an automated evaluation system for assessing COBOL-to-Java code translation within IBM’s watsonx Code Assistant for Z (WCA4Z). The system addresses key challenges in evaluating LLM-based translators, including model opacity and the complexity of translation quality assessment. Our approach combines analytic checkers with LLM-as-a-judge (LaaJ) techniques to deliver scalable, multi-faceted evaluations. The system supports continuous integration workflows, enables large-scale benchmarking, and reduces reliance on manual review. We describe the system architecture, evaluation strategies, and reporting mechanisms that provide actionable insights for developers and project managers, facilitating the evolution of high-quality, modernized codebases.

我们在IBM Z(WCA4Z)的watsonx代码助理(WCA4Z)内部推出一个自动评价系统,用于评估COBOL-Java代码翻译。这个系统处理在评价基于LLM的笔译员方面的主要挑战,包括示范不透明性和翻译质量评估的复杂性。我们的方法是将分析式核对器与LLM-as-a-judge(LaaJ)技术结合起来,以提供可扩缩的、多层面的评价。这个系统支持连续的整合工作流程,能够进行大规模基准设定,并减少对人工审查的依赖。我们描述了为开发者和项目管理员提供可操作的洞见的系统结构、评价战略和报告机制,为高质量、现代化的代码库的发展提供了便利。


Article 12

Title@2025-07-31 (4): SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution

Title: SWE-Debate: Competitive Multi-Agent Debate for Software Issue Resolution SWE-Debatte: Wettbewerbsfähige Multi-Agenten-Debatte für die Lösung von Software-Problemen SWE-Debate:解决软件问题竞争性多机构辩论 2507.23348v1

Authors (9): Han Li, Yuling Shi, Shaoxin Lin, Xiaodong Gu, Heng Lian, Xin Wang, Yantao Jia, Tao Huang, Qianxiang Wang

Issue resolution has made remarkable progress thanks to the advanced reasoning capabilities of large language models (LLMs). Recently, agent-based frameworks such as SWE-agent have further advanced this progress by enabling autonomous, tool-using agents to tackle complex software engineering tasks. While existing agent-based issue resolution approaches are primarily based on agents’ independent explorations, they often get stuck in local solutions and fail to identify issue patterns that span across different parts of the codebase. To address this limitation, we propose SWE-Debate, a competitive multi-agent debate framework that encourages diverse reasoning paths and achieves more consolidated issue localization. SWE-Debate first creates multiple fault propagation traces as localization proposals by traversing a code dependency graph. Then, it organizes a three-round debate among specialized agents, each embodying distinct reasoning perspectives along the fault propagation trace. This structured competition enables agents to collaboratively converge on a consolidated fix plan. Finally, this consolidated fix plan is integrated into an MCTS-based code modification agent for patch generation. Experiments on the SWE-bench benchmark show that SWE-Debate achieves new state-of-the-art results in open-source agent frameworks and outperforms baselines by a large margin.

由于大型语言模型(LLMs)的先进推理能力,问题解决取得了显著进展。最近,SWE代理商等基于代理商的框架进一步推进了这一进展,使自动使用工具的代理商能够应对复杂的软件工程任务。虽然现有基于代理商的问题解决方法主要基于代理商的独立探索,但它们往往被困在本地解决方案中,无法查明跨越代码库不同部分的问题模式。为了应对这一限制,我们提议SWE-Debate,这是一个竞争性多代理商辩论框架,鼓励多种推理路径,实现更综合的问题本地化。SWE-Debate首先通过绘制代码依赖性图表,生成多重错误传播痕迹,作为本地化建议。然后,它组织专门代理商之间的三轮辩论,每个都体现了与错误传播跟踪相关的不同推理观点。这种结构竞争使代理商能够就综合固定计划开展合作。最后,这一综合固定计划被纳入基于MCTS的代码修改工具,用于补丁生成。SWE-Debate在SWE-Bench基准上进行的实验表明,SWE-Deate在开放代理商基准框架中实现了新的州差幅。


Article 13

Title@2025-07-31 (4): Scalable and Precise Patch Robustness Certification for Deep Learning Models with Top-k Predictions

Title: Scalable and Precise Patch Robustness Certification for Deep Learning Models with Top-k Predictions Skalierbare und präzise Patch Robustness Zertifizierung für Deep Learning Modelle mit Top-K Vorhersagen 具有顶级预测力的深学习模型可缩放和精确的补丁强度认证 2507.23335v1

Authors (4): Qilin Zhou, Haipeng Wang, Zhengyuan Wei, W. K. Chan

Patch robustness certification is an emerging verification approach for defending against adversarial patch attacks with provable guarantees for deep learning systems. Certified recovery techniques guarantee the prediction of the sole true label of a certified sample. However, existing techniques, if applicable to top-k predictions, commonly conduct pairwise comparisons on those votes between labels, failing to certify the sole true label within the top k prediction labels precisely due to the inflation on the number of votes controlled by the attacker (i.e., attack budget); yet enumerating all combinations of vote allocation suffers from the combinatorial explosion problem. We propose CostCert, a novel, scalable, and precise voting-based certified recovery defender. CostCert verifies the true label of a sample within the top k predictions without pairwise comparisons and combinatorial explosion through a novel design: whether the attack budget on the sample is infeasible to cover the smallest total additional votes on top of the votes uncontrollable by the attacker to exclude the true labels from the top k prediction labels. Experiments show that CostCert significantly outperforms the current state-of-the-art defender PatchGuard, such as retaining up to 57.3% in certified accuracy when the patch size is 96, whereas PatchGuard has already dropped to zero.

认证的回收技术保证了对经认证样本的唯一真实标签的预测。然而,现有技术,如果适用于顶级预测,对标签之间的票数进行共同的对比比较,由于攻击者控制的票数(即攻击预算)的通货膨胀,未能核证最高K类预测标签中的唯一真实标签(准确地说,攻击者控制票数的通胀性,攻击者控制票数(即攻击预算)的最小增加票总数;但计算所有组合的选票分配情况都因组合式爆炸问题而受到影响。我们提议CostCert,一个新颖的、可扩缩的、精确的基于投票的认证样本保护者。成本Cert在最高K类预测中核实一个样本的真实标签,而不进行对齐比较,并通过新设计进行组合爆炸:由于攻击者控制票数的最小总数(即攻击者无法控制的预算);但是将所有选票分配的组合都从顶级预测标签中排除了真正的标签。实验显示,在经过认证的精确度为第57号时,CostCert明显超过目前K级预测的准确度。


Article 14

Title@2025-07-31 (4): SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy

Title: SequenceLayers: Sequence Processing and Streaming Neural Networks Made Easy SequenzLayer: Sequenzverarbeitung und Streaming von Neuronalen Netzwerken leicht gemacht 序列激光器:序列处理和串联神经网络变得容易 2507.23292v1

Authors (11): RJ Skerry-Ryan, Julian Salazar, Soroosh Mariooryad, David Kao, Daisy Stanton, Eric Battenberg, Matt Shannon, Ron J. Weiss, Robin Scheibler, Jonas Rothfuss, Tom Bagby

We introduce a neural network layer API and library for sequence modeling, designed for easy creation of sequence models that can be executed both layer-by-layer (e.g., teacher-forced training) and step-by-step (e.g., autoregressive sampling). To achieve this, layers define an explicit representation of their state over time (e.g., a Transformer KV cache, a convolution buffer, an RNN hidden state), and a step method that evolves that state, tested to give identical results to a stateless layer-wise invocation. This and other aspects of the SequenceLayers contract enables complex models to be immediately streamable, mitigates a wide range of common bugs arising in both streaming and parallel sequence processing, and can be implemented in any deep learning library. A composable and declarative API, along with a comprehensive suite of layers and combinators, streamlines the construction of production-scale models from simple streamable components while preserving strong correctness guarantees. Our current implementations of SequenceLayers (JAX, TensorFlow 2) are available at https://github.com/google/sequence-layers.

为实现这一目标,我们引入了神经网络层 API 和 序列模型库, 目的是容易地创建可以逐层执行的序列模型( 教师强制培训) 和一步步执行的序列模型( 自动递减抽样 ) 。 为了实现这一点, 层界定了它们随着时间推移的状态的清晰描述( 例如变换器 KV 缓存、 混凝土缓冲、 隐藏的 RNN ) , 并引入一个步骤方法, 该步骤方法将状态化, 测试为给无国籍的分层性职业带来相同结果。 以及序列激光器合同的其他方面使复杂模型能够立即流动, 减轻在串流和平行序列处理中产生的广泛常见的错误, 并且可以在任何深层学习图书馆中实施。 一个可比较和具有宣示性的 API , 连同一个全面的层层和梳理器组合, 将生产规模模型的构建从简单可流成的组件简化, 同时又保持强烈的正确性保证。 我们目前实施的SquecesLayers ( JAX, TensorFlow 2) 可在 httpsrence.


Article 15

Title@2025-07-31 (4): A Privacy-Preserving DAO Model Using NFT Authentication for the Punishment not Reward Blockchain Architecture

Title: A Privacy-Preserving DAO Model Using NFT Authentication for the Punishment not Reward Blockchain Architecture A Privacy-Preserving DAO-Modell mit NFT-Authentifizierung für die Strafe nicht Belohnung Blockchain Architektur 使用NFT认证用于惩罚而不是回报链架构的隐私保护 DAO 模式 2405.13156v2

Authors (2): Talgar Bayan, Richard Banach

This paper presents a decentralised autonomous organisation (DAO) model that uses non-fungible tokens (NFTs) for identity management and privacy-preserving interactions within a Punishment not Reward (PnR) blockchain mechanism. The proposed model introduces a dual NFT architecture deployed on Layer 2 networks: Membership NFTs ((NFT_{auth})) for authentication and access control and interaction NFTs ((NFT_{priv})) for private interactions among participants. Our Layer 2 implementation achieves 97\% gas cost reduction while maintaining security through cross-chain mechanisms. The identity management system incorporates decentralised KYC processes and Sybil attack resistance using soulbound token characteristics. Governance operates through smart contracts that manage reputation and administer punitive measures, including conditional identity disclosure for forensic purposes. Governance operates through smart contracts that manage reputation and administer punitive measures, including conditional identity disclosure when misconduct is detected.

本文件介绍了一个权力下放自治组织模式,该模式使用不可互换的标志进行身份管理和在惩罚非奖励(PnR)链锁机制内保护隐私的互动,拟议模式引入了在第2层网络上部署的双重国家反恐机构架构:成员国家反恐机构((NFTauth))用于认证和出入控制,以及参与者之间私人互动的NFT((NFTpriv))用于认证和出入控制及NFT(NFTpriv)用于参与者之间的私人互动。我们的第2层的实施在通过跨链机制维持安全的同时,实现了减少97气体成本的目标。身份管理系统纳入了分散的KYC程序和利用有灵魂的象征特征的Sybil攻击抵制。治理通过智能合同运作,管理声誉和管理惩罚措施,包括为法医目的有条件的身份披露。治理是通过智能合同运作,管理声誉和管理惩罚措施,包括发现不当行为时有条件的身份披露。


Article 16

Title@2025-07-31 (4): XABPs: Towards eXplainable Autonomous Business Processes

Title: XABPs: Towards eXplainable Autonomous Business Processes XABPs: Auf dem Weg zu eXplainable Autonomous Business Processes XABPs:迈向可塑性自治商业进程 2507.23269v1

Authors (6): Peter Fettke, Fabiana Fournier, Lior Limonad, Andreas Metzger, Stefanie Rinderle-Ma, Barbara Weber

Autonomous business processes (ABPs), i.e., self-executing workflows leveraging AI/ML, have the potential to improve operational efficiency, reduce errors, lower costs, improve response times, and free human workers for more strategic and creative work. However, ABPs may raise specific concerns including decreased stakeholder trust, difficulties in debugging, hindered accountability, risk of bias, and issues with regulatory compliance. We argue for eXplainable ABPs (XABPs) to address these concerns by enabling systems to articulate their rationale. The paper outlines a systematic approach to XABPs, characterizing their forms, structuring explainability, and identifying key BPM research challenges towards XABPs.

自主业务流程,即利用AI/ML的自动执行工作流程,有可能提高业务效率,减少错误,降低成本,改进反应时间,让工人免费从事更具战略性和创造性的工作,但是,ABP可能会引起具体关切,包括利益攸关方信任度降低、调试困难、问责制受到阻碍、偏见风险和监管合规问题。我们主张采用exiveABP(XABP)系统来解决这些问题,使系统能够说明其理由。该文件概述了对XABP的系统做法,说明其形式,安排解释性,并确定BPM对XABP的主要研究挑战。


Article 17

Title@2025-07-31 (4): CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation

Title: CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation CodeIF-Bench: Bewertung von Instruction-Following-Fähigkeiten von großen Sprachmodellen in der interaktiven Codegenerierung 守则-框架框架框架:评估互动代码生成中大语言模式的指令-遵守能力 2503.22688v3

Authors (7): Peiding Wang, Li Zhang, Fang Liu, Lin Shi, Minxiao Li, Bo Shen, An Fu

Large Language Models (LLMs) have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions. They offer limited insight into LLMs’ abilities to generate code that strictly follows users’ instructions in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating the instruction-following capabilities of LLMs in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. In both \textit{Static Conversation} and \textit{Dynamic Conversation} settings, we evaluate the performance of 7 state-of-the-art LLMs and summarize the important factors influencing the instruction-following ability of LLMs in multi-turn interactions, as well as potential directions for improvement.

大型语言模型(LLMS)在代码生成任务方面表现出色,成为开发者不可或缺的编程助理;然而,现有的代码生成基准主要评估LMS在单向互动中生成的代码的功能正确性,对LLMs生成的代码严格遵循多方向互动情景中用户指令的能力的洞察力有限;在本文中,我们引入了CodIF-Bench,这是评价LMS在互动代码生成中遵循指令的能力的基准;具体地说,CodIF-Bench包含九类与真实世界软件开发要求相一致的可核查指令,这些指令可以通过特定的测试案例独立和客观地验证,有助于评估多方向互动中遵循指令的能力。在“textit{Static conversation}和“textit{yncrict Conversation}环境中,我们评估了7个最先进的LMs的性能,总结影响LMs在多方向互动中遵循指令的能力以及可能的改进方向。


Article 18

Title@2025-07-31 (4): Kernel-FFI: Transparent Foreign Function Interfaces for Interactive Notebooks

Title: Kernel-FFI: Transparent Foreign Function Interfaces for Interactive Notebooks Kernel-FFI: Transparente Fremdfunktionsschnittstellen für interaktive Notebooks 核心-FFI:交互式笔记本的透明外国函数界面 2507.23205v1

Authors (4): Hebi Li, Forrest Sheng Bao, Qi Xiao, Jin Tian

Foreign Function Interfaces (FFIs) are essential for enabling interoperability between programming languages, yet existing FFI solutions are ill-suited for the dynamic, interactive workflows prevalent in modern notebook environments such as Jupyter. Current approaches require extensive manual configuration, introduce significant boilerplate, and often lack support for recursive calls and object-oriented programming (OOP) constructs-features critical for productive, multi-language development. We present Kernel-FFI, a transparent, language-agnostic framework that enables seamless cross-language function calls and object manipulation within interactive notebooks. Kernel-FFI employs source-level transformation to automatically rewrite cross-language invocations, eliminating the need for manual bindings or boilerplate. Kernel-FFI provides robust support for OOP by enabling foreign object referencing and automatic resource management across language boundaries. Furthermore, to address the blocking nature of Jupyter kernels and support recursive and asynchronous foreign calls, we introduce a novel side-channel communication mechanism. Our tool will be open-sourced and available at https://codepod.io/docs/kernel-ffi

外国功能界面(FFI)对于使编程语言之间实现互操作性至关重要,但现有的FFI解决方案不适合诸如Jupyter等现代笔记本环境中普遍存在的动态互动工作流程。目前的做法需要广泛的手工配置,引入重要的锅炉板,而且往往缺乏对循环调用和面向目标的编程(OOP)的支持,对于生产性、多语言的发展至关重要。我们介绍了一个透明、语言通晓的框架,即Kernel-FFI,这个框架可以使跨语言的调用和物体操作在互动笔记本中无缝无缝。Kernel-FFI使用源级转换自动重写跨语言职业,消除手动装订或锅炉的需要。Kernel-FFI为OOP提供了强有力的支持,它使外国物体的查找和自动资源管理跨越语言边界。此外,为了解决Jupyter内核的阻塞性质,支持重复和不同步的外国调用,我们引入了一个新型侧声道通讯机制。我们的工具将开放源,并可在https://coppodpod/kerkennel-nalevi/dols/dols/doskis/dalnial


Article 19

Title@2025-07-31 (4): CodePod: A Language-Agnostic Hierarchical Scoping System for Interactive Development

Title: CodePod: A Language-Agnostic Hierarchical Scoping System for Interactive Development CodePod: Ein sprach-agnostisches Hierarchisches Scoping-System für interaktive Entwicklung 代码pod:一个促进互动发展的语文、不可知的等级分级范围界定系统 2301.02410v2

Authors (4): Hebi Li, Forrest Sheng Bao, Qi Xiao, Jin Tian

Interactive development environments like Jupyter Notebooks enable incremental coding through cells with immediate feedback, but their linear structure and global namespace limit scalability for large software projects. We present CodePod, a hierarchical extension of Jupyter that introduces a novel scoped execution model with formal semantics. Our key contribution is a language-agnostic runtime system that performs source-level transformations to implement hierarchical scoping rules, enabling true incremental evaluation across nested modules without requiring language-specific kernel modifications. We formalize the scoping semantics as a mathematical framework with precise visibility relations and prove key properties including uniqueness of symbol resolution and correctness of the resolution algorithm. A qualitative user study with seven senior developers demonstrates that CodePod enables significant improvements in project scalability compared to Jupyter, with notable reductions in navigation effort. We validate the system’s effectiveness on large-scale projects with thousands of lines of code, demonstrating its applicability beyond traditional notebook boundaries. Our tool is open-source and available at https://codepod.io

互动发展环境,如Jupyter Notesbook等互动开发环境,能够通过立即反馈的单元格进行递增编码,但其线性结构和全球命名空间限制大型软件项目的可缩放性。我们介绍了CodePod,这是Jupyter的等级延伸,它引入了带有正式语义的新型范围执行模式。我们的主要贡献是一个语言-不可知的运行时间系统,它进行源级变换,以实施等级范围界定规则,使得能够对嵌套模块进行真正的递增评价,而不需要对语言特定的内核进行修改。我们正式将范围界定语义作为一个数学框架,具有精确的可见性关系,并证明关键特性,包括符号分辨率和分辨率算法的独特性。我们与7位高级开发商进行的一项定性用户研究显示,Cocolpod使得项目可缩放性与Jupyter相比有了显著的改进,导航工作也显著减少。我们验证了该系统在有数千行代码的大型项目上的有效性,表明其可超越传统的笔界线。我们的工具是开放源,可在https://codepodpodpod.io查阅。


Article 20

Title@2025-07-31 (4): VRISE: A Virtual Reality Platfrom for Immersive and Interactive Surveying Education

Title: VRISE: A Virtual Reality Platfrom for Immersive and Interactive Surveying Education VRISE: Eine Virtual Reality-Platfrom für immersive und interaktive Vermessungsausbildung VRISE: 模拟和互动调查教育虚拟现实分布图 2507.22810v2

Authors (5): Daniel Udekwe, Dimitrios Bolkas, Eren Erman Ozguven, Ren Moses, Qianwen Guo

Surveying is a core component of civil engineering education, requiring students to engage in hands-on spatial measurement, instrumentation handling, and field-based decision-making. However, traditional instruction often poses logistical and cognitive challenges that can hinder accessibility and student engagement. While virtual laboratories have gained traction in engineering education, few are purposefully designed to support flexible, adaptive learning in surveying. To address this gap, we developed Virtual Reality for Immersive and Interactive Surveying Education (VRISE), an immersive virtual reality laboratory that replicates ground-based and aerial surveying tasks through customizable, accessible, and user-friendly modules. VRISE features interactive experiences such as differential leveling with a digital level equipment and waypoint-based drone navigation, enhanced by input smoothing, adaptive interfaces, and real-time feedback to accommodate diverse learning styles. Evaluation across multiple user sessions demonstrated consistent gains in measurement accuracy, task efficiency, and interaction quality, with a clear progression in skill development across the ground-based and aerial surveying modalities. By reducing cognitive load and physical demands, even in tasks requiring fine motor control and spatial reasoning, VRISE demonstrates the potential of immersive, repeatable digital environments to enhance surveying education, broaden participation, and strengthen core competencies in a safe and engaging setting.

调查是土木工程教育的核心组成部分,要求学生参与亲手进行空间测量、仪表处理和实地决策。然而,传统教学往往带来后勤和认知方面的挑战,阻碍无障碍和学生参与。尽管虚拟实验室在工程教育中获得了牵引力,但很少有目的地设计支持灵活和适应性调查学习。为解决这一差距,我们开发了模拟和互动调查教育虚拟现实(VRISE),这是一个隐性虚拟现实实验室,通过可定制、无障碍和方便用户的模块复制地面和空中调查任务。VRISE具有互动经验,例如与数字级设备和路标无人驾驶导航进行不同的平级,通过投入平滑、适应性接口和实时反馈加以加强,以适应不同学习风格。多场用户会议的评价显示测量准确性、任务效率和互动质量方面不断提高,在地面和空中调查模式的技能发展方面明显取得进展。通过减少认知负担和物理需求,甚至在需要精细的机动控制和空间推理的任务方面,VRISE展示了互动经验,包括在需要优化的发动机控制和空间推理的情况下,扩大数字调查能力,加强核心调查环境,加强安全性调查的潜力。


Article 21

Title@2025-07-31 (4): AutoBridge: Automating Smart Device Integration with Centralized Platform

Title: AutoBridge: Automating Smart Device Integration with Centralized Platform AutoBridge: Automatisierung der Smart Device Integration mit zentralisierter Plattform AutoBridge: 与集中化平台自动整合智能设备 2507.23178v1

Authors (3): Siyuan Liu, Zhice Yang, Huangxun Chen

Multimodal IoT systems coordinate diverse IoT devices to deliver human-centered services. The ability to incorporate new IoT devices under the management of a centralized platform is an essential requirement. However, it requires significant human expertise and effort to program the complex IoT integration code that enables the platform to understand and control the device functions. Therefore, we propose AutoBridge to automate IoT integration code generation. Specifically, AutoBridge adopts a divide-and-conquer strategy: it first generates device control logic by progressively retrieving device-specific knowledge, then synthesizes platformcompliant integration code using platform-specific knowledge. To ensure correctness, AutoBridge features a multi-stage debugging pipeline, including an automated debugger for virtual IoT device testing and an interactive hardware-in-the-loop debugger that requires only binary user feedback (yes and no) for real-device verification. We evaluate AutoBridge on a benchmark of 34 IoT devices across two open-source IoT platforms. The results demonstrate that AutoBridge can achieves an average success rate of 93.87% and an average function coverage of 94.87%, without any human involvement. With minimal binary yes and no feedback from users, the code is then revised to reach 100% function coverage. A user study with 15 participants further shows that AutoBridge outperforms expert programmers by 50% to 80% in code accuracy, even when the programmers are allowed to use commercial code LLMs.

多式 IoT 系统协调不同的 IoT 设备以提供以人为中心的服务。 将新的 IoT 设备纳入中央平台管理下的新 IoT 设备的能力是一项必不可少的要求。 但是, 它需要大量的人力专长和精力来编程复杂的 IoT 集成代码, 使平台能够理解和控制设备功能。 因此, 我们建议 AutoBridge 将 IoT 集成代码自动生成。 具体来说, AutoBridge 采用一个分解和解析策略: 它首先通过逐步检索特定设备知识生成设备控制逻辑, 然后利用平台特定知识合成符合的平台集成代码。 为确保正确性, AutoBridge 具有多阶段调试管道功能, 包括一个自动调试器, 使平台能够理解和控制设备功能。 仅需要二进用户反馈( 是或不是) 。 我们从两个开源 IoT 平台对34 IoT 设备的基准评估AutB 逻辑逻辑逻辑逻辑逻辑逻辑逻辑逻辑, 然后显示, 自动Bridge 可以实现平均成功率为93.87%, 和平均函数覆盖 1587 用户 。


Article 22

Title@2025-07-31 (4): Extension Decisions in Open Source Software Ecosystem

Title: Extension Decisions in Open Source Software Ecosystem Erweiterungsentscheidungen in Open Source Software Ecosystem 开放源软件生态系统的推广决定 2507.23168v1

Authors (2): Elmira Onagh, Maleknaz Nayebi

GitHub Marketplace is expanding by approximately 41% annually, with new tools; however, many additions replicate existing functionality. We study this phenomenon in the platform’s largest segment, Continuous Integration (CI), by linking 6,983 CI Actions to 3,869 providers and mining their version histories. Our graph model timestamps every functionality’s debut, tracks its adoption, and clusters redundant tools. We find that approximately 65% of new CI Actions replicate existing capabilities, typically within six months, and that a small set of first-mover Actions accounts for most subsequent forks and extensions. These insights enable developers to choose the optimal moment to launch, target unmet functionality, and help maintainers eliminate redundant tools. We publish the complete graph and dataset to encourage longitudinal research on innovation and competition in software ecosystems, and to provide practitioners with a data-driven roadmap for identifying emerging trends and guiding product strategy.

GitHub 市场平台每年以新的工具扩展约41%;然而,许多新增功能复制了现有功能。我们在平台最大的部分“连续整合”中研究这一现象,将6,983 CI Action与3,869个供应商连接起来,并挖掘其版本历史。我们的图表模型时间标记了每个功能的首页,跟踪其采用情况,并收集了冗余工具。我们发现,大约65%的新CI Action复制了现有能力,通常在六个月内复制,而一小套首创行动代表了随后的首创和扩展。这些洞见使开发者能够选择最佳时机启动、锁定未实现功能并帮助维护者消除冗余工具。我们出版了完整的图表和数据集,以鼓励对软件生态系统的创新和竞争进行纵向研究,并为实践者提供数据驱动的路线图,以确定新出现的趋势和指导产品战略。


Article 23

Title@2025-07-30 (3): Vibe Modeling: Challenges and Opportunities

Title: Vibe Modeling: Challenges and Opportunities Vibe Modeling: Herausforderungen und Chancen 虚拟建模:挑战和机遇 2507.23120v1

Authors (1): Jordi Cabot

There is a pressing need for better development methods and tools to keep up with the growing demand and increasing complexity of new software systems. New types of user interfaces, the need for intelligent components, sustainability concerns, … bring new challenges that we need to handle. In the last years, model-driven engineering (MDE) has been key to improving the quality and productivity of software development, but models themselves are becoming increasingly complex to specify and manage. At the same time, we are witnessing the growing popularity of vibe coding approaches that rely on Large Language Models (LLMs) to transform natural language descriptions into running code at the expenses of code vulnerabilities, scalability issues and maintainability concerns. In this paper, we introduce the concept of \textit{vibe modeling} as a novel approach to integrate the best of both worlds (AI and MDE) to speed up the development of reliable complex systems. We outline the key concepts of vibe modeling and highlight the opportunities and open challenges it presents for the future of modeling.

迫切需要更好的开发方法和工具,以跟上新软件系统日益增长的需求和日益复杂的需求。新型用户界面、对智能组件的需求、可持续性关切、.带来我们需要应对的新挑战。在过去几年中,模型驱动的工程(MDE)是提高软件开发质量和生产率的关键,但模型本身越来越复杂,需要具体化和管理。与此同时,我们看到,依靠大语言模型(LLMS)将自然语言描述转化为运行代码的调控方法越来越受欢迎,而代之以代码脆弱性、可缩放性问题和维护性关切为代价。 在本文件中,我们引入了\ textit{vibe 建模}概念,作为整合世界最佳产品(AI和MDE)以加快可靠复杂系统开发的新颖方法。我们概述了建模的主要概念,并突出强调了建模的未来所面临的机遇和公开挑战。


Article 24

Title@2025-07-30 (3): FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering

Title: FlowETL: An Autonomous Example-Driven Pipeline for Data Engineering FlowETL: Eine autonome Beispiel-gesteuerte Pipeline für die Datentechnik FLFETL:数据工程的自主如流管道 2507.23118v1

Authors (4): Mattia Di Profio, Mingjun Zhong, Yaji Sripada, Marcel Jaspars

The Extract, Transform, Load (ETL) workflow is fundamental for populating and maintaining data warehouses and other data stores accessed by analysts for downstream tasks. A major shortcoming of modern ETL solutions is the extensive need for a human-in-the-loop, required to design and implement context-specific, and often non-generalisable transformations. While related work in the field of ETL automation shows promising progress, there is a lack of solutions capable of automatically designing and applying these transformations. We present FlowETL, a novel example-based autonomous ETL pipeline architecture designed to automatically standardise and prepare input datasets according to a concise, user-defined target dataset. FlowETL is an ecosystem of components which interact together to achieve the desired outcome. A Planning Engine uses a paired input-output datasets sample to construct a transformation plan, which is then applied by an ETL worker to the source dataset. Monitoring and logging provide observability throughout the entire pipeline. The results show promising generalisation capabilities across 14 datasets of various domains, file structures, and file sizes.

提取、变换、加载(ETL)工作流程是分析人员为下游任务获取的数据仓库和其他数据储存的源头基础。现代ETL解决方案的一个主要缺点是,广泛需要一种用于设计和实施特定背景和往往不普遍适用的变换的 “ 环流 “ 所需的 “ 人 “ 。虽然ETL自动化领域的相关工作显示有可喜的进展,但缺乏能够自动设计和应用这些变换的解决方案。我们介绍了FlowETL,这是一个新的以实例为基础的自动自动ETL管道结构,旨在根据一个简明、用户定义的目标数据集自动标准化和编制输入数据集。FlowETL是一个组合的生态系统,它们相互作用,以实现预期的结果。规划引擎使用配对的输入-输出数据集样本来构建一个变换计划,然后由ETL工人对源数据集加以应用。监测和记录使整个输电管道具有可观测性。结果显示,在14个不同领域、文件结构和文件大小的数据集中,具有很有希望的普及能力。


Article 25

Title@2025-07-30 (3): Insights into resource utilization of code small language models serving with runtime engines and execution providers

Title: Insights into resource utilization of code small language models serving with runtime engines and execution providers Einblicke in die Ressourcennutzung von Code-Small Language-Modellen, die mit Laufzeit-Engines und Ausführungsanbietern dienen 深入了解为运行时引擎和执行提供方服务的编码小型语文模式的资源利用情况 2412.15441v2

Authors (4): Francisco Durán, Matias Martinez, Patricia Lago, Silverio Martínez-Fernández

The rapid growth of language models, particularly in code generation, requires substantial computational resources, raising concerns about energy consumption and environmental impact. Optimizing language models inference resource utilization is crucial, and Small Language Models (SLMs) offer a promising solution to reduce resource demands. Our goal is to analyze the impact of deep learning serving configurations, defined as combinations of runtime engines and execution providers, on resource utilization, in terms of energy consumption, execution time, and computing-resource utilization from the point of view of software engineers conducting inference in the context of code generation SLMs. We conducted a technology-oriented, multi-stage experimental pipeline using twelve code generation SLMs to investigate energy consumption, execution time, and computing-resource utilization across the configurations. Significant differences emerged across configurations. CUDA execution provider configurations outperformed CPU execution provider configurations in both energy consumption and execution time. Among the configurations, TORCH paired with CUDA demonstrated the greatest energy efficiency, achieving energy savings from 37.99% up to 89.16% compared to other serving configurations. Similarly, optimized runtime engines like ONNX with the CPU execution provider achieved from 8.98% up to 72.04% energy savings within CPU-based configurations. Also, TORCH paired with CUDA exhibited efficient computing-resource utilization. Serving configuration choice significantly impacts resource utilization. While further research is needed, we recommend the above configurations best suited to software engineers’ requirements for enhancing serving resource utilization efficiency.

语言模式的迅速增长,特别是在代码生成方面,要求大量计算资源,引起对能源消耗和环境影响的关切。优化语言模型推断资源利用的假设至关重要,而小型语言模型为减少资源需求提供了一个大有希望的解决办法。我们的目标是分析深度学习服务配置的影响,即运行时间引擎和执行提供者的组合,在能源消耗、执行时间和计算资源利用方面,在能源使用、执行时间和计算资源利用方面,从在代码生成可持续土地管理方面进行推断的软件工程师的角度来看,对资源利用的影响。我们开展了面向技术的多阶段试验管道,使用12种代码生成的可持续土地管理来调查能源消耗、执行时间和计算资源利用情况。我们利用了12种代码生成的可持续土地管理来调查能源消耗、执行时间和计算资源利用情况。CUDA执行供应商配置比CUP执行供应商的配置要大得多。TORCH与CUDA的组合相比,进一步提高了能源效率,实现了从72.99 %到89.16%的节能节约率,而其他服务配置则比其他配置。同样,将ONX的运行引擎与CNX的运行周期运行引擎与C-PUDFE的利用率进行大幅提升的节能利用。


Article 26

Title@2025-07-30 (3): On LLM-Assisted Generation of Smart Contracts from Business Processes

Title: On LLM-Assisted Generation of Smart Contracts from Business Processes Zur LLM-Assistenten Generierung von Smart Contracts aus Geschäftsprozessen 利用LLM协助从业务流程中生成智能合同 2507.23087v1

Authors (3): Fabian Stiehle, Hans Weytjens, Ingo Weber

Large language models (LLMs) have changed the reality of how software is produced. Within the wider software engineering community, among many other purposes, they are explored for code generation use cases from different types of input. In this work, we present an exploratory study to investigate the use of LLMs for generating smart contract code from business process descriptions, an idea that has emerged in recent literature to overcome the limitations of traditional rule-based code generation approaches. However, current LLM-based work evaluates generated code on small samples, relying on manual inspection, or testing whether code compiles but ignoring correct execution. With this work, we introduce an automated evaluation framework and provide empirical data from larger data sets of process models. We test LLMs of different types and sizes in their capabilities of achieving important properties of process execution, including enforcing process flow, resource allocation, and data-based conditions. Our results show that LLM performance falls short of the perfect reliability required for smart contract development. We suggest future work to explore responsible LLM integrations in existing tools for code generation to ensure more reliable output. Our benchmarking framework can serve as a foundation for developing and evaluating such integrations.

大型语言模型(LLMS)改变了软件如何生产的现实。在更广泛的软件工程界中,除其他目的外,探索这些模型是为了从不同种类的投入中获得的代码生成使用案例。在这项工作中,我们提出一项探索性研究,以调查使用LLMS从业务流程描述中生成智能合同代码的想法,这是最近文献中出现的一种想法,目的是克服传统基于规则的代码生成方法的局限性。然而,LLMM目前的工作评价了小型样本的代码,依靠人工检查,或测试代码是否汇编而忽视正确执行。我们采用了自动化评价框架,并从较大的流程模型数据集中提供经验性数据。我们测试了不同类型和规模的LLMS实现重要流程实施特性的能力,包括执行流程流程、资源分配和基于数据的条件。我们的成果显示LMM的性能没有达到智能合同开发所需的完美可靠性。我们建议今后开展工作,探索现有代码生成工具中负责任的LM集成,以确保更可靠的产出。我们的基准框架可以作为开发和评价这种整合的基础。


Article 27

Title@2025-07-30 (3): The Design Space of Lockfiles Across Package Managers

Title: The Design Space of Lockfiles Across Package Managers Der Design-Raum von Lockfiles Across Package Managers 全包管理员的锁文件设计空间 2505.04834v2

Authors (4): Yogya Gamage, Deepika Tiwari, Martin Monperrus, Benoit Baudry

Software developers reuse third-party packages that are hosted in package registries. At build time, a package manager resolves and fetches the direct and indirect dependencies of a project. Most package managers also generate a lockfile, which records the exact set of resolved dependency versions. Lockfiles are used to reduce build times; to verify the integrity of resolved packages; and to support build reproducibility across environments and time. Despite these beneficial features, developers often struggle with their maintenance, usage, and interpretation. In this study, we unveil the major challenges related to lockfiles, such that future researchers and engineers can address them. We perform the first comprehensive study of lockfiles across 7 popular package managers, npm, pnpm, Cargo, Poetry, Pipenv, Gradle, and Go. First, we highlight the wide variety of design decisions that package managers make, regarding the generation process as well as the content of lockfiles. Next, we conduct a qualitative analysis based on semi-structured interviews with 15 developers. We capture first-hand insights about the benefits that developers perceive in lockfiles, as well as the challenges they face to manage these files. Following these observations, we make 5 recommendations to further improve lockfiles, for a better developer experience.

软件开发商重新使用由软件包登记册托管的第三方软件包。 在构建时, 软件包管理员会解决并获取项目的直接和间接依赖性。 大多数软件包管理员还生成一个锁定文件, 记录一套确切的解决依赖性版本。 锁定文件用于缩短构建时间; 用于核查已解决软件包的完整性; 支持在环境和时间间进行再复制。 尽管存在这些有益的特点, 开发商经常在维护、 使用和解释上挣扎。 在这次研究中, 我们公布与锁定文件有关的主要挑战, 以便未来的研究人员和工程师能够解决这些问题。 我们对7个受欢迎的软件包管理员、 npm、 pnpm、 Cargo、 Poetry、 Pipenv、 Gradle 和 Go 的锁定文件进行首次全面研究。 首先, 我们强调软件包管理员就生成过程以及锁定文件的内容做出各种各样的设计决定。 其次, 我们根据与15个开发商的半结构性访谈, 进行质量分析。 我们从第一手了解到在锁定文件中看到的好处, 以及他们为管理这些文件而要面对的挑战。


Article 28

Title@2025-07-30 (3): Tracking research software outputs in the UK

Title: Tracking research software outputs in the UK Verfolgung von Forschungssoftware-Outputs im Vereinigten Königreich 联合王国跟踪研究软件产出 2507.22871v1

Authors (2): Domhnall Carlin, Austen Rainer

Research software is crucial in the research process and the growth of Open Science underscores the importance of accessing research artifacts, like data and code, raising traceability challenges among outputs. While it is a clear principle that research code, along with other essential outputs, should be recognised as artifacts of the research process, the how of this principle remains variable. This study examines where UK academic institutions store and register software as a unique research output, searching the UKRI’s Gateway to Research (GtR) metadata for publicly funded research software in the UK. The quantity of software reported as research outcomes remains low in proportion to other categories. Artifact sharing appears low, with one-quarter of the reported software having no links and 45% having either a missing or erroneous URL. Of the valid URLs, we find the single largest category is Public Commercial Code Repository, with GitHub being the host of 18% of all publicly funded research software listed. These observations are contrasted with past findings from 2023 and finally, we discuss the lack of artifact sharing in UK research, with resulting implications for the maintenance and evolution of research software. Without dissemination, research software risks demotion to a transient artifact, useful only to meet short term research demands but ultimately lost to the broader enterprise of science.

研究软件在研究过程中至关重要,开放科学的增长强调了获取研究文物的重要性,如数据和代码,从而在产出中提高可追溯性的挑战。虽然研究守则和其他基本产出应被确认为研究过程的文物是一项明确的原则,但这一原则如何仍然是可变的。本研究考察了英国学术机构在哪些地方储存和登记软件作为独特的研究产出,为英国公共资助的研究软件搜索英国研究所的研究网关(GtR)元数据。作为研究成果报告的软件数量与其他类别的比例仍然较低。人工分享似乎较低,报告的软件有四分之一没有链接,45%没有缺失或错误的URL。在有效的URL中,我们发现最大的一个类别是《公共商业守则》Repositor, GitHub是所有公共资助的研究软件的18%的宿主。这些观察与2023年的以往研究结果形成对照,最后,我们讨论了英国研究中缺少文物共享的问题,从而对研究软件的维护和发展产生了影响。在不传播的情况下,研究软件的风险最终会降低企业对更广义的研究需求的影响。


Article 29

Title@2025-07-30 (3): Repair-R1: Better Test Before Repair

Title: Repair-R1: Better Test Before Repair Reparatur-R1: Besserer Test vor Reparatur 修理-R1:在修理前进行更好的测试 2507.22853v1

Authors (3): Haichuan Hu, Xiaochen Xie, Quanjun Zhang

APR (Automated Program Repair) aims to automatically locate program defects, generate patches and validate the repairs. Existing techniques for APR are often combined with LLMs (Large Language Models), which leverages the code-related knowledge of LLMs to improve repair effectiveness. Current LLM-based APR methods typically utilize test cases only during the inference stage, adopting an iterative approach that performs repair first and validates it through test execution afterward. This conventional paradigm neglects two important aspects: the potential contribution of test cases in the training phase, and the possibility of leveraging testing prior to repair. To address this, we propose Repair-R1, which introduces test cases into the model’s training phase and shifts test generation to precede repair. The model is required to first generate discriminative test cases that can distinguish defective behaviors, and then perform repair based on these tests. This enables the model to better locate defects and understand the underlying causes of defects, thereby improving repair effectiveness. We implement Repair-R1 with three different backbone models, using RL (reinforcement learning) to co-optimize test generation and bug repair. Experimental results on four widely adopted benchmarks demonstrate the superiority of Repair-R1. Specially, compared to vanilla models, Repair-R1 improves repair success rate by 2.68\% to 48.29\%, test generation success rate by 16.38\% to 53.28\%, and test coverage by 0.78\% to 53.96\%. We publish the code and weights at https://github.com/Tomsawyerhu/APR-RL and https://huggingface.co/tomhu/Qwen3-4B-RL-5000-step.

APRAR(自动程序维修)旨在自动定位程序缺陷、生成补丁和验证修理工作; PRRA的现有技术往往与LLMS(大语言模型)相结合,后者利用LLMS的代码知识提高修理效力; 目前的LLMPR方法通常只在发酵阶段使用测试案例,采用迭接方法,首先进行修理,然后通过测试执行加以验证; 这一常规范例忽略了两个重要方面:测试案例在培训阶段的潜在贡献,以及在修理前利用测试的可能性; 为了解决这个问题,我们提议修理R1, 将测试案例引入模型的培训阶段,并将测试生成转移至修复前。 该模型需要首先生成能够区分缺陷行为的歧视性测试案例,然后根据这些测试进行修复。 这使得模型能够更好地定位缺陷,并了解内部缺陷的根本原因,从而提高修理效果。 我们用三种不同的主干模型执行Sy-R1, 使用RL(加强学习) 联合优化测试生成和错误修复。 在四个广泛采用的基准上, 实验性测试性测试- R_R_ Q_ / Q_ 标准, 通过测试率 16 测试率 改进标准, 改进Van_ r_ r_r_r_r_r_r_r_r_r_r_r_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Article 30

Title@2025-07-30 (3): The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach

Title: The Multi-Agent Fault Localization System Based on Monte Carlo Tree Search Approach Das Multi-Agent Fault Localization System basiert auf Monte Carlo Tree Search Approach 以蒙特卡洛树搜索方法为基础的多机构错失地方化系统 2507.22800v1

Authors (1): Rui Ren

In real-world scenarios, due to the highly decoupled and flexible nature of microservices, it poses greater challenges to system reliability. The more frequent occurrence of incidents has created a demand for Root Cause Analysis(RCA) methods that enable rapid identification and recovery of incidents. Large language model (LLM) provides a new path for quickly locating and recovering from incidents by leveraging their powerful generalization ability combined with expert experience. Current LLM for RCA frameworks are based on ideas like ReAct and Chain-of-Thought, but the hallucination of LLM and the propagation nature of anomalies often lead to incorrect localization results. Moreover, the massive amount of anomalous information generated in large, complex systems presents a huge challenge for the context window length of LLMs. To address these challenges, we propose KnowledgeMind, an innovative LLM multi-agent system based on Monte Carlo Tree Search and a knowledge base reward mechanism for standardized service-by-service reasoning. Compared to State-Of-The-Art(SOTA) LLM for RCA methods, our service-by-service exploration approach significantly reduces the burden on the maximum context window length, requiring only one-tenth of its size. Additionally, by incorporating a rule-based real-time reward mechanism, our method effectively mitigates hallucinations during the inference process. Compared to the SOTA LLM for RCA framework, our method achieves a 49.29% to 128.35% improvement in root cause localization accuracy.

在现实世界中,由于微观服务的高度脱节和灵活性,它给系统可靠性带来了更大的挑战;更频繁地发生事件,导致对根本原因分析(RCA)方法的需求,从而能够快速识别和恢复事故; 大型语言模型(LLM)提供了一条新的途径,通过利用其强大的概括化能力以及专家经验,迅速定位和从事故中恢复过来; 目前RCA框架的LLM基于ReA和Thought的ReA和链等理念,但LLM的幻觉和异常现象的传播性质往往导致错误的本地化结果; 此外,大型复杂系统中产生的大量异常信息给LMS的背景窗口长度带来巨大的挑战。 为了应对这些挑战,我们提议KnowMind,即基于蒙特卡洛树搜索的创新LM多剂系统,以及标准化服务逐项推理的知识基础奖励机制。 与基于国家的LLM(SOTA)方法相比,我们的服务逐项探索方法往往大大减轻了在最大背景窗口长度上的负担,需要将我们的真正百分比改进机制纳入SOBM原则。


Article 31

Title@2025-07-30 (3): Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning

Title: Scoring Verifiers: Evaluating Synthetic Verification for Code and Reasoning Bewertungsprüfer: Bewertung der synthetischen Überprüfung für Code und Begründung 标定验证符:评估编码和理由的合成核查 2502.13820v3

Authors (4): Aleksander Ficek, Somshubra Majumdar, Vahid Noroozi, Boris Ginsburg

Synthetic verification techniques such as generating test cases and reward modelling are common ways to enhance the coding capabilities of large language models (LLM) beyond predefined tests. Additionally, code verification has recently found great success as a critical component in improving reasoning capability of LLMs via reinforcement learning. In this paper, we propose an approach which can transform existing coding benchmarks into scoring and ranking datasets to evaluate the effectiveness of synthetic verifiers. We also propose multiple metrics to measure different aspects of the synthetic verifiers with the proposed benchmarks. By employing the proposed approach, we release four new benchmarks (HE-R, HE-R+, MBPP-R, and MBPP-R+), and analyzed synthetic verification methods with standard, reasoning-based, and reward-based LLMs. Our experiments show that reasoning can significantly improve test case generation and that scaling the number of test cases enhances the verification accuracy.

合成核查技术,如产生测试案例和奖励建模,是提高大型语言模型(LLM)的编码能力,超越预先界定的测试的常见方法。此外,守则核查最近发现,作为通过强化学习提高LLMS推理能力的一个关键组成部分,在通过强化学习提高LMS推理能力方面,取得了巨大成功。在本文件中,我们提出一种方法,可将现有的编码基准转换成评分和排名数据集,以评价合成核查员的效力。我们还提出多种指标,用拟议基准衡量合成核查员的不同方面。通过采用拟议方法,我们发布了四个新的基准(HE-R、HE-R+、MBPP-R和MBPP-R+),并以标准、推理和奖励为基础的LMS分析合成核查方法。我们的实验表明,推理可以大大改进测试案例的生成,扩大测试案例的数量可以提高核查的准确性。


Article 32

Title@2025-07-30 (3): Designing for Self-Regulation in Informal Programming Learning: Insights from a Storytelling-Centric Approach

Title: Designing for Self-Regulation in Informal Programming Learning: Insights from a Storytelling-Centric Approach Designing for Self-Regulation in Informelles Programmieren Lernen: Einblicke aus einem Storytelling-Centric-Ansatz 设计非正规方案拟订学习的自我管理:从讲故事-核心方法的洞察 2507.22671v1

Authors (3): Sami Saeed Alghamdi, Christopher Bull, Ahmed Kharrufa

Many people learn programming independently from online resources and often report struggles in achieving their personal learning goals. Learners frequently describe their experiences as isolating and frustrating, challenged by abundant uncertainties, information overload, and distraction, compounded by limited guidance. At the same time, social media serves as a personal space where many engage in diverse self-regulation practices, including help-seeking, using external memory aids (e.g., self-notes), self-reflection, emotion regulation, and self-motivation. For instance, learners often mark achievements and set milestones through their posts. In response, we developed a system consisting of a web platform and browser extensions to support self-regulation online. The design aims to add learner-defined structure to otherwise unstructured experiences and bring meaning to curation and reflection activities by translating them into learning stories with AI-generated feedback. We position storytelling as an integrative approach to design that connects resource curation, reflective and sensemaking practice, and narrative practices learners already use across social platforms. We recruited 15 informal programming learners who are regular social media users to engage with the system in a self-paced manner; participation concluded upon submitting a learning story and survey. We used three quantitative scales and a qualitative survey to examine users’ characteristics and perceptions of the system’s support for their self-regulation. User feedback suggests the system’s viability as a self-regulation aid. Learners particularly valued in-situ reflection, automated story feedback, and video annotation, while other features received mixed views. We highlight perceived benefits, friction points, and design opportunities for future AI-augmented self-regulation tools.

许多人从在线资源中独立地学习编程,并经常报告他们在实现个人学习目标方面挣扎。 学习者经常把他们的经历描述为孤立和沮丧,受到大量不确定因素、信息过量和分散注意力的挑战,再加上有限的指导。 同时,社交媒体还充当个人空间,让许多人利用外部记忆辅助工具(如自注)、自我反省、情绪调控和自我激励等各种自我调控做法,参与各种自律做法,包括寻求帮助、利用外部记忆辅助工具(如自我调控)、自我反省、自我调控、情绪调控和自我激励。例如,学习者往往通过他们的职位来标志成就和设定里程碑。我们为此,我们开发了一个由网络平台和浏览器扩展组成的系统,以支持在线自我调控。设计的目的是将非结构性经验界定的结构添加到非结构性的经验中,并带来整理和反思活动的意义。 我们用一个综合的方法来设计将资源调控、反映和感知知觉的实践者在社交平台上已经使用的方法。 我们聘用了15个非正式的编程学习者,他们以自制方式参与系统,以自定的自定的反馈; 我们用一个定量的自评的自评的自评和自评的系统,用自评的自评的自评的自评的自评。


Article 33

Title@2025-07-30 (3): RobEthiChor: Automated Context-aware Ethics-based Negotiation for Autonomous Robots

Title: RobEthiChor: Automated Context-aware Ethics-based Negotiation for Autonomous Robots RobEthiChor: Automatisierte kontextorientierte Ethik-basierte Verhandlung für autonome Roboter RobEthiChor:关于自主机器人的基于道德操守的自动内部意识谈判 2507.22664v1

Authors (5): Mashal Afzal Memon, Gianluca Filippone, Gian Luca Scoccia, Marco Autili, Paola Inverardi

The presence of autonomous systems is growing at a fast pace and it is impacting many aspects of our lives. Designed to learn and act independently, these systems operate and perform decision-making without human intervention. However, they lack the ability to incorporate users’ ethical preferences, which are unique for each individual in society and are required to personalize the decision-making processes. This reduces user trust and prevents autonomous systems from behaving according to the moral beliefs of their end-users. When multiple systems interact with differing ethical preferences, they must negotiate to reach an agreement that satisfies the ethical beliefs of all the parties involved and adjust their behavior consequently. To address this challenge, this paper proposes RobEthiChor, an approach that enables autonomous systems to incorporate user ethical preferences and contextual factors into their decision-making through ethics-based negotiation. RobEthiChor features a domain-agnostic reference architecture for designing autonomous systems capable of ethic-based negotiating. The paper also presents RobEthiChor-Ros, an implementation of RobEthiChor within the Robot Operating System (ROS), which can be deployed on robots to provide them with ethics-based negotiation capabilities. To evaluate our approach, we deployed RobEthiChor-Ros on real robots and ran scenarios where a pair of robots negotiate upon resource contention. Experimental results demonstrate the feasibility and effectiveness of the system in realizing ethics-based negotiation. RobEthiChor allowed robots to reach an agreement in more than 73\% of the scenarios with an acceptable negotiation time (0.67s on average). Experiments also demonstrate that the negotiation approach implemented in RobEthiChor is scalable.

自主系统的存在正在以快速的速度增长,它正在影响我们生活的许多方面。设计这些系统是为了独立学习和采取行动,这些系统在没有人类干预的情况下运作和进行决策。然而,它们缺乏将用户的伦理偏好纳入其决策的能力,这种偏好对于社会中的每一个人都是独一无二的,是使决策过程个性化的必要条件。这降低了用户的信任,防止了自主系统根据最终用户的道德信仰行事。当多个系统与不同的伦理偏好相互作用时,它们必须进行谈判,以达成一项满足所有相关方的道德信仰并相应调整其行为的协议。为了应对这一挑战,本文件提议采用罗伯特克(Robeth Cor)系统,这一方法使自主系统能够通过基于道德的谈判将用户的伦理偏好和背景因素纳入其决策中。罗伯特科尔(Robeth Cor)系统能够以基于道德的逻辑化谈判为主,在机器人操作系统内部执行robE-Chorboral(ROS)的可操作方法。我们也可以在机器人(ROS)操作系统上部署的rob-robalalalalalal oralalalalalal oralal oralal oral oral oral ass) oralal assal ass labal laves。我们也可以化了一种可操作方法。我们也可以化了一种可操作方法,用来在成本化的逻辑上进行一个可操作方法。评估一个可操作方法,在机器人的逻辑上进行一个可操作性测试方法。我们对一个可操作方法的逻辑上进行一个可操作性选择性方法。


Article 34

Title@2025-07-30 (3): A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models

Title: A Systematic Literature Review on Detecting Software Vulnerabilities with Large Language Models Ein systematischer Literaturbericht über die Erkennung von Softwarelücken mit großen Sprachmodellen 关于检测与大语言模型的软件脆弱性的系统文献评论 2507.22659v1

Authors (5): Sabrina Kaniewski, Fabian Schmidt, Markus Enzweiler, Michael Menth, Tobias Heer

The increasing adoption of Large Language Models (LLMs) in software engineering has sparked interest in their use for software vulnerability detection. However, the rapid development of this field has resulted in a fragmented research landscape, with diverse studies that are difficult to compare due to differences in, e.g., system designs and dataset usage. This fragmentation makes it difficult to obtain a clear overview of the state-of-the-art or compare and categorize studies meaningfully. In this work, we present a comprehensive systematic literature review (SLR) of LLM-based software vulnerability detection. We analyze 227 studies published between January 2020 and June 2025, categorizing them by task formulation, input representation, system architecture, and adaptation techniques. Further, we analyze the datasets used, including their characteristics, vulnerability coverage, and diversity. We present a fine-grained taxonomy of vulnerability detection approaches, identify key limitations, and outline actionable future research opportunities. By providing a structured overview of the field, this review improves transparency and serves as a practical guide for researchers and practitioners aiming to conduct more comparable and reproducible research. We publicly release all artifacts and maintain a living repository of LLM-based software vulnerability detection studies.

在软件工程中越来越多地采用大语言模型(LLM)已引起人们对软件脆弱性检测的兴趣,然而,该领域的迅速发展已导致研究场面支离破碎,由于系统设计和数据集使用的不同而难以比较,由于系统设计和数据集的使用不同,不同语言模型的破碎使得难以对软件工程中的最新语言模型(LLM)得到清晰的概览,或有意义地比较和分类研究。在这项工作中,我们提出了基于LLM软件脆弱性检测的全面系统文献审查(SLR)。我们分析了2020年1月至2025年6月出版的227项研究报告,按任务设计、投入代表、系统架构和适应技术对其进行分类。此外,我们分析了所使用的数据集,包括其特征、脆弱性覆盖和多样性。我们提出了脆弱性检测方法的精细分类,确定了关键限制,并概述了今后可操作的研究机会。通过对实地进行结构化的概览,这一审查提高了透明度,并成为研究人员和从业人员进行更具有可比性和再生力的研究的实用指南。我们公开发布了所有用于LLM软件脆弱性检测的活存库。


Article 35

Title@2025-07-30 (3): Metamorphic Testing of Deep Code Models: A Systematic Literature Review

Title: Metamorphic Testing of Deep Code Models: A Systematic Literature Review Metamorphische Prüfung von Deep Code Modellen: Ein systematischer Literaturbericht 深代码模型的变形测试:系统文献审查 2507.22610v1

Authors (4): Ali Asgari, Milan de Koning, Pouria Derakhshanfar, Annibale Panichella

Large language models and deep learning models designed for code intelligence have revolutionized the software engineering field due to their ability to perform various code-related tasks. These models can process source code and software artifacts with high accuracy in tasks such as code completion, defect detection, and code summarization; therefore, they can potentially become an integral part of modern software engineering practices. Despite these capabilities, robustness remains a critical quality attribute for deep-code models as they may produce different results under varied and adversarial conditions (e.g., variable renaming). Metamorphic testing has become a widely used approach to evaluate models’ robustness by applying semantic-preserving transformations to input programs and analyzing the stability of model outputs. While prior research has explored testing deep learning models, this systematic literature review focuses specifically on metamorphic testing for deep code models. By studying 45 primary papers, we analyze the transformations, techniques, and evaluation methods used to assess robustness. Our review summarizes the current landscape, identifying frequently evaluated models, programming tasks, datasets, target languages, and evaluation metrics, and highlights key challenges and future directions for advancing the field.

用于代码情报的大型语言模型和深层次学习模型使软件工程领域发生了革命性的变化,因为它们有能力执行各种与代码有关的任务,这些模型可以处理源代码和软件文物,在完成代码、发现缺陷和代码总和等任务中高度精确地处理源代码和软件文物;因此,这些模型有可能成为现代软件工程做法的一个组成部分。尽管具备这些能力,但稳健性仍然是深代码模型的关键质量属性,因为它们可能在不同的敌对条件下(如变名重命名)产生不同的结果。变异性测试已成为一种广泛使用的方法,通过对输入程序应用语义保存转换和分析模型产出稳定性来评价模型的稳健性。虽然以前的研究探索了深层学习模型,但系统文献审查特别侧重于深层代码模型的变形测试。通过研究45份基本文件,我们分析了用于评估强性的评估的变异性、技术和评价方法。我们的审查总结了当前情况,确定了经常评估的模式、方案编制任务、数据集、目标语言和评价指标,并突出了推进实地工作的关键挑战和未来方向。


Article 36

Title@2025-07-30 (3): RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment

Title: RePaCA: Leveraging Reasoning Large Language Models for Static Automated Patch Correctness Assessment RePaCA: Ausschüttung von mit Gründen versehenen großen Sprachmodellen für statische automatisierte Patch-Korrekturbewertung REPACA:利用大型语言模型进行静态自动补丁正确性评估 2507.22580v1

Authors (4): Marcos Fuster-Pena, David de-Fitero-Dominguez, Antonio Garcia-Cabot, Eva Garcia-Lopez

Automated Program Repair (APR) seeks to automatically correct software bugs without requiring human intervention. However, existing tools tend to generate patches that satisfy test cases without fixing the underlying bug, those are known as overfitting patches. To address this issue, Automated Patch Correctness Assessment (APCA) attempts to identify overfitting patches generated by APR tools. It can be solved as a static approach, meaning that no additional information is needed beyond the original and fixed code snippets. Current static techniques often struggle with reliability, flexibility and transparency. To address these issues, we introduce RePaCA, a novel static APCA technique that leverages Large Language Models (LLMs) specialized in thinking tasks. Our model is prompted with both buggy and fixed code snippets and guided to generate a Chain of Thought that analyses code differences, reasons about how the patch addresses the root cause, and ultimately provides a binary classification: correct or overfitting. To enhance these reasoning capabilities for the APCA task specifically, the LLM is finetuned using Reinforcement Learning with the Group Relative Policy Optimization algorithm. When evaluated on a standard Defects4J-derived test, our approach achieves state-of-the-art performance, with 83.1% accuracy and an 84.8% F1-score. Furthermore, our model demonstrates superior generalization capabilities when trained on different datasets, outperforming the leading technique. This reasoning capability also provides enhanced explainability for the patch assessment. These findings underscore the considerable promise of finetuned, reasoning LLMs to advance static APCA by enhancing accuracy, generalization, and explainability.

自动程序修理(APRA) 寻求自动纠正软件错误而不要求人工干预; 然而, 现有工具往往产生补丁, 满足测试案例, 而不要求人为干预; 然而, 现有工具往往产生补丁, 满足测试案例, 而不要求修补基础错误, 通称为补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补; 解决这一问题, 自动化补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补用工具, 意思是: 固定补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补, 。 为了补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补,无需补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补补


Article 37

Title@2025-07-30 (3): Inside madupite: Technical Design and Performance

Title: Inside madupite: Technical Design and Performance Inside madupite: Technisches Design und Performance 内部疯狂:技术设计和性能 2507.22538v1

Authors (4): Matilde Gargiani, Robin Sieber, Philip Pawlowsky, John Lygeros

In this work, we introduce and benchmark madupite, a newly proposed high-performance solver designed for large-scale discounted infinite-horizon Markov decision processes with finite state and action spaces. After a brief overview of the class of mathematical optimization methods on which madupite relies, we provide details on implementation choices, technical design and deployment. We then demonstrate its scalability and efficiency by showcasing its performance on the solution of Markov decision processes arising from different application areas, including epidemiology and classical control. Madupite sets a new standard as, to the best of our knowledge, it is the only solver capable of efficiently computing exact solutions for large-scale Markov decision processes, even when these exceed the memory capacity of modern laptops and operate in near-undiscounted settings. This is possible as madupite can work in a fully distributed manner and therefore leverage the memory storage and computation capabilities of modern high-performance computing clusters. This key feature enables the solver to efficiently handle problems of medium to large size in an exact manner instead of necessarily resorting to function approximations. Moreover, madupite is unique in allowing users to customize the solution algorithm to better exploit the specific structure of their problem, significantly accelerating convergence especially in large-discount factor settings. Overall, madupite represents a significant advancement, offering unmatched scalability and flexibility in solving large-scale Markov decision processes.

在这项工作中,我们引入和基准疯狂,这是一个新提出的高性能求解器,它设计为大规模折扣的无限和有限的状态和行动空间的马可夫决定程序。在简要概述疯狂依赖的数学优化方法类别之后,我们提供有关执行选择、技术设计和部署的细节。然后通过展示其在不同应用领域(包括流行病学和传统控制)产生的马尔科夫决定程序的解决方案方面的表现,展示其可缩放性和效率。麦可特建立了一个新标准,根据我们的知识,它是唯一能够高效计算大规模马尔科夫决定程序的确切解决方案的解决方案的解决方案,即使这些解决方案超过了现代膝上型计算机的记忆能力,并在近乎无孔的环境下运作。这有可能是疯狂的,能够以完全分散的方式发挥作用,从而利用现代高性高性计算机群集的存储和计算能力。这一关键特征使解决方案解决者能够以精确的方式高效地处理中等规模至大规模的问题,而不必诉诸功能近似值。此外,疯狂的求解剂是独一无二的,让用户在大规模解决方案的定制性化过程中,在大规模趋同性的总体演算中,能够更迅速地加速地加速地利用其总体解决方案结构。


Article 38

Title@2025-07-30 (3): Ensemble Fuzzing with Dynamic Resource Scheduling and Multidimensional Seed Evaluation

Title: Ensemble Fuzzing with Dynamic Resource Scheduling and Multidimensional Seed Evaluation Ensemble Fuzzing mit dynamischer Ressourcenplanung und mehrdimensionaler Saatgutbewertung 结合动态资源安排和多层面种子评估 2507.22442v1

Authors (5): Yukai Zhao, Shaohua Wang, Jue Wang, Xing Hu, Xin Xia

Fuzzing is widely used for detecting bugs and vulnerabilities, with various techniques proposed to enhance its effectiveness. To combine the advantages of multiple technologies, researchers proposed ensemble fuzzing, which integrates multiple base fuzzers. Despite promising results, state-of-the-art ensemble fuzzing techniques face limitations in resource scheduling and performance evaluation, leading to unnecessary resource waste. In this paper, we propose Legion, a novel ensemble fuzzing framework that dynamically schedules resources during the ensemble fuzzing campaign. We designed a novel resource scheduling algorithm based on the upper confidence bound algorithm to reduce the resource consumption of ineffective base fuzzers. Additionally, we introduce a multidimensional seed evaluation strategy, which considers multiple metrics to achieve more comprehensive fine-grained performance evaluation. We implemented Legion as a prototype tool and evaluated its effectiveness on Google’s fuzzer-test-suite as well as real-world open-source projects. Results show that Legion outperforms existing state-of-the-art base fuzzers and ensemble fuzzing techniques, detecting 20 vulnerabilities in real-world open-source projects-five previously unknown and three classified as CVEs.

为了结合多种技术的优势,研究人员建议采用混合式模糊法,将多种基底模糊器的资源消耗量纳入多种基底模糊器。尽管取得了有希望的成果,但最先进的混合模糊法在资源时间安排和绩效评估方面仍面临限制,导致不必要的资源浪费。在本文件中,我们提议采用 “ 军士 “ ,这是一个新颖的混合式模糊框架,在混合式模糊运动期间动态地安排资源。我们设计了一个基于上层信任约束算法的新的资源列表算法,以减少无效基底模糊器的资源消耗。此外,我们引入了多层面种子评估战略,该战略考虑多种衡量标准,以实现更全面的细化绩效评估。我们将 “ 格列夫 “ 作为一种原型工具实施,并评价其在谷歌的模糊测试和真实世界开放源项目上的有效性。结果显示, “ 军士 “ 超越了现有状态的基底瓦利器和可调的模糊法,以降低无效基底模糊技术的资源消耗量。此外,我们引入了多层面的种子评估战略,将多重指标视为实现更全面的细化业绩评估。我们实施了20个未知的开放型项目。


Article 39

Title@2025-07-30 (3): ETrace:Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis

Title: ETrace:Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis ETrace: Event-getriebene Sicherheitserkennung in Smart Contracts über LLM-basierte Trace-Analyse ETRAR:通过基于LLM的追踪分析,在智能合同中实现对脆弱性的彻底发现 2506.15790v3

Authors (7): Chenyang Peng, Haijun Wang, Yin Wu, Hao Wu, Ming Fan, Yitao Zhao, Ting Liu

With the advance application of blockchain technology in various fields, ensuring the security and stability of smart contracts has emerged as a critical challenge. Current security analysis methodologies in vulnerability detection can be categorized into static analysis and dynamic analysis methods.However, these existing traditional vulnerability detection methods predominantly rely on analyzing original contract code, not all smart contracts provide accessible code.We present ETrace, a novel event-driven vulnerability detection framework for smart contracts, which uniquely identifies potential vulnerabilities through LLM-powered trace analysis without requiring source code access. By extracting fine-grained event sequences from transaction logs, the framework leverages Large Language Models (LLMs) as adaptive semantic interpreters to reconstruct event analysis through chain-of-thought reasoning. ETrace implements pattern-matching to establish causal links between transaction behavior patterns and known attack behaviors. Furthermore, we validate the effectiveness of ETrace through preliminary experimental results.

随着在各个领域预先应用链锁技术,确保智能合同的安全和稳定已成为一项关键挑战。目前的脆弱性检测安全分析方法可以分为静态分析和动态分析方法。然而,这些现有的传统脆弱性检测方法主要依赖分析原始合同代码,并非所有智能合同都提供无障碍代码。我们为智能合同介绍了由事件驱动的新颖的智能合同脆弱性检测框架,它通过LLM驱动的追踪分析,在不需要源代码访问的情况下,独特地识别了潜在的脆弱性。通过从交易日志中提取细微事件序列,该框架利用了大语言模型作为适应性语义翻译,通过思潮推理重新进行事件分析。“ETRace”采用模式匹配,以建立交易行为模式与已知攻击行为之间的因果关系。此外,我们还通过初步实验结果验证了ETrace的有效性。


Article 40

Title@2025-07-30 (3): AutoCodeSherpa: Symbolic Explanations in AI Coding Agents

Title: AutoCodeSherpa: Symbolic Explanations in AI Coding Agents AutoCodeSherpa: Symbolische Erklärungen in KI-Coding-Agenten AutoCodeSherpa: AI 编码剂中的符号解释 2507.22414v1

Authors (3): Sungmin Kang, Haifeng Ruan, Abhik Roychoudhury

Large Language Model (LLM) agents autonomously use external tools on top of one or more LLMs to accomplish specific tasks. Lately LLM agents for software engineering tasks have become popular. These agents can benefit from the use of program analysis tools working on program representations. This is demonstrated by existing agentic AI solutions such as AutoCodeRover or SpecRover which perform automated program repair. Specifically the goal of these works is to use program analysis to improve the patch quality. These agents are currently being used to automatically fix static analysis issues from the widely used SonarQube static analyzer. Nevertheless, for the agents to be deployed in a production environment, agents need to suggest software artifacts, such as patches, with evidence and with high confidence. In this work, we provide a workflow where an agent provides explanations of the bug in the form of symbolic formulae. The explanations are in the form of input conditions, infection conditions and output conditions, implemented as property based tests (PBT) and program-internal symbolic expressions. These can help in human developer cognition of the agent outputs as well as in achieving completely automated agentic workflows for software. The human developer can benefit from the input condition, represented as a PBT, to generate various concrete inputs showing a given issue. Furthermore, since the PBTs are executable, our explanations are executable as well. We can thus also use the explanations in a completely automated issue resolution environment for accepting or rejecting the patches that are suggested by patching agents such as AutoCodeRover. Finally, as agentic AI approaches continue to develop, the program analysis driven explanations can be provided to other LLM-based repair techniques such as Agentless to improve their output.

大型语言模型代理商在一个或多个 LLM 上自动使用外部工具来完成特定任务。 最近, LLM 代理商在软件工程任务中已经变得很受欢迎。 这些代理商可以从在程序表达中使用的程序分析工具中受益。 这表现在AutoCodeRover 或 SpecRover 等现有的代理AI 解决方案上,这些解决方案进行自动程序维修。 具体来说,这些工程的目的是利用程序分析来提高补丁质量。 这些代理商目前正在用来自动解决广泛使用的SonarQube静态分析器的静态分析问题。 然而,要将代理商部署到生产环境中,这些代理商目前正在使用静态分析器进行静态分析。 代理商需要提出自动解释,例如补丁、有证据和高度自信。 在这项工作中,一个代理商以象征性公式的形式解释错误。 解释是输入条件、感染条件和产出条件,作为基于财产的测试(PBT) 和程序内部象征性表达方式实施。 这可能帮助人类开发商对代理商产出进行反向最终操作, 以及实现完全自动化的代理商驱动流程流程,因此, 代表了Prdealterexcol excolol 解释, 。 能够代表了各种环境, Prode excument 。 。 能够从 p prode exilental lide lide lide ex ex ex ex ex ex ex ex ex ex ex exp exp exp exp exp exp exp exp exp expol 。


Article 41

Title@2025-07-30 (3): Scalability, Availability, Reproducibility and Extensibility in Islamic Database Systems

Title: Scalability, Availability, Reproducibility and Extensibility in Islamic Database Systems Skalierbarkeit, Verfügbarkeit, Reproduzierbarkeit und Erweiterbarkeit in islamischen Datenbanksystemen 伊斯兰数据库系统中的可扩展性、可获取性、可复制性和可推广性 2507.22384v1

Authors (4): Umar Siddiqui, Habiba Youssef, Adel Sabour, Mohamed Ali

With the widespread of software systems and applications that serve the Islamic knowledge domain, several concerns arise. Authenticity and accuracy of the databases that back up these systems are questionable. With the excitement that some software developers and amateur researchers may have, false statements and incorrect claims may be made around numerical signs or miracles in the Quran. Reproducibility of these claims may not be addressed by the people making such claims. Moreover, with the increase in the number of users, scalability and availability of these systems become a concern. In addition to all these concerns, extensibility is also another major issue. Properly designed systems can be extensible, reusable and built on top of one another, instead of each system being built from scratch every time a new framework is developed. In this paper, we introduce the QuranResearch.Org system and its vision for scalability, availability, reproducibility and extensibility to serve Islamic database systems.

由于为伊斯兰知识领域服务的软件系统和应用的广泛性,出现了若干令人关切的问题:支持这些系统的数据库的权威性和准确性令人怀疑;由于某些软件开发者和业余研究人员可能感到兴奋,可能会围绕《古兰经》中的数字标志或奇迹作出虚假陈述和不正确的主张;提出这种主张的人可能无法处理这些主张的可复制性;此外,随着用户数量的增加,这些系统的可扩展性和可获取性也成为一个令人关切的问题;除了所有这些关切外,可扩展性也是另一个主要问题;适当设计的系统可以推广、可再使用和在另一个系统之上建造,而不是每个系统在开发新框架时从零开始就从零开始建立;在本文件中,我们引入了可扩展性、可获取性、可复制性和可推广性,并可以为伊斯兰数据库系统服务。


Article 42

Title@2025-07-30 (3): SAEL: Leveraging Large Language Models with Adaptive Mixture-of-Experts for Smart Contract Vulnerability Detection

Title: SAEL: Leveraging Large Language Models with Adaptive Mixture-of-Experts for Smart Contract Vulnerability Detection SAEL: Nutzung großer Sprachmodelle mit adaptiven Mixture-of-Experts für Smart Contract Vulnerability Detection SAEL:利用适应性混合专家的大型语言模型利用智能合同脆弱性检测 2507.22371v1

Authors (9): Lei Yu, Shiqi Cheng, Zhirong Huang, Jingyuan Zhang, Chenjie Shen, Junyi Lu, Li Yang, Fengjun Zhang, Jiajia Ma

With the increasing security issues in blockchain, smart contract vulnerability detection has become a research focus. Existing vulnerability detection methods have their limitations: 1) Static analysis methods struggle with complex scenarios. 2) Methods based on specialized pre-trained models perform well on specific datasets but have limited generalization capabilities. In contrast, general-purpose Large Language Models (LLMs) demonstrate impressive ability in adapting to new vulnerability patterns. However, they often underperform on specific vulnerability types compared to methods based on specialized pre-trained models. We also observe that explanations generated by general-purpose LLMs can provide fine-grained code understanding information, contributing to improved detection performance. Inspired by these observations, we propose SAEL, an LLM-based framework for smart contract vulnerability detection. We first design targeted prompts to guide LLMs in identifying vulnerabilities and generating explanations, which serve as prediction features. Next, we apply prompt-tuning on CodeT5 and T5 to process contract code and explanations, enhancing task-specific performance. To combine the strengths of each approach, we introduce an Adaptive Mixture-of-Experts architecture. This dynamically adjusts feature weights via a Gating Network, which selects relevant features using TopK filtering and Softmax normalization, and incorporates a Multi-Head Self-Attention mechanism to enhance cross-feature relationships. This design enables effective integration of LLM predictions, explanation features, and code features through gradient optimization. The loss function jointly considers both independent feature performance and overall weighted predictions. Experiments show that SAEL outperforms existing methods across various vulnerabilities.

现有脆弱性检测方法有其局限性:(1) 静态分析方法,以复杂的假设情景为基础。 (2) 以专门培训前模型为基础的方法在具体数据集方面表现良好,但一般用途大语言模型(LLMS)在适应新的脆弱性模式方面表现出令人印象深刻的能力。然而,与以专门培训前模型为基础的方法相比,这些模型往往在具体脆弱性类型上表现不佳。我们还注意到,一般用途LMS产生的解释可以提供精细的代码理解信息,有助于改进检测性能。在这些意见的启发下,我们建议使用基于LLLAM的智能合同脆弱性检测框架SAEL。我们首先设计目标性能来指导LMS识别脆弱性和作出解释,作为预测特征。我们随后对CodL5和T5进行迅速调整,以处理合同代码和解释,加强具体任务性。为了结合每种方法的优势,我们引入了适应性混合的代码-探索性结构架构。这种动态性调整,通过GSlovellical Inversional Commission Exlistrical deal destrical destrical destrual destration ex ex ex ex ex supliver ex ex ex ex


Article 43

Title@2025-07-30 (3): From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications

Title: From Articles to Code: On-Demand Generation of Core Algorithms from Scientific Publications Von Artikeln zum Code: On-Demand-Erzeugung von Kernalgorithmen aus wissenschaftlichen Publikationen 《从条款到守则:科学出版物核心数值的按需生成》 2507.22324v1

Authors (3): Cameron S. Movassaghi, Amanda Momenzadeh, Jesse G. Meyer

Maintaining software packages imposes significant costs due to dependency management, bug fixes, and versioning. We show that rich method descriptions in scientific publications can serve as standalone specifications for modern large language models (LLMs), enabling on-demand code generation that could supplant human-maintained libraries. We benchmark state-of-the-art models (GPT-o4-mini-high, Gemini Pro 2.5, Claude Sonnet 4) by tasking them with implementing a diverse set of core algorithms drawn from original publications. Our results demonstrate that current LLMs can reliably reproduce package functionality with performance indistinguishable from conventional libraries. These findings foreshadow a paradigm shift toward flexible, on-demand code generation and away from static, human-maintained packages, which will result in reduced maintenance overhead by leveraging published articles as sufficient context for the automated implementation of analytical workflows.

维护软件包由于依赖性管理、错误修正和版本化而需要大量费用。我们显示科学出版物中丰富的方法描述可以作为现代大型语言模型(LLMs)的独立规格,使按需生成代码能够取代维护人类的图书馆。我们以最先进的模型(GPT-o4-mini-high、Gemini Pro 2.5、Claude Sonnet 4)为基准,责成这些模型执行从原始出版物中提取的一套不同的核心算法。我们的结果表明,目前的LLMs可以可靠地复制软件包功能,其性能与传统图书馆无法区分。这些发现预示着范式的转变将转向灵活、按需生成代码并远离静态、维护人类的软件包,这将通过利用已出版的文章作为自动实施分析工作流程的充分背景而减少维护间接费用。


Article 44

Title@2025-07-29 (2): Automated Prompt Engineering for Cost-Effective Code Generation Using Evolutionary Algorithm

Title: Automated Prompt Engineering for Cost-Effective Code Generation Using Evolutionary Algorithm Automatisierte Prompt-Engineering für kosteneffiziente Code-Generierung mit evolutionärem Algorithmus 利用进化算法为成本效率高的代码生成自动快速工程 2408.11198v2

Authors (5): Hamed Taherkhani, Melika Sepindband, Hung Viet Pham, Song Wang, Hadi Hemmati

Large Language Models have seen increasing use in various software development tasks, especially in code generation. The most advanced recent methods attempt to incorporate feedback from code execution into prompts to help guide LLMs in generating correct code in an iterative process. While effective, these methods could be costly due to numerous interactions with the LLM and extensive token usage. To address this issue, we propose an alternative approach named Evolutionary Prompt Engineering for Code (EPiC), which leverages a lightweight evolutionary algorithm to refine the original prompts into improved versions that generate high quality code, with minimal interactions with the LLM. Our evaluation against state-of-the-art (SOTA) LLM based code generation agents shows that EPiC not only achieves up to 6% improvement in pass@k but is also 2-10 times more cost-effective than the baselines.

大型语言模型在各种软件开发任务中,特别是在代码生成方面,已越来越多地使用大型语言模型。最新最先进的方法试图将代码执行过程中的反馈纳入提示,以帮助引导LLMs在迭接过程中生成正确的代码。这些方法虽然有效,但由于与LLM的多次互动和大量象征性使用,这些方法可能代价高昂。为了解决这一问题,我们建议了另一种方法,即代号“进化快速工程”(EPic),它利用一种轻量级的进化算法,将原始提示改进为更完善的版本,产生高质量的代码,与LLM的互动极少。我们对基于最新工艺(SOTA)LM的代码生成代理器的评估表明,EPic不仅在通行证@k上实现了高达6%的改进,而且比基线还提高了2-10倍的成本效益。


Article 45

Title@2025-07-29 (2): Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

Title: Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy Zur automatisierten Validierung von Sprachmodellen synthesierten Testfällen mit semantischer Entropie 争取使用语义通则对语言模拟合成试验案例进行自动验证 2411.08254v2

Authors (6): Hamed Taherkhani, Jiho Shin, Muhammad Ammar Tahir, Md Rakib Hossain Misu, Vineet Sunil Gattani, Hadi Hemmati

Modern Large Language Model (LLM)-based programming agents often rely on test execution feedback to refine their generated code. These tests are synthetically generated by LLMs. However, LLMs may produce invalid or hallucinated test cases, which can mislead feedback loops and degrade the performance of agents in refining and improving code. This paper introduces VALTEST, a novel framework that leverages semantic entropy to automatically validate test cases generated by LLMs. Analyzing the semantic structure of test cases and computing entropy-based uncertainty measures, VALTEST trains a machine learning model to classify test cases as valid or invalid and filters out invalid test cases. Experiments on multiple benchmark datasets and various LLMs show that VALTEST not only boosts test validity by up to 29% but also improves code generation performance, as evidenced by significant increases in pass@1 scores. Our extensive experiments also reveal that semantic entropy is a reliable indicator to distinguish between valid and invalid test cases, which provides a robust solution for improving the correctness of LLM-generated test cases used in software testing and code generation.

现代大语言模型(LLM)基于现代语言模型(LLM)的编程代理商往往依靠测试执行反馈来完善其生成的代码。这些测试是由LLMS合成产生的。然而,LLMS可能会产生无效或幻觉测试案例,从而误导反馈环,降低代理商在精炼和改进代码方面的性能。本文件介绍了VALTEST,这是一个创新框架,它利用语义酶酶来自动验证LLMS生成的测试案例。分析测试案例的语义结构以及计算酶基不确定性措施,VALTEST培训一个机器学习模型,将测试案例分类为有效或无效,过滤无效测试案例。关于多个基准数据集和各种LLMS的实验显示,VALTEST不仅提高了测试效力高达29%,而且还提高了代码生成性,这从通过过分数@1的大幅增长可以看出。我们进行的广泛实验还表明,语义酶是一种可靠的指标,可以区分有效和无效的测试案例,为改进软件测试和代码生成中使用的LM所生成的测试案例的正确性提供了有力的解决办法。


Article 46

Title@2025-07-29 (2): Secure coding for web applications: Frameworks, challenges, and the role of LLMs

Title: Secure coding for web applications: Frameworks, challenges, and the role of LLMs Sichere Codierung für Web-Anwendungen: Frameworks, Herausforderungen und die Rolle von LLMs 网络应用的安全编码:框架、挑战和LLMM的作用 2507.22223v1

Authors (3): Kiana Kiashemshaki, Mohammad Jalili Torkamani, Negin Mahmoudi

Secure coding is a critical yet often overlooked practice in software development. Despite extensive awareness efforts, real-world adoption remains inconsistent due to organizational, educational, and technical barriers. This paper provides a comprehensive review of secure coding practices across major frameworks and domains, including web development, DevSecOps, and cloud security. It introduces a structured framework comparison and categorizes threats aligned with the OWASP Top 10. Additionally, we explore the rising role of Large Language Models (LLMs) in evaluating and recommending secure code, presenting a reproducible case study across four major vulnerability types. This paper offers practical insights for researchers, developers, and educators on integrating secure coding into real-world development processes.

安全编码是软件开发中一个重要的但往往被忽视的做法。尽管作出了广泛的提高认识努力,但由于组织、教育和技术方面的障碍,现实世界的采用仍然不一致。本文件全面审查了包括网络开发、DevsecOps和云安全在内的主要框架和领域的安全编码做法。本文件介绍了结构化的框架比较,并按《世界安全伙伴关系》的前身10对威胁进行了分类。此外,我们探讨了大语言模型在评价和建议安全编码方面日益发挥作用,提出了四种主要脆弱性类型的可复制案例研究。本文件为研究人员、开发人员和教育工作者提供了关于将安全编码纳入现实世界发展进程的实际见解。


Article 47

Title@2025-07-29 (2): Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go?

Title: Runtime Failure Hunting for Physics Engine Based Software Systems: How Far Can We Go? Runtime Failure Hunting for Physics Engine Based Software Systems: Wie weit können wir gehen? 物理引擎软件系统运行时失灵追寻:我们能走多远? 2507.22099v1

Authors (4): Shuqing Li, Qiang Chen, Xiaoxue Ren, Michael R. Lyu

Physics Engines (PEs) are fundamental software frameworks that simulate physical interactions in applications ranging from entertainment to safety-critical systems. Despite their importance, PEs suffer from physics failures, deviations from expected physical behaviors that can compromise software reliability, degrade user experience, and potentially cause critical failures in autonomous vehicles or medical robotics. Current testing approaches for PE-based software are inadequate, typically requiring white-box access and focusing on crash detection rather than semantically complex physics failures. This paper presents the first large-scale empirical study characterizing physics failures in PE-based software. We investigate three research questions addressing the manifestations of physics failures, the effectiveness of detection techniques, and developer perceptions of current detection practices. Our contributions include: (1) a taxonomy of physics failure manifestations; (2) a comprehensive evaluation of detection methods including deep learning, prompt-based techniques, and large multimodal models; and (3) actionable insights from developer experiences for improving detection approaches. To support future research, we release PhysiXFails, code, and other materials at https://sites.google.com/view/physics-failure-detection.

物理引擎(PE)是模拟娱乐到安全临界系统等应用中物理互动的基本软件框架。尽管它很重要,但PE受到物理故障、偏离预期物理行为的影响,从而可能损害软件可靠性、降低用户经验,并有可能造成自主车辆或医疗机器人的重大故障。目前对PE软件的测试方法不足,通常需要白箱访问,并侧重于碰撞探测,而不是地震复杂的物理故障。本文介绍了在PE软件中将物理故障定性为物理故障的首次大规模实验性研究。我们调查了三个研究问题,分别涉及物理故障的表现、探测技术的有效性以及当前探测做法的发展者的看法。我们的贡献包括:(1)物理故障表现的分类;(2)对探测方法的全面评价,包括深度学习、即时技术以及大型多式联运模型;(3)从开发者的经验中获得的关于改进探测方法的可操作性见解。为了支持未来的研究,我们在 https://sitesites.gogle.com/view/phys-fail-detection 上公布了PhysiXFails、代码和其他材料。


Article 48

Title@2025-07-29 (2): Fine-Tuning Code Language Models to Detect Cross-Language Bugs

Title: Fine-Tuning Code Language Models to Detect Cross-Language Bugs Fine-Tuning-Code-Sprachenmodelle zur Erkennung von Cross-Language-Fehlern 用于检测跨语言错误的精细调整代码语言模型 2507.21954v1

Authors (7): Zengyang Li, Yimeng Li, Binbin Huang, Peng Liang, Ran Mo, Hui Liu, Yutao Ma

Multilingual programming, which involves using multiple programming languages (PLs) in a single project, is increasingly common due to its benefits. However, it introduces cross-language bugs (CLBs), which arise from interactions between different PLs and are difficult to detect by single-language bug detection tools. This paper investigates the potential of pre-trained code language models (CodeLMs) in CLB detection. We developed CLCFinder, a cross-language code identification tool, and constructed a CLB dataset involving three PL combinations (Python-C/C++, Java-C/C++, and Python-Java) with nine interaction types. We fine-tuned 13 CodeLMs on this dataset and evaluated their performance, analyzing the effects of dataset size, token sequence length, and code comments. Results show that all CodeLMs performed poorly before fine-tuning, but exhibited varying degrees of performance improvement after fine-tuning, with UniXcoder-base achieving the best F1 score (0.7407). Notably, small fine-tuned CodeLMs tended to performe better than large ones. CodeLMs fine-tuned on single-language bug datasets performed poorly on CLB detection, demonstrating the distinction between CLBs and single-language bugs. Additionally, increasing the fine-tuning dataset size significantly improved performance, while longer token sequences did not necessarily improve the model performance. The impact of code comments varied across models. Some fine-tuned CodeLMs’ performance was improved, while others showed degraded performance.

多语言编程涉及在一个单一项目中使用多种编程语言(PLs),由于它的好处而越来越普遍。然而,它引入了跨语言的错误(CLBs),这些错误来自不同PLs之间的互动,很难用单一语言的错误检测工具来检测。本文调查了CLB检测中经过预先训练的代码语言模型(CodeLMs)的潜力。我们开发了一个跨语言代码识别工具CLCFInder,并建造了一个CLB数据集,涉及三种PL组合(Python-C/C++、Java-C/C++和Python-Java),有9个互动类型。我们精细调了13个代码LMs,并评估了它们的业绩,分析了数据集大小、符号序列长度和代码评论。结果显示,所有代码LMs在微调之前表现不佳,但经过微调后表现有不同程度的改进,UnXcoder(0.407 ) 明显,微调一些精调的coLMs往往比大型的绩效更好,同时演示了CLMs的成绩。


Article 49

Title@2025-07-29 (2): DeepGo: Predictive Directed Greybox Fuzzing

Title: DeepGo: Predictive Directed Greybox Fuzzing DeepGo: Predictive Directed Greybox Fuzzing 深度Go:预测方向灰盒模糊 2507.21952v1

Authors (6): Peihong Lin, Pengfei Wang, Xu Zhou, Wei Xie, Gen Zhang, Kai Lu

The state-of-the-art DGF techniques redefine and optimize the fitness metric to reach the target sites precisely and quickly. However, optimizations for fitness metrics are mainly based on heuristic algorithms, which usually rely on historical execution information and lack foresight on paths that have not been exercised yet. Thus, those hard-to-execute paths with complex constraints would hinder DGF from reaching the targets, making DGF less efficient. In this paper, we propose DeepGo, a predictive directed grey-box fuzzer that can combine historical and predicted information to steer DGF to reach the target site via an optimal path. We first propose the path transition model, which models DGF as a process of reaching the target site through specific path transition sequences. The new seed generated by mutation would cause the path transition, and the path corresponding to the high-reward path transition sequence indicates a high likelihood of reaching the target site through it. Then, to predict the path transitions and the corresponding rewards, we use deep neural networks to construct a Virtual Ensemble Environment (VEE), which gradually imitates the path transition model and predicts the rewards of path transitions that have not been taken yet. To determine the optimal path, we develop a Reinforcement Learning for Fuzzing (RLF) model to generate the transition sequences with the highest sequence rewards. The RLF model can combine historical and predicted path transitions to generate the optimal path transition sequences, along with the policy to guide the mutation strategy of fuzzing. Finally, to exercise the high-reward path transition sequence, we propose the concept of an action group, which comprehensively optimizes the critical steps of fuzzing to realize the optimal path to reach the target efficiently.

最先进的DGF技术重新定义和优化最佳健身标准,以便准确和迅速地到达目标地点。然而,健身度量量的优化主要基于超自然算法,通常依赖历史执行信息,在尚未运行的路径上缺乏远见。因此,那些具有复杂限制的难以执行路径会阻碍DGF达到目标,使DGF效率降低。在本文件中,我们提议“DeepGo”,一个预测性的、定向的灰箱毛发器,可以将历史和预测的信息结合起来,引导DGF通过最佳的道路到达目标地点。我们首先提出路径过渡模式,将DGF模型作为通过具体路径过渡步骤到达目标地点的过程。突变产生的新种子将导致路径过渡,而与高回归路径过渡序列相对应的路径则表明到达目标地点的高度可能性。然后,为了预测路径过渡模式的过渡和相应的回报,我们可以利用深度的神经网络来构建一个虚拟Engemble Enble 环境(VEE),它正在逐渐模仿路径的过渡模式和预测路径的过渡路径,最后的路径将走向。


Article 50

Title@2025-07-29 (2): Vibe Coding as a Reconfiguration of Intent Mediation in Software Development: Definition, Implications, and Research Agenda

Title: Vibe Coding as a Reconfiguration of Intent Mediation in Software Development: Definition, Implications, and Research Agenda Vibe Coding als Rekonfiguration von Intent Mediation in der Software-Entwicklung: Definition, Implikationen und Forschungsagenda 作为软件开发中内在调解的重组:定义、影响和研究议程 2507.21928v1

Authors (5): Christian Meske, Tobias Hermanns, Esther von der Weiden, Kai-Uwe Loser, Thorsten Berger

Software development is undergoing a fundamental transformation as vibe coding becomes widespread, with large portions of contemporary codebases now being AI-generated. The disconnect between rapid adoption and limited conceptual understanding highlights the need for an inquiry into this emerging paradigm. Drawing on an intent perspective and historical analysis, we define vibe coding as a software development paradigm where humans and generative AI engage in collaborative flow to co-create software artifacts through natural language dialogue, shifting the mediation of developer intent from deterministic instruction to probabilistic inference. By intent mediation, we refer to the fundamental process through which developers translate their conceptual goals into representations that computational systems can execute. Our results show that vibe coding reconfigures cognitive work by redistributing epistemic labor between humans and machines, shifting the expertise in the software development process away from traditional areas such as design or technical implementation toward collaborative orchestration. We identify key opportunities, including democratization, acceleration, and systemic leverage, alongside risks, such as black box codebases, responsibility gaps, and ecosystem bias. We conclude with a research agenda spanning human-, technology-, and organization-centered directions to guide future investigations of this paradigm.

随着氛围编码的普及,大量当代代码库正在发生根本性的转变。快速采用和有限概念理解之间的脱节突出表明有必要对这一新兴模式进行调查。我们根据意向观点和历史分析,将氛围编码定义为一种软件开发范例,人类和基因化的AI通过自然语言对话参与合作流,共同创建软件文物,将开发者意图的调解从确定性指令转变为概率推断。通过意向调解,我们指的是开发者将其概念目标转化为计算系统可以实施的表述的基本过程。我们的结果显示,通过将人类和机器之间的传导劳动力重新分配,将软件开发过程中的专门知识从传统领域,如设计或技术实施转向协作管弦化。我们确定了关键的机会,包括民主化、加速和系统性杠杆,以及风险,如黑盒代码库、责任差距和生态系统偏差。我们以覆盖人类、技术和组织中心方向的研究议程结尾,以指导这一模式的未来调查。


Article 51

Title@2025-07-29 (2): LLM-based Content Classification Approach for GitHub Repositories by the README Files

Title: LLM-based Content Classification Approach for GitHub Repositories by the README Files LLM-basierter Content-Klassifikationsansatz für GitHub-Repositories durch die README-Dateien REEADME 文件中基于LLM的GitHub储存库内容分类方法 2507.21899v1

Authors (4): Malik Uzair Mehmood, Shahid Hussain, Wen Li Wang, Muhammad Usama Malik

GitHub is the world’s most popular platform for storing, sharing, and managing code. Every GitHub repository has a README file associated with it. The README files should contain project-related information as per the recommendations of GitHub to support the usage and improvement of repositories. However, GitHub repository owners sometimes neglected these recommendations. This prevents a GitHub repository from reaching its full potential. This research posits that the comprehensiveness of a GitHub repository’s README file significantly influences its adoption and utilization, with a lack of detail potentially hindering its full potential for widespread engagement and impact within the research community. Large Language Models (LLMs) have shown great performance in many text-based tasks including text classification, text generation, text summarization and text translation. In this study, an approach is developed to fine-tune LLMs for automatically classifying different sections of GitHub README files. Three encoder-only LLMs are utilized, including BERT, DistilBERT and RoBERTa. These pre-trained models are then fine-tuned based on a gold-standard dataset consisting of 4226 README file sections. This approach outperforms current state-of-the-art methods and has achieved an overall F1 score of 0.98. Moreover, we have also investigated the use of Parameter-Efficient Fine-Tuning (PEFT) techniques like Low-Rank Adaptation (LoRA) and shown an economical alternative to full fine-tuning without compromising much performance. The results demonstrate the potential of using LLMs in designing an automatic classifier for categorizing the content of GitHub README files. Consequently, this study contributes to the development of automated tools for GitHub repositories to improve their identifications and potential usages.

GitHub 是世界上最受欢迎的存储、共享和管理代码平台。 每个 GitHub 仓库都有一个与此相关的 README 文件。 README 文件应该包含与项目相关的信息, 正如 GitHub 的建议那样, 支持存储器的使用和改进。 然而, GitHub 仓库所有者有时忽略了这些建议。 这使 GitHub 仓库无法实现其全部潜力。 这项研究显示, GitHub 仓库的 README 文件的全面性能会大大影响其采用和使用, 从而影响其应用和使用, 由于缺乏细节, 有可能阻碍其在研究界内广泛接触和影响。 大语言模型(LLLIMS) 显示许多基于文本的任务中与项目有关的信息, 包括文本分类、 文本生成、 文本合成和文本。 在本次IME 数据库中, 将精密的LLMMS 用于对 GitHUDO 的潜在值, 包括 BERT、 DTIBERT 和 RoBERTA 的精化模型, 然后, 将精细化的 RAD- real- dal- deal- dal- dal- dalmade- drodustrational- deal- dislate- dealmax- deal- dislational- sal- dislational- disal- sal- ladal- disald ladal- ladaldaldaldald lad ladaldaldaldaldaldaldaldaldaldaldaldald mad lad lad lad ladaldaldald ladaldald ladaldaldaldaldaldaldaldaldaldaldaldaldaldaldald ladaldaldaldaldaldaldald lad lad lad lad lad lad lad lad lad lad lad lad lad lad ladald lad lad lad lad lad


Article 52

Title@2025-07-29 (2): The Impact of Foundational Models on Patient-Centric e-Health Systems

Title: The Impact of Foundational Models on Patient-Centric e-Health Systems Die Auswirkungen von Basismodellen auf die e-Health-Systeme von Patienten-Centric 基础模型对病人中心电子保健系统的影响 2507.21882v1

Authors (3): Elmira Onagh, Alireza Davoodi, Maleknaz Nayebi

As Artificial Intelligence (AI) becomes increasingly embedded in healthcare technologies, understanding the maturity of AI in patient-centric applications is critical for evaluating its trustworthiness, transparency, and real-world impact. In this study, we investigate the integration and maturity of AI feature integration in 116 patient-centric healthcare applications. Using Large Language Models (LLMs), we extracted key functional features, which are then categorized into different stages of the Gartner AI maturity model. Our results show that over 86.21\% of applications remain at the early stages of AI integration, while only 13.79% demonstrate advanced AI integration.

随着人工智能(AI)日益深入到卫生保健技术中,了解以病人为中心的应用中的人工智能成熟度对于评价其可信度、透明度和真实世界影响至关重要。在本研究中,我们调查了人工智能在116个以病人为中心的医疗应用中的整合和成熟性特征。我们利用大语言模型(LLMs)提取了关键功能特征,然后将其分为Gartner AI成熟模型的不同阶段。我们的结果显示,86.21** 的应用仍然处于人工智能整合的早期阶段,而只有13.79%显示高级人工智能整合。


Article 53

Title@2025-07-29 (2): Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Title: Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses? Out of Distribution, Out of Luck: Wie gut können LLMs auf Sicherheitsdatensätze trainieren Top 25 CWE-Schwächen erkennen? 发售之外,运气不好:如何训练LLMs研究检测25大CWE弱点的脆弱性数据集? 2507.21817v1

Authors (19): Yikun Li, Ngoc Tan Bui, Ting Zhang, Martin Weyssow, Chengran Yang, Xin Zhou, Jinfeng Jiang, Junkai Chen, Huihui Huang, Huu Hung Nguyen, Chiok Yew Ho, Jie Tan, Ruiyin Li, Yide Yin, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo

Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Current vulnerability datasets suffer from issues including label inaccuracy rates of 20-71%, extensive duplication, and poor coverage of critical CWE types. These issues create a significant “generalization gap” where models achieve misleading self-testing performance (measured on held-out data from same dataset for training) by exploiting spurious correlations rather than learning true vulnerability patterns. Our analysis reveals that many models experience substantial performance drops of up to 40.6% when evaluated on independent data, sometimes underperforming random guessing. To address these limitations, we present a three-part solution. First, we introduce a manually curated test dataset, BenchVul, covering the MITRE Top 25 Most Dangerous CWEs. Second, we construct a high-quality training dataset, TitanVul, comprising 35,045 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM framework. Third, we propose a Realistic Vulnerability Generation (RVG) framework, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation shows the strengths of each component in closing the generalization gap. First, BenchVul shows the limitations of self-testing: models trained on existing datasets, such as BigVul and PrimeVul, experience performance drops on BenchVul (from 0.776 to 0.519 and from 0.567 to 0.337). Second, training models on TitanVul demonstrates improved generalization, with model performance increasing from 0.584 when evaluated on the same dataset to 0.767 when tested on BenchVul. Third, supplementing TitanVul with RVG-generated data yields further gains, increasing model performance by 14.0% to 0.874.

自动脆弱性检测研究取得了显著进展,然而其真实世界影响仍然有限。当前的脆弱性数据集存在问题,包括标签不准确率20-71%、广泛重复和关键CWE类型覆盖面差等问题。这些问题造成了巨大的“普及差距 ” , 模型通过利用虚假的关联而不是学习真实的脆弱性模式,实现了误导自我测试性能(根据来自相同培训数据集的搁置数据衡量 ) 。 我们的分析表明,许多模型在独立数据评估时,性能下降高达40.6%,有时表现不佳随机测算。 为了应对这些限制,我们提出了一个三部分解决方案。 首先,我们引入了人工调整的测试数据集,涵盖MITARE 25 最危险的CWE。 其次,我们构建了一个高质量的培训数据集,即TetamVul,由35,045功能组成,利用新的多剂LM框架进行分解和验证。 第三,我们建议采用RecialVVVD(RVG) 改进后, 将环境-awa 脆弱性示例综合起来,用于代表但至关重要的CWE(CWE) 版本中的每一类模拟业绩测试阶段的自我评估。


Article 54

Title@2025-07-29 (2): ChartMark: A Structured Grammar for Chart Annotation

Title: ChartMark: A Structured Grammar for Chart Annotation ChartMark: Eine strukturierte Grammatik für Chart-Annotation 图表 Mark: 用于图表注释的结构性语法 2507.21810v1

Authors (7): Yiyu Chen, Yifan Wu, Shuyu Shen, Yupeng Xie, Leixian Shen, Hui Xiong, Yuyu Luo

Chart annotations enhance visualization accessibility but suffer from fragmented, non-standardized representations that limit cross-platform reuse. We propose ChartMark, a structured grammar that separates annotation semantics from visualization implementations. ChartMark features a hierarchical framework mapping onto annotation dimensions (e.g., task, chart context), supporting both abstract intents and precise visual details. Our toolkit demonstrates converting ChartMark specifications into Vega-Lite visualizations, highlighting its flexibility, expressiveness, and practical applicability.

图表说明提高了可视化的可视性,但存在限制跨平台再利用的支离破碎、非标准化的表征。我们提出了ChartMark,这是一个结构化的语法,将注解语义与可视化实施区分开来。ChartMark具有向注解维度(例如任务、图表背景)绘制的等级框架,支持抽象意图和准确的直观细节。我们的工具包显示将图 Mark规格转换成Vega-Lite可视化,突出显示其灵活性、表现性和实用性。


Article 55

Title@2025-07-29 (2): Identification of Design Recommendations for Augmented Reality Authors in Corporate Training

Title: Identification of Design Recommendations for Augmented Reality Authors in Corporate Training Identifizierung von Design-Empfehlungen für Augmented Reality-Autoren in der Unternehmensschulung 确定公司培训中增加现实作者的设计建议 2507.21722v1

Authors (3): Stefan Graser, Martin Schrepp, Stephan Böhm

Innovative technologies, such as Augmented Reality (AR), introduce new interaction paradigms, demanding the identification of software requirements during the software development process. In general, design recommendations are related to this, supporting the design of applications positively and meeting stakeholder needs. However, current research lacks context-specific AR design recommendations. This study addresses this gap by identifying and analyzing practical AR design recommendations relevant to the evaluation phase of the User-Centered Design (UCD) process. We rely on an existing dataset of Mixed Reality (MR) design recommendations. We applied a multi-method approach by (1) extending the dataset with AR-specific recommendations published since 2020, (2) classifying the identified recommendations using a NLP classification approach based on a pre-trained Sentence Transformer model, (3) summarizing the content of all topics, and (4) evaluating their relevance concerning AR in Corporate Training (CT) both based on a qualitative Round Robin approach with five experts. As a result, an updated dataset of 597 practitioner design recommendations, classified into 84 topics, is provided with new insights into their applicability in the context of AR in CT. Based on this, 32 topics with a total of 284 statements were evaluated as relevant for AR in CT. This research directly contributes to the authors’ work for extending their AR-specific User Experience (UX) measurement approach, supporting AR authors in targeting the improvement of AR applications for CT scenarios.

创新技术,如增强现实(AR),引入新的互动模式,要求在软件开发过程中确定软件要求。总体而言,设计建议与此相关,支持应用设计积极,满足利益攸关方的需求。然而,目前的研究缺乏针对具体情况的AR设计建议。本研究通过确定和分析与用户-中心设计(UCD)进程评价阶段相关的实用AR设计建议,解决了这一差距。我们依靠混合现实(MR)设计建议的现有数据集。我们采用了多种方法方法,:(1) 扩大自2020年以来公布的有ARCT具体建议的数据集,(2) 根据预先培训的判刑变换模型,使用NLP分类法对已确定的建议进行分类,(3) 概述所有专题的内容,(4) 评估其在公司培训中与AR(CT)的相关性,两者都是基于与五名专家的定性轮回式Robin设计(UD)进程相关的。结果是,由597名执业人员设计建议组成,分为84个专题。我们采用了多种方法,通过(1) 扩展了自2020年以来公布的ARCT具体建议中的数据集,(2) 利用NLP分类方法对已查明的建议进行分类分类分类,根据预先培训的NLP分类方法,对已确定的建议进行分类分类分类分类分类分类,3个专题加以分类分类,利用NP分类分类,并评估所有专题,并评估了所有专题,评估了所有专题,为AR284研究的作者对AR测量方法的应用研究结果的作者的论文研究结果的应用结果,为具体应用。


Article 56

Title@2025-07-29 (2): iPanda: An LLM-based Agent for Automated Conformance Testing of Communication Protocols

Title: iPanda: An LLM-based Agent for Automated Conformance Testing of Communication Protocols iPanda: Ein LLM-basierter Agent für automatisierte Konformitätsprüfung von Kommunikationsprotokollen iPanda:以LLM为基地的通信协议自动合规测试自动合规测试代理 2507.00378v2

Authors (11): Xikai Sun, Fan Dang, Shiqi Jiang, Jingao Xu, Kebin Liu, Xin Miao, Zihao Yang, Weichen Zhang, Haimo Lu, Yawen Zheng, Yunhao Liu

Conformance testing is essential for ensuring that protocol implementations comply with their specifications. However, traditional testing approaches involve manually creating numerous test cases and scripts, making the process labor-intensive and inefficient. Recently, Large Language Models (LLMs) have demonstrated impressive text comprehension and code generation abilities, providing promising opportunities for automation. In this paper, we propose iPanda, the first framework that leverages LLMs to automate protocol conformance testing. Given a protocol specification document and its implementation, iPanda first employs a keyword-based method to automatically generate comprehensive test cases. Then, it utilizes retrieval-augmented generation and customized CoT strategy to effectively interpret the implementation and produce executable test programs. To further enhance programs’ quality, iPanda incorporates an iterative optimization mechanism to refine generated test scripts interactively. Finally, by executing and analyzing the generated tests, iPanda systematically verifies compliance between implementations and protocol specifications. Comprehensive experiments on various protocols show that iPanda significantly outperforms pure LLM-based approaches, improving the success rate (Pass@1) of test-program generation by factors ranging from 4.675 times to 10.751 times.

合规测试是确保协议执行符合其规格的关键所在。 但是,传统测试方法涉及人工创建大量测试案例和脚本,使过程耗费大量人力且效率低下。 最近,大语言模型(LLMs)展示了令人印象深刻的文本理解和代码生成能力,为自动化提供了充满希望的机会。在本文中,我们提议iPanda,这是利用LMs进行协议合规自动测试的第一个框架。根据协议规格文件及其实施,iPanda首先使用基于关键词的方法自动生成全面测试案例。然后,它利用检索启动的生成和定制的 CoT 战略来有效解释实施和生成可执行的测试程序。为了进一步提高程序的质量,iPanda采用了一个迭代优化机制,以交互完善生成的测试脚本。最后,通过实施和分析生成的测试,iPanda系统地核查执行和协议规格之间的合规情况。 各种协议的全面实验表明iPanda明显地超越基于纯LMM的方法,提高测试方案生成的成功率(Pass@1)。


Article 57

Title@2025-07-29 (2): MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios

Title: MultiAIGCD: A Comprehensive dataset for AI Generated Code Detection Covering Multiple Languages, Models,Prompts, and Scenarios MultiAIGCD: Ein umfassender Datensatz für KI Generated Code Detection für mehrere Sprachen, Modelle,Prompts und Szenarien 多AIGCD:AI生成的包含多种语言、模型、提示和情景的代码探测综合数据集 2507.21693v1

Authors (3): Basak Demirok, Mucahid Kutlu, Selin Mergen

As large language models (LLMs) rapidly advance, their role in code generation has expanded significantly. While this offers streamlined development, it also creates concerns in areas like education and job interviews. Consequently, developing robust systems to detect AI-generated code is imperative to maintain academic integrity and ensure fairness in hiring processes. In this study, we introduce MultiAIGCD, a dataset for AI-generated code detection for Python, Java, and Go. From the CodeNet dataset’s problem definitions and human-authored codes, we generate several code samples in Java, Python, and Go with six different LLMs and three different prompts. This generation process covered three key usage scenarios: (i) generating code from problem descriptions, (ii) fixing runtime errors in human-written code, and (iii) correcting incorrect outputs. Overall, MultiAIGCD consists of 121,271 AI-generated and 32,148 human-written code snippets. We also benchmark three state-of-the-art AI-generated code detection models and assess their performance in various test scenarios such as cross-model and cross-language. We share our dataset and codes to support research in this field.

随着大型语言模型(LLMS)的快速推进,其在代码生成中的作用已大大扩展,这提供了简化的开发,同时也在教育和工作面试等领域引起了关注。因此,开发健全的系统以检测AI生成的代码对于保持学术完整性和确保招聘过程的公平性至关重要。在本研究中,我们引入了多AIGCD,这是AI生成的Python、Java和Go的代码检测数据集。从代码网络的难题定义和人为代码中,我们还在Java、Python和Java、Python生成了若干代码样本,用6个不同的LMS和3个不同的提示进行采集。这一生成过程覆盖了三种关键使用情景:(一) 从问题描述生成代码,(二) 确定人为代码中的运行时间错误,以及(三) 纠正不正确的产出。总体而言,多AIGCD由121,271 AI生成的和32,148个人类编写的代码片组成。我们还根据3个最先进的AI生成的代码检测模型和在跨模型和跨语言等各种测试情景中评估其性表现。我们分享了我们的数据集和代码以支持该领域的研究。


Article 58

Title@2025-07-29 (2): Predicting Maintenance Cessation of Open Source Software Repositories with An Integrated Feature Framework

Title: Predicting Maintenance Cessation of Open Source Software Repositories with An Integrated Feature Framework Vorhersage der Wartungseinstellung von Open Source Software Repositories mit integriertem Feature Framework 具有综合地物框架的开放源码软件储存库预测维持维持状态的停止 2507.21678v1

Authors (5): Yiming Xu, Runzhi He, Hengzhi Ye, Minghui Zhou, Huaimin Wang

The maintenance risks of open source software (OSS) projects pose significant threats to the quality, security, and resilience of modern software supply chains. While prior research has proposed diverse approaches for predicting OSS maintenance risk – leveraging signals ranging from surface features (e.g., stars, commits) to social network analyses and behavioral patterns – existing methods often suffer from ambiguous operational definitions, limited interpretability, and datasets of insufficient scale or generalizability. In this work, we introduce ``maintenance cessation’’, grounded in both explicit archival status and rigorous semantic analysis of project documentation. Building on this foundation, we curate a large-scale, longitudinal dataset of 115,466 GitHub repositories – encompassing 57,733 confirmed cessation events – complemented by comprehensive, timeline-based behavioral features. We propose an integrated, multi-perspective feature framework for predicting maintenance cessation, systematically combining user-centric features, maintainer-centric features and project evolution features. AFT survival analysis demonstrates a high C-index (0.846), substantially outperforming models relying only on surface features. Feature ablation and SHAP analysis further confirm the effectiveness and interpretability of our approach. Finally, we demonstrate real-world applicability by deploying a GBSA classifier in the openEuler ecosystem for proactive package risk screening. Our work establishes a scalable, interpretable foundation for maintenance-risk prediction, enabling reproducible risk management across large-scale open source ecosystems.

开放源码软件(OSS)项目的维护风险对现代软件供应链的质量、安全和复原力构成重大威胁。虽然先前的研究提出了预测OSS维护风险的不同方法 – – 利用从地表特征(例如星星,承诺)到社会网络分析和行为模式的信号 – – 现有方法往往具有模糊的操作定义、有限的解释性以及规模不足或笼统的数据集。在这项工作中,我们采用了“维持停止”办法,其依据是明确的档案状态和对项目文件的严格语义分析。在此基础上,我们制定了一个大型、纵向的115 466 GitHub 储存库 – – 包括57 733个已确认的停止事件 – – 的纵向数据集,辅之以全面的、基于时限的行为特征。我们提出了一个综合、多视角的功能框架,用于预测维持状态,系统地结合以用户为中心的特征、保持中心特征和项目演变特点。一个高C-指数(0.846),大量业绩模型仅依赖地表特征。


Article 59

Title@2025-07-29 (2): Harnessing Large Language Model for Virtual Reality Exploration Testing: A Case Study

Title: Harnessing Large Language Model for Virtual Reality Exploration Testing: A Case Study Großes Sprachmodell für Virtual Reality Exploration Testing nutzen: Eine Fallstudie 利用大语言模型进行虚拟现实探索试验:案例研究 2501.05625v2

Authors (6): Zhenyu Qi, Haotang Li, Hao Qin, Kebin Peng, Sen He, Xue Qin

As the Virtual Reality (VR) industry expands, the need for automated GUI testing is growing rapidly. Large Language Models (LLMs), capable of retaining information long-term and analyzing both visual and textual data, are emerging as a potential key to deciphering the complexities of VR’s evolving user interfaces. In this paper, we conduct a case study to investigate the capability of using LLMs, particularly GPT-4o, for field of view (FOV) analysis in VR exploration testing. Specifically, we validate that LLMs can identify test entities in FOVs and that prompt engineering can effectively enhance the accuracy of test entity identification from 41.67% to 71.30%. Our study also shows that LLMs can accurately describe identified entities’ features with at least a 90% accuracy rate. We further find out that the core features that effectively represent an entity are color, placement, and shape. Furthermore, the combination of the three features can especially be used to improve the accuracy of determining identical entities in multiple FOVs with the highest F1-score of 0.70. Additionally, our study demonstrates that LLMs are capable of scene recognition and spatial understanding in VR with precisely designed structured prompts. Finally, we find that LLMs fail to label the identified test entities, and we discuss potential solutions as future research directions.

随着虚拟现实(VR)产业的扩大,自动图形界面测试的需要正在迅速增长。大语言模型(LLMS)能够长期保存信息并分析视觉和文字数据,正在成为破解VR不断演变的用户界面复杂性的潜在关键。在本文中,我们进行案例研究,调查在VR勘探测试中利用LLMS,特别是GPT-4o(FOV)进行视域分析的能力。具体地说,我们确认LLMS能够识别FOVs中的测试实体,迅速工程能够有效地提高测试实体识别的准确性,从41.67%提高到71.30%。我们的研究还表明LLMS能够准确描述所查明的实体特征,至少90%的准确率。我们进一步发现有效代表实体的核心特征是颜色、位置和形状。此外,这三种特征的结合特别可以用来提高确定多个FOVVS(F1-核心为0.70的最高的F1)相同实体的相同实体的准确性。此外,我们的研究显示LMS能够准确描述和空间理解最终设计的LMMS(我们所设计的LMs)测试方向。


Article 60

Title@2025-07-29 (2): Ethical Classification of Non-Coding Contributions in Open-Source Projects via Large Language Models

Title: Ethical Classification of Non-Coding Contributions in Open-Source Projects via Large Language Models Ethische Klassifizierung von Nichtkodierungsbeiträgen in Open-Source-Projekten über große Sprachmodelle 通过大语言模式对开放源码项目非编码捐款进行道德分类 2507.21583v1

Authors (2): Sergio Cobos, Javier Luis Cánovas Izquierdo

The development of Open-Source Software (OSS) is not only a technical challenge, but also a social one due to the diverse mixture of contributors. To this aim, social-coding platforms, such as GitHub, provide the infrastructure needed to host and develop the code, but also the support for enabling the community’s collaboration, which is driven by non-coding contributions, such as issues (i.e., change proposals or bug reports) or comments to existing contributions. As with any other social endeavor, this development process faces ethical challenges, which may put at risk the project’s sustainability. To foster a productive and positive environment, OSS projects are increasingly deploying codes of conduct, which define rules to ensure a respectful and inclusive participatory environment, with the Contributor Covenant being the main model to follow. However, monitoring and enforcing these codes of conduct is a challenging task, due to the limitations of current approaches. In this paper, we propose an approach to classify the ethical quality of non-coding contributions in OSS projects by relying on Large Language Models (LLM), a promising technology for text classification tasks. We defined a set of ethical metrics based on the Contributor Covenant and developed a classification approach to assess ethical behavior in OSS non-coding contributions, using prompt engineering to guide the model’s output.

开发开放源码软件(OSS)不仅是一项技术挑战,而且也是一项社会挑战,其原因有各种贡献者,为此,GitHub等社会编码平台提供了主办和开发守则所需的基础设施,而且还为社区合作提供了支持,这是由非编码贡献推动的,例如问题(如变更建议或错误报告)或对现有贡献的评论。与任何其他社会努力一样,这一发展进程面临道德挑战,这可能危及项目的可持续性。为了促进生产和积极的环境,OSS项目正在越来越多地部署行为守则,这些守则界定了确保尊重和包容性参与环境的规则,而《贡献者公约》是遵循的主要模式。然而,监测和执行这些行为守则是一项具有挑战性的任务,因为目前的方法有局限性。在本文件中,我们提出了一种方法,通过依赖大语言模型(LLM),将开放源码软件项目的非编码贡献的道德质量分类归为一种有前途的技术。我们根据《承诺者公约》的道德准则,用一套不道德衡量标准来评估《交付者公约》的道德行为,并用一种及时的分类方法评估了《交付者准则》的道德标准。


Article 61

Title@2025-07-29 (2): Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding

Title: Kodezi Chronos: A Debugging-First Language Model for Repository-Scale Code Understanding Kodezi Chronos: Ein Debugging-First Language Model für Repository-Scale Code Understanding Kodezi Chronos:调试第一语言模式,用于存储库-范围守则理解 2507.12482v2

Authors (5): Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel, Yousuf Zaii

Large Language Models (LLMs) have improved code generation and software automation, but remain limited by inference-time context and lack structured reasoning over code. Debugging remains unsolved despite these advances. While Claude Opus 4 and GPT-4.1 achieve >70% on code synthesis benchmarks, they perform <15% on real debugging tasks. We introduce Kodezi Chronos, a language model built specifically for debugging. Chronos combines Adaptive Graph-Guided Retrieval to navigate codebases up to 10 million lines using multi-hop traversal (92% precision, 85% recall), Persistent Debug Memory trained on 15M+ sessions, and a 7-layer architecture for iterative fix-test-refine loops. On 5,000 real-world scenarios, Chronos achieves 67.3% fix accuracy, compared to 14.2% and 13.8% for Claude and GPT-4.1 respectively. Chronos reduces debugging time by 40% and iteration count by 65%. It resolves complex multi-file bugs involving cross-repository context and temporal reasoning. Key limitations include 23.4% success on hardware-dependent issues and 41.2% on dynamic language errors. Theoretical analysis shows O(k log d) retrieval complexity with convergence guarantees. In a human evaluation (N=50), 89% of participants preferred Chronos over baseline models. Chronos will be available in Kodezi OS in Q4 2025 and via API in Q1 2026.

大型语言模型(LLMS) 改进了代码生成和软件自动化,但仍然受推断时间背景的限制,且缺乏对代码的结构性推理。 尽管取得了这些进步, 调试仍未解决。 Claude Opus 4 和 GPT-4.1 在代码合成基准上实现了 > 70%, 在真正的调试任务上却执行 < 15%的调试任务。 我们引入了Kodezi Chronos, 这是专门为调试而建立的一种语言模型。 Chronos 将调试图形- Guided Retrelieval 组合到最多1 000万条代码库的导航中, 使用多节步步曲( 92% 精确度, 记得85% ) 、 15M+ 课上训练的持久性调试算存储存储存储存储存储存储存储存储存储器以及迭代固定测试- refine 环的七层架构。 在5,000个真实的情景下, Chronos 的精确度为67.3 % , 而克洛德和GPT-4分别是14 的 。 20 优先调时间将调时间, , 将调时间值将调调取时间减少40%, 它将计算为65%。 它解决了多档错误。


Article 62

Title@2025-07-29 (2): HLSDebugger: Identification and Correction of Logic Bugs in HLS Code with LLM Solutions

Title: HLSDebugger: Identification and Correction of Logic Bugs in HLS Code with LLM Solutions HLSDebugger: Identifizierung und Korrektur von Logic Bugs im HLS-Code mit LLM-Lösungen HLSDebugger: 使用LLM 解决方案识别和校正 HLS 代码中的逻辑错误 2507.21485v1

Authors (4): Jing Wang, Shang Liu, Yao Lu, Zhiyao Xie

High-level synthesis (HLS) accelerates hardware design by enabling the automatic translation of high-level descriptions into efficient hardware implementations. However, debugging HLS code is a challenging and labor-intensive task, especially for novice circuit designers or software engineers without sufficient hardware domain knowledge. The recent emergence of Large Language Models (LLMs) is promising in automating the HLS debugging process. Despite the great potential, three key challenges persist when applying LLMs to HLS logic debugging: 1) High-quality circuit data for training LLMs is scarce, posing a significant challenge. 2) Debugging logic bugs in hardware is inherently more complex than identifying software bugs with existing golden test cases. 3) The absence of reliable test cases requires multi-tasking solutions, performing both bug identification and correction. complicates the multi-tasking required for effective HLS debugging. In this work, we propose a customized solution named HLSDebugger to address the challenges. HLSDebugger first generates and releases a large labeled dataset with 300K data samples, targeting HLS logic bugs. The HLSDebugger model adopts an encoder-decoder structure, performing bug location identification, bug type prediction, and bug correction with the same model. HLSDebugger significantly outperforms advanced LLMs like GPT-4 in bug identification and by more than 3x in bug correction. It makes a substantial advancement in the exploration of automated debugging of HLS code.

高层次合成(HLS)通过将高层次描述自动转换为高效的硬件实施,加速硬件设计。然而,调试 HLS 代码是一项具有挑战性和劳动密集型的任务,特别是对于没有足够硬件域域知识的Novice 电路设计师或软件工程师来说,它是一个具有挑战性和劳动密集型的任务。最近出现的大语言模型(LLLM)在将HLS调试程序自动化方面很有希望。尽管存在巨大的潜力,但在应用LLMS逻辑调试时,有三个关键挑战依然存在:(1) 用于培训LMS的高质量自动校正系统数据很少,因此构成重大挑战。(2) 调试硬件中的逻辑错误比识别现有黄金测试案例的软件错误复杂得多。(3) 缺乏可靠的测试案例需要多重任务解决方案,同时进行错误识别和校正。在这项工作中,我们提出了一个名为HLSDSDebugger的定制解决方案,用来应对挑战。HLSDBDSDebugbger首先生成并释放一个带有300KPER数据样本的大型编码数据集。针对HLS逻辑错误的HDebubbu 逻辑错误识别错误和升级模型,在HBBDMBers的升级模型中,在运行中采用一个错误和升级的模型中采用一个更高级的模型。


Article 63

Title@2025-07-29 (2): LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests

Title: LLM4VV: Evaluating Cutting-Edge LLMs for Generation and Evaluation of Directive-Based Parallel Programming Model Compiler Tests LLM4VV: Bewertung von Cutting-Edge-LLMs für die Erstellung und Bewertung von richtliniesbasierten parallelen Programmiermodellkompilertests LLM4VV:评估用于编制和评价基于指令的平行方案拟订模式模型汇编者示范测试的切削-切削LMs 2507.21447v1

Authors (4): Zachariah Sollenberger, Rahul Patel, Saieda Ali Zada, Sunita Chandrasekaran

The usage of Large Language Models (LLMs) for software and test development has continued to increase since LLMs were first introduced, but only recently have the expectations of LLMs become more realistic. Verifying the correctness of code generated by LLMs is key to improving their usefulness, but there have been no comprehensive and fully autonomous solutions developed yet. Hallucinations are a major concern when LLMs are applied blindly to problems without taking the time and effort to verify their outputs, and an inability to explain the logical reasoning of LLMs leads to issues with trusting their results. To address these challenges while also aiming to effectively apply LLMs, this paper proposes a dual-LLM system (i.e. a generative LLM and a discriminative LLM) and experiments with the usage of LLMs for the generation of a large volume of compiler tests. We experimented with a number of LLMs possessing varying parameter counts and presented results using ten carefully-chosen metrics that we describe in detail in our narrative. Through our findings, it is evident that LLMs possess the promising potential to generate quality compiler tests and verify them automatically.

大语言模型(LLMs)在软件和测试开发方面的使用自LLMs首次引入以来继续增加,但直到最近,LLMs的期望才变得更加现实。验证LLMs生成的代码是否正确是提高其效用的关键,但还没有制定全面、完全自主的解决办法。当LLMs在不花时间和精力核实其产出的情况下盲目地应用于问题时,幻觉是一大问题,无法解释LLMs的逻辑推理导致问题,信任其结果。为了应对这些挑战,同时也力求有效地应用LLMs,本文提议建立一个双LM系统(即基因化LM和歧视性LMM),并试验LMS用于生成大量编译测试。我们试验了许多具有不同参数的LLMs,并用我们在叙述中详细描述的10个细微分辨度指标提出了结果。通过我们的研究结果,LMss显然具有产生质量编译测试并自动核实这些测试的希望的潜力。


Article 64

Title@2025-07-28 (1): MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration

Title: MAAD: Automate Software Architecture Design through Knowledge-Driven Multi-Agent Collaboration MAAD: Software-Architektur-Design automatisieren durch wissensgetriebene Multi-Agent-Kollaboration MAAD: 通过知识开发多机构协作开发自动化软件结构设计 2507.21382v1

Authors (8): Ruiyin Li, Yiran Zhang, Xiyu Zhou, Peng Liang, Weisong Sun, Jifeng Xuan, Zhi Jin, Yang Liu

Software architecture design is a critical, yet inherently complex and knowledge-intensive phase of software development. It requires deep domain expertise, development experience, architectural knowledge, careful trade-offs among competing quality attributes, and the ability to adapt to evolving requirements. Traditionally, this process is time-consuming and labor-intensive, and relies heavily on architects, often resulting in limited design alternatives, especially under the pressures of agile development. While Large Language Model (LLM)-based agents have shown promising performance across various SE tasks, their application to architecture design remains relatively scarce and requires more exploration, particularly in light of diverse domain knowledge and complex decision-making. To address the challenges, we proposed MAAD (Multi-Agent Architecture Design), an automated framework that employs a knowledge-driven Multi-Agent System (MAS) for architecture design. MAAD orchestrates four specialized agents (i.e., Analyst, Modeler, Designer and Evaluator) to collaboratively interpret requirements specifications and produce architectural blueprints enriched with quality attributes-based evaluation reports. We then evaluated MAAD through a case study and comparative experiments against MetaGPT, a state-of-the-art MAS baseline. Our results show that MAAD’s superiority lies in generating comprehensive architectural components and delivering insightful and structured architecture evaluation reports. Feedback from industrial architects across 11 requirements specifications further reinforces MAAD’s practical usability. We finally explored the performance of the MAAD framework with three LLMs (GPT-4o, DeepSeek-R1, and Llama 3.3) and found that GPT-4o exhibits better performance in producing architecture design, emphasizing the importance of LLM selection in MAS-driven architecture design.

软件结构设计是软件开发中一个至关重要但内在复杂和知识密集的阶段,是软件开发的一个重要阶段,它需要深奥的专长、发展经验、建筑知识、在相互竞争的质量属性之间谨慎权衡,以及适应不断变化的要求的能力。传统上,这一过程耗时费力,并严重依赖建筑师,导致设计选择有限,特别是在灵活发展的压力下。大型语言模型(LLLM)代理商在各种SE任务中表现出了有希望的绩效,但其在建筑设计中的应用仍然相对较少,需要更多探索,特别是考虑到不同的领域知识和复杂的决策。为了应对挑战,我们建议MAAD(MAD(MU-AG-AG 建筑设计))是一个自动框架,在建筑设计中采用知识驱动的多AG-AG系统(MAS), MAAD 4 结构结构框架(AMAD 结构结构结构结构结构结构结构的升级,我们从MAADAD 结构结构结构结构结构的更高性要求中,我们从MADAD的更高性评估中找到了更好的业绩要求。


Article 65

Title@2025-07-28 (1): Does Editing Improve Answer Quality on Stack Overflow? A Data-Driven Investigation

Title: Does Editing Improve Answer Quality on Stack Overflow? A Data-Driven Investigation Verbessert das Bearbeiten die Antwortqualität bei Stack-Überfluss? Eine datengetriebene Untersuchung 编辑是否改进堆叠溢出时的回答质量? 数据驱动调查 2507.21329v1

Authors (2): Saikat Mondal, Chanchal K. Roy

High-quality answers in technical Q&A platforms like Stack Overflow (SO) are crucial as they directly influence software development practices. Poor-quality answers can introduce inefficiencies, bugs, and security vulnerabilities, and thus increase maintenance costs and technical debt in production software. To improve content quality, SO allows collaborative editing, where users revise answers to enhance clarity, correctness, and formatting. Several studies have examined rejected edits and identified the causes of rejection. However, prior research has not systematically assessed whether accepted edits enhance key quality dimensions. While one study investigated the impact of edits on C/C++ vulnerabilities, broader quality aspects remain unexplored. In this study, we analyze 94,994 Python-related answers that have at least one accepted edit to determine whether edits improve (1) semantic relevance, (2) code usability, (3) code complexity, (4) security vulnerabilities, (5) code optimization, and (6) readability. Our findings show both positive and negative effects of edits. While 53.3% of edits improve how well answers match questions, 38.1% make them less relevant. Some previously broken code (9%) becomes executable, yet working code (14.7%) turns non-parsable after edits. Many edits increase complexity (32.3%), making code harder to maintain. Instead of fixing security issues, 20.5% of edits introduce additional issues. Even though 51.0% of edits optimize performance, execution time still increases overall. Readability also suffers, as 49.7% of edits make code harder to read. This study highlights the inconsistencies in editing outcomes and provides insights into how edits impact software maintainability, security, and efficiency that might caution users and moderators and help future improvements in collaborative editing systems.

Stack Overflow (SO) 等技术 {A 平台的高质量答案至关重要, 因为它们直接影响到软件开发实践。 低质量答案可能导致低效率、错误和安全脆弱性, 从而增加生产软件的维护成本和技术债务。 为改善内容质量, SO 允许合作编辑, 用户修改答案以提高清晰度、 正确度和格式化。 一些研究审查了拒绝编辑, 并找出了拒绝的原因。 但是, 先前的研究没有系统地评估被接受的编辑是否提升关键质量层面。 虽然一项研究调查了编辑C/ C++ (SO) 对C/ C++ (SO) 脆弱性的影响, 更广泛的质量方面仍未探索。 在本研究中, 我们分析94, 994 与平坦有关的答案可能带来低效率, 从而增加生产成本 94, 994 相关答案被接受编辑, (2) 编码, (3) 代码复杂度, (4) 安全脆弱性, (5) 代码优化, 和 读取 。 我们的调查结果显示编辑的正面效果和负面效果。 虽然有53. 答案更准确的答案与问题相符, , 38.1 使得问题更精确的编辑的编辑结果更接近于问题。 一些破的代码 (9%) 阅读的代码 读起来更难。


Article 66

Title@2025-07-28 (1): Black-Box Bug-Amplification for Multithreaded Software

Title: Black-Box Bug-Amplification for Multithreaded Software Black-Box Bug-Verstärkung für Multithreaded Software 多行软件的黑箱臭虫修正 2507.21318v1

Authors (6): Yeshayahu Weiss, Gal Amram, Achiya Elyasaf, Eitan Farchi, Oded Margalit, Gera Weiss

Bugs, especially those in concurrent systems, are often hard to reproduce because they manifest only under rare conditions. Testers frequently encounter failures that occur only under specific inputs, even when occurring with low probability. We propose an approach to systematically amplify the occurrence of such elusive bugs. We treat the system under test as a black-box and use repeated trial executions to train a predictive model that estimates the probability of a given input configuration triggering a bug. We evaluate this approach on a dataset of 17 representative concurrency bugs spanning diverse categories. Several model-based search techniques are compared against a brute-force random sampling baseline. Our results show that an ensemble of regression models can significantly increase bug occurrence rates across nearly all scenarios, often achieving an order-of-magnitude improvement over random sampling. The contributions of this work include: (i) a novel formulation of bug-amplification as a rare-event regression problem; (ii) an empirical evaluation of multiple techniques for amplifying bug occurrence, demonstrating the effectiveness of model-guided search; and (iii) a practical, non-invasive testing framework that helps practitioners expose hidden concurrency faults without altering the internal system architecture.

错误,特别是同时系统中的错误,往往很难复制,因为只有在罕见的条件下才会出现。测试者经常遇到只有在特定投入下才会发生的故障,即使发生概率低。我们提出系统扩大这种难以捉摸的错误发生的方法。我们把测试中的系统作为黑箱处理,并用反复的试验处决来训练一种预测模型,用以估计某一输入配置触发错误的可能性。我们在17个具有代表性的孔虫的数据集中评估了这个方法,该数据集涵盖不同类别。一些基于模型的搜索技术被比作一种粗略的随机抽样基线。我们的结果显示,回归模型的共合体可以大大增加几乎所有情况下的错误发生率,常常在随机抽样的基础上实现一个高压性改进。这项工作的贡献包括:(一) 一种新颖的错误强化配方,作为一种罕见的事件回归问题;(二) 对扩大错误发生频率的多种技术进行实证评估,表明模型制导搜索的有效性;以及(三) 一个实用的、非侵入性测试框架,帮助从业者在不改变内部结构的情况下暴露隐藏的货币错误。


Article 67

Title@2025-07-28 (1): “Maybe We Need Some More Examples:” Individual and Team Drivers of Developer GenAI Tool Use

Title: “Maybe We Need Some More Examples:” Individual and Team Drivers of Developer GenAI Tool Use “Vielleicht brauchen wir noch einige Beispiele:” Individuelle und Team-Treiber von Entwickler-GenAI-Tool-Nutzung “也许我们需要更多例子:”开发者GenAI工具使用的个人和团队驱动者 2507.21280v1

Authors (9): Courtney Miller, Rudrajit Choudhuri, Mara Ulloa, Sankeerti Haniyur, Robert DeLine, Margaret-Anne Storey, Emerson Murphy-Hill, Christian Bird, Jenna L. Butler

Despite the widespread availability of generative AI tools in software engineering, developer adoption remains uneven. This unevenness is problematic because it hampers productivity efforts, frustrates management’s expectations, and creates uncertainty around the future roles of developers. Through paired interviews with 54 developers across 27 teams – one frequent and one infrequent user per team – we demonstrate that differences in usage result primarily from how developers perceive the tool (as a collaborator vs. feature), their engagement approach (experimental vs. conservative), and how they respond when encountering challenges (with adaptive persistence vs. quick abandonment). Our findings imply that widespread organizational expectations for rapid productivity gains without sufficient investment in learning support creates a “Productivity Pressure Paradox,” undermining the very productivity benefits that motivate adoption.

尽管软件工程中广泛存在基因化的人工智能工具,但开发商的采用仍然不均衡。这种不平衡存在问题,因为它妨碍生产力的努力,挫败了管理层的期望,并给开发商的未来角色制造了不确定性。 通过在27个团队中与54名开发商 – – 每个团队一个频繁和一个不频繁的用户 – – 进行对齐访谈,我们证明,使用上的差异主要源于开发商如何看待这一工具(作为合作者对一个特征)、他们的参与方法(实验性对保守性),以及他们在遇到挑战时(适应性坚持性对快速放弃)如何应对。 我们的调查结果表明,在没有足够学习支持投资的情况下,本组织对快速生产力增长的普遍期望产生了“生产力压力拉多”,从而破坏了激励采用该工具的生产力效益。


Article 68

Title@2025-07-28 (1): Generating Highly Structured Test Inputs Leveraging Constraint-Guided Graph Refinement

Title: Generating Highly Structured Test Inputs Leveraging Constraint-Guided Graph Refinement Generierung von hochstrukturierten Testeingaben, die eine eingeschränkte Graphenverfeinerung ermöglichen 正在生成高结构化测试输入, 以杠杆方式调节受约束的辅助图表精度 2507.21271v1

Authors (4): Zhaorui Yang, Yuxin Qiu, Haichao Zhu, Qian Zhang

[Context] Modern AI applications increasingly process highly structured data, such as 3D meshes and point clouds, where test input generation must preserve both structural and semantic validity. However, existing fuzzing tools and input generators are typically handcrafted for specific input types and often generate invalid inputs that are subsequently discarded, leading to inefficiency and poor generalizability. [Objective] This study investigates whether test inputs for structured domains can be unified through a graph-based representation, enabling general, reusable mutation strategies while enforcing structural constraints. We will evaluate the effectiveness of this approach in enhancing input validity and semantic preservation across eight AI systems. [Method] We develop and evaluate GRAphRef, a graph-based test input generation framework that supports constraint-based mutation and refinement. GRAphRef maps structured inputs to graphs, applies neighbor-similarity-guided mutations, and uses a constraint-refinement phase to repair invalid inputs. We will conduct a confirmatory study across eight real-world mesh-processing AI systems, comparing GRAphRef with AFL, MeshAttack, Saffron, and two ablated variants. Evaluation metrics include structural validity, semantic preservation (via prediction consistency), and performance overhead. Experimental data is derived from ShapeNetCore mesh seeds and model outputs from systems like MeshCNN and HodgeNet. Statistical analysis and component latency breakdowns will be used to assess each hypothesis.

现代AI应用日益处理高度结构化的数据,如3D模片和点云,测试投入生成必须保持结构性和语义有效性。然而,现有的模糊工具和输入生成器通常是为特定输入类型手工制作的,往往产生无效的输入,随后被丢弃,导致效率低下和笼统性差。[目标]本研究报告调查结构化域的测试投入能否通过基于图表的表示法统一,使通用的、可重复使用的变异战略得以实施,同时实施结构性限制。我们将评估这一方法在加强八个AI系统输入有效性和语义保存方面的有效性。[方法]我们将开发和评估GRAREf,一个基于图表的测试生成框架,即支持基于约束的突变和完善。GRAMSRef对图形的结构化输入进行结构化,将相近的导导导变,并使用一个制约-精密化阶段来修复无效的投入。我们将在八个现实世界的内网-处理AI系统进行确认性研究,将GRAphRef与AF、MESAttack、Saffretal Syal Studal Supal-dealviews 和Sealviewal-dealviewsalviewsal-dealviewdalviews 。我们将进行模拟和Safligalviewdalviewdalviewdalviewdal-Safliview。


Article 69

Title@2025-07-28 (1): Smart Expansion Techniques for ASP-based Interactive Configuration

Title: Smart Expansion Techniques for ASP-based Interactive Configuration Smart Expansion Techniques für ASP-basierte interaktive Konfiguration 基于ASP的交互式互动配置的智能扩展技术 2507.21027v1

Authors (5): Lucia Balážová, Richard Comploi-Taupe, Susana Hahn, Nicolas Rühling, Gottfried Schenner

Product configuration is a successful application of Answer Set Programming (ASP). However, challenges are still open for interactive systems to effectively guide users through the configuration process. The aim of our work is to provide an ASP-based solver for interactive configuration that can deal with large-scale industrial configuration problems and that supports intuitive user interfaces via an API. In this paper, we focus on improving the performance of automatically completing a partial configuration. Our main contribution enhances the classical incremental approach for multi-shot solving by four different smart expansion functions. The core idea is to determine and add specific objects or associations to the partial configuration by exploiting cautious and brave consequences before checking for the existence of a complete configuration with the current objects in each iteration. This approach limits the number of costly unsatisfiability checks and reduces the search space, thereby improving solving performance. In addition, we present a user interface that uses our API and is implemented in ASP.

产品配置是成功应用解答设置程序(ASP) 的产品配置。 但是,互动系统仍面临挑战,通过配置程序有效指导用户。我们的工作目标是为互动式配置提供一个基于 ASP 的解决方案,用于处理大规模工业配置问题,并通过API支持直观用户界面。在本文中,我们侧重于改进自动完成部分配置的性能。我们的主要贡献增强了通过四种不同的智能扩展功能进行多镜头解答的典型渐进方法。核心理念是,在检查每个迭代中是否存在与当前对象的完整配置之前,先利用谨慎和勇敢的后果,确定和增加部分配置中的具体对象或关联。这种方法限制了费用高昂的不满意性检查数量,并减少了搜索空间,从而改进了解决性能。此外,我们展示了一个用户界面,该界面使用我们的API,并在APP中实施。


Article 70

Title@2025-07-28 (1): Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments

Title: Automated Identification of Sexual Orientation and Gender Identity Discriminatory Texts from Issue Comments Automatisierte Identifizierung von sexueller Orientierung und Geschlechteridentität diskriminatorische Texte aus der Ausgabe Kommentare 从问题评论中自动识别性取向和性别认同歧视文本 2311.08485v2

Authors (5): Sayma Sultana, Jaydeb Sarker, Farzana Israt, Rajshakhar Paul, Amiangshu Bosu

In an industry dominated by straight men, many developers representing other gender identities and sexual orientations often encounter hateful or discriminatory messages. Such communications pose barriers to participation for women and LGBTQ+ persons. Due to sheer volume, manual inspection of all communications for discriminatory communication is infeasible for a large-scale Free Open-Source Software (FLOSS) community. To address this challenge, this study aims to develop an automated mechanism to identify Sexual orientation and Gender identity Discriminatory (SGID) texts from software developers’ communications. On this goal, we trained and evaluated SGID4SE ( Sexual orientation and Gender Identity Discriminatory text identification for (4) Software Engineering texts) as a supervised learning-based SGID detection tool. SGID4SE incorporates six preprocessing steps and ten state-of-the-art algorithms. SGID4SE implements six different strategies to improve the performance of the minority class. We empirically evaluated each strategy and identified an optimum configuration for each algorithm. In our ten-fold cross-validation-based evaluations, a BERT-based model boosts the best performance with 85.9% precision, 80.0% recall, and 82.9% F1-Score for the SGID class. This model achieves 95.7% accuracy and 80.4% Matthews Correlation Coefficient. Our dataset and tool establish a foundation for further research in this direction.

在以直男为主的行业中,许多代表其他性别认同和性取向的开发商经常遇到仇恨或歧视性的信息。这种通信对妇女和男女同性恋、双性恋和变性者构成障碍。由于数量庞大,对所有歧视性通信的通信进行人工检查对于大型免费开放源码软件(FLOSS)社区来说是行不通的。为了应对这一挑战,本研究旨在开发一个自动机制,以识别软件开发商通信中的性取向和性别认同歧视文本(SGID),为此,我们培训和评价了SGID4SE((4)软件工程文本的性取向和性别认同歧视文本识别),将其作为一个以学习为基础的监督检测工具。SGID4SE包含六个预处理步骤和十种最先进的算法。SG4SE实施了六种不同的战略,以改善少数群体阶级的表现。我们从经验上评估了每项战略,并确定了每种算法的最佳配置。在我们十倍的交叉校验评估中,基于BERT的模型提升了最佳业绩,85.9%的精确度,8.0 % 回顾,以及8.9 %的FIFIF-C数据库进一步确定了我们这一模型的精确度。


Article 71

Title@2025-07-28 (1): Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs

Title: Repairing vulnerabilities without invisible hands. A differentiated replication study on LLMs Reparieren von Schwachstellen ohne unsichtbare Hände. Eine differenzierte Replikationsstudie auf LLMs 在没有无形手的情况下修复弱点,对LLMs进行差别化的推广研究。 2507.20977v1

Authors (2): Maria Camporese, Fabio Massacci

Background: Automated Vulnerability Repair (AVR) is a fast-growing branch of program repair. Recent studies show that large language models (LLMs) outperform traditional techniques, extending their success beyond code generation and fault detection. Hypothesis: These gains may be driven by hidden factors – “invisible hands” such as training-data leakage or perfect fault localization – that let an LLM reproduce human-authored fixes for the same code. Objective: We replicate prior AVR studies under controlled conditions by deliberately adding errors to the reported vulnerability location in the prompt. If LLMs merely regurgitate memorized fixes, both small and large localization errors should yield the same number of correct patches, because any offset should divert the model from the original fix. Method: Our pipeline repairs vulnerabilities from the Vul4J and VJTrans benchmarks after shifting the fault location by n lines from the ground truth. A first LLM generates a patch, a second LLM reviews it, and we validate the result with regression and proof-of-vulnerability tests. Finally, we manually audit a sample of patches and estimate the error rate with the Agresti-Coull-Wilson method.

自动脆弱性修复(AVR)是一个快速增长的方案修复分支。最近的研究显示,大型语言模型(LLMS)比传统技术(LLMS)更符合传统技术,其成功率超出代码生成和故障检测。假设:这些收益可能由隐藏因素驱动 – – “ 看不见的手 “ ,例如培训数据泄漏或完全错误定位 – – 让LLM复制同一代码的人为的修复方法。目标:我们在控制条件下复制先前的AVR研究,在报告的脆弱性位置上刻意添加错误。如果LLMs只是重塑小型和大型本地化错误,那么这些修正就会产生相同数量的正确补补丁,因为任何抵消都会将模型从原始修正中转移。方法:我们从Vul4J和VJTrans基准调换出错误位置后,在从地面真相线移出错误位置后,我们从第一个LLM生成一个补丁,第二个LM审查它,我们用回归和证明可靠性测试来验证结果。最后,我们用Agriwson-Worth-roll-roupsy-


Article 72

Title@2025-07-28 (1): Adopting Large Language Models to Automated System Integration

Title: Adopting Large Language Models to Automated System Integration Annahme großer Sprachmodelle zur Automatisierten Systemintegration 采用大语言模型实现自动化系统整合 2504.08490v2

Authors (1): Robin D. Pesl

Modern enterprise computing systems integrate numerous subsystems to resolve a common task by yielding emergent behavior. A widespread approach is using services implemented with Web technologies like REST or OpenAPI, which offer an interaction mechanism and service documentation standard, respectively. Each service represents a specific business functionality, allowing encapsulation and easier maintenance. Despite the reduced maintenance costs on an individual service level, increased integration complexity arises. Consequently, automated service composition approaches have arisen to mitigate this issue. Nevertheless, these approaches have not achieved high acceptance in practice due to their reliance on complex formal modeling. Within this Ph.D. thesis, we analyze the application of Large Language Models (LLMs) to automatically integrate the services based on a natural language input. The result is a reusable service composition, e.g., as program code. While not always generating entirely correct results, the result can still be helpful by providing integration engineers with a close approximation of a suitable solution, which requires little effort to become operational. Our research involves (i) introducing a software architecture for automated service composition using LLMs, (ii) analyzing Retrieval Augmented Generation (RAG) for service discovery, (iii) proposing a novel natural language query-based benchmark for service discovery, and (iv) extending the benchmark to complete service composition scenarios. We have presented our software architecture as Compositio Prompto, the analysis of RAG for service discovery, and submitted a proposal for the service discovery benchmark. Open topics are primarily the extension of the service discovery benchmark to service composition scenarios and the improvements of the service composition generation, e.g., using fine-tuning or LLM agents.

现代企业计算系统整合了多个子系统,通过产生突发行为,解决共同任务。一种广泛的方法正在使用使用网络技术(REST或OpenAPI)提供的服务,这些技术分别提供互动机制和服务文件标准。每个服务都是一种特定的功能,允许封装和更容易维护。尽管单个服务水平的维护成本降低,但一体化的复杂性增加。因此,为缓解这一问题,出现了自动化服务构成方法。然而,这些方法在实践中没有获得高度接受,因为它们依赖复杂的正规模型。在这个博士论文中,我们分析了大语言模型(LLLMS)的应用,以自动整合基于自然语言投入的服务构成。结果是一种可重复使用的服务构成,例如,作为程序代码。虽然并非总能产生完全正确的结果,但结果仍然有用,就是向集成工程师提供近似合适的解决方案,这几乎不需要投入运行。我们的研究涉及:(一)使用LLMMS,为自动化服务构成引入软件结构的软件结构。 (二)分析大语言模型(LLLMS)的应用(LMS)应用大语言模型(LLMG),将服务构成自动生成(LILADADG)自动模型用于服务升级发现服务结构,作为基础,将服务基础,(我们提供基础,将服务结构的升级工具基础,将服务基础,将服务结构推介算算算基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础扩展为基础,将服务基础,将服务基础,将服务基础,将服务结构推向基础,将服务基础推向基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础推为基础,将服务基础,将服务推为基础,将服务基础,将服务基础,将服务推向基础,将服务推,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务框架推向基础推向基础,将服务推向基础,将服务基础,将服务推,将服务基础,将服务推,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将服务基础,将


Article 73

Title@2025-07-28 (1): Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation

Title: Advanced System Integration: Analyzing OpenAPI Chunking for Retrieval-Augmented Generation Advanced System Integration: Analysieren von OpenAPI Chunking für retrieval-Augmented Generation 高级系统集成:分析用于回溯源源代的 OpenAPI 弹进器 2411.19804v2

Authors (4): Robin D. Pesl, Jerin G. Mathew, Massimo Mecella, Marco Aiello

Integrating multiple (sub-)systems is essential to create advanced Information Systems (ISs). Difficulties mainly arise when integrating dynamic environments across the IS lifecycle. A traditional approach is a registry that provides the API documentation of the systems’ endpoints. Large Language Models (LLMs) have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input token limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. Within this work, we (i) analyze the usage of Retrieval Augmented Generation (RAG) for endpoint discovery and the chunking, i.e., preprocessing, of OpenAPIs to reduce the input token length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints and retrieves details on demand. We evaluate RAG for endpoint discovery using the RestBench benchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval recall, precision, and F1 score. Then, we assess the Discovery Agent using the same test set. With our prototype, we demonstrate how to successfully employ RAG for endpoint discovery to reduce the token count. While revealing high values for recall, precision, and F1, further research is necessary to retrieve all requisite endpoints. Our experiments show that for preprocessing, LLM-based and format-specific approaches outperform na"ive chunking methods. Relying on an agent further enhances these results as the agent splits the tasks into multiple fine granular subtasks, improving the overall RAG performance in the token count, precision, and F1 score.

整合多个( 子) 系统对于创建高级信息系统( IS) 至关重要 。 在整合IS 生命周期中的动态环境时, 困难主要出现 。 传统方法是一个登记册, 提供系统端点的 API 文档。 大语言模型( LLM ) 显示能够自动创建基于此文档的系统集成( 服务构成) , 但是由于输入象征限制, 特别是全面的 API 描述, 需要简明化输入。 目前, 如何最妥善地处理这些 API 描述 。 在这项工作中, 我们( i) 分析 Retrievval 高级加速生成( RAG) 的使用情况, 用于端点发现和块块, 即预处理, OpenAPIs 用于在保存最相关信息的同时减少输入代号长度。 为了进一步减少组成前的输入代号长度, 改进前端点的缩略图, 重新获取需求细节。 我们用 RAG 使用 Rest Bech 精确度基准, 首先, 来分析最终发现最终的精确度, 精确度 , 测试工具 显示我们不断 的缩数 。


Article 74

Title@2025-07-28 (1): A first look at ROS 2 applications written in asynchronous Rust

Title: A first look at ROS 2 applications written in asynchronous Rust Ein erster Blick auf ROS 2 Anwendungen geschrieben in asynchronen Rust 第一次查看ROS 2 申请,以非同步鲁斯特书写 2505.21323v3

Authors (3): Martin Škoudlil, Michal Sojka, Zdeněk Hanzálek

The increasing popularity of the Rust programming language in building robotic applications using the Robot Operating System (ROS 2) raises questions about its real-time execution capabilities, particularly when employing asynchronous programming. Existing real-time scheduling and response-time analysis techniques for ROS 2 focus on applications written in C++ and do not address the unique execution models and challenges presented by Rust’s asynchronous programming paradigm. In this paper, we analyze the execution model of R2R – an asynchronous Rust ROS 2 bindings and various asynchronous Rust runtimes, comparing them with the execution model of C++ ROS 2 applications. We propose a structured approach for R2R applications aimed at deterministic real-time operation involving thread prioritization and callback-to-thread mapping schemes. Our experimental evaluation based on measuring end-to-end latencies of a synthetic application shows that the proposed approach is effective and outperforms other evaluated configurations. A more complex autonomous driving case study demonstrates its practical applicability. Overall, the experimental results indicate that our proposed structure achieves bounded response times for time-critical tasks. This paves the way for future work to adapt existing or develop new response-time analysis techniques for R2R applications using our structure.

在使用机器人操作系统(ROS 2)建立机器人应用程序的过程中,拉斯特编程语言越来越受欢迎,这使人们对其实时执行能力产生疑问,特别是在使用无同步程序时。ROS 2的现有实时时间安排和响应时间分析技术侧重于C++中写成的应用程序,而没有解决拉斯特的无同步程序拟定范式提出的独特的执行模式和挑战。在本文件中,我们分析了R2R的执行模式 – – 一个无同步的拉斯特 ROS 2捆绑定和各种不同步的拉斯特运行时间,将其与C+ROS 2应用程序的执行模式进行比较。我们为R2应用程序提出了一个结构结构结构结构结构结构,旨在确定实时操作的确定性操作,包括线性优先排序和回溯至全程绘图计划。我们基于测量一个合成应用程序的端到端的迟误的实验性评估表明,拟议的方法是有效的,而且比其他经过评估的配置更完善。更复杂的自动驱动案例研究表明其实际适用性。总体而言,实验结果表明我们拟议的结构在时间紧迫的任务方面实现了约束的反应时间。我们现有的工作结构,或者用新的方法来为进行新的适应。


Article 75

Title@2025-07-28 (1): TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories

Title: TypyBench: Evaluating LLM Type Inference for Untyped Python Repositories TypyBench: Bewertung der LLM-Typ-Schlussfolgerung für nicht typisierte Python-Repositories TypyBench: 评估非型式 Python 仓库的 LLM 类型推理 2507.22086v1

Authors (7): Honghua Dong, Jiacheng Yang, Xun Deng, Yuhe Jiang, Gennady Pekhimenko, Fan Long, Xujie Si

Type inference for dynamic languages like Python is a persistent challenge in software engineering. While large language models (LLMs) have shown promise in code understanding, their type inference capabilities remain underexplored. We introduce TypyBench, a benchmark designed to evaluate LLMs’ type inference across entire Python repositories. TypyBench features two novel metrics: TypeSim, which captures nuanced semantic relationships between predicted and ground truth types, and TypeCheck, which assesses type consistency across codebases. Our evaluation of various LLMs on a curated dataset of 50 high-quality Python repositories reveals that, although LLMs achieve decent TypeSim scores, they struggle with complex nested types and exhibit significant type consistency errors. These findings suggest that future research should shift focus from improving type similarity to addressing repository-level consistency. TypyBench provides a foundation for this new direction, offering insights into model performance across different type complexities and usage contexts. Our code and data are available at https://github.com/typybench/typybench.

Python 等动态语言的类型推断是软件工程的一个长期挑战。大型语言模型(LLMs)在代码理解方面表现出希望,但其类型推断能力仍然未得到充分探索。我们引入了TypyBench,这是用来评估整个Python 库中LLMs类型推断的基准。TypyBench 有两个新的指标:TypeSim,它捕捉了预测和地面真实类型之间的细微语义关系,TypeBench,它评估了各代码库之间的类型一致性。我们对50个高品质Python 库集成数据集的各种LLMs的评估显示,虽然LLMs取得了体面的Sim 类型评分,但它们与复杂的嵌套类型争斗,并显示出明显的类型一致性错误。这些研究结果表明,未来的研究的重点应该从改进类型相似性转向处理存储库层面的一致性。TypyBench为这一新方向提供了一个基础,为不同类型复杂和使用背景的模型性表现提供了深入的见解。我们的代码和数据可在https://github.com/typench/typybench/tybench。


Article 76

Title@2025-07-28 (1): SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

Title: SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation SPICE: Eine automatisierte SWE-Bench-Etikettierungspipeline für Ausgabeklarheit, Testabdeckung und Aufwandsabschätzung SPICE: 用于议题清晰度、测试覆盖率和努力估算的SWE-Bennch自动标签管道 2507.09108v2

Authors (10): Aaditya Bhatia, Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, Ahmed E. Hassan

High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE’s design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around $100,000 (manual annotation) to just $5.10. These results demonstrate SPICE’s potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).

高品质的标签数据集对于软件工程基础模型的培训和评价至关重要,但创建这些数据集往往费用高得令人望而却步,而且耗费大量人力。我们引入了SPICE,这是一个可扩缩的自动化管道,用于标为SWE-Bench型数据集,并配有说明,以澄清问题、测试覆盖范围和工作估计。SPICE结合了符合背景的代码导航、根据理由推动的提示和多种通用的共识,以制作与专家说明相近的标签。SPICE的设计参考了我们自己在标出SWE-Gym800多个实例方面的经验和挫折感。SPICE与人类标为SWE-Bench型的SWE-Bench 验证数据达成了强烈的一致,同时将标出1,000个实例的费用从大约100 000美元(人工注)降低到仅仅5.10美元。这些结果表明SPICE具有为SE重点调频调频提供具有成本效益的大规模数据集的潜力。为了支持社区,我们发布了SPICEICE工具和SPICE Tenge,这是一个新的数据集,由6 802个超过SWE-GYME的开放源项目中291 13级的6个重)。


Article 77

Title@2025-07-28 (1): Secret Breach Detection in Source Code with Large Language Models

Title: Secret Breach Detection in Source Code with Large Language Models Geheime Breach-Erkennung im Quellcode mit großen Sprachmodellen 具有大语言模式的源代码秘密侦测 2504.18784v2

Authors (5): Md Nafiu Rahman, Sadif Ahmed, Zahin Wahab, S M Sohan, Rifat Shahriyar

Background: Leaking sensitive information - such as API keys, tokens, and credentials - in source code remains a persistent security threat. Traditional regex and entropy-based tools often generate high false positives due to limited contextual understanding. Aims: This work aims to enhance secret detection in source code using large language models (LLMs), reducing false positives while maintaining high recall. We also evaluate the feasibility of using fine-tuned, smaller models for local deployment. Method: We propose a hybrid approach combining regex-based candidate extraction with LLM-based classification. We evaluate pre-trained and fine-tuned variants of various Large Language Models on a benchmark dataset from 818 GitHub repositories. Various prompting strategies and efficient fine-tuning methods are employed for both binary and multiclass classification. Results: The fine-tuned LLaMA-3.1 8B model achieved an F1-score of 0.9852 in binary classification, outperforming regex-only baselines. For multiclass classification, Mistral-7B reached 0.982 accuracy. Fine-tuning significantly improved performance across all models. Conclusions: Fine-tuned LLMs offer an effective and scalable solution for secret detection, greatly reducing false positives. Open-source models provide a practical alternative to commercial APIs, enabling secure and cost-efficient deployment in development workflows.

目标:这项工作的目的是利用大语言模型(LLMs)加强源码秘密检测,减少假阳性,同时保持高清。我们还评估了使用微调、较小的本地部署模式的可行性。方法:我们提议一种混合方法,将基于正反的候选人提取与基于LLM的分类相结合。我们评价了818 GitHub 库基准数据集中各种大语言模型的预先培训和微调变异。在二进制和多级分类中采用了各种激励战略和高效微调方法。结果:经过微调的LLAMA-31 8B模型在二进制分类中实现了0.9852的F-1核心,优于正反正正正方位基线。对于多级分类,Mistral-7B达到了0.982的精确度。我们对所有模型基准数据集中经过预先培训和微调的各种大语言模型的变异体进行了精度。在二进制和多级分类中采用了各种催化的催化战略和高效微微调整方法。结果:精确的ALMS-FMS-S-S-S-Profiral-deal-deal develop sadal developation developmental developmental developmental development 提供一种有效的、高效的硬化的硬性模型。


Article 78

Title@2025-07-28 (1): Enhancing Project-Specific Code Completion by Inferring Internal API Information

Title: Enhancing Project-Specific Code Completion by Inferring Internal API Information Verbesserung der projektspezifischen Code-Vervollständigung durch Schlussfolgerung interner API-Informationen 通过推断内部API信息加强具体项目法规的完成 2507.20888v1

Authors (6): Le Deng, Xiaoxue Ren, Chao Ni, Ming Liang, David Lo, Zhongxin Liu

Project-specific code completion is a critical task that leverages context from a project to generate accurate code. State-of-the-art methods use retrieval-augmented generation (RAG) with large language models (LLMs) and project information for code completion. However, they often struggle to incorporate internal API information, which is crucial for accuracy, especially when APIs are not explicitly imported in the file. To address this, we propose a method to infer internal API information without relying on imports. Our method extends the representation of APIs by constructing usage examples and semantic descriptions, building a knowledge base for LLMs to generate relevant completions. We also introduce ProjBench, a benchmark that avoids leaked imports and consists of large-scale real-world projects. Experiments on ProjBench and CrossCodeEval show that our approach significantly outperforms existing methods, improving code exact match by 22.72% and identifier exact match by 18.31%. Additionally, integrating our method with existing baselines boosts code match by 47.80% and identifier match by 35.55%.

具体项目的代码完成是一项至关重要的任务,它利用项目的背景来生成准确的代码。 最先进的方法使用大语言模型(LLMs)的检索生成和项目信息来完成代码完成。 但是,它们往往难以纳入内部API信息,这对于准确性至关重要, 特别是在文件没有明确输入API的情况下。 为了解决这个问题, 我们提出了一个不依靠进口来推断内部API信息的方法。 我们的方法通过构建使用示例和语义描述来扩展API的表示面, 为LLMs建立知识库以生成相关完成。 我们还引入了ProjBench, 这是一种避免泄漏进口和由大型真实世界项目组成的基准。 ProjBench和CrossCodeEval实验显示, 我们的方法大大超出现有方法, 使代码精确匹配率提高22.72%,使识别符号精确匹配率提高18.31%。 此外,我们的方法与现有基线推进代码整合了47.80%, 标识匹配率达到35.55%。


Article 79

Title@2025-07-28 (1): Search-Based Fuzzing For RESTful APIs That Use MongoDB

Title: Search-Based Fuzzing For RESTful APIs That Use MongoDB Suchbasiertes Fuzzing für RESTful APIs, die MongoDB verwenden 使用 MOngoDB 的基于搜索的模糊信息 2507.20848v1

Authors (4): Hernan Ghianni, Man Zhang, Juan P. Galeotti, Andrea Arcuri

In RESTful APIs, interactions with a database are a common and crucial aspect. When generating whitebox tests, it is essential to consider the database’s state (i.e., the data contained in the database) to achieve higher code coverage and uncover more hidden faults. This article presents novel techniques to enhance search-based software test generation for RESTful APIs interacting with NoSQL databases. Specifically, we target the popular MongoDB database, by dynamically analyzing (via automated code instrumentation) the state of the database during the test generation process. Additionally, to achieve better results, our novel approach allows inserting NoSQL data directly from test cases. This is particularly beneficial when generating the correct sequence of events to set the NoSQL database in an appropriate state is challenging or time-consuming. This method is also advantageous for testing read-only microservices. Our novel techniques are implemented as an extension of EvoMaster, the only open-source tool for white-box fuzzing RESTful APIs. Experiments conducted on six RESTful APIs demonstrated significant improvements in code coverage, with increases of up to 18% compared to existing white-box approaches. To better highlight the improvements of our novel techniques, comparisons are also carried out with four state-of-the-art black-box fuzzers.

在 Restful API 中, 与数据库的相互作用是一个共同和关键的方面。 当生成白箱测试时, 必须考虑到该数据库的状态( 即数据库中包含的数据) , 以便实现更高的代码覆盖, 并发现更多的隐藏错误。 此文章介绍了加强基于搜索的软件测试生成的新技术, 用于与 NoSQL 数据库互动的 RESTful API 。 具体地说, 我们的目标是流行的 MOngoDB 数据库, 在测试生成过程中通过动态分析( 通过自动代码仪表) 来测试数据库的状态。 此外, 为了实现更好的结果, 我们的新颖方法允许直接从测试案例中插入 NSQL 数据。 当生成正确的事件序列以在适当的状态设置 NOSQL 数据库时, 这特别有益。 这个方法对于测试只读的微观服务也有好处。 我们的新技术作为 EvoMaster 的扩展软件, 这是在测试中唯一一个公开源工具, 用于在测试生成 REST APT APT AL AL AL AL AL AL AL AL AL AL 。 此外, 在 6 AL AL AL 实验中, 我们的6 RAST AL AL AT 实验中展示了代码覆盖了显著的代码覆盖了显著改进了显著的代码覆盖范围, , , , , 也增加了了我们的新方法, 至 18 , , , 与 。


Article 80

Title@2025-07-28 (1): Client–Library Compatibility Testing with API Interaction Snapshots

Title: Client–Library Compatibility Testing with API Interaction Snapshots Client-Bibliothek Kompatibilitätstest mit API-Interaktions-Snapshots 客户- Library 兼容性测试与 API 互动抓图 2507.20814v1

Authors (4): Gustave Monce, Thomas Degueule, Jean-Rémy Falleri, Romain Robbes

Modern software development heavily relies on third-party libraries to speed up development and enhance quality. As libraries evolve, they may break the tacit contract established with their clients by introducing behavioral breaking changes (BBCs) that alter run-time behavior and silently break client applications without being detected at compile time. Traditional regression tests on the client side often fail to detect such BBCs, either due to limited library coverage or weak assertions that do not sufficiently exercise the library’s expected behavior. To address this issue, we propose a novel approach to client–library compatibility testing that leverages existing client tests in a novel way. Instead of relying on developer-written assertions, we propose recording the actual interactions at the API boundary during the execution of client tests (protocol, input and output values, exceptions, etc.). These sequences of API interactions are stored as snapshots which capture the exact contract expected by a client at a specific point in time. As the library evolves, we compare the original and new snapshots to identify perturbations in the contract, flag potential BBCs, and notify clients. We implement this technique in our prototype tool Gilesi, a Java framework that automatically instruments library APIs, records snapshots, and compares them. Through a preliminary case study on several client–library pairs with artificially seeded BBCs, we show that Gilesi reliably detects BBCs missed by client test suites.

现代软件开发在很大程度上依赖第三方图书馆来加快开发和提高质量。随着图书馆的发展,它们可能会通过引入行为突破性改变(BBCs)来改变运行时间行为,在不及时编译的情况下悄悄中断客户应用程序,从而打破与客户的默认合同。客户方的传统回归测试往往无法检测到这种BBC,原因有二:图书馆覆盖面有限,或断言不足,无法充分运用图书馆预期行为。为了解决这一问题,我们提议对客户-图书馆兼容性测试采取新颖的方法,以新颖方式利用现有客户测试。我们建议,在执行客户测试(程序、投入和产出值、例外等)时,不依赖开发者撰写的断言,而是在API边界记录实际互动情况。API的这些互动顺序通常被存储为描述客户在特定时间期望的准确合同的快照。随着图书馆的发展,我们将原始和新的快照加以比较,以便以新方式利用现有客户的测试。我们采用这种技术,而不是依赖开发者撰写的断言。我们用原型工具来记录API、输入一个自动的客户测试工具。


Article 81

Title@2025-07-28 (1): LLM-Based Repair of Static Nullability Errors

Title: LLM-Based Repair of Static Nullability Errors LLM-basierte Reparatur von statischen Nullierbarkeitsfehlern LLM – – 基于LLM的静态误差修复 2507.20674v1

Authors (4): Nima Karimipour, Michael Pradel, Martin Kellogg, Manu Sridharan

Modern Java projects increasingly adopt static analysis tools that prevent null-pointer exceptions by treating nullness as a type property. However, integrating such tools into large, existing codebases remains a significant challenge. While annotation inference can eliminate many errors automatically, a subset of residual errors – typically a mix of real bugs and false positives – often persist and can only be resolved via code changes. Manually addressing these errors is tedious and error-prone. Large language models (LLMs) offer a promising path toward automating these repairs, but naively-prompted LLMs often generate incorrect, contextually-inappropriate edits. Resolving a nullability error demands a deep understanding of how a symbol is used across the codebase, often spanning methods, classes, and packages. We present NullRepair, a system that integrates LLMs into a structured workflow for resolving the errors from a nullability checker. NullRepair’s decision process follows a flowchart derived from manual analysis of 200 real-world errors. It leverages static analysis to identify safe and unsafe usage regions of symbols, using error-free usage examples to contextualize model prompts. Patches are generated through an iterative interaction with the LLM that incorporates project-wide context and decision logic. Our evaluation on 12 real-world Java projects shows that NullRepair resolves an average of 72% of the errors that remain after applying a state-of-the-art annotation inference technique. Unlike a naively-prompted LLM, NullRepair also largely preserves program semantics, with all unit tests passing in 10/12 projects after applying every edit proposed by NullRepair, and 98% or more tests passing in the remaining two projects.

现代爪哇项目越来越多地采用静态分析工具,通过将无效性视为一种类型属性来防止无效性例外。然而,将此类工具整合到大型的属性中,现有的代码库仍是一个重大挑战。虽然批注推算可以自动消除许多错误,但一系列残余错误 – – 通常是由真正错误和虚假正反相混合而成 – – 往往会持续,并且只能通过代码修改来解决。手工处理这些错误是乏味和易出错的。大型语言模型(LLLMs)为这些修复自动化提供了一条充满希望的道路,但天真的LMs经常产生不正确、背景不适当的编辑。解决一个无效性错误错误错误错误错误错误的错误错误错误错误错误,现有的代码库仍然是一个巨大的挑战。尽管要解决一个符号如何在代码库中使用,但一个通常跨越方法、类别和包的子串联。我们介绍一个系统,将LMSLM纳入一个结构性的流程。Nell RellRepair的流程进程遵循一个流程图案图案,通过一个不误算法的模型,在正常的版本中将一个不误判。


Article 82

Title@2025-07-28 (1): Testora: Using Natural Language Intent to Detect Behavioral Regressions

Title: Testora: Using Natural Language Intent to Detect Behavioral Regressions Testora: Mit Hilfe der natürlichen Sprache intent, um Verhaltensregressionen zu erkennen 测试:使用自然语言意图检测行为倒退 2503.18597v2

Authors (1): Michael Pradel

As software is evolving, code changes can introduce regression bugs or affect the behavior in other unintended ways. Traditional regression test generation is impractical for detecting unintended behavioral changes, because it reports all behavioral differences as potential regressions. However, most code changes are intended to change the behavior in some way, e.g., to fix a bug or to add a new feature. This paper presents Testora, the first automated approach that detects regressions by comparing the intentions of a code change against behavioral differences caused by the code change. Given a pull request (PR), Testora queries an LLM to generate tests that exercise the modified code, compares the behavior of the original and modified code, and classifies any behavioral differences as intended or unintended. For the classification, we present an LLM-based technique that leverages the natural language information associated with the PR, such as the title, description, and commit messages – effectively using the natural language intent to detect behavioral regressions. Applying Testora to PRs of complex and popular Python projects, we find 19 regression bugs and 11 PRs that, despite having another intention, coincidentally fix a bug. Out of 13 regressions reported to the developers, 11 have been confirmed and 9 have already been fixed. The costs of using Testora are acceptable for real-world deployment, with 12.3 minutes to check a PR and LLM costs of only $0.003 per PR. We envision our approach to be used before or shortly after a code change gets merged into a code base, providing a way to early on detect regressions that are not caught by traditional approaches.

随着软件的演进,代码变化可以引入回归错误,或者以其他意想不到的方式影响行为。传统的回归测试生成对于检测无意的行为变化是不切实际的,因为传统回归测试生成将所有行为差异都报告为潜在的回归。然而,大多数代码变化的目的是以某种方式改变行为,例如,修复错误或者添加一个新特性。本文展示了Teora,这是通过比较代码变化导致的行为差异来检测代码变化意图的第一种自动方法。根据一项拉请求(PR),Tesora询问一个LLM,以生成使用修改后的代码进行测试,比较原始代码和修改后的代码的行为变化,并将任何行为差异归为预想或无意的。但是,对于分类,我们提出了一种基于LLMM的技术,以某种方式来改变行为行为,例如修正错误或添加新的特征。我们提出了一个自然语言意图来检测行为回归。根据传统和流行的Python项目,我们发现了19个回归错误和11个PRR,尽管有另一种意图,但早期修正的代码并没有被提前修正, 也用了一个固定的代码来修正。


Article 83

Title@2025-07-28 (1): Intention-Driven Generation of Project-Specific Test Cases

Title: Intention-Driven Generation of Project-Specific Test Cases Intentionsgetriebene Generierung projektspezifischer Testfälle 项目具体试验个案的有意和有意生成 2507.20619v1

Authors (7): Binhang Qi, Yun Lin, Xinyi Weng, Yuhuan Huang, Chenyan Liu, Hailong Sun, Jin Song Dong

Test cases are valuable assets for maintaining software quality. While numerous automated techniques have been proposed for generating tests (either by maximizing code coverage or by translating focal code into test code), practical tests are seldom driven by coverage alone. In real projects, each test reflects a developer’s validation intention for a specific behaviour and embodies rich, project-specific knowledge: which specific APIs to call and what assertions truly matter. Without considering such knowledge, tests can hardly pass code review and be integrated into the software product. In this work, we propose IntentionTest, which generates project-specific tests with validation intention as a structured description. Our design is motivated by two insights: (1) a description of validation intention, compared to coverage and focal code, carries more crucial information about what to test; and (2) practical tests exhibit high code duplication, indicating that domain knowledge is highly reusable for writing new tests. Given a focal code and a description of validation intention (in the form of either an informal comment or a formal test plan), IntentionTest retrieves a referable test in the project to guide test generation. Moreover, IntentionTest reduces the test generation problem into an editing problem on the test code regarding the validation intention. It generates a test including both test prefix and oracle, which aims to be executable and semantically correct. We evaluate IntentionTest against state-of-the-art baselines on 4,146 test cases from 13 open-source projects. Specifically, compared to ChatTester, IntentionTest can (1) generate significantly more semantically correct tests, improving common mutation scores by 39.03% and coverage overlap with ground-truth tests by 40.14%; (2) generate 21.30% more successful passing tests.

测试案例是维护软件质量的宝贵资产。 虽然为生成测试建议了许多自动化技术(通过最大限度地扩大代码覆盖范围或将焦码转换成测试代码),但实际测试很少由覆盖单独驱动。在实际项目中,每次测试反映开发者对特定行为的验证意图,并体现丰富的、具体项目知识:哪些是特定API,什么是真实的。不考虑这种知识,测试很难通过代码审查,也很难纳入软件产品。在这项工作中,我们提议了“有意测试”,该测试产生特定项目的重叠测试,目的是进行结构化的描述。我们的设计有两种洞察力:(1) 与覆盖和焦码相比,验证意图的描述更为关键;(2) 实际测试显示高代码重复性,表明域知识在撰写新测试时非常可重复。鉴于一个联络点代码和验证意向说明(以非正式评论或正式测试计划的形式),“有意测试测试”通过指导测试生成的公开测试结果。此外,“有意测试”可以将测试生成的生成的问题降低测试生成到测试范围的常规测试范围,包括测试前的测试结果。


Article 84

Title@2025-07-28 (1): Refactoring Deep Learning Code: A Study of Practices and Unsatisfied Tool Needs

Title: Refactoring Deep Learning Code: A Study of Practices and Unsatisfied Tool Needs Refactoring Deep Learning Code: Ein Studium von Praktiken und unzufriedenen Werkzeugbedürfnissen 重构深层学习守则:实践和不满意工具需要研究 2405.04861v2

Authors (6): SiQi Wang, Xing Hu, Bei Wang, WenXin Yao, Xin Xia, XingYu Wang

With the rapid development of deep learning, the implementation of intricate algorithms and substantial data processing have become standard elements of deep learning projects. As a result, the code has become progressively complex as the software evolves, which is difficult to maintain and understand. Existing studies have investigated the impact of refactoring on software quality within traditional software. However, the insight of code refactoring in the context of deep learning is still unclear. This study endeavors to fill this knowledge gap by empirically examining the current state of code refactoring in deep learning realm, and practitioners’ views on refactoring. We first manually analyzed the commit history of five popular and well-maintained deep learning projects (e.g., PyTorch). We mined 4,921 refactoring practices in historical commits and measured how different types and elements of refactoring operations are distributed and found that refactoring operation types’ distribution in deep learning projects is different from it in traditional Java software. We then surveyed 159 practitioners about their views of code refactoring in deep learning projects and their expectations of current refactoring tools. The result of the survey showed that refactoring research and the development of related tools in the field of deep learning are crucial for improving project maintainability and code quality, and that current refactoring tools do not adequately meet the needs of practitioners. Lastly, we provided our perspective on the future advancement of refactoring tools and offered suggestions for developers’ development practices.

随着深层学习的迅速发展,复杂算法的实施和大量数据处理已成为深层学习项目的标准要素。因此,随着软件的演变,守则逐渐变得复杂,难以维持和理解。现有研究调查了传统软件内软件质量再设置因素的影响;然而,深层学习背景下代码再设置因素的洞察力仍不清楚。这一研究努力填补这一知识差距,通过经验检查深层学习领域的代码再设置现状和从业者对再设置因素的看法。我们首先人工分析了五个广受欢迎的和保持良好的深层学习项目(例如PyTorch)的实践历史历史。我们挖掘了4 921种历史再设置做法,并测量了传统软件内对深层学习项目的不同类型和要素的分布。我们随后调查了159个从业者对深层学习项目中代码重新设置参数的看法,以及他们对当前再设置工具的期望。我们通过调查的结果显示,在历史承诺和测量工具的正确性方面,我们不断改进了当前质量的再定位和再设定工具,因此,我们为实地发展提供了至关重要的再定位和再定位工具。


Article 85

Title@2025-07-28 (1): GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation

Title: GeoJSEval: An Automated Evaluation Framework for Large Language Models on JavaScript-Based Geospatial Computation and Visualization Code Generation GeoJSEval: Ein automatisiertes Evaluations-Framework für große Sprachmodelle auf JavaScript-Basis Geospatial Computation and Visualization Code Generierung GeoJSEval: JavaScript基于地理空间计算和可视化代码生成大语言模型自动评价框架 2507.20553v1

Authors (9): Guanyu Chen, Haoyue Jiao, Shuyang Hou, Ziqi Liu, Lutong Xie, Shaowen Wu, Huayi Wu, Xuefeng Guan, Zhipeng Gui

With the widespread adoption of large language models (LLMs) in code generation tasks, geospatial code generation has emerged as a critical frontier in the integration of artificial intelligence and geoscientific analysis. This trend underscores the urgent need for systematic evaluation methodologies to assess LLMs generation capabilities in geospatial contexts. In particular, geospatial computation and visualization tasks in JavaScript environments rely heavily on orchestrating diverse frontend libraries and ecosystems, placing elevated demands on a model’s semantic understanding and code synthesis abilities. To address this challenge, we propose GeoJSEval–the first multimodal, function-level automatic evaluation framework for LLMs in JavaScript-based geospatial code generation. GeoJSEval comprises three core components: a standardized test suite (GeoJSEval-Bench), a code submission engine, and an evaluation module. It includes 432 function-level tasks and 2,071 structured test cases spanning five widely used JavaScript geospatial libraries and 25 mainstream geospatial data types. GeoJSEval enables multidimensional quantitative evaluation across metrics such as accuracy, output stability, execution efficiency, resource consumption, and error type distribution, and integrates boundary testing mechanisms to enhance robustness and coverage. We conduct a comprehensive evaluation of 18 state-of-the-art LLMs using GeoJSEval, revealing significant performance disparities and bottlenecks in spatial semantic understanding, code reliability, and function invocation accuracy. GeoJSEval provides a foundational methodology, evaluation resource, and practical toolkit for the standardized assessment and optimization of geospatial code generation models, with strong extensibility and applicability in real-world scenarios.

随着在代码生成任务中广泛采用大型语言模型(LLMs),地理空间代码的生成已成为整合人工智能和地球科学分析的关键前沿,这一趋势突出表明迫切需要系统评估方法,以评估地理空间环境中LLMs的生成能力,特别是JavaScript环境中的地理空间计算和可视化任务严重依赖对多种前端图书馆和生态系统的统筹安排,对模型的语义理解和代码合成能力的需求增加。为了应对这一挑战,我们提议GeoJESEval-第一个基于JavaScript的多式联运、功能级自动评价框架。GeJSESEval由三个核心组成部分组成:标准化测试套(GeoJSeevval-Bench)、代码提交引擎和评价模块,其中包括432项功能级任务和2 071个结构测试案例,涵盖广泛使用的JavaSIScript地理空间图书馆和25个主流地理空间数据类型。GeoJSESeevval为准确性、执行效率、资源消耗和准确性类型分布和实地评估的准确性、我们18的准确性、准确性、准确性标准分布和边界测试提供了可靠的实地评估机制。


Article 86

Title@2025-07-28 (1): Adaptive and Accessible User Interfaces for Seniors Through Model-Driven Engineering

Title: Adaptive and Accessible User Interfaces for Seniors Through Model-Driven Engineering Adaptive und zugängliche Benutzeroberflächen für Senioren durch modellgetriebene Technik 通过模型驱动工程为老年人提供适应性和无障碍用户界面 2502.18828v2

Authors (4): Shavindra Wickramathilaka, John Grundy, Kashumi Madampe, Omar Haggag

The use of diverse mobile applications among senior users is becoming increasingly widespread. However, many of these apps contain accessibility problems that result in negative user experiences for seniors. A key reason is that software practitioners often lack the time or resources to address the broad spectrum of age-related accessibility and personalisation needs. As current developer tools and practices encourage one-size-fits-all interfaces with limited potential to address the diversity of senior needs, there is a growing demand for approaches that support the systematic creation of adaptive, accessible app experiences. To this end, we present AdaptForge, a novel model-driven engineering (MDE) approach that enables advanced design-time adaptations of mobile application interfaces and behaviours tailored to the accessibility needs of senior users. AdaptForge uses two domain-specific languages (DSLs) to address age-related accessibility needs. The first model defines users’ context-of-use parameters, while the second defines conditional accessibility scenarios and corresponding UI adaptation rules. These rules are interpreted by an MDE workflow to transform an app’s original source code into personalised instances. We also report evaluations with professional software developers and senior end-users, demonstrating the feasibility and practical utility of AdaptForge.

高级用户使用各种移动应用程序的情况日益普遍,但是,许多这些应用程序都含有无障碍问题,导致老年人遇到负面用户经历。一个关键的原因是,软件从业人员往往缺乏时间或资源,无法满足与年龄有关的无障碍和个人化需求的广泛范围。由于目前的开发者工具和做法鼓励了满足老年人需求多样性潜力有限的一刀切的界面,因此对支持系统创建适应性、无障碍应用经验的方法的需求日益增长。为此,我们介绍了适应Forge,这是一种新型的模型驱动工程(MDE)方法,能够根据高级用户的无障碍需要,对移动应用界面和行为进行先进的设计时间调整。适应Forge使用两种特定领域的语言(DSLs)来满足与年龄有关的无障碍需求。第一个模型界定了用户的背景使用参数,而第二个模型则界定了有条件的无障碍情景和相应的UI调整规则。这些规则由MDE工作流程解释,将应用程序的原始源代码转化为个性化实例。我们还报告与专业软件开发者和高级终端用户进行的评价,展示了适应的可行性和实用性。


Article 87

Title@2025-07-28 (1): VDGraph: A Graph-Theoretic Approach to Unlock Insights from SBOM and SCA Data

Title: VDGraph: A Graph-Theoretic Approach to Unlock Insights from SBOM and SCA Data VDGraph: Ein graphisch-theoretischer Ansatz, um Einsichten aus SBOM- und SCA-Daten zu entsperren VDGraph: SBOM 和 SCA 数据解锁透视的图形理论方法 2507.20502v1

Authors (5): Howell Xia, Jonah Gluck, Sevval Simsek, David Sastre Medina, David Starobinski

The high complexity of modern software supply chains necessitates tools such as Software Bill of Materials (SBOMs) to manage component dependencies, and Software Composition Analysis (SCA) tools to identify vulnerabilities. While there exists limited integration between SBOMs and SCA tools, a unified view of complex dependency-vulnerability relationships remains elusive. In this paper, we introduce VDGraph, a novel knowledge graph-based methodology for integrating vulnerability and dependency data into a holistic view. VDGraph consolidates SBOM and SCA outputs into a graph representation of software projects’ dependencies and vulnerabilities. We provide a formal description and analysis of the theoretical properties of VDGraph and present solutions to manage possible conflicts between the SBOM and SCA data. We further introduce and evaluate a practical, proof-of-concept implementation of VDGraph using two popular SBOM and SCA tools, namely CycloneDX Maven plugin and Google’s OSV-Scanner. We apply VDGraph on 21 popular Java projects. Through the formulation of appropriate queries on the graphs, we uncover the existence of concentrated risk points (i.e., vulnerable components of high severity reachable through numerous dependency paths). We further show that vulnerabilities predominantly emerge at a depth of three dependency levels or higher, indicating that direct or secondary dependencies exhibit lower vulnerability density and tend to be more secure. Thus, VDGraph contributes a graph-theoretic methodology that improves visibility into how vulnerabilities propagate through complex, transitive dependencies. Moreover, our implementation, which combines open SBOM and SCA standards with Neo4j, lays a foundation for scalable and automated analysis across real-world projects.

现代软件供应链的高度复杂性要求使用软件材料法案(SBOMs)等工具来管理组成部分依赖性,以及软件构成分析(SCA)工具来查明脆弱性。虽然SBOMs和SCA工具之间的整合程度有限,但对复杂的依赖性-脆弱性关系的统一看法仍然难以实现。在本文件中,我们引入了VDGraph,这是将脆弱性和依赖性数据纳入整体观点的一种基于知识图表的新方法。VDGraph将SBOM和SCA产出合并为软件项目依赖性和脆弱性的图表表示。我们提供了VDGraph的理论属性的正式描述和分析,并提出了管理SBraphmm和SCA数据之间可能发生的冲突的解决方案。我们进一步引入并评价了一种实用的、验证的VDGraphyf系统实施,即SBOMX插件和Google的OSV-Scanner。我们将VGraph产出合并到21个通用Java项目中,通过对图表进行适当的查询,我们发现VDGraph的透明性透明度4,我们发现存在集中风险点的存在,以及SBRO值的高度的深度,从而展示了一种脆弱性。我们是如何的可靠性和高度的可靠性。我们是如何展示。 。我们是如何的可靠度。通过一个更深层的, 。我们通过一个更易变化的、更易变化的模型的模型的模型,通过一个更深的模型, 展示性分析。


Article 88

Title@2025-07-28 (1): Distinguishing Quantum Software Bugs from Hardware Noise: A Statistical Approach

Title: Distinguishing Quantum Software Bugs from Hardware Noise: A Statistical Approach Distinguishing Quantum Software Bugs from Hardware Noise: Ein statistischer Ansatz 将量子软件错误与硬件噪音区分开:统计方法 2507.20475v1

Authors (5): Ahmik Virani, Devraj, Anirudh Suresh, Lei Zhang, M V Panduranga Rao

Quantum computing in the Noisy Intermediate-Scale Quantum (NISQ) era presents significant challenges in differentiating quantum software bugs from hardware noise. Traditional debugging techniques from classical software engineering cannot directly resolve this issue due to the inherently stochastic nature of quantum computation mixed with noises from NISQ computers. To address this gap, we propose a statistical approach leveraging probabilistic metrics to differentiate between quantum software bugs and hardware noise. We evaluate our methodology empirically using well-known quantum algorithms, including Grover’s algorithm, Deutsch-Jozsa algorithm, and Simon’s algorithm. Experimental results demonstrate the efficacy and practical applicability of our approach, providing quantum software developers with a reliable analytical tool to identify and classify unexpected behavior in quantum programs.

在Noisy中度量子(NISQ)时代的量子计算在区分量子软件错误和硬件噪音方面提出了重大挑战。古典软件工程的传统调试技术无法直接解决这个问题,因为量子计算与NISQ计算机噪音混杂在一起,具有内在的随机性。为了解决这一差距,我们建议采用一种统计方法,利用概率性指标来区分量子软件错误和硬件噪音。我们用经验来评估我们的方法,使用众所周知的量子算法,包括Grover的算法、Deutsch-Jozsa算法和Simon的算法。实验结果表明我们的方法的有效性和实际适用性,为量子软件开发者提供可靠的分析工具,用以识别和分类量子方案中的意外行为。


Article 89

Title@2025-07-27 (7): When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions

Title: When Prompts Go Wrong: Evaluating Code Model Robustness to Ambiguous, Contradictory, and Incomplete Task Descriptions Wenn Prompts falsch gehen: Bewertung von Code-Modell Robustheit zu Ambigued, Contradictory und Unvollständige Aufgabenbeschreibungen 当提示出错时: 评估代码模型的强度, 使之与模糊、 矛盾和不完成的任务描述 2507.20439v1

Authors (7): Maya Larbi, Amal Akli, Mike Papadakis, Rihab Bouyousfi, Maxime Cordy, Federica Sarro, Yves Le Traon

Large Language Models (LLMs) have demonstrated impressive performance in code generation tasks under idealized conditions, where task descriptions are clear and precise. However, in practice, task descriptions frequently exhibit ambiguity, incompleteness, or internal contradictions. In this paper, we present the first empirical study examining the robustness of state-of-the-art code generation models when faced with such unclear task descriptions. We extend the HumanEval and MBPP benchmarks by systematically introducing realistic task descriptions flaws through guided mutation strategies, producing a dataset that mirrors the messiness of informal developer instructions. We evaluate multiple LLMs of varying sizes and architectures, analyzing their functional correctness and failure modes across task descriptions categories. Our findings reveal that even minor imperfections in task description phrasing can cause significant performance degradation, with contradictory task descriptions resulting in numerous logical errors. Moreover, while larger models tend to be more resilient than smaller variants, they are not immune to the challenges posed by unclear requirements. We further analyze semantic error patterns and identify correlations between description clarity, model behavior, and error types. Our results underscore the critical need for developing LLMs that are not only powerful but also robust to the imperfections inherent in natural user tasks, highlighting important considerations for improving model training strategies, designing more realistic evaluation benchmarks, and ensuring reliable deployment in practical software development environments.

大型语言模型(LLMS)在理想化条件下,任务描述明确而精确,在理想化条件下,在代码生成任务方面表现出令人印象深刻的业绩,但在实践中,任务描述往往表现出模糊、不完全或内部矛盾;在本文件中,我们提出了第一份经验性研究,在面对如此不明确的任务描述时,审查最先进的代码生成模型的稳健性;我们通过有指导的突变战略,系统地引入符合现实的任务描述基准和MBP基准,从而扩大人文Val和MBP基准,通过指导的突变战略,系统引入现实的任务描述缺陷,产生一套反映非正式开发者指示混乱的数据集;我们评估了不同大小和结构的多个LMS,分析了任务描述的功能正确性和失败模式;我们的调查结果显示,任务描述的细微不全,即使是任务描述的不全,任务描述也可能导致显著的绩效退化,任务描述导致许多逻辑错误;此外,虽然较大的模型往往比较小的变式要更具有弹性,但不能避免因不明确的要求而产生的挑战;我们进一步分析语义错误模式模式,并确定描述的清晰性、模式行为和错误类型之间的相互关系;我们的结果突出表明,开发模型的关键需要改进模型,这种模型不仅需要更有力,而且更可靠地强调内在的软件环境的内在的精确的制定重要的发展基准。


Article 90

Title@2025-07-27 (7): Testing Is Not Boring: Characterizing Challenge in Software Testing Tasks

Title: Testing Is Not Boring: Characterizing Challenge in Software Testing Tasks Testen ist nicht langweilig: Charakterisierende Herausforderung bei Software-Testaufgaben 测试并非无足轻重:在软件测试任务中突出挑战 2507.20407v1

Authors (4): Davi Gama Hardman, Cesar França, Brody Stuart-Verner, Ronnie de Souza Santos

As software systems continue to grow in complexity, testing has become a fundamental part of ensuring the quality and reliability of software products. Yet, software testing is still often perceived, both in industry and academia, as a repetitive, low-skill activity. This perception fails to recognize the creativity, problem-solving, and adaptability required in testing work. Tasks such as designing complex test cases, automating testing processes, and handling shifting requirements illustrate the challenges testing professionals regularly face. To better understand these experiences, we conducted a study with software testing professionals to explore the nature of challenging tasks in software testing and how they affect these professionals. Our findings show that tasks involving creativity, ongoing learning, and time pressure are often seen as motivating and rewarding. On the other hand, a lack of challenge or overwhelming demands can lead to frustration and disengagement. These findings demonstrate the importance of balancing task complexity to sustain motivation and present software testing as a dynamic and intellectually engaging field.

随着软件系统的复杂性不断增强,测试已成为确保软件产品质量和可靠性的一个基本组成部分,然而,在工业和学术界,软件测试仍经常被视为一种重复的低技能活动,这种感觉没有认识到测试工作所需的创造性、解决问题和适应性。设计复杂的测试案例、测试过程自动化和处理变化要求等任务说明了测试专业人员经常面临的挑战。为了更好地了解这些经验,我们与软件测试专业人员进行了一项研究,以探讨软件测试中具有挑战性的任务的性质及其如何影响这些专业人员。我们的调查结果显示,涉及创造性、持续学习和时间压力的任务往往被视为激励和奖励。另一方面,缺乏挑战或压倒性需求可能导致挫折和脱离。这些调查结果表明,平衡任务复杂性对于保持动力和将软件测试作为动态和有知识参与的领域的重要性。


Article 91

Title@2025-07-27 (7): How to Save My Gas Fees: Understanding and Detecting Real-world Gas Issues in Solidity Programs

Title: How to Save My Gas Fees: Understanding and Detecting Real-world Gas Issues in Solidity Programs Wie ich meine Gasgebühren erspare: Verstehen und Erkennen realer Gasprobleme in Soliditätsprogrammen 如何节省我的煤气费:了解和检测实世天然气在实实在在方案中的问题 2403.02661v2

Authors (7): Mengting He, Shihao Xia, Boqin Qin, Nobuko Yoshida, Tingting Yu, Yiying Zhang, Linhai Song

The execution of smart contracts on Ethereum, a public blockchain system, incurs a fee called gas fee for its computation and data storage. When programmers develop smart contracts (e.g., in the Solidity programming language), they could unknowingly write code snippets that unnecessarily cause more gas fees. These issues, or what we call gas wastes, can lead to significant monetary losses for users. This paper takes the initiative in helping Ethereum users reduce their gas fees in two key steps. First, we conduct an empirical study on gas wastes in open-source Solidity programs and Ethereum transaction traces. Second, to validate our study findings, we develop a static tool called PeCatch to effectively detect gas wastes in Solidity programs, and manually examine the Solidity compiler’s code to pinpoint implementation errors causing gas wastes. Overall, we make 11 insights and four suggestions, which can foster future tool development and programmer awareness, and fixing our detected bugs can save $0.76 million in gas fees daily.

Etheum是一个公共的封闭链系统,执行智能合同需要收费,其计算和数据存储费用被称为煤气费。当程序员开发智能合同(例如,Solidity编程语言)时,他们可以不知情地写入不必要地造成更多煤气费的代码片段。这些问题,或者我们称之为气体废物,可能会给用户造成重大的金钱损失。本文采取主动,帮助Etheem用户在两个关键步骤中降低煤气费。首先,我们对公开源Solidity程序和Eieum交易跟踪中的气体废物进行实证研究。第二,为了验证我们的研究结果,我们开发了一个名为Pecatch的静态工具,以有效检测固体方案中的气体废物,并手动检查“Solidity汇编者”的代码,以找出造成煤气废物的实施错误。总体而言,我们提出了11项见解和4项建议,可以促进未来工具开发和程序设计者认识,并修复我们发现的虫虫每天节省天然气费17 600万美元。


Article 92

Title@2025-07-27 (7): CIgrate: Automating CI Service Migration with Large Language Models

Title: CIgrate: Automating CI Service Migration with Large Language Models CIgrate: Automatisierung der CI-Service-Migration mit großen Sprachmodellen cigrate: 以大语言模式实现CI服务迁移自动化 2507.20402v1

Authors (2): Md Nazmul Hossain, Taher A. Ghaleb

Continuous Integration (CI) configurations often need to be migrated between services (e.g., Travis CI to GitHub Actions) as projects evolve, due to changes in service capabilities, usage limits, or service deprecation. Previous studies reported that migration across CI services is a recurring need in open-source development. However, manual migration can be time-consuming and error-prone. The state-of-the-art approach, CIMig, addresses this challenge by analyzing past migration examples to create service-specific rules and produce equivalent configurations across CI services. However, its relatively low accuracy raises concerns about the overall feasibility of automated CI migration using rule-based techniques alone. Meanwhile, Large Language Models (LLMs) have demonstrated strong capabilities in code generation and transformation tasks, suggesting potential to improve the automation, usability, and generalizability of CI configuration migration. This registered report presents a study in which we aim to assess whether CI migration can be improved using LLMs. To this end, we propose CIgrate, an LLM-based framework for automatically migrating CI configurations. We plan to evaluate the performance of CIgrate compared to CIMig as a baseline, in different setups (a) zero-shot/few-shot prompting of LLMs for configuration migration and (b) fine-tuning an LLM on a dataset of already established CI service migrations. We will also seek developer feedback on the quality and usability of the generated configurations. We formulate research questions focusing on the accuracy of LLM-generated migrations versus ground truth and the output of CIMig. The expected contributions include the first LLM-powered approach for CI service migration, a comparative evaluation of its effectiveness compared to rule-based approaches, and insight into leveraging LLMs to support software configuration evolution.

持续整合(CI)配置往往需要随着项目的变化,在服务能力、使用限制或服务折旧方面的变化而变化,在服务(例如,Travis CI CI to GitHub Action)之间迁移(CI),因为项目的变化,由于服务能力、使用限制或服务折旧方面的改变,项目在不断演变,因此往往需要在服务(Travis CI CI CI to GitHub Action)之间迁移(Travis Travis CI CI to GitHub Action)配置。以前的研究报告报告说,在开放源开发方面,跨CI服务之间的迁移是一项经常性的需要。然而,人工迁移可能是耗时和容易出错的。最先进的方法,即CIM分析以往的迁移实例,并产生类似的支持。我们首先建议Crigrate,一个基于LLM 软件的自动迁移配置框架。我们计划评估CIGRAT的运行情况, 将CGRAT 与C-Rialal-ILM的升级方法进行比较, 将IM IM IM IM IM IM 的预期的升级 IM IM IM 数据库 数据库 的升级 的升级 数据 升级升级 , 将它 的升级 的升级 的升级 的升级 的升级 , 的升级 的升级 的升级 将 的升级 的升级 将 的升级 的升级 的升级 用于 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 的升级 将 的升级 的升级 的升级 的升级 的 的 。


Article 93

Title@2025-07-27 (7): Software Fairness Testing in Practice

Title: Software Fairness Testing in Practice Software Fairness-Tests in der Praxis 实践中软件公平测试 2506.17095v2

Authors (4): Ronnie de Souza Santos, Matheus de Morais Leca, Reydne Santos, Cleyton Magalhaes

Software testing ensures that a system functions correctly, meets specified requirements, and maintains high quality. As artificial intelligence and machine learning (ML) technologies become integral to software systems, testing has evolved to address their unique complexities. A critical advancement in this space is fairness testing, which identifies and mitigates biases in AI applications to promote ethical and equitable outcomes. Despite extensive academic research on fairness testing, including test input generation, test oracle identification, and component testing, practical adoption remains limited. Industry practitioners often lack clear guidelines and effective tools to integrate fairness testing into real-world AI development. This study investigates how software professionals test AI-powered systems for fairness through interviews with 22 practitioners working on AI and ML projects. Our findings highlight a significant gap between theoretical fairness concepts and industry practice. While fairness definitions continue to evolve, they remain difficult for practitioners to interpret and apply. The absence of industry-aligned fairness testing tools further complicates adoption, necessitating research into practical, accessible solutions. Key challenges include data quality and diversity, time constraints, defining effective metrics, and ensuring model interoperability. These insights emphasize the need to bridge academic advancements with actionable strategies and tools, enabling practitioners to systematically address fairness in AI systems.

由于人工智能和机器学习技术已成为软件系统的组成部分,因此测试已经发展到能够解决其独特复杂性的地步。这一空间的一个关键进步是公平测试,它确定并减少AI应用中的偏见,以促进道德和公平结果。尽管对公平测试进行了广泛的学术研究,包括测试投入生成、测试或触角识别和组成部分测试,但实际采用仍然有限。工业从业人员往往缺乏明确的指南和有效工具,无法将公平测试纳入现实世界的AI开发。这项研究调查了软件专业人员如何通过与从事AI和ML项目的22名从业人员的访谈,测试AI驱动的系统,以实现公平。我们的调查结果突出表明理论公平概念与行业实践之间的巨大差距。虽然公平定义在继续演变,但实践者仍然难以解释和应用。缺乏与行业一致的公平测试工具使采用更加复杂,需要研究实际的、可获取的解决办法。关键的挑战包括数据质量和多样性、时间限制、确定有效的衡量标准以及确保模式的互操作性。这些见解强调需要将学术进步与可操作的战略和工具联系起来,使从业人员能够系统地处理AI系统中的公平问题。


Article 94

Title@2025-07-27 (7): BOOP: Write Right Code

Title: BOOP: Write Right Code BOOP: Schreiben Sie den richtigen Code 写权利法典 2507.22085v1

Authors (2): Vaani Goenka, Aalok D. Thakkar

Novice programmers frequently adopt a syntax-specific and test-case-driven approach, writing code first and adjusting until programs compile and test cases pass, rather than developing correct solutions through systematic reasoning. AI coding tools exacerbate this challenge by providing syntactically correct but conceptually flawed solutions. In this paper, we introduce BOOP (Blueprint, Operations, OCaml, Proof), a structured framework requiring four mandatory phases: formal specification, language-agnostic algorithm development, implementation, and correctness proof. This shifts focus from ``making code work’’ to understanding why code is correct. BOOP was implemented at our institution using a VS Code extension and preprocessor that enforces constraints and identifies counterproductive patterns. Initial evaluation shows improved algorithmic reasoning and reduced trial-and-error debugging. Students reported better edge case understanding and problem decomposition, though some initially found the format verbose. Instructors observed stronger foundational skills compared to traditional approaches.

新编程员经常采用特定语法和试验个案驱动的方法,首先写代码,然后在程序汇编和测试案件通过之前进行调整,而不是通过系统推理制定正确的解决办法。AI编码工具通过提供综合正确但概念上有缺陷的解决办法,加剧了这一挑战。在本文件中,我们引入了BOP(蓝图、操作、OCaml、证据),这是一个结构化框架,要求四个强制性阶段:正式规格、语言-不可知算法开发、实施和正确性证明。这把重点从“制定代码工作”转向理解为什么代码是正确的。在我们的机构,使用VS编码扩展和预处理器实施BOP,以强制执行限制和识别反生产模式。初步评估显示,逻辑推理有所改善,减少了试验和机能解调。学生们报告说,边际案例理解和问题分解,虽然有些最初发现了格式动。教官们发现,与传统方法相比,基础技能更强。


Article 95

Title@2025-07-27 (7): Beyond Binary Moderation: Identifying Fine-Grained Sexist and Misogynistic Behavior on GitHub with Large Language Models

Title: Beyond Binary Moderation: Identifying Fine-Grained Sexist and Misogynistic Behavior on GitHub with Large Language Models Beyond Binary Moderation: Fine-Grained Sexist und Misogynistic Behavior auf GitHub mit großen Sprachmodellen identifizieren 超越二进制温度: 识别高语言模式的吉特胡布人中精美的、有性别色彩的和有偏见的行为 2507.20358v1

Authors (3): Tanni Dev, Sayma Sultana, Amiangshu Bosu

Background: Sexist and misogynistic behavior significantly hinders inclusion in technical communities like GitHub, causing developers, especially minorities, to leave due to subtle biases and microaggressions. Current moderation tools primarily rely on keyword filtering or binary classifiers, limiting their ability to detect nuanced harm effectively. Aims: This study introduces a fine-grained, multi-class classification framework that leverages instruction-tuned Large Language Models (LLMs) to identify twelve distinct categories of sexist and misogynistic comments on GitHub. Method: We utilized an instruction-tuned LLM-based framework with systematic prompt refinement across 20 iterations, evaluated on 1,440 labeled GitHub comments across twelve sexism/misogyny categories. Model performances were rigorously compared using precision, recall, F1-score, and the Matthews Correlation Coefficient (MCC). Results: Our optimized approach (GPT-4o with Prompt 19) achieved an MCC of 0.501, significantly outperforming baseline approaches. While this model had low false positives, it struggled to interpret nuanced, context-dependent sexism and misogyny reliably. Conclusion: Well-designed prompts with clear definitions and structured outputs significantly improve the accuracy and interpretability of sexism detection, enabling precise and practical moderation on developer platforms like GitHub.

目标:本研究引入了一个细微的多级分类框架,利用该分类框架来利用按指令调整的大语言模型(LLMS)来确定十二种不同类别的性别歧视和对GitHub的否定性评论。方法:我们利用一个以指令调控的大型语言模型(LLLMS)为主的LLM(GPT-4o with Jear 19)取得了一个0.501的MC(GPT-4o with JertHub),大大超过基准方法。这个模型虽然具有低效,但却在20个迭代之间进行了系统化的迅速改进,在1,440个性别/Misogyny类别上标注了GitHub的评论,从而限制了他们有效发现细微伤害的能力。模型表现与精确的、回顾、F1-核心和Matthews Colation Gi-C(MCC)。结果:我们最优化的方法(GPT-4o with Press 19)达到了0.501,大大超出基准方法。虽然这个模型具有低效,但它很难解释精确的精准性定义和精确的性别分析。


Article 96

Title@2025-07-27 (7): Strategic Motivators for Ethical AI System Development: An Empirical and Holistic Model

Title: Strategic Motivators for Ethical AI System Development: An Empirical and Holistic Model Strategische Motivatoren für ethische KI-Systementwicklung: Ein empirisches und ganzheitliches Modell 道德与伦理合作系统发展战略动力器:经验和整体模式 2507.20218v1

Authors (5): Muhammad Azeem Akbar, Arif Ali Khan, Saima Rafi, Damian Kedziora, Sami Hyrynsalmi

Artificial Intelligence (AI) presents transformative opportunities for industries and society, but its responsible development is essential to prevent unintended consequences. Ethically sound AI systems demand strategic planning, strong governance, and an understanding of the key drivers that promote responsible practices. This study aims to identify and prioritize the motivators that drive the ethical development of AI systems. A Multivocal Literature Review (MLR) and a questionnaire-based survey were conducted to capture current practices in ethical AI. We applied Interpretive Structure Modeling (ISM) to explore the relationships between motivator categories, followed by MICMAC analysis to classify them by their driving and dependence power. Fuzzy TOPSIS was used to rank these motivators by importance. Twenty key motivators were identified and grouped into eight categories: Human Resource, Knowledge Integration, Coordination, Project Administration, Standards, Technology Factor, Stakeholders, and Strategy & Matrices. ISM results showed that ‘Human Resource’ and ‘Coordination’ heavily influence other factors. MICMAC analysis placed categories like Human Resource (CA1), Coordination (CA3), Stakeholders (CA7), and Strategy & Matrices (CA8) in the independent cluster, indicating high driving but low dependence power. Fuzzy TOPSIS ranked motivators such as promoting team diversity, establishing AI governance bodies, appointing oversight leaders, and ensuring data privacy as most critical. To support ethical AI adoption, organizations should align their strategies with these motivators and integrate them into their policies, governance models, and development frameworks.

人工智能(AI)为产业和社会提供了变革机会,但其负责任的发展对于防止意外后果至关重要。健全的人工智能系统要求战略规划、强有力的治理和理解促进负责任做法的关键驱动因素。这项研究旨在确定驱动个体智能系统道德发展的驱动因素并对其进行优先排序。开展了多语言文学审查(MLR)和基于问卷的调查,以了解道德自主的当前做法。我们应用了解释结构模型(ISM)来探索运动者类别之间的关系,随后是MICMAC的分析,以其驱动力和依赖力对其进行分类。Fuzzy TOPSIS被用于将这些动力动力按重要性排列。确定了20个主要驱动因素并将其分为8个类别:人力资源、知识整合、协调、项目管理、标准、技术因素、利益攸关方和战略与矩阵。IMIMO结果显示,“人力资源”和“协调”对其他因素影响极大。MICMAC的分析将人力资源(CI)、协调(CA3)、利益攸关方(CA7)、战略与FOTIS(F & Marizze)等类别组织整合其道德领导地位,并表明其领导地位(CAAA),这些领导机构是其道德领导。


Article 97

Title@2025-07-27 (7): Testing Autonomous Driving Systems – What Really Matters and What Doesn’t

Title: Testing Autonomous Driving Systems – What Really Matters and What Doesn’t Autonome Fahrsysteme testen – Was wirklich zählt und was nicht 自动自动驾驶测试系统 – – 真正重要和不重要的东西 2507.13661v2

Authors (4): Changwen Li, Joseph Sifakis, Rongjie Yan, Jian Zhang

Despite extensive research, the testing of autonomous driving systems (ADS) landscape remains fragmented, and there is currently no basis for an informed technical assessment of the importance and contribution of the current state of the art. This paper attempts to address this problem by exploring two complementary aspects. First, it proposes a framework for comparing existing test methods in terms of their intrinsic effectiveness and validity. It shows that many methods do not meet both of these requirements. Either because they are based on criteria that do not allow for rapid, inexpensive, and comprehensive detection of failures, or because the degree of validity of the properties tested cannot be accurately estimated. In particular, it is shown that most critical test methods do not take into account the nominal operational capabilities of autopilots and generate scenarios that are impossible for the tested vehicles to handle, resulting in unjustified rejections. Secondly, the paper shows that test effectiveness and validity are highly dependent on how autopilots are designed: how they choose between different control policies to perform maneuvers, as well as on the reproducibility of the results. In fact, most test methods take for granted two principles underlying traditional methods, but do not generally apply to ADS. We maintain that the absence of rationality and determinacy significantly impairs the effectiveness and validity of test methods, and provide test results on eight open autopilots, in which most do not satisfy these properties, thereby illustrating this fact. We conclude that under the current state of the art, it is impossible to obtain strong enough guarantees for essential autopilot properties and recommend that autopilots be developed with a view to both rationality and determinacy.

尽管进行了广泛的研究,对自主驾驶系统(ADS)的测试仍然支离破碎,目前没有依据对目前先进状态的重要性和贡献进行知情的技术评估。本文件试图通过探讨两个互补方面来解决这一问题。首先,它提议了一个框架,以比较现有测试方法的内在有效性和有效性;它表明许多方法不符合这两项要求。要么因为它们所依据的标准不允许快速、廉价和全面地发现故障,要么因为它们所依据的标准不允许快速、廉价和全面检测失败,或者因为测试的特性的合理性程度无法准确估计。特别是,事实证明,大多数关键测试方法没有考虑到自动驾驶仪的表面操作能力,并产生了测试工具无法处理的情景,导致不合理的拒绝。第二,该文件表明测试的有效性和有效性在很大程度上取决于自动驾驶技术的设计:它们如何选择不同的控制政策来进行操作,以及结果的可贵度。事实上,大多数测试方法都以两种原则为基础,但通常不适用于ADS。我们坚持认为,对自动驾驶仪的表面操作能力而言,其可靠性和可靠性都无法被充分检验。我们坚持认为,在进行这种高度的检验时,在进行这种检验时,最能检验和最能性检验的方法是充分地证明,在进行这种检验的可靠性和最不具有决定性。我们能够证明这种检验。


Article 98

Title@2025-07-27 (7): Relating System Safety and Machine Learnt Model Performance

Title: Relating System Safety and Machine Learnt Model Performance Über Systemsicherheit und maschinenerfahrene Modellleistung 与系统安全和机器学习模型性能有关的系统安全和机器学习模型性能 2507.20135v1

Authors (1): Ganesh Pai

The prediction quality of machine learnt models and the functionality they ultimately enable (e.g., object detection), is typically evaluated using a variety of quantitative metrics that are specified in the associated model performance requirements. When integrating such models into aeronautical applications, a top-down safety assessment process must influence both the model performance metrics selected, and their acceptable range of values. Often, however, the relationship of system safety objectives to model performance requirements and the associated metrics is unclear. Using an example of an aircraft emergency braking system containing a machine learnt component (MLC) responsible for object detection and alerting, this paper first describes a simple abstraction of the required MLC behavior. Then, based on that abstraction, an initial method is given to derive the minimum safety-related performance requirements, the associated metrics, and their targets for the both MLC and its underlying deep neural network, such that they meet the quantitative safety objectives obtained from the safety assessment process. We give rationale as to why the proposed method should be considered valid, also clarifying the assumptions made, the constraints on applicability, and the implications for verification.

机械学习模型的预测质量及其最终促成的功能(例如物体探测)通常使用相关模型性能要求中具体规定的各种定量指标进行评估。当将这些模型纳入航空应用时,自上而下的安全评估过程必须既影响选定的模型性能指标,又影响其可接受的价值范围。然而,系统安全目标与模型性能要求和相关指标之间的关系往往不明确。以飞机紧急制动系统为例,其中含有一个机器性能部件,负责物体探测和警报,本文首先描述了所要求的刚解运行为的简单抽象性。随后,根据这一抽象性,给出了一种初步方法,以得出最低安全性能要求、相关指标及其对刚果解放运动及其深层神经网络的目标,从而达到从安全评估过程中获得的数量性安全目标。我们说明了为什么认为拟议方法是有效的,并澄清了对适用性的限制以及核查的影响。我们解释了为什么认为拟议方法是有效的,同时也澄清了所作的假设。


Article 99

Title@2025-07-27 (7): From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics

Title: From Prompt to Pipeline: Large Language Models for Scientific Workflow Development in Bioinformatics Von Prompt zur Pipeline: Große Sprachmodelle für die wissenschaftliche Workflow-Entwicklung in der Bioinformatik 从提示到管道:生物信息学科学工作流程发展大语言模式 2507.20122v1

Authors (2): Khairul Alam, Banani Roy

The increasing complexity of bioinformatics data analysis has made Scientific Workflow Systems (SWSs) like Galaxy and Nextflow essential for enabling scalable, reproducible, and automated workflows. However, creating and understanding these workflows remains challenging, particularly for domain experts without programming expertise. This study investigates whether modern Large Language Models (LLMs), GPT-4o, Gemini 2.5 Flash, and DeepSeek-V3, can support the generation of accurate, complete, and usable bioinformatics workflows, and examines which prompting strategies most effectively guide this process. We evaluate these models using diverse tasks such as SNP analysis, RNA-seq, DNA methylation, and data retrieval, spanning both graphical (Galaxy) and script-based (Nextflow) platforms. Expert reviewers assess the generated workflows against community-curated baselines from the Galaxy Training Network and nf-core repositories. The results show that Gemini 2.5 Flash excels in generating Galaxy workflows, while DeepSeek-V3 performs strongly in Nextflow. Prompting strategies significantly impact quality, with role-based and chain-of-thought prompts improving completeness and correctness. While GPT-4o benefits from structured inputs, DeepSeek-V3 offers rich technical detail, albeit with some verbosity. Overall, the findings highlight the potential of LLMs to lower the barrier for workflow development, improve reproducibility, and democratize access to computational tools in bioinformatics, especially when combined with thoughtful prompt engineering.

生物信息学数据分析日益复杂,使得银河和下游等科学工作流程系统(SWS)成为使可缩放、可复制和自动化工作流程得以扩展、可复制和自动化的流程必不可少的条件。然而,这些工作流程的创建和理解仍然具有挑战性,特别是对于没有编程专长的域专家而言。本研究调查现代大语言模型(LLM)、GPT-4o、Gemini 2.5 Flash和DeepSeepSeek-V3)能否支持生成准确、完整和可用的生物信息工作流程,并审查哪些最能推动这一进程的战略。我们利用SNPP分析、RNAeq、DNA甲基化和数据检索等不同任务对这些模型进行评估,这些任务涵盖图形(Gaxly)和基于脚本的(extfrlow)平台。专家审评员根据银河培训网和nf-Creat-V3 社区精密的基线评估所产生的工作流程。结果显示,Gemini 2.5闪光在生成银河工作流程方面优异,而DeSeek-V3在下一个流程中表现得力。我们战略的显著影响显著影响质量质量,特别是基于角色和链的深度和链结构结构的流程,同时改进整个结构的流程的流程的流程,同时改进了某些的流程的流程,改进了流程的流程的流程,改进了某些的流程,改进了基础和结构的准确性,改进。


Article 100

Title@2025-07-27 (7): Learning to Align Human Code Preferences

Title: Learning to Align Human Code Preferences Human Code-Vorlieben ausrichten lernen 学习调整人类法典的首选 2507.20109v1

Authors (4): Xin Yin, Chao Ni, Liushan Chen, Xiaohu Yang

Large Language Models (LLMs) have demonstrated remarkable potential in automating software development tasks. While recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) to align models with human preferences, the optimal training strategy remains unclear across diverse code preference scenarios. This paper systematically investigates the roles of SFT and DPO in aligning LLMs with different code preferences. Through both theoretical analysis and empirical observation, we hypothesize that SFT excels in scenarios with objectively verifiable optimal solutions, while applying SFT followed by DPO (S&D) enables models to explore superior solutions in scenarios without objectively verifiable optimal solutions. Based on the analysis and experimental evidence, we propose Adaptive Preference Optimization (APO), a dynamic integration approach that adaptively amplifies preferred responses, suppresses dispreferred ones, and encourages exploration of potentially superior solutions during training. Extensive experiments across six representative code preference tasks validate our theoretical hypotheses and demonstrate that APO consistently matches or surpasses the performance of existing SFT and S&D strategies. Our work provides both theoretical foundations and practical guidance for selecting appropriate training strategies in different code preference alignment scenarios.

大型语言模型(LLMS)在软件开发任务自动化方面表现出了显著的潜力。虽然最近的进步利用了监督的Fin-Tinning(SFT)和直接优惠优化(DPO)使模型与人类偏好相一致,但最佳培训战略在不同的代码偏好情景中仍然不明确。本文系统地调查SFT和DPO在使LMS与不同的代码偏爱相一致方面的作用。通过理论分析和经验观察,我们假设SFT在有客观可核查的最佳解决方案的情景中具有优势,同时适用SFT(S&D)之后的DPO(S&D)使模型能够在没有客观可核查的最佳解决方案的情况下在情景中探索优异的解决方案。根据分析和实验证据,我们提出了适应性适应性适应性地强化了首选反应、抑制了偏好反应并鼓励在培训期间探索潜在优异解决方案的动态整合方法(APO)。在六个具有代表性的代码偏好任务中进行广泛的实验,证实了我们的理论假设,并表明APO一贯地匹配或超过现有的SFT和S&D战略的绩效。我们的工作为选择不同的代码调整情景的适当培训战略提供了理论基础和实际指导。


Article 101

Title@2025-07-27 (7): From First Use to Final Commit: Studying the Evolution of Multi-CI Service Adoption

Title: From First Use to Final Commit: Studying the Evolution of Multi-CI Service Adoption Vom ersten Einsatz bis zum endgültigen Commit: Studieren der Evolution der Multi-CI-Service-Adoption 从首次使用到最后提交:研究采用多种CI服务的发展演变 2507.20095v1

Authors (2): Nitika Chopra, Taher A. Ghaleb

Continuous Integration (CI) services, such as GitHub Actions and Travis CI, are widely adopted in open-source development to automate testing and deployment. Though existing research often examines individual services in isolation, it remains unclear how projects adopt and transition between multiple services over time. To understand how CI adoption is evolving across services, we present a preliminary study analyzing the historical CI adoption of 18,924 Java projects hosted on GitHub between January 2008 and December 2024, adopting at least one of eight CI services, namely Travis CI, AppVeyor, CircleCI, Azure Pipelines, GitHub Actions, Bitbucket, GitLab CI, and Cirrus CI. Specifically, we investigate: (1) how frequently CI services are co-adopted or replaced, and (2) how maintenance activity varies across different services. Our analysis shows that the use of multiple CI services within the same project is a recurring pattern observed in nearly one in five projects, often reflecting migration across CI services. Our study is among the first to examine multi-CI adoption in practice, offering new insights for future research and highlighting the need for strategies and tools to support service selection, coordination, and migration in evolving CI environments.

GitHub Actions和Travis CI等连续整合(CI)服务在开放源码开发中被广泛采用,以进行自动测试和部署。虽然现有的研究经常孤立地审查个别服务,但目前仍不清楚项目在一段时间内如何采用和过渡多种服务。为了了解CI的采用在服务之间如何演变,我们提出初步研究,分析2008年1月至2024年12月在GitHub托管的18 924 Java项目的历史采用情况,至少采用8个CI服务中的一个,即Travis CI、AppVeyor、CircleCI、Azure Pipelines、GitHub Action、Bitbucket、GitLab CI和Cirrus CI。具体地说,我们调查:(1) CI服务如何经常被共同采用或取代,以及(2) 维护活动在不同服务之间如何不同。我们的分析表明,在同一项目内使用多个CI服务是一种反复出现的模式,常常反映CI服务跨CI服务之间的迁移。我们的研究是首先研究实践中多CI的采用情况,为未来研究提供了新的见解,并强调需要支持服务选择、协调和不断演变中的CI环境的战略和工具。


Article 102

Title@2025-07-26 (6): The Effect of Pointer Analysis on Semantic Conflict Detection

Title: The Effect of Pointer Analysis on Semantic Conflict Detection Die Wirkung der Pointer-Analyse auf die semantische Konflikterkennung 指针分析对语义冲突探测的影响 2507.20081v1

Authors (5): Matheus Barbosa, Paulo Borba, Rodrigo Bonifácio, Victor Lira, Galileu Santos

Current merge tools don’t detect semantic conflicts, which occur when changes from different developers are textually integrated but semantically interfere with each other. Although researchers have proposed static analyses for detecting semantic conflicts, these analyses suffer from significant false positive rates. To understand whether such false positives could be reduced by using pointer analysis in the implementation of semantic conflict static analyses, we conduct an empirical study. We implement the same analysis with and without pointer analysis, run them on two datasets, observe how often they differ, and compare their accuracy and computational performance. Although pointer analysis is known to improve precision in static analysis, we find that its effect on semantic conflict detection can be drastic: we observe a significant reduction in timeouts and false positives, but also a significant increase in false negatives, with prohibitive drops in recall and F1-score. These results suggest that, in the context of semantic conflict detection, we should explore hybrid analysis techniques, combining aspects of both implementations we compare in our study.

目前的合并工具无法检测语义冲突, 不同开发者的变化在文字上是一体化的, 但却相互干扰。 虽然研究人员已经提出静态分析以探测语义冲突, 但这些分析存在巨大的假正率。 要了解在实施语义冲突静态分析时使用指针分析是否能减少这些假正数, 我们进行一项经验性研究。 我们进行同样的分析, 进行没有指针的分析, 在两个数据集上进行同样的分析, 观察它们的差异频率, 比较它们的精确度和计算性能。 尽管人们知道指示式分析可以提高静态分析的精确度, 但我们发现其对语义冲突探测的影响可能是巨大的: 我们观察到超时和假正数显著减少, 但也发现假正数也明显增加, 令人望而步的回溯和F1- 点下降。 这些结果表明, 在语义冲突探测方面, 我们应该探索混合分析技术, 将我们在研究中比较的两种执行方面结合起来。


Article 103

Title@2025-07-26 (6): Selective Prompt Anchoring for Code Generation

Title: Selective Prompt Anchoring for Code Generation Selektive Prompt-Ankerung für die Code-Generierung 代代代代代代代代代代代代代代代代代 代代代代代代代代代代代代代 代代代代代代代代代代代代 2408.09121v6

Authors (2): Yuan Tian, Tianyi Zhang

Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we propose Selective Prompt Anchoring (SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available at https://github.com/magic-YuanTian/Selective-Prompt-Anchoring.

大型语言模型(LLMs)最近的进展通过自动生成自然语言代码改变了软件开发。但在生成完全符合用户意图的完全正确的代码方面仍然存在挑战。 我们的研究显示,LLMs往往较少关注用户提示,因为生成了更多的代码符号。 我们假设这种关注稀释问题是代码生成错误的重要原因。 为了缓解这一问题,我们提议有选择的快速操作(SPA)来指导代码LLMs, 以指导代码生成时的用户意向。 我们用六个基准基底LMs来评估SPA。 我们的结果表明,SPA将Pass@1提升到12.9%,在所有环境中持续超过SOTA代码生成方法。 我们的代码可以在 https://github.com/magic-YuanTian/Elective-Prompt-Anchoring查阅 。


Article 104

Title@2025-07-26 (6): PDLogger: Automated Logging Framework for Practical Software Development

Title: PDLogger: Automated Logging Framework for Practical Software Development PDLogger: Automatisiertes Logging-Framework für die praktische Softwareentwicklung PD Logger:实用软件开发自动记录框架 2507.19951v1

Authors (5): Shengcheng Duan, Yihua Xu, Sheng Zhang, Shen Wang, Yue Duan

Logging is indispensable for maintaining the reliability and diagnosability of modern software, yet developers still struggle to decide where and how to log effectively. Existing automated logging techniques focus on isolated sub-tasks - predicting a single log position, level, or message - and therefore cannot produce complete, high-quality log statements that reflect real-world practice in which multiple logs often appear inside one method. They also neglect deeper semantic dependencies among methods and consider only a narrow set of candidate variables, leading to superficial or incomplete logs. In this paper, we present PDLogger, the first end-to-end log generation technique expressly designed for practical, multi-log scenarios. PDLogger operates in three phases. (1) Log position prediction: block-type-aware structured prompts guide a large language model (LLM) to suggest candidate positions across all control-flow blocks of a method. (2) Log generation: backward program slicing supplies precise inter-procedural control and data-dependency context, while an expanded variable extractor captures both member and external function expressions; the enriched prompt enables the LLM to emit a full log statement (position, level, message, variables). (3) Log refinement: level correction and context-sensitive deduplication prune false positives and redundant logs. We evaluate PDLogger on 3,113 log statements drawn from two widely used Java projects. Compared with the strongest prior systems, PDLogger improves log-position precision by 139.0 percent, F1 by 69.2 percent, level accuracy by 82.3 percent, variable precision by 131.8 percent, and message quality (BERTScore) by 65.7 percent. The framework consistently performs well with different mainstream LLMs, demonstrating robustness and generality. PDLogger’s implementation is available as open source to foster future research and adoption.

登录对于维护现代软件的可靠性和可测性是不可或缺的, 但是开发者仍然在努力决定如何有效登录。 现有的自动记录技术侧重于孤立的子任务 - 预测单一日志位置、 级别或信息 - 因此无法生成完整、 高质量的日志报表, 反映多种日志通常出现在一种方法中的现实世界做法。 它们也忽略了方法之间更深的语义依赖性, 并且只考虑一系列狭窄的候选变量, 导致浅色或不完整日志。 在本文中, 我们展示了PD Logger, 这是为实用、 多log 设想而专门设计的第一个端到端日志生成技术。 PDLoggers 运行三个阶段。 (1) 日志预测: 区块类型- 认知结构化的快速化日志导出一个大型语言模型( LLLM) , 用于在方法的所有控制流区块中显示候选人的位置。 (2) 日志生成: 向后程序分类提供精确的跨线控和数据依赖性环境, 通过成员或外部%函数表达方式扩大变量; 让LLM- 快速化的精度能够让LLM- 向高级、 高级变更精确化前的版本, 和高级演算进程, 进行系统, 以显示前演算进程 , 进行双向前演算。


Article 105

Title@2025-07-26 (6): Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases

Title: Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases Prometheus:多语言代码库解决问题的统一知识图 2507.19942v1

Authors (7): Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, He Ye

Language model (LM) agents, such as SWE-agent and OpenHands, have made progress toward automated issue resolution. However, existing approaches are often limited to Python-only issues and rely on pre-constructed containers in SWE-bench with reproduced issues, restricting their applicability to real-world and work for multi-language repositories. We present Prometheus, designed to resolve real-world issues beyond benchmark settings. Prometheus is a multi-agent system that transforms an entire code repository into a unified knowledge graph to guide context retrieval for issue resolution. Prometheus encodes files, abstract syntax trees, and natural language text into a graph of typed nodes and five general edge types to support multiple programming languages. Prometheus uses Neo4j for graph persistence, enabling scalable and structured reasoning over large codebases. Integrated by the DeepSeek-V3 model, Prometheus resolves 28.67% and 13.7% of issues on SWE-bench Lite and SWE-bench Multilingual, respectively, with an average API cost of $0.23 and $0.38 per issue. Prometheus resolves 10 unique issues not addressed by prior work and is the first to demonstrate effectiveness across seven programming languages. Moreover, it shows the ability to resolve real-world GitHub issues in the LangChain and OpenHands repositories. We have open-sourced Prometheus at: https://github.com/Pantheon-temple/Prometheus

语言模型(LM)代理商,如SWE-代理商和OpenHands等语言模型(LM)代理商,在自动解决问题方面取得了进展。然而,现有方法往往局限于只处理Python问题,并依赖SWE-Bench中含有复制问题的预构集装箱,将其适用范围限制在现实世界和多语言储存库的工作上。我们介绍了Prometheus,目的是解决超出基准设置范围以外的现实世界问题。Prometheus是一个多试办系统,将整个代码存储库转换成一个统一的知识图表,以指导问题解决方案的背景检索。Prometheus 将文件、抽象的同步树和自然语言文本编译成一个打印式节点图和五种一般边缘类型以支持多种程序语言。Prometheus使用Neo4j作为图的持久性,使大代码库的可缩放和结构推理推理。Prometheus解决了SWE-nch Lite和SWE-Bench多语的28.67%的开放式问题,我们第一次在预版节节节节节节点中展示了预选/GLOLE38的平均读能力。


Article 106

Title@2025-07-26 (6): The Impact of Fine-tuning Large Language Models on Automated Program Repair

Title: The Impact of Fine-tuning Large Language Models on Automated Program Repair Die Auswirkungen von Feinabstimmungen großer Sprachmodelle auf die automatisierte Programmreparatur 微调大语言模型对自动方案维修的影响 2507.19909v1

Authors (4): Roman Macháček, Anastasiia Grishina, Max Hort, Leon Moonen

Automated Program Repair (APR) uses various tools and techniques to help developers achieve functional and error-free code faster. In recent years, Large Language Models (LLMs) have gained popularity as components in APR tool chains because of their performance and flexibility. However, training such models requires a significant amount of resources. Fine-tuning techniques have been developed to adapt pre-trained LLMs to specific tasks, such as APR, and enhance their performance at far lower computational costs than training from scratch. In this study, we empirically investigate the impact of various fine-tuning techniques on the performance of LLMs used for APR. Our experiments provide insights into the performance of a selection of state-of-the-art LLMs pre-trained on code. The evaluation is done on three popular APR benchmarks (i.e., QuixBugs, Defects4J and HumanEval-Java) and considers six different LLMs with varying parameter sizes (resp. CodeGen, CodeT5, StarCoder, DeepSeekCoder, Bloom, and CodeLlama-2). We consider three training regimens: no fine-tuning, full fine-tuning, and parameter-efficient fine-tuning (PEFT) using LoRA and IA3. We observe that full fine-tuning techniques decrease the benchmarking performance of various models due to different data distributions and overfitting. By using parameter-efficient fine-tuning methods, we restrict models in the amount of trainable parameters and achieve better results. Keywords: large language models, automated program repair, parameter-efficient fine-tuning, AI4Code, AI4SE, ML4SE.

自动化程序修补(APR)使用各种工具和技术帮助开发者更快地实现功能和自动无误代码。近年来,大语言模型(LLMS)因其性能和灵活性而成为RARA工具链的组成部分,但培训这些模型需要大量资源。还开发了微调技术,使培训前的LMS适应具体任务,如RA, 并在远远低于从零开始的培训费用的情况下提高它们的性能。在这项研究中,我们实证地调查了各种微调技术对PRAR所用LMS绩效的影响。我们的实验为选择最先进的LLMSA大工具链的功能提供了深刻的见解。评价是以三种受欢迎的RA基准(即QuixBugs、Defects4J和HumanEval-Java)进行的,并且考虑到六个参数大小不同的LMMSMSM(Resp. CodeGGener、Star Coder、Deep Secoder、Bloom and CodLlama-2)。我们考虑三种更好的培训制度:不精确地调整、不精调、不精调、不精调的IFSeal-LOLADLSeral-Seral-SeralSeral-laimal-minal-minal-minal laction-laction-mod strad disal laction astrisal-trad stra d d d disal ex-s to disal-trad the sup ex-trad supal- supal sal-trad sal imation imation imation imation imation imation imationaldald imald imal-s-s-s-s-s-s-s-s-s-s-todal-saldaldaldald imd imd ex-s imal imaldaldaldaldaldaldaldaldaldaldaldaldal-s-我们我们我们我们我们不精化、不精度,我们使用微调调调译调校准调整、不精度,我们调校准调整模型,并使用各种标准,并使用各种标准,并使用各种标准,并全面调整模型,


Article 107

Title@2025-07-26 (6): CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation

Title: CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation CrossPL: Bewertung großer Sprachmodelle bei der Erzeugung von Cross Programming Language Code 交叉语言:评估关于编制跨方案拟订语言代码的大型语言模式 2507.19904v1

Authors (5): Zhanhang Xiong, Dongxia Wang, Yuekang Li, Xinyuan An, Wenhai Wang

As large language models (LLMs) become increasingly embedded in software engineering workflows, a critical capability remains underexplored: generating correct code that enables cross-programming-language (CPL) interoperability. This skill is essential for building complex systems that integrate components written in multiple languages via mechanisms like inter-process communication (IPC). To bridge this gap, we present CrossPL, the first benchmark designed to systematically evaluate LLMs’ ability to generate CPL-interoperating code. CrossPL comprises 1,982 tasks centered around IPC, covering six widely-used programming languages and seven representative CPL techniques. We construct this benchmark by (i) analyzing 19,169 multi-language GitHub repositories using 156 hand-crafted finite state machines (FSMs), and (ii) developing an LLM-based pipeline that automatically extracts CPL code snippets, generates task instructions, and validates functional correctness. We evaluate 14 state-of-the-art general-purpose LLMs and 6 code-oriented LLMs released in the past three years on CrossPL via FSM-based validation. Results reveal that even the best-performing models struggle with CPL scenarios, underscoring the need for more targeted research in this space. Our benchmark and code are available at: https://anonymous.4open.science/r/crosspl-2814.

随着大型语言模型(LLMS)越来越多地嵌入软件工程工作流程中,一种关键能力仍未得到充分探索:生成正确的代码,使跨方案语言(CPL)互操作性能够互操作性;这一技能对于通过流程间通信等机制建立综合以多种语言编写的组件的复杂系统至关重要;为了缩小这一差距,我们提出了CrossPL,这是旨在系统评价LLMs生成CPL-操作代码的能力的第一个基准;CrossPL,由1,982项任务组成,以IPC为中心,涵盖六种广泛使用的编程语言和七种有代表性的CPL技术;我们通过(一) 分析19,169种多语言的GitHub库,使用156台手工制作的限定国家机器(FSMS)来构建这一基准;以及(二) 开发一个基于LMM的管道,自动提取CPL代码片断,生成任务指令,并验证功能正确性。我们评价了14个州级通用通用LMs和6个以代码为主的LMSMs,在过去三年中通过FSM校准。


Article 108

Title@2025-07-26 (6): AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation

Title: AgentMesh: A Cooperative Multi-Agent Generative AI Framework for Software Development Automation AgentMesh: Ein kooperativer Multi-Agent Generativer KI-Rahmen für Software-Entwicklungsautomatisierung AgentMesh:软件开发自动化多生化合作框架 2507.19902v1

Authors (1): Sourena Khanzadeh

Software development is a complex, multi-phase process traditionally requiring collaboration among individuals with diverse expertise. We propose AgentMesh, a Python-based framework that uses multiple cooperating LLM-powered agents to automate software development tasks. In AgentMesh, specialized agents - a Planner, Coder, Debugger, and Reviewer - work in concert to transform a high-level requirement into fully realized code. The Planner agent first decomposes user requests into concrete subtasks; the Coder agent implements each subtask in code; the Debugger agent tests and fixes the code; and the Reviewer agent validates the final output for correctness and quality. We describe the architecture and design of these agents and their communication, and provide implementation details including prompt strategies and workflow orchestration. A case study illustrates AgentMesh handling a non-trivial development request via sequential task planning, code generation, iterative debugging, and final code review. We discuss how dividing responsibilities among cooperative agents leverages the strengths of large language models while mitigating single-agent limitations. Finally, we examine current limitations - such as error propagation and context scaling - and outline future work toward more robust, scalable multi-agent AI systems for software engineering automation.

软件开发是一个复杂、多阶段的过程,传统上需要具有不同专长的个人相互协作。我们建议一个基于Python的框架,即AgentMesh,这个基于Python的框架,使用多种合作的LLM动力代理商将软件开发任务自动化。在AgentMesh, 专门代理商―― Planner、Comer、Debugger和Reviewer - 协同工作,将高层次要求转化为完全实现的代码。Planner代理商首先将用户的要求分解为具体的子任务;代码代理商执行每个子任务代码;调试调器测试和修正代码;审查代理商验证最终产出的正确性和质量。我们描述这些代理商的结构和设计及其通信,并提供实施细节,包括快速战略和工作流程协调。一项案例研究表明,AgentMesh代理商通过连续任务规划、代码生成、迭接调调和最终代码审查处理非边际开发请求。我们讨论了合作代理商之间如何在减少单一代理的局限性的同时利用大型语言模型的优势。我们审视当前的局限性,例如错误传播和背景自动缩放的多功能系统。


Article 109

Title@2025-07-26 (6): MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation

Title: MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation MultiKernelBench: Ein Multi-Platform Benchmark für die Kernel-Generation 多KenneelBench: 核心生成的多平台基准 2507.17773v2

Authors (6): Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang

The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at https://github.com/wzzll123/MultiKernelBench.

利用大型语言模型(LLMs)自动生成深层次学习(DL)核心的自动生成(DL)核心,这已成为一种很有希望的方法,可以减少手工努力和硬件专长,以编写高性能操作者执行工作所需的高性能操作软件。然而,目前对这一领域中LLM的评估基准缺乏硬件支持、粗微的内核分类和任务覆盖不平衡。为克服这些限制,我们引入了多环邦奇,这是以LLLM为主的DLL内核生成的第一个全面、多平台基准。多环贝尼奇跨越了14个明确界定的内核类别的285项任务,支持了三大硬件平台:Nvidia GPUs、Huawei NPUs和Google TPUs。为了能够在未来的扩展性,我们设计了一个模块后端抽象层,将平台特定平台的逻辑与核心基准基础设施脱钩,便于新硬件平台的整合。我们进一步提出一个简单而有效的类别认知/125点快速提示方法,通过提供分类外壳类的外壳,提高代质量质量。通过对7个州GLLM号的公开平台进行系统的系统评估,在公共风险上进行显著的变换。


Article 110

Title@2025-07-26 (6): A Cooperative Approach for Knowledge-based Business Process Design in a Public Authority

Title: A Cooperative Approach for Knowledge-based Business Process Design in a Public Authority Ein kooperativer Ansatz für wissensbasiertes Business Process Design in einer öffentlichen Behörde 公共当局知识商业程序设计合作办法 2507.19842v1

Authors (4): Mohammad Azarijafari, Luisa Mich, Michele Missikoff, Oleg Missikoff

Enterprises are currently undergoing profound transformations due to the unpostponable digital transformation. Then, to remain competitive, enterprises must adapt their organisational structures and operations. This organisational shift is also important for small and medium-sized enterprises. A key innovation frontier is the adoption of process-oriented production models. This paper presents a knowledge-based method to support business experts in designing business processes. The method requires no prior expertise in Knowledge Engineering and guides designers through a structured sequence of steps to produce a diagrammatic workflow of the target process. The construction of the knowledge base starts from simple, text-based, knowledge artefacts and then progresses towards more structured, formal representations. The approach has been conceived to allow a shared approach for all stakeholders and actors who participate in the BP design.

企业目前由于不可推卸的数字转换而正在经历深刻的变革。然后,为了保持竞争力,企业必须调整其组织结构和业务。这种组织转变对中小企业也很重要。一个关键的创新领域是采用以程序为导向的生产模式。本文件介绍了一种知识型方法,用以支持商业专家设计业务流程。这种方法不需要事先在知识工程方面有专门知识,并且通过一系列步骤来引导设计者为目标进程绘制图表工作流程。知识库的建设始于简单的、基于文字的、知识工艺,然后逐步走向结构化的、正式的表述。这一方法的构想是允许所有参与业务流程设计的所有利害关系方和行为者采用共同的方法。


Article 111

Title@2025-07-26 (6): From Few-Label to Zero-Label: An Approach for Cross-System Log-Based Anomaly Detection with Meta-Learning

Title: From Few-Label to Zero-Label: An Approach for Cross-System Log-Based Anomaly Detection with Meta-Learning Von wenigem zum Null-Label: Ein Ansatz für systemübergreifende Log-basierte Anomalienerkennung mit Meta-Learning 从少标点到零标点:跨系统、基于日志的反常检测和元学习的方法 2507.19806v1

Authors (6): Xinlong Zhao, Tong Jia, Minghua He, Yihan Wu, Ying Li, Gang Huang

Log anomaly detection plays a critical role in ensuring the stability and reliability of software systems. However, existing approaches rely on large amounts of labeled log data, which poses significant challenges in real-world applications. To address this issue, cross-system transfer has been identified as a key research direction. State-of-the-art cross-system approaches achieve promising performance with only a few labels from the target system. However, their reliance on labeled target logs makes them susceptible to the cold-start problem when labeled logs are insufficient. To overcome this limitation, we explore a novel yet underexplored setting: zero-label cross-system log anomaly detection, where the target system logs are entirely unlabeled. To this end, we propose FreeLog, a system-agnostic representation meta-learning method that eliminates the need for labeled target system logs, enabling cross-system log anomaly detection under zero-label conditions. Experimental results on three public log datasets demonstrate that FreeLog achieves performance comparable to state-of-the-art methods that rely on a small amount of labeled data from the target system.

日志异常现象的检测在确保软件系统的稳定性和可靠性方面发挥着关键作用。然而,现有方法依赖大量贴标签的日志数据,这给真实世界的应用带来了重大挑战。为了解决这一问题,将跨系统的传输确定为关键的研究方向。最新技术的跨系统方法仅能从目标系统上贴上几个标签,就能取得有希望的绩效。然而,它们依赖标签的目标日志,在标签的日志不足时,就很容易发生冷启动问题。为了克服这一限制,我们探索了一个新的、但探索不足的设置:在目标系统日志完全没有标签的情况下,零标签跨系统的日志异常现象检测。为此,我们提议采用系统无标签的代谢元学习方法FreeLog,即系统无标签系统日志的必要性,使在零标签条件下能够检测跨系统日志异常现象。三个公共日志数据集的实验结果显示,FreeLog的性能与依靠目标系统少量标签数据的最先进方法相比。


Article 112

Title@2025-07-26 (6): Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries

Title: Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries Automatisierte Synthese von formal verifizierten Multi-Abstraktions-Funktionszusammenfassungen 正式核证的多种吸管功能摘要自动综合 2506.09550v3

Authors (8): Fanpeng Yang, Xu Ma, Shuling Wang, Xiong Xu, Qinxiang Cao, Naijun Zhan, Xiaofeng Li, Bin Gu

Function summaries, which characterize the behavior of code segments (typically functions) through preconditions and postconditions, are essential for understanding, reusing, and verifying software, particularly in safety-critical domains like aerospace embedded systems. However, these mission-critical legacy code serving as a valuable reused asset often lacks formal specifications. It is challenging to automatically generate function summaries for C programs, due to the existence of complex features such as loops, nested function calls, pointer aliasing, and so on. Moreover, function summaries should support multiple abstraction levels to meet diverse requirements, e.g. precise summaries capturing full functionality for formal verification and intuitive summaries for human understanding. To address these challenges, we first propose a novel framework that combines symbolic execution, large language models (LLMs), and formal verification to generate Relatively Strongest Postconditions (RSPs) and build function summaries that fully capture program behavior. Our approach leverages VST-A’s symbolic execution to precisely track program execution paths and state transitions, employs LLMs to infer loop invariants based on predefined templates, and uses Frama-C to guarantee soundness of generated summaries in an iterative refinement loop. Furthermore, from generated RSPs, we automatically synthesize strongest non-redundant postconditions expressed within given domain specific language. We compare our approach with existing work through extensive experiments.

功能摘要是代码部分(典型功能)通过先决条件和先决条件的行为的特点,对于理解、重新使用和核查软件至关重要,特别是在航空航天嵌入系统等安全关键领域;然而,这些关键任务遗留代码作为宝贵的再利用资产往往缺乏正式规格;由于存在环形、嵌套功能电话、指针别名等复杂特征,难以自动生成C程序功能摘要;此外,功能摘要应支持多种抽象级别,以满足各种要求,例如精确的摘要,为正式核查和直观摘要收集全面功能,以促进人类理解。为了应对这些挑战,我们首先提议一个新框架,将象征性执行、大型语言模型(LLLMS)和正式核查结合起来,以产生相对强大的附加条件,并建立能够充分反映方案行为的功能摘要。我们的方法利用VST-A象征性执行来精确跟踪程序执行路径和状态过渡,利用LLMMS来根据预先定义的模板推断变异性,并利用Frama-C来为人类理解的直观摘要。为了应对这些挑战,我们首先提出一个新的框架框架框架框架框架框架框架,以便保证我们通过不至最强烈的实地分析,通过现有版本进行我们制作的实地分析。


Article 113

Title@2025-07-26 (6): NeuSemSlice: Towards Effective DNN Model Maintenance via Neuron-level Semantic Slicing

Title: NeuSemSlice: Towards Effective DNN Model Maintenance via Neuron-level Semantic Slicing NeuSemSlice: Auf dem Weg zu einer effektiven DNN-Modellpflege über Semantisches Schneiden auf Neuron-Ebene NeusSemelice:通过中程语义剪切实现有效的 DNN 模型维护 2407.20281v2

Authors (7): Shide Zhou, Tianlin Li, Yihao Huang, Ling Shi, Kailong Wang, Yang Liu, Haoyu Wang

Deep Neural networks (DNNs), extensively applied across diverse disciplines, are characterized by their integrated and monolithic architectures, setting them apart from conventional software systems. This architectural difference introduces particular challenges to maintenance tasks, such as model restructure (e.g., model compression), re-adaptation (e.g., fitting new samples), and incremental development (e.g., continual knowledge accumulation). Prior research addresses these challenges by identifying task-critical neuron layers, and dividing neural networks into semantically-similar sequential modules. However, such layer-level approaches fail to precisely identify and manipulate neuron-level semantic components, restricting their applicability to finer-grained model maintenance tasks. In this work, we implement NeuSemSlice, a novel framework that introduces the semantic slicing technique to effectively identify critical neuron-level semantic components in DNN models for semantic-aware model maintenance tasks. Specifically, semantic slicing identifies, categorizes and merges critical neurons across different categories and layers according to their semantic similarity, enabling their flexibility and effectiveness in the subsequent tasks. For semantic-aware model maintenance tasks, we provide a series of novel strategies based on semantic slicing to enhance NeuSemSlice. They include semantic components (i.e., critical neurons) preservation for model restructure, critical neuron tuning for model re-adaptation, and non-critical neuron training for model incremental development. A thorough evaluation has demonstrated that NeuSemSlice significantly outperforms baselines in all three tasks.

广泛应用于不同学科的深神经网络(DNNS)广泛应用于不同学科,其特点是其集成和单一结构,将其与常规软件系统分开。这种建筑差异给维护任务带来了特殊的挑战,如模型重组(例如模型压缩)、再适应(例如安装新样本)和渐进发展(例如持续知识积累)等。先前的研究通过确定任务-关键神经层,将神经网络分为相似的顺序模块来应对这些挑战。然而,这种层级方法未能准确识别和操作神经级的神经级语义组成部分,将这些组成部分局限于精细的神经级模型维护任务。在这项工作中,我们实施了NeusemSImlical,这是一个新的框架,引入了语义化技术,以有效确定DNNNE模型中关键的神经级语义组成部分,用于语义-认知模型维护任务。具体地,语义分类和将各种类别和层的关键神经显示与其语系相似,使其具有灵活性和有效性,用于精细的神经级模型,用于随后的神经级发展任务。Semlical Slimemliction 3 任务中, 提供大幅的缩缩化任务。


Article 114

Title@2025-07-26 (6): Defining ethically sourced code generation

Title: Defining ethically sourced code generation Definition der ethisch bedingten Codegenerierung 界定道德源代码生成 2507.19743v1

Authors (4): Zhuolin Xu, Chenglin Li, Qiushi Li, Shin Hwei Tan

Several code generation models have been proposed to help reduce time and effort in solving software-related tasks. To ensure responsible AI, there are growing interests over various ethical issues (e.g., unclear licensing, privacy, fairness, and environment impact). These studies have the overarching goal of ensuring ethically sourced generation, which has gained growing attentions in speech synthesis and image generation. In this paper, we introduce the novel notion of Ethically Sourced Code Generation (ES-CodeGen) to refer to managing all processes involved in code generation model development from data collection to post-deployment via ethical and sustainable practices. To build a taxonomy of ES-CodeGen, we perform a two-phase literature review where we read 803 papers across various domains and specific to AI-based code generation. We identified 71 relevant papers with 10 initial dimensions of ES-CodeGen. To refine our dimensions and gain insights on consequences of ES-CodeGen, we surveyed 32 practitioners, which include six developers who submitted GitHub issues to opt-out from the Stack dataset (these impacted users have real-world experience of ethically sourcing issues in code generation models). The results lead to 11 dimensions of ES-CodeGen with a new dimension on code quality as practitioners have noted its importance. We also identified consequences, artifacts, and stages relevant to ES-CodeGen. Our post-survey reflection showed that most practitioners tend to ignore social-related dimensions despite their importance. Most practitioners either agreed or strongly agreed that our survey help improve their understanding of ES-CodeGen. Our study calls for attentions of various ethical issues towards ES-CodeGen.

为确保负责任的大赦国际,人们对各种伦理问题(例如许可证、隐私、公平和环境影响不明确)的兴趣日益浓厚。这些研究的首要目标是确保道德来源的一代,这在语音合成和图像生成方面引起了越来越多的关注。我们在本文件中提出了《道德源代码生成》的新概念(ES-CodeGen),以提及管理从数据收集到部署后代码模型开发的所有进程。为了建立ES-CodeGen的分类学,我们进行了两阶段的文献审查,我们阅读了不同领域的803篇论文,具体针对基于AI的代码生成者。我们确定了71份相关文件,其中含有ES-CodeGen的10个初始层面。为了改进我们对ES-CodeGen的后果的深度和深入了解,我们调查了32名从业人员,其中包括6名提交GitHub问题供选择的Stack数据集(这些受影响的用户对ES-C的道德层面有真实世界经验,对ESC的伦理层面有一定的重要性,而我们已明确了他们认为其道德层面的道德层面,我们在代码生成模型中也显示出了最有意义的结果。


Article 115

Title@2025-07-26 (6): Clean Code In Practice: Challenges and Opportunities

Title: Clean Code In Practice: Challenges and Opportunities Sauberer Code in der Praxis: Herausforderungen und Chancen 《清洁守则》实践:挑战与机遇 2507.19721v1

Authors (5): Dapeng Yan, Wenjie Yang, Kui Liu, Zhiming Liu, Zhikuang Cai

Reliability prediction is crucial for ensuring the safety and security of software systems, especially in the context of industry practices. While various metrics and measurements are employed to assess software reliability, the complexity of modern systems necessitates a deeper understanding of how these metrics interact with security and safety concerns. This paper explores the interplay between software reliability, safety, and security, offering a comprehensive analysis of key metrics and measurement techniques used in the industry for reliability prediction. We identify critical threats to software reliability and provide a threat estimation framework that incorporates both safety and security aspects. Our findings suggest that integrating reliability metrics with safety and security considerations can enhance the robustness of software systems. Furthermore, we propose a set of actionable guidelines for practitioners to improve their reliability prediction models while simultaneously addressing the security and safety challenges of contemporary software applications.

可靠性预测对于确保软件系统的安全和安保至关重要,特别是在行业做法方面。虽然采用各种计量和计量方法评估软件可靠性,但现代系统的复杂性要求更深入地了解这些计量方法如何与安保和安全关切相互作用。本文件探讨了软件可靠性、安全和安保之间的相互作用,全面分析了该行业用于可靠性预测的关键计量标准和测量技术。我们查明了软件可靠性面临的重大威胁,并提供了一个既包括安全和安保方面的威胁估计框架。我们的调查结果表明,将可靠性计量方法与安全和安保考虑结合起来,可以提高软件系统的可靠性。此外,我们提出了一套可供操作人员使用的可操作准则,以改进其可靠性预测模型,同时应对当代软件应用的安保和安全挑战。


Article 116

Title@2025-07-25 (5): Refactoring $\neq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis

Title: Refactoring $\neq$ Bug-Inducing: Improving Defect Prediction with Code Change Tactics Analysis Refactoring $\neq$ Bug-Inducing: Verbesserung der Fehlervorhersage mit Code Change Tactics Analyse 重构 $neq$ 臭虫诱导:用代码改变策略分析改进缺陷预测 2507.19714v1

Authors (8): Feifei Niu, Junqian Shao, Christoph Mayr-Dorn, Liguo Huang, Wesley K. G. Assunção, Chuanyi Li, Jidong Ge, Alexander Egyed

Just-in-time defect prediction (JIT-DP) aims to predict the likelihood of code changes resulting in software defects at an early stage. Although code change metrics and semantic features have enhanced prediction accuracy, prior research has largely ignored code refactoring during both the evaluation and methodology phases, despite its prevalence. Refactoring and its propagation often tangle with bug-fixing and bug-inducing changes within the same commit and statement. Neglecting refactoring can introduce bias into the learning and evaluation of JIT-DP models. To address this gap, we investigate the impact of refactoring and its propagation on six state-of-the-art JIT-DP approaches. We propose Code chAnge Tactics (CAT) analysis to categorize code refactoring and its propagation, which improves labeling accuracy in the JIT-Defects4J dataset by 13.7%. Our experiments reveal that failing to consider refactoring information in the dataset can diminish the performance of models, particularly semantic-based models, by 18.6% and 37.3% in F1-score. Additionally, we propose integrating refactoring information to enhance six baseline approaches, resulting in overall improvements in recall and F1-score, with increases of up to 43.2% and 32.5%, respectively. Our research underscores the importance of incorporating refactoring information in the methodology and evaluation of JIT-DP. Furthermore, our CAT has broad applicability in analyzing refactoring and its propagation for software maintenance.

刚到时的缺陷预测(JIT-DP)旨在预测在早期出现软件缺陷的代码变化的可能性。虽然代码变化指标和语义特征提高了预测准确性,但先前的研究尽管普遍,却基本上忽视了评估和方法阶段的代码再构件,尽管其流行程度在评估和方法阶段和阶段中基本上忽略了代码再构件。重新构件和传播往往与同一承诺和语句中的错误修正和诱导变化相交织在一起。忽略重新构件可能会在JIT-DP模型的学习和评价中引入偏差。为了弥补这一差距,我们调查了重新构件及其传播对JIT-DP六种先进方法的影响。我们建议CAnge 战术(CAT)分析对代码再构件及其传播的代码再构件进行分类,从而将JIT-Defects4J数据集的准确性标注13.7 %。我们的实验表明,不考虑重新构件软件中的重新构件信息,特别是以语义为基础的模型的性能,将18.6%和37.3%的传播方式分别纳入我们的F1的基数中的基本方法。我们提议将JIT的升级和再将它的精确方法纳入六种和重新纳入我们的精确方法。


Article 117

Title@2025-07-25 (5): LastMerge: A language-agnostic structured tool for code integration

Title: LastMerge: A language-agnostic structured tool for code integration LastMerge: Ein sprach-agnostisches strukturiertes Tool für die Codeintegration LastMorge:一个语言不可知的代码集成结构化工具 2507.19687v1

Authors (3): Joao Pedro Duarte, Paulo Borba, Guilherme Cavalcanti

Unstructured line-based merge tools are widely used in practice. Structured AST-based merge tools show significantly improved merge accuracy, but are rarely used in practice because they are language specific and costly, consequently not being available for many programming languages. To improve merge accuracy for a wide range of languages, we propose LastMerge, a generic structured merge tool that can be configured through a thin interface that significantly reduces the effort of supporting structured merge. To understand the impact that generic structured merge might have on merge accuracy and performance, we run an experiment with four structured merge tools: two Java specific tools, jDime and Spork, and their generic counterparts, respectively LastMerge and Mergiraf. Using each tool, we replay merge scenarios from a significant dataset, and collect data on runtime, behavioral divergences, and merge accuracy. Our results show no evidence that generic structured merge significantly impacts merge accuracy. Although we observe a difference rate of approximately 10% between the Java specific tools and their generic counterparts, most of the differences stem from implementation details and could be avoided. We find that LastMerge reports 15% fewer false positives than jDime while Mergiraf misses 42% fewer false negatives than Spork. Both generic tools exhibit comparable runtime performance to the state of the art language specific implementations. These results suggest that generic structured merge tools can effectively replace language-specific ones, paving the way for broader adoption of structured merge in industry.

结构化的基于线的合并工具在实践中被广泛使用。结构化的AST合并工具显示,合并精度显著提高,但很少在实践中使用,因为它们是语言具体和昂贵的,因此许多编程语言无法使用。为了提高多种语言的合并精度,我们提议LastMerge,这是一个通用结构化合并工具,可以通过一个薄界面配置,大大降低支持结构合并的努力。为了理解通用结构化合并可能对合并精度和性能产生的影响,我们试验了四个结构化合并工具:两个爪哇特定工具,jDime和Spork,以及它们的通用对应工具,分别是LastMege和Mergiraf。我们利用每一个工具,从一个重要的数据集中重新播放合并假设情景,收集运行时的数据、行为差异和合并精度。我们的结果没有显示,通用结构化的合并对结构合并对结构合并作用会大大降低精度。虽然我们观察到了爪哇具体工具与其通用对应工具之间大约10%的差别来自执行细节,并且可以避免。我们发现,最后Mege报告15 %的假肯定的肯定性报告比JDime和Meal Greal Spal missimeal 演示工具的缩缩缩缩缩演示工具有效。


Article 118

Title@2025-07-25 (5): GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Title: GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning GEPA: Reflektierende Prompt-Evolution kann Verstärkungs-Lernen übertreffen GEPA: 反思即时进化能够超过成绩的强化学习 2507.19457v1

Authors (17): Lakshya A Agrawal, Shangyin Tan, Dilara Soylu, Noah Ziems, Rishi Khare, Krista Opsahl-Ong, Arnav Singhvi, Herumb Shandilya, Michael J Ryan, Meng Jiang, Christopher Potts, Koushik Sen, Alexandros G. Dimakis, Ion Stoica, Dan Klein, Matei Zaharia, Omar Khattab

Large language models (LLMs) are increasingly adapted to downstream tasks via reinforcement learning (RL) methods like Group Relative Policy Optimization (GRPO), which often require thousands of rollouts to learn new tasks. We argue that the interpretable nature of language can often provide a much richer learning medium for LLMs, compared with policy gradients derived from sparse, scalar rewards. To test this, we introduce GEPA (Genetic-Pareto), a prompt optimizer that thoroughly incorporates natural language reflection to learn high-level rules from trial and error. Given any AI system containing one or more LLM prompts, GEPA samples system-level trajectories (e.g., reasoning, tool calls, and tool outputs) and reflects on them in natural language to diagnose problems, propose and test prompt updates, and combine complementary lessons from the Pareto frontier of its own attempts. As a result of GEPA’s design, it can often turn even just a few rollouts into a large quality gain. Across four tasks, GEPA outperforms GRPO by 10% on average and by up to 20%, while using up to 35x fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10% across two LLMs, and demonstrates promising results as an inference-time search strategy for code optimization.

大型语言模型(LLMS)日益通过强化学习(RL)方法(LLM)适应下游任务,如集体相对政策优化(GROP)方法(GRPO),这些方法往往需要数千个推出来学习新任务。我们争辩说,语言的可解释性往往能为LLM提供更丰富的学习媒介,而与分散的、彩礼奖励所产生的政策梯度相比,语言的可解释性往往能为LMS提供更丰富的学习媒介。为了测试这一点,我们引入了GEPA(Genetic-Pareto),这是一个迅速优化的系统,它能彻底结合自然语言的反思,从试验和错误中学习高级规则。鉴于任何包含一个或多个LLOM提示的AI系统,GEPA样本系统级的轨迹(例如推理、工具电话和工具输出)往往需要数千个系统级的样本,并且用自然语言来诊断问题,建议和测试快速更新,并将PARPRMS的边界经验结合起来。由于GEPA的设计,它往往只是将少量推出一个高质量收益。GMSBRMS,在10MSRMSRRRFSRRRRRFS值上迅速展示一个最短的结果。


Article 119

Title@2025-07-25 (5): An OpenSource CI/CD Pipeline for Variant-Rich Software-Defined Vehicles

Title: An OpenSource CI/CD Pipeline for Variant-Rich Software-Defined Vehicles Eine OpenSource CI/CD Pipeline für Variant-Rich Software-definierte Fahrzeuge 变式Rich软件定型车辆的开源CI/CD管道 2507.19446v1

Authors (5): Matthias Weiß, Anish Navalgund, Johannes Stümpfle, Falk Dettinger, Michael Weyrich

Software-defined vehicles (SDVs) offer a wide range of connected functionalities, including enhanced driving behavior and fleet management. These features are continuously updated via over-the-air (OTA) mechanisms, resulting in a growing number of software versions and variants due to the diversity of vehicles, cloud/edge environments, and stakeholders involved. The lack of a unified integration environment further complicates development, as connected mobility solutions are often built in isolation. To ensure reliable operations across heterogeneous systems, a dynamic orchestration of functions that considers hardware and software variability is essential. This paper presents an open-source CI/CD pipeline tailored for SDVs. It automates the build, test, and deployment phases using a combination of containerized open-source tools, creating a standardized, portable, and scalable ecosystem accessible to all stakeholders. Additionally, a custom OTA middleware distributes software updates and supports rollbacks across vehicles and backend services. Update variants are derived based on deployment target dependencies and hardware configurations. The pipeline also supports continuous development and deployment of AI models for autonomous driving features. Its effectiveness is evaluated using an automated valet parking (AVP) scenario involving TurtleBots and a coordinating backend server. Two object detection variants are developed and deployed to match hardware-specific requirements. Results demonstrate seamless OTA updates, correct variant selection, and successful orchestration across all targets. Overall, the proposed pipeline provides a scalable and efficient solution for managing software variants and OTA updates in SDVs, contributing to the advancement of future mobility technologies.

软件定义的车辆(SDV)具有广泛的连通功能,包括加强驾驶行为和车队管理。这些功能通过空外机制不断更新,导致软件版本和变体越来越多,因为车辆的多样性、云层/尖端环境以及利益攸关方都可使用。缺乏统一的整合环境使发展更加复杂,因为连接的流动解决方案往往是孤立地建立的。为确保不同系统之间的可靠操作,动态调控考虑硬件和软件变异性的职能至关重要。本文介绍了为SDV定制的开放源的CI/CD管道。它利用集装箱化的开放源码工具组合来自动更新构建、测试和部署阶段。它创造了标准化、可移植和可扩展的生态系统。此外,定制的OTA中软件传播软件更新软件,支持车辆和后端服务的滚回。根据部署目标依赖和硬件配置,更新变量也支持自动驱动驱动驱动功能的AI模式的继续开发和部署。它的有效性是使用自动化的VAVDLO解决方案的升级、测试和升级后端版本的OVA、测试和升级的OTA的升级版本。它提供了所有运行的自动变式的升级和升级的自动变式。


Article 120

Title@2025-07-25 (5): Resolving Build Conflicts via Example-Based and Rule-Based Program Transformations

Title: Resolving Build Conflicts via Example-Based and Rule-Based Program Transformations Lösen von Build-Konflikten über examplebasierte und regelbasierte Programmtransformationen 通过基于实例和基于规则的方案转型解决冲突 2507.19432v1

Authors (4): Sheikh Shadab Towqir, Fei He, Todd Mytkowicz, Na Meng

Merge conflicts often arise when developers integrate changes from different software branches. The conflicts can result from overlapping edits in programs (i.e., textual conflicts) or cause build and test errors (i.e., build and test conflicts). They degrade software quality and hinder programmer productivity. While several tools detect build conflicts, few offer meaningful support for resolving cases like those caused by method removal. To overcome limitations of existing tools, we introduce BUCOR (Build Conflict Resolver), a new conflict resolver. BUCOR first detects conflicts by comparing three versions related to a merging scenario: base b, left l, and right r. To resolve conflicts, it employs two complementary strategies: example-based transformation (BUCOR-E) and rule-based transformation (BUCOR-R). BUCOR-R applies predefined rules to handle common, well-understood conflicts. BUCOR-E mines branch versions (l and r) for exemplar edits applied to fix related build errors. From these examples, it infers and generalizes program transformation patterns to resolve more complex conflicts. We evaluated BUCOR on 88 real-world build conflicts spanning 21 distinct conflict types. BUCOR generated at least one solution for 65 cases and correctly resolved 43 conflicts. We observed that this hybrid approach–combining context-aware, example-based learning with structured, rule-based resolution–can effectively help resolve conflicts. Our research sheds light on future directions for more intelligent and automated merge tools.

当开发者整合不同软件分支的变化时,往往会出现合并冲突。冲突可能来自程序(即文字冲突)中的重叠编辑,或导致构建和测试错误(即,构建和测试冲突),或导致构建和测试错误(即,构建和测试冲突),这些冲突会降低软件质量,妨碍程序生产率。虽然一些工具可以发现冲突,但很少能提供有意义的支持来解决诸如方法清除造成的冲突等案件。为了克服现有工具的局限性,我们引入了BUCOR(重建冲突解决者),一个新的冲突解决者。BUCOR首先通过比较与合并方案有关的三种版本(基础b、左边I和右边 r)来检测冲突。为了解决冲突,它采用两种互补战略:基于示例的转换(BUCOR-E)和基于规则的转换(BUCOR-R)。虽然一些工具在解决共同冲突方面采用了预先确定的规则。为了修正相关的构建错误,我们用基于前置版的版本(l和rr)来比较三个版本的程序转换模式,用来解决更复杂的冲突。我们用88-BC在实际冲突方面有效地评估了BUC-BLI-B-L-B-B-L-L-B-B-B-L-I-I-L-L-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-


Article 121

Title@2025-07-25 (5): CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback

Title: CodeEvo: Interaction-Driven Synthesis of Code-centric Data through Hybrid and Iterative Feedback CodeEvo: Interaktionsgetriebene Synthese codezentrierter Daten durch hybrides und iteratives Feedback 代码化:通过混合和循环反馈对以代码为中心的数据进行互动驱动合成 2507.22080v1

Authors (5): Qiushi Sun, Jinyang Gong, Lei Li, Qipeng Guo, Fei Yuan

Acquiring high-quality instruction-code pairs is essential for training Large Language Models (LLMs) for code generation. Manually curated data is expensive and inherently limited in scale, motivating the development of code-centric synthesis methods. Yet, current approaches either focus on augmenting existing code or rely on predefined heuristics, both lacking rigorous data validation, which results in synthetic data that is ungrounded, repetitive, or overly simplistic. Inspired by collaborative programming practices, we propose CodeEvo, a framework that synthesizes code data through iterative interactions between two LLM agents: a Coder, which generates candidate code and test cases based on given instructions, and a Reviewer, which guides the synthesis process by producing new instructions and feedback. We further introduce a hybrid feedback mechanism that combines compiler determinism with the generative flexibility of agents, enabling automatic quality control throughout synthesis. Extensive experiments demonstrate that models fine-tuned on CodeEvo data significantly outperform established baselines across code generation benchmarks with various difficulties. In-depth analyses further provide insights from multiple perspectives into effective code-centric data synthesis.

获得高质量的教学编码对培训用于代码生成的大型语言模型(LLMs)至关重要。手工整理的数据费用昂贵,且在规模上具有内在局限性,促进了以代码为中心的合成方法的发展。然而,目前的方法要么侧重于增加现有代码,要么依赖预先定义的超常性,两者缺乏严格的数据验证,从而导致合成数据缺乏依据、重复或过于简单。在协作编程做法的启发下,我们提议CodeEvo,这是一个通过两个LLM代理商之间的迭接互动综合代码数据的框架:一个代码师,根据给定的指示生成候选代码和测试案例,另一个审查员,通过制作新的指令和反馈指导合成过程。我们进一步引入一种混合反馈机制,将编译器与代理商的基因化灵活性结合起来,使整个合成过程中的自动质量控制得以实现。广泛的实验表明,在代码Evo数据上进行微调的模型大大超越了各种困难的代码生成基准的既定基线。深入分析从多种角度为有效的代码中心数据合成提供洞察力。


Article 122

Title@2025-07-25 (5): SDVDiag: A Modular Platform for the Diagnosis of Connected Vehicle Functions

Title: SDVDiag: A Modular Platform for the Diagnosis of Connected Vehicle Functions SDVDiag: Modulare Plattform für die Diagnose von vernetzten Fahrzeugfunktionen SDVDiag: 连接车辆功能诊断模块平台 2507.19403v1

Authors (3): Matthias Weiß, Falk Dettinger, Michael Weyrich

Connected and software-defined vehicles promise to offer a broad range of services and advanced functions to customers, aiming to increase passenger comfort and support autonomous driving capabilities. Due to the high reliability and availability requirements of connected vehicles, it is crucial to resolve any occurring failures quickly. To achieve this however, a complex cloud/edge architecture with a mesh of dependencies must be navigated to diagnose the responsible root cause. As such, manual analyses become unfeasible since they would significantly delay the troubleshooting. To address this challenge, this paper presents SDVDiag, an extensible platform for the automated diagnosis of connected vehicle functions. The platform enables the creation of pipelines that cover all steps from initial data collection to the tracing of potential root causes. In addition, SDVDiag supports self-adaptive behavior by the ability to exchange modules at runtime. Dependencies between functions are detected and continuously updated, resulting in a dynamic graph view of the system. In addition, vital system metrics are monitored for anomalies. Whenever an incident is investigated, a snapshot of the graph is taken and augmented by relevant anomalies. Finally, the analysis is performed by traversing the graph and creating a ranking of the most likely causes. To evaluate the platform, it is deployed inside an 5G test fleet environment for connected vehicle functions. The results show that injected faults can be detected reliably. As such, the platform offers the potential to gain new insights and reduce downtime by identifying problems and their causes at an early stage.

连接和软件定义的车辆有望向客户提供广泛的服务和高级功能,目的是增加乘客舒适度,支持自主驾驶能力。由于连接车辆的高度可靠性和可用性要求,因此迅速解决任何出现故障至关重要。然而,要做到这一点,必须引导一个复杂的云层/尖端结构,其中含有依赖性网格,以诊断负责任的根本原因。因此,人工分析变得不可行,因为它们会大大拖延故障排除。为了应对这一挑战,本文件提供了SDVDiag,这是一个可自动诊断相关车辆功能的可扩展平台。该平台使得能够创建管道,涵盖从初步数据收集到潜在根源追踪的所有步骤。此外,SDVDiag支持通过在运行时交换模块的能力进行自我适应的行为。检测和不断更新功能之间的依赖性,从而形成一个动态的系统图表视图。此外,对于异常现象进行监测的关键系统测量。每当对事件进行调查,即对图表进行简要分析,并辅之以相关的异常点。最后,分析是通过跨时间定位平台进行早期定位,从而显示车辆内部的定位结果。


Article 123

Title@2025-07-25 (5): ReCatcher: Towards LLMs Regression Testing for Code Generation

Title: ReCatcher: Towards LLMs Regression Testing for Code Generation ReCatcher: Auf dem Weg zu LLMs Regressionstests für die Codegenerierung Recatcher: 转向为代码生成而进行回归测试的LLMs 2507.19390v1

Authors (4): Altaf Allah Abbassi, Leuson Da Silva, Amin Nikanjam, Foutse Khomh

Large Language Models (LLMs) for code generation evolve rapidly through fine-tuning, merging, or new model releases. However, such updates can introduce regressions, not only in correctness but also in code quality and performance. To address this, we present ReCatcher, a regression testing framework for Python code generation. ReCatcher systematically compares two LLMs, typically a current model and a candidate update, across three dimensions: logical correctness, static code quality, and execution performance. We apply ReCatcher to assess regressions across three update scenarios, fine-tuning, merging, and model release, using CodeLlama, DeepSeek-Coder, and GPT-4o. Our evaluation shows that fine-tuning with cross-language datasets increases syntax errors by up to 12%. Merging with general-purpose models like Llama2 leads to regressions in correctness by up to 18%. GPT-4o introduces regressions of up to 50% in handling missing imports compared to GPT-3.5-turbo, while GPT-4o-mini suffers up to 80% performance degradation in execution time versus GPT-4o. Overall, logical correctness, performance, and error handling (e.g., syntax errors and missing imports) are the most regression-prone areas. Comparing ReCatcher with baseline solutions, it presents better and consistent accuracy across logical and performance aspects. ReCatcher highlights the importance of systematic regression evaluation before adopting new models, while assisting researchers and practitioners in making more informed update decisions.

用于代码生成的大型语言模型(LLMS)通过微调、合并或新模式发布迅速演变。 但是, 这样的更新可以引入回归, 不仅在正确性方面, 而且在代码质量和性能方面。 为了解决这个问题, 我们为 Python 代码生成提供了一个回归测试框架 。 ReCatcher 系统地比较了两个LMS, 典型的当前模式和候选更新, 包括三个层面: 逻辑正确性、 静态代码质量 和执行性能。 我们应用ReCatcher 评估三个更新情景、 微调、 合并和发布模型的回归性能, 使用 Codellama 、 Deep Seek- Coder 和 GPT-4o 和 GPT 4o 来评估回归。 我们的评估显示, 跨语言数据集的微调调会增加音效误差12 % 。 与Llama2 等通用模型合并后, 将准确性回归率提高到18 % 。 GPT-4o , 在采用新的逻辑性精确度决定之前, GPT-turbo 之前将回归到50 % 。 GPT- main- mini 。 , 在运行中, 处理中, 的精确性分析中, 和最准确性化的精确性能的精确性处理中, 。


Article 124

Title@2025-07-25 (5): Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports

Title: Mut4All: Fuzzing Compilers via LLM-Synthesized Mutators Learned from Bug Reports Mut4All: Fuzzing Compiler über LLM-Synthesized Mutators aus Fehlerberichten gelernt Mut4All: 通过 LLM 合成的从臭虫报告中学习的 Mult4All: 模糊的编译者 2507.19275v1

Authors (10): Bo Wang, Pengyang Wang, Chong Chen, Qi Sun, Jieke Shi, Chengran Yang, Ming Deng, Youfang Lin, Zhou Yang, David Lo

Mutation-based fuzzing is effective for uncovering compiler bugs, but designing high-quality mutators for modern languages with complex constructs (e.g., templates, macros) remains challenging. Existing methods rely heavily on manual design or human-in-the-loop correction, limiting scalability and cross-language generalizability. We present Mut4All, a fully automated, language-agnostic framework that synthesizes mutators using Large Language Models (LLMs) and compiler-specific knowledge from bug reports. It consists of three agents: (1) a mutator invention agent that identifies mutation targets and generates mutator metadata using compiler-related insights; (2) a mutator implementation synthesis agent, fine-tuned to produce initial implementations; and (3) a mutator refinement agent that verifies and corrects the mutators via unit-test feedback. Mut4All processes 1000 bug reports (500 Rust, 500 C++), yielding 319 Rust and 403 C++ mutators at ~$0.08 each via GPT-4o. Our customized fuzzer, using these mutators, finds 62 bugs in Rust compilers (38 new, 7 fixed) and 34 bugs in C++ compilers (16 new, 1 fixed). Mut4All outperforms existing methods in both unique crash detection and coverage, ranking first on Rust and second on C++.

基于突变的模糊性对于发现编译器错误是有效的,但为具有复杂构造(例如模板、宏)的现代语言设计高质量的突变器仍然具有挑战性。现有方法主要依靠手工设计或人到行校正,限制可缩放性和跨语言的通用性。我们提供了Mut4All,这是一个完全自动化的、语言认知性的框架,它综合了使用大语言模型(LLLMS)的突变器和来自错误报告的具体编译器知识。它由三个代理商组成:(1)一个用于识别突变目标并利用与编译器相关的洞察来生成变异元的变异发明剂;(2)一个用于制作初始执行的变动器合成剂,经过微调调整;(3)一个通过单位测试反馈来校正变异器的变异器改进剂。Mut4All处理1000个错误报告(500 Rust,500 C++),产生319 Rust和403 C+C突变器,每件使用 GPT-4o。我们定制的 fuzzer 和RBard 1,在新的 Rust Bard 1 中,在新的编译器中发现62 和RutBard 1,在新的 Rettrodrodrod Adrodrodrods) 中,在新的编译制中,在新的编译制中发现62。


Article 125

Title@2025-07-25 (5): Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects

Title: Fine-Tuning Multilingual Language Models for Code Review: An Empirical Study on Industrial C# Projects Fine-Tuning Mehrsprachige Sprachmodelle für Code Review: Eine empirische Studie zu industriellen C#-Projekten 用于代码审查的精美多语言语言模式:工业C#项目经验研究 2507.19271v1

Authors (3): Igli Begolli, Meltem Aksoy, Daniel Neider

Code review is essential for maintaining software quality but often time-consuming and cognitively demanding, especially in industrial environments. Recent advancements in language models (LMs) have opened new avenues for automating core review tasks. This study presents the empirical evaluation of monolingual fine-tuning on the performance of open-source LMs across three key automated code review tasks: Code Change Quality Estimation, Review Comment Generation, and Code Refinement. We fine-tuned three distinct models, CodeReviewer, CodeLlama-7B, and DeepSeek-R1-Distill, on a C# specific dataset combining public benchmarks with industrial repositories. Our study investigates how different configurations of programming languages and natural languages in the training data affect LM performance, particularly in comment generation. Additionally, we benchmark the fine-tuned models against an automated software analysis tool (ASAT) and human reviewers to evaluate their practical utility in real-world settings. Our results show that monolingual fine-tuning improves model accuracy and relevance compared to multilingual baselines. While LMs can effectively support code review workflows, especially for routine or repetitive tasks, human reviewers remain superior in handling semantically complex or context-sensitive changes. Our findings highlight the importance of language alignment and task-specific adaptation in optimizing LMs for automated code review.

特别是在工业环境中,最近语文模式的进步为核心审查任务自动化开辟了新的途径。本研究报告介绍了对开放源码LM业绩进行单一语言微调的经验评价,这涉及三种关键的自动代码审查任务:代码改变质量估计、评论生成和代码改进。我们微调了三种不同的模型,即代码审查器、代码Llama-7B和DeepSeek-R1-Distill,将公共基准与工业储存库相结合的C具体数据集。我们的研究调查了培训数据中方案语言和自然语言的不同配置如何影响LM业绩,特别是在评论生成方面。此外,我们用自动化软件分析工具(ASAT)和人类审查员来衡量微调模型,以评价其在现实环境中的实际效用。我们的结果显示,单语言微调提高了模型的准确性和相关性,与多语种基线相比。虽然LMS可以有效地支持代码审查工作流程,特别是例行或重复的任务,但人类审查员仍然对LMMS-S-S-S-SL-S-SL-SL-S-SL-SL-SL-SIM-S-S-S-SL-SL-SL-SL-SL-SL-S-S-S-Sl-Sl-S-Sl-Sl-S-Sl-S-S-I-S-S-SMA-S-SL-S-S-S-S-S-S-S-S-S-SAR-S-SMA-S-S-SMA-SMA-SMA-SMA-SMA-S-S-SMA-SMA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S


Article 126

Title@2025-07-25 (5): Exploring the Use of LLMs for Requirements Specification in an IT Consulting Company

Title: Exploring the Use of LLMs for Requirements Specification in an IT Consulting Company Erforschung der Verwendung von LLMs für Anforderungen Spezifikation in einer IT-Beratungsgesellschaft 探索在IT咨询公司中如何使用按要求规格要求的LLMS 2507.19113v1

Authors (4): Liliana Pasquale, Azzurra Ragone, Emanuele Piemontese, Armin Amiri Darban

In practice, requirements specification remains a critical challenge. The knowledge necessary to generate a specification can often be fragmented across diverse sources (e.g., meeting minutes, emails, and high-level product descriptions), making the process cumbersome and time-consuming. In this paper, we report our experience using large language models (LLMs) in an IT consulting company to automate the requirements specification process. In this company, requirements are specified using a Functional Design Specification (FDS), a document that outlines the functional requirements and features of a system, application, or process. We provide LLMs with a summary of the requirements elicitation documents and FDS templates, prompting them to generate Epic FDS (including high-level product descriptions) and user stories, which are subsequently compiled into a complete FDS document. We compared the correctness and quality of the FDS generated by three state-of-the-art LLMs against those produced by human analysts. Our results show that LLMs can help automate and standardize the requirements specification, reducing time and human effort. However, the quality of LLM-generated FDS highly depends on inputs and often requires human revision. Thus, we advocate for a synergistic approach in which an LLM serves as an effective drafting tool while human analysts provide the critical contextual and technical oversight necessary for high-quality requirements engineering (RE) documentation.

在实践中,要求规格仍然是一项关键的挑战:制定规格所需的知识往往分散于各种来源(如会议记录、电子邮件和高级产品说明),使过程繁琐而费时;在本文件中,我们报告在信息技术咨询公司使用大型语言模型的经验,以便使要求规格过程自动化;在该公司,要求使用功能设计规格(FDS),该文件概述了系统、应用或过程的功能要求和特点;我们向LOMs提供需求引出文件和FDS模板的概要,促使它们生成Epic FDS(包括高级产品说明)和用户故事,随后汇编成完整的FDS文件;我们比较了由3个最先进的LMS产生的FDS的正确性和质量与由人类分析师制作的功能设计规格过程。我们的结果表明,LMS能够帮助自动化和标准化要求规格、缩短时间和人力工作。然而,LMM制成的FDS的质量在很大程度上取决于投入,并常常需要人进行修改。因此,我们主张以一个至关重要的逻辑分析工具来起草一个至关重要的LREM。


Article 127

Title: Emerging Trends in Software Architecture from the Practitioners Perspective: A Five Year Review Aufkommende Trends in der Softwarearchitektur aus der Perspektive der Praktizierenden: Ein Fünf-Jahres-Bericht 从从从业人员角度看软件架构的新趋势:五年审查 2507.14554v2

Authors (6): Ruoyu Su, Noman Ahmad, Matteo Esposito, Andrea Janes, Davide Taibi, Valentina Lenarduzzi

Software architecture plays a central role in the design, development, and maintenance of software systems. With the rise of cloud computing, microservices, and containers, architectural practices have diversified. Understanding these shifts is vital. This study analyzes software architecture trends across eight leading industry conferences over five years. We investigate the evolution of software architecture by analyzing talks from top practitioner conferences, focusing on the motivations and contexts driving technology adoption. We analyzed 5,677 talks from eight major industry conferences, using large language models and expert validation to extract technologies, their purposes, and usage contexts. We also explored how technologies interrelate and fit within DevOps and deployment pipelines. Among 450 technologies, Kubernetes, Cloud Native, Serverless, and Containers dominate by frequency and centrality. Practitioners present technology mainly related to deployment, communication, AI, and observability. We identify five technology communities covering automation, coordination, cloud AI, monitoring, and cloud-edge. Most technologies span multiple DevOps stages and support hybrid deployment. Our study reveals that a few core technologies, like Kubernetes and Serverless, dominate the contemporary software architecture practice. These are mainly applied in later DevOps stages, with limited focus on early phases like planning and coding. We also show how practitioners frame technologies by purpose and context, reflecting evolving industry priorities. Finally, we observe how only research can provide a more holistic lens on architectural design, quality, and evolution.

在设计、开发和维护软件系统方面,软件架构具有中心作用。随着云计算、微服务和集装箱的崛起,建筑实践也具有多样性。理解这些变化至关重要。本研究分析了五年来八次主要行业会议的软件架构趋势。我们通过分析顶级从业人员会议的谈判,调查软件架构的演变,重点是推动技术应用的动机和背景。我们分析了八次主要行业会议的5,677次谈判,使用大型语言模型和专家验证来提取技术、其用途和使用背景。我们还探讨了技术在DevOps和部署管道中的相互作用和适合性。在450种技术中,Kubernetes、云地、无服务器和集装箱以频率和中心为主。从业者介绍了主要与部署、通信、AI和可观察性有关的技术。我们确定了五个技术群体,涉及自动化、协调、云地AI、监测和云层。大多数技术跨越了多种DevOps阶段,支持混合部署。我们的研究显示,少数核心技术,如Kubernetes和服务器,主导当代软件架构实践。这些核心技术,主要应用在以后的DOps系统质量阶段和集装箱中,我们也以有限的设计重点展示了设计过程。我们如何在设计中,然后对设计结构框架进行更深入的研究。


Article 128

Title@2025-07-25 (5): SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews

Title: SESR-Eval: Dataset for Evaluating LLMs in the Title-Abstract Screening of Systematic Reviews SESR-Eval: Datensatz zur Bewertung von LLMs im Titel-Abstract Screening von Systematischen Bewertungen SESR-Eval:系统审查标题摘要筛选中评价LLMs的数据集 2507.19027v1

Authors (3): Aleksi Huotala, Miikka Kuutila, Mika Mäntylä

Background: The use of large language models (LLMs) in the title-abstract screening process of systematic reviews (SRs) has shown promising results, but suffers from limited performance evaluation. Aims: Create a benchmark dataset to evaluate the performance of LLMs in the title-abstract screening process of SRs. Provide evidence whether using LLMs in title-abstract screening in software engineering is advisable. Method: We start with 169 SR research artifacts and find 24 of those to be suitable for inclusion in the dataset. Using the dataset we benchmark title-abstract screening using 9 LLMs. Results: We present the SESR-Eval (Software Engineering Systematic Review Evaluation) dataset containing 34,528 labeled primary studies, sourced from 24 secondary studies published in software engineering (SE) journals. Most LLMs performed similarly and the differences in screening accuracy between secondary studies are greater than differences between LLMs. The cost of using an LLM is relatively low - less than $40 per secondary study even for the most expensive model. Conclusions: Our benchmark enables monitoring AI performance in the screening task of SRs in software engineering. At present, LLMs are not yet recommended for automating the title-abstract screening process, since accuracy varies widely across secondary studies, and no LLM managed a high recall with reasonable precision. In future, we plan to investigate factors that influence LLM screening performance between studies.

目标:建立一个基准数据集,用以评价斯洛伐克共和国所有权摘要筛选过程中LLM公司的业绩;提供证据,证明软件工程学杂志出版的24份次级研究中使用LLM公司是否可取;方法:我们从169个SR研究文物开始,发现其中24种产品适合列入数据集;利用9 LM公司衡量标题摘要筛选的数据集,结果:我们提供了SESR-Eval(Seftware工程系统审查评价)数据集,其中载有34 528项标签的初步研究,来源于软件工程学期刊上公布的24项次级研究;大多数LM公司在使用标题摘要筛选过程中使用LLM公司进行类似和筛选准确性差异大于LMM公司之间的差异;使用LM公司的研究成本相对较低,即使对于最昂贵的模型,也低于每期40美元。结论:我们的基准使得能够监测AI公司在软件工程学界合理筛选任务中的绩效,自目前以来,LMM公司没有广泛建议进行高额筛选。


Article 129

Title@2025-07-25 (5): An Enumerative Embedding of the Python Type System in ACL2s

Title: An Enumerative Embedding of the Python Type System in ACL2s Eine Enumerative Einbettung des Python-Typsystems in ACL2s Python型系统在ACL2中的插图嵌入 2507.19015v1

Authors (4): Samuel Xifaras, Panagiotis Manolios, Andrew T. Walter, William Robertson

Python is a high-level interpreted language that has become an industry standard in a wide variety of applications. In this paper, we take a first step towards using ACL2s to reason about Python code by developing an embedding of a subset of the Python type system in ACL2s. The subset of Python types we support includes many of the most commonly used type annotations as well as user-defined types comprised of supported types. We provide ACL2s definitions of these types, as well as defdata enumerators that are customized to provide code coverage and identify errors in Python programs. Using the ACL2s embedding, we can generate instances of types that can then be used as inputs to fuzz Python programs, which allows us to identify bugs in Python code that are not detected by state-of-the-art Python type checkers. We evaluate our work against four open-source repositories, extracting their type information and generating inputs for fuzzing functions with type signatures that are in the supported subset of Python types. Note that we only use the type signatures of functions to generate inputs and treat the bodies of functions as black boxes. We measure code coverage, which ranges from about 68% to more than 80%, and identify code patterns that hinder coverage such as complex branch conditions and external file system dependencies. We conclude with a discussion of the results and recommendations for future work.

Python 是一种高层次的翻译语言, 它在各种各样的应用中已成为行业标准。 在本文中, 我们迈出第一步, 使用 ACL2 来解释 Python 代码。 我们支持的 Python 类型包括许多最常用的类型说明以及用户定义类型, 包括支持类型。 我们提供这些类型的 ACL2 定义, 以及拆解数据计算器, 定制以提供代码覆盖, 并识别 Python 程序错误。 使用 ACL2 嵌入模式, 我们可以产生一些类型, 然后作为Fuzz Python 类型系统的一个子集的输入。 我们所支持的 Python 类型包含。 我们使用 Python 类型中最常用的错误说明以及用户定义类型。 我们使用四个开放源存储库来评估我们的工作, 提取其类型信息, 并生成用于 fluzz 函数, 以及 Python 系统支持的子集中的错误。 使用 ACL2 模式覆盖范围, 我们只使用 Python 的外部定义, 范围, 我们只能使用该选项的功能, 用于构建的代码, 范围, 我们只能使用这些选项的选项的选项的选项的选项的代码, 。


Article 130

Title@2025-07-25 (5): Towards Bug-Free Distributed Go Programs

Title: Towards Bug-Free Distributed Go Programs Auf dem Weg zu fehlerfreien verteilten Go-Programmen 迈向无臭虫分配方案 2506.15135v2

Authors (1): Zhengqun Koo

Programmers of distributed systems need to reason about concurrency to avoid races. However, reasoning about concurrency is difficult, and unexpected races show up as bugs. Data race detection in shared memory systems is well-studied (dynamic data race detection [13], behavioral types [15], dynamic race detection [31]). Similar to how a data race consists of reads and writes not related by happens-before at a shared memory location, a communication race consists of receives and sends not related by happens-before on a shared channel. Communication races are problematic: a receiver expects a specific message from a specific sender, but with a communication race, the receiver can receive a message meant for another receiver, or not receive anything at all. In this work, we describe a verification framework that can prove the absence of communication races for distributed programs that use a subset of the Go programming language, where synchronization is mainly achieved via message passing. We statically reason about how a distributed program executes, using a happens-before order, extended to buffered and unbuffered channels.

分布式系统的程序员需要了解如何用同种货币来避免种族。 但是,关于同种货币的推理是困难的,意想不到的种族是虫子。在共享的记忆系统中,数据种族探测是研究周密的(动态数据种族探测[13]、行为类型[15]、动态种族探测[31])。类似于数据种族是如何包括读数和写数,在共享的记忆位置之前没有发生意外,通信竞赛是接收和发送,而不是在共享的频道上发生意外。通信竞赛是有问题的:接收者期望特定发件人发出特定信息,但有通信竞赛,接收者可以接收另一个接收到信息,或者完全接收不到任何信息。在这项工作中,我们描述了一个核查框架,可以证明在使用Go编程语言的一组分布式程序上没有通信竞赛,而该编程主要是通过传递信息实现同步。我们静态地解释一个分布式程序是如何执行的,使用事前顺序,扩展至缓冲和未受困的频道。


Article 131

Title@2025-07-25 (5): CoCoEvo: Co-Evolution of Programs and Test Cases to Enhance Code Generation

Title: CoCoEvo: Co-Evolution of Programs and Test Cases to Enhance Code Generation CoCoEvo: Co-Evolution von Programmen und Testfällen zur Verbesserung der Codegenerierung CooEvo: 共同推进方案和试验案例,以加强法典的生成 2502.10802v2

Authors (5): Kefan Li, Yuan Yuan, Hongyue Yu, Tingyu Guo, Shijie Cao

Large Language Models (LLMs) have shown remarkable performance in automated code generation. However, existing approaches often rely heavily on pre-defined test cases, which become impractical in scenarios where such cases are unavailable. While prior works explore filtering techniques between programs and test cases, they overlook the refinement of test cases. To address this limitation, we introduce CoCoEvo, a novel LLM-based co-evolution framework that simultaneously evolves programs and test cases. CoCoEvo eliminates the dependency on pre-defined test cases by generating both programs and test cases directly from natural language problem descriptions and function headers. The framework employs specialized evolutionary operators, including LLM-based crossover and mutation operators for program evolution, along with an additional test case generation operator for test case evolution. Additionally, we propose optimization strategies such as a crossover rate scheduler to balance exploration and convergence, and a multi-objective optimization method for test case selection. Experimental results on multiple state-of-the-art LLMs demonstrate that CoCoEvo surpasses existing methods, achieving state-of-the-art performance in automated code generation and testing. These results underscore the potential of co-evolutionary techniques in advancing the field of automated programming.

大语言模型(LLMS)在自动代码生成方面表现显著,但是,现有方法往往严重依赖预先确定的测试案例,在无法提供此类案例的情况下,这种测试案例变得不切实际。虽然先前的工作探索了程序与测试案例之间的过滤技术,但忽视了测试案例的完善。为了解决这一限制,我们引入了CoCoEvo, 这是一种基于LLM的新型共同演进框架,它同时发展了程序和测试案例。CoEvo通过直接生成自然语言问题描述和功能头目的程序和测试案例,消除对预先确定的测试案例的依赖。框架雇用了专门的进化操作员,包括基于LLMM的交叉和突变操作员,用于程序演进,以及额外的测试案例生成操作员。此外,我们提出了优化战略,例如交叉费率表,以平衡探索和趋同,以及测试案例选择的多目标优化方法。多州LMS的实验结果表明CoEvo超越了现有方法,在自动代码生成和测试中实现最新性性功能。这些结果突出表明了在推进外地的自动化技术开发方面的潜力。


Article 132

Title@2025-07-25 (5): Classifying Issues in Open-source GitHub Repositories

Title: Classifying Issues in Open-source GitHub Repositories Einordnung von Problemen in Open-Source GitHub-Repositories 开放源码 GitHub 存储库中问题的分类 2507.18982v1

Authors (3): Amir Hossain Raaj, Fairuz Nawer Meem, Sadia Afrin Mim

GitHub is the most widely used platform for software maintenance in the open-source community. Developers report issues on GitHub from time to time while facing difficulties. Having labels on those issues can help developers easily address those issues with prior knowledge of labels. However, most of the GitHub repositories do not maintain regular labeling for the issues. The goal of this work is to classify issues in the open-source community using ML \& DNN models. There are thousands of open-source repositories on GitHub. Some of the repositories label their issues properly whereas some of them do not. When issues are pre-labeled, the problem-solving process and the immediate assignment of corresponding personnel are facilitated for the team, thereby expediting the development process. In this work, we conducted an analysis of prominent GitHub open-source repositories. We classified the issues in some common labels which are: API, Documentation, Enhancement, Question, Easy, Help-wanted, Dependency, CI, Waiting for OP’s response, Test, Bug, etc. Our study shows that DNN models outperf

GitHub是开放源码社区最广泛使用的软件维护平台。 开发者在面临困难时不时报告 GitHub 上的问题。 在这些问题上贴上标签可以帮助开发者以先前的标签知识轻而易举地解决这些问题。 但是, 大部分 GitHub 库并不定期为这些问题贴上标签。 这项工作的目标是使用 ML DNN 模型对开放源码社区的问题进行分类。 在 GitHub 上存在数千个开放源码库。 有些存储库恰当地标出了他们的问题, 而有些则没有。 当问题被预先标出时, 问题解决过程和立即指派相应人员为团队提供便利, 从而加快了开发进程。 在这项工作中, 我们对著名的 GitHub 开源库进行了分析。 我们将这些问题归类在一些共同的标签中, 它们是: API、 文件、 增强、 问题、 容易、 求助、 依赖性、 CI、 等待 等待 OP 反应、 测试、 错误等。 我们的研究显示 DNN 模型超越了 。


Article 133

Title@2025-07-25 (5): Do Existing Testing Tools Really Uncover Gender Bias in Text-to-Image Models?

Title: Do Existing Testing Tools Really Uncover Gender Bias in Text-to-Image Models? Enthüllen bestehende Testtools Gender-Bias wirklich in Text-to-Image-Modellen? 现有测试工具是否真的在文本到图像模型中排除性别偏见? 2501.15775v2

Authors (5): Yunbo Lyu, Zhou Yang, Yuqing Niu, Jing Jiang, David Lo

Text-to-Image (T2I) models have recently gained significant attention due to their ability to generate high-quality images and are consequently used in a wide range of applications. However, there are concerns about the gender bias of these models. Previous studies have shown that T2I models can perpetuate or even amplify gender stereotypes when provided with neutral text prompts. Researchers have proposed automated gender bias uncovering detectors for T2I models, but a crucial gap exists: no existing work comprehensively compares the various detectors and understands how the gender bias detected by them deviates from the actual situation. This study addresses this gap by validating previous gender bias detectors using a manually labeled dataset and comparing how the bias identified by various detectors deviates from the actual bias in T2I models, as verified by manual confirmation. We create a dataset consisting of 6,000 images generated from three cutting-edge T2I models: Stable Diffusion XL, Stable Diffusion 3, and Dreamlike Photoreal 2.0. During the human-labeling process, we find that all three T2I models generate a portion (12.48% on average) of low-quality images (e.g., generate images with no face present), where human annotators cannot determine the gender of the person. Our analysis reveals that all three T2I models show a preference for generating male images, with SDXL being the most biased. Additionally, images generated using prompts containing professional descriptions (e.g., lawyer or doctor) show the most bias. We evaluate seven gender bias detectors and find that none fully capture the actual level of bias in T2I models, with some detectors overestimating bias by up to 26.95%. We further investigate the causes of inaccurate estimations, highlighting the limitations of detectors in dealing with low-quality images. Based on our findings, we propose an enhanced detector…

文本到图像( T2I) 模型最近因其生成高质量图像的能力而引起人们的极大关注。 但是,人们对这些模型的性别偏差存在关注。 先前的研究显示, T2I 模型在提供中性文本提示时,可以延续甚至扩大性别陈规定型观念。 研究人员建议自动发现T2I 模型的检测器,但存在一个关键差距: 没有对现有工作进行全面比较, 并理解它们所检测到的性别偏差如何偏离实际状况。 本研究利用手动标签显示的26个数据集校准以前的性别偏差探测器, 并比较各种探测器所发现的性别偏差如何与T2I 模型中的实际偏差相悖, 并经手动确认。 我们创建了一个数据集, 由三台切T2I 模型生成的6 000个图像: 与Statt Difculp XL 检测器、 Stabl Difl 3 和Dreamlialal 2.0 。 在人类标记过程中, 我们发现所有三个T2I 模型都通过进一步生成一个部分( 12. 48% ) 以平均) 以显示最不精确的性别偏差的图像, 以显示最精确的图像。


Article 134

Title@2025-07-25 (5): SLICEMATE: Accurate and Scalable Static Program Slicing via LLM-Powered Agents

Title: SLICEMATE: Accurate and Scalable Static Program Slicing via LLM-Powered Agents SLICEMATE: Genaue und skalierbare statische Programm-Slicing über LLM-Powered Agents 液态:通过LLM授权代理器进行准确和可缩放的静态程序切除 2507.18957v1

Authors (8): Jianming Chang, Jieke Shi, Yunbo Lyu, Xin Zhou, Lulu Wang, Zhou Yang, Bixin Li, David Lo

Static program slicing, which extracts the executable portions of a program that affect the values at a specific location, supports many software analysis tasks such as debugging and security auditing. However, traditional slicing tools rely on computationally expensive reachability analysis over dependency graphs, which struggle to scale to large programs and often fail to handle code with incomplete syntax. Recently emerged learning-based methods, while more robust to such cases, still fall short of achieving comparable performance to traditional methods on well-formed code. In this work, we propose SliceMate, a novel static program slicing solution powered by Large Language Model (LLM) agents. It bypasses the need for explicit dependency graph construction and achieving superior slicing accuracy. Concretely, SliceMate integrates three specialized agents: (1) a synthesis agent that produces candidate slices by incrementally expanding the scan scope across functions and files guided by LLM-inferred dependencies; (2) a verification agent that performs conciseness and completeness checks of the candidate slices, detecting missing or irrelevant statements; and (3) a refinement agent that repairs the slices with minimal edits in accordance with the verification results. These agents are orchestrated by a control module that ensures timely convergence and outputs high-quality slices without manual intervention. For rigorous evaluation, we construct a new and high-quality benchmark, SliceBench, comprising 2,200 manually annotated Java and Python programs, with program lengths ranging from 5 to 8,577 lines, significantly larger than those in existing slicing benchmarks. Experimental results show that SliceMate greatly outperforms both traditional and learning-based slicing tools.

在这项工作中,我们提议SlicMate(一个由大语言模型(LLM)代理商驱动的新颖静态程序剪切方案),它支持许多软件分析任务,例如调试和安全审计。然而,传统的剪切工具依靠依赖图形进行成本昂贵的可获取性分析,而依赖图则则难以推广到大型程序,而且往往无法用不完整的语法处理代码。最近出现的基于学习的方法,虽然对此类案例比较有力,但仍不能达到与完善的代码的传统方法的类似性能。在这项工作中,我们提议SlicMate(一个由大语言模型(LLM)代理商驱动的新的静态程序),它支持了许多软件,例如调试制清晰的依附图的制作和达到更高的剪切精确度。具体地说,SliceMate(SlicM)结合了三个专业工具:(1) 合成工具,通过逐步扩大LLM-推导的功能和文档的扫描范围来产生候选人;(2) 核查代理人对候选人切片的切片进行简明和完整性检查,发现缺失或不相相关的说明;(3) 精细化剂对切换的切片的切片的精度比精度更精细的精度的精度,在2级程序进行精度的精度的精度的精度的精度和精度的精度的精度的精度,我们的精度的精度的精度的精度的精度的精度,用精度的精度的精度和精度的精度将的精度的精度的精度的精度的精度,这些精度的精度的精度的精度,这些精度的精度的精度的精度的精度的精度将的精度的精度的精度的精度的精度的精度的精度的精度将的精度和精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度和精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度的精度


Article 135

Title@2025-07-25 (5): Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Title: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity Messung des Einflusses der frühen-2025 KI auf erfahrene Open-Source-Entwicklerproduktivität 衡量2025年初AI(AI)对经验丰富的开放源码开发者生产力的影响 2507.09089v2

Authors (4): Joel Becker, Nate Rush, Elizabeth Barnes, David Rein

Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early 2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%–AI tooling slowed developers down. This slowdown also contradicts predictions from experts in economics (39% shorter) and ML (38% shorter). To understand this result, we collect and evaluate evidence for 20 properties of our setting that a priori could contribute to the observed slowdown effect–for example, the size and quality standards of projects, or prior developer experience with AI tooling. Although the influence of experimental artifacts cannot be entirely ruled out, the robustness of the slowdown effect across our analyses suggests it is unlikely to primarily be a function of our experimental design.

尽管被广泛采用,但AI工具对野生软件开发的影响仍然没有得到充分研究。我们进行了随机控制测试(RCT),以了解2025年2月至6月边境的AI工具如何影响有经验的开放源开发商的生产率。16个具有温和AI经验的开发商完成了成熟项目的246项任务,平均5年经验的成熟开发商完成了246项任务。每项任务被随机指定允许或不允许使用2025年初AI工具。当允许使用AI工具时,开发商主要使用流行代码编辑Cursor Pro和Claude 3.5.3.7 Sonnet。在开始任务之前,开发商预测允许AI完成时间将减少24 % 。在完成研究之后,开发商估计允许AI完成时间减少20 % 。令人惊讶的是,我们发现允许AI实际将完成时间增加19 % - AI 工具, 减缓开发者。这种减速也与经济专家(39% 短) 和 ML (38%短) 的预测相矛盾。为了了解这一结果,我们设定的20种特性,我们收集并评估其前一种有助于观察到的减速效应的证据。例如,允许AI 。在完成研究后,尽管其规模和质量标准不能完全超越了我们之前的实验性能影响,但我们的弹性分析。


Article 136

Title@2025-07-24 (4): Observing Fine-Grained Changes in Jupyter Notebooks During Development Time

Title: Observing Fine-Grained Changes in Jupyter Notebooks During Development Time Beobachten feinkörniger Änderungen in Jupyter-Notebooks während der Entwicklungszeit 发展时期黄极笔记本中观察到的微小变化 2507.15831v2

Authors (8): Sergey Titov, Konstantin Grotov, Cristina Sarasua, Yaroslav Golubev, Dhivyabharathi Ramasamy, Alberto Bacchelli, Abraham Bernstein, Timofey Bryksin

In software engineering, numerous studies have focused on the analysis of fine-grained logs, leading to significant innovations in areas such as refactoring, security, and code completion. However, no similar studies have been conducted for computational notebooks in the context of data science. To help bridge this research gap, we make three scientific contributions: we (1) introduce a toolset for collecting code changes in Jupyter notebooks during development time; (2) use it to collect more than 100 hours of work related to a data analysis task and a machine learning task (carried out by 20 developers with different levels of expertise), resulting in a dataset containing 2,655 cells and 9,207 cell executions; and (3) use this dataset to investigate the dynamic nature of the notebook development process and the changes that take place in the notebooks. In our analysis of the collected data, we classified the changes made to the cells between executions and found that a significant number of these changes were code iteration modifications. We report a number of other insights and propose potential future research directions on the novel data.

在软件工程方面,许多研究侧重于细细细的原木分析,导致在重构、安全和代码完成等领域进行重大创新,然而,在数据科学方面,没有对计算笔记本进行类似的研究。为了缩小这一研究差距,我们作出了三项科学贡献:(1) 采用工具来收集Jupyter笔记本在开发期间的代码变化;(2) 利用它收集与数据分析任务和机器学习任务有关的100多小时工作(由20个具有不同专门知识的开发者承担),从而形成包含2 655个细胞和9 207个细胞处决的数据集;(3) 利用这一数据集来调查笔记本开发过程的动态性质和笔记本中发生的变化。我们在分析所收集的数据时,对处决之间细胞的变化进行了分类,发现这些变化中有相当多的是代码 Iteration修改。我们报告了一些其他见解,并提出了关于新数据的潜在研究方向。


Article 137

Title@2025-07-24 (4): Exploring the Jupyter Ecosystem: An Empirical Study of Bugs and Vulnerabilities

Title: Exploring the Jupyter Ecosystem: An Empirical Study of Bugs and Vulnerabilities Erforschung des Jupyter-Ökosystems: Eine empirische Studie über Bugs und Schwachstellen 探索黄极生态系统:关于虫子和脆弱性的经验研究 2507.18833v1

Authors (4): Wenyuan Jiang, Diany Pressato, Harsh Darji, Thibaud Lutellier

Background. Jupyter notebooks are one of the main tools used by data scientists. Notebooks include features (configuration scripts, markdown, images, etc.) that make them challenging to analyze compared to traditional software. As a result, existing software engineering models, tools, and studies do not capture the uniqueness of Notebook’s behavior. Aims. This paper aims to provide a large-scale empirical study of bugs and vulnerabilities in the Notebook ecosystem. Method. We collected and analyzed a large dataset of Notebooks from two major platforms. Our methodology involved quantitative analyses of notebook characteristics (such as complexity metrics, contributor activity, and documentation) to identify factors correlated with bugs. Additionally, we conducted a qualitative study using grounded theory to categorize notebook bugs, resulting in a comprehensive bug taxonomy. Finally, we analyzed security-related commits and vulnerability reports to assess risks associated with Notebook deployment frameworks. Results. Our findings highlight that configuration issues are among the most common bugs in notebook documents, followed by incorrect API usage. Finally, we explore common vulnerabilities associated with popular deployment frameworks to better understand risks associated with Notebook development. Conclusions. This work highlights that notebooks are less well-supported than traditional software, resulting in more complex code, misconfiguration, and poor maintenance.

Juppyter 笔记本是数据科学家使用的主要工具之一。笔记本包括一些特征(配置脚本、标记、图像等),使得它们难以与传统软件相比进行分析。因此,现有的软件工程模型、工具和研究无法捕捉笔记本行为的独特性。目的。本文件旨在对笔记本生态系统中的错误和脆弱性进行大规模的经验性研究。方法。我们从两个主要平台收集和分析了大量的笔记本数据集。我们的方法包括对笔记本特性(如复杂计量、贡献者活动和文件)进行定量分析,以确定与错误相关的因素。此外,我们利用基础理论对笔记本错误进行定性研究,从而形成全面的分类。最后,我们分析了与安全有关的承诺和脆弱性报告,以评估与笔记本书部署框架相关的风险。结果。我们的调查结果强调,配置问题是笔记本文件中最常见的错误,随后不正确的使用。最后,我们探索了与大众部署框架相关的共同的脆弱性,以便更好地了解与笔记本发展相关的风险。此外,我们利用基础理论来对笔记本错误进行分类,这个错误的维护比常规代码要少得多。


Article 138

Title@2025-07-24 (4): MemoCoder: Automated Function Synthesis using LLM-Supported Agents

Title: MemoCoder: Automated Function Synthesis using LLM-Supported Agents MemoCoder: Automatisierte Funktionssynthese mit LLM-unterstützten Agenten MemoCoder:使用LLM支持的代理器自动功能合成 2507.18812v1

Authors (4): Yiping Jia, Zhen Ming Jiang, Shayan Noei, Ying Zou

With the widespread adoption of Large Language Models (LLMs) such as GitHub Copilot and ChatGPT, developers increasingly rely on AI-assisted tools to support code generation. While LLMs can generate syntactically correct solutions for well-structured programming tasks, they often struggle with challenges that require iterative debugging, error handling, or adaptation to diverse problem structures. Existing approaches such as fine-tuning or self-repair strategies either require costly retraining or lack mechanisms to accumulate and reuse knowledge from previous attempts. To address these limitations, we propose MemoCoder, a multi-agent framework that enables collaborative problem solving and persistent learning from past fixes. At the core of MemoCoder is a Fixing Knowledge Set, which stores successful repairs and supports retrieval for future tasks. A central Mentor Agent supervises the repair process by identifying recurring error patterns and refining high-level fixing strategies, providing a novel supervisory role that guides the self-repair loop. We evaluate MemoCoder across three public benchmarks – MBPP, HumanEval, and LiveCodeBench – spanning a range of problem complexities. Experimental results show that MemoCoder consistently outperforms both zero-shot prompting and a Self-Repair strategy, with improvements ranging from 3.1% to 12.1% in Pass@10 and from 1.4% to 14.5% in Pass@50, demonstrating its effectiveness in iterative refinement and knowledge-guided code generation.

随着大语言模型(LLMS)的广泛采用,例如GitHub Copilation和ChattGPT,开发者越来越依赖AI协助的工具来支持代码生成。LLMS可以为结构完善的编程任务产生综合正确的解决方案,但他们往往要面对挑战,需要反复调试、错误处理或适应不同的问题结构。微调或自我修复战略等现有方法要么需要花费昂贵的再培训或缺乏机制来积累和再利用以往尝试的知识。为了解决这些限制,我们提议MemoCoder,这是一个多试办框架,能够协作解决和持续学习过去修正的问题。在MemoCoder的核心是一个修补知识组,它储存成功的修理和支持对未来任务的检索。中央导师监管者通过查明反复出现的错误模式和完善高层次的修补战略来监督修复进程,提供指导自我修复循环的新的监督作用。我们从三个公共基准(MBPP、HumanEval和LiveCode Bench)评估Memo Coder – 跨越一系列问题的复杂性。在MemoCoCoCodeBSermaxRegres regil5 中不断显示1.10% 和方向的升级战略在14的升级中展示中不断显示自我定位和升级战略的改进。


Article 139

Title@2025-07-24 (4): MetaSel: A Test Selection Approach for Fine-tuned DNN Models

Title: MetaSel: A Test Selection Approach for Fine-tuned DNN Models MetaSel: Ein Testauswahlverfahren für fein abgestimmte DNN-Modelle MetaSel: 微调 DNN 模型的测试选择方法 2503.17534v3

Authors (4): Amin Abbasishahkoo, Mahboubeh Dadkhah, Lionel Briand, Dayi Lin

Deep Neural Networks (DNNs) face challenges during deployment due to data distribution shifts. Fine-tuning adapts pre-trained models to new contexts requiring smaller labeled sets. However, testing fine-tuned models under constrained labeling budgets remains a critical challenge. This paper introduces MetaSel, a new approach, tailored for fine-tuned DNN models, to select tests from unlabeled inputs. MetaSel assumes that fine-tuned and pre-trained models share related data distributions and exhibit similar behaviors for many inputs. However, their behaviors diverge within the input subspace where fine-tuning alters decision boundaries, making those inputs more prone to misclassification. Unlike general approaches that rely solely on the DNN model and its input set, MetaSel leverages information from both the fine-tuned and pre-trained models and their behavioral differences to estimate misclassification probability for unlabeled test inputs, enabling more effective test selection. Our extensive empirical evaluation, comparing MetaSel against 11 state-of-the-art approaches and involving 68 fine-tuned models across weak, medium, and strong distribution shifts, demonstrates that MetaSel consistently delivers significant improvements in Test Relative Coverage (TRC) over existing baselines, particularly under highly constrained labeling budgets. MetaSel shows average TRC improvements of 28.46% to 56.18% over the most frequent second-best baselines while maintaining a high TRC median and low variability. Our results confirm MetaSel’s practicality, robustness, and cost-effectiveness for test selection in the context of fine-tuned models.

深心神经网络(DNNs) 在部署期间,由于数据分布的变化而面临挑战。 微调使经过预先训练的模型适应需要较少标签的数据集的新环境。 但是,在限制标签预算下测试微调模型仍然是一个关键的挑战。 本文介绍了Metasel, 一种新办法,为经过微调的DNNN模型设计,目的是从没有标签的投入中选择测试。 MetaSel认为,经过微调和预先训练的模型可以共享相关数据分布,并展示许多投入的类似行为。然而,在投入子空间内,微调改变决策界限,使这些投入更容易被错误分类。与完全依赖DNNN模式及其投入集的一般办法不同,MetaSel利用了从微调和预先训练的DNNNNW模型及其行为差异中的信息,以估计没有标签的测试投入的错误分类概率,从而能够更有效地进行测试选择。 我们的广泛经验评估,将MetASel与11个最先进的先进方法进行比较,并涉及68个微调模型在弱、中等和强劲的分布变化中,使得这些输入更容易分类。 MetSeralS 最经常地显示,在高的RBS 28 的汇率基准下,同时持续地进行着高的不断的升级的升级的比标定调 。


Article 140

Title@2025-07-24 (4): Decompiling Rust: An Empirical Study of Compiler Optimizations and Reverse Engineering Challenges

Title: Decompiling Rust: An Empirical Study of Compiler Optimizations and Reverse Engineering Challenges Decompiling Rust: Eine empirische Studie über Compiler-Optimierungen und Reverse Engineering-Herausforderungen Drecomping Rust:关于编纂者优化和逆向工程挑战的经验性研究 2507.18792v1

Authors (1): Zixu Zhou

Decompiling Rust binaries is challenging due to the language’s rich type system, aggressive compiler optimizations, and widespread use of high-level abstractions. In this work, we conduct a benchmark-driven evaluation of decompilation quality across core Rust features and compiler build modes. Our automated scoring framework shows that generic types, trait methods, and error handling constructs significantly reduce decompilation quality, especially in release builds. Through representative case studies, we analyze how specific language constructs affect control flow, variable naming, and type information recovery. Our findings provide actionable insights for tool developers and highlight the need for Rust-aware decompilation strategies.

由于语言的丰富类型系统、积极的编译优化和高层次抽象的普及使用,分解规则的二进制具有挑战性。 在这项工作中,我们根据基准对核心的 Rust 特征和编译者构建模式的分解质量进行评估。 我们的自动评分框架显示,通用类型、特性方法和错误处理结构大大降低了分解质量,特别是在发布过程中。 通过有代表性的案例研究,我们分析了特定语言结构如何影响控制流程、变量命名和类型信息恢复。 我们的发现为工具开发者提供了可操作的洞察,并凸显了 Rust-aware 解析战略的必要性。


Article 141

Title@2025-07-24 (4): Initial Steps in Integrating Large Reasoning and Action Models for Service Composition

Title: Initial Steps in Integrating Large Reasoning and Action Models for Service Composition Erste Schritte bei der Integration großer Vernunft- und Handlungsmodelle für die Servicezusammensetzung 整合服务构成大理由和行动模式的初步步骤 2507.18775v1

Authors (2): Ilche Georgievski, Marco Aiello

Service composition remains a central challenge in building adaptive and intelligent software systems, often constrained by limited reasoning capabilities or brittle execution mechanisms. This paper explores the integration of two emerging paradigms enabled by large language models: Large Reasoning Models (LRMs) and Large Action Models (LAMs). We argue that LRMs address the challenges of semantic reasoning and ecosystem complexity while LAMs excel in dynamic action execution and system interoperability. However, each paradigm has complementary limitations - LRMs lack grounded action capabilities, and LAMs often struggle with deep reasoning. We propose an integrated LRM-LAM architectural framework as a promising direction for advancing automated service composition. Such a system can reason about service requirements and constraints while dynamically executing workflows, thus bridging the gap between intention and execution. This integration has the potential to transform service composition into a fully automated, user-friendly process driven by high-level natural language intent.

建立适应性和智能软件系统时,服务构成仍是一个中心挑战,往往受到有限推理能力或执行机制的制约。本文件探讨将大型语文模式促成的两个新兴模式(大理由模型和大型行动模型)结合起来:大理由模型和大型行动模型(LAMs),我们争辩说,LRM处理语义推理和生态系统复杂化的挑战,LAMs则擅长动态行动执行和系统互操作性。然而,每个模式都有互补性的局限性—-LRM缺乏有根有据的行动能力,LAMs常常以深入的推理方式挣扎。我们建议综合LRM-LAM建筑框架作为推进自动化服务构成的有希望的方向。这样的系统可以说明服务要求和限制,同时动态地执行工作流程,从而缩小意图与执行之间的差距。这种整合有可能将服务构成转变为由高层次自然语言意图驱动的完全自动化、方便用户的进程。


Article 142

Title@2025-07-24 (4): Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback

Title: Agentic Program Repair from Test Failures at Scale: A Neuro-symbolic approach with static analysis and test execution feedback Agentische Programm-Reparatur von Testfehlern im Maßstab: Ein neuro-symbolischer Ansatz mit statischer Analyse und Test-Ausführungs-Feedback 大规模试验失败时的试验失败时的代理方案修复:采用静态分析和测试执行反馈的神经-正反方法 2507.18755v1

Authors (24): Chandra Maddila, Adam Tait, Claire Chang, Daniel Cheng, Nauman Ahmad, Vijayaraghavan Murali, Marshall Roch, Arnaud Avondet, Aaron Meltzer, Victor Montalvao, Michael Hopko, Chris Waterson, Parth Thakkar, Renuka Fernandez, Kristian Kristensen, Sivan Barzily, Sherry Chen, Rui Abreu, Nachiappan Nagappan, Payam Shodjai, Killian Murphy, James Everingham, Aparna Ramani, Peter C. Rigby

Aim: With the advent of LLMs, sophisticated agentic program repair has become viable at large organizations with large codebases. In this work, we develop an Engineering Agent that fixes the source code based on test failures at scale across diverse software offerings internally. Method: Using Llama as the base, we employ the ReAct harness to develop an agent. We start with a test failure that was triaged by a rule-based test failure bot. We then set up an agentic harness and allow the agent to reason and run a set of 15 actions from reading a file to generating a patch. We provide feedback to the agent through static analysis and test failures so it can refine its solution. We leverage an LLM-as-a-Judge to ensure that the patch conforms to the standards followed by a human review to land fixes. Benchmark Findings: We curated offline benchmarks for our patch generator, the Engineering Agent loop, and the LLM-as-a-Judge. In offline evaluations we found that a specialized 70B model is highly competitive with the much larger but vanilla Llama-405B. In an ablation study, we found that the ReAct harness (neural model) benefited from the symbolic information from static analysis tools and test execution traces. A model that strikes a balance between the solve rate and error rate vs the cost and latency has a benchmark solve rate of 42.3% using an average 11.8 feedback iterations. Production Findings: In a three month period, 80% of the generated fixes were reviewed, of which 31.5% were landed (25.5% of the total number of generated fixes). Feedback from Engineers: We used open coding to extract qualitative themes from engineers’ feedback. We saw positive feedback in the form of quick approvals, gratitude, and surprise. We also found mixed feedback when the Engineering Agent’s solution was partially correct and it served as a good starting point.

目标 : 随着LLMS的到来, 精密的代理程序修理在拥有大代码库的大型组织中变得可行。 在这项工作中, 我们开发了一个工程剂, 根据各种软件内部提供的规模测试失败来修正源代码。 方法 : 使用Llama作为基础, 我们使用 ReAct 来开发一个代理物。 我们从测试失败的测试失败开始, 我们先用基于规则的测试失败机来修正测试失败。 我们随后设置了一种代理物力, 让代理商理性地, 并运行了一套15个动作。 我们通过静态分析和测试失败来向代理商提供反馈, 以便改进它的解决方案。 我们利用LLMAA- A- A-J 5 模型来确保源源代码符合对土地进行的人审查所遵循的标准 。 基准结论: 我们用基于规则的测试发电机、 快速循环和 LLM- A- Judi 模型的离线基准点, 我们发现一个专门的70B模型在更大程度上具有竞争力, 但是用Vanilla Llama- 405B 的反馈方法来改进其解决方案 。 在模拟研究中, 我们发现, IMA- bral 模型中, 我们利用了80 的进度分析中, 我们从一个测试了一个运行率 的进度模型中, 我们找到了一个测试了一种对一个从一个运行率的进度的进度。


Article 143

Title@2025-07-24 (4): HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis

Title: HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis HLSTester: Effiziente Prüfung von Verhaltensdiskrepanzen mit LLMs für High-Level-Synthese HLS Tester: 与高级别合成项目LLM有效测试行为差异 2504.14641v3

Authors (4): Kangwei Xu, Bing Li, Grace Li Zhang, Ulf Schlichtmann

In high-level synthesis (HLS), C/C++ programs with synthesis directives are used to generate circuits for FPGA implementations. However, hardware-specific and platform-dependent characteristics in these implementations can introduce behavioral discrepancies between the original C/C++ programs and the circuits after high-level synthesis. Existing methods for testing behavioral discrepancies in HLS are still immature, and the testing workflow requires significant human efforts. To address this challenge, we propose HLSTester, a large language model (LLM) aided testing framework that efficiently detects behavioral discrepancies in HLS. To mitigate hallucinations in LLMs and enhance prompt quality, the testbenches for original C/C++ programs are leveraged to guide LLMs in generating HLS-compatible testbenches, effectively eliminating certain traditional C/C++ constructs that are incompatible with HLS tools. Key variables are pinpointed through a backward slicing technique in both C/C++ and HLS programs to monitor their runtime spectra, enabling an in-depth analysis of the discrepancy symptoms. To reduce test time, a testing input generation mechanism is introduced to integrate dynamic mutation with insights from an LLM-based progressive reasoning chain. In addition, repetitive hardware testing is skipped by a redundancy-aware filtering technique for the generated test inputs. Experimental results demonstrate that the proposed LLM-aided testing framework significantly accelerates the testing workflow while achieving higher testbench simulation pass rates compared with the traditional method and the direct use of LLMs on the same HLS programs.

在高水平合成(HLS)中,C/C+++方案与综合指令一起被用于产生FPGA实施过程的电路。然而,这些实施过程中硬件特定和平台依赖的特点可以造成原C/C++程序与高水平合成后的电路之间的行为差异。现有的HLS行为差异测试方法仍然不成熟,测试工作流程需要大量的人力努力。为了应对这一挑战,我们建议HLSTester(一个大型语言模型(LLLM)辅助测试框架,以有效检测HLS中的行为差异。为了减少LLM中的幻觉并提高快速质量,将原C/C++程序的测试箱用于指导LLMM生成HLS兼容的测试箱,有效消除与HLS工具不兼容的某些传统C/C++构建。关键变量通过C++和HLS程序中的后退缩分级技术来定位。我们建议,为了减少测试时间差异症状的深度分析。为了减少测试时间,将原CLM/C+++程序测试高投入生成的测试机制用于将HLS-Revild Reveral 测试的试测测测测测测测算。


Article 144

Title@2025-07-24 (4): Exploring the Landscape of Fairness Interventions in Software Engineering

Title: Exploring the Landscape of Fairness Interventions in Software Engineering Erforschung der Landschaft der Fairness-Interventionen in der Software-Engineering 探索软件工程中公平干预的景观 2507.18726v1

Authors (1): Sadia Afrin Mim

Current developments in AI made it broadly significant for reducing human labor and expenses across several essential domains, including healthcare and finance. However, the application of AI in the actual world poses multiple risks and disadvantages due to potential risk factors in data (e.g., biased dataset). Practitioners developed a number of fairness interventions for addressing these kinds of problems. The paper acts as a survey, summarizing the various studies and approaches that have been developed to address fairness issues

大赦国际目前的发展使目前大赦国际在包括保健和金融在内的若干基本领域对减少人的劳动和开支具有广泛的重大意义,然而,由于数据中的潜在风险因素(如有偏向的数据集),在实际世界中应用大赦国际带来了多重风险和不利条件。 从业人员为解决这类问题制定了一些公平干预措施。


Article 145

Title@2025-07-24 (4): AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code

Title: AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code AccessGuru: LLMs zur Erkennung und korrekten Web-Zugänglichkeit von Verstößen im HTML-Code nutzen AccessGuru:利用LLMS检测和纠正HTML代码中违反网络无障碍的情况 2507.19549v1

Authors (3): Nadeen Fathallah, Daniel Hernández, Steffen Staab

The vast majority of Web pages fail to comply with established Web accessibility guidelines, excluding a range of users with diverse abilities from interacting with their content. Making Web pages accessible to all users requires dedicated expertise and additional manual efforts from Web page providers. To lower their efforts and promote inclusiveness, we aim to automatically detect and correct Web accessibility violations in HTML code. While previous work has made progress in detecting certain types of accessibility violations, the problem of automatically detecting and correcting accessibility violations remains an open challenge that we address. We introduce a novel taxonomy classifying Web accessibility violations into three key categories - Syntactic, Semantic, and Layout. This taxonomy provides a structured foundation for developing our detection and correction method and redefining evaluation metrics. We propose a novel method, AccessGuru, which combines existing accessibility testing tools and Large Language Models (LLMs) to detect violations and applies taxonomy-driven prompting strategies to correct all three categories. To evaluate these capabilities, we develop a benchmark of real-world Web accessibility violations. Our benchmark quantifies syntactic and layout compliance and judges semantic accuracy through comparative analysis with human expert corrections. Evaluation against our benchmark shows that AccessGuru achieves up to 84% average violation score decrease, significantly outperforming prior methods that achieve at most 50%.

绝大多数的网页未能遵守既定的网页无障碍准则,将各种能力各不相同的用户排除在与其内容互动之外。让所有用户都能进入网页需要专门的专业知识和网页提供者额外的手工努力。为降低他们的努力和促进包容性,我们的目标是自动检测和纠正HTML代码中的网络无障碍违规现象。虽然先前的工作在发现某些类型的无障碍违规现象方面取得了进展,但自动发现和纠正无障碍违规现象的问题仍然是我们处理的一个公开挑战。我们引入了一个新的分类学,将违反网络无障碍行为分为三大关键类别:协同、语义和布局。这一分类学为开发我们的检测和纠正方法以及重新定义评估指标提供了结构基础。我们提出了一个新的方法,即AccessGuru,该方法将现有的无障碍测试工具和大语言模型(LLMS)结合起来,以发现违规现象并应用由分类法驱动的快速战略来纠正所有这三大类别。为了评估这些能力,我们制定了一个真实世界网络无障碍违规现象的基准。我们的基准将遵守和布局情况分为三个关键类别:协同和布局,通过比较分析,为制定我们的检测和校正方法以及重新界定评价评价标准提供了结构准确性基础。我们提出的新方法,根据最先进的标准,评估显示,对184%的成绩的评分分比分率,对18。根据前的评分率,对18。根据前的评分了10分率,根据前的评分了10分率,对照率,对照比比比。


Article 146

Title@2025-07-24 (4): 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation

Title: 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation 3D-Software-Synthese geführt durch eingeschränkt-expressive Zwischendarstellung 3D 由限制性中等代表制指导的软件合成 2507.18625v1

Authors (5): Shuqing Li, Anson Y. Lam, Yun Peng, Wenxuan Wang, Michael R. Lyu

Graphical user interface (UI) software has undergone a fundamental transformation from traditional two-dimensional (2D) desktop/web/mobile interfaces to spatial three-dimensional (3D) environments. While existing work has made remarkable success in automated 2D software generation, such as HTML/CSS and mobile app interface code synthesis, the generation of 3D software still remains under-explored. Current methods for 3D software generation usually generate the 3D environments as a whole and cannot modify or control specific elements in the software. Furthermore, these methods struggle to handle the complex spatial and semantic constraints inherent in the real world. To address the challenges, we present Scenethesis, a novel requirement-sensitive 3D software synthesis approach that maintains formal traceability between user specifications and generated 3D software. Scenethesis is built upon ScenethesisLang, a domain-specific language that serves as a granular constraint-aware intermediate representation (IR) to bridge natural language requirements and executable 3D software. It serves both as a comprehensive scene description language enabling fine-grained modification of 3D software elements and as a formal constraint-expressive specification language capable of expressing complex spatial constraints. By decomposing 3D software synthesis into stages operating on ScenethesisLang, Scenethesis enables independent verification, targeted modification, and systematic constraint satisfaction. Our evaluation demonstrates that Scenethesis accurately captures over 80% of user requirements and satisfies more than 90% of hard constraints while handling over 100 constraints simultaneously. Furthermore, Scenethesis achieves a 42.8% improvement in BLIP-2 visual evaluation scores compared to the state-of-the-art method.

图形用户界面(UI)软件已经经历了从传统的二维(2D)桌面/网络/移动界面向空间三维(3D)环境的根本性转变。虽然现有工作在自动生成2D软件方面取得了显著的成功,例如HTML/CSS和移动应用程序界面代码合成,但3D软件的生成仍然未得到充分探索。目前3D软件生成方法通常产生整个3D环境,无法修改或控制软件中的具体内容。此外,这些方法努力处理现实世界中固有的复杂的空间和语义限制。为了应对挑战,我们提出了Senesisis,一种新的对要求敏感的3D软件合成方法的3D软件合成方法,在用户规格和生成的3D软件之间保持正式的可追溯性可追溯性可追溯性可变性。 Senemexexis,在表达复杂空间约束的系统化系统化语言上,通过直观的Scentrical-L 系统化语言,在显示我们系统化的系统化系统化系统化语言上,在显示对硬性语言的精确性缩缩度上,在显示我们系统化的缩缩缩缩缩化的缩缩缩。


Article 147

Title@2025-07-24 (4): OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advancing Next-Generation Intelligent Transportation Systems Research

Title: OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advancing Next-Generation Intelligent Transportation Systems Research OpenCAMS: Eine Open-Source vernetzte und automatisierte Mobilitäts-Co-Simulationsplattform für die Weiterentwicklung der Forschung für intelligente Transportsysteme der nächsten Generation OpenCAMS: 推进下一轮智能运输系统研究的开放源码连接和自动化流动联合模拟平台 2507.09186v3

Authors (4): Minhaj Uddin Ahmad, Akid Abrar, Sagar Dasgupta, Mizanur Rahman

We introduce OpenCAMS (Open-Source Connected and Automated Mobility Co-Simulation Platform), an open-source, synchronized, and extensible co-simulation framework that tightly couples three best-in-class simulation tools: (i) SUMO, (ii) CARLA, and (iii) OMNeT++. OpenCAMS is designed to support advanced research in transportation safety, mobility, and cybersecurity by combining the strengths of each simulation domain. Specifically, SUMO provides large-scale, microscopic traffic modeling; CARLA offers high-fidelity 3D perception, vehicle dynamics, and control simulation; and OMNeT++ enables modular, event-driven network communication, such as cellular vehicle-to-everything (C-V2X). OpenCAMS employs a time-synchronized, bidirectional coupling architecture that ensures coherent simulation progression across traffic, perception, and communication domains while preserving modularity and reproducibility. For example, CARLA can simulate and render a subset of vehicles that require detailed sensor emulation and control logic; SUMO orchestrates network-wide traffic flow, vehicle routing, and traffic signal management; and OMNeT++ dynamically maps communication nodes to both mobile entities (e.g., vehicles) and static entities (e.g., roadside units) to enable C-V2X communication. While these three simulators form the foundational core of OpenCAMS, the platform is designed to be expandable and future-proof, allowing additional simulators to be integrated on top of this core without requiring fundamental changes to the system architecture. The OpenCAMS platform is fully open-source and publicly available through its GitHub repository https://github.com/minhaj6/carla-sumo-omnetpp-cosim, providing the research community with an accessible, flexible, and collaborative environment for advancing next-generation intelligent transportation systems.

我们引入了OpenCAMS(开放源码连接和自动化流动共同模拟平台),这是一个开放源码、同步和可扩展的共同模拟框架,紧紧结合三种最高级模拟工具:(一) SUMO,(二) CARLA,和(三) OMNET+。OmNET+。OpenCAMS的目的是通过将每个模拟域的优势结合起来,支持运输安全、移动和网络安全方面的先进研究。具体地说,SUMO提供大型、显微可读交通模型;CARLA提供高纤维3D感知、车辆动态和控制模拟及控制模拟;OMNET++为模块化、事件驱动网络通信通信提供模块,如手机到百年一月(C-V2X) 。OpenCAMS使用时间同步、双向双向双向组合的组合组合组合结构结构,确保整个交通、感知知知和通信领域的模拟进程,同时保持模块和可读性。例如,CARLA可以模拟和提供一组需要详细传感器的车辆未来模拟和控逻辑的车辆;SUDS-CA-SlMSUMS-LMLMS-S-S-LMLMULM-S-S-S-S-mode-com-roma-comma-comma-comm-comm-comm-comm-comma-commex-comma-comma-comma-comma-comma-commus-comma-comma-commex-commex-commex-commex-s-s-commusmex-commex-s-s-commex-s-s-s-s-s-s-s-s-s-s-s-comm-s-s-s-s-s-s-s-s-s-s-s-s-s-l-s-s-s-s-s-s-commal-s-comm-s-s-sma-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-


Article 148

Title@2025-07-24 (4): Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench

Title: Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench Sind KI-erzeugte Fixes sicher? LLM und Agent Patches auf der SWE-Bench analysieren AI - 具有安全性吗? 分析SWE-bench 上的LLM 和代理补丁 2507.02976v2

Authors (3): Amirali Sajadi, Kostadin Damevski, Preetha Chatterjee

Large Language Models (LLMs) and their agentic frameworks are increasingly adopted to automate software development tasks such as issue resolution and program repair. While prior work has identified security risks in LLM-generated code, most evaluations have focused on synthetic or isolated settings, leaving open questions about the security of these systems in real-world development contexts. In this study, we present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset. We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches. We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data. Finally, we analyze a wide range of code, issue, and project-level factors to understand the conditions under which LLMs and agents are most likely to generate insecure code. Our findings reveal that the standalone LLM introduces nearly 9x more new vulnerabilities than developers, with many of these exhibiting unique patterns not found in developers’ code. Agentic workflows also generate a significant number of vulnerabilities, particularly when granting LLMs more autonomy, potentially increasing the likelihood of misinterpreting project context or task requirements. We find that vulnerabilities are more likely to occur in LLM patches associated with a higher number of files, more lines of generated code, and GitHub issues that lack specific code snippets or information about the expected code behavior and steps to reproduce. These results suggest that contextual factors play a critical role in the security of the generated code and point toward the need for proactive risk assessment methods that account for both code and issue-level information to complement existing vulnerability detection tools.

大型语言模型(LLMS)及其代理框架日益被采用,以自动化软件开发任务,如问题解析和程序修补等。虽然先前的工作已经查明LLM生成代码的安全风险,但大多数评价侧重于合成或孤立的设置,留下了关于这些系统在现实世界开发背景下的安全的开放问题。在这项研究中,我们利用SWE-bench数据集的20,000个+问题对LLMS生成的补丁进行了首次大规模安全分析。我们评估了独立LM(Llama3.3)生成的补丁,并将其与开发者制作的补丁进行比较。我们还评估了三个顶级代理框架(OpenHands、AutoCodeRover、HoneComb)生成的安全风险风险风险。最后,我们分析了一系列广泛的代码、问题和项目级因素,以了解LWMS和代理商最有可能生成不安全代码的条件。我们发现,独立LMRM比开发者引入了近9x新的脆弱性,其中很多这些显示在开发者代码中无法找到的独特模式, 也显示在高级代码中可能增加磁带风险的路径。


Article 149

Title@2025-07-24 (4): On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Title: On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten 关于含有闭合同步词类的标识名称的结构和语义 2505.18444v4

Authors (11): Christian D. Newman, Anthony Peruma, Eman Abdullah AlOmar, Mahie Crabbe, Syreen Banabilah, Reem S. AlSuhaibani, Michael J. Decker, Farhad Akhbardeh, Marcos Zampieri, Mohamed Wiem Mkaouer, Jonathan I. Maletic

Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns, which represent the part-of-speech (PoS) sequences underlying identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. To study these categories, the Closed Category Identifier Dataset (CCID), a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems, is constructed and presented. The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express concepts such as control flow, data transformation, temporal reasoning, and other behavioral roles through naming. This work contributes an empirical foundation for understanding how linguistic resources encode behavior in identifier names and supports new directions for research in naming, program comprehension, and education.

标识名称是代码的关键组成部分, 是开发者理解程序行为的主要线索 。 本文通过扩展语法模式的概念来调查标识名称的语言结构 。 语法模式代表了语法序列部分( POS) 基本识别短语。 具体重点是封闭的合成类别( 如预设、 连线、 确定者 ) , 尽管这些类别在一般自然语言中具有核心作用, 但这些类别很少在软件工程中研究 。 要研究这些类别, 封闭类识别数据集( CICID) 是一个新的人工手动数据集, 由来自30个开放源系统的1,275个标识组成。 封闭类语法模式与程序行为之间的关系随后通过基于理论的编码、 统计和模式分析加以分析 。 结果揭示了开发者用来表达控制流、 数据转换、 时间推理和其他行为作用等概念的经常性结构 。 这项工作为理解识别名称的语言资源如何编码行为和支持命名、 方案理解和教育研究的新方向提供了经验基础 。


Article 150

Title@2025-07-24 (4): A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat

Title: A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat Ein tiefer Tauchgang in die retrieval-angereicherte Generation zur Code-Vervollständigung: Erfahrung auf WeChat 为完成代码的完成而深入挖掘回收的一代人:关于 WeChat 的经验 2507.18515v1

Authors (6): Zezhou Yang, Ting Peng, Cuiyun Gao, Chaozheng Wang, Hailiang Huang, Yuetang Deng

Code completion, a crucial task in software engineering that enhances developer productivity, has seen substantial improvements with the rapid advancement of large language models (LLMs). In recent years, retrieval-augmented generation (RAG) has emerged as a promising method to enhance the code completion capabilities of LLMs, which leverages relevant context from codebases without requiring model retraining. While existing studies have demonstrated the effectiveness of RAG on public repositories and benchmarks, the potential distribution shift between open-source and closed-source codebases presents unique challenges that remain unexplored. To mitigate the gap, we conduct an empirical study to investigate the performance of widely-used RAG methods for code completion in the industrial-scale codebase of WeChat, one of the largest proprietary software systems. Specifically, we extensively explore two main types of RAG methods, namely identifier-based RAG and similarity-based RAG, across 26 open-source LLMs ranging from 0.5B to 671B parameters. For a more comprehensive analysis, we employ different retrieval techniques for similarity-based RAG, including lexical and semantic retrieval. Based on 1,669 internal repositories, we achieve several key findings: (1) both RAG methods demonstrate effectiveness in closed-source repositories, with similarity-based RAG showing superior performance, (2) the effectiveness of similarity-based RAG improves with more advanced retrieval techniques, where BM25 (lexical retrieval) and GTE-Qwen (semantic retrieval) achieve superior performance, and (3) the combination of lexical and semantic retrieval techniques yields optimal results, demonstrating complementary strengths. Furthermore, we conduct a developer survey to validate the practical utility of RAG methods in real-world development environments.

守则的完成是软件工程的关键任务,提高了开发者生产率,随着大型语言模型(LLMs)的快速进步,守则的完成有了实质性的改进。近年来,检索增强的生成(RAG)已成为提高LLM公司代码完成能力的一个很有希望的方法,LM公司从代码库中利用相关背景,而无需进行示范再培训。虽然现有的研究表明RAG公司对公共储存库和基准的有效性,开放源码库和封闭源码库之间的潜在分配变化带来了尚未探讨的独特挑战。为了缩小差距,我们开展了一项实证研究,以调查在最大自有软件系统之一即WeChat工业规模的代码库中广泛使用的RAG完成代码的方法(RAG)的性能。 具体而言,我们广泛探索了两种主要类型的RAG方法,即基于标识的RAG和类似源码的源码,范围从0.5B到671B参数的开放源码,对基于类似RAG的检索和语义检索系统进行不同的检索技术。基于1,在1,669内部存储库中,我们展示了类似于RAG的高级读取方法的高级版本。


Article 151

Title@2025-07-24 (4): Automated Code Review Using Large Language Models with Symbolic Reasoning

Title: Automated Code Review Using Large Language Models with Symbolic Reasoning Automatisierte Code-Überprüfung mit großen Sprachmodellen mit symbolischer Begründung 使用有符号理由的大语言模型的自动码审查 2507.18476v1

Authors (2): Busra Icoz, Goksel Biricik

Code review is one of the key processes in the software development lifecycle and is essential to maintain code quality. However, manual code review is subjective and time consuming. Given its rule-based nature, code review is well suited for automation. In recent years, significant efforts have been made to automate this process with the help of artificial intelligence. Recent developments in Large Language Models (LLMs) have also emerged as a promising tool in this area, but these models often lack the logical reasoning capabilities needed to fully understand and evaluate code. To overcome this limitation, this study proposes a hybrid approach that integrates symbolic reasoning techniques with LLMs to automate the code review process. We tested our approach using the CodexGlue dataset, comparing several models, including CodeT5, CodeBERT, and GraphCodeBERT, to assess the effectiveness of combining symbolic reasoning and prompting techniques with LLMs. Our results show that this approach improves the accuracy and efficiency of automated code review.

代码审查是软件开发生命周期的关键过程之一,对维护代码质量至关重要。但是,人工代码审查是主观的,耗费时间。鉴于其基于规则的性质,代码审查非常适合自动化。近年来,在人工智能的帮助下,为将这一过程自动化做出了重大努力。大语言模型(LLMS)的近期发展也成为这一领域的一个很有希望的工具,但这些模型往往缺乏充分理解和评估代码所需的逻辑推理能力。为克服这一限制,本研究报告提出了一种混合方法,将象征性推理技术与LLOMS结合起来,使代码审查过程自动化。我们用代码Glue数据集测试了我们的方法,比较了包括代码T5、DCBERT和GreagCodeBERT在内的若干模型,以评估将符号推理和提示技术与LMSMs相结合的有效性。我们的成果表明,这一方法提高了自动代码审查的准确性和效率。


Article 152

Title@2025-07-24 (4): Exploring and Evaluating Interplays of BPpy with Deep Reinforcement Learning and Formal Methods

Title: Exploring and Evaluating Interplays of BPpy with Deep Reinforcement Learning and Formal Methods Erforschung und Auswertung von Interplays von BPpy mit Deep Reinforcement Learning und Formal Methods 探索和评价与深强化学习和正规方法的BPpy的相互作用 2501.15480v2

Authors (5): Tom Yaacov, Gera Weiss, Adiel Ashrov, Guy Katz, Jules Zisser

We explore and evaluate the interactions between Behavioral Programming (BP) and a range of Artificial Intelligence (AI) and Formal Methods (FM) techniques. Our goal is to demonstrate that BP can serve as an abstraction that integrates various techniques, enabling a multifaceted analysis and a rich development process. Specifically, the paper examines how the BPpy framework, a Python-based implementation of BP, is enhanced by and enhances various FM and AI tools. We assess how integrating BP with tools such as Satisfiability Modulo Theory (SMT) solvers, symbolic and probabilistic model checking, and Deep Reinforcement Learning (DRL) allow us to scale the abilities of BP to model complex systems. Additionally, we illustrate how developers can leverage multiple tools within a single modeling and development task. The paper provides quantitative and qualitative evidence supporting the feasibility of our vision to create a comprehensive toolbox for harnessing AI and FM methods in a unified development framework.

我们探索和评价行为规划技术与一系列人工智能和正规方法(FM)技术之间的互动关系。我们的目标是证明BP可以作为一种抽象的抽象,将各种技术结合起来,进行多方面的分析,开展丰富的发展进程。具体地说,该文件审查了BPpy框架,一个基于Python的BPy执行BP,是如何通过各种调频和AI工具得到加强和加强的。我们评估如何将BP与满足性Modulo Theory(SMT)解答器、象征性和概率模型检查以及深强化学习(DRL)等工具相结合,使我们能够扩大BP的能力,以模拟复杂的系统。此外,我们说明了开发者如何在单一的模型和开发任务中利用多种工具。文件提供了定量和定性证据,支持我们建立一个综合工具箱,在统一的发展框架中利用AI和调频方法的愿景的可行性。


Article 153

Title@2025-07-24 (4): It is Giving Major Satisfaction: Why Fairness Matters for Software Practitioners

Title: It is Giving Major Satisfaction: Why Fairness Matters for Software Practitioners Es gibt große Zufriedenheit: Warum Fairness für Software-Praktiker wichtig ist 它给予重大满意:为什么软件从业人员的公平问题? 2410.02482v5

Authors (3): Emeralda Sesari, Federica Sarro, Ayushi Rastogi

Software practitioners often encounter workplace unfairness, such as unequal recognition and gender bias. While the link between fairness and job satisfaction has been established in other fields, its relevance to software professionals remains underexplored. This study examines how fairness perceptions relate to job satisfaction among software practitioners, focusing on both general trends and demographic-specific differences. We conducted an online survey of 108 software practitioners, followed by ordinal logistic regression to analyze the relationship between fairness perceptions and job satisfaction in software engineering contexts, with moderation analysis examining how this relationship varies across demographic groups. Our findings indicate that all four fairness dimensions (namely distributive, procedural, interpersonal, and informational fairness) significantly affect overall job satisfaction and satisfaction with job security. Among these, interpersonal fairness has the biggest impact. The relationship between fairness and job satisfaction is stronger for female, ethnically underrepresented, less experienced practitioners, and those with work limitations. Fairness in authorship emerged as an important factor for job satisfaction collectively, while fairness in policy implementation, high-demand situations, and working hours impacted specific demographic groups. This study highlights the role of fairness among software practitioners, offering strategies for organizations to promote fair practices and targeted approaches for certain demographic groups.

软件从业者往往遇到工作场所的不公平,例如不平等的承认和性别偏见。虽然公平与工作满意度之间的联系在其他领域已经确立,但与软件专业人员的相关性仍未得到充分探讨。本研究报告审查了公平观念如何与软件从业者的工作满意度相关,侧重于一般趋势和具体人口差异。我们对108名软件从业者进行了在线调查,随后是标准后勤回归,以分析软件工程方面的公平观念与工作满意度之间的关系,同时进行适度分析,审查这种关系在人口群体之间如何不同。我们的调查结果表明,所有四个公平层面(即分配、程序、人际和信息公平)都严重影响了工作满意度和对工作安全的总体满意度。其中,人际公平具有最大的影响。公平与工作满意度之间的关系对女性、族裔代表性不足、经验较少的从业者和有工作限制的人来说更为密切。作者的公平是集体满意度的一个重要因素,而政策执行的公平性、高需求情况和工作时间则影响到特定人口群体。本研究报告强调了软件从业者之间的公平性作用,为各组织提供了促进公平做法和针对某些人口群体的定向办法的战略。


Article 154

Title@2025-07-24 (4): FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping

Title: FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping FMI trifft SystemC: Ein Rahmen für das Cross-Tool Virtual Prototyping FMI 满足系统C:跨工具虚拟原型框架 2507.18339v1

Authors (5): Nils Bosbach, Meik Schmidt, Lukas Jünger, Matthias Berthold, Rainer Leupers

As systems become more complex, the demand for thorough testing and virtual prototyping grows. To simulate whole systems, multiple tools are usually needed to cover different parts. These parts include the hardware of a system and the environment with which the system interacts. The Functional Mock-up Interface (FMI) standard for co-simulation can be used to connect these tools. The control part of modern systems is usually a computing unit, such as a System-on-a-Chip (SoC) or Microcontroller Unit (MCU), which executes software from a connected memory and interacts with peripherals. To develop software without requiring access to physical hardware, full-system simulators, the so-called Virtual Platforms (VPs), are commonly used. The IEEE-standardized framework for VP development is SystemC TLM. SystemC provides interfaces and concepts that enable modular design and model exchange. However, SystemC lacks native FMI support, which limits the integration into broader co-simulation environments. This paper presents a novel framework to control and interact with SystemC-based VPs using the FMI. We present a case study showing how a simulated temperature sensor in a SystemC simulation can obtain temperature values from an external tool via FMI. This approach allows the unmodified target software to run on the VP and receive realistic environmental input data such as temperature, velocity, or acceleration values from other tools. Thus, extensive software testing and verification is enabled. By having tests ready and the software pre-tested using a VP once the physical hardware is available, certifications like ISO 26262 can be done earlier.

随着系统变得更为复杂,对彻底测试和虚拟原型的需求日益增长。要模拟整个系统,通常需要多种工具来覆盖不同部分。这些部分包括系统硬件和系统互动环境。可以使用功能模拟界面(FMI)标准,用于共同模拟这些工具。现代系统的控制部分通常是一个计算单位,例如系统对立系统(SoC)或微控制器(MCU),它从连接的记忆中执行软件,并与外围环境互动。开发软件不需要使用物理硬件,系统对软件进行全面系统模拟,并使用所谓的虚拟平台(VPs),这些部分通常使用功能模拟界面接口(FMI)标准化框架,用于共同模拟这些工具。系统C提供界面和概念,便于模块设计和模型交换。然而,系统C缺乏本地的FMI支持,它提供了一个与基于系统(VP)的同步存储和互动的新框架,使用FMI的系统(OFP)的全系统快速化软件模拟系统(VP),我们用这种系统对系统进行快速的测试,我们用系统(FMI)的系统进行测试,可以让外部的系统对服务器进行测试系统进行测试。我们用一个测试,可以让外部的系统进行这样的服务器进行这样的测试。


Article 155

Title@2025-07-24 (4): LLMShot: Reducing snapshot testing maintenance via LLMs

Title: LLMShot: Reducing snapshot testing maintenance via LLMs LLMShot: Reduzierung der Snapshot-Test-Wartung über LLMs LLMShot:减少通过LLMM减少快速测试维护 2507.10062v2

Authors (4): Ergün Batuhan Kaynak, Mayasah Lami, Sahand Moslemi, Anil Koyuncu

Snapshot testing has emerged as a critical technique for UI validation in modern software development, yet it suffers from substantial maintenance overhead due to frequent UI changes causing test failures that require manual inspection to distinguish between genuine regressions and intentional design changes. This manual triage process becomes increasingly burdensome as applications evolve, creating a need for automated analysis solutions. This paper introduces LLMShot, a novel framework that leverages Vision-Language Models (VLMs) to automatically analyze snapshot test failures through semantic classification of UI changes. To evaluate LLMShot’s effectiveness, we developed a comprehensive dataset using a feature-rich iOS application with configurable feature flags, creating realistic scenarios that produce authentic snapshot differences representative of real development workflows. Our evaluation using Gemma3 models demonstrates strong classification performance, with the 12B variant achieving over 84% recall in identifying failure root causes while the 4B model offers practical deployment advantages with acceptable performance for continuous integration environments. However, our exploration of selective ignore mechanisms revealed significant limitations in current prompting-based approaches for controllable visual reasoning. LLMShot represents the first automated approach to semantic snapshot test analysis, offering developers structured insights that can substantially reduce manual triage effort and advance toward more intelligent UI testing paradigms.

快速抓图测试已成为现代软件开发中UI验证的关键技术,但它却由于频繁的UI变化导致测试失败,需要进行人工检查以区分真正的回归和有意的设计变化,从而导致测试失败,从而导致测试失败,从而导致测试失败。随着应用程序的演变,这种人工裁剪过程变得日益繁琐,从而产生了自动分析解决方案的需要。本文介绍了LloMShot,这是一个利用Vision-Language Models(VLMS)的新型框架,通过对UI变化进行语义分类,自动分析短视测试失败。但是,为了评估LLLMShot的效能,我们开发了一个全面的数据集,使用了具有可配置特征标志的功能丰富的iOS应用程序,从而产生了现实的情景,从而产生真实的快照差异,代表了真实的发展工作流程。我们使用Gemma3模型进行的评估显示了很强的分类性表现,12B变量在查明失败根源方面达到84%以上,而4B模型则为持续整合环境的可接受性能提供实际部署优势。然而,我们对选择性的无视机制的探索揭示了当前快速直视推论方法的巨大局限性。 LLMShot代表了当前对可控性智能模拟模拟模拟模拟模拟模拟模拟模拟模拟测试分析的第一个自动化自动测试。


Article 156

Title@2025-07-24 (4): Gotta catch ‘em all! Towards File Localisation from Issues at Large

Title: Gotta catch ‘em all! Towards File Localisation from Issues at Large Ich muss sie alle fangen! Auf dem Weg zur Dateilokalisierung von Themen im Großen und Ganzen 必须抓住他们所有人! 2507.18319v1

Authors (3): Jesse Maarleveld, Jiapan Guo, Daniel Feitosa

Bug localisation, the study of developing methods to localise the files requiring changes to resolve bugs, has been researched for a long time to develop methods capable of saving developers’ time. Recently, researchers are starting to consider issues outside of bugs. Nevertheless, most existing research into file localisation from issues focusses on bugs or uses other selection methods to ensure only certain types of issues are considered as part of the focus of the work. Our goal is to work on all issues at large, without any specific selection. In this work, we provide a data pipeline for the creation of issue file localisation datasets, capable of dealing with arbitrary branching and merging practices. We provide a baseline performance evaluation for the file localisation problem using traditional information retrieval approaches. Finally, we use statistical analysis to investigate the influence of biases known in the bug localisation community on our dataset. Our results show that methods designed using bug-specific heuristics perform poorly on general issue types, indicating a need for research into general purpose models. Furthermore, we find that there are small, but statistically significant differences in performance between different issue types. Finally, we find that the presence of identifiers have a small effect on performance for most issue types. Many results are project-dependent, encouraging the development of methods which can be tuned to project-specific characteristics.

错误本地化, 研究如何开发本地化需要修改的文档以解决错误, 长期以来一直在研究开发能够保存开发者时间的方法。 最近, 研究人员开始考虑错误以外的问题。 然而, 大部分现有研究, 从关注错误的问题中存档本地化, 或者使用其他选择方法确保只有某些类型的问题被视为工作焦点的一部分。 我们的目标是在总体上就所有问题开展工作, 而不作任何具体选择。 在这项工作中, 我们为创建问题文件本地化数据集提供了一个数据管道, 能够处理任意的分支化和合并做法。 我们利用传统的信息检索方法为文件本地化问题提供基线绩效评估。 最后, 我们使用统计分析来调查本地化社区中已知的偏差对数据集的影响。 我们的结果显示, 使用特定错误的超常化方法在一般问题类型上表现不佳, 表明需要研究一般目的模型。 此外, 我们发现不同问题类型在性能上存在小但统计上显著的差别。 最后, 我们发现, 我们发现, 不同问题类型中存在最依赖的识别特征类型, 多数项目特性的特性类型都具有迷性。


Article 157

Title@2025-07-24 (4): YATE: The Role of Test Repair in LLM-Based Unit Test Generation

Title: YATE: The Role of Test Repair in LLM-Based Unit Test Generation YATE: Die Rolle der Testreparatur bei der LLM-basierten Einheiten-Testgenerierung YATE:在以LLM为基础的单位试验生成中测试修理的作用 2507.18316v1

Authors (5): Michael Konstantinou, Renzo Degiovanni, Jie M. Zhang, Mark Harman, Mike Papadakis

Recent advances in automated test generation utilises language models to produce unit tests. While effective, language models tend to generate many incorrect tests with respect to both syntax and semantics. Although such incorrect tests can be easily detected and discarded, they constitute a “missed opportunity” – if fixed, they are often valuable as they directly add testing value (they effectively target the underlying program logic to be tested) and indirectly form good seeds for generating additional tests. To this end, we propose a simple technique for repairing some of these incorrect tests through a combination of rule-based static analysis and re-prompting. We evaluate this simple approach, named YATE, on a set of 6 open-source projects and show that it can effectively produce tests that cover on average 32.06% more lines and kill 21.77% more mutants than a plain LLM-based method. We also compare YATE with four other LLM-based methods, namely HITS, SYMPROMPT, TESTSPARK and COVERUP and show that it produces tests that cover substantially more code. YATE achieves 22% higher line coverage, 20% higher branch coverage and kill 20% more mutants at a comparable cost (number of calls to LLMs).

自动测试生成的最近进步利用语言模型来制作单位测试。 虽然语言模型的效果是有效的, 但通常会在语法和语义学方面产生许多不正确的测试。 虽然这些不正确的测试可以很容易地检测和丢弃, 但是它们构成了一个“错失的机会 ” — — 如果固定,它们往往很宝贵,因为它们直接增加了测试价值(它们有效地针对基本的程序逻辑进行测试 ) , 间接地形成良好的种子来生成额外的测试。 为此,我们提出一种简单技术,通过基于规则的静态分析和重新激活相结合来修复其中一些不正确的测试。 我们用一套6个开放源项目来评估这个叫做YATE的简单方法, 并表明它能够有效地生产平均覆盖32.06%以上线的测试,杀死21.77%的变种人,而不是基于普通LM方法。 我们还将YATE与其他四种基于LM方法(即HITS、SYMPROMPPT、TESPARKK和CEURUP)进行比较, 并表明它产生的测试覆盖了相当多得多的代码。 YATEATE达到22%的线段, 20 %的分支覆盖面和杀死20%以上的变种LM(成本)。


Article 158

Title@2025-07-24 (4): Scheduzz: Constraint-based Fuzz Driver Generation with Dual Scheduling

Title: Scheduzz: Constraint-based Fuzz Driver Generation with Dual Scheduling Scheduzz: Fuzz Driver Generation mit Dual Scheduling Scheduzz:基于节制的有双重日程安排的 Fiszz 驱动力生成 2507.18289v1

Authors (7): Yan Li, Wenzhang Yang, Yuekun Wang, Jian Gao, Shaohua Wang, Yinxing Xue, Lijun Zhang

Fuzzing a library requires experts to understand the library usage well and craft high-quality fuzz drivers, which is tricky and tedious. Therefore, many techniques have been proposed to automatically generate fuzz drivers. However, they fail to generate rational fuzz drivers due to the lack of adherence to proper library usage conventions, such as ensuring a resource is closed after being opened. To make things worse, existing library fuzzing techniques unconditionally execute each driver, resulting in numerous irrational drivers that waste computational resources while contributing little coverage and generating false positive bug reports. To tackle these challenges, we propose a novel automatic library fuzzing technique, Scheduzz, an LLM-based library fuzzing technique. It leverages LLMs to understand rational usage of libraries and extract API combination constraints. To optimize computational resource utilization, a dual scheduling framework is implemented to efficiently manage API combinations and fuzz drivers. The framework models driver generation and the corresponding fuzzing campaign as an online optimization problem. Within the scheduling loop, multiple API combinations are selected to generate fuzz drivers, while simultaneously, various optimized fuzz drivers are scheduled for execution or suspension. We implemented Scheduzz and evaluated it in 33 real-world libraries. Compared to baseline approaches, Scheduzz significantly reduces computational overhead and outperforms UTopia on 16 out of 21 libraries. It achieves 1.62x, 1.50x, and 1.89x higher overall coverage than the state-of-the-art techniques CKGFuzzer, Promptfuzz, and the handcrafted project OSS-Fuzz, respectively. In addition, Scheduzz discovered 33 previously unknown bugs in these well-tested libraries, 3 of which have been assigned CVEs.

图书馆的模糊性要求专家了解图书馆的使用情况,并设计出质量高的模糊性驱动器,这是棘手而乏味的。因此,许多技术都建议自动生成模糊性驱动器。然而,由于缺乏对适当的图书馆使用惯例的遵守,这些技术未能产生理性的模糊性驱动器,例如,在开放后确保资源关闭。要让情况更糟,现有图书馆的模糊性技术无条件地执行每个驱动器,导致许多不合理的驱动器浪费计算资源,同时提供很少的覆盖面,并生成虚假的正面错误报告。为了应对这些挑战,我们提出了一个新的自动图书馆模糊技术,Scheduzz,一个基于LLAM的图书馆模糊性技术。它利用LLMS来理解图书馆的合理使用情况,并提取API的组合限制。为了优化计算资源的利用,一个双重的时间安排框架模型驱动器生成和相应的模糊性运动作为在线优化问题。在时间安排周期内,选择了多种 CIPI 组合来生成模糊性驱动器,同时,各种优化的模糊性驱动器,Schedel-fury技术, Scheduzz-LOral 都预定执行或暂停使用。 我们实施了Scial-deal-dal-dal-dal-dal-dal-dal-droudal-dal-dal-dal-daldal-dal-daldaldaldaldaldal 和21 开始,在21 开始,在21个数据库中,我们,在21个数据库里, 开始,在21个数据库里,并评估它。


Article 159

Title@2025-07-24 (4): An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs

Title: An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs Eine empirische Studie über körpereigene Software-Fehler im Bereich Künstliche Intelligenz von Robotern (EAIR) 关于人造人工智能机器人(EAIR)软件虫的经验研究 2507.18267v1

Authors (8): Zeqin Liao, Zibin Zheng, Peifan Reng, Henglong Liang, Zixu Gao, Zhixiang Chen, Wei Li, Yuhong Nan

Embodied Artificial Intelligence Robots (EAIR) is an emerging and rapidly evolving technological domain. Ensuring their program correctness is fundamental to their successful deployment. However, a general and in-depth understanding of EAIR system bugs remains lacking, which hinders the development of practices and techniques to tackle EAIR system bugs. To bridge this gap, we conducted the first systematic study of 885 EAIR system bugs collected from 80 EAIR system projects to investigate their symptoms, underlying causes, and module distribution. Our analysis takes considerable effort, which classifies these bugs into 18 underlying causes, 15 distinct symptoms, and identifies 13 affected modules. It reveals several new interesting findings and implications which help shed light on future research on tackling or repairing EAIR system bugs. First, among the 15 identified symptoms, our findings highlight 8 symptoms specific to EAIR systems, which is characterized by severe functional failures and potential physical hazards. Second, within the 18 underlying causes, we define 8 EAIR-specific causes, the majority of which stem from the intricate issues of AI- agent reasoning and decision making. Finally, to facilitate precise and efficient bug prediction, detection, and repair, we constructed a mapping between underlying causes and the modules in which they most frequently occur, which enables researchers to focus diagnostic efforts on the modules most susceptible to specific bug types.

人工智能机器人(EAIR)是一个新兴的、迅速演变的技术领域。确保其程序正确性是成功部署的基础。然而,仍然缺乏对EAIR系统错误的普遍和深入了解,这妨碍了开发处理EAIR系统错误的做法和技术。为了缩小这一差距,我们对从80个EAIR系统项目中收集的885个EAIR系统错误进行了首次系统研究,以调查其症状、根本原因和模块分布。我们的分析需要大量努力,将这些错误分为18个根本原因、15个不同症状和13个受影响的模块。它揭示了一些新的有趣的发现和意义,有助于了解今后关于处理或修复EAIR系统错误的研究。首先,在15个查明的症状中,我们的调查结果突出了EAIR系统特有的8个症状,其特征是功能严重失灵和潜在的物理危害。第二,在18个根本原因中,我们确定了8个EAIR系统具体原因,其中多数源于AI代理理论和决定的复杂问题。最后,为准确和高效的错误预测、检测和修复提供了一些新的结果和影响,有助于了解未来关于处理或修复EAIR系统错误的研究。在最易变本的模型中,我们建立了最能的模型。


Article 160

Title@2025-07-24 (4): GenAI for Automotive Software Development: From Requirements to Wheels

Title: GenAI for Automotive Software Development: From Requirements to Wheels GenAI für die Entwicklung von Automotive-Software: Von Anforderungen bis zu Rädern GENAI 汽车软件开发GENAI:从要求到轮子 2507.18223v1

Authors (6): Nenad Petrovic, Fengjunjie Pan, Vahid Zolfaghari, Krzysztof Lebioda, Andre Schamschurko, Alois Knoll

This paper introduces a GenAI-empowered approach to automated development of automotive software, with emphasis on autonomous and Advanced Driver Assistance Systems (ADAS) capabilities. The process starts with requirements as input, while the main generated outputs are test scenario code for simulation environment, together with implementation of desired ADAS capabilities targeting hardware platform of the vehicle connected to testbench. Moreover, we introduce additional steps for requirements consistency checking leveraging Model-Driven Engineering (MDE). In the proposed workflow, Large Language Models (LLMs) are used for model-based summarization of requirements (Ecore metamodel, XMI model instance and OCL constraint creation), test scenario generation, simulation code (Python) and target platform code generation (C++). Additionally, Retrieval Augmented Generation (RAG) is adopted to enhance test scenario generation from autonomous driving regulations-related documents. Our approach aims shorter compliance and re-engineering cycles, as well as reduced development and testing time when it comes to ADAS-related capabilities.

本文介绍了通用自动自动开发汽车软件的GenAI动力方法,重点是自动和高级驱动协助系统(ADAS)能力;这一过程从需求作为投入开始,而产生的主要产出是模拟环境的测试情景代码,以及针对与测试台连接的车辆硬件平台实施理想的ADAS能力;此外,我们引入了额外步骤,要求一致性检查,利用模型驱动工程(MDE);在拟议的工作流程中,大型语言模型(LLMS)用于基于模型的需求汇总(核心元模型、XMI模型实例和OCL制约设定)、测试情景生成、模拟代码(Python)和目标平台代码生成(C++);此外,还采用了检索聚合生成(RAG)能力,以加强自动驱动规则相关文件的测试情景生成;我们的方法旨在缩短合规和再设计周期,并缩短与ADAS相关能力有关的开发和测试时间。


Article 161

Title@2025-07-24 (4): SMECS: A Software Metadata Extraction and Curation Software

Title: SMECS: A Software Metadata Extraction and Curation Software KMUCS: Eine Software Metadata Extraktions- und Kurationssoftware SMECS:软件元数据抽取和计算软件 2507.18159v1

Authors (4): Stephan Ferenz, Aida Jafarbigloo, Oliver Werth, Astrid Nieße

Metadata play a crucial role in adopting the FAIR principles for research software and enables findability and reusability. However, creating high-quality metadata can be resource-intensive for researchers and research software engineers. To address this challenge, we developed the Software Metadata Extraction and Curation Software (SMECS) which integrates the extraction of metadata from existing sources together with a user-friendly interface for metadata curation. SMECS extracts metadata from online repositories such as GitHub and presents it to researchers through an interactive interface for further curation and export as a CodeMeta file. The usability of SMECS was evaluated through usability experiments which confirmed that SMECS provides a satisfactory user experience. SMECS supports the FAIRification of research software by simplifying metadata creation.

元数据在采用FAIR研究软件原则方面发挥着关键作用,能够找到和重新使用。然而,建立高质量的元数据对于研究人员和研究软件工程师来说可能是资源密集型的。为了应对这一挑战,我们开发了软件元数据提取和计算软件(SMECS),将从现有来源提取元数据与方便用户的元数据整理接口结合起来。中小企业中央数据库从GitHub等在线储存库中提取元数据,并通过互动接口将其提供给研究人员,以便进一步整理和作为代码Meta文件输出。通过可用性试验对中小企业中央信息系统的可用性进行了评估,证实中小企业中央数据库提供了令人满意的用户经验。中小企业中央数据库通过简化元数据创建,支持研究软件的公平化。


Article 162

Title@2025-07-24 (4): When Retriever Meets Generator: A Joint Model for Code Comment Generation

Title: When Retriever Meets Generator: A Joint Model for Code Comment Generation Wenn Retriever trifft Generator: Ein gemeinsames Modell für Code Comment Generation 当再利用与生成器相遇时: 代码Comment生成联合模式 2507.12558v2

Authors (5): Tien P. T. Le, Anh M. T. Bui, Huy N. D. Pham, Alessio Bucaioni, Phuong T. Nguyen

Automatically generating concise, informative comments for source code can lighten documentation effort and accelerate program comprehension. Retrieval-augmented approaches first fetch code snippets with existing comments and then synthesize a new comment, yet retrieval and generation are typically optimized in isolation, allowing irrelevant neighbors topropagate noise downstream. To tackle the issue, we propose a novel approach named RAGSum with the aim of both effectiveness and efficiency in recommendations. RAGSum is built on top offuse retrieval and generation using a single CodeT5 backbone. We report preliminary results on a unified retrieval-generation framework built on CodeT5. A contrastive pre-training phase shapes code embeddings for nearest-neighbor search; these weights then seed end-to-end training with a composite loss that (i) rewards accurate top-k retrieval; and (ii) minimizes comment-generation error. More importantly, a lightweight self-refinement loop is deployed to polish the final output. We evaluated theframework on three cross-language benchmarks (Java, Python, C), and compared it with three well-established baselines. The results show that our approach substantially outperforms thebaselines with respect to BLEU, METEOR, and ROUTE-L. These findings indicate that tightly coupling retrieval and generationcan raise the ceiling for comment automation and motivateforthcoming replications and qualitative developer studies.

为源代码自动生成简明、信息化的评论可以减轻文件工作,并加速程序理解。 检索强化方法首先用现有评论获取代码片断,然后合成新的评论,然而,检索和生成通常在孤立的情况下优化,允许不相关的邻居在下游对噪音进行排解。 为了解决这个问题,我们提议了一个名为RAGSum的新颖方法,其目的在于提高建议的效力和效率。RAGSum建在顶部的离线检索和生成上方,使用单一的代码T5主干线。我们报告了在代码T5基础上建立的统一检索-生成框架的初步结果。一个对比式的训练前阶段将代码嵌入最近的邻居搜索中;这些重量然后是种子端到端的培训,其复合损失(一) 奖励准确的顶级检索;以及 (二) 尽量减少评论生成错误。 更重要的是,将一个轻量的自我修整环安装在最上层上方,以光滑动的最后输出。 我们用三个跨语言基准(Java、Python、C)对框架进行了评估,并将它与三个完善的基线进行对比; 这些重量制质量到质量到质量级的代码, 显示我们不断的循环的复制和不断的循环的复制结果。


Article 163

Title@2025-07-24 (4): NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition

Title: NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition NoCode-Bench: Ein Benchmark für die Bewertung der Erweiterung natürlicher sprachgetriebener Funktionen NoCode-Bonch:评价自然语言-驱动地物的基准 2507.18130v1

Authors (5): Le Deng, Zhonghao Jiang, Jialun Cao, Michael Pradel, Zhongxin Liu

Natural language-driven no-code development allows users to specify software functionality using natural language (NL) instead of editing source code, promising increased productivity and democratized development. Large language models (LLMs) show potential in enabling this paradigm. In this context, software documentation acts as an NL specification for functionality. This work introduces NoCode-bench, a benchmark designed to evaluate LLMs on real-world NL-driven feature addition tasks, consisting of 634 tasks across 10 projects and 114k code changes. Each task pairs documentation updates with corresponding code implementations, validated by developer-written test cases. A subset of 114 high-quality, human-verified instances, NoCode-bench Verified, ensures reliable evaluation. Our experiments reveal that, despite high token usage, the best LLMs achieve a task success rate of only 15.79%, highlighting challenges in cross-file editing, codebase understanding, and tool calling. These findings indicate that LLMs are not yet ready for fully NL-driven no-code development. NoCode-bench lays the foundation for future advances in this area.

自然语言驱动的无代码开发使用户能够使用自然语言(NL)而不是编辑源代码来指定软件功能,从而有望提高生产率和民主化发展。大型语言模型(LLMs)显示了促成这一模式的潜力。在这方面,软件文件作为NL规格的功能规格。这项工作引入了NoCode-bench,这是一个基准,旨在评价现实世界NL驱动的特性添加任务中的LLMs,由10个项目中的634项任务和114k代码变化组成。每个任务对口文件更新了相应的代码执行,并得到了开发者编写的测试案例的验证。114个高品质、人文验证的实例之一,NoCode-bench Verized,确保了可靠的评价。我们的实验表明,尽管有很高的象征性使用,但最佳LLMs只取得了15.79%的任务成功率,突出了跨文件编辑、代码库理解和号召工具方面的挑战。这些研究结果表明LLMs尚未准备好完全NL驱动的无代码开发。NoCode-Bench为这一领域未来进展打下的基础。


Article 164

Title@2025-07-24 (4): OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization

Title: OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization OrQstrator: Ein KI-Powered-Framework für erweiterte Quantenschaltungsoptimierung Orstrator: AI授权的高级量子电路优化框架 2507.09682v2

Authors (2): Laura Baird, Armin Moin

We propose a novel approach, OrQstrator, which is a modular framework for conducting quantum circuit optimization in the Noisy Intermediate-Scale Quantum (NISQ) era. Our framework is powered by Deep Reinforcement Learning (DRL). Our orchestration engine intelligently selects among three complementary circuit optimizers: A DRL-based circuit rewriter trained to reduce depth and gate count via learned rewrite sequences; a domain-specific optimizer that performs efficient local gate resynthesis and numeric optimization; a parameterized circuit instantiator that improves compilation by optimizing template circuits during gate set translation. These modules are coordinated by a central orchestration engine that learns coordination policies based on circuit structure, hardware constraints, and backend-aware performance features such as gate count, depth, and expected fidelity. The system outputs an optimized circuit for hardware-aware transpilation and execution, leveraging techniques from an existing state-of-the-art approach, called the NISQ Analyzer, to adapt to backend constraints.

我们提出了一个新颖的方法,即OrQstrator,这是在Noisy中级量子(NISQ)时代进行量子电路优化的模块化框架。我们的框架由深强化学习(DRL)提供动力。我们的管弦引擎明智地在三个互补的电路优化器中选择:一个基于DRL的电路再编,通过学习的重写序列来降低深度和门数;一个特定域的优化器,运行高效的本地门再合成和数字优化;一个参数化电路即时器,通过优化门置翻译过程中的模板电路来改进编译。这些模块由中央管弦机协调,该机学习基于电路结构、硬件限制和后端识性能(如门数、深度和预期的忠诚)的协调政策。这个系统输出一种优化的硬件觉变换和执行的电路,利用现有状态方法(称为NISQAnalyzer)的技术,以适应后端限制。


Article 165

Title@2025-07-24 (4): Understanding the Supply Chain and Risks of Large Language Model Applications

Title: Understanding the Supply Chain and Risks of Large Language Model Applications Verständnis der Supply Chain und Risiken von Großsprachenmodellanwendungen 了解供应链和大语言模式应用的风险 2507.18105v1

Authors (7): Yujie Ma, Lili Quan, Xiaofei Xie, Qiang Hu, Jiongchi Yu, Yao Zhang, Sen Chen

The rise of Large Language Models (LLMs) has led to the widespread deployment of LLM-based systems across diverse domains. As these systems proliferate, understanding the risks associated with their complex supply chains is increasingly important. LLM-based systems are not standalone as they rely on interconnected supply chains involving pretrained models, third-party libraries, datasets, and infrastructure. Yet, most risk assessments narrowly focus on model or data level, overlooking broader supply chain vulnerabilities. While recent studies have begun to address LLM supply chain risks, there remains a lack of benchmarks for systematic research. To address this gap, we introduce the first comprehensive dataset for analyzing and benchmarking LLM supply chain security. We collect 3,859 real-world LLM applications and perform interdependency analysis, identifying 109,211 models, 2,474 datasets, and 9,862 libraries. We extract model fine-tuning paths, dataset reuse, and library reliance, mapping the ecosystem’s structure. To evaluate security, we gather 1,555 risk-related issues-50 for applications, 325 for models, 18 for datasets, and 1,229 for libraries from public vulnerability databases. Using this dataset, we empirically analyze component dependencies and risks. Our findings reveal deeply nested dependencies in LLM applications and significant vulnerabilities across the supply chain, underscoring the need for comprehensive security analysis. We conclude with practical recommendations to guide researchers and developers toward safer, more trustworthy LLM-enabled systems.

大型语言模型(LLMS)的兴起导致以LLM为基础的系统在不同领域广泛部署,随着这些系统的扩散,了解与复杂的供应链安全相关的风险变得日益重要。LLM系统并不独立,因为它们依赖由事先培训的模式、第三方图书馆、数据集和基础设施等组成的相互关联的供应链;然而,大多数风险评估都狭隘地侧重于模型或数据层面,忽视了更广泛的供应链脆弱性。虽然最近的研究已经开始解决LLM供应链风险,但仍缺乏系统研究的基准。为了弥补这一差距,我们推出了第一个综合数据集,用于分析和衡量LLM供应链安全基准。我们收集了3 859个真实的LLM应用程序,并进行了相互依存性分析,确定了109 211个模型、2 474个数据集和9 862个图书馆。我们从模型中提取了微调路径、数据集再利用和图书馆依赖性,对生态系统结构进行了测绘。为了评估安全,我们收集了1 555个与风险相关的问题-50个应用系统,325个模型,18个数据集,以及1 229个图书馆从公共脆弱性数据库中收集了1 859个应用程序,并进行了相互依存性分析。我们通过这一数据链中的重要数据分析和分析。


Article 166

Title@2025-07-24 (4): Identifier Name Similarities: An Exploratory Study

Title: Identifier Name Similarities: An Exploratory Study Identifier Name Ähnlichkeiten: Eine Sondierungsstudie 说明性名称 相似点:探索性研究 2507.18081v1

Authors (5): Carol Wong, Mai Abe, Silvia De Benedictis, Marissa Halim, Anthony Peruma

Identifier names, which comprise a significant portion of the codebase, are the cornerstone of effective program comprehension. However, research has shown that poorly chosen names can significantly increase cognitive load and hinder collaboration. Even names that appear readable in isolation may lead to misunderstandings in contexts when they closely resemble other names in either structure or functionality. In this exploratory study, we present our preliminary findings on the occurrence of identifier name similarity in software projects through the development of a taxonomy that categorizes different forms of identifier name similarity. We envision our initial taxonomy providing researchers with a platform to analyze and evaluate the impact of identifier name similarity on code comprehension, maintainability, and collaboration among developers, while also allowing for further refinement and expansion of the taxonomy.

由代码库相当一部分组成的识别名称是有效程序理解的基石。然而,研究表明,选择不当的名称会大大增加认知负荷,妨碍协作。即使孤立地看似可读的名称也可能在与结构或功能中的其他名称非常相似的情况下导致误解。在这项探索性研究中,我们介绍了关于软件项目存在识别名称相似性的初步调查结果,其方法是开发一种分类法,对不同形式的识别名称相似性进行分类。我们设想我们的初始分类法为研究人员提供一个平台,用以分析和评估识别名称相似性对代码理解、可维护性以及开发商之间合作的影响,同时允许进一步细化和扩大分类法。


Article 167

Title@2025-07-24 (4): An Empirical Study of Complexity, Heterogeneity, and Compliance of GitHub Actions Workflows

Title: An Empirical Study of Complexity, Heterogeneity, and Compliance of GitHub Actions Workflows Eine empirische Studie über Komplexität, Heterogenität und Compliance von GitHub-Maßnahmen 关于 “ 吉特胡布行动 “ 的复杂性、异质性和合规性的经验研究 2507.18062v1

Authors (2): Edward Abrokwah, Taher A. Ghaleb

Continuous Integration (CI) has evolved from a tooling strategy to a fundamental mindset in modern CI engineering. It enables teams to develop, test, and deliver software rapidly and collaboratively. Among CI services, GitHub Actions (GHA) has emerged as a dominant service due to its deep integration with GitHub and a vast ecosystem of reusable workflow actions. Although GHA provides official documentation and community-supported best practices, there appears to be limited empirical understanding of how open-source real-world CI workflows align with such practices. Many workflows might be unnecessarily complex and not aligned with the simplicity goals of CI practices. This study will investigate the structure, complexity, heterogeneity, and compliance of GHA workflows in open-source software repositories. Using a large dataset of GHA workflows from Java, Python, and C++ repositories, our goal is to (a) identify workflow complexities, (b) analyze recurring and heterogeneous structuring patterns, (c) assess compliance with GHA best practices, and (d) uncover differences in CI pipeline design across programming languages. Our findings are expected to reveal both areas of strong adherence to best practices and areas for improvement where needed. These insights will also have implications for CI services, as they will highlight the need for clearer guidelines and comprehensive examples in CI documentation.

持续整合(CI)已经从工具战略发展到现代CI工程的基本思维,使团队能够迅速合作开发、测试和提供软件。在CI服务中,GitHub Action(GHA)由于与GitHub的深度整合和大量可再利用工作流程行动的生态系统而成为一个主导服务。虽然GHA提供了正式文件和社区支持的最佳做法,但对于开放源码真实世界CI工作流程如何与这些做法保持一致,经验上的理解似乎有限。许多工作流程可能不必要地复杂,不符合CI做法的简单目标。这项研究将调查GHA工作流程的结构、复杂性、异质性和在公开源软件库中的合规性。利用来自Java、Python和C+++的GHA工作流程的大量数据集,我们的目标是:(a) 查明工作流程的复杂性,(b) 分析经常性和混杂的结构模式;(c) 评估对GHA最佳做法的遵守情况,以及(d) 发现CI编程中各语文设计的差异。我们的调查结果将揭示在哪些领域严格遵守CIA工作流程方面的最佳做法,以及哪些领域也需要改进CIA的最佳做法。


Article 168

Title@2025-07-24 (4): SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis

Title: SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis SAVANT: Sicherheitserkennung in Anwendungsabhängigkeiten durch Semantik-geführte Reichweitenanalyse SAVANT: 通过语义辅助控制可达性分析,在应用依赖性中发现脆弱性 2506.17798v2

Authors (7): Wang Lingxiang, Quanzhi Fu, Wenjia Song, Gelei Deng, Yi Liu, Dan Williams, Ying Zhang

The integration of open-source third-party library dependencies in Java development introduces significant security risks when these libraries contain known vulnerabilities. Existing Software Composition Analysis (SCA) tools struggle to effectively detect vulnerable API usage from these libraries due to limitations in understanding API usage semantics and computational challenges in analyzing complex codebases, leading to inaccurate vulnerability alerts that burden development teams and delay critical security fixes. To address these challenges, we proposed SAVANT by leveraging two insights: proof-of-vulnerability test cases demonstrate how vulnerabilities can be triggered in specific contexts, and Large Language Models (LLMs) can understand code semantics. SAVANT combines semantic preprocessing with LLM-powered context analysis for accurate vulnerability detection. SAVANT first segments source code into meaningful blocks while preserving semantic relationships, then leverages LLM-based reflection to analyze API usage context and determine actual vulnerability impacts. Our evaluation on 55 real-world applications shows that SAVANT achieves 83.8% precision, 73.8% recall, 69.0% accuracy, and 78.5% F1-score, outperforming state-of-the-art SCA tools.

现有软件构成分析(SCA)工具在有效检测这些图书馆的脆弱API使用情况方面挣扎着。 SAVANT将精密的脆弱性检测与LLM驱动的背景分析相结合。 SAVANT将精密的语义预处理与LLOM驱动的背景分析相结合。 SAVANT的首部分源代码在保留语义关系的同时,将有意义的区块纳入到有意义的区块中,然后利用基于LLAM的思考来分析API的使用背景并确定实际的脆弱性影响。我们对55个实际应用软件的评估表明,SAVANT实现了83.8%的精确度,73.8%的回顾,69.0%的精确度和78.5%的F1核心,高于艺术的状态工具。


Article 169

Title@2025-07-24 (4): Factors Impacting Faculty Adoption of Project-Based Learning in Computing Education: a Survey

Title: Factors Impacting Faculty Adoption of Project-Based Learning in Computing Education: a Survey Faktoren, die die Fakultät beeinflussen Adoption des projektbasierten Lernens in der Computerausbildung: eine Umfrage 影响学院在计算机教育中采用基于项目学习:调查 2507.18039v1

Authors (3): Ahmad D. Suleiman, Yiming Tang, Daqing Hou

This research full paper investigates the factors influencing computing educators’ adoption of project-based learning (PjBL) in software engineering and computing curricula. Recognized as a student-centered pedagogical approach, PjBL has the potential to enhance student motivation, engagement, critical thinking, collaboration, and problem-solving skills. Despite these benefits, faculty adoption remains inconsistent due to challenges such as insufficient institutional support, time constraints, limited training opportunities, designing or sourcing projects, and aligning them with course objectives. This research explores these barriers and investigates the strategies and resources that facilitate a successful adoption. Using a mixed-methods approach, data from 80 computing faculty were collected through an online survey comprising closed-ended questions to quantify barriers, enablers, and resource needs, along with an open-ended question to gather qualitative insights. Quantitative data were analyzed using statistical methods, while qualitative responses underwent thematic analysis. Results reveal that while PjBL is widely valued, its adoption is often selective and impacted by challenges in planning and managing the learning process, designing suitable projects, and a lack of institutional support, such as time, funding, and teaching assistants. Faculty are more likely to adopt or sustain PjBL when they have access to peer collaboration, professional development, and institutional incentives. In addition, sourcing projects from research, industry partnerships, and borrowing from peers emerged as key facilitators for new projects. These findings underscore the need for systemic support structures to empower faculty to experiment with and scale PjBL practices.

这份完整的研究论文调查了影响计算教育者在软件工程和计算课程中采用基于项目学习(PjBL)的因素。作为以学生为中心的教学方法,PjBL具有提高学生动力、参与、批判性思维、协作和解决问题技能的潜力。尽管有这些好处,但是由于体制支持不足、时间限制、培训机会有限、设计或外包项目以及使其与课程目标保持一致等挑战,教师的采用仍然不一致。这项研究探索了这些障碍,并调查了促进成功采用的各种战略和资源。利用混合方法方法,通过在线调查收集了80个计算师的数据,其中包括一些封闭的问题,以量化障碍、扶持人员和资源需求,以及收集定性见解的开放问题。尽管有这些好处,但是由于机构支持不足、时间限制、培训机会有限、设计或采购项目设计与课程目标相协调,因此其采用往往具有选择性,并受到在规划和管理过程、设计适当的项目以及缺乏机构支持的影响,例如时间、供资和教学助理等。学院更可能采用统计方法分析定量数据,同时使用定性数据,同时进行定性分析,同时对质量分析。结果分析。结果显示,尽管PB项目获得或持续进行机构化项目,但是,它们需要从获得或学习周期性项目获得新的研究。


Article 170

Title@2025-07-24 (4): Your ATs to Ts: MITRE ATT&CK Attack Technique to P-SSCRM Task Mapping

Title: Your ATs to Ts: MITRE ATT&CK Attack Technique to P-SSCRM Task Mapping Ihre ATs zu Ts: MITRE ATT&CK Angriffstechnik zu P-SSCRM Task Mapping 您的ATs to Ts: MITRE ATT和CK 攻击技术到 P-SSCRM任务绘图 2507.18037v1

Authors (5): Sivana Hamer, Jacob Bowen, Md Nazmul Haque, Chris Madden, Laurie Williams

The MITRE Adversarial Tactics, Techniques and Common Knowledge (MITRE ATT&CK) Attack Technique to Proactive Software Supply Chain Risk Management Framework (P-SSCRM) Task mapping described in this document helps software organizations to determine how different tasks mitigate the attack techniques of software supply chain attacks. The mapping was created through four independent strategies to find agreed-upon mappings. Because each P-SSCRM task is mapped to one or more tasks from the 10 frameworks, the mapping we provide is also a mapping between MITRE ATT&CK and other prominent government and industry frameworks.

本文件描述的MITRE Adversarial 战术、技术和共同知识(MITRE ATT和CK)对主动软件供应链风险管理框架(P-SSCRM)的进攻技术任务绘图,帮助软件组织确定不同任务如何减轻软件供应链攻击的攻击技术。该绘图是通过四个独立战略建立的,以寻找商定的地图绘制。由于P-SSCRM的每项任务都按照10个框架的一项或多项任务绘制,我们提供的地图也是MITRE ATT和CK与其他知名政府和工业框架之间的地图绘制。


Article 171

Title@2025-07-24 (4): An Empirical Study of GenAI Adoption in Open-Source Game Development: Tools, Tasks, and Developer Challenges

Title: An Empirical Study of GenAI Adoption in Open-Source Game Development: Tools, Tasks, and Developer Challenges Eine empirische Studie zur GenAI-Adoption in der Open-Source-Spielentwicklung: Werkzeuge, Aufgaben und Entwickler-Herausforderungen GENAI采用开放源码游戏开发的经验研究:工具、任务和开发者的挑战 2507.18029v1

Authors (4): Xiang Echo Chen, Wenhan Zhu, Guoshuai Albert Shi, Michael W. Godfrey

The growing capabilities of generative AI (GenAI) have begun to reshape how games are designed and developed, offering new tools for content creation, gameplay simulation, and design ideation. While prior research has explored traditional uses of AI in games, such as controlling agents or generating procedural content. There is limited empirical understanding of how GenAI is adopted by developers in real-world contexts, especially within the open-source community. This study aims to explore how GenAI technologies are discussed, adopted, and integrated into open-source game development by analyzing issue discussions on GitHub. We investigate the tools, tasks, and challenges associated with GenAI by comparing GenAI-related issues to those involving traditional AI (TradAI) and NonAI topics. Our goal is to uncover how GenAI differs from other approaches in terms of usage patterns, developer concerns, and integration practices. To address this objective, we construct a dataset of open-source game repositories that discuss AI-related topics. We apply open card sorting and thematic analysis to a stratified sample of GitHub issues, labelling each by type and content. These annotations enable comparative analysis across GenAI, TradAI, and NonAI groups, and provide insight into how GenAI is shaping the workflows and pain points of open-source game developers.

虽然先前的研究探索了在游戏中传统使用AI的方法,例如控制剂或产生程序内容。对于GenAI如何被开发者,特别是在开放源码界中如何在现实世界环境中采用GenAI, 经验上了解有限。本研究的目的是通过分析GitHub问题的讨论,探讨如何讨论、采用GenAI技术并将其纳入开放源码游戏开发。我们调查与GenAI有关的工具、任务和挑战,将GenAI相关问题与传统AI(TradAI)和NonAI专题相比较。我们的目标是发现GenAI如何在使用模式、开发者关切和整合做法方面与其他方法不同。为了实现这一目标,我们建立一个公开源码游戏储存库数据集,讨论与AI有关的专题。我们用公开的卡分解和专题分析方法,对GitHub问题进行分类,按类型和内容进行标注。这些说明有助于在GenAI、TradAI和NonAI的工作流程中进行对比分析。