cs.SE @ 2025-07-04: 127

07-03 (4)

Requirements Elicitation Follow-Up Question Generation

Voraussetzungen Elicitation Follow-Up Question Generation

问询后查询

2507.02858v1

07-03

Legal Requirements Translation from Law

Rechtliche Voraussetzungen Übersetzung aus dem Recht

法律要求译自法律

2507.02846v1

07-03

Agentic Business Process Management: Practitioner Perspectives on Agent Governance in Business Processes

Agentic Business Process Management: Praxisperspektiven zur Agenten-Governance in Unternehmensprozessen

代理业务流程管理:从业者对业务流程代理治理的看法

2504.03693v2

07-03

Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification

Gradient-Based Model Fingerprinting für LLM Ähnlichkeitserkennung und Familienklassifizierung

LLM相似性探测和家庭分类的渐进式样指纹

2506.01631v2

07-03

Fuzzing-based Mutation Testing of C/C++ Software in Cyber-Physical Systems

Fuzzing-basierte Mutationsprüfung von C/C++-Software in Cyber-Physical Systems

C/C+++网络物理系统中软件的模糊式变异测试

2503.24100v3

07-03

Sustainability Flags for the Identification of Sustainability Posts in Q&A Platforms

Nachhaltigkeitsflaggen für die Identifizierung von Nachhaltigkeitsbeiträgen in Q&A-Plattformen

在++A平台中确定可持续性职位的可持续性旗

2507.02695v1

07-03

RLHGNN: Reinforcement Learning-driven Heterogeneous Graph Neural Network for Next Activity Prediction in Business Processes

RLHGNN: Verstärkung Lernorientiertes Heterogenes Graph Neuronales Netzwerk für die nächste Aktivitätsvorhersage in Geschäftsprozessen

RLHGNN: 业务流程下一个活动预测的强化学习驱动的异质图形神经网络

2507.02690v1

07-03

Do Research Software Engineers and Software Engineering Researchers Speak the Same Language?

Sprechen Forschungssoftware-Ingenieure und Software-Engineering-Forscher die gleiche Sprache?

研究软件工程师和软件工程研究者会说同一种语言吗?

2507.02665v1

07-03

Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur

以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测

2506.08713v2

07-03

Alleviating Attack Data Scarcity: SCANIA’s Experience Towards Enhancing In-Vehicle Cyber Security Measures

Benachteiligung von Angriffsdaten: SCANIAs Erfahrung zur Verbesserung von Cybersicherheitsmaßnahmen im Fahrzeug

减轻攻击数据稀缺性:SCANIA在加强车辆内部网络安全措施方面的经验

2507.02607v1

07-03

Human-Machine Collaboration and Ethical Considerations in Adaptive Cyber-Physical Systems

Mensch-Maschine-Kollaboration und ethische Überlegungen in adaptiven Cyber-Physischen Systemen

适应性网络-物理系统中的人类-海洋协作和伦理考虑

2507.02578v1

07-03

LLMREI: Automating Requirements Elicitation Interviews with LLMs

LLMREI: Automatisieren der Anforderungen Bereitstellung von Interviews mit LLMs

LLMREI: 与LLMM公司进行自动要求的求求救采访

2507.02564v1

07-03

Meta-Fair: AI-Assisted Fairness Testing of Large Language Models

Meta-Fair: AI-Assisted Fairness Testing von großen Sprachmodellen

Meta-Fair:AI协助大语言模型公平性测试

2507.02533v1

07-03

Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Maintenance

Code Digital Twin: LLMs mit Tacit-Kenntnissen für komplexe Software-Wartung stärken

数字代码双对:赋予LLMs以用于复杂软件维护的隐秘知识

2503.07967v2

07-03

VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software

VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software

VEFIA: 垂直联邦合作软件有效推断审计框架

2507.02376v1

07-03

Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes

Erforschung der Integration von großen Sprachmodellen in industrielle Testwartungsprozesse

探索将大语言模型纳入工业试验维护工艺

2409.06416v2

07-03

Precisely Detecting Python Type Errors via LLM-based Unit Test Generation

Präzise Erkennung von Python-Typfehlern über LLM-basierte Einheitentestgenerierung

通过基于LLM的单位测试生成精确检测 Python 类型错误

2507.02318v1

07-02 (3)

Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models

Verbesserung der COBOL-Code-Erklärungen: Ein Multi-Agenten-Ansatz mit großen Sprachmodellen

加强COBOL守则解释:采用大语言模式的多机构办法

2507.02182v1

07-02

Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection

Auf dem Weg zu einer vertrauensvollen Sentiment-Analyse in der Software-Engineering: Dataset-Eigenschaften und Werkzeugauswahl

在软件工程中进行可信赖的感知分析:数据集特点和工具选择

2507.02137v1

07-02

DSCodeBench: A Realistic Benchmark for Data Science Code Generation

DSCodeBench: Ein realistischer Benchmark für die Erstellung von Data Science Code

DSCodeBench:数据科学代码生成的现实基准

2505.15621v2

07-02

A Multimodal Approach Combining Biometrics and Self-Report Instruments for Monitoring Stress in Programming: Methodological Insights

Ein multimodaler Ansatz zur Kombination von Biometrie und Selbstberichtsinstrumenten zur Überwachung von Stress in der Programmierung: Methodologische Erkenntnisse

将生物计量与在方案编制中监测压力的自报工具相结合的多式联运办法:方法见解

2507.02118v1

07-02

Can Internal Software Metrics Predict App Popularity at Launch? Yeas! and Nays!

Kann interne Software-Metriken App-Popularität beim Start voraussagen? Yeas! und Nays!

内部软件计量器能否预测发射时的流行程度?

2507.02110v1

07-02

Structural Code Search using Natural Language Queries

Structural Code Suche mit Hilfe von Natural Language Queries

使用自然语言查询的结构性法规搜索

2507.02107v1

07-02

How do Software Engineering Candidates Prepare for Technical Interviews?

Wie bereiten sich Software-Engineering-Kandidaten auf technische Interviews vor?

软件工程候选人如何准备技术面试?

2507.02068v1

07-02

Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries

Automatisierte Synthese von formal verifizierten Multi-Abstraktions-Funktionszusammenfassungen

正式核证的多种吸管功能摘要自动综合

2506.09550v2

07-02

APRMCTS: Improving LLM-based Automated Program Repair with Iterative Tree Search

APRMCTS: Verbesserte automatische Programmreparatur auf LLM-Basis mit iterativer Baumsuche

APRMCTS: 改进以LLM为基础的自动方案修理,同时进行迭代树搜索

2507.01827v1

07-02

Foundation Models for the Digital Twin Creation of Cyber-Physical Systems

Gründungsmodelle für die digitale Twin-Erstellung von Cyber-Physical Systems

数字双造网络物理系统数字双创造模式基金会模式

2407.18779v2

07-02

On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten

关于含有闭合同步词类的标识名称的结构和语义

2505.18444v2

07-02

Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions

Erforschen von Datenschutz und Sicherheit als Treiber für ökologische Nachhaltigkeit in Cloud-basierten Office-Lösungen

探索隐私和安全作为云载办公室解决方案中环境可持续性的驱动力

2506.23866v2

07-02

DaiFu: In-Situ Crash Recovery for Deep Learning Systems

DaiFu: In-Situ Crash Recovery für Deep Learning Systeme

DaiFu:深入学习系统现场事故恢复

2507.01628v1

07-02

Combining Type Inference and Automated Unit Test Generation for Python

Kombination von Typ-Inferenz und Automated Unit Test Generation für Python

Python 组合类型推断和自动单位测试生成

2507.01477v1

07-02

Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

Tensor-Programmoptimierung für die RISC-Vektorerweiterung mittels probabilistischer Programme

利用概率方案优化RISC-V矢量扩展

2507.01457v1

07-02

Context-Aware Code Wiring Recommendation with LLM-based Agent

Context-Aware Code Verkabelungsempfehlung mit LLM-basiertem Agent

与LLLM代理商的上下文软件代码联线建议

2507.01315v1

07-01 (2)

Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling

Regulierung des algorithmischen Managements: Eine Multi-Stakeholder-Studie über Herausforderungen bei der Ausrichtung von Software und dem Gesetz für die Arbeitsplanung

规范工资管理:多方利益攸关方研究软件和工作场所时间安排法在调整软件和工作场所时间安排法方面面临的挑战

2505.02329v3

07-01

Bugs in the Shadows: Static Detection of Faulty Python Refactorings

Bugs in the Shadows: Statische Erkennung von fehlerhaften Python-Refactorings

暗影中的臭虫: 持续探测失灵的 Python 反因子

2507.01103v1

07-01

What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews

Was ist mit Emotionen? Guiding Fine-Grained Emotion Extraction aus Mobile App Bewertungen

情感呢?指导从移动应用程序评论中抽取精美情感的导师

2505.23452v2

07-01

Toward a Brazilian Research Agenda in Quantum Software Engineering: A Systematic Mapping Study

Auf dem Weg zu einer brasilianischen Forschungsagenda in der Quantensoftware-Engineering: Eine systematische Mapping-Studie

实现巴西量子软件工程研究议程:系统绘图研究

2506.11013v2

07-01

A Study of In-Context-Learning-Based Text-to-SQL Errors

Eine Studie über In-Context-Learning-basierte Text-zu-SQL-Fehler

文中学习基于文本到SQL错误的研究

2501.09310v2

07-01

Out of the Day Job: Perspectives of Industry Practitioners in Co-Design and Delivery of Software Engineering Courses

Out of the Day Job: Perspektiven von Industriepraktizierenden in Co-Design und Bereitstellung von Software-Engineering-Kursen

日常工作:行业从业人员在共同设计和提供软件工程课程方面的观点

2507.00803v1

07-01

Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability

Echos von KI: Untersuchung der Auswirkungen von KI-Assistenten auf die Wartung von Software

AI的回声:调查AI助理对软件维护能力下游影响

2507.00788v1

07-01

Snaps: Bloated and Outdated?

Snaps: Aufgebläht und überholt?

混凝土和过时?

2507.00786v1

07-01

Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation

Beurteilung der Korrektheit in der LLM-basierten Codegenerierung über Unsicherheitsabschätzung

通过不确定性估算评估基于LLM的代码生成的正确性

2502.11620v3

07-01

A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

Ein hierarchischer und evolvierbarer Benchmark für eine feinkörnige Code-Anleitung mit Multi-Turn-Feedback

附有多发反馈的精细分类守则指示的等级和可变基准

2507.00699v1

07-01

A Domain-specific Language and Architecture for Detecting Process Activities from Sensor Streams in IoT

Domänenspezifische Sprache und Architektur zur Erkennung von Prozessaktivitäten von Sensor Streams im IoT

用于检测从IoT传感器流中检测进程活动的特定域语言和架构

2507.00686v1

07-01

The Secrets Must Not Flow: Scaling Security Verification to Large Codebases (extended version)

Die Geheimnisse dürfen nicht fließen: Skalierung der Sicherheitsüberprüfung auf große Codebases (erweiterte Version)

秘密不得流动:将安全核查扩大到大型代码库(扩展版)

2507.00595v1

07-01

Coverage-Guided Testing for Deep Learning Models: A Comprehensive Survey

Coverage-Guided Testing für Deep Learning Modelle: Eine umfassende Umfrage

深学习模型的覆盖面-指导测试:全面调查

2507.00496v1

07-01

The Influence of HEXACO Personality Traits on the Teamwork Quality in Software Teams – A Preliminary Research Approach

Der Einfluss von HEXACO Persönlichkeitseigenschaften auf die Teamwork-Qualität in Software-Teams - ein vorläufiger Forschungsansatz

HEXACO个人经历对软件团队团队工作质量的影响 – – 初步研究方法

2507.00481v1

07-01

Embedded DevOps: A Survey on the Application of DevOps Practices in Embedded Software and Firmware Development

Embedded DevOps: Eine Umfrage zur Anwendung von DevOps-Praxis in Embedded Software und Firmware-Entwicklung

嵌入式DevOps:关于在嵌入式软件和公司软件开发中应用DevOps做法的调查

2507.00421v1

07-01

Recommending Variable Names for Extract Local Variable Refactorings

Empfehlung von Variablennamen für lokale Variablen-Refaktorings extrahieren

为抽取本地变量重构建议变量名称

2507.00413v1

07-01

Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations

Enthüllen von Anhäufungsaufträgen in Software/Hardware-Implementierungen

在软件/硬件实施中发送浮动点累计令

2411.00442v3

07-01

Agentic AI in Product Management: A Co-Evolutionary Model

Agentische KI im Produktmanagement: Ein ko-evolutionäres Modell

产品管理中的AIA:共同进化模型

2507.01069v1

07-01

iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

iPanda: Ein intelligenter Protokolltest- und Debugging-Agent für Konformitätsprüfungen

iPanda:一个智能协议测试和调试代理人,用于合规测试

2507.00378v1

07-01

An AST-guided LLM Approach for SVRF Code Synthesis

Ein AST-geführter LLM-Ansatz für die SVRF-Codesynthese

SVRF 准则合成的AST制导LLM法

2507.00352v1

07-01

VTS-Guided AI Interaction Workflow for Business Insights

VTS-geführter KI-Interaktions-Workflow für Business Insights

VTS 指导的AI 商业观察互动工作流程

2507.00347v1

06-30 (1)

Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Fehler durch Interferenz: Sprachmodelle machen ausgeglichene Klammern Fehler, wenn fehlerhafte Mechanismen Klangeindrücke überschatten

被干扰失败:语言模型在错误机制压倒阴影声音一号时造成平衡括号错误

2507.00322v1

06-30

Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain

Herstellung einer Pipeline-Produktion: Herausforderungen und Lektionen im Bereich Healthcare

《管道生产-准备:保健领域的挑战和经验教训》

2506.06946v2

06-30

Rust vs. C for Python Libraries: Evaluating Rust-Compatible Bindings Toolchains

Rust vs. C für Python Bibliotheken: Bewertung von Rust-kompatiblen Bindungen Toolchains

Python图书馆的Rust诉C案:评估Rust-Compable Contracable Contails 工具链

2507.00264v1

06-30

A Metascience Study of the Low-Code Scientific Field

Eine Metawissenschaftsstudie des wissenschaftlichen Bereichs mit niedrigem Code

低阴极科学领域元科学研究

2408.05975v2

06-30

Is It Safe To Learn And Share? On Psychological Safety and Social Learning in (Agile) Communities of Practice

Ist es sicher, zu lernen und zu teilen? Über Psychologische Sicherheit und soziales Lernen in (Agile) Gemeinschaften der Praxis

它是否安全学习和分享? 在(敏感)实践社区的心理安全和社会学习方面,它是否安全学习和分享?

2507.01065v1

06-30

Mitigating Hallucinations in YOLO-based Object Detection Models: A Revisit to Out-of-Distribution Detection

Halluzinationen in YOLO-basierten Objekterkennungsmodellen abmildern: Ein Besuch bei Out-of-Distribution Detection

以YOLO为基地的物体探测模型中的减轻幻觉:重新研究扩散外探测

2503.07330v2

06-30

Exploring Challenges in Test Mocking: Developer Questions and Insights from StackOverflow

Herausforderungen im Test-Mocking erkunden: Entwickler-Fragen und Einblicke aus StackOverflow

探索试验模拟的挑战:来自斯塔克流的开发者问题和洞察

2505.08300v2

06-30

Bug Fixing with Broader Context: Enhancing LLM-Based Program Repair via Layered Knowledge Injection

Fehlerbehebung mit breiterem Kontext: Verbesserung der LLM-basierten Programm-Reparatur durch geschichtete Wissensinjektion

以更广泛的背景解决错误:通过多层知识注射加强基于LLM的方案修复

2506.24015v1

06-30

STCLocker: Deadlock Avoidance Testing for Autonomous Driving Systems

STCLocker: Deadlock-Vermeidungstests für autonome Fahrsysteme

STLLocker:自动驾驶系统死亡避免测试

2506.23995v1

06-30

Green Metrics Tool: Measuring for fun and profit

Green Metrics Tool: Messen für Spaß und Profit

绿色计量工具:衡量乐趣和利润

2506.23967v1

06-30

ADReFT: Adaptive Decision Repair for Safe Autonomous Driving via Reinforcement Fine-Tuning

ADReFT: Adaptive Entscheidungsreparatur für sicheres autonomes Fahren durch Verstärkung Feintuning

ADREFT: 安全自主驾驶的适应性决定修补

2506.23960v1

06-30

Automated Statistical Testing and Certification of a Reliable Model-Coupling Server for Scientific Computing

Automatisierte statistische Prüfung und Zertifizierung eines zuverlässigen Model-Coupling-Servers für Scientific Computing

科学计算系统可靠模型组合服务器的自动统计测试和认证

2505.09769v2

06-30

Measuring Software Innovation with Open Source Software Development Data

Software-Innovation mit Open Source Software-Entwicklungsdaten messen

利用开放源码软件开发数据衡量软件创新

2411.05087v2

06-30

Requirements for Active Assistance of Natural Questions in Software Architecture

Anforderungen an die aktive Unterstützung natürlicher Fragen in der Softwarearchitektur

积极协助软件建筑中自然问题的要求

2506.23898v1

06-30

Green AI in Action: Strategic Model Selection for Ensembles in Production

Grüne KI in Aktion: Strategische Modellauswahl für Ensembles in der Produktion

绿色AI “ 行动 “ :生产集合战略示范选择

2405.17451v2

06-30

An ontological lens on attack trees: Toward adequacy and interoperability

Eine ontologische Linse auf Angriffsbäumen: Auf dem Weg zur Angemessenheit und Interoperabilität

攻击树的肿瘤透镜:实现适足性和互操作性

2506.23841v1

06-30

Software Engineering for Large Language Models: Research Status, Challenges and the Road Ahead

Software Engineering für große Sprachmodelle: Forschungsstatus, Herausforderungen und die Zukunft

大语言模型软件工程:研究现状、挑战和道路

2506.23762v1

06-30

A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications

Eine Umfrage der LLM-basierten automatisierten Programm-Reparatur: Taxonomien, Design Paradigmen und Anwendungen

以LLOM为基础的自动方案维修调查:分类、设计模型和应用

2506.23749v1

06-30

Towards a Science of Developer eXperience (DevX)

Auf dem Weg zu einer Wissenschaft des Entwicklers eXperience (DevX)

走向开发者电子X光科学(DevX)

2506.23715v1

06-30

What Challenges Do Developers Face When Using Verification-Aware Programming Languages?

Welche Herausforderungen stellen sich Entwickler bei der Verwendung von Verifikations-Software-Programmiersprachen?

开发者在使用核查-软件编程语言时面临哪些挑战?

2506.23696v1

06-30

Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Können große Sprachmodelle den Studierenden helfen, Software-Korrektur zu beweisen? Eine experimentelle Studie mit Dafny

大语言模型能帮助学生证明软件正确性吗? 与Dafny的实验研究

2506.22370v2

06-30

Threadbox: Sandboxing for Modular Security

Threadbox: Sandboxing für modulare Sicherheit

Treadbox: 模块安全沙箱

2506.23683v1

06-30

QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration

QLPro: Automatisierte Code Vulnerability Discovery über LLM und Static Code Analysis Integration

QLPro:通过LLM和静态代码分析整合发现自动编码易脆弱性

2506.23644v1

06-30

From Tea Leaves to System Maps: Context-awareness in Monitoring Operational Machine Learning Models

Von Tea Leaves zu System Maps: Kontext-Bewusstsein bei der Überwachung operativer Machine Learning Modelle

从茶叶休假到系统地图:监测操作机器学习模式的背景意识

2506.10770v2

06-30

Comparative Analysis of the Code Generated by Popular Large Language Models (LLMs) for MISRA C++ Compliance

Vergleichende Analyse des Code Generated by Popular Large Language Models (LLMs) für MISRA C++ Compliance

MISRA C++遵约情况按大众大语言模式编制的守则比较分析

2506.23535v1

06-30

Improving vulnerability type prediction and line-level detection via adversarial training-based data augmentation and multi-task learning

Verbesserung der Sicherheitsvorhersage und der Linienerkennung durch trainingsbasierte Datenvergrößerung und Multi-Task-Lernen

通过对抗性培训数据扩增和多任务学习,改进脆弱性类型预测和线一级检测

2506.23534v1

06-29 (7)

On the Feasibility of Deduplicating Compiler Bugs with Bisection

Über die Machbarkeit von Compiler Bugs mit Bisection zu deduplizieren

应用编译器比分错误的可行性

2506.23281v1

06-29

From Release to Adoption: Challenges in Reusing Pre-trained AI Models for Downstream Developers

Von der Veröffentlichung bis zur Annahme: Herausforderungen bei der Wiederverwendung vortrainierter KI-Modelle für Downstream-Entwickler

从释放到采用:为下游开发者重新使用经过预先培训的AI模型的挑战

2506.23234v1

06-29

RPHunter: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis

RPHunter: Enthüllen von Rug Pull Schemes in Crypto Token über Code-and-Transaction Fusion Analysis

RPHunter:通过代码和交易整合分析,在加密中采用 “ 拼接 “ 的 “ 拼接 “ 的 “ Rug “ 拖网计划

2506.18398v2

06-29

A Comprehensive Study on Large Language Models for Mutation Testing

Eine umfassende Studie zu großen Sprachmodellen für Mutationsprüfungen

关于变异测试大语言模式的综合研究

2406.09843v3

06-29

Repair Ingredients Are All You Need: Improving Large Language Model-Based Program Repair via Repair Ingredients Search

Reparatur Zutaten sind alles, was Sie brauchen: Verbesserung der großen Sprache Modellbasierte Programm Reparatur über Reparatur Zutaten Suche

维修成份:通过修理成份搜索改进大语言示范方案

2506.23100v1

06-29

HF-DGF: Hybrid Feedback Guided Directed Grey-box Fuzzing

HF-DGF: Hybrid-Feedback-geführtes Grey-Box-Fuzzing

HF-DGF:混合反馈

2506.23063v1

06-28 (6)

Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation

Leitende KI, um eigene Fehler zu beheben: Eine empirische Studie zur LLM-getriebenen sicheren Codegenerierung

指导大赦国际修补其自成自有的法条:关于LLM-Driven安全法规生成的经验研究

2506.23034v1

06-28

Generating Privacy Stories From Software Documentation

Generieren von Daten aus der Software-Dokumentation

从软件文件生成隐私故事

2506.23014v1

06-28

Evaluating and Improving Large Language Models for Competitive Program Generation

Bewertung und Verbesserung großer Sprachmodelle für die wettbewerbsfähige Programmgenerierung

评价和改进竞争性方案制定大语言模式

2506.22954v1

06-28

A Quantum Annealing Approach for Solving Optimal Feature Selection and Next Release Problems

Ein Quantum-Analing-Ansatz zur Lösung optimaler Feature-Auswahl und nächster Release-Probleme

解决最佳地物选择和下一期释放问题的一个量化的安妮化方法

2506.14129v2

06-28

Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation

Kleiner = schwach? Benchmarking Robustheit quantifizierter LLMs bei der Codegenerierung

小 = 弱 = 弱 ?

2506.22776v1

06-28

Privacy-Preserving Methods for Bug Severity Prediction

Datenschutz-Erhaltung Methoden für Bug Severity Prediction

维护隐私的错误严重性预测方法

2506.22752v1

06-28

Understanding the Challenges and Promises of Developing Generative AI Apps: An Empirical Study

Die Herausforderungen und Versprechen der Entwicklung generativer KI-Apps verstehen: Eine empirische Studie

了解 “ 开发创新的AI Apps:经验研究 “ 的挑战和前景

2506.16453v2

06-28

RAILS: Retrieval-Augmented Intelligence for Learning Software Development

RAILS: Retrieval-Augmented Intelligence für die Entwicklung von Lernsoftware

RAILS:学习软件开发检索增强情报

2506.22742v1

06-28

P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

P4OMP: Retrieval-Augmented Prompting für OpenMP Parallelismus im seriellen Code

P4OMP: 序列法中公开MP平行主义的检索-启动提示

2506.22703v1

06-27 (5)

An LLM-assisted approach to designing software architectures using ADD

Ein LLM-unterstützter Ansatz zur Entwicklung von Softwarearchitekturen mit ADD

利用ADD设计软件结构的LLM辅助方法

2506.22688v1

06-27

Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision

Knowledge-Guided Multi-Agent Framework für automatisierte Anforderungsentwicklung: Eine Vision

知识指导的自动化要求发展多方支持框架:愿景

2506.22656v1

06-27

L2MAC: Large Language Model Automatic Computer for Extensive Code Generation

L2MAC: Automatischer Computer mit großem Sprachmodell für umfangreiche Code-Generierung

L2MAC:用于广泛代码生成的大型语言模拟自动计算机

2310.02003v6

06-27

What Makes ChatGPT Effective for Software Issue Resolution? An Empirical Study of Developer-ChatGPT Conversations in GitHub

Was macht ChatGPT effektiv für Software Problemlösung? Eine empirische Studie von Entwickler-ChatGPT Gespräche in GitHub

是什么使ChatGPT有效解决软件问题? 开发者-ChatGPT在GitHub的对话经验研究。

2506.22390v1

06-27

Automated detection of atomicity violations in large-scale systems

Automatisierte Erkennung von atomaren Verstößen in Großsystemen

在大规模系统中自动发现违反原子现象

2504.00521v2

100

06-27

Domain-Driven Design in Software Development: A Systematic Literature Review on Implementation, Challenges, and Effectiveness

Domain-Driven Design in der Software-Entwicklung: Ein systematischer Literaturbericht über Implementierung, Herausforderungen und Effektivität

软件开发中的域驱动设计:关于实施、挑战和有效性的系统文献审查

2310.01905v4

101

06-27

Autonomic Microservice Management via Agentic AI and MAPE-K Integration

Autonomes Microservice Management über Agentic AI und MAPE-K Integration

通过Agentic AI和MAPE-K整合进行自动微服务管理

2506.22185v1

102

06-27

Towards Modeling Human-Agentic Collaborative Workflows: A BPMN Extension

Auf dem Weg zur Modellierung von Mensch-Agentik-kollaborativen Workflows: Eine BPMN-Erweiterung

建立模拟人类-机构合作工作流程的模式:生物和植物保护网扩展项目

2412.05958v3

103

06-27

KARMA Approach supporting Development Process Reconstruction in Model-based Systems Engineering

KARMA-Ansatz zur Unterstützung der Prozessrekonstruktion in der modellbasierten Systemtechnik

KARMA 支持基于示范系统工程的发展进程重建的方法

2506.22037v1

104

06-27

The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Der SWE-Bench Illusion: Wenn State-of-the-Art LLMs sich an die Vernunft erinnern

SWE-Burnch 幻象:当最优秀的LLMs不记念理性而忘却理智时

2506.12286v2

105

06-27

Generative AI for Software Architecture. Applications, Challenges, and Future Directions

Generative KI für Softwarearchitektur. Anwendungen, Herausforderungen und Zukunftsrichtungen

A. 软件结构的生成AI 应用、挑战和未来方向

2503.13310v2

106

06-27

Enhancing Cloud Security through Topic Modelling

Verbesserung der Cloud-Sicherheit durch Themenmodellierung

通过专题建模模式加强云层安全

2505.01463v2

107

06-26 (4)

ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

ELFuzz: Effiziente Input-Generierung über LLM-gesteuerte Synthese über Fuzzer-Raum

ELFuzz:通过LLM驱动的模糊空间综合合成有效投入生成

2506.10323v2

108

06-26

Estimating Correctness Without Oracles in LLM-Based Code Generation

Schätzung der Korrektheit ohne Oracles in der LLM-basierten Code-Generierung

在基于LLM的代码生成中估算无甲骨文的正确性

2507.00057v1

109

06-26

FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

FuzzAug: Datenvergrößerung durch Coverage-geführtes Fuzzing für die neurale Testgeneration

FuzzAug: 神经测试生成时通过覆盖制导模糊增加数据

2406.08665v3

110

06-26

Performance Prediction for Large Systems via Text-to-Text Regression

Leistungsvorhersage für große Systeme über Text-zu-Text-Regression

通过文字到文字倒退对大型系统的性能预测

2506.21718v1

111

06-26

Using Generative AI in Software Design Education: An Experience Report

Generative KI in der Software Design Education: Ein Erfahrungsbericht

在软件设计教育中采用 “ 创新 “ 软件设计教育:经验报告

2506.21703v1

112

06-26

The DevSafeOps Dilemma: A Systematic Literature Review on Rapidity in Safe Autonomous Driving Development and Operation

Das DevSafeOps Dilemma: Ein systematischer Literaturbericht zur Schnelligkeit in der sicheren autonomen Fahrentwicklung und -operation

DevSafeOps Diillimma:关于安全自主驾驶的快速发展和运行的系统文学审查

2506.21693v1

113

06-26

Experience converting a large mathematical software package written in C++ to C++20 modules

Erfahrung beim Konvertieren eines großen mathematischen Softwarepakets in C++ in C++20 Module

用 C+++ 写成的大型数学软件包转换为 C+++ 至 C++20 模块的经验

2506.21654v1

114

06-26

Large Language Model-Powered Agent for C to Rust Code Translation

Large Language Model-Powered Agent für C to Rust Code Übersetzung

C至Rust 代码翻译的大型语言示范授权代理

2505.15858v2

115

06-26

Anonymized Network Sensing Graph Challenge

Anonymisierte Network Sensing Graph Challenge

匿名网络遥感图图挑战

2409.08115v2

116

06-26

IXAII: An Interactive Explainable Artificial Intelligence Interface for Decision Support Systems

IXAII: Interactive Explainable Artificial Intelligence Interface für Entscheidungsunterstützungssysteme

IXAII:决策支持系统互动解释人工智能接口

2506.21310v1

117

06-26

An object-centric core metamodel for IoT-enhanced event logs

Ein objektzentriertes Kernmetamodell für IoT-verstärkte Ereignisprotokolle

IoT 强化事件日志的以物体为中心的核心元元模型

2506.21300v1

118

06-26

Exploring Micro Frontends: A Case Study Application in E-Commerce

Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce

探索微观前沿:电子商务案例研究应用

2506.21297v1

119

06-26

KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks

KOALA: Ein konfigurierbares Tool zum Sammeln von IDE-Daten beim Lösen von Programmieraufgaben

KOALA: 在解决方案拟订任务时收集 IDE 数据的配置工具

2506.21266v1

120

06-26

Describing Console I/O Behavior for Testing Student Submissions in Haskell

Beschreibung von Console I/O-Behavior für die Prüfung von Studentenanträgen in Haskell

哈斯凯尔测试学生提交材料的I/O行为

2008.09253v2

121

06-26

$T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models

$T^3$: Mehrstufige Baum-basierte automatische Programm-Reparatur mit großen Sprachmodellen

$T$3美元:使用大语言模型进行多层次基于树的自动方案维修

2506.21211v1

122

06-26

Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks

维持MDEB:实现长期使用和可复制嵌入基准

2506.21182v1

123

06-26

How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE

Wie gut sind synthetische Anforderungen ? Bewertung von LLM-generierten Datensätzen für AI4RE

合成要求如何好? 评价AI4RE的LLM-发光数据集

2506.21138v1

124

06-26

SceneGenAgent: Precise Industrial Scene Generation with Coding Agent

SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent

SceneGenerAgenti: 精密工业场景与编码剂生成

2410.21909v3

125

06-26

Boosting Vulnerability Detection with Inter-function Multilateral Association Insights

Förderung der Erkennung von Schwachstellen durch multilaterale Integrations-Insights zwischen den Funktionen

与职能间多边协会透视促进脆弱性探测

2506.21014v1

126

06-26

ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

ToolScan: Ein Benchmark für die Charakterisierung von Fehlern in Tool-Use LLMs

工具扫描:工具使用 LLM 错误识别基准

2411.13547v2

Article 0

Title@2025-07-03 (4): Requirements Elicitation Follow-Up Question Generation

Title: Requirements Elicitation Follow-Up Question Generation

Voraussetzungen Elicitation Follow-Up Question Generation

问询后查询 2507.02858v1

Authors (3): Yuchen Shen, Anmol Singhal, Travis Breaux

Interviews are a widely used technique in eliciting requirements to gather stakeholder needs, preferences, and expectations for a software system. Effective interviewing requires skilled interviewers to formulate appropriate interview questions in real time while facing multiple challenges, including lack of familiarity with the domain, excessive cognitive load, and information overload that hinders how humans process stakeholders’ speech. Recently, large language models (LLMs) have exhibited state-of-the-art performance in multiple natural language processing tasks, including text summarization and entailment. To support interviewers, we investigate the application of GPT-4o to generate follow-up interview questions during requirements elicitation by building on a framework of common interviewer mistake types. In addition, we describe methods to generate questions based on interviewee speech. We report a controlled experiment to evaluate LLM-generated and human-authored questions with minimal guidance, and a second controlled experiment to evaluate the LLM-generated questions when generation is guided by interviewer mistake types. Our findings demonstrate that, for both experiments, the LLM-generated questions are no worse than the human-authored questions with respect to clarity, relevancy, and informativeness. In addition, LLM-generated questions outperform human-authored questions when guided by common mistakes types. This highlights the potential of using LLMs to help interviewers improve the quality and ease of requirements elicitation interviews in real time.

有效的面试需要熟练的面试人员在面临多种挑战时实时地提出适当的访谈问题,包括缺乏对域名的熟悉程度、过度的认知负荷和信息超负荷,从而妨碍人类处理利益攸关方的演讲。最近,大型语言模型(LLMS)在多种自然语言处理任务中表现出了最先进的表现,包括文字总结和要求。为了支持访谈人员,我们调查GPT-4o的应用,以便在需求期间,通过建立共同的访谈错误类型框架,提出后续访谈问题。此外,我们描述根据访谈者演讲产生问题的方法。我们报告有控制的实验,以最低限度的指导评价LLM所产生和人为的问题,以及第二次有控制的实验,在生成时以访谈者错误类型为指导,评价LMM产生的问题。我们的研究结果表明,对于这两个实验,LMM公司产生的问题并不比人类授权的关于清晰度、相关性和了解性能等问题更差。此外,我们还描述了基于受访者演讲者演讲结果的透明性要求。此外,LMM公司还用普通的深度问题来改进访问。

Article 1

Title@2025-07-03 (4): Legal Requirements Translation from Law

Title: Legal Requirements Translation from Law

Rechtliche Voraussetzungen Übersetzung aus dem Recht

法律要求译自法律 2507.02846v1

Authors (2): Anmol Singhal, Travis Breaux

Software systems must comply with legal regulations, which is a resource-intensive task, particularly for small organizations and startups lacking dedicated legal expertise. Extracting metadata from regulations to elicit legal requirements for software is a critical step to ensure compliance. However, it is a cumbersome task due to the length and complex nature of legal text. Although prior work has pursued automated methods for extracting structural and semantic metadata from legal text, key limitations remain: they do not consider the interplay and interrelationships among attributes associated with these metadata types, and they rely on manual labeling or heuristic-driven machine learning, which does not generalize well to new documents. In this paper, we introduce an approach based on textual entailment and in-context learning for automatically generating a canonical representation of legal text, encodable and executable as Python code. Our representation is instantiated from a manually designed Python class structure that serves as a domain-specific metamodel, capturing both structural and semantic legal metadata and their interrelationships. This design choice reduces the need for large, manually labeled datasets and enhances applicability to unseen legislation. We evaluate our approach on 13 U.S. state data breach notification laws, demonstrating that our generated representations pass approximately 89.4% of test cases and achieve a precision and recall of 82.2 and 88.7, respectively.

软件系统必须遵守法规,这是一个资源密集型的任务,对于小型组织和缺乏专门法律专门知识的初创机构来说尤其如此。从规章中提取元数据,以引起对软件的法律要求,是确保遵守的关键步骤。然而,由于法律文本的长度和复杂性质,这是一个繁琐的任务。虽然以前的工作寻求的是从法律文本中提取结构性和语义性元数据的自动化方法,但主要限制仍然存在:它们不考虑与这些元数据类型相关的属性之间的相互作用和相互关系,它们依赖人工标签或超自然驱动的机器学习,这并不能很好地概括新的文件。在本文件中,我们采用基于文字要求和文内文学习的方法,以自动生成法律文本的可理解性表述、可隐含和可执行的Python代码。我们的代表来自一个手工设计的Python类结构,这个结构是一个特定域的元模型,既捕捉结构性和语义性法律元数据,也依靠它们的相互关系。这一设计减少了对大型、手工标签数据设置和内文学习的需求,并增强了对隐性法律的可应用性解释性解释性解释性法律。我们评估了13项和精确性法律的测试性案例。我们评估了13项和精确性法律。我们评估了我们的数据和精确性检验性案例。

Article 2

Title@2025-07-03 (4): Agentic Business Process Management: Practitioner Perspectives on Agent Governance in Business Processes

Title: Agentic Business Process Management: Practitioner Perspectives on Agent Governance in Business Processes

Agentic Business Process Management: Praxisperspektiven zur Agenten-Governance in Unternehmensprozessen

代理业务流程管理:从业者对业务流程代理治理的看法 2504.03693v2

Authors (5): Hoang Vu, Nataliia Klievtsova, Henrik Leopold, Stefanie Rinderle-Ma, Timotheus Kampik

With the rise of generative AI, industry interest in software agents is growing. Given the stochastic nature of generative AI-based agents, their effective and safe deployment in organizations requires robust governance, which can be facilitated by agentic business process management. However, given the nascence of this new-generation agent notion, it is not clear what BPM practitioners consider to be an agent, and what benefits, risks and governance challenges they associate with agent deployments. To investigate how organizations can effectively govern AI agents, we conducted a qualitative study involving semi-structured interviews with 22 BPM practitioners from diverse industries. They anticipate that agents will enhance efficiency, improve data quality, ensure better compliance, and boost scalability through automation, while also cautioning against risks such as bias, over-reliance, cybersecurity threats, job displacement, and ambiguous decision-making. To address these challenges, the study presents six key recommendations for the responsible adoption of AI agents: define clear business goals, set legal and ethical guardrails, establish human-agent collaboration, customize agent behavior, manage risks, and ensure safe integration with fallback options. Additionally, the paper outlines actions to align traditional BPM with agentic AI, including balancing human and agent roles, redefining human involvement, adapting process structures, and introducing performance metrics. These insights provide a practical foundation for integrating AI agents into business processes while preserving oversight, flexibility, and trust.

随着基因化的AI的兴起,工业界对软件代理的兴趣正在增加。鉴于基因化的AI型代理的随机性质,在组织中有效和安全地部署它们需要强有力的治理,而这种治理可以通过代理业务流程管理加以促进。然而,鉴于这种新一代代理概念的诞生,目前尚不清楚BPM从业人员认为什么是代理,以及他们与代理部署有关哪些好处、风险和治理挑战。为了调查各组织如何能够有效地管理AI代理,我们开展了一项定性研究,涉及与不同行业22名BPM从业人员的半结构性访谈。他们预计,代理将提高效率,提高数据质量,确保更好的遵守,并通过自动化提高可扩展性,同时告诫人们避免偏见、过度依赖、网络安全威胁、工作流离失所和模棱两可的决策等风险。为了应对这些挑战,BPM从业人员研究提出了六项重要建议,以便负责地采用AI代理:确定明确的商业目标、设置法律和道德护栏、建立人类代理协作、定制代理行为、管理风险和确保安全地与倒行选项相结合。此外,文件概述了为使传统的BPM结构结构与代理人参与和重新确定机构的工作基础。

Article 3

Title@2025-07-03 (4): Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification

Title: Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification

Gradient-Based Model Fingerprinting für LLM Ähnlichkeitserkennung und Familienklassifizierung

LLM相似性探测和家庭分类的渐进式样指纹 2506.01631v2

Authors (3): Zehao Wu, Yanjie Zhao, Haoyu Wang

As Large Language Models (LLMs) become integral software components in modern applications, unauthorized model derivations through fine-tuning, merging, and redistribution have emerged as critical software engineering challenges. Unlike traditional software where clone detection and license compliance are well-established, the LLM ecosystem lacks effective mechanisms to detect model lineage and enforce licensing agreements. This gap is particularly problematic when open-source model creators, such as Meta’s LLaMA, require derivative works to maintain naming conventions for attribution, yet no technical means exist to verify compliance. To fill this gap, treating LLMs as software artifacts requiring provenance tracking, we present TensorGuard, a gradient-based fingerprinting framework for LLM similarity detection and family classification. Our approach extracts model-intrinsic behavioral signatures by analyzing gradient responses to random input perturbations across tensor layers, operating independently of training data, watermarks, or specific model formats. TensorGuard supports the widely-adopted safetensors format and constructs high-dimensional fingerprints through statistical analysis of gradient features. These fingerprints enable two complementary capabilities: direct pairwise similarity assessment between arbitrary models through distance computation, and systematic family classification of unknown models via the K-Means clustering algorithm with domain-informed centroid initialization using known base models. Experimental evaluation on 58 models comprising 8 base models and 50 derivatives across five model families (Llama, Qwen, Gemma, Phi, Mistral) demonstrates 94% classification accuracy under our centroid-initialized K-Means clustering.

随着大型语言模型(LLMS)成为现代应用中综合软件组成部分,未经授权的模型通过微调、合并和再分配得出,已成为关键的软件工程挑战。与克隆检测和许可证合规性已经确立的传统软件不同,LLM生态系统缺乏检测模型线和执行许可证协议的有效机制。当开源模型创建者(如Meta’s LalaMA)需要派生工程来维持归属的命名公约,而没有技术手段来核查其遵守情况时,这一差距尤其成问题。为了填补这一空白,将LLMs作为需要源头跟踪的软件工艺品处理,我们提出了TensorGuard,这是一个基于梯度的指纹框架,用于LLM的类似性检测和家庭分类。我们的方法提取了模型的模型,即:通过分析对多层随机输入的梯度反应以及实施许可证协议协议协议协议协议协议。独立于培训数据、水标记或具体模型,支持广泛采用的安全传感器格式,并通过对梯度特征进行统计分析来建立高维度指纹。这些指纹使两种互补能力成为:直接对准的类似性QQ,50个任意性模型之间,通过KML模型,通过基础的原始模型进行任意性模型,通过已知的模型进行基础计算。

Article 4

Title@2025-07-03 (4): Fuzzing-based Mutation Testing of C/C++ Software in Cyber-Physical Systems

Title: Fuzzing-based Mutation Testing of C/C++ Software in Cyber-Physical Systems

Fuzzing-basierte Mutationsprüfung von C/C++-Software in Cyber-Physical Systems

C/C+++网络物理系统中软件的模糊式变异测试 2503.24100v3

Authors (3): Jaekwon Lee, Fabrizio Pastore, Lionel Briand

Mutation testing can help minimize the delivery of faulty software. Therefore, it is a recommended practice for developing embedded software in safety-critical cyber-physical systems (CPS). However, state-of-the-art mutation testing techniques for C and C++ software, which are common languages for CPS, depend on symbolic execution. Unfortunately, symbolic execution’s limitations hinder its applicability (e.g., systems with black-box components). We propose relying on fuzz testing, which has demonstrated its effectiveness for C and C++ software. Fuzz testing tools automatically create test inputs that explore program branches in various ways, exercising statements in different program states, and thus enabling the detection of mutants, which is our objective. We empirically evaluated our approach using software components from operational satellite systems. Our assessment shows that our approach can detect between 40% and 90% of the mutants not detected by developers’ test suites. Further, we empirically determined that the best results are obtained by integrating the Clang compiler, a memory address sanitizer, and relying on laf-intel instrumentation to collect coverage and guide fuzzing. Our approach detects a significantly higher percentage of live mutants compared to symbolic execution, with an increase of up to 50 percentage points; further, we observed that although the combination of fuzzing and symbolic execution leads to additional mutants being killed, the benefits are minimal (a gain of less than one percentage point).

因此,这是开发安全临界网络物理系统(CPS)内嵌软件的建议做法。然而,C和C++软件(CPS通用语言的C和C++软件)的最先进的突变测试技术取决于象征性的执行。不幸的是,象征性执行的限制妨碍了它的适用性(例如黑盒组件的系统)。我们提议依靠模糊测试,这已经证明了C和C++软件的有效性。Fuzz测试工具自动创造测试投入,以各种方式探索程序分支,在不同的程序状态中进行陈述,从而能够探测我们的目标所在的变异体。我们用操作卫星系统的软件组件对我们的方法进行了实证性评估。我们的评估表明,我们的方法可以探测出40%至90%的变异体,而开发者测试室没有检测到这些变异体。此外,我们从经验上确定,最佳结果是通过整合Clang编译器、记忆地址S+++软件、依靠laf-int仪器来收集覆盖范围,从而能够探测出不同程序状态,从而能够探测变异体,而这正是我们的目标。我们的方法通过使用操作软件组件评估了50%的象征性化执行率,但比观察到的变异体杀伤率要高得多。

Article 5

Title@2025-07-03 (4): Sustainability Flags for the Identification of Sustainability Posts in Q&A Platforms

Title: Sustainability Flags for the Identification of Sustainability Posts in Q&A Platforms

Nachhaltigkeitsflaggen für die Identifizierung von Nachhaltigkeitsbeiträgen in Q&A-Plattformen

在++A平台中确定可持续性职位的可持续性旗 2507.02695v1

Authors (4): Sahar Ahmadisakha, Lech Bialek, Mohamed Soliman, Vasilios Andrikopoulos

In recent years, sustainability in software systems has gained significant attention, especially with the rise of cloud computing and the shift towards cloud-based architectures. This shift has intensified the need to identify sustainability in architectural discussions to take informed architectural decisions. One source to see these decisions is in online Q&A forums among practitioners’ discussions. However, recognizing sustainability concepts within software practitioners’ discussions remains challenging due to the lack of clear and distinct guidelines for this task. To address this issue, we introduce the notion of sustainability flags as pointers in relevant discussions, developed through thematic analysis of multiple sustainability best practices from cloud providers. This study further evaluates the effectiveness of these flags in identifying sustainability within cloud architecture posts, using a controlled experiment. Preliminary results suggest that the use of flags results in classifying fewer posts as sustainability-related compared to a control group, with moderately higher certainty and significantly improved performance. Moreover, sustainability flags are perceived as more useful and understandable than relying solely on definitions for identifying sustainability.

近年来,软件系统的可持续性受到极大关注,特别是随着云计算上升和云基结构向云型结构的转变,这一转变使得更有必要确定建筑讨论的可持续性,以便作出知情的建筑决定。看到这些决定的一个来源是从业人员讨论的在线论坛。然而,由于软件从业人员讨论缺乏对这项任务的明确和明确的指导方针,在软件系统的可持续性概念方面仍然具有挑战性。为解决这一问题,我们通过对云供应商的多重可持续性最佳做法进行专题分析,在相关讨论中引入了可持续性标志的概念。这项研究进一步评估了这些标志在利用受控试验确定云层结构员额内的可持续性方面的有效性。初步结果表明,使用旗帜的结果是,与控制组相比,与可持续性有关的职位分类较少,具有适度的确定性和显著的改进性。此外,可持续性标志被认为比仅仅依靠确定可持续性的定义更为有用和易懂。

Article 6

Title@2025-07-03 (4): RLHGNN: Reinforcement Learning-driven Heterogeneous Graph Neural Network for Next Activity Prediction in Business Processes

Title: RLHGNN: Reinforcement Learning-driven Heterogeneous Graph Neural Network for Next Activity Prediction in Business Processes

RLHGNN: Verstärkung Lernorientiertes Heterogenes Graph Neuronales Netzwerk für die nächste Aktivitätsvorhersage in Geschäftsprozessen

RLHGNN: 业务流程下一个活动预测的强化学习驱动的异质图形神经网络 2507.02690v1

Authors (6): Jiaxing Wang, Yifeng Yu, Jiahan Song, Bin Cao, Jing Fan, Ji Zhang

Next activity prediction represents a fundamental challenge for optimizing business processes in service-oriented architectures such as microservices environments, distributed enterprise systems, and cloud-native platforms, which enables proactive resource allocation and dynamic service composition. Despite the prevalence of sequence-based methods, these approaches fail to capture non-sequential relationships that arise from parallel executions and conditional dependencies. Even though graph-based approaches address structural preservation, they suffer from homogeneous representations and static structures that apply uniform modeling strategies regardless of individual process complexity characteristics. To address these limitations, we introduce RLHGNN, a novel framework that transforms event logs into heterogeneous process graphs with three distinct edge types grounded in established process mining theory. Our approach creates four flexible graph structures by selectively combining these edges to accommodate different process complexities, and employs reinforcement learning formulated as a Markov Decision Process to automatically determine the optimal graph structure for each specific process instance. RLHGNN then applies heterogeneous graph convolution with relation-specific aggregation strategies to effectively predict the next activity. This adaptive methodology enables precise modeling of both sequential and non-sequential relationships in service interactions. Comprehensive evaluation on six real-world datasets demonstrates that RLHGNN consistently outperforms state-of-the-art approaches. Furthermore, it maintains an inference latency of approximately 1 ms per prediction, representing a highly practical solution suitable for real-time business process monitoring applications. The source code is available at https://github.com/Joker3993/RLHGNN.

下一个活动预测是优化服务导向架构(如微观服务环境、分布式企业系统、云型平台等)业务流程的一个根本挑战,这些架构有助于积极主动的资源分配和动态服务构成。尽管以顺序为基础的方法十分普遍,但这些方法未能捕捉平行处决和有条件依赖所产生的非序列关系。即使以图表为基础的方法处理结构保护问题,它们也存在同质表述和静态结构,这些结构适用统一的模型战略,而不论个别程序复杂特性如何。为了应对这些限制,我们引入了RLHGNN,这是一个新颖的框架,将事件日志转换成基于三个不同边缘类型的不同流程流程图解。我们的方法创建了四个灵活的图表结构,有选择地结合这些边框以适应不同流程的复杂性,并利用作为Markov 决策程序开发的强化学习,以自动确定每个特定流程的最佳图形结构。RLHGNNN,然后将混杂的图表组合组合和具体关联的汇总战略用于有效预测下一个活动。这一适应方法可以精确地模拟服务互动中的序列和非序列进程。我们的方法在六个实体/系统内部数据库中持续地显示一个最新的实时数据库。

Article 7

Title@2025-07-03 (4): Do Research Software Engineers and Software Engineering Researchers Speak the Same Language?

Title: Do Research Software Engineers and Software Engineering Researchers Speak the Same Language?

Sprechen Forschungssoftware-Ingenieure und Software-Engineering-Forscher die gleiche Sprache?

研究软件工程师和软件工程研究者会说同一种语言吗? 2507.02665v1

Authors (5): Timo Kehrer, Robert Haines, Guido Juckeland, Shurui Zhou, David E. Bernholdt

Anecdotal evidence suggests that Research Software Engineers (RSEs) and Software Engineering Researchers (SERs) often use different terminologies for similar concepts, creating communication challenges. To better understand these divergences, we have started investigating how SE fundamentals from the SER community are interpreted within the RSE community, identifying aligned concepts, knowledge gaps, and areas for potential adaptation. Our preliminary findings reveal opportunities for mutual learning and collaboration, and our systematic methodology for terminology mapping provides a foundation for a crowd-sourced extension and validation in the future.

传闻证据表明,研究软件工程师(RSE)和软件工程研究人员(SERs)经常对类似概念使用不同的术语,从而产生沟通挑战。为更好地了解这些差异,我们已开始调查SER社区的SE基本原理是如何在RSE社区内部解释的,确定一致的概念、知识差距和潜在的适应领域。我们的初步发现揭示了相互学习与合作的机会,以及我们系统的术语绘图方法为今后扩大和验证人群来源提供了基础。

Article 8

Title@2025-07-03 (4): Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur

以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2

Authors (2): Fariz Ikhwantri, Dusica Marijan

Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.

确保复杂的系统符合规章条例,通常需要通过索赔-论证-证据框架检查保证案件的有效性。这一过程的一些挑战包括法律和技术文本性质复杂,需要示范解释,以及获得保证案件数据的机会有限。我们建议根据自然语言推断(NLI):利用多点推理推理(EXCLAIM)的推理推理(EXCLAIM)进行可推广共测。我们将保证案件的索赔-论证-证据结构作为可解释和可追踪的遵守检测的多重推理。我们用大型语言模型(LLLMs)生成的保证案件数量有限。我们提出了衡量覆盖面和结构一致性的衡量标准。我们作为案例研究,展示了GDPR要求产生的保证案件的有效性。我们的结果突出了以NLI为基础的方法在监管遵守过程自动化方面的潜力。

Article 9

Title@2025-07-03 (4): Alleviating Attack Data Scarcity: SCANIA’s Experience Towards Enhancing In-Vehicle Cyber Security Measures

Title: Alleviating Attack Data Scarcity: SCANIA’s Experience Towards Enhancing In-Vehicle Cyber Security Measures

Benachteiligung von Angriffsdaten: SCANIAs Erfahrung zur Verbesserung von Cybersicherheitsmaßnahmen im Fahrzeug

减轻攻击数据稀缺性:SCANIA在加强车辆内部网络安全措施方面的经验 2507.02607v1

Authors (5): Frida Sundfeldt, Bianca Widstam, Mahshid Helali Moghadam, Kuo-Yun Liang, Anders Vesterberg

The digital evolution of connected vehicles and the subsequent security risks emphasize the critical need for implementing in-vehicle cyber security measures such as intrusion detection and response systems. The continuous advancement of attack scenarios further highlights the need for adaptive detection mechanisms that can detect evolving, unknown, and complex threats. The effective use of ML-driven techniques can help address this challenge. However, constraints on implementing diverse attack scenarios on test vehicles due to safety, cost, and ethical considerations result in a scarcity of data representing attack scenarios. This limitation necessitates alternative efficient and effective methods for generating high-quality attack-representing data. This paper presents a context-aware attack data generator that generates attack inputs and corresponding in-vehicle network log, i.e., controller area network (CAN) log, representing various types of attack including denial of service (DoS), fuzzy, spoofing, suspension, and replay attacks. It utilizes parameterized attack models augmented with CAN message decoding and attack intensity adjustments to configure the attack scenarios with high similarity to real-world scenarios and promote variability. We evaluate the practicality of the generated attack-representing data within an intrusion detection system (IDS) case study, in which we develop and perform an empirical evaluation of two deep neural network IDS models using the generated data. In addition to the efficiency and scalability of the approach, the performance results of IDS models, high detection and classification capabilities, validate the consistency and effectiveness of the generated data as well. In this experience study, we also elaborate on the aspects influencing the fidelity of the data to real-world scenarios and provide insights into its application.

接通车辆的数字演进和随后的安全风险强调,迫切需要实施车辆内网络安全措施,如入侵探测和反应系统。攻击情景的持续推进进一步突出表明,需要建立适应性检测机制,以探测不断变化的、未知的和复杂的威胁。有效利用ML驱动的技术可以帮助应对这一挑战。但是,由于安全、成本和道德因素,对测试车辆实施不同攻击情景的制约导致缺少代表攻击情景的数据。这种限制要求以高效和有效的替代方法生成高质量的攻击代表数据。本文介绍了一种环境觉察攻击数据生成器,生成攻击投入和与车辆网络日志相对应,即控制区网络日志,代表各种类型的攻击,包括拒绝服务(Do),模糊、潜伏、暂停和重弹攻击。由于安全、成本和道德因素等因素,对测试车辆实施不同攻击情景的制约,导致袭击情景的解析和强度调整,从而使得袭击情景与现实世界情景高度相似,并促进变异性。我们评估了在入侵探测系统内生成的攻击数据的实际数据的实际数据,并评估了这一系统测试和数据测算的准确性,我们利用了这一系统测算结果的测试结果,并更新了数据,从而评估了对数据进行了数据评估。

Article 10

Title@2025-07-03 (4): Human-Machine Collaboration and Ethical Considerations in Adaptive Cyber-Physical Systems

Title: Human-Machine Collaboration and Ethical Considerations in Adaptive Cyber-Physical Systems

Mensch-Maschine-Kollaboration und ethische Überlegungen in adaptiven Cyber-Physischen Systemen

适应性网络-物理系统中的人类-海洋协作和伦理考虑 2507.02578v1

Authors (1): Zoe Pfister

Adaptive Cyber-Physical Systems (CPS) are systems that integrate both physical and computational capabilities, which can adjust in response to changing parameters. Furthermore, they increasingly incorporate human-machine collaboration, allowing them to benefit from the individual strengths of humans and machines. Human-Machine Teaming (HMT) represents the most advanced paradigm of human-machine collaboration, envisioning seamless teamwork between humans and machines. However, achieving effective and seamless HMT in adaptive CPS is challenging. While adaptive CPS already benefit from feedback loops such as MAPE-K, there is still a gap in integrating humans into these feedback loops due to different operational cadences of humans and machines. Further, HMT requires constant monitoring of human operators, collecting potentially sensitive information about their actions and behavior. Respecting the privacy and human values of the actors of the CPS is crucial for the success of human-machine teams. This research addresses these challenges by: (1) developing novel methods and processes for integrating HMT into adaptive CPS, focusing on human-machine interaction principles and their incorporation into adaptive feedback loops found in CPS, and (2) creating frameworks for integrating, verifying, and validating ethics and human values throughout the system lifecycle, starting from requirements engineering.

适应性网络-物理系统是综合物理和计算能力的系统,可以根据不断变化的参数进行调整。此外,这些系统还日益纳入人体-机器合作,使人体-机器合作能够受益于人类和机器的个人优势。人类-机器团队(HMT)代表了人类-机器合作的最先进范例,设想了人与机器之间的无缝协作。然而,在适应性CPS中实现有效和无缝的HMT具有挑战性。适应性计算机-计算机系统已经受益于诸如MAPE-K等反馈循环,但在将人纳入这些反馈循环方面仍然存在差距,因为人类和机器的操作速度不同。此外,HMT要求不断监测人类操作者,收集关于其行动和行为的潜在敏感信息。尊重计算机-机器行为者的隐私和人类价值观对于人类-机器团队的成功至关重要。这一研究通过以下方式应对这些挑战:(1) 开发新的方法和进程,将HMT纳入适应性计算机-K,重点是人-机器互动原则,并将其纳入CPS所发现的适应性反馈循环。此外,HMM要求从整个人类-生命周期中开始建立框架,核查和验证。

Article 11

Title@2025-07-03 (4): LLMREI: Automating Requirements Elicitation Interviews with LLMs

Title: LLMREI: Automating Requirements Elicitation Interviews with LLMs

LLMREI: Automatisieren der Anforderungen Bereitstellung von Interviews mit LLMs

LLMREI: 与LLMM公司进行自动要求的求求救采访 2507.02564v1

Authors (3): Alexander Korn, Samuel Gorsch, Andreas Vogelsang

Requirements elicitation interviews are crucial for gathering system requirements but heavily depend on skilled analysts, making them resource-intensive, susceptible to human biases, and prone to miscommunication. Recent advancements in Large Language Models present new opportunities for automating parts of this process. This study introduces LLMREI, a chat bot designed to conduct requirements elicitation interviews with minimal human intervention, aiming to reduce common interviewer errors and improve the scalability of requirements elicitation. We explored two main approaches, zero-shot prompting and least-to-most prompting, to optimize LLMREI for requirements elicitation and evaluated its performance in 33 simulated stakeholder interviews. A third approach, fine-tuning, was initially considered but abandoned due to poor performance in preliminary trials. Our study assesses the chat bot’s effectiveness in three key areas: minimizing common interview errors, extracting relevant requirements, and adapting its questioning based on interview context and user responses. Our findings indicate that LLMREI makes a similar number of errors compared to human interviewers, is capable of extracting a large portion of requirements, and demonstrates a notable ability to generate highly context-dependent questions. We envision the greatest benefit of LLMREI in automating interviews with a large number of stakeholders.

在收集系统要求方面,要求引导面试至关重要,但在很大程度上依赖熟练的分析人员,使他们资源密集,易受人类偏见的影响,容易出现沟通错误。大语言模型最近的进展为这一进程的部分自动化提供了新的机会。本研究报告介绍了LLMREI,这是一个聊天机,旨在进行要求引导访谈,尽量减少人力干预,目的是减少常见的面谈错误,提高需求的可调适性。我们探索了两种主要方法,即零点点反应和最不直接的激励,优化LLMMREI在33次模拟利益攸关方访谈中的需求征求和评价其业绩。第三种方法,即微调,最初被考虑过,但因初步审判工作表现不佳而放弃。我们的研究评估了聊天机在三个关键领域的有效性:尽量减少常见的面谈错误,提取相关要求,并根据访谈背景和用户回应调整询问。我们的调查结果表明,LMMREI与人访谈者相比,其错误数量相似,能够提取大量需求,并表明具有产生高度背景依赖性问题的明显能力。我们设想LMREI访谈的最大好处是,与大量利益攸关方进行自动访谈。

Article 12

Title@2025-07-03 (4): Meta-Fair: AI-Assisted Fairness Testing of Large Language Models

Title: Meta-Fair: AI-Assisted Fairness Testing of Large Language Models

Meta-Fair: AI-Assisted Fairness Testing von großen Sprachmodellen

Meta-Fair:AI协助大语言模型公平性测试 2507.02533v1

Authors (6): Miguel Romero-Arjona, José A. Parejo, Juan C. Alonso, Ana B. Sánchez, Aitor Arrieta, Sergio Segura

Fairness–the absence of unjustified bias–is a core principle in the development of Artificial Intelligence (AI) systems, yet it remains difficult to assess and enforce. Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministic heuristics, and curated datasets, making them resource-intensive and difficult to scale. This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs, reducing the dependence on domain-specific resources and broadening the applicability of current approaches. Our approach, Meta-Fair, is based on two key ideas. First, we adopt metamorphic testing to uncover bias by examining how model outputs vary in response to controlled modifications of input prompts, defined by metamorphic relations (MRs). Second, we propose exploiting the potential of LLMs for both test case generation and output evaluation, leveraging their capability to generate diverse inputs and classify outputs effectively. The proposal is complemented by three open-source tools supporting LLM-driven generation, execution, and evaluation of test cases. We report the findings of several experiments involving 12 pre-trained LLMs, 14 MRs, 5 bias dimensions, and 7.9K automatically generated test cases. The results show that Meta-Fair is effective in uncovering bias in LLMs, achieving an average precision of 92% and revealing biased behaviour in 29% of executions. Additionally, LLMs prove to be reliable and consistent evaluators, with the best-performing models achieving F1-scores of up to 0.79. Although non-determinism affects consistency, these effects can be mitigated through careful MR design. While challenges remain to ensure broader applicability, the results indicate a promising path towards an unprecedented level of automation in LLM testing.

在开发人工智能(AI)系统时,没有不合理的偏差,这是一条核心原则;公平-公平-公平-公平-没有不合理的偏向-在开发人工智能(AI)系统时,仍然难以评估和执行。目前对大语言模型(LLMS)进行公平测试的实用性,往往依靠人工评估、固定模板、确定性超强和整理数据集,使其资源密集,难以规模化。这项工作的目的是为在LLMS中测试公平性、减少对特定领域资源的依赖和扩大当前方法的适用性等创新、自动化方法奠定基础。我们的Meta-Fair方法基于两个关键理念。首先,我们采用变形测试来发现偏向性,通过检查模型产出在对投入速度的受控修改(由MRRs)中如何不同。第二,我们提议利用LMMS的潜力来进行测试案例的生成和产出评估,利用它们的能力来产生各种投入和对产出进行有效的分类。这项提议得到三个公开源工具的补充,支持LMD-F驱动的生成、执行和测试案例的评估。我们报告说,在FMLMS-ralMS(LMS)前的正确-ral-ral-r)中,这些测试中,这些测试结果是自动、14的正正正正正正正正正正正正正正对结果显示。

Article 13

Title@2025-07-03 (4): Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Maintenance

Title: Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Maintenance

Code Digital Twin: LLMs mit Tacit-Kenntnissen für komplexe Software-Wartung stärken

数字代码双对:赋予LLMs以用于复杂软件维护的隐秘知识 2503.07967v2

Authors (5): Xin Peng, Chong Wang, Mingwei Liu, Yiling Lou, Yijian Wu

While large language models (LLMs) have demonstrated promise in software engineering tasks like code completion and generation, their support for the maintenance of complex software systems remains limited. These models often struggle with understanding the tacit knowledge embedded in systems, such as responsibility allocation and collaboration across different modules. To address this gap, we introduce the concept and framework of \textbf{Code Digital Twin}, a conceptual representation of tacit knowledge that captures the concepts, functionalities, and design rationales behind code elements, co-evolving with the software. A code digital twin is constructed using a methodology that combines knowledge extraction from both structured and unstructured sources–such as source code, documentation, and change histories–leveraging LLMs, static analysis tools, and human expertise. This framework can empower LLMs for software maintenance tasks such as issue localization and repository-level code generation by providing tacit knowledge as contexts. Based on the proposed methodology, we explore the key challenges and opportunities involved in the continuous construction and refinement of code digital twin.

虽然大型语言模型(LLMS)在代码完成和生成等软件工程任务方面表现出了希望,但它们对复杂软件系统维护的支持仍然有限,这些模型往往难以理解各模块的责任分配和协作等系统内隐性知识。为了弥补这一差距,我们引入了以下概念和框架:ctextbf{Code Digital Twin},这是一个隐性知识的概念和框架,它反映了代码要素背后的概念、功能和设计原理,与软件共同演化。一个代码数字双对是用一种方法构建的,它将来自结构化和非结构化来源的知识提取(例如源代码、文件、改变历史-杠杆LMS、静态分析工具和人类专门知识)结合起来。这个框架可以通过提供隐性知识作为背景,来增强LLMS对软件维护任务的能力,例如发布本地化和存储库级代码生成。我们根据拟议的方法,探讨了持续构建和完善代码数字双对代码的构建和完善所涉及的主要挑战和机遇。

Article 14

Title@2025-07-03 (4): VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software

Title: VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software

VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software

VEFIA: 垂直联邦合作软件有效推断审计框架 2507.02376v1

Authors (6): Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang

Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants’ data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to audit whether the data party’s inference software is executed as expected during large-scale inference without leaking the data privacy of the data party or introducing additional latency to the inference system. The core of VeFIA is that the task party can use the inference results from a framework with Trusted Execution Environments (TEE) and the coordinator to validate the correctness of the data party’s computation results. VeFIA guarantees that, as long as the abnormal inference exceeds 5.4%, the task party can detect execution anomalies in the inference software with a probability of 99.99%, without incurring any additional online inference latency. VeFIA’s random sampling validation achieves 100% positive predictive value, negative predictive value, and true positive rate in detecting abnormal inference. To the best of our knowledge, this is the first paper to discuss the correctness of inference software execution in VFL.

VFIA帮助任务方审计数据方的推论软件是否在大规模推论期间按预期执行,而不会泄露数据方的数据隐私,也不会给推断系统带来额外的延迟。VFIA的核心是任务方可以使用信任执行环境框架和数据方计算结果协调员框架的推论结果来验证数据方计算结果的正确性。 VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的执行异常情况,其概率为99.99%,而不会给网上推论系统带来任何额外的延迟。VEFIA的核心是任务方可以使用信任执行环境框架和协调员框架的推论结果来验证数据方计算结果的正确性。VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的异常性异常性,且有可能达到99.99%,而不会在网上推论任何额外的拉度。VEFIA的随机抽样验证可以使用100%的精确度来检测我们准确度。

Article 15

Title@2025-07-03 (4): Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes

Title: Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes

Erforschung der Integration von großen Sprachmodellen in industrielle Testwartungsprozesse

探索将大语言模型纳入工业试验维护工艺 2409.06416v2

Authors (6): Jingxiong Liu, Ludvig Lemner, Linnea Wahlgren, Gregory Gay, Nasser Mohammadiha, Joakim Wennerberg

Much of the cost and effort required during the software testing process is invested in performing test maintenance - the addition, removal, or modification of test cases to keep the test suite in sync with the system-under-test or to otherwise improve its quality. Tool support could reduce the cost - and improve the quality - of test maintenance by automating aspects of the process or by providing guidance and support to developers. In this study, we explore the capabilities and applications of large language models (LLMs) - complex machine learning models adapted to textual analysis - to support test maintenance. We conducted a case study at Ericsson AB where we explore the triggers that indicate the need for test maintenance, the actions that LLMs can take, and the considerations that must be made when deploying LLMs in an industrial setting. We also propose and demonstrate a multi-agent architecture that can predict which tests require maintenance following a change to the source code. Collectively, these contributions advance our theoretical and practical understanding of how LLMs can be deployed to benefit industrial test maintenance processes.

软件测试过程中所需的大部分成本和精力都用于进行测试维护 – – 添加、删除或修改测试案例,使测试套件与测试中的系统保持同步,或以其他方式提高测试套件的质量; 工具支持可以降低测试维护的成本,并提高质量,办法是使过程的各方面自动化,或向开发者提供指导和支持; 在这项研究中,我们探讨了大型语言模型(LLMs) – – 适合于文字分析的复杂机器学习模型 – – 的能力和应用,以支持测试维护; 我们在Ericsson AB进行了一项案例研究,我们在该研究中探讨了显示需要测试维护的触发因素、LLMs可以采取的行动,以及在工业环境中部署LMs时必须作出的考虑; 我们还提出并展示了一种多工具结构,可以预测在源码改变后哪些测试需要维护; 共同而言,这些贡献提高了我们对如何部署LMs以有利于工业测试维护过程的理论和实践理解。

Article 16

Title@2025-07-03 (4): Precisely Detecting Python Type Errors via LLM-based Unit Test Generation

Title: Precisely Detecting Python Type Errors via LLM-based Unit Test Generation

Präzise Erkennung von Python-Typfehlern über LLM-basierte Einheitentestgenerierung

通过基于LLM的单位测试生成精确检测 Python 类型错误 2507.02318v1

Authors (7): Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, Junjie Chen

Type errors in Python often lead to runtime failures, posing significant challenges to software reliability and developer productivity. Existing static analysis tools aim to detect such errors without execution but frequently suffer from high false positive rates. Recently, unit test generation techniques offer great promise in achieving high test coverage, but they often struggle to produce bug-revealing tests without tailored guidance. To address these limitations, we present RTED, a novel type-aware test generation technique for automatically detecting Python type errors. Specifically, RTED combines step-by-step type constraint analysis with reflective validation to guide the test generation process and effectively suppress false positives. We evaluated RTED on two widely-used benchmarks, BugsInPy and TypeBugs. Experimental results show that RTED can detect 22-29 more benchmarked type errors than four state-of-the-art techniques. RTED is also capable of producing fewer false positives, achieving an improvement of 173.9%-245.9% in precision. Furthermore, RTED successfully discovered 12 previously unknown type errors from six real-world open-source Python projects.

Python 的错误类型往往会导致时间流逝,给软件可靠性和开发者生产率带来重大挑战。现有的静态分析工具旨在不执行而发现这类错误,但经常出现高假正率。最近,单位测试生成技术为实现高测试覆盖率提供了巨大的希望,但是它们往往在不提供有针对性的指导的情况下难以产生错误清除测试。为了解决这些限制,我们介绍了一种新型类型识别测试生成技术,即一种自动检测 Python 类型错误的新型类型识别测试生成技术。具体来说,RETED将一步步式限制分析与反射验证结合起来,以指导测试生成过程并有效抑制假正数。我们根据两个广泛使用的基准,即 BugsInPy 和 TypeBugs 进行了测试。实验结果表明,RTED能够检测出22-29种基准型错误多于四种最新技术。RETED还能够产生较少的假阳性,从而改进了173.9%-245.9%的精确度。此外,RETEDD成功发现了六个实体源开源 Python 工程的12种先前未知型错误。

Article 17

Title@2025-07-02 (3): Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models

Title: Enhancing COBOL Code Explanations: A Multi-Agents Approach Using Large Language Models

Verbesserung der COBOL-Code-Erklärungen: Ein Multi-Agenten-Ansatz mit großen Sprachmodellen

加强COBOL守则解释:采用大语言模式的多机构办法 2507.02182v1

Authors (6): Fangjian Lei, Jiawen Liu, Shayan Noei, Ying Zou, Derek Truong, William Alexander

Common Business Oriented Language (COBOL) is a programming language used to develop business applications that are widely adopted by financial, business, and government agencies. Due to its age, complexity, and declining number of COBOL developers, maintaining COBOL codebases is becoming increasingly challenging. In particular, the lack of documentation makes it difficult for new developers to effectively understand and maintain COBOL systems. Existing research utilizes large language models (LLMs) to explain the functionality of code snippets. However, COBOL presents unique challenges due to its architectural and syntactical differences, which often cause its code to exceed the token window size of LLMs. In this work, we propose a multi-agent approach that leverages two LLM-based agents working collaboratively to generate explanations for functions, files, and the overall project. These agents incorporate together by utilizing contextual information from the codebase into the code explanation prompts. We evaluate the effectiveness of our approach using 14 open-source, real-world COBOL projects. Our results indicate that our approach performs significantly better than the baseline in function code explanation, with improvements of 12.67%, 18.59%, and 0.62% in terms of METEOR, chrF, and SentenceBERT scores, respectively. At the file level, our approach effectively explains both short and long COBOL files that exceed the token window size of LLMs and surpass the baseline by 4.21%, 10.72%, and 14.68% in explaining the purpose, functionality, and clarity of the generated explanation. At the project level, our approach generates explanations that convey the functionality and purpose of 82% of the selected projects.

通用商业语言(COBOL)是一种用于开发商业应用程序的编程语言,这些应用程序被金融、商业和政府机构广泛采用。由于其年龄、复杂性和COBOL开发商数量不断减少,维护COBOL代码库的工作正变得越来越具有挑战性。特别是,由于缺少文件,新开发商难以有效地理解和维护COBOL系统。现有研究使用大型语言模型解释代码片的功能。然而,COBOL因其建筑和综合方法的不同而提出了独特的挑战,这往往导致其代码超过LLMS象征性窗口的尺寸。由于它的年老、复杂性和不断减少的COBOL开发商数量,维护COBOL代码库的代码库越来越具有挑战性。特别是,由于缺少文件,新的开发商难以有效地理解并维护COBOL系统系统系统系统。我们使用大的语言模型来解释代码片段的功能功能。但是,COBOL系统使用14个开放源、真实世界COBOL项目,我们的方法比职能代码解释的基线要好得多,在12.67 %、18.59 %的LMS(L)的功能水平上,有效地解释了我们的项目的功能范围和0.6 %的功能水平上, 和OOOOOO的功能水平上的数据水平水平和0.6 %。

Article 18

Title@2025-07-02 (3): Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection

Title: Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection

Auf dem Weg zu einer vertrauensvollen Sentiment-Analyse in der Software-Engineering: Dataset-Eigenschaften und Werkzeugauswahl

在软件工程中进行可信赖的感知分析:数据集特点和工具选择 2507.02137v1

Authors (4): Martin Obaidi, Marc Herrmann, Jil Klünder, Kurt Schneider

Software development relies heavily on text-based communication, making sentiment analysis a valuable tool for understanding team dynamics and supporting trustworthy AI-driven analytics in requirements engineering. However, existing sentiment analysis tools often perform inconsistently across datasets from different platforms, due to variations in communication style and content. In this study, we analyze linguistic and statistical features of 10 developer communication datasets from five platforms and evaluate the performance of 14 sentiment analysis tools. Based on these results, we propose a mapping approach and questionnaire that recommends suitable sentiment analysis tools for new datasets, using their characteristic features as input. Our results show that dataset characteristics can be leveraged to improve tool selection, as platforms differ substantially in both linguistic and statistical properties. While transformer-based models such as SetFit and RoBERTa consistently achieve strong results, tool effectiveness remains context-dependent. Our approach supports researchers and practitioners in selecting trustworthy tools for sentiment analysis in software engineering, while highlighting the need for ongoing evaluation as communication contexts evolve.

软件开发在很大程度上依赖基于文本的通信,使情绪分析成为了解团队动态和支持在需求工程中可靠AI驱动分析的宝贵工具,但是,由于通信风格和内容的差异,现有情绪分析工具往往在不同平台的数据集之间运作不一致。在本研究中,我们分析5个平台10个开发者通信数据集的语言和统计特征,评价14个情绪分析工具的性能。根据这些结果,我们提出了一个绘图方法和问卷,建议为新数据集提供适当的情绪分析工具,同时使用其特性作为投入。我们的结果显示,数据集的特性可以用来改进工具选择,因为平台在语言和统计特性上差异很大。SetFit和ROBERTA等基于变压器的模型不断取得强有力的结果,但工具的有效性仍然取决于具体情况。我们的方法支持研究人员和从业人员选择软件工程中值得信赖的情绪分析工具,同时强调随着通信环境的演变需要持续评估。

Article 19

Title@2025-07-02 (3): DSCodeBench: A Realistic Benchmark for Data Science Code Generation

Title: DSCodeBench: A Realistic Benchmark for Data Science Code Generation

DSCodeBench: Ein realistischer Benchmark für die Erstellung von Data Science Code

DSCodeBench:数据科学代码生成的现实基准 2505.15621v2

Authors (6): Shuyin Ouyang, Dong Huang, Jingwen Guo, Zeyu Sun, Qihao Zhu, Jie M. Zhang

We introduce DSCodeBench, a new benchmark designed to evaluate large language models (LLMs) on complicated and realistic data science code generation tasks. DSCodeBench consists of 1,000 carefully constructed problems sourced from realistic problems from GitHub across ten widely used Python data science libraries. Compared to the current state-of-the-art benchmark DS-1000, DSCodeBench offers a more challenging and representative testbed, longer code solutions, more comprehensive data science libraries, clearer and better structured problem descriptions, and stronger test suites. To construct the DSCodeBench, we develop a robust pipeline that combines task scope selection, code construction, test case generation, and problem description synthesis. The process is paired with rigorous manual editing to ensure alignment and enhance evaluation reliability. Experimental result shows that DSCodeBench exhibits robust scaling behavior, where larger models systematically outperform smaller ones, validating its ability to distinguish model capabilities. The best LLM we test, GPT-4o, has a pass@1 of 0.202, indicating that LLMs still have a large room to improve for realistic data science code generation tasks. We believe DSCodeBench will serve as a rigorous and trustworthy foundation for advancing LLM-based data science programming.

我们引入了DSCodeBench(DSCodeBench),这是一个旨在评估大型语言模型(LLM)的新的基准,用于评估复杂而现实的数据科学生成任务。DSCodeBench(DSCodeBench)由来自GitHub在10个广泛使用的Python数据科学图书馆中的现实问题产生的1,000个精心制造的问题组成。与目前最先进的基准DS-1000相比,DSCodeBench(DSCodeBench)提供了更具有挑战性和代表性的测试床、更长的代码解决方案、更全面的数据科学图书馆、更清晰和结构更完善的问题描述以及更强大的测试套套件。为了构建DSCodeBench(DSCoBench),我们开发了一个强大的管道,将任务范围结合到任务范围选择、代码构建、测试案例生成和问题描述合成。这一过程配有严格的手工编辑,以确保评价的可靠性。实验结果表明,DScodeBench(DScodeBench)展示了强大的缩缩缩缩图行为,从而确认其区分模型能力的能力。我们测试的测试中的最佳LPPTPTPTLS-S-C-BY-CYL)将有一个大房间用于推进现实的科学生成。

Article 20

Title@2025-07-02 (3): A Multimodal Approach Combining Biometrics and Self-Report Instruments for Monitoring Stress in Programming: Methodological Insights

Title: A Multimodal Approach Combining Biometrics and Self-Report Instruments for Monitoring Stress in Programming: Methodological Insights

Ein multimodaler Ansatz zur Kombination von Biometrie und Selbstberichtsinstrumenten zur Überwachung von Stress in der Programmierung: Methodologische Erkenntnisse

将生物计量与在方案编制中监测压力的自报工具相结合的多式联运办法:方法见解 2507.02118v1

Authors (4): Cristina Martinez Montes, Daniela Grassi, Nicole Novielli, Birgit Penzenstadle

The study of well-being, stress and other human factors has traditionally relied on self-report instruments to assess key variables. However, concerns about potential biases in these instruments, even when thoroughly validated and standardised, have driven growing interest in alternatives in combining these measures with more objective methods, such as physiological measures. We aimed to (i) compare psychometric stress measures and biometric indicators and (ii) identify stress-related patterns in biometric data during software engineering tasks. We conducted an experiment where participants completed a pre-survey, then programmed two tasks wearing biometric sensors, answered brief post-surveys for each, and finally went through a short exit interview. Our results showed diverse outcomes; we found no stress in the psychometric instruments. Participants in the interviews reported a mix of feeling no stress and experiencing time pressure. Finally, the biometrics showed a significant difference only in EDA phasic peaks. We conclude that our chosen way of inducing stress by imposing a stricter time limit was insufficient. We offer methodological insights for future studies working with stress, biometrics, and psychometric instruments.

对福祉、压力和其他人类因素的研究历来依靠自我报告工具来评估关键变量,然而,对这些文书中潜在偏见的关切,即使经过彻底验证和标准化,也促使人们日益关注将这些措施与更客观的方法相结合的替代办法,例如生理措施。我们的目标是(一) 比较心理压力措施和生物鉴别指标,以及(二) 在软件工程任务期间确定生物鉴别数据中与压力有关的模式。我们进行了一项试验,参加者在试验中完成了调查前的工作,然后用生物鉴别传感器对两项任务进行了编程,对每项任务进行了简短的事后调查,最后进行了短期的离职面谈。我们的结果显示结果各异;我们在心理计量工具中没有发现压力;参加面谈的与会者报告说,没有压力感,也经历了时间压力。最后,生物鉴别学显示只有EDA的高峰期存在重大差异。我们的结论是,我们选择的通过规定更严格的时限来刺激压力的方法是不够的。我们为未来有关压力、生物鉴别和心理测量工具的研究提供了方法方面的见解。

Article 21

Title@2025-07-02 (3): Can Internal Software Metrics Predict App Popularity at Launch? Yeas! and Nays!

Title: Can Internal Software Metrics Predict App Popularity at Launch? Yeas! and Nays!

Kann interne Software-Metriken App-Popularität beim Start voraussagen? Yeas! und Nays!

内部软件计量器能否预测发射时的流行程度? 2507.02110v1

Authors (5): Md Nahidul Islam Opu, Fatima Islam Mouri, Rick Kazman, Yuanfang Cai, Shaiful Chowdhury

Predicting mobile app popularity before release can provide developers with a strategic advantage in a competitive marketplace, yet it remains a challenging problem. This study explores whether internal software metrics, measurable from source code before deployment, can predict an app’s popularity, defined by user ratings (calculated from user reviews) and DownloadsPerYear (yearly downloads). Using a dataset of 446 open-source Android apps from F-Droid, we extract a wide array of features, including system-, class-, and method-level code metrics, code smells, and app metadata. Additional information, such as user reviews, download counts, and uses-permission, was collected from the Google Play Store. We evaluate regression and classification models across three feature sets: a minimal Size-only baseline, a domain-informed Handpicked set, and a Voting set derived via feature selection algorithms. Regression models perform poorly due to skewed data, with low $R^2$ scores. However, when reframed as binary classification (Popular vs. Unpopular), results improve significantly. The best model, a Multilayer Perceptron using the Voting set, achieves F1-scores of 0.72. These results suggest that internal code metrics, although limited in their explanatory power, can serve as useful indicators of app popularity. This challenges earlier findings that dismissed internal metrics as predictors of software quality.

在发布前预测移动应用程序的受欢迎程度,从而能在竞争性市场中为开发者提供战略优势,然而,这仍然是一个具有挑战性的问题。本研究探讨内部软件衡量标准,在部署之前从源代码中可以测量,能否预测一个应用程序的受欢迎程度,其定义是用户评级(根据用户审查计算)和下载Per年(每年下载)。利用F-Droid的446个开放源码和机器人应用程序数据集,我们提取了一系列广泛的功能,包括系统、级和方法级代码衡量标准、代码气味和应用程序元数据。从Google PlaySore收集了更多的信息,如用户审查、下载计数和使用权限等。我们评估了三个功能组的回归和分类模式:最小大小基线、对域知情的手动集和通过功能选择算法生成的投票集。回归模型由于数据偏斜化而表现不佳,其评分为$R%2。然而,当重新设定为二进制分类(Populal vs. Unpopular)时,结果会显著改善。我们评估了三种功能模型、多层次的缩缩缩缩缩缩缩缩缩缩缩缩图。

Article 22

Title@2025-07-02 (3): Structural Code Search using Natural Language Queries

Title: Structural Code Search using Natural Language Queries

Structural Code Suche mit Hilfe von Natural Language Queries

使用自然语言查询的结构性法规搜索 2507.02107v1

Authors (8): Ben Limpanukorn, Yanjun Wang, Zach Patterson, Pranav Garg, Murali Krishna Ramanathan, Xiaofei Ma, Anoop Deoras, Miryung Kim

Searching code is a common task that developers perform to understand APIs, learn common code patterns, and navigate code. Currently, developers most commonly search using keywords and regular expressions that are easy to use and widely available. Beyond keywords and regular expressions, structural code search tools allow developers to search for code based on its syntactic structure. This has numerous applications ranging from bug finding to systematically refactoring code. However, these structural code search tools operate on queries expressed in domain-specific languages (DSL) that can be difficult to learn and write. We propose to allow developers to use natural language to search for code structurally. Expressing queries in natural language provides an intuitive way to search for code and lowers the barrier to entry. In this work, we develop a novel general approach that combines the reasoning capabilities of an LLM to interpret natural language search queries with the power of structural search tools to efficiently and accurately retrieve relevant code. We then instantiate this approach for two structural code search DSLs: Semgrep and GQL. In our evaluation, we construct a new benchmark for structural code search consisting of 400 queries over 10 Java projects. We show that our approach for structural code search based on translating NL queries to DSL queries using an LLM is effective and robust, achieving a high precision and recall ranging from 55% - 70%. Further, our approach significantly outperforms baselines based on semantic code search and LLM retrievals by up to 57% and 14% on F1 scores.

搜索代码是开发者为理解 API 、学习通用代码模式和浏览代码而执行的一项共同任务。目前, 开发者最常用的是使用易于使用和广泛可用的关键字和常规表达式进行搜索。除了关键字和常规表达式外, 结构代码搜索工具允许开发者搜索基于其合成结构的代码。这有许多应用程序, 从错误查找到系统性重构代码。然而, 这些结构代码搜索工具运行于以特定域语言( DSL) 表达的、可能难以学习和写入的查询。我们提议允许开发者使用自然语言来搜索代码结构结构。以自然语言表达的查询提供了搜索代码的直观方法, 并降低了进入障碍。在这项工作中, 我们开发了一个新的通用方法, 将一个翻译自然语言搜索询问的推理能力与结构搜索工具的能力结合起来, 以便高效和准确地检索相关代码。我们随后将这个方法用于两个结构代码搜索 DSLSL1 和 GQLL。在我们的评估中, 我们为结构代码搜索的400个查询提供了新的基准, 将NLM 高级搜索到一个基于 70 LLSLSLR 和高级的排序。

Article 23

Title@2025-07-02 (3): How do Software Engineering Candidates Prepare for Technical Interviews?

Title: How do Software Engineering Candidates Prepare for Technical Interviews?

Wie bereiten sich Software-Engineering-Kandidaten auf technische Interviews vor?

软件工程候选人如何准备技术面试? 2507.02068v1

Authors (4): Brian Bell, Teresa Thomas, Sang Won Lee, Chris Brown

To obtain employment, aspiring software engineers must complete technical interviews – a hiring process which involves candidates writing code while communicating to an audience. However, the complexities of tech interviews are difficult to prepare for and seldom faced in computing curricula. To this end, we seek to understand how candidates prepare for technical interviews, investigating the effects of preparation methods and the role of education. We distributed a survey to candidates (n = 131) actively preparing for technical interviews. Our results suggest candidates rarely train in authentic settings and courses fail to support preparation efforts – leading to stress and unpreparedness. Based on our findings, we provide implications for stakeholders to enhance tech interview preparation for candidates pursuing software engineering roles.

为了获得就业,有志向的软件工程师必须完成技术面试 – – 招聘过程涉及候选人在与受众沟通时撰写代码,然而,技术面试的复杂性难以准备,而且很少在计算课程中遇到。为此目的,我们力求了解候选人如何准备技术面试,调查准备方法的影响和教育的作用。我们向积极准备技术面试的候选人(n=131)分发了一份调查。我们的结果表明,候选人很少在真实环境和课程中接受培训,无法支持准备工作 – – 导致压力和不准备。根据我们的调查结果,我们向利益攸关方提出如何为追求软件工程角色的候选人加强技术面试准备的问题。

Article 24

Title@2025-07-02 (3): Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries

Title: Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries

Automatisierte Synthese von formal verifizierten Multi-Abstraktions-Funktionszusammenfassungen

正式核证的多种吸管功能摘要自动综合 2506.09550v2

Authors (8): Fanpeng Yang, Xu Ma, Shuling Wang, Xiong Xu, Qinxiang Cao, Naijun Zhan, Xiaofeng Li, Bin Gu

Function summaries, which characterize the behavior of code segments (typically functions) through preconditions and postconditions, are essential for understanding, reusing, and verifying software, particularly in safety-critical domains like aerospace embedded systems. However, these mission-critical legacy code serving as a valuable reused asset often lacks formal specifications. It is challenging to automatically generate function summaries for C programs, due to the existence of complex features such as loops, nested function calls, pointer aliasing, and so on. Moreover, function summaries should support multiple abstraction levels to meet diverse requirements, e.g. precise summaries capturing full functionality for formal verification and intuitive summaries for human understanding. To address these challenges, we first propose a novel framework that combines symbolic execution, large language models (LLMs), and formal verification to generate Relatively Strongest Postconditions (RSPs) and build function summaries that fully capture program behavior. Our approach leverages VST-A’s symbolic execution to precisely track program execution paths and state transitions, employs LLMs to infer loop invariants based on predefined templates, and uses Frama-C to guarantee soundness of generated summaries in an iterative refinement loop. Furthermore, from generated RSPs, we automatically synthesize strongest non-redundant postconditions expressed within given domain specific language. We compare our approach with existing work through extensive experiments.

功能摘要是代码部分(典型功能)通过先决条件和先决条件的行为的特点,对于理解、重新使用和核查软件至关重要,特别是在航空航天嵌入系统等安全关键领域;然而,这些关键任务遗留代码作为宝贵的再利用资产往往缺乏正式规格;由于存在环形、嵌套功能电话、指针别名等复杂特征,难以自动生成C程序功能摘要;此外,功能摘要应支持多种抽象级别,以满足各种要求,例如精确的摘要,为正式核查和直观摘要收集全面功能,以促进人类理解。为了应对这些挑战,我们首先提议一个新框架,将象征性执行、大型语言模型(LLLMS)和正式核查结合起来,以产生相对强大的附加条件,并建立能够充分反映方案行为的功能摘要。我们的方法利用VST-A象征性执行来精确跟踪程序执行路径和状态过渡,利用LLMMS来根据预先定义的模板推断变异性,并利用Frama-C来为人类理解的直观摘要。为了应对这些挑战,我们首先提出一个新的框架框架框架框架框架框架框架,以便保证我们通过不至最强烈的实地分析,通过现有版本进行我们制作的实地分析。

Article 25

Title@2025-07-02 (3): APRMCTS: Improving LLM-based Automated Program Repair with Iterative Tree Search

Title: APRMCTS: Improving LLM-based Automated Program Repair with Iterative Tree Search

APRMCTS: Verbesserte automatische Programmreparatur auf LLM-Basis mit iterativer Baumsuche

APRMCTS: 改进以LLM为基础的自动方案修理,同时进行迭代树搜索 2507.01827v1

Authors (5): Haichuan Hu, Congqing He, Hao Zhang, Xiaochen Xie, Quanjun Zhang

Automated Program Repair (APR) attempts to fix software bugs without human intervention, which plays a crucial role in software development and maintenance. Recently, with the advances in Large Language Models (LLMs), a rapidly increasing number of APR techniques have been proposed with remarkable performance. However, existing LLM-based APR techniques typically adopt trial-and-error strategies, which suffer from two major drawbacks: (1) inherently limited patch effectiveness due to local exploration, and (2) low search efficiency due to redundant exploration. In this paper, we propose APRMCTS, which uses iterative tree search to improve LLM-based APR. APRMCTS incorporates Monte Carlo Tree Search (MCTS) into patch searching by performing a global evaluation of the explored patches and selecting the most promising one for subsequent refinement and generation. APRMCTS effectively resolves the problems of falling into local optima and thus helps improve the efficiency of patch searching. Our experiments on 835 bugs from Defects4J demonstrate that, when integrated with GPT-3.5, APRMCTS can fix a total of 201 bugs, which outperforms all state-of-the-art baselines. Besides, APRMCTS helps GPT-4o-mini, GPT-3.5, Yi-Coder-9B, and Qwen2.5-Coder-7B to fix 30, 27, 37, and 28 more bugs, respectively. More importantly, APRMCTS boasts a significant performance advantage while employing small patch size (16 and 32), notably fewer than the 500 and 10,000 patches adopted in previous studies. In terms of cost, compared to existing state-of-the-art LLM-based APR methods, APRMCTS has time and monetary costs of less than 20% and 50%, respectively. Our extensive study demonstrates that APRMCTS exhibits good effectiveness and efficiency, with particular advantages in addressing complex bugs.

自动化程序修理(APR)试图在没有人为干预的情况下修补软件错误,这在软件开发和维护方面起着关键作用。最近,随着大语言模型(LLMS)的进步,提出了数量迅速增加的PRAR技术,其绩效显著。然而,现有以LLM为主的PRAR技术通常采用试验和机变战略,这有两个重大缺陷:(1)由于当地探索而具有内在有限的补丁效力,(2)由于重复的探索而导致搜索效率低下。在本文中,我们提议500MCTS,利用迭代树搜索来改进以LLM为主的ARMTS。APRMCTS将蒙特卡洛树搜索(MCTS)纳入补补补,对所探索的补补补补补补补技术进行全球评价,并选择最有希望的补补补补习技术。此外,APRMCTS 有效地解决了进入当地节选技术的问题,从而帮助提高补补补效率。我们关于Deffects4J的835错误的实验表明,如果与GPT-35、AMCTS和ROMTTTTS的广度相比, 将比20-roup-roup-rodual-rodual-rodual-rodustryals mess的优势,则比所有的20-rode-rodual-roducal-ral-rodust-rmas mal-ral-ral-rmas mal-rmas mess 10-rm-ral-ral-rmax 和C-rusal-rental-rental-rmax 和ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-rmas mal-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-ral-r

Article 26

Title@2025-07-02 (3): Foundation Models for the Digital Twin Creation of Cyber-Physical Systems

Title: Foundation Models for the Digital Twin Creation of Cyber-Physical Systems

Gründungsmodelle für die digitale Twin-Erstellung von Cyber-Physical Systems

数字双造网络物理系统数字双创造模式基金会模式 2407.18779v2

Authors (3): Shaukat Ali, Paolo Arcaini, Aitor Arrieta

Foundation models are trained on a large amount of data to learn generic patterns. Consequently, these models can be used and fine-tuned for various purposes. Naturally, studying such models’ use in the context of digital twins for cyber-physical systems (CPSs) is a relevant area of investigation. To this end, we provide perspectives on various aspects within the context of developing digital twins for CPSs, where foundation models can be used to increase the efficiency of creating digital twins, improve the effectiveness of the capabilities they provide, and used as specialized fine-tuned foundation models acting as digital twins themselves. We also discuss challenges in using foundation models in a more generic context. We use the case of an autonomous driving system as a representative CPS to give examples. Finally, we provide discussions and open research directions that we believe are valuable for the digital twin community.

因此,这些模型可用于各种目的。当然,研究这些模型在网络物理系统数字双胞胎背景下的使用情况是一个相关的调查领域。为此,我们从发展数字双胞胎(CPS)的角度出发,从各方面提出看法。在开发数字双胞胎(CPS)的背景下,可以使用基础模型来提高创建数字双胞胎的效率,提高它们提供的能力的有效性,并用作作为作为数字双胞胎本身的专门的微调基础模型。我们还讨论在更通用的背景下使用基础模型方面的挑战。我们用自主驱动系统作为具有代表性的CPS的例子来举例说明。最后,我们提供我们认为对数字双胞胎群体有价值的讨论和开放研究方向。

Article 27

Title@2025-07-02 (3): On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Title: On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten

关于含有闭合同步词类的标识名称的结构和语义 2505.18444v2

Authors (11): Christian D. Newman, Anthony Peruma, Eman Abdullah AlOmar, Mahie Crabbe, Syreen Banabilah, Reem S. AlSuhaibani, Michael J. Decker, Farhad Akhbardeh, Marcos Zampieri, Mohamed Wiem Mkaouer, Jonathan I. Maletic

Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns, which represent the part-of-speech (PoS) sequences underlying identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. To study these categories, the Closed Category Identifier Dataset (CCID), a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems, is constructed and presented. The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express concepts such as control flow, data transformation, temporal reasoning, and other behavioral roles through naming. This work contributes an empirical foundation for understanding how linguistic resources encode behavior in identifier names and supports new directions for research in naming, program comprehension, and education.

标识名称是代码的关键组成部分, 是开发者理解程序行为的主要线索。本文通过扩展语法模式的概念来调查标识名称的语言结构。语法模式代表了语法序列部分( POS) 基本识别短语。具体重点是封闭的合成类别( 如预设、连线、确定者 ) , 尽管这些类别在一般自然语言中具有核心作用, 但这些类别很少在软件工程中研究。要研究这些类别, 封闭类识别数据集( CICID) 是一个新的人工手动数据集, 由来自30个开放源系统的1,275个标识组成。封闭类语法模式与程序行为之间的关系随后通过基于理论的编码、统计和模式分析加以分析。结果揭示了开发者用来表达控制流、数据转换、时间推理和其他行为作用等概念的经常性结构。这项工作为理解识别名称的语言资源如何编码行为和支持命名、方案理解和教育研究的新方向提供了经验基础。

Article 28

Title@2025-07-02 (3): Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions

Title: Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions

Erforschen von Datenschutz und Sicherheit als Treiber für ökologische Nachhaltigkeit in Cloud-basierten Office-Lösungen

探索隐私和安全作为云载办公室解决方案中环境可持续性的驱动力 2506.23866v2

Authors (3): Jason Kayembe, Iness Ben Guirat, Jan Tobias Mühlberg

In this paper, we explore the intersection of privacy, security, and environmental sustainability in cloud-based office solutions, focusing on quantifying user- and network-side energy use and associated carbon emissions. We hypothesise that privacy-focused services are typically more energy-efficient than those funded through data collection and advertising. To evaluate this, we propose a framework that systematically measures environmental costs based on energy usage and network data traffic during well-defined, automated usage scenarios. To test our hypothesis, we first analyse how underlying architectures and business models, such as monetisation through personalised advertising, contribute to the environmental footprint of these services. We then explore existing methodologies and tools for software environmental impact assessment. We apply our framework to three mainstream email services selected to reflect different privacy policies, from ad-supported tracking-intensive models to privacy-focused designs: Microsoft Outlook, Google Mail (Gmail), and Proton Mail. We extend this comparison to a self-hosted email solution, evaluated with and without end-to-end encryption. We show that the self-hosted solution, even with 14% of device energy and 15% of emissions overheads from PGP encryption, remains the most energy-efficient, saving up to 33% of emissions per session compared to Gmail. Among commercial providers, Proton Mail is the most efficient, saving up to 0.1 gCO2 e per session compared to Outlook, whose emissions can be further reduced by 2% through ad-blocking.

在本文中,我们探讨了基于云的办公解决方案中的隐私、安全和环境可持续性的交叉点,重点是量化用户和网络的能源使用和相关碳排放。我们假设,以隐私为重点的服务通常比通过数据收集和广告供资的服务更具能效。为了评估这一点,我们提议了一个框架,根据能源使用和网络数据流量在明确界定的自动化使用情景中系统地衡量环境成本。为了测试我们的假设,我们首先分析基本的架构和商业模式,例如通过个人化广告进行货币化,如何促进这些服务的环境足迹。然后我们探索软件环境影响评估的现有方法和工具。我们把框架应用到选择的三种主流电子邮件服务,以反映不同的隐私政策,从临时支持的跟踪密集模式到以隐私为重点的设计:微软 Outlook、Google Mail(Gmail)和Proton Mail。我们将这一框架扩大到一个自托管的电子邮件解决方案,对终端对终端对终端对终端对终端的加密进行评估。我们显示,即使设备能源的14%和PGPGP加密的15%的排放管理,我们将框架应用到最高效的版本,比起来,其最高效的版本的版本仍然是最高效的版本至33 %的版本的版本的版本的版本,通过PRO-G的版本的版本的版本的版本的版本的版本的版本的版本,其排放量到33-GMalal的版本的版本的版本的版本的版本至33的版本的版本的版本的版本的版本,直至每期的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本,直到33-版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本到33至33至G-版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本到G)。

Article 29

Title@2025-07-02 (3): DaiFu: In-Situ Crash Recovery for Deep Learning Systems

Title: DaiFu: In-Situ Crash Recovery for Deep Learning Systems

DaiFu: In-Situ Crash Recovery für Deep Learning Systeme

DaiFu:深入学习系统现场事故恢复 2507.01628v1

Authors (7): Zilong He, Pengfei Chen, Hongyu Zhang, Xiaoyun Li, Guangba Yu, Hongyang Chen, Zibin Zheng

Deep learning (DL) systems have been widely adopted in many areas, and are becoming even more popular with the emergence of large language models. However, due to the complex software stacks involved in their development and execution, crashes are unavoidable and common. Crashes severely waste computing resources and hinder development productivity, so efficient crash recovery is crucial. Existing solutions, such as checkpoint-retry, are too heavyweight for fast recovery from crashes caused by minor programming errors or transient runtime errors. Therefore, we present DaiFu, an in-situ recovery framework for DL systems. Through a lightweight code transformation to a given DL system, DaiFu augments it to intercept crashes in situ and enables dynamic and instant updates to its program running context (e.g., code, configurations, and other data) for agile crash recovery. Our evaluation shows that DaiFu helps reduce the restore time for crash recovery, achieving a 1372x speedup compared with state-of-the-art solutions. Meanwhile, the overhead of DaiFu is negligible (under 0.40%). We also construct a benchmark spanning 7 distinct crash scenarios in DL systems, and show the effectiveness of DaiFu in diverse situations.

深度学习(DL)系统在许多领域被广泛采用,随着大型语言模型的出现,这种系统越来越受欢迎。然而,由于软件堆叠的复杂,碰撞是不可避免和常见的。碰撞严重浪费计算资源,阻碍发展生产率,因此有效的碰撞恢复至关重要。现有的解决方案,例如检查站回收系统,对于因小型编程错误或瞬时运行错误而导致的坠毁而迅速恢复来说过于沉重。因此,我们介绍了DAiFu,DaiFu是一个DL系统现场恢复框架。通过对给定的DL系统进行轻量级代码转换,DaiFu将它增扩到现场拦截坠毁,并使其能够对程序运行环境(例如代码、配置和其他数据)进行动态和即时更新,以利快速恢复。我们的评估表明,DaiFu帮助缩短了坠毁恢复的恢复时间,实现了1372x速度,与最新解决方案相比,而DaiFu的顶部的顶部是微不足道的(0.40 % ) 。我们还在DL系统上建造了一个跨越7个不同情景的基准,并展示了DAYFu的有效性。

Article 30

Title@2025-07-02 (3): Combining Type Inference and Automated Unit Test Generation for Python

Title: Combining Type Inference and Automated Unit Test Generation for Python

Kombination von Typ-Inferenz und Automated Unit Test Generation für Python

Python 组合类型推断和自动单位测试生成 2507.01477v1

Authors (3): Lukas Krodinger, Stephan Lukasczyk, Gordon Fraser

Automated unit test generation is an established research field that has so far focused on statically-typed programming languages. The lack of type information in dynamically-typed programming languages, such as Python, inhibits test generators, which heavily rely on information about parameter and return types of functions to select suitable arguments when constructing test cases. Since automated test generators inherently rely on frequent execution of candidate tests, we make use of these frequent executions to address this problem by introducing type tracing, which extracts type-related information during execution and gradually refines the available type information. We implement type tracing as an extension of the Pynguin test-generation framework for Python, allowing it (i) to infer parameter types by observing how parameters are used during runtime, (ii) to record the types of values that function calls return, and (iii) to use this type information to increase code coverage. The approach leads to up to 90.0% more branch coverage, improved mutation scores, and to type information of similar quality to that produced by other state-of-the-art type-inference tools.

自动化单位测试生成是一个固定的研究领域,迄今为止一直侧重于静态类型编程语言;缺乏动态类型编程语言的类型信息,例如Python,抑制了测试生成器,因为测试生成器严重依赖参数和函数返回类型的信息,以在构建测试案例时选择合适的参数和函数返回类型。由于自动测试生成器本身依赖经常进行候选测试,我们利用这些频繁的处决来解决这一问题,方法是引入类型追踪,在实施过程中提取与类型有关的信息,并逐步完善现有类型信息。我们实施类型追踪,作为Python Python Pynguin测试生成框架的延伸,允许它(一) 通过观察运行期间使用参数的方式推推参数类型,以便记录功能返回的值类型,以及(三)使用这种类型信息来增加代码覆盖。这一方法导致将分支覆盖面提高到90.0%以上,改进突变分数,并输入与其他州型推算工具生成的类似质量的信息。

Article 31

Title@2025-07-02 (3): Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

Title: Tensor Program Optimization for the RISC-V Vector Extension Using Probabilistic Programs

Tensor-Programmoptimierung für die RISC-Vektorerweiterung mittels probabilistischer Programme

利用概率方案优化RISC-V矢量扩展 2507.01457v1

Authors (3): Federico Nicolas Peccia, Frederik Haxel, Oliver Bringmann

RISC-V provides a flexible and scalable platform for applications ranging from embedded devices to high-performance computing clusters. Particularly, its RISC-V Vector Extension (RVV) becomes of interest for the acceleration of AI workloads. But writing software that efficiently utilizes the vector units of RISC-V CPUs without expert knowledge requires the programmer to rely on the autovectorization features of compilers or hand-crafted libraries like muRISCV-NN. Smarter approaches, like autotuning frameworks, have been missing the integration with the RISC-V RVV extension, thus heavily limiting the efficient deployment of complex AI workloads. In this paper, we present a workflow based on the TVM compiler to efficiently map AI workloads onto RISC-V vector units. Instead of relying on hand-crafted libraries, we integrated the RVV extension into TVM’s MetaSchedule framework, a probabilistic program framework for tensor operation tuning. We implemented different RISC-V SoCs on an FPGA and tuned a wide range of AI workloads on them. We found that our proposal shows a mean improvement of 46% in execution latency when compared against the autovectorization feature of GCC, and 29% against muRISCV-NN. Moreover, the binary resulting from our proposal has a smaller code memory footprint, making it more suitable for embedded devices. Finally, we also evaluated our solution on a commercially available RISC-V SoC implementing the RVV 1.0 Vector Extension and found our solution is able to find mappings that are 35% faster on average than the ones proposed by LLVM. We open-sourced our proposal for the community to expand it to target other RISC-V extensions.

RISC- V 提供了从嵌入设备到高性能计算群集的灵活且可扩缩的应用平台。特别是, 其RISC- VV 矢量扩展( RVVV) 的整合极大地限制了复杂的AI工作量的高效部署。但是,在本文中,我们展示了一个基于 TVM 编译器的工作流程,以高效地将AI工作量映射到RISC- V 矢量驱动器的矢量单位上,而没有专家知识的写入软件,要求程序员依靠诸如 muriscCV-NN等编译器或手制图书馆的自动化功能。智能化框架,如自动调控调框架等智能化方法,一直缺少与RISC- V RVV 扩展的整合,从而严重限制了复杂的AI工作量的高效部署。在本文中,基于 TVM 编译器编辑器的工作流程以高效地将AI 工作量映射到 RIRC- VL 驱动器中。我们发现RV 扩展器的快速化程序框架, 我们的最小化程序在常规化上应用了我们的 VIC- VC 和常规化工具, 也显示我们可操作的常规化的常规化。

Article 32

Title@2025-07-02 (3): Context-Aware Code Wiring Recommendation with LLM-based Agent

Title: Context-Aware Code Wiring Recommendation with LLM-based Agent

Context-Aware Code Verkabelungsempfehlung mit LLM-basiertem Agent

与LLLM代理商的上下文软件代码联线建议 2507.01315v1

Authors (5): Taiming Wang, Yanjie Jiang, Chunhao Dong, Yuxia Zhang, Hui Liu

Copy-paste-modify is a widespread and pragmatic practice in software development, where developers adapt reused code snippets, sourced from platforms such as Stack Overflow, GitHub, or LLM outputs, into their local codebase. A critical yet underexplored aspect of this adaptation is code wiring, which involves substituting unresolved variables in the pasted code with suitable ones from the surrounding context. Existing solutions either rely on heuristic rules or historical templates, often failing to effectively utilize contextual information, despite studies showing that over half of adaptation cases are context-dependent. In this paper, we introduce WIRL, an LLM-based agent for code wiring framed as a Retrieval-Augmented Generation (RAG) infilling task. WIRL combines an LLM, a customized toolkit, and an orchestration module to identify unresolved variables, retrieve context, and perform context-aware substitutions. To balance efficiency and autonomy, the agent adopts a mixed strategy: deterministic rule-based steps for common patterns, and a state-machine-guided decision process for intelligent exploration. We evaluate WIRL on a carefully curated, high-quality dataset consisting of real-world code adaptation scenarios. Our approach achieves an exact match precision of 91.7% and a recall of 90.0%, outperforming advanced LLMs by 22.6 and 13.7 percentage points in precision and recall, respectively, and surpassing IntelliJ IDEA by 54.3 and 49.9 percentage points. These results underscore its practical utility, particularly in contexts with complex variable dependencies or multiple unresolved variables. We believe WIRL paves the way for more intelligent and context-aware developer assistance in modern IDEs.

在软件开发中,一个广泛和务实的做法就是将软件开发者从Stack Overflow、GitHub或LLM等平台上产生的再利用代码片段改制到本地代码库。这一调整的一个重要方面是代码线,它涉及用周围环境的合适变量替换旧代码中尚未解决的变量,现有解决方案要么依靠超常规则或历史模板,往往无法有效利用背景信息,尽管研究表明一半以上的适应案例是依背景而定的。在本文件中,我们引入WIRL,一个基于LLM的代码界面,作为Retrieval-Auged Generation(RAG)的链接,用于完成任务。WIRLL将一个LM、一个定制工具包和一个调控模块结合起来,以确定尚未解决的变量、检索背景和背景替代。为了平衡效率和自主,该代理商采用了一种混合的战略:基于通用模式的确定性规则背景,以及一个用于智能探索的州级调整决策流程。我们分别对49-AGRL的逻辑结果进行了精确度评估,在19 %的精确度上,我们用一种精细的精确的精确度上,我们用一个数据定位的精细的精细的精细的精细的精确度上,我们用在19的精确的精确的精确度上,我们用了一个一个数据定位的精确的精确的精确的精确的精确的精确的精确度上,我们对了。

Article 33

Title@2025-07-01 (2): Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling

Title: Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling

Regulierung des algorithmischen Managements: Eine Multi-Stakeholder-Studie über Herausforderungen bei der Ausrichtung von Software und dem Gesetz für die Arbeitsplanung

规范工资管理:多方利益攸关方研究软件和工作场所时间安排法在调整软件和工作场所时间安排法方面面临的挑战 2505.02329v3

Authors (6): Jonathan Lynn, Rachel Y. Kim, Sicun Gao, Daniel Schneider, Sachin S. Pandya, Min Kyung Lee

Algorithmic management (AM)’s impact on worker well-being has led to calls for regulation. However, little is known about the effectiveness and challenges in real-world AM regulation across the regulatory process – rule operationalization, software use, and enforcement. Our multi-stakeholder study addresses this gap within workplace scheduling, one of the few AM domains with implemented regulations. We interviewed 38 stakeholders across the regulatory process: regulators, defense attorneys, worker advocates, managers, and workers. Our findings suggest that the efficacy of AM regulation is influenced by: (i) institutional constraints that challenge efforts to encode law into AM software, (ii) on-the-ground use of AM software that shapes its ability to facilitate compliance, (iii) mismatches between software and regulatory contexts that hinder enforcement, and (iv) unique concerns that software introduces when used to regulate AM. These findings underscore the importance of a sociotechnical approach to AM regulation, which considers organizational and collaborative contexts alongside the inherent attributes of software. We offer future research directions and implications for technology policy and design.

分析管理(AM)对工人福祉的影响导致要求监管。然而,对于现实世界的AM监管在整个监管过程中的有效性和挑战知之甚少。我们多方利益攸关方的研究涉及工作场所时间安排中的这一差距,这是少数有实施监管的AM领域之一。我们采访了整个监管过程中的38个利益攸关方:监管者、辩护律师、工人律师、管理人员和工人。我们的调查结果表明,AM监管的效力受到以下因素的影响:(一) 机构制约,对将法律纳入AM软件的努力构成挑战;(二) 影响其促进合规能力的AM软件的实地使用;(三) 软件与监管环境之间的不匹配,阻碍执法;(四) 软件在监管AM时引入的独特关切。这些研究结果强调了对AM监管采取社会技术方法的重要性,该方法考虑到组织和协作环境以及软件的固有属性。我们为技术政策和设计提供了未来的研究方向和影响。

Article 34

Title@2025-07-01 (2): Bugs in the Shadows: Static Detection of Faulty Python Refactorings

Title: Bugs in the Shadows: Static Detection of Faulty Python Refactorings

Bugs in the Shadows: Statische Erkennung von fehlerhaften Python-Refactorings

暗影中的臭虫: 持续探测失灵的 Python 反因子 2507.01103v1

Authors (4): Jonhnanthan Oliveira, Rohit Gheyi, Márcio Ribeiro, Alessandro Garcia

Python is a widely adopted programming language, valued for its simplicity and flexibility. However, its dynamic type system poses significant challenges for automated refactoring - an essential practice in software evolution aimed at improving internal code structure without changing external behavior. Understanding how type errors are introduced during refactoring is crucial, as such errors can compromise software reliability and reduce developer productivity. In this work, we propose a static analysis technique to detect type errors introduced by refactoring implementations for Python. We evaluated our technique on Rope refactoring implementations, applying them to open-source Python projects. Our analysis uncovered 29 bugs across four refactoring types from a total of 1,152 refactoring attempts. Several of these issues were also found in widely used IDEs such as PyCharm and PyDev. All reported bugs were submitted to the respective developers, and some of them were acknowledged and accepted. These results highlight the need to improve the robustness of current Python refactoring tools to ensure the correctness of automated code transformations and support reliable software maintenance.

Python是一种广泛采用的编程语言,因其简单性和灵活性而受到重视。然而,其动态类型系统对自动再构件提出了重大挑战,这是软件进化中的一种基本做法,目的是在不改变外部行为的情况下改进内部代码结构。了解在再构件过程中如何引入类型错误至关重要,因为这种错误会损害软件的可靠性,降低开发者的生产率。在这项工作中,我们提出了一个静态分析技术,以检测通过对 Python 实施再构件而引入的类型错误。我们评估了我们用于实施开源 Python 项目的系统。我们的分析从总共1,152次重构件尝试中发现了29种类型在四种重新构件中的错误。其中一些问题还存在于广泛使用的 IDE中,如PyCharm和PyDev。所有报告的错误都提交给了相关开发者,其中一些被承认和接受。这些结果突出表明有必要改进当前Python再构件工具的稳健性,以确保自动代码转换的正确性并支持可靠的软件维护。

Article 35

Title@2025-07-01 (2): What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews

Title: What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews

Was ist mit Emotionen? Guiding Fine-Grained Emotion Extraction aus Mobile App Bewertungen

情感呢?指导从移动应用程序评论中抽取精美情感的导师 2505.23452v2

Authors (5): Quim Motger, Marc Oriol, Max Tiessler, Xavier Franch, Jordi Marco

Opinion mining plays a vital role in analysing user feedback and extracting insights from textual data. While most research focuses on sentiment polarity (e.g., positive, negative, neutral), fine-grained emotion classification in app reviews remains underexplored. Fine-grained emotion classification is thus needed to better understand users’ affective responses and support downstream tasks such as feature-emotion analysis, user-oriented release planning, and issue triaging. This paper addresses this gap by identifying and addressing the challenges and limitations in fine-grained emotion analysis in the context of app reviews. Our study adapts Plutchik’s emotion taxonomy to app reviews by developing a structured annotation framework and dataset. Through an iterative human annotation process, we define clear annotation guidelines and document key challenges in emotion classification. Additionally, we evaluate the feasibility of automating emotion annotation using large language models, assessing their cost-effectiveness and agreement with human-labelled data. Our findings reveal that while large language models significantly reduce manual effort and maintain substantial agreement with human annotators, full automation remains challenging due to the complexity of emotional interpretation. This work contributes to opinion mining in requirements engineering by providing structured guidelines, an annotated dataset, and insights for developing automated pipelines to capture the complexity of emotions in app reviews.

意见挖掘在分析用户反馈和从文本数据中提取见解方面发挥着至关重要的作用。虽然大多数研究侧重于情绪极化(如正面、负面、中性),但软件审查中微微薄情感分类仍未得到充分探讨。因此,需要精细的情感分类,以更好地了解用户的情感反应,支持下游任务,如特征情感分析、面向用户的发布规划和问题三角等。本文件通过查明和解决在应用审查中微微情感分析的挑战和局限性来解决这一差距。我们的研究调整了普卢奇克的情感分类,以通过制定结构化的批注框架和数据集来应用审查。通过反复的人类批注进程,我们定义清晰的批注指南,并记录情感分类方面的主要挑战。此外,我们评估使用大型语言模型进行情绪自动批注的可行性,评估其成本效益和与人类标签数据的一致性。我们的研究结果表明,虽然大型语言模型大大减少了手工工作,并保持了与人类批注者的实质性一致,但全面自动化仍具有挑战性,因为要制定结构化的批注框架和情感分析,这项工作有助于通过结构化的深度分析分析分析,为分析提供情感分析的情感分析提供数据。

Article 36

Title@2025-07-01 (2): Toward a Brazilian Research Agenda in Quantum Software Engineering: A Systematic Mapping Study

Title: Toward a Brazilian Research Agenda in Quantum Software Engineering: A Systematic Mapping Study

Auf dem Weg zu einer brasilianischen Forschungsagenda in der Quantensoftware-Engineering: Eine systematische Mapping-Studie

实现巴西量子软件工程研究议程:系统绘图研究 2506.11013v2

Authors (2): Filipe Fernandes, Cláudia Werner

Context: Quantum Software Engineering (QSE) has emerged as a promising discipline to support the development of quantum applications by integrating quantum computing principles with established software engineering practices. Problem: Despite recent growth, QSE still lacks standardized methodologies, tools, and guidelines. Moreover, countries like Brazil have had minimal representation in the development of this emerging field. Objective: This study aims to map the current state of QSE by identifying research trends, contributions, and gaps that can inform future investigations and strategic initiatives. Methodology: A systematic mapping study was conducted across major scientific databases, retrieving 3,219 studies. After applying inclusion and exclusion criteria, 3,052 studies were excluded, resulting in 167 selected for analysis. The publications were classified by study type, research type, and alignment with SWEBOK knowledge areas. Results: Most studies focused on Software Engineering Models and Methods, Software Architecture, and Software Testing. Conceptual and technical proposals were predominant, while empirical validations remained limited. Conclusions: QSE is still a maturing field. Advancing it requires standardization, more empirical research, and greater participation from developing countries. As its main contribution, this study proposes a Brazilian Research Agenda in QSE to guide national efforts and foster the development of a strong local scientific community.

目标:本研究报告的目的是通过确定研究趋势、贡献和差距,为今后的调查和战略举措提供依据,从而绘制质量评估现状图; 方法:通过将量子计算原则与既定软件工程做法相结合,系统绘图研究,为开发量子应用提供了很有希望的学科; 问题:尽管最近有所增长,但质量评估仍然缺乏标准化的方法、工具和准则; 此外,巴西等国家在这个新兴领域的开发中的代表性极小; 目标:本研究报告旨在通过查明研究趋势、贡献和差距,为未来调查和战略举措提供依据,从而绘制质量评估现状图; 方法:在主要科学数据库中进行了系统绘图研究,检索了3 219项研究; 在应用了包容和排斥标准之后,排除了3 052项研究,结果选出了167项供分析之用; 出版物按研究类型、研究类型和与SWEBOK知识领域的一致性进行了分类; 结果:大多数研究侧重于软件工程模型和方法、软件构架和软件测试。概念和技术提案占主导地位,而经验验证仍然有限。结论: 质量评估仍是一个成熟的领域。推进这项工作需要标准化、更多的经验研究,发展中国家更多地参与。推进这项工作,还需要推进这项工作,需要标准化、更多的经验研究,并让更多的参与。本研究报告主要贡献是巴西社区促进国家科学研究议程。

Article 37

Title@2025-07-01 (2): A Study of In-Context-Learning-Based Text-to-SQL Errors

Title: A Study of In-Context-Learning-Based Text-to-SQL Errors

Eine Studie über In-Context-Learning-basierte Text-zu-SQL-Fehler

文中学习基于文本到SQL错误的研究 2501.09310v2

Authors (9): Jiawei Shen, Chengcheng Wan, Ruoyi Qiao, Jiazhen Zou, Hang Xu, Yuchen Shao, Yueling Zhang, Weikai Miao, Geguang Pu

Large language models (LLMs) have been adopted to perform text-to-SQL tasks, utilizing their in-context learning (ICL) capability to translate natural language questions into structured query language (SQL). However, such a technique faces correctness problems and requires efficient repairing solutions. In this paper, we conduct the first comprehensive study of text-to-SQL errors. Our study covers four representative ICL-based techniques, five basic repairing methods, two benchmarks, and two LLM settings. We find that text-to-SQL errors are widespread and summarize 29 error types of 7 categories. We also find that existing repairing attempts have limited correctness improvement at the cost of high computational overhead with many mis-repairs. Based on the findings, we propose MapleRepair, a novel text-to-SQL error detection and repairing framework. The evaluation demonstrates that MapleRepair outperforms existing solutions by repairing 13.8% more queries with neglectable mis-repairs and 67.4% less overhead.

采用大型语言模型(LLMS)来完成文本到SQL的任务,利用它们将自然语言问题转换成结构化查询语言(SQL)的内置学习能力(ICL)来将自然语言问题转换成结构化查询语言(SQL),然而,这种技术面临着正确性问题,需要高效的修复解决方案。在本文件中,我们首次对文本到SQL错误进行了全面研究。我们的研究涵盖了四种有代表性的ICL技术、五种基本修复方法、两个基准和两个LLM设置。我们发现文本到SQL的错误非常普遍,并总结了29种7类错误类型。我们还发现,现有的修复尝试在以许多错误修复的高计算管理成本下,其正确性改进有限。根据研究结果,我们提议了MapleRepair, 一个新的文本到SQL错误探测和修复框架。评估表明,MapleRepair通过修复13.8%的更多查询,用可忽略的错误修复,少67.4%的管理费,比现有的解决办法超出现有的解决办法。

Article 38

Title@2025-07-01 (2): Out of the Day Job: Perspectives of Industry Practitioners in Co-Design and Delivery of Software Engineering Courses

Title: Out of the Day Job: Perspectives of Industry Practitioners in Co-Design and Delivery of Software Engineering Courses

Out of the Day Job: Perspektiven von Industriepraktizierenden in Co-Design und Bereitstellung von Software-Engineering-Kursen

日常工作:行业从业人员在共同设计和提供软件工程课程方面的观点 2507.00803v1

Authors (9): Gillian Daniel, Chris Hall, Per Hammer, Alec-Angus Macdonald, Hollie Marwick-Best, Emma McKenzie, George Popa, Derek Somerville, Tim Storer

Over more than two decades, The University of Glasgow has co-designed and delivered numerous software engineering focused courses with industry partners, covering both technical and discipline specific professional skills. Such collaborations are not unique and many of the benefits are well recognised in the literature. These include enhancing the real-world relevance of curricula, developing student professional networks ahead of graduation and easing recruitment opportunities for employers. However, there is relatively little scholarship on the perspectives of industry practitioners who participate in course design and delivery. This gap is significant, since the effort invested by practitioners is often substantial and may require ongoing support from both the industry partner and academic institution. Understanding the motivations, expectations and experiences of practitioners who engage in course delivery can guide the formation of future partnerships and ensure their long-term sustainability. We begin to address this gap by reporting on the outcomes of a retrospective conducted amongst the practitioner coauthors of this paper, with the academic coauthors acting as facilitators. All coauthors have participated in the recent co-design and delivery of software engineering courses, but we choose to focus explicitly on the perspectives of the practitioners. We report on the themes that emerged from the discussions and our resulting recommendations for future collaborations.

20多年来,格拉斯哥大学与行业伙伴共同设计并提供了无数软件工程重点课程,涵盖技术和纪律方面的特定专业技能,这种合作并不独特,文献中也充分认识到其中的许多好处,其中包括加强课程的实际相关性,在毕业前发展学生专业网络,放宽雇主的征聘机会;然而,关于参与课程设计和交付的行业从业人员的观点,奖学金相对较少;这一差距很大,因为从业人员投入的努力往往很大,可能需要行业伙伴和学术机构的持续支持;了解参与课程交付的从业人员的动机、期望和经验,可以指导未来伙伴关系的形成,并确保其长期可持续性;我们开始通过报告本文从业者共同作者之间进行的回顾的结果来弥补这一差距,由学术共同作者担任主持人;所有共同作者都参加了最近联合设计和交付的软件工程课程,但我们选择明确侧重于从业人员的观点。我们报告了讨论中出现的主题和我们由此提出的未来合作建议。

Article 39

Title@2025-07-01 (2): Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability

Title: Echoes of AI: Investigating the Downstream Effects of AI Assistants on Software Maintainability

Echos von KI: Untersuchung der Auswirkungen von KI-Assistenten auf die Wartung von Software

AI的回声:调查AI助理对软件维护能力下游影响 2507.00788v1

Authors (8): Markus Borg, Dave Hewett, Nadim Hagatulah, Noric Couderc, Emma Söderberg, Donald Graham, Uttam Kini, Dave Farley

[Context] AI assistants, like GitHub Copilot and Cursor, are transforming software engineering. While several studies highlight productivity improvements, their impact on maintainability requires further investigation. [Objective] This study investigates whether co-development with AI assistants affects software maintainability, specifically how easily other developers can evolve the resulting source code. [Method] We conducted a two-phase controlled experiment involving 151 participants, 95% of whom were professional developers. In Phase 1, participants added a new feature to a Java web application, with or without AI assistance. In Phase 2, a randomized controlled trial, new participants evolved these solutions without AI assistance. [Results] AI-assisted development in Phase 1 led to a modest speedup in subsequent evolution and slightly higher average CodeHealth. Although neither difference was significant overall, the increase in CodeHealth was statistically significant when habitual AI users completed Phase 1. For Phase 1, we also observed a significant effect that corroborates previous productivity findings: using an AI assistant yielded a 30.7% median decrease in task completion time. Moreover, for habitual AI users, the mean speedup was 55.9%. [Conclusions] Our study adds to the growing evidence that AI assistants can effectively accelerate development. Moreover, we did not observe warning signs of degraded code-level maintainability. We recommend that future research focus on risks such as code bloat from excessive code generation and the build-up of cognitive debt as developers invest less mental effort during implementation.

AI助理,如GitHub Coplete和Cursor,正在改造软件工程。虽然一些研究强调生产率的提高,但它们对维持能力的影响需要进一步调查。[目标]本研究调查与AI助理共同开发是否影响软件的可维护性,特别是其他开发商如何容易地发展出源代码。[Method]我们进行了由151名参与者(其中95%是专业开发商)参与的两阶段控制实验。在第一阶段,参与者在Java网络应用程序中添加了新的特征,有或没有AI的援助。在第二阶段,随机控制的试验中,新参与者在没有AI援助的情况下开发了这些解决方案。[Results]AI协助的第一阶段开发导致随后的演变速度稍慢,平均代码健康略高。虽然两者没有多大的差别,但是当习惯AI用户完成第一阶段时,《卫生法典》的增加在统计上意义重大,在第一阶段,我们还观察到一个显著的影响:使用AI助理在任务完成时间里减少了30.7%的中位。此外,对于习惯的AI用户来说,平均速度是55.9%。[Resulth] AI援助的开发过程导致适度的加速速度加快,而且我们的研究也建议了对未来债务重组研究的强度增加的证据。

Article 40

Title@2025-07-01 (2): Snaps: Bloated and Outdated?

Title: Snaps: Bloated and Outdated?

Snaps: Aufgebläht und überholt?

混凝土和过时? 2507.00786v1

Authors (2): Jukka Ruohonen, Qusai Ramadan

Snap is an alternative software packaging system developed by Canonical and provided by default in the Ubuntu Linux distribution. Given the heterogeneity of various Linux distributions and their various releases, Snap allows an interoperable delivery of software directly to users. However, concerns and criticism have also been frequently expressed. Regarding this criticism, the paper shows that currently distributed snap packages are indeed on average bloated in terms of their sizes and outdated in terms updating frequencies. With these empirical observations, this short paper contributes to the research domain of software packaging, software packages, and package managers.

Snap是由Canonical开发的替代软件包装系统,在Ubuntu Linux发行时默认提供。鉴于Linux发行的各种产品及其不同版本的多样性,Snap允许直接向用户提供可互操作的软件,然而,人们也经常表示关切和批评。关于这一批评,该文件表明,目前分发的快件包的大小实际上平均膨胀,更新频率方面已经过时。有了这些经验性观察,这份简短的文件有助于软件包装、软件包和软件包管理员的研究领域。

Article 41

Title@2025-07-01 (2): Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation

Title: Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation

Beurteilung der Korrektheit in der LLM-basierten Codegenerierung über Unsicherheitsabschätzung

通过不确定性估算评估基于LLM的代码生成的正确性 2502.11620v3

Authors (2): Arindam Sharma, Cristina David

In this work, we explore uncertainty estimation as a proxy for correctness in LLM-generated code. To this end, we adapt two state-of-the-art techniques from natural language generation – one based on entropy and another on mutual information – to the domain of code generation. Given the distinct semantic properties of code, we introduce modifications, including a semantic equivalence check based on symbolic execution. Our findings indicate a strong correlation between the uncertainty computed through these techniques and correctness, highlighting the potential of uncertainty estimation for quality assessment. Additionally, we propose a simplified version of the entropy-based method that assumes a uniform distribution over the LLM’s responses, demonstrating comparable effectiveness. Using these techniques, we develop an abstention policy that prevents the model from making predictions when uncertainty is high, reducing incorrect outputs to near zero. Our evaluation on the LiveCodeBench shows that our approach significantly outperforms a baseline relying solely on LLM-reported log-probabilities.

在这项工作中,我们探索不确定性估算作为LLM生成代码正确性的一种替代物。为此,我们从自然语言生成中将两种最先进的技术 – – 一种基于英特罗比技术,另一种基于相互信息技术 – – 调整到代码生成领域。鉴于代码具有不同的语义特性,我们引入了修改,包括基于象征性执行的语义等同检查。我们的调查结果表明,通过这些技术计算出的不确定性与正确性之间有着很强的关联,突出了不确定性估算对质量评估的潜力。此外,我们提议了一种基于英特罗比方法的简化版本,该方法假定对LLM的响应进行统一分布,并展示了可比的有效性。我们使用这些技术,制定了一种避免模型在不确定性高时作出预测,将不正确的输出减少到接近零。我们对LiveCodeBench的评估表明,我们的方法大大超出了完全依赖LM报告的日志概率的基线。

Article 42

Title@2025-07-01 (2): A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

Title: A Hierarchical and Evolvable Benchmark for Fine-Grained Code Instruction Following with Multi-Turn Feedback

Ein hierarchischer und evolvierbarer Benchmark für eine feinkörnige Code-Anleitung mit Multi-Turn-Feedback

附有多发反馈的精细分类守则指示的等级和可变基准 2507.00699v1

Authors (6): Guoliang Duan, Mingwei Liu, Yanlin Wang, Chong Wang, Xin Peng, Zibin Zheng

Large language models (LLMs) have advanced significantly in code generation, yet their ability to follow complex programming instructions with layered and diverse constraints remains underexplored. Existing benchmarks often prioritize functional correctness, overlooking the nuanced requirements found in real-world development. We introduce MultiCodeIF, a comprehensive benchmark designed to evaluate instruction-following in code generation across multiple dimensions: constraint type, hierarchical levels, and iterative refinement. Built upon a structured taxonomy of 9 categories and 27 constraint types, MultiCodeIF enables granular assessment of both functional and non-functional instruction adherence. Using an automated pipeline, ConstraGen, we synthesize and evolve 2,021 code tasks sourced from 14 programming languages, supporting multi-turn evaluation through feedback-driven task variants. Empirical evaluation of six state-of-the-art LLMs uncovers substantial performance disparities. The top-performing model, Claude-3-7-Sonnet, achieves 63.0% average constraint satisfaction, while smaller models like Qwen3-1.7B fall to 44.8%. Models perform well on explicit constraints, but struggle with implicit or abstract constraints. Tasks with multiple hierarchical constraints significantly reduce model success rates, from 54.5% in single-level to just 18.8% in multi-level scenarios. However, structured feedback enables progressive improvement: average constraint satisfaction rises from 63.0% to 83.4% over four iterative refinement rounds. MultiCodeIF provides a scalable, constraint-aware, and feedback-sensitive framework to benchmark LLMs under realistic code generation scenarios, bridging the gap between synthetic evaluations and real-world instruction complexity. The full benchmark dataset, evaluation pipeline, and source code are available at https://github.com/SYSUSELab/MultiCodeIF.

大型语言模型(LLMS)在代码生成方面取得了显著进步,然而,在代码生成方面,它们遵循复杂且具有层层和多种制约的流程指令的能力仍然没有得到充分利用。现有基准往往优先考虑功能正确性,忽视现实世界发展中发现的细微要求。我们引入了多CodeIF,这是一个综合基准,旨在评估代码生成过程中遵循多种层面的指令:约束类型、等级和迭接完善。基于9个类别和27个制约类型的结构分类,多CodeIF能够对功能和非功能性指令的合规性做出精确评估。使用自动化管道、ConstraGen,我们合成和演化了2 021个管道代码任务,来源于14个编程语言,通过反馈驱动任务变量变量支持多方向评价。我们引入了多功能化的多功能化框架。对六种最新水平的LMLMs进行实证性评估揭示了巨大的绩效差异差异。顶级模型、Claude-3-7-Sontt(平均限制)达到63.0%的制约度,而像Qwen3-Syaldeal-lock-lock-lock-lock-lock-reck Serviews)框架则缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩缩到44。模型模型的缩缩缩缩缩缩缩缩缩缩缩缩缩到44。模型模型模型的模型的模型的模型的模型的模型在从明确的缩缩缩缩缩缩缩缩缩缩缩缩略到44。

Article 43

Title@2025-07-01 (2): A Domain-specific Language and Architecture for Detecting Process Activities from Sensor Streams in IoT

Title: A Domain-specific Language and Architecture for Detecting Process Activities from Sensor Streams in IoT

Domänenspezifische Sprache und Architektur zur Erkennung von Prozessaktivitäten von Sensor Streams im IoT

用于检测从IoT传感器流中检测进程活动的特定域语言和架构 2507.00686v1

Authors (4): Ronny Seiger, Daniel Locher, Marco Kaufmann, Aaron F. Kurz

Modern Internet of Things (IoT) systems are equipped with a plethora of sensors providing real-time data about the current operations of their components, which is crucial for the systems’ internal control systems and processes. However, these data are often too fine-grained to derive useful insights into the execution of the larger processes an IoT system might be part of. Process mining has developed advanced approaches for the analysis of business processes that may also be used in the context of IoT. Bringing process mining to IoT requires an event abstraction step to lift the low-level sensor data to the business process level. In this work, we aim to empower domain experts to perform this step using a newly developed domain-specific language (DSL) called Radiant. Radiant supports the specification of patterns within the sensor data that indicate the execution of higher level process activities. These patterns are translated to complex event processing (CEP) applications to be used for detecting activity executions at runtime. We propose a corresponding software architecture for online event abstraction from IoT sensor streams using the CEP applications. We evaluate these applications to monitor activity executions using IoT sensors in smart manufacturing and smart healthcare. The evaluation method and results inform the domain expert about the quality of activity detections and potential for improvement.

现代物联网(IoT)系统配备了大量传感器,提供关于其组件当前运行情况的实时数据,这对系统的内部控制系统和流程至关重要,然而,这些数据往往过于精细,无法对执行IoT系统可能属于其中一部分的较大流程产生有用的洞察力; 开采过程开发了分析业务流程的先进方法,也可在IoT背景下使用。将进程采矿引入IoT,要求采取一个事件抽象步骤,将低级别传感器数据提升到业务流程层面。在这项工作中,我们力求使域专家有能力使用新开发的域名语言(DSL)进行这一步骤。 Radiant支持传感器数据中显示执行更高级流程活动的模式的规格。这些模式被转化为复杂的事件处理(CEP)应用程序,用于在运行时探测活动处决情况。我们建议使用CEP应用程序为IoT传感器流的在线事件抽象化设计一个相应的软件架构。我们评估这些应用程序,以便利用智能检测和智能检测活动质量和智能医疗方法的专家监测活动执行结果。

Article 44

Title@2025-07-01 (2): The Secrets Must Not Flow: Scaling Security Verification to Large Codebases (extended version)

Title: The Secrets Must Not Flow: Scaling Security Verification to Large Codebases (extended version)

Die Geheimnisse dürfen nicht fließen: Skalierung der Sicherheitsüberprüfung auf große Codebases (erweiterte Version)

秘密不得流动:将安全核查扩大到大型代码库(扩展版) 2507.00595v1

Authors (6): Linard Arquint, Samarth Kishor, Jason R. Koenig, Joey Dodds, Daniel Kroening, Peter Müller

Existing program verifiers can prove advanced properties about security protocol implementations, but are difficult to scale to large codebases because of the manual effort required. We develop a novel methodology called Diodon that addresses this challenge by splitting the codebase into the protocol implementation (the Core) and the remainder (the Application). This split allows us to apply powerful semi-automated verification techniques to the security-critical Core, while fully-automatic static analyses scale the verification to the entire codebase by ensuring that the Application cannot invalidate the security properties proved for the Core. The static analyses achieve that by proving I/O independence, i.e., that the I/O operations within the Application are independent of the Core’s security-relevant data (such as keys), and that the Application meets the Core’s requirements. We have proved Diodon sound by first showing that we can safely allow the Application to perform I/O independent of the security protocol, and second that manual verification and static analyses soundly compose. We evaluate Diodon on two case studies: an implementation of the signed Diffie-Hellman key exchange and a large (100k+ LoC) production Go codebase implementing a key exchange protocol for which we obtained secrecy and injective agreement guarantees by verifying a Core of about 1% of the code with the auto-active program verifier Gobra in less than three person months.

现有的程序核查器可以证明有关安全协议执行的先进属性,但是由于需要人工操作,很难推广到大型代码库。我们开发了名为 Diodon 的新颖方法,通过将代码库分为协议执行(Core ) 和其余(Application * ) 来应对这一挑战。这种分割使我们能够对安全关键核心应用强大的半自动核查技术,而全自动静态分析则通过确保应用程序不能使核心的安全属性被证明无效而将核查范围扩大到整个代码库。静态分析通过证明 *I/O独立性 ,即应用程序内I/O操作独立于协议,即应用程序内I/O操作独立于核心安全相关数据(例如钥匙),满足了核心要求。我们首先证明了Diodoon的正确性,我们能够安全地允许应用程序执行独立于安全协议的I/O,其次人工核查和静态分析可以正确进行。我们评估了两个案例研究:与Diffie-O独立性* 独立协议的I/O/OO/O/OO/O real 操作协议在3个月后进行一个核心交易中执行一个关键交易协议的3个核心程序。我们用Goffie-C 和一个大代码的核查协议,在3个关键交易中执行一个核心交易协议的运行协议的3个代码的核查协议。我们用一个加密的加密的加密的3个关键交易的版本,在1号中执行一个核心交易中,一个核心交易的代码的运行的3个加密协议的核查协议的3个加密协议的运行。

Article 45

Title@2025-07-01 (2): Coverage-Guided Testing for Deep Learning Models: A Comprehensive Survey

Title: Coverage-Guided Testing for Deep Learning Models: A Comprehensive Survey

Coverage-Guided Testing für Deep Learning Modelle: Eine umfassende Umfrage

深学习模型的覆盖面-指导测试:全面调查 2507.00496v1

Authors (4): Hongjing Guo, Chuanqi Tao, Zhiqiu Huang, Weiqin Zou

As Deep Learning (DL) models are increasingly applied in safety-critical domains, ensuring their quality has emerged as a pressing challenge in modern software engineering. Among emerging validation paradigms, coverage-guided testing (CGT) has gained prominence as a systematic framework for identifying erroneous or unexpected model behaviors. Despite growing research attention, existing CGT studies remain methodologically fragmented, limiting the understanding of current advances and emerging trends. This work addresses that gap through a comprehensive review of state-of-the-art CGT methods for DL models, including test coverage analysis, coverage-guided test input generation, and coverage-guided test input optimization. This work provides detailed taxonomies to organize these methods based on methodological characteristics and application scenarios. We also investigate evaluation practices adopted in existing studies, including the use of benchmark datasets, model architectures, and evaluation aspects. Finally, open challenges and future directions are highlighted in terms of the correlation between structural coverage and testing objectives, method generalizability across tasks and models, practical deployment concerns, and the need for standardized evaluation and tool support. This work aims to provide a roadmap for future academic research and engineering practice in DL model quality assurance.

由于深入学习(DL)模式越来越多地应用于安全关键领域,确保质量已成为现代软件工程的一个紧迫挑战,在新兴的验证模式中,覆盖指导测试(CGT)作为查明错误或意外模式行为的系统框架越来越受到重视。尽管研究日益重视,现有的覆盖指导测试(CGT)研究在方法上仍然支离破碎,限制了对当前进展和新趋势的理解。这项工作通过全面审查DL模型的最新水平的CGT方法,包括测试范围分析、覆盖指导测试投入生成和覆盖指导测试投入优化,解决了这一差距。这项工作提供了详细的分类,以根据方法特点和应用设想来组织这些方法。我们还调查了现有研究中采用的评价做法,包括使用基准数据集、模型架构和评价方面。最后,在结构覆盖面和测试目标、跨任务和模式的方法通用性、实际部署问题以及标准化评价和工具支持的必要性等方面,突出了公开的挑战和未来的方向。这项工作旨在为DL模型质量保证的未来学术研究和工程实践提供路线图。

Article 46

Title@2025-07-01 (2): The Influence of HEXACO Personality Traits on the Teamwork Quality in Software Teams – A Preliminary Research Approach

Title: The Influence of HEXACO Personality Traits on the Teamwork Quality in Software Teams – A Preliminary Research Approach

Der Einfluss von HEXACO Persönlichkeitseigenschaften auf die Teamwork-Qualität in Software-Teams - ein vorläufiger Forschungsansatz

HEXACO个人经历对软件团队团队工作质量的影响 – – 初步研究方法 2507.00481v1

Authors (3): Philipp M. Zähl, Sabine Theis, Martin R. Wolf

Although software engineering research has focused on optimizing processes and technology, there is a growing recognition that human factors, particularly teamwork, also significantly impact optimization. Recent research suggests that developer personality has a strong influence on teamwork. In fact, personality considerations may have a greater impact on software development than processes and tools. This paper aims to design a study that measures the impact of HEXACO personality traits on the Teamwork Quality (TWQ) of software teams. A preliminary data collection (n=54) was conducted for this purpose. The analysis showed that several personality traits, as well as their composition, had a significant impact on TWQ. Additionally, other variables, such as the proportion of women and age distribution, also affected TWQ. The study’s initial results demonstrate the usefulness and validity of the study design. The results also suggest several opportunities to improve teamwork in IT organizations and avenues for further research.

虽然软件工程研究的重点是优化流程和技术,但人们日益认识到,人的因素,特别是团队合作,也具有显著的影响优化;最近的研究表明,开发商的个性对团队合作有重大影响;事实上,个性考虑对软件开发的影响可能大于程序和工具;本文件旨在设计一项研究,衡量HEXACO个性特征对软件团队团队团队工作质量的影响;为此目的进行了初步数据收集(n=54);分析显示,若干个性特征及其构成对团队合作产生了重大影响。此外,其他变量,如妇女比例和年龄分布,也影响到TWQ。研究的初步结果显示,研究设计有用和有效。研究结果还表明,有好几次机会改进信息技术组织的合作,并有途径进行进一步研究。

Article 47

Title@2025-07-01 (2): Embedded DevOps: A Survey on the Application of DevOps Practices in Embedded Software and Firmware Development

Title: Embedded DevOps: A Survey on the Application of DevOps Practices in Embedded Software and Firmware Development

Embedded DevOps: Eine Umfrage zur Anwendung von DevOps-Praxis in Embedded Software und Firmware-Entwicklung

嵌入式DevOps:关于在嵌入式软件和公司软件开发中应用DevOps做法的调查 2507.00421v1

Authors (2): Parthiv Katapara, Anand Sharma

The adoption of DevOps practices in embedded systems and firmware development is emerging as a response to the growing complexity of modern hardware–software co-designed products. Unlike cloud-native applications, embedded systems introduce challenges such as hardware dependency, real-time constraints, and safety-critical requirements. This literature review synthesizes findings from 20 academic and industrial sources to examine how DevOps principles–particularly continuous integration, continuous delivery, and automated testing–are adapted to embedded contexts. We categorize efforts across tooling, testing strategies, pipeline automation, and security practices. The review highlights current limitations in deployment workflows and observability, proposing a roadmap for future research. This work offers researchers and practitioners a consolidated understanding of Embedded DevOps, bridging fragmented literature with a structured perspective.

在嵌入系统和实件开发中采用DevOps做法,正逐渐成为现代硬件软件共同设计产品日益复杂的一种对策。与云型应用程序不同,嵌入系统带来了硬件依赖、实时限制和安全关键要求等挑战。本文献审查综合了20个学术和工业来源的研究结果,以研究DevOps原则如何适应嵌入环境,特别是持续整合、连续交付和自动测试。我们将各种工具、测试战略、管道自动化和安全做法的努力分类。审查突出了目前部署工作流程和可观测性方面的局限性,提出了未来研究路线图。这项工作为研究人员和从业人员提供了对嵌入式DevOps的综合理解,以结构化的观点将零碎的文献连接起来。

Article 48

Title@2025-07-01 (2): Recommending Variable Names for Extract Local Variable Refactorings

Title: Recommending Variable Names for Extract Local Variable Refactorings

Empfehlung von Variablennamen für lokale Variablen-Refaktorings extrahieren

为抽取本地变量重构建议变量名称 2507.00413v1

Authors (4): Taiming Wang, Hui Liu, Yuxia Zhang, Yanjie Jiang

Extract local variable is one of the most popular refactorings, and most IDEs and refactoring tools provide automated support for this refactoring. However, we find approximately 70% of the names recommended by these IDEs are different from what developers manually constructed, adding additional renaming burdens to developers and providing limited assistance. In this paper, we introduce VarNamer, an automated approach designed to recommend variable names for extract local variable refactorings. Through a large-scale empirical study, we identify key contexts that are useful for composing variable names. Leveraging these insights, we developed a set of heuristic rules through program static analysis techniques and employ data mining techniques to recommend variable names effectively. Notably, some of our heuristic rules have been successfully integrated into Eclipse, where they are now distributed with the latest releases of the IDE. Evaluation demonstrates its superiority over state-of-the-art IDEs. Specifically, VarNamer significantly increases the chance of exact match by 52.6% compared to Eclipse and 40.7% compared to IntelliJ IDEA. We also evaluated the proposed approach with real-world extract local variable refactorings conducted in C++ projects, and the results suggest that the approach can achieve comparable performance on programming languages besides Java. It may suggest the generalizability of VarNamer. Finally, we designed and conducted a user study and the results of the user study suggest that our approach can speed up the refactoring by 27.8% and reduce 49.3% edits on the recommended variable names.

抽取本地变量是最受欢迎的重新设定要素之一, 大多数 IDE 和再设定工具都为这种重新设定提供了自动支持。然而, 我们发现这些 IDE 推荐的大约70% 的名称与开发者手工构建的名称不同, 给开发者增加了额外的重命名负担, 并提供了有限的援助。在本文中, 我们引入了 VarNamer, 这是一种自动方法, 旨在为提取本地变量重新设定推荐不同名称推荐不同名称。通过大规模的经验性研究, 我们确定了用于构建变量名称的关键环境。利用这些洞察力, 我们通过程序静态分析技术开发了一套超常规则, 并使用数据挖掘技术有效地推荐变量名称。值得注意的是, 我们的一些超常规则已经成功地融入了 Eclipse , 现在随着最新发布 IDE 的发布而传播。评估显示其优于最新版本 IDE 。具体地说, VarName 明显增加了精确匹配52.6 % 的机会, 与 Eclipse 和 40. 7 % 相比, 我们通过 IntellijJIDU 评估了一套用户排序方法, 。我们还评估了可变式 Val- developmental 和 Cdal 演示, 。我们在现实- producal- producalalalalal viewdal viewdal dal viewdal Produaldaldaldaldaldaldaldaldaldaldaldals 中, 我们算取了27 。

Article 49

Title@2025-07-01 (2): Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations

Title: Revealing Floating-Point Accumulation Orders in Software/Hardware Implementations

Enthüllen von Anhäufungsaufträgen in Software/Hardware-Implementierungen

在软件/硬件实施中发送浮动点累计令 2411.00442v3

Authors (4): Peichen Xie, Yanjie Gao, Yang Wang, Jilong Xue

Accumulation-based operations, such as summation and matrix multiplication, are fundamental to numerous computational domains. However, their accumulation orders are often undocumented in existing software and hardware implementations, making it difficult for developers to ensure consistent results across systems. To address this issue, we introduce FPRev, a diagnostic tool designed to reveal the accumulation order in the software and hardware implementations through numerical testing. With FPRev, developers can identify and compare accumulation orders, enabling developers to create reproducible software and verify implementation equivalence. FPRev is a testing-based tool that non-intrusively reveals the accumulation order by analyzing the outputs of the tested implementation for distinct specially designed inputs. Employing FPRev, we showcase the accumulation orders of popular libraries (such as NumPy and PyTorch) on CPUs and GPUs (including GPUs with specialized matrix accelerators such as Tensor Cores). We also validate the efficiency of FPRev through extensive experiments. FPRev exhibits a lower time complexity compared to the basic solution. FPRev is open-sourced at https://github.com/peichenxie/FPRev.

累积式操作,如总和和矩阵乘法,对于许多计算领域至关重要,但是,它们的累积订单往往在现有软件和硬件实施中没有记录,使开发商难以确保各系统一致的结果。为了解决这一问题,我们采用了FPRev,这是一个诊断工具,旨在通过数字测试来显示软件和硬件实施中的累积顺序。开发商可以利用FPRev来识别和比较积累订单,使开发商能够创建可复制的软件并核查执行等值。FPRev是一种测试工具,它通过分析测试后实施中特定专门设计的投入的产出,非侵入性地揭示累积顺序。我们利用FPRev,展示CPU和GPUP(包括配有诸如Tensor Cores等专用矩阵加速器的GPUs)的流行图书馆(如NumPy和PyTorrch)的积累顺序。我们还通过广泛的实验来验证FPRev的效率。FPRev比基本解决方案的复杂时间要低。FPCRev在https://github.com/Revpecheni上公开源。

Article 50

Title@2025-07-01 (2): Agentic AI in Product Management: A Co-Evolutionary Model

Title: Agentic AI in Product Management: A Co-Evolutionary Model

Agentische KI im Produktmanagement: Ein ko-evolutionäres Modell

产品管理中的AIA:共同进化模型 2507.01069v1

Authors (1): Nishant A. Parikh

This study explores agentic AI’s transformative role in product management, proposing a conceptual co-evolutionary framework to guide its integration across the product lifecycle. Agentic AI, characterized by autonomy, goal-driven behavior, and multi-agent collaboration, redefines product managers (PMs) as orchestrators of socio-technical ecosystems. Using systems theory, co-evolutionary theory, and human-AI interaction theory, the framework maps agentic AI capabilities in discovery, scoping, business case development, development, testing, and launch. An integrative review of 70+ sources, including case studies from leading tech firms, highlights PMs’ evolving roles in AI orchestration, supervision, and strategic alignment. Findings emphasize mutual adaptation between PMs and AI, requiring skills in AI literacy, governance, and systems thinking. Addressing gaps in traditional frameworks, this study provides a foundation for future research and practical implementation to ensure responsible, effective agentic AI integration in software organizations.

本研究报告探讨了AI在产品管理中的代理变革作用,提出了指导其融入整个产品生命周期的概念性共同革命框架,提出了指导其融入整个产品生命周期的概念性共同革命框架。以自主、目标驱动行为和多剂协作为特征的ARI,重新定义产品经理人作为社会技术生态系统的管弦师。利用系统理论、共同革命理论和人类-AI互动理论,框架绘制了AI在发现、范围界定、商业案例开发、开发、开发、测试和启动方面的代理能力。对70+来源的综合审查,包括主要技术公司的案例研究,强调了总理在AI管弦、监督和战略协调方面不断变化的作用。研究结果强调总理与AI之间的相互适应,要求在AI识字、治理和系统思维方面的技能。本研究报告利用系统理论、共同革命理论和人类-AI互动理论,为未来研究和实际实施AI在发现、范围界定、商业案例开发、开发、测试和启动方面的能力提供了基础。对70+来源的综合审查,包括主要技术公司的案例研究,强调了首相在AI管弦、监督和战略协调方面不断变化的作用。研究结果强调首相与AI之间的相互适应,需要AI识字、治理和系统思维方面的技能。研究解决传统框架中的差距,为未来研究为确保软件组织负责任、有效的代理一体化的一体化的一体化一体化的一体化一体化整合提供了基础,为今后的研究和实际实施提供了基础,以确保软件组织进行研究和实际执行基础。

Article 51

Title@2025-07-01 (2): iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

Title: iPanda: An Intelligent Protocol Testing and Debugging Agent for Conformance Testing

iPanda: Ein intelligenter Protokolltest- und Debugging-Agent für Konformitätsprüfungen

iPanda:一个智能协议测试和调试代理人,用于合规测试 2507.00378v1

Authors (8): Xikai Sun, Fan Dang, Kebin Liu, Xin Miao, Zihao Yang, Haimo Lu, Yawen Zheng, Yunhao Liu

Conformance testing is essential for ensuring that protocol implementations comply with their specifications. However, traditional testing approaches involve manually creating numerous test cases and scripts, making the process labor-intensive and inefficient. Recently, Large Language Models (LLMs) have demonstrated impressive text comprehension and code generation abilities, providing promising opportunities for automation. In this paper, we propose iPanda, the first end-to-end framework that leverages LLMs to automate protocol conformance testing. Given a protocol specification document and its implementation, iPanda first employs a keyword-based method to automatically generate comprehensive test cases. Then, it utilizes a code-based retrieval-augmented generation approach to effectively interpret the implementation and produce executable test code. To further enhance code quality, iPanda incorporates an iterative self-correction mechanism to refine generated test scripts interactively. Finally, by executing and analyzing the generated tests, iPanda systematically verifies compliance between implementations and protocol specifications. Comprehensive experiments on various protocols show that iPanda significantly outperforms pure LLM-based approaches, improving the success rate (Pass@1) of test-code generation by factors ranging from 4.675 times to 10.751 times.

符合性测试是确保协议执行符合其规格的关键。然而,传统测试方法涉及人工生成大量测试案例和脚本,使过程耗费大量人力,效率低下。最近,大语言模型(LLMs)展示了令人印象深刻的文本理解和代码生成能力,为自动化提供了充满希望的机会。在本文中,我们提议了iPanda,这是利用LLMs进行协议符合性协议测试的第一个端到端框架。根据协议规格文件及其实施,iPanda首先使用基于关键字的方法自动生成全面的测试案例。然后,它利用基于代码的检索增强生成方法,有效解释实施过程并生成可执行的测试代码。为了进一步提高代码质量,iPanda采用了一个互动的自我校正机制,以交互方式完善生成的测试脚本。最后,通过实施和分析生成的测试,iPanda系统地核查执行和协议规格之间的遵守情况。对各种协议的全面实验表明iPanda大大超出基于纯LM的方法,从而按4.75至107年的时期的各种因素改进了测试生成标准的成功率(Pass@1)。

Article 52

Title@2025-07-01 (2): An AST-guided LLM Approach for SVRF Code Synthesis

Title: An AST-guided LLM Approach for SVRF Code Synthesis

Ein AST-geführter LLM-Ansatz für die SVRF-Codesynthese

SVRF 准则合成的AST制导LLM法 2507.00352v1

Authors (4): Abanoub E. Abdelmalak, Mohamed A. Elsayed, David Abercrombie, Ilhami Torunoglu

Standard Verification Rule Format (SVRF) is essential for semiconductor applications like Design Rule Check (DRC), Layout Versus Schematic (LVS), and Optical Proximity Correction (OPC) and it faces challenges as advancing nodes create complex design rules that renders traditional SVRF development ineffective and highlight an expertise gap. This paper introduces a novel methodology integrating Abstract Syntax Tree (AST) embedding and Retrieval-Augmented Generation (RAG) for enhanced SVRF code synthesis, ensuring semantic accuracy and error minimization through structural validation with domain-specific insights for precise code generation. We evaluate different T5-based models and propose an innovative SVRF-specific scoring framework that complements standard metrics like BLEU and ROUGE-L. In our approach, AST provides rigorous structural validation, while RAG infuses relevant domain knowledge, effectively enhancing the code generation workflow. Testing on a comprehensive benchmark of 740 DRC rule implementations, our methodology demonstrates up to a 40\% improvement in code generation accuracy compared to basic text-based fine-tuning process. This fusion of industry expertise with advanced coding strategies not only optimizes SVRF development under limited dataset constraints but also creates a more intuitive and efficient coding environment. Consequently, users can rapidly iterate through design cycles, reduce manual error correction, and significantly improve overall productivity.

标准核查规则格式(SVRF)对于设计规则检查(DRC)、布局Versatic(LVS)、光相近校正(OPC)等半导体应用至关重要,因为推进节点会创造复杂的设计规则,使传统的SVRF发展失去效力,并突出专业知识差距。本文介绍了一种新颖的方法,将S&F树(AST)嵌入和检索检索搜索启动一代(RAG)纳入SVRF规则综合,通过对精确代码生成的域别洞察进行结构验证,确保语义准确性和误差最小化。我们评估了不同的T5模型,并提出了一个创新的SVRF具体评分框架,补充了BLEU和ROUGE-L等标准标准标准标准。在我们的方法中,AST提供了严格的结构验证,同时,RAG利用了相关的域知识,有效地加强了代码生成工作流程。测试了740 DeRC规则执行的全面基准,我们的方法表明,与基于基本文本的微调程序相比,代码生成的准确性改进了40。我们评估了40。我们评估了基于T5的模型的模型的模型模型模型的模型的模型模型,提出了创新专门知识,并提出了一个创新的精化技术专门知识的精制化的精制化的精化和高化的节制化的节制的节制化的节制战略,它不仅使得了先进的用户通过先进的环境设计战略的精制能的精制的精制的精制的精制的精制的精制,也使得了一种精制。

Article 53

Title@2025-07-01 (2): VTS-Guided AI Interaction Workflow for Business Insights

Title: VTS-Guided AI Interaction Workflow for Business Insights

VTS-geführter KI-Interaktions-Workflow für Business Insights

VTS 指导的AI 商业观察互动工作流程 2507.00347v1

Authors (6): Sun Ding, Ude Enebeli, Atilhan, Manay, Ryan Pua, Kamal Kotak

Modern firms face a flood of dense, unstructured reports. Turning these documents into usable insights takes heavy effort and is far from agile when quick answers are needed. VTS-AI tackles this gap. It integrates Visual Thinking Strategies, which emphasize evidence-based observation, linking, and thinking, into AI agents, so the agents can extract business insights from unstructured text, tables, and images at scale. The system works in three tiers (micro, meso, macro). It tags issues, links them to source pages, and rolls them into clear action levers stored in a searchable YAML file. In tests on an 18-page business report, VTS-AI matched the speed of a one-shot ChatGPT prompt yet produced richer findings: page locations, verbatim excerpts, severity scores, and causal links. Analysts can accept or adjust these outputs in the same IDE, keeping human judgment in the loop. Early results show VTS-AI spots the direction of key metrics and flags where deeper number-crunching is needed. Next steps include mapping narrative tags to financial ratios, adding finance-tuned language models through a Model-Context Protocol, and building a Risk & Safety Layer to stress-test models and secure data. These upgrades aim to make VTS-AI a production-ready, audit-friendly tool for rapid business analysis.

现代公司面临大量密集、无结构化的报告。将这些文件转化为可用的洞察力需要投入大量精力,而且当需要快速解答时,它们远非灵活。 VTS-AI 将弥补这一差距。它将强调循证观察、链接和思考的视觉思维战略纳入AI代理, 以便代理商能够从无结构的文本、表格和规模图像中提取商业见解。系统在三个层次( 微观、中间、宏观) 运作。它标记问题, 将它们链接到源页上, 并把它们放入一个存储在可搜索YAML 文档中的清晰的行动杠杆。在18页的商业报告测试中, VTS-AI 匹配了一次性热点访问的速度, 但却产生了更丰富的发现: 页面位置、逐字记录、重度计和因果关系。分析师可以接受或调整这些产出, 将人类的判断保持在循环中。早期结果显示VTS-AI 显示关键指标和标志的方向, 其中需要更深的对数字进行整理。下一步包括绘制财务比率的叙述标记, 将金融调整语言模型, 添加到风险- TRA 升级到快速的模型到S- 目标到S- train- train- train- trust- tranging to to to trade to trade to studatestratedal lave laudatedal tor to studatedal laudatedal lap lap to dal laudatedal ex ladal ladal lavementaldaldal lavedal 。

Article 54

Title@2025-06-30 (1): Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Title: Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones

Fehler durch Interferenz: Sprachmodelle machen ausgeglichene Klammern Fehler, wenn fehlerhafte Mechanismen Klangeindrücke überschatten

被干扰失败:语言模型在错误机制压倒阴影声音一号时造成平衡括号错误 2507.00322v1

Authors (4): Daking Rai, Samuel Miller, Kevin Moran, Ziyu Yao

Despite remarkable advances in coding capabilities, language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses. In this study, we investigate the underlying mechanisms behind the persistence of these errors across LMs of varying sizes (124M-7B) to both understand and mitigate the errors. Our study reveals that LMs rely on a number of components (attention heads and FF neurons) that independently make their own predictions. While some components reliably promote correct answers across a generalized range of inputs (i.e., implementing “sound mechanisms’’), others are less reliable and introduce noise by promoting incorrect tokens (i.e., implementing “faulty mechanisms’’). Errors occur when the faulty mechanisms overshadow the sound ones and dominantly affect the predictions. Motivated by this insight, we introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance. RASteer substantially improves performance on balanced parentheses tasks, boosting accuracy of some models from $0$% to around $100$% without impairing the models’ general coding ability. We further demonstrate its broader applicability in arithmetic reasoning tasks, achieving performance gains of up to around $20$%.

尽管在编码能力方面取得了显著进步,语言模型(LMS)仍然与简单的合成任务(例如产生平衡的括号)纠缠不休。在本研究中,我们调查了不同大小(124M-7B)的LM(124M-7B)中这些错误持续存在背后的深层机制,以了解和减轻错误。我们的研究显示,LMS依赖一些独立作出预测的成分(注意头和FF神经元),而有些成分可靠地促进在广泛的投入(即执行“声机制”)中正确回答,而另一些成分则不那么可靠,通过推广不正确的符号(即执行“失灵机制”)。错误机制掩盖声音并主要影响预测,就会发生错误。我们受这一洞察力的驱使,我们引入了RASteer,这是一个指导方法,系统地确定和增加可靠成分对改进模型性能的贡献。RASTEER大大改进了平衡的括号任务的业绩,提高了某些模型的精确度,从0美元提高到约100美元左右,同时不损害模型的总体编码能力。我们进一步展示了它对于20 %的运用。

Article 55

Title@2025-06-30 (1): Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain

Title: Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain

Herstellung einer Pipeline-Produktion: Herausforderungen und Lektionen im Bereich Healthcare

《管道生产-准备:保健领域的挑战和经验教训》 2506.06946v2

Authors (6): Daniel Angelo Esteves Lawand, Lucas Quaresma Medina Lam, Roberto Oliveira Bolgheroni, Renato Cordeiro Ferreira, Alfredo Goldman, Marcelo Finger

Deploying a Machine Learning (ML) training pipeline into production requires good software engineering practices. Unfortunately, the typical data science workflow often leads to code that lacks critical software quality attributes. This experience report investigates this problem in SPIRA, a project whose goal is to create an ML-Enabled System (MLES) to pre-diagnose insufficiency respiratory via speech analysis. This paper presents an overview of the architecture of the MLES, then compares three versions of its Continuous Training subsystem: from a proof of concept Big Ball of Mud (v1), to a design pattern-based Modular Monolith (v2), to a test-driven set of Microservices (v3) Each version improved its overall extensibility, maintainability, robustness, and resiliency. The paper shares challenges and lessons learned in this process, offering insights for researchers and practitioners seeking to productionize their pipelines.

不幸的是,典型的数据科学工作流程往往导致缺乏关键软件质量属性的代码。本经验报告对SPIRA的这一问题进行了调查,SPIRA项目的目标是通过语言分析建立一个ML-Enabled System(MLES),以便预先诊断呼吸不全的呼吸系统。本文件概述了MLES的架构,然后比较了其连续培训子系统的三个版本:从概念大球泥(v1)到基于设计模式的模版单(v2)到一套测试驱动的微服务(v3),每个版本都改进了其总体可存活性、可维持性、稳健性和弹性。文件分享了在这一过程中所学到的挑战和经验教训,为试图生产管道的研究人员和从业人员提供了见解。

Article 56

Title@2025-06-30 (1): Rust vs. C for Python Libraries: Evaluating Rust-Compatible Bindings Toolchains

Title: Rust vs. C for Python Libraries: Evaluating Rust-Compatible Bindings Toolchains

Rust vs. C für Python Bibliotheken: Bewertung von Rust-kompatiblen Bindungen Toolchains

Python图书馆的Rust诉C案:评估Rust-Compable Contracable Contails 工具链 2507.00264v1

Authors (3): Isabella Basso do Amaral, Renato Cordeiro Ferreira, Alfredo Goldman

The Python programming language is best known for its syntax and scientific libraries, but it is also notorious for its slow interpreter. Optimizing critical sections in Python entails special knowledge of the binary interactions between programming languages, and can be cumbersome to interface manually, with implementers often resorting to convoluted third-party libraries. This comparative study evaluates the performance and ease of use of the PyO3 Python bindings toolchain for Rust against ctypes and cffi. By using Rust tooling developed for Python, we can achieve state-of-the-art performance with no concern for API compatibility.

Python 编程语言在其语法和科学图书馆中最出名,但它也因其翻译速度缓慢而臭名昭著。优化 Python 的关键部分需要特别了解程序语言之间的二进制互动,并且可能难以用手动方式连接,因为执行者经常求助于复杂的第三方图书馆。这份比较研究评估了 PyO3 Python 捆绑工具链的性能和容易使用性。通过使用为 Python 开发的 Rust 工具链,我们可以在不考虑API 兼容性的情况下实现最先进的性能。

Article 57

Title@2025-06-30 (1): A Metascience Study of the Low-Code Scientific Field

Title: A Metascience Study of the Low-Code Scientific Field

Eine Metawissenschaftsstudie des wissenschaftlichen Bereichs mit niedrigem Code

低阴极科学领域元科学研究 2408.05975v2

Authors (3): Mauro Dalle Lucca Tosi, Javier Luis Cánovas Izquierdo, Jordi Cabot

In the last years, model-related publications have been exploring the application of modeling techniques across various domains. Initially focused on UML and the Model-Driven Architecture approach, the literature has been evolving towards the usage of more general concepts such as Model-Driven Development or Model-Driven Engineering. More recently, however, the term “low-code” has taken the modeling field by storm, largely due to its association with several highly popular development platforms. The research community is still discussing the differences and commonalities between this emerging term and previous modeling-related concepts, as well as the broader implications of low-code on the modeling field. In this paper, we present a metascience study of Low-Code. Our study follows a two-fold approach: (1) to analyze the composition and growth (e.g., size, diversity, venues, and topics) of the emerging Low-Code community; and (2) to explore how these aspects differ from those of the “classical” model-driven community. Ultimately, we hope to trigger a discussion on the current state and potential future trajectory of the low-code community, as well as the opportunities for collaboration and synergies between the low-code and modeling communities.

在过去几年里,与模型有关的出版物一直在探索在各个领域应用模型技术,最初侧重于UML和模型驱动结构方法,文献正在逐步转向使用更一般的概念,如模型驱动开发或模型驱动工程,但最近,“低代码”一词主要由于与几个高度流行的发展平台有关,以风暴的形式进入了模型领域;研究界仍在讨论这个新兴术语与以前模型相关概念之间的差异和共性,以及低代码对模型领域的更广泛影响。在本文件中,我们介绍了对低代码的元科学研究。我们的研究遵循了两种方法:(1) 分析新兴低代码社区的构成和增长(例如规模、多样性、地点和议题);(2) 探讨这些方面与“古典”模式驱动的社区的不同之处。最后,我们希望就低代码社区的现状和未来轨迹展开讨论,同时探讨低代码社区之间合作和协同作用的机会。

Article 58

Title: Is It Safe To Learn And Share? On Psychological Safety and Social Learning in (Agile) Communities of Practice

Ist es sicher, zu lernen und zu teilen? Über Psychologische Sicherheit und soziales Lernen in (Agile) Gemeinschaften der Praxis

它是否安全学习和分享? 在(敏感)实践社区的心理安全和社会学习方面,它是否安全学习和分享? 2507.01065v1

Authors (3): Christiaan Verwijs, Evelien Acun-Roos, Daniel Russo

As hybrid, distributed, and asynchronous work models become more prevalent, continuous learning in Agile Software Development (ASD) gains renewed importance. Communities of Practice (CoPs) are increasingly adopted to support social learning beyond formal education, often relying on virtual communication. Psychological safety, a prerequisite for effective learning, remains insufficiently understood in these settings. This mixed-methods study investigates psychological safety within Agile CoPs through survey data from 143 participants. Results indicate that psychological safety is significantly lower in online interactions compared to face-to-face settings. Moreover, low psychological safety reduces participants’ intent to continue contributing and avoidance of interpersonal risk. No significant differences emerged based on gender, community seniority, or content creation activity. However, differences by role and age group suggest potential generational or role-related effects. Thematic analysis revealed exclusionary behavior, negative interaction patterns, and hostility as primary threats to psychological safety, often reinforced by tribalism and specific community dynamics. Suggested interventions include establishing explicit norms, structured facilitation, and active moderation. The findings were validated through member checking with 30 participants. This study provides a comparative perspective on interaction modalities and offers practical guidance for organizers seeking to cultivate inclusive, high-impact CoPs and similarly structured virtual or hybrid work environments.

随着混合、分布和无节制的工作模式变得更加普遍,在Aliile软件开发(ASD)中不断学习的做法也变得更加重要; 实践社区(CoPs)日益被采用,以支持正规教育以外的社会学习,往往依靠虚拟交流; 心理安全是有效学习的先决条件,在这些环境中仍然没有充分理解; 混合方法研究通过143名参与者的调查数据调查Alile软件内部的心理安全; 研究结果表明,与面对面环境相比,在线互动中的心理安全大大低于面对面环境; 心理安全低度降低了参与者继续促进和避免人际风险的意愿; 没有出现基于性别、社区年资或内容创建活动的重大差异; 然而,角色和年龄群体的差异表明代际或与角色有关的潜在影响; 专题分析显示排斥行为、负面互动模式和敌意是心理安全的主要威胁,往往受到部落主义和特定社区动态的强化; 建议采取的干预措施包括制定明确的规范、结构化便利和积极调适; 调查结果通过成员核对,减少了参与者继续促进和避免人际风险的意愿。

Article 59

Title@2025-06-30 (1): Mitigating Hallucinations in YOLO-based Object Detection Models: A Revisit to Out-of-Distribution Detection

Title: Mitigating Hallucinations in YOLO-based Object Detection Models: A Revisit to Out-of-Distribution Detection

Halluzinationen in YOLO-basierten Objekterkennungsmodellen abmildern: Ein Besuch bei Out-of-Distribution Detection

以YOLO为基地的物体探测模型中的减轻幻觉:重新研究扩散外探测 2503.07330v2

Authors (5): Weicheng He, Changshun Wu, Chih-Hong Cheng, Xiaowei Huang, Saddek Bensalem

Object detection systems must reliably perceive objects of interest without being overly confident to ensure safe decision-making in dynamic environments. Filtering techniques based on out-of-distribution (OoD) detection are commonly added as an extra safeguard to filter hallucinations caused by overconfidence in novel objects. Nevertheless, evaluating YOLO-family detectors and their filters under existing OoD benchmarks often leads to unsatisfactory performance. This paper studies the underlying reasons for performance bottlenecks and proposes a methodology to improve performance fundamentally. Our first contribution is a calibration of all existing evaluation results: Although images in existing OoD benchmark datasets are claimed not to have objects within in-distribution (ID) classes (i.e., categories defined in the training dataset), around 13% of objects detected by the object detector are actually ID objects. Dually, the ID dataset containing OoD objects can also negatively impact the decision boundary of filters. These ultimately lead to a significantly imprecise performance estimation. Our second contribution is to consider the task of hallucination reduction as a joint pipeline of detectors and filters. By developing a methodology to carefully synthesize an OoD dataset that semantically resembles the objects to be detected, and using the crafted OoD dataset in the fine-tuning of YOLO detectors to suppress the objectness score, we achieve a 88% reduction in overall hallucination error with a combined fine-tuned detection and filtering system on the self-driving benchmark BDD-100K. Our code and dataset are available at: https://gricad-gitlab.univ-grenoble-alpes.fr/dnn-safety/m-hood.

目标检测系统必须可靠地看到感兴趣的对象,而不过于自信,以确保动态环境中的安全决策。基于分配之外检测(OoD)的过滤技术通常被添加为一种额外的保障,用于过滤新对象过于自信造成的幻觉。然而,根据现有的 OOD 基准评估YOLO-家庭探测器及其过滤器通常会导致工作表现不尽人意。本文研究性能瓶颈的根本原因,并提出从根本上改进性能的方法。我们的第一个贡献是对现有的所有评价结果进行校准:虽然现有OOOD基准数据集中的图像据称没有在分配(ID)类(即培训数据集中定义的类别)中设置对象,作为用于过滤新对象过度信任造成的幻觉的过滤技术,但通常作为额外保障。由对象检测者检测者检测者检测者检测者检测者检测的约13%的物体。含有OOD目标的ID数据集也会对过滤者的决定范围产生消极影响。最终导致一个显著的不精确性能估计。我们的第二个贡献是将幻觉减少的任务视为检测器和过滤器的联合管道。通过开发一种方法,将OD数据集进行精密的精密的100级级的Servid- 将目标与我们测试中的自我定位测试中的自我定位轨道中测试目标进行升级,从而测量到整个的轨道中的数据在测试结果中将我们测试结果。

Article 60

Title@2025-06-30 (1): Exploring Challenges in Test Mocking: Developer Questions and Insights from StackOverflow

Title: Exploring Challenges in Test Mocking: Developer Questions and Insights from StackOverflow

Herausforderungen im Test-Mocking erkunden: Entwickler-Fragen und Einblicke aus StackOverflow

探索试验模拟的挑战:来自斯塔克流的开发者问题和洞察 2505.08300v2

Authors (5): Mumtahina Ahmed, Md Nahidul Islam Opu, Chanchal Roy, Sujana Islam Suhi, Shaiful Chowdhury

Mocking is a common unit testing technique that is used to simplify tests, reduce flakiness, and improve coverage by replacing real dependencies with simplified implementations. Despite its widespread use in Open Source Software projects, there is limited understanding of how and why developers use mocks and the challenges they face. In this collaborative study, we have analyzed 25,302 questions related to Mocking on STACKOVERFLOW to identify the challenges faced by developers. We have used Latent Dirichlet Allocation for topic modeling, identified 30 key topics, and grouped the topics into five key categories. Consequently, we analyzed the annual and relative probabilities of each category to understand the evolution of mocking-related discussions. Trend analysis reveals that category like Advanced Programming peaked between 2009 and 2012 but have since declined, while categories such as Mocking Techniques and External Services have remained consistently dominant, highlighting evolving developer priorities and ongoing technical challenges. Our findings also show an inverse relationship between a topic’s popularity and its difficulty. Popular topics like Framework Selection tend to have lower difficulty and faster resolution times, while complex topics like HTTP Requests and Responses are more likely to remain unanswered and take longer to resolve. A classification of questions into How, Why, What, and Other revealed that over 70% are How questions, particularly in practical domains like file access and APIs, indicating a strong need for implementation guidance. Why questions are more prevalent in error-handling contexts, reflecting conceptual challenges in debugging, while What questions are rare and mostly tied to theoretical discussions. These insights offer valuable guidance for improving developer support, tooling, and educational content in the context of mocking and unit testing.

模拟是一种常见的单位测试技术,用于简化测试、减少不毛和通过简化执行来取代实际依赖性来改进覆盖范围,从而以简化执行方式取代实际依赖性。尽管在开放源码软件项目中广泛使用,但对于开发商如何和为什么使用模拟以及他们所面临的挑战了解有限。在这项合作研究中,我们分析了与在STACKVOVLOWOW上嘲笑有关的25,302个问题,以找出开发商面临的挑战。我们用《利特·迪里特分配》来进行主题建模,确定了30个关键主题,并将专题分为5个关键类别。因此,我们分析了每一类别的年度和相对概率,以了解与模拟有关的讨论的演变。趋势分析显示,诸如高级方案编制在2009年至2012年期间达到高峰,但此后却有所下降。在“技术与外部服务”等类别中,我们分析了25,302个与“技术与技术开发者”有关的问题,以辨别开发者所面临的挑战。我们的调查结果还表明,某个主题的受欢迎程度与困难之间存在反向反关系。框架选择等大众议题的难度往往会减少和更快的解答时间,而解决时间,而像HTTP请求与反应的复杂背景中的问题又如何在70个领域反映和应对问题,为什么持续和反应更长期的问题,这些问题,如何解解析问题如何在研究领域。

Article 61

Title@2025-06-30 (1): Bug Fixing with Broader Context: Enhancing LLM-Based Program Repair via Layered Knowledge Injection

Title: Bug Fixing with Broader Context: Enhancing LLM-Based Program Repair via Layered Knowledge Injection

Fehlerbehebung mit breiterem Kontext: Verbesserung der LLM-basierten Programm-Reparatur durch geschichtete Wissensinjektion

以更广泛的背景解决错误:通过多层知识注射加强基于LLM的方案修复 2506.24015v1

Authors (4): Ramtin Ehsani, Esteban Parra, Sonia Haiduc, Preetha Chatterjee

Prompting LLMs with bug-related context (e.g., error messages, stack traces) improves automated program repair, but many bugs still remain unresolved. In real-world projects, developers often rely on broader repository and project-level context beyond the local code to resolve such bugs. In this paper, we investigate how automatically extracting and providing such knowledge can improve LLM-based program repair. We propose a layered knowledge injection framework that incrementally augments LLMs with structured context. It starts with the Bug Knowledge Layer, which includes information such as the buggy function and failing tests; expands to the Repository Knowledge Layer, which adds structural dependencies, related files, and commit history; and finally injects the Project Knowledge Layer, which incorporates relevant details from documentation and previously fixed bugs. We evaluate this framework on a dataset of 314 bugs from BugsInPy using two LLMs (Llama 3.3 and GPT-4o-mini), and analyze fix rates across six bug types. By progressively injecting knowledge across layers, our approach achieves a fix rate of 79% (250/314) using Llama 3.3, a significant improvement of 23% over previous work. All bug types show improvement with the addition of repository-level context, while only a subset benefit further from project-level knowledge, highlighting that different bug types require different levels of contextual information for effective repair. We also analyze the remaining unresolved bugs and find that more complex and structurally isolated bugs, such as Program Anomaly and GUI bugs, remain difficult even after injecting all available information. Our results show that layered context injection improves program repair and suggest the need for interactive and adaptive APR systems.

使用与错误相关的上下文( 如错误信息、堆叠痕迹) 提示LLM 的LLMS 提示与错误相关的LMS 改进自动程序修复, 但许多错误仍未解决。在现实世界的项目中, 开发者经常依靠超出本地代码的更广泛的仓库和项目级环境来解决错误。在本文中, 我们调查如何自动提取和提供这种知识可以改进基于 LLM 的程序修复。我们建议一个分层的知识注入框架, 以结构化的上下文递增LLMS 。它从“ 错误知识层” 开始, 包括诸如错误功能和失败测试等信息; 扩展到存储库知识层, 从而增加结构依赖、相关文档和历史; 最后, 输入“ 项目知识层” , 包含来自文档和先前固定错误的相关细节。我们用两个 LLLlama 3. 3.3 和 GPT-4- Miny 来分析六种错误的修补率率。通过不断注入知识, 我们的方法达到了79% 的修补率 , Llama 3.3 和 A train 程序显示不同版本的改进了我们之前的精细程序, 的精细程序需要进一步的精细的精细。

Article 62

Title@2025-06-30 (1): STCLocker: Deadlock Avoidance Testing for Autonomous Driving Systems

Title: STCLocker: Deadlock Avoidance Testing for Autonomous Driving Systems

STCLocker: Deadlock-Vermeidungstests für autonome Fahrsysteme

STLLocker:自动驾驶系统死亡避免测试 2506.23995v1

Authors (5): Mingfei Cheng, Renzhi Wang, Xiaofei Xie, Yuan Zhou, Lei Ma

Autonomous Driving System (ADS) testing is essential to ensure the safety and reliability of autonomous vehicles (AVs) before deployment. However, existing techniques primarily focus on evaluating ADS functionalities in single-AV settings. As ADSs are increasingly deployed in multi-AV traffic, it becomes crucial to assess their cooperative performance, particularly regarding deadlocks, a fundamental coordination failure in which multiple AVs enter a circular waiting state indefinitely, resulting in motion planning failures. Despite its importance, the cooperative capability of ADSs to prevent deadlocks remains insufficiently underexplored. To address this gap, we propose the first dedicated Spatio-Temporal Conflict-Guided Deadlock Avoidance Testing technique, STCLocker, for generating DeadLock Scenarios (DLSs), where a group of AVs controlled by the ADS under test are in a circular wait state. STCLocker consists of three key components: Deadlock Oracle, Conflict Feedback, and Conflict-aware Scenario Generation. Deadlock Oracle provides a reliable black-box mechanism for detecting deadlock cycles among multiple AVs within a given scenario. Conflict Feedback and Conflict-aware Scenario Generation collaborate to actively guide AVs into simultaneous competition over spatial conflict resources (i.e., shared passing regions) and temporal competitive behaviors (i.e., reaching the conflict region at the same time), thereby increasing the effectiveness of generating conflict-prone deadlocks. We evaluate STCLocker on two types of ADSs: Roach, an end-to-end ADS, and OpenCDA, a module-based ADS supporting cooperative communication. Experimental results show that, on average, STCLocker generates more DLS than the best-performing baseline.

自动驾驶系统(ADS)测试对于确保自动驾驶车辆(ADS)在部署之前的安全性和可靠性至关重要,然而,现有技术主要侧重于在单AV环境下评估ADS功能。随着ADS越来越多地在多AV交通中部署,评估其合作性业绩变得至关重要,特别是在僵局方面,因为多重AV进入一个循环等待状态,从而导致规划失败。尽管这很重要,ADS防止僵局的合作能力仍未得到充分探索。为弥补这一差距,我们提议了首种专用的Spatio-Teal-Domal-Guided Developlock ADSVA(STCLocker )测试技术,用于生成DADLock 假象(DLS ) , 由ADS 控制的一组AVS(DS) 处于循环状态, STLockocker 由三个关键部分组成: 死锁 Oraclex, 和冲突感知的生成。死锁 Oracrelock-ber Obbber 提供可靠的黑箱机制,用于在给多个AVVDS-DSDSA-S-S-S-S-S-SDAreving Areving Arent Areving Aral Arence real Arass real real lactions) lactions a lax a laudal laudal laudal rest laudal laudal lades lades laudal lades lades lades lades lader lader lader lader laud lader lader lader lades lader lader lader ladessre lades laudsre lades laudal laud lader lades lades lader lades lades lader lades lades lades lades lader lades lades lader lades 。我们积极合作合作合作合作, 进行关于冲突在两个区域,这样的合作, 进行关于冲突平均时间上,这样进行最佳和冲突周期性交易中,我们上, , ,

Article 63

Title@2025-06-30 (1): Green Metrics Tool: Measuring for fun and profit

Title: Green Metrics Tool: Measuring for fun and profit

Green Metrics Tool: Messen für Spaß und Profit

绿色计量工具:衡量乐趣和利润 2506.23967v1

Authors (2): Geerd-Dietger Hoffmann, Verena Majuntke

The environmental impact of software is gaining increasing attention as the demand for computational resources continues to rise. In order to optimize software resource consumption and reduce carbon emissions, measuring and evaluating software is a first essential step. In this paper we discuss what metrics are important for fact base decision making. We introduce the Green Metrics Tool (GMT), a novel framework for accurately measuring the resource consumption of software. The tool provides a containerized, controlled, and reproducible life cycle-based approach, assessing the resource use of software during key phases. Finally, we discuss GMT features like visualization, comparability and rule- and LLM-based optimisations highlighting its potential to guide developers and researchers in reducing the environmental impact of their software.

随着对计算资源的需求继续增加,软件的环境影响日益受到关注。为了优化软件资源消耗和减少碳排放,衡量和评价软件是第一步。在本文件中,我们讨论了哪些衡量标准对事实基础决策很重要。我们引入了绿色计量工具(GMT),这是一个精确计量软件资源消耗的新框架。该工具提供了一种基于生命周期的集装箱化、控制性和可复制的方法,评估了软件在关键阶段的资源使用情况。最后,我们讨论了可视化、可比性和基于规则和LLM的优化等全球监测技术特征,强调它有可能指导开发者和研究人员减少其软件对环境的影响。

Article 64

Title@2025-06-30 (1): ADReFT: Adaptive Decision Repair for Safe Autonomous Driving via Reinforcement Fine-Tuning

Title: ADReFT: Adaptive Decision Repair for Safe Autonomous Driving via Reinforcement Fine-Tuning

ADReFT: Adaptive Entscheidungsreparatur für sicheres autonomes Fahren durch Verstärkung Feintuning

ADREFT: 安全自主驾驶的适应性决定修补 2506.23960v1

Authors (5): Mingfei Cheng, Xiaofei Xie, Renzhi Wang, Yuan Zhou, Ming Hu

Autonomous Driving Systems (ADSs) continue to face safety-critical risks due to the inherent limitations in their design and performance capabilities. Online repair plays a crucial role in mitigating such limitations, ensuring the runtime safety and reliability of ADSs. Existing online repair solutions enforce ADS compliance by transforming unacceptable trajectories into acceptable ones based on predefined specifications, such as rule-based constraints or training datasets. However, these approaches often lack generalizability, adaptability and tend to be overly conservative, resulting in ineffective repairs that not only fail to mitigate safety risks sufficiently but also degrade the overall driving experience. To address this issue, we propose Adaptive Decision Repair (ADReFT), a novel and effective repair method that identifies safety-critical states through offline learning from failed tests and generates appropriate mitigation actions to improve ADS safety. Specifically, ADReFT incorporates a transformer-based model with two joint heads, State Monitor and Decision Adapter, designed to capture complex driving environment interactions to evaluate state safety severity and generate adaptive repair actions. Given the absence of oracles for state safety identification, we first pretrain ADReFT using supervised learning with coarse annotations, i.e., labeling states preceding violations as positive samples and others as negative samples. It establishes ADReFT’s foundational capability to mitigate safety-critical violations, though it may result in somewhat conservative mitigation strategies. Therefore, we subsequently finetune ADReFT using reinforcement learning to improve its initial capability and generate more precise and contextually appropriate repair decisions. Our evaluation results illustrate that ADReFT achieves better repair performance.

自动驾驶系统(ADS)由于其设计和性能的内在限制,继续面临安全方面的重大风险; 在线修理在减少此类限制、确保自动驾驶系统运行的安全和可靠性方面发挥着关键作用; 现有的在线修理解决方案通过将不可接受的轨迹转换成基于事先界定的规格的可接受的轨迹,例如基于规则的限制或培训数据集,强制遵守ADS; 然而,这些办法往往缺乏一般性、适应性强,而且往往过于保守,导致修理效率低下,不仅不能充分减轻安全风险,而且会降低总体驾驶经验; 为解决这一问题,我们提议调整决定修复(ADReFT),这是一个创新和有效的修理方法,通过从失败的测试中脱机学习确定对安全至关重要的国家,并创造适当的缓解行动; 具体而言,ADReFT采用基于变压的模型,旨在捕捉复杂的驱动环境相互作用,以评价州安全严重程度和产生适应性修复行动; 鉴于缺乏国家安全识别的标志,我们先行,我们先行前调整ADREFT(AReRef),通过监督的升级能力来识别安全违约情况,然后进行反向反向分析。

Article 65

Title@2025-06-30 (1): Automated Statistical Testing and Certification of a Reliable Model-Coupling Server for Scientific Computing

Title: Automated Statistical Testing and Certification of a Reliable Model-Coupling Server for Scientific Computing

Automatisierte statistische Prüfung und Zertifizierung eines zuverlässigen Model-Coupling-Servers für Scientific Computing

科学计算系统可靠模型组合服务器的自动统计测试和认证 2505.09769v2

Authors (3): Seth Wolfgang, Lan Lin, Fengguang Song

Sequence-based specification and usage-driven statistical testing are designed for rigorous and cost-effective software development, offering a semi-formal approach to assessing the behavior of complex systems and interactions between various components. This approach is particularly valuable for scientific computing applications in which comprehensive tests are needed to prevent flawed results or conclusions. As scientific discovery becomes increasingly more complex, domain scientists couple multiple scientific computing models or simulations to solve intricate multiphysics and multiscale problems. These model-coupling applications use a hardwired coupling program or a flexible web service to link and combine different models. In this paper, we focus on the quality assurance of the more elastic web service via a combination of rigorous specification and testing methods. The application of statistical testing exposes problems ignored by pre-written unit tests and highlights areas in the code where failures might occur. We certify the model-coupling server controller with a derived reliability statistic, offering a quantitative measure to support a claim of its robustness.

以序列为基础的规格和使用驱动的统计测试是为严格和具有成本效益的软件开发设计的,为评估复杂系统的行为和各组成部分之间的互动提供了一种半正规的办法来评估复杂系统的行为和各种组成部分之间的相互作用。这种方法对于科学计算应用特别有用,因为需要在这些应用中进行全面测试,以防止有缺陷的结果或结论。随着科学发现日益复杂,域科学家将多种科学计算模型或模拟组合在一起,以解决复杂的多物理和多尺度问题。这些模型组合应用使用硬接合程序或灵活的网络服务来连接和结合不同的模型。在本文件中,我们通过严格的规格和测试方法,侧重于更弹性网络服务的质量保证。统计测试的应用揭示了预先编制单位测试所忽略的问题,并突出代码中可能发生故障的领域。我们用一个衍生的可靠性统计来验证模型组合服务器控制器,并提供定量措施来支持其稳健性。

Article 66

Title@2025-06-30 (1): Measuring Software Innovation with Open Source Software Development Data

Title: Measuring Software Innovation with Open Source Software Development Data

Software-Innovation mit Open Source Software-Entwicklungsdaten messen

利用开放源码软件开发数据衡量软件创新 2411.05087v2

Authors (7): Eva Maxfield Brown, Cailean Osborne, Peter Cihon, Moritz Böhmecke-Schwafert, Kevin Xu, Mirko Boehm, Knut Blind

Existing innovation metrics inadequately capture software innovation, creating blind spots for researchers and policymakers seeking to understand and foster technological innovation in an increasingly software-defined economy. This paper introduces a novel measure of software innovation based on open source software (OSS) development activity on GitHub. We examine the dependency growth and release complexity among 350,000 unique releases from 33,000 unique packages across the JavaScript, Python, and Ruby ecosystems over two years post-release. We find that the semantic versioning types of OSS releases exhibit ecosystem-specific and maturity-dependent patterns in predicting one-year dependency growth, with minor releases showing relatively consistent adoption across contexts while major and patch releases vary significantly by ecosystem and package size. In addition, while semantic versioning correlates with the technical complexity of the change-set, complexity itself shows minimal correlation with downstream adoption, suggesting that versioning signals rather than technical change drive dependency growth. Overall, while semantic versioning release information can be used as a unit of innovation in OSS development complementary to common sources for innovation metrics (e.g. scientific publications, patents, and standards), this measure should be weighted by ecosystem culture, package maturity, and release type to accurately capture innovation dynamics. We conclude with a discussion of the theoretical and practical implications of this novel measure of software innovation as well as future research directions.

本文介绍了基于GitHub的开放源码软件开发活动的软件创新创新新颖措施。我们审视了来自JavaScript、Python和Ruby生态系统的33 000个独特软件包的350 000个不同版本的依赖性增长和释放复杂性。我们发现,在释放后两年里,SOS释放的语义版本类型在预测一年依赖性增长方面表现出生态系统特有和成熟度的特征,在预测一年依赖性增长时,略有释放表明在不同背景中采用相对一致的排放量,而主要和补丁释放则因生态系统和包体规模的不同而差异很大。此外,虽然语义化版本与变异的技术复杂性相关,但复杂性本身与下游采纳的330 000个不同版本的版本中,表明版本信号而不是技术变化会推动依赖性增长。总体而言,语义化版本发布信息可用作开放源码软件开发的创新单位,补充创新指标的共同来源(例如科学出版物、专利和标准)的生态系统特定和成熟性发展模式,但这一措施的准确性版式版面性版本应作为生态系统创新方向的模型和新版本,从而对生态系统创新影响进行精确分析。

Article 67

Title@2025-06-30 (1): Requirements for Active Assistance of Natural Questions in Software Architecture

Title: Requirements for Active Assistance of Natural Questions in Software Architecture

Anforderungen an die aktive Unterstützung natürlicher Fragen in der Softwarearchitektur

积极协助软件建筑中自然问题的要求 2506.23898v1

Authors (3): Diogo Lemos, Ademar Aguiar, Neil B. Harrison

Natural questions are crucial to shaping key architectural decisions and preserving architectural knowledge. They arise organically during the architectural design process, often resulting from the existing architectural experience of the designer and the distinctive characteristics of the system being designed. However, natural questions are often mismanaged or ignored, which can lead to architectural drift, knowledge loss, inefficient resource use, or poor understandability of the system’s architecture. We aim to better understand the lifecycle of natural questions, its key requirements, challenges and difficulties, and then to envision an assisted environment to properly support it. The environment should be adaptable and responsive to real-world constraints and uncertainties by seamlessly integrating knowledge management tools and artificial intelligence techniques into software development workflows. Based on existing literature, a requirements workshop, and three design iterations, we proposed a lifecycle for natural questions and elicited essential functional and non-functional requirements for such an environment. At last, the results of a survey conducted with experts helped to analyze and validate the elicited requirements and proposed features for the environment to enhance collaboration, decision-making, and the preservation of architectural knowledge more effectively than conventional methods.

自然问题是形成关键的建筑决定和保存建筑知识的关键,在建筑设计过程中,自然问题是有机地产生的,往往是由于设计者现有的建筑经验以及所设计的系统的独特性造成的;然而,自然问题往往管理不当或被忽视,可能导致建筑漂移、知识丧失、资源使用效率低下或系统结构难以理解;我们的目标是更好地了解自然问题的生命周期、其关键要求、挑战和困难,然后设想一个适当支持自然问题的辅助环境;环境应适应现实世界的限制因素和不确定因素并作出反应,将知识管理工具和人工智能技术无缝地纳入软件开发工作流程;根据现有文献、需求讲习班和三个设计版本,我们建议自然问题的生命周期,并为这种环境提出基本的功能性和非功能性要求;最后,与专家一道进行的调查的结果有助于分析和验证所引出的要求和拟议的环境特征,以加强协作、决策,并比常规方法更有效地保存建筑知识。

Article 68

Title@2025-06-30 (1): Green AI in Action: Strategic Model Selection for Ensembles in Production

Title: Green AI in Action: Strategic Model Selection for Ensembles in Production

Grüne KI in Aktion: Strategische Modellauswahl für Ensembles in der Produktion

绿色AI “ 行动 “ :生产集合战略示范选择 2405.17451v2

Authors (4): Nienke Nijkamp, June Sallou, Niels van der Heijden, Luís Cruz

Integrating Artificial Intelligence (AI) into software systems has significantly enhanced their capabilities while escalating energy demands. Ensemble learning, combining predictions from multiple models to form a single prediction, intensifies this problem due to cumulative energy consumption. This paper presents a novel approach to model selection that addresses the challenge of balancing the accuracy of AI models with their energy consumption in a live AI ensemble system. We explore how reducing the number of models or improving the efficiency of model usage within an ensemble during inference can reduce energy demands without substantially sacrificing accuracy. This study introduces and evaluates two model selection strategies, Static and Dynamic, for optimizing ensemble learning systems performance while minimizing energy usage. Our results demonstrate that the Static strategy improves the F1 score beyond the baseline, reducing average energy usage from 100% from the full ensemble to 62%. The Dynamic strategy further enhances F1 scores, using on average 76% compared to 100% of the full ensemble. Moreover, we propose an approach that balances accuracy with resource consumption, significantly reducing energy usage without substantially impacting accuracy. This method decreased the average energy usage of the Static strategy from approximately 62% to 14%, and for the Dynamic strategy, from around 76% to 57%. Our field study of Green AI using an operational AI system developed by a large professional services provider shows the practical applicability of adopting energy-conscious model selection strategies in live production environments.

将人工智能(AI)纳入软件系统极大地提高了它们的能力,同时提高了能源需求。整合学习,将多种模型的预测结合起来,形成单一的预测,强化了这一问题,因为能源消耗累积而使这一问题更加严重。本文件介绍了一种新颖的模型选择方法,以应对在实时AI联合体系统内平衡AI模型的准确性和能源消耗的挑战。我们探讨了如何减少模型数量或提高模型在推论期间组合使用效率,从而降低能源需求,同时又不大幅降低准确性。本研究报告提出并评价了两种模型选择战略,即静态和动态战略,以优化共同学习系统的业绩,同时尽量减少能源使用。我们的成果表明,静态战略将F1分的得分提高到基线以上,将平均能源使用率从全部总合体的100%降至62%。动态战略进一步提升了F1的得分,使用平均76%,而使用完全专业模型的100%。此外,我们建议采用一种平衡资源消耗的准确性方法,显著地降低能源使用率,而不会对准确性影响。这个方法表明,Stical Statical 战略的能源使用率从我们应用了57的能源选择系统,从使用率系统,从使用率第16 % 到了采用一个绿色业务战略,从一个基本能源战略,从使用率系统,从使用率系统,到采用第62采用一个基本能源战略。

Article 69

Title@2025-06-30 (1): An ontological lens on attack trees: Toward adequacy and interoperability

Title: An ontological lens on attack trees: Toward adequacy and interoperability

Eine ontologische Linse auf Angriffsbäumen: Auf dem Weg zur Angemessenheit und Interoperabilität

攻击树的肿瘤透镜:实现适足性和互操作性 2506.23841v1

Authors (6): Ítalo Oliveira, Stefano M. Nicoletti, Gal Engelberg, Mattia Fumagalli, Dan Klein, Giancarlo Guizzardi

Attack Trees (AT) are a popular formalism for security analysis. They are meant to display an attacker’s goal decomposed into attack steps needed to achieve it and compute certain security metrics (e.g., attack cost, probability, and damage). ATs offer three important services: (a) conceptual modeling capabilities for representing security risk management scenarios, (b) a qualitative assessment to find root causes and minimal conditions of successful attacks, and (c) quantitative analyses via security metrics computation under formal semantics, such as minimal time and cost among all attacks. Still, the AT language presents limitations due to its lack of ontological foundations, thus compromising associated services. Via an ontological analysis grounded in the Common Ontology of Value and Risk (COVER) – a reference core ontology based on the Unified Foundational Ontology (UFO) – we investigate the ontological adequacy of AT and reveal four significant shortcomings: (1) ambiguous syntactical terms that can be interpreted in various ways; (2) ontological deficit concerning crucial domain-specific concepts; (3) lacking modeling guidance to construct ATs decomposing a goal; (4) lack of semantic interoperability, resulting in ad hoc stand-alone tools. We also discuss existing incremental solutions and how our analysis paves the way for overcoming those issues through a broader approach to risk management modeling.

攻击树(AT)是安全分析的流行形式主义。它们意在展示攻击者的目标,将其分解成达到安全分析所需的攻击步骤,并计算某些安全指标(例如攻击成本、概率和损坏)。AT提供三项重要服务:(a) 概念模型能力,以代表安全风险管理设想方案;(b) 定性评估,以找出袭击成功的根源和最起码条件;(c) 通过安全指标计算,通过安全指标进行定量分析,如所有袭击的时间和费用最低。不过,AT语言由于缺乏本体基础,从而损害相关服务,因而存在局限性。 AT语言基于价值观和风险共同本体(COverseb)的肿瘤分析 – – 基于统一本体学(UFO)的参考核心理论 – – 我们调查AT的肿瘤是否充分,并揭示四个重大缺陷:(1) 可以用不同方式解释的模糊的合成术语;(2) 关键领域特定概念的本科学缺陷;(3) 在构建ATS(C) 建立更广义的互操作性工具方面缺乏建模指导;以及我们如何通过渐进式的系统分析,从而形成一个目标;(4) 我们如何超越现有风险解决办法;我们如何进行渐进式分析。

Article 70

Title@2025-06-30 (1): Software Engineering for Large Language Models: Research Status, Challenges and the Road Ahead

Title: Software Engineering for Large Language Models: Research Status, Challenges and the Road Ahead

Software Engineering für große Sprachmodelle: Forschungsstatus, Herausforderungen und die Zukunft

大语言模型软件工程:研究现状、挑战和道路 2506.23762v1

Authors (5): Hongzhou Rao, Yanjie Zhao, Xinyi Hou, Shenao Wang, Haoyu Wang

The rapid advancement of large language models (LLMs) has redefined artificial intelligence (AI), pushing the boundaries of AI research and enabling unbounded possibilities for both academia and the industry. However, LLM development faces increasingly complex challenges throughout its lifecycle, yet no existing research systematically explores these challenges and solutions from the perspective of software engineering (SE) approaches. To fill the gap, we systematically analyze research status throughout the LLM development lifecycle, divided into six phases: requirements engineering, dataset construction, model development and enhancement, testing and evaluation, deployment and operations, and maintenance and evolution. We then conclude by identifying the key challenges for each phase and presenting potential research directions to address these challenges. In general, we provide valuable insights from an SE perspective to facilitate future advances in LLM development.

大型语言模型(LLMs)的迅速发展重新定义了人工智能,推动了AI研究的界限,并为学术界和业界提供了不受限制的可能性;然而,LLM的发展在整个生命周期面临日益复杂的挑战,但现有的研究没有从软件工程方法的角度系统地探讨这些挑战和解决办法;为填补这一空白,我们系统地分析整个LLM发展生命周期的研究状况,分为六个阶段:要求工程、数据集构建、模型开发与增强、测试与评估、部署与操作以及维护和演变;最后,我们确定每个阶段的关键挑战,提出应对这些挑战的潜在研究方向;一般而言,我们从SE的角度提供宝贵的见解,以促进LLM发展的未来进展。

Article 71

Title@2025-06-30 (1): A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications

Title: A Survey of LLM-based Automated Program Repair: Taxonomies, Design Paradigms, and Applications

Eine Umfrage der LLM-basierten automatisierten Programm-Reparatur: Taxonomien, Design Paradigmen und Anwendungen

以LLOM为基础的自动方案维修调查:分类、设计模型和应用 2506.23749v1

Authors (8): Boyang Yang, Zijian Cai, Fengling Liu, Bach Le, Lingming Zhang, Tegawendé F. Bissyandé, Yang Liu, Haoye Tian

Large language models (LLMs) are reshaping automated program repair (APR). We categorize the recent 63 LLM-based APR systems published from January 2022 to June 2025 into four paradigms, and show how retrieval- or analysis-augmented contexts strengthen any of them. This taxonomy clarifies key trade-offs: fine-tuning delivers strong task alignment at high training cost; prompting enables rapid deployment but is limited by prompt design and context windows; procedural pipelines offer reproducible control with moderate overhead; agentic frameworks tackle multi-hunk or cross-file bugs at the price of increased latency and complexity. Persistent challenges include verifying semantic correctness beyond test suites, repairing repository-scale defects, and lowering the costs of LLMs. We outline research directions that combine lightweight human feedback, repository-aware retrieval, code analysis, and cost-aware planning to advance reliable and efficient LLM-based APR.

大型语言模型(LLMS)正在重塑自动化程序修理(APR) 我们将最近于2022年1月至2025年6月公布的63个以LLM为基础的PRA系统分为四个模式,并表明检索或分析增强的环境如何加强其中任何一种模式。这一分类澄清了关键的权衡:微调能以高培训成本带来强有力的任务协调;促进快速部署,但受到迅速设计和背景窗口的限制;程序管道提供中度间接费用的可复制控制;代理框架以更高的耐久性和复杂性的价格处理多层或跨档的错误。持续存在的挑战包括核实测试套件以外的语义正确性、修复存储库规模的缺陷和降低LMMs的费用。我们概述了将轻量人类反馈、存储感应检索、代码分析以及成本意识规划相结合的研究方向,以推进可靠和高效的LMM-PR。

Article 72

Title@2025-06-30 (1): Towards a Science of Developer eXperience (DevX)

Title: Towards a Science of Developer eXperience (DevX)

Auf dem Weg zu einer Wissenschaft des Entwicklers eXperience (DevX)

走向开发者电子X光科学(DevX) 2506.23715v1

Authors (1): Benoit Combemale

As software continues to permeate nearly every facet of modern life, the complexity and ubiquity of digital services underscore the need for sustainable, effective, and inclusive software development practices. Although software engineering has made significant progress in technical challenges since its inception, the human experience of those involved in software creation, broadly defined as developers, remains underexplored. This column advocates for the formal recognition of Developer eXperience (DevX) as a distinct research field. We argue that DevX profoundly influences critical development activities and overall productivity, especially as development becomes increasingly collaborative and diverse in terms of application domains. Building on existing efforts to measure and enhance DevX, we identify key rationales, scientific enablers, and interdisciplinary intersections that support this emerging discipline. We also outline the core scientific challenges ahead, aiming to call for actions from the research community and to promote more human-centered approaches to software engineering.

由于软件继续渗透现代生活的几乎每个方面,数字服务的复杂性和普遍性突出表明了可持续、有效和包容性软件开发做法的必要性。虽然软件工程自建立以来在技术挑战方面取得了显著进展,但参与软件创建的人(广义上称为开发者)的人类经验仍未得到充分探讨。本栏主张正式承认开发者电子Xperience(DevX)是一个独特的研究领域。我们认为,DevX深刻地影响着关键的发展活动和总体生产力,特别是当发展在应用领域日益合作和多样化时。我们在现有努力衡量和加强DevX的基础上,确定了支持这一新兴学科的关键原理、科学推进器和跨学科交叉点。我们还概述了未来的核心科学挑战,目的是呼吁研究界采取行动,促进更多以人为本的软件工程方法。

Article 73

Title@2025-06-30 (1): What Challenges Do Developers Face When Using Verification-Aware Programming Languages?

Title: What Challenges Do Developers Face When Using Verification-Aware Programming Languages?

Welche Herausforderungen stellen sich Entwickler bei der Verwendung von Verifikations-Software-Programmiersprachen?

开发者在使用核查-软件编程语言时面临哪些挑战? 2506.23696v1

Authors (3): Francisco Oliveira, Alexandra Mendes, Carolina Carreira

Software reliability is critical in ensuring that the digital systems we depend on function correctly. In software development, increasing software reliability often involves testing. However, for complex and critical systems, developers can use Design by Contract (DbC) methods to define precise specifications that software components must satisfy. Verification-Aware (VA) programming languages support DbC and formal verification at compile-time or run-time, offering stronger correctness guarantees than traditional testing. However, despite the strong guarantees provided by VA languages, their adoption remains limited. In this study, we investigate the barriers to adopting VA languages by analyzing developer discussions on public forums using topic modeling techniques. We complement this analysis with a developer survey to better understand the practical challenges associated with VA languages. Our findings reveal key obstacles to adoption, including steep learning curves and usability issues. Based on these insights, we identify actionable recommendations to improve the usability and accessibility of VA languages. Our findings suggest that simplifying tool interfaces, providing better educational materials, and improving integration with everyday development environments could improve the usability and adoption of these languages. Our work provides actionable insights for improving the usability of VA languages and making verification tools more accessible.

软件的可靠性对于确保我们正确依赖数字系统至关重要。在软件开发中,提高软件的可靠性往往涉及测试。然而,对于复杂和关键的系统,开发者可以使用合同设计(DbC)方法来界定软件组件必须满足的精确规格。核查-软件(VA)程序语言支持DbC,在编译或运行时进行正式核查,提供比传统测试更强的正确性保障。然而,尽管VA语言提供了强有力的保证,但采用VA语言仍然有限。在本研究中,我们通过分析关于公共论坛的开发者讨论,利用专题模型技术,来调查采用VA语言的障碍。我们用开发者调查来补充这一分析,以更好地了解与VA语言相关的实际挑战。我们的调查结果揭示了在采用方面的主要障碍,包括粗糙的学习曲线和可用性问题。我们根据这些洞察,确定了提高VA语言的可使用性和可获取性、可操作性的建议。我们的研究结果表明,简化工具界面、提供更好的教育材料以及改进与日常发展环境的融合可以提高这些语言的可用性和采用性。我们的工作提供了可操作性的洞察力。

Article 74

Title@2025-06-30 (1): Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Title: Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Können große Sprachmodelle den Studierenden helfen, Software-Korrektur zu beweisen? Eine experimentelle Studie mit Dafny

大语言模型能帮助学生证明软件正确性吗? 与Dafny的实验研究 2506.22370v2

Authors (4): Carolina Carreira, Álvaro Silva, Alexandre Abreu, Alexandra Mendes

Students in computing education increasingly use large language models (LLMs) such as ChatGPT. Yet, the role of LLMs in supporting cognitively demanding tasks, like deductive program verification, remains poorly understood. This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny, a language that supports functional correctness, by allowing programmers to write formal specifications and automatically verifying that the implementation satisfies the specification. We conducted a mixed-methods study with master’s students enrolled in a formal methods course. Each participant completed two verification problems, one with access to a custom ChatGPT interface that logged all interactions, and the other without. We identified strategies used by successful students and assessed the level of trust students place in LLMs. Our findings show that students perform significantly better when using ChatGPT; however, performance gains are tied to prompt quality. We conclude with practical recommendations for integrating LLMs into formal methods courses more effectively, including designing LLM-aware challenges that promote learning rather than substitution.

然而,LLMS在支持认知要求高的任务(如计算程序核查)方面的作用仍然不为人所知。本文调查了学生在解决Dafny的正式核查练习时如何与LLM互动。 Dafny是支持功能正确性的一种语言,它使程序设计员能够编写正式的规格,并自动核实执行符合规格。我们与注册参加正规方法课程的硕士学生进行了混合方法研究。每个参与者都完成了两个核查问题,一个是能够使用记录所有互动的CatGPT用户界面,另一个是没有。我们确定了成功学生使用的战略,并评估了LLMS学生的信任程度。我们的调查结果显示,学生在使用CatGPT时表现得更好;然而,成绩与及时的质量挂钩。我们最后提出了将LMS更有效地纳入正规方法课程的实用建议,包括设计LMM-aware挑战,促进学习而不是替代。

Article 75

Title@2025-06-30 (1): Threadbox: Sandboxing for Modular Security

Title: Threadbox: Sandboxing for Modular Security

Threadbox: Sandboxing für modulare Sicherheit

Treadbox: 模块安全沙箱 2506.23683v1

Authors (2): Maysara Alhindi, Joseph Hallett

There are many sandboxing mechanisms provided by operating systems to limit what resources applications can access, however, sometimes the use of these mechanisms requires developers to refactor their code to fit the sandboxing model. In this work, we investigate what makes existing sandboxing mechanisms challenging to apply to certain types of applications, and propose Threadbox, a sandboxing mechanism that enables having modular and independent sandboxes, and can be applied to threads and sandbox specific functions. We present case studies to illustrate the applicability of the idea and discuss its limitations.

操作系统提供了许多沙箱机制,以限制各种应用软件能够获取的资源,然而,有时,这些机制的使用要求开发商重新设定其代码以适应沙箱模式。在这项工作中,我们调查是什么使得现有的沙箱机制难以适用于某些类型的应用,并提议Treadbox,这是一个沙箱机制,能够建立模块化和独立的沙箱,并可用于细线和沙箱特定功能。我们提出案例研究,以说明该理念的适用性并讨论其局限性。

Article 76

Title@2025-06-30 (1): QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration

Title: QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration

QLPro: Automatisierte Code Vulnerability Discovery über LLM und Static Code Analysis Integration

QLPro:通过LLM和静态代码分析整合发现自动编码易脆弱性 2506.23644v1

Authors (8): Junze Hu, Xiangyu Jin, Yizhe Zeng, Yuling Liu, Yunpeng Li, Dan Du, Kaiyu Xie, Hongsong Zhu

We introduce QLPro, a vulnerability detection framework that systematically integrates LLMs and static analysis tools to enable comprehensive vulnerability detection across entire open-source projects.We constructed a new dataset, JavaTest, comprising 10 open-source projects from GitHub with 62 confirmed vulnerabilities. CodeQL, a state-of-the-art static analysis tool, detected only 24 of these vulnerabilities while QLPro detected 41. Furthermore, QLPro discovered 6 previously unknown vulnerabilities, 2 of which have been confirmed as 0-days.

我们引入了QLPro,这是一个脆弱性检测框架,它系统地整合了LLMs和静态分析工具,以便能够在整个开放源码项目中全面检测脆弱性。我们建立了一个新的数据集,JavaTestor,由GitHub的10个公开源码项目组成,其中62个被确认为脆弱性。 CodeQL是一个最先进的静态分析工具,仅检测到24个,而QLPro检测到41个。此外,QLPro发现了6个以前未知的脆弱性,其中2个被确认为0天。

Article 77

Title@2025-06-30 (1): From Tea Leaves to System Maps: Context-awareness in Monitoring Operational Machine Learning Models

Title: From Tea Leaves to System Maps: Context-awareness in Monitoring Operational Machine Learning Models

Von Tea Leaves zu System Maps: Kontext-Bewusstsein bei der Überwachung operativer Machine Learning Modelle

从茶叶休假到系统地图:监测操作机器学习模式的背景意识 2506.10770v2

Authors (4): Joran Leest, Claudia Raibulet, Patricia Lago, Ilias Gerostathopoulos

Machine learning (ML) models in production do not fail due to statistical anomalies in their input data; they fail due to contextual misalignment – when their environment deviates from training assumptions, leading to unreliable predictions. Effective ML monitoring requires rich contextual information to move beyond detecting statistical shifts toward meaningful alerts and systematic root-cause analysis. Surprisingly, despite extensive research in ML monitoring and related areas (drift detection, data validation, out-of-distribution detection), there is no shared understanding of how to use contextual information – a striking gap, given that monitoring fundamentally involves interpreting information in context. In response, this paper presents a systematic review to characterize and structure the various types of contextual information in this domain. Our analysis examines 94 primary studies across data mining, databases, software engineering, and ML. We introduce the Contextual System–Aspect–Representation (C-SAR) framework, a conceptual model that synthesizes our findings. We also identify 20 recurring and potentially reusable patterns of specific system, aspect, and representation triplets, and map them to the monitoring activities they support. This study provides a new perspective on ML monitoring: from interpreting tea leaves'' (i.e., isolated data and performance statistics) to constructing and managingsystem maps’’ (i.e., end-to-end views that connect data, models, and operating context). This way, we aim to enable systematic ML monitoring practices.

生产过程中的机器学习模式并非因为其投入数据中的统计异常而失败;由于环境与培训假设不同,导致不可靠的预测,因此环境不匹配,结果不可靠。有效的ML监测需要丰富的背景信息,以超越对统计变化的检测,转向有意义的警报和系统根源分析。奇怪的是,尽管在ML监测及相关领域进行了广泛的研究(遥控检测、数据验证、分配以外的检测),但对于如何使用背景信息没有共同的理解 – – 一种巨大的差距,因为监测从根本上涉及对信息进行背景解释。对此,本文件提出系统审查,以确定和构建该领域各类背景信息的特点和结构。我们的分析审查了数据挖掘、数据库、软件工程和ML的94项初级研究。我们引入了Contextual Syste-Aspect-Reformation(C-SAR)框架,一个综合我们调查结果的概念模型。我们还确定了具体系统系统、方面和代表的20种经常性和可能可重复的模式,并绘制了它们所支持的监测活动的地图。本研究报告从ML的新的视角,从ML到M-L的运行中,从分析到数据管理。

Article 78

Title@2025-06-30 (1): Comparative Analysis of the Code Generated by Popular Large Language Models (LLMs) for MISRA C++ Compliance

Title: Comparative Analysis of the Code Generated by Popular Large Language Models (LLMs) for MISRA C++ Compliance

Vergleichende Analyse des Code Generated by Popular Large Language Models (LLMs) für MISRA C++ Compliance

MISRA C++遵约情况按大众大语言模式编制的守则比较分析 2506.23535v1

Authors (1): Malik Muhammad Umer

Safety-critical systems are engineered systems whose failure or malfunction could result in catastrophic consequences. The software development for safety-critical systems necessitates rigorous engineering practices and adherence to certification standards like DO-178C for avionics. DO-178C is a guidance document which requires compliance to well-defined software coding standards like MISRA C++ to enforce coding guidelines that prevent the use of ambiguous, unsafe, or undefined constructs. Large Language Models (LLMs) have demonstrated significant capabilities in automatic code generation across a wide range of programming languages, including C++. Despite their impressive performance, code generated by LLMs in safety-critical domains must be carefully analyzed for conformance to MISRA C++ coding standards. In this paper, I have conducted a comparative analysis of the C++ code generated by popular LLMs including: OpenAI ChatGPT, Google Gemini, DeepSeek, Meta AI, and Microsoft Copilot for compliance with MISRA C++.

安全关键系统是设计好的系统,其故障或故障可能导致灾难性后果。安全关键系统的软件开发需要严格的工程做法和遵守DO-178C等航空系统认证标准。DO-178C是一份指导文件,要求遵守明确界定的软件编码标准,如MISRA C+++, 以强制执行防止使用模糊、不安全或未定义结构的编码准则。大型语言模型(LLLMs)在包括C++在内的多种编程语言的自动代码生成方面表现出巨大的能力。尽管LLMs在安全关键领域生成的代码表现令人印象深刻,但必须仔细分析其是否符合MISRA C++编码标准。在这份文件中,我对流行LMS生成的C++代码进行了比较分析,包括:OpenAI ChatGPT、Google Gemini、DeepSeek、Meta AI和用于遵守MISRA C++的微软C试点。

Article 79

Title@2025-06-30 (1): Improving vulnerability type prediction and line-level detection via adversarial training-based data augmentation and multi-task learning

Title: Improving vulnerability type prediction and line-level detection via adversarial training-based data augmentation and multi-task learning

Verbesserung der Sicherheitsvorhersage und der Linienerkennung durch trainingsbasierte Datenvergrößerung und Multi-Task-Lernen

通过对抗性培训数据扩增和多任务学习,改进脆弱性类型预测和线一级检测 2506.23534v1

Authors (6): Siyu Chen, Jiongyi Yang, Xiang Chen, Menglin Zheng, Minnan Wei, Xiaolin Ju

Context: Software vulnerabilities pose a significant threat to modern software systems, as evidenced by the growing number of reported vulnerabilities and cyberattacks. These escalating trends underscore the urgent need for effective approaches that can automatically detect and understand software vulnerabilities. Objective: However, the scarcity of labeled samples and the class imbalance issue in vulnerability datasets present significant challenges for both Vulnerability Type Prediction (VTP) and Line-level Vulnerability Detection (LVD), especially for rare yet critical vulnerability types. Moreover, most existing studies treat VTP and LVD as independent tasks, overlooking their inherent correlation, which limits the potential to leverage shared semantic patterns across tasks. Methods: To address these limitations, we propose a unified approach that integrates Embedding-Layer Driven Adversarial Training (EDAT) with Multi-task Learning (MTL). Specifically, EDAT enhances model robustness by introducing adversarial perturbations to identifier embeddings, guided by semantic importance. Meanwhile, MTL improves overall performance by leveraging shared representations and inter-task correlations between VTP and LVD. Results: Extensive experiments demonstrate that our proposed approach outperforms state-of-the-art baselines on both VTP and LVD tasks. For VTP, it yields notable improvements in accuracy, precision, recall, and F1-score, particularly in identifying rare vulnerability types. Similarly, for LVD, our approach enhances line-level detection accuracy while significantly reducing false positives. Conclusion: Our study demonstrates that combining EDAT with MTL provides a unified solution that improves performance on both tasks and warrants further investigation.

软件脆弱性对现代软件系统构成重大威胁,这表现在所报告的脆弱性和网络攻击数量不断增加。这些不断上升的趋势突出表明迫切需要采取能够自动检测和理解软件脆弱性的有效办法。目标:然而,标签样本的稀缺和脆弱性数据集中的阶级不平衡问题对脆弱性类型预测和线级脆弱性检测都提出了重大挑战,特别是稀有但关键的脆弱性类型。此外,大多数现有研究将VTP和LVD视为独立任务,忽视了它们固有的关联性,从而限制了利用共同的语义模式在各项任务之间发挥杠杆作用的潜力。方法:解决这些局限性,我们建议采用统一的方法,将Embed-Layerin Adversarial培训(EDAT)与多任务学习(MTL)相结合。具体地说,EDAT采用对标识嵌入的对抗性渗透性渗透性测试,在语义重要性的指导下,MTLL通过利用共同的表述和任务之间的关联性关系,提高总体绩效。结果:为了应对这些局限性,我们拟议的VTP的准确性调查,1 LT 和LT级的精确性研究显示我们的拟议方法比重的准确性,我们关于LT的精确性研究。

Article 80

Title@2025-06-29 (7): On the Feasibility of Deduplicating Compiler Bugs with Bisection

Title: On the Feasibility of Deduplicating Compiler Bugs with Bisection

Über die Machbarkeit von Compiler Bugs mit Bisection zu deduplizieren

应用编译器比分错误的可行性 2506.23281v1

Authors (3): Xintong Zhou, Zhenyang Xu, Chengnian Sun

Random testing has proven to be an effective technique for compiler validation. However, the debugging of bugs identified through random testing presents a significant challenge due to the frequent occurrence of duplicate test programs that expose identical compiler bugs. The process to identify duplicates is a practical research problem known as bug deduplication. Prior methodologies for compiler bug deduplication primarily rely on program analysis to extract bug-related features for duplicate identification, which can result in substantial computational overhead and limited generalizability. This paper investigates the feasibility of employing bisection, a standard debugging procedure largely overlooked in prior research on compiler bug deduplication, for this purpose. Our study demonstrates that the utilization of bisection to locate failure-inducing commits provides a valuable criterion for deduplication, albeit one that requires supplementary techniques for more accurate identification. Building on these results, we introduce BugLens, a novel deduplication method that primarily uses bisection, enhanced by the identification of bug-triggering optimizations to minimize false negatives. Empirical evaluations conducted on four real-world datasets demonstrate that BugLens significantly outperforms the state-of-the-art analysis-based methodologies Tamer and D3 by saving an average of 26.98% and 9.64% human effort to identify the same number of distinct bugs. Given the inherent simplicity and generalizability of bisection, it presents a highly practical solution for compiler bug deduplication in real-world applications.

事实证明,随机测试是编译器验证的有效技术。然而,通过随机测试查明的错误的调试是一个重大挑战,因为经常出现重复的测试程序,暴露了相同的编译器错误。识别重复的程序是一个实际的研究问题,称为“错误除错”。编译器错误解析的先前方法主要依靠程序分析,以提取与错误有关的特性进行重复识别,这可能导致大量的计算间接费用和有限的一般性。本文调查了使用双节的可行性,这是一种标准调试程序,在以前关于编译器错误除错的研究中大都忽略了。我们的研究显示,使用双节来定位失败引引引出错误的程序,为解析提供了宝贵的标准,尽管这是一个需要补充技术才能更准确地辨别的问题。基于这些结果,我们引入了一种新颖的解析方法,主要使用双节,通过识别错误调式优化来尽量减少虚假的负差。对四个真实世界数据集进行的“BugLens”评估显示, 错误Lenergal 大大超越了“错误-duction-inal-deal-developyal”应用了“9.68-deal-laft-laphyal-de-deal-laphy-deal-de-deal-laphyal-deal-debiltyal-deal-deal roduction-deal-deal-deal-deal-deal-deal-robal-deal-deal-deal-deal-deal-deal-deal-deal-deal-debal-deal-deal-deal-deal-deal-debal-deal-deal-deal-deal-deal-deal-rodal-deal-deal-deal-deal-deal-deal-deal-deal-deal-rodal-deal rodal-deal-deal-deal-debal-deal-deal-debal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-deal-de

Article 81

Title@2025-06-29 (7): From Release to Adoption: Challenges in Reusing Pre-trained AI Models for Downstream Developers

Title: From Release to Adoption: Challenges in Reusing Pre-trained AI Models for Downstream Developers

Von der Veröffentlichung bis zur Annahme: Herausforderungen bei der Wiederverwendung vortrainierter KI-Modelle für Downstream-Entwickler

从释放到采用:为下游开发者重新使用经过预先培训的AI模型的挑战 2506.23234v1

Authors (5): Peerachai Banyongrakkul, Mansooreh Zahedi, Patanamon Thongtanunam, Christoph Treude, Haoyu Gao

Pre-trained models (PTMs) have gained widespread popularity and achieved remarkable success across various fields, driven by their groundbreaking performance and easy accessibility through hosting providers. However, the challenges faced by downstream developers in reusing PTMs in software systems are less explored. To bridge this knowledge gap, we qualitatively created and analyzed a dataset of 840 PTM-related issue reports from 31 OSS GitHub projects. We systematically developed a comprehensive taxonomy of PTM-related challenges that developers face in downstream projects. Our study identifies seven key categories of challenges that downstream developers face in reusing PTMs, such as model usage, model performance, and output quality. We also compared our findings with existing taxonomies. Additionally, we conducted a resolution time analysis and, based on statistical tests, found that PTM-related issues take significantly longer to be resolved than issues unrelated to PTMs, with significant variation across challenge categories. We discuss the implications of our findings for practitioners and possibilities for future research.

培训前模型(PTMs)已获得广泛欢迎,并在各个领域取得了显著成功,其驱动力是其开创性业绩和通过托管提供者容易获得。然而,下游开发商在软件系统中重新使用PTM系统时所面临的挑战没有那么深入探讨。为缩小这一知识差距,我们从质量上创建和分析了31个OSS GitHub项目中840份PTM相关问题报告的数据集。我们系统地开发了开发商在下游项目中面临的与PTM相关的挑战的综合分类。我们的研究确定了下游开发商在重新使用PTM系统时所面临的七大类挑战,例如模型使用、模型性能和产出质量。我们还将我们的调查结果与现有的分类进行了比较。此外,我们进行了分辨率时间分析,并根据统计测试发现,与PTM系统有关的问题比与PTM系统无关的问题需要更长的时间才能解决,而不同的挑战类别差异很大。我们讨论了我们的调查结果对从业人员的影响以及未来研究的可能性。

Article 82

Title@2025-06-29 (7): RPHunter: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis

Title: RPHunter: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis

RPHunter: Enthüllen von Rug Pull Schemes in Crypto Token über Code-and-Transaction Fusion Analysis

RPHunter:通过代码和交易整合分析,在加密中采用 “ 拼接 “ 的 “ 拼接 “ 的 “ Rug “ 拖网计划 2506.18398v2

Authors (8): Hao Wu, Haijun Wang, Shangwang Li, Yin Wu, Ming Fan, Wuxia Jin, Yitao Zhao, Ting Liu

Rug pull scams have emerged as a persistent threat to cryptocurrency, causing significant financial losses. A typical scenario involves scammers deploying honeypot contracts to attract investments, restricting token sales, and draining the funds, which leaves investors with worthless tokens. Current methods either rely on predefined patterns to detect code risks or utilize statistical transaction data to train detection models. However, real-world Rug Pull schemes often involve a complex interplay between malicious code and suspicious transaction behaviors. These methods, which solely focus on one aspect, fall short in detecting such schemes effectively. In this paper, we propose RPHunter, a novel technique that integrates code and transaction for Rug Pull detection. First, RPHunter establishes declarative rules and performs flow analysis to extract code risk information, further constructing a semantic risk code graph (SRCG). Meanwhile, to leverage transaction information, RPHunter formulates dynamic token transaction activities as a token flow behavior graph (TFBG) in which nodes and edges are characterized from network structure and market manipulation perspectives. Finally, RPHunter employs graph neural networks to extract complementary features from SRCG and TFBG, integrating them through an attention fusion model to enhance the detection of Rug Pull. We manually analyzed 645 Rug Pull incidents from code and transaction aspects and constructed a ground-truth dataset. We evaluated RPHunter on our dataset, achieving a precision of 95.3%, a recall of 93.8% and an F1 score of 94.5%, which highlights superior performance compared to existing methods. Furthermore, when applied to the real-world scenarios, RPHunter has identified 4801 Rug Pull tokens, achieving a precision of 90.7%.

鲁棒骗局已成为对隐蔽货币的持续威胁,造成了巨大的金融损失。典型的情景是,诈骗者利用蜜罐合同来吸引投资,限制象征性销售,耗尽资金,使投资者留下没有价值的象征。当前的方法要么依靠预先定义的模式来检测代码风险,要么利用统计交易数据来培训检测模型。然而,真实世界的鲁棒骗骗局往往涉及恶意代码和可疑交易行为之间的复杂互动。这些方法仅侧重于一个方面,在有效发现此类计划方面落后于一个方面。在本文中,我们提议了RPHunter(RPHunter),这是将代码与交易结合起来以吸引投资,限制象征性销售,限制象征性销售,同时,RPHunter(RPH)建立宣言规则并进行流程分析,进一步构建一个在线交易的准确性功能,同时将 RBGG(RG) 的准确性数据整合起来,同时将我们目前运行的节点和节点的节点与节点的节点定位,在SRCG(RG) 的精确度上应用图形网络, 将当前运行的精确度提高到一个标准。

Article 83

Title@2025-06-29 (7): A Comprehensive Study on Large Language Models for Mutation Testing

Title: A Comprehensive Study on Large Language Models for Mutation Testing

Eine umfassende Studie zu großen Sprachmodellen für Mutationsprüfungen

关于变异测试大语言模式的综合研究 2406.09843v3

Authors (6): Bo Wang, Mingda Chen, Youfang Lin, Mark Harman, Mike Papadakis, Jie M. Zhang

Large Language Models (LLMs) have recently been used to generate mutants in both research work and in industrial practice. However, there has been no comprehensive empirical study of their performance for this increasingly important LLM-based Software Engineering application. To address this, we report the results of a comprehensive empirical study over six different LLMs, including both state-of-the-art open- and closed-source models, on 851 real bugs drawn from two different Java real-world bug benchmarks. Our results reveal that, compared to existing rule-based approaches, LLMs generate more diverse mutants, that are behaviorally closer to real bugs and, most importantly, with 90.1% higher fault detection. That is, 79.1% (for LLMs) vs. 41.6% (for rule-based); an increase of 37.5 percentage points. Nevertheless, our results also reveal that these impressive results for improved effectiveness come at a cost: the LLM-generated mutants have worse non-compilability, duplication, and equivalent mutant rates by 36.1, 13.1, and 4.2 percentage points, respectively. These findings are immediately actionable for both research and practice. They allow practitioners to have greater confidence in deploying LLM-based mutation, while researchers now have a baseline for the state-of-the-art, with which they can research techniques to further improve effectiveness and reduce cost.

最近,大型语言模型(LLMS)被用于在研究工作和工业实践中产生变异体。然而,对于这一越来越重要的以LLM为基础的软件工程应用,还没有对其性能进行全面的经验性研究。为此,我们报告了六种不同LMS的全面经验研究的结果,包括最先进的开放和封闭源模型,从两个不同的爪哇现实世界错误基准中得出的851个真正的虫子。我们的结果表明,与现有的基于规则的方法相比,LMS产生更多样化的变异体,这些变异体在行为上更接近真正的虫子,最重要的是,有90.1%的错漏检测率更高。也就是说,79.1%(LMS)对41.6%(基于规则);增加了37.5个百分点。然而,我们的结果还表明,提高效力的这些令人印象深刻的结果成本很高:LMM生成的变异种变体的不兼容性、重复性和等同的变异体比率分别比36.1、13.1和4.2个百分点。这些结果可以立即采取行动,既可以用于研究,也可以立即采取行动,也可以发现90.1%(LMS)对41.6%(LM)的错觉检查结果进行检查,使研究人员更有信心,同时能够提高研究的基线,同时降低研究技术。他们对LM.

Article 84

Title@2025-06-29 (7): Repair Ingredients Are All You Need: Improving Large Language Model-Based Program Repair via Repair Ingredients Search

Title: Repair Ingredients Are All You Need: Improving Large Language Model-Based Program Repair via Repair Ingredients Search

Reparatur Zutaten sind alles, was Sie brauchen: Verbesserung der großen Sprache Modellbasierte Programm Reparatur über Reparatur Zutaten Suche

维修成份:通过修理成份搜索改进大语言示范方案 2506.23100v1

Authors (5): Jiayi Zhang, Kai Huang, Jian Zhang, Yang Liu, Chunyang Chen

Automated Program Repair (APR) techniques aim to automatically fix buggy programs. Among these, Large Language Model-based (LLM-based) approaches have shown great promise. Recent advances demonstrate that directly leveraging LLMs can achieve leading results. However, these techniques remain suboptimal in generating contextually relevant and accurate patches, as they often overlook repair ingredients crucial for practical program repair. In this paper, we propose ReinFix, a novel framework that enables LLMs to autonomously search for repair ingredients throughout both the reasoning and solution phases of bug fixing. In the reasoning phase, ReinFix integrates static analysis tools to retrieve internal ingredients, such as variable definitions, to assist the LLM in root cause analysis when it encounters difficulty understanding the context. During the solution phase, when the LLM lacks experience in fixing specific bugs, ReinFix searches for external ingredients from historical bug fixes with similar bug patterns, leveraging both the buggy code and its root cause to guide the LLM in identifying appropriate repair actions, thereby increasing the likelihood of generating correct patches. Evaluations on two popular benchmarks (Defects4J V1.2 and V2.0) demonstrate the effectiveness of our approach over SOTA baselines. Notably, ReinFix fixes 146 bugs, which is 32 more than the baselines on Defects4J V1.2. On Defects4J V2.0, ReinFix fixes 38 more bugs than the SOTA. Importantly, when evaluating on the recent benchmarks that are free of data leakage risk, ReinFix also maintains the best performance.

自动程序修理技术(APR)旨在自动修正错误程序。在这些方法中,基于大语言模型(LLM)的大型语言模型(LLM)方法显示了巨大的希望。最近的进展表明直接利用LLMs能够取得领先结果。然而,这些技术在产生符合背景和准确的补丁方面仍然不够理想,因为它们往往忽视了对实际程序修理至关重要的修理成分。在本文件中,我们提议了ReinFix,这是一个新的框架,使LMs能够在纠正错误的推理和解决方案阶段自主地搜索修理成分。在推理阶段,ReinFix综合了静态分析工具,以检索内部成份,例如变式定义,以协助LLM在根本原因分析时能够取得领先结果。在解决方案阶段,当LMM缺乏解决特定错误的经验时,这些技术仍然不够完美。 ReinFix搜索了具有类似错误模式的历史错误修正的外部成份,利用错误代码及其根源来指导LM确定适当的修理行动,从而增加产生正确补补丁的可能性。在两个流行基准上的评价(Difus V1.2和V0.F),在Sinalal Relix 4 上也比Sinalta bref biltal) 更具有效力。

Article 85

Title@2025-06-29 (7): HF-DGF: Hybrid Feedback Guided Directed Grey-box Fuzzing

Title: HF-DGF: Hybrid Feedback Guided Directed Grey-box Fuzzing

HF-DGF: Hybrid-Feedback-geführtes Grey-Box-Fuzzing

HF-DGF:混合反馈 2506.23063v1

Authors (4): Guangfa Lyu, Zhenzhong Cao, Xiaofei Ren, Fengyu Wang

Directed Grey-box Fuzzing (DGF) has emerged as a widely adopted technique for crash reproduction and patch testing, leveraging its capability to precisely navigate toward target locations and exploit vulnerabilities. However, current DGF tools are constrained by insufficient runtime feedback, limiting their efficiency in reaching targets and exploring state spaces. This study presents HF-DGF, a novel directed grey-box fuzzing framework. Its seed scheduling is guided by a hybrid feedback mechanism integrating control-flow distance, value-flow influence score, and slice coverage. To enable precise control-flow distance feedback, we propose a backward-stepping algorithm to calculate basic block-level seed distances on a virtual inter-procedural control-flow graph (ICFG). For effective state space exploration, we introduce value-flow influence and a corresponding metric, the value-flow influence score. Additionally, to mitigate runtime overhead from hybrid feedback, we adopt a novel selective instrumentation strategy. Evaluations on 41 real-world vulnerabilities show HF-DGF outperforms existing tools: it achieves crash reproduction 5.05 times faster than AFL, 5.79 times faster than AFLGo, 73.75 times faster than WindRanger, 2.56 times faster than DAFL, and 8.45 times faster than Beacon on average. Notably, when all fuzzers triggered crashes, HF-DGF exhibited the lowest code coverage, demonstrating superior directionality and efficiency. It also surpasses AFLGo, WindRanger, DAFL, and Beacon in static analysis efficiency.

直接的灰色框 Fuzzing (DGF) 已成为一种广泛采用的碰撞复制和补丁测试技术,利用它的能力精确地向目标地点导航和利用脆弱性。然而,目前的DGF工具受到运行时间反馈不足的限制,限制了它们达到目标和探索州空间的效率。本研究报告介绍了高频-DGF,这是一个全新的、直接的灰色框模糊框架。它的种子时间安排以混合反馈机制为指导,其中包括控制流距离、价值流影响评分和切片覆盖范围。为了能够准确获得控制流距离反馈,我们建议采用后步算法,以虚拟的跨进程控制流图(ICFG)计算基本区级种子距离。为了进行有效的州际空间探索,我们引入了价值流影响和相应的衡量标准。此外,为了减少运行时间的运行时间,我们采用了新的选择性仪表战略。对41个真实世界脆弱性的评估显示,高频-DGFF超过现有工具:在AFLF、5.79倍的速度比ALGO、73-75倍的基本级种子种子迁移距离计算速度更快, 也比A-FLFLFA-GO 更快, 更快速地展示更快。

Article 86

Title@2025-06-28 (6): Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation

Title: Guiding AI to Fix Its Own Flaws: An Empirical Study on LLM-Driven Secure Code Generation

Leitende KI, um eigene Fehler zu beheben: Eine empirische Studie zur LLM-getriebenen sicheren Codegenerierung

指导大赦国际修补其自成自有的法条:关于LLM-Driven安全法规生成的经验研究 2506.23034v1

Authors (4): Hao Yan, Swapneel Suhas Vaidya, Xiaokuan Zhang, Ziyu Yao

Large Language Models (LLMs) have become powerful tools for automated code generation. However, these models often overlook critical security practices, which can result in the generation of insecure code that contains vulnerabilities-weaknesses or flaws in the code that attackers can exploit to compromise a system. However, there has been limited exploration of strategies to guide LLMs in generating secure code and a lack of in-depth analysis of the effectiveness of LLMs in repairing code containing vulnerabilities. In this paper, we present a comprehensive evaluation of state-of-the-art LLMs by examining their inherent tendencies to produce insecure code, their capability to generate secure code when guided by self-generated vulnerability hints, and their effectiveness in repairing vulnerabilities when provided with different levels of feedback. Our study covers both proprietary and open-weight models across various scales and leverages established benchmarks to assess a wide range of vulnerability types. Through quantitative and qualitative analyses, we reveal that although LLMs are prone to generating insecure code, advanced models can benefit from vulnerability hints and fine-grained feedback to avoid or fix vulnerabilities. We also provide actionable suggestions to developers to reduce vulnerabilities when using LLMs for code generation.

大型语言模型(LLMS)已成为自动生成代码的有力工具,然而,这些模型往往忽略了重要的安全做法,这可能导致产生不安全的代码,其中含有攻击者可以用来损害一个系统的编码中的薄弱环节或缺陷;然而,对指导LLMS生成安全代码的战略的探索有限,而且缺乏对LLMS在修复含有脆弱性的代码方面效力的深入分析;在本文件中,我们通过检查其生成不安全代码的内在倾向、在自我生成的脆弱性提示指导下生成安全代码的能力,以及在提供不同程度的反馈时修复脆弱性的效力,对LLMS进行全面评估。我们的研究涵盖各种规模的专有和开放加权模式,并确定了评估多种类型脆弱性的基准。通过定量和定性分析,我们发现虽然LLMS容易生成不安全代码,但先进的模型可以受益于脆弱性提示和精细的反馈,以避免或纠正脆弱性。我们还向开发商提供了行动建议,以便在使用LMS生成代码时减少脆弱性。

Article 87

Title@2025-06-28 (6): Generating Privacy Stories From Software Documentation

Title: Generating Privacy Stories From Software Documentation

Generieren von Daten aus der Software-Dokumentation

从软件文件生成隐私故事 2506.23014v1

Authors (6): Wilder Baldwin, Shashank Chintakuntla, Shreyah Parajuli, Ali Pourghasemi, Ryan Shanz, Sepideh Ghanavati

Research shows that analysts and developers consider privacy as a security concept or as an afterthought, which may lead to non-compliance and violation of users’ privacy. Most current approaches, however, focus on extracting legal requirements from the regulations and evaluating the compliance of software and processes with them. In this paper, we develop a novel approach based on chain-of-thought prompting (CoT), in-context-learning (ICL), and Large Language Models (LLMs) to extract privacy behaviors from various software documents prior to and during software development, and then generate privacy requirements in the format of user stories. Our results show that most commonly used LLMs, such as GPT-4o and Llama 3, can identify privacy behaviors and generate privacy user stories with F1 scores exceeding 0.8. We also show that the performance of these models could be improved through parameter-tuning. Our findings provide insight into using and optimizing LLMs for generating privacy requirements given software documents created prior to or throughout the software development lifecycle.

研究显示,分析家和开发商将隐私视为安全概念或事后考虑,可能导致不遵守规定和侵犯用户隐私,但大多数目前的做法侧重于从法规中提取法律要求,并评估软件和流程的合规性。在本文件中,我们根据思维链(CoT)、文本学习(ICL)和大语言模型(LLMS),开发软件之前和开发过程中的各种软件文件中的隐私行为,并随后生成用户故事格式的隐私要求。我们的结果显示,最常用的LLMS,如GPT-4o和Llama 3,可以识别隐私行为,生成超过0.8分的F1分的隐私用户故事。我们还表明,这些模型的性能可以通过参数调整得到改进。我们的调查结果使人们深入了解如何使用和优化LLMS,以生成在软件开发生命周期之前或整个周期生成的软件文件的隐私要求。

Article 88

Title@2025-06-28 (6): Evaluating and Improving Large Language Models for Competitive Program Generation

Title: Evaluating and Improving Large Language Models for Competitive Program Generation

Bewertung und Verbesserung großer Sprachmodelle für die wettbewerbsfähige Programmgenerierung

评价和改进竞争性方案制定大语言模式 2506.22954v1

Authors (8): Minnan Wei, Ziming Li, Xiang Chen, Menglin Zheng, Ziyan Qu, Cheng Yu, Siyu Chen, Xiaolin Ju

Context: Due to the demand for strong algorithmic reasoning, complex logic implementation, and strict adherence to input/output formats and resource constraints, competitive programming generation by large language models (LLMs) is considered the most challenging problem in current LLM-based code generation. However, previous studies often evaluate LLMs using simple prompts and benchmark datasets prone to data leakage. Moreover, prior work has limited consideration of the diversity in algorithm types and difficulty levels. Objective: In this study, we aim to evaluate and improve LLMs in solving real-world competitive programming problems. Methods: We initially collect 117 problems from nine regional ICPC/CCPC contests held in 2024 and design four filtering criteria to construct a curated benchmark consisting of 80 problems. Leveraging DeepSeek-R1 as the LLM, we evaluate its competitive program generation capabilities through the online judge (OJ) platforms, guided by a carefully designed basic prompt. For incorrect submissions, we construct a fine-grained error taxonomy and then propose a targeted improvement framework by combining a multi-turn dialogue-based repair phase and an information-augmented regeneration phase. Results: Experimental results show that only 5 out of 80 problems are fully accepted when using basic prompts. For the unsolved problems, we construct the error taxonomy, including general errors (such as design, boundary, condition, data type, syntax, and input/output errors) and specialized errors (such as those in mathematical problems, greedy algorithms, and graph theories). After applying our proposed improvement strategies, we substantially increased the number of correct solutions, with 46 out of 80 problems successfully accepted.

由于需要强有力的算法推理、复杂的逻辑实施以及严格遵守投入/产出格式和资源限制,大型语言模型(LLMs)的竞争性编程被认为是当前以LLM为基础的代码生成中最具挑战性的问题;然而,以往的研究往往使用简单的提示和容易数据泄漏的基准数据集来评价LLMs;此外,先前的工作限制了对算法类型和困难程度多样性的考虑。目标:在本研究中,我们的目标是在解决真实世界竞争性的数学编程问题方面评价和改进LLMS。方法:我们最初从2024年举行的9次区域ICPC/CPPC竞赛中收集了117个问题,并设计了4个成功筛选标准,以构建由80个问题组成的标定基准。将DeepSeek-R1作为LM,我们通过在线法官平台评估其竞争性编程能力,并以精心设计的基本提示的提示为基础。关于提交资料的不正确度,我们建立一个精细的错误分类,然后提出有目标的改进框架,方法是将基于多种对话的改进阶段和信息更新的改进方法进行大量收集,在80个阶段中,结果显示我们提出的精确的精确性设计,在80个总的逻辑上的问题。

Article 89

Title@2025-06-28 (6): A Quantum Annealing Approach for Solving Optimal Feature Selection and Next Release Problems

Title: A Quantum Annealing Approach for Solving Optimal Feature Selection and Next Release Problems

Ein Quantum-Analing-Ansatz zur Lösung optimaler Feature-Auswahl und nächster Release-Probleme

解决最佳地物选择和下一期释放问题的一个量化的安妮化方法 2506.14129v2

Authors (5): Shuchang Wang, Xiaopeng Qiu, Yingxing Xue, Yanfu Li, Wei Yang

Search-based software engineering (SBSE) addresses critical optimization challenges in software engineering, including the next release problem (NRP) and feature selection problem (FSP). While traditional heuristic approaches and integer linear programming (ILP) methods have demonstrated efficacy for small to medium-scale problems, their scalability to large-scale instances remains unknown. Here, we introduce quantum annealing (QA) as a subroutine to tackling multi-objective SBSE problems, leveraging the computational potential of quantum systems. We propose two QA-based algorithms tailored to different problem scales. For small-scale problems, we reformulate multi-objective optimization (MOO) as single-objective optimization (SOO) using penalty-based mappings for quantum processing. For large-scale problems, we employ a decomposition strategy guided by maximum energy impact (MEI), integrating QA with a steepest descent method to enhance local search efficiency. Applied to NRP and FSP, our approaches are benchmarked against the heuristic NSGA-II and the ILP-based $\epsilon$-constraint method. Experimental results reveal that while our methods produce fewer non-dominated solutions than $\epsilon$-constraint, they achieve significant reductions in execution time. Moreover, compared to NSGA-II, our methods deliver more non-dominated solutions with superior computational efficiency. These findings underscore the potential of QA in advancing scalable and efficient solutions for SBSE challenges.

以搜索为基础的软件工程(SBSE)应对软件工程中的关键优化挑战,包括下一个发布问题(NRP)和特征选择问题(FSP ) 。虽然传统的超光速方法和整型线性编程(ILP)方法显示了中小型问题的效果,但是它们向大事件缩放的程度仍然不得而知。在这里,我们采用量子肛交(QA)作为解决多目标SBSE问题的子例,利用量子系统的计算潜力。我们提出了针对不同问题规模的两种基于QA的算法。对于小规模问题,我们使用基于惩罚的量子处理绘图将多目标优化(MOO)重新定位为单一目标优化(SOO ) 。对于大规模问题,我们采用了一种受最大能源影响(MEI)引导的分解战略,将QA与最大幅度的下降方法结合起来,以提高本地搜索效率。我们采用的方法与基于超超值的NSGA-II 和基于超值CAL-CAL-CAL-CAL 方法相比,在SA-SQ-SL-CL-SL-SL-I-SL-SLA-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-I-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-SL-S-S-SL-SL-S-S-SL-SL-S-I-S-S-S-S-S-S-S-S-S-S-SL-S-S-I-S-S-S-SL-SL-SL-SL-SL-SL-SL-I-I-SL-I-SL-SL-S-S-S-S-SL-S-S-S-S-S-SL-SL-I-S-S-SL-SL-SL-SL-

Article 90

Title@2025-06-28 (6): Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation

Title: Smaller = Weaker? Benchmarking Robustness of Quantized LLMs in Code Generation

Kleiner = schwach? Benchmarking Robustheit quantifizierter LLMs bei der Codegenerierung

小 = 弱 = 弱 ? 2506.22776v1

Authors (4): Sen Fang, Weiyuan Ding, Antonio Mastropaolo, Bowen Xu

Quantization has emerged as a mainstream method for compressing Large Language Models (LLMs), reducing memory requirements and accelerating inference without architectural modifications. While existing research primarily focuses on evaluating the effectiveness of quantized LLMs compared to their original counterparts, the impact on robustness remains largely unexplored.In this paper, we present the first systematic investigation of how quantization affects the robustness of LLMs in code generation tasks. Through extensive experiments across four prominent LLM families (LLaMA, DeepSeek, CodeGen, and StarCoder) with parameter scales ranging from 350M to 33B, we evaluate robustness from dual perspectives: adversarial attacks on input prompts and noise perturbations on model architecture. Our findings challenge conventional wisdom by demonstrating that quantized LLMs often exhibit superior robustness compared to their full-precision counterparts, with 51.59% versus 42.86% of our adversarial experiments showing better resilience in quantized LLMs. Similarly, our noise perturbation experiments also confirm that LLMs after quantitation generally withstand higher levels of weight disturbances. These results suggest that quantization not only reduces computational requirements but can actually enhance LLMs’ reliability in code generation tasks, providing valuable insights for developing more robust and efficient LLM deployment strategies.

量化已成为压缩大语言模型(LLMS)的主流方法,减少了记忆要求,加快了不进行建筑修改的推论。虽然现有研究主要侧重于评价量化的LMS相对于原始对等公司的效力,但对稳健性的影响基本上尚未探索。在本文件中,我们首次系统地调查了量化如何影响高语言模型在代码生成任务中的稳健性。通过在四个著名的LLM家庭(LLAMA、DeepSeek、CocGen和StarCoder)进行从350M到33B等参数尺度的广泛实验,我们从两个角度评估了稳健性:对投入提示的对抗性攻击和对模型结构的噪音。我们的调查结果挑战了常规智慧,表明量化的LMS往往比其完全精准的对等企业表现出超强的稳健性。在量化的LMS中,51.59%比42.86%的对抗性实验显示出更强的弹性。同样,我们的噪音渗透性实验还证实,在量化后,LMSMS通常能够承受更高程度的重量扰动。这些结果表明,为更可靠的LMS的计算任务提供更可靠的配置。

Article 91

Title@2025-06-28 (6): Privacy-Preserving Methods for Bug Severity Prediction

Title: Privacy-Preserving Methods for Bug Severity Prediction

Datenschutz-Erhaltung Methoden für Bug Severity Prediction

维护隐私的错误严重性预测方法 2506.22752v1

Authors (3): Havvanur Dervişoğlu, Ruşen Halepmollası, Elif Eyvaz

Bug severity prediction is a critical task in software engineering as it enables more efficient resource allocation and prioritization in software maintenance. While AI-based analyses and models significantly require access to extensive datasets, industrial applications face challenges due to data-sharing constraints and the limited availability of labeled data. In this study, we investigate method-level bug severity prediction using source code metrics and Large Language Models (LLMs) with two widely used datasets. We compare the performance of models trained using centralized learning, federated learning, and synthetic data generation. Our experimental results, obtained using two widely recognized software defect datasets, indicate that models trained with federated learning and synthetic data achieve comparable results to centrally trained models without data sharing. Our finding highlights the potential of privacy-preserving approaches such as federated learning and synthetic data generation to enable effective bug severity prediction in industrial context where data sharing is a major challenge. The source code and dataset are available at our GitHub repository: https://github.com/drvshavva/EASE2025-Privacy-Preserving-Methods-for-Bug-Severity-Prediction.

在软件工程中,虫虫严重程度预测是一项关键任务,因为它有助于更有效地分配资源和确定软件维护的优先次序。尽管AI基础的分析和模型极大地需要获取广泛的数据集,但工业应用由于数据共享的限制和标签数据有限而面临挑战。在本研究中,我们用两种广泛使用的数据集调查方法层面的虫严重程度预测,使用源代码指标和大语言模型(LLMs)进行两种广泛使用的数据集。我们比较了利用集中学习、联合学习和合成数据生成等方法培训过的模型的性能。我们利用两个得到广泛承认的软件缺陷数据集获得的实验结果表明,经过联合学习和合成数据培训的模型在不分享数据的情况下取得与中央培训的模型的可比结果。我们的发现突出表明,在数据共享是一个重大挑战的工业环境中,采用保密方法和合成数据生成方法,以便能够进行有效的虫度预测。我们的GitHub储存库:https://github.com/drvshavva/EASE2025-Privacey-Reserv-Metods-for-Bug-Severity-Stregystry-Pregy。

Article 92

Title@2025-06-28 (6): Understanding the Challenges and Promises of Developing Generative AI Apps: An Empirical Study

Title: Understanding the Challenges and Promises of Developing Generative AI Apps: An Empirical Study

Die Herausforderungen und Versprechen der Entwicklung generativer KI-Apps verstehen: Eine empirische Studie

了解 “ 开发创新的AI Apps:经验研究 “ 的挑战和前景 2506.16453v2

Authors (3): Buthayna AlMulla, Maram Assi, Safwat Hassan

The release of ChatGPT in 2022 triggered a rapid surge in generative artificial intelligence mobile apps (i.e., Gen-AI apps). Despite widespread adoption, little is known about how end users perceive and evaluate these Gen-AI functionalities in practice. In this work, we conduct a user-centered analysis of 676,066 reviews from 173 Gen-AI apps on the Google Play Store. We introduce a four-phase methodology, SARA (Selection, Acquisition, Refinement, and Analysis), that enables the systematic extraction of user insights using prompt-based LLM techniques. First, we demonstrate the reliability of LLMs in topic extraction, achieving 91% accuracy through five-shot prompting and non-informative review filtering. Then, we apply this method to the informative reviews, identify the top 10 user-discussed topics (e.g., AI Performance, Content Quality, and Content Policy & Censorship) and analyze the key challenges and emerging opportunities. Finally, we examine how these topics evolve over time, offering insight into shifting user expectations and engagement patterns with Gen-AI apps. Based on our findings and observations, we present actionable implications for developers and researchers.

2022年公布ChattGPT后,基因化人工智能移动应用软件(即Gen-AI Apps)迅速激增。尽管广泛采用,但对于终端用户如何看待和评价Gen-AI的功能却知之甚少。在这项工作中,我们对Google Play Store上的173 Gen-AI应用软件进行了676 066次以用户为中心的分析,对Google Play Store上的173 Gen-AI应用软件进行了676 066次审查。我们采用了四阶段方法SARA(选择、获取、精炼和分析),以便利用基于迅速的LLM技术系统提取用户的洞见。首先,我们展示了专题提取中的LLMs的可靠性,通过五发即时即时和非信息化的审查过滤实现91%的准确性。然后,我们将这一方法应用于信息化的审查,确定十大用户讨论的议题(例如AI性能、内容质量和内容政策与检查),并分析主要挑战和新出现的机会。我们研究了这些专题如何随着时间而演变,我们深入了解用户对Gen-AI应用软件的预期和参与模式。我们根据调查结果和观察了各种影响和研究。

Article 93

Title@2025-06-28 (6): RAILS: Retrieval-Augmented Intelligence for Learning Software Development

Title: RAILS: Retrieval-Augmented Intelligence for Learning Software Development

RAILS: Retrieval-Augmented Intelligence für die Entwicklung von Lernsoftware

RAILS:学习软件开发检索增强情报 2506.22742v1

Authors (6): Wali Mohammad Abdullah, Md. Morshedul Islam, Devraj Parmar, Happy Hasmukhbhai Patel, Sindhuja Prabhakaran, Baidya Saha

Large Language Models (LLMs) like GPT-3.5-Turbo are increasingly used to assist software development, yet they often produce incomplete code or incorrect imports, especially when lacking access to external or project-specific documentation. We introduce RAILS (Retrieval-Augmented Intelligence for Learning Software Development), a framework that augments LLM prompts with semantically retrieved context from curated Java resources using FAISS and OpenAI embeddings. RAILS incorporates an iterative validation loop guided by compiler feedback to refine suggestions. We evaluated RAILS on 78 real-world Java import error cases spanning standard libraries, GUI APIs, external tools, and custom utilities. Despite using the same LLM, RAILS outperforms baseline prompting by preserving intent, avoiding hallucinations, and surfacing correct imports even when libraries are unavailable locally. Future work will integrate symbolic filtering via PostgreSQL and extend support to other languages and IDEs.

GPT-3.5-Turbo等大型语言模型(LLMs)越来越多地用于协助软件开发,但它们往往产生不完整的代码或不正确的进口,特别是在缺乏外部或具体项目文件的情况下。我们引入了RAILS(为学习软件开发检索启动情报),这是一个利用FAISIS和OpenAI嵌入器从保藏的爪哇资源中用语义检索精度来增强LLM提示的框架。RAILS包含一个由汇编者反馈指导的迭代验证循环,以完善建议。我们评估RAILS的78个真实世界爪哇进口错误案例,这些案例涉及标准图书馆、GUI APIs、外部工具和海关公用设施。尽管使用了相同的LLM,但RAILS仍然超越了基线,通过维护意图、避免幻觉和冲浪正确进口,即使图书馆在当地无法使用。未来的工作将整合通过PostgreSQL的象征性过滤,并将支持扩大到其他语言和IDS。

Article 94

Title@2025-06-28 (6): P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

Title: P4OMP: Retrieval-Augmented Prompting for OpenMP Parallelism in Serial Code

P4OMP: Retrieval-Augmented Prompting für OpenMP Parallelismus im seriellen Code

P4OMP: 序列法中公开MP平行主义的检索-启动提示 2506.22703v1

Authors (2): Wali Mohammad Abdullah, Azmain Kabir

We present P4OMP, a retrieval-augmented framework for transforming serial C/C++ code into OpenMP-annotated parallel code using large language models (LLMs). To our knowledge, this is the first system to apply retrieval-based prompting for OpenMP pragma correctness without model fine-tuning or compiler instrumentation. P4OMP leverages Retrieval-Augmented Generation (RAG) with structured instructional knowledge from OpenMP tutorials to improve the reliability of prompt-driven code generation. By grounding generation in the retrieved context, P4OMP improves syntactic correctness compared to baseline prompting with GPT-3.5-Turbo. We evaluate P4OMP against a baseline, GPT-3.5-Turbo without retrieval, on a comprehensive benchmark of 108 real-world C++ programs drawn from Stack Overflow, PolyBench, and NAS benchmark suites. P4OMP achieves 100% compilation success on all parallelizable cases, while the baseline fails to compile in 20 out of 108 cases. Six cases that rely on non-random-access iterators or thread-unsafe constructs are excluded due to fundamental OpenMP limitations. A detailed analysis demonstrates how P4OMP consistently avoids scoping errors, syntactic misuse, and invalid directive combinations that commonly affect baseline-generated code. We further demonstrate strong runtime scaling across seven compute-intensive benchmarks on an HPC cluster. P4OMP offers a robust, modular pipeline that significantly improves the reliability and applicability of LLM-generated OpenMP code.

我们提出P4OMP,这是利用大型语言模型(LLMs)将序列 C/C++ 代码转换成 OpenMP 附加说明的平行代码的检索强化框架。据我们所知,这是第一个在不进行模型微调或编译仪仪表的情况下对 OpenMP 软体校准校准进行基于检索的提示的系统。P4OMP 利用OpenMP Retriewval-Auged Conference(RAG) 的系统化指导知识,提高快速驱动代码生成的可靠性。通过在回收的环境下进行生成,P4OMP改进了可靠性,而与GPT-3.5-Turbo 的基线相比,我们用一个基准(GPT-3.5-Turbo)对 OpenMP 进行检索,而没有进行模型微调或编校准。P4 P4 P4 将快速编译的代码在108个案例中无法编译。六个案例依靠非兰地对 On-rma-rmalal IM 进行精确分析,从而展示了Orma-real-laimal-rma-rma-lax-lax 。

Article 95

Title@2025-06-27 (5): An LLM-assisted approach to designing software architectures using ADD

Title: An LLM-assisted approach to designing software architectures using ADD

Ein LLM-unterstützter Ansatz zur Entwicklung von Softwarearchitekturen mit ADD

利用ADD设计软件结构的LLM辅助方法 2506.22688v1

Authors (3): Humberto Cervantes, Rick Kazman, Yuanfang Cai

Designing effective software architectures is a complex, iterative process that traditionally relies on expert judgment. This paper proposes an approach for Large Language Model (LLM)-assisted software architecture design using the Attribute-Driven Design (ADD) method. By providing an LLM with an explicit description of ADD, an architect persona, and a structured iteration plan, our method guides the LLM to collaboratively produce architecture artifacts with a human architect. We validate the approach through case studies, comparing generated designs against proven solutions and evaluating them with professional architects. Results show that our LLM-assisted ADD process can generate architectures closely aligned with established solutions and partially satisfying architectural drivers, highlighting both the promise and current limitations of using LLMs in architecture design. Our findings emphasize the importance of human oversight and iterative refinement when leveraging LLMs in this domain.

设计有效的软件结构是一个复杂、迭接的过程,传统上依赖专家的判断。本文件提出使用属性驱动设计(ADD)方法的大型语言模型辅助软件结构设计方法。通过向一个LLM提供对ADD、建筑师的清晰描述和结构化迭代计划,我们的方法指导LLM与一名人类建筑师合作制作建筑文物。我们通过案例研究验证这一方法,对照经验证的解决方案比较所产生的设计,并与专业建筑师评估这些设计。结果显示,我们的LLMAM协助的ADD进程能够产生与既定解决方案密切吻合的建筑结构,并部分满足建筑驱动因素,同时强调在建筑设计中使用LMMS的希望和当前局限性。我们的调查结果强调,在利用该领域的LMS时,必须进行人类监督和迭接改进。

Article 96

Title@2025-06-27 (5): Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision

Title: Knowledge-Guided Multi-Agent Framework for Automated Requirements Development: A Vision

Knowledge-Guided Multi-Agent Framework für automatisierte Anforderungsentwicklung: Eine Vision

知识指导的自动化要求发展多方支持框架:愿景 2506.22656v1

Authors (5): Jiangping Huang, Dongming Jin, Weisong Sun, Yang Liu, Zhi Jin

This paper envisions a knowledge-guided multi-agent framework named KGMAF for automated requirements development. KGMAF aims to address gaps in current automation systems for SE, which prioritize code development and overlook the complexities of requirements tasks. KGMAF is composed of six specialized agents and an artifact pool to improve efficiency and accuracy. Specifically, KGMAF outlines the functionality, actions, and knowledge of each agent and provides the conceptual design of the artifact pool. Our case study highlights the potential of KGMAF in real-world scenarios. Finally, we outline several research opportunities for implementing and enhancing automated requirements development using multi-agent systems. We believe that KGMAF will play a pivotal role in shaping the future of automated requirements development in the era of LLMs.

该文件设想了一个名为KGMAF的知识引导多剂框架,用于自动化要求的开发;KGMAF旨在填补SE现有自动化系统的差距,该系统将代码开发列为优先事项,并忽视要求任务的复杂性;KGMAF由六个专门代理机构和一个工艺品库组成,以提高效率和准确性;具体地说,KGMAF概述了每个代理机构的功能、行动和知识,并提供了文物库的概念设计;我们的案例研究强调了KGMAF在现实世界中的潜力;最后,我们概述了利用多剂系统实施和加强自动化要求开发的若干研究机会;我们相信,KGMAF将在塑造LMS时代自动需求开发的未来方面发挥关键作用。

Article 97

Title@2025-06-27 (5): L2MAC: Large Language Model Automatic Computer for Extensive Code Generation

Title: L2MAC: Large Language Model Automatic Computer for Extensive Code Generation

L2MAC: Automatischer Computer mit großem Sprachmodell für umfangreiche Code-Generierung

L2MAC:用于广泛代码生成的大型语言模拟自动计算机 2310.02003v6

Authors (3): Samuel Holt, Max Ruiz Luyten, Mihaela van der Schaar

Transformer-based large language models (LLMs) are constrained by the fixed context window of the underlying transformer architecture, hindering their ability to produce long and coherent outputs. Memory-augmented LLMs are a promising solution, but current approaches cannot handle long output generation tasks since they (1) only focus on reading memory and reduce its evolution to the concatenation of new memories or (2) use very specialized memories that cannot adapt to other domains. This paper presents L2MAC, the first practical LLM-based general-purpose stored-program automatic computer (von Neumann architecture) framework, an LLM-based multi-agent system, for long and consistent output generation. Its memory has two components: the instruction registry, which is populated with a prompt program to solve the user-given task, and a file store, which will contain the final and intermediate outputs. Each instruction in turn is executed by a separate LLM agent, whose context is managed by a control unit capable of precise memory reading and writing to ensure effective interaction with the file store. These components enable L2MAC to generate extensive outputs, bypassing the constraints of the finite context window while producing outputs that fulfill a complex user-specified task. We empirically demonstrate that L2MAC achieves state-of-the-art performance in generating large codebases for system design tasks, significantly outperforming other coding methods in implementing the detailed user-specified task; we show that L2MAC works for general-purpose extensive text-based tasks, such as writing an entire book; and we provide valuable insights into L2MAC’s performance improvement over existing methods.

基于变压器的大型变换语言模型(LLMS)受到基础变压器结构固定背景窗口的限制,妨碍其产生长期和连贯产出的能力。内存增强的LLMS是一个很有希望的解决方案,但目前的方法无法处理长期产出生成任务,因为它们(1) 只侧重于读取记忆,将其演化到新记忆的融合,或(2) 使用无法适应其他领域的非常专门的记忆。本文展示了L2MAC, 这是第一个基于LMM的实用通用存储自动程序(Von Neumann架构)框架,一个基于LMM的多试机系统,用于长期和一致产出的生成。它的记忆有两个组成部分:指示登记册,它包含一个用于解析用户授权的任务的快速程序,以及一个包含最终和中间产出的文件储存。每项指示都由一个独立的LMMA代理机构执行,其背景由一个能够精确记忆读写和书写的控制单位管理,以确保与文件库的有效互动。这些组成部分使得L2MAC公司能够产生广泛的产出,绕过有限背景窗口的限制,同时将产出制成一个执行复杂版本的LMAC2号的系统,从而完成复杂的用户任务。

Article 98

Title@2025-06-27 (5): What Makes ChatGPT Effective for Software Issue Resolution? An Empirical Study of Developer-ChatGPT Conversations in GitHub

Title: What Makes ChatGPT Effective for Software Issue Resolution? An Empirical Study of Developer-ChatGPT Conversations in GitHub

Was macht ChatGPT effektiv für Software Problemlösung? Eine empirische Studie von Entwickler-ChatGPT Gespräche in GitHub

是什么使ChatGPT有效解决软件问题? 开发者-ChatGPT在GitHub的对话经验研究。 2506.22390v1

Authors (5): Ramtin Ehsani, Sakshi Pathak, Esteban Parra, Sonia Haiduc, Preetha Chatterjee

Conversational large-language models are extensively used for issue resolution tasks. However, not all developer-LLM conversations are useful for effective issue resolution. In this paper, we analyze 686 developer-ChatGPT conversations shared within GitHub issue threads to identify characteristics that make these conversations effective for issue resolution. First, we analyze the conversations and their corresponding issues to distinguish helpful from unhelpful conversations. We begin by categorizing the types of tasks developers seek help with to better understand the scenarios in which ChatGPT is most effective. Next, we examine a wide range of conversational, project, and issue-related metrics to uncover factors associated with helpful conversations. Finally, we identify common deficiencies in unhelpful ChatGPT responses to highlight areas that could inform the design of more effective developer-facing tools. We found that only 62% of the ChatGPT conversations were helpful for successful issue resolution. ChatGPT is most effective for code generation and tools/libraries/APIs recommendations, but struggles with code explanations. Helpful conversations tend to be shorter, more readable, and exhibit stronger semantic and linguistic alignment. Larger, more popular projects and more experienced developers benefit more from ChatGPT. At the issue level, ChatGPT performs best on simpler problems with limited developer activity and faster resolution, typically well-scoped tasks like compilation errors. The most common deficiencies in unhelpful ChatGPT responses include incorrect information and lack of comprehensiveness. Our findings have wide implications including guiding developers on effective interaction strategies for issue resolution, informing the development of tools or frameworks to support optimal prompt design, and providing insights on fine-tuning LLMs for issue resolution tasks.

大型连通语言模型被广泛用于解决问题的任务。但是,并非所有开发者- LLM 对话都有助于有效解决问题。在本文中,我们分析在 GitHub 发行线索中共享的686个开发者- ChatGPT 对话,以确定这些对话能够有效解决问题的特征。首先,我们分析这些对话及其相应的问题,以区分有助于解决问题的和无帮助的谈话。我们首先对任务开发者的类型进行分类,以更好地了解ChatGPT最有效的情景。其次,我们审视广泛的对话、项目和与问题有关的指标,以发现与有益的对话相关的因素。最后,我们发现在无帮助的 Chater-ChatGPT 响应中,共有686个开发者- ChatgPT 对话的特性,从而帮助成功解决问题。 ChateGPT 热G 热解/类似工具/APIs 的建议最为有效,但与代码解释框架相挣扎。帮助的对话往往更短、更清晰、更清晰、更清晰、更清晰的路径和最准确的版本的版本的版本的版本的解决方案, 包括更精确的版本的解决方案的解决方案的解决方案的解决方案的解决方案的解决方案的解决方案的解决方案的解决方案和G。

Article 99

Title@2025-06-27 (5): Automated detection of atomicity violations in large-scale systems

Title: Automated detection of atomicity violations in large-scale systems

Automatisierte Erkennung von atomaren Verstößen in Großsystemen

在大规模系统中自动发现违反原子现象 2504.00521v2

Authors (6): Hang He, Yixing Luo, Chengcheng Wan, Ting Su, Haiying Sun, Geguang Pu

Atomicity violations in interrupt-driven programs pose a significant threat to software safety in critical systems. These violations occur when the execution sequence of operations on shared resources is disrupted by asynchronous interrupts. Detecting atomicity violations is challenging due to the vast program state space, application-level code dependencies, and complex domain-specific knowledge. We propose Clover, a hybrid framework that integrates static analysis with large language model (LLM) agents to detect atomicity violations in real-world programs. Clover first performs static analysis to extract critical code snippets and operation information. It then initiates a multi-agent process, where the expert agent leverages domain-specific knowledge to detect atomicity violations, which are subsequently validated by the judge agent. Evaluations on RaceBench 2.1, SV-COMP, and RWIP demonstrate that Clover achieves a precision/recall of 92.3%/86.6%, outperforming existing approaches by 27.4-118.2% on F1-score.

在中断驱动的方案中,违反原子现象对关键系统软件的安全构成重大威胁。当共享资源操作的顺序因非同步中断中断而中断时,就会发生这些违反行为。发现违反原子现象具有挑战性,因为国家方案空间辽阔、应用层面的代码依赖性和复杂的具体领域知识。我们提议Clover是一个混合框架,将静态分析与大型语言模型(LLM)剂相结合,以发现现实世界方案中的违反原子现象。Crover首先进行静态分析,以提取关键代码片断和操作信息。然后启动一个多试剂程序,专家代理利用特定领域知识发现违反原子现象,随后由法官代理人验证。对RWIP和SV-COMP的评析表明,Clover实现了92.3%86.6%的精确度/召回率,超过F1核心现有方法的27.4%至118.2%。

Article 100

Title@2025-06-27 (5): Domain-Driven Design in Software Development: A Systematic Literature Review on Implementation, Challenges, and Effectiveness

Title: Domain-Driven Design in Software Development: A Systematic Literature Review on Implementation, Challenges, and Effectiveness

Domain-Driven Design in der Software-Entwicklung: Ein systematischer Literaturbericht über Implementierung, Herausforderungen und Effektivität

软件开发中的域驱动设计:关于实施、挑战和有效性的系统文献审查 2310.01905v4

Authors (3): Ozan Özkan, Önder Babur, Mark van den Brand

Context: Domain-Driven Design (DDD) has gained significant attention in software development for its potential to address complex software challenges, particularly in the areas of system refactoring, reimplementation, and adoption. Using domain knowledge, DDD aims to solve complex business problems effectively. Objective: This SLR aims to provide an analysis of existing research on DDD in software development, paint a picture of DDD in solving software problems, identify the challenges encountered during its application and explore the results of these studies. Method: We systematically selected 36 peer reviewed studies and conducted quantitative and qualitative analyzes to synthesize the findings. Results: DDD has effectively improved software systems, with its key concepts. The application of DDD in microservices has gained prominence for its ability to facilitate system decomposition. Some studies lacked empirical evaluations, highlighting challenges in onboarding and the need for expertise. Conclusion: Adopting DDD benefits software development, involving stakeholders such as engineers, architects, managers, and domain experts. More empirical evaluations and open discussions on challenges are needed. Collaboration between academia and industry advances the adoption and transfer of knowledge of DDD in projects.

目标:DDD旨在分析软件开发中现有的DDD研究,描绘DDD在解决软件问题方面的情况,查明应用过程中遇到的挑战,并探讨这些研究的结果。方法:我们有系统地挑选了36项同行审查研究,并进行了定量和定性分析,以综合研究结果。结果:DDD有效地改进了软件系统,并提出了关键概念。DDD在微观服务中的应用因其促进系统分解的能力而得到了突出。有些研究缺乏经验性评价,突出登船方面的挑战和专业知识的需要。结论:采用DDDDD软件开发,让工程师、建筑师、管理人员和领域专家等利益攸关方参与其中。需要更多经验性评价和公开讨论挑战。学术界和工业界合作,推动项目中DDDD知识的采纳和转让。

Article 101

Title@2025-06-27 (5): Autonomic Microservice Management via Agentic AI and MAPE-K Integration

Title: Autonomic Microservice Management via Agentic AI and MAPE-K Integration

Autonomes Microservice Management über Agentic AI und MAPE-K Integration

通过Agentic AI和MAPE-K整合进行自动微服务管理 2506.22185v1

Authors (7): Matteo Esposito, Alexander Bakhtin, Noman Ahmad, Mikel Robredo, Ruoyu Su, Valentina Lenarduzzi, Davide Taibi

While microservices are revolutionizing cloud computing by offering unparalleled scalability and independent deployment, their decentralized nature poses significant security and management challenges that can threaten system stability. We propose a framework based on MAPE-K, which leverages agentic AI, for autonomous anomaly detection and remediation to address the daunting task of highly distributed system management. Our framework offers practical, industry-ready solutions for maintaining robust and secure microservices. Practitioners and researchers can customize the framework to enhance system stability, reduce downtime, and monitor broader system quality attributes such as system performance level, resilience, security, and anomaly management, among others.

虽然微观服务正在通过提供空前的可扩缩性和独立部署,使云层计算革命化,但其分散性质带来了严重的安全和管理挑战,可能危及系统稳定。我们提议了一个以MAPE-K为基础的框架,利用ATP-K的代理AI进行自主异常探测和补救,以解决高度分布的系统管理的艰巨任务。我们的框架为维持强大和安全的微观服务提供了实用的、行业准备就绪的解决方案。从业者和研究人员可以定制框架,以加强系统稳定,减少故障,并监测更广泛的系统质量属性,例如系统性能水平、复原力、安全和异常管理等。

Article 102

Title@2025-06-27 (5): Towards Modeling Human-Agentic Collaborative Workflows: A BPMN Extension

Title: Towards Modeling Human-Agentic Collaborative Workflows: A BPMN Extension

Auf dem Weg zur Modellierung von Mensch-Agentik-kollaborativen Workflows: Eine BPMN-Erweiterung

建立模拟人类-机构合作工作流程的模式:生物和植物保护网扩展项目 2412.05958v3

Authors (3): Adem Ait, Javier Luis Cánovas Izquierdo, Jordi Cabot

Large Language Models (LLMs) have facilitated the definition of autonomous intelligent agents. Such agents have already demonstrated their potential in solving complex tasks in different domains. And they can further increase their performance when collaborating with other agents in a multi-agent system. However, the orchestration and coordination of these agents is still challenging, especially when they need to interact with humans as part of human-agentic collaborative workflows. These kinds of workflows need to be precisely specified so that it is clear whose responsible for each task, what strategies agents can follow to complete individual tasks or how decisions will be taken when different alternatives are proposed, among others. Current business process modeling languages fall short when it comes to specifying these new mixed collaborative scenarios. In this exploratory paper, we extend a well-known process modeling language (i.e., BPMN) to enable the definition of this new type of workflow. Our extension covers both the formalization of the new metamodeling concepts required and the proposal of a BPMN-like graphical notation to facilitate the definition of these workflows. Our extension has been implemented and is available as an open-source human-agentic workflow modeling editor on GitHub.

大型语言模型(LLMS)为自主智能剂的定义提供了便利。这些代理商已经展示了在解决不同领域复杂任务方面的潜力。当在一个多试剂系统中与其他代理商合作时,它们可以进一步提高其绩效。然而,这些代理商的协同和协调仍然具有挑战性,特别是当它们需要与人类互动,作为人类代理协作工作流程的一部分时。这些工作流程需要精确地具体化,以便明确谁负责每项任务,什么战略代理商可以完成个别任务,或者在提出不同备选方案时如何作出决定。当提出这些新的混合协作设想时,目前的业务流程模拟语言就不够了。在本探索文件中,我们扩展了一种众所周知的进程模型语言(即BPMN),以便能够界定这种新型工作流程。我们的扩展范围既包括所需新的元模型概念的正规化,也包括一个类似于BPMN的图形标记的建议,以便利这些工作流程的定义。我们的扩展已经实施,并作为GiHub的公开源的人类代理工作流程模型编辑器提供。

Article 103

Title@2025-06-27 (5): KARMA Approach supporting Development Process Reconstruction in Model-based Systems Engineering

Title: KARMA Approach supporting Development Process Reconstruction in Model-based Systems Engineering

KARMA-Ansatz zur Unterstützung der Prozessrekonstruktion in der modellbasierten Systemtechnik

KARMA 支持基于示范系统工程的发展进程重建的方法 2506.22037v1

Authors (7): Jiawei Li, Zan Liang, Guoxin Wang, Jinzhi Lu, Yan Yan, Shouxuan Wu, Hao Wang

Model reconstruction is a method used to drive the development of complex system development processes in model-based systems engineering. Currently, during the iterative design process of a system, there is a lack of an effective method to manage changes in development requirements, such as development cycle requirements and cost requirements, and to realize the reconstruction of the system development process model. To address these issues, this paper proposes a model reconstruction method to support the development process model. Firstly, the KARMA language, based on the GOPPRR-E metamodeling method, is utilized to uniformly formalize the process models constructed based on different modeling languages. Secondly, a model reconstruction framework is introduced. This framework takes a structured development requirements based natural language as input, employs natural language processing techniques to analyze the development requirements text, and extracts structural and optimization constraint information. Then, after structural reorganization and algorithm optimization, a development process model that meets the development requirements is obtained. Finally, as a case study, the development process of the aircraft onboard maintenance system is reconstructed. The results demonstrate that this method can significantly enhance the design efficiency of the development process.

模型重建是用来推动在基于模型的系统工程中开发复杂的系统开发过程的一种方法,目前,在系统的迭代设计过程中,缺乏一种有效的方法来管理发展要求的变化,例如发展周期要求和费用要求,并实现系统开发过程模型的重建,为了解决这些问题,本文件提议了一种支持开发过程模型的模型重建方法,首先,利用基于GOPPRR-E元模型模型的KARMA语言,统一正式确定以不同模型语言建造的进程模型;其次,采用了一个模型重建框架,采用以自然语言为基础的结构化开发要求作为投入,使用自然语言处理技术分析发展要求文本,并提取结构和优化限制信息。随后,在结构重组和算法优化后,获得了一个满足开发过程模型。最后,作为案例研究,对机上维修系统的开发过程进行了重新确定。结果表明,这一方法可以大大提高开发过程的设计效率。

Article 104

Title@2025-06-27 (5): The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Title: The SWE-Bench Illusion: When State-of-the-Art LLMs Remember Instead of Reason

Der SWE-Bench Illusion: Wenn State-of-the-Art LLMs sich an die Vernunft erinnern

SWE-Burnch 幻象:当最优秀的LLMs不记念理性而忘却理智时 2506.12286v2

Authors (3): Shanchao Liang, Spandan Garg, Roshanak Zilouchian Moghaddam

As large language models (LLMs) become increasingly capable and widely adopted, benchmarks play a central role in assessing their practical utility. For example, SWE-Bench Verified has emerged as a critical benchmark for evaluating LLMs’ software engineering abilities, particularly their aptitude for resolving real-world GitHub issues. Recent LLMs show impressive performance on SWE-Bench, leading to optimism about their capacity for complex coding tasks. However, current evaluation protocols may overstate these models’ true capabilities. It is crucial to distinguish LLMs’ generalizable problem-solving ability and other learned artifacts. In this work, we introduce two diagnostic tasks: file path identification from issue descriptions alone, and ground truth function reproduction with only the current file context and issue description to probe models’ underlying knowledge. We present empirical evidence that performance gains on SWE-Bench-Verified may be partially driven by memorization rather than genuine problem-solving. We show that state-of-the-art models achieve up to 76% accuracy in identifying buggy file paths using only issue descriptions, without access to repository structure. This performance is merely up to 53% on tasks from repositories not included in SWE-Bench, pointing to possible data contamination or memorization. A similar pattern is also observed for the function reproduction task, where the verbatim similarity is much higher on SWE-Bench-Verified than on other similar coding benchmarks. These findings raise concerns about the validity of existing results and underscore the need for more robust, contamination-resistant benchmarks to reliably evaluate LLMs’ coding abilities.

随着大型语言模型(LLMS)越来越有能力并被广泛采用,基准在评估其实用实用性方面发挥着核心作用。例如,SWE-Bench Verification(SWE-Bench Ferfied)已经成为评价LLMS软件工程能力,特别是解决现实世界 GitHub 问题的能力的关键基准。最近LMS在SWE-Bench(LLMS)上表现出令人印象深刻的绩效,这导致人们对其复杂编码任务的能力感到乐观。然而,目前的评估协议可能夸大了这些模型的真实能力。区分LLMS的可普遍适用的解决问题解决能力和其他学习的文物至关重要。在这项工作中,我们引入了两个诊断任务:单从问题描述中进行路径识别,用当前文件背景进行地面真相函数复制,并发布描述来探究模型的基本知识。我们从SWE-Bench-Verificality上的业绩可能部分由记忆化而不是真正的解决问题的能力来驱动。我们显示,在确定稳性的文件路径路径方面达到76%的准确性路径,不需要访问存储结构结构。这种业绩仅需要高达53%,在SWEWE-BES 功能上显示类似的任务模式上的类似的结果,在SWEWEWE-commlDRDRDRDRDRDRDRBY 功能上,而没有被观察到。

Article 105

Title@2025-06-27 (5): Generative AI for Software Architecture. Applications, Challenges, and Future Directions

Title: Generative AI for Software Architecture. Applications, Challenges, and Future Directions

Generative KI für Softwarearchitektur. Anwendungen, Herausforderungen und Zukunftsrichtungen

A. 软件结构的生成AI 应用、挑战和未来方向 2503.13310v2

Authors (8): Matteo Esposito, Xiaozhou Li, Sergio Moreschini, Noman Ahmad, Tomas Cerny, Karthik Vaidhyanathan, Valentina Lenarduzzi, Davide Taibi

Context: Generative Artificial Intelligence (GenAI) is transforming much of software development, yet its application in software architecture is still in its infancy, and no prior study has systematically addressed the topic. Aim: We aim to systematically synthesize the use, rationale, contexts, usability, and future challenges of GenAI in software architecture. Method: We performed a multivocal literature review (MLR), analyzing peer-reviewed and gray literature, identifying current practices, models, adoption contexts, and reported challenges, extracting themes via open coding. Results: Our review identified significant adoption of GenAI for architectural decision support and architectural reconstruction. OpenAI GPT models are predominantly applied, and there is consistent use of techniques such as few-shot prompting and retrieved-augmented generation (RAG). GenAI has been applied mostly to initial stages of the Software Development Life Cycle (SDLC), such as Requirements-to-Architecture and Architecture-to-Code. Monolithic and microservice architectures were the dominant targets. However, rigorous testing of GenAI outputs was typically missing from the studies. Among the most frequent challenges are model precision, hallucinations, ethical aspects, privacy issues, lack of architecture-specific datasets, and the absence of sound evaluation frameworks. Conclusions: GenAI shows significant potential in software design, but several challenges remain on its path to greater adoption. Research efforts should target designing general evaluation methodologies, handling ethics and precision, increasing transparency and explainability, and promoting architecture-specific datasets and benchmarks to bridge the gap between theoretical possibilities and practical use.

• 目标:我们的目标是系统地综合GENAI在软件结构中的使用、理由、背景、可用性和未来挑战。方法:我们进行了多语言文献审查,分析同行评审和灰色文献,查明当前做法、模式、采用背景和所报告的挑战,通过公开编码提取主题。结果:我们的审查发现GENAI在建筑决策支持和建筑重建方面大量采用GENAI产出的测试。OpenAI GPT模型主要应用,并且始终使用各种技术,例如少发的提示和检索的一代(RAGG)。 GENAI主要应用于软件发展生命周期的初始阶段,如设计要求和结构到规则等。

Article 106

Title@2025-06-27 (5): Enhancing Cloud Security through Topic Modelling

Title: Enhancing Cloud Security through Topic Modelling

Verbesserung der Cloud-Sicherheit durch Themenmodellierung

通过专题建模模式加强云层安全 2505.01463v2

Authors (3): Sabbir M. Saleh, Nazim Madhavji, John Steinbacher

Protecting cloud applications is critical in an era where security threats are increasingly sophisticated and persistent. Continuous Integration and Continuous Deployment (CI/CD) pipelines are particularly vulnerable, making innovative security approaches essential. This research explores the application of Natural Language Processing (NLP) techniques, specifically Topic Modelling, to analyse security-related text data and anticipate potential threats. We focus on Latent Dirichlet Allocation (LDA) and Probabilistic Latent Semantic Analysis (PLSA) to extract meaningful patterns from data sources, including logs, reports, and deployment traces. Using the Gensim framework in Python, these methods categorise log entries into security-relevant topics (e.g., phishing, encryption failures). The identified topics are leveraged to highlight patterns indicative of security issues across CI/CD’s continuous stages (build, test, deploy). This approach introduces a semantic layer that supports early vulnerability recognition and contextual understanding of runtime behaviours.

在安全威胁日益复杂和持续的时代,保护云层应用至关重要。持续整合和连续部署(CI/CD)管道特别脆弱,因此创新的安全方法至关重要。这项研究探索了自然语言处理(NLP)技术的应用,特别是专题模型,以分析与安全有关的文本数据并预测潜在威胁。我们侧重于Lentnt Dirichlet分配(LDA)和概率性边端语分析(PLSA),以便从数据源中提取有意义的模式,包括日志、报告和部署痕迹。利用Python的Gensim框架,这些方法将日志条目分为与安全有关的专题(例如:phishing、加密故障)。已查明的专题被用来突出显示CI/CD连续各阶段(建设、测试、部署)安全问题的规律。这种方法引入了有助于早期识别脆弱性和对运行时行为的背景理解的语系层。

Article 107

Title@2025-06-26 (4): ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

Title: ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

ELFuzz: Effiziente Input-Generierung über LLM-gesteuerte Synthese über Fuzzer-Raum

ELFuzz:通过LLM驱动的模糊空间综合合成有效投入生成 2506.10323v2

Authors (3): Chuyang Chen, Brendan Dolan-Gavitt, Zhiqiang Lin

Generation-based fuzzing produces appropriate testing cases according to specifications of input grammars and semantic constraints to test systems and software. However, these specifications require significant manual efforts to construct. This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailored to a system under test (SUT) via LLM-driven synthesis over fuzzer space. At a high level, it starts with minimal seed fuzzers and propels the synthesis by fully automated LLM-driven evolution with coverage guidance. Compared to previous approaches, ELFuzz can 1) seamlessly scale to SUTs of real-world sizes – up to 1,791,104 lines of code in our evaluation – and 2) synthesize efficient fuzzers that catch interesting grammatical structures and semantic constraints in a human-understandable way. Our evaluation compared ELFuzz with specifications manually written by domain experts and synthesized by state-of-the-art approaches. It shows that ELFuzz achieves up to 434.8% more coverage and triggers up to 174.0% more artificially injected bugs. We also used ELFuzz to conduct a real-world fuzzing campaign on the newest version of cvc5 for 14 days, and encouragingly, it found five 0-day bugs (three are exploitable). Moreover, we conducted an ablation study, which shows that the fuzzer space model, the key component of ELFuzz, contributes the most (up to 62.5%) to the effectiveness of ELFuzz. Further analysis of the fuzzers synthesized by ELFuzz confirms that they catch interesting grammatical structures and semantic constraints in a human-understandable way. The results present the promising potential of ELFuzz for more automated, efficient, and extensible input generation for fuzzing.

以下一代为基础的模糊性根据输入语法规范的规格和测试系统和软件的语义限制生成适当的测试案例。然而, 这些规格需要大量手工构建。本文提出一种新的方法, 即 ELFuzz (通过大语言模型演进以模糊化) , 通过 LLM 驱动合成法在模糊空间上自动合成一个正在测试的系统( SUT) , 自动合成基于生成的模糊性。在高水平上, 它从最小种子模糊性开始, 通过完全自动化的LLLLM驱动的演进和覆盖性指导来推进合成。与以前的方法相比, ELUFuzz (ELFZ) 能够完美地推广到真实世界规模的SUT, 高达1,791,104条代码的代码, 以及2) 合成高效的模糊性引信, 通过LLMM 的合成法, 我们用域专家手动的模糊性模型和状态的合成方法, 显示ELFUZ( 5) 可以实现434.8 % 的覆盖, 并触发真实的 ELULF 版本的 ELF 。我们用LF 的 RULF 的 RLF , 也用新的版本, 发现, 21 方向的 RULULF 。

Article 108

Title@2025-06-26 (4): Estimating Correctness Without Oracles in LLM-Based Code Generation

Title: Estimating Correctness Without Oracles in LLM-Based Code Generation

Schätzung der Korrektheit ohne Oracles in der LLM-basierten Code-Generierung

在基于LLM的代码生成中估算无甲骨文的正确性 2507.00057v1

Authors (4): Thomas Valentin, Ardi Madadi, Gaetano Sapia, Marcel Böhme

Generating code from natural language specifications is one of the most successful applications of Large Language Models (LLMs). Yet, they hallucinate: LLMs produce outputs that may be grammatically correct but are factually incorrect. Without an existing, correct implementation (i.e., an oracle), can we quantify how likely the generated program is correct? In this paper, we propose a measure of incorrectness, called incoherence, that can be estimated efficiently in the absence of an oracle and provides a lower bound on the error, i.e., the probability that the LLM-generated program for that specification is incorrect. Our experiments demonstrate an extraordinary effectiveness. For the average code generation task, our incoherence-based methodology can automatically identify about two-thirds of incorrect programs without reports of false positives. In fact, an oracle-based evaluation of LLMs can be reliably replaced by an incoherence-based evaluation. In particular, we find a very strong agreement between the ranking of LLMs by the number of programs deemed correct via an oracle (pass@1) and the ranking of LLMs by the number of programs deemed correct via our incoherence.

从自然语言规格中生成代码是大语言模型(LLMs)最成功的应用之一。然而,它们却产生幻觉:LLMs产出的文法可能正确,但实际上不正确。没有现有的、正确的执行(即甲骨文),我们能否量化生成的程序是否正确?在本文中,我们提出了一个称为不一致性的不正确度,在没有甲骨文的情况下可以有效估计,并且对错误的等级限制较低,即LLMs生成的规格程序不正确的可能性。我们的实验显示了一种非同寻常的效果。对于平均代码生成任务,我们基于不协调的方法可以自动识别出大约三分之二不正确的程序,而没有虚假的正数报告。事实上,基于甲骨肉的LMs评价可以可靠地被以不连贯为基础的评价所取代。特别是,我们发现在LMs的等级与通过一个甲骨文(pass@1)被认为正确的程序数目之间的非常强烈的一致。

Article 109

Title@2025-06-26 (4): FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

Title: FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

FuzzAug: Datenvergrößerung durch Coverage-geführtes Fuzzing für die neurale Testgeneration

FuzzAug: 神经测试生成时通过覆盖制导模糊增加数据 2406.08665v3

Authors (4): Yifeng He, Jicheng Wang, Yuyang Rong, Hao Chen

Testing is essential to modern software engineering for building reliable software. Given the high costs of manually creating test cases, automated test case generation, particularly methods utilizing large language models, has become increasingly popular. These neural approaches generate semantically meaningful tests that are more maintainable compared with traditional automatic testing methods like fuzzing. However, the diversity and volume of unit tests in current datasets are limited, especially for newer but important languages. In this paper, we present a novel data augmentation technique, FuzzAug, that introduces the benefits of fuzzing to large language models by introducing valid testing semantics and providing diverse coverage-guided inputs. Doubling the size of training datasets, FuzzAug improves the performance from the baselines significantly. This technique demonstrates the potential of introducing prior knowledge from dynamic software analysis to improve neural test generation, offering significant enhancements in neural test generation.

由于人工创建测试案例的成本高昂,自动测试案例的生成,特别是使用大语言模型的方法,越来越受欢迎。这些神经方法产生比像模糊等传统自动测试方法更能维持的具有词义意义的测试。然而,当前数据集中单位测试的多样性和数量有限,特别是对于较新但重要的语言而言。在本文件中,我们介绍了一种新的数据增强技术FuzzAug,它通过引入有效的测试语义和提供多种覆盖指导投入,为大型语言模型引入了模糊化的好处。将培训数据集的规模扩大一倍,FuzzAug从基线上大大改进了性能。这一技术展示了从动态软件分析引进先前知识的潜力,以改善神经测试生成,为神经测试生成提供重要的改进。

Article 110

Title@2025-06-26 (4): Performance Prediction for Large Systems via Text-to-Text Regression

Title: Performance Prediction for Large Systems via Text-to-Text Regression

Leistungsvorhersage für große Systeme über Text-zu-Text-Regression

通过文字到文字倒退对大型系统的性能预测 2506.21718v1

Authors (10): Yash Akhauri, Bryan Lewandowski, Cheng-Hsi Lin, Adrian N. Reyes, Grant C. Forbes, Arissa Wongpanich, Bangding Yang, Mohamed S. Abdelfattah, Sagi Perel, Xingyou Song

In many industries, predicting metric outcomes of large systems is a fundamental problem, driven largely by traditional tabular regression. However, such methods struggle on complex systems data in the wild such as configuration files or system logs, where feature engineering is often infeasible. We propose text-to-text regression as a general, scalable alternative. For predicting resource efficiency on Borg, Google’s massive compute cluster scheduling system, a 60M parameter encoder-decoder, trained from random initialization, achieves up to a near perfect 0.99 (0.9 average) rank correlation across the entire fleet, and 100x lower MSE than tabular approaches. The model also easily adapts to new tasks in only 500 few-shot examples and captures the densities of complex outcome distributions. Ablation studies highlight the importance of using encoders, increasing sequence length, and the model’s inherent uncertainty quantification. These findings pave the way for universal simulators of real-world outcomes.

在许多行业,预测大型系统的计量结果是一个根本问题,这在很大程度上是由传统的表格回归驱动的。然而,这类方法在野外复杂的系统数据上挣扎,如配置文档或系统日志,其中地貌工程往往不可行。我们建议文本到文本的回归,作为一般的、可缩放的替代方法。对于预测博格的资源效率,谷歌大规模计算集束列表系统,一个经过随机初始化培训的60M参数编码器分解器,在整个船队中达到接近完美的0.99(平均0.99)级,比表格方法低100xMSE。模型还容易在500个微小的例子中适应新的任务,并捕捉到复杂结果分布的密度。吸收研究强调了使用编码器、增加序列长度和模型固有的不确定性量化的重要性。这些发现为现实世界结果的普遍模拟者铺平了道路。

Article 111

Title@2025-06-26 (4): Using Generative AI in Software Design Education: An Experience Report

Title: Using Generative AI in Software Design Education: An Experience Report

Generative KI in der Software Design Education: Ein Erfahrungsbericht

在软件设计教育中采用 “ 创新 “ 软件设计教育:经验报告 2506.21703v1

Authors (3): Victoria Jackson, Susannah Liu, Andre van der Hoek

With the rapid adoption of Generative AI (GenAI) tools, software engineering educators have grappled with how best to incorporate them into the classroom. While some research discusses the use of GenAI in the context of learning to code, there is little research that explores the use of GenAI in the classroom for other areas of software development. This paper provides an experience report on introducing GenAI into an undergraduate software design class. Students were required to use GenAI (in the form of ChatGPT) to help complete a team-based assignment. The data collected consisted of the ChatGPT conversation logs and students’ reflections on using ChatGPT for the assignment. Subsequently, qualitative analysis was undertaken on the data. Students identified numerous ways ChatGPT helped them in their design process while recognizing the need to critique the response before incorporating it into their design. At the same time, we identified several key lessons for educators in how to deploy GenAI in a software design class effectively. Based on our experience, we believe students can benefit from using GenAI in software design education as it helps them design and learn about the strengths and weaknesses of GenAI.

由于迅速采用创用AI(GenAI)工具,软件工程教育者一直在努力研究如何最好地将GenAI纳入课堂,虽然一些研究讨论了GenAI在学习编码方面的使用情况,但几乎没有研究探索GenAI在课堂上用于软件开发的其他领域;本文件提供了关于将GenAI引入本科软件设计班的经验报告;学生们需要使用GenAI(以查特格波特为形式)帮助完成一个团队任务;所收集的数据包括查特格博特对话日志和学生们对如何使用ChatGPT来完成这项任务的思考;随后,对数据进行了定性分析;学生们查明了在设计过程中帮助他们的许多方法,同时认识到在将GentGPT纳入软件设计前需要批评对策;同时,我们为教育工作者确定了如何将GenAI有效应用在软件设计班上的一些关键经验教训;根据我们的经验,我们认为,学生们可以受益于利用GenAI进行软件设计教育,以帮助他们设计和了解GenAI的长处和弱点。

Article 112

Title@2025-06-26 (4): The DevSafeOps Dilemma: A Systematic Literature Review on Rapidity in Safe Autonomous Driving Development and Operation

Title: The DevSafeOps Dilemma: A Systematic Literature Review on Rapidity in Safe Autonomous Driving Development and Operation

Das DevSafeOps Dilemma: Ein systematischer Literaturbericht zur Schnelligkeit in der sicheren autonomen Fahrentwicklung und -operation

DevSafeOps Diillimma:关于安全自主驾驶的快速发展和运行的系统文学审查 2506.21693v1

Authors (4): Ali Nouri, Beatriz Cabrero-Daniel, Fredrik Törner, Christian Berger

Developing autonomous driving (AD) systems is challenging due to the complexity of the systems and the need to assure their safe and reliable operation. The widely adopted approach of DevOps seems promising to support the continuous technological progress in AI and the demand for fast reaction to incidents, which necessitate continuous development, deployment, and monitoring. We present a systematic literature review meant to identify, analyse, and synthesise a broad range of existing literature related to usage of DevOps in autonomous driving development. Our results provide a structured overview of challenges and solutions, arising from applying DevOps to safety-related AI-enabled functions. Our results indicate that there are still several open topics to be addressed to enable safe DevOps for the development of safe AD.

发展自主驱动系统(AD)具有挑战性,因为系统复杂,需要确保其安全可靠运作。广泛采用的DevOps方法似乎有希望支持AI的不断技术进步和对事件作出迅速反应的需求,这需要持续开发、部署和监测。我们提出了一个系统化的文献审查,旨在确定、分析和综合与在自主驱动发展中使用DevOps有关的大量现有文献。我们的结果为将DevOps应用到与安全相关的AI驱动功能中产生的挑战和解决办法提供了结构化概览。我们的结果表明,仍有若干开放的主题有待解决,以便安全DevOps能够发展安全的ADD。

Article 113

Title@2025-06-26 (4): Experience converting a large mathematical software package written in C++ to C++20 modules

Title: Experience converting a large mathematical software package written in C++ to C++20 modules

Erfahrung beim Konvertieren eines großen mathematischen Softwarepakets in C++ in C++20 Module

用 C+++ 写成的大型数学软件包转换为 C+++ 至 C++20 模块的经验 2506.21654v1

Authors (1): Wolfgang Bangerth

Mathematical software has traditionally been built in the form of “packages” that build on each other. A substantial fraction of these packages is written in C++ and, as a consequence, the interface of a package is described in the form of header files that downstream packages and applications can then #include. C++ has inherited this approach towards exporting interfaces from C, but the approach is clunky, unreliable, and slow. As a consequence, C++20 has introduced a “module” system in which packages explicitly export declarations and code that compilers then store in machine-readable form and that downstream users can “import” – a system in line with what many other programming languages have used for decades. Herein, I explore how one can convert large mathematical software packages written in C++ to this system, using the deal.II finite element library with its around 800,000 lines of code as an example. I describe an approach that allows providing both header-based and module-based interfaces from the same code base, discuss the challenges one encounters, and how modules actually work in practice in a variety of technical and human metrics. The results show that with a non-trivial, but also not prohibitive effort, the conversion to modules is possible, resulting in a reduction in compile time for the converted library itself; on the other hand, for downstream projects, compile times show no clear trend. I end with thoughts about long-term strategies for converting the entire ecosystem of mathematical software over the coming years or decades.

数学软件传统上以“ 软件包” 的形式建立, 并互相建立。这些软件包中有相当一部分是以 C++ 写入的, 因此, 一个软件包的接口以首页文件的形式描述, 下游包和应用程序可以包含# 。 C++ 继承了从 C 导出接口的这一方法, 但该方法却非常杂乱、不可靠和缓慢。因此, C+.20 引入了一个“ 模版” 系统, 使汇编者能够以机器可读的形式储存的软件包和代码明确出口声明和代码, 下游用户可以“ 进口 ” , 这个系统与数十年来许多其他编程语言使用的系统相一致。我探索了如何将C++ 中所写的大型数学软件包转换到这个系统, 使用这一交易。二具有约800 000 条代码的有限要素库作为例子。我描述一种方法, 既能提供基于头版的和基于模块的界面的界面, 也能够讨论遇到的挑战, 模块如何在各种技术和人类计量方法中实际运作, —— 与未来各种编程语言所使用的系统语言相一致的系统。。结果显示一个不易转换过程, 过程的模型本身, , 将一个非翻校算过程。

Article 114

Title@2025-06-26 (4): Large Language Model-Powered Agent for C to Rust Code Translation

Title: Large Language Model-Powered Agent for C to Rust Code Translation

Large Language Model-Powered Agent für C to Rust Code Übersetzung

C至Rust 代码翻译的大型语言示范授权代理 2505.15858v2

Authors (6): HoHyun Sim, Hyeonjoong Cho, Yeonghyeon Go, Zhoulai Fu, Ali Shokri, Binoy Ravindran

The C programming language has been foundational in building system-level software. However, its manual memory management model frequently leads to memory safety issues. In response, a modern system programming language, Rust, has emerged as a memory-safe alternative. Moreover, automating the C-to-Rust translation empowered by the rapid advancements of the generative capabilities of LLMs is gaining growing interest for large volumes of legacy C code. Despite some success, existing LLM-based approaches have constrained the role of LLMs to static prompt-response behavior and have not explored their agentic problem-solving capability. Applying the LLM agentic capability for the C-to-Rust translation introduces distinct challenges, as this task differs from the traditional LLM agent applications, such as math or commonsense QA domains. First, the scarcity of parallel C-to-Rust datasets hinders the retrieval of suitable code translation exemplars for in-context learning. Second, unlike math or commonsense QA, the intermediate steps required for C-to-Rust are not well-defined. Third, it remains unclear how to organize and cascade these intermediate steps to construct a correct translation trajectory. To address these challenges in the C-to-Rust translation, we propose a novel intermediate step, the Virtual Fuzzing-based equivalence Test (VFT), and an agentic planning framework, the LLM-powered Agent for C-to-Rust code translation (LAC2R). The VFT guides LLMs to identify input arguments that induce divergent behaviors between an original C function and its Rust counterpart and to generate informative diagnoses to refine the unsafe Rust code. LAC2R uses the MCTS to systematically organize the LLM-induced intermediate steps for correct translation. We experimentally demonstrated that LAC2R effectively conducts C-to-Rust translation on large-scale, real-world benchmarks.

C编程语言是建立系统级软件的基础语言。然而,其人工内存管理模式经常导致记忆安全问题。作为回应,现代系统编程语言Rust(Rust)已成为一种耐记忆的替代方案。此外,由于LLM的基因化能力迅速提高,使得C-Rust翻译自动化起来。尽管取得了一些成功,但基于LLMst的现有方法限制了LLM(LLM)的作用,使其成了静态的快速反应行为,而没有探索其中间解决问题的能力。在C-Rst翻译中应用LLM(LM)代理能力,带来了不同的挑战,因为这一任务不同于传统的LLM代理应用程序,例如数学或普通QA域。首先,平行C-Rust数据集的缺乏,阻碍了适当的C(LM)代码翻译的检索。第二,与数学或普通QA(Commerical-R)的QA(C-RM),为C-RM(LM)的原始-RM(LM-R-RM)的解算法解算的中间步骤没有很好地界定。第三,它仍然不清楚如何组织和不断组织和升级的翻译。

Article 115

Title@2025-06-26 (4): Anonymized Network Sensing Graph Challenge

Title: Anonymized Network Sensing Graph Challenge

Anonymisierte Network Sensing Graph Challenge

匿名网络遥感图图挑战 2409.08115v2

Authors (29): Hayden Jananthan, Michael Jones, William Arcand, David Bestor, William Bergeron, Daniel Burrill, Aydin Buluc, Chansup Byun, Timothy Davis, Vijay Gadepally, Daniel Grant, Michael Houle, Matthew Hubbell, Piotr Luszczek, Peter Michaleas, Lauren Milechin, Chasen Milner, Guillermo Morales, Andrew Morris, Julie Mullen, Ritesh Patel, Alex Pentland, Sandeep Pisharody, Andrew Prout, Albert Reuther, Antonio Rosa, Gabriel Wachman, Charles Yee, Jeremy Kepner

The MIT/IEEE/Amazon GraphChallenge encourages community approaches to developing new solutions for analyzing graphs and sparse data derived from social media, sensor feeds, and scientific data to discover relationships between events as they unfold in the field. The anonymized network sensing Graph Challenge seeks to enable large, open, community-based approaches to protecting networks. Many large-scale networking problems can only be solved with community access to very broad data sets with the highest regard for privacy and strong community buy-in. Such approaches often require community-based data sharing. In the broader networking community (commercial, federal, and academia) anonymized source-to-destination traffic matrices with standard data sharing agreements have emerged as a data product that can meet many of these requirements. This challenge provides an opportunity to highlight novel approaches for optimizing the construction and analysis of anonymized traffic matrices using over 100 billion network packets derived from the largest Internet telescope in the world (CAIDA). This challenge specifies the anonymization, construction, and analysis of these traffic matrices. A GraphBLAS reference implementation is provided, but the use of GraphBLAS is not required in this Graph Challenge. As with prior Graph Challenges the goal is to provide a well-defined context for demonstrating innovation. Graph Challenge participants are free to select (with accompanying explanation) the Graph Challenge elements that are appropriate for highlighting their innovations.

MIT/IEEE/Amazon Graph Challenge鼓励采取社区办法,为分析图表和从社交媒体、传感器和科学数据获得的稀少数据开发新的解决方案,以发现实地所发生事件之间的关系。匿名网络感知图挑战力求促成大型、开放、基于社区的网络保护网络办法。许多大规模联网问题只能通过社区利用最重视隐私和强有力的社区买入的非常广泛的数据集来解决。这种办法往往需要社区分享数据。在更广泛的网络社区(商业、联邦和学术界)中,有标准数据共享协议的匿名源到目的地交通信息总库已成为能够满足许多这些要求的数据产品。这个挑战提供了一个机会,以突出新办法优化匿名交通总库的构建和分析,利用世界上最大的互联网望远镜(CAIDA)提供的1000亿多亿个网络包进行最高度尊重隐私和强有力的社区买入。这个方法往往需要社区分享数据。在更广泛的网络社区(商业、联邦和学术界)中提供了匿名源到目的地交通信息总库,但使用具有标准数据共享协议的源到目的地交通总库作为数据共享协议的一种数据产品,可以满足许多要求。这个要求。这个数据共享协议的挑战总路路路路标的用户在选择中提供适当的图表上显示图表的图表的图表的正确解释。

Article 116

Title@2025-06-26 (4): IXAII: An Interactive Explainable Artificial Intelligence Interface for Decision Support Systems

Title: IXAII: An Interactive Explainable Artificial Intelligence Interface for Decision Support Systems

IXAII: Interactive Explainable Artificial Intelligence Interface für Entscheidungsunterstützungssysteme

IXAII:决策支持系统互动解释人工智能接口 2506.21310v1

Authors (3): Pauline Speckmann, Mario Nadj, Christian Janiesch

Although several post-hoc methods for explainable AI have been developed, most are static and neglect the user perspective, limiting their effectiveness for the target audience. In response, we developed the interactive explainable intelligent system called IXAII that offers explanations from four explainable AI methods: LIME, SHAP, Anchors, and DiCE. Our prototype provides tailored views for five user groups and gives users agency over the explanations’ content and their format. We evaluated IXAII through interviews with experts and lay users. Our results indicate that IXAII, which provides different explanations with multiple visualization options, is perceived as helpful to increase transparency. By bridging the gaps between explainable AI methods, interactivity, and practical implementation, we provide a novel perspective on AI explanation practices and human-AI interaction.

虽然已经开发了几种可解释的后热方法,但大多数是静态的,忽视了用户视角,限制了用户视角,限制了用户对目标受众的实效。作为回应,我们开发了名为IXAII的互动式可解释智能系统,从四种可解释的AI方法(LIME、SHAP、Anchors和DICE)中提供了解释。我们的原型为五个用户群体提供了量身定制的观点,并为用户机构提供了解释内容和格式。我们通过与专家和普通用户的访谈对IXAII进行了评估。我们的结果表明,IXAII提供了多种可视化选项的不同解释,被认为有助于提高透明度。通过弥合可解释的AI方法、互动性和实际实施之间的差距,我们从新的视角审视了AI解释做法和人类-AI互动。

Article 117

Title@2025-06-26 (4): An object-centric core metamodel for IoT-enhanced event logs

Title: An object-centric core metamodel for IoT-enhanced event logs

Ein objektzentriertes Kernmetamodell für IoT-verstärkte Ereignisprotokolle

IoT 强化事件日志的以物体为中心的核心元元模型 2506.21300v1

Authors (13): Yannis Bertrand, Christian Imenkamp, Lukas Malburg, Matthias Ehrendorfer, Marco Franceschetti, Joscha Grüger, Francesco Leotta, Jürgen Mangler, Ronny Seiger, Agnes Koschmider, Stefanie Rinderle-Ma, Barbara Weber, Estefania Serral

Advances in Internet-of-Things (IoT) technologies have prompted the integration of IoT devices with business processes (BPs) in many organizations across various sectors, such as manufacturing, healthcare and smart spaces. The proliferation of IoT devices leads to the generation of large amounts of IoT data providing a window on the physical context of BPs, which facilitates the discovery of new insights about BPs using process mining (PM) techniques. However, to achieve these benefits, IoT data need to be combined with traditional process (event) data, which is challenging due to the very different characteristics of IoT and process data, for instance in terms of granularity levels. Recently, several data models were proposed to integrate IoT data with process data, each focusing on different aspects of data integration based on different assumptions and requirements. This fragmentation hampers data exchange and collaboration in the field of PM, e.g., making it tedious for researchers to share data. In this paper, we present a core model synthesizing the most important features of existing data models. As the core model is based on common requirements, it greatly facilitates data sharing and collaboration in the field. A prototypical Python implementation is used to evaluate the model against various use cases and demonstrate that it satisfies these common requirements.

在诸如制造、保健和智能空间等各个部门的许多组织中,互联网技术的进步促使将互联网设备与业务流程(BPs)整合在一起,这在制造、保健和智能空间等许多部门中都促使将互联网设备与业务流程(BPs)整合在一起。互联网设备的扩散导致产生了大量互联网数据,为基于业务流程物理背景的数据整合提供了一个窗口,为利用工艺采矿(PM)技术发现关于业务流程的新认识提供了便利。然而,为了实现这些效益,互联网数据需要与传统流程(活动)数据相结合,而传统流程(活动)数据则具有挑战性,因为互联网和流程数据的特点非常不同,例如在颗粒度方面。最近,提出了若干数据模型,将互联网数据与流程数据整合在一起,每个数据侧重于基于不同假设和要求的数据整合的不同方面。这种破碎妨碍了数据交换和在PMM(PM)领域的合作,例如,使研究人员难以分享数据。在本文件中,我们介绍了一个核心模型,将现有数据模型的最重要特征(例如颗粒度)加以综合,因为核心模型以共同要求为基础,有利于使用这些共同的数据共享。

Article 118

Title@2025-06-26 (4): Exploring Micro Frontends: A Case Study Application in E-Commerce

Title: Exploring Micro Frontends: A Case Study Application in E-Commerce

Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce

探索微观前沿:电子商务案例研究应用 2506.21297v1

Authors (5): Ricardo Hideki Hangai Kojo, Luiz Fernando Corte Real, Renato Cordeiro Ferreira, Thatiane de Oliveira Rosa, Alfredo Goldman

In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company’s needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company’s context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.

在微观前端的建筑风格中,前端被分为小部分,从简单的按钮到整个页面。目标是提高可缩放性、复原力和团队独立性,尽管其成本增加了复杂性和基础设施需求。本文件试图了解何时值得采用微观前端,特别是在工业方面。为此,我们根据学术和灰色文献对微型前端的艺术状态进行了调查。我们随后在手工艺产品的市场中实施了这种建筑风格,这些产品已经使用了微观服务。最后,我们通过与开发商的半开放问卷评估了执行情况。在经过研究的市场公司中,由于主要系统(爪哇单项)和专门的前端系统之间的紧密连接,需要进行建筑变革。此外,为了解决这些问题,我们根据学术和灰色文献对微型前端结构进行了调查。我们随后在手工艺产品的市场中采用了微型前端结构,并且已经使用了已经使用过缩略图的后端模式。最后,我们通过半开放的问卷来评估执行情况。尽管通过最精密的前端基础设施(Java ) 和前端技术的运用方式得到了成功的应用,但是在采用最灵活的前端端分析中, 也实现了对正端分析。在采用最具有可比性的公司进行了成功的前端分析,但通过这种分析后端和最成功的前端技术的后端反应, 也取得了必要的应用。在采用了最成功的前端评估。在采用了最成功的前端技术。在采用后端技术,在采用后端的后端技术,在采用了最成功的前端技术,在采用后端技术,在采用。在采用后端技术,在采用后端技术,在采用后端端端端端端技术,在采用后端分析中,在采用后端技术,在采用后端,在采用后端技术,在采用了最成功的前端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端分析中实现了。在采用后端选择。在采用后端,在采用后端,在采用后端,在采用后端,在采用了最成功的前端选择,在采用了。在采用后端,在采用后端,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端

Article 119

Title@2025-06-26 (4): KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks

Title: KOALA: a Configurable Tool for Collecting IDE Data When Solving Programming Tasks

KOALA: Ein konfigurierbares Tool zum Sammeln von IDE-Daten beim Lösen von Programmieraufgaben

KOALA: 在解决方案拟订任务时收集 IDE 数据的配置工具 2506.21266v1

Authors (6): Daniil Karol, Elizaveta Artser, Ilya Vlasov, Yaroslav Golubev, Hieke Keuning, Anastasiia Birillo

Collecting data of students solving programming tasks is incredibly valuable for researchers and educators. It allows verifying that the students correctly apply the features and concepts they are taught, or finding students’ misconceptions. However, existing data collection tools have limitations, e.g., no control over the granularity of the collected code, not collecting the specific events of the programming environment used, and overall being hard to configure. To overcome these limitations, we propose KOALA, a convenient and highly configurable tool for collecting code snapshots and feature usage from students solving programming tasks in JetBrains IDEs. The plugin can be installed in IDEs and configured to provide the students with the necessary tasks, enable or disable certain IDE features like code completion, and run surveys. During problem solving, the plugin collects code snapshots at the configured granularity, all IDE actions like running and debugging, as well as some data not collected in prior works, like employed hotkeys and switching focus between files. The collected data is sent to the server that comes with the tool, where it is stored and can be converted to the standardized ProgSnap2 format. To showcase the tool, we collected data from 28 students solving tasks in two courses within the IDE, highlighting some insights from this data.

收集完成编程任务的学生的数据对于研究人员和教育工作者来说是极其宝贵的。它能够核实学生正确应用他们所教授的特征和概念,或者发现学生的错误概念。但是, 现有的数据收集工具有局限性, 例如对所收集的代码的颗粒没有控制, 不收集所使用的编程环境的具体事件, 并且总体来说很难配置。为了克服这些局限性, 我们提议 KOALA, 这是一种方便和高度可配置的工具, 用来收集在 JeetBrains IDEs 中完成编程任务的学生的代码快照和特征使用。所收集的数据可以安装在 IDE 中, 配置该插件可以向学生提供必要的任务, 启用或禁用某些 IDE 功能, 如代码完成, 并运行调查。在解决问题过程中, 插件会收集到已配置的颗粒特性, 所有的 IDE 动作, 如运行和调试, 以及一些未在先前工作中收集的数据, 比如用过的热键和文件之间的焦点转换。所收集的数据被发送到该工具的服务器, 在那里存储, 并且可以转换成标准化的 ProgSNA 2 格式中的学生。

Article 120

Title@2025-06-26 (4): Describing Console I/O Behavior for Testing Student Submissions in Haskell

Title: Describing Console I/O Behavior for Testing Student Submissions in Haskell

Beschreibung von Console I/O-Behavior für die Prüfung von Studentenanträgen in Haskell

哈斯凯尔测试学生提交材料的I/O行为 2008.09253v2

Authors (2): Oliver Westphal, Janis Voigtländer

We present a small, formal language for specifying the behavior of simple console I/O programs. The design is driven by the concrete application case of testing interactive Haskell programs written by students. Specifications are structurally similar to lexical analysis regular expressions, but are augmented with features like global variables that track state and history of program runs, enabling expression of an interesting range of dynamic behavior. We give a semantics for our specification language based on acceptance of execution traces. From this semantics we derive a definition of the set of all traces valid for a given specification. Sampling that set enables us to mechanically check program behavior against specifications in a probabilistic fashion. Beyond testing, other possible uses of the specification language in an education context include related activities like providing more helpful feedback, generating sample solutions, and even generating random exercise tasks.

我们为指定简单的控制台 I/ O 程序的行为提供了一种小的、正式的语言。设计是由测试学生编写的交互式哈斯凯尔程序的具体应用案例驱动的。规格在结构上类似于常规语言分析, 但增加了一些特征, 比如跟踪程序运行状态和历史的全球变量, 使得能够表达一系列有趣的动态行为。我们根据接受执行痕迹, 给我们的规格语言提供了一种语义。我们从这个语义中得出了对特定规格有效的所有痕迹的定义。抽样让我们能够机械地检查程序行为与规格的概率化方式。除了测试之外, 在教育背景下, 规格语言的其他可能用途包括相关活动, 比如提供更有用的反馈, 生成样本解决方案, 甚至产生随机练习任务。

Article 121

Title@2025-06-26 (4): $T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models

Title: $T^3$: Multi-level Tree-based Automatic Program Repair with Large Language Models

$T^3$: Mehrstufige Baum-basierte automatische Programm-Reparatur mit großen Sprachmodellen

$T$3美元:使用大语言模型进行多层次基于树的自动方案维修 2506.21211v1

Authors (4): Quanming Liu, Xupeng Bu, Zhichao Yan, Ru Li

Automatic Program Repair (APR) is a core technology in software development and maintenance, with aims to enable automated defect repair with minimal human intervention. In recent years, the substantial advancements in Large Language Models (LLMs) and the Chain-of-Thought (CoT) techniques have significantly enhanced the reasoning capabilities of these models. However, due to the complex logic and multi-step reasoning ability needed, the application of CoT techniques in the APR domain remains insufficient. This study systematically evaluates the performance of several common CoT techniques in APR tasks and proposes an innovative framework $T^3$, which integrates the powerful reasoning capabilities of LLMs with tree search, effectively improving the precision of generating candidate repair solutions. Furthermore, $T^3$ provides valuable guidance for optimizing sample selection and repair strategies in APR tasks, establishing a robust framework for achieving efficient automated debugging.

自动程序修理(APR)是软件开发和维护方面的一项核心技术,目的是在最低限度的人力干预下实现自动缺陷修复,近年来,大语言模型(LLMs)和 “ 研究链 “ (Cot)技术的重大进步大大提高了这些模型的推理能力,然而,由于需要复杂的逻辑和多步推理能力,在PRA领域应用COT技术仍然不够充分,这项研究系统地评估了在PRA任务中若干通用COT技术的绩效,并提出了一个创新框架,即3美元,将LLMS的强大推理能力与树木搜索结合起来,有效地提高产生候选修复解决方案的精确性,此外,3美元为优化RA任务中的样本选择和修复战略提供了宝贵的指导,为实现高效自动调试建立一个强有力的框架。

Article 122

Title@2025-06-26 (4): Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Title: Maintaining MTEB: Towards Long Term Usability and Reproducibility of Embedding Benchmarks

Beibehaltung des MTEB: Langfristige Nutzungsfähigkeit und Reproduzierbarkeit von Einbettungs-Benchmarks

维持MDEB:实现长期使用和可复制嵌入基准 2506.21182v1

Authors (5): Isaac Chung, Imene Kerboua, Marton Kardos, Roman Solomatin, Kenneth Enevoldsen

The Massive Text Embedding Benchmark (MTEB) has become a standard evaluation platform for text embedding models. While previous work has established the core benchmark methodology, this paper focuses on the engineering aspects that ensure MTEB’s continued reproducibility and extensibility. We present our approach to maintaining robust continuous integration pipelines that validate dataset integrity, automate test execution, and assess benchmark results’ generalizability. We detail the design choices that collectively enhance reproducibility and usability. Furthermore, we discuss our strategies for handling community contributions and extending the benchmark with new tasks and datasets. These engineering practices have been instrumental in scaling MTEB to become more comprehensive while maintaining quality and, ultimately, relevance to the field. Our experiences offer valuable insights for benchmark maintainers facing similar challenges in ensuring reproducibility and usability in machine learning evaluation frameworks. The MTEB repository is available at: https://github.com/embeddings-benchmark/mteb

大量文本嵌入基准(MDEB)已成为文本嵌入模型的标准评价平台。虽然以前的工作已经确立了核心基准方法,但本文件侧重于确保MDEB继续可复制和可推广的工程方面,我们介绍了我们维持强有力的连续整合管道的方法,这些管道验证数据集的完整性、自动测试执行以及评估基准结果的一般性。我们详细介绍了共同加强可复制性和可用性的设计选择。此外,我们讨论了处理社区贡献和以新的任务和数据集扩展基准的战略。这些工程做法有助于扩大MTEB的规模,使之更加全面,同时保持质量,并最终与外地相关。我们的经验为基准维护者提供了宝贵的洞见,他们在确保机器学习评估框架的可复制性和可用性方面面临类似挑战。MTEB存放于https://github.com/embeddings-benchmark/mteb。我们可在以下网址查阅:https://github. com/embeddings-benchmark/mteb。

Article 123

Title@2025-06-26 (4): How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE

Title: How Good Are Synthetic Requirements ? Evaluating LLM-Generated Datasets for AI4RE

Wie gut sind synthetische Anforderungen ? Bewertung von LLM-generierten Datensätzen für AI4RE

合成要求如何好? 评价AI4RE的LLM-发光数据集 2506.21138v1

Authors (2): Abdelkarim El-Hajjami, Camille Salinesi

The shortage of publicly available, labeled requirements datasets remains a major barrier to advancing Artificial Intelligence for Requirements Engineering (AI4RE). While Large Language Models offer promising capabilities for synthetic data generation, systematic approaches to control and optimize the quality of generated requirements remain underexplored. This paper presents Synthline v1, an enhanced Product Line approach for generating synthetic requirements data that extends our earlier v0 version with advanced generation strategies and curation techniques. We investigate four research questions assessing how prompting strategies, automated prompt optimization, and post-generation curation affect data quality across four classification tasks: defect detection, functional vs. non-functional, quality vs. non-quality, and security vs. non-security. Our evaluation shows that multi-sample prompting significantly boosts both utility and diversity over single-sample generation, with F1-score gains from 6 to 44 points. The use of PACE (Prompt Actor-Critic Editing) for automated prompt optimization yields task-dependent results, greatly improving functional classification (+32.5 points) but reducing performance on others. Interestingly, similarity-based curation improves diversity but often harms classification performance, indicating that some redundancy may help ML models. Most importantly, our results show that synthetic requirements can match or outperform human-authored ones for specific tasks, with synthetic data surpassing human data for security (+7.8 points) and defect classification (+15.4 points). These findings offer practical insights for AI4RE and chart a viable path to mitigating dataset scarcity through systematic synthetic generation.

虽然大语言模型为合成数据生成提供了大有希望的能力,但控制和优化生成要求质量的系统方法仍未得到充分探讨。本文件展示了Synthline v1, 一个强化产品系列方法,用于生成合成要求数据,将我们先前的V0版本扩展至先进的生成战略和校正技术。我们调查了四个研究问题,评估快速战略、自动快速优化和后生成曲线如何影响四个分类任务的数据质量:缺陷检测、功能性与功能性对非功能性、质量对质量和安全性对非安全性。我们的评估表明,多样本极大地促进在单一样本生成过程中的效用和多样性,F1核心收益从6点增加到44点。使用PACE(Prompt Acor-Critical 编辑)自动快速优化产出任务依据的结果,大大改进功能性分类(+32.5点),但降低其他任务的绩效。有趣的是,基于相似的缩略图路径, 显示具体数据生成值(MRegile) 能够显示具体的安全性分类。

Article 124

Title@2025-06-26 (4): SceneGenAgent: Precise Industrial Scene Generation with Coding Agent

Title: SceneGenAgent: Precise Industrial Scene Generation with Coding Agent

SceneGenAgent: Präzise industrielle Szenegenerierung mit Coding Agent

SceneGenerAgenti: 精密工业场景与编码剂生成 2410.21909v3

Authors (8): Xiao Xia, Dan Zhang, Zibo Liao, Zhenyu Hou, Tianrui Sun, Jing Li, Ling Fu, Yuxiao Dong

The modeling of industrial scenes is essential for simulations in industrial manufacturing. While large language models (LLMs) have shown significant progress in generating general 3D scenes from textual descriptions, generating industrial scenes with LLMs poses a unique challenge due to their demand for precise measurements and positioning, requiring complex planning over spatial arrangement. To address this challenge, we introduce SceneGenAgent, an LLM-based agent for generating industrial scenes through C# code. SceneGenAgent ensures precise layout planning through a structured and calculable format, layout verification, and iterative refinement to meet the quantitative requirements of industrial scenarios. Experiment results demonstrate that LLMs powered by SceneGenAgent exceed their original performance, reaching up to 81.0% success rate in real-world industrial scene generation tasks and effectively meeting most scene generation requirements. To further enhance accessibility, we construct SceneInstruct, a dataset designed for fine-tuning open-source LLMs to integrate into SceneGenAgent. Experiments show that fine-tuning open-source LLMs on SceneInstruct yields significant performance improvements, with Llama3.1-70B approaching the capabilities of GPT-4o. Our code and data are available at https://github.com/THUDM/SceneGenAgent .

工业场景模型对于工业制造业的模拟至关重要。大型语言模型(LLMS)在用文字描述生成一般的3D场景方面取得了显著进展,而利用LLMS生成工业场景则因其对精确测量和定位的需求而带来了独特的挑战,需要对空间安排进行复杂的规划。为了应对这一挑战,我们引入了C#代码生成工业场景的基于LLM的CeneGenAgenti代理商SceenGenAgenti(CeneGenAgenti)确保了精确的布局规划,通过结构化和可计算的格式、布局核查以及迭接的完善以满足工业场景的数量要求。实验结果表明,SceneGenAgenent所驱动的LMS超过其最初的性能,在真实世界工业场景生成任务中达到高达81.0%的成功率,并有效地满足了大多数场景生成要求。为了进一步提高无障碍性,我们建造了SeenInstruct(SenInGPTHMM),一个旨在将开源LMMS-GPTSentrentrental 3./MUD)和MSUDSUDMS。

Article 125

Title@2025-06-26 (4): Boosting Vulnerability Detection with Inter-function Multilateral Association Insights

Title: Boosting Vulnerability Detection with Inter-function Multilateral Association Insights

Förderung der Erkennung von Schwachstellen durch multilaterale Integrations-Insights zwischen den Funktionen

与职能间多边协会透视促进脆弱性探测 2506.21014v1

Authors (3): Shaojian Qiu, Mengyang Huang, Jiahao Cheng

Vulnerability detection is a crucial yet challenging technique for ensuring the security of software systems. Currently, most deep learning-based vulnerability detection methods focus on stand-alone functions, neglecting the complex inter-function interrelations, particularly the multilateral associations. This oversight can fail to detect vulnerabilities in these interrelations. To address this gap, we present an Inter-Function Multilateral Association analysis framework for Vulnerability Detection (IFMA-VD). The cornerstone of the IFMA-VD lies in constructing a code behavior hypergraph and utilizing hyperedge convolution to extract multilateral association features. Specifically, we first parse functions into a code property graph to generate intra-function features. Following this, we construct a code behavior hypergraph by segmenting the program dependency graph to isolate and encode behavioral features into hyperedges. Finally, we utilize a hypergraph network to capture the multilateral association knowledge for augmenting vulnerability detection. We evaluate IFMA-VD on three widely used vulnerability datasets and demonstrate improvements in F-measure and Recall compared to baseline methods. Additionally, we illustrate that multilateral association features can boost code feature representation and validate the effectiveness of IFMA-VD on real-world datasets.

目前,大多数深层次的基于学习的脆弱性检测方法都侧重于独立功能,忽视复杂的功能间相互关系,特别是多边协会。这种监督可能无法发现这些相互关系中的弱点。为了解决这一差距,我们提出了一个跨功能多边协会脆弱性检测分析框架(IFMA-VD)。IMA-VD的基石在于建立一套守则行为高光谱,并利用高级变相来提取多边联系特征。具体地说,我们首先将功能分析成一个代码属性图,以产生内部功能特征。之后,我们通过将程序依赖性图进行分解,将行为特征编码高光谱,将程序依赖性图分离和编码到高端。最后,我们利用一个高光谱网络来捕捉多边联系知识,以加强脆弱性检测。我们评估了三个广泛使用的弱点数据集,并展示了FMA-VD与基线方法相比在F计量和召回方面的改进。此外,我们说明,多边联系特征可以提高代码特征,并验证IMA-VD在现实世界数据集上的有效性。

Article 126

Title@2025-06-26 (4): ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

Title: ToolScan: A Benchmark for Characterizing Errors in Tool-Use LLMs

ToolScan: Ein Benchmark für die Charakterisierung von Fehlern in Tool-Use LLMs

工具扫描:工具使用 LLM 错误识别基准 2411.13547v2

Authors (18): Shirley Kokane, Ming Zhu, Tulika Awalgaonkar, Jianguo Zhang, Thai Hoang, Akshara Prabhakar, Zuxin Liu, Tian Lan, Liangwei Yang, Juntao Tan, Rithesh Murthy, Weiran Yao, Zhiwei Liu, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, Caiming Xiong, Silivo Savarese

Evaluating Large Language Models (LLMs) is one of the most critical aspects of building a performant compound AI system. Since the output from LLMs propagate to downstream steps, identifying LLM errors is crucial to system performance. A common task for LLMs in AI systems is tool use. While there are several benchmark environments for evaluating LLMs on this task, they typically only give a success rate without any explanation of the failure cases. To solve this problem, we introduce TOOLSCAN, a new benchmark to identify error patterns in LLM output on tool-use tasks. Our benchmark data set comprises of queries from diverse environments that can be used to test for the presence of seven newly characterized error patterns. Using TOOLSCAN, we show that even the most prominent LLMs exhibit these error patterns in their outputs. Researchers can use these insights from TOOLSCAN to guide their error mitigation strategies.

评价大语言模型(LLMS)是建立性能复合AI系统的最关键方面之一。由于LLMS的输出向下游步骤传播,确定LLM错误对于系统性能至关重要。AI系统中的LLMs的一项共同任务是工具使用。虽然评价LLMs在这项工作上有一些基准环境,但它们通常只给出成功率而不解释失败案例。为了解决这个问题,我们引入了TOOLSCAN,这是一个新的基准,用以确定LLLM产出在工具使用任务上的错误模式。我们的基准数据集包括来自不同环境的查询,可用于测试七个新发现的错误模式的存在。我们使用TOOLSCAN,显示即使是最著名的LMs也在其产出中展示了这些错误模式。研究人员可以利用TOOLSCAN的这些洞见来指导其减少错误的战略。