cs.SE @ 2025-07-18: 157

07-17 (4)

SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben

SWE-MERA: 积极评价软件工程任务大语言模型的动态基准

2507.11059v2

07-17

Detecting LLM-generated Code with Subtle Modification by Adversarial Training

LLM-generierter Code mit subtiler Änderung durch Adversarial Training erkennen

检测通过反向培训进行精细修改的LLM生成代码

2507.13123v1

07-17

Inferring Attributed Grammars from Parser Implementations

Zugeschriebene Grammatiken aus Parser-Implementierungen ableiten

从剖析器执行中推断出属性语法

2507.13117v1

07-17

A Conceptual Framework for Requirements Engineering of Pretrained-Model-Enabled Systems

Ein konzeptioneller Rahmen für die Anforderungsentwicklung von vortrainierten modellgebundenen Systemen

预先培训的、采用模式的系统工程要求概念框架

2507.13095v1

07-17

MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben

MERA 守则:一个统一框架,用于评估不同任务制定守则的情况

2507.12284v2

07-17

iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development

iReDev: Ein wissensgestütztes Multi-Agent-Rahmenwerk für intelligente Anforderungsentwicklung

iReDev:开发智能要求的知识开发多机构框架

2507.13081v1

07-17

Write Your Own CodeChecker: An Automated Test-Driven Checker Development Approach with LLMs

Schreiben Sie Ihren eigenen CodeChecker: Ein automatisierter Test-Driven Checker-Entwicklungsansatz mit LLMs

使用 LLMS 写入您的自定义代码检查器: 自动测试驱动检查开发方法

2411.06796v3

07-17

Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

Untersuchung der Leistungsfähigkeit kleiner Sprachmodelle bei der Erkennung von Testriechen in manuellen Testfällen

调查小语言模型在人工试验案件中检测测试嗅觉方面的性能

2507.13035v1

07-17

Risks of ignoring uncertainty propagation in AI-augmented security pipelines

Risiken der Ignorierung der Unsicherheitsausbreitung in KI-gesteigerten Sicherheitspipelines

忽视在AI强化安全管道中传播不确定性的风险

2407.14540v2

07-17

ReCode: Updating Code API Knowledge with Reinforcement Learning

ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen

ReCode:更新法规API知识与强化学习

2506.20495v2

07-17

The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI

Der Fall für Contextual Copyleft: Lizenzierung von Open Source Trainingsdaten und Generative KI

上下文翻转:为开放源码培训数据发放许可证的案例

2507.12713v1

07-17

CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

CodeAssistBench (CAB): Datensatz & Benchmarking für Multiturn-Chat-basierte Code-Unterstützung

代码协助站(CAB):多功能聊天代码援助的数据集和基准

2507.10646v2

07-17

GUI Test Migration via Abstraction and Concretization

GUI-Test-Migration über Abstraktion und Konkretisierung

GUI 通过抽象和简明化测试移民

2409.05028v2

07-17

AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges

KI-Sicherheit in den Augen des Downstream-Entwicklers: Ein erster Blick auf Bedenken, Praktiken und Herausforderungen

AI 下游开发者眼中的安全:首先审视关注、做法和挑战

2503.19444v3

07-17

When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration

When Domains Collide: Eine Aktivitätstheorie zur Erforschung der disziplinübergreifenden Zusammenarbeit

当域碰撞:跨纪律协作活动理论探索时

2506.20063v2

07-16 (3)

ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle

ParaStudent: Erzeugen und Evaluieren des Realistischen Studentenkodex durch Lehre von LLMs zum Kampf

副专业学生:通过教授LLMs进行斗争,产生和评价现实学生守则

2507.12674v1

07-16

Single Conversation Methodology: A Human-Centered Protocol for AI-Assisted Software Development

Single Conversation Methodology: Ein Human-Centered-Protokoll für KI-Assisted Software Development

单一对话方法:AI协助软件开发的以人为中心的议定书

2507.12665v1

07-16

A Fuzzy Approach to Project Success: Measuring What Matters

Ein fuzzy Ansatz zum Projekt Erfolg: Messen, was zählt

项目成功:衡量重要事项的模糊方法

2507.12653v1

07-16

A Three-Phase Evaluation Approach for new Information and Data Models in the Smart Grid Domain

Ein dreiphasiger Evaluierungsansatz für neue Informations- und Datenmodelle im Bereich Smart Grid

智能网域新信息和数据模型的三阶段评价方法

2507.12649v1

07-16

QSpark: Towards Reliable Qiskit Code Generation

QSpark: Auf dem Weg zur zuverlässigen Qiskit-Code-Generierung

QSpark:迈向可靠的基斯基特代码生成

2507.12642v1

07-16

ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells

ROSE: Transformerbasierte Refactoring-Empfehlung für architektonische Gerüche

ROSE: 以变压器为基础的建筑气味重建建议

2507.12561v1

07-16

When Retriever Meets Generator: A Joint Model for Code Comment Generation

Wenn Retriever trifft Generator: Ein gemeinsames Modell für Code Comment Generation

当再利用与生成器相遇时: 代码Comment生成联合模式

2507.12558v1

07-16

Machine Learning Systems: A Survey from a Data-Oriented Perspective

Machine Learning Systems: Eine Umfrage aus datenorientierter Perspektive

机械学习系统:从数据导向的角度进行调查

2302.04810v3

07-16

SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

SWE-Perf: Können Sprachmodelle die Code-Performance auf realen Repositories optimieren?

SWE-Perf:语言模型能够优化现实世界仓库的代码性能吗?

2507.12415v1

07-16

GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon: Bewertung der KI-Code-Generation gegen Python Library Version Inkompatibilitäten

GitChameleon:评估AI 与 Python 图书馆版本不兼容性

2507.12367v1

07-16

Planning-Aware Code Infilling via Horizon-Length Prediction

通过地平线-地球预测填充规划-软件代码

2410.03103v3

07-16

An Empirical Study of Large Language Models for Type and Call Graph Analysis in Python and JavaScript

Eine empirische Studie großer Sprachmodelle für die Typ- und Call Graph Analyse in Python und JavaScript

Python 和 JavaScript 中用于类型和召唤图分析的大语言模型和经验研究

2410.00603v2

07-16

An Online A/B Testing Decision Support System for Web Usability Assessment Based on a Linguistic Decision-making Methodology: Case of Study a Virtual Learning Environment

Ein Online A/B Testing Decision Support System for Web Usability Assessment basierend auf einer sprachlichen Entscheidungsmethodik: Fall einer virtuellen Lernumgebung

网上A/B测试决定支持系统,用于基于语言决策方法的网络可用性评估:研究案例和虚拟学习环境

2507.12118v1

07-16

Leveraging LLMs for User Stories in AI Systems: UStAI Dataset

Nutzung von LLMs für Nutzergeschichten in KI-Systemen: UStAI-Datensatz

为AI系统用户故事利用LMLMs:UStAI数据集

2504.00513v3

07-16

From Static to Intelligent: Evolving SaaS Pricing with LLMs

Von der statischen zur intelligenten: Evolving SaaS Pricing mit LLMs

从静态到智慧:不断演进的SaaS与LLMs的定价

2507.12104v1

07-16

LLAMA: Multi-Feedback Smart Contract Fuzzing Framework with LLM-Guided Seed Generation

LLAMA: Multi-Feedback Smart Contract Fuzzing Framework mit LLM-geführter Saatgutgeneration

LLAMA:与LLM-Guided种子一代的多氟后智能合同模糊模糊框架

2507.12084v1

07-16

From Release to Adoption: Challenges in Reusing Pre-trained AI Models for Downstream Developers

Von der Veröffentlichung bis zur Annahme: Herausforderungen bei der Wiederverwendung vortrainierter KI-Modelle für Downstream-Entwickler

从释放到采用:为下游开发者重新使用经过预先培训的AI模型的挑战

2506.23234v2

07-16

Expanding ML-Documentation Standards For Better Security

Erweiterung der ML-Dokumentationsstandards für bessere Sicherheit

扩大多L-文件标准以增进安全

2507.12003v1

07-16

A Task Taxonomy for Conformance Checking

Eine Aufgaben-Taxonomie für die Konformitätsprüfung

合规检查任务分类

2507.11976v1

07-16

Kevin: Multi-Turn RL for Generating CUDA Kernels

Kevin: Multi-Turn RL für die Erzeugung von CUDA-Kerneln

Kevin: 生成 CUDA 核心多发RL

2507.11948v1

07-16

Extremal Testing for Network Software using LLMs

Extreme Tests für Netzwerk-Software mit LLMs

使用LLMM 网络软件的Extremal Extremal Extremal 测试

2507.11898v1

07-15 (2)

On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous Vehicles

Zur Notwendigkeit einer statistischen Grundlage für die szenariogestützte Prüfung autonomer Fahrzeuge

关于需要一个统计基金会以设想情况为基础测试自用车辆的统计基金会

2505.02274v2

07-15

REST in Pieces: RESTful Design Rule Violations in Student-Built Web Apps

REST in Pieces: RESTful Design Regel Verstöße in Student-Build Web Apps

在学生-建筑网页应用程序中违反设计规则

2507.11689v1

07-15

MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization

MetaLint: Generalisierbare idiomatische Code-Qualitätsanalyse durch instruction-following und einfach-zu-harte Verallgemeinerung

MetLint: 通过执行指示和易于协调的通用化,可通用的单性守则质量分析

2507.11687v1

07-15

Rookie Mistakes: Measuring Software Quality in Student Projects to Guide Educational Enhancement

Rookie Fehler: Softwarequalität in Studentenprojekten messen, um die Verbesserung der Bildung zu steuern

Rookie错误:衡量学生项目软件质量以指导加强教育

2507.12488v1

07-15

You Can REST Now: Automated REST API Documentation and Testing via LLM-Assisted Request Mutations

Sie können jetzt REST: Automatisierte REST API Dokumentation und Tests über LLM-Assisted Request Mutations

你可以现在就休息了:通过LLM协助请求变异进行自动REST API文件和测试

2402.05102v2

07-15

Decision Models for Selecting Architecture Patterns and Strategies in Quantum Software Systems

Entscheidungsmodelle für die Auswahl von Architekturmustern und -strategien in Quantensoftwaresystemen

量量软件系统中选择建筑模式和战略的决定模式

2507.11671v1

07-15

ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

ELFuzz: Effiziente Input-Generierung über LLM-gesteuerte Synthese über Fuzzer-Raum

ELFuzz:通过LLM驱动的模糊空间综合合成有效投入生成

2506.10323v3

07-15

Modeling Code: Is Text All You Need?

Modeling Code: Ist Text alles, was Sie brauchen?

建模代码:你只需要文字吗?

2507.11467v1

07-15

Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

Unterstützung oder Disruption? Erforschen und Bewerten von Design und Trade-offs proaktiver KI-Programmierungsunterstützung

探讨和评价主动的AI方案拟定支助的设计和取舍

2502.18658v3

07-15

From Chaos to Automation: Enabling the Use of Unstructured Data for Robotic Process Automation

Vom Chaos zur Automatisierung: Die Nutzung unstrukturierter Daten für die Automatisierung von Roboterprozessen ermöglichen

从混乱到自动化:使无结构数据能够用于机器人程序自动化

2507.11364v1

07-15

Security Debt in Practice: Nuanced Insights from Practitioners

Sicherheitsschuld in der Praxis: Nuanced Insights von Praktizierenden

实践中的担保债务:从从业者那里得到的 “ 洞察 “

2507.11362v1

07-15

RefModel: Detecting Refactorings using Foundation Models

RefModel: Refactorings mithilfe von Foundation Models erkennen

RefModel: 使用基础模型检测重构

2507.11346v1

07-15

QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration

QLPro: Automatisierte Code Vulnerability Discovery über LLM und Static Code Analysis Integration

QLPro:通过LLM和静态代码分析整合发现自动编码易脆弱性

2506.23644v2

07-15

Dually Hierarchical Drift Adaptation for Online Configuration Performance Learning

Dual Hierarchische Drift-Anpassung für Online-Konfigurations-Performance-Lernen

为在线配置绩效学习进行双级分级漂流适应

2507.08730v3

07-15

An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling

Eine empirische Studie von Multi-Agent RAG für Real-World University Admissions Counseling

现实世界大学招生咨询多方代理RAG经验研究

2507.11272v1

07-15

New Formulation of DNN Statistical Mutation Killing for Ensuring Monotonicity: A Technical Report

Neue Formulierung von DNN-Statistischem Mutationskilling zur Sicherung der Monotonizität: Ein technischer Bericht

新制定的DNN 统计变异杀人确保独独独性:技术报告

2507.11199v1

07-15

GUARD:Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation

GUARD:Dual-Agent-basierte Backdoor-Verteidigung auf Ketten-of-Thought in Neural Code Generation

GUARD: 在神经代码生成过程中寻求的连锁研究中,基于 “ 以企业为基地 “ 的后门防御

2505.21425v2

07-15

PromiseTune: Unveiling Causally Promising and Explainable Configuration Tuning

PromiseTune: Enthüllen kausal vielversprechende und erklärbare Konfigurationstuning

前景图:不懈的因果保证和可解释的配置图纸

2507.05995v3

07-15

Automata Models for Effective Bug Description

Automata Modelle für effektive Bug-Beschreibung

有效臭虫描述的自动模型

2507.11146v1

07-15

MT4DP: Data Poisoning Attack Detection for DL-based Code Search Models via Metamorphic Testing

MT4DP: Datenvergiftung Angriffserkennung für DL-basierte Code-Suchmodelle über Metamorphische Tests

MT4DP:通过变形测试对基于DL的代码搜索模型进行数据中毒攻击检测

2507.11092v1

07-15

Function-to-Style Guidance of LLMs for Code Translation

Funktion-zu-Stil Anleitung von LLMs für Code-Übersetzung

代码翻译LLMM LL 指南

2507.11083v1

07-15

Self-Admitted GenAI Usage in Open-Source Software

Selbstzugelassene GenAI-Nutzung in Open-Source-Software

开放源码软件自发使用GenAI

2507.10422v2

07-15

Advancing Code Coverage: Incorporating Program Analysis with Large Language Models

Advancing Code Coverage: Einschließliche Programmanalyse mit großen Sprachmodellen

推进代码覆盖范围:将方案分析纳入大语言模式

2404.04966v2

07-15

Evaluating Generated Commit Messages with Large Language Models

Auswertung von Generated Commit-Nachrichten mit großen Sprachmodellen

以大语言模式评价生成的提交信件

2507.10906v1

07-15

MalCodeAI: Autonomous Vulnerability Detection and Remediation via Language Agnostic Code Reasoning

MalCodeAI: Autonome Schwachstelle Erkennung und Sanierung über Language Agnostic Code Reasoning

MalCodeAI:通过语言《名人法》进行自主脆弱性检测和补救

2507.10898v1

07-14 (1)

BandFuzz: An ML-powered Collaborative Fuzzing Framework

BandFuzz: Ein ML-powered Collaborative Fuzzing Framework

BandFuzz: ML 授权的协作模糊框架

2507.10845v1

07-14

Past, Present and Future: Exploring Adaptive AI in Software Development Bots

Vergangenheit, Gegenwart und Zukunft: Erforschen von adaptiver KI in Software-Entwicklungs-Bots

过去、现在和未来:探索软件开发中的适应性AI

2507.10822v1

07-14

How Robust are LLM-Generated Library Imports? An Empirical Study using Stack Overflow

Wie robust sind LLM-generierte Bibliotheksimporte? Eine empirische Studie mit Stack Overflow

LLM - 受LLM创的图书馆进口如何强劲? 利用Stack 溢流进行的一项经验性研究

2507.10818v1

07-14

Supervised Semantic Similarity-based Conflict Detection Algorithm: S3CDA

Überwachter semantischer Ähnlichkeits-basierter Konflikterkennungs-Algorithmus: S3CDA

受监督的语义相似性基于冲突探测冲突探测等级: S3CDA

2206.13690v3

07-14

Towards a Closer Collaboration Between Practice and Research in Agile Software Development Workshop: A Summary and Research Agenda

Auf dem Weg zu einer engeren Zusammenarbeit zwischen Praxis und Forschung in der Agile Software Development Workshop: Eine Zusammenfassung und Forschungsagenda

更紧密地合作,在 “ 危险软件开发实践与研究 “ 的实践与研究之间开展更密切的合作讲习班:摘要和研究议程

2507.10785v1

07-14

GenAI-Enabled Backlog Grooming in Agile Software Projects: An Empirical Study

GenAI-Enabled Backlog Grooming in agilen Software-Projekten: Eine empirische Studie

GenAI-GenAI-Enable Enable Chacklog 人工软件项目中的工作室:经验研究

2507.10753v1

07-14

Toward Realistic Evaluations of Just-In-Time Vulnerability Prediction

Hin zu realistischen Bewertungen von Just-in-Time Sicherheitsvorhersage

A. 实现现实评估时空时脆弱性预测

2507.10729v1

07-14

Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models

Fokussieren lernen: Kontextextraktion für effiziente Code-Anfälligkeitserkennung mit Sprachmodellen

学习聚焦:以语言模式有效识别《守则》脆弱性

2505.17460v3

07-14

Speculative Automated Refactoring of Imperative Deep Learning Programs to Graph Execution

Spekulative Automatisierte Refaktorisierung imperativer Deep Learning-Programme zur Graphen-Execution

用于图表执行的势必深深学习方案的投机性自动重组

2504.05424v3

07-14

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

CodeJudgeBench: Benchmarking von LLM-as-a-Judge für Codierungsaufgaben

标准法官:为编码任务确定LLM-as-a法官基准

2507.10535v1

07-14

Investigating Adversarial Attacks in Software Analytics via Machine Learning Explainability

Untersuchung von Adversarial Attacks in Software Analytics durch maschinelles Lernen Erklärbarkeit

调查通过机器学习解释分析软件分析中的反攻击

2408.04124v2

07-14

A Code Comprehension Benchmark for Large Language Models for Code

Ein Code-Verständnis-Benchmark für große Sprachmodelle für Code

《守则》大语言模式的《守则》理解基准

2507.10641v1

07-14

Towards a Theory on Process Automation Effects

Auf dem Weg zu einer Theorie über Prozessautomatisierungseffekte

关于进程自动化效果的理论

2506.10992v2

07-14

SENSOR: An ML-Enhanced Online Annotation Tool to Uncover Privacy Concerns from User Reviews in Social-Media Applications

SENSOR: Ein ML-erweitertes Online-Annotations-Tool, um Datenschutz-Bedenken aus User Reviews in Social-Media-Anwendungen zu enthüllen

SENSOR:一个ML-加强在线说明工具,以从社会-媒体应用中的用户审查中发现隐私问题。

2507.10640v1

07-14

Formal Analysis of the Contract Automata Runtime Environment with Uppaal: Modelling, Verification and Testing

Formale Analyse der Vertragsautomatisierung Laufzeitumgebung mit Uppaal: Modellierung, Verifizierung und Prüfung

对合同自动化运行时环境的正式分析:建模、核查和测试

2501.12932v2

07-14

AssertCoder: LLM-Based Assertion Generation via Multimodal Specification Extraction

AssertCoder: LLM-basierte Assertion Generation über Multimodal Specification Extraction

AssoldtCoder:通过多式联运规格采掘法生产以LLM为基础的货权

2507.10338v1

07-14

Toolsuite for Implementing Multiagent Systems Based on Communication Protocols

Toolsuite zur Implementierung von Multiagentensystemen auf Basis von Kommunikationsprotokollen

基于通信议定书的用于实施多剂系统的工具

2507.10324v1

07-14

Streamlined Airborne Software Development for Large UAVs: From Unified Data Collection to Automated Code Generation

Streamlined Airborne Software Development für große UAVs: Von der Unified Data Collection bis zur automatisierten Codegenerierung

为大型无人驾驶航空器简化空载软件开发:从统一数据收集到自动代码生成

2507.10321v1

07-14

A Survey of Reinforcement Learning for Software Engineering

Ein Überblick über die Verbesserung des Lernens für Software-Engineering

软件工程强化学习调查

2507.12483v1

07-14

A Grounded Theory on the Teacher and Student Roles in Pair Programming

Eine fundierte Theorie über Lehrer und Schülerrollen in der Pair-Programmierung

关于教师和学生在对等方案规划中的作用的理论基础

2507.10305v1

07-14

Helveg: Diagrams for Software Documentation

Helveg: Diagramme für Software-Dokumentation

Helveg:软件文件图

2507.10244v1

07-14

An Empirical Study of Interaction Bugs in ROS-based Software

Eine empirische Studie von Interaktionsfehlern in ROS-basierter Software

以ROS为基础的软件中的相互作用虫的经验研究

2507.10235v1

07-14

Towards a Framework for Operationalizing the Specification of Trustworthy AI Requirements

Auf dem Weg zu einem Rahmen für die Operationalisierung der Spezifikation vertrauenswürdiger AI-Anforderungen

建立一个落实可信赖的AI要求具体规格的框架

2507.10228v1

07-14

Breaking the Myth: Can Small Models Infer Postconditions Too?

Der Mythos brechen: Können kleine Modelle auch Postkonditionen nachvollziehen?

打破神话:小模型能否也推推推先决条件?

2507.10182v1

07-14

Accelerating Automatic Program Repair with Dual Retrieval-Augmented Fine-Tuning and Patch Generation on Large Language Models

Beschleunigung der automatischen Programmreparatur mit Dual Retrieval-Augmented Fine-Tuning und Patch Generation bei großen Sprachmodellen

加速自动程序维修,以大语言模式双检索增强的微调和补丁生成

2507.10103v1

07-14

Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding

Kodezi Chronos: Ein Debugging-First Language Model für Repository-Scale, Memory-Driven Code Understanding

Kodezi Chronos:调试第一语言模型,用于存储库规模、记忆驱动代码理解

2507.12482v1

07-14

LLMShot: Reducing snapshot testing maintenance via LLMs

LLMShot: Reduzierung der Snapshot-Test-Wartung über LLMs

LLMShot:减少通过LLMM减少快速测试维护

2507.10062v1

07-14

Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks

Explizite Gefährlichkeitsgenerierung mit LLMs: Eine Untersuchung jenseits zweifelhafter Angriffe

与LLM女士:在反向攻击之外进行调查

2507.10054v1

07-14

Enhancing the Capabilities of Large Language Models for API calls through Knowledge Graphs

Verbesserung der Fähigkeiten von großen Sprachmodellen für API-Aufrufe durch Wissensgraphen

通过 “ 知识图 “ 提高大语言模式的能力

2507.10630v1

07-14

EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective

EVALOOP: Bewertung der Robustheit von LLM in der Programmierung aus einer Perspektive der Selbstkonsistenz

EVALOOP: 从自统一的角度评估方案拟订中的LLM强力

2505.12185v3

07-14

When Less is More: A systematic review of four-day workweek conceptualizations and their effects on organizational performance

When Less is More: Eine systematische Überprüfung von viertägigen Arbeitswochenkonzeptualisierungen und deren Auswirkungen auf die organisatorische Leistung

时间越少越少:系统审查四天工作周概念概念化及其对组织业绩的影响

2507.09911v1

07-14

Modelling Interrelations Between Agile Practices: The Agile Map

Modellierung von Zusammenhängen zwischen agilen Praktiken: Die agile Karte

模拟各种恶恶之间相互关系的模型:各种恶恶:各种恶恶的地图

2507.09907v1

07-14

PathFuzzing: Worst Case Analysis by Fuzzing Symbolic-Execution Paths

PathFuzzing: Schlechteste Fallanalyse durch Fuzzing Symbolic-Execution Paths

路径Fuzzing:通过模糊符号执行路径进行最坏的案例研究分析

2507.09892v1

07-14

Turning the Tide: Repository-based Code Reflection

Drehen der Tide: Repository-basierte Code-Reflexion

翻转底盘:基于仓库的代码反射

2507.09866v1

07-14

IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

IRFuzzer: Spezialisiertes Fuzzing für LLVM-Backend-Code-Generierung

IRFuzzer: LLLVM 后端代码生成专门模糊

2402.05256v2

07-13 (7)

Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Was zählt: Ein Rahmen für die Bewertung von Sicherheitsrisiken in realen LLM-Anwendungen

衡量什么重要事项:在现实世界LLM应用中评估安全风险的框架

2507.09820v1

07-13

Prompting for Performance: Exploring LLMs for Configuring Software

Prompting for Performance: LLMs für die Konfiguration von Software erkunden

促效:探索配置软件LLMs

2507.09790v1

07-13

OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization

OrQstrator: Ein KI-Powered-Framework für erweiterte Quantenschaltungsoptimierung

Orstrator: AI授权的高级量子电路优化框架

2507.09682v1

07-13

Is Quantization a Deal-breaker? Empirical Insights from Large Code Models

Ist Quantisierung ein Deal-Breaker? Empirische Einblicke aus großen Code-Modellen

量化是否是一个突破交易者?来自大代码模型的实证透视

2507.09665v1

100

07-13

Code Review as Decision-Making – Building a Cognitive Model from the Questions Asked During Code Review

Code-Review als Entscheidungsfindung – Aufbau eines Kognitivmodells aus den Fragen, die während der Code-Review gestellt wurden

作为决策的《守则》审查 – – 从《守则》审查期间提出的问题建立认知模式

2507.09637v1

101

07-13

Complexity and Coupling: A Functional Domain Approach

Komplexität und Koppelung: Ein funktionaler Bereichsansatz

复杂性和组合:功能领域办法

2507.09599v1

102 07-13 The Mythical Good Software Die mythische gute Software 《神道好软件》 2507.09596v1

103

07-13

Equality Saturation for Optimizing High-Level Julia IR

Gleichstellungssättigung für die Optimierung von High-Level Julia IR

优化高级别Julia IR 平等饱和

2502.17075v2

104

07-13

How to Define Design in Industrial Control and Automation Software

Wie man Design in der industriellen Steuerungs- und Automatisierungssoftware definiert

如何界定工业控制和自动化软件的设计

2507.09594v1

105

07-13

A Serverless Architecture for Real-Time Stock Analysis using Large Language Models: An Iterative Development and Debugging Case Study

Eine serverlose Architektur für Echtzeit-Speicheranalyse mit großen Sprachmodellen: Eine iterative Entwicklungs- und Debugging-Fallstudie

使用大语言模型进行实时库存分析的无服务器结构:迭代发展和调试案例研究

2507.09583v1

106

07-13

The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs

Der Debugging Decay Index: Debugging Strategien für Code LLMs neu denken

调试衰减指数:重新思考守则LMS的调试战略

2506.18403v2

107

07-13

It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective

Es wird nur schlimmer: DL-basierte Sicherheitsdetektoren aus praktischer Sicht neu zu betrachten

更糟糕的是:从实际角度重新审视基于DL的脆弱性检测器

2507.09529v1

108

07-13

Towards LLM-Based Automatic Playtest

Zum LLM-basierten automatischen Playtest

面向基于 LLM 的自动游戏测试

2507.09490v1

109

07-13

Evaluating LLMs on Sequential API Call Through Automated Test Generation

Bewertung von LLMs auf sequentieller API-Aufruf durch automatisierte Testgenerierung

通过自动测试生成的序列API呼叫评估LLMs

2507.09481v1

110

07-12 (6)

Enhancing NeuroEvolution-Based Game Testing: A Branch Coverage Approach for Scratch Programs

Verbesserung der NeuroEvolution-basierten Game-Tests: Ein branchenübergreifender Ansatz für Scratch-Programme

强化基于进进神经革命的游戏测试:Scratch方案分支覆盖方法

2507.09414v1

111

07-12

LLM-Powered Quantum Code Transpilation

LLM 功率量代码转换

2507.12480v1

112

07-12

Enhancing Interpretability in Software Change Management with Chain-of-Thought Reasoning

Verbesserung der Dolmetschbarkeit im Software Change Management durch schlüsselfertiges Reasoning

提高软件变革管理与 “ 探索链解释理由 “ 的可解释性

2507.09315v1

113

07-12

Explainability as a Compliance Requirement: What Regulated Industries Need from AI Tools for Design Artifact Generation

Erklärbarkeit als Compliance-Voraussetzung: Was regulierte Industrien von KI-Werkzeugen für die Design-Artefakt-Generierung benötigen

作为遵约要求的解释性:AI 设计人工制造工具中监管工业需要什么

2507.09220v1

114

07-12

Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval

Zurück zu den Grundlagen: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval

返回到 Basics: 重新思考与LLM 辅助检索连接的问题

2507.09199v1

115

07-12

OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advanced Transportation Research

OpenCAMS: Eine Open-Source vernetzte und automatisierte Mobilitäts-Co-Simulationsplattform für fortgeschrittene Verkehrsforschung

开放源码连接和自动化流动联合模拟平台,用于高级运输研究

2507.09186v1

116

07-12

Position Paper: Programming Language Techniques for Bridging LLM Code Generation Semantic Gaps

Positionspapier: Programmiersprachentechniken zur Bridging LLM Code Generation Semantische Lücken

立场文件:缩小LLM码生成语义差距的编程语言技术

2507.09135v1

117

07-12

SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

SPICE: Eine automatisierte SWE-Bench-Etikettierungspipeline für Ausgabeklarheit, Testabdeckung und Aufwandsabschätzung

SPICE: 用于议题清晰度、测试覆盖率和努力估算的SWE-Bennch自动标签管道

2507.09108v1

118

07-12

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Messung des Einflusses der frühen-2025 KI auf erfahrene Open-Source-Entwicklerproduktivität

衡量2025年初AI(AI)对经验丰富的开放源码开发者生产力的影响

2507.09089v1

119

07-11 (5)

SetupBench: Assessing Software Engineering Agents’ Ability to Bootstrap Development Environments

SetupBench: Bewertung der Fähigkeit von Software-Engineering-Agenten zu Bootstrap-Entwicklungsumgebungen

设置基准:评估软件工程代理器的能力,以建立发展环境

2507.09063v1

120

07-11

SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps

SAGE: A Context-Aware Approach for Mining Privacy Relevant Reviews from Mental Health Apps

SAGE: “ 采矿隐私要求 “ 的背景意识方法,来自心理健康应用软件的相关审查

2507.09051v1

121

07-11

CMER: A Context-Aware Approach for Mining Ethical Concern-related App Reviews

CMER: A Context-aware approach for Mining Ethical Concern-related App Reviews

CMER: 采矿道德关切相关上诉审查的背景意识方法

2507.09049v1

122

07-11

Towards Extracting Software Requirements from App Reviews using Seq2seq Framework

Auf dem Weg zur Extraktion von Software-Anforderungen aus App-Bewertungen mit Seq2seq Framework

争取利用Seq2seq 框架从应用审查中提取软件要求

2507.09039v1

123

07-11

BrainLesion Suite: A Flexible and User-Friendly Framework for Modular Brain Lesion Image Analysis

BrainLesion Suite: Ein flexibles und benutzerfreundliches Framework für die modulare Gehirn-Lesions-Bildanalyse

脑悬浮套件:模块脑悬浮图像分析灵活和用户友好框架

2507.09036v1

124

07-11

Accelerating Drug Discovery Through Agentic AI: A Multi-Agent Approach to Laboratory Automation in the DMTA Cycle

Beschleunigen der Wirkstoff-Discovery durch Agentic AI: Multi-Agenten-Ansatz zur Laborautomatisierung im DMTA-Zyklus

AI:对DMTTA周期实验室自动化采取多机构办法

2507.09023v1

125

07-11

ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs

ToolRegistry: Eine protokoll-agnostische Werkzeugverwaltungsbibliothek für funktionsaufrufende LLMs

工具登记:功能调频LMS的礼宾-不可确定性工具管理库

2507.10593v1

126

07-11

Semantic Source Code Segmentation using Small and Large Language Models

Semantische Quellcode-Segmentierung mit kleinen und großen Sprachmodellen

使用小型和大语言模式的语义源代码代码分割

2507.08992v1

127

07-11

Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Können große Sprachmodelle den Studierenden helfen, Software-Korrektur zu beweisen? Eine experimentelle Studie mit Dafny

大语言模型能帮助学生证明软件正确性吗? 与Dafny的实验研究

2506.22370v3

128

07-11

Choosing the Right Git Workflow: A Comparative Analysis of Trunk-based vs. Branch-based Approaches

Auswahl des richtigen Git-Workflows: Eine vergleichende Analyse von Trunk-based vs. Branch-based Approaches

选择正确的基特工作流程:对基于Trunk的方法与基于分部门的方法的比较分析

2507.08943v1

129

07-11

Repairing Language Model Pipelines by Meta Self-Refining Competing Constraints at Runtime

Reparatur von Sprachmodell-Pipelines durch Meta-Selbst-Refining Wettbewerbsbeschränkungen bei Runtime

运行时通过Meta自我改进竞争制约修复语言示范管道

2507.10590v1

130

07-11

On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten

关于含有闭合同步词类的标识名称的结构和语义

2505.18444v3

131

07-11

Multilingual Multimodal Software Developer for Code Generation

Mehrsprachiger multimodaler Softwareentwickler für die Codegenerierung

用于代码生成的多语言多语种多式软件开发器

2507.08719v1

132

07-11

LLMCup: Ranking-Enhanced Comment Updating with LLMs

LLMCup: Ranking-erweiterter Kommentar Aktualisierung mit LLMs

LLMCUM: 更新与LLMM的评分

2507.08671v1

133

07-11

Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

Text2BIM: Generierung von Baumodellen mit Hilfe eines Multi-Agent-Frameworks auf Basis eines großen Sprachmodells

Text2BIM:利用以大语言模式为基础的多机构机构框架生成建筑模型

2408.08054v2

134

07-11

NL in the Middle: Code Translation with LLMs and Intermediate Representations

NL in der Mitte: Code-Übersetzung mit LLMs und Intermediate Representations

中文本不适用:配有LLMs和中级代表的代码翻译

2507.08627v1

135

07-11

Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency, Effectiveness and Empathy

Proto-Personas durch Prompt Engineering generieren: Eine Fallstudie zu Effizienz, Effektivität und Empathie

通过即时工程产生个人方案:关于效率、有效性和冷漠的案例研究

2507.08594v1

136

07-11

ARPaCCino: An Agentic-RAG for Policy as Code Compliance

ARPaCCino: Eine Agentur-RAG für Politik als Code-Compliance

ARPACCino:作为《守则》合规政策的一个代理-RAG

2507.10584v1

137

07-11

InferLog: Accelerating LLM Inference for Online Log Parsing via ICL-oriented Prefix Caching

InferLog: Beschleunigung der LLM-Inferenz für das Online-Log Parsing über ICL-orientiertes Prefix-Caching

InferLog: 通过ICL 导向的前缀缓存加速在线日志解析的 LLM 推断

2507.08523v1

138

07-11

$\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

$\texttt{Droid}$: Eine Ressourcen-Suite für KI-generierte Code-Erkennung

$\ textt{ droid} $: 用于 AI 生成代码检测的资源套件

2507.10583v1

139

07-11

Computing Floating-Point Errors by Injecting Perturbations

Berechnung von Floating-Point-Fehlern durch Einspritzen von Perturbationen

通过注射扰动输入,计算浮点误差

2507.08467v1

140

07-11

ProvideQ: A Quantum Optimization Toolbox

ProvideQ: Eine Quantum-Optimierungs-Toolbox

提供 Q: 量图优化工具箱

2507.07649v2

141

07-11

Leveraging Large Language Models for Classifying App Users’ Feedback

Nutzung von großen Sprachmodellen zur Klassifizierung des Feedbacks von App-Nutzern

利用大语言模型对应用程序用户的反馈进行分类

2507.08250v1

142

07-10 (4)

KP-A: A Unified Network Knowledge Plane for Catalyzing Agentic Network Intelligence

KP-A: Eine einheitliche Netzwerk-Wissensplattform für katalysierende Agentische Netzwerk-Intelligenz

KP-A:一个用于催化剂网络情报的统一网络知识平台

2507.08164v1

143

07-10

The Impact of Generative AI on Code Expertise Models: An Exploratory Study

Die Auswirkungen generativer KI auf Code-Expertise-Modelle: Eine Sondierungsstudie

《创世大赦国际对守则专门知识模型的影响:探索性研究》

2507.08160v1

144

07-10

Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows

Code mit mir oder für mich? Wie die zunehmende KI-Automatisierung Entwickler-Workflows transformiert

如何增加 AI 自动转换开发者工作流程

2507.08149v1

145

07-10

The State of Computational Science in Fission and Fusion Energy

Der Zustand der Computational Science in Fission und Fusionsenergie

裂变和聚变能源的计算科学状况

2507.08061v1

146

07-10

QCP: A Practical Separation Logic-based C Program Verification Tool

QCP: Eine praktische Trennung Logisch-basiertes C-Programm Verifikationswerkzeug

QCP:基于实际隔离逻辑的C方案核查工具

2505.12878v2

147

07-10

Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management

Offene Quelle, versteckte Kosten: Ein systematischer Literaturbericht über OSS-Lizenzverwaltung

开放源码,隐藏成本:开放源码软件许可证管理的系统文献审查

2507.05270v2

148

07-10

Open-source automatic pipeline for efficient conversion of large-scale point clouds to IFC format

Open-Source-Automatische Pipeline für die effiziente Umwandlung von großflächigen Punktwolken in IFC-Format

将大型点云有效转换成国际金融公司格式的开放源自动管道

2503.11498v3

149

07-10

From Domain Documents to Requirements: Retrieval-Augmented Generation in the Space Industry

Von Domänendokumenten zu Anforderungen: Retrieval-Augmented Generation in der Raumfahrtindustrie

从域文档到要求:空间工业中回收利用-增强的一代人

2507.07689v1

150

07-10

Prompt Engineering for Requirements Engineering: A Literature Review and Roadmap

Prompt Engineering for Requirements Engineering: Literature Review und Roadmap

工程:文学审查和路线图

2507.07682v1

151

07-10

Quantum Executor: A Unified Interface for Quantum Computing

Quantum Executor: Ein einheitliches Interface für Quantum Computing

量图执行器: 量数计算的统一界面

2507.07597v1

152

07-10

From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering

Von Anforderungen zum Code: Entwickler-Praxis in LLM-Assisted Software Engineering verstehen

从要求到准则:了解LLM辅助软件工程开发者的做法

2507.07548v1

153

07-10

Towards an Engineering Workflow Management System for Asset Administration Shells using BPMN

Auf dem Weg zu einem Engineering Workflow Management System für Asset Administration Shells mit BPMN

努力建立一个利用生物和水管理网的资产管理壳壳工程工作流程管理系统

2507.07468v1

154

07-10

Toolchain for Faster Iterations in Quantum Software Development

Toolchain für schnellere Iterationen in der Quantensoftware-Entwicklung

量量软件开发中快速迭接工具链

2507.07448v1

155

07-10

DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications

DITING: Ein statischer Analyzer zur Identifizierung von Problemen mit schlechten Partitionierungen in TEE-Anwendungen

Tinging: 识别TEE应用中的不良分割问题的静态分析器

2502.15281v2

156

07-10

Automatic Generation of Explainability Requirements and Software Explanations From User Reviews

Automatische Generierung von Erklärbarkeitsanforderungen und Software-Erläuterungen aus Benutzer-Bewertungen

用户审查自动产生解释要求和软件解释

2507.07344v1

Article 0

Title@2025-07-17 (4): SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

Title: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks

SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben

SWE-MERA: 积极评价软件工程任务大语言模型的动态基准 2507.11059v2

Authors (9): Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh

The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.

软件工程大语言模型(LLMS)的快速发展揭示了现有基准,特别是广泛使用的SWE-bench数据集的重大局限性,最近的研究发现了严重的数据污染问题,例如SWE-bench报告32.67%的成功补丁涉及直接溶解渗漏,31.08%因测试案例不足而通过。我们引入了SWE-MERA,这是一个动态的、不断更新的基准,旨在通过自动收集真实世界的GitHub问题和严格的质量验证来应对这些基本挑战。我们的方法是一个可靠的管道,既能确保质量,又能尽量减少污染风险,从而产生约10,000项潜在任务,目前已有300个样本。使用Aider编码剂进行的评估表明,在最新模型中具有很强的歧视性力量。我们报告了2024年9月至2025年6月期间所收集的任务最近得到评估的十多个LMMS的绩效。

Article 1

Title@2025-07-17 (4): Detecting LLM-generated Code with Subtle Modification by Adversarial Training

Title: Detecting LLM-generated Code with Subtle Modification by Adversarial Training

LLM-generierter Code mit subtiler Änderung durch Adversarial Training erkennen

检测通过反向培训进行精细修改的LLM生成代码 2507.13123v1

Authors (5): Xin Yin, Xinrui Li, Chao Ni, Xiaodan Xu, Xiaohu Yang

With the rapid development of Large Language Models (LLMs), their powerful code-generation capabilities have been widely applied in tasks like code completion and automated development, demonstrating the value of improving coding efficiency. However, the extensive use of LLM-generated code also raises several new challenges. On the one hand, issues such as the regulation of code provenance, copyright disputes, and code quality have become increasingly concerning. How to effectively detect LLM-generated code and ensure its compliant and responsible use has become a critical and urgent issue. On the other hand, in practical applications, LLM-generated code is often subject to manual modifications, such as variable renaming or structural adjustments. Although some recent studies have proposed training-based and zero-shot methods for detecting LLM-generated code, these approaches show insufficient robustness when facing modified LLM-generated code, and there is a lack of an effective solution. To address the real-world scenario where LLM-generated code may undergo minor modifications, we propose CodeGPTSensor+, an enhanced version of CodeGPTSensor, which employs adversarial training to improve robustness against input perturbations. CodeGPTSensor+ integrates an adversarial sample generation module, Multi-objective Identifier and Structure Transformation (MIST), which systematically generates both high-quality and representative adversarial samples. This module effectively enhances the model’s resistance against diverse adversarial attacks. Experimental results on the HMCorp dataset demonstrate that CodeGPTSensor+ significantly improves detection accuracy on the adversarial test set while maintaining high accuracy on the original test set, showcasing superior robustness compared to CodeGPTSensor.

随着大语言模型(LLMS)的迅速发展,其强大的代码生成能力被广泛应用于诸如代码完成和自动开发等任务,这表明了提高编码效率的价值;然而,广泛使用LLM产生的代码也带来了一些新的挑战;一方面,规范代码出处、版权争议和代码质量等问题日益引起关注;如何有效检测LLM产生的代码并确保其符合和负责任的使用已成为一个关键和紧迫的问题。另一方面,在实际应用中,LLM产生的代码经常受到手工修改,例如变式重命名或结构调整。虽然最近的一些研究提出了在检测LM生成代码时采用基于培训和零发光的方法,但这些方法在面对修改的LLMM生成代码时显示不够健全,而且缺乏有效的解决办法。为了解决LLMM生成代码可能稍加修改的现实假设,我们建议CodGPTSO+, 强化版的CMGTSors,它利用对抗IBERS的稳健性投入的对准性重新命名或结构调整。

Article 2

Title@2025-07-17 (4): Inferring Attributed Grammars from Parser Implementations

Title: Inferring Attributed Grammars from Parser Implementations

Zugeschriebene Grammatiken aus Parser-Implementierungen ableiten

从剖析器执行中推断出属性语法 2507.13117v1

Authors (3): Andreas Pointner, Josef Pichler, Herbert Prähofer

Software systems that process structured inputs often lack complete and up-to-date specifications, which specify the input syntax and the semantics of input processing. While grammar mining techniques have focused on recovering syntactic structures, the semantics of input processing remains largely unexplored. In this work, we introduce a novel approach for inferring attributed grammars from parser implementations. Given an input grammar, our technique dynamically analyzes the implementation of recursive descent parsers to reconstruct the semantic aspects of input handling, resulting in specifications in the form of attributed grammars. By observing program executions and mapping the program’s runtime behavior to the grammar, we systematically extract and embed semantic actions into the grammar rules. This enables comprehensive specification recovery. We demonstrate the feasibility of our approach using an initial set of programs, showing that it can accurately reproduce program behavior through the generated attributed grammars.

处理结构化投入的软件系统往往缺乏完整和最新的规格,这些规格具体规定了输入语法和输入处理的语义。语法采矿技术侧重于恢复合成结构,而输入处理的语义基本上尚未探索。在这项工作中,我们引入了一种新颖的方法,从实施剖析器中推算有分辨的语法。根据输入语法,我们的技术动态地分析了反复下降的剖析器的实施情况,以重建输入处理的语义方面,从而产生了有分辨语法的规格。通过观察程序执行过程和将程序运行时间的行为与语法规则进行绘图,我们系统地提取和将语法行动嵌入语法规则中。这有利于全面规范的恢复。我们展示了使用最初一套程序的方法的可行性,表明它可以通过生成的有分辨语法来准确复制程序的行为。

Article 3

Title@2025-07-17 (4): A Conceptual Framework for Requirements Engineering of Pretrained-Model-Enabled Systems

Title: A Conceptual Framework for Requirements Engineering of Pretrained-Model-Enabled Systems

Ein konzeptioneller Rahmen für die Anforderungsentwicklung von vortrainierten modellgebundenen Systemen

预先培训的、采用模式的系统工程要求概念框架 2507.13095v1

Authors (4): Dongming Jin, Zhi Jin, Linyu Li, Xiaohong Chen

Recent advances in large pretrained models have led to their widespread integration as core components in modern software systems. The trend is expected to continue in the foreseeable future. Unlike traditional software systems governed by deterministic logic, systems powered by pretrained models exhibit distinctive and emergent characteristics, such as ambiguous capability boundaries, context-dependent behavior, and continuous evolution. These properties fundamentally challenge long-standing assumptions in requirements engineering, including functional decomposability and behavioral predictability. This paper investigates this problem and advocates for a rethinking of existing requirements engineering methodologies. We propose a conceptual framework tailored to requirements engineering of pretrained-model-enabled software systems and outline several promising research directions within this framework. This vision helps provide a guide for researchers and practitioners to tackle the emerging challenges in requirements engineering of pretrained-model-enabled systems.

与由确定性逻辑管理的传统软件系统不同,由预先培训的模型驱动的系统具有独特和突发的特点,例如能力界限模糊、根据具体情况行事和不断演化。这些特性从根本上挑战了要求工程中的长期假设,包括功能不兼容性和行为可预测性。本文件调查了这一问题,并主张重新思考现有的要求工程方法。我们建议了一个概念框架,专门为预先培训的模型化软件系统的工程要求制定概念框架,并勾勒了这一框架内若干有希望的研究方向。这一愿景有助于为研究人员和从业人员提供指南,以应对在培训前的模型化系统的需求工程方面新出现的挑战。

Article 4

Title@2025-07-17 (4): MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

Title: MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks

MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben

MERA 守则:一个统一框架,用于评估不同任务制定守则的情况 2507.12284v2

Authors (23): Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova

Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.

为解决上述问题,我们建议MERA守则,这是MERA基准体系的一个新补充,特别侧重于评价俄罗斯最新代码生成LLMS的守则。这个基准包括11项评价任务,涉及8种编程语言。我们提议的评价方法包括一种分类,它概述了完成这些任务模型所需的实际编码技能。基准包括用户进行MERA评估的开放源代码库、一种与各种编程环境兼容的评分系统以及一个以领导板和提交系统为主的平台。我们评价开放LMS和前沿API模型,分析其在非英语实际编码任务方面的局限性。我们正在公开发布MERA,以指导今后的研究,预测模型开发的破碎特征,并使评价程序标准化。

Article 5

Title@2025-07-17 (4): iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development

Title: iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development

iReDev: Ein wissensgestütztes Multi-Agent-Rahmenwerk für intelligente Anforderungsentwicklung

iReDev:开发智能要求的知识开发多机构框架 2507.13081v1

Authors (7): Dongming Jin, Weisong Sun, Jiangping Huang, Peng Liang, Jifeng Xuan, Yang Liu, Zhi Jin

Requirements development is a critical phase as it is responsible for providing a clear understanding of what stakeholders need. It involves collaboration among stakeholders to extract explicit requirements and address potential conflicts, which is time-consuming and labor-intensive. Recently, multi-agent systems for software development have attracted much attention. However, existing research provides limited support for requirements development and overlooks the injection of human knowledge into agents and the human-agent collaboration. % To address these issues, this paper proposes a knowledge-driven multi-agent framework for intelligent requirement development, named iReDev. iReDev features: iReDev consists of six knowledge-driven agents to support the entire requirements development. They collaboratively perform various tasks to produce a software requirements specification. iReDev focuses on integrating human knowledge for agents, enabling them to simulate real-world stakeholders. iReDev uses an event-driven communication mechanism based on an artifact pool. Agents continuously monitor the pool and autonomously trigger the next action based on its changes, enabling iReDev to handle new requirements quickly. iReDev introduces a human-in-the-loop mechanism to support human-agent collaboration, ensuring that the generated artifacts align with the expectations of stakeholders. We evaluated the generated artifacts and results show that iReDev outperforms existing baselines in multiple aspects. We further envision three key directions and hope this work can facilitate the development of intelligent requirements development.

开发需求是一个关键阶段,因为它负责明确了解利益攸关方需要哪些内容。它涉及利益攸关方之间的合作,以提出明确要求并解决潜在冲突,这需要时间和劳力的密集性。最近,软件开发的多试剂系统吸引了大量注意力。然而,现有的研究为需求开发提供了有限的支持,忽视了将人类知识注入代理和人力代理协作。%为解决这些问题,本文件提议了一个知识驱动的多试剂框架,用于开发智能需求,名为 iReDev。iReDev 功能:iReDev 由六个知识驱动的代理组成,以支持整个需求开发。他们合作执行各种任务,以制定软件要求规格。iReDev 侧重于将人类知识整合到代理方,使其能够模拟真实世界利益攸关方。iReDev 使用一个以人工智能库为基础的事件驱动通信机制。代理人不断监测人才库,并自主启动基于其变化的下一步行动,使iReDev能够快速处理新的需求。iReDev 引入一个由六个知识驱动的代理机构组成的机制,以支持整个需求开发。他们合作执行各种任务,以软件要求为软件设计规范规范。iReD侧重于工作,确保所产生的关键方向与我们所生成的模型将展示了各种期望。

Article 6

Title@2025-07-17 (4): Write Your Own CodeChecker: An Automated Test-Driven Checker Development Approach with LLMs

Title: Write Your Own CodeChecker: An Automated Test-Driven Checker Development Approach with LLMs

Schreiben Sie Ihren eigenen CodeChecker: Ein automatisierter Test-Driven Checker-Entwicklungsansatz mit LLMs

使用 LLMS 写入您的自定义代码检查器: 自动测试驱动检查开发方法 2411.06796v3

Authors (6): Jun Liu, Yuanyuan Xie, Jiwei Yan, Jinhao Huang, Jun Yan, Jian Zhang

With the rising demand for code quality assurance, developers are not only utilizing existing static code checkers but also seeking custom checkers to satisfy their specific needs. Nowadays, various code-checking frameworks provide extensive checker customization interfaces to meet this need. However, both the abstract checking logic and the complex API usage of large-scale checker frameworks make this task challenging. To this end, automated code checker generation is anticipated to ease the burden of checker development. In this paper, we propose AutoChecker, an innovative LLM-powered approach that can write code checkers automatically based on only a rule description and a test suite. To achieve comprehensive checking logic, AutoChecker incrementally updates the checker’s logic by focusing on solving one selected case each time. To obtain precise API knowledge, during each iteration, it leverages fine-grained logic-guided API-context retrieval, where it first decomposes the checking logic into a series of sub-operations and then retrieves checker-related API-contexts for each sub-operation. For evaluation, we apply AutoChecker, five baselines, and three ablation methods using multiple LLMs to generate checkers for 20 randomly selected PMD rules. Experimental results show that AutoChecker significantly outperforms others across all effectiveness metrics, with an average test pass rate of 82.28%. Additionally, the checkers generated by AutoChecker can be successfully applied to real-world projects, matching the performance of official checkers.

随着对代码质量保证的需求不断增加,开发者不仅正在利用现有静态代码检查器,而且还在寻找自定义检查器以满足其具体需求。如今,各种代码检查框架为满足这一需求提供了广泛的检查器定制界面。然而,抽象的检查逻辑和大型检查框架复杂的API使用使这项任务具有挑战性。为此,预计自动代码检查器生成将减轻检查器开发的负担。在本文件中,我们提议了Auto checker,这是一种创新的LLM动力方法,可以仅根据规则描述和测试套件自动写入代码检查器。为了实现全面检查逻辑,AutoCrecker通过每次解决一个选定案件,逐步更新检查器逻辑。为了获得精确的API知识,每次循环中,它利用精细的逻辑引导API-文文本检索,首先将检查逻辑引入一系列子操作,然后为每个子操作操作的检查器,然后检索与检查器相关的 AIPI-文文本。在评估中,我们应用Auto checker、5个实际基线和3个ALIBLI 测试结果,然后用多个测试方法对多个LMRBER 进行多次测试。

Article 7

Title@2025-07-17 (4): Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

Title: Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases

Untersuchung der Leistungsfähigkeit kleiner Sprachmodelle bei der Erkennung von Testriechen in manuellen Testfällen

调查小语言模型在人工试验案件中检测测试嗅觉方面的性能 2507.13035v1

Authors (6): Keila Lucas, Rohit Gheyi, Márcio Ribeiro, Fabio Palomba, Luana Martins, Elvys Soares

Manual testing, in which testers follow natural language instructions to validate system behavior, remains crucial for uncovering issues not easily captured by automation. However, these test cases often suffer from test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce test reliability and maintainability. While detection tools exist, they typically require manual rule definition and lack scalability. This study investigates the potential of Small Language Models (SLMs) for automatically detecting test smells. We evaluate Gemma3, Llama3.2, and Phi-4 on 143 real-world Ubuntu test cases, covering seven types of test smells. Phi-4 achieved the best results, reaching a pass@2 of 97% in detecting sentences with test smells, while Gemma3 and Llama3.2 reached approximately 91%. Beyond detection, SLMs autonomously explained issues and suggested improvements, even without explicit prompt instructions. They enabled low-cost, concept-driven identification of diverse test smells without relying on extensive rule definitions or syntactic analysis. These findings highlight the potential of SLMs as efficient tools that preserve data privacy and can improve test quality in real-world scenarios.

人工测试,让测试者遵循自然语言指示来验证系统行为,对于发现自动化不易发现的问题仍然至关重要;然而,这些测试案例往往存在测试气味、质量问题,如模糊性、冗余性或缺少检查,从而降低测试可靠性和可维持性;虽然检测工具存在,但通常需要人工规则定义,且缺乏可缩放性;这项研究调查了小语言模型(SLMs)自动检测测试气味的潜力;我们评估了Gemma3、Llama3.2和Phi-4的143个真实世界Ubuntu测试案例,涉及7种测试气味;Phi-4取得了最佳结果,在用测试气味检测句中达到97%的通行证@97%,Gemma3和Llama3.2达到约91%;除了检测之外,SLMs自主解释问题并提出改进建议,即使没有明确指示;这些研究还有助于在不依赖广泛规则定义或合成分析的情况下低成本、概念驱动地识别不同测试气味。这些研究结果突出表明了SLSLSDs作为维护数据隐私的有效工具的潜力,并能改善真实世界情景的测试质量。

Article 8

Title@2025-07-17 (4): Risks of ignoring uncertainty propagation in AI-augmented security pipelines

Title: Risks of ignoring uncertainty propagation in AI-augmented security pipelines

Risiken der Ignorierung der Unsicherheitsausbreitung in KI-gesteigerten Sicherheitspipelines

忽视在AI强化安全管道中传播不确定性的风险 2407.14540v2

Authors (4): Emanuele Mezzi, Aurora Papotti, Fabio Massacci, Katja Tuma

The use of AI technologies is being integrated into the secure development of software-based systems, with an increasing trend of composing AI-based subsystems (with uncertain levels of performance) into automated pipelines. This presents a fundamental research challenge and seriously threatens safety-critical domains. Despite the existing knowledge about uncertainty in risk analysis, no previous work has estimated the uncertainty of AI-augmented systems given the propagation of errors in the pipeline. We provide the formal underpinnings for capturing uncertainty propagation, develop a simulator to quantify uncertainty, and evaluate the simulation of propagating errors with one case study. We discuss the generalizability of our approach and its limitations and present recommendations for evaluation policies concerning AI systems. Future work includes extending the approach by relaxing the remaining assumptions and by experimenting with a real system.

使用AI技术正在被纳入软件系统的安全开发,将AI基子系统(性能水平不确定)纳入自动化输油管的趋势日益明显,这是一个根本性的研究挑战,严重威胁到安全临界领域。尽管目前对风险分析的不确定性有了解,但以前的工作没有考虑到输油管中错误的传播而对AI强化系统的不确定性作出估计。我们为获取不确定性传播提供了正式的基础,开发了一个模拟器,以量化不确定性,并用一个案例研究对传播错误的模拟进行评估。我们讨论了我们的方法的可概括性及其局限性,并就AI系统的评价政策提出建议。未来的工作包括通过放松其余的假设和试验一个真正的系统来扩展这一方法。

Article 9

Title@2025-07-17 (4): ReCode: Updating Code API Knowledge with Reinforcement Learning

Title: ReCode: Updating Code API Knowledge with Reinforcement Learning

ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen

ReCode:更新法规API知识与强化学习 2506.20495v2

Authors (5): Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang

Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.

大型语言模型(LLMS)具有非凡的代码生成能力,但在适应外部图书馆API的频繁更新时却步履维艰。这一关键限制来自对培训数据中过时的 API 知识的依赖,即使能够查阅现有文件,从而在动态环境中阻碍可靠的代码生成。为了解决这一问题,我们提议ReCode(基于规则的加强学习以更新代码),这是一个模仿人类程序程序员适应API变化的新框架。具体地说,我们建立一个大约2 000个数据条目的数据集,以培训LLMS进行基于更新信息的版本的迁移。然后,我们引入一个修改后的代码评估字符串相似度指标,作为强化学习的奖励。我们的实验表明,ReCode大大提升了LPIS在动态API情景中的代码生成性能,特别是在隐蔽的代码AredateArena任务上。与监管的微调相比,ReCode对于LMS的一般代码生成能力影响较小。我们应用了一套LMS和强化学习算法(GPO和DAPO),所有这些都都实现了一致的改进。值得注意的是,在培训后,Quender2.5-C-7BB的模型/Rebroughdaldroformax

Article 10

Title@2025-07-17 (4): The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI

Title: The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI

Der Fall für Contextual Copyleft: Lizenzierung von Open Source Trainingsdaten und Generative KI

上下文翻转:为开放源码培训数据发放许可证的案例 2507.12713v1

Authors (5): Grant Shanklin, Emmie Hine, Claudio Novelli, Tyler Schroder, Luciano Floridi

The proliferation of generative AI systems has created new challenges for the Free and Open Source Software (FOSS) community, particularly regarding how traditional copyleft principles should apply when open source code is used to train AI models. This article introduces the Contextual Copyleft AI (CCAI) license, a novel licensing mechanism that extends copyleft requirements from training data to the resulting generative AI models. The CCAI license offers significant advantages, including enhanced developer control, incentivization of open source AI development, and mitigation of openwashing practices. This is demonstrated through a structured three-part evaluation framework that examines (1) legal feasibility under current copyright law, (2) policy justification comparing traditional software and AI contexts, and (3) synthesis of cross-contextual benefits and risks. However, the increased risk profile of open source AI, particularly the potential for direct misuse, necessitates complementary regulatory approaches to achieve an appropriate risk-benefit balance. The paper concludes that when implemented within a robust regulatory environment focused on responsible AI usage, the CCAI license provides a viable mechanism for preserving and adapting core FOSS principles to the evolving landscape of generative AI development.

突现型AI系统的扩散给自由和开放源码软件(FOSS)社区带来了新的挑战,特别是在使用开放源码培训AI模式时,传统抄录左派原则应如何适用方面,本条介绍了背景翻录式AI(CCAI)许可证,这是一个将培训数据复制要求扩展至由此产生的基因化AI模式的新发许可证机制;CACI许可证具有重大优势,包括加强开发者控制、鼓励开发开放源码AI和减少露天洗涤做法,这通过一个结构化的三部分评价框架得到证明,该框架审查:(1) 现行版权法下的法律可行性;(2) 将传统软件与AI环境进行比较的政策理由;(3) 综合交叉文本的好处和风险;然而,由于开放源的AI风险简介增加,特别是直接滥用的可能性增加,有必要采取补充性监管办法,以实现适当的风险-利益平衡;文件的结论是,如果在一个以负责任的AI使用为重点的稳健的监管环境内实施,CAPI许可证提供了一种可行的机制,用于维护和调整核心自由和开放源码软件原则,以适应正在演变的AI型发展的格局。

Article 11

Title@2025-07-17 (4): CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

Title: CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance

CodeAssistBench (CAB): Datensatz & Benchmarking für Multiturn-Chat-basierte Code-Unterstützung

代码协助站(CAB):多功能聊天代码援助的数据集和基准 2507.10646v2

Authors (5): Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, Anoop Deoras

Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address real-world questions about actual codebases. Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB’s recent issues. This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions.

由大型语言模型驱动的编程助理(编程助理)已经改变了软件开发,但大多数基准都狭隘地侧重于代码生成任务。最近,InfiBench 和 StackEval 等努力试图利用Stack 溢流数据填补这一差距,但仍局限于孤立环境中的单向互动,需要大量手工整理,不能代表完整的项目环境。我们引入了CodeAsist Bench(CAB),这是在现实环境中评价多向下编程援助的第一个基准框架,在现实环境中处理实际代码库的现实世界问题。与现有的编程 A 基准不同, CAB 自动从与问题有关的问题GitHub 中生成可扩缩的数据集,使用可配置参数(例如,存储库创建日期、星号数、编程语言),包括用于评价的代码库自动集装箱化。我们随后通过这些集装箱化环境中的模拟用户来评价模型,这些模型涉及231个储存库,涵盖7种编程语言和多种问题领域。我们对主要LMSDMs问题的评估显示巨大的能力差距:在Sack over483 和70AB 的解决方案的解决率方面,这些模型只是解决了70-rent-rent-rent profilent 的模型,它们在70-rent profilent produislent lient pride prois pass pass pass pass pass prois prois prois presis

Article 12

Title@2025-07-17 (4): GUI Test Migration via Abstraction and Concretization

Title: GUI Test Migration via Abstraction and Concretization

GUI-Test-Migration über Abstraktion und Konkretisierung

GUI 通过抽象和简明化测试移民 2409.05028v2

Authors (7): Yakun Zhang, Chen Liu, Xiaofei Xie, Yun Lin, Jin Song Dong, Dan Hao, Lu Zhang

GUI test migration aims to produce test cases with events and assertions to test specific functionalities of a target app. Existing migration approaches typically focus on the widget-mapping paradigm that maps widgets from source apps to target apps. However, since different apps may implement the same functionality in different ways, direct mapping may result in incomplete or buggy test cases, thus significantly impacting the effectiveness of testing target functionality and the practical applicability of migration approaches. In this paper, we propose a new migration paradigm (i.e., the abstraction-concretization paradigm) that first abstracts the test logic for the target functionality and then utilizes this logic to generate the concrete GUI test case. Furthermore, we introduce MACdroid, the first approach that migrates GUI test cases based on this paradigm. Specifically, we propose an abstraction technique that utilizes source test cases from source apps targeting the same functionality to extract a general test logic for that functionality. Then, we propose a concretization technique that utilizes the general test logic to guide an LLM in generating the corresponding GUI test case (including events and assertions) for the target app. We evaluate MACdroid on two widely-used datasets (including 31 apps, 34 functionalities, and 123 test cases). On the FrUITeR dataset, the test cases generated by MACdroid successfully test 64% of the target functionalities, improving the baselines by 191%. On the Lin dataset, MACdroid successfully tests 75% of the target functionalities, outperforming the baselines by 42%. These results underscore the effectiveness of MACdroid in GUI test migration.

GUI 测试迁移的目的是通过测试事件来测试案例,测试目标应用程序的具体功能。现有的迁移方法通常侧重于从源应用程序到目标应用程序的部件映射模式。但是,由于不同的应用程序可能以不同的方式执行相同的功能,直接映射可能导致测试案例不完全或错误,从而极大地影响测试目标功能的有效性和迁移方法的实际适用性。在本文件中,我们提出了一种新的迁移模式(即抽象混凝土模式),首先将目标功能的测试逻辑摘要用于强调测试逻辑,然后利用这一逻辑来生成具体 GUI 测试案例。此外,我们引入了MACdroid,这是根据这个模式迁移图形测试案例的第一个方法。具体地说,我们提出了一种抽象技术,利用源应用程序的测试案例,针对同一功能的功能和迁移方法的实际适用性测试逻辑。然后,我们提出了一种解剖化技术,利用一般测试逻辑来指导LLMUMU生成相应的 GUI测试案例(包括事件和声明),然后利用这个逻辑来生成具体的 GUILME 测试案例。我们用MAC 的3 测试模型测试了两个数据测试模型,通过测试模型测试模型,这些测试了基数测试了基数,这些测试了基数的基数,这些基数的基数。

Article 13

Title@2025-07-17 (4): AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges

Title: AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges

KI-Sicherheit in den Augen des Downstream-Entwicklers: Ein erster Blick auf Bedenken, Praktiken und Herausforderungen

AI 下游开发者眼中的安全:首先审视关注、做法和挑战 2503.19444v3

Authors (6): Haoyu Gao, Mansooreh Zahedi, Wenxin Jiang, Hong Yi Lin, James Davis, Christoph Treude

Pre-trained models (PTMs) have become a cornerstone of AI-based software, allowing for rapid integration and development with minimal training overhead. However, their adoption also introduces unique safety challenges, such as data leakage and biased outputs, that demand rigorous handling by downstream developers. While previous research has proposed taxonomies of AI safety concerns and various mitigation strategies, how downstream developers address these issues remains unexplored. This study investigates downstream developers’ concerns, practices and perceived challenges regarding AI safety issues during AI-based software development. To achieve this, we conducted a mixed-method study, including interviews with 18 participants, a survey of 86 practitioners, and an analysis of 874 AI incidents from the AI Incident Database. Our results reveal that while developers generally demonstrate strong awareness of AI safety concerns, their practices, especially during the preparation and PTM selection phases, are often inadequate. The lack of concrete guidelines and policies leads to significant variability in the comprehensiveness of their safety approaches throughout the development lifecycle, with additional challenges such as poor documentation and knowledge gaps, further impeding effective implementation. Based on our findings, we offer suggestions for PTM developers, AI-based software developers, researchers, and policy makers to enhance the integration of AI safety measures.

预先培训的模型已成为AI软件的基石,允许快速整合和开发,尽量减少培训管理费用;然而,采用这些模型还带来了独特的安全挑战,如数据泄漏和偏差产出等,需要下游开发商严格处理。虽然以前的研究提出了AI安全问题分类和各种缓解战略,但下游开发商如何解决这些问题仍未探讨。本研究报告调查了下游开发商在AI软件开发过程中对AI安全问题的关切、做法和所察觉的挑战。为此,我们开展了一项混合方法研究,包括与18名参与者的访谈、对86名从业人员的调查以及AI事件数据库对874起AI事件的分析。我们的结果显示,虽然开发商一般都对AI安全问题有强烈的认识,但他们的做法,特别是在准备和PTM选择阶段,往往不够充分。缺乏具体的指导方针和政策导致他们在整个发展生命周期内安全方法的全面性存在极大的差异,例如文件不全和知识差距,进一步阻碍有效执行。我们根据调查结果,向IPTM开发商、AI软件开发商、研究人员和决策者提出建议,以加强AI的安全措施。

Article 14

Title@2025-07-17 (4): When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration

Title: When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration

When Domains Collide: Eine Aktivitätstheorie zur Erforschung der disziplinübergreifenden Zusammenarbeit

当域碰撞:跨纪律协作活动理论探索时 2506.20063v2

Authors (6): Zixuan Feng, Thomas Zimmermann, Lorenzo Pisani, Christopher Gooley, Jeremiah Wander, Anita Sarma

Background: Software development teams are increasingly diverse, embedded, and cross-disciplinary. Domain experts (DEs) from different disciplines collaborate with professional software developers (SDEs), bringing complementary expertise in creating and maintaining complex production software. However, contested expectations, divergent problem-solving perspectives, and conflicting priorities lead to friction. Aims: This study aims to investigate the dynamics of emerging collaboration of cross-disciplinary software development (CDSD) by exploring the expectations held by DEs and SDEs and understanding how these frictions manifest in practice. Method: We utilize Activity Theory (AT), a well-established socio-technical framework, as an analytical lens in a grounded, empirical investigation, conducted through a mixed-method study involving 24 interviews (12 DEs and 12 SDEs) and a large-scale validation survey with 293 participants (161 DEs and 132 SDEs). Results: We conceptualize and empirically ground the CDSD dynamics. We identified eight expectations held by SDEs and six by DEs. By mapping these expectations to AT components, we revealed 21 frictions in CDSD and illustrated where and how they arise. Conclusions: This study offers a theoretical lens for understanding the dynamics and frictions in CDSD and provides actionable insights for future research, practitioners, and infrastructure design.

软件开发团队日益多样化、嵌入和跨学科。来自不同学科的专家与专业软件开发者(SDEs)合作,在创建和维护复杂的生产软件方面提供互补的专门知识。然而,有争议的期望、不同的解决问题观点和相互冲突的优先事项导致摩擦。目的:本研究的目的是通过探索DEs和SDEs持有的期望并了解这些摩擦在实践中如何表现来调查跨学科软件开发(CDSD)新兴协作的动态,并了解这些摩擦的实际表现。方法:我们利用活动理论(AT)这个成熟的社会技术框架,作为基础、经验性调查的分析透镜,通过由24次访谈(12个DEs和12个SDEs)进行的混合方法研究以及293名参与者(161个DEs和132个SDEs)进行的大规模验证调查,进行。结果:我们从概念上和从经验上确定了CDSD动态的8项期望和DEs所持有的6项期望。我们通过向AT组成部分绘制这些期望图,揭示了CDSD的21项摩擦,并说明了它们在何处和如何产生的。结论:本研究为CDSD的未来研究、可理解的理论视角,为CDSD设计中的动态和设计提供了可理解。

Article 15

Title@2025-07-16 (3): ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle

Title: ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle

ParaStudent: Erzeugen und Evaluieren des Realistischen Studentenkodex durch Lehre von LLMs zum Kampf

副专业学生:通过教授LLMs进行斗争,产生和评价现实学生守则 2507.12674v1

Authors (5): Mihran Miroyan, Rose Niousha, Joseph E. Gonzalez, Gireeja Ranade, Narges Norouzi

Large Language Models (LLMs) have shown strong performance on programming tasks, but can they generate student-like code like real students - imperfect, iterative, and stylistically diverse? We present ParaStudent, a systematic study of LLM-based “student-like” code generation in an introductory programming course setting. Using a dataset of timestamped student submissions across multiple semesters, we design low- and high-resolution experiments to model student progress and evaluate code outputs along semantic, functional, and stylistic dimensions. Our results show that fine-tuning significantly improves alignment with real student trajectories and captures error patterns, incremental improvements, and stylistic variations more faithfully. This study shows that modeling realistic student code requires capturing learning dynamics through context-aware generation, temporal modeling, and multi-dimensional evaluation. Code for experiments and evaluation is available at \href{https://github.com/mmiroyan/ParaStudent}{\texttt{github.com/mmiroyan/ParaStudent}}.

大型语言模型(LLMS)在编程任务方面表现良好,但是它们能产生像实生一样的学生代码吗?我们在一个介绍性编程课程设置中提出ParaStudent,这是对基于LLM的“学生类”代码生成的系统研究。我们用一套多学期学生提交的时间标记数据集设计低和高分辨率实验,以模拟学生进步,并评价语义、功能和文理方面的代码输出。我们的结果显示,微调大大改进了与真实学生轨迹和捕捉错误模式、递增改进和文理变化的匹配。这项研究显示,模拟现实的学生代码需要通过背景认知生成、时间模型和多维评价来捕捉学习动态。实验和评价守则可在以下网站查阅:https://github.com/miroyan/ParaStudentuntextt{github.com/mmiroyan/ParaStuard_pid_。

Article 16

Title@2025-07-16 (3): Single Conversation Methodology: A Human-Centered Protocol for AI-Assisted Software Development

Title: Single Conversation Methodology: A Human-Centered Protocol for AI-Assisted Software Development

Single Conversation Methodology: Ein Human-Centered-Protokoll für KI-Assisted Software Development

单一对话方法:AI协助软件开发的以人为中心的议定书 2507.12665v1

Authors (1): Salvador D. Escobedo

We propose the Single Conversation Methodology (SCM), a novel and pragmatic approach to software development using large language models (LLMs). In contrast to ad hoc interactions with generative AI, SCM emphasizes a structured and persistent development dialogue, where all stages of a project - from requirements to architecture and implementation - unfold within a single, long-context conversation. The methodology is grounded on principles of cognitive clarity, traceability, modularity, and documentation. We define its phases, best practices, and philosophical stance, while arguing that SCM offers a necessary correction to the passive reliance on LLMs prevalent in current practices. We aim to reassert the active role of the developer as architect and supervisor of the intelligent tool.

我们提出单一对话方法,这是使用大型语言模型开发软件的一种新颖和务实的方法。与与具有基因的AI进行特殊互动相反,SCM强调分阶段和持续的发展对话,项目的所有阶段――从要求到建筑和执行――都在同一长期的谈话中展开,该方法以认知清晰、可追溯性、模块化和文件等原则为基础。我们界定了其阶段、最佳做法和哲学立场,同时认为SCM为目前做法中普遍存在的被动依赖LLMs提供了必要的纠正。我们的目标是重申开发商作为智能工具的建筑师和监督员的积极作用。

Article 17

Title@2025-07-16 (3): A Fuzzy Approach to Project Success: Measuring What Matters

Title: A Fuzzy Approach to Project Success: Measuring What Matters

Ein fuzzy Ansatz zum Projekt Erfolg: Messen, was zählt

项目成功:衡量重要事项的模糊方法 2507.12653v1

Authors (4): João Granja-Correia, Remedios Hernández-Linares, Luca Ferranti, Arménio Rego

This paper introduces a novel approach to project success evaluation by integrating fuzzy logic into an existing construct. Traditional Likert-scale measures often overlook the context-dependent and multifaceted nature of project success. The proposed hierarchical Type-1 Mamdani fuzzy system prioritizes sustained positive impact for end-users, reducing emphasis on secondary outcomes like stakeholder satisfaction and internal project success. This dynamic approach may provide a more accurate measure of project success and could be adaptable to complex evaluations. Future research will focus on empirical testing and broader applications of fuzzy logic in social science.

本文件介绍了一种新的项目成功评价方法,将模糊的逻辑纳入现有结构中,传统的类似标准措施往往忽视项目成功的背景和多面性。拟议的第1级Mamdani模糊系统优先考虑对最终用户的持续积极影响,减少对利益攸关方满意度和内部项目成功率等次级成果的强调。这种动态方法可以更准确地衡量项目成功率,并适应复杂的评价。未来研究将侧重于经验测试和社会科学中模糊逻辑的更广泛应用。

Article 18

Title@2025-07-16 (3): A Three-Phase Evaluation Approach for new Information and Data Models in the Smart Grid Domain

Title: A Three-Phase Evaluation Approach for new Information and Data Models in the Smart Grid Domain

Ein dreiphasiger Evaluierungsansatz für neue Informations- und Datenmodelle im Bereich Smart Grid

智能网域新信息和数据模型的三阶段评价方法 2507.12649v1

Authors (3): Christine van Stiphoudt, Sergio Potenciano Menci, Gilbert Fridgen

The ongoing digitalisation of the smart grid is resulting in an increase in automated information exchanges across distributed energy systems. This process has led to the development of new information and data models when the existing ones fall short. To prevent potential disruptions caused by flaws in the newly designed information and data models, it is essential to evaluate them during the design process before they are implemented in operation. Currently, general explicit evaluation approaches outside the smart grid domain stay at a high level without defining clear steps. Meanwhile, implicit evaluation approaches in the smart grid domain focus on testing systems that utilise information and data models already in use for functionality in terms of conformance and interoperability. Notably, no combination of explicit and implicit evaluation approaches for newly designed information and data models offers a clearly defined set of steps during their design process in the smart grid context. Consequently, we design a three-phase evaluation approach using design science research to address this gap. Our evaluation approach combines explicit and implicit evaluation methods and is applicable when developing new information and data models. We use the development of an information model and data model focused on industrial flexibility descriptions to refine our evaluation approach. Additionally, we provide lessons learned from our experience.

智能网格的不断数字化导致分布式能源系统之间自动信息交流的增加。这一进程导致在现有信息和数据模型不完善时开发新的信息和数据模型。为了防止新设计的信息和数据模型的缺陷可能造成干扰,在设计过程中必须评估这些模型,然后才能实施。目前,智能网格网域之外的一般性明确评价方法保持在高水平上,而没有确定明确的步骤。与此同时,智能网域域域的隐含评价方法侧重于测试系统,这些系统利用在一致性和互操作性方面功能已经使用的信息和数据模型。值得注意的是,对于新设计的信息和数据模型,没有采用明确和隐含的评价方法,在设计过程中在智能网格背景下提供一套明确界定的步骤。因此,我们设计了三阶段评价方法,利用设计科学研究来弥补这一差距。我们的评价方法将明确和隐含的评价方法结合起来,在开发新的信息和数据模型时适用。我们利用开发的信息模型和数据模型侧重于工业灵活性说明来改进我们的评价方法。此外,我们从我们的经验中总结了经验教训。

Article 19

Title@2025-07-16 (3): QSpark: Towards Reliable Qiskit Code Generation

Title: QSpark: Towards Reliable Qiskit Code Generation

QSpark: Auf dem Weg zur zuverlässigen Qiskit-Code-Generierung

QSpark:迈向可靠的基斯基特代码生成 2507.12642v1

Authors (4): Kiana Kheiri, Aamna Aamir, Andriy Miranskyy, Chen Ding

Quantum circuits must be error-resilient, yet LLMs like Granite-20B-Code and StarCoder often output flawed Qiskit code. We fine-tuned a 32 B model with two RL methods, Group Relative Policy Optimization (GRPO) and Odds-Ratio Preference Optimization (ORPO), using a richly annotated synthetic dataset. On the Qiskit HumanEval benchmark, ORPO reaches 56.29\% Pass@1 ($\approx+10$ pp over Granite-8B-QK) and GRPO hits 49\%, both beating all general-purpose baselines; on the original HumanEval they score 65.90\% and 63.00\%. GRPO excels on basic tasks (42/54), ORPO on intermediate ones (41/68), and neither solves the five advanced tasks, highlighting clear gains yet room for progress in AI-assisted quantum programming.

量子电路必须具有抗误性,然而,Granite-20B-Code和StarCoder等LLMs通常会输出有缺陷的Qiskit代码。我们用两种RL方法,即Group相对政策优化(GRO)和Odds-Ratio 偏好优化(ORPO),使用大量注解的合成数据集,对32B模型进行了微调,使用两种RL方法,即Group 相对政策优化(GROPO)和Oddds-Ratio 偏好优化(ORPO)进行微调。在Qiskit HumanEval基准上,ORPO达到56.29+1 Pass@1 ($\ approx+10$ pp over Granite-8B-QK)和GROPO点击49,两者都击败了所有通用基线;在最初的人类val上,它们得65.90和63.00。GROPO在基本任务(42/54)、O在中间任务(41/68)上优于半项,没有解决五项高级任务,也没有解决五项高级任务,突出任务,突出任务,突出任务,突出任务,突出点方案的进展。

Article 20

Title@2025-07-16 (3): ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells

Title: ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells

ROSE: Transformerbasierte Refactoring-Empfehlung für architektonische Gerüche

ROSE: 以变压器为基础的建筑气味重建建议 2507.12561v1

Authors (3): Samal Nursapa, Anastassiya Samuilova, Alessio Bucaioni. Phuong T. Nguyen

Architectural smells such as God Class, Cyclic Dependency, and Hub-like Dependency degrade software quality and maintainability. Existing tools detect such smells but rarely suggest how to fix them. This paper explores the use of pre-trained transformer models–CodeBERT and CodeT5–for recommending suitable refactorings based on detected smells. We frame the task as a three-class classification problem and fine-tune both models on over 2 million refactoring instances mined from 11,149 open-source Java projects. CodeT5 achieves 96.9% accuracy and 95.2% F1, outperforming CodeBERT and traditional baselines. Our results show that transformer-based models can effectively bridge the gap between smell detection and actionable repair, laying the foundation for future refactoring recommendation systems. We release all code, models, and data under an open license to support reproducibility and further research.

建筑结构的气味, 如上帝级、环球依赖性和类似 Hub 的依附性等, 降低了软件质量和可维护性。现有工具检测到这种气味, 但很少建议如何修复这些气味。本文探索了使用预先训练的变压器模型- CodeBERT 和 CodeT5 来根据检测到的气味建议适当的再设定因素。我们将此任务设定为三级分类问题, 并对11,149 开放源的爪哇项目中200多万个重设事件的模式进行微调。代码T5 实现了96.9%的准确度, 95.2% F1, 超过了业绩好的代码BERT 和传统基线。我们的结果表明, 变压器模型可以有效地弥合气味检测和可操作的修理之间的差距, 为未来的再调节建议系统打下基础。我们发布所有代码、模型和数据, 以公开的许可支持可复制和进一步的研究。

Article 21

Title@2025-07-16 (3): When Retriever Meets Generator: A Joint Model for Code Comment Generation

Title: When Retriever Meets Generator: A Joint Model for Code Comment Generation

Wenn Retriever trifft Generator: Ein gemeinsames Modell für Code Comment Generation

当再利用与生成器相遇时: 代码Comment生成联合模式 2507.12558v1

Authors (5): Tien P. T. Le, Anh M. T. Bui, Huy N. D. Pham, Alessio Bucaioni, Phuong T. Nguyen

Automatically generating concise, informative comments for source code can lighten documentation effort and accelerate program comprehension. Retrieval-augmented approaches first fetch code snippets with existing comments and then synthesize a new comment, yet retrieval and generation are typically optimized in isolation, allowing irrelevant neighbors topropagate noise downstream. To tackle the issue, we propose a novel approach named RAGSum with the aim of both effectiveness and efficiency in recommendations. RAGSum is built on top offuse retrieval and generation using a single CodeT5 backbone. We report preliminary results on a unified retrieval-generation framework built on CodeT5. A contrastive pre-training phase shapes code embeddings for nearest-neighbor search; these weights then seed end-to-end training with a composite loss that (i) rewards accurate top-k retrieval; and (ii) minimizes comment-generation error. More importantly, a lightweight self-refinement loop is deployed to polish the final output. We evaluated theframework on three cross-language benchmarks (Java, Python, C), and compared it with three well-established baselines. The results show that our approach substantially outperforms thebaselines with respect to BLEU, METEOR, and ROUTE-L. These findings indicate that tightly coupling retrieval and generationcan raise the ceiling for comment automation and motivateforthcoming replications and qualitative developer studies.

为源代码自动生成简明、信息化的评论可以减轻文件工作,并加速程序理解。检索强化方法首先用现有评论获取代码片断,然后合成新的评论,然而,检索和生成通常在孤立的情况下优化,允许不相关的邻居在下游对噪音进行排解。为了解决这个问题,我们提议了一个名为RAGSum的新颖方法,其目的在于提高建议的效力和效率。RAGSum建在顶部的离线检索和生成上方,使用单一的代码T5主干线。我们报告了在代码T5基础上建立的统一检索-生成框架的初步结果。一个对比式的训练前阶段将代码嵌入最近的邻居搜索中;这些重量然后是种子端到端的培训,其复合损失(一) 奖励准确的顶级检索;以及 (二) 尽量减少评论生成错误。更重要的是,将一个轻量的自我修整环安装在最上层上方,以光滑动的最后输出。我们用三个跨语言基准(Java、Python、C)对框架进行了评估,并将它与三个完善的基线进行对比; 这些重量制质量到质量到质量级的代码, 显示我们不断的循环的复制和不断的循环的复制结果。

Article 22

Title@2025-07-16 (3): Machine Learning Systems: A Survey from a Data-Oriented Perspective

Title: Machine Learning Systems: A Survey from a Data-Oriented Perspective

Machine Learning Systems: Eine Umfrage aus datenorientierter Perspektive

机械学习系统:从数据导向的角度进行调查 2302.04810v3

Authors (4): Christian Cabrera, Andrei Paleyes, Pierre Thodoroff, Neil D. Lawrence

Engineers are deploying ML models as parts of real-world systems with the upsurge of AI technologies. Real-world environments challenge the deployment of such systems because these environments produce large amounts of heterogeneous data, and users require increasingly efficient responses. These requirements push prevalent software architectures to the limit when deploying ML-based systems. Data-oriented Architecture (DOA) is an emerging style that equips systems better for integrating ML models. Even though papers on deployed ML systems do not mention DOA, their authors made design decisions that implicitly follow DOA. Implicit decisions create a knowledge gap, limiting the practitioners’ ability to implement ML-based systems. \hlb{This paper surveys why, how, and to what extent practitioners have adopted DOA to implement and deploy ML-based systems.} We overcome the knowledge gap by answering these questions and explicitly showing the design decisions and practices behind these systems. The survey follows a well-known systematic and semi-automated methodology for reviewing papers in software engineering. The majority of reviewed works partially adopt DOA. Such an adoption enables systems to address requirements such as Big Data management, low latency processing, resource management, security and privacy. Based on these findings, we formulate practical advice to facilitate the deployment of ML-based systems.

随着AI技术的激增,工程师正在将ML模型作为现实世界系统的一部分加以部署。现实世界环境对此类系统的部署提出了挑战,因为这些环境产生大量不同的数据,用户需要越来越高效的反应。这些要求将流行的软件结构推向部署以ML为基础的系统时的极限。以数据为导向的建筑(DOA)是一种新兴的风格,为整合ML模型提供了更好的系统。即使已部署的ML系统的文件没有提到DOA,但其作者却不言而喻地根据DOA作出了设计决定。隐含的决定造成了知识差距,限制了从业人员实施ML系统的能力。 \hlb{本文调查了为什么、如何以及在何种程度上从业人员采用了DOA来实施和部署以ML为基础的系统。}我们通过回答这些问题并明确展示这些系统背后的设计决定和做法,克服了知识差距。调查遵循了一种众所周知的系统性和半自动化方法来审查软件工程文件。经过审查的大多数作品部分采用DA。这种应用使系统能够满足诸如大数据管理、低密度处理、资源管理、安全和隐私部署发现等要求。

Article 23

Title@2025-07-16 (3): SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

Title: SWE-Perf: Can Language Models Optimize Code Performance on Real-World Repositories?

SWE-Perf: Können Sprachmodelle die Code-Performance auf realen Repositories optimieren?

SWE-Perf:语言模型能够优化现实世界仓库的代码性能吗? 2507.12415v1

Authors (8): Xinyi He, Qian Liu, Mingzhe Du, Lin Yan, Zhijie Fan, Yiming Huang, Zejian Yuan, Zejun Ma

Code performance optimization is paramount in real-world software engineering and critical for production-level systems. While Large Language Models (LLMs) have demonstrated impressive capabilities in code generation and bug fixing, their proficiency in enhancing code performance at the repository level remains largely unexplored. To address this gap, we introduce SWE-Perf, the first benchmark specifically designed to systematically evaluate LLMs on code performance optimization tasks within authentic repository contexts. SWE-Perf comprises 140 carefully curated instances, each derived from performance-improving pull requests from popular GitHub repositories. Each benchmark instance includes the relevant codebase, target functions, performance-related tests, expert-authored patches, and executable environments. Through a comprehensive evaluation of representative methods that span file-level and repo-level approaches (e.g., Agentless and OpenHands), we reveal a substantial capability gap between existing LLMs and expert-level optimization performance, highlighting critical research opportunities in this emerging field.

守则绩效优化在现实世界软件工程中至关重要,对于生产级系统至关重要。虽然大语言模型(LLMS)在代码生成和故障修补方面表现出令人印象深刻的能力,但其在存储处一级提高代码性能的熟练程度在很大程度上仍未得到探讨。为了弥补这一差距,我们引入了SWE-Perf,这是专门为在真实存储处背景下系统评估代码绩效优化任务而专门设计的第一个基准。SWE-Perf由140个经过仔细整理的事例组成,每个事例来自广受欢迎的GitHub存储处的改进性能拉动请求。每个基准实例包括相关的代码库、目标功能、与性能有关的测试、专家撰写的补丁和可执行环境。通过对具有代表性的方法(例如无代理人和开放人)进行全面评价,我们发现现有LMS和专家级优化业绩之间存在巨大的能力差距,突出了这个新兴领域的关键研究机会。

Article 24

Title@2025-07-16 (3): GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

Title: GitChameleon: Evaluating AI Code Generation Against Python Library Version Incompatibilities

GitChameleon: Bewertung der KI-Code-Generation gegen Python Library Version Inkompatibilitäten

GitChameleon:评估AI 与 Python 图书馆版本不兼容性 2507.12367v1

Authors (12): Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia

The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51\% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.

软件图书馆的迅速演变为代码生成带来了相当大的障碍,需要不断适应经常更新版本,同时保持后向兼容性。虽然现有的代码演变基准提供了宝贵的洞察力,但它们通常缺乏基于执行的评价,以生成符合特定图书馆版本的代码。为了解决这个问题,我们引入了GitChameleon,这是一套由328个Python代码完成问题组成的新颖、精心整理的数据集,每套数据都以特定图书馆版本为条件,并伴之以可执行单位测试。GitChameleon严格评估当代大型语言模型(LLLMS)、LLM-动力代理、代码助理和RAG系统的能力,以实施显示功能准确性,进行版本定制的代码生成。我们的广泛评估表明,最先进的系统在这项工作中遇到重大挑战;企业模型在48-51范围内实现了基线成功率,凸显了问题的严重性。通过提供基于执行的基准,强调代码图书馆的动态性质,GitChameleon能够更清楚地了解这一挑战,并帮助指导开发更适应性和更可靠的AI代码生成方法。我们在http://Gsmasmasm上公开提供数据设置和评估系统。

Article 25

Title@2025-07-16 (3): Planning-Aware Code Infilling via Horizon-Length Prediction

Title: Planning-Aware Code Infilling via Horizon-Length Prediction

Planning-Aware Code Infilling via Horizon-Length Prediction

通过地平线-地球预测填充规划-软件代码 2410.03103v3

Authors (6): Yifeng Ding, Hantian Ding, Shiqi Wang, Qing Sun, Varun Kumar, Zijian Wang

Fill-in-the-Middle (FIM), or infilling, has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm which performs next-token prediction (NTP) over reordered sequence often leads to models struggling to generate content that aligns well with the surrounding context. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different model families and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios.

中途填充(FIM)或填充(FIM)已成为编码语言模型的组成部分,使得在左侧和右侧环境中生成缺失的代码成为了代码模式的组成部分。然而,当前的FIM培训模式,即对顺序重排进行下方预测(NTP)后进行下方预测(NTP)后,往往导致模型难以产生与周围环境相适应的内容。我们假设光是NTP不足以让模型学习以远右环境为条件的有效规划,这是成功填充代码的一个关键因素。为了克服这一点,我们提出了地平线预测(HLP)这一新的培训目标,教给模型来预测每个步骤的剩余中标数。HLP用外观规划推进FIM,使模型能够在不依赖特定数据集后处理的情况下内在地学习为任意的左右环境填充边界。我们对不同模型家族和大小的评价表明,HLP在文件级别和储存库层面的不同基准上大大改进了FIM的绩效,相对提高到24 %。此外,HLP通过HLP推进(HL)推进(HL)系统)的推进(FIP)模型在可计量标准推理算中提高了实际成本。

Article 26

Title@2025-07-16 (3): An Empirical Study of Large Language Models for Type and Call Graph Analysis in Python and JavaScript

Title: An Empirical Study of Large Language Models for Type and Call Graph Analysis in Python and JavaScript

Eine empirische Studie großer Sprachmodelle für die Typ- und Call Graph Analyse in Python und JavaScript

Python 和 JavaScript 中用于类型和召唤图分析的大语言模型和经验研究 2410.00603v2

Authors (6): Ashwin Prasad Shivarpatna Venkatesh, Rose Sunil, Samkutty Sabu, Amir M. Mir, Sofia Reis, Eric Bodden

Large Language Models (LLMs) are increasingly being explored for their potential in software engineering, particularly in static analysis tasks. In this study, we investigate the potential of current LLMs to enhance call-graph analysis and type inference for Python and JavaScript programs. We empirically evaluated 24 LLMs, including OpenAI’s GPT series and open-source models like LLaMA and Mistral, using existing and newly developed benchmarks. Specifically, we enhanced TypeEvalPy, a micro-benchmarking framework for type inference in Python, with auto-generation capabilities, expanding its scope from 860 to 77,268 type annotations for Python. Additionally, we introduced SWARM-CG and SWARM-JS, comprehensive benchmarking suites for evaluating call-graph construction tools across multiple programming languages. Our findings reveal a contrasting performance of LLMs in static analysis tasks. For call-graph generation, traditional static analysis tools such as PyCG for Python and Jelly for JavaScript consistently outperform LLMs. While advanced models like mistral-large-it-2407-123b and gpt-4o show promise, they still struggle with completeness and soundness in call-graph analysis across both languages. In contrast, LLMs demonstrate a clear advantage in type inference for Python, surpassing traditional tools like HeaderGen and hybrid approaches such as HiTyper. These results suggest that, while LLMs hold promise in type inference, their limitations in call-graph analysis highlight the need for further research. Our study provides a foundation for integrating LLMs into static analysis workflows, offering insights into their strengths and current limitations.

正在越来越多地探索大型语言模型(LLMS)的软件工程潜力,特别是在静态分析任务中。在本研究中,我们调查了当前LMS的潜力,以加强对Python和JavaScript程序的需求分析和类型推导。我们根据经验评估了24 LMS,包括OpenAI的GPT系列和开放源模型,如LalaMA和Mistral,使用现有和新开发的基准。具体地说,我们加强了TyvalPy(TyelEvalPy),这是一个微型标记框架,用于在Python进行类型的静态推断,拥有自动生成能力,将其范围从860到77,268类型Python的描述范围扩大。此外,我们推出了SWSWARM-CG和SWARM-JS综合基准套,用于评估多种编程语言的呼声势构建工具。我们发现LMs在静态分析中表现了一种对比性分析。关于调调的PyCGnalpher(PyCG)和JellyPyPripteral)等固定分析工具, 用于在正压基础分析中提供了一种直压工具,在正态磁分析中显示一种直压工具。

Article 27

Title@2025-07-16 (3): An Online A/B Testing Decision Support System for Web Usability Assessment Based on a Linguistic Decision-making Methodology: Case of Study a Virtual Learning Environment

Title: An Online A/B Testing Decision Support System for Web Usability Assessment Based on a Linguistic Decision-making Methodology: Case of Study a Virtual Learning Environment

Ein Online A/B Testing Decision Support System for Web Usability Assessment basierend auf einer sprachlichen Entscheidungsmethodik: Fall einer virtuellen Lernumgebung

网上A/B测试决定支持系统,用于基于语言决策方法的网络可用性评估:研究案例和虚拟学习环境 2507.12118v1

Authors (5): Noe Zermeño, Cristina Zuheros, Lucas Daniel Del Rosso Calache, Francisco Herrera, Rosana Montes

In recent years, attention has increasingly focused on enhancing user satisfaction with user interfaces, spanning both mobile applications and websites. One fundamental aspect of human-machine interaction is the concept of web usability. In order to assess web usability, the A/B testing technique enables the comparison of data between two designs. Expanding the scope of tests to include the designs being evaluated, in conjunction with the involvement of both real and fictional users, presents a challenge for which few online tools offer support. We propose a methodology for web usability evaluation based on user-centered approaches such as design thinking and linguistic decision-making, named Linguistic Decision-Making for Web Usability Evaluation. This engages people in role-playing scenarios and conducts a number of usability tests, including the widely recognized System Usability Scale. We incorporate the methodology into a decision support system based on A/B testing. We use real users in a case study to assess three Moodle platforms at the University of Guadalajara, Mexico.

近年来,人们越来越重视提高用户对用户界面的满意度,包括移动应用程序和网站。人类机器互动的一个基本方面是网络可用性概念。为了评估网络可用性,A/B测试技术使得能够对两种设计的数据进行比较。扩大测试范围,将所评估的设计纳入所评估的设计,同时让真实用户和虚构用户参与,是一个挑战,很少有在线工具对此提供支持。我们提出了一个基于以用户为中心的方法进行网络可用性评价的方法,例如设计思维和语言决策,称为“网络可用性评估语言决策”。这让人们参与角色扮演情景,并进行一系列可用性测试,包括广泛承认的系统可用度尺度。我们将这一方法纳入基于A/B测试的决策支持系统。我们利用实际用户进行案例研究,评估墨西哥瓜达拉贾拉大学的三个Moode平台。

Article 28

Title@2025-07-16 (3): Leveraging LLMs for User Stories in AI Systems: UStAI Dataset

Title: Leveraging LLMs for User Stories in AI Systems: UStAI Dataset

Nutzung von LLMs für Nutzergeschichten in KI-Systemen: UStAI-Datensatz

为AI系统用户故事利用LMLMs:UStAI数据集 2504.00513v3

Authors (3): Asma Yamani, Malak Baslyman, Moataz Ahmed

AI systems are gaining widespread adoption across various sectors and domains. Creating high-quality AI system requirements is crucial for aligning the AI system with business goals and consumer values and for social responsibility. However, with the uncertain nature of AI systems and the heavy reliance on sensitive data, more research is needed to address the elicitation and analysis of AI systems requirements. With the proprietary nature of many AI systems, there is a lack of open-source requirements artifacts and technical requirements documents for AI systems, limiting broader research and investigation. With Large Language Models (LLMs) emerging as a promising alternative to human-generated text, this paper investigates the potential use of LLMs to generate user stories for AI systems based on abstracts from scholarly papers. We conducted an empirical evaluation using three LLMs and generated $1260$ user stories from $42$ abstracts from $26$ domains. We assess their quality using the Quality User Story (QUS) framework. Moreover, we identify relevant non-functional requirements (NFRs) and ethical principles. Our analysis demonstrates that the investigated LLMs can generate user stories inspired by the needs of various stakeholders, offering a promising approach for generating user stories for research purposes and for aiding in the early requirements elicitation phase of AI systems. We have compiled and curated a collection of stories generated by various LLMs into a dataset (UStAI), which is now publicly available for use.

建立高质量的AI系统要求对于使AI系统与商业目标和消费者价值以及社会责任相一致至关重要。然而,由于AI系统性质不确定,而且高度依赖敏感数据,需要开展更多的研究,以解决对AI系统要求的引证和分析问题。由于许多AI系统的专有性质,缺乏对AI系统开放源要求的文物和技术要求文件,限制了更广泛的研究和调查。随着大语言模型(LLMS)成为人类生成文本的有希望的替代物,本文调查了LMS为AI系统提供用户故事的可能性,根据学术论文的摘要,我们利用三个LMS进行了经验性评估,从26美元的领域产生了1 260美元的用户故事。我们利用质量用户书(QUS)框架评估了它们的质量。此外,我们查明了相关的非功能要求和道德原则。我们的分析表明,所调查的LMS可以产生用户故事,这些故事可以受到各种利益攸关方的需要的启发,提供了一种很有希望的方法,为AI系统制作用户故事,用于研究目的,并用AIS系统进行早期收集。我们通过AIS收集的用户故事。

Article 29

Title@2025-07-16 (3): From Static to Intelligent: Evolving SaaS Pricing with LLMs

Title: From Static to Intelligent: Evolving SaaS Pricing with LLMs

Von der statischen zur intelligenten: Evolving SaaS Pricing mit LLMs

从静态到智慧:不断演进的SaaS与LLMs的定价 2507.12104v1

Authors (3): Francisco Javier Cavero, Juan C. Alonso, Antonio Ruiz-Cortés

The SaaS paradigm has revolutionized software distribution by offering flexible pricing options to meet diverse customer needs. However, the rapid expansion of the SaaS market has introduced significant complexity for DevOps teams, who must manually manage and evolve pricing structures, an approach that is both time-consuming and prone to errors. The absence of automated tools for pricing analysis restricts the ability to efficiently evaluate, optimize, and scale these models. This paper proposes leveraging intelligent pricing (iPricing), dynamic, machine-readable pricing models, as a solution to these challenges. Intelligent pricing enables competitive analysis, streamlines operational decision-making, and supports continuous pricing evolution in response to market dynamics, leading to improved efficiency and accuracy. We present an LLM-driven approach that automates the transformation of static HTML pricing into iPricing, significantly improving efficiency and consistency while minimizing human error. Our implementation, AI4Pricing2Yaml, features a basic Information Extractor that uses web scraping and LLMs technologies to extract essential pricing components, plans, features, usage limits, and add-ons, from SaaS websites. Validation against a dataset of 30 distinct commercial SaaS, encompassing over 150 intelligent pricings, demonstrates the system’s effectiveness in extracting the desired elements across all steps. However, challenges remain in addressing hallucinations, complex structures, and dynamic content. This work highlights the potential of automating intelligent pricing transformation to streamline SaaS pricing management, offering implications for improved consistency and scalability in an increasingly intricate pricing landscape. Future research will focus on refining extraction capabilities and enhancing the system’s adaptability to a wider range of SaaS websites.

SaaS的范式通过提供灵活的定价选项以满足不同的客户需求,使软件分销发生了革命性的变化;然而,SaaS市场的迅速扩张为DevOps团队带来了相当的复杂性,这些团队必须手工管理和演变定价结构,这一方法既耗时又容易出错;缺乏自动定价分析工具限制了高效评估、优化和规模这些模型的能力;本文件提议利用智能定价(定价)、动态、机器可读的定价模型来应对这些挑战;智能定价能够进行竞争性分析,简化业务决策,支持持续定价变化,以应对市场动态的影响,从而提高效率和准确性;我们提出了一个由LLOM驱动的方法,将固定的超时超时超时的超时超时超时超时超时超时超价定价转换,大大提高效率和一致性,同时尽量减少人类错误;我们的实施,AI4Pricing2Yaml,其特点是利用网上精炼精炼和LLMSMS技术来提取更精确的定价组成部分、计划、特征、使用限制和附加数据,包括SaS网站的更高时超时超时超时超时超时超时的精度;在30个智能的系统中,在Salimalimalimalalalal-S的Sealalalmassss,在Sealmas,在Smasslvialmas,在S所有的系统上展示了Slaxxxxxxxxx,在S的精度和S的精度上,在S的精度和S的精度上,在Slaxx。

Article 30

Title@2025-07-16 (3): LLAMA: Multi-Feedback Smart Contract Fuzzing Framework with LLM-Guided Seed Generation

Title: LLAMA: Multi-Feedback Smart Contract Fuzzing Framework with LLM-Guided Seed Generation

LLAMA: Multi-Feedback Smart Contract Fuzzing Framework mit LLM-geführter Saatgutgeneration

LLAMA:与LLM-Guided种子一代的多氟后智能合同模糊模糊框架 2507.12084v1

Authors (5): Keke Gai, Haochen Liang, Jing Yu, Liehuang Zhu, Dusit Niyato

Smart contracts play a pivotal role in blockchain ecosystems, and fuzzing remains an important approach to securing smart contracts. Even though mutation scheduling is a key factor influencing fuzzing effectiveness, existing fuzzers have primarily explored seed scheduling and generation, while mutation scheduling has been rarely addressed by prior work. In this work, we propose a Large Language Models (LLMs)-based Multi-feedback Smart Contract Fuzzing framework (LLAMA) that integrates LLMs, evolutionary mutation strategies, and hybrid testing techniques. Key components of the proposed LLAMA include: (i) a hierarchical prompting strategy that guides LLMs to generate semantically valid initial seeds, coupled with a lightweight pre-fuzzing phase to select high-potential inputs; (ii) a multi-feedback optimization mechanism that simultaneously improves seed generation, seed selection, and mutation scheduling by leveraging runtime coverage and dependency feedback; and (iii) an evolutionary fuzzing engine that dynamically adjusts mutation operator probabilities based on effectiveness, while incorporating symbolic execution to escape stagnation and uncover deeper vulnerabilities. Our experiments demonstrate that LLAMA outperforms state-of-the-art fuzzers in both coverage and vulnerability detection. Specifically, it achieves 91% instruction coverage and 90% branch coverage, while detecting 132 out of 148 known vulnerabilities across diverse categories. These results highlight LLAMA’s effectiveness, adaptability, and practicality in real-world smart contract security testing scenarios.

智能合同在链链生态系统中发挥着关键作用,而模糊仍然是确保智能合同的一个重要方法。尽管突变列表是影响模糊有效性的一个关键因素,但现有模糊器主要探索种子时间安排和生成,而先前的工作很少涉及突变时间安排。在这项工作中,我们提议了一个基于大语言模型(LLLM)的多功能后多功能智能合同模糊框架(LLAMAMA),将LLMMS、进化突变战略和混合测试技术结合起来。拟议的LLMA的关键组成部分包括:(一) 一种等级催化战略,引导LLMS产生具有超常效力的初始种子,加上一个轻度的预发泡阶段,以选择高潜能投入;(二) 一个多功能的优化机制,同时利用运行时间覆盖面和依赖性反馈,改进种子生成、种子选择和突变的时间安排;以及(三) 一种进化的模糊引擎,根据有效性动态调整超变异操作者概率,同时纳入象征性执行以摆脱停滞并发现更深的脆弱性。我们的实验表明,LLAMA公司在真实的Sloverial A范围上完成了第91号指令的测试中,它们超越了Silal-lievilview第90级测试了Silal A类。

Article 31

Title@2025-07-16 (3): From Release to Adoption: Challenges in Reusing Pre-trained AI Models for Downstream Developers

Title: From Release to Adoption: Challenges in Reusing Pre-trained AI Models for Downstream Developers

Von der Veröffentlichung bis zur Annahme: Herausforderungen bei der Wiederverwendung vortrainierter KI-Modelle für Downstream-Entwickler

从释放到采用:为下游开发者重新使用经过预先培训的AI模型的挑战 2506.23234v2

Authors (5): Peerachai Banyongrakkul, Mansooreh Zahedi, Patanamon Thongtanunam, Christoph Treude, Haoyu Gao

Pre-trained models (PTMs) have gained widespread popularity and achieved remarkable success across various fields, driven by their groundbreaking performance and easy accessibility through hosting providers. However, the challenges faced by downstream developers in reusing PTMs in software systems are less explored. To bridge this knowledge gap, we qualitatively created and analyzed a dataset of 840 PTM-related issue reports from 31 OSS GitHub projects. We systematically developed a comprehensive taxonomy of PTM-related challenges that developers face in downstream projects. Our study identifies seven key categories of challenges that downstream developers face in reusing PTMs, such as model usage, model performance, and output quality. We also compared our findings with existing taxonomies. Additionally, we conducted a resolution time analysis and, based on statistical tests, found that PTM-related issues take significantly longer to be resolved than issues unrelated to PTMs, with significant variation across challenge categories. We discuss the implications of our findings for practitioners and possibilities for future research.

培训前模型(PTMs)已获得广泛欢迎,并在各个领域取得了显著成功,其驱动力是其开创性业绩和通过托管提供者容易获得。然而,下游开发商在软件系统中重新使用PTM系统时所面临的挑战没有那么深入探讨。为缩小这一知识差距,我们从质量上创建和分析了31个OSS GitHub项目中840份PTM相关问题报告的数据集。我们系统地开发了开发商在下游项目中面临的与PTM相关的挑战的综合分类。我们的研究确定了下游开发商在重新使用PTM系统时所面临的七大类挑战,例如模型使用、模型性能和产出质量。我们还将我们的调查结果与现有的分类进行了比较。此外,我们进行了分辨率时间分析,并根据统计测试发现,与PTM系统有关的问题比与PTM系统无关的问题需要更长的时间才能解决,而不同的挑战类别差异很大。我们讨论了我们的调查结果对从业人员的影响以及未来研究的可能性。

Article 32

Title@2025-07-16 (3): Expanding ML-Documentation Standards For Better Security

Title: Expanding ML-Documentation Standards For Better Security

Erweiterung der ML-Dokumentationsstandards für bessere Sicherheit

扩大多L-文件标准以增进安全 2507.12003v1

Authors (1): Cara Ellen Appel

This article presents the current state of ML-security and of the documentation of ML-based systems, models and datasets in research and practice based on an extensive review of the existing literature. It shows a generally low awareness of security aspects among ML-practitioners and organizations and an often unstandardized approach to documentation, leading to overall low quality of ML-documentation. Existing standards are not regularly adopted in practice and IT-security aspects are often not included in documentation. Due to these factors, there is a clear need for improved security documentation in ML, as one step towards addressing the existing gaps in ML-security. To achieve this, we propose expanding existing documentation standards for ML-documentation to include a security section with specific security relevant information. Implementing this, a novel expanded method of documenting security requirements in ML-documentation is presented, based on the existing Model Cards and Datasheets for Datasets standards, but with the recommendation to adopt these findings in all ML-documentation.

本文介绍了以多边实验室为基础的研究和实践系统、模型和数据集的当前安全情况和文件记录情况,这些研究和实践是在对现有文献进行广泛审查的基础上进行的,这表明多边实验室从业人员和组织对安全方面的认识普遍较低,而且对文件的处理方法往往不标准化,导致多边实验室文件的总体质量低下;在实践中没有经常采用现有标准,而且信息技术安全方面往往没有列入文件;由于这些因素,显然需要改进多边实验室的安全文件,作为弥补多边实验室安全方面现有差距的一个步骤;为此,我们提议扩大多边实验室文件的现有文件标准,以纳入一个含有具体安全相关信息的安全部分;实施这一提议,根据现有的数据集标准模范卡和数据表,提出了记录多边实验室文件中的安全要求的新扩大方法,但建议在所有多边实验室文件中采用这些结论。

Article 33

Title@2025-07-16 (3): A Task Taxonomy for Conformance Checking

Title: A Task Taxonomy for Conformance Checking

Eine Aufgaben-Taxonomie für die Konformitätsprüfung

合规检查任务分类 2507.11976v1

Authors (6): Jana-Rebecca Rehse, Michael Grohs, Finn Klessascheck, Lisa-Marie Klein, Tatiana von Landesberger, Luise Pufahl

Conformance checking is a sub-discipline of process mining, which compares observed process traces with a process model to analyze whether the process execution conforms with or deviates from the process design. Organizations can leverage this analysis, for example to check whether their processes comply with internal or external regulations or to identify potential improvements. Gaining these insights requires suitable visualizations, which make complex results accessible and actionable. So far, however, the development of conformance checking visualizations has largely been left to tool vendors. As a result, current tools offer a wide variety of visual representations for conformance checking, but the analytical purposes they serve often remain unclear. However, without a systematic understanding of these purposes, it is difficult to evaluate the visualizations’ usefulness. Such an evaluation hence requires a deeper understanding of conformance checking as an analysis domain. To this end, we propose a task taxonomy, which categorizes the tasks that can occur when conducting conformance checking analyses. This taxonomy supports researchers in determining the purpose of visualizations, specifying relevant conformance checking tasks in terms of their goal, means, constraint type, data characteristics, data target, and data cardinality. Combining concepts from process mining and visual analytics, we address researchers from both disciplines to enable and support closer collaborations.

合规性检查是流程采矿的次级纪律,将观察到的流程痕迹与分析流程执行是否符合或偏离流程设计的过程模型进行比较。各组织可以利用这一分析,例如,利用这一分析来检查其流程是否符合内部或外部条例,或查明潜在的改进。通过这些洞察力需要适当的直观化,使复杂的结果可以获取和可操作。但迄今为止,合规性检查视觉化的开发基本上留给了工具供应商。因此,当前工具为合规性检查提供了各种各样的直观显示,但是它们所提供的分析目的往往仍然不清楚。然而,如果对这些目的没有系统的理解,就很难评估可视化的效用。因此,这种评估需要更深入地了解合规性检查作为分析领域。为此,我们建议了一个任务分类,对进行合规性检查分析时可能发生的任务进行分类。这一分类支持研究人员确定可视化的目的,具体说明其目标、手段、约束类型、数据特性、数据目标目标和数据目标、数据目标、数据目标和数据基点,从而支持我们从视觉和更紧密的合作。

Article 34

Title@2025-07-16 (3): Kevin: Multi-Turn RL for Generating CUDA Kernels

Title: Kevin: Multi-Turn RL for Generating CUDA Kernels

Kevin: Multi-Turn RL für die Erzeugung von CUDA-Kerneln

Kevin: 生成 CUDA 核心多发RL 2507.11948v1

Authors (5): Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti

Writing GPU kernels is a challenging task and critical for AI systems’ efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.

写写 GPU 内核是一项具有挑战性的任务, 对 AI 系统的效率至关重要。它也具有高度的迭接性 : 域专家编写代码, 通过执行反馈提高绩效。此外, 它提供了准确性和加速性等可核查的回报, 使它成为应用强化学习( RL) 的自然环境。为了明确地将这一过程的迭接性纳入培训, 我们开发了一个灵活的多方向RL配方, 以解决在现实世界环境中遇到的独特挑战, 比如从长轨道上学习, 以及有效的奖赏归属交替。我们介绍了 Kevin - K( ernel D)evin, 这是为 CUDA 内核生成和优化培训的第一个多点RL 模型。此外, Kevin在评估设置中展示了基础模型( QwQQ-32B) 上的重大进步, 提高了生成的内核子( 纯 CUDA) 56% 到 82% 的正确性, 以及平均速度从 0.53x 基线 ( PyTorrich Eager) 到 1. 10x 中遇到的独特挑战。超越前沿模型, 如 o4- mini (0. 0. 0. 78x ) 。最后, 我们研究它在测试测试到升级的升级的升级中的行为。

Article 35

Title@2025-07-16 (3): Extremal Testing for Network Software using LLMs

Title: Extremal Testing for Network Software using LLMs

Extreme Tests für Netzwerk-Software mit LLMs

使用LLMM 网络软件的Extremal Extremal Extremal 测试 2507.11898v1

Authors (7): Rathin Singha, Harry Qian, Srinath Saikrishnan, Tracy Zhao, Ryan Beckett, Siva Kesava Reddy Kakarla, George Varghese

Physicists often manually consider extreme cases when testing a theory. In this paper, we show how to automate extremal testing of network software using LLMs in two steps: first, ask the LLM to generate input constraints (e.g., DNS name length limits); then ask the LLM to generate tests that violate the constraints. We demonstrate how easy this process is by generating extremal tests for HTTP, BGP and DNS implementations, each of which uncovered new bugs. We show how this methodology extends to centralized network software such as shortest path algorithms, and how LLMs can generate filtering code to reject extremal input. We propose using agentic AI to further automate extremal testing. LLM-generated extremal testing goes beyond an old technique in software testing called Boundary Value Analysis.

物理学家通常在测试理论时手工考虑极端案例。在本文中, 我们展示了如何将使用LLMM的网络软件的极限测试自动化化, 分两步进行: 首先, 请LLM 生成输入限制( 如 DNS 名称长度限制); 然后让LLM 生成违反限制的测试。我们通过为 HTTP、 BGP 和 DNS 实施生成极端测试来证明这一过程是多么容易, 每一个测试都发现了新的错误。我们展示了这一方法是如何推广到中央网络软件的, 如最短路径算法, 以及 LLMs 如何生成过滤代码来拒绝 extremal 输入。我们提议使用代理AI 来进一步自动进行极端测试。 LLM 生成的极限测试超越了软件测试“ 边界值分析” 的老技术。

Article 36

Title@2025-07-15 (2): On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous Vehicles

Title: On the Need for a Statistical Foundation in Scenario-Based Testing of Autonomous Vehicles

Zur Notwendigkeit einer statistischen Grundlage für die szenariogestützte Prüfung autonomer Fahrzeuge

关于需要一个统计基金会以设想情况为基础测试自用车辆的统计基金会 2505.02274v2

Authors (5): Xingyu Zhao, Robab Aghazadeh-Chakherlou, Chih-Hong Cheng, Peter Popov, Lorenzo Strigini

Scenario-based testing has emerged as a common method for autonomous vehicles (AVs) safety assessment, offering a more efficient alternative to mile-based testing by focusing on high-risk scenarios. However, fundamental questions persist regarding its stopping rules, residual risk estimation, debug effectiveness, and the impact of simulation fidelity on safety claims. This paper argues that a rigorous statistical foundation is essential to address these challenges and enable rigorous safety assurance. By drawing parallels between AV testing and established software testing methods, we identify shared research gaps and reusable solutions. We propose proof-of-concept models to quantify the probability of failure per scenario (\textit{pfs}) and evaluate testing effectiveness under varying conditions. Our analysis reveals that neither scenario-based nor mile-based testing universally outperforms the other. Furthermore, we give an example of formal reasoning about alignment of synthetic and real-world testing outcomes, a first step towards supporting statistically defensible simulation-based safety claims.

基于情景的测试已成为自主车辆安全评估的共同方法,通过侧重于高风险情景,为里程测试提供了更有效的替代方法;然而,在停止规则、残余风险估计、调试有效性以及模拟忠诚对安全主张的影响方面,仍然存在一些基本问题;本文认为,严格的统计基础对于应对这些挑战和促成严格的安全保障至关重要;通过将AV测试与既定的软件测试方法相平行,我们找出了共同的研究差距和可重复使用的解决办法;我们提出了概念验证模型,以量化每个情景(\ textit{pfs})的失灵概率,并评估不同条件下的测试效力;我们的分析表明,无论是基于情景的测试还是基于里程的测试,都没有普遍优于其他标准;此外,我们举例说明了对合成和真实世界测试结果的统一的正式推理,这是支持统计上无效的模拟安全主张的第一步。

Article 37

Title@2025-07-15 (2): REST in Pieces: RESTful Design Rule Violations in Student-Built Web Apps

Title: REST in Pieces: RESTful Design Rule Violations in Student-Built Web Apps

REST in Pieces: RESTful Design Regel Verstöße in Student-Build Web Apps

在学生-建筑网页应用程序中违反设计规则 2507.11689v1

Authors (3): Sergio Di Meglio, Valeria Pontillo, Luigi Libero Lucio Starace

In Computer Science Bachelor’s programs, software quality is often underemphasized due to limited time and a focus on foundational skills, leaving many students unprepared for industry expectations. To better understand the typical quality of student code and inform both education and hiring practices, we analyze 40 full-stack web applications developed in a third-year Web Technologies course. Using an automated static analysis pipeline, we assess adherence to REST API design rules. Results reveal frequent violations of foundational conventions, such as missing hyphens in endpoint paths (98%), incorrect pluralization (88%), and misuse of HTTP methods (83%). These findings highlight the need for more focused instruction on API design and support the adoption of automated tools to improve code quality in student projects.

在计算机科学学士课程中,由于时间有限和对基础技能的重视,软件质量往往不够强调,使许多学生没有做好对行业期望的准备。为了更好地了解学生守则的典型质量,并为教育和雇用做法提供信息,我们分析了在三年级网络技术课程中开发的40个全堆式网络应用程序。我们利用自动静态分析管道评估对REST API设计规则的遵守情况。结果显示,基本惯例经常遭到违反,如终点路径缺失的连字符(98%)、不正确的复数(88%)和滥用HTTP方法(83%)。这些调查结果突出表明,需要更集中地指导API的设计,并支持采用自动化工具来提高学生项目的代码质量。

Article 38

Title@2025-07-15 (2): MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization

Title: MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization

MetaLint: Generalisierbare idiomatische Code-Qualitätsanalyse durch instruction-following und einfach-zu-harte Verallgemeinerung

MetLint: 通过执行指示和易于协调的通用化,可通用的单性守则质量分析 2507.11687v1

Authors (6): Atharva Naik, Lawanya Baghel, Dhakshin Govindarajan, Darsh Agrawal, Daniel Fried, Carolyn Rose

Large Language Models, though successful in code generation, struggle with code quality analysis because they are limited by static training data and can’t easily adapt to evolving best practices. We introduce MetaLint, a new instruction-following framework that formulates code quality analysis as the task of detecting and fixing problematic semantic code fragments or code idioms based on high-level specifications. Unlike conventional approaches that train models on static, rule-based data, MetaLint employs instruction tuning on synthetic linter-generated data to support easy-to-hard generalization, enabling models to adapt to novel or complex code patterns without retraining. To evaluate this, we construct a benchmark of challenging idioms inspired by real-world coding standards such as Python Enhancement Proposals (PEPs) and assess whether MetaLint-trained models reason adaptively or simply memorize. Our results show that MetaLint improves generalization to unseen PEP idioms, achieving a 70.37% F-score on idiom detection with the highest recall (70.43%) among all evaluated models. It also achieves 26.73% on localization, competitive for its 4B parameter size and comparable to larger state-of-the-art models like o3-mini, highlighting its potential for future-proof code quality analysis.

大型语言模型,虽然在代码生成方面很成功,但是在代码质量分析中挣扎的是代码质量分析,因为它们受到静态培训数据的限制,并且无法轻易地适应不断发展的最佳做法。我们引入了Metalint,这是一个新的指导性框架,它根据Python增强建议(PEP)等现实世界编码标准来制定具有挑战性的语言质量分析基准,评估Metalint培训的模型是否具有适应性或简单的记忆力。我们的结果显示,Metalint改进了对静态、基于规则的数据模型的常规模式的通用化。Metalint使用对合成界面生成数据的指导性调整,以支持简单到硬的通用化,使模型能够适应新的或复杂的代码模式,而无需再培训。为了评估这一点,我们建立了一个受现实世界编码标准(例如Python增强建议(PEPEP))启发的具有挑战性的智商模型基准,并评估Metalint培训模式是否具有适应性,或者只是记忆性,在经过所有评估的模型(70.4-4B级)中实现了70.37%的F-芯质检测(70.44%)的最强的模型。

Article 39

Title@2025-07-15 (2): Rookie Mistakes: Measuring Software Quality in Student Projects to Guide Educational Enhancement

Title: Rookie Mistakes: Measuring Software Quality in Student Projects to Guide Educational Enhancement

Rookie Fehler: Softwarequalität in Studentenprojekten messen, um die Verbesserung der Bildung zu steuern

Rookie错误:衡量学生项目软件质量以指导加强教育 2507.12488v1

Authors (6): Marco De Luca, Sergio Di Martino, Sergio Di Meglio, Anna Rita Fasolino, Luigi Libero Lucio Starace, Porfirio Tramontana

When teaching Programming and Software Engineering in Bachelor’s Degree programs, the emphasis on creating functional software projects often overshadows the focus on software quality, a trend that aligns with ACM curricula recommendations. Software Engineering courses are typically introduced later in the curriculum, and can generally allocate only limited time to quality-related topics, leaving educators with the challenge of deciding which quality aspects to prioritize. In this decision, the literature offers limited guidance, as most existing studies focus on code written by novice students and small code units, making it unclear whether those findings extend to intermediate-level students with foundational object-oriented programming skills working on more complex software projects. To address this gap, we analyze 83 object-oriented team projects developed by 172 university students across 4 different editions of the Object-Oriented Programming course. We apply a static analysis pipeline used in prior research to assess software quality, combining SonarQube and ArchUnit to detect code smells and architectural anti-patterns. Our findings highlight recurring quality issues and offer concrete evidence of the challenges students face at this stage, providing valuable guidance for educators aiming to continuously improve Software Engineering curricula and promote quality-oriented development practices.

当在学士学位课程中教授规划和软件工程时,强调创建功能软件项目往往掩盖了对软件质量的重视,这一趋势与ACM课程建议相一致。软件工程课程通常在课程后期引入,通常只能为质量相关专题分配有限的时间,使教育者面临决定哪些质量方面需要优先处理的挑战。在该决定中,文献提供了有限的指导,因为大多数现有研究侧重于新学生和小型代码单位编写的代码,使得这些研究结果是否延伸到具有基本目标导向方案编制技能的中级学生,这些技能涉及更复杂的软件项目。为弥补这一差距,我们分析了172名大学生在4个不同版本的面向目标的规划课程中开发的83个面向目标的团队项目。我们应用了用于评估软件质量的静态分析管道,将SonarQube和ArchUnd Unit结合起来,以发现代码的气味和建筑反模式。我们的调查结果强调了反复出现的质量问题,并提供了学生在这一阶段面临的挑战的具体证据,为教育工作者不断改进软件工程课程和促进面向质量的发展做法提供了宝贵的指导。

Article 40

Title@2025-07-15 (2): You Can REST Now: Automated REST API Documentation and Testing via LLM-Assisted Request Mutations

Title: You Can REST Now: Automated REST API Documentation and Testing via LLM-Assisted Request Mutations

Sie können jetzt REST: Automatisierte REST API Dokumentation und Tests über LLM-Assisted Request Mutations

你可以现在就休息了:通过LLM协助请求变异进行自动REST API文件和测试 2402.05102v2

Authors (5): Alix Decrop, Xavier Devroey, Mike Papadakis, Pierre-Yves Schobbens, Gilles Perrouin

REST APIs are prevalent among web service implementations, easing interoperability through the HTTP protocol. API testers and users exploit the widely adopted OpenAPI Specification (OAS), a machine-readable standard to document REST APIs. However, documenting APIs is a time-consuming and error-prone task, and existing documentation is not always complete, publicly accessible, or up-to-date. This situation limits the efficiency of testing tools and hinders human comprehension. Large Language Models (LLMs) offer the potential to automatically infer API documentation, using their colossal training data. In this paper, we present RESTSpecIT, the first automated approach that infers documentation and performs black-box testing of REST APIs by leveraging LLMs. Our approach requires minimal user input compared to state-of-the-art tools; Given an API name and an LLM access key, RESTSpecIT generates API request seeds and mutates them with data returned by the LLM. The tool then analyzes API responses for documentation inference and testing purposes. RESTSpecIT utilizes an in-context prompt masking strategy, requiring no prior model fine-tuning. We evaluate the quality of our tool with three state-of-the-art LLMs: DeepSeek V3, GPT-4.1, and GPT-3.5. Our evaluation demonstrates that RESTSpecIT can (1) infer documentation with 88.62% of routes and 89.25% of query parameters found on average, (2) discover undocumented API data, (3) operate efficiently (in terms of model costs, requests sent, runtime), and (4) assist REST API testing by uncovering server errors and generating valid OpenAPI Specification inputs for testing tools.

REST API在网络服务实施中很普遍,通过 HTTP 协议降低了互操作性。API 测试者和用户利用广泛采用的 OpenAPI 规格(OAS),这是用于记录REST API的机器可读标准。然而,记录API是一项耗时且容易出错的任务,现有文件并不总是完整、公开或更新。这种情况限制了测试工具的效率,妨碍了人类理解。大语言模型(LLLM)提供了自动发送API文件的潜力,并使用了其巨额培训数据。在这个文件中,我们提供了RESTSpecI 参数(OAS),这是利用LOMMs对REST APS进行文件评估和进行黑箱测试的第一个自动化方法。我们的方法要求与最新工具相比,用户投入最少;鉴于API的名称和LM访问键限制了测试工具的效率,RESTPIT可以生成ALPT 模型种子,并将这些数据与LMM返回的数据混在一起。然后,我们用ASTS-RRRRR 3 和SLIS 快速测试战略(我们使用最新测试成本) 。

Article 41

Title@2025-07-15 (2): Decision Models for Selecting Architecture Patterns and Strategies in Quantum Software Systems

Title: Decision Models for Selecting Architecture Patterns and Strategies in Quantum Software Systems

Entscheidungsmodelle für die Auswahl von Architekturmustern und -strategien in Quantensoftwaresystemen

量量软件系统中选择建筑模式和战略的决定模式 2507.11671v1

Authors (10): Mst Shamima Aktar, Peng Liang, Muhammad Waseem, Amjed Tahir, Mojtaba Shahin, Muhammad Azeem Akbar, Arif Ali Khan, Aakash Ahmad, Musengamana Jean de Dieu, Ruiyin Li

Quantum software represents disruptive technologies in terms of quantum-specific software systems, services, and applications - leverage the principles of quantum mechanics via programmable quantum bits (Qubits) that manipulate quantum gates (QuGates) - to achieve quantum supremacy in computing. Quantum software architecture enables quantum software developers to abstract away implementation-specific details (i.e., mapping of Qubits and QuGates to high-level architectural components and connectors). Architectural patterns and strategies can provide reusable knowledge and best practices to engineer quantum software systems effectively and efficiently. However, quantum software practitioners face significant challenges in selecting and implementing appropriate patterns and strategies due to the complexity of quantum software systems and the lack of guidelines. To address these challenges, this study proposes decision models for selecting patterns and strategies in six critical design areas in quantum software systems: Communication, Decomposition, Data Processing, Fault Tolerance, Integration and Optimization, and Algorithm Implementation. These decision models are constructed based on data collected from both a mining study (i.e., GitHub and Stack Exchange) and a Systematic Literature Review, which were used to identify relevant patterns and strategies with their involved Quality Attributes (QAs). We then conducted semi-structured interviews with 16 quantum software practitioners to evaluate the familiarity, understandability, completeness, and usefulness of the proposed decision models. The results show that the proposed decision models can aid practitioners in selecting suitable patterns and strategies to address the challenges related to the architecture design of quantum software systems. The dataset is available at [6], allowing the community to reproduce and build upon our findings.

量子软件在量子特定软件系统、服务和应用方面代表了破坏性技术,在量子特定软件系统、服务和应用方面——通过可编程量子比特(Qubits)运用量子机械原理,操纵量子门(QuGates)——以实现计算中的量子至上。量子软件结构使量子软件开发者能够将具体实施细节(即Quits和QuGates的绘图工作与高层次建筑构件和连接器)抽取出来。建筑模式和战略可以提供可再利用的知识和最佳做法,以便有效和高效率地设计量子软件系统。然而,量子软件从业人员在选择和实施适当的模式和战略方面面临重大挑战,因为量子软件系统的复杂性和缺乏准则。为了应对这些挑战,本研究提出了在量子软件系统中六个关键设计领域选择模式和战略的决策模式:通信、Decomposition、数据处理、控制、整合和优化以及实施Algoriththm 。这些决定模型的构建基于从采矿研究收集到的数据(i.i.i.i.Hub和Stack Exch Excial Excial Revial Revial Aview) 和我们当时使用了相关的评估了有关定义模式,并展示了相关定义。

Article 42

Title@2025-07-15 (2): ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

Title: ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

ELFuzz: Effiziente Input-Generierung über LLM-gesteuerte Synthese über Fuzzer-Raum

ELFuzz:通过LLM驱动的模糊空间综合合成有效投入生成 2506.10323v3

Authors (3): Chuyang Chen, Brendan Dolan-Gavitt, Zhiqiang Lin

Generation-based fuzzing produces appropriate testing cases according to specifications of input grammars and semantic constraints to test systems and software. However, these specifications require significant manual efforts to construct. This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailored to a system under test (SUT) via LLM-driven synthesis over fuzzer space. At a high level, it starts with minimal seed fuzzers and propels the synthesis by fully automated LLM-driven evolution with coverage guidance. Compared to previous approaches, ELFuzz can 1) seamlessly scale to SUTs of real-world sizes – up to 1,791,104 lines of code in our evaluation – and 2) synthesize efficient fuzzers that catch interesting grammatical structures and semantic constraints in a human-understandable way. Our evaluation compared ELFuzz with specifications manually written by domain experts and synthesized by state-of-the-art approaches. It shows that ELFuzz achieves up to 434.8% more coverage and triggers up to 174.0% more artificially injected bugs. We also used ELFuzz to conduct a real-world fuzzing campaign on the newest version of cvc5 for 14 days, and encouragingly, it found five 0-day bugs (three are exploitable). Moreover, we conducted an ablation study, which shows that the fuzzer space model, the key component of ELFuzz, contributes the most (up to 62.5%) to the effectiveness of ELFuzz. Further analysis of the fuzzers synthesized by ELFuzz confirms that they catch interesting grammatical structures and semantic constraints in a human-understandable way. The results present the promising potential of ELFuzz for more automated, efficient, and extensible input generation for fuzzing.

以下一代为基础的模糊性根据输入语法规范的规格和测试系统和软件的语义限制生成适当的测试案例。然而, 这些规格需要大量手工构建。本文提出一种新的方法, 即 ELFuzz (通过大语言模型演进以模糊化) , 通过 LLM 驱动合成法在模糊空间上自动合成一个正在测试的系统( SUT) , 自动合成基于生成的模糊性。在高水平上, 它从最小种子模糊性开始, 通过完全自动化的LLLLM驱动的演进和覆盖性指导来推进合成。与以前的方法相比, ELUFuzz (ELFZ) 能够完美地推广到真实世界规模的SUT, 高达1,791,104条代码的代码, 以及2) 合成高效的模糊性引信, 通过LLMM 的合成法, 我们用域专家手动的模糊性模型和状态的合成方法, 显示ELFUZ( 5) 可以实现434.8 % 的覆盖, 并触发真实的 ELULF 版本的 ELF 。我们用LF 的 RULF 的 RLF , 也用新的版本, 发现, 21 方向的 RULULF 。

Article 43

Title@2025-07-15 (2): Modeling Code: Is Text All You Need?

Title: Modeling Code: Is Text All You Need?

Modeling Code: Ist Text alles, was Sie brauchen?

建模代码:你只需要文字吗? 2507.11467v1

Authors (7): Daniel Nichols, Konstantinos Parasyris, Harshitha Menon, Brian R. Bartoldson, Giorgis Georgakoudis, Tal Ben-Nun, Abhinav Bhatele

Code LLMs have become extremely popular recently for modeling source code across a variety of tasks, such as generation, translation, and summarization. However, transformer-based models are limited in their capabilities to reason through structured, analytical properties of code, such as control and data flow. Previous work has explored the modeling of these properties with structured data and graph neural networks. However, these approaches lack the generative capabilities and scale of modern LLMs. In this work, we introduce a novel approach to combine the strengths of modeling both code as text and more structured forms.

最近,代码LLMS在制作、翻译和总结等各种任务中为源代码建模变得非常受欢迎,但是,基于变压器的模型在能力上受到限制,只能通过有结构的、分析的代码特性来解释,例如控制和数据流。以前的工作是利用结构化的数据和图形神经网络来探索这些属性的建模,但是,这些方法缺乏现代LMS的基因化能力和规模。在这项工作中,我们采用一种新颖的办法,将代码建模的长处作为文本和结构化的两种形式结合起来。

Article 44

Title@2025-07-15 (2): Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

Title: Assistance or Disruption? Exploring and Evaluating the Design and Trade-offs of Proactive AI Programming Support

Unterstützung oder Disruption? Erforschen und Bewerten von Design und Trade-offs proaktiver KI-Programmierungsunterstützung

探讨和评价主动的AI方案拟定支助的设计和取舍 2502.18658v3

Authors (7): Kevin Pu, Daniel Lazaro, Ian Arawjo, Haijun Xia, Ziang Xiao, Tovi Grossman, Yan Chen

AI programming tools enable powerful code generation, and recent prototypes attempt to reduce user effort with proactive AI agents, but their impact on programming workflows remains unexplored. We introduce and evaluate Codellaborator, a design probe LLM agent that initiates programming assistance based on editor activities and task context. We explored three interface variants to assess trade-offs between increasingly salient AI support: prompt-only, proactive agent, and proactive agent with presence and context (Codellaborator). In a within-subject study (N=18), we find that proactive agents increase efficiency compared to prompt-only paradigm, but also incur workflow disruptions. However, presence indicators and interaction context support alleviated disruptions and improved users’ awareness of AI processes. We underscore trade-offs of Codellaborator on user control, ownership, and code understanding, emphasizing the need to adapt proactivity to programming processes. Our research contributes to the design exploration and evaluation of proactive AI systems, presenting design implications on AI-integrated programming workflow.

AI编程工具使强大的编程工具能够产生强大的代码,而最近的原型则试图减少与AI代理商的用户努力,但它们对编程工作流程的影响仍未得到探讨。我们介绍和评价了Collaborator,这是一个设计探测器LLM代理商,根据编辑的活动和任务背景启动编程援助。我们探讨了三个接口变式,以评估日益突出的AI支助之间的权衡:即:即即即时的、主动的代理商和具有存在和背景的主动代理商(Codellaborator)。在一项专题研究(N=18)中,我们发现主动代理商提高了效率,而不是只迅速的模式,但也造成了工作流程的中断。然而,存在指标和互动环境支持缓解了干扰,提高了用户对AI进程的认识。我们强调,代码管理员在用户控制、所有权和代码理解方面的权衡,强调需要将主动性调整到编程过程。我们的研究有助于设计对主动的AI系统进行探索和评价,对AI综合编程流程提出了设计影响。

Article 45

Title@2025-07-15 (2): From Chaos to Automation: Enabling the Use of Unstructured Data for Robotic Process Automation

Title: From Chaos to Automation: Enabling the Use of Unstructured Data for Robotic Process Automation

Vom Chaos zur Automatisierung: Die Nutzung unstrukturierter Daten für die Automatisierung von Roboterprozessen ermöglichen

从混乱到自动化:使无结构数据能够用于机器人程序自动化 2507.11364v1

Authors (3): Kelly Kurowski, Xixi Lu, Hajo A. Reijers

The growing volume of unstructured data within organizations poses significant challenges for data analysis and process automation. Unstructured data, which lacks a predefined format, encompasses various forms such as emails, reports, and scans. It is estimated to constitute approximately 80% of enterprise data. Despite the valuable insights it can offer, extracting meaningful information from unstructured data is more complex compared to structured data. Robotic Process Automation (RPA) has gained popularity for automating repetitive tasks, improving efficiency, and reducing errors. However, RPA is traditionally reliant on structured data, limiting its application to processes involving unstructured documents. This study addresses this limitation by developing the UNstructured Document REtrieval SyStem (UNDRESS), a system that uses fuzzy regular expressions, techniques for natural language processing, and large language models to enable RPA platforms to effectively retrieve information from unstructured documents. The research involved the design and development of a prototype system, and its subsequent evaluation based on text extraction and information retrieval performance. The results demonstrate the effectiveness of UNDRESS in enhancing RPA capabilities for unstructured data, providing a significant advancement in the field. The findings suggest that this system could facilitate broader RPA adoption across processes traditionally hindered by unstructured data, thereby improving overall business process efficiency.

各组织内日益增加的无结构化数据量对数据分析和流程自动化提出了重大挑战。无结构化数据缺乏预先确定的格式,包含各种形式,如电子邮件、报告和扫描,估计约占企业数据的80%。尽管它能够提供宝贵的见解,但从无结构化数据中提取有意义的信息比结构化数据更为复杂。机器人程序自动化(RPA)在使重复任务自动化、提高效率和减少错误方面越来越受欢迎。然而,RPA传统上依赖结构化数据,将其应用限于非结构化文件的流程。这项研究通过开发结构化文件REtrival SyStem(UNDRSS)解决了这一局限性,该系统使用模糊的常规表达方式、自然语言处理技术和大型语言模型,使RPA平台能够有效地从无结构化文件中检索信息。研究涉及设计和开发原型系统,以及随后根据文本提取和信息检索绩效进行的评估。研究结果表明,UNDRESSES在加强无结构化文件的 RPA数据能力方面的有效性,提供了传统的改进,从而阻碍了整个外地程序的效率。

Article 46

Title@2025-07-15 (2): Security Debt in Practice: Nuanced Insights from Practitioners

Title: Security Debt in Practice: Nuanced Insights from Practitioners

Sicherheitsschuld in der Praxis: Nuanced Insights von Praktizierenden

实践中的担保债务:从从业者那里得到的 “ 洞察 “ 2507.11362v1

Authors (3): Chaima Boufaied, Taher Ghaleb, Zainab Masood

With the increasing reliance on software and automation nowadays, tight deadlines, limited resources, and prioritization of functionality over security can lead to insecure coding practices. When not handled properly, these constraints cause unaddressed security vulnerabilities to accumulate over time, forming Security Debts (SDs). Despite their critical importance, there is limited empirical evidence on how software practitioners perceive, manage, and communicate SDs in real-world settings. In this paper, we present a qualitative empirical study based on semi-structured interviews with 22 software practitioners across various roles, organizations, and countries. We address four research questions: i) we assess software practitioners’ knowledge of SDs and awareness of associated security risks, ii) we investigate their behavior towards SDs, iii) we explore common tools and strategies used to mitigate SDs, and iv) we analyze how security risks are communicated within teams and to decision makers. We observe variations in how practitioners perceive and manage SDs, with some prioritizing delivery speed over security, while others consistently maintain security as a priority. Our findings emphasize the need for stronger integration of security practices across the Software Development Life Cycle (SDLC), more consistent use of mitigation strategies, better balancing of deadlines, resources, and security-related tasks, with attention to the Confidentiality, Integrity, and Availability (CIA) triad.

由于目前日益依赖软件和自动化,期限紧迫,资源有限,而且对安全功能的优先排序等,可能导致编码做法不安全;这些制约因素如果不妥善处理,就会导致安全脆弱性得不到解决,从而逐渐积累,从而形成安全债务(SDs)。尽管这些制约因素至关重要,但关于软件从业人员如何看待、管理和交流现实世界环境中的自毁的实证证据有限。在本文件中,我们根据与22个不同角色、组织和国家的软件从业人员进行的半结构性访谈,提出了一份定性经验研究报告。我们讨论了四个研究问题:一)我们评估软件从业人员对自毁知识的了解和对相关安全风险的认识,二)我们调查他们对自失常行为,三)我们探索用于缓解自失常现象的共同工具和战略,三)我们分析如何在团队和决策者中传达安全风险。我们观察从业者如何看待和管理自失常,有些人将交付速度放在安全之上,而另一些人则一贯将安全作为优先事项。我们的调查结果强调,需要在软件发展生命周期(SDLC)内加强安全做法的整合,更加一致地使用减缓战略,更好地平衡最后期限、资源、可靠性和安全方面的任务。

Article 47

Title@2025-07-15 (2): RefModel: Detecting Refactorings using Foundation Models

Title: RefModel: Detecting Refactorings using Foundation Models

RefModel: Refactorings mithilfe von Foundation Models erkennen

RefModel: 使用基础模型检测重构 2507.11346v1

Authors (6): Pedro Simões, Rohit Gheyi, Rian Melo, Jonhnanthan Oliveira, Márcio Ribeiro, Wesley K. G. Assunção

Refactoring is a common software engineering practice that improves code quality without altering program behavior. Although tools like ReExtractor+, RefactoringMiner, and RefDiff have been developed to detect refactorings automatically, they rely on complex rule definitions and static analysis, making them difficult to extend and generalize to other programming languages. In this paper, we investigate the viability of using foundation models for refactoring detection, implemented in a tool named RefModel. We evaluate Phi4-14B, and Claude 3.5 Sonnet on a dataset of 858 single-operation transformations applied to artificially generated Java programs, covering widely-used refactoring types. We also extend our evaluation by including Gemini 2.5 Pro and o4-mini-high, assessing their performance on 44 real-world refactorings extracted from four open-source projects. These models are compared against RefactoringMiner, RefDiff, and ReExtractor+. RefModel is competitive with, and in some cases outperform, traditional tools. In real-world settings, Claude 3.5 Sonnet and Gemini 2.5 Pro jointly identified 97% of all refactorings, surpassing the best-performing static-analysis-based tools. The models showed encouraging generalization to Python and Golang. They provide natural language explanations and require only a single sentence to define each refactoring type.

重构是一种常见的软件工程实践,可以提高代码质量而不会改变程序行为。虽然ReExtractor+、RefctoringMiner和RefDiff等工具已经开发出自动检测再构件的工具,但是它们依赖复杂的规则定义和静态分析,因此难以推广和概括到其他编程语言。在本文中,我们调查了使用基础模型进行再构件检测的可行性,该模型在名为RefModel的工具中实施。我们评估了Phi4-14B和Claude 3.5 Sonnet的一套858个单一操作转换数据集,适用于人工生成的爪哇程序,涵盖了广泛使用的再构件类型。我们还扩大了我们的评价范围,包括Gemini 2.5 Pro和 o4-minigh,评估了其在从四个开放源项目中提取的44个真实世界再构件上的性能。这些模型与RefactorMiner、RefDiff和ReExtractor +.它们只是与传统工具的外型相比,而且在某些情况中也比起来。在现实环境中设置中,Claude3.5 Sonnet和Gemismastrual 2.5的每一种模型中,每个都显示了通用工具。

Article 48

Title@2025-07-15 (2): QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration

Title: QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration

QLPro: Automatisierte Code Vulnerability Discovery über LLM und Static Code Analysis Integration

QLPro:通过LLM和静态代码分析整合发现自动编码易脆弱性 2506.23644v2

Authors (8): Junze Hu, Xiangyu Jin, Yizhe Zeng, Yuling Liu, Yunpeng Li, Dan Du, Kaiyu Xie, Hongsong Zhu

We introduce QLPro, a vulnerability detection framework that systematically integrates LLMs and static analysis tools to enable comprehensive vulnerability detection across entire open-source projects.We constructed a new dataset, JavaTest, comprising 10 open-source projects from GitHub with 62 confirmed vulnerabilities. CodeQL, a state-of-the-art static analysis tool, detected only 24 of these vulnerabilities while QLPro detected 41. Furthermore, QLPro discovered 6 previously unknown vulnerabilities, 2 of which have been confirmed as 0-days.

我们引入了QLPro,这是一个脆弱性检测框架,它系统地整合了LLMs和静态分析工具,以便能够在整个开放源码项目中全面检测脆弱性。我们建立了一个新的数据集,JavaTestor,由GitHub的10个公开源码项目组成,其中62个被确认为脆弱性。 CodeQL是一个最先进的静态分析工具,仅检测到24个,而QLPro检测到41个。此外,QLPro发现了6个以前未知的脆弱性,其中2个被确认为0天。

Article 49

Title@2025-07-15 (2): Dually Hierarchical Drift Adaptation for Online Configuration Performance Learning

Title: Dually Hierarchical Drift Adaptation for Online Configuration Performance Learning

Dual Hierarchische Drift-Anpassung für Online-Konfigurations-Performance-Lernen

为在线配置绩效学习进行双级分级漂流适应 2507.08730v3

Authors (3): Zezhen Xiang, Jingzhi Gong, Tao Chen

Modern configurable software systems need to learn models that correlate configuration and performance. However, when the system operates in dynamic environments, the workload variations, hardware changes, and system updates will inevitably introduce concept drifts at different levels - global drifts, which reshape the performance landscape of the entire configuration space; and local drifts, which only affect certain sub-regions of that space. As such, existing offline and transfer learning approaches can struggle to adapt to these implicit and unpredictable changes in real-time, rendering configuration performance learning challenging. To address this, we propose DHDA, an online configuration performance learning framework designed to capture and adapt to these drifts at different levels. The key idea is that DHDA adapts to both the local and global drifts using dually hierarchical adaptation: at the upper level, we redivide the data into different divisions, within each of which the local model is retrained, to handle global drifts only when necessary. At the lower level, the local models of the divisions can detect local drifts and adapt themselves asynchronously. To balance responsiveness and efficiency, DHDA combines incremental updates with periodic full retraining to minimize redundant computation when no drifts are detected. Through evaluating eight software systems and against state-of-the-art approaches, we show that DHDA achieves considerably better accuracy and can effectively adapt to drifts with up to 2x improvements, while incurring reasonable overhead and is able to improve different local models in handling concept drift.

现代的可配置软件系统需要学习与配置和性能相关的模型。然而,当系统在动态环境中运作时,工作量变化、硬件变化和系统更新将不可避免地引入不同层次的概念漂移——全球漂移,改变整个配置空间的性能景观;地方漂移,这只影响该空间的某些次区域。因此,现有的离线和传输学习方法可能难以适应实时中隐含和不可预测的变化,使配置绩效学习具有挑战性。为了解决这个问题,我们提议DHDA(DHDA),一个在线配置绩效学习框架,旨在捕捉和适应不同层次的这些漂移。关键思想是DHDA(DHA)利用双重等级适应概念适应当地和全球的漂移:在上层,我们重新将数据改编成不同的分区,在每一分区内,只有在必要时才能处理全球漂移。在较低层次,各司的当地模型可以检测到本地漂移,并自稳地调整自己。为了平衡和效率,DHD(D)将渐进式更新与定期全面再培训相结合,在不可靠地进行漂移时,我们无法对D(D)系统进行更精确地评估。

Article 50

Title@2025-07-15 (2): An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling

Title: An Empirical Study of Multi-Agent RAG for Real-World University Admissions Counseling

Eine empirische Studie von Multi-Agent RAG für Real-World University Admissions Counseling

现实世界大学招生咨询多方代理RAG经验研究 2507.11272v1

Authors (6): Anh Nguyen-Duc, Chien Vu Manh, Bao Anh Tran, Viet Phuong Ngo, Luan Le Chi, Anh Quang Nguyen

This paper presents MARAUS (Multi-Agent and Retrieval-Augmented University Admission System), a real-world deployment of a conversational AI platform for higher education admissions counseling in Vietnam. While large language models (LLMs) offer potential for automating advisory tasks, most existing solutions remain limited to prototypes or synthetic benchmarks. MARAUS addresses this gap by combining hybrid retrieval, multi-agent orchestration, and LLM-based generation into a system tailored for real-world university admissions. In collaboration with the University of Transport Technology (UTT) in Hanoi, we conducted a two-phase study involving technical development and real-world evaluation. MARAUS processed over 6,000 actual user interactions, spanning six categories of queries. Results show substantial improvements over LLM-only baselines: on average 92 percent accuracy, hallucination rates reduced from 15 precent to 1.45 percent, and average response times below 4 seconds. The system operated cost-effectively, with a two-week deployment cost of 11.58 USD using GPT-4o mini. This work provides actionable insights for the deployment of agentic RAG systems in low-resource educational settings.

本文介绍MARAUS(多国代理和回收大学招生系统),这是越南高等教育招生咨询中一个对话性AI平台的实实在在的部署,虽然大型语言模型(LLMS)提供了使咨询任务自动化的潜力,但大多数现有解决办法仍然局限于原型或合成基准。MARAUS通过将混合检索、多试管和LLLM(以LLM(LLM)为基础的一代)结合成一个适合现实世界大学招生的系统来弥补这一差距。我们与河内运输技术大学(UT)合作,开展了一项涉及技术发展和实际世界评估的两阶段研究。MARAUS(M)处理了6 000多个实际用户互动,涉及六类查询。结果显示,仅LLM(LLM)基线有重大改进:平均92%的准确率,幻觉率从15%前点降至1.45%,平均反应时间低于4秒。该系统运作成本有效,使用GPT-4o微型系统部署费用为11.58美元,为期两周。这项工作为在低资源教育环境中部署RAGGG系统提供了可操作的洞察。

Article 51

Title@2025-07-15 (2): New Formulation of DNN Statistical Mutation Killing for Ensuring Monotonicity: A Technical Report

Title: New Formulation of DNN Statistical Mutation Killing for Ensuring Monotonicity: A Technical Report

Neue Formulierung von DNN-Statistischem Mutationskilling zur Sicherung der Monotonizität: Ein technischer Bericht

新制定的DNN 统计变异杀人确保独独独性:技术报告 2507.11199v1

Authors (5): Jinhan Kim, Nargiz Humbatova, Gunel Jahangirova, Shin Yoo, Paolo Tonella

Mutation testing has emerged as a powerful technique for evaluating the effectiveness of test suites for Deep Neural Networks. Among existing approaches, the statistical mutant killing criterion of DeepCrime has leveraged statistical testing to determine whether a mutant significantly differs from the original model. However, it suffers from a critical limitation: it violates the monotonicity property, meaning that expanding a test set may result in previously killed mutants no longer being classified as killed. In this technical report, we propose a new formulation of statistical mutant killing based on Fisher exact test that preserves the statistical rigour of it while ensuring monotonicity.

突变测试已成为评估深神经网络测试套件有效性的有力技术,在现行方法中,深犯罪统计变异杀人标准利用统计测试来确定变异人是否与原始模型有重大差异,然而,它受到一个关键的限制:它违反了单一特性,这意味着扩大测试组可能导致先前被杀死的变异人不再被归类为死亡。在本技术报告中,我们提议根据渔业精确测试,采用统计变异人杀人的新方式,既保持该变异人的统计严谨性,又确保单一性。

Article 52

Title@2025-07-15 (2): GUARD:Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation

Title: GUARD:Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation

GUARD:Dual-Agent-basierte Backdoor-Verteidigung auf Ketten-of-Thought in Neural Code Generation

GUARD: 在神经代码生成过程中寻求的连锁研究中,基于 “ 以企业为基地 “ 的后门防御 2505.21425v2

Authors (4): Naizhu Jin, Zhong Li, Tian Zhang, Qingkai Zeng

With the widespread application of large language models in code generation, recent studies demonstrate that employing additional Chain-of-Thought generation models can significantly enhance code generation performance by providing explicit reasoning steps. However, as external components, CoT models are particularly vulnerable to backdoor attacks, which existing defense mechanisms often fail to detect effectively. To address this challenge, we propose GUARD, a novel dual-agent defense framework specifically designed to counter CoT backdoor attacks in neural code generation. GUARD integrates two core components: GUARD-Judge, which identifies suspicious CoT steps and potential triggers through comprehensive analysis, and GUARD-Repair, which employs a retrieval-augmented generation approach to regenerate secure CoT steps for identified anomalies. Experimental results show that GUARD effectively mitigates attacks while maintaining generation quality, advancing secure code generation systems.

由于在代码生成中广泛应用了大型语言模型,最近的研究表明,采用额外的“努力生成链”模型可以提供明确的推理步骤,大大提高代码生成绩效,然而,作为外部组成部分,COT模型特别容易受到后门攻击,而现有的防御机制往往无法有效地发现后门攻击;为了应对这一挑战,我们提议GUARD,这是一个新的双重用途防御框架,专门用来在神经代码生成中打击COT后门攻击。 GUARD集成两个核心组成部分:GUARD-Judge,通过全面分析查明可疑的COT步骤和潜在触发因素;GUARD-Repair,采用检索式的生成方法,为已查明的异常情况重新生成安全的COT步骤。实验结果表明,GUARD在保持生成质量的同时有效地减轻了袭击,推进了安全的代码生成系统。

Article 53

Title@2025-07-15 (2): PromiseTune: Unveiling Causally Promising and Explainable Configuration Tuning

Title: PromiseTune: Unveiling Causally Promising and Explainable Configuration Tuning

PromiseTune: Enthüllen kausal vielversprechende und erklärbare Konfigurationstuning

前景图:不懈的因果保证和可解释的配置图纸 2507.05995v3

Authors (2): Pengzhou Chen, Tao Chen

The high configurability of modern software systems has made configuration tuning a crucial step for assuring system performance, e.g., latency or throughput. However, given the expensive measurements, large configuration space, and rugged configuration landscape, existing tuners suffer ineffectiveness due to the difficult balance of budget utilization between exploring uncertain regions (for escaping from local optima) and exploiting guidance of known good configurations (for fast convergence). The root cause is that we lack knowledge of where the promising regions lay, which also causes challenges in the explainability of the results. In this paper, we propose PromiseTune that tunes configuration guided by causally purified rules. PromiseTune is unique in the sense that we learn rules, which reflect certain regions in the configuration landscape, and purify them with causal inference. The remaining rules serve as approximated reflections of the promising regions, bounding the tuning to emphasize these places in the landscape. This, as we demonstrate, can effectively mitigate the impact of the exploration and exploitation trade-off. Those purified regions can then be paired with the measured configurations to provide spatial explainability at the landscape level. Comparing with 11 state-of-the-art tuners on 12 systems and varying budgets, we show that PromiseTune performs significantly better than the others with 42% superior rank to the overall second best while providing richer information to explain the hidden system characteristics.

现代软件系统高度的可配置性使得配置调整成为确保系统性能的关键步骤,例如长期性或吞吐量。然而,由于测量费用昂贵、配置空间大、配置面貌崎岖不平,现有调调器由于在探索不确定区域(摆脱当地选法)和利用已知良好配置的指导(快速趋同)之间难以平衡使用预算而丧失效力。其根本原因是,我们不了解有希望区域的位置,这也给解释结果带来挑战。在本文件中,我们提议 “ 承诺图 “ 以因果净化规则为指导,调整配置。 “ 承诺图 “ 是独特的,因为我们学习了规则,这些规则反映某些区域在配置面貌中的情况,并且用因果关系推断来净化这些规则。其余规则作为有希望区域的近似反映,将调整以强调这些地貌景观中的这些地方。正如我们所证明的那样,这可以有效地减轻勘探和开发交易的影响。这些经过净化的区域可以与测量的配置相匹配,以有因果关系的因果净化规则为指导。 “ 承诺图 “ 承诺图 “ 是独一无二的,因为我们学习了规则,这些规则反映了某些区域在配置,这些区域在配置中,在配置中反映了某些区域在结构中,比我们11级的稳定性水平上更精确度,我们展示了系统在12级上展示了更精确的系统。

Article 54

Title@2025-07-15 (2): Automata Models for Effective Bug Description

Title: Automata Models for Effective Bug Description

Automata Modelle für effektive Bug-Beschreibung

有效臭虫描述的自动模型 2507.11146v1

Authors (4): Tom Yaacov, Gera Weiss, Gal Amram, Avi Hayoun

Debugging complex systems is a crucial yet time-consuming task. This paper presents the use of automata learning and testing techniques to obtain concise and informative bug descriptions. We introduce the concepts of Failure Explanations (FE), Eventual Failure Explanations (EFE), and Early Detection (ED) to provide meaningful summaries of failing behavior patterns. By factoring out irrelevant information and focusing on essential test patterns, our approach aims to enhance bug detection and understanding. We evaluate our methods using various test patterns and real-world benchmarks, demonstrating their effectiveness in producing compact and informative bug descriptions.

调试复杂系统是一项关键而又耗时的任务。本文件介绍了使用自动数据学习和测试技术获取简明和内容丰富的错误描述的情况。我们介绍了失败解释(FE)、偶然失败解释(EFE)和早期发现(ED)的概念,以提供有意义的行为失灵模式摘要。通过将不相干的信息考虑在内并侧重于基本测试模式,我们的方法旨在加强错误的检测和理解。我们利用各种测试模式和现实世界基准来评估我们的方法,并展示其在编制紧凑和内容丰富的错误描述方面的有效性。

Article 55

Title@2025-07-15 (2): MT4DP: Data Poisoning Attack Detection for DL-based Code Search Models via Metamorphic Testing

Title: MT4DP: Data Poisoning Attack Detection for DL-based Code Search Models via Metamorphic Testing

MT4DP: Datenvergiftung Angriffserkennung für DL-basierte Code-Suchmodelle über Metamorphische Tests

MT4DP:通过变形测试对基于DL的代码搜索模型进行数据中毒攻击检测 2507.11092v1

Authors (6): Gong Chen, Wenjie Liu, Xiaoyuan Xie, Xunzhu Tang, Tegawendé F. Bissyandé, Songqiang Chen

Recently, several studies have indicated that data poisoning attacks pose a severe security threat to deep learning-based (DL-based) code search models. Attackers inject carefully crafted malicious patterns into the training data, misleading the code search model to learn these patterns during training. During the usage of the poisoned code search model for inference, once the malicious pattern is triggered, the model tends to rank the vulnerability code higher. However, existing detection methods for data poisoning attacks on DL-based code search models remain insufficiently effective. To address this critical security issue, we propose MT4DP, a Data Poisoning Attack Detection Framework for DL-based Code Search Models via Metamorphic Testing. MT4DP introduces a novel Semantically Equivalent Metamorphic Relation (SE-MR) designed to detect data poisoning attacks on DL-based code search models. Specifically, MT4DP first identifies the high-frequency words from search queries as potential poisoning targets and takes their corresponding queries as the source queries. For each source query, MT4DP generates two semantically equivalent follow-up queries and retrieves its source ranking list. Then, each source ranking list is re-ranked based on the semantic similarities between its code snippets and the follow-up queries. Finally, variances between the source and re-ranked lists are calculated to reveal violations of the SE-MR and warn the data poisoning attack. Experimental results demonstrate that MT4DP significantly enhances the detection of data poisoning attacks on DL-based code search models, outperforming the best baseline by 191% on average F1 score and 265% on average precision. Our work aims to promote further research into effective techniques for mitigating data poisoning threats on DL-based code search models.

最近,一些研究显示,数据中毒袭击对基于深层学习的代码搜索模型构成严重的安全威胁。攻击者仔细地将恶意模式引入培训数据,误导代码搜索模型以在培训期间学习这些模式。在使用有毒代码搜索模型进行推断时,一旦恶意模式触发,该模型倾向于将脆弱性代码排序更高。然而,基于DL代码搜索模型的现有数据中毒袭击检测方法仍然不够有效。为解决这一关键的安全问题,我们提议MT4DP,即基于DL代码搜索模型的数据中毒袭击检测框架,通过变异性测试将恶意模式引入恶意代码搜索模型,误导代码搜索模型在培训期间学习这些模式。在使用有毒代码搜索模型时,该模型往往把基于DL的代码搜索模型中的高频词确定为潜在的中毒目标,并将其相应的查询作为源查询。对于每个来源的查询,MT4DP将两个与基于D代码代码搜索模型的类似的后续查询,然后在SEVL变异性数据排序列表中,每个来源的排序都是在SEOroral的排序搜索和排序列表中,然后在SErevoral Strial的排序列表中,在排序中,在Sloevoral Stal Streal的排序排序中,在排序中,在计算中,在排序中将数据排序中将数据排序中将数据排序中将数据排序中将数据排序中将数据列表中将数据排序中将数据排序。

Article 56

Title@2025-07-15 (2): Function-to-Style Guidance of LLMs for Code Translation

Title: Function-to-Style Guidance of LLMs for Code Translation

Funktion-zu-Stil Anleitung von LLMs für Code-Übersetzung

代码翻译LLMM LL 指南 2507.11083v1

Authors (11): Longhui Zhang, Bin Wang, Jiahao Wang, Xiaofeng Zhao, Min Zhang, Hao Yang, Meishan Zhang, Yu Li, Jing Li, Jun Yu, Min Zhang

Large language models (LLMs) have made significant strides in code translation tasks. However, ensuring both the correctness and readability of translated code remains a challenge, limiting their effective adoption in real-world software development. In this work, we propose F2STrans, a function-to-style guiding paradigm designed to progressively improve the performance of LLMs in code translation. Our approach comprises two key stages: (1) Functional learning, which optimizes translation correctness using high-quality source-target code pairs mined from online programming platforms, and (2) Style learning, which improves translation readability by incorporating both positive and negative style examples. Additionally, we introduce a novel code translation benchmark that includes up-to-date source code, extensive test cases, and manually annotated ground-truth translations, enabling comprehensive functional and stylistic evaluations. Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen-1.5B to outperform prompt-enhanced Qwen-32B and GPT-4 on average across 20 diverse code translation scenarios.

大型语言模型(LLMS)在代码翻译任务方面取得了长足的进步,然而,确保翻译代码的正确性和可读性仍是一项挑战,限制了翻译代码在现实世界软件开发中的有效应用。在这项工作中,我们提议F2STrans,这是一个功能化指导模式,旨在逐步提高代码翻译中LLMS的绩效。我们的方法包括两个关键阶段:(1) 功能学习,它利用从在线编程平台提取的高质量源目标代码对子优化翻译的正确性;(2) 风格学习,它通过纳入正式和负式范例来改进翻译的可读性。此外,我们引入了一个新的代码翻译基准,其中包括最新的源代码、广泛的测试案例和手动的地面跟踪翻译,从而能够进行全面功能性和文理学评估。我们的新基准和现有数据集的实验表明,我们的方法大大改进了代码翻译绩效。值得注意的是,我们的方法使Quen-1.5B在20种不同的代码翻译设想中的平均速度超过快速增强的Quen-32B和GPT-4。

Article 57

Title@2025-07-15 (2): Self-Admitted GenAI Usage in Open-Source Software

Title: Self-Admitted GenAI Usage in Open-Source Software

Selbstzugelassene GenAI-Nutzung in Open-Source-Software

开放源码软件自发使用GenAI 2507.10422v2

Authors (7): Tao Xiao, Youmei Fan, Fabio Calefato, Christoph Treude, Raula Gaikovina Kula, Hideaki Hata, Sebastian Baltes

The widespread adoption of generative AI (GenAI) tools such as GitHub Copilot and ChatGPT is transforming software development. Since generated source code is virtually impossible to distinguish from manually written code, their real-world usage and impact on open-source software development remain poorly understood. In this paper, we introduce the concept of self-admitted GenAI usage, that is, developers explicitly referring to the use of GenAI tools for content creation in software artifacts. Using this concept as a lens to study how GenAI tools are integrated into open-source software projects, we analyze a curated sample of more than 250,000 GitHub repositories, identifying 1,292 such self-admissions across 156 repositories in commit messages, code comments, and project documentation. Using a mixed methods approach, we derive a taxonomy of 32 tasks, 10 content types, and 11 purposes associated with GenAI usage based on 284 qualitatively coded mentions. We then analyze 13 documents with policies and usage guidelines for GenAI tools and conduct a developer survey to uncover the ethical, legal, and practical concerns behind them. Our findings reveal that developers actively manage how GenAI is used in their projects, highlighting the need for project-level transparency, attribution, and quality control practices in the new era of AI-assisted software development. Finally, we examine the impact of GenAI adoption on code churn in 151 repositories with self-admitted GenAI usage and find no general increase, contradicting popular narratives on the impact of GenAI on software development.

广泛采用GitHub Coople和ChattGGPT等基因化AI(GenAI)工具(GenAI)广泛采用GitHub Copolit和ChattGPT等基因化AI(GenAI)工具正在改变软件开发。由于生成源代码几乎无法与人工写成的代码区分,因此其真实世界的使用情况和对开放源码软件开发的影响仍然不甚为人知。在本文件中,我们采用了自我认可的GenAI使用概念,即开发者明确提及GenAI使用GenAI工具在软件工艺品中的内容创建内容。我们利用这一概念来研究GenAI工具如何融入开放源软件项目,我们分析了25万多个GitHub文献库的整理样本,查明了156个储存库在承诺信息、代码评论和项目文件中的1,292个这类自授自授权限。我们采用混合方法,得出了32项任务、10种内容类型和11个与GenAI使用GenAI使用工具在软件制作内容时所使用的目的指南的概念。我们分析了13份文件及其开发准则,以发现这些工具背后的道德、法律和实际关切。我们的调查结果显示GitHiHuAI如何在项目中使用自相矛盾的自我发展过程中使用,我们对IAI在透明化的自我化的自我分析了自我化的自我分析,在最后的自我分析项目中如何中如何的自我分析。我们在透明化标准中如何上如何使用。我们对IA的自我分析。

Article 58

Title@2025-07-15 (2): Advancing Code Coverage: Incorporating Program Analysis with Large Language Models

Title: Advancing Code Coverage: Incorporating Program Analysis with Large Language Models

Advancing Code Coverage: Einschließliche Programmanalyse mit großen Sprachmodellen

推进代码覆盖范围:将方案分析纳入大语言模式 2404.04966v2

Authors (5): Chen Yang, Junjie Chen, Bin Lin, Ziqi Wang, Jianyi Zhou

Automatic test generation plays a critical role in software quality assurance. While the recent advances in Search-Based Software Testing (SBST) and Large Language Models (LLMs) have shown promise in generating useful tests, these techniques still struggle to cover certain branches. Reaching these hard-to-cover branches usually requires constructing complex objects and resolving intricate inter-procedural dependencies in branch conditions, which poses significant challenges for existing test generation techniques. In this work, we propose TELPA, a novel technique aimed at addressing these challenges. Its key insight lies in extracting real usage scenarios of the target method under test to learn how to construct complex objects and extracting methods entailing inter-procedural dependencies with hard-to-cover branches to learn the semantics of branch constraints. To enhance efficiency and effectiveness, TELPA identifies a set of ineffective tests as counter-examples for LLMs and employs a feedback-based process to iteratively refine these counter-examples. Then, TELPA integrates program analysis results and counter-examples into the prompt, guiding LLMs to gain deeper understandings of the semantics of the target method and generate diverse tests that can reach the hard-to-cover branches. Our experimental results on 27 open-source Python projects demonstrate that TELPA significantly outperforms the state-of-the-art SBST and LLM-based techniques, achieving an average improvement of 31.39% and 22.22% in terms of branch coverage.

自动测试生成在软件质量保证方面发挥着关键作用。虽然搜索软件测试(SBST)和大语言模型(LLMS)最近的进展在生成有用测试方面显示了希望,但这些技术仍然难以覆盖某些分支。达到这些难以覆盖的分支通常需要建造复杂的天体,解决分支条件下复杂的程序间依赖性,这对现有测试生成技术构成重大挑战。在这项工作中,我们提出了旨在应对这些挑战的新技术TELPA。它的关键洞察力在于提取正在测试的目标方法的实际使用情景,以学习如何构建复杂的天体和提取方法,从而导致难以覆盖的分支间依赖性,以学习分支制约的语义。为了提高效率和有效性,TELPA确定了一系列无效的测试,作为LMS的反标本,并采用基于反馈的程序来迭接地完善这些反标本。然后,TELPA将程序分析结果和反标本纳入快速的,指导LMS, 以加深对系统内部对象的跨部的理解, 22号分支的SLLMS-S-S-S-roupal Troupal Trodual 22 Sal-s-s-sal-s-s-sal-sal-sl-slal-sl-s-sl-s-s-s-s-s-s-s-sl-slal-sl-sl-s-s-sl-sl-sal-s-s-s-s-s-s-s-slal-s-s-s-s-sal-slation-sal-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-

Article 59

Title@2025-07-15 (2): Evaluating Generated Commit Messages with Large Language Models

Title: Evaluating Generated Commit Messages with Large Language Models

Auswertung von Generated Commit-Nachrichten mit großen Sprachmodellen

以大语言模式评价生成的提交信件 2507.10906v1

Authors (8): Qunhong Zeng, Yuxia Zhang, Zexiong Ma, Bo Jiang, Ningyuan Sun, Klaas-Jan Stol, Xingyu Mou, Hui Liu

Commit messages are essential in software development as they serve to document and explain code changes. Yet, their quality often falls short in practice, with studies showing significant proportions of empty or inadequate messages. While automated commit message generation has advanced significantly, particularly with Large Language Models (LLMs), the evaluation of generated messages remains challenging. Traditional reference-based automatic metrics like BLEU, ROUGE-L, and METEOR have notable limitations in assessing commit message quality, as they assume a one-to-one mapping between code changes and commit messages, leading researchers to rely on resource-intensive human evaluation. This study investigates the potential of LLMs as automated evaluators for commit message quality. Through systematic experimentation with various prompt strategies and state-of-the-art LLMs, we demonstrate that LLMs combining Chain-of-Thought reasoning with few-shot demonstrations achieve near human-level evaluation proficiency. Our LLM-based evaluator significantly outperforms traditional metrics while maintaining acceptable reproducibility, robustness, and fairness levels despite some inherent variability. This work conducts a comprehensive preliminary study on using LLMs for commit message evaluation, offering a scalable alternative to human assessment while maintaining high-quality evaluation.

在软件开发过程中,信息传递是必不可少的,因为它们有助于记录和解释代码变化。然而,其质量在实践上往往不足,研究表明信息空白或不足的比例很大。自动承诺生成信息已经取得显著进展,特别是在大语言模型(LLMs)方面,但是对生成的信息的评价仍然具有挑战性。传统的基于参考的自动测量标准,如BLEU、ROUGE-L和METEOR,在评估承诺信息质量方面有显著的局限性,因为它们在代码变化和发送信息之间进行一对一的映射,导致研究人员依赖资源密集型的人的评价。本项研究调查了LMS作为自动评价者对信息质量进行自动评价的潜力。通过对各种迅速战略和最先进的LLMs进行系统实验,我们证明LLMs将链式推理与几张的演示相结合,在接近人类层面评价熟练程度方面取得了成功。我们基于LMM的LMs评价员在保持可接受性、稳健和公平水平的同时,大大超越了传统测量标准,尽管存在某些固有的差异。这项工作对使用LMS进行全面的初步研究,对使用LMs进行信息评价,提供了高质量评估的替代方法。

Article 60

Title@2025-07-15 (2): MalCodeAI: Autonomous Vulnerability Detection and Remediation via Language Agnostic Code Reasoning

Title: MalCodeAI: Autonomous Vulnerability Detection and Remediation via Language Agnostic Code Reasoning

MalCodeAI: Autonome Schwachstelle Erkennung und Sanierung über Language Agnostic Code Reasoning

MalCodeAI:通过语言《名人法》进行自主脆弱性检测和补救 2507.10898v1

Authors (3): Jugal Gajjar, Kamalasankari Subramaniakuppusamy, Noha El Kachach

The growing complexity of cyber threats and the limitations of traditional vulnerability detection tools necessitate novel approaches for securing software systems. We introduce MalCodeAI, a language-agnostic, multi-stage AI pipeline for autonomous code security analysis and remediation. MalCodeAI combines code decomposition and semantic reasoning using fine-tuned Qwen2.5-Coder-3B-Instruct models, optimized through Low-Rank Adaptation (LoRA) within the MLX framework, and delivers scalable, accurate results across 14 programming languages. In Phase 1, the model achieved a validation loss as low as 0.397 for functional decomposition and summarization of code segments after 200 iterations, 6 trainable layers, and a learning rate of 2 x 10^(-5). In Phase 2, for vulnerability detection and remediation, it achieved a best validation loss of 0.199 using the same number of iterations and trainable layers but with an increased learning rate of 4 x 10^(-5), effectively identifying security flaws and suggesting actionable fixes. MalCodeAI supports red-hat-style exploit tracing, CVSS-based risk scoring, and zero-shot generalization to detect complex, zero-day vulnerabilities. In a qualitative evaluation involving 15 developers, the system received high scores in usefulness (mean 8.06/10), interpretability (mean 7.40/10), and readability of outputs (mean 7.53/10), confirming its practical value in real-world development workflows. This work marks a significant advancement toward intelligent, explainable, and developer-centric software security solutions.

由于网络威胁日益复杂,传统的脆弱性检测工具也日益具有复杂性,因此必须采取新颖的办法确保软件系统的安全。我们引入了MalCodeAI,这是一个用于自主代码安全分析和补救的语文保密、多阶段AI管道。MalCodeAI结合了代码分解和语义推理模型,使用了微调的Quen2.5-Coder-3B-Instruct 模型,在MLX框架内通过低Rank适应(LORA)优化,在低Rank适应(LORA)中实现了最佳化,并在14种编程语言中提供了可扩展的准确结果。在第一阶段,该模型实现了验证损失,低至0.397,在200迭代、6个可培训层和2x10(5)之后代码部分的功能分解和组合。在第二阶段,在脆弱性检测和补救方面,它实现了0.199的验证损失,在MLXLX(5) (5) 、有效查明安全缺陷并提出可操作的解决方案。 MalCodeAI支持红衣式的追踪、CVSS为上的风险评级、15天的风险评分值,在安全性评估中,在高度评估中,在10度上进行,在标准发展过程中,在10度评估中进行了重大的可靠度评估。

Article 61

Title@2025-07-14 (1): BandFuzz: An ML-powered Collaborative Fuzzing Framework

Title: BandFuzz: An ML-powered Collaborative Fuzzing Framework

BandFuzz: Ein ML-powered Collaborative Fuzzing Framework

BandFuzz: ML 授权的协作模糊框架 2507.10845v1

Authors (6): Wenxuan Shi, Hongwei Li, Jiahao Yu, Xinqian Sun, Wenbo Guo, Xinyu Xing

Collaborative fuzzing has recently emerged as a technique that combines multiple individual fuzzers and dynamically chooses the appropriate combinations suited for different programs. Unlike individual fuzzers, which rely on specific assumptions to maintain their effectiveness, collaborative fuzzing relaxes the assumptions on target programs, providing constant and robust performance across various programs. Ideally, collaborative fuzzing should be a more promising direction toward generic fuzzing solutions, as it mitigates the need for manual cherry-picking of individual fuzzers. However, the effectiveness of existing collaborative fuzzing frameworks is limited by major challenges, such as the need for additional computational resources compared to individual fuzzers and the inefficient allocation of resources among the various fuzzers.

合作模糊最近作为一种技术出现了,它结合了多个个体模糊器,并动态地选择了适合不同程序的适当组合。与依靠具体假设来保持其有效性的个体模糊器不同,合作模糊会放松对目标方案的假设,在各种方案中提供恒定和稳健的业绩。理想的情况是,合作模糊应该成为通效模糊解决方案的更有希望的方向,因为它减少了人工挑选个体模糊器的需要。但是,现有的协作模糊框架的有效性受到重大挑战的限制,例如需要与个体模糊器相比增加计算资源,以及不同模糊器之间资源分配效率低下。

Article 62

Title@2025-07-14 (1): Past, Present and Future: Exploring Adaptive AI in Software Development Bots

Title: Past, Present and Future: Exploring Adaptive AI in Software Development Bots

Vergangenheit, Gegenwart und Zukunft: Erforschen von adaptiver KI in Software-Entwicklungs-Bots

过去、现在和未来:探索软件开发中的适应性AI 2507.10822v1

Authors (2): Omar Elsisi, Glaucia Melo

Conversational agents, such as chatbots and virtual assistants, have become essential in software development, boosting productivity, collaboration, and automating various tasks. This paper examines the role of adaptive AI-powered conversational agents in software development, highlighting their ability to offer dynamic, context-aware assistance to developers. Unlike traditional rule-based systems, adaptive AI agents use machine learning and natural language processing to learn from interactions and improve over time, providing more personalized and responsive help. We look at how these tools have evolved from simple query-based systems to advanced AI-driven solutions like GitHub Copilot and Microsoft Teams bots. We also explore the challenges of integrating adaptive AI into software development processes. The study aims to assess the benefits and limitations of these systems, address concerns like data privacy and ethical issues, and offer insights into their future use in the field. Ultimately, adaptive AI chatbots have great potential to revolutionize software development by delivering real-time, customized support and enhancing the efficiency of development cycles.

诸如聊天机和虚拟助理等交流代理机构在软件开发、提高生产力、合作和使各种任务自动化方面已经变得至关重要。本文件审视了适应性AI动力对话代理机构在软件开发中的作用,强调它们向开发者提供动态的、符合背景的援助的能力。与传统的基于规则的系统不同,适应性AI代理机构利用机器学习和自然语言处理从互动中学习和改进,提供更个性化和反应迅速的帮助。我们审视这些工具是如何从简单的查询系统演变为先进的AI驱动解决方案的,如GitHub Copil和微软团队机器人。我们还探讨了将适应性AI机器人纳入软件开发流程的挑战。研究的目的是评估这些系统的效益和局限性,解决数据隐私和伦理问题等关切问题,并就其今后在外地的使用情况提供见解。归根结底,适应性AI聊天机具有巨大的潜力,通过提供实时、定制的支持和提高开发周期的效率,使软件开发革命化。

Article 63

Title@2025-07-14 (1): How Robust are LLM-Generated Library Imports? An Empirical Study using Stack Overflow

Title: How Robust are LLM-Generated Library Imports? An Empirical Study using Stack Overflow

Wie robust sind LLM-generierte Bibliotheksimporte? Eine empirische Studie mit Stack Overflow

LLM - 受LLM创的图书馆进口如何强劲? 利用Stack 溢流进行的一项经验性研究 2507.10818v1

Authors (3): Jasmine Latendresse, SayedHassan Khatoonabadi, Emad Shihab

Software libraries are central to the functionality, security, and maintainability of modern code. As developers increasingly turn to Large Language Models (LLMs) to assist with programming tasks, understanding how these models recommend libraries is essential. In this paper, we conduct an empirical study of six state-of-the-art LLMs, both proprietary and open-source, by prompting them to solve real-world Python problems sourced from Stack Overflow. We analyze the types of libraries they import, the characteristics of those libraries, and the extent to which the recommendations are usable out of the box. Our results show that LLMs predominantly favour third-party libraries over standard ones, and often recommend mature, popular, and permissively licensed dependencies. However, we also identify gaps in usability: 4.6% of the libraries could not be resolved automatically due to structural mismatches between import names and installable packages, and only two models (out of six) provided installation guidance. While the generated code is technically valid, the lack of contextual support places the burden of manually resolving dependencies on the user. Our findings offer actionable insights for both developers and researchers, and highlight opportunities to improve the reliability and usability of LLM-generated code in the context of software dependencies.

软件库是现代代码功能、安全和可维护的核心。随着开发者越来越多地转向大语言模型(LLMS)来协助编程任务,了解这些模型如何建议图书馆至关重要。在本文件中,我们对6个最先进的专有和开放源的LLMS进行了实证研究,通过促使它们解决来自Stack Overflow 的真实世界 Python问题,我们分析了它们进口的图书馆类型、这些图书馆的特点以及建议从盒子中被利用的程度。我们的结果显示,LLMS主要偏爱第三方图书馆,而不是标准图书馆,常常建议成熟、受欢迎和许可的附属图书馆。然而,我们还查明了可用性的差距:4.6%的图书馆由于进口名称和可安装的包之间的结构不匹配而不能自动解决,只有两个模型(其中六个模型)提供了安装指导。虽然生成的代码在技术上是有效的,但缺乏背景支持使用户承担了人工解决依赖性的负担。我们的调查结果为开发商和研究人员提供了可操作的洞察力的洞察力,并强调软件的可靠性。

Article 64

Title@2025-07-14 (1): Supervised Semantic Similarity-based Conflict Detection Algorithm: S3CDA

Title: Supervised Semantic Similarity-based Conflict Detection Algorithm: S3CDA

Überwachter semantischer Ähnlichkeits-basierter Konflikterkennungs-Algorithmus: S3CDA

受监督的语义相似性基于冲突探测冲突探测等级: S3CDA 2206.13690v3

Authors (4): Garima Malik, Mucahit Cevik, Ayse Basar, Devang Parikh

Identifying conflicting requirements is a key challenge in software requirement engineering, often overlooked in automated solutions. Most existing approaches rely on handcrafted rules or struggle to generalize across different domains. In this paper, we introduce S3CDA, a two-phase algorithm designed to automatically detect conflicts in software requirements. Our method first identifies potentially conflicting requirement pairs using semantic similarity, and then validates them by analyzing overlapping domain-specific entities. We evaluate S3CDA on five diverse real-world datasets and compare it against popular large language models like GPT-4o, Llama-3, Sonnet-3.5 and Gemini-1.5. While LLMs show promise, especially on general datasets, S3CDA consistently performs better in domain-specific settings with higher performance. Our findings suggest that combining Natural Language Processing (NLP) techniques with domain-aware insights offers a practical and effective alternative for conflict detection in requirements.

在软件需求工程中,确定相互冲突的要求是一项关键的挑战,在自动化解决方案中往往被忽视。大多数现有方法依靠手工制定的规则或努力在不同的领域推广。在本文中,我们引入了S3CDA,这是一个两阶段的算法,旨在自动检测软件需求中的冲突。我们的方法首先使用语义相似性来识别可能相互冲突的要求对,然后通过分析重叠的域别实体来验证这些要求。我们根据五个不同的现实世界数据集对S3CDA进行了评估,并将其与GPT-4o、Llama-3、Sonnet-3.5和Gemini-1.5等流行的大型语言模型进行比较。LMS展示了前景,特别是在一般数据集方面,S3CDA在特定域环境中始终表现更好。我们的研究结果表明,将自然语言处理技术与域认知洞察技术相结合,为在需求中发现冲突提供了实用和有效的替代方法。

Article 65

Title@2025-07-14 (1): Towards a Closer Collaboration Between Practice and Research in Agile Software Development Workshop: A Summary and Research Agenda

Title: Towards a Closer Collaboration Between Practice and Research in Agile Software Development Workshop: A Summary and Research Agenda

Auf dem Weg zu einer engeren Zusammenarbeit zwischen Praxis und Forschung in der Agile Software Development Workshop: Eine Zusammenfassung und Forschungsagenda

更紧密地合作,在 “ 危险软件开发实践与研究 “ 的实践与研究之间开展更密切的合作讲习班:摘要和研究议程 2507.10785v1

Authors (5): Michael Neumann, Eva-Maria Schön, Mali Senapathi, Maria Rauschenberger, Tiago Silva da Silva

Agile software development principles and values have been widely adopted across various industries, influencing products and services globally. Despite its increasing popularity, a significant gap remains between research and practical implementation. This paper presents the findings of the first international workshop designed to foster collaboration between research and practice in agile software development. We discuss the main themes and factors identified by the workshop participants that contribute to this gap, strategies to bridge it, and the challenges that require further research attention.

各种行业广泛采用敏感软件开发原则和价值,对全球产品和服务产生影响,尽管这种原则和价值越来越受欢迎,但研究与实际执行之间仍然存在巨大差距,本文件介绍了旨在促进灵活软件开发研究与实践之间协作的第一次国际讲习班的结果,我们讨论了讲习班参加者查明的造成这一差距的主要主题和因素、弥补这一差距的战略以及需要进一步研究注意的挑战。

Article 66

Title@2025-07-14 (1): GenAI-Enabled Backlog Grooming in Agile Software Projects: An Empirical Study

Title: GenAI-Enabled Backlog Grooming in Agile Software Projects: An Empirical Study

GenAI-Enabled Backlog Grooming in agilen Software-Projekten: Eine empirische Studie

GenAI-GenAI-Enable Enable Chacklog 人工软件项目中的工作室:经验研究 2507.10753v1

Authors (3): Kasper Lien Oftebro, Anh Nguyen-Duc, Kai-Kristian Kemell

Effective backlog management is critical for ensuring that development teams remain aligned with evolving requirements and stakeholder expectations. However, as product backlogs consistently grow in scale and complexity, they tend to become cluttered with redundant, outdated, or poorly defined tasks, complicating prioritization and decision making processes. This study investigates whether a generative-AI (GenAI) assistant can automate backlog grooming in Agile software projects without sacrificing accuracy or transparency. Through Design Science cycles, we developed a Jira plug-in that embeds backlog issues with the vector database, detects duplicates via cosine similarity, and leverage the GPT-4o model to propose merges, deletions, or new issues. We found that AI-assisted backlog grooming achieved 100 percent precision while reducing the time-to-completion by 45 percent. The findings demonstrated the tool’s potential to streamline backlog refinement processes while improving user experiences.

有效的积压管理对于确保发展团队与不断演变的要求和利益攸关方的期望保持一致至关重要,然而,由于产品积压的规模和复杂性不断增大,产品积压往往被冗余、过时或定义不清的任务所包罗,使优先次序和决策程序复杂化。本研究报告调查了基因化的AI(GenAI)助理能否在不牺牲准确性或透明度的情况下将Agile软件项目的积压装配自动化,同时不牺牲准确性或透明度。通过设计科学周期,我们开发了一个Jira插件,将积压问题嵌入矢量数据库,通过Comsine相似性探测重复,并利用GPT-4o模式提出合并、删除或新问题。我们发现,由AI协助的积压处理达到100%的精确度,同时将完成时间减少45%。研究结果表明,该工具有可能在改进用户经验的同时简化积压的改进程序。

Article 67

Title@2025-07-14 (1): Toward Realistic Evaluations of Just-In-Time Vulnerability Prediction

Title: Toward Realistic Evaluations of Just-In-Time Vulnerability Prediction

Hin zu realistischen Bewertungen von Just-in-Time Sicherheitsvorhersage

A. 实现现实评估时空时脆弱性预测 2507.10729v1

Authors (5): Duong Nguyen, Thanh Le-Cong, Triet Huynh Minh Le, M. Ali Babar, Quyet-Thang Huynh

Modern software systems are increasingly complex, presenting significant challenges in quality assurance. Just-in-time vulnerability prediction (JIT-VP) is a proactive approach to identifying vulnerable commits and providing early warnings about potential security risks. However, we observe that current JIT-VP evaluations rely on an idealized setting, where the evaluation datasets are artificially balanced, consisting exclusively of vulnerability-introducing and vulnerability-fixing commits. To address this limitation, this study assesses the effectiveness of JIT-VP techniques under a more realistic setting that includes both vulnerability-related and vulnerability-neutral commits. To enable a reliable evaluation, we introduce a large-scale public dataset comprising over one million commits from FFmpeg and the Linux kernel. Our empirical analysis of eight state-of-the-art JIT-VP techniques reveals a significant decline in predictive performance when applied to real-world conditions; for example, the average PR-AUC on Linux drops 98\% from 0.805 to 0.016. This discrepancy is mainly attributed to the severe class imbalance in real-world datasets, where vulnerability-introducing commits constitute only a small fraction of all commits. To mitigate this issue, we explore the effectiveness of widely adopted techniques for handling dataset imbalance, including customized loss functions, oversampling, and undersampling. Surprisingly, our experimental results indicate that these techniques are ineffective in addressing the imbalance problem in JIT-VP. These findings underscore the importance of realistic evaluations of JIT-VP and the need for domain-specific techniques to address data imbalance in such scenarios.

现代软件系统日益复杂,在质量保证方面提出了重大挑战。即时脆弱性预测(JIT-VP)是一种积极主动的方法,用于确定弱势者,并就潜在的安全风险发出预警。然而,我们注意到,目前的JIT-VP评价依赖于一种理想化的环境,在这种环境中,评价数据集人为地平衡,完全由脆弱性引入和脆弱性固定承诺组成。为解决这一局限性,本研究在更现实的环境下评估JIT-VP技术的有效性,包括脆弱性相关和脆弱性中立承诺。为了进行可靠的评估,我们引入了一个大型公共数据集,由FFmpeg和Linux核心方承诺的100多万人组成。我们对八种最先进的JIT-VP技术进行的经验分析表明,在应用到现实世界条件时,预测性业绩表现显著下降;例如,Linux软件的平均PR-AUC从0.805下降到0.016。这一差异主要归因于真实世界数据集中严重的阶级不平衡现象,而 JS-Indroupal Erentality Expressation of the messing the massivealanceality trainaltial-maismissional ex ex ex ex ex the the the caltimalsualtimalsu subaltiquest the the the problest plest plest pleglementaltialalalalaltialtialalalalal lemental lementaltilementaltial lemental lemental lemental lemental lemental lementaltialtialtialtimentaltial lemental lemental lemental lementaltimental lemental lements lements lements exs su subaltialtial subal le le le le le le le le le le le le le lemental le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le le

Article 68

Title@2025-07-14 (1): Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models

Title: Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models

Fokussieren lernen: Kontextextraktion für effiziente Code-Anfälligkeitserkennung mit Sprachmodellen

学习聚焦:以语言模式有效识别《守则》脆弱性 2505.17460v3

Authors (7): Xinran Zheng, Xingzhi Qian, Huichi Zhou, Shuo Yang, Yiling He, Suman Jana, Lorenzo Cavallaro

Language models (LMs) show promise for vulnerability detection but struggle with long, real-world code due to sparse and uncertain vulnerability locations. These issues, exacerbated by token limits, often cause models to miss vulnerability-related signals, thereby impairing effective learning. A key intuition is to enhance LMs with concise, information-rich context. Commit-based annotations offer precise, CWE-agnostic supervision, but are unavailable during inference, as they depend on historical code changes. Moreover, their extreme sparsity, often covering only a few lines, makes it difficult for LMs to process directly. In this paper, we propose FocusVul, a model-agnostic framework that improves LM-based vulnerability detection by learning to select sensitive context. FocusVul learns commit-based annotation patterns through hierarchical semantic modeling and generalizes them to identify line-level vulnerability-relevant regions during inference. It then extracts LM-oriented context via both dependency and execution flows surrounding selected regions, yielding semantically rich inputs for effective vulnerability detection. Experiments on real-world benchmarks show that FocusVul consistently outperforms heuristic-based and full-function fine-tuning approaches, improving classification performance by 164.04% and reducing FLOPs by 19.12% on average.

语言模型(LMS) 显示了识别脆弱性的希望,但由于脆弱地点稀少和不确定,与长期的、真实的世界代码抗争的可能性很大。这些问题由于象征性限制而加剧,往往导致模型丢失与脆弱性有关的信号,从而损害有效的学习。一个关键直觉是用简明、信息丰富的背景来提升LMs。基于文件的注释提供了精确的、CWE-不可知性的监督,但在推断过程中却无法使用,因为它们取决于历史代码的变化。此外,它们的极端宽度往往只覆盖几条线,使得LMs难以直接处理。在本文中,我们提议FocusVul,一个通过学习选择敏感环境来改进基于LM的脆弱性检测的模型框架。FocusVult通过等级的语义模型学习基于承诺的批注模式,并在推断过程中将其概括化,以辨别与脆弱性相关的线级区域。然后通过在选定区域周围的依附性和执行流动来提取LME,为有效识别脆弱性提供内容丰富的投入。在现实世界基准上进行的实验显示FlastVult Vult Vult-im-minal-laction apretty laction by smalvical-laphal-laphyal-laphyal-laphyal bynal-

Article 69

Title@2025-07-14 (1): Speculative Automated Refactoring of Imperative Deep Learning Programs to Graph Execution

Title: Speculative Automated Refactoring of Imperative Deep Learning Programs to Graph Execution

Spekulative Automatisierte Refaktorisierung imperativer Deep Learning-Programme zur Graphen-Execution

用于图表执行的势必深深学习方案的投机性自动重组 2504.05424v3

Authors (5): Raffi Khatchadourian, Tatiana Castro Vélez, Mehdi Bagherzadeh, Nan Jia, Anita Raja

Efficiency is essential to support ever-growing datasets, especially for Deep Learning (DL) systems. DL frameworks have traditionally embraced deferred execution-style DL code – supporting symbolic, graph-based Deep Neural Network (DNN) computation. While scalable, such development is error-prone, non-intuitive, and difficult to debug. Consequently, more natural, imperative DL frameworks encouraging eager execution have emerged but at the expense of run-time performance. Though hybrid approaches aim for the “best of both worlds,” using them effectively requires subtle considerations. Our key insight is that, while DL programs typically execute sequentially, hybridizing imperative DL code resembles parallelizing sequential code in traditional systems. Inspired by this, we present an automated refactoring approach that assists developers in determining which otherwise eagerly-executed imperative DL functions could be effectively and efficiently executed as graphs. The approach features novel static imperative tensor and side-effect analyses for Python. Due to its inherent dynamism, analyzing Python may be unsound; however, the conservative approach leverages a speculative (keyword-based) analysis for resolving difficult cases that informs developers of any assumptions made. The approach is: (i) implemented as a plug-in to the PyDev Eclipse IDE that integrates the WALA Ariadne analysis framework and (ii) evaluated on nineteen DL projects consisting of 132 KLOC. The results show that 326 of 766 candidate functions (42.56%) were refactorable, and an average relative speedup of 2.16 on performance tests was observed with negligible differences in model accuracy. The results indicate that the approach is useful in optimizing imperative DL code to its full potential.

效率对于支持不断增长的数据集至关重要, 特别是深海学习( DL) 系统。 DL 框架传统上采用推迟执行式 DL 代码, 支持符号性、基于图形的深神经网络( DNN) 计算。虽然可以缩放, 但这种开发是容易出错的, 非直观的, 并且难以调试。因此, 更自然的、必要的 DL 框架鼓励急迫执行, 但却牺牲了运行时的性能。虽然混合方法旨在“ 最佳世界” , 有效地使用它们需要微妙的考虑。我们的关键洞察力是, DL 程序通常按顺序执行, 混合的 DL 代码类似于传统系统中的平行代码。受此启发, 我们提出了一个自动重构的重构方法, 帮助开发者确定哪些否则急迫地执行的 DL 功能可以有效和高效地作为图表执行。这种方法以新颖的静态高压和副效果分析为Python 。由于它固有的活力, 分析Python 可能是不准确的; 但是, 保守方法将一个可调的 Ral- dal- dal- dal- disal- disal 函数法的精准的精准的精准的精准的精度函数在传统方法在传统的精准的精准的精准的精准的精准的精准的精准的精准的精度函数( , 。

Article 70

Title@2025-07-14 (1): CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

Title: CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks

CodeJudgeBench: Benchmarking von LLM-as-a-Judge für Codierungsaufgaben

标准法官:为编码任务确定LLM-as-a法官基准 2507.10535v1

Authors (5): Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan

Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.

大型语言模型(LLMS)在各种编码任务中大大提升了最新水平,除了直接回答用户询问外,LMS还可以充当法官,评估和比较其他模型生成的答复的质量。这种评价能力对于制定不同LMS的基准和通过答复排名提高响应质量至关重要。然而,尽管越来越多地采用LLM-as-a-judge模式,但其编码设想方案的效力仍然没有得到充分利用,因为缺乏专门的参数,因此,由于缺乏专门的差异分析基准,我们引入了Codjudge Bench,这是一个明确用来评估LM-as-a-judge模式在三种关键编码任务(代码生成、代码修理和单位测试)中的绩效评估业绩基准。尽管最近采用LM-as-a-a-judge模式的情况大大超出了我们精心设计的代码判断任务中的不思考模式。即使是较小的思考模型,例如Quen3-8B,我们经过专门培训的LM-a-judge模型在70B级的精确度上都能够评估 LLM-judroad-judge 模型。然而,所有模型在判断模型的准确性测试中都展示了相当的准确性的工作。

Article 71

Title@2025-07-14 (1): Investigating Adversarial Attacks in Software Analytics via Machine Learning Explainability

Title: Investigating Adversarial Attacks in Software Analytics via Machine Learning Explainability

Untersuchung von Adversarial Attacks in Software Analytics durch maschinelles Lernen Erklärbarkeit

调查通过机器学习解释分析软件分析中的反攻击 2408.04124v2

Authors (3): MD Abdul Awal, Mrigank Rochan, Chanchal K. Roy

With the recent advancements in machine learning (ML), numerous ML-based approaches have been extensively applied in software analytics tasks to streamline software development and maintenance processes. Nevertheless, studies indicate that despite their potential usefulness, ML models are vulnerable to adversarial attacks, which may result in significant monetary losses in these processes. As a result, the ML models’ robustness against adversarial attacks must be assessed before they are deployed in software analytics tasks. Despite several techniques being available for adversarial attacks in software analytics tasks, exploring adversarial attacks using ML explainability is largely unexplored. Therefore, this study aims to investigate the relationship between ML explainability and adversarial attacks to measure the robustness of ML models in software analytics tasks. In addition, unlike most existing attacks that directly perturb input-space, our attack approach focuses on perturbing feature-space. Our extensive experiments, involving six datasets, three ML explainability techniques, and seven ML models, demonstrate that ML explainability can be used to conduct successful adversarial attacks on ML models in software analytics tasks. This is achieved by modifying only the top 1-3 important features identified by ML explainability techniques. Consequently, the ML models under attack fail to accurately predict up to 86.6% of instances that were correctly predicted before adversarial attacks, indicating the models’ low robustness against such attacks. Finally, our proposed technique demonstrates promising results compared to four state-of-the-art adversarial attack techniques targeting tabular data.

随着机器学习(ML)的最近进展,许多以ML为基础的方法被广泛应用于软件分析任务中的软件分析任务,以简化软件开发和维护程序。然而,研究表明,尽管ML模型有潜在用处,但它们很容易受到对抗性攻击,在这些过程中可能造成巨大的货币损失。因此,在应用软件分析任务之前,必须评估ML模型对对抗性攻击的稳健性。尽管在软件分析任务中的对抗性攻击中有一些可用的技术,但利用ML解释性解释性研究对抗性攻击的情况基本上没有被探讨。因此,本研究旨在调查ML技术解释性和对抗性攻击之间的关系,以衡量软件分析任务中ML模型的稳健性。此外,与直接干扰输入空间的大多数现有攻击不同,我们的攻击方法侧重于扰动性地空间。我们的广泛实验涉及六个数据集、三个ML解释性低可解释性模型和七个ML模型,表明ML可以用来成功地对ML模型进行对抗性攻击的成功性攻击,相比之下,在软件分析性攻击之前,ML目标性攻击的4种重要例子表明,因此,ML预测性模型无法准确地解释。

Article 72

Title@2025-07-14 (1): A Code Comprehension Benchmark for Large Language Models for Code

Title: A Code Comprehension Benchmark for Large Language Models for Code

Ein Code-Verständnis-Benchmark für große Sprachmodelle für Code

《守则》大语言模式的《守则》理解基准 2507.10641v1

Authors (5): Jayant Havare, Saurav Chaudhary, Ganesh Ramakrishnan, Kaushik Maharajan, Srikanth Tamilselvam

Large Language Models have shown impressive capabilities in coding tasks like code generation and code completion, as they have been trained on a large amount of code data. Also, since one of the core pretraining objectives is Next Token Prediction, these models tends to learn surface-level syntactic patterns in code. However, this does not guarantee code comprehension ability i.e. the ability to capture the semantics of the code. In our opinion, this is the reason why these models often underperform on tasks that require deeper semantic understanding, such as code debugging and code optimization. To address this, we propose fine-tuning these models specifically for code comprehension tasks using large-scale datasets, enabling them to develop a more robust understanding of code semantics. We evaluate three code models of varying sizes on a suite of code comprehension tasks designed to assess semantic understanding beyond surface-level syntactic pattern matching. In particular, we analyze performance on the Subjectivity Grading Task and observe that model performance improves after fine-tuning on relevant downstream tasks. The most significant improvement is seen in the QWQ-32B model, where accuracy increases from 70% to 83.47%. A similar or explainable trend is observed across other models, clearly indicating an enhancement in code comprehension ability. Among the models studied, the DPO-fine-tuned Codestral-22B achieves the highest micro-accuracy of 87.66% on the Subjectivity Grading Task.

大型语言模型在代码生成和代码完成等编码任务方面表现出了令人印象深刻的能力,因为这些模型在代码生成和代码完成等编码任务方面已经接受了大量代码数据的培训。此外,由于核心培训前目标之一是下 Token 预测,这些模型往往会学习代码中的表面合成模式。然而,这并不能保证代码理解能力,即能够捕捉代码的语义。我们认为,这就是为什么这些模型常常在需要更深入理解代码生成和代码完成等语义化任务方面表现不力的原因,例如代码调试和代码优化。为了解决这个问题,我们提议对这些模型进行微调,具体用于使用大型数据集进行代码理解任务,使这些模型能够对代码语义学形成更强有力的理解。我们评价一套代码理解任务中三种不同大小的代码模型,旨在评估地层合成模式的语义理解能力。我们特别分析了主观属性区分任务的执行情况,并观察到模型在对相关下游任务进行微调后会改进。最显著的改进出现在Q-32B模型中,使这些模型能够形成对代码语义学能力进行更强有力的理解。我们所观测到的A- brealal am- brealal lader am am am am am lader lader lax lax lax lax lax lax lax lax lax lax lax lax a lax lax lax lax a lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax lax a lax lax lax lax lax lax lax lax lax lax lax lax d lax lax lax lax lax lax lax lax lax lax

Article 73

Title@2025-07-14 (1): Towards a Theory on Process Automation Effects

Title: Towards a Theory on Process Automation Effects

Auf dem Weg zu einer Theorie über Prozessautomatisierungseffekte

关于进程自动化效果的理论 2506.10992v2

Authors (4): Hoang Vu, Jennifer Haase, Henrik Leopold, Jan Mendling

Process automation is a crucial strategy for improving business processes, but little attention has been paid to the effects that automation has once it is operational. This paper addresses this research problem by reviewing the literature on human-automation interaction. Although many of the studies in this field have been conducted in different domains, they provide a foundation for developing propositions about process automation effects. Our analysis focuses on how humans perceive automation technology when working within a process, allowing us to propose an effective engagement model between technology, process participants, process managers, and software developers. This paper offers insights and recommendations that can help organizations optimize their use of process automation. We further derive novel research questions for a discourse within the process automation community.

流程自动化是改进业务流程的关键战略,但很少注意自动化一旦投入运行后产生的影响。本文件通过审查关于人类航空互动的文献来探讨这一研究问题。虽然这一领域的许多研究是在不同的领域进行的,但它们为提出有关流程自动化效应的建议提供了基础。我们的分析侧重于人类在流程中工作时如何看待自动化技术,使我们能够提出技术、流程参与者、流程管理员和软件开发者之间的有效参与模式。本文件提出了有助于各组织优化使用流程自动化的见解和建议。我们进一步为流程自动化界内部的讨论提出了新的研究问题。

Article 74

Title: SENSOR: An ML-Enhanced Online Annotation Tool to Uncover Privacy Concerns from User Reviews in Social-Media Applications

SENSOR: Ein ML-erweitertes Online-Annotations-Tool, um Datenschutz-Bedenken aus User Reviews in Social-Media-Anwendungen zu enthüllen

SENSOR:一个ML-加强在线说明工具,以从社会-媒体应用中的用户审查中发现隐私问题。 2507.10640v1

Authors (5): Labiba Farah, Mohammad Ridwan Kabir, Shohel Ahmed, MD Mohaymen Ul Anam, Md. Sakibul Islam

The widespread use of social media applications has raised significant privacy concerns, often highlighted in user reviews. These reviews also provide developers with valuable insights into improving apps by addressing issues and introducing better features. However, the sheer volume and nuanced nature of reviews make manual identification and prioritization of privacy-related concerns challenging for developers. Previous studies have developed software utilities to automatically classify user reviews as privacy-relevant, privacy-irrelevant, bug reports, feature requests, etc., using machine learning. Notably, there is a lack of focus on classifying reviews specifically as privacy-related feature requests, privacy-related bug reports, or privacy-irrelevant. This paper introduces SENtinel SORt (SENSOR), an automated online annotation tool designed to help developers annotate and classify user reviews into these categories. For automating the annotation of such reviews, this paper introduces the annotation model, GRACE (GRU-based Attention with CBOW Embedding), using Gated Recurrent Units (GRU) with Continuous Bag of Words (CBOW) and Attention mechanism. Approximately 16000 user reviews from seven popular social media apps on Google Play Store, including Instagram, Facebook, WhatsApp, Snapchat, X (formerly Twitter), Facebook Lite, and Line were analyzed. Two annotators manually labelled the reviews, achieving a Cohen’s Kappa value of 0.87, ensuring a labeled dataset with high inter-rater agreement for training machine learning models. Among the models tested, GRACE demonstrated the best performance (macro F1-score: 0.9434, macro ROC-AUC: 0.9934, and accuracy: 95.10%) despite class imbalance. SENSOR demonstrates significant potential to assist developers with extracting and addressing privacy-related feature requests or bug reports from user reviews, enhancing user privacy and trust.

广泛使用社交媒体应用程序引起了重要的隐私问题,这在用户审查中经常得到强调。这些审查也为开发者提供了通过解决问题和引入更好的特征改进应用程序的宝贵洞察力。然而,由于审查的篇幅和细微性质,对隐私相关关切的手工识别和优先排序给开发者带来了挑战。以前的研究开发了软件公用设施,将用户审查自动分类为隐私相关、隐私相关、错误报告、特征请求等,使用机器学习。值得注意的是,缺乏将审查具体归类为隐私相关功能请求、隐私相关错误报告或隐私相关。本文介绍了Sendintinel SORt(SENEROR),这是一个自动在线注解工具,旨在帮助开发者对与隐私相关、隐私相关、错误报告、功能请求等等。本文介绍了批注模式GRARCE(GRU的潜在关注与COW Embed),使用GRU(GRU) 与SOB(COWO) 和关注机制。大约16000个用户审查SENSO-SLOA 的SLOASLSLOA IM IMSLSLA IM IMSLOA), 包括SIM IM IM IM IM IM IMOILO IMSLODSLA IM IM IMSU IM IMSU IMSO IM IM IM IMSU IMODSODSODSODSODSDSODSDSDSDSDSDSDSDODODSODODSODSDRDRDSD AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS AS ASDODODODODODODODA ASUDODODODA AS ASDODA AS AS AS 。约。约。大约16000 AS AS AS 。约。约 AS

Article 75

Title@2025-07-14 (1): Formal Analysis of the Contract Automata Runtime Environment with Uppaal: Modelling, Verification and Testing

Title: Formal Analysis of the Contract Automata Runtime Environment with Uppaal: Modelling, Verification and Testing

Formale Analyse der Vertragsautomatisierung Laufzeitumgebung mit Uppaal: Modellierung, Verifizierung und Prüfung

对合同自动化运行时环境的正式分析:建模、核查和测试 2501.12932v2

Authors (1): Davide Basile

Recently, a distributed middleware application called contract automata runtime environment ({\tt CARE}) has been introduced to realise service applications specified using a dialect of finite-state automata. In this paper, we detail the formal modelling, verification and testing of {\tt CARE}. We provide a formalisation as a network of stochastic timed automata. The model is verified against the desired properties with the tool {\sc Uppaal}, utilising exhaustive and statistical model checking techniques. Abstract tests are generated from the {\sc Uppaal} models that are concretised for testing {\tt CARE}. This research emphasises the advantages of employing formal modelling, verification and testing processes to enhance the dependability of an open-source distributed application. We discuss the methodology used for modelling the application and generating concrete tests from the abstract model, addressing the issues that have been identified and fixed.

最近,推出了一个分布式的中间软件应用程序,名为“自动自动运行时间合同环境”(~t CARE}),用于实现使用限定状态自动运行方言指定的服务应用。在本文中,我们详细说明了“有限状态自动运行方言”的正式建模、核查和测试。我们提供了一种正规化的系统化,作为随机的定时自动成像网络。该模型用工具 ~scUppaal} 对照所期望的特性进行校验,使用详尽的统计模型校验技术。从测试 ~tCARE} 的精密模型中生成了摘要测试。这一研究强调使用正式建模、核查和测试程序的优势,以提高开源分配应用程序的可靠性。我们讨论了用于模拟应用和从抽象模型中生成具体测试的方法,解决已经查明和固定的问题。

Article 76

Title@2025-07-14 (1): AssertCoder: LLM-Based Assertion Generation via Multimodal Specification Extraction

Title: AssertCoder: LLM-Based Assertion Generation via Multimodal Specification Extraction

AssertCoder: LLM-basierte Assertion Generation über Multimodal Specification Extraction

AssoldtCoder:通过多式联运规格采掘法生产以LLM为基础的货权 2507.10338v1

Authors (5): Enyuan Tian, Yiwei Ci, Qiusong Yang, Yufeng Li, Zhichao Lyu

Assertion-Based Verification (ABV) is critical for ensuring functional correctness in modern hardware systems. However, manually writing high-quality SVAs remains labor-intensive and error-prone. To bridge this gap, we propose AssertCoder, a novel unified framework that automatically generates high-quality SVAs directly from multimodal hardware design specifications. AssertCoder employs a modality-sensitive preprocessing to parse heterogeneous specification formats (text, tables, diagrams, and formulas), followed by a set of dedicated semantic analyzers that extract structured representations aligned with signal-level semantics. These representations are utilized to drive assertion synthesis via multi-step chain-of-thought (CoT) prompting. The framework incorporates a mutation-based evaluation approach to assess assertion quality via model checking and further refine the generated assertions. Experimental evaluation across three real-world Register-Transfer Level (RTL) designs demonstrates AssertCoder’s superior performance, achieving an average increase of 8.4% in functional correctness and 5.8% in mutation detection compared to existing state-of-the-art approaches.

确保现代硬件系统的功能正确性至关重要。然而,手工撰写高质量的SVA仍然需要大量劳动力和容易出错。为了缩小这一差距,我们提议AssertCoder,这是一个全新的统一框架,直接根据多式联运硬件设计规格自动产生高质量的SVA。AssertCoder采用一种对方式敏感的预处理方法,以解析多种规格格式(文本、表格、图表和公式),随后是一套专门的语义分析器,提取与信号等级语义一致的结构化表达法。这些表达法用于通过多步思考链(CoT)推动主张合成。该框架包含一种基于突变的评价方法,通过模型检查和进一步完善生成的主张质量。三个真实世界登记册-交易级别(RTL)的设计实验性评价显示了AsseltCoder的优异性表现,与现有状态方法相比,功能正确性检测平均增加了8.4%,突变异性检测增加了5.8%。

Article 77

Title@2025-07-14 (1): Toolsuite for Implementing Multiagent Systems Based on Communication Protocols

Title: Toolsuite for Implementing Multiagent Systems Based on Communication Protocols

Toolsuite zur Implementierung von Multiagentensystemen auf Basis von Kommunikationsprotokollen

基于通信议定书的用于实施多剂系统的工具 2507.10324v1

Authors (3): Amit K. Chopra, Samuel H. Christie V, Munindar P. Singh

Interaction-Oriented Programming (IOP) is an approach to building a multiagent system by modeling the interactions between its roles via a flexible interaction protocol and implementing agents to realize the interactions of the roles they play in the protocol. In recent years, we have developed an extensive suite of software that enables multiagent system developers to apply IOP. These include tools for efficiently verifying protocols for properties such as liveness and safety and middleware that simplifies the implementation of agents. This paper presents some of that software suite.

以互动为主的编程(IPP)是建立多试剂系统的一种方法,其方法是通过灵活的互动协议,模拟其作用与实施者之间的互动,以实现其在协议中所起作用的相互作用。近年来,我们开发了一套广泛的软件,使多试剂系统开发者能够应用IPP。其中包括有效核查诸如活性和安全等特性协议的工具,以及简化代理实施过程的中间软件。本文件介绍了其中的一些软件套件。

Article 78

Title@2025-07-14 (1): Streamlined Airborne Software Development for Large UAVs: From Unified Data Collection to Automated Code Generation

Title: Streamlined Airborne Software Development for Large UAVs: From Unified Data Collection to Automated Code Generation

Streamlined Airborne Software Development für große UAVs: Von der Unified Data Collection bis zur automatisierten Codegenerierung

为大型无人驾驶航空器简化空载软件开发:从统一数据收集到自动代码生成 2507.10321v1

Authors (4): Viktor Sinitsyn, Nils Schlautmann, Florian Schwaiger, Florian Holzapfel

The aerospace industry has experienced significant transformations over the last decade, driven by technological advancements and innovative solutions in goods and personal transportation. This evolution has spurred the emergence of numerous start-ups that now face challenges traditionally encountered by established aerospace companies. Among these challenges is the efficient processing of digital intra-device communication interfaces for onboard equipment - a critical component for ensuring seamless system integration and functionality. Addressing this challenge requires solutions that emphasize clear and consistent interface descriptions, automation of processes, and reduced labor-intensive efforts. This paper presents a novel process and toolchain designed to streamline the development of digital interfaces and onboard software, which our team has successfully applied in several completed projects. The proposed approach focuses on automation and flexibility while maintaining compliance with design assurance requirements.

过去十年来,航空航天业在技术进步和货物及个人运输创新解决办法的推动下经历了重大变革,这一演变促使出现了许多初创企业,这些新企业现在面临既有航空航天公司历来遇到的挑战,这些挑战包括高效处理机载设备数字设备内部通信界面,这是确保系统无缝整合和功能的关键组成部分。应对这一挑战需要强调明确和一致的界面描述、流程自动化和减少劳动密集型努力的解决办法。本文件介绍了一个新颖的过程和工具链,旨在简化数字界面和机载软件的开发,我们的团队在一些已完成的项目中成功地应用了这些新程序和工具链。拟议的方法侧重于自动化和灵活性,同时保持对设计保证要求的遵守。

Article 79

Title@2025-07-14 (1): A Survey of Reinforcement Learning for Software Engineering

Title: A Survey of Reinforcement Learning for Software Engineering

Ein Überblick über die Verbesserung des Lernens für Software-Engineering

软件工程强化学习调查 2507.12483v1

Authors (9): Dong Wang, Hanmo You, Lingwei Zhu, Kaiwei Lin, Zheng Chen, Chen Yang, Junji Yu, Zan Wang, Junjie Chen

Reinforcement Learning (RL) has emerged as a powerful paradigm for sequential decision-making and has attracted growing interest across various domains, particularly following the advent of Deep Reinforcement Learning (DRL) in 2015. Simultaneously, the rapid advancement of Large Language Models (LLMs) has further fueled interest in integrating RL with LLMs to enable more adaptive and intelligent systems. In the field of software engineering (SE), the increasing complexity of systems and the rising demand for automation have motivated researchers to apply RL to a broad range of tasks, from software design and development to quality assurance and maintenance. Despite growing research in RL-for-SE, there remains a lack of a comprehensive and systematic survey of this evolving field. To address this gap, we reviewed 115 peer-reviewed studies published across 22 premier SE venues since the introduction of DRL. We conducted a comprehensive analysis of publication trends, categorized SE topics and RL algorithms, and examined key factors such as dataset usage, model design and optimization, and evaluation practices. Furthermore, we identified open challenges and proposed future research directions to guide and inspire ongoing work in this evolving area. To summarize, this survey offers the first systematic mapping of RL applications in software engineering, aiming to support both researchers and practitioners in navigating the current landscape and advancing the field. Our artifacts are publicly available: https://github.com/KaiWei-Lin-lanina/RL4SE.

强化学习(RL)已成为连续决策的强大范例,在各个领域引起了越来越多的兴趣,特别是在2015年深入强化学习(DRL)的到来后,大型语言模型的迅速发展进一步激发了人们将RL与LLM(LLM)相结合的兴趣,从而能够建立更具适应性和智能的系统;在软件工程(SE)领域,系统日益复杂,自动化需求不断增长,促使研究人员将RL应用到从软件设计和开发到质量保证和维护等广泛任务中。尽管在RE-SE的研究不断增加,但对这一不断发展的领域仍缺乏全面、系统的调查。为弥补这一差距,我们审查了自DRL(L)推出以来在22个主要SE地点发表的115项同行审评研究。我们全面分析了出版趋势、SE专题分类和RL算法,并审查了数据集使用、模型设计和优化以及评价做法等关键因素。我们还查明了公开的挑战,并提出了未来研究方向,以指导并激励这个不断演变的领域正在进行的工作。本项调查为目前SARVL(R-L)系统定位/L(RARC)应用软件领域提供了第一次系统化的搜索和升级支持。

Article 80

Title@2025-07-14 (1): A Grounded Theory on the Teacher and Student Roles in Pair Programming

Title: A Grounded Theory on the Teacher and Student Roles in Pair Programming

Eine fundierte Theorie über Lehrer und Schülerrollen in der Pair-Programmierung

关于教师和学生在对等方案规划中的作用的理论基础 2507.10305v1

Authors (4): Linus Ververs, Trang Linh Lam, Janina Berger, Lutz Prechelt

Context: Pair programming is an established (agile) practice and is practiced throughout the industry. Objective: Understand under what circumstances knowledge transfer can harm a pair programming session. Method: Grounded Theory Methodology based on 17 recorded pair programming sessions with 18 developers from 5 German software companies accompanied, by 6 interviews with different developers from 4 other German companies. Results: We define the student and teacher roles to help developers deal with a one-sided knowledge gap. We describe pitfalls to avoid and develop a grounded theory centered around the Power Gap in pair programming. Conclusions: Knowledge transfer can be harmful when developers don’t pay attention to their partners needs and desires. If developers don’t pay attention to the Power Gap and keep it in check, Defensive Behavior may arise that leads to a vicious cycle impacting the knowledge transfer, the Togetherness and the code quality in a negative way.

目标:了解在什么情况下知识转让会损害对口编程会议。方法:基于17个有记录的对口编程会议,由来自5个德国软件公司的18个开发商参加,由另外4个德国公司的不同开发商进行6次访谈。结果:我们界定了学生和教师的作用,以帮助开发商处理单方面的知识差距。我们描述了避免和发展围绕对口编程中“权力差距”的有根理论的缺陷。结论:当开发商不注意其伙伴的需要和愿望时,知识转让可能有害。如果开发商不注意“权力差距”并加以控制,则可能会出现“防卫行为”导致恶性循环,对知识转让、协同和代码质量产生负面影响。

Article 81

Title@2025-07-14 (1): Helveg: Diagrams for Software Documentation

Title: Helveg: Diagrams for Software Documentation

Helveg: Diagramme für Software-Dokumentation

Helveg:软件文件图 2507.10244v1

Authors (4): Adam Štěpánek, David Kuťák, Barbora Kozlíková, Jan Byška

Software developers often have to gain an understanding of a codebase. Be it programmers getting onboarded onto a team project or, for example, developers striving to grasp an external open-source library. In either case, they frequently turn to the project’s documentation. However, documentation in its traditional textual form is ill-suited for this kind of high-level exploratory analysis, since it is immutable from the readers’ perspective and thus forces them to follow a predefined path. We have designed an approach bringing aspects of software architecture visualization to API reference documentation. It utilizes a highly interactive node-link diagram with expressive node glyphs and flexible filtering capabilities, providing a high-level overview of the codebase as well as details on demand. To test our design, we have implemented a prototype named Helveg, capable of automatically generating diagrams of C# codebases. User testing of Helveg confirmed its potential, but it also revealed problems with the readability, intuitiveness, and user experience of our tool. Therefore, in this paper, which is an extended version of our VISSOFT paper with DOI 10.1109/VISSOFT64034.2024.00012, we address many of these problems through major changes to the glyph design, means of interaction, and user interface of the tool. To assess the improvements, this new version of Helveg was evaluated again with the same group of participants as the previous version.

软件开发者往往必须了解一个代码库。不管是程序设计者被加入一个团队项目还是开发者努力掌握外部开放源码库。无论是哪种情况, 他们都经常转向项目文档。但是, 传统文本形式的文件不适合这种高级探索性分析, 因为它从读者的角度是不可改变的, 从而迫使他们遵循预设路径。我们设计了一种方法, 将软件结构的方方面面的可视化带入 API 参考文档。它使用高度互动的节点链接图, 带有明确的节点和灵活的过滤能力, 对代码库提供高层次的概览以及需求的细节。但是, 为了测试我们的设计, 我们采用了名为 Helveg 的原型, 能够自动生成 C 代码库的图表。 Helveg 的用户测试证实了它的潜力, 但它也揭示了我们工具的可读性、直观度和用户经验。因此, 在本文中, 我们的VIFT文件的扩展版和DO101940 的用户版本, 我们用这些主要设计工具的版本的版本, 我们的用户FT640 的版本, 和GVSO10.

Article 82

Title@2025-07-14 (1): An Empirical Study of Interaction Bugs in ROS-based Software

Title: An Empirical Study of Interaction Bugs in ROS-based Software

Eine empirische Studie von Interaktionsfehlern in ROS-basierter Software

以ROS为基础的软件中的相互作用虫的经验研究 2507.10235v1

Authors (5): Zhixiang Chen, Zhuangbin Chen, Xingjie Cai, Wei Li, Zibin Zheng

Modern robotic systems integrate multiple independent software and hardware components, each responsible for distinct functionalities such as perception, decision-making, and execution. These components interact extensively to accomplish complex end-to-end tasks. As a result, the overall system reliability depends not only on the correctness of individual components, but also on the correctness of their interactions. Failures often manifest at the boundaries between components, yet interaction-related reliability issues in robotics–referred to here as interaction bugs (iBugs)–remain underexplored. This work presents an empirical study of iBugs within robotic systems built using the Robot Operating System (ROS), a widely adopted open-source robotics framework. A total of 121 iBugs were analyzed across ten actively maintained and representative ROS projects. The identified iBugs are categorized into three major types: intra-system iBugs, hardware iBugs, and environmental iBugs, covering a broad range of interaction scenarios in robotics. The analysis includes an examination of root causes, fixing strategies, and the impact of these bugs. Several findingsa are derived that shed light on the nature of iBugs and suggest directions for improving their prevention and detection. These insights aim to inform the design of more robust and safer robotic systems.

现代机器人系统整合了多种独立的软件和硬件元件,每个元件都负责感知、决策和执行等不同功能。这些元件广泛互动,以完成复杂的端到端任务。因此,整个系统可靠性不仅取决于各个元件的正确性,而且取决于它们相互作用的正确性。在机器人中,失败通常表现在部件之间的界限,但与互动有关的可靠性问题在这里被称为互动错误(iBugs)-探索不足。这项工作是对机器人操作系统内部建立的iBugs(iBugs)的实验性研究,这是一个广泛采用的开放源码机器人框架。共121个iBugs(iBugs)在10个积极维护并具有代表性的ROS项目中进行了分析。已确定的iBugs分为三大类型:内部的iBugs(iBugs)、硬件iBugs(iBugs)和环境iBugs(iBugs),涵盖机器人中范围广泛的互动情景。分析包括研究根源、确定战略以及这些虫子的影响。一些发现从这些发现中揭示了更牢固的机器人系统设计方向。

Article 83

Title@2025-07-14 (1): Towards a Framework for Operationalizing the Specification of Trustworthy AI Requirements

Title: Towards a Framework for Operationalizing the Specification of Trustworthy AI Requirements

Auf dem Weg zu einem Rahmen für die Operationalisierung der Spezifikation vertrauenswürdiger AI-Anforderungen

建立一个落实可信赖的AI要求具体规格的框架 2507.10228v1

Authors (3): Hugo Villamizar, Daniel Mendez, Marcos Kalinowski

Growing concerns around the trustworthiness of AI-enabled systems highlight the role of requirements engineering (RE) in addressing emergent, context-dependent properties that are difficult to specify without structured approaches. In this short vision paper, we propose the integration of two complementary approaches: AMDiRE, an artefact-based approach for RE, and PerSpecML, a perspective-based method designed to support the elicitation, analysis, and specification of machine learning (ML)-enabled systems. AMDiRE provides a structured, artefact-centric, process-agnostic methodology and templates that promote consistency and traceability in the results; however, it is primarily oriented toward deterministic systems. PerSpecML, in turn, introduces multi-perspective guidance to uncover concerns arising from the data-driven and non-deterministic behavior of ML-enabled systems. We envision a pathway to operationalize trustworthiness-related requirements, bridging stakeholder-driven concerns and structured artefact models. We conclude by outlining key research directions and open challenges to be discussed with the RE community.

由AI支持的系统的信誉日益引起人们的关注,强调要求工程(RE)在解决突发的、因环境而异的特性方面的作用,这些特性在没有结构化的方法下难以具体说明。在这份简短的远景文件中,我们提议将两种互补办法结合起来:AMDIRE,一个基于可再生能源的人工工程法和PerSpecML,一种基于视角的方法,旨在支持对机器学习(ML)系统进行吸引、分析和规格的机器学习(ML)系统。AMDIRE提供了一种结构化的、以艺术为中心的、以过程为核心的方法和模板,促进结果的一致性和可追溯性;然而,它主要面向确定性系统。 PerspecML(Per-PerML)则提出了多视角的指导,以发现由数据驱动的、非确定性的行为引起的关切。我们设想了一种落实与信任有关的要求、弥合利益攸关方驱动的关切和结构化的艺术模式的途径。我们最后通过概述与RE社区讨论的关键研究方向和公开挑战。

Article 84

Title@2025-07-14 (1): Breaking the Myth: Can Small Models Infer Postconditions Too?

Title: Breaking the Myth: Can Small Models Infer Postconditions Too?

Der Mythos brechen: Können kleine Modelle auch Postkonditionen nachvollziehen?

打破神话:小模型能否也推推推先决条件? 2507.10182v1

Authors (3): Gehao Zhang, Zhenting Wang, Juan Zhai

Formal specifications are essential for ensuring software correctness, yet manually writing them is tedious and error-prone. Large Language Models (LLMs) have shown promise in generating such specifications from natural language intents, but the giant model size and high computational demands raise a fundamental question: Do we really need large models for this task? In this paper, we show that a small, fine-tuned language model can achieve high-quality postcondition generation with much lower computational costs. We construct a specialized dataset of prompts, reasoning logs, and postconditions, then supervise the fine-tuning of a $7$B-parameter code model. Our approach tackles real-world repository dependencies and preserves pre-state information, allowing for expressive and accurate specifications. We evaluate the model on a benchmark of real-world Java bugs (Defects4J) and compare against both proprietary giants (e.g., GPT-4o) and open-source large models. Empirical results demonstrate that our compact model matches or outperforms significantly larger counterparts in syntax correctness, semantic correctness, and bug-distinguishing capability. These findings highlight that targeted fine-tuning on a modest dataset can enable small models to achieve results formerly seen only in massive, resource-heavy LLMs, offering a practical and efficient path for the real-world adoption of automated specification generation.

正式的规格对于确保软件正确性至关重要,但手工写成这些规格却容易出错。大型语言模型(LLMS)在根据自然语言意图生成这些规格方面显示了希望,但巨大的模型规模和高计算要求却提出了一个根本性问题:我们是否真的需要用于这项任务的大型模型?在本文中,我们显示一个小的、经过微调的语言模型可以实现高质量的后期生成,而计算成本要低得多。我们建造了一个由提示、推理日志和后设条件组成的专门数据集,然后监督一个7美元B参数代码模型的微调。我们的方法解决了真实世界的仓库依赖性并保存了预先状态信息,从而允许了明确和准确的规格。我们评估了一个基于真实世界爪哇错误(Defects4J)基准的模型,并与独家巨人(例如GPT-4o)和开源大型模型进行比较。我们所设计的缩模型匹配或优于一个相当大的对应方,然后监督一个7美元B参数代码模型的微调整。我们的方法解决了真实性储存库库库的可靠性,并保存了预先信息信息信息,从而获得明确和精确的精确的规格。我们评估了真实的模型,我们评估了真实的模型的模型的模型的模型,能够使原始的模型能够实现一个普通化的模型的模型,使原始的原始的模型能够使原始的模型能够使原始的大小的模型的模型能够实现。

Article 85

Title@2025-07-14 (1): Accelerating Automatic Program Repair with Dual Retrieval-Augmented Fine-Tuning and Patch Generation on Large Language Models

Title: Accelerating Automatic Program Repair with Dual Retrieval-Augmented Fine-Tuning and Patch Generation on Large Language Models

Beschleunigung der automatischen Programmreparatur mit Dual Retrieval-Augmented Fine-Tuning und Patch Generation bei großen Sprachmodellen

加速自动程序维修,以大语言模式双检索增强的微调和补丁生成 2507.10103v1

Authors (7): Hanyang Guo, Xiaoheng Xie, Hong-Ning Dai, Peng Di, Yu Zhang, Bishenghui Tao, Zibin Zheng

Automated Program Repair (APR) is essential for ensuring software reliability and quality while enhancing efficiency and reducing developers’ workload. Although rule-based and learning-based APR methods have demonstrated their effectiveness, their performance was constrained by the defect type of repair, the quality of training data, and the size of model parameters. Recently, Large Language Models (LLMs) combined with Retrieval-Augmented-Generation (RAG) have been increasingly adopted in APR tasks. However, current code LLMs and RAG designs neither fully address code repair tasks nor consider code-specific features. To overcome these limitations, we propose SelRepair, a novel APR approach with integration of a fine-tuned LLM with a newly-designed dual RAG module. This approach uses a bug-fix pair dataset for fine-tuning and incorporates semantic and syntactic/structural similarity information through an RAG selection gate. This design ensures relevant information is retrieved efficiently, thereby reducing token length and inference time. Evaluations on Java datasets show SelRepair outperforms other APR methods, achieving 26.29% and 17.64% in terms of exact match (EM) on different datasets while reducing inference time by at least 6.42% with controlled input lengths.

自动化程序维修(ALLM)对于确保软件的可靠性和质量、同时提高效率和减少开发者的工作量至关重要。尽管基于规则和学习的ARPR方法已经证明了其有效性,但其效绩受到缺陷型修理、培训数据质量和模型参数大小的限制。最近,在RAPR的任务中越来越多地采用大语言模型(LLM)和Retreval-Auged-Generation(RAG) 。然而,目前的代码LLM和RAG设计既未充分处理代码修理任务,也未考虑具体代码特点。为了克服这些限制,我们建议SelRepair采用新颖的ARPAR方法,即精细调整的LM(LM)与新设计的双RAG模块相结合的新型LM(LLM),其性能受制约的功能受到制约。这个方法使用一个错误型配对数据集进行微调,并通过RAG选择门纳入了语义和合成/结构相似性信息。这一设计确保相关信息被高效检索,从而减少象征性的时间和引用时间。为了克服这些限制,对Java数据设置的评估显示Selrepair优于其他AR方法,在6.29%和176-6-6-6-6-64%的长度上比对数据进行最短的时间比。

Article 86

Title@2025-07-14 (1): Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding

Title: Kodezi Chronos: A Debugging-First Language Model for Repository-Scale, Memory-Driven Code Understanding

Kodezi Chronos: Ein Debugging-First Language Model für Repository-Scale, Memory-Driven Code Understanding

Kodezi Chronos:调试第一语言模型,用于存储库规模、记忆驱动代码理解 2507.12482v1

Authors (4): Ishraq Khan, Assad Chowdary, Sharoz Haseeb, Urvish Patel

Large Language Models (LLMs) have advanced code generation and software automation, but are fundamentally constrained by limited inference-time context and lack of explicit code structure reasoning. We introduce Kodezi Chronos, a next-generation architecture for autonomous code understanding, debugging, and maintenance, designed to operate across ultra-long contexts comprising entire codebases, histories, and documentation, all without fixed window limits. Kodezi Chronos leverages a multi-level embedding memory engine, combining vector and graph-based indexing with continuous code-aware retrieval. This enables efficient and accurate reasoning over millions of lines of code, supporting repository-scale comprehension, multi-file refactoring, and real-time self-healing actions. Our evaluation introduces a novel Multi Random Retrieval benchmark, specifically tailored to the software engineering domain. Unlike classical retrieval benchmarks, this method requires the model to resolve arbitrarily distant and obfuscated associations across code artifacts, simulating realistic tasks such as variable tracing, dependency migration, and semantic bug localization. Chronos outperforms prior LLMs and code models, demonstrating a 23% improvement in real-world bug detection and reducing debugging cycles by up to 40% compared to traditional sequence-based approaches. By natively interfacing with IDEs and CI/CD workflows, Chronos enables seamless, autonomous software maintenance, elevating code reliability and productivity while reducing manual effort. These results mark a critical advance toward self-sustaining, continuously optimized software ecosystems.

大型语言模型(LLMS)具有先进的代码生成和软件自动化,但受到有限的推断时间背景和缺乏明确的代码结构推理的制约。我们引入了Kodezi Chronos,这是下一代自主代码理解、调试和维护的新一代架构,旨在跨越超长背景,包括整个代码库、历史和文档,且无固定窗口限制。Kodezi Chronos利用多层嵌入存储引擎,将矢量和图形索引与连续的代码智能检索相结合。这样可以对数百万条代码行进行高效和准确的推理,支持存储器规模的理解、多文件的软件重新配置和实时自我修复行动。我们的评估引入了一个全新的多随机检索基准,专门针对软件工程领域。与古典的检索基准不同,这种方法需要模型解决在代码文物之间任意的遥远和模糊的关联,模拟现实的任务,如可变的追踪、依赖性迁移和语系错误的错误本地化。 Chronos超越了前LMS和内部代码模型的自我更新,展示了新式的自我定位和自我转换的自我定位周期,同时演示到真实的系统,并改进了实际的系统。

Article 87

Title@2025-07-14 (1): LLMShot: Reducing snapshot testing maintenance via LLMs

Title: LLMShot: Reducing snapshot testing maintenance via LLMs

LLMShot: Reduzierung der Snapshot-Test-Wartung über LLMs

LLMShot:减少通过LLMM减少快速测试维护 2507.10062v1

Authors (4): Ergün Batuhan Kaynak, Mayasah Lami, Sahand Moslemi, Anil Koyuncu

Snapshot testing has emerged as a critical technique for UI validation in modern software development, yet it suffers from substantial maintenance overhead due to frequent UI changes causing test failures that require manual inspection to distinguish between genuine regressions and intentional design changes. This manual triage process becomes increasingly burdensome as applications evolve, creating a need for automated analysis solutions. This paper introduces LLMShot, a novel framework that leverages vision-based Large Language Models to automatically analyze snapshot test failures through hierarchical classification of UI changes. To evaluate LLMShot’s effectiveness, we developed a comprehensive dataset using a feature-rich iOS application with configurable feature flags, creating realistic scenarios that produce authentic snapshot differences representative of real development workflows. Our evaluation using Gemma3 models demonstrates strong classification performance, with the 12B variant achieving over 84% recall in identifying failure root causes while the 4B model offers practical deployment advantages with acceptable performance for continuous integration environments. However, our exploration of selective ignore mechanisms revealed significant limitations in current prompting-based approaches for controllable visual reasoning. LLMShot represents the first automated approach to semantic snapshot test analysis, offering developers structured insights that can substantially reduce manual triage effort and advance toward more intelligent UI testing paradigms.

快速抓图测试已成为现代软件开发中UI验证的关键技术,但它却由于频繁的UI变化导致测试失败,需要进行人工检查以区分真正的回归和有意的设计变化,从而导致测试失败,从而导致测试失败。随着应用程序的演变,这种人工分级过程变得日益繁琐,需要自动分析解决方案。本文介绍了LloMShot,这是一个新颖的框架,它利用基于愿景的大语言模型,通过对UI变化的等级分类,自动分析短视测试失败。为了评估LLLMShot的有效性,我们利用具有可配置特征标志的功能丰富的iOS应用程序开发了一个全面的数据集,创造了现实的情景,产生真实的快照差异,代表了实际开发工作流程。我们使用Gemma3模型进行的评估显示了很强的分类绩效,12B变量在查明失败根源方面达到84%以上,而4B模型为持续整合环境的可接受性能提供了实用的部署优势。然而,我们对选择性的忽略机制的探索揭示了当前基于快速方法的可控直观推推论存在重大局限性。 LLMShot表示第一个自动自动的直观测试分析方法,向了精度测试模型,向智能智能测试,为开发者提供了更精度测试。

Article 88

Title@2025-07-14 (1): Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks

Title: Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks

Explizite Gefährlichkeitsgenerierung mit LLMs: Eine Untersuchung jenseits zweifelhafter Angriffe

与LLM女士:在反向攻击之外进行调查 2507.10054v1

Authors (4): Emir Bosnak, Sahand Moslemi, Mayasah Lami, Anil Koyuncu

Large Language Models (LLMs) are increasingly used as code assistants, yet their behavior when explicitly asked to generate insecure code remains poorly understood. While prior research has focused on unintended vulnerabilities or adversarial prompting techniques, this study examines a more direct threat scenario: open-source LLMs generating vulnerable code when prompted either directly or indirectly. We propose a dual experimental design: (1) Dynamic Prompting, which systematically varies vulnerability type, user persona, and directness across structured templates; and (2) Reverse Prompting, which derives prompts from real vulnerable code samples to assess vulnerability reproduction accuracy. We evaluate three open-source 7B-parameter models (Qwen2, Mistral, and Gemma) using ESBMC static analysis to assess both the presence of vulnerabilities and the correctness of the generated vulnerability type. Results show all models frequently produce vulnerable outputs, with Qwen2 achieving highest correctness rates. User persona significantly affects success, where student personas achieved higher vulnerability rates than professional roles, while direct prompts were marginally more effective. Vulnerability reproduction followed an inverted-U pattern with cyclomatic complexity, peaking at moderate ranges. Our findings expose limitations of safety mechanisms in open-source models, particularly for seemingly benign educational requests.

大语言模型(LLMS)越来越多地被用作代码助理,然而,在明确要求生成不安全代码时,他们的行为仍然没有得到很好的理解。虽然先前的研究侧重于意想不到的脆弱性或对抗性催化技术,但本研究审查了一种更直接的威胁情景:在直接或间接推动下,开放源的LMS生成脆弱代码。我们提议了一种双重实验设计:(1)动态催化,在结构化模板中系统地区分脆弱性类型、用户个性和直接性;(2)反敏化,从真实的脆弱代码样本中获取信号,以评估脆弱性复制的准确性。我们利用ESBMC静态分析对三种开放源的7B参数模型(Quen2,Mistral,和Gemma)进行评估,以评估脆弱性的存在和生成的脆弱类型是否正确性。结果显示,所有模型都经常产生脆弱产出,Quen2达到最高正确率。用户a 严重影响成功,因为学生的脆弱率高于专业角色,而直接提示效果则略低。脆弱性再现遵循一种具有周期复杂性的反向U模式,在中等范围内达到顶峰值。我们的调查结果暴露了开放源安全机制的局限性。

Article 89

Title@2025-07-14 (1): Enhancing the Capabilities of Large Language Models for API calls through Knowledge Graphs

Title: Enhancing the Capabilities of Large Language Models for API calls through Knowledge Graphs

Verbesserung der Fähigkeiten von großen Sprachmodellen für API-Aufrufe durch Wissensgraphen

通过 “ 知识图 “ 提高大语言模式的能力 2507.10630v1

Authors (4): Ye Yang, Xue Xiao, Ping Yin, Taotao Xie

API calls by large language models (LLMs) offer a cutting-edge approach for data analysis. However, their ability to effectively utilize tools via API calls remains underexplored in knowledge-intensive domains like meteorology. This paper introduces KG2data, a system that integrates knowledge graphs, LLMs, ReAct agents, and tool-use technologies to enable intelligent data acquisition and query handling in the meteorological field. Using a virtual API, we evaluate API call accuracy across three metrics: name recognition failure, hallucination failure, and call correctness. KG2data achieves superior performance (1.43%, 0%, 88.57%) compared to RAG2data (16%, 10%, 72.14%) and chat2data (7.14%, 8.57%, 71.43%). KG2data differs from typical LLM-based systems by addressing their limited access to domain-specific knowledge, which hampers performance on complex or terminology-rich queries. By using a knowledge graph as persistent memory, our system enhances content retrieval, complex query handling, domain-specific reasoning, semantic relationship resolution, and heterogeneous data integration. It also mitigates the high cost of fine-tuning LLMs, making the system more adaptable to evolving domain knowledge and API structures. In summary, KG2data provides a novel solution for intelligent, knowledge-based question answering and data analysis in domains with high knowledge demands.

大型语言模型(LLMs)的API调用大型语言模型(LLMs)为数据分析提供了一种尖端的方法。然而,它们通过API调用工具有效利用工具的能力在气象等知识密集型领域仍然没有得到充分利用。本文介绍了KG2data,这是一个将知识图形、LLMS、ReAct代理器和工具使用技术相结合的系统,在气象领域能够智能地获取和查询处理数据。我们使用虚拟API,评估API调用三个指标的准确性:名称识别失败、幻觉失灵和调用正确性。KG2data与RAG2data(16%、10%、72.14%)和聊天2data(7.14%、8.57%、71.43%)相比,其有效使用工具。 KG2dddddd数据与典型的LMM系统不同,因为后者解决了对特定领域知识的有限获取和查询能力,从而妨碍了复杂或术语丰富的查询的绩效。通过知识图解,我们的系统可以加强内容检索、复杂的查询处理、具体领域推理学、语义关系解解以及数据整合。在高科技领域中,它也减少了高成本数据分析。

Article 90

Title@2025-07-14 (1): EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective

Title: EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective

EVALOOP: Bewertung der Robustheit von LLM in der Programmierung aus einer Perspektive der Selbstkonsistenz

EVALOOP: 从自统一的角度评估方案拟订中的LLM强力 2505.12185v3

Authors (3): Sen Fang, Weiyuan Ding, Bowen Xu

Assessing the programming capabilities of Large Language Models (LLMs) is crucial for their effective use in software engineering. Current evaluations, however, predominantly measure the accuracy of generated code on static benchmarks, neglecting the critical aspect of model robustness during programming tasks. While adversarial attacks offer insights on model robustness, their effectiveness is limited and evaluation could be constrained. Current adversarial attack methods for robustness evaluation yield inconsistent results, struggling to provide a unified evaluation across different LLMs. We introduce EVALOOP, a novel assessment framework that evaluate the robustness from a self-consistency perspective, i.e., leveraging the natural duality inherent in popular software engineering tasks, e.g., code generation and code summarization. EVALOOP initiates a self-contained feedback loop: an LLM generates output (e.g., code) from an input (e.g., natural language specification), and then use the generated output as the input to produce a new output (e.g., summarizes that code into a new specification). EVALOOP repeats the process to assess the effectiveness of EVALOOP in each loop. This cyclical strategy intrinsically evaluates robustness without rely on any external attack setups, providing a unified metric to evaluate LLMs’ robustness in programming. We evaluate 16 prominent LLMs (e.g., GPT-4.1, O4-mini) on EVALOOP and found that EVALOOP typically induces a 5.01%-19.31% absolute drop in pass@1 performance within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, GPT-3.5-Turbo, despite superior initial code generation compared to DeepSeek-V2, demonstrated lower robustness over repeated evaluation loop.

评估大语言模型(LLMS)的编程能力对于在软件工程中有效使用这些模型至关重要。但是,目前的评价主要是测量生成的固定基准代码的准确性,忽略了模型稳健性的关键方面。对抗性攻击提供了对模型稳健性的洞察力,但其效力有限,评价可能受到限制。目前稳健性评价的对抗性攻击方法产生不一致的结果,难以在不同LMS之间提供统一的评价。我们引入了EVALOOP,这是一个从绝对一致性角度评价稳健性强性的新评估框架,即利用流行软件工程任务中固有的自然双重性,例如,静态生成和代码合成。 EVALOOP 启动了一个自成一体的反馈循环:一个LM(例如,自然语言规格)生成产出,然后将产生的产出用作新的产出(例如,将该代码总结为一种稳定的初始规格) 。 EVALOOP- 19 继续评估过程评估 EVALOO-40 在每个循环中,而不是透明性的循环中持续性地评估。

Article 91

Title@2025-07-14 (1): When Less is More: A systematic review of four-day workweek conceptualizations and their effects on organizational performance

Title: When Less is More: A systematic review of four-day workweek conceptualizations and their effects on organizational performance

When Less is More: Eine systematische Überprüfung von viertägigen Arbeitswochenkonzeptualisierungen und deren Auswirkungen auf die organisatorische Leistung

时间越少越少:系统审查四天工作周概念概念化及其对组织业绩的影响 2507.09911v1

Authors (3): Marvin Auf der Landwehr, Julia Topp, Michael Neumann

Context: Agile IT organizations, which are characterized by self-organization and collaborative social interactions, require motivating, efficient and flexible work environments to maximize value creation. Compressed work schedules such as the four-day workweek have evolved into multiple facets over the last decades and are associated with various benefits for organizations and their employees. Objective: Our objective in this study is to deepen our comprehension of the impact of compressed work schedules on the operational efficacy of IT enterprises, while concurrently developing a comprehensive framework delineating the intricacies of compressed work schedules.Method: We conducted a systematic review of available conceptualizations related to four-day workweek schedules and elaborate on their organizational and social effects. To cover scientific and practice-oriented literature, our review combined a systematic literature review and a web content analysis. Results: Based on the generated insights, we derive a meta-framework that matches conceptualizations and effects, finally guiding the adoption of compressed work schedules based on individual managerial prerequisites and circumstances.

目标:本项研究的目标是加深我们对压缩工作时间表对信息技术企业业务效率的影响的理解,同时制定一个综合框架,明确压缩工作时间表的复杂性。方法:我们系统地审查了与四天工作周时间表有关的现有概念化,并阐述了其组织和社会影响。为了涵盖科学和面向实践的文献,我们的审查结合了系统文献审查和网络内容分析。结果:根据所得出的见解,我们形成了一个与概念化和效果相匹配的元框架,最后指导采用基于个人管理前提和情况的压缩工作时间表。

Article 92

Title@2025-07-14 (1): Modelling Interrelations Between Agile Practices: The Agile Map

Title: Modelling Interrelations Between Agile Practices: The Agile Map

Modellierung von Zusammenhängen zwischen agilen Praktiken: Die agile Karte

模拟各种恶恶之间相互关系的模型:各种恶恶:各种恶恶的地图 2507.09907v1

Authors (3): Thomas Hansper, Kevin Phong Pham, Michael Neumann

Agile methods are defined through guidelines comprising various practices intended to enable agile ways of working. These guidelines further comprise a specific set of agile practices aiming to enable teams for an agile way of working. However, due to its wide-spread use in practice we know that agile practices are adopted and tailored intensively, which lead to a high variety of agile practices in terms of their level of detail. Problem: A high variety of agile practices can be challenging as we do not know how different agile practices are interrelated with each other. To be more precise, tailoring and adopting agile practices may lead to the challenge, that the combinatorial use of several agile practices can only be successful to a limited extent, as practices support or even require each other for a effective use in practice. Objective: Our study aims to provide an enabler for this problem. We want to identify interrelations between agile practices and describe them in a systematic manner. Contribution: The core contribution of this paper is the Agile Map, a theoretical model describing relations between agile practices following a systematic approach aiming to provide an overview of coherences between agile practices. The model aims to support practitioners in selecting and combining agile practices in a meaningful way.

这些指导方针还包含一套特别的灵活做法,旨在使各小组能够灵活地开展工作;然而,由于在实践中广泛采用灵活做法,我们知道灵活做法被广泛采用,并经过大量调整,导致各种详细程度的灵活做法。问题:由于我们不知道不同灵活做法之间如何相互关联,因此多种多样的灵活做法可能具有挑战性。更精确地说,调整和采用灵活做法可能导致挑战,一些灵活做法的组合使用只能在有限程度上取得成功,因为做法支持,甚至相互要求在实践中有效使用。目标:我们的研究旨在为这一问题提供一个促进者。我们要查明灵活做法之间的相互关系,并系统地描述这些做法。贡献:本文件的核心贡献是Agile地图,这是一个理论模型,说明采用系统方法的灵活做法之间的关系,目的是概述灵活做法之间的一致性。该模型旨在支持从业人员以有意义的方式选择和合并灵活做法。

Article 93

Title@2025-07-14 (1): PathFuzzing: Worst Case Analysis by Fuzzing Symbolic-Execution Paths

Title: PathFuzzing: Worst Case Analysis by Fuzzing Symbolic-Execution Paths

PathFuzzing: Schlechteste Fallanalyse durch Fuzzing Symbolic-Execution Paths

路径Fuzzing:通过模糊符号执行路径进行最坏的案例研究分析 2507.09892v1

Authors (2): Zimu Chen, Di Wang

Estimating worst-case resource consumption is a critical task in software development. The worst-case analysis (WCA) problem is an optimization-based abstraction of this task. Fuzzing and symbolic execution are widely used techniques for addressing the WCA problem. However, improving code coverage in fuzzing or managing path explosion in symbolic execution within the context of WCA poses significant challenges. In this paper, we propose PathFuzzing, aiming to combine the strengths of both techniques to design a WCA method. The key idea is to transform a program into a symbolic one that takes an execution path (encoded as a binary string) and interprets the bits as branch decisions. PathFuzzing then applies evolutionary fuzzing techniques to the transformed program to search for binary strings that represent satisfiable path conditions and lead to high resource consumption. We evaluate the performance of PathFuzzing experimentally on a benchmark suite that consists of prior work’s benchmarks and some added by us. Results show that PathFuzzing generally outperforms a fuzzing and a symbolic-execution baseline.

估计最坏情况的资源消耗是软件开发中的一项关键任务。最坏情况分析( WCA) 问题在于对任务进行基于优化的抽象化。模糊和象征性执行是用来解决 WCA 问题的常用技术。然而, 在WCA 范围内, 改善模糊或管理象征性执行过程中路径爆炸的代码覆盖带来了重大挑战。在本文中, 我们建议 PathFuzzizing , 目的是将两种技术的优势结合起来设计 WCA 方法。关键的想法是将一个程序转换成一个具有象征意义的程序, 它将执行路径( 编码为二进制字符串) , 并将比特解释成分支决定。路径Fuzzing 然后将进化的模糊技术应用到转型程序中, 以寻找代表可作比较路径条件并导致高资源消耗的二进制字符串。我们用一个由先前的工作基准和我们添加的基准套件来评估PathFuzzzzz 的实验性表现。结果表明, 路径Fuzzzing 通常比一个模糊和象征性的基线要长于一个模糊。

Article 94

Title@2025-07-14 (1): Turning the Tide: Repository-based Code Reflection

Title: Turning the Tide: Repository-based Code Reflection

Drehen der Tide: Repository-basierte Code-Reflexion

翻转底盘:基于仓库的代码反射 2507.09866v1

Authors (8): Wei Zhang, Jian Yang, Jiaxi Yang, Ya Wang, Zhoujun Li, Zeyu Cui, Binyuan Hui, Junyang Lin

Code large language models (LLMs) enhance programming by understanding and generating code across languages, offering intelligent feedback, bug detection, and code updates through reflection, improving development efficiency and accessibility. While benchmarks (e.g. HumanEval/LiveCodeBench) evaluate code generation and real-world relevance, previous works ignore the scenario of modifying code in repositories. Considering challenges remaining in improving reflection capabilities and avoiding data contamination in dynamic benchmarks, we introduce LiveRepoReflection, a challenging benchmark for evaluating code understanding and generation in multi-file repository contexts, featuring 1,888 rigorously filtered test cases across $6$ programming languages to ensure diversity, correctness, and high difficulty. Further, we create RepoReflection-Instruct, a large-scale, quality-filtered instruction-tuning dataset derived from diverse sources, used to train RepoReflectionCoder through a two-turn dialogue process involving code generation and error-driven repair. The leaderboard evaluates over 40 LLMs to reflect the model performance of repository-based code reflection.

守则大语言模型(LLMS)通过理解和生成跨语言的代码来增强编程,通过反射提供智能反馈、错误检测和代码更新,提高发展效率和无障碍性。基准(例如HumanEval/LiveCodeBench)评估代码生成和现实世界相关性,以往的工作忽视了修改存储库代码的设想。考虑到在提高反思能力和避免动态基准中数据污染方面仍然存在的挑战,我们引入了LiveRepoReflection,这是在多文件存储库背景下评价代码理解和生成的具有挑战性的基准,其特点是6美元方案编制语言的1 888 严格过滤测试案例,以确保多样性、正确性和高度难度。此外,我们创建了基于不同来源的大规模、质量过滤式指令调整数据集,用于培训ReporeflectionCoder,通过涉及代码生成和错误驱动修复的双向对话过程。领导板对40多个LMS进行了评估,以反映存储器代码映射的模型性能。

Article 95

Title@2025-07-14 (1): IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

Title: IRFuzzer: Specialized Fuzzing for LLVM Backend Code Generation

IRFuzzer: Spezialisiertes Fuzzing für LLVM-Backend-Code-Generierung

IRFuzzer: LLLVM 后端代码生成专门模糊 2402.05256v2

Authors (5): Yuyang Rong, Zhanghan Yu, Zhenkai Weng, Stephen Neuendorffer, Hao Chen

Modern compilers, such as LLVM, are complex pieces of software. Due to their complexity, manual testing is unlikely to suffice, yet formal verification is difficult to scale. End-to-end fuzzing can be used, but it has difficulties in achieving high coverage of some components of LLVM. In this paper, we implement IRFuzzer to investigate the effectiveness of specialized fuzzing of the LLVM compiler backend. We focus on two approaches to improve the fuzzer: guaranteed input validity using constrained mutations and improved feedback quality. The mutator in IRFuzzer is capable of generating a wide range of LLVM IR inputs, including structured control flow, vector types, and function definitions. The system instruments coding patterns in the compiler to monitor the execution status of instruction selection. The instrumentation not only provides a new coverage feedback called matcher table coverage, but also provides an architecture specific guidance to the mutator. We show that IRFuzzer is more effective than existing fuzzers by fuzzing on 29 mature LLVM backend targets. In the process, we reported 74 confirmed new bugs in LLVM upstream, out of which 49 have been fixed, five have been back ported to LLVM 15, showing that specialized fuzzing provides useful and actionable insights to LLVM developers.

LLVM 等现代编译器是复杂的软件。由于其复杂性, 人工测试不太可能足够, 但正式的核查也难以进行。使用端到端的模糊可以使用, 但难以实现LLLVM某些部件的高覆盖率。在本文中, 我们使用 IRFuzzer 来调查LLLVM 编译器后端专门模糊的功能的有效性。我们侧重于两种方法来改进模糊器: 使用受限制的突变来保证输入有效性并改进反馈质量。 IRFuzzer 的变异器能够生成广泛的LLLVM IR 输入器, 包括结构化的控制流、矢量类型和功能定义。编译器在编译器中难以实现对LLLLLVM 执行状况进行高度监控。仪器不仅提供称为匹配表后端的新的覆盖反馈, 而且还为变异器提供结构上的指导。我们显示, IRFuzzzer比现有的烟雾器更有效, 是在29个成熟的LLVM 后端目标上, 我们报告有74个确认新的错误, LLLLLVM 向LVM 的后端显示有15号的后端的后端, 。

Article 96

Title@2025-07-13 (7): Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Title: Measuring What Matters: A Framework for Evaluating Safety Risks in Real-World LLM Applications

Was zählt: Ein Rahmen für die Bewertung von Sicherheitsrisiken in realen LLM-Anwendungen

衡量什么重要事项:在现实世界LLM应用中评估安全风险的框架 2507.09820v1

Authors (6): Jia Yi Goh, Shaun Khoo, Nyx Iskandar, Gabriel Chua, Leanne Tan, Jessica Foo

Most safety testing efforts for large language models (LLMs) today focus on evaluating foundation models. However, there is a growing need to evaluate safety at the application level, as components such as system prompts, retrieval pipelines, and guardrails introduce additional factors that significantly influence the overall safety of LLM applications. In this paper, we introduce a practical framework for evaluating application-level safety in LLM systems, validated through real-world deployment across multiple use cases within our organization. The framework consists of two parts: (1) principles for developing customized safety risk taxonomies, and (2) practices for evaluating safety risks in LLM applications. We illustrate how the proposed framework was applied in our internal pilot, providing a reference point for organizations seeking to scale their safety testing efforts. This work aims to bridge the gap between theoretical concepts in AI safety and the operational realities of safeguarding LLM applications in practice, offering actionable guidance for safe and scalable deployment.

大型语言模型(LLMs)目前大多数的安全测试工作都侧重于评估基础模型,然而,越来越需要评估应用层面的安全性,例如系统提示、回收管道和护栏等组成部分的安全性,从而对LLM应用的整体安全产生重大影响。在本文件中,我们引入了一个实用框架,用于评估LLM系统的应用安全性安全性,通过在本组织内多种使用案例中实际部署全球应用来加以验证。框架由两部分组成:(1) 制定定制安全风险分类的原则,(2) 评估LLM应用中的安全风险的做法。我们介绍了拟议框架如何在内部试点中应用,为寻求扩大安全测试工作的组织提供了一个参考点。这项工作旨在弥合在AI安全理论概念与实际保护LM应用的实际操作之间的差距,为安全和可扩展的部署提供可操作的指导。

Article 97

Title@2025-07-13 (7): Prompting for Performance: Exploring LLMs for Configuring Software

Title: Prompting for Performance: Exploring LLMs for Configuring Software

Prompting for Performance: LLMs für die Konfiguration von Software erkunden

促效:探索配置软件LLMs 2507.09790v1

Authors (10): Helge Spieker, Théo Matricon, Nassim Belmecheri, Jørn Eirik Betten, Gauthier Le Bartz Lyan, Heraldo Borges, Quentin Mazouni, Dennis Gross, Arnaud Gotlieb, Mathieu Acher

Software systems usually provide numerous configuration options that can affect performance metrics such as execution time, memory usage, binary size, or bitrate. On the one hand, making informed decisions is challenging and requires domain expertise in options and their combinations. On the other hand, machine learning techniques can search vast configuration spaces, but with a high computational cost, since concrete executions of numerous configurations are required. In this exploratory study, we investigate whether large language models (LLMs) can assist in performance-oriented software configuration through prompts. We evaluate several LLMs on tasks including identifying relevant options, ranking configurations, and recommending performant configurations across various configurable systems, such as compilers, video encoders, and SAT solvers. Our preliminary results reveal both positive abilities and notable limitations: depending on the task and systems, LLMs can well align with expert knowledge, whereas hallucinations or superficial reasoning can emerge in other cases. These findings represent a first step toward systematic evaluations and the design of LLM-based solutions to assist with software configuration.

软件系统通常提供许多能够影响性能衡量尺度的配置选项,例如执行时间、记忆使用、二进制大小或比特率。一方面,作出知情决定具有挑战性,需要选项及其组合方面的域内专长。另一方面,机器学习技术可以搜索广阔的配置空间,但计算成本很高,因为需要具体执行许多配置。在本次探索研究中,我们调查大型语言模型(LLLMS)能否通过提示帮助建立面向性能的软件配置。我们评估了几个LMS的任务,包括确定相关选项、排名配置,以及建议各种可配置系统(如编译者、视频编码器和SAT解析者)的性能配置。我们的初步结果显示了积极的能力和显著的局限性:取决于任务和系统,LMS可以很好地与专家知识保持一致,而幻觉或表面推理则可以在其他情况下出现。这些发现是系统评价和设计基于LM的解决方案以协助软件配置的第一步。

Article 98

Title@2025-07-13 (7): OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization

Title: OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization

OrQstrator: Ein KI-Powered-Framework für erweiterte Quantenschaltungsoptimierung

Orstrator: AI授权的高级量子电路优化框架 2507.09682v1

Authors (2): Laura Baird, Armin Moin

We propose a novel approach, OrQstrator, which is a modular framework for conducting quantum circuit optimization in the Noisy Intermediate-Scale Quantum (NISQ) era. Our framework is powered by Deep Reinforcement Learning (DRL). Our orchestration engine intelligently selects among three complementary circuit optimizers: A DRL-based circuit rewriter trained to reduce depth and gate count via learned rewrite sequences; a domain-specific optimizer that performs efficient local gate resynthesis and numeric optimization; a parameterized circuit instantiator that improves compilation by optimizing template circuits during gate set translation. These modules are coordinated by a central orchestration engine that learns coordination policies based on circuit structure, hardware constraints, and backend-aware performance features such as gate count, depth, and expected fidelity. The system outputs an optimized circuit for hardware-aware transpilation and execution, leveraging techniques from an existing state-of-the-art approach, called the NISQ Analyzer, to adapt to backend constraints.

我们提出了一个新颖的方法,即OrQstrator,这是在Noisy中级量子(NISQ)时代进行量子电路优化的模块化框架。我们的框架由深强化学习(DRL)提供动力。我们的管弦引擎明智地在三个互补的电路优化器中选择:一个基于DRL的电路再编,通过学习的重写序列来降低深度和门数;一个特定域的优化器,运行高效的本地门再合成和数字优化;一个参数化电路即时器,通过优化门置翻译过程中的模板电路来改进编译。这些模块由中央管弦机协调,该机学习基于电路结构、硬件限制和后端识性能(如门数、深度和预期的忠诚)的协调政策。这个系统输出一种优化的硬件觉变换和执行的电路,利用现有状态方法(称为NISQAnalyzer)的技术,以适应后端限制。

Article 99

Title@2025-07-13 (7): Is Quantization a Deal-breaker? Empirical Insights from Large Code Models

Title: Is Quantization a Deal-breaker? Empirical Insights from Large Code Models

Ist Quantisierung ein Deal-Breaker? Empirische Einblicke aus großen Code-Modellen

量化是否是一个突破交易者?来自大代码模型的实证透视 2507.09665v1

Authors (3): Saima Afrin, Bowen Xu, Antonio Mastropaolo

The growing scale of large language models (LLMs) not only demands extensive computational resources but also raises environmental concerns due to their increasing carbon footprint. Model quantization emerges as an effective approach that can reduce the resource demands of LLMs by decreasing parameter precision without substantially affecting performance (e.g., 16 bit to 4 bit). While recent studies have established quantization as a promising approach for optimizing large code models (LCMs), a specialized subset of LLMs tailored for automated software engineering, their findings offer only limited insights into its practical implications. Specifically, current investigations focus only on the functional correctness of the code generated by quantized models, neglecting how quantization impacts critical aspects of code quality such as reliability, maintainability, and security. To bridge this gap, our study investigates the effects of quantization on the qualitative aspects of automatically generated code. We apply Activation-aware Weight Quantization (AWQ) to two widely used code models, CodeLlama and DeepSeekCoder, to generate Java and Python code. Using state-of-the-art static analysis tools, we evaluate software quality metrics and static features including cyclomatic complexity, cognitive complexity, and lines of code. Our findings reveal that quantization is a robust technique that not only preserves functional correctness, but also retains key qualitative code attributes sought after by developers, such as maintainability and structural simplicity.

大型语言模型(LLMS)规模不断扩大,不仅需要大量计算资源,而且还因其碳足迹的增加而引起环境关切。模型量化作为一种有效方法出现,可以降低参数精确度,减少LLMS的资源需求,同时不严重影响性能(例如,16位至4位)。虽然最近的研究将量化确定为优化大型代码模型(LMS)的一个很有希望的方法,这是专门为自动化软件工程定制的LLMS的一个专门子集,但其调查结果仅提供了对其实际影响的有限洞察力。具体地说,目前的调查仅侧重于量化模型产生的代码的功能正确性,忽略了量化对代码质量的关键方面,如可靠性、可维持性和安全性等。为缩小这一差距,我们的研究调查了量化对自动生成代码质量方面的影响。我们将Acivation-aware Weight Quartization(AWAWQQQ)应用两种广泛使用的代码(DCLlama和Deep SeekCoder)来生成Java和Python的代码。我们利用了状态的静态静态定性分析工具分析工具,我们评估了功能质量指标的精确度的特性和精确性数据,这又反映了了我们的系统。

Article 100

Title@2025-07-13 (7): Code Review as Decision-Making – Building a Cognitive Model from the Questions Asked During Code Review

Title: Code Review as Decision-Making – Building a Cognitive Model from the Questions Asked During Code Review

Code-Review als Entscheidungsfindung – Aufbau eines Kognitivmodells aus den Fragen, die während der Code-Review gestellt wurden

作为决策的《守则》审查 – – 从《守则》审查期间提出的问题建立认知模式 2507.09637v1

Authors (3): Lo Gullstrand Heander, Emma Söderberg, Christofer Rydenfält

Code review is a well-established and valued practice in the software engineering community contributing to both code quality and interpersonal benefits. However, there are challenges in both tools and processes that give rise to misalignments and frustrations. Recent research seeks to address this by automating code review entirely, but we believe that this risks losing the majority of the interpersonal benefits such as knowledge transfer and shared ownership. We believe that by better understanding the cognitive processes involved in code review, it would be possible to improve tool support, with out without AI, and make code review both more efficient, more enjoyable, while increasing or maintaining all of its benefits. In this paper, we conduct an ethnographic think-aloud study involving 10 participants and 34 code reviews. We build a cognitive model of code review bottom up through thematic, statistical, temporal, and sequential analysis of the transcribed material. Through the data, the similarities between the cognitive process in code review and decision-making processes, especially recognition-primed decision-making, become apparent. The result is the Code Review as Decision-Making (CRDM) model that shows how the developers move through two phases during the code review; first an orientation phase to establish context and rationale and then an analytical phase to understand, assess, and plan the rest of the review. Throughout the process several decisions must be taken, on writing comments, finding more information, voting, running the code locally, verifying continuous integration results, etc. Analysis software and process-coded data publicly available at: https://doi.org/10.5281/zenodo.15758266

守则审查是软件工程界中一项既定和有价值的做法,有助于守则质量和人际效益,然而,在工具和进程中都存在挑战,导致出现不协调与挫折。最近的研究力求通过完全使守则审查自动化来解决这个问题,但我们认为,这有可能丧失大部分人际效益,例如知识转让和共享所有权。我们认为,通过更好地了解守则审查所涉及的认知过程,可以改进工具支助,在没有AI的情况下进行,并使守则审查既更有效、更可享受,又增加或保持其所有效益。在本文中,我们进行了有10个参与者和34个守则审查参加的族裔思想研究。我们通过专题、统计、时间和顺序分析,建立了守则审查的认知模式。通过数据,可以发现编码审查和决策过程之间的相似性,特别是确认和定位的决策。结果是,守则审查作为决定(CRDM)的模型,显示开发者如何在守则审查的两个阶段进行,有10个参与者和34个守则审查。我们通过专题、统计、时间和顺序分析材料分析,从一个更深入的阶段开始,对数据分析决定进行深入的分析,并编写。

Article 101

Title@2025-07-13 (7): Complexity and Coupling: A Functional Domain Approach

Title: Complexity and Coupling: A Functional Domain Approach

Komplexität und Koppelung: Ein funktionaler Bereichsansatz

复杂性和组合:功能领域办法 2507.09599v1

Authors (1): Aydin Homay

This paper provides a precise and scientific definition of complexity and coupling, grounded in the functional domain, particularly within industrial control and automation systems (iCAS). We highlight the widespread ambiguity in defining complexity and coupling, emphasizing that many existing definitions rooted in physical attributes lead to confusion and inconsistencies. Furthermore, we re-exhibit why coupled design inherently increases complexity and how potentially this complexity could be reduced. Drawing on examples from various disciplines, such as software engineering, industrial automation, and mechanical design, we demonstrate that complexity does not necessarily correlate with system size or the number of components, and coupling, unlike common belief in software engineering, actually does not occur in the physical domain but in the functional domain. We conclude that effective design necessitates addressing coupling and complexity within the functional domain.

本文根据职能领域,特别是工业控制和自动化系统(ICAS),对复杂和混合作了精确和科学的定义。我们强调在界定复杂和混合方面普遍存在的模糊不清,强调许多以物理属性为基础的现有定义导致混乱和不一致。此外,我们再次探讨为什么同时设计本身会增加复杂程度,以及这种复杂程度如何可能减少。我们从软件工程、工业自动化和机械设计等不同学科的例子中可以看出,复杂程度不一定与系统大小或部件数量相关,而且与对软件工程的共同看法不同,合并实际上并非发生在物理领域,而是发生在功能领域。我们的结论是,有效的设计需要解决功能领域内的混合和复杂程度。

Article 102

Title@2025-07-13 (7): The Mythical Good Software

Title: The Mythical Good Software

Die mythische gute Software

《神道好软件》 2507.09596v1

Authors (1): Aydin Homay

Good software has high cohesion and low coupling is clumsy, obscure, and in some certain cases could be actually a harmful state of being. It is clumsy because there is no perfect correlation between higher cohesiveness and optimum design, and it is obscure because it conveys the message that coupling and cohesion are two distinct design principles, while there are in principle the same design approaches, and only the time and space differ between them, and it could also be a harmful state of being because we should not always aim for higher cohesiveness without considering its cost. In the course of this study, we aim to elucidate for the readers the meaning and underlying philosophy of the aforementioned paragraph.

良好的软件具有高度的凝聚力,而低的结合是笨拙、模糊的,在某些情况下,实际上可能是一种有害的存在状态。它笨拙,因为在更高的凝聚力和最佳设计之间没有完全的关联性。它很模糊,因为它传达的信息是,结合和凝聚是两种不同的设计原则,而原则上是相同的设计方法,只有它们之间的时间和空间不同。它也可能是一种有害的状态,因为我们不应该总是在不考虑其代价的情况下追求更高的凝聚力。在这项研究过程中,我们的目标是向读者阐明上述段落的含义和基本理念。

Article 103

Title@2025-07-13 (7): Equality Saturation for Optimizing High-Level Julia IR

Title: Equality Saturation for Optimizing High-Level Julia IR

Gleichstellungssättigung für die Optimierung von High-Level Julia IR

优化高级别Julia IR 平等饱和 2502.17075v2

Authors (3): Jules Merckx, Tim Besard, Bjorn De Sutter

Compilers are indispensable for transforming code written in high-level languages into performant machine code, but their general-purpose optimizations sometimes fall short. Domain experts might be aware of certain optimizations that the compiler is unable to apply or that are only valid in a particular domain. We have developed a system that allows domain experts to express rewrite rules to optimize code in the Julia programming language. Our system builds on e-graphs and equality saturation. It can apply optimizations in the presence of control flow and side effects. As Julia uses multiple dispatch, we allow users to constrain rewrite rules by argument types, and propagate type information through the e-graph representation. We propose an ILP formulation for optimal e-graph extraction taking into account dominance properties for code reuse and introduce CFG skeleton relaxation to rewrite calls to pure functions as well as those with side effects. Use cases demonstrate that our system can perform rewrites on high-level, domain-specific code, as well as on lower-level code such as Julia’s broadcasting mechanism. Finally, we analyze the required compilation time.

编译器对于将高语言的代码转换为性能机器代码是不可或缺的,但是它们的通用优化有时是不足的。域专家可能知道编译器无法应用或仅在特定领域有效的某些优化。我们开发了一个系统,让域专家表达重写规则以优化朱丽亚编程语言的代码。我们的系统建立在电子绘图和平等饱和的基础上。它可以在控制流和副作用的情况下应用优化。朱丽亚使用多种发送方式, 我们允许用户通过参数类型限制重写规则, 通过电子图表代表方式传播类型信息。我们建议了一种ILP 格式, 用于最佳电子制图提取, 以考虑到代码再利用的主导特性, 并引入 CFG 骨架松动, 以重写纯功能的呼声以及具有副作用的呼声。使用案例表明, 我们的系统可以在高层次、特定域代码上进行重写, 以及像 Julia 广播机制那样的低级别代码上进行重写。最后, 我们分析了所需的编译时间。

Article 104

Title@2025-07-13 (7): How to Define Design in Industrial Control and Automation Software

Title: How to Define Design in Industrial Control and Automation Software

Wie man Design in der industriellen Steuerungs- und Automatisierungssoftware definiert

如何界定工业控制和自动化软件的设计 2507.09594v1

Authors (1): Aydin Homay

Design is a fundamental aspect of engineering, enabling the creation of products, systems, and organizations to meet societal and/or business needs. However, the absence of a scientific foundation in design often results in subjective decision-making, reducing both efficiency and innovation. This challenge is particularly evident in the software industry and, by extension, in the domain of industrial control and automation systems (iCAS). In this study, first we review the existing design definitions within the software industry, challenge prevailing misconceptions about design, review design definition in the field of design theory and address key questions such as: When does design begin? How can design be defined scientifically? What constitutes good design? and the difference between design and design language by relying on advancements in the field of design theory. We also evaluate the distinction between ad-hoc and systematic design approaches, and present arguments on how to balance complementary operational concerns while resolving conflicting evolutionary concerns.

设计是工程的一个根本方面,使产品、系统和组织的创建能够满足社会和(或)商业需要。然而,设计缺乏科学基础往往导致主观决策,降低效率和革新。这一挑战在软件行业尤为明显,在工业控制和自动化系统(iCAS)领域更为明显。在这项研究中,我们首先审查软件行业内现有的设计定义,挑战对设计的普遍误解,审查设计理论领域的设计定义,并解决关键问题,例如:设计何时开始?如何在科学上界定设计?什么是良好的设计?以及设计和设计语言之间的差别,依靠设计理论领域的进步。我们还评估了临时设计方法和系统设计方法之间的区别,并就如何在解决相互冲突的进化问题的同时平衡互补的业务关切提出论据。

Article 105

Title@2025-07-13 (7): A Serverless Architecture for Real-Time Stock Analysis using Large Language Models: An Iterative Development and Debugging Case Study

Title: A Serverless Architecture for Real-Time Stock Analysis using Large Language Models: An Iterative Development and Debugging Case Study

Eine serverlose Architektur für Echtzeit-Speicheranalyse mit großen Sprachmodellen: Eine iterative Entwicklungs- und Debugging-Fallstudie

使用大语言模型进行实时库存分析的无服务器结构:迭代发展和调试案例研究 2507.09583v1

Authors (1): Taniv Ashraf

The advent of powerful, accessible Large Language Models (LLMs) like Google’s Gemini presents new opportunities for democratizing financial data analysis. This paper documents the design, implementation, and iterative debugging of a novel, serverless system for real-time stock analysis. The system leverages the Gemini API for qualitative assessment, automates data ingestion and processing via GitHub Actions, and presents the findings through a decoupled, static frontend. We detail the architectural evolution of the system, from initial concepts to a robust, event-driven pipeline, highlighting the practical challenges encountered during deployment. A significant portion of this paper is dedicated to a case study on the debugging process, covering common software errors, platform-specific permission issues, and rare, environment-level platform bugs. The final architecture operates at a near-zero cost, demonstrating a viable model for individuals to build sophisticated AI-powered financial tools. The operational application is publicly accessible, and the complete source code is available for review. We conclude by discussing the role of LLMs in financial analysis, the importance of robust debugging methodologies, and the emerging paradigm of human-AI collaboration in software development.

Google的Gemini的Gemini的Gemini等强大、可获得的大语言模型(LLMS)的出现为金融数据分析的民主化提供了新的机会。本文记录了设计、实施和迭代调试用于实时库存分析的无服务器的新型系统。该系统利用Gemini API进行定性评估,通过GitHub Action将数据吸收和处理自动化,并通过一个分解、静态的前端介绍研究结果。我们详细介绍了该系统的建筑演变,从初始概念到一个稳健、事件驱动的管道,突出了部署期间遇到的实际挑战。本文的一大部分专门论述关于调试过程的案例研究,包括通用软件错误、特定平台许可问题和罕见的环境级平台错误。最后结构以近零成本运作,展示了个人建立复杂的AI驱动金融工具的可行模式。操作应用程序可供公众使用,完整的源代码可供审查。我们最后通过讨论LMS在金融分析中的作用、稳健调方法的重要性以及软件开发中正在形成的人类-AI合作模式来总结。

Article 106

Title@2025-07-13 (7): The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs

Title: The Debugging Decay Index: Rethinking Debugging Strategies for Code LLMs

Der Debugging Decay Index: Debugging Strategien für Code LLMs neu denken

调试衰减指数:重新思考守则LMS的调试战略 2506.18403v2

Authors (2): Muntasir Adnan, Carlos C. N. Kuhn

The effectiveness of AI debugging follows a predictable exponential decay pattern; most models lose 60-80% of their debugging capability within just 2-3 attempts, despite iterative debugging being a critical capability for practical code generation systems. We introduce the Debugging Decay Index (DDI), a mathematical framework that quantifies when debugging becomes ineffective and predicts intervention points. Our strategic fresh start approach shifts from exploitation to exploration at strategic points in the debugging process, demonstrating that well-timed interventions can rescue the effectiveness of debugging. DDI reveals a fundamental limitation in current AI debugging and provides the first quantitative framework for optimising iterative code generation strategies.

AI 调试的有效性遵循一种可预测的指数衰变模式;尽管迭代调试是实用代码生成系统的关键能力,但大多数模型仅在2-3次尝试中丧失了60-80%的调试能力。我们引入了调试衰减指数(DDI),这是一个数学框架,当调试失效时可以量化,并预测干预点。我们的新战略启动方法从开发转向在调试过程中的战略点进行探索,表明及时的干预措施可以挽救调试的有效性。 DDI揭示了当前AI调试中的基本限制,并为优化迭代代码生成战略提供了第一个量化框架。

Article 107

Title@2025-07-13 (7): It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective

Title: It Only Gets Worse: Revisiting DL-Based Vulnerability Detectors from a Practical Perspective

Es wird nur schlimmer: DL-basierte Sicherheitsdetektoren aus praktischer Sicht neu zu betrachten

更糟糕的是:从实际角度重新审视基于DL的脆弱性检测器 2507.09529v1

Authors (6): Yunqian Wang, Xiaohong Li, Yao Zhang, Yuekang Li, Zhiping Zhou, Ruitao Feng

With the growing threat of software vulnerabilities, deep learning (DL)-based detectors have gained popularity for vulnerability detection. However, doubts remain regarding their consistency within declared CWE ranges, real-world effectiveness, and applicability across scenarios. These issues may lead to unreliable detection, high false positives/negatives, and poor adaptability to emerging vulnerabilities. A comprehensive analysis is needed to uncover critical factors affecting detection and guide improvements in model design and deployment. In this paper, we present VulTegra, a novel evaluation framework that conducts a multidimensional comparison of scratch-trained and pre-trained-based DL models for vulnerability detection. VulTegra reveals that state-of-the-art (SOTA) detectors still suffer from low consistency, limited real-world capabilities, and scalability challenges. Contrary to common belief, pre-trained models are not consistently better than scratch-trained models but exhibit distinct strengths in specific contexts.Importantly, our study exposes the limitations of relying solely on CWE-based classification and identifies key factors that significantly affect model performance. Experimental results show that adjusting just one such factor consistently improves recall across all seven evaluated detectors, with six also achieving better F1 scores. Our findings provide deeper insights into model behavior and emphasize the need to consider both vulnerability types and inherent code features for effective detection.

由于软件脆弱性的威胁日益增大,深入学习(DL)的探测器越来越受到人们的欢迎,以便发现脆弱性。然而,对于这些探测器在所宣布的CWE范围内的一致性、实际世界有效性和各种情景的可适用性,仍然存有疑问。这些问题可能导致检测不可靠、高假正反反反和对新出现的脆弱性适应性差。需要进行全面分析,以发现影响检测的关键因素并指导模型设计和部署的改进。在本文件中,我们介绍了VulTegra,这是一个新的评价框架,对经破碎训练的和经过预先训练的识别脆弱性的DL模型进行多层面比较。VulTegra显示,最先进的(SOTA)探测器仍然缺乏一致性、实际世界能力有限和可扩缩性挑战。与通常的信念相反,预先培训的模型并非始终比经破碎训练的模型好,而是在具体情况下表现出明显的优势。我们的研究暴露了仅仅依赖CWE的分类的局限性,并确定了对模型性表现有重大影响的关键因素。实验结果表明,仅仅调整了其中的一个这种模型,在所有7种经过评估的探测器中不断改进一个模型,同时回顾所有一种经过评估的探测器,并且也考虑到我们有6种深层次的探测标准。

Article 108

Title@2025-07-13 (7): Towards LLM-Based Automatic Playtest

Title: Towards LLM-Based Automatic Playtest

Zum LLM-basierten automatischen Playtest

面向基于 LLM 的自动游戏测试 2507.09490v1

Authors (2): Yan Zhao, Chiwei Tang

Playtesting is the process in which people play a video game for testing. It is critical for the quality assurance of gaming software. Manual playtesting is time-consuming and expensive. However, automating this process is challenging, as playtesting typically requires domain knowledge and problem-solving skills that most conventional testing tools lack. Recent advancements in artificial intelligence (AI) have opened up new possibilities for applying Large Language Models (LLMs) to playtesting. However, significant challenges remain: current LLMs cannot visually perceive game environments, and most existing research focuses on text-based games or games with robust APIs. Many non-text games lack APIs to provide textual descriptions of game states, making it almost impossible to naively apply LLMs for playtesting. This paper introduces Lap, our novel approach to LLM-based Automatic Playtesting, which uses ChatGPT to test match-3 games, a category of games where players match three or more identical tiles in a row or column to earn points. Lap encompasses three key phases: processing of game environments, prompting-based action generation, and action execution. Given a match-3 game, Lap takes a snapshot of the game board and converts it to a numeric matrix. It then prompts the ChatGPT-O1-mini API to suggest moves based on that matrix and tentatively applies the suggested moves to earn points and trigger changes in the game board. It repeats the above-mentioned three steps iteratively until timeout. For evaluation, we conducted a case study using Lap on an open-source match-3 game, CasseBonbons, and empirically compared it with three existing tools. Our results are promising: Lap outperformed existing tools by achieving higher code coverage and triggering more program crashes. This research sheds light on the future of automatic testing and LLM applications.

游戏测试是人们玩游戏测试游戏的过程。它对于游戏软件的质量保证至关重要。手动游戏测试耗时且昂贵。但是, 自动测试具有挑战性, 因为游戏测试通常需要大多数常规测试工具所缺乏的域知识和解决问题技能。人工智能( AI) 最近的进步为应用大语言模型( LLMS) 进行游戏测试开辟了新的可能性。然而, 依然存在着重大挑战: 当前 LLMS 无法看到游戏环境, 而大多数现有研究侧重于基于文本的游戏或具有强力 API 的游戏。许多非文本游戏缺乏提供游戏状态的文本性描述, 使得几乎不可能天真地应用 LLMS 来进行游戏测试。本文介绍了我们基于 LLMM 的自动游戏测试新办法。使用 ChatGPT 来测试匹配三局游戏( LLMMS) 的游戏类别, 玩家在一行或一列中匹配三或更相同的小调来得分数的游戏。 LAPS 包括三个关键阶段: 处理游戏环境环境, 加速进行基于未来的动作生成, 以及操作执行。匹配3 游戏的游戏的游戏程序在游戏游戏上运行前三局前, 拉普的游戏中, 将快速的游戏游戏的游戏运行运行运行运行到一个游戏游戏游戏的游戏的游戏的游戏的动作, 。将预演算算到一个快速的游戏的游戏头。

Article 109

Title@2025-07-13 (7): Evaluating LLMs on Sequential API Call Through Automated Test Generation

Title: Evaluating LLMs on Sequential API Call Through Automated Test Generation

Bewertung von LLMs auf sequentieller API-Aufruf durch automatisierte Testgenerierung

通过自动测试生成的序列API呼叫评估LLMs 2507.09481v1

Authors (5): Yuheng Huang, Da Song, Zhenlan Ji, Shuai Wang, Lei Ma

By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.

大型语言模型(LLMS)通过整合外部API工具,扩大了其在复杂现实世界任务中各种复杂任务的有希望的能力,然而,对LLM工具的测试、评价和分析仍然处于早期阶段,大多数现有基准依靠人工收集的测试案例,其中许多案例无法自动检查语义正确性,而是依赖诸如弦匹配等静态方法。此外,这些基准往往忽略了在现实世界应用中常见的顺序API调用之间出现的复杂互动。为了填补空白,我们在本文件中引入了国家Gen,这是一个自动框架,旨在产生涉及API相继互动的多种编码任务。州Gen将基于国家机器的API限制的解决和验证、基于能源的取样和控制流量注入结合起来,以产生可执行的方案。这些方案随后通过两个LLMM代理方的合作被转化为像人类一样的自然语言任务描述。我们构建了国家Eval,这是一个包含120个经核实的测试案例的基准,涵盖以下三个有代表性的情景:会议服务、Tensor 操作和11Lab MCPS

Article 110

Title@2025-07-12 (6): Enhancing NeuroEvolution-Based Game Testing: A Branch Coverage Approach for Scratch Programs

Title: Enhancing NeuroEvolution-Based Game Testing: A Branch Coverage Approach for Scratch Programs

Verbesserung der NeuroEvolution-basierten Game-Tests: Ein branchenübergreifender Ansatz für Scratch-Programme

强化基于进进神经革命的游戏测试:Scratch方案分支覆盖方法 2507.09414v1

Authors (3): Khizra Sohail, Atif Aftab Ahmed Jilani, Nigar Azhar Butt

Automated test generation for game-like programs presents unique challenges due to their non-deterministic behavior and complex control structures. The NEATEST framework has been used for automated testing in Scratch games, employing neuroevolution-based test generation optimized for statement coverage. However, statement coverage alone is often insufficient for fault detection, as it does not guarantee execution of all logical branches. This paper introduces a branch coverage-based fitness function to enhance test effectiveness in automated game testing. We extend NEATEST by integrating a branch fitness function that prioritizes control-dependent branches, guiding the neuroevolution process to maximize branch exploration. To evaluate the effectiveness of this approach, empirical experiments were conducted on 25 Scratch games, comparing Neatest with Statement Coverage (NSC) against Neatest with Branch Coverage (NBC). A mutation analysis was also performed to assess the fault detection capabilities of both techniques. The results demonstrate that NBC achieves higher branch coverage than NSC in 13 out of 25 games, particularly in programs with complex conditional structures. Moreover, NBC achieves a lower false positive rate in mutation testing, making it a more reliable approach for identifying faulty behavior in game programs. These findings confirm that branch coverage-based test generation improves test coverage and fault detection in Scratch programs.

游戏类程序的自动测试生成因其非决定性行为和复杂的控制结构而面临独特的挑战。 NEATEST 框架被用于在Scratch游戏中进行自动测试,使用神经革命型测试生成来优化语句覆盖。然而,单是语句覆盖面往往不足以检测错误,因为它不能保证执行所有逻辑分支。本文引入了一个基于分支的健身功能,以提高自动游戏测试的测试效力。我们通过整合一个部门健身功能来扩展NEATEST,该功能优先考虑控制依赖的分支,指导神经革命进程以最大限度地扩大分支探索。为了评估这一方法的有效性,在25个Scraatch游戏中进行了实验,将Natest与声明覆盖范围(NSC)和Natest和分支覆盖范围(NBC)进行比较。还进行了突变分析,以评估这两种技术的检测能力。结果显示,NBC在25个游戏中比NSC在13个游戏中实现了更高的分支覆盖率,特别是在有复杂条件结构的方案中。此外,NBC在突变测试中实现了较低的反正率率,从而更可靠地确认了Scatch 测试范围。

Article 111

Title@2025-07-12 (6): LLM-Powered Quantum Code Transpilation

Title: LLM-Powered Quantum Code Transpilation

LLM-Powered Quantum Code Transpilation

LLM 功率量代码转换 2507.12480v1

Authors (2): Nazanin Siavash, Armin Moin

There exist various Software Development Kits (SDKs) tailored to different quantum computing platforms. These are known as Quantum SDKs (QSDKs). Examples include but are not limited to Qiskit, Cirq, and PennyLane. However, this diversity presents significant challenges for interoperability and cross-platform development of hybrid quantum-classical software systems. Traditional rule-based transpilers for translating code between QSDKs are time-consuming to design and maintain, requiring deep expertise and rigid mappings in the source and destination code. In this study, we explore the use of Large Language Models (LLMs) as a flexible and automated solution. Leveraging their pretrained knowledge and contextual reasoning capabilities, we position LLMs as programming language-agnostic transpilers capable of converting quantum programs from one QSDK to another while preserving functional equivalence. Our approach eliminates the need for manually defined transformation rules and offers a scalable solution to quantum software portability. This work represents a step toward enabling intelligent, general-purpose transpilation in the quantum computing ecosystem.

现有适合不同量子计算平台的各种软件开发工具包(SDK),称为量子计算平台,称为Qantum SDK(QSDK),例子包括但不限于Qiskit、Cirq和PennyLane。然而,这种多样性对混合量子古典软件系统的互操作性和跨平台开发提出了重大挑战。基于规则的传统快速传输器用于翻译QSDK之间代码的设计和维护耗时,需要源代码和目的地代码方面的深入专业知识和僵硬绘图。在本研究中,我们探索使用大语言模型(LLLMs)作为灵活和自动的解决方案。我们利用这些大语言模型(LLMs)的预先培训知识和背景推理能力,将LLMs定位为能够将量子程序从一个QSDK转换到另一个,同时保持功能等同的语文-ANSTIPers编程。我们的方法消除了人工定义的转换规则的需要,并为量子软件可移植性提供了可扩展的解决方案。这项工作是朝着在量子计算生态系统中促进智能、通用的转换迈出的一步。

Article 112

Title@2025-07-12 (6): Enhancing Interpretability in Software Change Management with Chain-of-Thought Reasoning

Title: Enhancing Interpretability in Software Change Management with Chain-of-Thought Reasoning

Verbesserung der Dolmetschbarkeit im Software Change Management durch schlüsselfertiges Reasoning

提高软件变革管理与 “ 探索链解释理由 “ 的可解释性 2507.09315v1

Authors (9): Yongqian Sun, Weihua Kuang, Chao Shen, Xidao Wen, Tinghua Zheng, Heng Liu, Shenglin Zhang, Bo Wu, Dan Pei

In modern online services, frequent software changes introduce significant risks. To tackle this challenge, we propose SCELM (Software Change Evaluation and Lifecycle Management), an end-to-end automated framework for software change management. SCELM aims to manage software changes efficiently and precisely, significantly reducing service failures and economic losses.

在现代在线服务中,频繁的软件变化带来了巨大的风险。为了应对这一挑战,我们提议建立软件变化评估和生命周期管理(SCELM ) , 软件变化管理端对端自动框架。 SCELM 旨在高效、准确地管理软件变化,大大减少服务故障和经济损失。

Article 113

Title@2025-07-12 (6): Explainability as a Compliance Requirement: What Regulated Industries Need from AI Tools for Design Artifact Generation

Title: Explainability as a Compliance Requirement: What Regulated Industries Need from AI Tools for Design Artifact Generation

Erklärbarkeit als Compliance-Voraussetzung: Was regulierte Industrien von KI-Werkzeugen für die Design-Artefakt-Generierung benötigen

作为遵约要求的解释性:AI 设计人工制造工具中监管工业需要什么 2507.09220v1

Authors (4): Syed Tauhid Ullah Shah, Mohammad Hussein, Ann Barcomb, Mohammad Moshirpour

Artificial Intelligence (AI) tools for automating design artifact generation are increasingly used in Requirements Engineering (RE) to transform textual requirements into structured diagrams and models. While these AI tools, particularly those based on Natural Language Processing (NLP), promise to improve efficiency, their adoption remains limited in regulated industries where transparency and traceability are essential. In this paper, we investigate the explainability gap in AI-driven design artifact generation through semi-structured interviews with ten practitioners from safety-critical industries. We examine how current AI-based tools are integrated into workflows and the challenges arising from their lack of explainability. We also explore mitigation strategies, their impact on project outcomes, and features needed to improve usability. Our findings reveal that non-explainable AI outputs necessitate extensive manual validation, reduce stakeholder trust, struggle to handle domain-specific terminology, disrupt team collaboration, and introduce regulatory compliance risks, often negating the anticipated efficiency benefits. To address these issues, we identify key improvements, including source tracing, providing clear justifications for tool-generated decisions, supporting domain-specific adaptation, and enabling compliance validation. This study outlines a practical roadmap for improving the transparency, reliability, and applicability of AI tools in requirements engineering workflows, particularly in regulated and safety-critical environments where explainability is crucial for adoption and certification.

设计工艺品生产自动化的人工智能(AI)工具越来越多地用于要求工程(RE),将文字要求转换成结构化的图表和模型。这些人工智能工具,特别是基于自然语言处理(NLP)的工具,有望提高效率,但在透明度和可追溯性至关重要的受监管行业中,采用这些工具仍然有限。在本文件中,我们通过与来自安全关键行业的10名从业人员的半结构性访谈,调查AI驱动设计工艺品生产的解释性差距。我们研究了目前基于AI的工具如何融入工作流程,以及这些工具缺乏解释性所带来的挑战。我们还探讨了缓解战略、其对项目结果的影响以及提高可用性所需的特征。我们的调查结果显示,非解释性的AI产出需要广泛的手工验证、减少利益攸关方的信任、努力处理特定领域的术语、扰乱团队协作和引入监管性合规风险,往往抵消预期的效率效益。为解决这些问题,我们确定了关键改进措施,包括来源追踪、为工具生成的决定提供明确的理由、支持特定领域的适应性调整以及能够验证遵守情况。本研究报告概述了改进透明性、可靠性、可靠性、监管性、监管性环境方面安全性、以及采用AI工具的可靠性要求的实用路线图。

Article 114

Title@2025-07-12 (6): Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval

Title: Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval

Zurück zu den Grundlagen: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval

返回到 Basics: 重新思考与LLM 辅助检索连接的问题 2507.09199v1

Authors (11): Huihui Huang, Ratnadira Widyasari, Ting Zhang, Ivana Clairine Irsan, Jieke Shi, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, David Lo

Issue-commit linking, which connects issues with commits that fix them, is crucial for software maintenance. Existing approaches have shown promise in automatically recovering these links. Evaluations of these techniques assess their ability to identify genuine links from plausible but false links. However, these evaluations overlook the fact that, in reality, when a repository has more commits, the presence of more plausible yet unrelated commits may interfere with the tool in differentiating the correct fix commits. To address this, we propose the Realistic Distribution Setting (RDS) and use it to construct a more realistic evaluation dataset that includes 20 open-source projects. By evaluating tools on this dataset, we observe that the performance of the state-of-the-art deep learning-based approach drops by more than half, while the traditional Information Retrieval method, VSM, outperforms it. Inspired by these observations, we propose EasyLink, which utilizes a vector database as a modern Information Retrieval technique. To address the long-standing problem of the semantic gap between issues and commits, EasyLink leverages a large language model to rerank the commits retrieved from the database. Under our evaluation, EasyLink achieves an average Precision@1 of 75.91%, improving over the state-of-the-art by over four times. Additionally, this paper provides practical guidelines for advancing research in issue-commit link recovery.

将问题和问题联系起来,将问题与解决问题联系起来,对于软件维护至关重要。现有办法显示自动恢复这些联系的前景。这些办法的评估评估了它们从可信但虚假的联系中查明真正联系的能力。然而,这些评价忽略了一个事实,即事实上,如果存储库作出更多承诺,存在更合理但又不相干的关系可能会干扰区分正确确定承诺的工具。为了解决这个问题,我们提议现实分配设置(RDS),并用它来构建一个更现实的评估数据集,其中包括20个开放源码项目。通过评价这一数据集的工具,我们观察到最先进的深层次学习方法的业绩下降一半以上,而传统的信息检索方法VSM(VSM)则超过它。根据这些观察,我们建议使用一个矢量数据库作为现代信息检索技术。为了解决长期存在的问题和承诺之间的语义差距问题,EasyLink利用一个大语言模型来重新定位从数据库中检索到的承诺。在实际链接中,通过我们的平均链接1 提供了一个“易连结”时间的检索,通过我们的平均链接提供一个更新文件的恢复时间。

Article 115

Title@2025-07-12 (6): OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advanced Transportation Research

Title: OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advanced Transportation Research

OpenCAMS: Eine Open-Source vernetzte und automatisierte Mobilitäts-Co-Simulationsplattform für fortgeschrittene Verkehrsforschung

开放源码连接和自动化流动联合模拟平台,用于高级运输研究 2507.09186v1

Authors (4): Minhaj Uddin Ahmad, Akid Abrar, Sagar Dasgupta, Mizanur Rahman

We introduce OpenCAMS (Open-Source Connected and Automated Mobility Co-Simulation Platform), an open-source, synchronized, and extensible co-simulation framework that tightly couples three best-in-class simulation tools: (i) SUMO, (ii) CARLA, and (iii) OMNeT++. OpenCAMS is designed to support advanced research in transportation safety, mobility, and cybersecurity by combining the strengths of each simulation domain. Specifically, SUMO provides large-scale, microscopic traffic modeling; CARLA offers high-fidelity 3D perception, vehicle dynamics, and control simulation; and OMNeT++ enables modular, event-driven network communication, such as cellular vehicle-to-everything (C-V2X). OpenCAMS employs a time-synchronized, bidirectional coupling architecture that ensures coherent simulation progression across traffic, perception, and communication domains while preserving modularity and reproducibility. For example, CARLA can simulate and render a subset of vehicles that require detailed sensor emulation and control logic; SUMO orchestrates network-wide traffic flow, vehicle routing, and traffic signal management; and OMNeT++ dynamically maps communication nodes to both mobile entities (e.g., vehicles) and static entities (e.g., roadside units) to enable C-V2X communication. While these three simulators form the foundational core of OpenCAMS, the platform is designed to be expandable and future-proof, allowing additional simulators to be integrated on top of this core without requiring fundamental changes to the system architecture. The OpenCAMS platform is fully open-source and publicly available through its GitHub repository https://github.com/minhaj6/carla-sumo-omnetpp-cosim, providing the research community with an accessible, flexible, and collaborative environment for advancing next-generation intelligent transportation systems.

我们引入了OpenCAMS(开放源码连接和自动化流动共同模拟平台),这是一个开放源码、同步和可扩展的共同模拟框架,紧紧结合三种最高级模拟工具:(一) SUMO,(二) CARLA,和(三) OMNET+。OmNET+。OpenCAMS的目的是通过将每个模拟域的优势结合起来,支持运输安全、移动和网络安全方面的先进研究。具体地说,SUMO提供大型、显微可读交通模型;CARLA提供高纤维3D感知、车辆动态和控制模拟及控制模拟;OMNET++为模块化、事件驱动网络通信通信提供模块,如手机到百年一月(C-V2X) 。OpenCAMS使用时间同步、双向双向双向组合的组合组合组合结构结构,确保整个交通、感知知知和通信领域的模拟进程,同时保持模块和可读性。例如,CARLA可以模拟和提供一组需要详细传感器的车辆未来模拟和控逻辑的车辆;SUDS-CA-SlMSUMS-LMLMS-S-S-LMLMULM-S-S-S-S-mode-com-roma-comma-comma-comm-comm-comm-comm-comma-commex-comma-comma-comma-comma-comma-commus-comma-comma-commex-commex-commex-commex-s-s-commusmex-commex-s-s-commex-s-s-s-s-s-s-s-s-s-s-s-comm-s-s-s-s-s-s-s-s-s-s-s-s-s-l-s-s-s-s-s-s-commal-s-comm-s-s-sma-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-

Article 116

Title@2025-07-12 (6): Position Paper: Programming Language Techniques for Bridging LLM Code Generation Semantic Gaps

Title: Position Paper: Programming Language Techniques for Bridging LLM Code Generation Semantic Gaps

Positionspapier: Programmiersprachentechniken zur Bridging LLM Code Generation Semantische Lücken

立场文件:缩小LLM码生成语义差距的编程语言技术 2507.09135v1

Authors (3): Yalong Du, Chaozheng Wang, Huaijin Wang

Large Language Models have demonstrated remarkable capabilities in automated code generation, yet their statistical nature and black-box characteristics create significant semantic gaps manifested through syntax errors, semantic hallucinations, and reliability concerns. This position paper argues that principled integration of Programming Language (PL) techniques is essential for bridging these gaps. Through structured program representations, formal correctness guarantees, and robust verification mechanisms, PL techniques can elevate LLM-generated code from statistical pattern matching to truly reliable and trustworthy levels. This integration is crucial for developing systems that generate code that is not only functionally correct but also interpretable, verifiable, and ultimately trustworthy.

大型语言模型在自动代码生成方面表现出了非凡的能力,但其统计性质和黑盒特征造成了严重的语义差距,表现为语法错误、语义幻觉和可靠性问题。本立场文件认为,有原则地整合编程语言(PL)技术对于弥合这些差距至关重要。通过结构化的方案表述、正式的正确性保障和强有力的核查机制,PLT技术可以将LLM生成的代码从统计模式上提升到真正可靠和可信赖的水平。这种整合对于开发不仅在功能上正确,而且可解释、可核查并最终可信赖的代码系统至关重要。

Article 117

Title@2025-07-12 (6): SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

Title: SPICE: An Automated SWE-Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation

SPICE: Eine automatisierte SWE-Bench-Etikettierungspipeline für Ausgabeklarheit, Testabdeckung und Aufwandsabschätzung

SPICE: 用于议题清晰度、测试覆盖率和努力估算的SWE-Bennch自动标签管道 2507.09108v1

Authors (10): Aaditya Bhatia, Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, Ahmed E. Hassan

High-quality labeled datasets are crucial for training and evaluating foundation models in software engineering, but creating them is often prohibitively expensive and labor-intensive. We introduce SPICE, a scalable, automated pipeline for labeling SWE-bench-style datasets with annotations for issue clarity, test coverage, and effort estimation. SPICE combines context-aware code navigation, rationale-driven prompting, and multi-pass consensus to produce labels that closely approximate expert annotations. SPICE’s design was informed by our own experience and frustration in labeling more than 800 instances from SWE-Gym. SPICE achieves strong agreement with human-labeled SWE-bench Verified data while reducing the cost of labeling 1,000 instances from around $100,000 (manual annotation) to just $5.10. These results demonstrate SPICE’s potential to enable cost-effective, large-scale dataset creation for SE-focused FMs. To support the community, we release both SPICE tool and SPICE Bench, a new dataset of 6,802 SPICE-labeled instances curated from 291 open-source projects in SWE-Gym (over 13x larger than SWE-bench Verified).

高品质的标签数据集对于软件工程基础模型的培训和评价至关重要,但创建这些数据集往往费用高得令人望而却步,而且耗费大量人力。我们引入了SPICE,这是一个可扩缩的自动化管道,用于标为SWE-Bench型数据集,并配有说明,以澄清问题、测试覆盖范围和工作估计。SPICE结合了符合背景的代码导航、根据理由推动的提示和多种通用的共识,以制作与专家说明相近的标签。SPICE的设计参考了我们自己在标出SWE-Gym800多个实例方面的经验和挫折感。SPICE与人类标为SWE-Bench型的SWE-Bench 验证数据达成了强烈的一致,同时将标出1,000个实例的费用从大约100 000美元(人工注)降低到仅仅5.10美元。这些结果表明SPICE具有为SE重点调频调频提供具有成本效益的大规模数据集的潜力。为了支持社区,我们发布了SPICEICE工具和SPICE Tenge,这是一个新的数据集,由6 802个超过SWE-GYME的开放源项目中291 13级的6个重)。

Article 118

Title@2025-07-12 (6): Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Title: Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity

Messung des Einflusses der frühen-2025 KI auf erfahrene Open-Source-Entwicklerproduktivität

衡量2025年初AI(AI)对经验丰富的开放源码开发者生产力的影响 2507.09089v1

Authors (4): Joel Becker, Nate Rush, Elizabeth Barnes, David Rein

Despite widespread adoption, the impact of AI tools on software development in the wild remains understudied. We conduct a randomized controlled trial (RCT) to understand how AI tools at the February-June 2025 frontier affect the productivity of experienced open-source developers. 16 developers with moderate AI experience complete 246 tasks in mature projects on which they have an average of 5 years of prior experience. Each task is randomly assigned to allow or disallow usage of early 2025 AI tools. When AI tools are allowed, developers primarily use Cursor Pro, a popular code editor, and Claude 3.5/3.7 Sonnet. Before starting tasks, developers forecast that allowing AI will reduce completion time by 24%. After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19%–AI tooling slowed developers down. This slowdown also contradicts predictions from experts in economics (39% shorter) and ML (38% shorter). To understand this result, we collect and evaluate evidence for 20 properties of our setting that a priori could contribute to the observed slowdown effect–for example, the size and quality standards of projects, or prior developer experience with AI tooling. Although the influence of experimental artifacts cannot be entirely ruled out, the robustness of the slowdown effect across our analyses suggests it is unlikely to primarily be a function of our experimental design.

尽管被广泛采用,但AI工具对野生软件开发的影响仍然没有得到充分研究。我们进行了随机控制测试(RCT),以了解2025年2月至6月边境的AI工具如何影响有经验的开放源开发商的生产率。16个具有温和AI经验的开发商完成了成熟项目的246项任务,平均5年经验的成熟开发商完成了246项任务。每项任务被随机指定允许或不允许使用2025年初AI工具。当允许使用AI工具时,开发商主要使用流行代码编辑Cursor Pro和Claude 3.5.3.7 Sonnet。在开始任务之前,开发商预测允许AI完成时间将减少24 % 。在完成研究之后,开发商估计允许AI完成时间减少20 % 。令人惊讶的是,我们发现允许AI实际将完成时间增加19 % - AI 工具, 减缓开发者。这种减速也与经济专家(39% 短) 和 ML (38%短) 的预测相矛盾。为了了解这一结果,我们设定的20种特性,我们收集并评估其前一种有助于观察到的减速效应的证据。例如,允许AI 。在完成研究后,尽管其规模和质量标准不能完全超越了我们之前的实验性能影响,但我们的弹性分析。

Article 119

Title@2025-07-11 (5): SetupBench: Assessing Software Engineering Agents’ Ability to Bootstrap Development Environments

Title: SetupBench: Assessing Software Engineering Agents’ Ability to Bootstrap Development Environments

SetupBench: Bewertung der Fähigkeit von Software-Engineering-Agenten zu Bootstrap-Entwicklungsumgebungen

设置基准:评估软件工程代理器的能力,以建立发展环境 2507.09063v1

Authors (3): Avi Arora, Jinu Jang, Roshanak Zilouchian Moghaddam

Modern Large Language Model (LLM) agents promise end to end assistance with real-world software tasks, yet existing benchmarks evaluate LLM agents almost exclusively in pre-baked environments where every dependency is pre-installed. To fill this gap, we introduce SetupBench, a 93 instance benchmark that isolates the environment-bootstrap skill: starting from a bare Linux sandbox, an agent must install packages, resolve dependency conflicts, initialize databases, and configure background services. Our tasks span seven language ecosystems, five database engines, and multi-service orchestration scenarios, each accompanies by a natural language problem statement and a deterministic success command. Through evaluation of OpenHands, a state-of-the-art coding agent, we find low success rates across task categories, with particular challenges in repository setup (38.9-57.4%) and local database configuration (20.0-53.3%). Our analysis reveals systematic failure modes including incomplete development tooling installation, hallucinated task constraints, and non-persistent environment modifications that break agent-human collaboration workflows. We identify substantial inefficiencies in agent exploration strategies, with 38-89% of actions being unnecessary compared to optimal human behavior. These findings highlight gaps in current agents’ practical environment-bootstrap capabilities. By targeting this critical yet under-evaluated capability, SetupBench provides a rigorous yard-stick for the next generation of software developer agents aiming to solve end to end real-wold tasks.

现代大型语言模型(LLM)代理商承诺结束对现实世界软件任务的援助,然而,现有的基准几乎完全在每一个依赖者都预先安装了预先成熟的环境中评价LLM代理商。为了填补这一空白,我们引入了SetupBench,这是一个将环境-促进陷阱技能分离的93个实例基准:从一个光的Linux沙箱开始,一个代理商必须安装软件包,解决依赖性冲突,初始化数据库和配置背景服务。我们的任务涉及七个语言生态系统、五个数据库引擎和多功能调控方案,每个都由自然语言问题声明和确定性的成功命令伴随。我们通过对OpenHands(最先进的编码代理商)的评价,我们发现跨任务类别的成功率较低,其中特别包括存储器设置(38.9-54.4%)和地方数据库配置(2.0-53.3 % ) 的挑战。我们的分析揭示了系统性的失败模式,包括开发工具安装不完善,任务限制,以及不耐受控制的环境改变,从而打破代理商-人的合作工作流程。我们通过对实际勘探战略的高度效率低效率,通过3889%的勘探战略,从而对比了当前目标定位的精确分析能力,这些关键环境分析能力,从而显示了当前- 环境的精确分析能力,这些不必要地评估能力,这些精确的动作,这些精确的动作,这些能力是在精确的动作,这些精确的动作,在精确的定位的动作,在精确的动作,这些环境,在精确的动作的动作的定位的动作,这些能力之下,这些能力是比。

Article 120

Title@2025-07-11 (5): SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps

Title: SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps

SAGE: A Context-Aware Approach for Mining Privacy Relevant Reviews from Mental Health Apps

SAGE: “ 采矿隐私要求 “ 的背景意识方法,来自心理健康应用软件的相关审查 2507.09051v1

Authors (2): Aakash Sorathiya, Gouri Ginde

Mental health (MH) apps often require sensitive user data to customize services for mental wellness needs. However, such data collection practices in some MH apps raise significant privacy concerns for users. These concerns are often mentioned in app reviews, but other feedback categories, such as reliability and usability, tend to take precedence. This poses a significant challenge in automatically identifying privacy requirements-relevant reviews (privacy reviews) that can be utilized to extract privacy requirements and address users’ privacy concerns. Thus, this study introduces SAGE, a context-aware approach to automatically mining privacy reviews from MH apps using Natural Language Inference (NLI) with MH domain-specific privacy hypotheses (provides domain-specific context awareness) and a GPT model (eliminates the need for fine-tuning). The quantitative evaluation of SAGE on a dataset of 204K app reviews achieved an F1 score of 0.85 without any fine-tuning, outperforming the fine-tuned baseline classifiers BERT and T5. Furthermore, SAGE extracted 748 privacy reviews previously overlooked by keyword-based methods, demonstrating its effectiveness through qualitative evaluation. These reviews can later be refined into actionable privacy requirement artifacts.

心理健康(MH)应用软件往往需要敏感的用户数据来定制满足心理健康需要的服务,然而,某些MH应用软件的这类数据收集做法引起了用户对隐私的重大关切,这些关切在应用审查中经常提及,但其他反馈类别,如可靠性和可用性,往往居于优先地位,这对自动确定隐私要求相关审查(隐私审查)(隐私审查)构成重大挑战,这些审查可用于提取隐私要求和解决用户对隐私的关切。因此,本研究报告引入了SAGE, 这是一种符合背景的自动挖掘隐私审查的方法,即使用自然语言推断(NLI)的MH应用软件进行自动挖掘隐私审查,使用MH特定域的隐私假设(提供特定领域背景认识)和全球专利保护网络模型(消除微调的必要性),对204K应用审查数据集的SGEGE进行了定量评价,实现了0.85的F1分,但没有作任何微调,超过了经过微调的基线分类标准BERT和T5.此外,SAGEGE提取了748项隐私审查,这通过定性评估证明了其有效性。

Article 121

Title: CMER: A Context-Aware Approach for Mining Ethical Concern-related App Reviews

CMER: A Context-aware approach for Mining Ethical Concern-related App Reviews

CMER: 采矿道德关切相关上诉审查的背景意识方法 2507.09049v1

Authors (2): Aakash Sorathiya, Gouri Ginde

With the increasing proliferation of mobile applications in our daily lives, the concerns surrounding ethics have surged significantly. Users communicate their feedback in app reviews, frequently emphasizing ethical concerns, such as privacy and security. Incorporating these reviews has proved to be useful for many areas of software engineering (e.g., requirement engineering, testing, etc.). However, app reviews related to ethical concerns generally use domain-specific language and are typically overshadowed by more generic categories of user feedback, such as app reliability and usability. Thus, making automated extraction a challenging and time-consuming effort. This study proposes CMER (A \underline{C}ontext-Aware Approach for \underline{M}ining \underline{E}thical Concern-related App \underline{R}eviews), a novel approach that combines Natural Language Inference (NLI) and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. In CMER, NLI provides domain-specific context awareness by using domain-specific hypotheses, and the Llama-like LLM eliminates the need for labeled data in the classification task. We evaluated the validity of CMER by mining privacy and security-related reviews (PSRs) from the dataset of more than 382K app reviews of mobile investment apps. First, we evaluated four NLI models and compared the results of domain-specific hypotheses with generic hypotheses. Next, we evaluated three LLMs for the classification task. Finally, we combined the best NLI and LLM models (CMER) and extracted 2,178 additional PSRs overlooked by the previous study using a keyword-based approach, thus demonstrating the effectiveness of CMER. These reviews can be further refined into actionable requirement artifacts.

随着移动应用程序在我们日常生活中日益扩散,人们对伦理的担忧大增。用户在应用程序审查中传达他们的反馈,经常强调隐私和安全等道德问题。纳入这些审查证明对软件工程的许多领域(例如要求工程、测试等)是有用的。然而,与伦理问题有关的应用审查通常使用特定领域的语言,通常被更通用的用户反馈类别(例如应用程序可靠性和使用性)所掩盖。因此,自动提取是一项具有挑战性和耗时性的工作。本研究报告提议CMER(A\ underline{C}Intext-Award Award Agine 方法,用于域内线{M}M}M}ining underline{E}道德关切相关App\underline{Cunderline{R}eviews。但是,将自然语言导力(NLIMA)和仅(类似LLIMA的)大语言模型(LM)相结合,用于在规模上提取与道德关切相关的应用的应用程序。在C-CMLIM上,通过域特定假设提供针对域域域的域内局的域认识认识认识,因此通过LLIM(我们LIM)对LIM进行最新数据分析,我们最新的数据评估,并用最新数据分析,可以进一步评估。

Article 122

Title@2025-07-11 (5): Towards Extracting Software Requirements from App Reviews using Seq2seq Framework

Title: Towards Extracting Software Requirements from App Reviews using Seq2seq Framework

Auf dem Weg zur Extraktion von Software-Anforderungen aus App-Bewertungen mit Seq2seq Framework

争取利用Seq2seq 框架从应用审查中提取软件要求 2507.09039v1

Authors (2): Aakash Sorathiya, Gouri Ginde

Mobile app reviews are a large-scale data source for software improvements. A key task in this context is effectively extracting requirements from app reviews to analyze the users’ needs and support the software’s evolution. Recent studies show that existing methods fail at this task since app reviews usually contain informal language, grammatical and spelling errors, and a large amount of irrelevant information that might not have direct practical value for developers. To address this, we propose a novel reformulation of requirements extraction as a Named Entity Recognition (NER) task based on the sequence-to-sequence (Seq2seq) generation approach. With this aim, we propose a Seq2seq framework, incorporating a BiLSTM encoder and an LSTM decoder, enhanced with a self-attention mechanism, GloVe embeddings, and a CRF model. We evaluated our framework on two datasets: a manually annotated set of 1,000 reviews (Dataset 1) and a crowdsourced set of 23,816 reviews (Dataset 2). The quantitative evaluation of our framework showed that it outperformed existing state-of-the-art methods with an F1 score of 0.96 on Dataset 2, and achieved comparable performance on Dataset 1 with an F1 score of 0.47.

移动应用程序审查是软件改进的大规模数据源。这方面的一项关键任务是有效地从应用审查中提取需求要求,以分析用户的需求并支持软件的演变。最近的研究显示,由于应用审查通常包含非正式语言、语法和拼写错误,以及大量可能对开发者没有直接实际价值的不相干信息,现有方法未能完成这项任务。为此,我们提议根据顺序到顺序(Seq2seq)生成方法,将需求提取新改为命名实体识别(NER)任务。为此,我们提议了一个Seq2seq框架,包括一个BisLSTM编码器和一个LSTM解码器,通过一个自我注意机制、GloVe嵌入和通用报告格式模型加以强化。我们评估了我们关于两个数据集的框架:一组人工附加说明的1 000项审查(数据集1)和一组群集的23 816项审查(数据集2)。我们框架的定量评价显示,它超越了现有的Seq2级标准,即F1分数为0.16分的可比较性数据。

Article 123

Title@2025-07-11 (5): BrainLesion Suite: A Flexible and User-Friendly Framework for Modular Brain Lesion Image Analysis

Title: BrainLesion Suite: A Flexible and User-Friendly Framework for Modular Brain Lesion Image Analysis

BrainLesion Suite: Ein flexibles und benutzerfreundliches Framework für die modulare Gehirn-Lesions-Bildanalyse

脑悬浮套件:模块脑悬浮图像分析灵活和用户友好框架 2507.09036v1

Authors (29): Florian Kofler, Marcel Rosier, Mehdi Astaraki, Hendrik Möller, Ilhem Isra Mekki, Josef A. Buchner, Anton Schmick, Arianna Pfiffer, Eva Oswald, Lucas Zimmer, Ezequiel de la Rosa, Sarthak Pati, Julian Canisius, Arianna Piffer, Ujjwal Baid, Mahyar Valizadeh, Akis Linardos, Jan C. Peeken, Surprosanna Shit, Felix Steinbauer, Daniel Rueckert, Rolf Heckemann, Spyridon Bakas, Jan Kirschke, Constantin von See, Ivan Ezhov, Marie Piraud, Benedikt Wiestler, Bjoern Menze

BrainLesion Suite is a versatile toolkit for building modular brain lesion image analysis pipelines in Python. Following Pythonic principles, BrainLesion Suite is designed to provide a ‘brainless’ development experience, minimizing cognitive effort and streamlining the creation of complex workflows for clinical and scientific practice. At its core is an adaptable preprocessing module that performs co-registration, atlas registration, and optional skull-stripping and defacing on arbitrary multi-modal input images. BrainLesion Suite leverages algorithms from the BraTS challenge to synthesize missing modalities, inpaint lesions, and generate pathology-specific tumor segmentations. BrainLesion Suite also enables quantifying segmentation model performance, with tools such as panoptica to compute lesion-wise metrics. Although BrainLesion Suite was originally developed for image analysis pipelines of brain lesions such as glioma, metastasis, and multiple sclerosis, it can be adapted for other biomedical image analysis applications. The individual BrainLesion Suite packages and tutorials are accessible on GitHub.

脑解剖套件是用于在 Python 中建立模块式脑损伤图像分析管道的多功能工具包。遵循“ 脉冲原则 ” , 脑解剖套件旨在提供“ 无脑”的发展经验, 最大限度地减少认知努力, 简化临床和科学实践复杂工作流程的创建。其核心是一个适应性的预处理模块, 进行共同登记、地图册登记, 以及选择性的头骨剥离和对任意的多模式输入图像进行拆解。脑解套件利用来自 BRATS 的算法, 以综合缺失的模式、印面损伤和产生病理特定肿瘤分块。脑解套件还能够量化分解模型的性能, 工具包括光学模型, 以计算偏差性指标。虽然脑解套件最初是用来进行脑损伤图像分析管道的, 如浮质、转移和多重凝固度等, 但它可以调整用于其他生物医学图像分析应用。个人脑解剖套件和辅导包可以在 GitHub 上查阅。

Article 124

Title@2025-07-11 (5): Accelerating Drug Discovery Through Agentic AI: A Multi-Agent Approach to Laboratory Automation in the DMTA Cycle

Title: Accelerating Drug Discovery Through Agentic AI: A Multi-Agent Approach to Laboratory Automation in the DMTA Cycle

Beschleunigen der Wirkstoff-Discovery durch Agentic AI: Multi-Agenten-Ansatz zur Laborautomatisierung im DMTA-Zyklus

AI:对DMTTA周期实验室自动化采取多机构办法 2507.09023v1

Authors (12): Yao Fehlis, Charles Crain, Aidan Jensen, Michael Watson, James Juhasz, Paul Mandel, Betty Liu, Shawn Mahon, Daren Wilson, Nick Lynch-Jonely, Ben Leedom, David Fuller

The pharmaceutical industry faces unprecedented challenges in drug discovery, with traditional approaches struggling to meet modern therapeutic development demands. This paper introduces a novel AI framework, Tippy, that transforms laboratory automation through specialized AI agents operating within the Design-Make-Test-Analyze (DMTA) cycle. Our multi-agent system employs five specialized agents - Supervisor, Molecule, Lab, Analysis, and Report, with Safety Guardrail oversight - each designed to excel in specific phases of the drug discovery pipeline. Tippy represents the first production-ready implementation of specialized AI agents for automating the DMTA cycle, providing a concrete example of how AI can transform laboratory workflows. By leveraging autonomous AI agents that reason, plan, and collaborate, we demonstrate how Tippy accelerates DMTA cycles while maintaining scientific rigor essential for pharmaceutical research. The system shows significant improvements in workflow efficiency, decision-making speed, and cross-disciplinary coordination, offering a new paradigm for AI-assisted drug discovery.

制药业在药物发现方面面临着前所未有的挑战,传统方法在努力满足现代治疗发展需求。本文件介绍一个新的AI框架Tippy,通过在设计-制造-测试-分析(DMTA)周期内运作的专门的AI代理机构改造实验室自动化。我们的多试剂系统雇用了5个专业代理机构 — — 主管、分子、实验室、分析和报告,由安全卫士监督,每个监督机构都旨在在药物发现管道的特定阶段取得优异成绩。Tippy是首次为DMTA周期自动化而实施专门的AI代理机构,为AI如何改变实验室工作流程提供了具体的范例。我们通过利用自主的AI代理机构来解释、规划和合作,我们展示了Tippy如何加速DMTA周期,同时保持药物研究所必需的科学规范。这个系统在工作流程效率、决策速度和跨学科协调方面有了显著的改进,为AI辅助药物发现提供了新的范例。

Article 125

Title@2025-07-11 (5): ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs

Title: ToolRegistry: A Protocol-Agnostic Tool Management Library for Function-Calling LLMs

ToolRegistry: Eine protokoll-agnostische Werkzeugverwaltungsbibliothek für funktionsaufrufende LLMs

工具登记:功能调频LMS的礼宾-不可确定性工具管理库 2507.10593v1

Authors (1): Peng Ding

Large Language Model (LLM) applications are increasingly relying on external tools to extend their capabilities beyond text generation. However, current tool integration approaches suffer from fragmentation, protocol limitations, and implementation complexity, leading to substantial development overhead. This paper presents Toolregistry, a protocol-agnostic tool management library that simplifies tool registration, representation, execution, and lifecycle management via a unified interface. Our evaluation demonstrates that \toolregistry achieves 60-80% reduction in tool integration code, up to 3.1x performance improvements through concurrent execution, and 100% compatibility with OpenAI function calling standards. Real-world case studies show significant improvements in development efficiency and code maintainability across diverse integration scenarios. \toolregistry is open-source and available at https://github.com/Oaklight/ToolRegistry, with comprehensive documentation at https://toolregistry.readthedocs.io/.

大型语言模型(LLM)应用程序日益依赖外部工具,将其能力扩大到文本生成之外,然而,当前工具集成方法存在支离破碎、协议限制和执行复杂性,导致大量开发间接费用。本文介绍了《工具登记》,这是一个协议-不可知工具管理库,通过统一接口简化工具登记、代表、执行和生命周期管理。我们的评估表明,“工具登记系统”通过同时执行实现了60-80%的工具集成代码的减少,达到3.1x的绩效改进,以及100%的与OpenAI功能调用标准兼容。现实世界案例研究显示,不同集成情景的发展效率和守则维护能力有了显著改善。\ Toolregistry是开源的,可在https://github.com/Oaklight/ToolRegistry上查阅,并在https://toolregistry.rededocs.io/上提供综合文件。

Article 126

Title@2025-07-11 (5): Semantic Source Code Segmentation using Small and Large Language Models

Title: Semantic Source Code Segmentation using Small and Large Language Models

Semantische Quellcode-Segmentierung mit kleinen und großen Sprachmodellen

使用小型和大语言模式的语义源代码代码分割 2507.08992v1

Authors (5): Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, Madhu Chauhan

Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from the computer science domain.Our results show that context-based line-by-line analysis is superior over range-based segmentation.Using smaller language models like CodeBERT and an encoder-only version of CodeT5+ are better than their LLM counterparts. Most notably, these two best-performing models did not see R code during pre-training versus the LLMs but were only fine-tuned on 4,130 lines of manually annotated code.

源代码分割法将代码分为功能一致部分,对于软件开发方面的知识检索和维护至关重要。虽然随着库库的成长,特别是R及其研究领域(例如社会科学、心理学)等低资源语言的储存库的成长,能够有效地导航和理解大型代码库,人工和合成分析方法变得不切实际,特别是对于R及其研究领域(例如社会科学、心理学)等低资源语言而言更是如此。本文介绍了使用大语言和小语言模型(LLLMS/SLMS)进行的研究代码分割的自动、具体领域的方法。它提出了两种新颖的方法和一个带有人文注释的数据集,StatCCodeSeg。我们探索了两种不同的方法:逐行分析,同时根据背景和基于范围确定部分。我们试验LLMSM和经过精细调整的可持续土地管理方法。为了支持我们的方法的通用性,我们还将包括计算机科学领域的Python代码实验。我们的研究结果表明,基于背景的逐行分析比基于范围的分割法的分解法更为优越。我们发现,像CBERT5+只编码的编码版本比LLMM4号前改进了。

Article 127

Title@2025-07-11 (5): Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Title: Can Large Language Models Help Students Prove Software Correctness? An Experimental Study with Dafny

Können große Sprachmodelle den Studierenden helfen, Software-Korrektur zu beweisen? Eine experimentelle Studie mit Dafny

大语言模型能帮助学生证明软件正确性吗? 与Dafny的实验研究 2506.22370v3

Authors (4): Carolina Carreira, Álvaro Silva, Alexandre Abreu, Alexandra Mendes

Students in computing education increasingly use large language models (LLMs) such as ChatGPT. Yet, the role of LLMs in supporting cognitively demanding tasks, like deductive program verification, remains poorly understood. This paper investigates how students interact with an LLM when solving formal verification exercises in Dafny, a language that supports functional correctness, by allowing programmers to write formal specifications and automatically verifying that the implementation satisfies the specification. We conducted a mixed-methods study with master’s students enrolled in a formal methods course. Each participant completed two verification problems, one with access to a custom ChatGPT interface that logged all interactions, and the other without. We identified strategies used by successful students and assessed the level of trust students place in LLMs. Our findings show that students perform significantly better when using ChatGPT; however, performance gains are tied to prompt quality. We conclude with practical recommendations for integrating LLMs into formal methods courses more effectively, including designing LLM-aware challenges that promote learning rather than substitution.

然而,LLMS在支持认知要求高的任务(如计算程序核查)方面的作用仍然不为人所知。本文调查了学生在解决Dafny的正式核查练习时如何与LLM互动。 Dafny是支持功能正确性的一种语言,它使程序设计员能够编写正式的规格,并自动核实执行符合规格。我们与注册参加正规方法课程的硕士学生进行了混合方法研究。每个参与者都完成了两个核查问题,一个是能够使用记录所有互动的CatGPT用户界面,另一个是没有。我们确定了成功学生使用的战略,并评估了LLMS学生的信任程度。我们的调查结果显示,学生在使用CatGPT时表现得更好;然而,成绩与及时的质量挂钩。我们最后提出了将LMS更有效地纳入正规方法课程的实用建议,包括设计LMM-aware挑战,促进学习而不是替代。

Article 128

Title@2025-07-11 (5): Choosing the Right Git Workflow: A Comparative Analysis of Trunk-based vs. Branch-based Approaches

Title: Choosing the Right Git Workflow: A Comparative Analysis of Trunk-based vs. Branch-based Approaches

Auswahl des richtigen Git-Workflows: Eine vergleichende Analyse von Trunk-based vs. Branch-based Approaches

选择正确的基特工作流程:对基于Trunk的方法与基于分部门的方法的比较分析 2507.08943v1

Authors (4): Pedro Lopes, Paola Accioly, Paulo Borba, Vitor Menezes

Git has become one of the most widely used version control systems today. Among its distinguishing features, its ability to easily and quickly create branches stands out, allowing teams to customize their workflows. In this context, various formats of collaborative development workflows using Git have emerged and gained popularity among software engineers. We can categorize such workflows into two main types: branch-based workflows and trunk-based workflows. Branch-based workflows typically define a set of remote branches with well-defined objectives, such as feature branches, a branch for feature integration, and a main branch. The goal is to migrate changes from the most isolated branch to the main one shared by all as the code matures. In this category, GitFlow stands out as the most popular example. In contrast, trunk-based workflows have a single remote branch where developers integrate their changes directly. In this range of options, choosing a workflow that maximizes team productivity while promoting software quality becomes a non-trivial task. Despite discussions on forums, social networks, and blogs, few scientific articles have explored this topic. In this work, we provide evidence on how Brazilian developers work with Git workflows and what factors favor or hinder the use of each model. To this end, we conducted semi-structured interviews and a survey with software developers. Our results indicate that trunk-based development favors fast-paced projects with experienced and smaller teams, while branch-based development suits less experienced and larger teams better, despite posing management challenges.

基特已成为当今最广泛使用的版本控制系统之一。在它的突出特点中,其容易和快速创建分支的能力突出,使团队能够定制工作流程。在这方面,使用吉特的各种形式的合作开发工作流程已经出现,并在软件工程师中越来越受欢迎。我们可以将这些工作流程分为两大类:基于分支的工作流程和基于中继的工作流程。基于分支的工作流程通常确定一组具有明确界定目标的远程分支,如特征分支、特征整合分支和主要分支。目标是将最孤立的分支转换为因代码成熟而为所有人都共享的主要分支。在这个类别中,吉特佛罗成为最受欢迎的范例。相比之下,基于干线的工作流程有一个单独的远程分支,其中开发者直接整合其变化。在一系列选项中,选择一个最大限度地提高团队生产率的工作流程,同时促进软件质量成为一项非三重任务。尽管在论坛、社会网络和博客上进行了讨论,但很少有科学文章来探讨这个主题。在这项工作中,我们提供了巴西开发商如何与拥有更深层次的流程和软件开发团队一起工作,我们如何使用更好的模式,我们如何使用更有利于的版本。

Article 129

Title@2025-07-11 (5): Repairing Language Model Pipelines by Meta Self-Refining Competing Constraints at Runtime

Title: Repairing Language Model Pipelines by Meta Self-Refining Competing Constraints at Runtime

Reparatur von Sprachmodell-Pipelines durch Meta-Selbst-Refining Wettbewerbsbeschränkungen bei Runtime

运行时通过Meta自我改进竞争制约修复语言示范管道 2507.10590v1

Authors (1): Mojtaba Eshghie

Language Model (LM) pipelines can dynamically refine their outputs against programmatic constraints. However, their effectiveness collapses when faced with competing soft constraints, leading to inefficient backtracking loops where satisfying one constraint violates another. We introduce Meta Self-Refining, a framework that equips LM pipelines with a meta-corrective layer to repair these competitions at runtime/inference-time. Our approach monitors the pipeline’s execution history to detect oscillatory failures. Upon detection, it invokes a meta-repairer LM that analyzes the holistic state of the backtracking attempts and synthesizes a strategic instruction to balance the competing requirements. This self-repair instruction guides the original LM out of a failing refining loop towards a successful output. Our results show Meta Self-Refining can successfully repair these loops, leading to more efficient LM programs.

语言模型(LM)管道可以在方案制约下动态地改进其产出。然而,在面临相互竞争的软制约时,其效力会崩溃,导致低效率的回溯回溯回路圈,从而在满足一种制约时会违反另一种制约。我们引入了Meta自我更新,即一个为LM管道配备了元校正层的框架,以便在运行时/发酵时修复这些竞争。我们的方法监测管道的执行历史,以发现血管故障。一经发现,它就引用一个元修复器LM,分析回溯努力的整体状态,并合成战略指示,以平衡竞争要求。这一自我更新指导指导引导原始LMM的精炼循环走向成功产出的失败。我们的结果显示Meta自我更新能够成功修复这些循环,导致更有效的LM程序。

Article 130

Title@2025-07-11 (5): On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Title: On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten

关于含有闭合同步词类的标识名称的结构和语义 2505.18444v3

Authors (11): Christian D. Newman, Anthony Peruma, Eman Abdullah AlOmar, Mahie Crabbe, Syreen Banabilah, Reem S. AlSuhaibani, Michael J. Decker, Farhad Akhbardeh, Marcos Zampieri, Mohamed Wiem Mkaouer, Jonathan I. Maletic

Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns, which represent the part-of-speech (PoS) sequences underlying identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. To study these categories, the Closed Category Identifier Dataset (CCID), a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems, is constructed and presented. The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express concepts such as control flow, data transformation, temporal reasoning, and other behavioral roles through naming. This work contributes an empirical foundation for understanding how linguistic resources encode behavior in identifier names and supports new directions for research in naming, program comprehension, and education.

标识名称是代码的关键组成部分, 是开发者理解程序行为的主要线索。本文通过扩展语法模式的概念来调查标识名称的语言结构。语法模式代表了语法序列部分( POS) 基本识别短语。具体重点是封闭的合成类别( 如预设、连线、确定者 ) , 尽管这些类别在一般自然语言中具有核心作用, 但这些类别很少在软件工程中研究。要研究这些类别, 封闭类识别数据集( CICID) 是一个新的人工手动数据集, 由来自30个开放源系统的1,275个标识组成。封闭类语法模式与程序行为之间的关系随后通过基于理论的编码、统计和模式分析加以分析。结果揭示了开发者用来表达控制流、数据转换、时间推理和其他行为作用等概念的经常性结构。这项工作为理解识别名称的语言资源如何编码行为和支持命名、方案理解和教育研究的新方向提供了经验基础。

Article 131

Title@2025-07-11 (5): Multilingual Multimodal Software Developer for Code Generation

Title: Multilingual Multimodal Software Developer for Code Generation

Mehrsprachiger multimodaler Softwareentwickler für die Codegenerierung

用于代码生成的多语言多语种多式软件开发器 2507.08719v1

Authors (15): Linzheng Chai, Jian Yang, Shukai Liu, Wei Zhang, Liran Wang, Ke Jin, Tao Sun, Congnan Liu, Chenchen Zhang, Hualei Zhu, Jiaheng Liu, Xianjie Wu, Ge Zhang, Tianyu Liu, Zhoujun Li

The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.

大语言模型(LLMS)的快速进步大大改善了代码生成,但大多数模型仍然只使用文字,忽略了在现实世界软件开发中使用的图表和流程图等关键视觉辅助工具。为了缩小这一差距,我们引入了多语言多式软件开发商MM- Coder。MM-Coder将视觉设计投入-统一型模拟语言(UML)图表和流程图(长期视觉工作流程)集成为一体,并附有加强代码生成准确性和建筑协调的文本指令。为此,我们开发了MMM-Instruc-Instruct, 这是一种包括视觉-工作流程生成代码在内的多种多式指令调整数据集,允许MMM-Coder合成像人类开发商那样的文本和图形信息,这与以往关于狭隘任务的工作不同。此外,我们引入了MMEval,这是评价多式代码生成的新基准,解决了现有的只使用文字的局限性。我们使用MMEval的评估强调了精确视觉信息捕获模型、随后的教学和先进的编程知识方面仍然存在的重大挑战。我们的工作旨在通过文本和视觉设计来解释和执行复杂的规格设计来使LMIS来使工业方案编制方案实现革命化工业方案。

Article 132

Title@2025-07-11 (5): LLMCup: Ranking-Enhanced Comment Updating with LLMs

Title: LLMCup: Ranking-Enhanced Comment Updating with LLMs

LLMCup: Ranking-erweiterter Kommentar Aktualisierung mit LLMs

LLMCUM: 更新与LLMM的评分 2507.08671v1

Authors (5): Hua Ge, Juan Zhai, Minxue Pan, Fusen He, Ziyue Tan

While comments are essential for enhancing code readability and maintainability in modern software projects, developers are often motivated to update code but not comments, leading to outdated or inconsistent documentation that hinders future understanding and maintenance. Recent approaches such as CUP and HebCup have attempted automatic comment updating using neural sequence-to-sequence models and heuristic rules, respectively. However, these methods can miss or misinterpret crucial information during comment updating, resulting in inaccurate comments, and they often struggle with complex update scenarios. Given these challenges, a promising direction lies in leveraging large language models (LLMs), which have shown impressive performance in software engineering tasks such as comment generation, code synthesis, and program repair. This suggests their strong potential to capture the logic behind code modifications - an ability that is crucial for the task of comment updating. Nevertheless, selecting an appropriate prompt strategy for an LLM on each update case remains challenging. To address this, we propose a novel comment updating framework, LLMCup, which first uses multiple prompt strategies to provide diverse candidate updated comments via an LLM, and then employs a ranking model, CupRank, to select the best candidate as final updated comment. Experimental results demonstrate the effectiveness of LLMCup, with improvements over state-of-the-art baselines (CUP and HebCup) by 49.0%-116.9% in Accuracy, 10.8%-20% in BLEU-4, 4.6% in METEOR, 0.9%-1.9% in F1, and 2.1%-3.4% in SentenceBert similarity. Furthermore, a user study shows that comments updated by LLMCup sometimes surpass human-written updates, highlighting the importance of incorporating human evaluation in comment quality assessment.

虽然这些评论对于提高现代软件项目的代码可读性和可维护性至关重要,但开发商往往有动力更新代码,而不是评论,从而导致妨碍未来理解和维护的过时或不一致的文档。最近的做法,如CUP和HebCup,分别尝试使用神经序列序列至序列模型和超常规则自动更新评论。然而,这些方法在评论更新过程中可能会错失或误解关键信息,导致不准确的评论,而且往往会与复杂的更新情景作斗争。鉴于这些挑战,一个有希望的方向在于利用大型语言模型(LLLM),这些模型在评论生成、代码合成以及程序修补等软件工程任务中表现出令人印象深刻的性能,从而导致过时的代码修改背后的逻辑 — — 这种能力对于评论更新任务至关重要。然而,为每个更新案例的LLMM选择适当的快速战略仍然具有挑战性。为了解决这个问题,我们提出了一个新的更新框架,即LLMC,它首先使用多种快速战略通过LM提供不同的候选人最新评论,然后使用一个排名模型,SupRank, 将最佳候选人作为最终候选人在OLM%的 Ral-ral-lusial-leval-liaral-liaral im imal-lial-lievial-lieval-lial-lial 10 bruvals lauald laxxxxx 10 lax lax laual laual lax lax 10 。

Article 133

Title@2025-07-11 (5): Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

Title: Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework

Text2BIM: Generierung von Baumodellen mit Hilfe eines Multi-Agent-Frameworks auf Basis eines großen Sprachmodells

Text2BIM:利用以大语言模式为基础的多机构机构框架生成建筑模型 2408.08054v2

Authors (4): Changyu Du, Sebastian Esser, Stavros Nousias, André Borrmann

The conventional BIM authoring process typically requires designers to master complex and tedious modeling commands in order to materialize their design intentions within BIM authoring tools. This additional cognitive burden complicates the design process and hinders the adoption of BIM and model-based design in the AEC (Architecture, Engineering, and Construction) industry. To facilitate the expression of design intentions more intuitively, we propose Text2BIM, an LLM-based multi-agent framework that can generate 3D building models from natural language instructions. This framework orchestrates multiple LLM agents to collaborate and reason, transforming textual user input into imperative code that invokes the BIM authoring tool’s APIs, thereby generating editable BIM models with internal layouts, external envelopes, and semantic information directly in the software. Furthermore, a rule-based model checker is introduced into the agentic workflow, utilizing predefined domain knowledge to guide the LLM agents in resolving issues within the generated models and iteratively improving model quality. Extensive experiments were conducted to compare and analyze the performance of three different LLMs under the proposed framework. The evaluation results demonstrate that our approach can effectively generate high-quality, structurally rational building models that are aligned with the abstract concepts specified by user input. Finally, an interactive software prototype was developed to integrate the framework into the BIM authoring software Vectorworks, showcasing the potential of modeling by chatting. The code is available at: https://github.com/dcy0577/Text2BIM

传统的BIM 作者进程通常要求设计者掌握复杂和烦琐的建模指令,以便在BIM 作者工具中实现其设计意图。这种额外的认知负担使设计过程复杂化,妨碍在AEC(建筑、工程和建筑)行业直接采用BIM和基于模型的设计设计。此外,为了便于更直观地表达设计意图,我们提议Text2BIM,一个基于LLM 的多试剂框架,即基于LLLM 的多试剂框架,能够从自然语言指令中产生3D建模模型。这个框架协调多种LLM 代理商进行协作和理性,将文本用户输入转换为援引BIM 工具的API 的必备代码,从而产生内部布局、外部信封和软件中直接采用BIM 的基于模型的设计设计。此外,我们提出一个基于规则的模式检查器,利用预先定义的域知识指导LLM 代理商在生成模型中解决问题,并反复改进模型质量。进行了广泛的实验,以比较和分析三个不同的LMS在VIM 具体框架下的LM 的功能化模型/ 。评估结果显示我们用户在结构化模型上开发的模型中可以有效产生一种可实现的模型。

Article 134

Title@2025-07-11 (5): NL in the Middle: Code Translation with LLMs and Intermediate Representations

Title: NL in the Middle: Code Translation with LLMs and Intermediate Representations

NL in der Mitte: Code-Übersetzung mit LLMs und Intermediate Representations

中文本不适用:配有LLMs和中级代表的代码翻译 2507.08627v1

Authors (4): Chi-en Amy Tai, Pengyu Nie, Lukasz Golab, Alexander Wong

Studies show that large language models (LLMs) produce buggy code translations. One avenue to improve translation accuracy is through intermediate representations, which could provide structured insights to guide the model’s understanding. We explore whether code translation using LLMs can benefit from intermediate representations via natural language (NL) and abstract syntax trees (ASTs). Since prompt engineering greatly affects LLM performance, we consider several ways to integrate these representations, from one-shot to chain-of-thought (CoT) prompting. Using Open Gpt4 8X7B and specialized StarCoder and CodeGen models on popular code translation benchmarks (CodeNet and AVATAR), we find that CoT with an intermediate NL summary performs best, with an increase of 13.8% and 6.7%, respectively, in successful translations for the best-performing model (Open Gpt4 8X7B) compared to the zero-shot prompt.

研究表明,大型语言模型(LLMs)产生错误代码翻译。提高翻译准确性的一个途径是通过中间演示,它可以提供结构化的洞察力来指导模型的理解。我们探讨使用LMs的代码翻译是否可以通过自然语言(NL)和抽象语法树(ASTs)从中间演示中受益。由于迅速工程对LLM的性能产生极大影响,我们考虑了将这些演示整合起来的几种方法,从一角到思维链(CoT),在通用代码翻译基准(CodeNet和AVATAR)中,使用Open Gpt4 8X7B和专门的Star Coder和代码Gen模型,我们发现带有中间NL摘要的COT表现最佳,与零点提示相比,最佳模型(Opt Gpt4 8X7B)的成功翻译分别增加了13.8%和6.7%。

Article 135

Title@2025-07-11 (5): Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency, Effectiveness and Empathy

Title: Generating Proto-Personas through Prompt Engineering: A Case Study on Efficiency, Effectiveness and Empathy

Proto-Personas durch Prompt Engineering generieren: Eine Fallstudie zu Effizienz, Effektivität und Empathie

通过即时工程产生个人方案:关于效率、有效性和冷漠的案例研究 2507.08594v1

Authors (8): Fernando Ayach, Vitor Lameirão, Raul Leão, Jerfferson Felizardo, Rafael Sobrinho, Vanessa Borges, Patrícia Matsubara, Awdren Fontão

Proto-personas are commonly used during early-stage Product Discovery, such as Lean Inception, to guide product definition and stakeholder alignment. However, the manual creation of proto-personas is often time-consuming, cognitively demanding, and prone to bias. In this paper, we propose and empirically investigate a prompt engineering-based approach to generate proto-personas with the support of Generative AI (GenAI). Our goal is to evaluate the approach in terms of efficiency, effectiveness, user acceptance, and the empathy elicited by the generated personas. We conducted a case study with 19 participants embedded in a real Lean Inception, employing a qualitative and quantitative methods design. The results reveal the approach’s efficiency by reducing time and effort and improving the quality and reusability of personas in later discovery phases, such as Minimum Viable Product (MVP) scoping and feature refinement. While acceptance was generally high, especially regarding perceived usefulness and ease of use, participants noted limitations related to generalization and domain specificity. Furthermore, although cognitive empathy was strongly supported, affective and behavioral empathy varied significantly across participants. These results contribute novel empirical evidence on how GenAI can be effectively integrated into software Product Discovery practices, while also identifying key challenges to be addressed in future iterations of such hybrid design processes.

在早期产品发现阶段,如Lean Inception,经常使用Proto-persons 来指导产品定义和利益攸关方协调。然而,人工生成proto-persons 往往耗费时间、认知要求和容易产生偏差。在本文件中,我们提出并实证地调查了一种基于迅速工程的方法,在创世的AI(GenAI)的支持下产生proto-peras。我们的目标是评估在效率、有效性、用户接受和生成者所激发的同情方面的做法。我们进行了一项案例研究,19名参与者嵌入了真正的Lean Inception,采用了定性和定量方法设计。结果通过减少时间和努力,提高后来发现阶段个人的质量和可恢复性,如最低可耐用产品范围界定和特征改进,揭示了该方法的效率。虽然普遍接受度很高,特别是在人们认识到的有用性和易用性方面,但参加者注意到与通用和域特性有关的局限性。我们进行了一项案例研究,但参加者对影响和行为共解有相当大的差异。这些结果揭示了该方法的效率,通过减少时间和努力,提高个人在以后发现后发现人性设计过程中的质量和可操作性。这些软件的新的证据证据证据,有助于确定未来版本设计过程。

Article 136

Title@2025-07-11 (5): ARPaCCino: An Agentic-RAG for Policy as Code Compliance

Title: ARPaCCino: An Agentic-RAG for Policy as Code Compliance

ARPaCCino: Eine Agentur-RAG für Politik als Code-Compliance

ARPACCino:作为《守则》合规政策的一个代理-RAG 2507.10584v1

Authors (6): Francesco Romeo, Luigi Arena, Francesco Blefari, Francesco Aurelio Pironti, Matteo Lupinacci, Angelo Furfaro

Policy as Code (PaC) is a paradigm that encodes security and compliance policies into machine-readable formats, enabling automated enforcement in Infrastructure as Code (IaC) environments. However, its adoption is hindered by the complexity of policy languages and the risk of misconfigurations. In this work, we present ARPaCCino, an agentic system that combines Large Language Models (LLMs), Retrieval-Augmented-Generation (RAG), and tool-based validation to automate the generation and verification of PaC rules. Given natural language descriptions of the desired policies, ARPaCCino generates formal Rego rules, assesses IaC compliance, and iteratively refines the IaC configurations to ensure conformance. Thanks to its modular agentic architecture and integration with external tools and knowledge bases, ARPaCCino supports policy validation across a wide range of technologies, including niche or emerging IaC frameworks. Experimental evaluation involving a Terraform-based case study demonstrates ARPaCCino’s effectiveness in generating syntactically and semantically correct policies, identifying non-compliant infrastructures, and applying corrective modifications, even when using smaller, open-weight LLMs. Our results highlight the potential of agentic RAG architectures to enhance the automation, reliability, and accessibility of PaC workflows.

由于《守则》(PaC)是将安全和合规政策纳入机器可读格式的范例,使基础设施自动执行成为《守则》(IaC)环境,但《守则》的采用受到政策语言复杂性和配置错误风险的阻碍。在这项工作中,我们介绍了《守则》,这是一个将大语言模型(LLMS)、Retreval-Auged-Eneration(RAG)和基于工具的验证相结合的代理系统,将《守则》规则的生成和核查自动化。鉴于对所希望的政策的自然语言描述,ARPaCCino生成了正式的Rego规则,评估了《守则》的遵守情况,并反复完善了《国际守则》的配置,以确保合规性。由于其模块化的代理结构以及与外部工具和知识基础的整合,《守则》支持了范围广泛的技术(包括定位框架或新兴的Iac框架)的政策验证。涉及基于Terraform的案例研究的实验性评价,表明《准则》在生成统一和语义正确的政策、确定《准则》的合规性规则、评估《国际数据库》的不合规性结构、在改进过程中采用我们的升级性结构时,甚至加强《准则的升级的系统。

Article 137

Title@2025-07-11 (5): InferLog: Accelerating LLM Inference for Online Log Parsing via ICL-oriented Prefix Caching

Title: InferLog: Accelerating LLM Inference for Online Log Parsing via ICL-oriented Prefix Caching

InferLog: Beschleunigung der LLM-Inferenz für das Online-Log Parsing über ICL-orientiertes Prefix-Caching

InferLog: 通过ICL 导向的前缀缓存加速在线日志解析的 LLM 推断 2507.08523v1

Authors (8): Yilun Wang, Pengfei Chen, Haiyu Huang, Zilong He, Gou Tan, Chuanfu Zhang, Jingkai He, Zibin Zheng

Modern software systems generate massive volumes of runtime logs, necessitating efficient and accurate log parsing to enable critical downstream tasks such as anomaly detection and root cause analysis. Recently, large language models (LLMs) have achieved advanced accuracy on log parsing, but their deployment in production environments faces two major limitations: (1) the privacy risks associated with commercial LLMs, driving the adoption of local deployment, and (2) the stringent latency and throughput requirements imposed by high-volume log streams, which existing LLM-based parsers fail to meet. Although recent efforts have reduced the number of LLM queries, they overlook the high latency of the LLM invocations, where concurrent log parsing requests can cause serve performance degradation of LLM inference system. In this study, we present InferLog, the first LLM inference optimization method for online log parsing. Our key insight is that the inference efficiency emerges as the vital bottleneck in LLM-based online log parsing, rather than parsing accuracy. InferLog accelerates inference by designing (1) A Prefix-aware ICL Refinement policy to refine the examples and permutation of in-context learning to improve the prefix caching efficiency. (2) A rapid and task-specific configuration tuning pipeline based on meta-learning to find the optimal LLM scheduling-related configuration for dynamic log parsing workloads. The experimental results based on Loghub dataset and vLLM demonstrate that InferLog significantly outperforms existing inference optimization methods and markedly accelerates the state-of-the-art LLM-based log parser without compromising parsing accuracy.

现代软件系统产生大量运行时日志,需要高效和准确的日志分析,以便能够完成异常检测和根源原因分析等关键下游任务。最近,大型语言模型(LLM)在日志分析中实现了较高的准确性,但在生产环境中的部署面临两大限制:(1) 商业LLM的隐私风险,推动当地部署的采用,以及(2) 高容量的日志流规定的严格的潜伏和通量要求,而现有以LLM为基地的LLM对流无法满足。虽然最近的努力减少了LLM查询的数量,但它们忽略了LLM职业的高通度,在此情况下,同时的日志分析请求可导致LLM推断系统的性能退化。在这项研究中,我们介绍了LFLOM的第一个LOright优化方法。我们的主要了解是,由于基于LLMSM为主的在线日志进行的关键瓶颈分析,而不是精确度测算,因此测算速度加快了LLM的推论,通过设计(Prefix-Abreal developmental) 的精确度精确度精确度定值,从而大幅改进了在ICLMLMLisal-dealalalalal-deal-dealdaldald 的精确校正的校正的校正的校正的校正。

Article 138

Title@2025-07-11 (5): $\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

Title: $\texttt{Droid}$: A Resource Suite for AI-Generated Code Detection

$\texttt{Droid}$: Eine Ressourcen-Suite für KI-generierte Code-Erkennung

$\ textt{ droid} $: 用于 AI 生成代码检测的资源套件 2507.10583v1

Authors (4): Daniil Orel, Indraneil Paul, Iryna Gurevych, Preslav Nakov

In this work, we compile $\textbf{$\texttt{DroidCollection}$}$, the most extensive open data suite for training and evaluating machine-generated code detectors, comprising over a million code samples, seven programming languages, outputs from 43 coding models, and over three real-world coding domains. Alongside fully AI-generated samples, our collection includes human-AI co-authored code, as well as adversarial samples explicitly crafted to evade detection. Subsequently, we develop $\textbf{$\texttt{DroidDetect}$}$, a suite of encoder-only detectors trained using a multi-task objective over $\texttt{DroidCollection}$. Our experiments show that existing detectors’ performance fails to generalise to diverse coding domains and programming languages outside of their narrow training data. Additionally, we demonstrate that while most detectors are easily compromised by humanising the output distributions using superficial prompting and alignment approaches, this problem can be easily amended by training on a small amount of adversarial data. Finally, we demonstrate the effectiveness of metric learning and uncertainty-based resampling as means to enhance detector training on possibly noisy distributions.

在这项工作中,我们汇编了$textbf{$\textt{DroidCollection}$$,这是用于培训和评价机器生成代码探测器的最广泛的开放数据集,由100多万代码样本、7个编程语言、43个编码模型和3个真实世界编码域组成。除了完全由AI生成的样本外,我们收集的样本还包括人类-AI共同作者的代码,以及明确为逃避检测而设计的对立样本。随后,我们开发了$\textbf{$\textt{DroidSergy}$,这是一套只使用多任务目标($@textt{DroidCollection}$)来培训的编码器。我们的实验表明,现有探测器的性能无法在它们狭隘的培训数据之外对不同的编码域和编程语言进行概括化。此外,我们证明,虽然大多数探测器很容易由于使用表面提示和校准方法对产出分布进行人性化而受到影响,但这个问题很容易通过对少量的对抗性数据进行培训而得到修正。最后,我们展示了测量性学习和基于不确定性的分布的有效性,作为可能加强静态探测的手段。

Article 139

Title@2025-07-11 (5): Computing Floating-Point Errors by Injecting Perturbations

Title: Computing Floating-Point Errors by Injecting Perturbations

Berechnung von Floating-Point-Fehlern durch Einspritzen von Perturbationen

通过注射扰动输入,计算浮点误差 2507.08467v1

Authors (6): Youshuai Tan, Zhanwei Zhang, Jinfu Chen, Zishuo Ding, Jifeng Xuan, Weiyi Shang

Floating-point programs form the foundation of modern science and engineering, providing the essential computational framework for a wide range of applications, such as safety-critical systems, aerospace engineering, and financial analysis. Floating-point errors can lead to severe consequences. Although floating-point errors widely exist, only a subset of inputs may trigger significant errors in floating-point programs. Therefore, it is crucial to determine whether a given input could produce such errors. Researchers tend to take the results of high-precision floating-point programs as oracles for detecting floating-point errors, which introduces two main limitations: (1) difficulty of implementation and (2) prolonged execution time. The two recent tools, ATOMU and FPCC, can partially address these issues. However, ATOMU suffers from false positives; while FPCC, though eliminating false positives, operates at a considerably slower speed. To address these two challenges, we propose a novel approach named PI-detector to computing floating-point errors effectively and efficiently. Our approach is based on the observation that floating-point errors stem from large condition numbers in atomic operations (such as addition and subtraction), which then propagate and accumulate. PI-detector injects small perturbations into the operands of individual atomic operations within the program and compares the outcomes of the original program with the perturbed version to compute floating-point errors. We evaluate PI-detector with datasets from ATOMU and HSED, as well as a complex linear system-solving program. Experimental results demonstrate that PI-detector can perform efficient and accurate floating-point error computation.

浮动点程序构成了现代科学和工程的基础,为安全临界系统、航空航天工程和金融分析等广泛应用提供了基本的计算框架。浮动点错误可能导致严重后果。尽管浮动点错误广泛存在, 但只有一组投入可能会在浮动点程序中引发重大错误。因此, 关键是要确定某个特定输入是否会产生这样的错误。研究人员倾向于将高精度浮点程序的结果作为探测浮点错误的奥克莱, 这带来了两个主要的局限性:(1) 执行困难和(2) 执行时间过长。最近的两个工具,即ATOMU 和 FPCC, 可以部分解决这些问题。然而, 浮点错误的误差可能存在。尽管浮点错误普遍存在, 但只有一组投入可能会在浮动点程序中引发重大错误。因此, 关键是要确定一个名为PI 检测器的新办法, 以高效和高效的方式计算浮点错误。我们的方法基于这样的观察,即浮点错误来自原子操作中大型条件(例如添加和减缩缩放) 。浮点点的运行过程会以移动点的原始操作结果和存储点程序进行。我们的方法可以显示, 移动点的硬点的计算结果, 和存储点的计算程序会显示, 的原始操作会以每个的硬点的硬值的计算结果。

Article 140

Title@2025-07-11 (5): ProvideQ: A Quantum Optimization Toolbox

Title: ProvideQ: A Quantum Optimization Toolbox

ProvideQ: Eine Quantum-Optimierungs-Toolbox

提供 Q: 量图优化工具箱 2507.07649v2

Authors (4): Domenik Eichhorn, Nick Poser, Maximilian Schweikart, Ina Schaefer

Hybrid solvers for combinatorial optimization problems combine the advantages of classical and quantum computing to overcome difficult computational challenges. Although their theoretical performance seems promising, their practical applicability is challenging due to the lack of a technological stack that can seamlessly integrate quantum solutions with existing classical optimization frameworks. We tackle this challenge by introducing the ProvideQ toolbox, a software tool that enables users to easily adapt and configure hybrid solvers via Meta-Solver strategies. A Meta-Solver strategy implements decomposition techniques, which splits problems into classical and quantum subroutines. The ProvideQ toolbox enables the interactive creation of such decompositions via a Meta-Solver configuration tool. It combines well-established classical optimization techniques with quantum circuits that are seamlessly executable on multiple backends. This paper introduces the technical details of the ProvideQ toolbox, explains its architecture, and demonstrates possible applications for several real-world use cases. Our proof of concept shows that Meta-Solver strategies already enable the application of quantum subroutines today, however, more sophisticated hardware is required to make their performance competitive.

组合优化问题的混合解析器将古典计算和量子计算的优势结合起来,以克服困难的计算挑战。虽然它们的理论性能似乎很有希望,但由于缺乏能够将量子解决方案与现有的经典优化框架无缝地整合起来的技术堆叠,它们的实际适用性是具有挑战性的。我们通过引入“提供Q”工具箱来应对这一挑战,这是一个软件工具,使用户能够通过Meta-Solver战略方便地调整和配置混合解析器。一个元-Solver战略实施分解技术,将问题分为经典和量子例。“提供Q”工具箱使得能够通过一个元-Solver配置工具交互生成这种分解。它将成熟的经典优化技术与可在多个后端上无缝地执行的量子路路结合起来。本文介绍了“提供Q”工具箱的技术细节,解释了其结构,并展示了几个现实世界使用案例的可能应用。我们的概念证明,“提供”战略已经使得今天能够应用量子子流的硬件成为了应用。

Article 141

Title@2025-07-11 (5): Leveraging Large Language Models for Classifying App Users’ Feedback

Title: Leveraging Large Language Models for Classifying App Users’ Feedback

Nutzung von großen Sprachmodellen zur Klassifizierung des Feedbacks von App-Nutzern

利用大语言模型对应用程序用户的反馈进行分类 2507.08250v1

Authors (2): Yasaman Abedini, Abbas Heydarnoori

In recent years, significant research has been conducted into classifying application (app) user feedback, primarily relying on supervised machine learning algorithms. However, fine-tuning more generalizable classifiers based on existing labeled datasets remains an important challenge, as creating large and accurately labeled datasets often requires considerable time and resources. In this paper, we evaluate the capabilities of four advanced LLMs, including GPT-3.5-Turbo, GPT-4, Flan-T5, and Llama3-70b, to enhance user feedback classification and address the challenge of the limited labeled dataset. To achieve this, we conduct several experiments on eight datasets that have been meticulously labeled in prior research. These datasets include user reviews from app stores, posts from the X platform, and discussions from the public forums, widely recognized as representative sources of app user feedback. We analyze the performance of various LLMs in identifying both fine-grained and coarse-grained user feedback categories. Given the substantial volume of daily user feedback and the computational limitations of LLMs, we leverage these models as an annotation tool to augment labeled datasets with general and app-specific data. This augmentation aims to enhance the performance of state-of-the-art BERT-based classification models. Our findings indicate that LLMs when guided by well-crafted prompts, can effectively classify user feedback into coarse-grained categories. Moreover, augmenting the training dataset with datasets labeled using the consensus of LLMs can significantly enhance classifier performance.

近年来,对应用(应用)用户反馈进行了大量分类,主要依靠受监督的机器学习算法;然而,根据现有标签数据集对8个数据集进行微调,根据现有标签数据集进行更宽泛的分类,这仍然是一个重大挑战,因为创建大而准确的标签数据集往往需要大量的时间和资源;在本文件中,我们评估了4个高级LLMS的能力,包括GPT-3.5-Turbo、GPT-4、GPT-4、Flan-T5和Llama3-70b,以加强用户反馈分类,并应对有限标签数据集的挑战。为此,我们对先前研究中精心标明的8个数据集进行了若干试验。这些数据集包括来自软件仓库、X平台的用户审查以及公共论坛的讨论,被广泛承认为有代表性的用户反馈来源。我们分析了各种LLMMS在确定精细和粗化用户反馈类别方面的绩效。鉴于每日用户反馈数量庞大,而且LLMMS的计算局限性。我们利用这些模型作为说明工具,在以往研究中进行精确的LLMMS的分类,通过提高我们的数据质量的升级数据分类,从而提高我们的数据质量,从而提高通用和软件的质量。

Article 142

Title@2025-07-10 (4): KP-A: A Unified Network Knowledge Plane for Catalyzing Agentic Network Intelligence

Title: KP-A: A Unified Network Knowledge Plane for Catalyzing Agentic Network Intelligence

KP-A: Eine einheitliche Netzwerk-Wissensplattform für katalysierende Agentische Netzwerk-Intelligenz

KP-A:一个用于催化剂网络情报的统一网络知识平台 2507.08164v1

Authors (5): Yun Tang, Mengbang Zou, Zeinab Nezami, Syed Ali Raza Zaidi, Weisi Guo

The emergence of large language models (LLMs) and agentic systems is enabling autonomous 6G networks with advanced intelligence, including self-configuration, self-optimization, and self-healing. However, the current implementation of individual intelligence tasks necessitates isolated knowledge retrieval pipelines, resulting in redundant data flows and inconsistent interpretations. Inspired by the service model unification effort in Open-RAN (to support interoperability and vendor diversity), we propose KP-A: a unified Network Knowledge Plane specifically designed for Agentic network intelligence. By decoupling network knowledge acquisition and management from intelligence logic, KP-A streamlines development and reduces maintenance complexity for intelligence engineers. By offering an intuitive and consistent knowledge interface, KP-A also enhances interoperability for the network intelligence agents. We demonstrate KP-A in two representative intelligence tasks: live network knowledge Q&A and edge AI service orchestration. All implementation artifacts have been open-sourced to support reproducibility and future standardization efforts.

大型语言模型(LLMS)和代理系统的出现使6G自主网络能够与拥有先进情报的6G自主网络连接起来,包括自我配置、自我优化和自我愈合;然而,由于目前执行个别情报任务,需要孤立的知识检索管道,造成重复的数据流和不一致的解释;在开放-RAN服务模式统一努力(支持互操作性和供应商多样性)的启发下,我们提议KP-A:为代理网络情报专门设计的统一网络知识计划;通过将网络知识的获取和管理与情报逻辑脱钩,KP-A精简开发并降低情报工程师的维护复杂性;通过提供直观和一致的知识界面,KP-A还加强了网络情报代理人的互操作性。我们在两个有代表性的情报任务中展示KP-A:实时网络知识 QA和边缘AI服务协调。所有实施工艺品都是公开来源,以支持再生和今后的标准化努力。

Article 143

Title@2025-07-10 (4): The Impact of Generative AI on Code Expertise Models: An Exploratory Study

Title: The Impact of Generative AI on Code Expertise Models: An Exploratory Study

Die Auswirkungen generativer KI auf Code-Expertise-Modelle: Eine Sondierungsstudie

《创世大赦国际对守则专门知识模型的影响:探索性研究》 2507.08160v1

Authors (2): Otávio Cury, Guilherme Avelino

Generative Artificial Intelligence (GenAI) tools for source code generation have significantly boosted productivity in software development. However, they also raise concerns, particularly the risk that developers may rely heavily on these tools, reducing their understanding of the generated code. We hypothesize that this loss of understanding may be reflected in source code knowledge models, which are used to identify developer expertise. In this work, we present an exploratory analysis of how a knowledge model and a Truck Factor algorithm built upon it can be affected by GenAI usage. To investigate this, we collected statistical data on the integration of ChatGPT-generated code into GitHub projects and simulated various scenarios by adjusting the degree of GenAI contribution. Our findings reveal that most scenarios led to measurable impacts, indicating the sensitivity of current expertise metrics. This suggests that as GenAI becomes more integrated into development workflows, the reliability of such metrics may decrease.

用于源代码生成的人工智能(GenAI)工具极大地提高了软件开发的生产率,但也引起了人们的关切,特别是开发商可能严重依赖这些工具的风险,降低了他们对生成代码的理解程度。我们假设这种理解的丧失可能反映在源代码知识模型中,这些模型用于确定开发者的专门知识。在这项工作中,我们提出探索性分析,说明知识模型和基于该模型的卡车因数算法如何会受到GenAI的使用情况的影响。为了调查这一点,我们收集了统计数据,说明将热电偶生成的代码纳入GitHub项目和模拟各种情景,调整GentHub的贡献程度。我们的调查结果显示,大多数情景都产生了可衡量的影响,表明当前专门知识指标的敏感性。这表明,随着GenAI更多地融入发展工作流程,这类指标的可靠性可能会降低。

Article 144

Title@2025-07-10 (4): Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows

Title: Code with Me or for Me? How Increasing AI Automation Transforms Developer Workflows

Code mit mir oder für mich? Wie die zunehmende KI-Automatisierung Entwickler-Workflows transformiert

如何增加 AI 自动转换开发者工作流程 2507.08149v1

Authors (4): Valerie Chen, Ameet Talwalkar, Robert Brennan, Graham Neubig

Developers now have access to a growing array of increasingly autonomous AI tools to support software development. While numerous studies have examined developer use of copilots, which can provide chat assistance or code completions, evaluations of coding agents, which can automatically write files and run code, still largely rely on static benchmarks without humans-in-the-loop. In this work, we conduct the first academic study to explore developer interactions with coding agents and characterize how more autonomous AI tools affect user productivity and experience, compared to existing copilots. We evaluate two leading copilot and agentic coding assistants, GitHub Copilot and OpenHands, recruiting participants who regularly use the former. Our results show agents have the potential to assist developers in ways that surpass copilots (e.g., completing tasks that humans might not have accomplished before) and reduce the user effort required to complete tasks. However, there are challenges involved in enabling their broader adoption, including how to ensure users have an adequate understanding of agent behaviors. Our results not only provide insights into how developer workflows change as a result of coding agents but also highlight how user interactions with agents differ from those with existing copilots, motivating a set of recommendations for researchers building new agents. Given the broad set of developers who still largely rely on copilot-like systems, our work highlights key challenges of adopting more agentic systems into developer workflows.

开发者现在可以获取越来越多的日益自主的AI工具,以支持软件开发。虽然许多研究都审查了开发者使用可提供聊天协助或代码完成的副驾驶,但对于可自动写入文档和运行代码的编码代理商的评价在很大程度上仍然依赖静态基准,而没有环绕中的人类,在这项工作中,我们进行了第一次学术研究,探索开发者与编码代理商的互动,并描述与现有共同试办相比,自主性更强的AI工具如何影响用户生产力和经验。我们评估了两个主要联合试办和代理编码助理,即GitHub Copil和OpenHands,招聘经常使用前者的参与者。我们的结果显示,代理商有可能以超越共同试办的方式协助开发者(例如,完成人类以前可能没有完成的任务),并减少完成任务所需的用户努力。然而,在使开发者更广泛地采用这些工具方面存在挑战,包括如何确保用户充分理解代理商的行为。我们的结果不仅有助于了解开发者工作流程如何作为编码代理商的一项结果,而且还能在很大程度上帮助开发者与开发者进行互动,同时强调用户与开发者之间的广泛互动,这些代理商如何依赖现有的研发系统。

Article 145

Title@2025-07-10 (4): The State of Computational Science in Fission and Fusion Energy

Title: The State of Computational Science in Fission and Fusion Energy

Der Zustand der Computational Science in Fission und Fusionsenergie

裂变和聚变能源的计算科学状况 2507.08061v1

Authors (2): Andrea Morales Coto, Aditi Verma

The tools used to engineer something are just as important as the thing that is actually being engineered. In fact, in many cases, the tools can indeed determine what is engineerable. In fusion and fission1 energy engineering, software has become the dominant tool for design. For that reason, in 2024, for the first time ever, we asked 103 computational scientists developing the codes used in fusion and fission energy about the problems they are attempting to solve with their codes, the tools available to them to solve them, and their end to end developer experience with said tools. The results revealed a changing tide in software tools in fusion and fission, with more and more computational scientists preferring modern programming languages, open-source codes, and modular software. These trends represent a peek into what will happen 5 to 10 years in the future of nuclear engineering. Since the majority of our respondents belonged to US national labs and universities, these results hint at the most cutting-edge trends in the industry. The insights included in the State of Computational Science in Fission and Fusion Energy indicate a dramatic shift toward multiphysics codes, a drop-off in the use of FORTRAN in favor of more modern languages like Python and C++, and ever-rising budgets for code development, at times reaching $50M in a single organization. Our survey paints a future of nuclear engineering codes that is modular in nature, small in terms of compute, and increasingly prioritized by organizations. Access to our results in web form are available online.

用于制造某种材料的工具与实际设计的工具一样重要。事实上,在许多情况下,这些工具确实可以决定什么是可工程师的。在聚变和裂变1能源工程中,软件已成为设计的主要工具。因此,2024年,我们首次询问103名计算科学家,他们开发聚变和裂变能源所使用的代码,了解他们试图用其代码解决的问题,他们可利用的工具,以及他们最终用上述工具开发的经验。结果显示,在聚变和裂变的软件工具中,越来越多的计算科学家更喜欢现代编程语言、开放源代码和模块软件。这些趋势代表着对未来核工程5到10年中将发生的事情的偷窥。由于我们大多数被调查者属于美国国家实验室和大学,这些结果暗示了工业中最尖端的趋势。在Fission 和变异变能源中包含的比较科学状态表明,在多物理代码方面出现了巨大的变化,越来越多的计算工具, 越来越多的计算学家倾向于使用现代编程和将来的编程预算。

Article 146

Title@2025-07-10 (4): QCP: A Practical Separation Logic-based C Program Verification Tool

Title: QCP: A Practical Separation Logic-based C Program Verification Tool

QCP: Eine praktische Trennung Logisch-basiertes C-Programm Verifikationswerkzeug

QCP:基于实际隔离逻辑的C方案核查工具 2505.12878v2

Authors (13): Xiwei Wu, Yueyang Feng, Xiaoyang Lu, Tianchuan Lin, Kan Liu, Zhiyi Wang, Shushu Wu, Lihan Xie, Chengxi Yang, Hongyi Zhong, Naijun Zhan, Zhenjiang Hu, Qinxiang Cao

As software systems increase in size and complexity dramatically, ensuring their correctness, security, and reliability becomes an increasingly formidable challenge. Despite significant advancements in verification techniques and tools, there still remain %these tools still continue to encounter substantial difficulties when applying these tools to complex, real-world scenarios. To address these difficulties, this paper introduces a novel verification tool, called \textbf{Qualified C Programming Verifier (QCP)}. QCP incorporates a refined front-end %syntax of assertion language to enhance user interaction. The proposed assertion language aims to %syntax is designed to lower the entry barrier for verification tools, improve proof efficiency by improving automation, and facilitate a deeper understanding of both the program and its verification results.

随着软件系统规模和复杂性的急剧扩大,确保其正确性、安全和可靠性成为日益艰巨的挑战。尽管在核查技术和工具方面有了显著的进步,但是仍然有%这些工具在将这些工具应用于复杂的现实世界情景方面仍然面临着巨大的困难。为了解决这些困难,本文件引入了一种新型的核查工具,称为\ textb-ZQATIC(QCP)}。QCP采用了一种精细的前端 %syngycast 语言,以加强用户的互动。拟议的主张语言旨在使用 %syngyx,目的是降低核查工具的进入屏障,通过改进自动化提高证明效率,并促进对程序及其核查结果的更深入了解。

Article 147

Title@2025-07-10 (4): Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management

Title: Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management

Offene Quelle, versteckte Kosten: Ein systematischer Literaturbericht über OSS-Lizenzverwaltung

开放源码,隐藏成本:开放源码软件许可证管理的系统文献审查 2507.05270v2

Authors (6): Boyuan Li, Chengwei Liu, Lingling Fan, Sen Chen, Zhenlin Zhang, Zheli Liu

Integrating third-party software components is a common practice in modern software development, offering significant advantages in terms of efficiency and innovation. However, this practice is fraught with risks related to software licensing. A lack of understanding may lead to disputes, which can pose serious legal and operational challenges. To these ends, both academia and industry have conducted various investigations and proposed solutions and tools to deal with these challenges. However, significant limitations still remain. Moreover, the rapid evolution of open-source software (OSS) licenses, as well as the rapidly incorporated generative software engineering techniques, such as large language models for code (CodeLLMs), are placing greater demands on the systematic management of software license risks. To unveil the severe challenges and explore possible future directions, we conduct the first systematic literature review (SLR) on 80 carefully selected OSS license-related papers, classifying existing research into three key categories, i.e., license identification, license risk assessment, and license risk mitigation. Based on these, we discuss challenges in existing solutions, conclude the opportunities to shed light on future research directions and offer practical recommendations for practitioners. We hope this thorough review will help bridge the gaps between academia and industry and accelerate the ecosystem-wide governance of legitimate software risks within the software engineering community.

整合第三方软件组件是现代软件开发的常见做法,在效率和创新方面有很大的优势。然而,这种做法充满了软件许可证发放方面的风险。缺乏了解可能导致争议,并可能造成严重的法律和业务挑战。为此,学术界和工业界开展了各种调查,提出了应对这些挑战的解决办法和工具。然而,仍然存在重大限制。此外,开放源软件(OSS)许可证的迅速发展,以及迅速纳入的基因化软件工程技术,如大型代码语言模型(CodeLLMS),对软件许可证风险的系统管理提出了更大的要求。为了揭示严峻的挑战并探索可能的未来方向,我们首次对80份精心挑选的开放源码软件许可证相关文件进行系统文献审查(SLR),将现有研究分为三大类别,即许可证识别、许可证风险评估和降低许可证风险。在此基础上,我们讨论了现有解决方案中的挑战,总结了阐明未来研究方向的机会,并向从业人员提出切实可行的建议。我们希望,这一彻底审查将有助于弥合学术界和业界之间在合法软件方面的风险,加快整个生态系统治理。

Article 148

Title@2025-07-10 (4): Open-source automatic pipeline for efficient conversion of large-scale point clouds to IFC format

Title: Open-source automatic pipeline for efficient conversion of large-scale point clouds to IFC format

Open-Source-Automatische Pipeline für die effiziente Umwandlung von großflächigen Punktwolken in IFC-Format

将大型点云有效转换成国际金融公司格式的开放源自动管道 2503.11498v3

Authors (2): Slávek Zbirovský, Václav Nežerka

Building Information Modeling (BIM) is an essential component in the sustainable reconstruction and revitalization of ageing structures. However, model creation usually relies on laborious manual transformation of the unstructured point cloud data provided by laser scans or photogrammetry. This paper presents Cloud2BIM, an open-source software tool designed to automate the conversion of point clouds into BIM models compliant with the Industry Foundation Classes (IFC) standard. Cloud2BIM integrates advanced algorithms for wall and slab segmentation, opening detection, and room zoning based on real wall surfaces, resulting in a comprehensive and fully automated workflow. Unlike existing tools, it avoids computationally- and calibration-intensive techniques such as RANSAC, supports non-orthogonal geometries, and provides unprecedented processing speed-achieving results up to seven times faster than fastest competing solutions. Systematic validation using benchmark datasets confirms that Cloud2BIM is an easy-to-use, efficient, and scalable solution for generating accurate BIM models, capable of converting extensive point cloud datasets for entire buildings into IFC format with minimal user input.

建筑信息建模(BIM)是可持续重建和振兴老化结构的基本组成部分,但是,模型的创建通常依赖于由激光扫描或摄影测量提供的非结构化点云数据人工转换。本文展示了Cloud2BIM,这是一个开放源软件工具,旨在按照工业基础类标准将点云自动转换成BIM模型。Cloud2BIM结合了基于实际墙面的墙壁和板块分割、开启探测和房间分区的先进算法,从而形成一个全面和完全自动化的工作流程。它与现有工具不同,它避免了计算和校准密集技术,如RANSAC,支持非垂直的地理分布,提供了前所未有的处理速度达比最快的解决方案快七倍的超速结果。使用基准数据集进行的系统验证证实Clod2BIM是一种容易使用、高效和可扩展的解决方案,可以生成准确的BIM模型,能够将整个建筑的广点云数据集转换成IFCFC格式,用户投入极少。

Article 149

Title@2025-07-10 (4): From Domain Documents to Requirements: Retrieval-Augmented Generation in the Space Industry

Title: From Domain Documents to Requirements: Retrieval-Augmented Generation in the Space Industry

Von Domänendokumenten zu Anforderungen: Retrieval-Augmented Generation in der Raumfahrtindustrie

从域文档到要求:空间工业中回收利用-增强的一代人 2507.07689v1

Authors (5): Chetan Arora, Fanyu Wang, Chakkrit Tantithamthavorn, Aldeida Aleti, Shaun Kenyon

Requirements engineering (RE) in the space industry is inherently complex, demanding high precision, alignment with rigorous standards, and adaptability to mission-specific constraints. Smaller space organisations and new entrants often struggle to derive actionable requirements from extensive, unstructured documents such as mission briefs, interface specifications, and regulatory standards. In this innovation opportunity paper, we explore the potential of Retrieval-Augmented Generation (RAG) models to support and (semi-)automate requirements generation in the space domain. We present a modular, AI-driven approach that preprocesses raw space mission documents, classifies them into semantically meaningful categories, retrieves contextually relevant content from domain standards, and synthesises draft requirements using large language models (LLMs). We apply the approach to a real-world mission document from the space domain to demonstrate feasibility and assess early outcomes in collaboration with our industry partner, Starbound Space Solutions. Our preliminary results indicate that the approach can reduce manual effort, improve coverage of relevant requirements, and support lightweight compliance alignment. We outline a roadmap toward broader integration of AI in RE workflows, intending to lower barriers for smaller organisations to participate in large-scale, safety-critical missions.

空间工业的工程要求(RE)具有内在的复杂性,要求高度精确,符合严格的标准,并适应特定任务的限制。较小型的空间组织和新进入者往往努力从广泛的、非结构化的文件(如飞行任务简报、界面规格和监管标准)中获取可操作的要求。在这个创新机会文件中,我们探索了再回收新一代(RAG)模型的潜力,以支持和(半)自动生成空间领域的要求。我们提出了一个模块化的、由AI驱动的方法,该方法处理原始空间飞行任务文件,将其分为具有实际意义的类别,从大语言模型(LLMS)中检索域标准中与环境相关的内容,并综合了要求草案。我们从空间领域对现实世界飞行任务文件采用这一方法,以展示可行性,并与我们的工业伙伴“星界空间解决方案”合作评估早期成果。我们的初步结果表明,该方法可以减少人工工作,改进相关要求的覆盖面,并支持轻度的合规调整。我们概述了将AI更广泛地纳入电子工作流程的路线图,打算降低小型组织参加大规模、安全性飞行任务的障碍。

Article 150

Title@2025-07-10 (4): Prompt Engineering for Requirements Engineering: A Literature Review and Roadmap

Title: Prompt Engineering for Requirements Engineering: A Literature Review and Roadmap

Prompt Engineering for Requirements Engineering: Literature Review und Roadmap

工程:文学审查和路线图 2507.07682v1

Authors (4): Kaicheng Huang, Fanyu Wang, Yutan Huang, Chetan Arora

Advancements in large language models (LLMs) have led to a surge of prompt engineering (PE) techniques that can enhance various requirements engineering (RE) tasks. However, current LLMs are often characterized by significant uncertainty and a lack of controllability. This absence of clear guidance on how to effectively prompt LLMs acts as a barrier to their trustworthy implementation in the RE field. We present the first roadmap-oriented systematic literature review of Prompt Engineering for RE (PE4RE). Following Kitchenham’s and Petersen’s secondary-study protocol, we searched six digital libraries, screened 867 records, and analyzed 35 primary studies. To bring order to a fragmented landscape, we propose a hybrid taxonomy that links technique-oriented patterns (e.g., few-shot, Chain-of-Thought) to task-oriented RE roles (elicitation, validation, traceability). Two research questions, with five sub-questions, map the tasks addressed, LLM families used, and prompt types adopted, and expose current limitations and research gaps. Finally, we outline a step-by-step roadmap showing how today’s ad-hoc PE prototypes can evolve into reproducible, practitioner-friendly workflows.

大型语言模型(LLMS)的进步导致迅速工程技术(PE)的激增,这些技术可以加强各种要求工程(RE)任务;然而,目前的LLMS往往具有重大的不确定性和缺乏可控性的特点,在如何有效促使LLMS成为其在RE领域可信赖的执行的障碍方面缺乏明确的指导;我们提出了第一份面向路线图的系统文献审查报告《RE快速工程(PE4RE)》(PE4RE),根据Kitchenham和Peter的副研究协议,我们搜索了六个数字图书馆,筛选了867个记录,分析了35项基本研究;为了给分散的地貌带来秩序,我们建议采用一种混合分类方法,将面向技术的模式(例如,少量的、链式的)与面向任务的RE角色(引文、鉴定、可追溯性)联系起来;两个研究问题,包括五个子问题,绘制了所处理的任务、LM家庭使用和迅速采用的任务图,并暴露了目前的局限性和研究差距;最后,我们概述了一个分步骤绘制的路线图,显示今天的PE原型能如何演变成可复制的工作流程。

Article 151

Title@2025-07-10 (4): Quantum Executor: A Unified Interface for Quantum Computing

Title: Quantum Executor: A Unified Interface for Quantum Computing

Quantum Executor: Ein einheitliches Interface für Quantum Computing

量图执行器: 量数计算的统一界面 2507.07597v1

Authors (3): Giuseppe Bisicchia, Alessandro Bocci, Antonio Brogi

As quantum computing evolves from theoretical promise to practical deployment, the demand for robust, portable, and scalable tools for quantum software experimentation is growing. This paper introduces Quantum Executor, a backend-agnostic execution engine designed to orchestrate quantum experiments across heterogeneous platforms. Quantum Executor provides a declarative and modular interface that decouples experiment design from backend execution, enabling seamless interoperability and code reuse across diverse quantum and classical resources. Key features include support for asynchronous and distributed execution, customizable execution strategies and a unified API for managing quantum experiments. We illustrate its applicability through two life-like usage scenarios such as automated benchmarking and hybrid validation, discussing its capacity to streamline quantum development. We conclude by discussing current limitations and outlining a roadmap for future enhancements.

随着量子计算从理论承诺演变为实际应用,对强力、可移植和可缩放的量子软件实验工具的需求正在增长。本文件介绍了量子软件实验的“量子执行器 ” 。 Qantum 执行器是一个后端的、不可知的执行引擎,旨在在不同平台协调量子实验。量子执行器提供了一个宣示和模块界面,将实验设计与后端执行脱钩,在不同量子和古典资源之间实现无缝互操作性和代码再利用。关键特征包括支持无同步和分散的执行、可定制的执行策略以及管理量子实验的统一的“API ” 。我们通过两种类似生命的情景,例如自动化基准和混合验证,来说明其适用性,讨论其简化量子开发的能力。我们通过讨论目前的局限性和为未来增强制定路线图来结束我们的讨论。

Article 152

Title@2025-07-10 (4): From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering

Title: From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering

Von Anforderungen zum Code: Entwickler-Praxis in LLM-Assisted Software Engineering verstehen

从要求到准则:了解LLM辅助软件工程开发者的做法 2507.07548v1

Authors (3): Jonathan Ullrich, Matthias Koch, Andreas Vogelsang

With the advent of generative LLMs and their advanced code generation capabilities, some people already envision the end of traditional software engineering, as LLMs may be able to produce high-quality code based solely on the requirements a domain expert feeds into the system. The feasibility of this vision can be assessed by understanding how developers currently incorporate requirements when using LLMs for code generation-a topic that remains largely unexplored. We interviewed 18 practitioners from 14 companies to understand how they (re)use information from requirements and other design artifacts to feed LLMs when generating code. Based on our findings, we propose a theory that explains the processes developers employ and the artifacts they rely on. Our theory suggests that requirements, as typically documented, are too abstract for direct input into LLMs. Instead, they must first be manually decomposed into programming tasks, which are then enriched with design decisions and architectural constraints before being used in prompts. Our study highlights that fundamental RE work is still necessary when LLMs are used to generate code. Our theory is important for contextualizing scientific approaches to automating requirements-centric SE tasks.

随着基因化的LLMs及其先进的代码生成能力的出现,一些人已经设想了传统软件工程的终结,因为LLMs可能能够仅仅根据一个域专家对系统的投入要求来制作高质量的代码。这一愿景的可行性可以通过理解开发商目前如何在使用LLMs对代码生成专题使用基本尚未探索的LLMs时纳入要求来评估。我们采访了14家公司的18名从业人员,以了解他们在生成代码时如何(再)利用要求和其他设计艺术品提供的信息来喂养LMs。根据我们的调查结果,我们提出了一个解释流程开发商所使用和他们所依赖的文物的理论。我们的理论表明,通常记载的要求过于抽象,无法直接输入LMS。相反,他们首先必须手工分解成编成程序任务,然后在迅速使用设计决定和建筑限制后再加以充实。我们的研究强调,在使用LMSMs生成代码时,基本的RE工作仍然是必要的。我们的理论对于使以要求为中心的SE任务实现自动化的科学方法十分重要。

Article 153

Title@2025-07-10 (4): Towards an Engineering Workflow Management System for Asset Administration Shells using BPMN

Title: Towards an Engineering Workflow Management System for Asset Administration Shells using BPMN

Auf dem Weg zu einem Engineering Workflow Management System für Asset Administration Shells mit BPMN

努力建立一个利用生物和水管理网的资产管理壳壳工程工作流程管理系统 2507.07468v1

Authors (2): Sten Grüner, Nafise Eskandani

The integration of Industry 4.0 technologies into engineering workflows is an essential step toward automating and optimizing plant and process engineering processes. The Asset Administration Shell (AAS) serves as a key enabler for creating interoperable Digital Twins that facilitate engineering data exchange and automation. This paper explores the use of AAS within engineering workflows, particularly in combination with Business Process Model and Notation (BPMN) to define structured and automated processes. We propose a distributed AAS copy-on-write infrastructure that enhances security and scalability while enabling seamless cross organizational collaboration. We also introduce a workflow management prototype automating AAS operations and engineering workflows, improving efficiency and traceability.

将工业4.0技术纳入工程工作流程是走向工厂和工序工程流程自动化和优化的一个必要步骤。资产管理壳牌公司(AAS)是创建可互操作的数字双体的关键推动器,可促进工程数据交换和自动化。本文探讨了在工程工作流程中使用AAS的问题,特别是与业务流程模型和标记(BPMN)相结合,以界定结构化和自动化流程。我们提出一个分布式AAS复制版基础设施,既能加强安全和可扩缩性,又能实现无缝跨组织合作。我们还引入了AAS自动化操作和工程工作流程的工作流程原型,提高效率和可追踪性。

Article 154

Title@2025-07-10 (4): Toolchain for Faster Iterations in Quantum Software Development

Title: Toolchain for Faster Iterations in Quantum Software Development

Toolchain für schnellere Iterationen in der Quantensoftware-Entwicklung

量量软件开发中快速迭接工具链 2507.07448v1

Authors (4): Otso Kinanen, Andrés D. Muñoz-Moller, Vlad Stirbu, Tommi Mikkonen

Quantum computing proposes a revolutionary paradigm that can radically transform numerous scientific and industrial application domains. To realize this promise, these new capabilities need software solutions that are able to effectively harness its power. However, developers may face significant challenges when developing and executing quantum software due to the limited availability of quantum computer hardware, high computational demands of simulating quantum computers on classical systems, and complicated technology stack to enable currently available accelerators into development environments. These limitations make it difficult for the developer to create an efficient workflow for quantum software development. In this paper, we investigate the potential of using remote computational capabilities in an efficient manner to improve the workflow of quantum software developers, by lowering the barrier of moving between local execution and computationally more efficient remote hardware and offering speedup in execution with simulator surroundings. The goal is to allow the development of more complex circuits and to support an iterative software development approach. In our experiment, with the solution presented in this paper, we have obtained up to 5 times faster circuit execution runtime, and enabled qubit ranges from 21 to 29 qubits with a simple plug-and-play kernel for the Jupyter notebook.

量子计算提出了能够从根本上改变众多科学和工业应用领域的革命范式。为了实现这一承诺,这些新能力需要能够有效利用其权力的软件解决方案。然而,开发商在开发和实施量子软件时可能面临重大挑战,因为数量计算机硬件有限,古典系统中模拟量子计算机的计算要求高,以及为使目前可用的加速器进入开发环境所需的复杂技术堆叠。这些限制使得开发商难以为量子软件开发创造高效的工作流程。在本文中,我们调查了以有效的方式利用远程计算能力改进量子软件开发商工作流程的可能性,降低本地执行与计算效率更高的远程硬件之间的移动障碍,并提供与模拟器周围的快速执行。目标是允许开发更复杂的电路并支持反复软件开发方法。在我们的实验中,我们用本文中所提出的解决方案,已经取得了5倍的快速电路路运行时间,并使得象子范围从21到29个,为Jupyter笔记本提供了简单的插件箱。

Article 155

Title@2025-07-10 (4): DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications

Title: DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications

DITING: Ein statischer Analyzer zur Identifizierung von Problemen mit schlechten Partitionierungen in TEE-Anwendungen

Tinging: 识别TEE应用中的不良分割问题的静态分析器 2502.15281v2

Authors (10): Chengyan Ma, Ruidong Han, Jieke Shi, Ye Liu, Yuqing Niu, Di Lu, Chuang Tian, Jianfeng Ma, Debin Gao, David Lo

Trusted Execution Environment (TEE) enhances the security of mobile applications and cloud services by isolating sensitive code in the secure world from the non-secure normal world. However, TEE applications are still confronted with vulnerabilities stemming from bad partitioning. Bad partitioning can lead to critical security problems of TEE, such as leaking sensitive data to the normal world or being adversely affected by malicious inputs from the normal world. To address this, we propose an approach to detect partitioning issues in TEE applications. First, we conducted a survey of TEE vulnerabilities caused by bad partitioning and found that the parameters exchanged between the secure and normal worlds often contain insecure usage with bad partitioning implementation. Second, we developed a tool named DITING that can analyze data-flows of these parameters and identify their violations of security rules we defined to find bad partitioning issues. Different from existing research that only focuses on malicious input to TEE, we assess the partitioning issues more comprehensively through input/output and shared memory. Finally, we created the first benchmark targeting bad partitioning, consisting of 110 test cases. Experiments demonstrate that DITING achieves an F1 score of 0.90 in identifying bad partitioning issues.

受信任的执行环境(TEE)通过将安全世界中的敏感代码与非安全正常世界分离,增强了移动应用和云层服务的安全性。然而,TEE应用仍面临因差分导致的脆弱性。不良的分割会导致TEE的关键安全问题,例如敏感数据泄漏到正常世界,或受到来自正常世界的恶意投入的不利影响。为了解决这个问题,我们建议了一种方法来探测TEE应用中的分割问题。首先,我们开展了一项调查,调查了由于差分造成的TEE脆弱性,发现在安全和正常世界之间交换的参数往往含有不安全的用途,而且执行不可靠的分隔。第二,我们开发了一种名为DIting的工具,可以分析这些参数的数据流,并查明其违反我们为发现错误分割问题而确定的安全规则的情况。不同于现有研究,我们仅侧重于对TEE的恶意输入/输出和共同记忆,我们更全面地评估了分割问题。最后,我们建立了第一个针对错误分割问题的基准,包括110个测试案例。实验表明,DIting在确定坏分割问题上达到了0.90分的F1分。

Article 156

Title@2025-07-10 (4): Automatic Generation of Explainability Requirements and Software Explanations From User Reviews

Title: Automatic Generation of Explainability Requirements and Software Explanations From User Reviews

Automatische Generierung von Erklärbarkeitsanforderungen und Software-Erläuterungen aus Benutzer-Bewertungen

用户审查自动产生解释要求和软件解释 2507.07344v1

Authors (9): Martin Obaidi, Jannik Fischbach, Jakob Droste, Hannah Deters, Marc Herrmann, Jil Klünder, Steffen Krätzig, Hugo Villamizar, Kurt Schneider

Explainability has become a crucial non-functional requirement to enhance transparency, build user trust, and ensure regulatory compliance. However, translating explanation needs expressed in user feedback into structured requirements and corresponding explanations remains challenging. While existing methods can identify explanation-related concerns in user reviews, there is no established approach for systematically deriving requirements and generating aligned explanations. To contribute toward addressing this gap, we introduce a tool-supported approach that automates this process. To evaluate its effectiveness, we collaborated with an industrial automation manufacturer to create a dataset of 58 user reviews, each annotated with manually crafted explainability requirements and explanations. Our evaluation shows that while AI-generated requirements often lack relevance and correctness compared to human-created ones, the AI-generated explanations are frequently preferred for their clarity and style. Nonetheless, correctness remains an issue, highlighting the importance of human validation. This work contributes to the advancement of explainability requirements in software systems by (1) introducing an automated approach to derive requirements from user reviews and generate corresponding explanations, (2) providing empirical insights into the strengths and limitations of automatically generated artifacts, and (3) releasing a curated dataset to support future research on the automatic generation of explainability requirements.

解释性已成为提高透明度、建立用户信任和确保监管合规性的关键非功能性要求。然而,将用户反馈中表达的解释需要转化为结构化要求和相应的解释,仍然具有挑战性。虽然现有方法可以确定用户审查中与解释相关的关切,但尚无既定方法系统地提出要求和提出一致的解释。为帮助弥补这一差距,我们采用了一种工具支持的方法,使这一进程自动化。为了评估其有效性,我们与一个工业自动化制造商合作,创建了一套由58个用户审查组成的数据集,每套数据附有人工制作的解释性要求和解释说明。我们的评价表明,尽管与人造的要求相比,AI产生的要求往往缺乏相关性和正确性,但AI提出的解释往往因其清晰性和风格而更可取,然而,正确性仍然是一个问题,突出了人类验证的重要性。这项工作有助于推动软件系统解释性要求的提高,其方法是:(1)采用自动化方法,从用户审查中得出要求并产生相应的解释,(2)对自动生成的艺术品的优点和局限性,(3)发布经整理的数据,以支持未来关于自动生成解释性要求的研究。