cs.SE @ 2025-06-13: 174

06-12 (4)

SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

SWE-Factory: Ihre automatisierte Fabrik für Ausgabeauflösungstraining Daten- und Bewertungs-Benchmarks

SWE-Foctory: 您的解决问题自动工厂培训数据和评价基准

2506.10954v1

06-12

LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

LLM-Cure: LLM-basierte Anwenderbewertungsanalyse für Feature Enhancement

LLM-Cure: 以LLM为基础的增强地物功能的LLM竞争者用户审查分析分析

2409.15724v2

06-12

MultiCoSim: A Python-based Multi-Fidelity Co-Simulation Framework

MultiCoSim: Ein Python-basiertes Multi-Fidelity-Co-Simulations-Framework

MultiCoSim:一个基于金吞的多纤维共同模拟框架

2506.10869v1

06-12

Evaluating Large Language Models on Non-Code Software Engineering Tasks

Bewertung großer Sprachmodelle auf nicht-Code-Software-Engineering-Aufgaben

评价非守则软件工程任务大语言模型

2506.10833v1

06-12

Solving Package Management via Hypergraph Dependency Resolution

Lösung des Paketmanagements über Hypergraph Dependency Resolution

通过电报依赖决议解决软件包管理

2506.10803v1

06-12

What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps

Was Nutzer schätzen und Kritik: Groß angelegte Analyse von User Feedback auf KI-Powered Mobile Apps

什么是用户价值和关键值:对AI授权移动应用程序用户反馈的大规模分析

2506.10785v1

06-12

From Tea Leaves to System Maps: Context-awareness in Monitoring Operational Machine Learning Models

Von Tea Leaves zu System Maps: Kontext-Bewusstsein bei der Überwachung operativer Machine Learning Modelle

从茶叶休假到系统地图:监测操作机器学习模式的背景意识

2506.10770v1

06-12

An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications

Eine empirische Bewertung von vortrainierten großen Sprachmodellen zur Reparatur deklarativer formaler Spezifikationen

对培训前大语言模型进行经验评估,以修复《宣言》正式规格

2404.11050v2

06-12

Formalising Software Requirements using Large Language Models

Formalisierung von Software-Anforderungen mit großen Sprachmodellen

使用大语言模式正式确定软件要求

2506.10704v1

06-12

Not One to Rule Them All: Mining Meaningful Code Review Orders From GitHub

Nicht einer, der sie alle beherrscht: Bergbau Sinnvolle Code-Review-Aufträge von GitHub

无法统治他们所有人:GitHub下达的具有采矿意义的法律审查令

2506.10654v1

06-12

Scalable Software Testing in Fast Virtual Platforms: Leveraging SystemC, QEMU and Containerization

Skalierbare Software-Tests in schnellen virtuellen Plattformen: Leveraging SystemC, QEMU und Containerization

在快速虚拟平台上进行可缩放软件测试:利用系统C、QEMU和集装箱化

2506.10624v1

06-12

AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length

AdaptiveLLM: Ein Framework zur Auswahl optimaler kosteneffizienter LLM für Code-Generation auf Basis der CoT-Länge

适应性LLM:根据Cot长度为编码生成选择最佳成本效率高的LLM框架

2506.10525v1

06-12

PyGen: A Collaborative Human-AI Approach to Python Package Creation

PyGen: Ein kollaborativer Mensch-AI-Ansatz zur Erstellung von Python-Paketen

PyGen:以人类 – – 美洲合作办法创建皮顿包包

2411.08932v4

06-12

BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

BugGen: Eine selbstkorrigierende LLM-Pipeline für eine realistische RTL-Bug-Synthese

BugGen: 现实的 RTL 错误合成自更正多 Agency LLM 管道

2506.10501v1

06-12

EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair

EXPEREPAIR: Dual-Memory verbesserte LLM-basierte Repository-Level-Programm-Reparatur

以储存库为基础的两层增强的LLM(基于仓库的LLM)方案维修

2506.10484v1

06-12

Leveraging Network Methods for Hub-like Microservice Detection

Nutzung von Netzwerkmethoden für die hubähnliche Microservice-Erkennung

利用网络方法进行中枢式微服务探测

2506.07683v2

06-12

Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Auf dem Weg zum Verständnis von Fehlern in verteilten Schulungs- und Schlussfolgerungsrahmen für große Sprachmodelle

努力了解大语言模式分布式培训和推断框架中的错误

2506.10426v1

06-12

Centrality Change Proneness: an Early Indicator of Microservice Architectural Degradation

Zentralitätsänderung Proneness: ein Frühindikator für die architektonische Degradation von Mikroservices

中心变化中心变化前景:微观服务建筑退化早期指标

2506.07690v2

06-12

Bug Classification in Quantum Software: A Rule-Based Framework and Its Evaluation

Fehlerklassifizierung in der Quantensoftware: Ein regelbasiertes Framework und seine Bewertung

量子软件中的臭虫分类:基于规则的框架及其评价

2506.10397v1

06-12

MLLM-Based UI2Code Automation Guided by UI Layout Information

MLLM-based UI2Code Automation Geführt von UI Layout Informationen

MLLM-基于 MLLM 的 UI2Cde 用户界面布局信息引导的自动化

2506.10376v1

06-12

AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine

AutoGEEval++: Ein Multi-Level- und Multi-Geospatial-Modulity-Automated Evaluation Framework für große Sprachmodelle in der Geospatial Code Generation auf Google Earth Engine

AutoGEEval++:谷歌地球引擎地理空间代码生成中大语言模型多层次和多地球空间-模式自动评价框架

2506.10365v1

06-12

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision

CodeTool: Programmatisches Tool verbessern Anrufung von LLMs durch Prozessüberwachung

守则工具:加强通过程序监督援引LLMs的程序工具

2503.20840v2

06-12

Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements

Erweiterung großer Sprachmodelle mit statischer Codeanalyse für automatisierte Codequalitätsverbesserungen

增强大语言模式,采用静态代码分析法分析,提高自动代码质量

2506.10330v1

06-12

ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

ELFuzz: Effiziente Input-Generierung über LLM-gesteuerte Synthese über Fuzzer-Raum

ELFuzz:通过LLM驱动的模糊空间综合合成有效投入生成

2506.10323v1

06-12

Minimizing False Positives in Static Bug Detection via LLM-Enhanced Path Feasibility Analysis

Minimierung falscher Positive bei statischer Fehlererkennung über LLM-verbesserte Pfad-Feasibility-Analyse

通过LLM-强化路径可行性分析尽量减少静态虫检测中的假正数

2506.10322v1

06-12

AI-Based Software Vulnerability Detection: A Systematic Literature Review

KI-basierte Software-Verletzungserkennung: Ein systematischer Literaturbericht

基于AI的软件脆弱性检测:系统文献审查

2506.10280v1

06-11 (3)

Playing in the Sandbox: A Study on the Usability of Seccomp

Spielen in der Sandbox: Eine Studie über die Usability von Seccomp

沙箱游戏:关于可使用性的研究

2506.10234v1

06-11

Prompt Variability Effects On LLM Code Generation

Veränderliche Auswirkungen auf die LLM-Code-Generierung

对LLM 代码生成的迅速易变性效应

2506.10204v1

06-11

New Fault Domains for Conformance Testing of Finite State Machines

Neue Fehlerbereiche für die Konformitätsprüfung von Finite State Maschinen

限定国家机器符合性测试的新失密域域

2410.19405v2

06-11

D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning

D-LiFT: Verbesserung des LLM-basierten Decompiler-Backends über Code Quality-driven Fine-tuning

D-LiFT:通过《守则》质量驱动的微调改进基于LLM的解腐器后端

2506.10125v1

06-11

Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection

Experten-in-the-Loop-Systeme mit Cross-Domain und In-Domain-Lernen für Software-Anfälligkeitserkennung

具有跨域和内域的 “ 软件脆弱性探测 “ 微热学习

2506.10104v1

06-11

Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput

Reward-Modelle ermöglichen eine skalierbare Code-Überprüfung durch den Handel mit Genauigkeit für Durchsatz

通过交易准确性对交易流量的可缩放代码校验

2506.10056v1

06-11

Microservices and Real-Time Processing in Retail IT: A Review of Open-Source Toolchains and Deployment Strategies

Mikroservices und Echtzeit-Verarbeitung im Einzelhandel IT: Eine Überprüfung von Open-Source-Toolchains und Bereitstellungsstrategien

《零售信息技术的微观服务和实时处理:对开放源码工具链和部署战略的审查》

2506.09938v1

06-11

Assessing a Safety Case: Bottom-up Guidance for Claims and Evidence Evaluation

Bewertung eines Sicherheitsfalles: Bottom-up-Leitfaden für Forderungen und die Bewertung von Beweisen

安全案件评估:索赔和证据评价自下而上准则

2506.09929v1

06-11

The Popularity Hypothesis in Software Security: A Large-Scale Replication with PHP Packages

Die Popularitätshypothese in der Softwaresicherheit: Eine großflächige Replizierung mit PHP-Paketen

软件安全中的大众化假设:大规模复制PHP套件

2502.16670v2

06-11

Quantum resources in resource management systems

Quantenressourcen in Ressourcenmanagementsystemen

资源管理系统的量子资源

2506.10052v1

06-11

The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Programming Tasks

Die Auswirkungen von GitHub Copilot auf die Programmierung von Computerstudenten Wirksamkeit, Effizienz und Prozesse in Brownfield-Programmierungsaufgaben

GitHub联合试点对布朗外地方案拟订任务中计算机学生方案拟订效力、效率和进程的影响

2506.10051v1

06-11

Stakeholder Participation for Responsible AI Development: Disconnects Between Guidance and Current Practice

Stakeholder-Beteiligung für verantwortungsvolle KI-Entwicklung: Verbindungen zwischen Führung und aktueller Praxis

利益攸关方参与负责任的大赦国际发展:指导与现行做法脱节

2506.09873v1

06-11

variability.dev: Towards an Online Toolbox for Feature Modeling

variability.dev: Auf dem Weg zu einer Online-Toolbox für Feature Modeling

变量. dev: 走向一个用于特性建模的在线工具箱

2506.09845v1

06-11

Online Discovery of Simulation Models for Evolving Business Processes (Extended Version)

Online Discovery of Simulation Models for Evolving Business Processes (Erweiterte Version)

不断演变的业务流程模拟模型在线发现(扩展版)

2506.10049v1

06-11

ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

ComfyUI-R1: Erforschung von Konzeptmodellen für die Workflow-Generierung

ComfyUI-R1:探索产生工作流程的理由模型

2506.09790v1

06-11

Towards Bridging Formal Methods and Human Interpretability

Auf dem Weg zur Überbrückung formaler Methoden und menschlicher Verdolmetschbarkeit

实现正规方法和人类可解释性之间的衔接

2506.09759v1

06-11

A First Look at Bugs in LLM Inference Engines

Ein erster Blick auf Bugs in LLM Inferenz-Engines

第一次看一看LLM 推断引擎中的虫虫

2506.09713v1

06-11

Mapping NVD Records to Their VFCs: How Hard is it?

Mapping NVD Records zu ihren VFCs: Wie schwer ist es?

绘制VFCs的NVD记录:有多难?

2506.09702v1

06-11

Calculating Software’s Energy Use and Carbon Emissions: A Survey of the State of Art, Challenges, and the Way Ahead

Berechnung der Energienutzung und CO2-Emissionen von Software: Eine Übersicht über den Stand der Technik, Herausforderungen und den Weg nach vorn

计算软件的能源使用量和碳排放量:对艺术、挑战和前进道路现状的调查

2506.09683v1

06-11

Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

Bewertung und Verbesserung von Benchmarks für die Bewertung großer Sprachmodelle in der Software-Engineering-Aufgaben

评估和推进评估软件工程任务中大语言模型的基准

2505.08903v3

06-11

Translating a VDM Model of a Medical Device into Kapture

Übersetzen eines VDM-Modells eines medizinischen Geräts in Kapture

将医疗设备VDM模型转换成外形

2506.09636v1

06-11

ASTAGEN: Empirical Evaluation of Automated SATD Taxonomy Generation with LLMs

ASTAGEN: Empirische Bewertung der automatisierten SATD-Taxonomie-Generation mit LLMs

ASTAGEN: 与LLM公司一起对自动的SATD 分类生成进行经验评估

2506.09601v1

06-11

Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries

Automatisierte Synthese von formal verifizierten Multi-Abstraktions-Funktionszusammenfassungen

正式核证的多种吸管功能摘要自动综合

2506.09550v1

06-11

AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling

AcTracer: Aktives Testen von Großsprachenmodellen durch Multi-Stage-Sampling

AcTracler:通过多标准抽样积极测试大语言模型

2408.03573v2

06-11

TrioXpert: An automated incident management framework for microservice system

TrioXpert: Ein automatisiertes Ereignismanagement-Framework für Microservice-Systeme

三重X光:微服务系统自动事件管理框架

2506.10043v1

06-11

Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models

Reasoning as a Resource: Schnelles und langsames Denken in Codegenerierungsmodellen optimieren

资源理由:在代码生成模型中优化快速和缓慢思考

2506.09396v1

06-11

Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance

Auf dem Weg zu einer besseren Codegenerierung: Adaptive Dekodierung mit Ungewissheitsleitfaden

实现更好的代码生成:以不确定性指导进行适应性代用

2506.08980v2

06-11

Less is More: DocString Compression in Code Generation

Weniger ist mehr: DocString-Kompression in der Code-Generierung

更少为 : DocString 压缩代码生成

2410.22793v3

06-11

Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey

Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage

增强法典制定中强化学习的加强守则LLMS 代码生成:调查

2412.20367v4

06-11

Mono: Is Your “Clean” Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond

Mono: Ist Ihr “sauberer” Sicherheitsdatensatz wirklich lösbar? Unentschlossene Patches und darüber hinaus enthüllen und verschleppen

Mono: 您的“ 干净” 脆弱数据集是否真的可以解决? 曝光和跟踪不可量化的补丁及其他

2506.03651v2

06-11

Assessing the Impact of Refactoring Energy-Inefficient Code Patterns on Software Sustainability: An Industry Case Study

Bewertung der Auswirkungen von Refactoring Energy-Inefficient Code Patterns auf Software Sustainability: Eine Fallstudie für die Industrie

评估可再生能源低能效代码模式对软件可持续性的影响:工业案例研究

2506.09370v1

06-11

Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models

Steigerung der Testabdeckung von Rust Units durch Hybrid-Programmanalyse und große Sprachmodelle

通过混合方案分析和大语言模式,促进分散单位测试覆盖率

2506.09002v2

06-11

Detecting State Manipulation Vulnerabilities in Smart Contracts Using LLM and Static Analysis

Ermittlung staatlicher Manipulationslücken in Smart Contracts mittels LLM und statischer Analyse

检测使用LLM和静态分析的智能合同中的国家操纵脆弱性

2506.08561v2

06-11

On The Impact of Merge Request Deviations on Code Review Practices

Über die Auswirkungen von Merge Request Abweichungen auf Code-Review-Praktiken

合并请求对守则审查惯例的影响

2506.08860v2

06-10 (2)

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

UTBoost: Strenge Bewertung von Coding Agents auf SWE-Bench

UTBost: 严格评价SWE-Bench上的编码剂

2506.09289v1

06-10

RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts

RocketPPA: Code-Level Power, Performance und Area Prediction über LLM und Mixture of Experts

火箭式PPPA:通过LLM和专家混合进行代码级动力、性能和地区预测

2503.21971v3

06-10

ClassInvGen: Class Invariant Synthesis using Large Language Models

ClassInvGen: Class Invariant Synthesis mit großen Sprachmodellen

类 InvGen: 使用大语言模型的分类变量合成

2502.18917v2

06-10

Formal Methods Meets Readability: Auto-Documenting JML Java Code

Formale Methoden erfüllen Lesbarkeit: Auto-Dokumentierung von JML Java Code

正式方法符合可读性:自动文档 JML Java 代码

2506.09230v1

06-10

Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen

LLM女士能产生可靠的试验案例发电机吗?

2506.06821v2

06-10

How Do Users Revise Architectural Related Questions on Stack Overflow: An Empirical Study

Wie revidieren Benutzer architektonische verwandte Fragen über Stack Overflow: Eine empirische Studie

用户如何修改关于堆叠溢溢流的建筑相关问题:经验研究

2406.18959v2

06-10

Who is using AI to code? Global diffusion and impact of generative AI

Wer nutzt KI zum Kodieren? Globale Diffusion und Wirkung generativer KI

谁在使用人工智能编码?

2506.08945v1

06-10

The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot

Die Auswirkungen großer Sprachmodelle auf Open-Source-Innovationen: Belege von GitHub Copilot

《大语言模式对开放源码创新的影响:GitHub公司的证据》

2409.08379v3

06-10

When Uncertainty Leads to Unsafety: Empirical Insights into the Role of Uncertainty in Unmanned Aerial Vehicle Safety

Wenn Ungewissheit zu Unsicherheit führt: Empirische Einblicke in die Rolle der Ungewissheit in der unbemannten Luftfahrzeugsicherheit

当不确定因素导致不安全时:对不确定因素在无人驾驶飞行器安全方面作用的实证洞察力

2501.08908v2

06-10

ZTaint-Havoc: From Havoc Mode to Zero-Execution Fuzzing-Driven Taint Inference

ZTaint-Havoc: Vom Havoc-Modus zur Null-Execution Fuzzing-Driven Taint-Inferenz

ZTaint-Havoc:从哈沃克模式到零执行法

2506.08838v1

06-10

Exploring the Evidence-Based Beliefs of LLM-Based Programming Assistants

Erforschung der evidenzbasierten Überzeugungen von LLM-basierten Programmierassistenten

探索以LLM为基础的方案拟订助理的基于证据的信念

2407.13900v2

06-10

Towards a Knowledge Base of Common Sustainability Weaknesses in Green Software Development

Auf dem Weg zu einer Wissensbasis für gemeinsame Nachhaltigkeitsschwächen in der grünen Softwareentwicklung

建立绿色软件开发中共同可持续性弱点知识库

2506.08812v1

06-10

User Modeling in Model-Driven Engineering: A Systematic Literature Review

User Modeling in Model-Driven Engineering: Ein systematischer Literaturbericht

模型驱动工程的用户建模:系统文学评论

2412.15871v3

06-10

Do Generative AI Tools Ensure Green Code? An Investigative Study

Stellen Generative KI-Tools einen grünen Code sicher? Eine Untersuchungsstudie

产生AI工具确保绿色守则? 一项调查研究

2506.08790v1

06-10

Mitigating fairwashing using Two-Source Audits

Fairwashing durch Zwei-Quellen-Audits abmildern

利用双重来源审计减少洗水

2305.13883v2

06-10

Breaking the ICE: Exploring promises and challenges of benchmarks for Inference Carbon & Energy estimation for LLMs

Breaking the ICE: Erforschen von Versprechungen und Herausforderungen von Benchmarks für Inferenz-Kohlenstoff- & Energieschätzungen für LLMs

打破ICE:探索LLMM的碳和能源估算基准的许诺和挑战

2506.08727v1

06-10

Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur

以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测

2506.08713v1

06-10

ROS-related Robotic Systems Development with V-model-based Application of MeROS Metamodel

ROS-bezogene Robotik-Entwicklung mit V-Modell-basierter Anwendung von Meros Metamodel

与ROS有关的机器人系统开发,以V型模型为基础应用MEROS模型模型

2506.08706v1

06-10

A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell Detection

Eine umfassende Bewertung des Parameter-Effizienten Feintunings auf Code-Smell-Erkennung

全面评价关于代码嗅觉检测的参数有效精密设计

2412.13801v2

06-10

Causality-aware Safety Testing for Autonomous Driving Systems

Causality-aware Sicherheitstests für autonome Fahrsysteme

自动驾驶系统因能安全测试

2506.08688v1

06-10

Proceedings of the 23rd International Overture Workshop

Berichte des 23. Internationalen Ouvertüren-Workshops

第23次国际展望研讨会记录

2506.08680v1

06-10

Realigning Incentives to Build Better Software: a Holistic Approach to Vendor Accountability

Neuausrichtung von Anreizen, um bessere Software zu entwickeln: ein ganzheitlicher Ansatz für die Verantwortlichkeit des Verkäufers

调整奖励措施,以建设更好的软件:供应商问责制的综合办法

2504.07766v2

06-10

Logic Mining from Process Logs: Towards Automated Specification and Verification

Logic Mining aus Prozessprotokollen: Auf dem Weg zu einer automatisierten Spezifikation und Verifizierung

从加工日志进行逻辑采矿:走向自动规格和核查

2506.08628v1

06-10

LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR

LMRPA: Großsprachige modellgetriebene effiziente Roboterprozessautomatisierung für OCR

LMRPA: OCR的大型语言模型驱动高效机器人程序自动化

2412.18063v2

06-10

RE-oriented Model Development with LLM Support and Deduction-based Verification

RE-orientierte Modellentwicklung mit LLM-Unterstützung und Deduktionsbasierter Verifizierung

与LLLM支助和基于减税的核查

2506.08606v1

06-10

Evaluating the Performance and Efficiency of Sentence-BERT for Code Comment Classification

Bewertung der Leistung und Effizienz von Sentence-BERT für die Klassifizierung von Code Comment

为守则评论分类评价判刑-德国-德国-德国-德国-德国-德国-德国-德国-德国-德国的绩效和效率

2506.08581v1

06-10

IssueCourier: Multi-Relational Heterogeneous Temporal Graph Neural Network for Open-Source Issue Assignment

IssueCourier: Multi-Relationale Heterogene Zeitliche Graphen-Neural-Netzwerk für Open-Source-Ausgabezuweisung

问题压力:开放源码问题任务多关系性不同时代不同形态图质神经网络

2505.11205v3

06-10

Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub

Verbesserung der Open-Domain-Task-Solving-Kapazität von LLMs durch autonome Tool-Integration von GitHub

通过GitHub的自主工具集成,加强LLMs的开放域任务调整能力

2312.17294v3

06-10

Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study

Software Engineering-Agenten durch die Spurweite der Rückverfolgbarkeit verstehen: Eine empirische Studie

通过可追踪通道了解软件工程剂:经验研究

2506.08311v1

06-09 (1)

Developer Perspectives on Licensing and Copyright Issues Arising from Generative AI for Software Development

Entwickler-Perspektiven zu Lizenzierungs- und Urheberrechtsfragen, die sich aus generativen KI für die Softwareentwicklung ergeben

开发者对软件开发创创大赦国际提出的许可证发放和版权问题的看法

2411.10877v5

06-09

MBTModelGenerator: A software tool for reverse engineering of Model-based Testing (MBT) models from clickstream data of web applications

MBTModelGenerator: Ein Software-Tool für Reverse Engineering von Modellbasierten Testing (MBT)-Modellen aus Clickstream-Daten von Web-Anwendungen

MBTModelGenererator:一个软件工具,用于从网络应用的点击流数据中逆向设计基于模型的测试模型(MBT)模型

2506.08179v1

06-09

Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles

Repeton: Strukturierte Fehler-Reparatur mit ReAct-geführten Patch-and-Test-Zyklen

Repeton: 结构化的错误修复, 使用重新操作指导的补丁和测试周期

2506.08173v1

06-09

Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language Models

Worst-Case-Symbolische Einschränkungen Analyse und Generalisierung mit großen Sprachmodellen

分析并推广大语言模式

2506.08171v1

06-09

A Metrics-Oriented Architectural Model to Characterize Complexity on Machine Learning-Enabled Systems

Ein metrisch ausgerichtetes architektonisches Modell zur Charakterisierung von Komplexität auf maschinell lernfähigen Systemen

以计量为主的建筑建筑模型,以明确机械学习系统的复杂性

2506.08153v1

06-09

From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus?

从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充?

2505.18789v2

06-09

Adversarial Attack Classification and Robustness Testing for Large Language Models for Code

Adversariale Angriffsklassifikation und Robustheitsprüfung für große Sprachmodelle für Code

守则大语言模型对反攻击分类和强力测试

2506.07942v1

06-09

Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?

Können Hessian-Based Insights Fehlerdiagnosen in aufmerksamkeitsbasierten Modellen unterstützen?

以海珊为基地的洞察能支持以关注为基础的模型中的过失诊断吗?

2506.07871v1

06-09

Execution-Aware Program Reduction for WebAssembly via Record and Replay

Execution-Aware Programmreduktion für WebAssembly über Aufzeichnung und Wiedergabe

通过录制和重放减少网络摄像头的执行软件程序

2506.07834v1

06-09

Subgraph-Oriented Testing for Deep Learning Libraries

Subgraph-orientierte Tests für Deep-Learning-Bibliotheken

深深学习图书馆以Sub图为主的测试

2412.06430v2

06-09

Towards a Small Language Model Lifecycle Framework

Auf dem Weg zu einem Rahmen für den Lebenszyklus eines kleinen Sprachmodells

建立一个小型语言模拟生命周期框架

2506.07695v1

100

06-09

MacroSwarm: A Field-based Compositional Framework for Swarm Programming

MacroSwarm: Ein feldbasierter Kompositionsrahmen für die Schwarmprogrammierung

宏观群群:基于实地的蜂群方案规划组成框架

2401.10969v4

101

06-09

Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

Bewertung der Wirksamkeit von LLMs bei der Erkennung und Korrektur von Testriechen: Eine empirische Studie

评价LLLMS在检测和纠正试验闻闻方面的效力:经验研究

2506.07594v1

102

06-09

IntenTest: Stress Testing for Intent Integrity in API-Calling LLM Agents

IntenTest: Stresstest auf Intent Integrität in API-aufrufenden LLM-Agenten

即时测试:压力测试,以促使APIalling LLM代理物保持诚信

2506.07524v1

103

06-09

PARF: An Adaptive Abstraction-Strategy Tuner for Static Analysis

PARF: Ein adaptives Abstraktions-Strategie-Tuner für statische Analyse

PARF: 用于静态分析的适应性抽象-战略图纳

2505.13229v3

104

06-09

Large Language Models for Multilingual Vulnerability Detection: How Far Are We?

Große Sprachmodelle für mehrsprachige Sicherheitserkennung: Wie weit sind wir?

多语言脆弱性探测大语言模型:我们有多远?

2506.07503v1

105

06-09

A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions

Ein Rahmen für die Erstellung nicht-regressiver Testfälle über Branchenkonsistenzanalyse, angetrieben durch Beschreibungen

由说明驱动的通过分处一致性分析生成非递减试验案例的框架

2506.07486v1

106

06-09

Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault Localization

Multi-View Adaptive Kontrastives Lernen für Informationen Retrieval basierte Fehler Lokalisierung

信息检索的多视图适应性差异学习

2409.12519v2

107

06-09

Generate Realistic Test Scenes for V2X Communication Systems

Realistische Testszenen für V2X-Kommunikationssysteme generieren

V2X 通信系统生成Reeistic 测试场景

2506.07419v1

108

06-09

A Systematic Literature Review on Continuous Integration and Deployment (CI/CD) for Secure Cloud Computing

Ein systematischer Literaturbericht über kontinuierliche Integration und Bereitstellung (CI/CD) für sicheres Cloud Computing

安全云计算系统关于持续整合和部署的系统文献审查(CI/CD)

2506.08055v1

109

06-09

Protecting Deep Learning Model Copyrights with Adversarial Example-Free Reuse Detection

Schutz von Deep-Learning-Modell-Urheberrechten mit zweifelhafter Beispiel-freier Wiederverwertungserkennung

保护深学习模式版权,进行反反对学性实例自由再利用探测

2407.03883v2

110

06-09

Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data

Erhöhung der Schwachstelle Erkennung von LLMs durch Curriculum Preference Optimization mit synthetisch begründeten Daten

通过课程优先优化使用合成理由数据,促进对LLMs的脆弱性检测

2506.07390v1

111

06-09

Human Side of Smart Contract Fuzzing: An Empirical Study

Menschliche Seite des Smart Contract Fuzzing: Eine empirische Studie

聪明合同的人类方面模糊:经验研究

2506.07389v1

112

06-09

SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

SyncMind: Messmittel Out-of-Sync-Wiederherstellung in der kollaborativen Software-Engineering

Syncmind:在合作软件工程中测量合成外恢复剂

2502.06994v2

113

06-09

GUIPilot: A Consistency-based Mobile GUI Testing Approach for Detecting Application-specific Bugs

GUUIPilot: Ein auf Konsistenz basierender mobiler GUI-Testansatz zur Erkennung anwendungsspezifischer Fehler

GUUPilot: 一种基于一致性的移动图形测试方法,用于检测特定应用程序的臭虫

2506.07385v1

114

06-08 (7)

On Mutation-Guided Unit Test Generation

Auf Mutation-geführte Einheiten-Test-Generierung

Mudiation- Guided 单位测试生成

2506.02954v2

115

06-08

Selective Prompt Anchoring for Code Generation

Selektive Prompt-Ankerung für die Code-Generierung

代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代

2408.09121v5

116

06-08

Taxonomy of migration scenarios for Qiskit refactoring using LLMs

Taxonomie der Migrationsszenarien für die Qiskit-Refaktorisierung mit LLMs

利用LLMM 进行基斯基特再设定的移徙情景分类

2506.07135v1

117

06-08

Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example

Testgetriebene Software-Experimente mit LASSO: ein Beispiel für LLM Prompt Benchmarking

LASSO试验驱动软件实验:LLM快速基准示例

2410.08911v2

118

06-08

Rethinking the effects of data contamination in Code Intelligence

Überdenken der Auswirkungen der Datenkontamination in Code Intelligence

重新思考法典情报部门数据污染的影响

2506.02791v2

119

06-08

Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation with Accurate Oracles

Halluzination zum Konsens: Multi-Agent LLMs für die End-to-End-Testgenerierung mit präzisen Oracles

致共识的幻觉:与精准甲骨文进行端至端端至末端试验一代的多代理人法学硕士

2506.02943v3

120

06-08

LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models

LEANCODE: Modelle besser verstehen für Code-Vereinfachung von vortrainierten großen Sprachmodellen

LEANCODE: 更好地理解模式,以更好地简化培训前大语言模式的守则

2505.14759v3

121

06-08

A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

Ein genauerer Blick in transformerbasierte Code-Intelligenz durch Code-Transformation: Herausforderungen und Chancen

更仔细地研究《以变革者为基础的守则》通过《守则》的转变而获得的情报:挑战和机遇

2207.04285v2

122

06-07 (6)

Is Your Training Pipeline Production-Ready? A Case Study in the Healthcare Domain

Ist Ihr Training Pipeline Production-Ready? Eine Fallstudie im Bereich Healthcare

你的训练管道生产-准备? 保健领域案例研究

2506.06946v1

123

06-07

Object-Spatial Programming

Objekträumliche Programmierung

物体空间方案拟订

2503.15812v6

124

06-07

LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning

LLM4Vuln: Ein einheitlicher Bewertungsrahmen für die Entkopplung und Verbesserung der Schwachstelle von LLM

LLM4Vuln:脱钩和加强LLLM女士脆弱性原因统一评价框架

2401.16185v4

125

06-07

CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics

CleanVul: Automatische Sicherheitserkennung auf Funktionsebene in Code Commits mit LLM Heuristics

CleanVul: 使用LLM Huristics的代码委员会自动检测功能级别弱点

2411.17274v6

126

06-07

Quantum-Based Software Engineering

Quantenbasierte Software-Engineering

基于量子的软件工程

2505.23674v2

127

06-07

Automated Repair of Ambiguous Natural Language Requirements

Automatisierte Reparatur von abwechslungsreichen natürlichen Sprachanforderungen

自动修理模糊的自然语言要求

2505.07270v2

128

06-07

Monitoring Continuous Integration Practices in Industry: A Case Study

Überwachung kontinuierlicher Integrationspraktiken in der Industrie: Eine Fallstudie

持续监测工业一体化做法:个案研究

2503.02610v2

129

06-07

On the Need to Monitor Continuous Integration Practices – An Empirical Study

Über die Notwendigkeit, kontinuierliche Integrationspraktiken zu überwachen – Eine empirische Studie

监测持续融合做法的必要性 – – 经验研究

2409.05101v2

130

06-07

Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Beyond Surface Similarity: Bewertung von LLM-basierten Test-Refactorings mit strukturellem und semantischem Bewusstsein

超越地表相似性:评估基于LLM的具有结构和语义意识的测试重构因素

2506.06767v1

131

06-07

Mind the Gap: A Readability-Aware Metric for Test Code Complexity

Achten Sie auf die Lücke: Ein Readability-Aware Metric für Test Code Complexity

思维差距:测试编码复杂度的可读性-可读性软件

2506.06764v1

132

06-06 (5)

AI Simulation by Digital Twins: Systematic Survey, Reference Framework, and Mapping to a Standardized Architecture

KI-Simulation durch digitale Zwillinge: Systematische Umfrage, Referenzrahmen und Mapping zu einer standardisierten Architektur

AI 数字双对模拟:系统调查、参考框架和绘制标准化建筑图

2506.06580v1

133

06-06

DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling

DISC: DISC: Dynamische Zersetzung verbessert LLM-Inferenzskalierung

DISC: DISC: 动态分解改善LLM 推推法的扩大

2502.16706v2

134

06-06

MergeRepair: An Exploratory Study on Merging Task-Specific Adapters in Code LLMs for Automated Program Repair

MergeRepair: Eine explorative Studie zum Zusammenführen von Task-spezifischen Adaptern in Code LLMs zur automatischen Programmreparatur

合并Repair:关于将特定任务适应器合并成自动方案维修代码LLMS的探索性研究

2408.09568v3

135

06-06

Private GPTs for LLM-driven testing in software development and machine learning

Private GPTs für LLM-gesteuerte Tests in Softwareentwicklung und maschinellem Lernen

在软件开发和机器学习方面进行由LLLM驱动的测试的私人GPT

2506.06509v1

136

06-06

Information-Theoretic Detection of Unusual Source Code Changes

Information-Theoretische Erkennung ungewöhnlicher Quellcode-Änderungen

异常源代码变化的信息理论检测

2506.06508v1

137

06-06

Enhancing Software Supply Chain Security Through STRIDE-Based Threat Modelling of CI/CD Pipelines

Verbesserung der Sicherheit der Software-Lieferkette durch STRIDE-basierte Bedrohungsmodellierung von CI/CD-Pipelines

通过对光/光/光/光/光光气管道进行基于战略控制的威胁建模,加强软件供应链安全

2506.06478v1

138

06-06

PyGemini: Unified Software Development towards Maritime Autonomy Systems

PyGemini:向海洋自主系统发展统一软件

2506.06262v1

139

06-06

DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

DesignBench: Ein umfassender Benchmark für die MLLM-basierte Generierung von Front-End-Codes

设计时区:基于MLLLM的前端代码生成综合基准

2506.06251v1

140

06-06

Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies

Skalierbare Sprache Agnostic Taint Tracking mit expliziten Datenabhängigkeiten

使用明确数据依赖性进行可缩放语言 Agnistic Taint 跟踪

2506.06247v1

141

06-06

MLOps with Microservices: A Case Study on the Maritime Domain

MLOps mit Microservices: Eine Fallstudie zum maritimen Bereich

具有微服务的多边业务方案:海洋领域案例研究

2506.06202v1

142

06-06

Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph

Obfuscation-Resilient Binary Code Ähnlichkeitsanalyse mit Dominance Enhanced Semantic Graph

利用压强增强语义图分析显要强化语义图

2506.06161v1

143

06-06

Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Begründung durch Ausführung: Vereinheitlichung von Prozess- und Ergebnisprämien für die Codegenerierung

执行中的理由:代码生成的统一程序和结果奖励

2412.15118v2

144

06-06

Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests

Nutzung generativer KI zur Verbesserung der Automatisierten Bewertung bei der Programmierung von Bildungswettbewerben

利用 “ 利用激励AI “ 增强方案编制教育竞赛中的自动评估

2506.05990v1

145

06-06

A Preference-Driven Methodology for High-Quality Solidity Code Generation

Eine präferenzorientierte Methodik für die Erzeugung von Soliditätscodes hoher Qualität

高质量固体废物代码生成的优先开发方法

2506.03006v2

146

06-06

Analysis of cost-efficiency of serverless approaches

Analyse der Kosteneffizienz serverloser Ansätze

分析无服务器无服务器方法的成本效益

2506.05836v1

147

06-06

Training Software Engineering Agents and Verifiers with SWE-Gym

Schulung von Software Engineering Agents und Prüfern mit SWE-Gym

SWE-Gym培训软件工程代理和验证人

2412.21139v2

148

06-06

PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

PoCGen: Proof-of-Concept Exploits für Schwachstellen in Npm-Paketen generieren

PoCGen:为Npm包件的脆弱度生成概念探索校验

2506.04962v2

149

06-06

Towards Mixed-Criticality Software Architectures for Centralized HPC Platforms in Software-Defined Vehicles: A Systematic Literature Review

Auf dem Weg zu Softwarearchitekturen für zentralisierte HPC-Plattformen in softwaredefinierten Fahrzeugen: Ein systematischer Literaturbericht

争取为软件定义车辆中集中的高氯聚氯乙烯平台建立混合-临界环境软件结构:系统文学审查

2506.05822v1

150

06-06

CodeContests+: High-Quality Test Case Generation for Competitive Programming

CodeContests+: Hochqualitative Testfall-Generation für wettbewerbsfähige Programmierung

标准测试+:为竞争性方案拟订编制高品质测试个案

2506.05817v1

151

06-06

RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

RepoMaster: Autonome Exploration und Verständnis von GitHub-Lagerstätten für komplexe Aufgabenlösung

RepoMaster:为复杂任务解决而自主探索和了解GitHub储存库

2505.21577v2

152

06-06

ProSec: Fortifying Code LLMs with Proactive Security Alignment

ProSec: Erweiterung von Code LLMs mit proaktiver Sicherheitsausrichtung

Prosec: 使用前期安全对齐来强化代码LLMs

2411.12882v3

153

06-06

Analyzing the Evolution and Maintenance of Quantum Software Repositories

Analyse der Evolution und Wartung von Quantum Software Repositories

分析量子软件储存库的演变和维护

2501.06894v3

154

06-06

Multi-Agent Collaboration via Cross-Team Orchestration

Multi-Agenten-Zusammenarbeit über Cross-Team-Orchestrierung

通过跨团队管弦化多机构协作

2406.08979v2

155

06-06

CoopetitiveV: Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality Verilog Generation

CoopetitiveV: LLM-powered Coopetitive Multi-Agent für hochwertige Verilog-Generation

协作V:利用LLM-动力协同协作的多方协作促进高品质活性一代

2412.11014v2

156

06-05 (4)

Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework

Deployability-Centric Infrastructure-as-Code Generation: Ein LLM-basiertes Iteratives Framework

以LLM为基础的迭代框架

2506.05623v1

157

06-05

Which Prompting Technique Should I Use? An Empirical Investigation of Prompting Techniques for Software Engineering Tasks

Welche Prompting-Technik sollte ich verwenden? Eine empirische Untersuchung von Prompting-Techniken für Software-Engineering-Aufgaben

我应使用哪一种快速技术?软件工程任务快速技术的经验调查。

2506.05614v1

158

06-05

A Large Language Model Approach to Identify Flakiness in C++ Projects

Ein Ansatz für ein großes Sprachmodell zur Identifizierung von Flakiness in C++-Projekten

C++项目中查明Flakiness的大语言示范方法

2412.12340v2

159

06-05

PandasBench: A Benchmark for the Pandas API

PandasBench: Ein Benchmark für die Pandas API

Pandas Bunch:Pandas API基准

2506.02345v2

160

06-05

Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Interpretation trifft auf Sicherheit: Eine Umfrage zu Interpretationsmethoden und Tools zur Verbesserung der LLM-Sicherheit

口译满足安全需要:关于改进LLM安全的解释方法和工具的调查

2506.05451v1

161

06-05

Software Bill of Materials in Software Supply Chain Security A Systematic Literature Review

Software Bill of Materials in Software Supply Chain Sicherheit Ein systematischer Literaturbericht

软件供应链安全材料法案系统化文献审查

2506.03507v2

162

06-05

Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem

Jenseits des Protokolls: Enthüllen von Angriffsvektoren im Modell Kontext Protokoll Ökosystem

《议定书》之后的《议定书》:《示范背景议定书》中的 “ 固定攻击矢量 “ 生态系统

2506.02040v2

163

06-05

LLM-Guided Scenario-based GUI Testing

LLM-geführte Szenario-basierte GUI-Tests

LLM-LLM 指导设想情况用户界面测试

2506.05079v1

164

06-05

Tech-ASan: Two-stage check for Address Sanitizer

Tech-ASan: Zweistufiger Check für Address Sanitizer

Tech-Asan:地址防疫剂两阶段检查

2506.05022v1

165

06-05

BacPrep: An Experimental Platform for Evaluating LLM-Based Bacalaureat Assessment

BacPrep: Eine experimentelle Plattform zur Bewertung von LLM-basiertem Bacalaureat Assessment

BacPrep:评估以LLM为基础的Bakaraureat评估的实验平台

2506.04989v1

166

06-05

A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair

Eine Multi-Dataset-Bewertung von Modellen für die Automatisierte Sicherheitsreparatur

对自动脆弱性修复模型的多数据集评价

2506.04987v1

167

06-05

Multi-Language Detection of Design Pattern Instances

Mehrsprachige Erkennung von Designmuster-Instanzen

多语言设计模式多语言探测

2506.03903v2

168

06-05

Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities

Dekonstruieren von Obfuscation: Ein vierdimensionaler Rahmen für die Auswertung von Großsprachenmodellen Assembly Code Deobfuscation Fähigkeiten

解构腐蚀:四维框架,用于评价大语言模型组装编码脱腐能力

2505.19887v2

169

06-05

Tensor-based multivariate function approximation: methods benchmarking and comparison

Tensor-basierte multivariate Funktionsannäherung: Methoden Benchmarking und Vergleich

以电锯为基础的多变量函数近似值:方法基准和比较

2506.04791v1

170

06-05

From Developer Pairs to AI Copilots: A Comparative Study on Knowledge Transfer

Von Entwicklerpaaren zu KI-Copiloten: Eine vergleichende Studie zum Wissenstransfer

从开发者对等到AI 副驾驶员:知识转让比较研究

2506.04785v1

171

06-05

QuanUML: Towards A Modeling Language for Model-Driven Quantum Software Development

QuanUML: Auf dem Weg zu einer Modellierungssprache für modellgetriebene Quantensoftware-Entwicklung

QuuUML:争取为开发模型驱动量子软件开发建立示范语言

2506.04639v1

172

06-05

Seed-Coder: Let the Code Model Curate Data for Itself

Saatgut-Coder: Lassen Sie das Code-Modell Daten für sich selbst kuratieren

种子编码器:让代码模型为它自己计算曲线数据

2506.03524v2

173

06-05

KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems

KPIRoot+: Ein effizientes integriertes Framework für Anomalieerkennung und Ursachenanalyse in großräumigen Cloud-Systemen

KPIROot+:大型云系统异常探测和根本原因分析有效综合框架

2506.04569v1

Article 0

Title@2025-06-12 (4): SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

Title: SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks

SWE-Factory: Ihre automatisierte Fabrik für Ausgabeauflösungstraining Daten- und Bewertungs-Benchmarks

SWE-Foctory: 您的解决问题自动工厂培训数据和评价基准 2506.10954v1

Authors (9): Lianghong Guo, Yanlin Wang, Caihua Li, Pengyu Yang, Jiachi Chen, Wei Tao, Yingtian Zou, Duyu Tang, Zibin Zheng

Constructing large-scale datasets for the GitHub issue resolution task is crucial for both training and evaluating the software engineering capabilities of Large Language Models (LLMs). However, the traditional process for creating such benchmarks is notoriously challenging and labor-intensive, particularly in the stages of setting up evaluation environments, grading test outcomes, and validating task instances. In this paper, we propose SWE-Factory, an automated pipeline designed to address these challenges. To tackle these issues, our pipeline integrates three core automated components. First, we introduce SWE-Builder, a multi-agent system that automates evaluation environment construction, which employs four specialized agents that work in a collaborative, iterative loop and leverages an environment memory pool to enhance efficiency. Second, we introduce a standardized, exit-code-based grading method that eliminates the need for manually writing custom parsers. Finally, we automate the fail2pass validation process using these reliable exit code signals. Experiments on 671 issues across four programming languages show that our pipeline can effectively construct valid task instances; for example, with GPT-4.1-mini, our SWE-Builder constructs 269 valid instances at $0.045 per instance, while with Gemini-2.5-flash, it achieves comparable performance at the lowest cost of $0.024 per instance. We also demonstrate that our exit-code-based grading achieves 100% accuracy compared to manual inspection, and our automated fail2pass validation reaches a precision of 0.92 and a recall of 1.00. We hope our automated pipeline will accelerate the collection of large-scale, high-quality GitHub issue resolution datasets for both training and evaluation. Our code and datasets are released at https://github.com/DeepSoftwareAnalytics/swe-factory.

为GitHub 问题解答任务构建大型数据集对于培训和评估大语言模型软件工程能力至关重要。但是,创建这些基准的传统程序具有众所周知的挑战性和劳动密集型,特别是在建立评价环境、测试结果分级和验证任务实例的阶段。在本文件中,我们提议SWE-Factor(一个旨在应对这些挑战的自动化管道)来应对这些挑战。为了解决这些问题,我们的管道整合了三个核心自动自动组件。首先,我们引入了SWE-Builder(一个多试办系统,一个自动存储评价环境的多试办系统),这个系统雇用了4个专业代理,在协作、迭代环中工作,并利用环境存储库来提高效率。第二,我们采用了标准化的、基于退出代码的评级方法,从而消除了手动写定制读取器的需求。最后,我们用这些可靠的退出代码信号来自动连接验证系统。在四个编程语言的671问题上的实验显示,我们的管道可以有效地构建有效的任务实例;例如GPT-4.1-mini、我们SWE-Bilder-revildal-dealalalalal a ladeal ladeal labal 和我们GWe-reval-reval dalalal 20xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx。

Article 1

Title@2025-06-12 (4): LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

Title: LLM-Cure: LLM-based Competitor User Review Analysis for Feature Enhancement

LLM-Cure: LLM-basierte Anwenderbewertungsanalyse für Feature Enhancement

LLM-Cure: 以LLM为基础的增强地物功能的LLM竞争者用户审查分析分析 2409.15724v2

Authors (3): Maram Assi, Safwat Hassan, Ying Zou

The exponential growth of the mobile app market underscores the importance of constant innovation and rapid response to user demands. As user satisfaction is paramount to the success of a mobile application (app), developers typically rely on user reviews, which represent user feedback that includes ratings and comments to identify areas for improvement. However, the sheer volume of user reviews poses challenges in manual analysis, necessitating automated approaches. Existing automated approaches either analyze only the target apps reviews, neglecting the comparison of similar features to competitors or fail to provide suggestions for feature enhancement. To address these gaps, we propose a Large Language Model (LLM)-based Competitive User Review Analysis for Feature Enhancement) (LLM-Cure), an approach powered by LLMs to automatically generate suggestion s for mobile app feature improvements. More specifically, LLM-Cure identifies and categorizes features within reviews by applying LLMs. When provided with a complaint in a user review, LLM-Cure curates highly rated (4 and 5 stars) reviews in competing apps related to the complaint and proposes potential improvements tailored to the target application. We evaluate LLM-Cure on 1,056,739 reviews of 70 popular Android apps. Our evaluation demonstrates that LLM-Cure significantly outperforms the state-of-the-art approaches in assigning features to reviews by up to 13% in F1-score, up to 16% in recall and up to 11% in precision. Additionally, LLM-Cure demonstrates its capability to provide suggestions for resolving user complaints. We verify the suggestions using the release notes that reflect the changes of features in the target mobile app. LLM-Cure achieves a promising average of 73% of the implementation of the provided suggestions.

由于用户满意度是移动应用程序(应用程序)成功与否的关键,因此开发商通常依赖用户审查,这代表用户反馈,包括评级和评论,以确定需要改进的领域;然而,用户审查的数量之大,在人工分析方面构成挑战,需要自动化方法。现有的自动化方法要么只分析目标应用程序审查,忽视对竞争者类似特征的比较,要么不提供增强功能的建议。为了弥补这些差距,我们建议采用基于用户满意度的大型语言模型(LLM),基于功能增强的竞争性用户审查分析(LLLM-Cure),开发者通常依赖用户审查,这种审查代表用户反馈,包括评级和评论,以找出需要改进的领域。更具体地说,使用LLM-C审查,在用户审查中,在与投诉有关的竞争性应用程序中,高分级(4和5颗星)审查,在与投诉有关的竞争应用程序中进行高分级(4和5颗星)审查,并提议根据目标应用程序进行可能的改进。我们用LM-C的1 056-739次用户审查,在70个用户和超级应用LLLLLC1次的LC1次版本中,通过高分显示其平均的交付能力,我们的评价显示13个LLM的LLLM的进度。

Article 2

Title@2025-06-12 (4): MultiCoSim: A Python-based Multi-Fidelity Co-Simulation Framework

Title: MultiCoSim: A Python-based Multi-Fidelity Co-Simulation Framework

MultiCoSim: Ein Python-basiertes Multi-Fidelity-Co-Simulations-Framework

MultiCoSim:一个基于金吞的多纤维共同模拟框架 2506.10869v1

Authors (2): Quinn Thibeault, Giulia Pedrielli

Simulation is a foundational tool for the analysis and testing of cyber-physical systems (CPS), underpinning activities such as algorithm development, runtime monitoring, and system verification. As CPS grow in complexity and scale, particularly in safety-critical and learning-enabled settings, accurate analysis and synthesis increasingly rely on the rapid use of simulation experiments. Because CPS inherently integrate hardware, software, and physical processes, simulation platforms must support co-simulation of heterogeneous components at varying levels of fidelity. Despite recent advances in high-fidelity modeling of hardware, firmware, and physics, co-simulation in diverse environments remains challenging. These limitations hinder the development of reusable benchmarks and impede the use of simulation for automated and comparative evaluation. Existing simulation tools often rely on rigid configurations, lack automation support, and present obstacles to portability and modularity. Many are configured through static text files or impose constraints on how simulation components are represented and connected, making it difficult to flexibly compose systems or integrate components across platforms. To address these challenges, we introduce MultiCoSim, a Python-based simulation framework that enables users to define, compose, and configure simulation components programmatically. MultiCoSim supports distributed, component-based co-simulation and allows seamless substitution and reconfiguration of components. We demonstrate the flexibility of MultiCoSim through case studies that include co-simulations involving custom automaton-based controllers, as well as integration with off-the-shelf platforms like the PX4 autopilot for aerial robotics. These examples highlight MultiCoSim’s capability to streamline CPS simulation pipelines for research and development.

模拟平台是分析和测试网络物理系统(CPS)的基本模拟工具,它支撑了算法开发、运行时间监测和系统核查等活动。随着CPS在复杂和规模上,特别是在安全关键和学习环境上,精确的分析和合成日益依赖模拟实验的快速使用。由于CPS内在地将硬件、软件和物理流程整合在一起,模拟平台必须支持不同组成部分在不同水平的忠诚度上的共同模拟。尽管最近在硬件、固态软件和物理的高性建模、在不同环境中共同模拟等活动方面有所进展,但挑战依然艰巨。这些限制阻碍了可再利用的输油平台基准的开发,阻碍了自动和比较评价的模拟使用。现有的模拟工具往往依赖僵硬的配置,缺乏自动化支持,并存在移动性和模块化障碍。许多模拟平台必须通过静态的文本文件配置,或对模拟组件的表达和连接设置限制,因此难以灵活地配置系统或将基于平台的组件整合。为了应对这些挑战,我们引入多功能系统,多功能模拟模拟框架使用户能够定义、配置、配置和配置自动智能的自动智能模型能力,通过多功能系统进行多功能系统测试,从而展示多功能、多功能系统化的立像化的立像系统整合。

Article 3

Title@2025-06-12 (4): Evaluating Large Language Models on Non-Code Software Engineering Tasks

Title: Evaluating Large Language Models on Non-Code Software Engineering Tasks

Bewertung großer Sprachmodelle auf nicht-Code-Software-Engineering-Aufgaben

评价非守则软件工程任务大语言模型 2506.10833v1

Authors (2): Fabian C. Peña, Steffen Herbold

Large Language Models (LLMs) have demonstrated remarkable capabilities in code understanding and generation; however, their effectiveness on non-code Software Engineering (SE) tasks remains underexplored. We present the first comprehensive benchmark, which we name `Software Engineering Language Understanding’ (SELU), for evaluating LLMs on 17 non-code tasks, spanning from identifying whether a requirement is functional or non-functional to estimating the effort and complexity of backlog items. SELU covers classification, regression, Named Entity Recognition (NER), and Masked Language Modeling (MLM) targets, with data drawn from diverse sources such as code repositories, issue tracking systems, and developer forums. We fine-tune 22 open-source LLMs, prompt two proprietary alternatives, and train two baselines. Performance is measured using metrics such as F1-macro, SMAPE, F1-micro, and accuracy, and compared via the Bayesian signed-rank test. Our results show that moderate-scale decoder-only models consistently form a top-tier, exhibiting high mean performance and low across-task variance, while domain adaptation via code-focused pre-training might yield only modest improvements. These insights guide model selection for non-code SE workflows and highlight directions for expanding SELU to generative and design-oriented scenarios.

大型语言模型(LLMS)在代码理解和生成方面表现出了非凡的能力;然而,它们对于非代码软件工程(SE)任务的有效性仍未得到充分探讨。我们提出了第一个综合基准,我们命名为“软件工程语言理解”(SELU),用于评估17项非代码任务中的LLMS(LLMS),其范围从确定一项要求是否功能性或不起作用到评估积压项目的努力和复杂性,到评估积压项目的工作和复杂性。SELU包括分类、回归、实体识别(NER)和隐蔽语言模型(MLMM)目标,其数据来自代码库、问题跟踪系统和开发论坛等不同来源。我们提出了第一个综合基准,我们称之为“软件工程语言理解”(SELU),用于评估17项非代码任务中的LMSLM(SELU),其绩效是通过F1-macrocro、SMAPE、F1-MI和精确度比较指标衡量的。我们的结果显示,中度的解调模式始终构成最高层,展示高平均值和低位差异差异,同时通过以区域方向显示SE-LIMVL的升级前选择。

Article 4

Title@2025-06-12 (4): Solving Package Management via Hypergraph Dependency Resolution

Title: Solving Package Management via Hypergraph Dependency Resolution

Lösung des Paketmanagements über Hypergraph Dependency Resolution

通过电报依赖决议解决软件包管理 2506.10803v1

Authors (10): Ryan Gibb, Patrick Ferris, David Allsopp, Michael Winston Dales, Mark Elvers, Thomas Gazagnaire, Sadiq Jaffer, Thomas Leonard, Jon Ludlam, Anil Madhavapeddy

Package managers are everywhere, with seemingly every language and operating system implementing their own solution. The lack of interoperability between these systems means that multi-lingual projects are unable to express precise dependencies across language ecosystems, and external system and hardware dependencies are typically implicit and unversioned. We define HyperRes, a formal system for describing versioned dependency resolution using a hypergraph that is expressive enough to model many ecosystems and solve dependency constraints across them. We define translations from dozens of existing package managers to HyperRes and comprehensively demonstrate that dependency resolution can work across ecosystems that are currently distinct. This does not require users to shift their choice of package managers; instead, HyperRes allows for the translation of packaging metadata between ecosystems, and for solving to be precisely specialised to a particular deployment environment.

软件包管理员无处不在,看似每一种语言和业务系统都有自己的解决方案。这些系统之间缺乏互操作性,这意味着多种语文项目无法表达不同语言生态系统的精确依赖性,外部系统和硬件依赖性通常是隐含和没有反向的。我们定义了HyperRes,这是一个正式的系统,用于用高压图描述已版本的依赖性分辨率,该高压图足以模拟许多生态系统并解决它们之间的依赖性制约。我们定义了从几十个现有软件包管理员到HyperRes的翻译,并全面表明依赖性解决方案可在目前不同生态系统之间发挥作用。这不要求用户改变对软件包管理员的选择;相反,HyperRes允许在生态系统之间翻译包装元数据,并精确地专门解决特定部署环境的问题。

Article 5

Title@2025-06-12 (4): What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps

Title: What Users Value and Critique: Large-Scale Analysis of User Feedback on AI-Powered Mobile Apps

Was Nutzer schätzen und Kritik: Groß angelegte Analyse von User Feedback auf KI-Powered Mobile Apps

什么是用户价值和关键值:对AI授权移动应用程序用户反馈的大规模分析 2506.10785v1

Authors (4): Vinaik Chhetri, Krishna Upadhyay, A. B. Siddique, Umar Farooq

Artificial Intelligence (AI)-powered features have rapidly proliferated across mobile apps in various domains, including productivity, education, entertainment, and creativity. However, how users perceive, evaluate, and critique these AI features remains largely unexplored, primarily due to the overwhelming volume of user feedback. In this work, we present the first comprehensive, large-scale study of user feedback on AI-powered mobile apps, leveraging a curated dataset of 292 AI-driven apps across 14 categories with 894K AI-specific reviews from Google Play. We develop and validate a multi-stage analysis pipeline that begins with a human-labeled benchmark and systematically evaluates large language models (LLMs) and prompting strategies. Each stage, including review classification, aspect-sentiment extraction, and clustering, is validated for accuracy and consistency. Our pipeline enables scalable, high-precision analysis of user feedback, extracting over one million aspect-sentiment pairs clustered into 18 positive and 15 negative user topics. Our analysis reveals that users consistently focus on a narrow set of themes: positive comments emphasize productivity, reliability, and personalized assistance, while negative feedback highlights technical failures (e.g., scanning and recognition), pricing concerns, and limitations in language support. Our pipeline surfaces both satisfaction with one feature and frustration with another within the same review. These fine-grained, co-occurring sentiments are often missed by traditional approaches that treat positive and negative feedback in isolation or rely on coarse-grained analysis. To this end, our approach provides a more faithful reflection of the real-world user experiences with AI-powered apps. Category-aware analysis further uncovers both universal drivers of satisfaction and domain-specific frustrations.

包括生产率、教育、娱乐和创造力在内的不同领域的移动应用程序中,人工智能(AI)驱动功能迅速扩散,在包括生产力、教育、娱乐和创造力在内的各个领域,移动应用程序中,这些功能迅速扩散;然而,用户如何看待、评价和批评这些人工智能功能,在很大程度上仍未探索,这主要是由于用户反馈量巨大。在这项工作中,我们首次对AI动力移动应用程序用户反馈进行了全面、大规模研究,利用了谷歌播放的894K AI驱动应用程序的简化数据集,在14个类别中,利用了292个AI驱动应用程序的简化数据集,在Google Play进行了894K AI特定用户回顾。我们开发并验证了一个多阶段分析管道,从人标基准开始,系统评价大型语言模型(LLLMS)和催化战略。每个阶段,包括审查分类、方位感知提取和组合,都验证了准确性和一致性。我们的管道能够对用户反馈进行可缩略、高精度分析,将超过100万个方位同对口,分为18个正反向和15个负向用户专题。我们的分析显示用户始终注重一套狭窄主题:正面评价,正面评价、可靠性、可靠性、可靠性、准确性分析、准确性分析,以及内部分析中的一种反向内对口、负面分析。

Article 6

Title@2025-06-12 (4): From Tea Leaves to System Maps: Context-awareness in Monitoring Operational Machine Learning Models

Title: From Tea Leaves to System Maps: Context-awareness in Monitoring Operational Machine Learning Models

Von Tea Leaves zu System Maps: Kontext-Bewusstsein bei der Überwachung operativer Machine Learning Modelle

从茶叶休假到系统地图:监测操作机器学习模式的背景意识 2506.10770v1

Authors (4): Joran Leest, Claudia Raibulet, Patricia Lago, Ilias Gerostathopoulos

Machine learning (ML) models in production do not fail due to statistical anomalies in their input data; they fail due to contextual misalignment – when their environment deviates from training assumptions, leading to unreliable predictions. Effective ML monitoring requires rich contextual information to move beyond detecting statistical shifts toward meaningful alerts and systematic root-cause analysis. Yet, surprisingly, despite extensive research in ML monitoring and related disciplines (drift detection, data validation, out-of-distribution detection), there is no shared understanding of how to use contextual information – striking, given that monitoring involves interpretation of information in context. In response, this paper presents a systematic review to characterize and structure the various types of contextual information in this domain. Our analysis examines 94 primary studies across data mining, databases, software engineering, and ML. We introduce the Contextual System–Aspect–Representation (C-SAR) framework, a conceptual model that synthesizes our findings. We also identify 20 recurring and potentially reusable patterns of specific system, aspect, and representation combinations, and map them to the monitoring activities they support. This study provides a new perspective on ML monitoring: from interpreting “tea leaves” of observational statistics into constructing and managing “system maps” that enable systematic and reliable ML monitoring practices.

生产中的机器学习(ML)模型并非因为其投入数据中的统计异常而失败;由于环境与培训假设不同,导致不可靠的预测,因此环境不匹配;有效的ML监测需要丰富的背景信息,以超越对统计变化的探测,转向有意义的警报和系统根源分析;然而,令人惊讶的是,尽管对ML监测和相关学科进行了广泛的研究(遥控检测、数据验证、分配以外的检测),但对于如何使用背景信息没有共同的理解 – – 惊人,因为监测涉及对信息的解释。作为回应,本文件对该领域各种类型的背景资料的定性和结构进行了系统审查。我们的分析审查了94项主要研究,涉及数据挖掘、数据库、软件工程和ML。我们引入了 “ 环境系统 – – 显示(C-SAR) “ 框架,这是一个综合我们调查结果的概念模型。我们还确定了具体系统、方面和代表组合的20种经常性和可能重复使用的模式,并将这些模式映射到它们所支持的监测活动。本研究报告从新的角度审视了ML监测:从“系统化”观察方法,将“系统显示”的“系统显示”到“系统观察做法。

Article 7

Title@2025-06-12 (4): An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications

Title: An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications

Eine empirische Bewertung von vortrainierten großen Sprachmodellen zur Reparatur deklarativer formaler Spezifikationen

对培训前大语言模型进行经验评估,以修复《宣言》正式规格 2404.11050v2

Authors (4): Mohannad Alhanahnah, Md Rashedul Hasan, Lisong Xu, Hamid Bagheri

Automatic Program Repair (APR) has garnered significant attention as a practical research domain focused on automatically fixing bugs in programs. While existing APR techniques primarily target imperative programming languages like C and Java, there is a growing need for effective solutions applicable to declarative software specification languages. This paper systematically investigates the capacity of Large Language Models (LLMs) to repair declarative specifications in Alloy, a declarative formal language used for software specification. We designed 12 different repair settings, encompassing single-agent and dual-agent paradigms, utilizing various LLMs. These configurations also incorporate different levels of feedback, including an auto-prompting mechanism for generating prompts autonomously using LLMs. Our study reveals that dual-agent with auto-prompting setup outperforms the other settings, albeit with a marginal increase in the number of iterations and token usage. This dual-agent setup demonstrated superior effectiveness compared to state-of-the-art Alloy APR techniques when evaluated on a comprehensive set of benchmarks. This work is the first to empirically evaluate LLM capabilities to repair declarative specifications, while taking into account recent trending LLM concepts such as LLM-based agents, feedback, auto-prompting, and tools, thus paving the way for future agent-based techniques in software engineering.

自动程序修理(APR)作为一个务实的研究领域,以自动修正程序中的错误为重点,引起了人们的极大关注。虽然现有的PRA技术主要针对C和爪哇等急需的编程语言,但越来越需要适用于宣示软件规格语言的有效解决方案。本文系统地调查了大语言模型(LLMs)在软件规格中修补宣示性正式语言Alloy中宣示性规格的能力。我们设计了12个不同的修理环境,包括单一试办和双重试办模式,利用各种LLMs。这些配置还包含不同层次的反馈,包括自动生成LMS的自动促动机制。我们的研究显示,自动制作软件规格语言软件的双试剂在其他设置方面表现良好,尽管迭代数和代用量略有增加。这种双试装置在根据一套综合基准评估时,显示出了优于最先进的Alloy APR技术的功效。这项工作首先从经验角度评估了LLM的修饰性规格能力,同时考虑到最近的趋势性 LLMM概念,如LM-LM-SM的软件工具,因此的将来的反馈。

Article 8

Title@2025-06-12 (4): Formalising Software Requirements using Large Language Models

Title: Formalising Software Requirements using Large Language Models

Formalisierung von Software-Anforderungen mit großen Sprachmodellen

使用大语言模式正式确定软件要求 2506.10704v1

Authors (3): Arshad Beg, Diarmuid O’Donoghue, Rosemary Monahan

This paper is a brief introduction to our recently initiated project named VERIFAI: Traceability and verification of natural language requirements. The project addresses the challenges in the traceability and verification of formal specifications through providing support for the automatic generation of the formal specifications and the traceability of the requirements from the initial software design stage through the systems implementation and verification. Approaches explored in this project include Natural Language Processing, use of ontologies to describe the software system domain, reuse of existing software artefacts from similar systems (i.e. through similarity based reuse) and large language models to identify and declare the specifications as well as use of artificial intelligence to guide the process.

本文件简要介绍我们最近启动的名为“VERIFAI:自然语言要求的可追踪性和核查”的项目,通过支持从最初软件设计阶段到系统实施和核查阶段自动生成正式规格和要求的可追踪性,解决正式规格的可追踪性和核查方面的挑战,该项目探讨的方法包括“自然语言处理”,使用本体描述软件系统域,重新使用类似系统的现有软件制品(即以类似方式重新使用),以及使用大型语言模型确定和申报规格,以及使用人工智能指导工艺。

Article 9

Title@2025-06-12 (4): Not One to Rule Them All: Mining Meaningful Code Review Orders From GitHub

Title: Not One to Rule Them All: Mining Meaningful Code Review Orders From GitHub

Nicht einer, der sie alle beherrscht: Bergbau Sinnvolle Code-Review-Aufträge von GitHub

无法统治他们所有人:GitHub下达的具有采矿意义的法律审查令 2506.10654v1

Authors (4): Abir Bouraffa, Carolin Brandt, Andy Zaidmann, Walid Maalej

Developers use tools such as GitHub pull requests to review code, discuss proposed changes, and request modifications. While changed files are commonly presented in alphabetical order, this does not necessarily coincide with the reviewer’s preferred navigation sequence. This study investigates the different navigation orders developers follow while commenting on changes submitted in pull requests. We mined code review comments from 23,241 pull requests in 100 popular Java and Python repositories on GitHub to analyze the order in which the reviewers commented on the submitted changes. Our analysis shows that for 44.6% of pull requests, the reviewers comment in a non-alphabetical order. Among these pull requests, we identified traces of alternative meaningful orders: 20.6% (2,134) followed a largest-diff-first order, 17.6% (1,827) were commented in the order of the files’ similarity to the pull request’s title and description, and 29% (1,188) of pull requests containing changes to both production and test files adhered to a test-first order. We also observed that the proportion of reviewed files to total submitted files was significantly higher in non-alphabetically ordered reviews, which also received slightly fewer approvals from reviewers, on average. Our findings highlight the need for additional support during code reviews, particularly for larger pull requests, where reviewers are more likely to adopt complex strategies rather than following a single predefined order.

开发者使用诸如 GitHub 调试代码、讨论拟议修改和请求修改等工具。修改过的文件通常按字母顺序排列, 但不一定与审查者的首选导航顺序相吻合。本研究调查不同的导航命令开发者所遵循的不同导航命令开发者,同时对提交的调试请求作出评论。我们从GitHub 上100个受欢迎的 Java 和 Python 库的23 241 调试请求中提取了意见, 以分析审查者对提交的修改意见的顺序。我们的分析表明, 调试请求中44.6%的调试文件, 审评者以非正反顺序表示意见。在这些调调试请求中,我们发现了有意义的替代命令的线索: 20.6% (2 134) 遵循了最大的第一级命令,17.6% (1 827) 被评为档案与拉动请求的标题和描述相似, 29 % (1 188) 的拉动请求中载有对生产和测试文件的修改意见, 遵守了测试第一级命令。我们还发现, 被审查的文档与提交总文档的比例在非正反导指令审查中大大提高了支持, 在平均审查中可能得到的排序中, 以更复杂的审查中, 得到的排序中, 更接近于更复杂的是更复杂的排序。

Article 10

Title@2025-06-12 (4): Scalable Software Testing in Fast Virtual Platforms: Leveraging SystemC, QEMU and Containerization

Title: Scalable Software Testing in Fast Virtual Platforms: Leveraging SystemC, QEMU and Containerization

Skalierbare Software-Tests in schnellen virtuellen Plattformen: Leveraging SystemC, QEMU und Containerization

在快速虚拟平台上进行可缩放软件测试:利用系统C、QEMU和集装箱化 2506.10624v1

Authors (3): Lukas Jünger, Jan Henrik Weinstock, Tim Kraus

The ever-increasing complexity of HW/SW systems presents a persistent challenge, particularly in safety-critical domains like automotive, where extensive testing is imperative. However, the availability of hardware often lags behind, hindering early-stage software development. To address this, Virtual Platforms (VPs) based on the SystemC TLM-2.0 standard have emerged as a pivotal solution, enabling pre-silicon execution and testing of unmodified target software. In this study, we propose an approach leveraging containerization to encapsulate VPs in order to reduce environment dependencies and enable cloud deployment for fast, parallelized test execution, as well as open-source VP technologies such as QEMU and VCML to obviate the need for seat licenses. To demonstrate the efficacy of our approach, we present an Artificial Intelligence (AI) accelerator VP case study. Through our research, we offer a robust solution to address the challenges posed by the complexity of HW/SW systems, with practical implications for accelerating HW/SW co-development.

特别是在汽车等安全关键领域,必须进行广泛测试。然而,硬件的供应往往滞后,阻碍了早期软件开发。为解决这一问题,基于系统C TLM-2.0标准的虚拟平台(VPs)已经成为一个关键解决方案,使得能够在硅前执行和测试未经修改的目标软件。在本研究中,我们建议采用一种利用集装箱装装封五氯丙烷的办法,以减少环境依赖性,使云能用于快速平行的测试执行,以及开放源源VP技术,如QEMU和VCML, 以避免需要座位许可证。为了展示我们的方法的有效性,我们提出了一个人工智能(AI)加速器VP案例研究。我们通过我们的研究,提出了一种强有力的解决方案,以应对HW/SW系统复杂性带来的挑战,对加速HW/SW的共同发展具有实际影响。

Article 11

Title@2025-06-12 (4): AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length

Title: AdaptiveLLM: A Framework for Selecting Optimal Cost-Efficient LLM for Code-Generation Based on CoT Length

AdaptiveLLM: Ein Framework zur Auswahl optimaler kosteneffizienter LLM für Code-Generation auf Basis der CoT-Länge

适应性LLM:根据Cot长度为编码生成选择最佳成本效率高的LLM框架 2506.10525v1

Authors (4): Junhang Cheng, Fang Liu, Chengru Wu, Li Zhang

While Large Language Models (LLMs) have significantly advanced code generation efficiency, they face inherent challenges in balancing performance and inference costs across diverse programming tasks. Dynamically selecting the optimal LLM based on task difficulty and resource constraints offers a promising approach to achieve an optimal balance between efficiency and performance. However, existing model selection methods are resource-intensive and often neglect cost efficiency. Moreover, these approaches rely on human-annotated difficulty labels that are frequently inaccessible in real-world settings and may not align with the LLM’s own assessment of task difficulty. In this paper, we introduce AdaptiveLLM, a framework that dynamically selects optimal LLMs for a given coding task by automatically assessing task difficulty. Our framework first estimates task difficulty using Chain-of-Thought lengths generated by reasoning model, clusters these into three difficulty levels via k-means, and fine-tunes CodeBERT to embed difficulty-aware features. A trained XGBoost classifier then selects the best model for each problem, optimizing the performance-cost trade-off. Experimental results show that AdaptiveLLM achieves a 7.86% improvement in pass@1 score while reducing resource consumption by 88.9% compared to baseline method ComplexityNet. When compared to a single model, AdaptiveLLM demonstrates an approximately 15% accuracy improvement, while maintaining the same level of cost consumption. Apart from that, the difficulty assessment using CoT provides more reliable selection criteria than human evaluation. Our replication package is available at https://github.com/cjhCoder7/AdaptiveLLM.

虽然大语言模型(LLMS)具有显著的代码生成效率,但它们在平衡不同方案编制任务的业绩和推算成本方面面临着内在的挑战。根据任务困难和资源限制动态地选择最佳LMLM提供了一种大有希望的方法,以实现效率和业绩之间的最佳平衡。但是,现有的模式选择方法是资源密集型的,往往忽视成本效率。此外,这些方法依赖在现实世界环境中经常无法获取的、可能与LLM自己对任务困难的评估不相符的附加说明的难度标签。在本文中,我们引入了适应性LLM,这是一个框架,通过自动评估任务困难,动态地选择最佳LMM,用于给定的复制任务选择。我们的框架首次估算了任务困难,使用推理模型生成的链式计算长度,将这些模型分为三个难度级别,通过 k 和微调的代码代码BERT 来嵌入困难特性。经过培训的XGBOost分类师随后选择了每个问题的最佳模式,优化了绩效-成本交易。实验结果表明,SDaliveLM公司通过自动评估实现一个7.86%的准确性 LMRMServiewalalalalalalalalalalality,同时用一个比185MLislevalalalalalalalalalalalalalalalalalalalalalalalalaldaldalalalald laxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Article 12

Title@2025-06-12 (4): PyGen: A Collaborative Human-AI Approach to Python Package Creation

Title: PyGen: A Collaborative Human-AI Approach to Python Package Creation

PyGen: Ein kollaborativer Mensch-AI-Ansatz zur Erstellung von Python-Paketen

PyGen:以人类 – – 美洲合作办法创建皮顿包包 2411.08932v4

Authors (6): Saikat Barua, Mostafizur Rahman, Md Jafor Sadek, Rafiul Islam, Shehenaz Khaled, Md. Shohrab Hossain

The principles of automation and innovation serve as foundational elements for advancement in contemporary science and technology. Here, we introduce Pygen, an automation platform designed to empower researchers, technologists, and hobbyists to bring abstract ideas to life as core, usable software tools written in Python. Pygen leverages the immense power of autoregressive large language models to augment human creativity during the ideation, iteration, and innovation process. By combining state-of-the-art language models with open-source code generation technologies, Pygen has significantly reduced the manual overhead of tool development. From a user prompt, Pygen automatically generates Python packages for a complete workflow from concept to package generation and documentation. The findings of our work show that Pygen considerably enhances the researcher’s productivity by enabling the creation of resilient, modular, and well-documented packages for various specialized purposes. We employ a prompt enhancement approach to distill the user’s package description into increasingly specific and actionable. While being inherently an open-ended task, we have evaluated the generated packages and the documentation using Human Evaluation, LLM-based evaluation, and CodeBLEU, with detailed results in the results section. Furthermore, we documented our results, analyzed the limitations, and suggested strategies to alleviate them. Pygen is our vision of ethical automation, a framework that promotes inclusivity, accessibility, and collaborative development. This project marks the beginning of a large-scale effort towards creating tools where intelligent agents collaborate with humans to improve scientific and technological development substantially. Our code and generated examples are open-sourced at [https://github.com/GitsSaikat/Pygen]

自动化和创新原则是当代科学技术进步的基础要素。在这里,我们引入了Pygen,这是一个旨在赋予研究人员、技术人员和业余爱好者权力的自动化平台,将抽象思想作为核心的、有用的软件工具带进Python书写的软件工具。Pygen利用自动递减的大语言模型的巨大力量,在思想、循环和创新过程中增强人的创造力。通过将最先进的语言模型与开放源代码生成技术相结合,Pygen大大减少了工具开发的手工管理费用。从用户提示,Pygen自动生成Python软件包,从概念到软件生成和文件的完整工作流程。我们的工作结果表明,Pygen通过为各种专门目的创建有弹性、模块化和有详细记录的软件包组合,大大增强了研究人员的生产力。我们采用了迅速强化的方法,将用户的组合描述提炼成越来越具体和可操作性。我们的任务本来就是公开的,我们用人文评估、LLM自动生成的软件包和文件包包包包包,从概念生成到软件生成和文件的完整流程。我们从大的技术成本框架中,我们用一个记录和代码分析的结果。我们开始,我们的项目和代码生成的逻辑上,我们用大量的逻辑分析。

Article 13

Title@2025-06-12 (4): BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

Title: BugGen: A Self-Correcting Multi-Agent LLM Pipeline for Realistic RTL Bug Synthesis

BugGen: Eine selbstkorrigierende LLM-Pipeline für eine realistische RTL-Bug-Synthese

BugGen: 现实的 RTL 错误合成自更正多 Agency LLM 管道 2506.10501v1

Authors (7): Surya Jasper, Minh Luu, Evan Pan, Aakash Tyagi, Michael Quinn, Jiang Hu, David Kebo Houngninou

Hardware complexity continues to strain verification resources, motivating the adoption of machine learning (ML) methods to improve debug efficiency. However, ML-assisted debugging critically depends on diverse and scalable bug datasets, which existing manual or automated bug insertion methods fail to reliably produce. We introduce BugGen, a first of its kind, fully autonomous, multi-agent pipeline leveraging Large Language Models (LLMs) to systematically generate, insert, and validate realistic functional bugs in RTL. BugGen partitions modules, selects mutation targets via a closed-loop agentic architecture, and employs iterative refinement and rollback mechanisms to ensure syntactic correctness and functional detectability. Evaluated across five OpenTitan IP blocks, BugGen produced 500 unique bugs with 94% functional accuracy and achieved a throughput of 17.7 validated bugs per hour-over five times faster than typical manual expert insertion. Additionally, BugGen identified 104 previously undetected bugs in OpenTitan regressions, highlighting its utility in exposing verification coverage gaps. Compared against Certitude, BugGen demonstrated over twice the syntactic accuracy, deeper exposure of testbench blind spots, and more functionally meaningful and complex bug scenarios. Furthermore, when these BugGen-generated datasets were employed to train ML-based failure triage models, we achieved high classification accuracy (88.1%-93.2%) across different IP blocks, confirming the practical utility and realism of generated bugs. BugGen thus provides a scalable solution for generating high-quality bug datasets, significantly enhancing verification efficiency and ML-assisted debugging.

硬件复杂程度继续使核查资源紧张,促使采用机器学习(ML)方法来提高调试效率。然而,ML协助的调试关键取决于多种且可缩放的错误数据集,而现有的人工或自动错误插入方法无法可靠地生成这些数据集。我们引入了BugGen,这是同类的首个完全自主的多剂管道管道,利用大语言模型(LLLMS)系统生成、插入和验证了RTL. BugGen 分区模块中切合实际的功能错误,通过闭路代理结构选择了突变目标,并使用了迭代性改进和回滚回机制,以确保同步性效用正确性和功能可探测性。在五个 OpenTitan IP IP区中,BugGen 生成了500个独特的错误,其功能准确性为94%,并实现了17.7个经验证的错误,比典型的手动专家插入速度快5倍。此外,BugG在 OpenTeral-Timan Registration中发现了104个先前未检测过的错误,突出的错误在揭示核查范围上产生漏洞。对比,BBBBBG 80,BBBen 和滚化的精确度的精确度为两倍的两倍。当我们使用了双向了两次地展示的精确度提高了的精确度测测测测测测测测测测算时,这些测测测测测测测测测测测测测测算。

Article 14

Title@2025-06-12 (4): EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair

Title: EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair

EXPEREPAIR: Dual-Memory verbesserte LLM-basierte Repository-Level-Programm-Reparatur

以储存库为基础的两层增强的LLM(基于仓库的LLM)方案维修 2506.10484v1

Authors (6): Fangwen Mu, Junjie Wang, Lin Shi, Song Wang, Shoubin Li, Qing Wang

Automatically repairing software issues remains a fundamental challenge at the intersection of software engineering and AI. Although recent advancements in Large Language Models (LLMs) have demonstrated potential for repository-level repair tasks, current methodologies exhibit two notable limitations: (1) they often address issues in isolation, neglecting to incorporate insights from previously resolved issues, and (2) they rely on static and rigid prompting strategies, which constrain their ability to generalize across diverse and evolving issue scenarios. Inspired by the dual memory systems of human cognition, where episodic and semantic memories work synergistically to support human reasoning and decision-making, we propose ExpeRepair, a novel LLM-based approach that continuously learns from historical repair experiences through dual-channel knowledge accumulation. ExpeRepair organizes historical repair experiences into two complementary memories: an episodic memory that stores concrete repair demonstrations, and a semantic memory that encodes abstract reflective insights. At inference time, ExpeRepair activates both memory systems by retrieving relevant demonstrations from episodic memory and recalling high-level repair insights from semantic memory. It further enhances adaptability through dynamic prompt composition, synergistically integrating both memory types to replace static prompts with context-aware, experience-driven prompts. Experiments on the SWE-bench Lite benchmark demonstrate that ExpeRepair achieves a pass@1 score of 49.3% with Claude 3.7 Sonnet, outperforming all state-of-the-art open-source methods.

软件工程和AI交汇处,自动修理软件问题仍然是软件工程和AI系统面临的一个根本挑战。尽管最近大语言模型的进展显示有可能完成存储处一级的修理任务,但目前的方法显示出两个显著的局限性:(1) 它们往往孤立地处理问题,忽视了对以前解决的问题的见解;(2) 它们依赖静态和僵硬的催化战略,这限制了它们推广不同和不断变化的问题情景的能力。受人类认知的双重记忆系统的影响,在这种系统中,偶发和语义记忆协同配合地发挥作用,支持人类的推理和决策,我们提议采用ExpeRepair,这是一种基于LLOM的新方法,通过双通道知识积累不断从历史修复经验中学习。ExpeRepair将历史修复经验组织成两个互补的记忆:一个储存具体修复演示的记忆缩略图,以及一个包含抽象反省洞察力的记忆。 ExpeRepair激活两种记忆系统,通过从直观记忆的记忆中重新检索相关的演示,回顾从高层次的记忆中回顾,一个基于Smantical-laimal real real real deal real restial recal recal exal exal exal deal exal exturmatiew 进一步通过动态的记忆和快速化的记忆,通过感官序结构结构结构结构结构化、感化、感化、感化、感官能感化、感化、感化、感官能感化的记忆和感化的感化、感化、感化、感化、感化、感化的感化、感化、感化、感化、感化、感化、感化、感化、感性反应性反应、感化、感化、感化、感化、感化、感化、感化、感化、感性反应性反应、感化、感化、感性反应、感性反应、感化、感化、感性反应、感化、感性反应、感化、感性反应、感化、感化、感性反应、感化、感化、感化、感化、感化、感化、感性反应、感化、感性反应、感化、感化、感性反应、感性反应、感性反应、感性反应、感性反应、感性

Article 15

Title@2025-06-12 (4): Leveraging Network Methods for Hub-like Microservice Detection

Title: Leveraging Network Methods for Hub-like Microservice Detection

Nutzung von Netzwerkmethoden für die hubähnliche Microservice-Erkennung

利用网络方法进行中枢式微服务探测 2506.07683v2

Authors (4): Alexander Bakhtin, Matteo Esposito, Valentina Lenarduzzi, Davide Taibi

Context: Microservice Architecture is a popular architectural paradigm that facilitates flexibility by decomposing applications into small, independently deployable services. Catalogs of architectural anti-patterns have been proposed to highlight the negative aspects of flawed microservice design. In particular, the Hub-like anti-pattern lacks an unambiguous definition and detection method. Aim: In this work, we aim to find a robust detection approach for the Hub-like microservice anti-pattern that outputs a reasonable number of Hub-like candidates with high precision. Method: We leveraged a dataset of 25 microservice networks and several network hub detection techniques to identify the Hub-like anti-pattern, namely scale-free property, centrality metrics and clustering coefficient, minimum description length principle, and the approach behind the Arcan tool. Results and Conclusion: Our findings revealed that the studied architectural networks are not scale-free, that most considered hub detection approaches do not agree on the detected hubs, and that the method by Kirkley leveraging the Erdos-Renyi encoding is the most accurate one in terms of the number of detected hubs and the detection precision. Investigating further the applicability of these methods to detecting Hub-like components in microservice-based and other systems opens up new research directions. Moreover, our results provide an evaluation of the approach utilized by the widely used Arcan tool and highlight the potential to update the tool to use the normalized degree centrality of a component in the network, or for the approach based on ER encoding to be adopted instead.

微观服务架构是一个受欢迎的建筑范式,它通过将应用程序分解成可独立部署的小型服务,为灵活性提供了便利,将应用软件分解成可独立部署的小型服务; 提出了建筑反模式目录,以突出有缺陷的微观服务设计中的消极方面; 特别是,类似枢纽的反模式缺乏一个明确的定义和检测方法。目标: 在这项工作中,我们的目标是为中枢的微观服务反模式找到一种强有力的检测方法,该方法输出出数量合理的类似枢纽的高精度候选人; 方法:我们利用了25个微观服务网络和若干网络枢纽检测技术的数据集,以确定中枢式反模式,即无规模财产、核心指标和集群系数、最低描述长度原则以及Arcan工具背后的方法。结果和结论:我们的调查结果表明,所研究的建筑网络并非无规模的,大多数被视为中心检测方法并不同意所检测的中心中心点,而Kirkley利用Erd-Renyi编码的方法,是所检测的枢纽中心点数目和其他检测精确度方面最准确的方法。进一步调查这些方法的适用性,用以查明以核心核心研究方向,在使用的工具中采用新的核心工具中,为工具,提供基于核心评估工具的工具更新。

Article 16

Title@2025-06-12 (4): Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Title: Towards Understanding Bugs in Distributed Training and Inference Frameworks for Large Language Models

Auf dem Weg zum Verständnis von Fehlern in verteilten Schulungs- und Schlussfolgerungsrahmen für große Sprachmodelle

努力了解大语言模式分布式培训和推断框架中的错误 2506.10426v1

Authors (6): Xiao Yu, Haoxuan Chen, Feifei Niu, Xing Hu, Jacky Wai Keung, Xin Xia

With the rapid development of large language models (LLMs), distributed training and inference frameworks like DeepSpeed have become essential for scaling model training and inference across multiple GPUs or nodes. However, the increasing complexity of these frameworks brings non-trivial software bugs, which may degrade training performance, cause unexpected failures, and result in significant resource waste. Understanding framework bugs’ characteristics is fundamental for quality assurance, allowing the design of more effective debugging and repair methods. Thus, our paper conducts the first large-scale empirical analysis of 308 fixed bugs across three popular distributed training/inference frameworks: DeepSpeed, Megatron-LM, and Colossal-AI. We examine bug symptoms, root causes, bug identification and fixing efforts, and common low-effort fixing strategies. Additionally, the distributed nature of these frameworks introduces unique bug root causes, such as allocation strategy error and distributed communication error. Diagnosing and fixing complex bugs remains challenging due to factors like the disconnect between symptoms and root causes, high bug reproduction costs, and low-level or cross-component interactions. Interestingly, we observe that 48% of bug fixes require minimal code changes (<=10 LOC) and follow simple strategies such as conditional logic optimization, parameter handling enhancement, or version compatibility handling, indicating potential for automation. Based on these insights, we offer several implications for improving the reliability of both distributed training and inference frameworks and their dependent LLM projects, while also identifying opportunities to leverage LLM-based tools for automated debugging and repair.

随着大型语言模型(LLMs)的迅速发展,分散的培训和推理框架(如DeepSpeed)的迅速发展,分散的培训和推理框架对于在多个GPU或节点之间扩大模型培训和推算至关重要。然而,这些框架的日益复杂性带来了非三重软件错误,可能降低培训业绩,造成意外失败,并导致大量资源浪费。理解框架错误的特性对于质量保证至关重要,可以设计更有效的调试和维修方法。因此,我们的文件对三个广受欢迎的培训/推理框架(深Speed、Megacatron-LM和Colossal-AI)中的308个固定错误进行了首次大规模的经验分析。我们研究了错误症状、根源、错误识别和修正努力以及常见的低效率修正战略。此外,这些框架的分布性质带来了独特的错误根源,例如分配战略错误错误和分布错误分布,使得能够设计更有效的调控和修理方法。因此,由于症状和根源工具脱节、高错误复制成本以及低层次或跨部分的交互作用等因素,我们考察了错误的症状症状症状症状症状症状症状、根源分析,我们观察了48%的逻辑的逻辑分析或逻辑分析模型的改进了基础。我们观察了其中的改进了其中的改进了48/逻辑的精确的逻辑的精确性研究。 ,我们观察了其中的改进了其中的改进了其中的改进了其中的改进了其中的逻辑的精确性,我们提供了基础,在改进了48%的精确性研究的改进了基础。

Article 17

Title@2025-06-12 (4): Centrality Change Proneness: an Early Indicator of Microservice Architectural Degradation

Title: Centrality Change Proneness: an Early Indicator of Microservice Architectural Degradation

Zentralitätsänderung Proneness: ein Frühindikator für die architektonische Degradation von Mikroservices

中心变化中心变化前景:微观服务建筑退化早期指标 2506.07690v2

Authors (4): Alexander Bakhtin, Matteo Esposito, Valentina Lenarduzzi, Davide Taibi

Over the past decade, the wide adoption of Microservice Architecture has required the identification of various patterns and anti-patterns to prevent Microservice Architectural Degradation. Frequently, the systems are modelled as a network of connected services. Recently, the study of temporal networks has emerged as a way to describe and analyze evolving networks. Previous research has explored how software metrics such as size, complexity, and quality are related to microservice centrality in the architectural network. This study investigates whether temporal centrality metrics can provide insight into the early detection of architectural degradation by correlating or affecting software metrics. We reconstructed the architecture of 7 releases of an OSS microservice project with 42 services. For every service in every release, we computed the software and centrality metrics. From one of the latter, we derived a new metric, Centrality Change Proneness. We then explored the correlation between the metrics. We identified 7 size and 5 complexity metrics that have a consistent correlation with centrality, while Centrality Change Proneness did not affect the software metrics, thus providing yet another perspective and an early indicator of microservice architectural degradation.

在过去的十年中,广泛采用微服务结构要求查明防止微服务结构退化的各种模式和反模式,通常,这些系统以联网服务网络为模型,最近出现了对时间网络的研究,作为描述和分析不断演变的网络的一种方式。以前的研究探讨了诸如规模、复杂性和质量等软件衡量标准如何与建筑网络中的微观服务中心点有关。这项研究调查了时间中心点衡量标准能否通过关联或影响软件衡量标准,为早期发现建筑退化提供洞察力。我们重建了一个有42项服务的OSS微观服务项目的7种释放结构。我们计算了每期服务中的每期的软件和中心点衡量标准。我们从后者中得出了一个新的衡量标准,即中心点变化光度。我们随后探讨了这些衡量标准之间的相互关系。我们确定了7个大小和5个复杂的衡量标准与中心点密切相关,而中心点变化指标并没有影响软件衡量标准,从而提供了另一个视角和微观服务结构退化的早期指标。

Article 18

Title@2025-06-12 (4): Bug Classification in Quantum Software: A Rule-Based Framework and Its Evaluation

Title: Bug Classification in Quantum Software: A Rule-Based Framework and Its Evaluation

Fehlerklassifizierung in der Quantensoftware: Ein regelbasiertes Framework und seine Bewertung

量子软件中的臭虫分类:基于规则的框架及其评价 2506.10397v1

Authors (2): Mir Mohammad Yousuf, Shabir Ahmad Sofi

Accurate classification of software bugs is essential for improving software quality. This paper presents a rule-based automated framework for classifying issues in quantum software repositories by bug type, category, severity, and impacted quality attributes, with additional focus on quantum-specific bug types. The framework applies keyword and heuristic-based techniques tailored to quantum computing. To assess its reliability, we manually classified a stratified sample of 4,984 issues from a dataset of 12,910 issues across 36 Qiskit repositories. Automated classifications were compared with ground truth using accuracy, precision, recall, and F1-score. The framework achieved up to 85.21% accuracy, with F1-scores ranging from 0.7075 (severity) to 0.8393 (quality attribute). Statistical validation via paired t-tests and Cohen’s Kappa showed substantial to almost perfect agreement for bug type (k = 0.696), category (k = 0.826), quality attribute (k = 0.818), and quantum-specific bug type (k = 0.712). Severity classification showed slight agreement (k = 0.162), suggesting room for improvement. Large-scale analysis revealed that classical bugs dominate (67.2%), with quantum-specific bugs at 27.3%. Frequent bug categories included compatibility, functional, and quantum-specific defects, while usability, maintainability, and interoperability were the most impacted quality attributes. Most issues (93.7%) were low severity; only 4.3% were critical. A detailed review of 1,550 quantum-specific bugs showed that over half involved quantum circuit-level problems, followed by gate errors and hardware-related issues.

精确的软件错误分类对于提高软件质量至关重要。本文提供了一个基于规则的自动自动框架, 用于按错误类型、类别、严重程度和受影响的质量属性对量子软件库中的问题进行分类, 并额外侧重于量子型错误类型。框架应用了关键词和基于脂质的量子计算技术。为了评估其可靠性, 我们手工从36 Qiskit 仓库的12 910个数据集中分类了4 984个问题。自动分类用准确性、精确性、回溯性和 F1 核心来比较了基于规则的自动框架。框架达到了85. 21% 的准确性, F1 核心从 0. 775 (多样性) 到 0. 8 393( 质量属性)。通过配对式测试和 Cohen kappa 的统计验证非常接近完美的协议类型( k = 0.696)、类别( k = 0. 8266)、质量属性( k=0. 818), 质量属性( k = 0. 0. 8018 ) 质量分类显示轻微的准确性定义( k= 0. 0.162) 准确性分类准确性质量等级, 质量等级分析( ) A 和直径级( ) 显示1.3) 质量等级分析。和直径级分析( 质量问题为1 质量问题为1 级) 级分析( 级) 和直径比( 级) 。。。级分析( 0.16级) 和直径级级分析。

Article 19

Title@2025-06-12 (4): MLLM-Based UI2Code Automation Guided by UI Layout Information

Title: MLLM-Based UI2Code Automation Guided by UI Layout Information

MLLM-based UI2Code Automation Geführt von UI Layout Informationen

MLLM-基于 MLLM 的 UI2Cde 用户界面布局信息引导的自动化 2506.10376v1

Authors (5): Fan Wu, Cuiyun Gao, Shuqing Li, Xin-Cheng Wen, Qing Liao

Converting user interfaces into code (UI2Code) is a crucial step in website development, which is time-consuming and labor-intensive. The automation of UI2Code is essential to streamline this task, beneficial for improving the development efficiency. There exist deep learning-based methods for the task; however, they heavily rely on a large amount of labeled training data and struggle with generalizing to real-world, unseen web page designs. The advent of Multimodal Large Language Models (MLLMs) presents potential for alleviating the issue, but they are difficult to comprehend the complex layouts in UIs and generate the accurate code with layout preserved. To address these issues, we propose LayoutCoder, a novel MLLM-based framework generating UI code from real-world webpage images, which includes three key modules: (1) Element Relation Construction, which aims at capturing UI layout by identifying and grouping components with similar structures; (2) UI Layout Parsing, which aims at generating UI layout trees for guiding the subsequent code generation process; and (3) Layout-Guided Code Fusion, which aims at producing the accurate code with layout preserved. For evaluation, we build a new benchmark dataset which involves 350 real-world websites named Snap2Code, divided into seen and unseen parts for mitigating the data leakage issue, besides the popular dataset Design2Code. Extensive evaluation shows the superior performance of LayoutCoder over the state-of-the-art approaches. Compared with the best-performing baseline, LayoutCoder improves 10.14% in the BLEU score and 3.95% in the CLIP score on average across all datasets.

将用户界面转换为代码( UI2Code ) 是网站开发中的一个关键步骤, 它耗时费力。 UI2Code 的自动化对于简化这项工作至关重要, 有利于提高开发效率。存在基于深层次学习的任务方法; 但是, 他们严重依赖大量标签的培训数据, 并努力将通用数据转换为真实世界, 看不见的网页设计。多式大语言模型(MLLM) 的出现为缓解这一问题提供了潜力, 但很难理解UIS 的复杂布局, 并生成与布局保存的准确代码。为了解决这些问题, 我们提议了布局 Coder 的自动化代码, 一个新的基于 MLLLM 的框架, 生成了来自真实世界网页图像的 UIS 代码; 然而, 元素Relationlational Construction , 目的是通过识别和分组结构相似的组件获取UIFIB的布局; (2) UIB布局剖面图, 旨在生成UIS布局布局树来指导随后的代码生成过程; (3) 状态Guide Code Code DFSIion, , 目的是要制作精确代码, 的代码比C 更精确的代码升级的代码, 的代码, 升级的版值比C 升级的版值比C 升级的流程, 升级的流程, 升级的流程要显示BDrealbrealde d d d d d d d d drealevationalevationalview数据, 保存所有C 。

Article 20

Title@2025-06-12 (4): AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine

Title: AutoGEEval++: A Multi-Level and Multi-Geospatial-Modality Automated Evaluation Framework for Large Language Models in Geospatial Code Generation on Google Earth Engine

AutoGEEval++: Ein Multi-Level- und Multi-Geospatial-Modulity-Automated Evaluation Framework für große Sprachmodelle in der Geospatial Code Generation auf Google Earth Engine

AutoGEEval++:谷歌地球引擎地理空间代码生成中大语言模型多层次和多地球空间-模式自动评价框架 2506.10365v1

Authors (13): Shuyang Hou, Zhangxiao Shen, Huayi Wu, Haoyue Jiao, Ziqi Liu, Lutong Xie, Chang Liu, Jianyuan Liang, Yaxian Qing, Xiaopu Zhang, Dehua Peng, Zhipeng Gui, Xuefeng Guan

Geospatial code generation is becoming a key frontier in integrating artificial intelligence with geo-scientific analysis, yet standardised automated evaluation tools for this task remain absent. This study presents AutoGEEval++, an enhanced framework building on AutoGEEval, and the first automated assessment system for large language models (LLMs) generating geospatial code on Google Earth Engine (GEE). It supports diverse data modalities and varying task complexities. Built on the GEE Python API, AutoGEEval++ features a benchmark dataset-AutoGEEval++-Bench-with 6,365 test cases across 26 data types and three task categories: unit, combo, and theme tests. It includes a submission programme and a judge module to realise an end-to-end automated evaluation pipeline from code generation to execution-based validation. The framework adopts multi-dimensional metrics-accuracy, resource usage, run-time efficiency, and error types-balancing hallucination control and efficiency, and enabling boundary testing and error pattern analysis. Using AutoGEEval++, we evaluate 24 state-of-the-art LLMs (as of June 2025), including general-purpose, reasoning-enhanced, code-centric, and geoscience-specific models. Results reveal clear performance, stability, and error differences across task types, model designs, and deployment settings, confirming AutoGEEval++’s practical value and scalability in vertical-domain code generation. This work establishes the first standardised evaluation protocol and foundational benchmark for GEE-based LLM code generation, providing a unified basis for performance comparison and a methodological framework for systematic, domain-specific code evaluation.

地理空间代码的生成正在成为将人工智能与地球科学分析相结合的关键前沿,但这一任务仍缺乏标准化的自动化评价工具。本研究报告介绍了AutoGeevval++,AutoGEEval的强化框架,AutoGeeEval的强化框架+,以及首个大型语言模型自动评估系统(LLLMs)生成Google地球引擎的地理空间代码。它支持多种数据模式和不同的任务复杂性。在GEE Python API上建起了一个基准数据集-AutoGEE++++-Bench ,在26个数据类型和3个任务类别(单位、组合和主题测试)中有6 365个测试案例。它包括一个提交程序和法官模块,以实现从代码生成到基于执行验证的大语言模型的终端至终端自动评价管道。该框架采用了多维度度度-准确性、资源使用、运行时间效率、错差类型平衡控制和效率,以及边界测试和错误模式分析。使用AutoGEVAL++,我们评估了24个水平的LMS(截至2025的) 直径评估模型,在常规、常规具体成本成本、成本成本成本、成本、成本和成本评估模型中,确定和清晰度的生成的生成的生成的生成的生成的生成模型中,在确定和清晰度的生成的生成的生成的计算和逻辑和逻辑模型中确定。

Article 21

Title@2025-06-12 (4): CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision

Title: CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision

CodeTool: Programmatisches Tool verbessern Anrufung von LLMs durch Prozessüberwachung

守则工具:加强通过程序监督援引LLMs的程序工具 2503.20840v2

Authors (9): Yifei Lu, Fanghua Ye, Jian Li, Qiang Gao, Cheng Liu, Haibo Luo, Nan Du, Xiaolong Li, Feiliang Ren

Tool invocation significantly enhances the capabilities of Large Language Models (LLMs), yet challenges persist, particularly in complex task scenarios. Current methods, such as instruction-enhanced reasoning and supervised fine-tuning, often result in unnecessarily long reasoning paths and face difficulties in verifying the correctness of intermediate steps. In this paper, we propose CodeTool, a novel framework for stepwise code generation that improves LLM tool invocation by leveraging the concise and easily verifiable nature of code. CodeTool incorporates two distinct process rewards: the On-the-spot Reward, which provides immediate feedback on the accuracy of each tool invocation, and the Latent Reward, which assesses the contribution of each step toward overall task completion. By maximizing the cumulative reward of the On-the-spot and Latend Rewards at each step, LLMs are guided to follow efficient and accurate reasoning paths. Extensive experiments on StableToolBench and RestBench-TMDB demonstrate the superiority of CodeTool over existing approaches.

使用工具大大增强了大语言模型(LLMs)的能力,但挑战依然存在,特别是在复杂的任务情景下。目前的方法,例如强化教学的推理和监督下的微调,往往导致不必要的冗长推理路径,在核实中间步骤的正确性方面遇到困难。在本文件中,我们提议CodeTool,这是一个用于逐步生成代码的新框架,它通过利用简明和易于核查的代码性质,改进LLM工具的引用。CodeTool包含两个不同的过程奖励:现场奖励,它提供对每个工具的准确性的直接反馈,以及延迟奖励,它评估每个步骤对完成全部任务的贡献。通过最大限度地增加现场和晚期的奖励,LOMs将遵循高效和准确的推理路径。在StagetToolnch和SrestBench-TMDB进行的广泛实验显示Cool优于现有方法。

Article 22

Title@2025-06-12 (4): Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements

Title: Augmenting Large Language Models with Static Code Analysis for Automated Code Quality Improvements

Erweiterung großer Sprachmodelle mit statischer Codeanalyse für automatisierte Codequalitätsverbesserungen

增强大语言模式,采用静态代码分析法分析,提高自动代码质量 2506.10330v1

Authors (2): Seyed Moein Abtahi, Akramul Azim

This study examined code issue detection and revision automation by integrating Large Language Models (LLMs) such as OpenAI’s GPT-3.5 Turbo and GPT-4o into software development workflows. A static code analysis framework detects issues such as bugs, vulnerabilities, and code smells within a large-scale software project. Detailed information on each issue was extracted and organized to facilitate automated code revision using LLMs. An iterative prompt engineering process is applied to ensure that prompts are structured to produce accurate and organized outputs aligned with the project requirements. Retrieval-augmented generation (RAG) is implemented to enhance the relevance and precision of the revisions, enabling LLM to access and integrate real-time external knowledge. The issue of LLM hallucinations - where the model generates plausible but incorrect outputs - is addressed by a custom-built “Code Comparison App,” which identifies and corrects erroneous changes before applying them to the codebase. Subsequent scans using the static code analysis framework revealed a significant reduction in code issues, demonstrating the effectiveness of combining LLMs, static analysis, and RAG to improve code quality, streamline the software development process, and reduce time and resource expenditure.

这项研究通过将OpenAI的GPT-3.5 Turbo和GPT-4o等大语言模型(LLMs)纳入软件开发工作流程,对代码问题的检测和修订自动化进行了研究。静态代码分析框架在大型软件项目中检测了错误、弱点和代码的气味等问题。每个问题的详细信息被提取并组织起来,以便利使用LLMs进行自动代码修改。采用迭代快速工程程序,确保根据项目要求构建准确和有组织的产出。实施回溯式生成(RAG),以提高修订的相关性和准确性,使LLM能够访问和整合实时外部知识。LM幻觉问题(该模型产生合理但不正确的产出)由定制的“Code比较应用程序”解决,在将其应用到代码库之前查明并纠正错误的变化。随后使用静态代码分析框架进行的扫描显示代码问题显著减少,表明将LMS、静态分析以及RAG提高代码质量、简化软件开发过程并减少时间和资源支出的有效性。

Article 23

Title@2025-06-12 (4): ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

Title: ELFuzz: Efficient Input Generation via LLM-driven Synthesis Over Fuzzer Space

ELFuzz: Effiziente Input-Generierung über LLM-gesteuerte Synthese über Fuzzer-Raum

ELFuzz:通过LLM驱动的模糊空间综合合成有效投入生成 2506.10323v1

Authors (3): Chuyang Chen, Brendan Dolan-Gavitt, Zhiqiang Lin

Generation-based fuzzing produces appropriate testing cases according to specifications of input grammars and semantic constraints to test systems and software. However, these specifications require significant manual efforts to construct. This paper proposes a new approach, ELFuzz (Evolution Through Large Language Models for Fuzzing), that automatically synthesizes generation-based fuzzers tailored to a system under test (SUT) via LLM-driven synthesis over fuzzer space. At a high level, it starts with minimal seed fuzzers and propels the synthesis by fully automated LLM-driven evolution with coverage guidance. Compared to previous approaches, ELFuzz can 1) seamlessly scale to SUTs of real-world sizes – up to 1,791,104 lines of code in our evaluation – and 2) synthesize efficient fuzzers that catch interesting grammatical structures and semantic constraints in a human-understandable way. Our evaluation compared ELFuzz with specifications manually written by domain experts and synthesized by state-of-the-art approaches. It shows that ELFuzz achieves up to 434.8% more coverage and triggers up to 174.0% more artificially injected bugs. We also used ELFuzz to conduct a real-world fuzzing campaign on the newest version of cvc5 for 14 days, and encouragingly, it found five 0-day bugs (three are exploitable). Moreover, we conducted an ablation study, which shows that the fuzzer space model, the key component of ELFuzz, contributes the most (up to 62.5%) to the effectiveness of ELFuzz. Further analysis of the fuzzers synthesized by ELFuzz confirms that they catch interesting grammatical structures and semantic constraints in a human-understandable way. The results present the promising potential of ELFuzz for more automated, efficient, and extensible input generation for fuzzing.

以下一代为基础的模糊性根据输入语法规范的规格和测试系统和软件的语义限制生成适当的测试案例。然而, 这些规格需要大量手工构建。本文提出一种新的方法, 即 ELFuzz (通过大语言模型演进以模糊化) , 通过 LLM 驱动合成法在模糊空间上自动合成一个正在测试的系统( SUT) , 自动合成基于生成的模糊性。在高水平上, 它从最小种子模糊性开始, 通过完全自动化的LLLLM驱动的演进和覆盖性指导来推进合成。与以前的方法相比, ELUFuzz (ELFZ) 能够完美地推广到真实世界规模的SUT, 高达1,791,104条代码的代码, 以及2) 合成高效的模糊性引信, 通过LLMM 的合成法, 我们用域专家手动的模糊性模型和状态的合成方法, 显示ELFUZ( 5) 可以实现434.8 % 的覆盖, 并触发真实的 ELULF 版本的 ELF 。我们用LF 的 RULF 的 RLF , 也用新的版本, 发现, 21 方向的 RULULF 。

Article 24

Title@2025-06-12 (4): Minimizing False Positives in Static Bug Detection via LLM-Enhanced Path Feasibility Analysis

Title: Minimizing False Positives in Static Bug Detection via LLM-Enhanced Path Feasibility Analysis

Minimierung falscher Positive bei statischer Fehlererkennung über LLM-verbesserte Pfad-Feasibility-Analyse

通过LLM-强化路径可行性分析尽量减少静态虫检测中的假正数 2506.10322v1

Authors (9): Xueying Du, Kai Yu, Chong Wang, Yi Zou, Wentai Deng, Zuoyu Ou, Xin Peng, Lingming Zhang, Yiling Lou

Static bug analyzers play a crucial role in ensuring software quality. However, existing analyzers for bug detection in large codebases often suffer from high false positive rates. This is primarily due to the limited capabilities of analyzers in path feasibility validation with multiple conditional branches and complex data dependencies. While current LLM-based approaches attempt to address this issue, their effectiveness remains limited due to insufficient constraint cascade analysis and scalability challenges in large projects. To address this challenge, we propose an iterative path feasibility analysis framework LLM4PFA. By leveraging LLM agent based targeted constraint reasoning, and key context-aware analysis driven by agent planning, LLM4PFA effectively enhances complex inter-procedural path feasibility analysis for minimizing false positives in static bug detection. Evaluation results show that LLM4PFA precisely filters out 72% to 96% false positives reported during static bug detection, significantly outperforming all the baselines by 41.1% - 105.7% improvements; meanwhile LLM4PFA only misses 3 real bugs of 45 true positives.

固态虫分析器在确保软件质量方面发挥着关键作用。但是,大型代码库中用于检测错误的现有分析器往往存在高假正率。这主要是因为分析器在路径可行性验证方面能力有限,且有多个有条件分支和复杂的数据依赖性。目前的LLLM方法试图解决这一问题,但由于限制级联分析不足和大型项目的可缩放性挑战,其有效性仍然有限。为了应对这一挑战,我们提议了一个迭接路径可行性分析框架LLM4PFA。通过利用基于LLM代理器的定向约束推理和由代理商规划驱动的关键环境认知分析,LLM4PFA有效地加强了复杂的程序间路径可行性分析,以尽量减少静态错误检测中的假正数。评价结果显示LLM4PFA精确过滤了在静态错误检测中报告的72%至96%的假正数,大大超过所有基线的41.1%至105.7%的改进率。与此同时,LM4PFA只忽略了3个真正正数的真虫。

Article 25

Title@2025-06-12 (4): AI-Based Software Vulnerability Detection: A Systematic Literature Review

Title: AI-Based Software Vulnerability Detection: A Systematic Literature Review

KI-basierte Software-Verletzungserkennung: Ein systematischer Literaturbericht

基于AI的软件脆弱性检测:系统文献审查 2506.10280v1

Authors (3): Samiha Shimmi, Hamed Okhravi, Mona Rahimi

Software vulnerabilities in source code pose serious cybersecurity risks, prompting a shift from traditional detection methods (e.g., static analysis, rule-based matching) to AI-driven approaches. This study presents a systematic review of software vulnerability detection (SVD) research from 2018 to 2023, offering a comprehensive taxonomy of techniques, feature representations, and embedding methods. Our analysis reveals that 91% of studies use AI-based methods, with graph-based models being the most prevalent. We identify key limitations, including dataset quality, reproducibility, and interpretability, and highlight emerging opportunities in underexplored techniques such as federated learning and quantum neural networks, providing a roadmap for future research.

源代码中的软件脆弱性带来严重的网络安全风险,促使从传统的检测方法(如静态分析、基于规则的匹配)转向AI驱动的方法。本研究系统地审查了2018年至2023年软件脆弱性检测(SVD)研究,对各种技术、特征表现和嵌入方法进行了全面的分类。我们的分析表明,91%的研究使用基于AI的方法,而基于图表的模型最为普遍。我们确定了关键的局限性,包括数据集质量、可复制性和可解释性,并突显了诸如联合学习和量子神经网络等探索不足的技术方面新出现的机遇,为今后的研究提供了路线图。

Article 26

Title@2025-06-11 (3): Playing in the Sandbox: A Study on the Usability of Seccomp

Title: Playing in the Sandbox: A Study on the Usability of Seccomp

Spielen in der Sandbox: Eine Studie über die Usability von Seccomp

沙箱游戏:关于可使用性的研究 2506.10234v1

Authors (2): Maysara Alhindi, Joseph Hallett

Sandboxing restricts what applications do, and prevents exploited processes being abused; yet relatively few applications get sandboxed: why? We report a usability trial with 7 experienced Seccomp developers exploring how they approached sandboxing an application and the difficulties they faced. The developers each approached sandboxing the application differently and each came to different solutions. We highlight many challenges of using Seccomp, the sandboxing designs by the participants, and what developers think would make it easier for them to sandbox applications effectively.

沙箱设计限制了应用程序的作用,防止了被滥用的开发过程;然而,相对而言,很少有应用程序得到沙箱处理:为什么?我们报告与7名有经验的Seccomp开发商进行了可用性试验,探讨他们如何使用沙箱处理一个应用程序,以及他们所面临的困难。开发商各自以不同的方式使用沙箱处理应用程序,而每个应用都提出了不同的解决方案。我们强调了使用Seccomp、参与者的沙箱设计以及开发商认为什么可以使他们更容易有效地应用沙箱。

Article 27

Title@2025-06-11 (3): Prompt Variability Effects On LLM Code Generation

Title: Prompt Variability Effects On LLM Code Generation

Veränderliche Auswirkungen auf die LLM-Code-Generierung

对LLM 代码生成的迅速易变性效应 2506.10204v1

Authors (5): Andrei Paleyes, Radzim Sendyka, Diana Robinson, Christian Cabrera, Neil D. Lawrence

Code generation is one of the most active areas of application of Large Language Models (LLMs). While LLMs lower barriers to writing code and accelerate development process, the overall quality of generated programs depends on the quality of given prompts. Specifically, functionality and quality of generated code can be sensitive to user’s background and familiarity with software development. It is therefore important to quantify LLM’s sensitivity to variations in the input. To this end we propose a synthetic evaluation pipeline for code generation with LLMs, as well as a systematic persona-based evaluation approach to expose qualitative differences of LLM responses dependent on prospective user background. Both proposed methods are completely independent from specific programming tasks and LLMs, and thus are widely applicable. We provide experimental evidence illustrating utility of our methods and share our code for the benefit of the community.

生成代码是应用大语言模型最活跃的领域之一。虽然LLMs公司降低写法码和加速发展进程的障碍,但生成程序的总体质量取决于给定提示的质量。具体地说,生成代码的功能和质量对用户的背景和软件开发的熟悉度十分敏感,因此,必须量化LLM公司对投入差异的敏感度。为此,我们提议为与LLMs公司生成代码建立一个合成评价管道,并采用基于人的系统评价方法,以揭示LLM公司根据潜在用户背景做出的反应的质量差异。两种拟议方法都完全独立于具体的方案编制任务和LMS公司,因此广泛适用。我们提供实验性证据,说明我们的方法的效用,并分享我们的代码,以造福社区。

Article 28

Title@2025-06-11 (3): New Fault Domains for Conformance Testing of Finite State Machines

Title: New Fault Domains for Conformance Testing of Finite State Machines

Neue Fehlerbereiche für die Konformitätsprüfung von Finite State Maschinen

限定国家机器符合性测试的新失密域域 2410.19405v2

Authors (2): Frits Vaandrager, Ivo Melse

A fault domain reflects a tester’s assumptions about faults that may occur in an implementation and that need to be detected during testing. A fault domain that has been widely studied in the literature on black-box conformance testing is the class of finite state machines (FSMs) with at most $m$ states. Numerous strategies for generating test suites have been proposed that guarantee fault coverage for this class. These so-called $m$-complete test suites grow exponentially in $m-n$, where $n$ is the number of states of the specification, so one can only run them for small values of $m-n$. But the assumption that $m-n$ is small is not realistic in practice. In his seminal paper from 1964, Hennie raised the challenge to design checking experiments in which the number of states may increase appreciably. In order to solve this long-standing open problem, we propose (much larger) fault domains that capture the assumption that all states in an implementation can be reached by first performing a sequence from some set $A$ (typically a state cover for the specification), followed by $k$ arbitrary inputs, for some small $k$. The number of states of FSMs in these fault domains grows exponentially in $k$. We present a sufficient condition for $k$-$A$-completeness of test suites with respect to these fault domains. Our condition implies $k$-$A$-completeness of two prominent $m$-complete test suite generation strategies, the Wp and HSI methods. Thus these strategies are complete for much larger fault domains than those for which they were originally designed, and thereby solve Hennie’s challenge. We show that three other prominent $m$-complete methods (H, SPY and SPYH) do not always generate $k$-$A$-complete test suites.

错误域反映了一个测试者对执行过程中可能发生的错误的假设, 测试期间需要检测到。一个在黑箱合规测试文献中广泛研究的错误域, 黑箱合规测试文献中研究的错误域, 是限量国家机器(FSMs)的类别, 最多为$美元。已经提出了许多建立测试套件的战略, 保证了这一类的缺陷的覆盖。这些所谓的美元- 完整的测试套件以百万美元成倍增长, 其中一美元是明确的州数, 所以只能用小值美元运行。但是, 在实践中, 美元- 美元是小值国家的错误域。在1964年的原始论文中, 亨尼提出了设计测试实验的挑战, 其中国家数量可能会显著增加。为了解决这一长期存在的问题, 我们提出了( 更大的) 错误域, 它们的实施方式是首先完成一个由美元- 美元( 通常是用于规格的州) 的序列。美元- 美元- 以美元美元的货币为本域, 以美元的货币为本的货币为美元。明显货币标准, 我们的货币- 测试中, 这些货币- 的货币- 的货币的货币测试中的货币- 的计算方法将产生一个非常值- 。

Article 29

Title@2025-06-11 (3): D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning

Title: D-LiFT: Improving LLM-based Decompiler Backend via Code Quality-driven Fine-tuning

D-LiFT: Verbesserung des LLM-basierten Decompiler-Backends über Code Quality-driven Fine-tuning

D-LiFT:通过《守则》质量驱动的微调改进基于LLM的解腐器后端 2506.10125v1

Authors (12): Muqi Zou, Hongyu Cai, Hongwei Wu, Zion Leonahenahe Basque, Arslan Khan, Berkay Celik, Dave, Tian, Antonio Bianchi, Ruoyu, Wang, Dongyan Xu

Decompilers, which reconstruct human-readable source code from binary executables, are vital to many security tasks. Yet, despite recent advances, their output often suffers from syntactic and semantic errors and remains difficult to read. Recently, with the advent of large language models (LLMs), researchers began to explore the potential of LLMs to refine decompiler output. Nevertheless, our study of these approaches reveals significant limitations, such as introducing new errors and relying on unreliable accuracy validation. In this paper, we present D-LiFT, an automated decompiler backend that harnesses and further trains LLMs to improve the quality of decompiled code via reinforcement learning (RL). Unlike prior work that overlooks preserving accuracy, D-LiFT adheres to a key principle for enhancing the quality of decompiled code: \textit{preserving accuracy while improving readability}. Central to D-LiFT, we propose D-SCORE, an integrated quality assessment system to score the decompiled code from multiple aspects. In line with our principle, D-SCORE assigns low scores to any inaccurate output and only awards higher scores for readability to code that passes the accuracy check. Specifically, D-SCORE first verifies the syntactic and semantic correctness via the compiler and symbolic execution; only if a candidate is deemed accurate, it then evaluates readability using established metrics to compare the LLM output with the original decompiled code. The score will then be fed back to the LLM for fine-tuning. Our implementation, based on Ghidra and a range of LLMs, demonstrates significant improvements for the accurate decompiled code from the coreutils and util-linux projects. Compared to baseline LLMs without D-SCORE-driven fine-tuning, D-LiFT produces 55.3% more improved decompiled functions, as measured by D-SCORE.

解压缩器从二进制执行器中重建人读源代码,对于许多安全任务至关重要。然而,尽管最近有所进步,但是它们的输出往往有55个合成和语义错误,而且仍然难以读懂。最近,随着大型语言模型(LLMS)的出现,研究人员开始探索LLMS改进解压缩器输出的潜力。然而,我们对这些方法的研究揭示了巨大的局限性,例如引入新的错误和依赖不可靠的准确度验证。在本文中,我们提供了D-LiFT,一个自动解压缩器后端,通过强化学习(RL)来利用和进一步培训LLMS来提高已破译代码的质量。D-S-LiFT 与以往忽视保存准确度的工作不同,D-LIFT遵循了一个关键原则,提高解压缩码输出质量:\textitititit{p 准确度}。然后我们建议D-LiFT,一个从自动解析器到从多个方面获取解译代码的回调质量评估系统。根据我们的原则, D-S-S-COdealde-deal-deal-de realde realde dass realdeal deal dass deal deal deal deal deal deal deal deal deald ,如果它只只读读取了任何不精确到任何不精确的计算,如果它只读取的计算,那么的计算到任何硬度的计算,则只读取的计算。

Article 30

Title@2025-06-11 (3): Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection

Title: Expert-in-the-Loop Systems with Cross-Domain and In-Domain Few-Shot Learning for Software Vulnerability Detection

Experten-in-the-Loop-Systeme mit Cross-Domain und In-Domain-Lernen für Software-Anfälligkeitserkennung

具有跨域和内域的 “ 软件脆弱性探测 “ 微热学习 2506.10104v1

Authors (6): David Farr, Kevin Talty, Alexandra Farr, John Stockdale, Iain Cruickshank, Jevin West

As cyber threats become more sophisticated, rapid and accurate vulnerability detection is essential for maintaining secure systems. This study explores the use of Large Language Models (LLMs) in software vulnerability assessment by simulating the identification of Python code with known Common Weakness Enumerations (CWEs), comparing zero-shot, few-shot cross-domain, and few-shot in-domain prompting strategies. Our results indicate that while zero-shot prompting performs poorly, few-shot prompting significantly enhances classification performance, particularly when integrated with confidence-based routing strategies that improve efficiency by directing human experts to cases where model uncertainty is high, optimizing the balance between automation and expert oversight. We find that LLMs can effectively generalize across vulnerability categories with minimal examples, suggesting their potential as scalable, adaptable cybersecurity tools in simulated environments. However, challenges such as model reliability, interpretability, and adversarial robustness remain critical areas for future research. By integrating AI-driven approaches with expert-in-the-loop (EITL) decision-making, this work highlights a pathway toward more efficient and responsive cybersecurity workflows. Our findings provide a foundation for deploying AI-assisted vulnerability detection systems in both real and simulated environments that enhance operational resilience while reducing the burden on human analysts.

随着网络威胁变得更加复杂,迅速和准确地发现脆弱性对于维护安全系统至关重要。本研究探索了在软件脆弱性评估中使用大语言模型(LLMs)的方法,其方法是模拟识别已知的常见弱度数字(CWES)的Python代码,比较零射、几发横跨域域和几发内侧推进战略。我们的结果显示,虽然零发提示效果效果不佳,但少发提示显著提高了分类性能,特别是当与基于信任的定线战略相结合,通过引导人类专家了解模型不确定性高、优化自动化与专家监督之间的平衡来提高效率时。我们发现,LLMs可以有效地将各种易受灾性类别都归纳为极少的例子,表明其在模拟环境中具有可缩放、适应性强的网络安全工具的潜力。然而,模型可靠性、可解释性和对抗性强健度等挑战仍然是未来研究的关键领域。通过将全动方法与当期专家决策相结合,这项工作突出了一种通往更高效和反应灵敏的网络安全动态流的路径。我们发现,LLLLLLLLLM能够有效地利用最高效和反应灵敏捷的操作环境,同时加强模拟脆弱性检测系统。我们的调查结果为模拟分析系统提供了一个基础。

Article 31

Title@2025-06-11 (3): Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput

Title: Reward Models Enable Scalable Code Verification by Trading Accuracy for Throughput

Reward-Modelle ermöglichen eine skalierbare Code-Überprüfung durch den Handel mit Genauigkeit für Durchsatz

通过交易准确性对交易流量的可缩放代码校验 2506.10056v1

Authors (4): Gabriel Orlanski, Nicholas Roberts, Aws Albarghouthi, Frederic Sala

The standard paradigm for solving coding tasks via large language models (LLMs) is to generate-then-rank programs, where the latter step uses a verifier in the ranking process. The growing consensus is that a comprehensive verifier (e.g., a full test suite) should be prioritized over an outcome reward model (ORM) whenever possible, with little consideration given to the trade-offs involved. We aim to challenge this assumption by systematically exploring the tradeoff between speed and accuracy. We find that ORMs play a crucial role in scaling verification through trading accuracy for speed, even when a comprehensive verifier is available. Their value becomes especially apparent when used in a generate-prune-then-rank approach, where a faster but less accurate verifier removes incorrect solutions prior to ranking – leading to a system that is 11.65x faster while only being 8.33% less accurate than the full test suite. We analyze the generate-prune-then-rank approach and show that it works by filtering out incorrect but highly ranked solutions. These findings enable the design of scalable and accurate program ranking systems.

通过大语言模型(LLMS)解决编码任务的标准范式是生成当值程序,让后一步在排名过程中使用核查员。日益形成的共识是,尽可能将综合核查员(如完整的测试套件)置于结果奖励模式(ORM)之上,而很少考虑到所涉及的权衡。我们的目标是通过系统地探索速度和准确性之间的权衡来挑战这一假设。我们发现ORM公司在通过快速交易精确度来扩大核查规模方面发挥着关键作用,即使有一个全面的核查员。在采用生产-生产-生产-当值-排位方法时,其价值尤其明显,在这种方法中,快速但不准确的核查员在排位前消除了不正确的解决方案 – – 导致一个系统比整个测试套件更快11.65x,而仅差8.33%的准确率。我们分析了生成-生产-当值-排位方法,并表明它通过筛选不正确但排位高的解决方案来发挥作用。这些发现,能够设计可扩展和准确的程序排位系统。

Article 32

Title@2025-06-11 (3): Microservices and Real-Time Processing in Retail IT: A Review of Open-Source Toolchains and Deployment Strategies

Title: Microservices and Real-Time Processing in Retail IT: A Review of Open-Source Toolchains and Deployment Strategies

Mikroservices und Echtzeit-Verarbeitung im Einzelhandel IT: Eine Überprüfung von Open-Source-Toolchains und Bereitstellungsstrategien

《零售信息技术的微观服务和实时处理:对开放源码工具链和部署战略的审查》 2506.09938v1

Authors (2): Aaditaa Vashisht, Rekha B S

With the rapid pace of digital transformation, the retail industry is increasingly depending on real-time, scalable, and resilient systems to manage financial transactions, analyze customer behavior, and streamline order processing. This literature review explores how modern event-driven and microservices-based architectures, particularly those leveraging Apache Kafka, Spring Boot, MongoDB, and Kubernetes are transforming retail and financial systems. By systematically reviewing academic publications, technical white papers, and industry reports from recent years, this study synthesizes key themes and implementation strategies. The analysis reveals that technologies like Kafka and Spring Boot are instrumental in building low-latency, event-driven applications that support real-time analytics and fraud detection, while MongoDB, when deployed on Kubernetes, ensures fault tolerance and high availability in inventory and transaction systems. Kubernetes itself plays a crucial role in automating deployment and scaling of microservices. These findings provide valuable insights for industry practitioners aiming to design scalable infrastructures, identify research opportunities in hybrid deployment models, and offer educators a foundation to integrate modern system architectures into professional and technical communication training.

随着数字转型的飞速发展,零售业越来越依赖实时、可扩展和有弹性的系统来管理金融交易、分析客户行为和精简订单处理。本文献审查探讨了现代事件驱动和微观服务架构是如何转变零售和金融体系的。通过系统审查近年来的学术出版物、技术白皮书和行业报告,本研究报告综合了关键主题和执行战略。分析显示,卡夫卡和春靴等技术有助于建立低延迟、事件驱动的应用,支持实时分析分析和欺诈检测,而蒙戈登布在库伯涅斯部署时,则确保库存和交易系统的容错和高可用性。库伯内茨本身在部署和扩展微观服务自动化方面发挥着至关重要的作用。这些研究结果为行业从业人员提供了宝贵的见解,旨在设计可扩展的基础设施,确定混合部署模式的研究机会,并为教育工作者提供将现代系统架构纳入专业和技术通信培训的基础。

Article 33

Title@2025-06-11 (3): Assessing a Safety Case: Bottom-up Guidance for Claims and Evidence Evaluation

Title: Assessing a Safety Case: Bottom-up Guidance for Claims and Evidence Evaluation

Bewertung eines Sicherheitsfalles: Bottom-up-Leitfaden für Forderungen und die Bewertung von Beweisen

安全案件评估:索赔和证据评价自下而上准则 2506.09929v1

Authors (6): Scott Schnelle, Francesca Favaro, Laura Fraade-Blanar, David Wichner, Holland Broce, Justin Miranda

As Automated Driving Systems (ADS) technology advances, ensuring safety and public trust requires robust assurance frameworks, with safety cases emerging as a critical tool toward such a goal. This paper explores an approach to assess how a safety case is supported by its claims and evidence, toward establishing credibility for the overall case. Starting from a description of the building blocks of a safety case (claims, evidence, and optional format-dependent entries), this paper delves into the assessment of support of each claim through the provided evidence. Two domains of assessment are outlined for each claim: procedural support (formalizing process specification) and implementation support (demonstrating process application). Additionally, an assessment of evidence status is also undertaken, independently from the claims support. Scoring strategies and evaluation guidelines are provided, including detailed scoring tables for claim support and evidence status assessment. The paper further discusses governance, continual improvement, and timing considerations for safety case assessments. Reporting of results and findings is contextualized within its primary use for internal decision-making on continual improvement efforts. The presented approach builds on state of the art auditing practices, but specifically tackles the question of judging the credibility of a safety case. While not conclusive on its own, it provides a starting point toward a comprehensive “Case Credibility Assessment” (CCA), starting from the evaluation of the support for each claim (individually and in aggregate), as well as every piece of evidence provided. By delving into the technical intricacies of ADS safety cases, this work contributes to the ongoing discourse on safety assurance and aims to facilitate the responsible integration of ADS technology into society.

随着自动驾驶系统(ADS)技术的进步,确保安全和公众信任需要强有力的保证框架,安全案例将成为实现这一目标的关键工具,本文件探讨了评估安全案例如何得到其索赔和证据的支持的方法,目的是为整个案例建立可信度。从描述安全案例的构件(索赔、证据和基于任择格式的条目)开始,本文件深入到通过提供证据对每项索赔的支持进行评估。对每项索赔的评估分为两个领域:程序支持(程序规范化规格)和执行支持(验证程序应用)。此外,还独立于索赔支持,对证据状况进行评估。提供证据状况评估,包括详细的索赔支持和证据状况评估评分表。本文进一步讨论了安全案例评估的治理、持续改进和时间考虑。结果和结论报告是内部决策在不断改进努力方面的主要用途。提出的方法以工艺审计做法为基础,但具体解决了判断安全案例的可信度问题(从风险评估开始的每个阶段开始,从风险评估开始,从风险评估的每个阶段开始,从最终证据开始,从风险评估的每个阶段,到最终的风险评估,从风险评估的每个阶段,从风险评估开始,从风险评估的每个阶段开始,从最终的风险评估,到最终的风险评估,为C。

Article 34

Title@2025-06-11 (3): The Popularity Hypothesis in Software Security: A Large-Scale Replication with PHP Packages

Title: The Popularity Hypothesis in Software Security: A Large-Scale Replication with PHP Packages

Die Popularitätshypothese in der Softwaresicherheit: Eine großflächige Replizierung mit PHP-Paketen

软件安全中的大众化假设:大规模复制PHP套件 2502.16670v2

Authors (2): Jukka Ruohonen, Qusai Ramadan

There has been a long-standing hypothesis that a software’s popularity is related to its security or insecurity in both research and popular discourse. There are also a few empirical studies that have examined the hypothesis, either explicitly or implicitly. The present work continues with and contributes to this research with a replication-motivated large-scale analysis of software written in the PHP programming language. The dataset examined contains nearly four hundred thousand open source software packages written in PHP. According to the results based on reported security vulnerabilities, the hypothesis does holds; packages having been affected by vulnerabilities over their release histories are generally more popular than packages without having been affected by a single vulnerability. With this replication results, the paper contributes to the efforts to strengthen the empirical knowledge base in cyber and software security.

长期以来一直有一个假设,即软件的受欢迎程度与其在研究和大众讨论方面的安全或不安全有关,还有一些经验性研究对假设进行了明确或隐含的审查,目前的工作继续进行,对以PHP编程语言撰写的软件进行以复制为动机的大规模分析,对这项研究作出贡献。所审查的数据集包含以PHP为主的近40万个开放源软件包。根据所报告的安全脆弱性的结果,假设确实有效;由于在公布历史上的脆弱性而受到影响的软件包一般比一揽子软件更受欢迎,而没有受到单一脆弱性的影响。通过这种复制结果,文件有助于努力加强网络和软件安全的经验知识库。

Article 35

Title@2025-06-11 (3): Quantum resources in resource management systems

Title: Quantum resources in resource management systems

Quantenressourcen in Ressourcenmanagementsystemen

资源管理系统的量子资源 2506.10052v1

Authors (12): Iskandar Sitdikov, M. Emre Sahin, Utz Bacher, Aleksander Wennersteen, Andrew Damin, Mark Birmingham, Philippa Rubin, Stefano Mensa, Matthieu Moreau, Aurelien Nober, Hitomi Takahashi, Munetaka Ohtani

Quantum computers are beginning to operate in high-performance computing (HPC) environments. Quantum can complement classical resources for specific workloads, but their adoption depends on integration into existing HPC infrastructure. Treating quantum devices as first-class resources allows for unified scheduling, improved usability, and support for hybrid quantum-classical applications. This paper presents the design architecture and reference implementation for quantum resources control using existing workload management systems. We introduce a suite of plugins for Slurm that enable integration of on-prem and cloud quantum computing resources into existing high-performance computing centers. The paper details the interface design, plugin concept and implementation, operational aspects for heterogeneous compute clusters, as well as considerations for other resource management systems.

量子计算机开始在高性能计算(HPC)环境中运行。量子计算机可以补充用于具体工作量的古典资源,但其采用取决于融入现有的高性能计算基础设施。将量子设备作为一流资源处理,可以统一时间安排,改进可用性,并支持混合量子古典应用。本文件介绍了利用现有工作量管理系统量子资源控制的设计架构和参考实施。我们为Slurm引入了一套插件,使预数和云量计算资源融入现有的高性能计算中心。该文件详细介绍了界面设计、插件概念和实施、多种计算组的操作方面以及其他资源管理系统的考虑因素。

Article 36

Title@2025-06-11 (3): The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Programming Tasks

Title: The Effects of GitHub Copilot on Computing Students’ Programming Effectiveness, Efficiency, and Processes in Brownfield Programming Tasks

Die Auswirkungen von GitHub Copilot auf die Programmierung von Computerstudenten Wirksamkeit, Effizienz und Prozesse in Brownfield-Programmierungsaufgaben

GitHub联合试点对布朗外地方案拟订任务中计算机学生方案拟订效力、效率和进程的影响 2506.10051v1

Authors (6): Md Istiak Hossain Shihab, Christopher Hundhausen, Ahsun Tariq, Summit Haque, Yunhan Qiao, Brian Mulanda

When graduates of computing degree programs enter the software industry, they will most likely join teams working on legacy code bases developed by people other than themselves. In these so-called brownfield software development settings, generative artificial intelligence (GenAI) coding assistants like GitHub Copilot are rapidly transforming software development practices, yet the impact of GenAI on student programmers performing brownfield development tasks remains underexplored. This paper investigates how GitHub Copilot influences undergraduate students’ programming performance, behaviors, and understanding when completing brownfield programming tasks in which they add new code to an unfamiliar code base. We conducted a controlled experiment in which 10 undergraduate computer science students completed highly similar brownfield development tasks with and without Copilot in a legacy web application. Using a mixed-methods approach combining performance analysis, behavioral analysis, and exit interviews, we found that students completed tasks 35% faster (p < 0.05) and made 50% more solution progress p (< 0.05) when using Copilot. Moreover, our analysis revealed that, when using Copilot, students spent 11% less time manually writing code (p < 0.05), and 12% less time conducting web searches (p < 0.05), providing evidence of a fundamental shift in how they engaged in programming. In exit interviews, students reported concerns about not understanding how or why Copilot suggestions work. This research suggests the need for computing educators to develop new pedagogical approaches that leverage GenAI assistants’ benefits while fostering reflection on how and why GenAI suggestions address brownfield programming tasks. Complete study results and analysis are presented at https://ghcopilot-icer.github.io/.

当计算机学位课程的毕业生进入软件行业时,他们很可能加入在非他们自己开发的遗留代码基础上工作的团队。在这些所谓的棕色野外软件开发设置中,基因化人工智能(GenAI)像GitHub Copilit这样的基因化人工智能(GenAI)编译助理正在迅速改变软件开发做法,然而GenAI对学生编程程序员执行棕色野外开发任务的影响仍然未得到充分探讨。本文调查了GitHub Copilot在完成棕色野外编程学生编程任务时如何影响本科学生的编程业绩、行为和理解。此外,我们的分析显示,在使用Cocilfield编程任务时,学生们花费了11 % 的手动代码(p pleglegal 0.05) 。我们进行了控制性实验,10个本科计算机科学学生在遗留的网络应用程序应用程序应用程序应用程序应用程序应用程序应用中完成了非常相似的棕色开发任务。我们发现,为什么在进行基本的网上搜索任务中,为什么没有进行在线搜索,而没有进行这样的时间分析,而没有进行这样的访问显示,为什么需要进行基础搜索。

Article 37

Title@2025-06-11 (3): Stakeholder Participation for Responsible AI Development: Disconnects Between Guidance and Current Practice

Title: Stakeholder Participation for Responsible AI Development: Disconnects Between Guidance and Current Practice

Stakeholder-Beteiligung für verantwortungsvolle KI-Entwicklung: Verbindungen zwischen Führung und aktueller Praxis

利益攸关方参与负责任的大赦国际发展:指导与现行做法脱节 2506.09873v1

Authors (3): Emma Kallina, Thomas Bohné, Jat Singh

Responsible AI (rAI) guidance increasingly promotes stakeholder involvement (SHI) during AI development. At the same time, SHI is already common in commercial software development, but with potentially different foci. This study clarifies the extent to which established SHI practices are able to contribute to rAI efforts as well as potential disconnects – essential insights to inform and tailor future interventions that further shift industry practice towards rAI efforts. First, we analysed 56 rAI guidance documents to identify why SHI is recommended (i.e. its expected benefits for rAI) and uncovered goals such as redistributing power, improving socio-technical understandings, anticipating risks, and enhancing public oversight. To understand why and how SHI is currently practised in commercial settings, we then conducted an online survey (n=130) and semi-structured interviews (n=10) with AI practitioners. Our findings reveal that SHI in practice is primarily driven by commercial priorities (e.g. customer value, compliance) and several factors currently discourage more rAI-aligned SHI practices. This suggests that established SHI practices are largely not contributing to rAI efforts. To address this disconnect, we propose interventions and research opportunities to advance rAI development in practice.

负责任的AI(rAI)指导越来越多地促进利益攸关方在AI开发过程中的参与。与此同时,SHI在商业软件开发中已经司空见惯,但有潜在的差异。本研究报告澄清了已经确立的SHI做法能够在多大程度上促进rAI努力和潜在脱节 – – 向未来干预提供信息和调整使其进一步将行业做法转向rAI努力的重要见解。首先,我们分析了56 rAI指导文件,以确定为什么建议SHI(即其对RAI的预期好处)和尚未实现的目标,如重新分配权力、改善社会技术理解、预测风险和加强公共监督等。为了了解目前商业环境中实行SHI做法的原因和方式,我们随后进行了一次在线调查(n=130)和半结构性访谈(n=10),我们的调查结果显示,SHI在实践中主要受商业优先事项(如客户价值、合规)和目前阻碍更多rAI调整SHI做法的若干因素的驱动。这表明,已经确立的SHI做法在很大程度上无助于RA的努力。为了解决这种脱节问题,我们提议采取干预措施和研究各种机会,以推进RAAI的发展。

Article 38

Title@2025-06-11 (3): variability.dev: Towards an Online Toolbox for Feature Modeling

Title: variability.dev: Towards an Online Toolbox for Feature Modeling

variability.dev: Auf dem Weg zu einer Online-Toolbox für Feature Modeling

变量. dev: 走向一个用于特性建模的在线工具箱 2506.09845v1

Authors (8): Tobias Heß, Lukas Ostheimer, Tobias Betz, Simon Karrer, Tim Jannik Schmidt, Pierre Coquet, Sean Semmler, Thomas Thüm

The emergence of feature models as the default to model the variability in configurable systems fosters a rich diversity in applications, application domains, and perspectives. Independent of their domain, modelers require to open, view, edit, transform, save, and configure models as well as to collaborate with others. However, at the time of writing, the top five results when googling ``Online Editor Feature Model’’ point to editors that either have minimal functionality, are unmaintained or defunct, or require an offline installation, such as FeatureIDE. In this work we present a preview of our in-development online toolbox for feature modeling, variability.dev. In particular, we showcase our collaborative feature-model editor and our online configurator both of which are built on top of the FeatureIDE library.

地物模型的出现作为模拟可配置系统变异的默认值,促进了应用程序、应用程序域和视角的丰富多样性。建模者独立于其域外, 需要打开、查看、编辑、变换、保存、配置模型, 并与他人合作。然而, 在撰写本文时, 谷歌“ 在线编辑器特性模型” 点对编辑来说最前五个结果, 编辑的功能微乎其微, 无法维护或失效, 或需要离线安装, 如 FeatureIDE 。在这项工作中, 我们展示了我们开发中的功能建模、变异性. dev 的在线工具箱的预览。特别是, 我们展示了我们的协作性地物模型编辑和我们的在线配置工具箱, 两者都建在 Fetatridede 图书馆的顶端。

Article 39

Title@2025-06-11 (3): Online Discovery of Simulation Models for Evolving Business Processes (Extended Version)

Title: Online Discovery of Simulation Models for Evolving Business Processes (Extended Version)

Online Discovery of Simulation Models for Evolving Business Processes (Erweiterte Version)

不断演变的业务流程模拟模型在线发现(扩展版) 2506.10049v1

Authors (4): Francesco Vinci, Gyunam Park, Wil van der Aalst, Massimiliano de Leoni

Business Process Simulation (BPS) refers to techniques designed to replicate the dynamic behavior of a business process. Many approaches have been proposed to automatically discover simulation models from historical event logs, reducing the cost and time to manually design them. However, in dynamic business environments, organizations continuously refine their processes to enhance efficiency, reduce costs, and improve customer satisfaction. Existing techniques to process simulation discovery lack adaptability to real-time operational changes. In this paper, we propose a streaming process simulation discovery technique that integrates Incremental Process Discovery with Online Machine Learning methods. This technique prioritizes recent data while preserving historical information, ensuring adaptation to evolving process dynamics. Experiments conducted on four different event logs demonstrate the importance in simulation of giving more weight to recent data while retaining historical knowledge. Our technique not only produces more stable simulations but also exhibits robustness in handling concept drift, as highlighted in one of the use cases.

商业过程模拟(BPS)是指旨在复制商业过程动态行为的技术; 提出了许多办法,以便从历史事件日志中自动发现模拟模型,减少成本和人工设计这些模型的时间; 然而,在动态商业环境中,各组织不断改进其程序,以提高效率、降低成本和提高客户满意度; 现有的模拟发现处理技术缺乏适应实时业务变化的适应性; 在本文件中,我们提议了一种将递增过程发现与在线机器学习方法相结合的流动过程模拟发现技术; 这一技术在保存历史信息的同时优先考虑最新数据,确保适应不断演变的过程动态; 在四个不同的事件日志上进行的实验表明,在模拟中,必须更多地重视最新数据,同时保留历史知识; 我们的技术不仅产生更稳定的模拟,而且在处理概念漂移方面表现出稳健,正如其中一个使用案例所强调的那样。

Article 40

Title@2025-06-11 (3): ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

Title: ComfyUI-R1: Exploring Reasoning Models for Workflow Generation

ComfyUI-R1: Erforschung von Konzeptmodellen für die Workflow-Generierung

ComfyUI-R1:探索产生工作流程的理由模型 2506.09790v1

Authors (8): Zhenran Xu, Yiyu Wang, Xue Yang, Longyue Wang, Weihua Luo, Kaifu Zhang, Baotian Hu, Min Zhang

AI-generated content has evolved from monolithic models to modular workflows, particularly on platforms like ComfyUI, enabling customization in creative pipelines. However, crafting effective workflows requires great expertise to orchestrate numerous specialized components, presenting a steep learning curve for users. To address this challenge, we introduce ComfyUI-R1, the first large reasoning model for automated workflow generation. Starting with our curated dataset of 4K workflows, we construct long chain-of-thought (CoT) reasoning data, including node selection, workflow planning, and code-level workflow representation. ComfyUI-R1 is trained through a two-stage framework: (1) CoT fine-tuning for cold start, adapting models to the ComfyUI domain; (2) reinforcement learning for incentivizing reasoning capability, guided by a fine-grained rule-metric hybrid reward, ensuring format validity, structural integrity, and node-level fidelity. Experiments show that our 7B-parameter model achieves a 97\% format validity rate, along with high pass rate, node-level and graph-level F1 scores, significantly surpassing prior state-of-the-art methods that employ leading closed-source models such as GPT-4o and Claude series. Further analysis highlights the critical role of the reasoning process and the advantage of transforming workflows into code. Qualitative comparison reveals our strength in synthesizing intricate workflows with diverse nodes, underscoring the potential of long CoT reasoning in AI art creation.

AI 生成的内容已经从单一模式演变为模块化工作流程,特别是在ComfyUI等平台上,使创造性管道具有定制化功能。然而,制定有效的工作流程需要大量专门知识来协调许多专门组成部分,为用户提供一个陡峭的学习曲线。为了应对这一挑战,我们引入了第一个大型自动工作流程生成推理模型ComfyUI-R1,这是第一个自动化工作流程生成的大型推理模型。从我们整理的4K工作流程数据集开始,我们构建了长期的思维链推理数据,包括节点选择、工作流程规划和代码级工作流程代表。ComfyU-R1通过一个两阶段框架进行培训:(1) COT为寒点启动进行微调,将模型调整到ComfyUI 域;(2) 强化激励推理能力学习,以精细的定标定混合奖项为指导,确保格式的有效性、结构完整性和不偏差水平。实验显示,我们的7B 参数模型取得了97格式有效性率,以及高通过高通度、平级和图表级F1分级的F1,通过前两个阶段的CLIL的流程流程推算,大大超越了我们之前的精确流程流程的流程的流程,从而利用了G的升级的精确流程的升级的流程的流程的流程,从而进一步采用了了G的升级的升级的升级的升级的升级的升级的流程。

Article 41

Title@2025-06-11 (3): Towards Bridging Formal Methods and Human Interpretability

Title: Towards Bridging Formal Methods and Human Interpretability

Auf dem Weg zur Überbrückung formaler Methoden und menschlicher Verdolmetschbarkeit

实现正规方法和人类可解释性之间的衔接 2506.09759v1

Authors (3): Abhijit Paul, Proma Chowdhury, Kazi Sakib

Labeled Transition Systems (LTS) are integral to model checking and design repair tools. System engineers frequently examine LTS designs during model checking or design repair to debug, identify inconsistencies, and validate system behavior. Despite LTS’s significance, no prior research has examined human comprehension of these designs. To address this, we draw on traditional software engineering and graph theory, identifying 7 key metrics: cyclomatic complexity, state space size, average branching factor, maximum depth, Albin complexity, modularity, and redundancy. We created a dataset of 148 LTS designs, sampling 48 for 324 paired comparisons, and ranked them using the Bradley-Terry model. Through Kendall’s Tau correlation analysis, we found that Albin complexity ($\tau = 0.444$), state space size ($\tau = 0.420$), cyclomatic complexity ($\tau = 0.366$), and redundancy ($\tau = 0.315$) most accurately reflect human comprehension of LTS designs. To showcase the metrics’ utility, we applied the Albin complexity metric within the Fortis design repair tool, ranking system redesigns. This ranking reduced annotators’ comprehension time by 39\%, suggesting that metrics emphasizing human factors can enhance formal design interpretability.

系统工程师经常在模型检查或设计修理中检查LTS设计,以调试、识别不一致之处和验证系统行为。尽管LTS意义重大,但先前没有研究人类对这些设计的理解。为了解决这个问题,我们利用传统的软件工程和图表理论,确定了7个关键指标:环球复杂性、州空间大小、平均分流系数、最大深度、Albin复杂性、模块性和冗余。我们创建了148 LTS设计数据集,抽样48,324对齐比较,并使用Bradley-Tery模型对其进行评级。通过Kendall的Tau相关分析,我们发现Albin复杂程度(=0.44440美元)、州空间大小(=0.420美元)、环球复杂性(=0.366美元)和冗余($tau=0.3115美元),最准确地反映了LTS设计的人理解。为了展示该指标的实用性,我们在Fortis设计修理工具中应用了Albin复杂程度指标,我们在Fredis 修理工具中应用了Albin,我们应用了Algin 重定序系统解释标准。

Article 42

Title@2025-06-11 (3): A First Look at Bugs in LLM Inference Engines

Title: A First Look at Bugs in LLM Inference Engines

Ein erster Blick auf Bugs in LLM Inferenz-Engines

第一次看一看LLM 推断引擎中的虫虫 2506.09713v1

Authors (8): Mugeng Liu, Siqi Zhong, Weichen Bi, Yixuan Zhang, Zhiyang Chen, Zhenpeng Chen, Xuanzhe Liu, Yun Ma

Large language model-specific inference engines (in short as \emph{LLM inference engines}) have become a fundamental component of modern AI infrastructure, enabling the deployment of LLM-powered applications (LLM apps) across cloud and local devices. Despite their critical role, LLM inference engines are prone to bugs due to the immense resource demands of LLMs and the complexities of cross-platform compatibility. However, a systematic understanding of these bugs remains lacking. To bridge this gap, we present the first empirical study on bugs in LLM inference engines. We mine official repositories of 5 widely adopted LLM inference engines, constructing a comprehensive dataset of 929 real-world bugs. Through a rigorous open coding process, we analyze these bugs to uncover their symptoms, root causes, and commonality. Our findings reveal six major bug symptoms and a taxonomy of 28 root causes, shedding light on the key challenges in bug detection and location within LLM inference engines. Based on these insights, we propose a series of actionable implications for researchers, inference engine vendors, and LLM app developers.

大型语言模型特有推理引擎(简而言之为 emph{LLLM 推理引擎 ) 已成为现代AI 基础设施的基本组成部分,使LLM 动力应用程序(LLM 应用程序)能够跨越云层和地方装置。尽管LLM 推理引擎具有关键作用,但由于LLM 的巨大资源需求以及交叉平台兼容性的复杂性,这些引擎容易出现虫子。然而,对这些虫子仍缺乏系统的理解。为了缩小这一差距,我们提出了关于LLLM 推理引擎中的虫子的第一次经验性研究。我们开采了5个广泛采用LLM 推理引擎的官方储存库,建立了929个实际世界虫子的综合数据集。我们通过严格的开放编码程序,分析了这些虫子,以发现其症状、根源和共性。我们的调查结果揭示了6个主要的虫子症状和28个根源的分类,揭示了LM 推理引擎内部的错误检测和位置的关键挑战。我们根据这些了解,提出了一系列对研究人员、推理机供应商和LM 应用软件的可采取行动的影响。

Article 43

Title@2025-06-11 (3): Mapping NVD Records to Their VFCs: How Hard is it?

Title: Mapping NVD Records to Their VFCs: How Hard is it?

Mapping NVD Records zu ihren VFCs: Wie schwer ist es?

绘制VFCs的NVD记录:有多难? 2506.09702v1

Authors (10): Huu Hung Nguyen, Duc Manh Tran, Yiran Cheng, Thanh Le-Cong, Hong Jin Kang, Ratnadira Widyasari, Shar Lwin Khin, Ouh Eng Lieh, Ting Zhang, David Lo

Mapping National Vulnerability Database (NVD) records to vulnerability-fixing commits (VFCs) is crucial for vulnerability analysis but challenging due to sparse explicit links in NVD references.This study explores this mapping’s feasibility through an empirical approach. Manual analysis of NVD references showed Git references enable over 86% success, while non-Git references achieve under 14%. Using these findings, we built an automated pipeline extracting 31,942 VFCs from 20,360 NVD records (8.7% of 235,341) with 87% precision, mainly from Git references. To fill gaps, we mined six external security databases, yielding 29,254 VFCs for 18,985 records (8.1%) at 88.4% precision, and GitHub repositories, adding 3,686 VFCs for 2,795 records (1.2%) at 73% precision. Combining these, we mapped 26,710 unique records (11.3% coverage) from 7,634 projects, with overlap between NVD and external databases, plus unique GitHub contributions. Despite success with Git references, 88.7% of records remain unmapped, highlighting the difficulty without Git links. This study offers insights for enhancing vulnerability datasets and guiding future automated security research.

国家脆弱性绘图数据库(NVD)记录与脆弱性确定承诺(VFCs)记录对于脆弱性分析至关重要,但由于NVD参考文献中明确链接少,因此具有挑战性。本研究通过经验方法探索了这一绘图的可行性。对NVD参考文献的手工分析显示,Git参考文献使86%的成功率超过,而非Git参考文献则达到14%以下。利用这些研究结果,我们建立了一个自动管道,从20,360 NVD记录中提取31,942 VFCs(占235,341的8.7%,精确度为87%,主要来自Git参考文献。为了填补空白,我们挖掘了六个外部安全数据库,生成了29,254 VFCs,18,985记录(8.1%),精确度为88.4%,GitHub储存库为8,增加了3,686 VFCs,2,795记录(1.2%),精确度为73%。我们从7,634个项目中提取了26,710个独特的记录(覆盖面为11.3%),主要来自GitHub 贡献。尽管在Git参考文献参考文献中取得了成功,88.7%,但该记录仍无法指导数据更新。

Article 44

Title@2025-06-11 (3): Calculating Software’s Energy Use and Carbon Emissions: A Survey of the State of Art, Challenges, and the Way Ahead

Title: Calculating Software’s Energy Use and Carbon Emissions: A Survey of the State of Art, Challenges, and the Way Ahead

Berechnung der Energienutzung und CO2-Emissionen von Software: Eine Übersicht über den Stand der Technik, Herausforderungen und den Weg nach vorn

计算软件的能源使用量和碳排放量:对艺术、挑战和前进道路现状的调查 2506.09683v1

Authors (8): Priyavanshi Pathania, Nikhil Bamby, Rohit Mehra, Samarth Sikand, Vibhu Saujanya Sharma, Vikrant Kaulgud, Sanjay Podder, Adam P. Burden

The proliferation of software and AI comes with a hidden risk: its growing energy and carbon footprint. As concerns regarding environmental sustainability come to the forefront, understanding and optimizing how software impacts the environment becomes paramount. In this paper, we present a state-of-the-art review of methods and tools that enable the measurement of software and AI-related energy and/or carbon emissions. We introduce a taxonomy to categorize the existing work as Monitoring, Estimation, or Black-Box approaches. We delve deeper into the tools and compare them across different dimensions and granularity - for example, whether their measurement encompasses energy and carbon emissions and the components considered (like CPU, GPU, RAM, etc.). We present our observations on the practical use (component wise consolidation of approaches) as well as the challenges that we have identified across the current state-of-the-art. As we start an initiative to address these challenges, we emphasize active collaboration across the community in this important field.

软件和AI的扩散带来了一个隐蔽的风险:其日益增长的能源和碳足迹。当环境可持续性问题被放在首位时,我们理解和优化软件对环境影响如何成为至高无上的问题。在本文件中,我们介绍了对能够测量软件和AI相关能源和/或碳排放的方法和工具的最新审查。我们引入了一种分类学,将现有工作分类为监测、估计或黑箱方法。我们深入地研究工具,将其跨越不同层面和颗粒度,例如,是否包括能源和碳排放以及所考虑的成分(如CPU、GPU、RAM等)。我们提出了关于实际使用(方法的明智整合)以及我们在整个目前状态所查明的挑战的意见。当我们发起应对这些挑战的倡议时,我们强调整个社区在这一重要领域进行积极的合作。

Article 45

Title@2025-06-11 (3): Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

Title: Assessing and Advancing Benchmarks for Evaluating Large Language Models in Software Engineering Tasks

Bewertung und Verbesserung von Benchmarks für die Bewertung großer Sprachmodelle in der Software-Engineering-Aufgaben

评估和推进评估软件工程任务中大语言模型的基准 2505.08903v3

Authors (8): Xing Hu, Feifei Niu, Junkai Chen, Xin Zhou, Junwei Zhang, Junda He, Xin Xia, David Lo

Large language models (LLMs) are gaining increasing popularity in software engineering (SE) due to their unprecedented performance across various applications. These models are increasingly being utilized for a range of SE tasks, including requirements engineering and design, code analysis and generation, software maintenance, and quality assurance. As LLMs become more integral to SE, evaluating their effectiveness is crucial for understanding their potential in this field. In recent years, substantial efforts have been made to assess LLM performance in various SE tasks, resulting in the creation of several benchmarks tailored to this purpose. This paper offers a thorough review of 291 benchmarks, addressing three main aspects: what benchmarks are available, how benchmarks are constructed, and the future outlook for these benchmarks. We begin by examining SE tasks such as requirements engineering and design, coding assistant, software testing, AIOPs, software maintenance, and quality management. We then analyze the benchmarks and their development processes, highlighting the limitations of existing benchmarks. Additionally, we discuss the successes and failures of LLMs in different software tasks and explore future opportunities and challenges for SE-related benchmarks. We aim to provide a comprehensive overview of benchmark research in SE and offer insights to support the creation of more effective evaluation tools.

大型语言模型(LLMS)由于在各种应用中表现前所未有,在软件工程方面越来越受欢迎,这些模型越来越多地用于一系列SE任务,包括工程和设计要求、代码分析和生成、软件维护和质量保证;随着LLMS成为SE的有机组成部分,评价其有效性对于了解其在这一领域的潜力至关重要;近年来,为评估LLMS在各种SE任务中的绩效作出了大量努力,从而建立了若干专门为此而量身定做的基准;本文件对291个基准进行了彻底审查,涉及三个主要方面:现有基准有哪些,基准如何建立,以及这些基准的未来前景;我们首先审查SE任务,例如工程和设计要求、编码助理、软件测试、AIOP、软件维护和质量管理;然后分析基准及其开发过程,突出现有基准的局限性;此外,我们讨论了LLMMS在不同软件任务的成败,并探讨与SE有关基准的未来机会和挑战。我们的目的是全面概述SE的基准研究,并提供见解,以支持建立更有效的评估工具。

Article 46

Title@2025-06-11 (3): Translating a VDM Model of a Medical Device into Kapture

Title: Translating a VDM Model of a Medical Device into Kapture

Übersetzen eines VDM-Modells eines medizinischen Geräts in Kapture

将医疗设备VDM模型转换成外形 2506.09636v1

Authors (3): Joe Hare, Leo Freitas, Ken Pierce

As the complexity of safety-critical medical devices increases, so does the need for clear, verifiable, software requirements. This paper explores the use of Kapture, a formal modelling tool developed by D-RisQ, to translate an existing formal VDM model of a medical implant for treating focal epilepsy called CANDO. The work was undertaken without prior experience in formal methods. The paper assess Kapture’s usability, the challenges of formal modelling, and the effectiveness of the translated model. The result is a model in Kapture which covers over 90% of the original VDM model, and produces matching traces of results. While several issues were encountered during design and implementation, mainly due to the initial learning curve, this paper demonstrates that complex systems can be effectively modelled in Kapture by inexperienced users and highlights some difficulties in translating VDM specifications to Kapture.

随着安全关键医疗装置的复杂性增加,对明确、可核查的软件要求的需求也随之增加。本文件探讨使用D-RisQ开发的一种正式模型工具,即D-RisQ开发的Kapture, 来翻译现有正式的VDM模型,用于治疗中心癫痫,称为CANDO。这项工作是在没有正式方法前的经验的情况下进行的。文件评估了Kapture的可用性、正式模型的挑战和翻译模型的有效性。结果为Kapture的模型,覆盖了原VDM模型的90%以上,并产生了相应的结果痕迹。在设计和实施过程中遇到了几个问题,主要由于最初的学习曲线,本文表明,没有经验的用户可以在Kapture中有效地模拟复杂的系统,并突出了将VDM规格转化为Kapture的一些困难。

Article 47

Title@2025-06-11 (3): ASTAGEN: Empirical Evaluation of Automated SATD Taxonomy Generation with LLMs

Title: ASTAGEN: Empirical Evaluation of Automated SATD Taxonomy Generation with LLMs

ASTAGEN: Empirische Bewertung der automatisierten SATD-Taxonomie-Generation mit LLMs

ASTAGEN: 与LLM公司一起对自动的SATD 分类生成进行经验评估 2506.09601v1

Authors (5): Sota Nakashima, Yuta Ishimoto, Masanari Kondo, Tao Xiao, Yasutaka Kamei

Technical debt refers to suboptimal code that degrades software quality. When developers intentionally introduce such debt, it is called self-admitted technical debt (SATD). Since SATD hinders maintenance, identifying its categories is key to uncovering quality issues. Traditionally, constructing such taxonomies requires manually inspecting SATD comments and surrounding code, which is time-consuming, labor-intensive, and often inconsistent due to annotator subjectivity. This study presents ASTAGEN, an initial step toward automating SATD taxonomy generation using large language models (LLMs). Given a comment and its surrounding code, ASTAGEN first generates a concise explanation for each SATD comment, then incrementally generates and updates categories to construct a taxonomy. We evaluate ASTAGEN on SATD datasets from three domains: quantum software, smart contracts, and machine learning. It successfully recovers domain-specific categories reported in prior work, such as Layer Configuration in machine learning. Compared to a naive use of an LLM, ASTAGEN produces more consistent category assignments due to its explanation-driven, iterative design. It also completes taxonomy generation in under two hours and for less than one USD, even on the largest dataset. These results suggest that while full automation remains challenging, ASTAGEN is able to support semi-automated taxonomy construction. Furthermore, our work opens up avenues for future work, such as automatic taxonomy generation in other areas.

技术债务是指降低软件质量的亚最佳代码。当开发者有意引入这种债务时,它被称为自我承认的技术债务(SATD)。由于SATD阻碍维护,确定其类别是发现质量问题的关键。传统上,建立这种分类需要人工检查SATD评论和周围代码,这是耗时、劳动密集型的,而且往往因注释主观性而不一致。本研究报告介绍了ASTAGEN,这是利用大语言模型(LLLM)实现SATD分类生成自动化的第一步。根据评论及其周围代码,ASTAGEN首先为每次SATD评论提供简明的解释,然后逐步生成和更新分类,以构建一个分类。我们从三个领域(量子软件、智能合同和机器学习)对SATD数据集进行ATGEN评估。它成功地恢复了先前工作中报告的域别类别,如机器学习中的图层配置。与使用LMM(LMM)的天真的使用相比,ASTAGEN根据解释和迭代设计得出了更加一致的分类任务。此外,ASTGEN在两个领域完成税制的生成过程,在两个小时内,在ADADA系统进行最有挑战的自动化的自动化,而后,这种自动化工作也显示ADADADADADADADA。

Article 48

Title@2025-06-11 (3): Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries

Title: Automated Synthesis of Formally Verified Multi-Abstraction Function Summaries

Automatisierte Synthese von formal verifizierten Multi-Abstraktions-Funktionszusammenfassungen

正式核证的多种吸管功能摘要自动综合 2506.09550v1

Authors (8): Fanpeng Yang, Xu Ma, Shuling Wang, Xiong Xu, Qinxiang Cao, Naijun Zhan, Xiaofeng Li, Bin Gu

Function summaries, which characterize the behavior of code segments (typically functions) through preconditions and postconditions, are essential for understanding, reusing, and verifying software, particularly in safety-critical domains like aerospace embedded systems. However, these mission-critical legacy code serving as a valuable reused asset often lacks formal specifications. It is challenging to automatically generate function summaries for C programs, due to the existence of complex features such as loops, nested function calls, pointer aliasing, and so on. Moreover, function summaries should support multiple abstraction levels to meet diverse requirements, e.g. precise summaries capturing full functionality for formal verification and intuitive summaries for human understanding. To address these challenges, we first propose a novel framework that combines symbolic execution, large language models (LLMs), and formal verification to generate Relatively Strongest Postconditions (RSPs) and build function summaries that fully capture program behavior. Our approach leverages VST-A’s symbolic execution to precisely track program execution paths and state transitions, employs LLMs to infer loop invariants based on predefined templates, and uses Frama-C to guarantee soundness of generated summaries in an iterative refinement loop. Furthermore, from generated RSPs, we automatically synthesize strongest non-redundant postconditions expressed within given domain specific language. We compare our approach with existing work through extensive experiments.

功能摘要是代码部分(典型功能)通过先决条件和先决条件的行为的特点,对于理解、重新使用和核查软件至关重要,特别是在航空航天嵌入系统等安全关键领域;然而,这些关键任务遗留代码作为宝贵的再利用资产往往缺乏正式规格;由于存在环形、嵌套功能电话、指针别名等复杂特征,难以自动生成C程序功能摘要;此外,功能摘要应支持多种抽象级别,以满足各种要求,例如精确的摘要,为正式核查和直观摘要收集全面功能,以促进人类理解。为了应对这些挑战,我们首先提议一个新框架,将象征性执行、大型语言模型(LLLMS)和正式核查结合起来,以产生相对强大的附加条件,并建立能够充分反映方案行为的功能摘要。我们的方法利用VST-A象征性执行来精确跟踪程序执行路径和状态过渡,利用LLMMS来根据预先定义的模板推断变异性,并利用Frama-C来为人类理解的直观摘要。为了应对这些挑战,我们首先提出一个新的框架框架框架框架框架框架框架,以便保证我们通过不至最强烈的实地分析,通过现有版本进行我们制作的实地分析。

Article 49

Title@2025-06-11 (3): AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling

Title: AcTracer: Active Testing of Large Language Model via Multi-Stage Sampling

AcTracer: Aktives Testen von Großsprachenmodellen durch Multi-Stage-Sampling

AcTracler:通过多标准抽样积极测试大语言模型 2408.03573v2

Authors (5): Yuheng Huang, Jiayang Song, Qiang Hu, Felix Juefei-Xu, Lei Ma

Performance evaluation plays a crucial role in the development life cycle of large language models (LLMs). It estimates the model’s capability, elucidates behavior characteristics, and facilitates the identification of potential issues and limitations, thereby guiding further improvement. Given that LLMs’ diverse task-handling abilities stem from large volumes of training data, a comprehensive evaluation also necessitates abundant, well-annotated, and representative test data to assess LLM performance across various downstream tasks. However, the demand for high-quality test data often entails substantial time, computational resources, and manual efforts, sometimes causing the evaluation to be inefficient or impractical. To address these challenges, researchers propose active testing, which estimates the overall performance by selecting a subset of test data. Nevertheless, the existing active testing methods tend to be inefficient, even inapplicable, given the unique new challenges of LLMs (e.g., diverse task types, increased model complexity, and unavailability of training data). To mitigate such limitations and expedite the development cycle of LLMs, in this work, we introduce AcTracer, an active testing framework tailored for LLMs that strategically selects a small subset of test data to achieve a more accurate performance estimation for LLMs. AcTracer utilizes both internal and external information from LLMs to guide the test sampling process, reducing variance through a multi-stage pool-based active selection. Our experiment results demonstrate that AcTracer achieves state-of-the-art performance compared to existing methods across various tasks.

绩效评估在大型语言模型(LLMS)的发展生命周期中发挥着关键作用; 评估模型的能力,阐明行为特征,便利查明潜在问题和局限性,从而指导进一步的改进; 鉴于LLMS的多种任务处理能力来自大量培训数据,因此,全面评估还需要大量、有说明的和有代表性的测试数据,以评估各下游任务中LLM系统的业绩; 然而,对高质量测试数据的需求往往需要大量的时间、计算资源和人工工作,有时导致评价效率低下或不切实际; 为了应对这些挑战,研究人员提议进行积极测试,通过选择一组测试数据来估计总体业绩; 然而,鉴于LLMS的独特新挑战(例如,不同任务类型,增加模型复杂性,以及缺乏培训数据),现有积极测试方法往往效率低下,甚至不适用; 为了减轻这种局限性,加快LMS的发展周期,我们在此工作中引入了ATraxer,这是为LMS公司设计的积极测试框架,从一个小的测试数据分组,从一个测试数据中挑选出一个较精确的测试数据,从一个测试阶段的ALMS-LMS-S-S-S-S-S-S-S-S-SLARTS-S-S-S-S-S-S-S-S-S-Servic-S-S-S-S-S

Article 50

Title@2025-06-11 (3): TrioXpert: An automated incident management framework for microservice system

Title: TrioXpert: An automated incident management framework for microservice system

TrioXpert: Ein automatisiertes Ereignismanagement-Framework für Microservice-Systeme

三重X光:微服务系统自动事件管理框架 2506.10043v1

Authors (8): Yongqian Sun, Yu Luo, Xidao Wen, Yuan Yuan, Xiaohui Nie, Shenglin Zhang, Tong Liu, Xi Luo

Automated incident management plays a pivotal role in large-scale microservice systems. However, many existing methods rely solely on single-modal data (e.g., metrics, logs, and traces) and struggle to simultaneously address multiple downstream tasks, including anomaly detection (AD), failure triage (FT), and root cause localization (RCL). Moreover, the lack of clear reasoning evidence in current techniques often leads to insufficient interpretability. To address these limitations, we propose TrioXpert, an end-to-end incident management framework capable of fully leveraging multimodal data. TrioXpert designs three independent data processing pipelines based on the inherent characteristics of different modalities, comprehensively characterizing the operational status of microservice systems from both numerical and textual dimensions. It employs a collaborative reasoning mechanism using large language models (LLMs) to simultaneously handle multiple tasks while providing clear reasoning evidence to ensure strong interpretability. We conducted extensive evaluations on two popular microservice system datasets, and the experimental results demonstrate that TrioXpert achieves outstanding performance in AD (improving by 4.7% to 57.7%), FT (improving by 2.1% to 40.6%), and RCL (improving by 1.6% to 163.1%) tasks.

自动化事故管理在大规模微观服务系统中发挥着关键作用,然而,许多现有方法都完全依赖单一模式数据(例如,指标、日志和痕迹),并努力同时处理多个下游任务,包括异常探测(AD)、故障分级(FT)和根本原因本地化(RCL)等。此外,当前技术缺乏明确的推理证据,往往导致无法充分解释。为解决这些局限性,我们提议TrioXpert,即一个能够充分利用多式联运数据的端到端事件管理框架。TrioXpert根据不同模式的固有特点设计了三个独立的数据处理管道,从数字和文字两个层面全面描述微观服务系统的运作状况。它使用一个协作推理机制,使用大语言模型(LLLMs)同时处理多项任务,同时提供明确的推理证据以确保强有力的解释性。我们广泛评价了两个流行的微观服务系统数据集,实验结果显示TrioXpert在AD(改进4.7%至57.7%)、FT(改进2.1%至40.6%)和RCL.6%的任务改进2.1%)和RCL.

Article 51

Title@2025-06-11 (3): Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models

Title: Reasoning as a Resource: Optimizing Fast and Slow Thinking in Code Generation Models

Reasoning as a Resource: Schnelles und langsames Denken in Codegenerierungsmodellen optimieren

资源理由:在代码生成模型中优化快速和缓慢思考 2506.09396v1

Authors (2): Zongjie Li, Shuai Wang

This position paper proposes a fundamental shift in designing code generation models: treating reasoning depth as a controllable resource. Rather than being an incidental byproduct of prompting, we argue that the trade-off between rapid, direct answers (“fast thinking”) and elaborate, chain-of-thought deliberation (“slow thinking”) must be explicitly managed. We contend that optimizing reasoning budgets across the entire model lifecycle - from synthetic data creation and benchmarking to real-world deploymen - can unlock superior trade-offs among accuracy, latency, and cost. This paper outlines how adaptive control over reasoning can enrich supervision signals, motivate new multi-dimensional benchmarks, and inform cost-aware, security-conscious deployment policies. By viewing fast and slow thinking as complementary modes to be scheduled, we envision coding agents that think deep when necessary and act fast when possible.

这份立场文件提出了设计代码生成模型的根本转变:将推理深度视为可控制的资源。我们不认为推理深度是推动的附带副产品,而认为快速、直接回答(“快速思考 ” ) 和周密、深思熟虑(“低度思考 ” ) 之间的权衡必须明确加以管理。我们认为,优化整个模型生命周期的推理预算 — — 从合成数据生成和基准设定到现实世界部署人员 — — 能够打开精确度、延缓度和成本之间的优劣取舍。本文概述了对推理的适应性控制如何丰富监督信号,激励新的多维基准,并告知成本意识、安全意识部署政策。通过将快速和缓慢的思维视为要安排的互补模式,我们设想了在必要时进行深入思考并尽可能快速采取行动的编码代理人。

Article 52

Title@2025-06-11 (3): Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance

Title: Towards Better Code Generation: Adaptive Decoding with Uncertainty Guidance

Auf dem Weg zu einer besseren Codegenerierung: Adaptive Dekodierung mit Ungewissheitsleitfaden

实现更好的代码生成:以不确定性指导进行适应性代用 2506.08980v2

Authors (7): Kaifeng He, Mingwei Liu, Chong Wang, Zike Li, Yanlin Wang, Xin Peng, Zibin Zheng

Code generation using large language models (LLMs) is highly sensitive to the choice of tokens during decoding, especially at points of uncertainty that critically affect the generated program’s logic. Conventional decoding methods such as greedy search and beam search apply uniform treatment to all tokens, neglecting the unique uncertainty characteristics inherent in code generation, which can result in suboptimal outputs. In this work, we conduct an empirical analysis demonstrating that a significant portion of generation errors arises from incorrect token ranking at high-uncertainty steps, where the ground truth token exists in the candidate set but fails to be ranked first. Inspired by this insight, we introduce AdaDec, an adaptive decoding framework guided by token-level uncertainty quantified via Shannon entropy. AdaDec dynamically learns uncertainty thresholds tailored to each model and employs a pause-then-rerank mechanism with lookahead when the uncertainty surpasses these thresholds. Evaluation on the HumanEval and MBPP benchmarks reveals that AdaDec achieves up to a 15.5% improvement in Pass@1 accuracy compared to greedy decoding, matches or outperforms traditional beam search, and reduces both computational overhead and latency through targeted, selective pausing. Our findings suggest that uncertainty-aware adaptive decoding holds considerable potential for enhancing both the reliability and efficiency of code generation with LLMs.

使用大语言模型( LLMs) 的代码生成对于解码过程中的代号选择非常敏感, 特别是在对生成程序逻辑有重大影响的不确定性点上。常规解码方法, 如贪婪搜索和梁搜索等, 对所有代号都实行统一处理, 忽略代码生成所固有的独特的不确定性特征, 这可能导致产出不尽人意。在这项工作中, 我们进行经验分析, 表明很大一部分代号错误来自高不确定性步骤的不正确的代号排名, 其中, 候选人集中存在地面真象, 但没有被排在第一位。受此洞察的影响, 我们引入 AdaDec, 一种适应性解码框架, 以香农 entro entropy 量化的代号级不确定性为指导。 AdaDec 动态地学习了每个代号中特有的不确定性阈值, 并运用了当不确定性超过这些阈值时的超前置机制。对 HumanEval 和 MBPP 基准的评估显示, AdaDec 与贪婪解码解码、匹配或超前的调框架相比, 达到1 15.5 %1 改进了15% 。。 , 我们的定性搜索和降低的计算的可靠性, , 的计算法调低的计算法性 , , 和降低的计算性的计算性的计算性性性的计算性性。

Article 53

Title@2025-06-11 (3): Less is More: DocString Compression in Code Generation

Title: Less is More: DocString Compression in Code Generation

Weniger ist mehr: DocString-Kompression in der Code-Generierung

更少为 : DocString 压缩代码生成 2410.22793v3

Authors (10): Guang Yang, Yu Zhou, Wei Cheng, Xiangyu Zhang, Xiang Chen, Terry Yue Zhuo, Ke Liu, Xin Zhou, David Lo, Taolue Chen

The widespread use of Large Language Models (LLMs) in software engineering has intensified the need for improved model and resource efficiency. In particular, for neural code generation, LLMs are used to translate function/method signature and DocString to executable code. DocStrings which capture user re quirements for the code and used as the prompt for LLMs, often contains redundant information. Recent advancements in prompt compression have shown promising results in Natural Language Processing (NLP), but their applicability to code generation remains uncertain. Our empirical study show that the state-of-the-art prompt compression methods achieve only about 10% reduction, as further reductions would cause significant performance degradation. In our study, we propose a novel compression method, ShortenDoc, dedicated to DocString compression for code generation. Our extensive experiments on six code generation datasets, five open-source LLMs (1B to 10B parameters), and one closed-source LLM GPT-4o confirm that ShortenDoc achieves 25-40% compression while preserving the quality of generated code, outperforming other baseline methods at similar compression levels. The benefit of this research is to improve efficiency and reduce the cost while maintaining the quality of the generated code, especially when calling third-party APIs, and is able to reduce the token processing cost by 25-40%.

在软件工程中广泛使用大语言模型(LLMs)已经加强了改进模型和资源效率的必要性。特别是,在神经代码生成方面,LMs被用于翻译功能/方法签名和 DocString 来执行代码。DocStrings 采集代码用户的再需求,并用作LLMs的提示,常常包含冗余的信息。在快速压缩方面的最新进展显示,自然语言处理(NLPP)取得了有希望的结果,但其对代码生成的适用性仍然不确定。我们的实证研究表明,最新快速压缩方法仅能实现大约10%的削减,因为进一步的削减将导致显著的性能退化。在我们的研究中,我们提出了一种新的压缩方法(ShortenDoc),专门用于代码生成的DocString压缩。我们在6个代码生成数据集、5个开源LMs(1B至10B参数)和1个封闭源LM GPT-4o上的进展证实,在维护生成的代码质量的同时,实现25-40 %的压缩,在类似的压缩处理水平上比其他基线方法要高得多。我们提出的是提高成本,特别是使Asimpalal 。

Article 54

Title@2025-06-11 (3): Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey

Title: Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey

Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage

增强法典制定中强化学习的加强守则LLMS 代码生成:调查 2412.20367v4

Authors (19): Junqiao Wang, Zeng Zhang, Yangfan He, Zihao Zhang, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Xin Yi, Zhongwei Wan, Xinhang Yuan, Kuan Lu, Menghao Huo, Tang Jingqun, Guangwu Qian, Keqin Li, Qiuwu Chen, Lewei He

Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing large language models (LLMs) in code generation and optimization. This survey systematically reviews RL-driven techniques across the code development lifecycle, from compiler-level optimizations and resource allocation strategies to end-to-end code synthesis frameworks. We first examine classical and modern RL algorithms – spanning policy gradients, actor-critic methods, human-feedback alignment, and preference-based optimization – and their adaptations to the unique challenges of code generation, such as sparse and delayed rewards. Next, we analyze key benchmarks, datasets, and evaluation metrics that drive progress in RL-augmented Code LLMs. Finally, we identify open problems, including the need for richer feedback sources, support for low-level and domain-specific languages, and methods to reduce computational overhead. By consolidating current insights and outlining future directions, this work aims to guide researchers and practitioners in leveraging RL to produce more robust, efficient, and human-aligned code generation systems.

强化学习(RL)已成为在代码生成和优化方面加强大型语言模型(LLM)的强大范例,这项调查系统地审查了代码开发生命周期中由RL驱动的技术,从汇编者一级的优化和资源分配战略到端到端至端代码合成框架。我们首先审查传统和现代RL算法 – – 涵盖政策梯度、行为体-批评方法、人肉背对齐和基于优惠的优化 – – 以及这些算法适应代码生成的独特挑战,如微弱和延迟的奖励。接着,我们分析了推动RL强化代码LM取得进展的关键基准、数据集和评价指标。最后,我们查明了一些尚未解决的问题,包括需要更丰富的反馈来源、支持低层次和特定领域语言以及减少计算间接费用的方法。通过整合目前的见解和概述未来方向,这项工作旨在指导研究人员和从业人员利用RL生成更健全、高效和与人接轨的代码生成系统。

Article 55

Title@2025-06-11 (3): Mono: Is Your “Clean” Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond

Title: Mono: Is Your “Clean” Vulnerability Dataset Really Solvable? Exposing and Trapping Undecidable Patches and Beyond

Mono: Ist Ihr “sauberer” Sicherheitsdatensatz wirklich lösbar? Unentschlossene Patches und darüber hinaus enthüllen und verschleppen

Mono: 您的“ 干净” 脆弱数据集是否真的可以解决? 曝光和跟踪不可量化的补丁及其他 2506.03651v2

Authors (7): Zeyu Gao, Junlin Zhou, Bolun Zhang, Yi He, Chao Zhang, Yuxin Cui, Hao Wang

The quantity and quality of vulnerability datasets are essential for developing deep learning solutions to vulnerability-related tasks. Due to the limited availability of vulnerabilities, a common approach to building such datasets is analyzing security patches in source code. However, existing security patches often suffer from inaccurate labels, insufficient contextual information, and undecidable patches that fail to clearly represent the root causes of vulnerabilities or their fixes. These issues introduce noise into the dataset, which can mislead detection models and undermine their effectiveness. To address these issues, we present mono, a novel LLM-powered framework that simulates human experts’ reasoning process to construct reliable vulnerability datasets. mono introduces three key components to improve security patch datasets: (i) semantic-aware patch classification for precise vulnerability labeling, (ii) iterative contextual analysis for comprehensive code understanding, and (iii) systematic root cause analysis to identify and filter undecidable patches. Our comprehensive evaluation on the MegaVul benchmark demonstrates that mono can correct 31.0% of labeling errors, recover 89% of inter-procedural vulnerabilities, and reveals that 16.7% of CVEs contain undecidable patches. Furthermore, mono’s enriched context representation improves existing models’ vulnerability detection accuracy by 15%. We open source the framework mono and the dataset MonoLens in https://github.com/vul337/mono.

脆弱性数据集的数量和质量对于制定与脆弱性有关的任务的深层次学习解决方案至关重要。由于脆弱性的可用性有限,建立这类数据集的共同方法是分析源代码中的安全补丁。然而,现有的安全补丁往往受到标签不准确、背景信息不足和无法分解的补丁的困扰,无法清楚地代表脆弱性的根源或固定点。这些问题在数据集中引入噪音,这可能会误导检测模型,损害其有效性。为了解决这些问题,我们只提出一个创新的LLM驱动框架,模拟人类专家构建可靠脆弱性数据集的推理过程。一个单项引入了三个关键组成部分来改进安全补丁数据集:(一) 精确脆弱性标签的语义认知补丁分类,(二) 用于全面理解代码的迭代背景分析,以及(三) 系统化根分析,以识别和过滤不可分解的补补补补的补补补补补。我们对Megavul基准的全面评价表明,单项能够纠正31.0%的标签误差,恢复89 %的跨度脆弱性数据集。单项引入了三个关键部分的CVE的精确度分类/透明度框架,并揭示了16.7%的CVERismroup 15 的元的元的元的元的元的加密的元化数据。

Article 56

Title@2025-06-11 (3): Assessing the Impact of Refactoring Energy-Inefficient Code Patterns on Software Sustainability: An Industry Case Study

Title: Assessing the Impact of Refactoring Energy-Inefficient Code Patterns on Software Sustainability: An Industry Case Study

Bewertung der Auswirkungen von Refactoring Energy-Inefficient Code Patterns auf Software Sustainability: Eine Fallstudie für die Industrie

评估可再生能源低能效代码模式对软件可持续性的影响:工业案例研究 2506.09370v1

Authors (6): Rohit Mehra, Priyavanshi Pathania, Vibhu Saujanya Sharma, Vikrant Kaulgud, Sanjay Podder, Adam P. Burden

Advances in technologies like artificial intelligence and metaverse have led to a proliferation of software systems in business and everyday life. With this widespread penetration, the carbon emissions of software are rapidly growing as well, thereby negatively impacting the long-term sustainability of our environment. Hence, optimizing software from a sustainability standpoint becomes more crucial than ever. We believe that the adoption of automated tools that can identify energy-inefficient patterns in the code and guide appropriate refactoring can significantly assist in this optimization. In this extended abstract, we present an industry case study that evaluates the sustainability impact of refactoring energy-inefficient code patterns identified by automated software sustainability assessment tools for a large application. Preliminary results highlight a positive impact on the application’s sustainability post-refactoring, leading to a 29% decrease in per-user per-month energy consumption.

人工智能和逆向技术的进步导致软件系统在商业和日常生活中的扩散。随着这种广泛的渗透,软件的碳排放也在迅速增长,从而对我们环境的长期可持续性产生消极影响。因此,从可持续性角度优化软件比以往更加重要。我们认为,采用自动化工具,在代码中识别节能模式并指导适当的再计量,可以极大地帮助实现这种优化。在这个扩展的抽象文件中,我们介绍了一个行业案例研究,该案例研究评估了自动化软件可持续性评估工具为大规模应用确定的重新设定节能代码模式对可持续性的影响。初步结果突出表明了对应用的可持续性影响,导致每个用户每月能源消耗减少29%。

Article 57

Title@2025-06-11 (3): Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models

Title: Boosting Rust Unit Test Coverage through Hybrid Program Analysis and Large Language Models

Steigerung der Testabdeckung von Rust Units durch Hybrid-Programmanalyse und große Sprachmodelle

通过混合方案分析和大语言模式,促进分散单位测试覆盖率 2506.09002v2

Authors (7): Bei Chu, Yang Feng, Kui Liu, Hange Shi, Zifan Nan, Zhaoqiang Guo, Baowen Xu

Unit testing is essential for ensuring software reliability and correctness. Classic Search-Based Software Testing (SBST) methods and concolic execution-based approaches for generating unit tests often fail to achieve high coverage due to difficulties in handling complex program units, such as branching conditions and external dependencies. Recent work has increasingly utilized large language models (LLMs) to generate test cases, improving the quality of test generation by providing better context and correcting errors in the model’s output. However, these methods rely on fixed prompts, resulting in relatively low compilation success rates and coverage. This paper presents PALM, an approach that leverages large language models (LLMs) to enhance the generation of high-coverage unit tests. PALM performs program analysis to identify branching conditions within functions, which are then combined into path constraints. These constraints and relevant contextual information are used to construct prompts that guide the LLMs in generating unit tests. We implement the approach and evaluate it in 10 open-source Rust crates. Experimental results show that within just two or three hours, PALM can significantly improves test coverage compared to classic methods, with increases in overall project coverage exceeding 50% in some instances and its generated tests achieving an average coverage of 75.77%, comparable to human effort (71.30%), highlighting the potential of LLMs in automated test generation. We submitted 91 PALM-generated unit tests targeting new code. Of these submissions, 80 were accepted, 5 were rejected, and 6 remain pending review. The results demonstrate the effectiveness of integrating program analysis with AI and open new avenues for future research in automated software testing.

单位测试对于确保软件的可靠性和正确性至关重要。经典搜索软件测试(SBST)方法和基于常规执行的单位测试方法往往无法达到高覆盖率,因为在处理复杂的程序单位方面遇到困难,例如分支条件和外部依赖性。最近的工作越来越多地使用大型语言模型(LLMs)来生成测试案例,通过提供更好的背景和纠正模型产出中的错误来提高测试质量。然而,这些方法依赖于固定的提示,导致汇编成功率和覆盖面相对较低。本文展示了PALM,这是利用大型语言模型(LLLMs)来提高高覆盖单位测试的生成率的方法。PALM进行了程序分析,以查明功能内的分支条件,然后将其结合到路径限制。这些限制和相关背景信息被用来构建指导 LLMs生成测试的提示。我们用10个开放源的Rust箱来实施并评估该方法。实验结果表明,PALM在仅仅两个或三个小时内,PALM可以大大改进与经典方法相比的测试范围,用大型语言模型(LMs)来提高高覆盖率。 PLM 在他的运行中提高了80个测试范围,我们提出了对LM的标准化测试。

Article 58

Title@2025-06-11 (3): Detecting State Manipulation Vulnerabilities in Smart Contracts Using LLM and Static Analysis

Title: Detecting State Manipulation Vulnerabilities in Smart Contracts Using LLM and Static Analysis

Ermittlung staatlicher Manipulationslücken in Smart Contracts mittels LLM und statischer Analyse

检测使用LLM和静态分析的智能合同中的国家操纵脆弱性 2506.08561v2

Authors (7): Hao Wu, Haijun Wang, Shangwang Li, Yin Wu, Ming Fan, Yitao Zhao, Ting Liu

An increasing number of DeFi protocols are gaining popularity, facilitating transactions among multiple anonymous users. State Manipulation is one of the notorious attacks in DeFi smart contracts, with price variable being the most commonly exploited state variable-attackers manipulate token prices to gain illicit profits. In this paper, we propose PriceSleuth, a novel method that leverages the Large Language Model (LLM) and static analysis to detect Price Manipulation (PM) attacks proactively. PriceSleuth firstly identifies core logic function related to price calculation in DeFi contracts. Then it guides LLM to locate the price calculation code statements. Secondly, PriceSleuth performs backward dependency analysis of price variables, instructing LLM in detecting potential price manipulation. Finally, PriceSleuth utilizes propagation analysis of price variables to assist LLM in detecting whether these variables are maliciously exploited. We presented preliminary experimental results to substantiate the effectiveness of PriceSleuth . And we outline future research directions for PriceSleuth.

越来越多的DeFi协议越来越受欢迎,为多个匿名用户之间的交易提供便利。国家操纵是DeFi智能合同中臭名昭著的袭击之一,价格变量是最经常被利用的州变量袭击者操纵象征性价格以获取非法利润。在本文中,我们提出PlassSleuth,这是利用大语言模型(LLM)和静态分析来主动探测价格操纵袭击的一种新颖方法。PlassSleuth首先确定了与DeFi合同中价格计算相关的核心逻辑功能。然后它指导LLM查找价格计算代码报表。第二,PriceSleuth对价格变量进行了后向依赖性分析,指示LLM发现潜在的价格操纵。最后,PlasSleuth利用价格变量的传播分析协助LLM发现这些变量是否被恶意利用。我们提出了初步实验结果,以证实PressSleuth公司的有效性。我们概述了PlasSleuth公司的未来研究方向。

Article 59

Title@2025-06-11 (3): On The Impact of Merge Request Deviations on Code Review Practices

Title: On The Impact of Merge Request Deviations on Code Review Practices

Über die Auswirkungen von Merge Request Abweichungen auf Code-Review-Praktiken

合并请求对守则审查惯例的影响 2506.08860v2

Authors (3): Samah Kansab, Francis Bordeleau, Ali Tizghadam

Code review is a key practice in software engineering, ensuring quality and collaboration. However, industrial Merge Request (MR) workflows often deviate from standardized review processes, with many MRs serving non-review purposes (e.g., drafts, rebases, or dependency updates). We term these cases deviations and hypothesize that ignoring them biases analytics and undermines ML models for review analysis. We identify seven deviation categories, occurring in 37.02% of MRs, and propose a few-shot learning detection method (91% accuracy). By excluding deviations, ML models predicting review completion time improve performance in 53.33% of cases (up to 2.25x) and exhibit significant shifts in feature importance (47% overall, 60% top-k). Our contributions include: (1) a taxonomy of MR deviations, (2) an AI-driven detection approach, and (3) empirical evidence of their impact on ML-based review analytics. This work aids practitioners in optimizing review efforts and ensuring reliable insights.

守则审查是软件工程的关键做法,确保质量和协作。然而,工业合并请求(MR)工作流程往往偏离标准化审查程序,许多管家服务于非审查目的(如草案、基准重置或依赖性更新)。我们将这些偏离和假设情况称为忽略这些偏离和假设情况,忽视了它们偏向分析,破坏了用于审查分析的ML模式。我们确定了7个偏差类别,发生在37.02%的管家中,并提出了一个微小的学习检测方法(91%的准确性)。通过排除偏差,ML模型预测审评完成时间在53.33%的案例中提高了绩效(高达2.25x),并显示出显著的重要变化(总体而言,占47%,最高*k** )。我们的贡献包括:(1) 对MR偏离的分类,(2) AI驱动的检测方法,以及(3) 它们对基于ML的审查分析的影响的经验证据。这项工作有助于从业人员优化审查工作并确保可靠的洞察力。

Article 60

Title@2025-06-10 (2): UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

Title: UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

UTBoost: Strenge Bewertung von Coding Agents auf SWE-Bench

UTBost: 严格评价SWE-Bench上的编码剂 2506.09289v1

Authors (4): Boxi Yu, Yuxuan Zhu, Pinjia He, Daniel Kang

The advent of Large Language Models (LLMs) has spurred the development of coding agents for real-world code generation. As a widely used benchmark for evaluating the code generation capabilities of these agents, SWE-Bench uses real-world problems based on GitHub issues and their corresponding pull requests. However, the manually written test cases included in these pull requests are often insufficient, allowing generated patches to pass the tests without resolving the underlying issue. To address this challenge, we introduce UTGenerator, an LLM-driven test case generator that automatically analyzes codebases and dependencies to generate test cases for real-world Python projects. Building on UTGenerator, we propose UTBoost, a comprehensive framework for test case augmentation. In our evaluation, we identified 36 task instances with insufficient test cases and uncovered 345 erroneous patches incorrectly labeled as passed in the original SWE Bench. These corrections, impacting 40.9% of SWE-Bench Lite and 24.4% of SWE-Bench Verified leaderboard entries, yield 18 and 11 ranking changes, respectively.

大语言模型(LLMS)的出现刺激了真实世界代码生成的编码代理器的发展。 SWE-Bench作为评估这些代理商代码生成能力的一个广泛使用的基准,使用基于 GitHub 问题及其相应的拉动请求的现实世界问题。然而,这些拉动请求中包含的人工书面测试案例往往不够充分,使得生成的补丁能够在不解决根本问题的情况下通过测试。为了应对这一挑战,我们引入了UTGenerator,这是一个由LLM驱动的测试案例生成器,自动分析代码库和依赖性,以生成真实世界 Python 项目的测试案例。在UTGenerator的基础上,我们提出了UTBoost,这是测试案件扩增综合框架。在我们的评估中,我们发现了36个测试案例不充分,发现了345个错误的补丁,错误的标签在最初的SWE-Bench Lite和SWE-Bench校准头板条目中分别影响到40.9%和24.4%的40。

Article 61

Title@2025-06-10 (2): RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts

Title: RocketPPA: Code-Level Power, Performance, and Area Prediction via LLM and Mixture of Experts

RocketPPA: Code-Level Power, Performance und Area Prediction über LLM und Mixture of Experts

火箭式PPPA:通过LLM和专家混合进行代码级动力、性能和地区预测 2503.21971v3

Authors (3): Armin Abdollahi, Mehdi Kamal, Massoud Pedram

This paper presents RocketPPA, a novel ultra-fast power, performance (delay), and area (PPA) estimator operating directly at the code-level abstraction using HDL code as input. The key technical innovation is its LLM-based regression model, which uniquely integrates a large language model (LLM) with a mixture-of-experts (MoE) architecture composed of multilayer perceptrons (MLPs). The LLM interprets the input HDL code and then utilizes its final hidden-layer representations to predict PPA metrics. Low-rank adaptation (LoRA) is used for parameter-efficient fine-tuning to enable efficient LLM training. Furthermore, the work includes the development of an LLM-based HDL code repair framework to generate a large and synthesizable training dataset. Experimental results on the VerilogEval benchmark demonstrate that RocketPPA achieves significant improvements in the accuracy of PPA estimation compared to previous state-of-the-art methods like Llama3-MetRex-8B. Specifically, at a 10% relative error threshold, RocketPPA enhances the pass rate for area prediction by 13.6%, delay by 9.4%, and power by 14.7%. At a 20% threshold, the improvements are 9.6% for area, 10.8% for delay, and 18.5% for power. Moreover, RocketPPA achieves a speedup of over 20x compared to MetRex and 30x over MasterRTL in processing the test set. The impact of RocketPPA is the potential to substantially accelerate the hardware design process by providing accurate PPA estimations early in the design cycle, thus avoiding the overhead of manual feature engineering and time-consuming synthesis flows.

本文展示了RocketPPA, 这是一种新型超快功率、性能( 延迟) 和地区( PPA ) 估计器。关键的技术创新是其基于LLM 的回归模型, 它将大型语言模型( LLM) 与一个由多层感应器组成的专家混合结构( MoE) 特别结合。 LLM 解释输入的 HDL 代码, 并利用其最后的隐藏显示器来预测 PPPA 指标。低级别调整( LoRA) 用于节能微调微调, 以高效的LDLLLM) 。此外, 工作包括开发一个基于LLM HDL码的回归模型( LLM) 回归模型, 以大型和可合成的培训数据集。 VerilogEval 基准的实验结果显示, 与Llama3- MetRex-8B 等以往的状态方法相比, PPPA 的准确度显著提高。低级别调整( LoRA ) , 用于10% 相对误差值调整了 LLPPA 20.

Article 62

Title@2025-06-10 (2): ClassInvGen: Class Invariant Synthesis using Large Language Models

Title: ClassInvGen: Class Invariant Synthesis using Large Language Models

ClassInvGen: Class Invariant Synthesis mit großen Sprachmodellen

类 InvGen: 使用大语言模型的分类变量合成 2502.18917v2

Authors (8): Chuyue Sun, Viraj Agashe, Saikat Chakraborty, Jubi Taneja, Clark Barrett, David Dill, Xiaokang Qiu, Shuvendu K. Lahiri

Formal program specifications in the form of preconditions, postconditions, and class invariants have several benefits for the construction and maintenance of programs. They not only aid in program understanding due to their unambiguous semantics but can also be enforced dynamically (or even statically when the language supports a formal verifier). However, synthesizing high-quality specifications in an underlying programming language is limited by the expressivity of the specifications or the need to express them in a declarative manner. Prior work has demonstrated the potential of large language models (LLMs) for synthesizing high-quality method pre/postconditions for Python and Java, but does not consider class invariants. In this work, we describe ClassInvGen, a method for co-generating executable class invariants and test inputs to produce high-quality class invariants for a mainstream language such as C++, leveraging LLMs’ ability to synthesize pure functions. We show that ClassInvGen outperforms a pure LLM-based technique to generate specifications (from code) as well as prior data-driven invariant inference techniques such as Daikon. We contribute a benchmark of standard C++ data structures along with a harness that can help measure both the correctness and completeness of generated specifications using tests and mutants. We also demonstrate its applicability to real-world code by performing a case study on several classes within a widely used and high-integrity C++ codebase.

以先决条件、后期条件和阶级变异等形式综合高质量的程序规格,对程序的设计和维护有若干好处。它们不仅有助于方案理解,因为它们的语义明确,而且可以动态地执行(或者当语言支持正式验证器时,甚至静态地执行)。然而,将高质量规格合成成一种基本编程语言,由于规格的清晰度或需要以宣示的方式表达这些规格,因而受到限制。先前的工作表明,大型语言模型(LLLMs)有可能合成Python和爪哇的高质量方法前期/后期,但并不广泛考虑类内变异性。在这项工作中,我们描述了Cel InvGen,这是共同生成可执行性分类的方法,测试投入,以产生一种优质的变异性语言,例如C++,利用LLMs的能力合成纯净功能。我们表明,Sleg InvGen的纯性LM技术超越了生成规格(从代码)的纯精度技术,以及先前的数据驱动的易变性前期,没有考虑等级。我们描述了CevGlevGen Indealtal roup roup roup roup roup roup roup roup lapeaude laus laus laus laus lave laus lax lax lax lax lax laus a lax lax lave lave lave laveal lax lax laus lax lax lax laus laus labild lave lave lax labild lax lax lax lax lax lax lax lax lax lax lax lax ex laveal lax lax lax lax lax ex lave lave lab labal lax lax lax lax laved lave lave lax lax lave ex lax ex ex ex lax ex ex ex ex ex ex laus laus ex ex ex

Article 63

Title@2025-06-10 (2): Formal Methods Meets Readability: Auto-Documenting JML Java Code

Title: Formal Methods Meets Readability: Auto-Documenting JML Java Code

Formale Methoden erfüllen Lesbarkeit: Auto-Dokumentierung von JML Java Code

正式方法符合可读性:自动文档 JML Java 代码 2506.09230v1

Authors (3): Juan Carlos Recio Abad, Ruben Saborido, Francisco Chicano

This paper investigates whether formal specifications using Java Modeling Language (JML) can enhance the quality of Large Language Model (LLM)-generated Javadocs. While LLMs excel at producing documentation from code alone, we hypothesize that incorporating formally verified invariants yields more complete and accurate results. We present a systematic comparison of documentation generated from JML-annotated and non-annotated Java classes, evaluating quality through both automated metrics and expert analysis. Our findings demonstrate that JML significantly improves class-level documentation completeness, with more moderate gains at the method level. Formal specifications prove particularly effective in capturing complex class invariants and design contracts that are frequently overlooked in code-only documentation. A threshold effect emerges, where the benefits of JML become more pronounced for classes with richer sets of invariants. While JML enhances specification coverage, its impact on core descriptive quality is limited, suggesting that formal specifications primarily ensure comprehensive coverage rather than fundamentally altering implementation descriptions. These results offer actionable insights for software teams adopting formal methods in documentation workflows, highlighting scenarios where JML provides clear advantages. The study contributes to AI-assisted software documentation research by demonstrating how formal methods and LLMs can synergistically improve documentation quality.

本文调查使用爪哇模拟语言(Java Modeling Lings(JML)的正式规格是否能够提高大语言模型(LLM)产生的爪哇软件的质量。虽然LLMS在单凭代码编制文件方面非常出色,但我们假设将正式核实的异差纳入其中会产生更加完整和准确的结果。我们系统地比较了JML附加说明和非附加说明的爪哇类中产生的文件,通过自动化指标和专家分析评价质量。我们的研究结果表明,JML大大改进了分类文件的完整性,在方法层面取得了较微小的收益。正式规格证明在捕捉复杂类别变差和设计合同方面特别有效,而这种分类往往在代码化文件中被忽视。出现了一种临界效应,即JMLML的好处对较丰富的异差类更明显。虽然JMLML对核心说明质量的影响有限,但表明正式规格主要是确保全面覆盖,而不是从根本上改变执行说明。这些结果为在文件工作流程中采用正式方法的软件团队提供了可操作的见解,突出了JMLML提供明显优势的情景。这项研究有助于AI辅助性文件研究。

Article 64

Title@2025-06-10 (2): Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Title: Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems

Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen

LLM女士能产生可靠的试验案例发电机吗? 2506.06821v2

Authors (21): Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He

Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.

大型语言模型(LLMS)在代码生成方面表现出了非凡的能力,能够在推断过程中处理复杂的任务,然而,LLMS在通过测试案例生成过程中可用于代码检查或调试的功能仍然在很大程度上没有得到探索。我们从竞争级别的编程(CP)方案的角度来调查这一问题,并提出TCGBench,即(LLM生成)测试案例生成器的基准。这一基准包括两项任务,目的是研究LLMS在(1)为特定CP问题生成有效测试案例生成器的能力,以及进一步(2)生成有针对性的测试案例生成器,暴露人造代码中的错误。实验结果表明,尽管最先进的LMS能够产生有效的测试案例生成器,但大多数LLMS都在努力生成能够有效揭示人类代码缺陷的定向测试案例。特别是,甚至先进的推理模型(如o3-mini)在生成目标型发电机的任务中也远远低于人类的性能。此外,我们为生成目标型发电机设计了一个高质量的手工整理数据集。分析结果表明,LMS的性能通过这一数据组合的迅速得到改进。

Article 65

Title: How Do Users Revise Architectural Related Questions on Stack Overflow: An Empirical Study

Wie revidieren Benutzer architektonische verwandte Fragen über Stack Overflow: Eine empirische Studie

用户如何修改关于堆叠溢溢流的建筑相关问题:经验研究 2406.18959v2

Authors (4): Musengamana Jean de Dieu, Peng Liang, Mojtaba Shahin, Arif Ali Khan

Technical Questions and Answers (Q&A) sites, such as Stack Overflow (SO), accumulate a significant variety of information related to software development in posts from users. To ensure the quality of this information, SO encourages its users to review posts through various mechanisms (e.g., question and answer revision processes). Although Architecture Related Posts (ARPs) communicate architectural information that has a system-wide impact on development, little is known about how SO users revise information shared in ARPs. To fill this gap, we conducted an empirical study to understand how users revise Architecture Related Questions (ARQs) on SO. We manually checked 13,205 ARPs and finally identified 4,114 ARQs that contain revision information. Our main findings are that: (1) The revision of ARQs is not prevalent in SO, and an ARQ revision starts soon after this question is posted (i.e., from 1 minute onward). Moreover, the revision of an ARQ occurs before and after this question receives its first answer/architecture solution, with most revisions beginning before the first architecture solution is posted. Both Question Creators (QCs) and non-QCs actively participate in ARQ revisions, with most revisions being made by QCs. (2) A variety of information (14 categories) is missing and further provided in ARQs after being posted, among which design context, component dependency, and architecture concern are dominant information. (3) Clarify the understanding of architecture under design and improve the readability of architecture problem are the two major purposes of the further provided information in ARQs. (4) The further provided information in ARQs has several impacts on the quality of answers/architecture solutions, including making architecture solution useful, making architecture solution informative, making architecture solution relevant, among others.

技术问答网站,如Stack Overflow (SO), 收集了大量与用户职位软件开发有关的信息。为了确保这些信息的质量,SO鼓励用户通过各种机制(例如问答修订程序)审查职位。虽然建筑相关职位(ARPs)传播了对全系统发展有影响的建筑信息,但对SO用户如何修改ARP共享的信息知之甚少。为了填补这一空白,我们进行了实证研究,以了解用户如何修改SO的架构相关问题(ARQs)。我们手工核对了13 205 ARPs,并最终确定了4 114个包含修订信息的ARQs。我们的主要结论是:(1) ARQs的修订在SO(例如问答)中并不普遍,而ARQs的修订在这一问题发布后很快(即从1分钟后开始)开始。此外,ARQQ在第一个架构解决方案发布之前,大多数修订工作开始。 Qs Reguills(QC)在读了有关设计中的大多数信息,在内部结构中做了两次修订。

Article 66

Title@2025-06-10 (2): Who is using AI to code? Global diffusion and impact of generative AI

Title: Who is using AI to code? Global diffusion and impact of generative AI

Wer nutzt KI zum Kodieren? Globale Diffusion und Wirkung generativer KI

谁在使用人工智能编码? 2506.08945v1

Authors (4): Simone Daniotti, Johannes Wachs, Xiangnan Feng, Frank Neffke

Generative coding tools promise big productivity gains, but uneven uptake could widen skill and income gaps. We train a neural classifier to spot AI-generated Python functions in 80 million GitHub commits (2018-2024) by 200,000 developers and track how fast–and where–these tools take hold. By December 2024, AI wrote an estimated 30.1% of Python functions from U.S. contributors, versus 24.3% in Germany, 23.2% in France, 21.6% in India, 15.4% in Russia and 11.7% in China. Newer GitHub users use AI more than veterans, while male and female developers adopt at similar rates. Within-developer fixed-effects models show that moving to 30% AI use raises quarterly commits by 2.4%. Coupling this effect with occupational task and wage data puts the annual value of AI-assisted coding in the United States at $9.6-$14.4 billion, rising to $64-$96 billion if we assume higher estimates of productivity effects reported by randomized control trials. Moreover, generative AI prompts learning and innovation, leading to increases in the number of new libraries and library combinations that programmers use. In short, AI usage is already widespread but highly uneven, and the intensity of use, not only access, drives measurable gains in output and exploration.

产生编码的工具有望带来巨大的生产率增益,但接受程度不均可能会扩大技能和收入差距。我们训练神经分类器,以在8000万GitHub(GitHub)承诺(2018-2024年)中发现AI产生的Python功能,202020年,202020年开发者承诺(2018-2024年),并跟踪这些工具的快速和在哪里起作用。到2024年12月,AI从美国捐款者那里写了大约30.1%的Python功能,德国为24.3%,法国为23.2%,印度为21.6%,俄罗斯为15.4%,中国为11.7%。新GitHub用户比退伍军人更多地使用AI,而男性和女性开发者则以类似的速度采用。开发者内部的固定效应模型显示,到30%的AI的使用每季度增加2.4%。结合职业任务和工资数据,AI协助的编码每年价值为9.6亿至14.4亿美元,如果我们假设随机控制试验所报告的生产力效应的估计数更高,则增加到64-9亿美元。此外,Git Hital AI促进短期学习和创新,导致新图书馆和广泛使用率增加。

Article 67

Title@2025-06-10 (2): The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot

Title: The Impact of Large Language Models on Open-source Innovation: Evidence from GitHub Copilot

Die Auswirkungen großer Sprachmodelle auf Open-Source-Innovationen: Belege von GitHub Copilot

《大语言模式对开放源码创新的影响:GitHub公司的证据》 2409.08379v3

Authors (3): Doron Yeverechyahu, Raveesh Mayya, Gal Oestreicher-Singer

Large Language Models (LLMs) have been shown to enhance individual productivity in guided settings. Whereas LLMs are likely to also transform innovation processes in a collaborative work setting, it is unclear what trajectory this transformation will follow. Innovation in these contexts encompasses both capability innovation that explores new possibilities by acquiring new competencies in a project and iterative innovation that exploits existing foundations by enhancing established competencies and improving project quality. Whether LLMs affect these two aspects of collaborative work and to what extent is an open empirical question. Open-source development provides an ideal setting to examine LLM impacts on these innovation types, as its voluntary and open/collaborative nature of contributions provides the greatest opportunity for technological augmentation. We focus on open-source projects on GitHub by leveraging a natural experiment around the selective rollout of GitHub Copilot (a programming-focused LLM) in October 2021, where GitHub Copilot selectively supported programming languages like Python or Rust, but not R or Haskell. We observe a significant jump in overall contributions, suggesting that LLMs effectively augment collaborative innovation in an unguided setting. Interestingly, Copilot’s launch increased iterative innovation focused on maintenance-related or feature-refining contributions significantly more than it did capability innovation through code-development or feature-introducing commits. This disparity was more pronounced after the model upgrade in June 2022 and was evident in active projects with extensive coding activity, suggesting that as both LLM capabilities and/or available contextual information improve, the gap between capability and iterative innovation may widen. We discuss practical and policy implications to incentivize high-value innovative solutions.

大型语言模型(LLMS)被证明可以提高导引环境中的个人生产力。尽管LLMS有可能在协作性工作环境中改变创新进程,但这一转变的轨迹并不清楚。这些背景下的创新既包括能力创新,通过在一个项目中获得新的能力,探索新的可能性;也包括迭代创新,通过提高既有能力和提高项目质量,利用现有基础,探索新的可能性。LLMS是否影响协作工作的这两个方面,在多大程度上是一个公开的经验问题。开放源发展为审查LLM对这些创新类型的影响提供了一个理想的环境,因为其自愿和开放/协作性贡献的扩大为技术增强提供了最大机会。我们注重GitHub的开放源项目,在2021年10月围绕GitHub Coilit(注重方案编制的LM)的选择性推出,利用自然实验,探索新的可能性,探索新颖性创新的自然实验,GitHub Colib Corality在6月20日之后,我们观察到总体贡献的大幅提升,表明LMSM在非指导性的环境下有效地加强了合作性创新。

Article 68

Title@2025-06-10 (2): When Uncertainty Leads to Unsafety: Empirical Insights into the Role of Uncertainty in Unmanned Aerial Vehicle Safety

Title: When Uncertainty Leads to Unsafety: Empirical Insights into the Role of Uncertainty in Unmanned Aerial Vehicle Safety

Wenn Ungewissheit zu Unsicherheit führt: Empirische Einblicke in die Rolle der Ungewissheit in der unbemannten Luftfahrzeugsicherheit

当不确定因素导致不安全时:对不确定因素在无人驾驶飞行器安全方面作用的实证洞察力 2501.08908v2

Authors (4): Sajad Khatiri, Fatemeh Mohammadi Amin, Sebastiano Panichella, Paolo Tonella

Despite the recent developments in obstacle avoidance and other safety features, autonomous Unmanned Aerial Vehicles (UAVs) continue to face safety challenges. No previous work investigated the relationship between the behavioral uncertainty of a UAV, characterized in this work by inconsistent or erratic control signal patterns, and the unsafety of its flight. By quantifying uncertainty, it is possible to develop a predictor for unsafety, which acts as a flight supervisor. We conducted a large-scale empirical investigation of safety violations using PX4-Autopilot, an open-source UAV software platform. Our dataset of over 5,000 simulated flights, created to challenge obstacle avoidance, allowed us to explore the relation between uncertain UAV decisions and safety violations: up to 89% of unsafe UAV states exhibit significant decision uncertainty, and up to 74% of uncertain decisions lead to unsafe states. Based on these findings, we implemented Superialist (Supervising Autonomous Aerial Vehicles), a runtime uncertainty detector based on autoencoders, the state-of-the-art technology for anomaly detection. Superialist achieved high performance in detecting uncertain behaviors with up to 96% precision and 93% recall. Despite the observed performance degradation when using the same approach for predicting unsafety (up to 74% precision and 87% recall), Superialist enabled early prediction of unsafe states up to 50 seconds in advance.

尽管在避免障碍和其他安全特征方面最近有所发展,但自主无人驾驶飞行器(无人驾驶飞行器)继续面临安全挑战。以往没有工作调查无人驾驶飞行器行为不确定性之间的关系,无人驾驶飞行器的行为不确定性在这项工作中以不连贯或不稳定的控制信号模式为特征,其飞行不安全性为特征。通过量化不确定性,有可能为不安全性开发一个预测器,作为飞行监督员。我们利用开放源代码UAV软件平台PX4自动驾驶机(开放源代码UAV软件平台)对违反安全的行为进行了大规模的经验性调查。我们建立了5 000多个模拟飞行数据集,以挑战避免障碍,使我们得以探索无人驾驶飞行器不确定决定和安全侵犯行为之间的关系:高达89%的不安全无人驾驶飞行器国家存在重大决定不确定性,高达74%的不确定决定导致不安全性状态。基于这些发现,我们实施了超导师(超导自动飞行器),一个运行时空的不确定性检测器,这是以自动编码器、州级异常检测技术为基础的。我们为挑战避免障碍而创建了5 000多个模拟飞行,从而挑战避免障碍,使我们得以探索无人驾驶飞行器决定和安全违规行为与侵犯之间的关系:高达89%的不安全性行为在预测测期中达到96%的精确度后,还观察到了87 %的状态。尽管观察到了状态。

Article 69

Title@2025-06-10 (2): ZTaint-Havoc: From Havoc Mode to Zero-Execution Fuzzing-Driven Taint Inference

Title: ZTaint-Havoc: From Havoc Mode to Zero-Execution Fuzzing-Driven Taint Inference

ZTaint-Havoc: Vom Havoc-Modus zur Null-Execution Fuzzing-Driven Taint-Inferenz

ZTaint-Havoc:从哈沃克模式到零执行法 2506.08838v1

Authors (3): Yuchong Xie, Wenhui Zhang, Dongdong She

Fuzzing is a widely used technique for discovering software vulnerabilities, but identifying hot bytes that influence program behavior remains challenging. Traditional taint analysis can track such bytes white-box, but suffers from scalability issue. Fuzzing-Driven Taint Inference (FTI) offers a black-box alternative, yet typically incurs significant runtime overhead due to extra program executions. We observe that the commonly used havoc mutation scheme in fuzzing can be adapted for lightweight FTI with zero extra executions. We present a computational model of havoc mode, demonstrating that it can perform FTI while generating new test cases. Building on this, we propose ZTaint-Havoc, a novel, efficient FTI with minimal overhead (3.84% on UniBench, 12.58% on FuzzBench). We further design an effective mutation algorithm utilizing the identified hot bytes. Our comprehensive evaluation shows that ZTaint-Havoc, implemented in AFL++, improves edge coverage by up to 33.71% on FuzzBench and 51.12% on UniBench over vanilla AFL++, with average gains of 2.97% and 6.12% in 24-hour fuzzing campaigns.

模糊是一种广泛使用的发现软件脆弱性的技术,但识别影响程序行为的热字字节仍然具有挑战性。传统的污点分析可以追踪这种字节白箱,但有可缩放性问题。模糊- Driven Taint Inference (FTI) 提供了黑箱替代方案,但通常会因程序外执行而产生大量运行时间性间接费用。我们观察到,在模糊中常用的破坏性突变计划可以适用于轻型FTI,零额外处决。我们提出了一个破坏模式的计算模型,表明它可以在产生新测试案例的同时进行FTI。在此基础上,我们提议ZTaint- Havoc, 是一个新型的高效FTI, 其顶部最低( UniBench 3.84%, FuzzBench 12.58% )。我们还进一步设计一个有效的突变算法, 利用所查明的热字节。我们的全面评估显示,在AFLL++中实施的ZTaint-Havovoc, 其边缘覆盖率提高到33.71%, 在VuzzBench 和UniBench AL++ 上增加了51.12。

Article 70

Title@2025-06-10 (2): Exploring the Evidence-Based Beliefs of LLM-Based Programming Assistants

Title: Exploring the Evidence-Based Beliefs of LLM-Based Programming Assistants

Erforschung der evidenzbasierten Überzeugungen von LLM-basierten Programmierassistenten

探索以LLM为基础的方案拟订助理的基于证据的信念 2407.13900v2

Authors (2): Chris Brown, Jason Cusati

Recent innovations in artificial intelligence (AI), primarily powered by large language models (LLMs), have transformed how programmers develop and maintain software – leading to new frontiers in software engineering (SE). The advanced capabilities of LLM-based programming assistants to support software development tasks have led to a rise in the adoption of LLMs in SE. However, little is known about the evidenced-based practices, tools and processes verified by research findings, supported and adopted by AI programming assistants. To this end, our work conducts a preliminary evaluation exploring the beliefs of LLM used to support software development tasks. We investigate 17 evidence-based claims posited by empirical SE research across five LLM-based programming assistants. Our findings show that LLM-based programming assistants have ambiguous beliefs regarding research claims and lack credible evidence to support responses. Based on our results, we provide implications for practitioners adopting LLM-based programming assistants in development contexts and shed light on future research directions to enhance the reliability and trustworthiness of LLMs – aiming to increase awareness and adoption of evidence-based SE research findings in practice.

最近人工智能(AI)的革新主要由大型语言模型(LLMs)推动,改变了程序员如何开发和维护软件 – – 导致软件工程的新领域。基于LLM的编程助理支持软件开发任务的先进能力导致SE采用LLMs的情况增加。然而,关于经AI编程助理支持和采纳的研究结果核实的有证据的做法、工具和程序,我们鲜为人知。为此目的,我们开展了一项初步评价,探讨用于支持软件开发任务的LLM的信念。我们调查了五个基于LLM的编程助理根据SE的经验研究提出的17项基于证据的索赔。我们的调查结果显示,基于LLM的编程助理对研究主张有模糊的信念,缺乏可靠的证据来支持应对措施。根据我们的结果,我们为在开发过程中采用LLM制程的编程助理从业人员提供了影响,并指明了提高LMS的可靠性和可信度的未来研究方向 – – 目的是在实际中提高认识和采用基于证据的SE研究结果。

Article 71

Title@2025-06-10 (2): Towards a Knowledge Base of Common Sustainability Weaknesses in Green Software Development

Title: Towards a Knowledge Base of Common Sustainability Weaknesses in Green Software Development

Auf dem Weg zu einer Wissensbasis für gemeinsame Nachhaltigkeitsschwächen in der grünen Softwareentwicklung

建立绿色软件开发中共同可持续性弱点知识库 2506.08812v1

Authors (6): Priyavanshi Pathania, Rohit Mehra, Vibhu Saujanya Sharma, Vikrant Kaulgud, Sanjay Podder, Adam P. Burden

With the climate crisis looming, engineering sustainable software systems become crucial to optimize resource utilization, minimize environmental impact, and foster a greener, more resilient digital ecosystem. For developers, getting access to automated tools that analyze code and suggest sustainabilityrelated optimizations becomes extremely important from a learning and implementation perspective. However, there is currently a dearth of such tools due to the lack of standardized knowledge, which serves as the foundation of these tools. In this paper, we motivate the need for the development of a standard knowledge base of commonly occurring sustainability weaknesses in code, and propose an initial way of doing that. Furthermore, through preliminary experiments, we demonstrate why existing knowledge regarding software weaknesses cannot be re-tagged “as is” to sustainability without significant due diligence, thereby urging further explorations in this ecologically significant domain.

随着气候危机的逼近,工程可持续软件系统对于优化资源利用、尽量减少环境影响和培育更绿色、更具复原力的数字生态系统变得至关重要。对于开发者来说,从学习和执行的角度来说,获取分析代码和提出可持续性优化的自动化工具变得极为重要。然而,由于缺乏标准化知识,目前此类工具短缺,而标准化知识是这些工具的基础。在本文件中,我们提出需要开发一个标准知识库,说明代码中通常存在的可持续性薄弱环节,并提出这样做的初步方法。此外,通过初步实验,我们展示了为什么现有的软件薄弱环节无法在没有重大尽职调查的情况下重新标记为可持续性,从而敦促在这一具有生态意义的领域进一步探索。

Article 72

Title@2025-06-10 (2): User Modeling in Model-Driven Engineering: A Systematic Literature Review

Title: User Modeling in Model-Driven Engineering: A Systematic Literature Review

User Modeling in Model-Driven Engineering: Ein systematischer Literaturbericht

模型驱动工程的用户建模:系统文学评论 2412.15871v3

Authors (3): Aaron Conrardy, Alfredo Capozucca, Jordi Cabot

In software applications, user models can be used to specify the profile of the typical users of the application, including personality traits, preferences, skills, etc. In theory, this would enable an adaptive application behavior that could lead to a better user experience. Nevertheless, user models do not seem to be part of standard modeling languages nor common in current model-driven engineering (MDE) approaches. In this paper, we conduct a systematic literature review to analyze existing proposals for user modeling in MDE and identify their limitations. The results showcase that there is a lack of a unified and complete user modeling perspective. Instead, we observe a lot of fragmented and partial proposals considering only simple user dimensions and with lack of proper tool support. This limits the implementation of richer user interfaces able to better support the user-specific needs. Therefore, we hope this analysis triggers a discussion on the importance of user models and their inclusion in MDE pipelines. Especially in a context where, thanks to the rise of AI techniques, personalization, based on a rich number of user dimensions, is becoming more and more of a possibility.

在软件应用中,用户模型可用于指定应用程序典型用户的概况,包括个性特征、偏好、技能等。理论上,这将有利于适应性应用行为,从而导致更好的用户经验。然而,用户模型似乎不是标准模型语言的一部分,在目前模式驱动的工程方法中也不常见。在本文件中,我们进行系统的文献审查,分析MDE用户建模的现有建议,并查明其局限性。结果显示缺乏统一和完整的用户建模观点。相反,我们观察到许多零散和不完整的建议,只考虑简单的用户层面,缺乏适当的工具支持。这限制了能够更好地支持用户特定需求的更富的用户界面的安装。因此,我们希望这一分析能够引发关于用户模型的重要性以及将其纳入MDE管道的讨论。特别是在由于AI技术的兴起,基于大量用户层面的个人化正在变得越来越可能。

Article 73

Title@2025-06-10 (2): Do Generative AI Tools Ensure Green Code? An Investigative Study

Title: Do Generative AI Tools Ensure Green Code? An Investigative Study

Stellen Generative KI-Tools einen grünen Code sicher? Eine Untersuchungsstudie

产生AI工具确保绿色守则? 一项调查研究 2506.08790v1

Authors (6): Samarth Sikand, Rohit Mehra, Vibhu Saujanya Sharma, Vikrant Kaulgud, Sanjay Podder, Adam P. Burden

Software sustainability is emerging as a primary concern, aiming to optimize resource utilization, minimize environmental impact, and promote a greener, more resilient digital ecosystem. The sustainability or “greenness” of software is typically determined by the adoption of sustainable coding practices. With a maturing ecosystem around generative AI, many software developers now rely on these tools to generate code using natural language prompts. Despite their potential advantages, there is a significant lack of studies on the sustainability aspects of AI-generated code. Specifically, how environmentally friendly is the AI-generated code based upon its adoption of sustainable coding practices? In this paper, we present the results of an early investigation into the sustainability aspects of AI-generated code across three popular generative AI tools - ChatGPT, BARD, and Copilot. The results highlight the default non-green behavior of tools for generating code, across multiple rules and scenarios. It underscores the need for further in-depth investigations and effective remediation strategies.

软件可持续性正在成为一个主要关注问题,目的是优化资源利用,尽量减少环境影响,促进更绿色、更具复原力的数字生态系统。软件的可持续性或“绿色性”通常取决于采用可持续编码做法。随着生态系统围绕基因化的人工智能不断成熟,许多软件开发者现在依靠这些工具来生成使用自然语言提示的代码。尽管这些工具具有潜在优势,但对于AI产生的代码的可持续性方面却缺乏大量研究。具体地说,AI生成的代码在采用可持续编码做法的基础上对环境有多友好?在本文件中,我们介绍了对AI生成的代码在三种流行的基因化的人工智能工具(ChatGPT、BARD和Coitive)中的可持续性方面进行早期调查的结果。这些结果突出表明了生成代码的工具在多种规则和情景中默认的不绿色行为。它强调需要进一步深入的调查和有效的补救战略。

Article 74

Title@2025-06-10 (2): Mitigating fairwashing using Two-Source Audits

Title: Mitigating fairwashing using Two-Source Audits

Fairwashing durch Zwei-Quellen-Audits abmildern

利用双重来源审计减少洗水 2305.13883v2

Authors (4): Jade Garcia Bourrée, Erwan Le Merrer, Gilles Tredan, Benoît Rottembourg

Recent legislation requires online platforms to provide dedicated APIs to assess the compliance of their decision-making algorithms with the law. Research has nevertheless shown that the auditors of such platforms are prone to manipulation (a practice referred to as \textit{fairwashing}). To address this salient problem, recent work has considered audits under the assumption of partial knowledge of the platform’s internal mechanisms. In this paper, we propose a more pragmatic approach with the \textit{Two-Source Audit} setup: while still leveraging the API, we advocate for the adjunction of a second source of data to both perform the audit of a platform and the detection of fairwashing attempts. Our method is based on identifying discrepancies between the two data sources, using data proxies at use in the fairness literature. We formally demonstrate the conditions for success in this fairwashing mitigation task. We then validate our method empirically, demonstrating that Two-Source Audits can achieve a Pareto-optimal balance between the two objectives. We believe this paper sets the stage for reliable audits in manipulation-prone setups, under mild assumptions.

最近的立法要求在线平台提供专门的API,以评估其决策算法是否符合法律。然而,研究显示,这些平台的审计员容易被操纵(一种称为\ textit{fairwashing}的做法)。为解决这一突出的问题,最近的工作考虑在部分了解平台内部机制的情况下进行审计。在本文件中,我们建议对设置\ textit{2-源审计}采取更务实的方法:在利用API的同时,我们主张对第二个数据来源进行合并,以便对平台进行审计并发现洗涤企图。我们的方法是基于查明两个数据来源之间的差异,使用公平文献中的数据代号。我们正式证明这一公平洗涤的缓解任务取得成功的条件。然后,我们用经验验证我们的方法,表明两个来源的审计可以在两个目标之间实现最佳平衡。我们认为,本文件根据温和的假设,在可操纵的设置方面为可靠的审计提供了舞台。

Article 75

Title@2025-06-10 (2): Breaking the ICE: Exploring promises and challenges of benchmarks for Inference Carbon & Energy estimation for LLMs

Title: Breaking the ICE: Exploring promises and challenges of benchmarks for Inference Carbon & Energy estimation for LLMs

Breaking the ICE: Erforschen von Versprechungen und Herausforderungen von Benchmarks für Inferenz-Kohlenstoff- & Energieschätzungen für LLMs

打破ICE:探索LLMM的碳和能源估算基准的许诺和挑战 2506.08727v1

Authors (8): Samarth Sikand, Rohit Mehra, Priyavanshi Pathania, Nikhil Bamby, Vibhu Saujanya Sharma, Vikrant Kaulgud, Sanjay Podder, Adam P. Burden

While Generative AI stands to be one of the fastest adopted technologies ever, studies have made evident that the usage of Large Language Models (LLMs) puts significant burden on energy grids and our environment. It may prove a hindrance to the Sustainability goals of any organization. A crucial step in any Sustainability strategy is monitoring or estimating the energy consumption of various components. While there exist multiple tools for monitoring energy consumption, there is a dearth of tools/frameworks for estimating the consumption or carbon emissions. Current drawbacks of both monitoring and estimation tools include high input data points, intrusive nature, high error margin, etc. We posit that leveraging emerging LLM benchmarks and related data points can help overcome aforementioned challenges while balancing accuracy of the emission estimations. To that extent, we discuss the challenges of current approaches and present our evolving framework, R-ICE, which estimates prompt level inference carbon emissions by leveraging existing state-of-the-art(SOTA) benchmark. This direction provides a more practical and non-intrusive way to enable emerging use-cases like dynamic LLM routing, carbon accounting, etc. Our promising validation results suggest that benchmark-based modelling holds great potential for inference emission estimation and warrants further exploration from the scientific community.

虽然生成的AI是有史以来采用最快的技术之一,但研究显示,使用大型语言模型(LLMs)给能源网和我们的环境带来沉重负担,可能阻碍任何组织的可持续性目标。任何可持续性战略的一个关键步骤是监测或估计各组成部分的能源消耗。虽然有多种监测能源消费的工具,但缺乏估计消费或碳排放量的工具/框架。目前监测和估计工具的缺点包括高输入数据点、侵扰性、高误差等。我们认为,利用新兴的LLM基准和相关数据点可以帮助克服上述挑战,同时平衡排放量估计的准确性。在这方面,我们讨论当前方法的挑战,并提出我们不断发展的框架R-ICE。 R-ICE通过利用现有的先进技术(SOITA)基准,估计迅速的碳排放量水平。这个方向提供了更实际、更无侵犯性的方法,使新兴的使用案例,如动态LM路由、碳核算等等。我们充满希望的验证结果显示,根据基准的建模在进一步的科学评估中具有巨大的潜力。

Article 76

Title@2025-06-10 (2): Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure

Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur

以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v1

Authors (2): Fariz Ikhwantri, Dusica Marijan

Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.

确保复杂的系统符合规章条例,通常需要通过索赔-论证-证据框架检查保证案件的有效性。这一过程的一些挑战包括法律和技术文本性质复杂,需要示范解释,以及获得保证案件数据的机会有限。我们建议根据自然语言推断(NLI):利用多点推理推理(EXCLAIM)的推理推理(EXCLAIM)进行可推广共测。我们将保证案件的索赔-论证-证据结构作为可解释和可追踪的遵守检测的多重推理。我们用大型语言模型(LLLMs)生成的保证案件数量有限。我们提出了衡量覆盖面和结构一致性的衡量标准。我们作为案例研究,展示了GDPR要求产生的保证案件的有效性。我们的结果突出了以NLI为基础的方法在监管遵守过程自动化方面的潜力。

Article 77

Title: ROS-related Robotic Systems Development with V-model-based Application of MeROS Metamodel

ROS-bezogene Robotik-Entwicklung mit V-Modell-basierter Anwendung von Meros Metamodel

与ROS有关的机器人系统开发,以V型模型为基础应用MEROS模型模型 2506.08706v1

Authors (6): Tomasz Winiarski, Jan Kaniuka, Daniel Giełdowski, Jakub Ostrysz, Krystian Radlak, Dmytro Kushnir

As robotic systems grow increasingly complex, heterogeneous, and safety-critical, the need for structured development methodologies becomes paramount. Although frameworks like the Robot Operating System (ROS) and Model-Based Systems Engineering (MBSE) offer foundational tools, they often lack integration when used together. This paper addresses that gap by aligning the widely recognized V-model development paradigm with the MeROS metamodel SysML-based modeling language tailored for ROS-based systems. We propose a domain-specific methodology that bridges ROS-centric modelling with systems engineering practices. Our approach formalises the structure, behaviour, and validation processes of robotic systems using MeROS, while extending it with a generalized, adaptable V-model compatible with both ROS and ROS 2. Rather than prescribing a fixed procedure, the approach supports project-specific flexibility and reuse, offering guidance across all stages of development. The approach is validated through a comprehensive case study on HeROS, a heterogeneous multi-robot platform comprising manipulators, mobile units, and dynamic test environments. This example illustrates how the MeROS-compatible V-model enhances traceability and system consistency while remaining accessible and extensible for future adaptation. The work contributes a structured, tool-agnostic foundation for developers and researchers seeking to apply MBSE practices in ROS-based projects.

由于机器人系统日益复杂、多样和安全,对结构化发展方法的需求变得至关重要。虽然机器人操作系统和基于模型的系统工程等框架提供了基础工具,但是在同时使用时往往缺乏一体化。本文件通过将广泛承认的V型发展模式模式与为基于模型的系统而专门设计的基于模型的SymML模型语言相匹配,从而缩小了这一差距。我们提出了一个具体领域的方法,将以ROS为中心的建模与系统工程实践相连接。我们的方法将机器人系统的结构、行为和验证过程正规化,同时将机器人系统的结构、行为和验证过程与ROS和ROS2相兼容,同时扩展一个通用的、可调整的V型模型,与ROS和ROS2相兼容。这一方法不仅没有规定固定程序,反而支持了具体项目的灵活性和再利用,为各个发展阶段提供了指导。该方法通过关于HEROS系统的综合案例研究得到验证。HEROS是一个由操纵器、移动器和动态测试环境组成的混合多机器人平台。我们的方法将机器人模型正规化,同时加强可操作性和系统的一致性和系统一致性,同时保持无障碍的V型模型,同时为未来的研发者基础,为未来的研究开发者寻求基础基础,并应用。

Article 78

Title@2025-06-10 (2): A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell Detection

Title: A Comprehensive Evaluation of Parameter-Efficient Fine-Tuning on Code Smell Detection

Eine umfassende Bewertung des Parameter-Effizienten Feintunings auf Code-Smell-Erkennung

全面评价关于代码嗅觉检测的参数有效精密设计 2412.13801v2

Authors (8): Beiqi Zhang, Peng Liang, Xin Zhou, Xiyu Zhou, David Lo, Qiong Feng, Zengyang Li, Lin Li

Code smells are suboptimal coding practices that negatively impact the quality of software systems. Existing detection methods, relying on heuristics or Machine Learning (ML) and Deep Learning (DL) techniques, often face limitations such as unsatisfactory performance. Parameter-Efficient Fine-Tuning (PEFT) methods have emerged as a resource-efficient approach for adapting LLMs to specific tasks, but their effectiveness for code smell detection remains underexplored. In this regard, this study evaluates state-of-the-art PEFT methods on both Small (SLMs) and Large Language Models (LLMs) for detecting four types of code smells: Complex Conditional, Complex Method, Feature Envy, and Data Class. Using high-quality and balanced datasets sourced from GitHub, we fine-tuned four SLMs and five LLMs with PEFT techniques, including prompt tuning, prefix tuning, LoRA, and (IA)3. Results show that PEFT methods achieve comparable or better performance than full fine-tuning while consuming less GPU memory. LLMs generally outperform SLMs on detecting certain smells (e.g., Complex Conditional), while SLMs do better on others (e.g., Data Class). Additionally, increasing training dataset size significantly boosted performance, while increasing trainable parameters did not. Our findings highlight PEFT methods as effective and scalable solutions, outperforming existing heuristic-based, DL-based, and In-Context Learning approaches for code smell detection.

现有检测方法,依赖超常或机器学习和深学习技术,往往面临诸如不尽人意的性能等限制。使用来自GitHub的高质量和平衡的数据设置,我们微调了4种可持续土地管理和5种具有PEFT技术的LMS和LMS, 包括快速调试、前置调调、LORA和(IA)3。结果显示,PEFT方法比完全调低GPU的内存、普遍超越SLMS的准确性能,同时大幅提升了SLMS的准确性能,同时提高了SLVS的准确性能,提高了SLMS的准确性能,提高了SLF的准确性能,提高了SLF的准确性能,提高了SLMS的准确性能。

Article 79

Title@2025-06-10 (2): Causality-aware Safety Testing for Autonomous Driving Systems

Title: Causality-aware Safety Testing for Autonomous Driving Systems

Causality-aware Sicherheitstests für autonome Fahrsysteme

自动驾驶系统因能安全测试 2506.08688v1

Authors (7): Wenbing Tang, Mingfei Cheng, Renzhi Wang, Yuan Zhou, Chengwei Liu, Yang Liu, Zuohua Ding

Simulation-based testing is essential for evaluating the safety of Autonomous Driving Systems (ADSs). Comprehensive evaluation requires testing across diverse scenarios that can trigger various types of violations under different conditions. While existing methods typically focus on individual diversity metrics, such as input scenarios, ADS-generated motion commands, and system violations, they often fail to capture the complex interrelationships among these elements. This oversight leads to gaps in testing coverage, potentially missing critical issues in the ADS under evaluation. However, quantifying these interrelationships presents a significant challenge. In this paper, we propose a novel causality-aware fuzzing technique, Causal-Fuzzer, to enable efficient and comprehensive testing of ADSs by exploring causally diverse scenarios. The core of Causal-Fuzzer is constructing a causal graph to model the interrelationships among the diversities of input scenarios, ADS motion commands, and system violations. Then the causal graph will guide the process of critical scenario generation. Specifically, Causal-Fuzzer proposes (1) a causality-based feedback mechanism that quantifies the combined diversity of test scenarios by assessing whether they activate new causal relationships, and (2) a causality-driven mutation strategy that prioritizes mutations on input scenario elements with higher causal impact on ego action changes and violation occurrence, rather than treating all elements equally. We evaluated Causal-Fuzzer on an industry-grade ADS Apollo, with a high-fidelity. Our empirical results demonstrate that Causal-Fuzzer significantly outperforms existing methods in (1) identifying a greater diversity of violations, (2) providing enhanced testing sufficiency with improved coverage of causal relationships, and (3) achieving greater efficiency in detecting the first critical scenarios.

全面评价要求在不同条件下对各种不同情况进行新的因果关系认知模糊技术(Causal-Fuzzer),以便通过探索因果多样性假设、ADS产生的运动指令和系统违规等个别多样性衡量标准进行高效和全面的测试。Causal-Fuzzer的核心是构建一个因果图表,以模拟投入假设、ADS运动指令和系统违规等多样性之间的相互关系。然后,因果图表将指导关键情景的生成过程。具体地说,Causal-Fuzzer提出(1) 一种基于因果关系的检测机制,通过评估是否启动新的因果关系,以更高级的因果关系评估所有因果关系。(2) 以不断提高的因果关系评估,而不是以不断提高的因果形式评估所有因果关系。

Article 80

Title@2025-06-10 (2): Proceedings of the 23rd International Overture Workshop

Title: Proceedings of the 23rd International Overture Workshop

Berichte des 23. Internationalen Ouvertüren-Workshops

第23次国际展望研讨会记录 2506.08680v1

Authors (2): Hugo Daniel Macedo, Ken Pierce

This volume contains the papers presented at the 23rd International Overture Workshop, held on the 11th of June 2025. This event was the latest in a series of workshops around the Vienna Development Method (VDM), the open-source project Overture, and related tools and formalisms. VDM is one of the longest established formal methods for systems development. A lively community of researchers and practitioners has grown up in academia and industry has grown around the modelling languages (VDM-SL, VDM++, VDM-RT, CML) and tools (VDMTools, Overture, Crescendo, Symphony, the INTO-CPS chain, and ViennaTalk). Together, these provide a platform for work on modelling and analysis technology that includes static and dynamic analysis, test generation, execution support, and model checking. This workshop provided updates on the emerging technology of VDM/Overture, including collaboration infrastructure, collaborative modelling and co-simulation for Cyber-Physical Systems.

本卷载有在2025年6月11日举办的第二十三届国际展望讲习班上提出的论文,这是关于维也纳发展方法、开放源码项目说明和相关工具及形式主义的一系列讲习班中最近的一次。VDM是建立时间最长的系统开发正式方法之一。学术界和工业界活跃的研究人员和从业人员群体围绕模拟语言(VDM-SL、VDM++、VDM-RT、CML)和工具(VDMTools、Opture、Crescendo、Symphony、Onto-CPS链和Venvienna Talk)和工具(VDMTools、Onder、Crescendo、Crescendo、Cymphony、Onto-CPS链和Venvienna Talk)发展,共同为模型和分析技术工作提供了一个平台,其中包括静态和动态分析、测试生成、执行支持和模型检查。该讲习班提供了关于VDDM/Oring新兴技术的最新情况,包括协作基础设施、协作建模和网络-Pystems系统共同模拟。

Article 81

Title@2025-06-10 (2): Realigning Incentives to Build Better Software: a Holistic Approach to Vendor Accountability

Title: Realigning Incentives to Build Better Software: a Holistic Approach to Vendor Accountability

Neuausrichtung von Anreizen, um bessere Software zu entwickeln: ein ganzheitlicher Ansatz für die Verantwortlichkeit des Verkäufers

调整奖励措施,以建设更好的软件:供应商问责制的综合办法 2504.07766v2

Authors (3): Gergely Biczók, Sasha Romanosky, Mingyan Liu

In this paper, we ask the question of why the quality of commercial software, in terms of security and safety, does not measure up to that of other (durable) consumer goods we have come to expect. We examine this question through the lens of incentives. We argue that the challenge around better quality software is due in no small part to a sequence of misaligned incentives, the most critical of which being that the harm caused by software problems is by and large shouldered by consumers, not developers. This lack of liability means software vendors have every incentive to rush low-quality software onto the market and no incentive to enhance quality control. Within this context, this paper outlines a holistic technical and policy framework we believe is needed to incentivize better and more secure software development. At the heart of the incentive realignment is the concept of software liability. This framework touches on various components, including legal, technical, and financial, that are needed for software liability to work in practice; some currently exist, some will need to be re-imagined or established. This is primarily a market-driven approach that emphasizes voluntary participation but highlights the role appropriate regulation can play. We connect and contrast this with the EU legal environment and discuss what this framework means for open-source software (OSS) development and emerging AI risks. Moreover, we present a CrowdStrike case study complete with a what-if analysis had our proposed framework been in effect. Our intention is very much to stimulate a robust conversation among both researchers and practitioners.

在本文中,我们提出一个问题,为什么商业软件的质量在安保和安全方面不能达到我们所期望的其他(可耐用)消费品的质量。我们从激励的角度来审查这一问题。我们认为,提高软件质量的挑战在很大程度上是由于一系列的奖励措施造成的,其中最关键的是,软件问题造成的损害是由消费者而不是开发商造成的,而大部分是由消费者而不是开发商承担的。这种缺乏责任意味着软件供应商完全有动力将低质量软件赶到市场,而没有动力加强质量控制。在这一背景下,本文件概述了一个我们认为需要鼓励更好、更安全的软件开发的全面技术和政策框架。在激励改革的核心是软件责任概念。这一框架涉及软件在实际运作时需要的各种组成部分,包括法律、技术和财务责任;有些目前存在,有些需要重新确定或确立。这主要是一种强调自愿参与但强调适当监管作用的市场驱动办法。我们将这一整体的技术和政策框架与欧盟法律-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-和软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-系统分析-软件-软件-软件-软件-和软件-软件-软件-和软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-系统研究-系统研究-系统研究-和软件-软件-和软件-软件-和软件-和软件-和软件-软件-软件-软件-软件-软件-软件-软件-系统研究-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-软件-和软件-软件-软件-软件-软件-软件-和软件-软件-软件-软件-和研究-研究-和研究-和研究-软件-和研究-和研究研究研究-和研究研究-和研究-软件-软件-软件-软件-软件-软件-软件-和研究-软件-软件-软件-和研究研究-和研究研究-和研究-和研究-和研究-和研究-和研究-和研究-和研究-和研究-和研究研究研究-和研究-和研究-和研究-和研究研究研究-和研究研究研究研究-和研究

Article 82

Title@2025-06-10 (2): Logic Mining from Process Logs: Towards Automated Specification and Verification

Title: Logic Mining from Process Logs: Towards Automated Specification and Verification

Logic Mining aus Prozessprotokollen: Auf dem Weg zu einer automatisierten Spezifikation und Verifizierung

从加工日志进行逻辑采矿:走向自动规格和核查 2506.08628v1

Authors (2): Radoslaw Klimek, Julia Witek

Logical specifications play a key role in the formal analysis of behavioural models. Automating the derivation of such specifications is particularly valuable in complex systems, where manual construction is time-consuming and error-prone. This article presents an approach for generating logical specifications from process models discovered via workflow mining, combining pattern-based translation with automated reasoning techniques. In contrast to earlier work, we evaluate the method on both general-purpose and real-case event logs, enabling a broader empirical assessment. The study examines the impact of data quality, particularly noise, on the structure and testability of generated specifications. Using automated theorem provers, we validate a variety of logical properties, including satisfiability, internal consistency, and alignment with predefined requirements. The results support the applicability of the approach in realistic settings and its potential integration into empirical software engineering practices.

逻辑规格在行为模型的正式分析中起着关键作用。在复杂系统中,手工构造耗时费时,容易出错,这种规格的生成在复杂系统中特别宝贵。本条从工作流程挖掘过程中发现的流程模型中提出逻辑规格的方法,将基于模式的翻译与自动推理技术相结合。与早先的工作不同,我们评价了通用和实际事件日志的方法,从而能够进行更广泛的经验评估。研究审查了数据质量,特别是噪音对生成规格的结构和可测试性的影响。我们利用自动理论验证,验证了各种逻辑属性,包括可比较性、内部一致性和符合预先界定的要求。结果支持了该方法在现实环境中的适用性及其可能纳入实证软件工程做法。

Article 83

Title@2025-06-10 (2): LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR

Title: LMRPA: Large Language Model-Driven Efficient Robotic Process Automation for OCR

LMRPA: Großsprachige modellgetriebene effiziente Roboterprozessautomatisierung für OCR

LMRPA: OCR的大型语言模型驱动高效机器人程序自动化 2412.18063v2

Authors (3): Osama Hosam Abdellaif, Abdelrahman Nader, Ali Hamdi

This paper introduces LMRPA, a novel Large Model-Driven Robotic Process Automation (RPA) model designed to greatly improve the efficiency and speed of Optical Character Recognition (OCR) tasks. Traditional RPA platforms often suffer from performance bottlenecks when handling high-volume repetitive processes like OCR, leading to a less efficient and more time-consuming process. LMRPA allows the integration of Large Language Models (LLMs) to improve the accuracy and readability of extracted text, overcoming the challenges posed by ambiguous characters and complex text structures.Extensive benchmarks were conducted comparing LMRPA to leading RPA platforms, including UiPath and Automation Anywhere, using OCR engines like Tesseract and DocTR. The results are that LMRPA achieves superior performance, cutting the processing times by up to 52\%. For instance, in Batch 2 of the Tesseract OCR task, LMRPA completed the process in 9.8 seconds, where UiPath finished in 18.1 seconds and Automation Anywhere finished in 18.7 seconds. Similar improvements were observed with DocTR, where LMRPA outperformed other automation tools conducting the same process by completing tasks in 12.7 seconds, while competitors took over 20 seconds to do the same. These findings highlight the potential of LMRPA to revolutionize OCR-driven automation processes, offering a more efficient and effective alternative solution to the existing state-of-the-art RPA models.

本文介绍LMRPA,这是一个创新的大型模型驱动机器人程序自动化(RPA)模式,旨在大大提高光性特征识别(OCR)任务的效率和速度;传统的RPA平台在处理像OCR这样的大量重复性程序时,往往出现工作瓶颈,导致处理效率较低和更耗时的流程;LMRPA允许将大语言模型(LLMS)集成,以提高摘录文本的准确性和可读性,克服模糊字符和复杂文本结构带来的挑战;将LMRPA与领先的RPA平台(包括UiPath和自动化之地)进行比较,使用Exteract和DocTR等OCR引擎;传统RPA平台在处理大量重复性程序时,往往遇到工作上的绩效瓶颈,使处理时间缩短到52;例如,在Tesseract OCR任务第二批中,LMPA在9.8秒内完成了这一过程,UiPath在18.1秒内完成了工作,自动处理地点在18.7秒内完成了。

Article 84

Title@2025-06-10 (2): RE-oriented Model Development with LLM Support and Deduction-based Verification

Title: RE-oriented Model Development with LLM Support and Deduction-based Verification

RE-orientierte Modellentwicklung mit LLM-Unterstützung und Deduktionsbasierter Verifizierung

与LLLM支助和基于减税的核查 2506.08606v1

Authors (1): Radoslaw Klimek

The requirements engineering (RE) phase is pivotal in developing high-quality software. Integrating advanced modelling techniques with large language models (LLMs) and formal verification in a logical style can significantly enhance this process. We propose a comprehensive framework that focuses on specific Unified Modelling Language (UML) diagrams for preliminary system development. This framework offers visualisations at various modelling stages and seamlessly integrates large language models and logical reasoning engines. The behavioural models generated with the assistance of LLMs are automatically translated into formal logical specifications. Deductive formal verification ensures that logical requirements and interrelations between software artefacts are thoroughly addressed. Ultimately, the framework facilitates the automatic generation of program skeletons, streamlining the transition from design to implementation.

要求工程(RE)阶段是开发高质量软件的关键。将先进的建模技术与大型语言模型(LLMS)和逻辑风格的正式核查相结合,可以大大加强这一进程。我们提议了一个综合框架,侧重于用于初步系统开发的具体的统一建模语言图(UML),该框架提供不同建模阶段的可视化,无缝地整合大型语言模型和逻辑推理引擎。在LLMS协助下生成的行为模型自动转化为正式的逻辑规格。典型的正式核查确保了软件工艺品之间的逻辑要求和相互关系得到彻底的处理。最终,该框架便利了程序骨架的自动生成,简化了从设计到执行的过渡过程。

Article 85

Title@2025-06-10 (2): Evaluating the Performance and Efficiency of Sentence-BERT for Code Comment Classification

Title: Evaluating the Performance and Efficiency of Sentence-BERT for Code Comment Classification

Bewertung der Leistung und Effizienz von Sentence-BERT für die Klassifizierung von Code Comment

为守则评论分类评价判刑-德国-德国-德国-德国-德国-德国-德国-德国-德国-德国的绩效和效率 2506.08581v1

Authors (2): Fabian C. Peña, Steffen Herbold

This work evaluates Sentence-BERT for a multi-label code comment classification task seeking to maximize the classification performance while controlling efficiency constraints during inference. Using a dataset of 13,216 labeled comment sentences, Sentence-BERT models are fine-tuned and combined with different classification heads to recognize comment types. While larger models outperform smaller ones in terms of F1, the latter offer outstanding efficiency, both in runtime and GFLOPS. As result, a balance between a reasonable F1 improvement (+0.0346) and a minimal efficiency degradation (+1.4x in runtime and +2.1x in GFLOPS) is reached.

这项工作评估了多标签编码评论分类任务的判决-BERT, 目的是最大限度地提高分类性能,同时控制推论期间的效率限制。使用13 216条标签评论性句子的数据集,判决-BERT模型经过微调,并与不同的分类负责人合并,以识别评论类型。在F1方面,较大的模型比F1小,后者在运行时和GFLOPS方面都表现优异。因此,在合理的F1改进(+0.0346)和最低效率降解(运行时+1.4x和GFLOPS中+2.1x)之间实现了平衡。

Article 86

Title@2025-06-10 (2): IssueCourier: Multi-Relational Heterogeneous Temporal Graph Neural Network for Open-Source Issue Assignment

Title: IssueCourier: Multi-Relational Heterogeneous Temporal Graph Neural Network for Open-Source Issue Assignment

IssueCourier: Multi-Relationale Heterogene Zeitliche Graphen-Neural-Netzwerk für Open-Source-Ausgabezuweisung

问题压力:开放源码问题任务多关系性不同时代不同形态图质神经网络 2505.11205v3

Authors (5): Chunying Zhou, Xiaoyuan Xie, Gong Chen, Peng He, Bing Li

Issue assignment plays a critical role in open-source software (OSS) maintenance, which involves recommending the most suitable developers to address the reported issues. Given the high volume of issue reports in large-scale projects, manually assigning issues is tedious and costly. Previous studies have proposed automated issue assignment approaches that primarily focus on modeling issue report textual information, developers’ expertise, or interactions between issues and developers based on historical issue-fixing records. However, these approaches often suffer from performance limitations due to the presence of incorrect and missing labels in OSS datasets, as well as the long tail of developer contributions and the changes of developer activity as the project evolves. To address these challenges, we propose IssueCourier, a novel Multi-Relational Heterogeneous Temporal Graph Neural Network approach for issue assignment. Specifically, we formalize five key relationships among issues, developers, and source code files to construct a heterogeneous graph. Then, we further adopt a temporal slicing technique that partitions the graph into a sequence of time-based subgraphs to learn stage-specific patterns. Furthermore, we provide a benchmark dataset with relabeled ground truth to address the problem of incorrect and missing labels in existing OSS datasets. Finally, to evaluate the performance of IssueCourier, we conduct extensive experiments on our benchmark dataset. The results show that IssueCourier can improve over the best baseline up to 45.49% in top-1 and 31.97% in MRR.

问题任务在开放源码软件(OSS)维护中发挥着关键作用,这包括建议最合适的开发者解决所报告的问题。鉴于大型项目中问题报告数量庞大,人工分配问题既繁琐又昂贵。先前的研究提出了自动问题分配方法,主要侧重于问题报告模板、文本信息、开发者的专门知识,或基于历史问题固定记录的问题与开发者之间的互动。然而,这些方法往往由于开放源码软件数据集中存在错误和缺失的标签,以及开发者贡献和开发者活动随着项目演变变化而变化的漫长尾巴而存在绩效限制。为了应对这些挑战,我们提议采用 “ 问题库 “ 方案,即新的多关系热性热层结构神经网络方法。具体地说,我们正式确定问题、开发者和源代码文档之间的五个关键关系,以构建一个混杂的图表。然后,我们进一步采用时间分解技术,将图表分成一个基于时间的子图序列,并随着项目的发展变化而改变开发者活动。此外,我们提出一个基准数据设置基准数据集,用新的多变现的真相基准,我们最后在数据库中标定了对当前数据进行不正确的数据标签。

Article 87

Title@2025-06-10 (2): Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub

Title: Enhancing Open-Domain Task-Solving Capability of LLMs via Autonomous Tool Integration from GitHub

Verbesserung der Open-Domain-Task-Solving-Kapazität von LLMs durch autonome Tool-Integration von GitHub

通过GitHub的自主工具集成,加强LLMs的开放域任务调整能力 2312.17294v3

Authors (12): Bohan Lyu, Xin Cong, Heyang Yu, Pan Yang, Yujia Qin, Yining Ye, Yaxi Lu, Zhong Zhang, Yukun Yan, Yankai Lin, Zhiyuan Liu, Maosong Sun

Large Language Models (LLMs) excel in traditional natural language processing tasks but struggle with problems that require complex domain-specific calculations or simulations. While equipping LLMs with external tools to build LLM-based agents can enhance their capabilities, existing approaches lack the flexibility to address diverse and ever-evolving user queries in open domains. Currently, there is also no existing dataset that evaluates LLMs on open-domain knowledge that requires tools to solve. To this end, we introduce OpenAct benchmark to evaluate the open-domain task-solving capability, which is built on human expert consultation and repositories in GitHub. It comprises 339 questions spanning 7 diverse domains that need to be solved with domain-specific methods. In our experiments, even state-of-the-art LLMs and LLM-based agents demonstrate unsatisfactory success rates, underscoring the need for a novel approach. Furthermore, we present OpenAgent, a novel LLM-based agent system that can tackle evolving queries in open domains through autonomously integrating specialized tools from GitHub. OpenAgent employs 1) a hierarchical framework where specialized agents handle specific tasks and can assign tasks to inferior agents, 2) a bi-level experience learning mechanism to learn from both humans’ and its own experiences to tackle tool flaws. Experiments demonstrate its superior effectiveness and efficiency, which significantly outperforms baselines. Our data and code are open-source at https://github.com/OpenBMB/OpenAct.

大型语言模型(LLMS)在传统自然语言处理任务方面非常出色,但在解决需要复杂领域特定计算或模拟的问题方面却十分困难。LLMS在为建立LLM代理机构提供外部工具以提升其能力的同时,现有方法缺乏灵活性,无法在开放域处理多样化和不断变化的用户查询。目前,也没有现有数据集来评估开放域知识方面的LLMS,需要工具解决。为此,我们引入了OpenAct基准,以评价开放域任务解决能力,该基准建在GitHub的人类专家咨询和储存库上。它包含339个问题,涉及7个不同领域,需要用特定域方法加以解决。在我们的实验中,即使是最先进的LLMS和LLMT代理机构也缺乏灵活性,这突出表明了新颖的LLMMT代理系统需要解决。我们介绍了一个新型的LMT代理系统,它可以通过自主地整合GitHub的专门工具来解决开放域不断演变的查询问题。OpenAgency 使用一个等级框架,专门人员可以将具体任务和任务指派给低级B代理机构,在内部数据库/内部数据库中学习工具。2个基础,从内部数据库学习工具,从内部学习工具,从内部学习工具,从内部标准到基础到内部学习。

Article 88

Title@2025-06-10 (2): Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study

Title: Understanding Software Engineering Agents Through the Lens of Traceability: An Empirical Study

Software Engineering-Agenten durch die Spurweite der Rückverfolgbarkeit verstehen: Eine empirische Studie

通过可追踪通道了解软件工程剂:经验研究 2506.08311v1

Authors (6): Ira Ceka, Saurabh Pujar, Shyam Ramji, Luca Buratti, Gail Kaiser, Baishakhi Ray

With the advent of large language models (LLMs), software engineering agents (SWE agents) have emerged as a powerful paradigm for automating a range of software tasks – from code generation and repair to test case synthesis. These agents operate autonomously by interpreting user input and responding to environmental feedback. While various agent architectures have demonstrated strong empirical performance, the internal decision-making worfklows that drive their behavior remain poorly understood. Deeper insight into these workflows hold promise for improving both agent reliability and efficiency. In this work, we present the first systematic study of SWE agent behavior through the lens of execution traces. Our contributions are as follows: (1) we propose the first taxonomy of decision-making pathways across five representative agents; (2) using this taxonomy, we identify three core components essential to agent success – bug localization, patch generation, and reproduction test generation – and study each in depth; (3) we study the impact of test generation on successful patch production; and analyze strategies that can lead to successful test generation; (4) we further conduct the first large-scale code clone analysis comparing agent-generated and developer-written patches and provide a qualitative study revealing structural and stylistic differences in patch content. Together, these findings offer novel insights into agent design and open avenues for building agents that are both more effective and more aligned with human development practices.

随着大型语言模型(LLMS)的出现,软件工程代理商(SWE代理商)已成为使一系列软件任务自动化的强大范例 – – 从代码生成和修理到测试案例合成。这些代理商通过解释用户投入和对环境反馈作出反应而自主运作。虽然各种代理商结构表现出很强的经验性表现,但驱动其行为的内部决策功能仍然不甚为人知。对这些工作流程的深入深入了解对提高代理商可靠性和效率都有希望。在这项工作中,我们通过执行痕迹的透镜展示了SWE代理商行为的首次系统研究。我们的贡献如下:(1) 我们提出五个代表代理商决策路径的第一个分类学;(2) 使用这种分类学,我们确定三个对代理商成功必不可少的核心组成部分 – – 错误本地化、补接生成和复制测试生成 – – 并进行深入研究;(3) 我们研究测试生成对成功补接合生产的影响;以及分析能够导致成功测试生成的战略。(4) 我们进一步进行第一次大规模克隆代码分析,比较代理商生成和开发者撰写的补丁和开发者撰写的补丁。我们提出了五个代表商的决策路径的第一个分类;(2) 利用这种分类和再定性研究,以揭示结构和再分析人类结构与结构与结构上的差异。

Article 89

Title@2025-06-09 (1): Developer Perspectives on Licensing and Copyright Issues Arising from Generative AI for Software Development

Title: Developer Perspectives on Licensing and Copyright Issues Arising from Generative AI for Software Development

Entwickler-Perspektiven zu Lizenzierungs- und Urheberrechtsfragen, die sich aus generativen KI für die Softwareentwicklung ergeben

开发者对软件开发创创大赦国际提出的许可证发放和版权问题的看法 2411.10877v5

Authors (7): Trevor Stalnaker, Nathan Wintersgill, Oscar Chaparro, Laura A. Heymann, Massimiliano Di Penta, Daniel M German, Denys Poshyvanyk

Despite the utility that Generative AI (GenAI) tools provide for tasks such as writing code, the use of these tools raises important legal questions and potential risks, particularly those associated with copyright law. As lawmakers and regulators engage with those questions, the views of users can provide relevant perspectives. In this paper, we provide: (1) a survey of 574 developers on the licensing and copyright aspects of GenAI for coding, as well as follow-up interviews; (2) a snapshot of developers’ views at a time when GenAI and perceptions of it are rapidly evolving; and (3) an analysis of developers’ views, yielding insights and recommendations that can inform future regulatory decisions in this evolving field. Our results show the benefits developers derive from GenAI, how they view the use of AI-generated code as similar to using other existing code, the varied opinions they have on who should own or be compensated for such code, that they are concerned about data leakage via GenAI, and much more, providing organizations and policymakers with valuable insights into how the technology is being used and what concerns stakeholders would like to see addressed.

尽管创制的AI(GenAI)工具对诸如写法等任务很有用处,但使用这些工具会引起重要的法律问题和潜在风险,特别是与版权法相关的风险。当立法者和监管者参与这些问题时,用户的观点可以提供相关观点。在本文件中,我们提供了:(1) 对574名开发者进行关于GenAI的许可和版权方面的调查,以便进行编码,以及后续访谈;(2) 在GenAI迅速演变时,对开发者的观点及其看法进行简要描述;(3) 分析开发者的观点,提出见解和建议,为这个不断发展的领域的未来监管决定提供信息。我们的结果显示,开发者从GenAI获得的利益,他们如何认为AI生成的代码的使用类似于使用其他现有代码,他们对谁应该拥有或应当获得这种代码的补偿持有不同观点,他们关心通过GenAI获得的数据渗漏,而更多的是,为各组织和决策者提供宝贵的见解,说明技术是如何使用的,以及利益攸关方希望得到解决的关切问题。

Article 90

Title@2025-06-09 (1): MBTModelGenerator: A software tool for reverse engineering of Model-based Testing (MBT) models from clickstream data of web applications

Title: MBTModelGenerator: A software tool for reverse engineering of Model-based Testing (MBT) models from clickstream data of web applications

MBTModelGenerator: Ein Software-Tool für Reverse Engineering von Modellbasierten Testing (MBT)-Modellen aus Clickstream-Daten von Web-Anwendungen

MBTModelGenererator:一个软件工具,用于从网络应用的点击流数据中逆向设计基于模型的测试模型(MBT)模型 2506.08179v1

Authors (2): Sasidhar Matta, Vahid Garousi

Automated testing has become a standard practice in software engineering, yet the creation of test models and suites remains labor-intensive. To reduce this effort, we developed an open-source tool that automatically generates Model-Based Testing (MBT) models from clickstream data collected during user interaction with web applications. The tool captures UI events, transforms them into state-transition models, and exports the result in a format compatible with the GraphWalker MBT tool. This enables immediate test execution without the need for manual model creation. The approach lowers the barrier to MBT adoption by leveraging actual usage behavior and reducing the reliance on upfront modeling. This technical report documents the system requirements, design decisions, implementation details, testing process, and empirical evaluation of the tool, which is publicly available as open-source.

自动测试已成为软件工程的一个标准做法,但测试模型和套件的创建仍然耗费大量人力。为了减少这一努力,我们开发了一个开放源码工具,通过用户与网络应用程序互动过程中收集的点击流数据自动生成基于模型的测试模型。该工具捕捉了UI事件,将其转化为州过渡模式,并以与GreaphWalker MBT工具兼容的格式输出结果。这样就可以在无需手工创建模型的情况下立即进行测试。该方法通过利用实际使用行为和减少对前方模型的依赖,降低了MBT的采用障碍。该技术报告记录了系统要求、设计决定、实施细节、测试过程以及工具的经验评估,这些工具作为公开来源提供。

Article 91

Title@2025-06-09 (1): Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles

Title: Repeton: Structured Bug Repair with ReAct-Guided Patch-and-Test Cycles

Repeton: Strukturierte Fehler-Reparatur mit ReAct-geführten Patch-and-Test-Zyklen

Repeton: 结构化的错误修复, 使用重新操作指导的补丁和测试周期 2506.08173v1

Authors (4): Nguyen Phu Vinh, Anh Chung Hoang, Chris Ngo, Truong-Son Hy

Large Language Models (LLMs) have shown strong capabilities in code generation and comprehension, yet their application to complex software engineering tasks often suffers from low precision and limited interpretability. We present Repeton, a fully open-source framework that leverages LLMs for precise and automated code manipulation in real-world Git repositories. Rather than generating holistic fixes, Repeton operates through a structured patch-and-test pipeline: it iteratively diagnoses issues, proposes code changes, and validates each patch through automated testing. This stepwise process is guided by lightweight heuristics and development tools, avoiding reliance on embedding-based retrieval systems. Evaluated on the SWE-bench Lite benchmark, our method shows good performance compared to RAG-based methods in both patch validity and interpretability. By decomposing software engineering tasks into modular, verifiable stages, Repeton provides a practical path toward scalable and transparent autonomous debugging.

大型语言模型(LLMs)在代码生成和理解方面表现出很强的能力,然而,它们在复杂软件工程任务中的应用往往受到低精度和有限解释的影响。我们介绍了一个完全开放的源码框架雷佩顿(Repeton ) , 这个框架利用LLMs在现实世界的 Git 仓库中进行精确和自动的代码操作。 Repeton 不是产生整体的修补,而是通过结构化的补丁和测试管道运作:它反复诊断问题,提出代码修改,并通过自动测试验证每个补丁。这个渐进式过程以轻量超常和开发工具为指导,避免依赖嵌入式的检索系统。根据SWE-bench Lite 基准,我们的方法在补丁和可解释性两方面都与RAG为基础的方法相比表现良好。通过将软件工程任务分解成模块、可核查的阶段,Repeton 提供了一条通往可扩展和透明的自动调试的实用路径。

Article 92

Title@2025-06-09 (1): Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language Models

Title: Worst-Case Symbolic Constraints Analysis and Generalisation with Large Language Models

Worst-Case-Symbolische Einschränkungen Analyse und Generalisierung mit großen Sprachmodellen

分析并推广大语言模式 2506.08171v1

Authors (5): Daniel Koh, Yannic Noller, Corina S. Pasareanu, Adrians Skapars, Youcheng Sun

Large language models (LLMs) have been successfully applied to a variety of coding tasks, including code generation, completion, and repair. However, more complex symbolic reasoning tasks remain largely unexplored by LLMs. This paper investigates the capacity of LLMs to reason about worst-case executions in programs through symbolic constraints analysis, aiming to connect LLMs and symbolic reasoning approaches. Specifically, we define and address the problem of worst-case symbolic constraints analysis as a measure to assess the comprehension of LLMs. We evaluate the performance of existing LLMs on this novel task and further improve their capabilities through symbolic reasoning-guided fine-tuning, grounded in SMT (Satisfiability Modulo Theories) constraint solving and supported by a specially designed dataset of symbolic constraints. Experimental results show that our solver-aligned model, WARP-1.0-3B, consistently surpasses size-matched and even much larger baselines, demonstrating that a 3B LLM can recover the very constraints that pin down an algorithm’s worst-case behaviour through reinforcement learning methods. These findings suggest that LLMs are capable of engaging in deeper symbolic reasoning, supporting a closer integration between neural network-based learning and formal methods for rigorous program analysis.

大型语言模型(LLMS)被成功地应用于各种编码任务,包括代码生成、完成和修理,然而,更复杂的象征性推理任务基本上仍未由LLMS探讨。本文调查LLMS通过象征性约束分析、旨在连接LLMS和象征性推理方法,在方案内解释最坏案例处决的能力,具体地说,我们界定和处理最坏案例象征性制约分析问题,作为评估LLMS理解度的一项措施。我们评估现有LLMS在这项新任务上的绩效,并通过象征性推理引导的微调进一步提高其能力,这种微调以SMT(满足性莫杜洛理论)为基础,解决制约因素,并辅之以专门设计的象征性制约数据集。实验结果表明,我们SlomMS-grad模式WARP-1.0-3B(WARP-1.0-3B)始终超过尺寸,甚至大得多的基线,表明3BLMM公司可以通过强化学习方法,恢复压低算法最坏案例行为的非常困难。这些结论表明LMS能够进行更深入的象征性推理,支持严格的网络学习和正式分析。

Article 93

Title@2025-06-09 (1): A Metrics-Oriented Architectural Model to Characterize Complexity on Machine Learning-Enabled Systems

Title: A Metrics-Oriented Architectural Model to Characterize Complexity on Machine Learning-Enabled Systems

Ein metrisch ausgerichtetes architektonisches Modell zur Charakterisierung von Komplexität auf maschinell lernfähigen Systemen

以计量为主的建筑建筑模型,以明确机械学习系统的复杂性 2506.08153v1

Authors (1): Renato Cordeiro Ferreira

How can the complexity of ML-enabled systems be managed effectively? The goal of this research is to investigate how complexity affects ML-Enabled Systems (MLES). To address this question, this research aims to introduce a metrics-based architectural model to characterize the complexity of MLES. The goal is to support architectural decisions, providing a guideline for the inception and growth of these systems. This paper showcases the first step for creating the metrics-based architectural model: an extension of a reference architecture that can describe MLES to collect their metrics.

如何有效地管理由ML支持的系统的复杂性?这项研究的目的是调查复杂性如何影响ML-Enabled系统。为解决这一问题,这项研究旨在引入一个基于标准的建筑模型,以描述MLES的复杂性。目的是支持建筑决策,为这些系统的启动和成长提供指南。本文展示了创建基于标准的建筑模型的第一步:扩展一个参考结构,可以描述MLES来收集其测量数据。

Article 94

Title@2025-06-09 (1): From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Title: From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus?

从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充? 2505.18789v2

Authors (3): Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg

Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.

由于原始产出中经常存在外来代码,因此,后处理对于自动评价中填(FIM)代码生成中的LLMs至关重要。这一外代人表示对产出界限缺乏认识,需要缩短时间才能进行有效评价。然而,确定最佳脱节战略往往证明是复杂的,特别是当范围包括几种编程语言时。本研究调查了处理后指示调整LM产出的必要性。我们的调查结果显示,受监督的微调大大加强了FIM代码生成,使LMs能够生成与周围环境无缝结合的代码。评估我们经过微调的 \ texttwen2.5-Coder} (基准和指示) 人类 Eval 填充和SAFIM 基准模型显示,在不处理后,特别是当\emph{middr} 由完整行组成时,业绩得到改进。然而,当\emph{mdr}是一个随机的代码范围时,LM产出的后处理仍是必要的。

Article 95

Title@2025-06-09 (1): Adversarial Attack Classification and Robustness Testing for Large Language Models for Code

Title: Adversarial Attack Classification and Robustness Testing for Large Language Models for Code

Adversariale Angriffsklassifikation und Robustheitsprüfung für große Sprachmodelle für Code

守则大语言模型对反攻击分类和强力测试 2506.07942v1

Authors (4): Yang Liu, Armstrong Foundjem, Foutse Khomh, Heng Li

Large Language Models (LLMs) have become vital tools in software development tasks such as code generation, completion, and analysis. As their integration into workflows deepens, ensuring robustness against vulnerabilities especially those triggered by diverse or adversarial inputs becomes increasingly important. Such vulnerabilities may lead to incorrect or insecure code generation when models encounter perturbed task descriptions, code, or comments. Prior research often overlooks the role of natural language in guiding code tasks. This study investigates how adversarial perturbations in natural language inputs including prompts, comments, and descriptions affect LLMs for Code (LLM4Code). It examines the effects of perturbations at the character, word, and sentence levels to identify the most impactful vulnerabilities. We analyzed multiple projects (e.g., ReCode, OpenAttack) and datasets (e.g., HumanEval, MBPP), establishing a taxonomy of adversarial attacks. The first dimension classifies the input type code, prompts, or comments while the second dimension focuses on granularity: character, word, or sentence-level changes. We adopted a mixed-methods approach, combining quantitative performance metrics with qualitative vulnerability analysis. LLM4Code models show varying robustness across perturbation types. Sentence-level attacks were least effective, suggesting models are resilient to broader contextual changes. In contrast, word-level perturbations posed serious challenges, exposing semantic vulnerabilities. Character-level effects varied, showing model sensitivity to subtle syntactic deviations.Our study offers a structured framework for testing LLM4Code robustness and emphasizes the critical role of natural language in adversarial evaluation. Improving model resilience to semantic-level disruptions is essential for secure and reliable code-generation systems.

大型语言模型(LLMS) 已成为软件开发任务(如代码生成、完成和分析)中的重要工具。随着它们融入工作流程的深度,确保抵御脆弱性的稳健性,特别是因多种或对抗性投入而引发的脆弱性,变得日益重要。当模型遇到扭曲的任务描述、代码或评论时,这些脆弱性可能导致代码生成不正确或不安全。先前的研究往往忽略自然语言在指导代码任务中的作用。本研究调查自然语言投入(包括提示、评论和描述)的对抗性干扰如何影响代码( LLLM4Code) 的可靠性(LLLM4Code) 。当它们融入工作流程时,确保应对脆弱性的稳健性,特别是因不同或敌对性投入性投入而引发的脆弱性。当第二维度(LLLM4 Code) 时, 审视在字符、字型和句级层次上发生的干扰效应。我们分析了多个项目(例如Recodedecode、 OploadAttlemental ) 以及数据模型的稳性测试模式, 以精确性模型的形式显示。

Article 96

Title@2025-06-09 (1): Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?

Title: Can Hessian-Based Insights Support Fault Diagnosis in Attention-based Models?

Können Hessian-Based Insights Fehlerdiagnosen in aufmerksamkeitsbasierten Modellen unterstützen?

以海珊为基地的洞察能支持以关注为基础的模型中的过失诊断吗? 2506.07871v1

Authors (2): Sigma Jahan, Mohammad Masudur Rahman

As attention-based deep learning models scale in size and complexity, diagnosing their faults becomes increasingly challenging. In this work, we conduct an empirical study to evaluate the potential of Hessian-based analysis for diagnosing faults in attention-based models. Specifically, we use Hessian-derived insights to identify fragile regions (via curvature analysis) and parameter interdependencies (via parameter interaction analysis) within attention mechanisms. Through experiments on three diverse models (HAN, 3D-CNN, DistilBERT), we show that Hessian-based metrics can localize instability and pinpoint fault sources more effectively than gradients alone. Our empirical findings suggest that these metrics could significantly improve fault diagnosis in complex neural architectures, potentially improving software debugging practices.

随着基于关注的深层次学习模型规模和复杂性的扩大,诊断其缺陷变得日益具有挑战性。在这项工作中,我们开展了一项实证研究,以评价海珊基于分析在基于关注的模型中诊断缺陷的可能性。具体地说,我们利用海珊派来的洞察力,在关注机制内确定脆弱区域(通过曲线分析)和参数相互依存性(通过参数互动分析),通过对三种不同模型(韩、3D-CNN、DistillBERT)的实验,我们表明海珊派生的计量方法可以将不稳定地方化,并比单是梯度就能更有效地确定断层源。我们的经验发现,这些计量方法可以大大改善复杂神经结构中的缺陷诊断,并有可能改进软件调试做法。

Article 97

Title@2025-06-09 (1): Execution-Aware Program Reduction for WebAssembly via Record and Replay

Title: Execution-Aware Program Reduction for WebAssembly via Record and Replay

Execution-Aware Programmreduktion für WebAssembly über Aufzeichnung und Wiedergabe

通过录制和重放减少网络摄像头的执行软件程序 2506.07834v1

Authors (5): Doehyun Baek, Daniel Lehmann, Ben L. Titzer, Sukyoung Ryu, Michael Pradel

WebAssembly (Wasm) programs may trigger bugs in their engine implementations. To aid debugging, program reduction techniques try to produce a smaller variant of the input program that still triggers the bug. However, existing execution-unaware program reduction techniques struggle with large and complex Wasm programs, because they rely on static information and apply syntactic transformations, while ignoring the valuable information offered by the input program’s execution behavior. We present RR-Reduce and Hybrid-Reduce, novel execution-aware program reduction techniques that leverage execution behaviors via record and replay. RR-Reduce identifies a bug-triggering function as the target function, isolates that function from the rest of the program, and generates a reduced program that replays only the interactions between the target function and the rest of the program. Hybrid-Reduce combines a complementary execution-unaware reduction technique with RR-Reduce to further reduce program size. We evaluate RR-Reduce and Hybrid-Reduce on 28 Wasm programs that trigger a diverse set of bugs in three engines. On average, RR-Reduce reduces the programs to 1.20 percent of their original size in 14.5 minutes, which outperforms the state of the art by 33.15 times in terms of reduction time. Hybrid-Reduce reduces the programs to 0.13 percent of their original size in 3.5 hours, which outperforms the state of the art by 3.42 times in terms of reduced program size and 2.26 times in terms of reduction time. We envision RR-Reduce as the go-to tool for rapid, on-demand debugging in minutes, and Hybrid-Reduce for scenarios where developers require the smallest possible programs.

WebAssembly (Wasm) 程序可能会引发引擎执行中的错误。为了帮助调试, 程序减少技术会试图生成一个较小的输入程序变体, 但仍触发错误。但是, 现有的执行- unawar 程序减少技术会与大型和复杂的Wasm 程序争斗, 因为它们依赖静态信息并应用合成变换, 而忽略了输入程序执行行为提供的宝贵信息。我们展示了 RR- Red 和混合- Red 、利用记录和重放执行行为的新型执行- 识读程序减少技术。 RR- Red 显示, 将一个错误触发程序变小的功能作为目标功能, 分离出该程序与程序其余部分的功能, 因为它们依赖静态信息, 并使用 RDR- Renformal 程序减少3 。平均情况下, 将原始的 RR- 20 的变式程序压缩到正常时间。

Article 98

Title@2025-06-09 (1): Subgraph-Oriented Testing for Deep Learning Libraries

Title: Subgraph-Oriented Testing for Deep Learning Libraries

Subgraph-orientierte Tests für Deep-Learning-Bibliotheken

深深学习图书馆以Sub图为主的测试 2412.06430v2

Authors (4): Xiaoyuan Xie, Yan Song, Songqiang Chen, Jinfu Chen

Deep Learning (DL) libraries, such as PyTorch, are widely used for building and deploying DL models on various hardware platforms. Meanwhile, they are found to contain bugs that lead to incorrect calculation results and cause issues like non-convergence training and inaccurate prediction of DL models. Thus, many efforts have been made to test DL libraries and reveal bugs. However, existing DL library testing methods manifest limitations: model-level testing methods cause complexity in fault localization. Meanwhile, API-level testing methods often generate invalid inputs or primarily focus on extreme inputs that lead to crash failures; they also ignore testing realistic API interactions. These limitations may lead to missing detection of bugs, even in the frequently used APIs. To address these limitations, we propose SORT (Subgraph-Oriented Realistic Testing) to differential test DL libraries on different hardware platforms. SORT takes popular API interaction patterns, represented as frequent subgraphs of model computation graphs, as test subjects. In this way, it introduces realistic API interaction sequences while maintaining efficiency in locating faulty APIs for observed errors. Besides, SORT prepares test inputs by referring to extensive features of runtime inputs for each API in executing real-life benchmark data. The generated inputs are expected to better simulate such valid real inputs and reveal bugs more likely to happen in real-life usage. Evaluation on 728 frequent subgraphs of 49 popular PyTorch models demonstrates that SORT achieves a 100% valid input generation rate, detects more precision bugs than existing methods, and reveals interaction-related bugs missed by single-API testing. 18 precision bugs in PyTorch are identified.

PyTorrch 等深度学习(DL) 图书馆,例如 PyTorrch , 被广泛用于在不同硬件平台上建立和部署 DL 模型。与此同时, API 级测试方法往往产生无效输入, 或主要侧重于导致崩溃失败的极端输入; 它们也忽略了现实的 API 互动。这些限制可能导致错误检测错误, 甚至经常使用的 API 模式。因此, 已经做出许多努力测试 DL 库和披露错误。然而, 现有的 DL 图书馆测试方法表现出局限性: 模型级测试方法导致本地化的复杂。与此同时, API 级测试方法往往产生无效的投入, 或者主要侧重于导致崩溃失败的极端输入; 它们忽略了现实的 API 互动。这些限制可能导致错误的检测错误, 甚至在经常使用的 API 中, 也可能导致错误的错误的错误。此外, SORT IPA (Sub- Orent Reportational ) 将测试的宏大投入用于不同的硬件平台。 Serview 。 Seral 。 Serviewal disal 。 Serviews 和 erviewdal eral roduews roduces

Article 99

Title@2025-06-09 (1): Towards a Small Language Model Lifecycle Framework

Title: Towards a Small Language Model Lifecycle Framework

Auf dem Weg zu einem Rahmen für den Lebenszyklus eines kleinen Sprachmodells

建立一个小型语言模拟生命周期框架 2506.07695v1

Authors (4): Parsa Miraghaei, Sergio Moreschini, Antti Kolehmainen, David Hästbacka

Background: The growing demand for efficient and deployable language models has led to increased interest in Small Language Models (SLMs). However, existing research remains fragmented, lacking a unified lifecycle perspective. Objective: This study aims to define a comprehensive lifecycle framework for SLMs by synthesizing insights from academic literature and practitioner sources. Method: We conducted a comprehensive survey of 36 works, analyzing and categorizing lifecycle-relevant techniques. Results: We propose a modular lifecycle model structured into main, optional, and cross-cutting components. The model captures key interconnections across stages, supporting method reuse, co-adaptation, and lifecycle-awareness. Conclusion: Our framework provides a coherent foundation for developing and maintaining SLMs, bridging theory and practice, and guiding future research and tool development.

背景:对高效和可部署语言模式的需求日益增加,导致人们对小型语言模式的兴趣增加。然而,现有研究仍然支离破碎,缺乏统一的生命周期视角。目标:本研究的目的是通过综合学术文献和从业人员来源的见解,为可持续土地管理确定一个全面的生命周期框架。方法:我们对36项工程进行了全面调查,分析和分类了与生命周期有关的技术。结果:我们提出了模块生命周期模式,分为主要、可选和交叉组成部分。模型收集了各个阶段的关键相互联系,支持方法的再利用、共同适应和生命周期认识。结论:我们的框架为发展和维护可持续土地管理、过渡理论和实践提供了连贯的基础,并指导今后的研究和工具开发。

Article 100

Title@2025-06-09 (1): MacroSwarm: A Field-based Compositional Framework for Swarm Programming

Title: MacroSwarm: A Field-based Compositional Framework for Swarm Programming

MacroSwarm: Ein feldbasierter Kompositionsrahmen für die Schwarmprogrammierung

宏观群群:基于实地的蜂群方案规划组成框架 2401.10969v4

Authors (3): Gianluca Aguzzi, Roberto Casadei, Mirko Viroli

Swarm behaviour engineering is an area of research that seeks to investigate methods and techniques for coordinating computation and action within groups of simple agents to achieve complex global goals like pattern formation, collective movement, clustering, and distributed sensing. Despite recent progress in the analysis and engineering of swarms (of drones, robots, vehicles), there is still a need for general design and implementation methods and tools that can be used to define complex swarm behaviour in a principled way. To contribute to this quest, this article proposes a new field-based coordination approach, called MacroSwarm, to design and program swarm behaviour in terms of reusable and fully composable functional blocks embedding collective computation and coordination. Based on the macroprogramming paradigm of aggregate computing, MacroSwarm builds on the idea of expressing each swarm behaviour block as a pure function, mapping sensing fields into actuation goal fields, e.g., including movement vectors. In order to demonstrate the expressiveness, compositionality, and practicality of MacroSwarm as a framework for swarm programming, we perform a variety of simulations covering common patterns of flocking, pattern formation, and collective decision-making. The implications of the inherent self-stabilisation properties of field-based computations in MacroSwarm are discussed, which formally guarantee some resilience properties and guided the design of the library.

尽管最近在分析和工程群群(无人机、机器人、车辆)方面有所进展,但仍需要一般性的设计和实施方法和工具,以便以原则方式界定复杂的群温行为,为这一探索作出贡献,本篇文章提议了一种新的实地协调办法,称为宏观群温,以设计和规划在可再利用和完全可折合的功能组群嵌入集体计算和协调方面的群态行为。根据总体计算宏观方案模式,宏观群集基于将每个群集行为组(无人机、机器人、车辆)表述为纯功能的构想,将感测场绘制成活动目标领域,包括移动矢量。为了表明宏观群集体作为群集规划框架的表达性、构成性和实用性,我们开展了各种模拟活动,包括了集体计算和可完全可折合的功能组群集的功能组群体。在总体计算和协调中,根据总体计算宏观组群集的宏观组群集的宏观组群集的模型模型模型的模型模型模型化和模型化的模型化,并讨论了内部系统模型化的自身设计、模型化的系统化。

Article 101

Title@2025-06-09 (1): Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

Title: Evaluating LLMs Effectiveness in Detecting and Correcting Test Smells: An Empirical Study

Bewertung der Wirksamkeit von LLMs bei der Erkennung und Korrektur von Testriechen: Eine empirische Studie

评价LLLMS在检测和纠正试验闻闻方面的效力:经验研究 2506.07594v1

Authors (6): E. G. Santana Jr, Jander Pereira Santos Junior, Erlon P. Almeida, Iftekhar Ahmed, Paulo Anselmo da Mota Silveira Neto, Eduardo Santana de Almeida

Test smells indicate poor development practices in test code, reducing maintainability and reliability. While developers often struggle to prevent or refactor these issues, existing tools focus primarily on detection rather than automated refactoring. Large Language Models (LLMs) have shown strong potential in code understanding and transformation, but their ability to both identify and refactor test smells remains underexplored. We evaluated GPT-4-Turbo, LLaMA 3 70B, and Gemini-1.5 Pro on Python and Java test suites, using PyNose and TsDetect for initial smell detection, followed by LLM-driven refactoring. Gemini achieved the highest detection accuracy (74.35\% Python, 80.32\% Java), while LLaMA was lowest. All models could refactor smells, but effectiveness varied, sometimes introducing new smells. Gemini also improved test coverage, unlike GPT-4 and LLaMA, which often reduced it. These results highlight LLMs’ potential for automated test smell refactoring, with Gemini as the strongest performer, though challenges remain across languages and smell types.

测试的气味表明测试代码的发展实践不善,降低了可调适性和可靠性。虽然开发者往往努力防止或重新提出这些问题,但现有工具主要侧重于检测而不是自动再构思。大型语言模型(LLMs)在代码理解和转换方面表现出了巨大的潜力,但是它们识别和再构体测试气味的能力仍然未得到充分探讨。我们评估了GPT-4-Turbo、LalaMA 3 70B和Gemini-1.5 Python和Java测试套房,利用PyNose和Ts检测进行初步嗅觉检测,然后是LLMM驱动的再构思。Gemini实现了最高检测精度(74.35Python,80.32Java),而Lama是最低的。所有模型都可以反构思气味,但效力各不相同,有时会引入新的气味。Gemini还提高了测试范围,而Gepini与GPT-4和LAMA不同,这往往会减少它。这些结果突出表明LMs的自动测试嗅觉嗅觉再闻的可能性,而Gemini是最强,尽管各语言和嗅觉之间仍然存在挑战。

Article 102

Title@2025-06-09 (1): IntenTest: Stress Testing for Intent Integrity in API-Calling LLM Agents

Title: IntenTest: Stress Testing for Intent Integrity in API-Calling LLM Agents

IntenTest: Stresstest auf Intent Integrität in API-aufrufenden LLM-Agenten

即时测试:压力测试,以促使APIalling LLM代理物保持诚信 2506.07524v1

Authors (8): Shiwei Feng, Xiangzhe Xu, Xuan Chen, Kaiyuan Zhang, Syed Yusuf Ahmed, Zian Su, Mingwei Zheng, Xiangyu Zhang

LLM agents are increasingly deployed to automate real-world tasks by invoking APIs through natural language instructions. While powerful, they often suffer from misinterpretation of user intent, leading to the agent’s actions that diverge from the user’s intended goal, especially as external toolkits evolve. Traditional software testing assumes structured inputs and thus falls short in handling the ambiguity of natural language. We introduce IntenTest, an API-centric stress testing framework that systematically uncovers intent integrity violations in LLM agents. Unlike prior work focused on fixed benchmarks or adversarial inputs, IntenTest generates realistic tasks based on toolkits’ documentation and applies targeted mutations to expose subtle agent errors while preserving user intent. To guide testing, we propose semantic partitioning, which organizes natural language tasks into meaningful categories based on toolkit API parameters and their equivalence classes. Within each partition, seed tasks are mutated and ranked by a lightweight predictor that estimates the likelihood of triggering agent errors. To enhance efficiency, IntenTest maintains a datatype-aware strategy memory that retrieves and adapts effective mutation patterns from past cases. Experiments on 80 toolkit APIs demonstrate that IntenTest effectively uncovers intent integrity violations, significantly outperforming baselines in both error-exposing rate and query efficiency. Moreover, IntenTest generalizes well to stronger target models using smaller LLMs for test generation, and adapts to evolving APIs across domains.

通过自然语言指令,LLM代理商越来越多地被部署到将现实世界的任务自动化,通过自然语言指令来援引API。虽然它们的力量强大,但往往受到对用户意图的误解,导致该代理商的行动与用户的预期目标不同,特别是随着外部工具包的演变。传统的软件测试假定有结构的投入,因而在处理自然语言的模糊性方面有缺陷。我们引入了以API为中心的压力测试框架IntenTest,这是一个以API为中心的压力测试框架,系统地发现LLM代理商蓄意违反诚信的行为。与以往侧重于固定基准或对抗性投入的工作不同, IntenTest产生现实的任务,并应用有针对性的突变来暴露微妙的代理商错误,同时保存用户意图。为了指导测试,我们建议采用语义分割,将自然语言任务组织成有意义的类别,以工具包API参数及其等同课程为基础。在每种分区中,种子任务被变换成一个轻质的预测器,估计触发代理商错误的可能性。为了提高效率,Intentread,Test保持一种数据型认知战略记忆,从过去的案例中检索和调整有效的突变型模式。在80个工具包APILIS 测试模型中进行实验,有效地展示,在更高的基准中,在更高的基准中进行测试中,并用更精确测试测试,在跨度上展示。

Article 103

Title@2025-06-09 (1): PARF: An Adaptive Abstraction-Strategy Tuner for Static Analysis

Title: PARF: An Adaptive Abstraction-Strategy Tuner for Static Analysis

PARF: Ein adaptives Abstraktions-Strategie-Tuner für statische Analyse

PARF: 用于静态分析的适应性抽象-战略图纳 2505.13229v3

Authors (9): Zhongyi Wang, Mingshuai Chen, Tengjie Lin, Linyu Yang, Junhao Zhuo, Qiuye Wang, Shengchao Qin, Xiao Yi, Jianwei Yin

We launch Parf - a toolkit for adaptively tuning abstraction strategies of static program analyzers in a fully automated manner. Parf models various types of external parameters (encoding abstraction strategies) as random variables subject to probability distributions over latticed parameter spaces. It incrementally refines the probability distributions based on accumulated intermediate results generated by repeatedly sampling and analyzing, thereby ultimately yielding a set of highly accurate abstraction strategies. Parf is implemented on top of Frama-C/Eva - an off-the-shelf open-source static analyzer for C programs. Parf provides a web-based user interface facilitating the intuitive configuration of static analyzers and visualization of dynamic distribution refinement of the abstraction strategies. It further supports the identification of dominant parameters in Frama-C/Eva analysis. Benchmark experiments and a case study demonstrate the competitive performance of Parf for analyzing complex, large-scale real-world programs.

我们推出 Parf 工具箱, 用于对静态程序分析器的抽象战略进行适应性调整; 将各种类型的外部参数( 编码抽象战略)作为随机变量,在悬浮参数空间上进行概率分布; 根据反复取样和分析所积累的中间结果,逐步完善概率分布,从而最终产生一套非常准确的抽象战略; Parf 在Frama-C/Eva(Frama-C/Eva)之上实施,这是为 C 程序提供的一个现成的公开源静态分析器; Parf 提供基于网络的用户界面,促进静态分析器的直观配置和对抽象战略动态分布的完善的可视化; 进一步支持在Frama-C/Eva 分析中确定主导参数; 基准实验和案例研究表明 Parf 在分析复杂、大规模真实世界方案方面的竞争性表现。

Article 104

Title@2025-06-09 (1): Large Language Models for Multilingual Vulnerability Detection: How Far Are We?

Title: Large Language Models for Multilingual Vulnerability Detection: How Far Are We?

Große Sprachmodelle für mehrsprachige Sicherheitserkennung: Wie weit sind wir?

多语言脆弱性探测大语言模型:我们有多远? 2506.07503v1

Authors (7): Honglin Shu, Michael Fu, Junji Yu, Dong Wang, Chakkrit Tantithamthavorn, Junjie Chen, Yasutaka Kamei

Various deep learning-based approaches utilizing pre-trained language models (PLMs) have been proposed for automated vulnerability detection. With recent advancements in large language models (LLMs), several studies have begun exploring their application to vulnerability detection tasks. However, existing studies primarily focus on specific programming languages (e.g., C/C++) and function-level detection, leaving the strengths and weaknesses of PLMs and LLMs in multilingual and multi-granularity scenarios largely unexplored. To bridge this gap, we conduct a comprehensive fine-grained empirical study evaluating the effectiveness of state-of-the-art PLMs and LLMs for multilingual vulnerability detection. Using over 30,000 real-world vulnerability-fixing patches across seven programming languages, we systematically assess model performance at both the function-level and line-level. Our key findings indicate that GPT-4o, enhanced through instruction tuning and few-shot prompting, significantly outperforms all other evaluated models, including CodeT5P. Furthermore, the LLM-based approach demonstrates superior capability in detecting unique multilingual vulnerabilities, particularly excelling in identifying the most dangerous and high-severity vulnerabilities. These results underscore the promising potential of adopting LLMs for multilingual vulnerability detection at function-level and line-level, revealing their complementary strengths and substantial improvements over PLM approaches. This first empirical evaluation of PLMs and LLMs for multilingual vulnerability detection highlights LLMs’ value in addressing real-world software security challenges.

利用预先培训的语言模型(PLM),提出了各种深层次的基于学习的办法来自动发现脆弱性;由于在大型语言模型(LLM)方面最近取得的进展,一些研究已开始探索如何将其应用于脆弱性检测任务;然而,现有研究主要侧重于具体方案编制语言(如C/C+++)和功能级检测,使多语种和多语种情景中PLM和LLM的优缺点基本上未被探讨;为了缩小这一差距,我们开展了一项全面的精细实验性研究,评价了用于多语种脆弱性检测的最新PLM和LMLM的实效;利用7种方案语言的30,000多个实际世界脆弱性固定补补补补补补补补补补补补补补补,我们系统地评估了功能层面和线级层面的示范性绩效;我们的主要调查结果表明,GPT-4o通过教学调整和少发的提示,大大超出包括DCT5PLM在内的所有其他经评估模式。此外,LM方法显示在发现独特的多语言脆弱性方面的独特脆弱性方面,特别是在确定最危险和高水平的多语言模型检测能力方面,以及甚低语言模型。

Article 105

Title@2025-06-09 (1): A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions

Title: A Framework for Creating Non-Regressive Test Cases via Branch Consistency Analysis Driven by Descriptions

Ein Rahmen für die Erstellung nicht-regressiver Testfälle über Branchenkonsistenzanalyse, angetrieben durch Beschreibungen

由说明驱动的通过分处一致性分析生成非递减试验案例的框架 2506.07486v1

Authors (8): Yuxiang Zhang, Pengyu Xue, Zhen Yang, Xiaoxue Ren, Xiang Li, Linhao Wu, Jiancheng Zhao, Xingda Yu

Automated test-generation research overwhelmingly assumes the correctness of focal methods, yet practitioners routinely face non-regression scenarios where the focal method may be defective. A baseline evaluation of EvoSuite and two leading Large Language Model (LLM)-based generators, namely ChatTester and ChatUniTest, on defective focal methods reveals that despite achieving up to 83% of branch coverage, none of the generated tests expose defects. To resolve this problem, we first construct two new benchmarks, namely Defects4J-Desc and QuixBugs-Desc, for experiments. In particular, each focal method is equipped with an extra Natural Language Description (NLD) for code functionality understanding. Subsequently, we propose DISTINCT, a Description-guided, branch-consistency analysis framework that transforms LLMs into fault-aware test generators. DISTINCT carries three iterative components: (1) a Generator that derives initial tests based on the NLDs and the focal method, (2) a Validator that iteratively fixes uncompilable tests using compiler diagnostics, and (3) an Analyzer that iteratively aligns test behavior with NLD semantics via branch-level analysis. Extensive experiments confirm the effectiveness of our approach. Compared to state-of-the-art methods, DISTINCT achieves an average improvement of 14.64% in Compilation Success Rate (CSR) and 6.66% in Passing Rate (PR) across both benchmarks. It notably enhances Defect Detection Rate (DDR) on both benchmarks, with a particularly significant gain of 149.26% observed on Defects4J-Desc. In terms of code coverage, DISTINCT improves Statement Coverage (SC) by an average of 3.77% and Branch Coverage (BC) by 5.36%. These results set a new baseline for non-regressive test generation and highlight how description-driven reasoning enables LLMs to move beyond coverage chasing toward effective defect detection.

自动测试生成研究以压倒多数假设了协调方法的正确性,然而,实践者通常会面临不倒退的情景,而协调方法可能存在缺陷。对EvoSite和两个以大语言模型(LLM)为基础的发电机(ChatTester和ChatUniTest)的基线评估显示,尽管实现了83%的分支覆盖率,但所产生的测试都没有暴露出缺陷。为了解决这一问题,我们首先为实验建立了两个新的基准,即Deffects4J-Desc和QuixBugs-Desc。特别是,每个协调方法都配备了额外的自然语言说明(NLDD),用于理解代码功能。随后,我们提出了DISTINCT(DID-DRIT),一个将LMMS转化为差分辨测试发电机的83%,DIT包含三个迭代内容:(1)一个发电机,根据NLDFT和焦点方法进行初始测试,一个校验标准基准,使用编者诊断,将不可调的基准值基准值调整为 3.,以及(3) 具体分析,通过IMSBLS的升级测试结果。

Article 106

Title@2025-06-09 (1): Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault Localization

Title: Multi-View Adaptive Contrastive Learning for Information Retrieval Based Fault Localization

Multi-View Adaptive Kontrastives Lernen für Informationen Retrieval basierte Fehler Lokalisierung

信息检索的多视图适应性差异学习 2409.12519v2

Authors (5): Chunying Zhou, Xiaoyuan Xie, Gong Chen, Peng He, Bing Li

Most studies focused on information retrieval-based techniques for fault localization, which built representations for bug reports and source code files and matched their semantic vectors through similarity measurement. However, such approaches often ignore some useful information that might help improve localization performance, such as 1) the interaction relationship between bug reports and source code files; 2) the similarity relationship between bug reports; and 3) the co-citation relationship between source code files. In this paper, we propose a novel approach named Multi-View Adaptive Contrastive Learning for Information Retrieval Fault Localization (MACL-IRFL) to learn the above-mentioned relationships for software fault localization. Specifically, we first generate data augmentations from report-code interaction view, report-report similarity view and code-code co-citation view separately, and adopt graph neural network to aggregate the information of bug reports or source code files from the three views in the embedding process. Moreover, we perform contrastive learning across these views. Our design of contrastive learning task will force the bug report representations to encode information shared by report-report and report-code views,and the source code file representations shared by code-code and report-code views, thereby alleviating the noise from auxiliary information. Finally, to evaluate the performance of our approach, we conduct extensive experiments on five open-source Java projects. The results show that our model can improve over the best baseline up to 28.93%, 25.57% and 20.35% on Accuracy@1, MAP and MRR, respectively.

多数研究都侧重于基于信息的错误定位技术,即建立错误报告和源代码文档的演示,并通过相似度测量来匹配其语义矢量。然而,这类方法往往忽视一些可能有助于改进本地化绩效的有用信息,例如:(1) 错误报告和源代码文档之间的互动关系;(2) 错误报告和源代码文档之间的类似关系;(2) 错误报告之间的类似关系;和(3) 源代码文档之间的共引关系。在本文件中,我们提议了一种新颖的方法,名为“为信息检索而进行多维调性学习 ” (MACL-IRFL) ,以学习上述软件错误本地化的关系。具体地说,我们首先从报告代码互动视图、报告-报告相似度视图和代码共同引用视角生成数据增强能力,然后采用图形神经网络,将错误报告或源代码文档代码文件的信息汇总到嵌入过程的三种观点。此外,我们设计对比式的学习任务将迫使错误报告演示组(MACL-IRFL) 将上述软件本地化观点所共享的信息编码,以及源代码文件表达方式分别从报告-code-redical vieweral view 20-roral Arevilal 和我们共享的软化模型,我们通过代码和软质序-reval-reval-ral-ass-assal-views

Article 107

Title@2025-06-09 (1): Generate Realistic Test Scenes for V2X Communication Systems

Title: Generate Realistic Test Scenes for V2X Communication Systems

Realistische Testszenen für V2X-Kommunikationssysteme generieren

V2X 通信系统生成Reeistic 测试场景 2506.07419v1

Authors (9): An Guo, Xinyu Gao, Chunrong Fang, Haoxiang Tian, Weisong Sun, Yanzhou Mu, Shuncheng Tang, Lei Ma, Zhenyu Chen

Accurately perceiving complex driving environments is essential for ensuring the safe operation of autonomous vehicles. With the tremendous progress in deep learning and communication technologies, cooperative perception with Vehicle-to-Everything (V2X) technologies has emerged as a solution to overcome the limitations of single-agent perception systems in perceiving distant objects and occlusions. Despite the considerable advancements, V2X cooperative perception systems require thorough testing and continuous enhancement of system performance. Given that V2X driving scenes entail intricate communications with multiple vehicles across various geographic locations, creating V2X test scenes for these systems poses a significant challenge. Moreover, current testing methodologies rely on manual data collection and labeling, which are both time-consuming and costly. In this paper, we design and implement V2XGen, an automated testing generation tool for V2X cooperative perception systems. V2XGen utilizes a high-fidelity approach to generate realistic cooperative object instances and strategically place them within the background data in crucial positions. Furthermore, V2XGen adopts a fitness-guided V2X scene generation strategy for the transformed scene generation process and improves testing efficiency. We conduct experiments on V2XGen using multiple cooperative perception systems with different fusion schemes to assess its performance on various tasks. The experimental results demonstrate that V2XGen is capable of generating realistic test scenes and effectively detecting erroneous behaviors in different V2X-oriented driving conditions. Furthermore, the results validate that retraining systems under test with the generated scenes can enhance average detection precision while reducing occlusion and long-range perception errors.

由于深层次的学习和通信技术取得了巨大进展,对车辆对一切(V2X)技术的合作认识已成为克服单一试剂感知系统在发现遥远的物体和封闭性方面的局限性的一种解决办法。尽管取得了相当大的进步,V2X合作感知系统需要彻底测试和不断提高系统性能。鉴于V2X驾驶场景涉及与不同地理位置的多辆车的复杂通信,为这些系统制造V2X测试场景构成重大挑战。此外,目前测试方法依赖于人工数据收集和标签,这既耗时又昂贵。在本文件中,我们设计和实施V2XGen,这是V2X合作感知系统的自动测试生成工具。V2XGen使用高纤维性方法产生现实的合作性对象实例,并在战略上将其置于背景数据的关键位置。此外,V2XGen为改变的现场生成过程采用了健身制V2X的现场生成战略,提高了实际性能。我们在VX测试过程中有效地进行实验,同时利用不同的实验性结果,在不同的VX测试过程中进行不同的实验。

Article 108

Title@2025-06-09 (1): A Systematic Literature Review on Continuous Integration and Deployment (CI/CD) for Secure Cloud Computing

Title: A Systematic Literature Review on Continuous Integration and Deployment (CI/CD) for Secure Cloud Computing

Ein systematischer Literaturbericht über kontinuierliche Integration und Bereitstellung (CI/CD) für sicheres Cloud Computing

安全云计算系统关于持续整合和部署的系统文献审查(CI/CD) 2506.08055v1

Authors (3): Sabbir M. Saleh, Nazim Madhavji, John Steinbacher

As cloud environments become widespread, cybersecurity has emerged as a top priority across areas such as networks, communication, data privacy, response times, and availability. Various sectors, including industries, healthcare, and government, have recently faced cyberattacks targeting their computing systems. Ensuring secure app deployment in cloud environments requires substantial effort. With the growing interest in cloud security, conducting a systematic literature review (SLR) is critical to identifying research gaps. Continuous Software Engineering, which includes continuous integration (CI), delivery (CDE), and deployment (CD), is essential for software development and deployment. In our SLR, we reviewed 66 papers, summarising tools, approaches, and challenges related to the security of CI/CD in the cloud. We addressed key aspects of cloud security and CI/CD and reported on tools such as Harbor, SonarQube, and GitHub Actions. Challenges such as image manipulation, unauthorised access, and weak authentication were highlighted. The review also uncovered research gaps in how tools and practices address these security issues in CI/CD pipelines, revealing a need for further study to improve cloud-based security solutions.

随着云层环境变得广泛,网络安全已成为网络、通信、数据隐私、响应时间和可用性等各领域的首要优先事项,包括工业、医疗保健和政府在内的各个部门最近都面临针对其计算机系统的网络攻击。确保云环境中的安全应用需要大量的努力。随着对云层安全的兴趣日益浓厚,进行系统的文献审查对找出研究差距至关重要。连续软件工程,包括持续整合(CI)、交付(CDE)和部署(CD),对于软件开发和部署至关重要。在我们的SLR中,我们审查了66份文件、汇总工具、方法和与云中光/CD安全有关的挑战。我们处理了云层安全的关键方面和光/CD,并报告了诸如港、SonarQube和GitHub行动等工具。图像操纵、未经授权的接入和薄弱的认证等挑战也得到了强调。该审查还发现,在光/CD管道中如何解决这些安全问题的工具和做法方面存在研究差距,表明需要进一步研究,以改进云层安全解决方案。

Article 109

Title@2025-06-09 (1): Protecting Deep Learning Model Copyrights with Adversarial Example-Free Reuse Detection

Title: Protecting Deep Learning Model Copyrights with Adversarial Example-Free Reuse Detection

Schutz von Deep-Learning-Modell-Urheberrechten mit zweifelhafter Beispiel-freier Wiederverwertungserkennung

保护深学习模式版权,进行反反对学性实例自由再利用探测 2407.03883v2

Authors (4): Xiaokun Luan, Xiyue Zhang, Jingyi Wang, Meng Sun

Model reuse techniques can reduce the resource requirements for training high-performance deep neural networks (DNNs) by leveraging existing models. However, unauthorized reuse and replication of DNNs can lead to copyright infringement and economic loss to the model owner. This underscores the need to analyze the reuse relation between DNNs and develop copyright protection techniques to safeguard intellectual property rights. Existing white-box testing-based approaches cannot address the common heterogeneous reuse case where the model architecture is changed, and DNN fingerprinting approaches heavily rely on generating adversarial examples with good transferability, which is known to be challenging in the black-box setting. To bridge the gap, we propose NFARD, a Neuron Functionality Analysis-based Reuse Detector, which only requires normal test samples to detect reuse relations by measuring the models’ differences on a newly proposed model characterization, i.e., neuron functionality (NF). A set of NF-based distance metrics is designed to make NFARD applicable to both white-box and black-box settings. Moreover, we devise a linear transformation method to handle heterogeneous reuse cases by constructing the optimal projection matrix for dimension consistency, significantly extending the application scope of NFARD. To the best of our knowledge, this is the first adversarial example-free method that exploits neuron functionality for DNN copyright protection. As a side contribution, we constructed a reuse detection benchmark named Reuse Zoo that covers various practical reuse techniques and popular datasets. Extensive evaluations on this comprehensive benchmark show that NFARD achieves F1 scores of 0.984 and 1.0 for detecting reuse relationships in black-box and white-box settings, respectively, while generating test suites 2 ~ 99 times faster than previous methods.

模型再利用技术可以利用现有模式,减少培训高性能深神经网络的资源需求;然而,未经授权再利用和复制DNNN可导致侵犯版权,对模型所有人造成经济损失;这突出表明需要分析DNN的再利用关系,并开发版权保护技术,以保障知识产权;现有的基于白箱的测试方法无法解决模型结构改变的通用异同再利用案例,DNN指纹方法严重依赖生成具有良好可转移性的对抗性实例,这在黑箱设置中是众所周知具有挑战性的。为缩小差距,我们建议NFARD,基于NFARD的神经功能分析型Reuse 探测器,它只需要正常的测试样品来检测再利用关系,通过测量新提议的模型特征,即神经功能(NF)的功能差异。一套基于NFW的远程测量方法的设计使NFARD既适用于白箱又适用于黑箱环境。此外,我们设计了一种直线性转换方法,通过构建最优化的预测矩阵格式一致性矩阵,大大扩展了模型的再利用范围评估范围,同时大幅扩展了RFRMRS的测试范围技术的应用范围。

Article 110

Title@2025-06-09 (1): Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data

Title: Boosting Vulnerability Detection of LLMs via Curriculum Preference Optimization with Synthetic Reasoning Data

Erhöhung der Schwachstelle Erkennung von LLMs durch Curriculum Preference Optimization mit synthetisch begründeten Daten

通过课程优先优化使用合成理由数据,促进对LLMs的脆弱性检测 2506.07390v1

Authors (5): Xin-Cheng Wen, Yijun Yang, Cuiyun Gao, Yang Xiao, Deheng Ye

Large language models (LLMs) demonstrate considerable proficiency in numerous coding-related tasks; however, their capabilities in detecting software vulnerabilities remain limited. This limitation primarily stems from two factors: (1) the absence of reasoning data related to vulnerabilities, which hinders the models’ ability to capture underlying vulnerability patterns; and (2) their focus on learning semantic representations rather than the reason behind them, thus failing to recognize semantically similar vulnerability samples. Furthermore, the development of LLMs specialized in vulnerability detection is challenging, particularly in environments characterized by the scarcity of high-quality datasets. In this paper, we propose a novel framework ReVD that excels at mining vulnerability patterns through reasoning data synthesizing and vulnerability-specific preference optimization. Specifically, we construct forward and backward reasoning processes for vulnerability and corresponding fixed code, ensuring the synthesis of high-quality reasoning data. Moreover, we design the triplet supervised fine-tuning followed by curriculum online preference optimization for enabling ReVD to better understand vulnerability patterns. The extensive experiments conducted on PrimeVul and SVEN datasets demonstrate that ReVD sets new state-of-the-art for LLM-based software vulnerability detection, e.g., 12.24\%-22.77\% improvement in the accuracy. The source code and data are available at https://github.com/Xin-Cheng-Wen/PO4Vul.

大型语言模型(LLMS)在众多与编码有关的任务中表现出相当的熟练程度;然而,它们发现软件脆弱性的能力仍然有限,这一限制主要来自两个因素:(1) 缺乏与脆弱性有关的推理数据,这妨碍了模型捕捉基本脆弱性模式的能力;(2) 侧重于学习语义表达,而不是其背后的原因,因此没有认识到语言上的相似脆弱性样本;此外,开发专门识别脆弱性的LLMS具有挑战性,特别是在缺少高质量数据集的环境里;在本文件中,我们提出一个新的ReVD框架,它通过推理数据合成和针对脆弱性的优惠优化,在采矿脆弱性模式方面优异。具体地说,我们为脆弱性和相应的固定代码建立前向后推理过程,确保综合高质量的推理数据。此外,我们设计了三重监管的在线优惠优化课程,使REVD能够更好地了解脆弱性模式。在PripVul和SVEN数据集方面进行的广泛实验表明,REVDD为基于LM-VULM/Sheng的软件脆弱性精确度检测、http-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx/xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx可提供的测试。

Article 111

Title@2025-06-09 (1): Human Side of Smart Contract Fuzzing: An Empirical Study

Title: Human Side of Smart Contract Fuzzing: An Empirical Study

Menschliche Seite des Smart Contract Fuzzing: Eine empirische Studie

聪明合同的人类方面模糊:经验研究 2506.07389v1

Authors (2): Guanming Qiao, Partha Protim Paul

Smart contract (SC) fuzzing is a critical technique for detecting vulnerabilities in blockchain applications. However, its adoption remains challenging for practitioners due to fundamental differences between SCs and traditional software systems. In this study, we investigate the challenges practitioners face when adopting SC fuzzing tools by conducting an inductive content analysis of 381 GitHub issues from two widely used SC fuzzers: Echidna and Foundry. Furthermore, we conducted a user study to examine how these challenges affect different practitioner groups, SC developers, and traditional software security professionals, and identify strategies practitioners use to overcome them. We systematically categorize these challenges into a taxonomy based on their nature and occurrence within the SC fuzzing workflow. Our findings reveal domain-specific ease-of-use and usefulness challenges, including technical issues with blockchain emulation, and human issues with a lack of accessible documentation and process automation. Our results provide actionable insights for tool developers and researchers, guiding future improvements in SC fuzzer tool design.

智能合同(SC)模糊是发现链链应用程序脆弱性的关键技术之一,然而,由于在册种姓和传统软件系统之间存在根本差异,采用该合同对从业人员仍具有挑战性。在本研究中,我们通过对两种广泛使用的在册种姓模糊器(Echidna和Foundry)产生的381 GitHub问题进行感应内容分析,调查从业人员在采用在册种姓模糊工具时所面临的挑战。此外,我们进行了用户研究,以研究这些挑战如何影响不同从业者群体、在册种姓开发者和传统软件安全专业人员,并查明使用哪些战略从业者来克服这些挑战。我们系统地根据这些挑战的性质和在在册种姓模糊工作流程中出现的情况,将这些挑战分类为分类。我们的调查结果揭示了特定领域的使用方便性和有用性挑战,包括块链模拟技术问题,以及缺乏可获取的文件和流程自动化的人类问题。我们的成果为工具开发者和研究人员提供了可操作的洞察力,指导今后对SC模糊工具设计的改进。

Article 112

Title@2025-06-09 (1): SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

Title: SyncMind: Measuring Agent Out-of-Sync Recovery in Collaborative Software Engineering

SyncMind: Messmittel Out-of-Sync-Wiederherstellung in der kollaborativen Software-Engineering

Syncmind:在合作软件工程中测量合成外恢复剂 2502.06994v2

Authors (7): Xuehang Guo, Xingyao Wang, Yangyi Chen, Sha Li, Chi Han, Manling Li, Heng Ji

Software engineering (SE) is increasingly collaborative, with developers working together on shared complex codebases. Effective collaboration in shared environments requires participants – whether humans or AI agents – to stay on the same page as their environment evolves. When a collaborator’s understanding diverges from the current state – what we term the out-of-sync challenge – the collaborator’s actions may fail, leading to integration issues. In this work, we introduce SyncMind, a framework that systematically defines the out-of-sync problem faced by large language model (LLM) agents in collaborative software engineering (CSE). Based on SyncMind, we create SyncBench, a benchmark featuring 24,332 instances of agent out-of-sync scenarios in real-world CSE derived from 21 popular GitHub repositories with executable verification tests. Experiments on SyncBench uncover critical insights into existing LLM agents’ capabilities and limitations. Besides substantial performance gaps among agents (from Llama-3.1 agent <= 3.33% to Claude-3.5-Sonnet >= 28.18%), their consistently low collaboration willingness (<= 4.86%) suggests fundamental limitations of existing LLM in CSE. However, when collaboration occurs, it positively correlates with out-of-sync recovery success. Minimal performance differences in agents’ resource-aware out-of-sync recoveries further reveal their significant lack of resource awareness and adaptability, shedding light on future resource-efficient collaborative systems. Code and data are openly available on our project website: https://xhguo7.github.io/SyncMind/.

软件工程( SE) 正在日益合作, 开发者在共享的复杂代码库上合作。在共享环境中的有效合作要求参与者 – – 无论是人类还是AI代理商 – – 在其环境演变过程中保持同一页。当合作者的理解与当前状态不同时 – – 我们称之为超同步的挑战 – – 协作者的行动可能会失败,导致整合问题。在这项工作中,我们引入了SyncMind(SyncMind)这一框架,这个框架系统地界定了协作软件工程(CSE)中大型语言模型(LLLM)代理商所面临的超同步问题。基于 SyncMind,我们创建SynBench(SyncBench),一个基于24,332个代理商的超同步情景与当前状态。我们21个流行的 GitHub 存储库中生成的匹配挑战,导致整合问题。 SyncBenchen 实验发现现有 LLM代理商的能力和局限性。除了代理商之间在( LLlamama- 3.3% 和Son- Son-S- relist) rentalalalaldeal daldeal dislational dal dislevational) 4.

Article 113

Title@2025-06-09 (1): GUIPilot: A Consistency-based Mobile GUI Testing Approach for Detecting Application-specific Bugs

Title: GUIPilot: A Consistency-based Mobile GUI Testing Approach for Detecting Application-specific Bugs

GUUIPilot: Ein auf Konsistenz basierender mobiler GUI-Testansatz zur Erkennung anwendungsspezifischer Fehler

GUUPilot: 一种基于一致性的移动图形测试方法,用于检测特定应用程序的臭虫 2506.07385v1

Authors (7): Ruofan Liu, Xiwen Teoh, Yun Lin, Guanjie Chen, Ruofei Ren, Denys Poshyvanyk, Jin Song Dong

In this work, we propose GUIPilot, an approach for detecting inconsistencies between the mobile design and their implementations. The mobile design usually consists of design mock-ups that specify (1) the expected screen appearances (e.g., widget layouts, colors, and shapes) and (2) the expected screen behaviors, regarding how one screen can transition into another (e.g., labeled widgets with textual description). Given a design mock-up and the implementation of its application, GUIPilot reports both their screen inconsistencies as well as process inconsistencies. On the one hand, GUIPilot detects the screen inconsistencies by abstracting every screen into a widget container where each widget is represented by its position, width, height, and type. By defining the partial order of widgets and the costs of replacing, inserting, and deleting widgets in a screen, we convert the screen-matching problem into an optimizable widget alignment problem. On the other hand, we translate the specified GUI transition into stepwise actions on the mobile screen (e.g., click, long-press, input text on some widgets). To this end, we propose a visual prompt for the vision-language model to infer widget-specific actions on the screen. By this means, we can validate the presence or absence of expected transitions in the implementation. Our extensive experiments on 80 mobile applications and 160 design mock-ups show that (1) GUIPilot can achieve 94.5% precision and 99.6% recall in detecting screen inconsistencies, outperforming the state-of-the-art approach, such as GVT, by 66.2% and 56.6% respectively, and (2) GUIPilot reports zero errors in detecting process inconsistencies. Furthermore, our industrial case study on applying GUIPilot on a trading mobile application shows that GUIPilot has detected nine application bugs, and all the bugs were confirmed by the original application experts.

在此工作中, 我们提出 GUIPilot , 这是一种用来检测移动设计及其实施之间不一致之处的方法。移动设计通常包括设计模拟, 具体指明:(1) 屏幕显示( 如部件布局、颜色和形状) 和 (2) 预期屏幕行为, 说明一个屏幕如何转换到另一个屏幕( 例如标签的部件带有文字描述) 。在设计模拟和应用中, GUIPilot 既报告其屏幕不一致之处, 也报告其过程不一致之处。一方面, GUIPilot 将每个屏幕应用都抽取到每个部件以其位置、宽度、高度和类型为代表的部件容器中。通过定义部件的部分顺序和在屏幕中替换、插入和删除部件的成本, 我们将屏幕匹配问题转换成一个可选的部件协调问题。另一方面, 我们将指定的图形转换为移动屏幕上的逐步行动( 例如, 点击、长期打印、输入文本在某个部件的容器中, 如此在设计中, 我们提议一个直观的 G- 直观的操作显示80 。直观操作中, 显示我们的直观操作。

Article 114

Title@2025-06-08 (7): On Mutation-Guided Unit Test Generation

Title: On Mutation-Guided Unit Test Generation

Auf Mutation-geführte Einheiten-Test-Generierung

Mudiation- Guided 单位测试生成 2506.02954v2

Authors (4): Guancheng Wang, Qinghua Xu, Lionel C. Briand, Kui Liu

Unit tests play a vital role in uncovering potential faults in software. While tools like EvoSuite focus on maximizing code coverage, recent advances in large language models (LLMs) have shifted attention toward LLM-based test generation. However, code coverage metrics – such as line and branch coverage – remain overly emphasized in reported research, despite being weak indicators of a test suite’s fault-detection capability. In contrast, \textit{mutation score} offers a more reliable and stringent measure, as demonstrated in our findings where some test suites achieve 100\% coverage but only 4\% mutation score. Although a few studies consider mutation score, the effectiveness of LLMs in killing mutants remains underexplored. In this paper, we propose MUTGEN, a mutation-guided, LLM-based test generation approach that incorporates mutation feedback directly into the prompt. Evaluated on 204 subjects from two benchmarks, MUTGEN significantly outperforms both EvoSuite and vanilla prompt-based strategies in terms of mutation score. Furthermore, MUTGEN introduces an iterative generation mechanism that pushes the limits of LLMs in killing additional mutants. Our study also provide insights into the limitations of LLM-based generation, analyzing the reasons for live and uncovered mutants, and the impact of different mutation operators on generation effectiveness.

单位测试在发现软件中的潜在缺陷方面发挥着关键作用。EvoSacterite等工具在发现软件中的潜在缺陷方面扮演了关键角色。虽然EvoSite关注最大限度地扩大代码覆盖范围,但大型语言模型(LLMs)最近的进展已经将注意力转移到了基于LLM的测试生成上。然而,尽管对测试套件的缺陷检测能力来说指标薄弱,但报告的研究仍然过分强调代码覆盖面指标(如线和分支覆盖面),尽管这是测试套件检测缺陷的薄弱指标。相比之下,\ Textit{mation 评分提供了一种更加可靠和严格的衡量标准,正如我们的调查结果所显示的,有些测试套件达到100覆盖率,但只有4突变得分。虽然有几项研究考虑突变得分,但LLMMs在杀死变种人方面的效力仍然不足。在本论文中,我们建议采用MUTGEN, 一种以突变基因组为基础的测试生成方法,将突变异反馈直接评估204个主题,MUTGEN在突变得分方面明显超越了EVilla-GENLMs的定位,还在研究中将LMs的生成的极限推介结果和变异特性的生成的极限,也提供了。

Article 115

Title@2025-06-08 (7): Selective Prompt Anchoring for Code Generation

Title: Selective Prompt Anchoring for Code Generation

Selektive Prompt-Ankerung für die Code-Generierung

代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代代 2408.09121v5

Authors (2): Yuan Tian, Tianyi Zhang

Recent advances in large language models (LLMs) have transformed software development by automatically generating code from natural language. Yet challenges remain in generating fully correct code that aligns with user intent. Our study reveals that LLMs tend to pay less attention to user prompts as more code tokens are generated. We hypothesize that this attention dilution issue is an important reason for code generation errors. To mitigate this issue, we propose Selective Prompt Anchoring (SPA) to guide code LLMs to pay more attention to user intent when generating code. We evaluate SPA using six base LLMs across six benchmarks. Our results demonstrate that SPA enhances Pass@1 by up to 12.9%, consistently outperforming SOTA code generation methods in all settings. Our code is available at https://github.com/magic-YuanTian/Selective-Prompt-Anchoring.

大型语言模型(LLMs)最近的进展通过自动生成自然语言代码改变了软件开发。但在生成完全符合用户意图的完全正确的代码方面仍然存在挑战。我们的研究显示,LLMs往往较少关注用户提示,因为生成了更多的代码符号。我们假设这种关注稀释问题是代码生成错误的重要原因。为了缓解这一问题,我们提议有选择的快速操作(SPA)来指导代码LLMs, 以指导代码生成时的用户意向。我们用六个基准基底LMs来评估SPA。我们的结果表明,SPA将Pass@1提升到12.9%,在所有环境中持续超过SOTA代码生成方法。我们的代码可以在 https://github.com/magic-YuanTian/Elective-Prompt-Anchoring查阅。

Article 116

Title@2025-06-08 (7): Taxonomy of migration scenarios for Qiskit refactoring using LLMs

Title: Taxonomy of migration scenarios for Qiskit refactoring using LLMs

Taxonomie der Migrationsszenarien für die Qiskit-Refaktorisierung mit LLMs

利用LLMM 进行基斯基特再设定的移徙情景分类 2506.07135v1

Authors (4): José Manuel Suárez, Luís Mariano Bibbó, Joaquín Bogado, Alejandro Fernandez

As quantum computing advances, quantum programming libraries’ heterogeneity and steady evolution create new challenges for software developers. Frequent updates in software libraries break working code that needs to be refactored, thus adding complexity to an already complex landscape. These refactoring challenges are, in many cases, fundamentally different from those known in classical software engineering due to the nature of quantum computing software. This study addresses these challenges by developing a taxonomy of quantum circuit’s refactoring problems, providing a structured framework to analyze and compare different refactoring approaches. Large Language Models (LLMs) have proven valuable tools for classic software development, yet their value in quantum software engineering remains unexplored. This study uses LLMs to categorize refactoring needs in migration scenarios between different Qiskit versions. Qiskit documentation and release notes were scrutinized to create an initial taxonomy of refactoring required for migrating between Qiskit releases. Two taxonomies were produced: one by expert developers and one by an LLM. These taxonomies were compared, analyzing differences and similarities, and were integrated into a unified taxonomy that reflects the findings of both methods. By systematically categorizing refactoring challenges in Qiskit, the unified taxonomy is a foundation for future research on AI-assisted migration while enabling a more rigorous evaluation of automated refactoring techniques. Additionally, this work contributes to quantum software engineering (QSE) by enhancing software development workflows, improving language compatibility, and promoting best practices in quantum programming.

随着量子计算的进步,量子编程图书馆的异质性和稳定的演变为软件开发者带来了新的挑战。软件库的频繁更新打破了需要重新设定的工作代码,从而增加了已经复杂的景观的复杂性。在许多情况下,由于量子计算软件的性质,这些重新设定的挑战与古典软件工程的已知挑战大不相同。本研究通过开发量子电路再设定问题分类学来应对这些挑战,为分析和比较不同的再构方法提供一个结构化框架。大型语言编程模型(LLLMs)已证明了用于经典软件开发的宝贵工具,但它们在量子软件工程工程方面的价值仍未被挖掘。本研究利用LLMS对不同基斯基特版本的迁移情景中的再配置需求进行分类。对Qiskit的文档和发布说明进行了仔细研究,以建立量子电路路再配置问题再配置问题的初步分类学,为分析和比较了不同的再配置方法。通过专家开发者和一个LM公司制作了两个分类。这些分类比较了SALM,分析差异和相似性软件的数值工程价值仍然未被挖掘。这些分类方法用于比较,分析差异和相似之处。LM,利用LM,利用LM对不同迁移方案进行分类,从而将改进了统一的税法基础进行系统化研究,从而将改进了一种统一的税法基础,从而将改进了对税法基础进行再分析,从而将改进了一种税法系的研究,从而将改进了对税法基础进行再分析。

Article 117

Title@2025-06-08 (7): Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example

Title: Test-driven Software Experimentation with LASSO: an LLM Prompt Benchmarking Example

Testgetriebene Software-Experimente mit LASSO: ein Beispiel für LLM Prompt Benchmarking

LASSO试验驱动软件实验:LLM快速基准示例 2410.08911v2

Authors (1): Marcus Kessel

Empirical software engineering faces a critical gap: the lack of standardized tools for rapid development and execution of Test-Driven Software Experiments (TDSEs) – that is, experiments that involve the execution of software subjects and the observation and analysis of their “de facto” run-time behavior. In this paper we present a general-purpose analysis platform called LASSO that provides a minimal set of domain-specific languages and data structures to conduct TDSEs. By empowering users with an executable scripting language to design and execute TDSEs, LASSO enables efficient evaluation of run-time semantics and execution characteristics in addition to statically determined properties. We present an example TDSE that demonstrates the practical benefits of LASSO’s scripting capabilities for assessing the reliability of LLMs for code generation by means of a self-contained, reusable and extensible study script. The LASSO platform and live pipeline examples are publicly available at: https://softwareobservatorium.github.io/.

经验型软件工程面临一个重大差距:缺乏快速开发和实施测试驱动软件实验(TDSE)的标准化工具,也就是说,涉及软件主体执行的实验以及对其“事实上”运行时间行为的观察和分析。在本文中,我们提出了一个通用分析平台,称为LASSO,提供一套最起码的域名语言和数据结构来进行TDSE。通过赋予具有可执行脚本语言的用户设计和实施TDSE的权能,LASSO除了静态确定的特性外,还能够有效地评价运行时语义和执行特点。我们举了一个TDSE的例子,它展示了LASSO通过自成一体、可再使用和可扩展的研究脚本来评估LLMS生成代码可靠性的编程能力的实际好处。LASSO平台和现场编程实例可在下列网址上公开查阅:https://sowareobservatoratorium.github.io/。

Article 118

Title@2025-06-08 (7): Rethinking the effects of data contamination in Code Intelligence

Title: Rethinking the effects of data contamination in Code Intelligence

Überdenken der Auswirkungen der Datenkontamination in Code Intelligence

重新思考法典情报部门数据污染的影响 2506.02791v2

Authors (9): Zhen Yang, Hongyi Lin, Yifan He, Jie Xu, Zeyu Sun, Shuo Liu, Pengpeng Wang, Zhongxing Yu, Qingyuan Liang

In recent years, code intelligence has gained increasing importance in the field of automated software engineering. Meanwhile, the widespread adoption of Pretrained Language Models (PLMs) and Large Language Models (LLMs) has raised concerns regarding data contamination and its potential impact on model performance evaluation. This paper presents a systematic empirical study to investigate the fine-grained data contamination on code intelligence tasks. Our study involves diverse representative PLMs, namely RoBERTa and GPT-2, and LLMs, namely LLaMA and StarCoder, covering three major tasks: code translation, code generation, and code summarization. We categorize contamination scenarios into four types according to the code intelligence practice, namely input-only, output-only, unpaired, and paired contamination settings, and construct corresponding experimental and control groups for exploration. Experimental results show that, under the pre-training, fine-tuning, and inference paradigm adopted by PLMs, even deliberately injecting paired contamination does not lead to significant performance overestimation. But direct inference or small-scale fine-tuning uncovers the contamination effects. In contrast, LLMs with pre-training and inference paradigm are significantly affected by the paired contamination. Apart from the above, other contamination scenarios have no impact on both PLMs and LLMs. Our findings challenge the conventional belief that contamination inevitably leads to performance overestimation, providing new insights into the evaluation and deployment of code intelligence models.

近年来,代码情报在自动化软件工程领域的重要性日益增强,与此同时,广泛采用预先培训语言模型和大语言模型使人们对数据污染及其对模型绩效评估的潜在影响产生关切,本文件介绍了一项系统的经验研究,以调查细化数据污染对代码情报任务的影响。我们的研究涉及不同的代表PLM,即RoBERTA和GPT-2,以及LLLMS,即LalaMA和StarCoder,涵盖三大任务:代码翻译、代码生成和代码合成。我们根据代码情报实践,将污染情景分为四种类型,即只投入、只产出、未覆盖和配对的污染环境,并构建相应的实验和控制组。实验结果显示,根据预先培训、微调和PLMS采用的推断模式,即使有意注射配对的污染也不会导致显著的过度估计。但直接推断或小规模微调揭示了污染效应。相比之下,具有前培训前和内置的污染状况的LLMS和内置的内置,对常规污染的预测性影响,因此无法对常规污染的预测和内置。

Article 119

Title@2025-06-08 (7): Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation with Accurate Oracles

Title: Hallucination to Consensus: Multi-Agent LLMs for End-to-End Test Generation with Accurate Oracles

Halluzination zum Konsens: Multi-Agent LLMs für die End-to-End-Testgenerierung mit präzisen Oracles

致共识的幻觉:与精准甲骨文进行端至端端至末端试验一代的多代理人法学硕士 2506.02943v3

Authors (4): Qinghua Xu, Guancheng Wang, Lionel Briand, Kui Liu

Unit testing plays a critical role in ensuring software correctness. However, writing unit tests manually is laborious, especially for strong typed languages like Java, motivating the need for automated approaches. Traditional methods primarily rely on search-based or randomized algorithms to generate tests that achieve high code coverage and produce regression oracles, which are derived from the program’s current behavior rather than its intended functionality. Recent advances in large language models (LLMs) have enabled oracle generation from natural language descriptions. However, existing LLM-based methods often require LLM fine-tuning or rely on external tools such as EvoSuite for test prefix generation. In this work, we propose CANDOR, a novel end-to-end, prompt-based LLM framework for automated JUnit test generation. CANDOR orchestrates multiple specialized LLM agents to generate JUnit tests, including both high-quality test prefixes and accurate oracles. To mitigate the notorious hallucinations in LLMs, we introduce a novel strategy that engages multiple reasoning LLMs in a panel discussion and generate accurate oracles based on consensus. Additionally, to reduce the verbosity of reasoning LLMs’ outputs, we propose a novel dual-LLM pipeline to produce concise and structured oracle evaluations. Our experiments on the HumanEvalJava and LeetCodeJava datasets show that CANDOR can generate accurate oracles and is slightly better than EvoSuite in generating tests with high line coverage and clearly superior in terms of mutation score. Moreover, CANDOR significantly outperforms the state-of-the-art, prompt-based test generator LLM-Empirical, achieving improvements of 15.8 to 25.1 percentage points in oracle correctness on both correct and faulty source code. Ablation studies confirm the critical contributions of key agents in improving test prefix quality and oracle accuracy.

单位测试在确保软件正确性方面起着关键作用。然而, 以LLM为基础的写字单位测试手工操作是困难的, 特别是像爪哇这样的强型语言, 需要自动化方法。传统方法主要依靠基于搜索或随机的算法来生成测试, 达到高代码覆盖率并产生回归或触角, 这些测试来自程序当前的行为, 而不是其预期的功能。大型语言模型( LLLMs) 的最近进步使得奥氏生成能够从自然语言描述中解析。然而, 现有的LLM 方法往往需要LLM 的精细调或依赖外部工具, 如 EvoSeter 等用于测试前置生成。在此工作中, 我们建议 CANDOR, 一个全新的端到端的LMM框架。 CANDOR 的精度测试中, CANDOR 的精度、 CRLM 的精度分析中, 更清晰的精度中, CAVALM 的精度和精度的精度分析中, CRLM 的精度分析中, 更精确的精度的精度。

Article 120

Title@2025-06-08 (7): LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models

Title: LEANCODE: Understanding Models Better for Code Simplification of Pre-trained Large Language Models

LEANCODE: Modelle besser verstehen für Code-Vereinfachung von vortrainierten großen Sprachmodellen

LEANCODE: 更好地理解模式,以更好地简化培训前大语言模式的守则 2505.14759v3

Authors (5): Yan Wang, Ling Ding, Tien N Nguyen, Shaohua Wang, Yanan Zheng

Large Language Models for code often entail significant computational complexity, which grows significantly with the length of the input code sequence. We propose LeanCode for code simplification to reduce training and prediction time, leveraging code contexts in utilizing attention scores to represent the tokens’ importance. We advocate for the selective removal of tokens based on the average context-aware attention scores rather than average scores across all inputs. LeanCode uses the attention scores of `CLS’ tokens within the encoder for classification tasks, such as code search. It also employs the encoder-decoder attention scores to determine token significance for sequence-to-sequence tasks like code summarization. Our evaluation shows LeanCode’s superiority over the SOTAs DietCode and Slimcode, with improvements of 60% and 16% for code search, and 29% and 27% for code summarization, respectively.

用于代码的大型语言模型往往需要大量的计算复杂性,随着输入代码序列的长度而大幅增长。我们提议使用精密代码来简化代码,以减少培训和预测时间,利用代码背景来利用关注分数来代表符号的重要性。我们主张根据平均背景认知关注分而不是所有投入的平均分数有选择地删除符号。精密代码在代码编码中将“ CLS” 符号的注意分数用于分类任务, 如代码搜索。它还使用编码器- 解码器关注分数来确定序列到序列任务( 如代码合成)的象征意义。我们的评估显示, 精密代码优于SOTAS DieCode 和 Slimcode, 分别改进了60%和16%的代码搜索, 以及29%和27%的代码合成。

Article 121

Title@2025-06-08 (7): A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

Title: A Closer Look into Transformer-Based Code Intelligence Through Code Transformation: Challenges and Opportunities

Ein genauerer Blick in transformerbasierte Code-Intelligenz durch Code-Transformation: Herausforderungen und Chancen

更仔细地研究《以变革者为基础的守则》通过《守则》的转变而获得的情报:挑战和机遇 2207.04285v2

Authors (7): Yaoxian Li, Shiyi Qi, Cuiyun Gao, Yun Peng, David Lo, Zenglin Xu, Michael R. Lyu

Transformer-based models have demonstrated state-of-the-art performance in many intelligent coding tasks such as code comment generation and code completion. Previous studies show that deep learning models are sensitive to the input variations, but few studies have systematically studied the robustness of Transformer under perturbed input code. In this work, we empirically study the effect of semantic-preserving code transformation on the performance of Transformer. Specifically, 24 and 27 code transformation strategies are implemented for two popular programming languages, Java and Python, respectively. For facilitating analysis, the strategies are grouped into five categories: block transformation, insertion/deletion transformation, grammatical statement transformation, grammatical token transformation, and identifier transformation. Experiments on three popular code intelligence tasks, including code completion, code summarization and code search, demonstrate insertion/deletion transformation and identifier transformation show the greatest impact on the performance of Transformer. Our results also suggest that Transformer based on abstract syntax trees (ASTs) shows more robust performance than the model based on only code sequence under most code transformations. Besides, the design of positional encoding can impact the robustness of Transformer under code transformation. Based on our findings, we distill some insights about the challenges and opportunities for Transformer-based code intelligence.

以变换器为基础的模型在许多智能编码任务(如代码评论生成和代码完成)中展示了最先进的性能。以前的研究表明,深学习模型对输入变异很敏感,但很少有研究系统地研究过输入编码受扰的输入编码下变异器的稳健性。在这项工作中,我们从经验上研究了语义保存代码变异对变异器性能的影响。具体地说,对两种流行的编程语言(Java和Python)分别实施了24和27个代码变异战略。为了便于分析,这些战略分为五类:区块变换、插入/删除变换、语法语义变换、语法符号变换和标识转换。对三种流行代码情报任务的实验,包括代码补全、代码加和代码搜索、演示插入/删除变异异和标识变异能对变异性效果的影响最大。我们的结果还表明,基于抽象合成线的变异能显示的性能比基于大多数代码变异的代号的模型显示的更强性。此外,我们关于定位变异性变变变变能的模型可以影响我们在代码下的变现机的变现机的变异变现机中发现。

Article 122

Title@2025-06-07 (6): Is Your Training Pipeline Production-Ready? A Case Study in the Healthcare Domain

Title: Is Your Training Pipeline Production-Ready? A Case Study in the Healthcare Domain

Ist Ihr Training Pipeline Production-Ready? Eine Fallstudie im Bereich Healthcare

你的训练管道生产-准备? 保健领域案例研究 2506.06946v1

Authors (5): Daniel Lawand, Lucas Quaresma, Roberto Bolgheroni, Alfredo Goldman, Renato Cordeiro Ferreira

Deploying a Machine Learning (ML) training pipeline into production requires robust software engineering practices. This differs significantly from experimental workflows. This experience report investigates this challenge in SPIRA, a project whose goal is to create an ML-Enabled System (MLES) to pre-diagnose insufficiency respiratory via speech analysis. The first version of SPIRA’s training pipeline lacked critical software quality attributes. This paper presents an overview of the MLES, then compares three versions of the architecture of the Continuous Training subsystem, which evolved from a Big Ball of Mud, to a Modular Monolith, towards Microservices. By adopting different design principles and patterns to enhance its maintainability, robustness, and extensibility. In this way, the paper seeks to offer insights for both ML Engineers tasked to productionize ML training pipelines and Data Scientists seeking to adopt MLOps practices.

在生产过程中部署机器学习(ML)培训管道需要强有力的软件工程实践,这与实验工作流程大不相同。本经验报告对SPIRA的这一挑战进行了调查。SPIRA项目的目标是通过语言分析建立一个ML-Enabled System(MLES),以便预先诊断呼吸系统不全。SPIRA培训管道的第一版缺乏关键的软件质量属性。本文概述了MLES,然后将连续培训子系统的结构的三个版本进行了比较,从一个大泥球发展到一个模块单体,到MicroServices。通过采用不同的设计原则和模式,提高它的可维持性、稳健性和可延续性。通过这种方式,该文件力求为负责生产ML培训管道的ML工程师和试图采用MLOPs做法的数据科学家提供洞察力。

Article 123

Title@2025-06-07 (6): Object-Spatial Programming

Title: Object-Spatial Programming

Objekträumliche Programmierung

物体空间方案拟订 2503.15812v6

Authors (1): Jason Mars

The evolution of programming languages from low-level assembly to high-level abstractions demonstrates a fundamental principle: by constraining how programmers express computation and enriching semantic information at the language level, we can make previously undecidable program properties tractable for optimization. Building on the insight of this undecidability-lessening effect, we introduce Object-Spatial Programming (OSP), a novel programming model that extends Object-Oriented Programming by introducing topologically-aware class constructs called archetypes. OSP fundamentally inverts the traditional relationship between data and computation, enabling computation to move to data through four specialized archetypes: object classes, node classes (discrete data locations), edge classes (first-class relationships), and walker classes (mobile computational entities). By making topological relationships and traversal patterns explicit at the language level, OSP transforms previously opaque program behaviors into observable, optimizable patterns. This semantic enhancement enables runtime systems to make informed decisions about data locality, parallel execution, and distribution strategies based on explicit topology, while providing programmers with intuitive abstractions for modeling complex systems where connection topology is central to the computational model. The paradigm addresses fundamental limitations in traditional programming models when representing agent-based systems, social networks, neural networks, distributed systems, finite state machines, and other spatially-oriented computational problems, demonstrating how thoughtful abstraction design can simultaneously enhance programmer expressiveness and enable sophisticated system-level optimizations across the computing stack.

编程语言从低层组装到高层次抽象学的演进显示了一项根本原则:通过限制程序员如何在语言层面表达计算和丰富语义信息,我们可以通过限制程序员如何在语言层次上进行计算并丰富语义信息,使先前不可分的编程属性可以优化。基于这种不可分化效应的洞察力,我们引入了物体-空间编程(OSP),这是将目标偏向性编程扩展为横向编程的新编程模式,它通过引入被称为考古型号的表层-意识类结构,将目标偏向性编程扩展为横向编程。OSP从根本上改变了数据与计算的传统关系,使计算能够通过四种专门的直观型型类(对象类别、节点类(不同数据位置)、边缘级(一级关系)和行走者类(移动计算器类(移动计算实体))向数据转移:在语言层次层次上将表面关系和轮廓式编程模式将先前不透明的程序行为转换为可观察的、可优化型式模式。这种语义化系统使得运行系统能够对数据地点、平行执行和分布战略做出知情决定,同时以明确的基于明确的表学,同时为向直观的编程,同时向直观的编程,同时提供模式,同时提供模式,向直观的编程、直观的编程、直观的编程的编程、直观的编程式的编程的编程的编程、直观的编程、直线路路路路路路路路路路路路路系网络,同时向的编程网络路路路系,同时为代表着式式的编程网络代表着式系统,在中心模式、直路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路,从而,在中心路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路路的网络,以至路,并路,以路,以路路,以路,并路,在中央设计网路路路路路路路,以路,在建路路路路路路路路路路路路路路路路路路,

Article 124

Title@2025-06-07 (6): LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning

Title: LLM4Vuln: A Unified Evaluation Framework for Decoupling and Enhancing LLMs’ Vulnerability Reasoning

LLM4Vuln: Ein einheitlicher Bewertungsrahmen für die Entkopplung und Verbesserung der Schwachstelle von LLM

LLM4Vuln:脱钩和加强LLLM女士脆弱性原因统一评价框架 2401.16185v4

Authors (8): Yuqiang Sun, Daoyuan Wu, Yue Xue, Han Liu, Wei Ma, Lyuye Zhang, Yang Liu, Yingjiu Li

Large language models (LLMs) have demonstrated significant potential in various tasks, including those requiring human-level intelligence, such as vulnerability detection. However, recent efforts to use LLMs for vulnerability detection remain preliminary, as they lack a deep understanding of whether a subject LLM’s vulnerability reasoning capability stems from the model itself or from external aids such as knowledge retrieval and tooling support. In this paper, we aim to decouple LLMs’ vulnerability reasoning from other capabilities, such as vulnerability knowledge adoption, context information retrieval, and advanced prompt schemes. We introduce LLM4Vuln, a unified evaluation framework that separates and assesses LLMs’ vulnerability reasoning capabilities and examines improvements when combined with other enhancements. To support this evaluation, we construct UniVul, the first benchmark that provides retrievable knowledge and context-supplementable code across three representative programming languages: Solidity, Java, and C/C++. Using LLM4Vuln and UniVul, we test six representative LLMs (GPT-4.1, Phi-3, Llama-3, o4-mini, DeepSeek-R1, and QwQ-32B) for 147 ground-truth vulnerabilities and 147 non-vulnerable cases in 3,528 controlled scenarios. Our findings reveal the varying impacts of knowledge enhancement, context supplementation, and prompt schemes. We also identify 14 zero-day vulnerabilities in four pilot bug bounty programs, resulting in $3,576 in bounties.

大型语言模型(LLMS)在各种任务中显示出巨大的潜力,包括需要人际情报(如脆弱性检测)的任务;然而,最近利用LLM4Vuln统一评价框架,将LLM4脆弱性推理能力分开并评估LLMS的脆弱性推理能力,并结合其他改进来审查改进情况;为了支持这项评价,我们建立了UniVul,这是第一个基准,为三种有代表性的方案编制语言提供可检索的知识和背景补充代码:团结、爪哇和C/C++。我们用LLM4Vuln和UniVul等其他能力将LLMS的脆弱性推理与脆弱性知识的采纳、背景信息检索和先进的快速计划等脱钩。我们引入LLMS-41、Phi-3、Llama-3、o4-mini、DeepSek-R1和Qw32-B)的弱点推理能力,并结合其他改进工作来审查改进情况;为了支持这项评价,我们建立UniVul,这是第一个基准,该基准,提供可检索的知识和背景分析的478-255的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变的变

Article 125

Title@2025-06-07 (6): CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics

Title: CleanVul: Automatic Function-Level Vulnerability Detection in Code Commits Using LLM Heuristics

CleanVul: Automatische Sicherheitserkennung auf Funktionsebene in Code Commits mit LLM Heuristics

CleanVul: 使用LLM Huristics的代码委员会自动检测功能级别弱点 2411.17274v6

Authors (16): Yikun Li, Ting Zhang, Ratnadira Widyasari, Yan Naing Tun, Huu Hung Nguyen, Tan Bui, Ivana Clairine Irsan, Yiran Cheng, Xiang Lan, Han Wei Ang, Frank Liauw, Martin Weyssow, Hong Jin Kang, Eng Lieh Ouh, Lwin Khin Shar, David Lo

Accurate identification of software vulnerabilities is crucial for system integrity. Vulnerability datasets, often derived from the National Vulnerability Database (NVD) or directly from GitHub, are essential for training machine learning models to detect these security flaws. However, these datasets frequently suffer from significant noise, typically 40% to 75%, due primarily to the automatic and indiscriminate labeling of all changes in vulnerability-fixing commits (VFCs) as vulnerability-related. This misclassification occurs because not all changes in a commit aimed at fixing vulnerabilities pertain to security threats; many are routine updates like bug fixes or test improvements. This paper introduces the first methodology that uses the Large Language Model (LLM) with a heuristic enhancement to automatically identify vulnerability-fixing changes from VFCs, achieving an F1-score of 0.82. VulSifter was applied to a large-scale study, where we conducted a crawl of 127,063 repositories on GitHub, resulting in the acquisition of 5,352,105 commits. VulSifter involves utilizing an LLM to comprehend code semantics and contextual information, while applying heuristics to filter out unrelated changes. We then developed CleanVul, a high-quality dataset comprising 8,203 functions using our LLM heuristic enhancement approach, demonstrating Correctness (90.6%) comparable to established datasets such as SVEN and PrimeVul. To evaluate the CleanVul dataset, we conducted experiments focusing on fine-tuning various LLMs on CleanVul and other high-quality datasets. Evaluation results reveal that LLMs fine-tuned on CleanVul not only exhibit enhanced accuracy but also superior generalization capabilities compared to those trained on uncleaned datasets. Specifically, models trained on CleanVul and tested on PrimeVul achieve accuracy higher than those trained and tested exclusively on PrimeVul.

对软件脆弱性的准确识别对于系统完整性至关重要。脆弱性数据集通常来自国家脆弱性数据库(NVD)或GitHub, 通常来自国家脆弱性数据库(NVD)或直接来自GitHub, 这对于培训机器学习模型以发现这些安全缺陷至关重要。然而,这些数据集经常受到重大噪音的影响, 通常为40%至75%, 主要原因是将所有脆弱性固定(VFC) 承诺变化的自动和不加区分的标签与脆弱性相关。这种分类之所以发生,是因为并非所有旨在修正脆弱性的承诺的变化都与安全威胁有关; 许多是例行更新,例如错误修正或测试改进。本文介绍了第一个方法,即使用大语言模型(LLLM), 以超超导力方式自动识别脆弱性固定变化, 达到F1至0.82的F值。 VulSifter用于大规模研究, 我们在GitHulHubb上进行了127 063个存储库, 导致获得5,352,105 105 承诺。 VulSifter 使用LM, 来理解代码精度和背景信息, 比较精确地分析90, 而我们则使用Silmalmadealdealdealde, 高级数据升级数据升级数据升级, 演示这些功能, 直判。

Article 126

Title@2025-06-07 (6): Quantum-Based Software Engineering

Title: Quantum-Based Software Engineering

Quantenbasierte Software-Engineering

基于量子的软件工程 2505.23674v2

Authors (1): Jianjun Zhao

Quantum computing has demonstrated the potential to solve computationally intensive problems more efficiently than classical methods. Many software engineering tasks, such as test case selection, static analysis, code clone detection, and defect prediction, involve complex optimization, search, or classification, making them candidates for quantum enhancement. In this paper, we introduce Quantum-Based Software Engineering (QBSE) as a new research direction for applying quantum computing to classical software engineering problems. We outline its scope, clarify its distinction from quantum software engineering (QSE), and identify key problem types that may benefit from quantum optimization, search, and learning techniques. We also summarize existing research efforts that remain fragmented. Finally, we outline a preliminary research agenda that may help guide the future development of QBSE, providing a structured and meaningful direction within software engineering.

量子计算显示出了比古典方法更高效地解决计算密集问题的潜力。许多软件工程任务,如测试案例选择、静态分析、代码克隆检测和缺陷预测等,涉及复杂的优化、搜索或分类,使这些任务成为量子增强的候选。在本文件中,我们引入了量子软件工程(QBSE)作为将量子计算应用到古典软件工程问题的新研究方向。我们概述了其范围,澄清了与量子软件工程(QSE)的区别,并确定了可能受益于量子优化、搜索和学习技术的关键问题类型。我们还总结了仍然支离破碎的现有研究工作。最后,我们概述了可能有助于指导QBSE未来发展的初步研究议程,为软件工程提供了结构化和有意义的方向。

Article 127

Title@2025-06-07 (6): Automated Repair of Ambiguous Natural Language Requirements

Title: Automated Repair of Ambiguous Natural Language Requirements

Automatisierte Reparatur von abwechslungsreichen natürlichen Sprachanforderungen

自动修理模糊的自然语言要求 2505.07270v2

Authors (5): Haoxiang Jia, Robbie Morris, He Ye, Federica Sarro, Sergey Mechtaev

The widespread adoption of large language models (LLMs) in software engineering has amplified the role of natural language (NL). The inherent ambiguity of NL threatens software quality, because ambiguous requirements may lead to faulty program generation. The complexity of ambiguity detection and resolution motivates us to introduce automated repair of ambiguous NL requirements, which we approach by reducing code generation uncertainty and aligning NL with input-output examples. Repairing ambiguity in requirements is a difficult challenge for LLMs, as it demands metacognition - the model must understand how its own interpretation changes when the text is altered. Our experiments show that directly prompting an LLM to detect and resolve ambiguities results in irrelevant or inconsistent clarifications. Our key insight is to decompose this problem into simpler sub-problems that do not require metacognitive reasoning. First, we analyze and repair the LLM’s interpretation of requirements embodied by the distribution of programs they induce by using traditional testing and program repair. Second, we repair requirements based on the changes to the distribution via contrastive specification inference. We implemented this proposal, dubbed as SpecFix, and evaluated it by using three state-of-the-art LLMs (GPT-4o, DeepSeek-V3 and Qwen2.5-Coder-32b) across two widely used code generation benchmarks, namely HumanEval+ and MBPP+. Our results show that SpecFix, operating autonomously without human intervention or external information, modifies 23.93% of the requirements, leading to a 33.66% improvement in model Pass@1 on the modified requirements. Across the entire benchmark, this corresponds to an 4.3% increase in overall Pass@1. Importantly, SpecFix’s repairs generalize across models: requirements repaired by one model boost the performance of other models by 9.6%.

在软件工程中广泛采用大型语言模型(LLMs),这扩大了自然语言(NL)的作用。NL的内在模糊性威胁到软件质量,因为模棱两可的要求可能会导致程序生成错误。模糊性检测和分辨率的复杂性促使我们引入对模糊性NL要求的自动修复,我们通过减少代码生成不确定性和使NLLL与输入输出示例相匹配,纠正要求的模糊性对LLM是一个困难的挑战,因为它需要代谢性能——当文本被修改时,该模型必须了解它自己的解释变化如何。我们的实验表明,直接促使一个LLM来探测和解决模糊性导致不相关或不一致的澄清。我们的主要洞察力是将这一问题分解为更简单的子问题,而不需要代谢性推理。首先,我们分析并修复LLMMM对程序分布所体现的要求的解释,通过使用传统测试和程序修补。第二,我们通过对比性价比性规格来修改整个版本的版本要求。我们实施了这项提议,作为SpecFix的SpecF要求,也就是,也就是,即深VDLMF总操作要求在三个状态中,用S-CMBD-R-BS-BS-BS-BS-C-C-C-C-C-BS-C-C-C-C-C-C-C-C-C-LS-C-C-C-C-C-C-C-C-C-C-C-S-S-S-C-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-

Article 128

Title@2025-06-07 (6): Monitoring Continuous Integration Practices in Industry: A Case Study

Title: Monitoring Continuous Integration Practices in Industry: A Case Study

Überwachung kontinuierlicher Integrationspraktiken in der Industrie: Eine Fallstudie

持续监测工业一体化做法:个案研究 2503.02610v2

Authors (3): Jadson Santos, Daniel Alencar da Costa, Uirá Kulesza

In this paper, we study the benefits and challenges of monitoring Continuous Integration (CI) practices in software development. Our aim is to evaluate the impact of monitoring seven CI practices in industry using three organizations in Brazil as case studies. We developed a tool for monitoring CI practices and conducted a multiple case study, applying a mixed-methods strategy. We combined surveys, interviews, log data, and repositories data from software projects and their CI services.We gauged the organization’s interest in monitoring CI practices. The act of monitoring CI provided an overview of the organizational state of practice in terms of CI, motivated further improvement of CI practices, increased perceived quality of software, and improved project communication. We recommend that companies adopt the practice monitoring of CI practices and that CI services integrate monitoring functionalities into their dashboards.

在本文中,我们研究了监测软件开发方面持续一体化做法的益处和挑战,目的是评估利用巴西三个组织作为案例研究对行业七种CI做法进行监测的影响,我们开发了一个监测CI做法的工具,并运用混合方法战略进行了多案例研究,我们综合了软件项目及其CI服务所提供的调查、访谈、日志数据和储存数据。我们评估了该组织对监测CI做法的兴趣。监测CI的行动从CI的角度概述了组织做法状况,推动进一步改善CI的做法,提高了软件质量,改进了项目通信。我们建议公司对CI做法进行监测,并建议CI服务将监测功能纳入其仪表板。

Article 129

Title@2025-06-07 (6): On the Need to Monitor Continuous Integration Practices – An Empirical Study

Title: On the Need to Monitor Continuous Integration Practices – An Empirical Study

Über die Notwendigkeit, kontinuierliche Integrationspraktiken zu überwachen – Eine empirische Studie

监测持续融合做法的必要性 – – 经验研究 2409.05101v2

Authors (4): Jadson Santos, Daniel Alencar da Costa, Shane McIntosh, Uirá Kulesza

Continuous Integration (CI) encompasses a set of widely adopted practices that enhance software development. However, there are indications that developers may not adequately monitor CI practices. Hence, this paper explores developers’ perceptions regarding the monitoring CI practices. To achieve this, we first perform a Document Analysis to assess developers’ expressed need for practice monitoring in pull requests comments generated by developers during the development process. After that, we conduct a survey among developers from 121 open-source projects to understand perception of the significance of monitoring seven CI practices in their projects. Finally, we triangulate the emergent themes from our survey by performing a second Document Analysis to understand the extent of monitoring features supported by existing CI services. Our key findings indicate that: 1) the most frequently mentioned CI practice during the development process is Test Coverage'' (> 80\%), whileBuild Health’’ and ``Time to Fix a Broken Build’’ present notable opportunities for monitoring CI practices; 2) developers do not adequately monitor all CI practices and express interest in monitoring additional practices; and 3) the most popular CI services currently offer limited native support for monitoring CI practices, requiring the use of third-party tools. Our results lead us to conclude that monitoring CI practices is often overlooked by both CI services and developers. Using third-party tools in conjunction with CI services is challenging, they monitor some redundant practices and still falls short of fully supporting CI practices monitoring. Therefore, CI services should implement CI practices monitoring, which would facilitate and encourage developers to monitor them.

持续整合(CI)包括一系列广泛采用的做法,以加强软件开发;然而,有迹象表明,开发商可能无法充分监测CI公司的做法。因此,本文件探讨开发商对CI公司做法的看法。为此,我们首先进行文件分析,评估开发商对开发商在开发过程中提出的拉动请求的评论表示的对实践监测的需要。之后,我们对121个公开源码项目的开发商进行调查,了解监测CI公司项目中7种做法的重要性。最后,我们通过进行第二次文件分析,对调查中新出现的主题进行三角分析,了解CI公司现有服务所支持的监测功能的范围。我们的主要调查结果表明:(1) 开发过程中最经常提到的CI公司做法是“测试覆盖率”( > 80),而“健康”和“修复破碎建筑时间”是监测CI公司做法的显著机会;(2) 开发商没有充分监测CI公司所有做法,表示有兴趣监测其他做法;(3) 最受欢迎的CI公司服务目前对监督CI公司做法提供有限的本地支持,需要使用第三方的短期工具。我们在CI公司监督工作中往往忽视CI公司的做法。

Article 130

Title@2025-06-07 (6): Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Title: Beyond Surface Similarity: Evaluating LLM-Based Test Refactorings with Structural and Semantic Awareness

Beyond Surface Similarity: Bewertung von LLM-basierten Test-Refactorings mit strukturellem und semantischem Bewusstsein

超越地表相似性:评估基于LLM的具有结构和语义意识的测试重构因素 2506.06767v1

Authors (8): Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

Large Language Models (LLMs) are increasingly employed to automatically refactor unit tests, aiming to enhance readability, naming, and structural clarity while preserving functional behavior. However, evaluating such refactorings remains challenging: traditional metrics like CodeBLEU are overly sensitive to renaming and structural edits, whereas embedding-based similarities capture semantics but ignore readability and modularity. We introduce CTSES, a composite metric that integrates CodeBLEU, METEOR, and ROUGE-L to balance behavior preservation, lexical quality, and structural alignment. CTSES is evaluated on over 5,000 test suites automatically refactored by GPT-4o and Mistral-Large-2407, using Chain-of-Thought prompting, across two established Java benchmarks: Defects4J and SF110. Our results show that CTSES yields more faithful and interpretable assessments, better aligned with developer expectations and human intuition than existing metrics.

大型语言模型(LLMS)越来越多地用于自动再构件单位测试,目的是提高可读性、命名和结构清晰度,同时保持功能行为。然而,评估这类再构件仍然具有挑战性:CocBLU等传统指标对重新命名和结构编辑过于敏感,而基于嵌入的相似性则捕捉语义,但忽视可读性和模块性。我们引入了CTESS,这是一个综合CodelbleU、METEOR和ROUGE-L的复合指标,以平衡行为保护、词汇质量和结构一致性。 CTSES在由GPT-4o和Mistral-Large-2407自动重新构件的5 000多个测试套件测试套件上进行了评价,这些测试套件使用“推力链”的提示,横跨两个既定的爪哇基准:Defects4J和SF110。我们的结果显示,CTES产生更忠实和可解释的评估,比现有指标更符合开发者的期望和人类直觉。

Article 131

Title@2025-06-07 (6): Mind the Gap: A Readability-Aware Metric for Test Code Complexity

Title: Mind the Gap: A Readability-Aware Metric for Test Code Complexity

Achten Sie auf die Lücke: Ein Readability-Aware Metric für Test Code Complexity

思维差距:测试编码复杂度的可读性-可读性软件 2506.06764v1

Authors (8): Wendkûuni C. Ouédraogo, Yinghua Li, Xueqi Dang, Xin Zhou, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé

Automatically generated unit tests-from search-based tools like EvoSuite or LLMs-vary significantly in structure and readability. Yet most evaluations rely on metrics like Cyclomatic Complexity and Cognitive Complexity, designed for functional code rather than test code. Recent studies have shown that SonarSource’s Cognitive Complexity metric assigns near-zero scores to LLM-generated tests, yet its behavior on EvoSuite-generated tests and its applicability to test-specific code structures remain unexplored. We introduce CCTR, a Test-Aware Cognitive Complexity metric tailored for unit tests. CCTR integrates structural and semantic features like assertion density, annotation roles, and test composition patterns-dimensions ignored by traditional complexity models but critical for understanding test code. We evaluate 15,750 test suites generated by EvoSuite, GPT-4o, and Mistral Large-1024 across 350 classes from Defects4J and SF110. Results show CCTR effectively discriminates between structured and fragmented test suites, producing interpretable scores that better reflect developer-perceived effort. By bridging structural analysis and test readability, CCTR provides a foundation for more reliable evaluation and improvement of generated tests. We publicly release all data, prompts, and evaluation scripts to support replication.

从EvoSite或LLMS-vary等搜索工具自动生成单位测试,这些测试在结构和可读性方面相当明显。然而,大多数评价都依赖于为功能代码而不是测试代码设计的Cylomatic复杂度和认知复杂度等测量标准。最近的研究显示,Sonar源的认知复杂度测量将近零分分配给LLM生成的测试,但其在EvoSite生成的测试中的行为及其对测试特定代码结构的可适用性仍未得到探讨。我们引入了CCCTR,这是为单位测试量而专门设计的测试-软件复杂度指标。CCTR综合了结构性和语义性特征特征,如主张密度、注解作用和测试构成模式多样性,这些特征为传统的复杂度模型所忽视,但对理解测试代码至关重要。我们评估了EvoSite、GPT-4o和Mistral 1024 的测试成套测试套件,涉及350个类别,分别来自Deffects4J和SF110。结果显示,CCTR对结构化和零散化测试成套测试成套测试套件进行了有效的区分,产生了可解释性评分分,生成的分数,从而更好地反映结构复制性、可读性、可读性、可读性测试基础,提供了所有结构测试和精确性基础。

Article 132

Title@2025-06-06 (5): AI Simulation by Digital Twins: Systematic Survey, Reference Framework, and Mapping to a Standardized Architecture

Title: AI Simulation by Digital Twins: Systematic Survey, Reference Framework, and Mapping to a Standardized Architecture

KI-Simulation durch digitale Zwillinge: Systematische Umfrage, Referenzrahmen und Mapping zu einer standardisierten Architektur

AI 数字双对模拟:系统调查、参考框架和绘制标准化建筑图 2506.06580v1

Authors (2): Xiaoran Liu, Istvan David

Insufficient data volume and quality are particularly pressing challenges in the adoption of modern subsymbolic AI. To alleviate these challenges, AI simulation uses virtual training environments in which AI agents can be safely and efficiently developed with simulated, synthetic data. Digital twins open new avenues in AI simulation, as these high-fidelity virtual replicas of physical systems are equipped with state-of-the-art simulators and the ability to further interact with the physical system for additional data collection. In this article, we report on our systematic survey of digital twin-enabled AI simulation. By analyzing 22 primary studies, we identify technological trends and derive a reference framework to situate digital twins and AI components. Based on our findings, we derive a reference framework and provide architectural guidelines by mapping it onto the ISO 23247 reference architecture for digital twins. Finally, we identify challenges and research opportunities for prospective researchers.

为缓解这些挑战,AI模拟利用虚拟培训环境,使AI代理商能够安全有效地利用模拟合成数据开发。数字双胞胎在AI模拟中开辟了新的途径,因为这些高贞洁性物理系统的虚拟复制品配备了最先进的模拟器,并有能力与物理系统进一步互动,以收集更多的数据。在本篇文章中,我们报告了对数字双功能AI模拟进行系统调查的情况。通过分析22项初级研究,我们确定了技术趋势,并制定了数字双胞胎和AI组成部分的定位参考框架。根据我们的调查结果,我们提出了一个参考框架,并通过将其绘制到ISO 23247数字双胞胎参考结构,提供了建筑指南。最后,我们确定了未来研究人员的挑战和研究机会。

Article 133

Title@2025-06-06 (5): DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling

Title: DISC: DISC: Dynamic Decomposition Improves LLM Inference Scaling

DISC: DISC: Dynamische Zersetzung verbessert LLM-Inferenzskalierung

DISC: DISC: 动态分解改善LLM 推推法的扩大 2502.16706v2

Authors (9): Jonathan Light, Wei Cheng, Benjamin Riviere, Wu Yue, Masafumi Oyamada, Mengdi Wang, Yisong Yue, Santiago Paternain, Haifeng Chen

Inference scaling methods for large language models often work by breaking problems into steps or groups of tokens, then sampling and selecting the best next steps. However, these steps and their sizes are usually fixed or manually designed based on domain knowledge. We introduce dynamic decomposition, a method that adaptively and automatically breaks down solution and reasoning traces into manageable steps during inference. By allocating compute more effectively - especially by subdividing difficult steps and prioritizing their sampling - dynamic decomposition significantly boosts inference efficiency. Experiments on benchmarks like APPS, MATH, and LiveCodeBench show that dynamic decomposition outperforms fixed strategies such as token-level, sentence-level, and single-step decompositions, reducing the pass@10 error rate by 5.0%, 6.7%, and 10.5% respectively. These results show the promise of dynamic decomposition for improving a broad range of inference scaling techniques.

大型语言模型的推论缩放方法往往通过将问题破碎成步骤或一组符号,然后抽样和选择最佳的下一步步骤而发挥作用。然而,这些步骤及其大小通常是根据域知识固定或人工设计的。我们引入了动态分解法,这是一种适应性和自动拆解解决方案的方法,并在推理过程中将痕迹推入可操作的步骤。通过更有效地分配计算,特别是分解困难的步骤并确定其取样的优先次序,动态分解会大大提高推导效率。关于APPS、MATH和LiveCodeBench等基准的实验显示,动态分解会形成代谢式、句级和单步分解法等固定战略,分别将误差率@10降低5.0%、6.7%和10.5%。这些结果显示,动态分解会改善广泛的推断缩放技术。

Article 134

Title@2025-06-06 (5): MergeRepair: An Exploratory Study on Merging Task-Specific Adapters in Code LLMs for Automated Program Repair

Title: MergeRepair: An Exploratory Study on Merging Task-Specific Adapters in Code LLMs for Automated Program Repair

MergeRepair: Eine explorative Studie zum Zusammenführen von Task-spezifischen Adaptern in Code LLMs zur automatischen Programmreparatur

合并Repair:关于将特定任务适应器合并成自动方案维修代码LLMS的探索性研究 2408.09568v3

Authors (4): Meghdad Dehghan, Jie JW Wu, Fatemeh H. Fard, Ali Ouni

Large Language Models (LLMs) have shown high capabilities in several software development-related tasks such as program repair, documentation, code refactoring, debugging, and testing. However, training these models requires massive amount of data and significant computational resources. Adapters are specialized, small modules designed for parameter efficient fine-tuning of LLMs for specific tasks, domains, or applications without requiring extensive retraining of the entire model. These adapters offer a more efficient way to customize LLMs for particular needs, leveraging the pre-existing capabilities of the large model. Model (and adapter) merging have emerged as a technique to develop one model capable of multiple tasks, with minimal or no training required. Although model and adapter merging has shown promising performance in domains such as natural language processing and computer vision, its applicability to software engineering tasks remains underexplored. In this paper, we investigate the effectiveness of merged adapters within the context of software engineering, with a particular focus on the Automated Program Repair (APR) task, through our approach, MergeRepair. In particular, we merge multiple task-specific adapters using three different merging methods, including weight-averaging, ties, and dare-ties, and evaluate the performance of the merged adapter on the APR task. We introduce a continual merging approach, a novel method in which we sequentially merge the task-specific adapters where the order and weight of the merged adapters play a significant role. We further compare the performance of our approach with a baseline method consisting of equal-weight merging applied on parameters of different adapters, where all adapters are of equal importance.

大型语言模型(LLMS)在几个软件开发相关任务(如程序维修、文档、代码重构、调试、调试和测试)中表现出高超能力,然而,培训这些模型需要大量数据和大量计算资源。适应器是专门设计的小型模块,用于为具体任务、领域或应用而对LLMS进行参数高效微调,而无需对整个模型进行广泛再培训。这些适应器提供了一种更有效的方法,利用大型模型的原有能力,使LLMS适应特殊需要。模型(和调试器)合并成为一种技术,用来开发一种能够完成多种任务、且不需要培训的模型。虽然模型和调试器合并表明,在自然语言处理和计算机愿景等领域,其对于LLLMSDMS具体任务、领域或应用性能调整参数参数的参数是很有希望的。在本文件中,我们调查了软件工程工程工程中合并适应器的效能,特别侧重于自动化程序调整(APR)任务,通过我们的方法,即MergeRepair。我们用三种不同的合并方法将多个任务调整适应器组合,在其中,我们不断调整的升级的变校正的校正的变校正方法中,我们开始一个我们的工作。

Article 135

Title@2025-06-06 (5): Private GPTs for LLM-driven testing in software development and machine learning

Title: Private GPTs for LLM-driven testing in software development and machine learning

Private GPTs für LLM-gesteuerte Tests in Softwareentwicklung und maschinellem Lernen

在软件开发和机器学习方面进行由LLLM驱动的测试的私人GPT 2506.06509v1

Authors (2): Jakub Jagielski, Markus Abel

In this contribution, we examine the capability of private GPTs to automatically generate executable test code based on requirements. More specifically, we use acceptance criteria as input, formulated as part of epics, or stories, which are typically used in modern development processes. This gives product owners, or business intelligence, respectively, a way to directly produce testable criteria through the use of LLMs. We explore the quality of the so-produced tests in two ways: i) directly by letting the LLM generate code from requirements, ii) through an intermediate step using Gherkin syntax. As a result, it turns out that the two-step procedure yields better results -where we define better in terms of human readability and best coding practices, i.e. lines of code and use of additional libraries typically used in testing. Concretely, we evaluate prompt effectiveness across two scenarios: a simple “Hello World” program and a digit classification model, showing that structured prompts lead to higher-quality test outputs.

在此贡献中,我们研究私人全球贸易点根据要求自动生成可执行的测试代码的能力。更具体地说,我们使用接受标准作为投入,作为史诗或故事的一部分,在现代发展进程中通常使用。这分别给产品所有者或商业情报提供了一种通过使用LLMs直接生成可测试标准的途径。我们以两种方式探讨所制成的测试的质量:一)直接让LLM根据要求生成代码,二)通过使用Gherkin 语法的中间步骤,(二)通过使用Gherkin 语法的中间步骤。结果显示,两步程序产生更好的结果,我们在人类可读性和最佳编码做法方面界定得更好,即代码线和使用通常用于测试的其他图书馆。具体地说,我们评估了两种情景的及时有效性:简单的“哈罗世界”方案和数字分类模型,显示结构上的提示导致更高质量的测试产出。

Article 136

Title@2025-06-06 (5): Information-Theoretic Detection of Unusual Source Code Changes

Title: Information-Theoretic Detection of Unusual Source Code Changes

Information-Theoretische Erkennung ungewöhnlicher Quellcode-Änderungen

异常源代码变化的信息理论检测 2506.06508v1

Authors (4): Adriano Torres, Sebastian Baltes, Christoph Treude, Markus Wagner

The code base of software projects evolves essentially through inserting and removing information to and from the source code. We can measure this evolution via the elements of information - tokens, words, nodes - of the respective representation of the code. In this work, we approach the measurement of the information content of the source code of open-source projects from an information-theoretic standpoint. Our focus is on the entropy of two fundamental representations of code: tokens and abstract syntax tree nodes, from which we derive definitions of textual and structural entropy. We proceed with an empirical assessment where we evaluate the evolution patterns of the entropy of 95 actively maintained open source projects. We calculate the statistical relationships between our derived entropy metrics and classic methods of measuring code complexity and learn that entropy may capture different dimensions of complexity than classic metrics. Finally, we conduct entropy-based anomaly detection of unusual changes to demonstrate that our approach may effectively recognise unusual source code change events with over 60% precision, and lay the groundwork for improvements to information-theoretic measurement of source code evolution, thus paving the way for a new approach to statically gauging program complexity throughout its development.

软件项目的代码基础主要通过插入和删除源代码和源代码中的信息而演变。我们可以通过代码各自代表的信息要素( 符号、单词、节点) 来测量这一演变。在这项工作中, 我们从信息理论的角度来测量开源项目源代码的信息内容。我们的重点是两个基本代码表达方式的酶: 符号和抽象合成树节点, 我们从中得出文本和结构昆虫的定义。我们进行一项实验性评估, 评估95个积极维护开放源项目昆虫的演变模式。我们计算出我们衍生的昆虫指标和典型的代码复杂性测量方法之间的统计关系, 并了解酶可能捕捉到与经典计量方法不同的复杂层面。最后, 我们进行基于昆虫的异常异常检测, 以证明我们的方法可以有效识别超过60%的异常源代码变化事件, 并为改进源代码演变的信息理论测量奠定基础, 从而为在开发过程中采用静态合成复杂程序铺设新的方法铺设了道路。

Article 137

Title@2025-06-06 (5): Enhancing Software Supply Chain Security Through STRIDE-Based Threat Modelling of CI/CD Pipelines

Title: Enhancing Software Supply Chain Security Through STRIDE-Based Threat Modelling of CI/CD Pipelines

Verbesserung der Sicherheit der Software-Lieferkette durch STRIDE-basierte Bedrohungsmodellierung von CI/CD-Pipelines

通过对光/光/光/光/光光气管道进行基于战略控制的威胁建模,加强软件供应链安全 2506.06478v1

Authors (1): Sowmiya Dhandapani

With the increasing adoption of Continuous Integration and Continuous Deployment pipelines, securing software supply chains has become a critical challenge for modern DevOps teams. This study addresses these challenges by applying a structured threat modeling approach to identify and mitigate risks throughout the CI/CD lifecycle. By modeling a representative pipeline architecture incorporating tools such as GitHub, Jenkins, Docker, and Kubernetes and applying the STRIDE framework, we systematically analyze vulnerabilities at each stage, from source code management to deployment. Threats are documented and mapped to comprehensive security controls drawn from standards like NIST SP 800-218, OWASP Top 10 CI/CD risks, and the SLSA framework. Controls are further evaluated against SLSA maturity levels to assess improvements in trust and provenance. To operationalize these findings, the study outlines a practical security toolchain integration strategy grounded in Security as Code and Shift Left-Shield Right principles, enabling automated, enforceable security across the pipeline. This approach provides a pragmatic roadmap for enhancing CI/CD pipeline security against evolving software supply chain threats.

随着持续整合和连续部署管道的日益采用,确保软件供应链的安全已成为现代发展业务伙伴团队面临的一项重大挑战。本研究报告通过采用结构化的威胁模型方法,查明和减轻整个CI/CD生命周期的风险,应对这些挑战。通过模拟具有代表性的管道结构,包括GitHub、Jenkins、Docker和Kubernetes等工具,并应用STRIDE框架,我们系统地分析各个阶段的脆弱性,从源码管理到部署。威胁被记录下来,并被映射到全面安全控制,从NISTSP SP 800-218、OWASP Top 10 CI/CD风险和SLSA框架等标准中抽出。对照SLSA成熟度进一步评估控制,以评估信任和出处的改进。为落实这些发现,研究概述了以安全代码和左轮右转原则为基础的实际安全工具链整合战略,使整个管道能够自动、可强制执行的安全。这一方法提供了务实的路线图,以加强CI/CD管道安全,防止软件供应链威胁不断演变。

Article 138

Title@2025-06-06 (5): PyGemini: Unified Software Development towards Maritime Autonomy Systems

Title: PyGemini: Unified Software Development towards Maritime Autonomy Systems

PyGemini: Unified Software Development towards Maritime Autonomy Systems

PyGemini:向海洋自主系统发展统一软件 2506.06262v1

Authors (6): Kjetil Vasstein, Christian Le, Simon Lervåg Breivik, Trygve Maukon Myhr, Annette Stahl, Edmund Førland Brekke

Ensuring the safety and certifiability of autonomous surface vessels (ASVs) requires robust decision-making systems, supported by extensive simulation, testing, and validation across a broad range of scenarios. However, the current landscape of maritime autonomy development is fragmented – relying on disparate tools for communication, simulation, monitoring, and system integration – which hampers interdisciplinary collaboration and inhibits the creation of compelling assurance cases, demanded by insurers and regulatory bodies. Furthermore, these disjointed tools often suffer from performance bottlenecks, vendor lock-in, and limited support for continuous integration workflows. To address these challenges, we introduce PyGemini, a permissively licensed, Python-native framework that builds on the legacy of Autoferry Gemini to unify maritime autonomy development. PyGemini introduces a novel Configuration-Driven Development (CDD) process that fuses Behavior-Driven Development (BDD), data-oriented design, and containerization to support modular, maintainable, and scalable software architectures. The framework functions as a stand-alone application, cloud-based service, or embedded library – ensuring flexibility across research and operational contexts. We demonstrate its versatility through a suite of maritime tools – including 3D content generation for simulation and monitoring, scenario generation for autonomy validation and training, and generative artificial intelligence pipelines for augmenting imagery – thereby offering a scalable, maintainable, and performance-oriented foundation for future maritime robotics and autonomy research.

为应对这些挑战,我们引入了以广泛模拟、测试和验证为后盾的广泛模拟、测试和验证为后盾的强有力的决策系统。然而,当前的海洋自主发展景观是支离破碎的 – – 依赖不同的通信、模拟、监测和系统整合工具,这些工具阻碍跨学科合作,妨碍产生保险人和监管机构所要求的令人信服的保证案例。此外,这些互不连通的工具往往因业绩瓶颈、供应商锁定和对持续整合工作流程的支持有限而受到影响。为了应对这些挑战,我们引入了PyGemini,这是一个以Autaferry Gemini的遗产为基础,统一海洋自主发展;PyGemini引入了一个新的配置驱动开发(CDDD)流程,该流程将行为发展(BDDD)、以数据为导向的设计和集装箱化结合起来,以支持模块、可维持和可扩展的软件结构。框架功能是独立应用、云基服务或嵌入型图书馆,这一架构建立在Autorferry Gemission 的遗产上,以统一海洋自主性研究模式和操作性模型,以此确保未来成本模型生成的灵活性。

Article 139

Title@2025-06-06 (5): DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

Title: DesignBench: A Comprehensive Benchmark for MLLM-based Front-end Code Generation

DesignBench: Ein umfassender Benchmark für die MLLM-basierte Generierung von Front-End-Codes

设计时区:基于MLLLM的前端代码生成综合基准 2506.06251v1

Authors (7): Jingyu Xiao, Ming Wang, Man Ho Lam, Yuxuan Wan, Junliang Liu, Yintong Huo, Michael R. Lyu

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in automated front-end engineering, e.g., generating UI code from visual designs. However, existing front-end UI code generation benchmarks have the following limitations: (1) While framework-based development becomes predominant in modern front-end programming, current benchmarks fail to incorporate mainstream development frameworks. (2) Existing evaluations focus solely on the UI code generation task, whereas practical UI development involves several iterations, including refining editing, and repairing issues. (3) Current benchmarks employ unidimensional evaluation, lacking investigation into influencing factors like task difficulty, input context variations, and in-depth code-level analysis. To bridge these gaps, we introduce DesignBench, a multi-framework, multi-task evaluation benchmark for assessing MLLMs’ capabilities in automated front-end engineering. DesignBench encompasses three widely-used UI frameworks (React, Vue, and Angular) alongside vanilla HTML/CSS, and evaluates on three essential front-end tasks (generation, edit, and repair) in real-world development workflows. DesignBench contains 900 webpage samples spanning over 11 topics, 9 edit types, and 6 issue categories, enabling detailed analysis of MLLM performance across multiple dimensions. Our systematic evaluation reveals critical insights into MLLMs’ framework-specific limitations, task-related bottlenecks, and performance variations under different conditions, providing guidance for future research in automated front-end development. Our code and data are available at https://github.com/WebPAI/DesignBench.

虽然基于框架的发展在现代前端编程中占据主导地位,但目前的基准没有纳入主流发展框架。 (2) 现有的评价仅侧重于UI代码生成任务,而实用的UI开发则涉及若干迭代,包括改进编辑和修复问题。 (3) 目前的基准采用单维评价,缺乏对任务困难、投入背景变化和深度代码级分析等影响因素的调查。为弥补这些差距,我们采用DefBench、多框架、多任务评估基准,以评估MLLMs在前端自动化工程方面的能力。DefBench包括三个广泛使用的UI框架(React、Vue和Agorma)以及香草 HTML/CS,并评价实际世界发展工作流程中的三项基本前端任务(生成、编辑和修理),缺乏对任务、投入背景变化和深度代码级分析。DefBenBench包含900个网页样本,多框架用于评估MLLM任务层面的多重分析。

Article 140

Title@2025-06-06 (5): Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies

Title: Scalable Language Agnostic Taint Tracking using Explicit Data Dependencies

Skalierbare Sprache Agnostic Taint Tracking mit expliziten Datenabhängigkeiten

使用明确数据依赖性进行可缩放语言 Agnistic Taint 跟踪 2506.06247v1

Authors (4): Sedick David Baker Effendi, Xavier Pinho, Andrei Michael Dreyer, Fabian Yamaguchi

Taint analysis using explicit whole-program data-dependence graphs is powerful for vulnerability discovery but faces two major challenges. First, accurately modeling taint propagation through calls to external library procedures requires extensive manual annotations, which becomes impractical for large ecosystems. Second, the sheer size of whole-program graph representations leads to serious scalability and performance issues, particularly when quick analysis is needed in continuous development pipelines. This paper presents the design and implementation of a system for a language-agnostic data-dependence representation. The system accommodates missing annotations describing the behavior of library procedures by over-approximating data flows, allowing annotations to be added later without recalculation. We contribute this data-flow analysis system to the open-source code analysis platform Joern making it available to the community.

使用清晰的全方案数据依赖图解进行塔因特分析,对发现脆弱性很有帮助,但面临两大挑战。首先,准确模拟通过外部图书馆程序打来的污点传播需要广泛的人工说明,这对大型生态系统来说不切实际。第二,全方案图解表的大小导致严重的可缩放性和性能问题,特别是在连续开发管道需要快速分析的情况下。本文件介绍了语言不可知数据依赖代表系统的设计和实施。该系统包含通过过度使用数据流动来描述图书馆程序行为的缺失说明,允许以后在不重新计算的情况下添加说明。我们向开放源代码分析平台Joern提供这一数据流分析系统,供社区使用。

Article 141

Title@2025-06-06 (5): MLOps with Microservices: A Case Study on the Maritime Domain

Title: MLOps with Microservices: A Case Study on the Maritime Domain

MLOps mit Microservices: Eine Fallstudie zum maritimen Bereich

具有微服务的多边业务方案:海洋领域案例研究 2506.06202v1

Authors (3): Renato Cordeiro Ferreira, Rowanne Trapmann, Willem-Jan van den Heuvel

This case study describes challenges and lessons learned on building Ocean Guard: a Machine Learning-Enabled System (MLES) for anomaly detection in the maritime domain. First, the paper presents the system’s specification, and architecture. Ocean Guard was designed with a microservices’ architecture to enable multiple teams to work on the project in parallel. Then, the paper discusses how the developers adapted contract-based design to MLOps for achieving that goal. As a MLES, Ocean Guard employs code, model, and data contracts to establish guidelines between its services. This case study hopes to inspire software engineers, machine learning engineers, and data scientists to leverage similar approaches for their systems.

本案例研究介绍了建设海洋警卫队的挑战和经验教训:一个用于海洋领域异常现象探测的机械学习-启用系统(MLES),首先,本文件介绍了该系统的规格和结构。海洋警卫队设计了一个微型服务架构,使多个小组能够同时开展项目工作。然后,本文件讨论了开发人员如何为实现这一目标而调整合同设计,使之适应海洋警卫队:海洋警卫队使用一套规则、模式和数据合同,以确立其服务之间的准则。本案例研究希望激励软件工程师、机器学习工程师和数据科学家利用类似的系统方法。

Article 142

Title@2025-06-06 (5): Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph

Title: Obfuscation-Resilient Binary Code Similarity Analysis using Dominance Enhanced Semantic Graph

Obfuscation-Resilient Binary Code Ähnlichkeitsanalyse mit Dominance Enhanced Semantic Graph

利用压强增强语义图分析显要强化语义图 2506.06161v1

Authors (6): Yufeng Wang, Yuhong Feng, Yixuan Cao, Haoran Li, Haiyue Feng, Yifeng Wang

Binary code similarity analysis (BCSA) serves as a core technique for binary analysis tasks such as vulnerability detection. While current graph-based BCSA approaches capture substantial semantics and show strong performance, their performance suffers under code obfuscation due to the unstable control flow. To address this issue, we develop ORCAS, an Obfuscation-Resilient BCSA model based on Dominance Enhanced Semantic Graph (DESG). The DESG is an original binary code representation, capturing more binaries’ implicit semantics without control flow structure, including inter-instruction relations, inter-basic block relations, and instruction-basic block relations. ORCAS robustly scores semantic similarity across binary functions from different obfuscation options, optimization levels, and instruction set architectures. Extensive evaluation on the BinKit dataset shows ORCAS significantly outperforms eight baselines, achieving an average 12.1% PR-AUC gain when using combined three obfuscation options compared to the state-of-the-art approaches. Furthermore, ORCAS improves recall by up to 43% on an original obfuscated real-world vulnerability dataset, which we released to facilitate future research.

为解决这一问题,我们开发了基于多功能增强语义图(DEG)的“ORCAS”模型。DESG是一个原始的二进制代号代表,收集了更多的二进制的隐含语义,而没有控制流结构,包括内部关系、基本块间关系和指令基本块关系。此外,ORCAS还从不同的混淆选项、优化级别和指令设置的架构中,将二进制函数的相似性有力地评分为43%,从而便利了原始的脆弱程度,从而便利了我们所释放的原始数据。

Article 143

Title@2025-06-06 (5): Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Title: Reasoning Through Execution: Unifying Process and Outcome Rewards for Code Generation

Begründung durch Ausführung: Vereinheitlichung von Prozess- und Ergebnisprämien für die Codegenerierung

执行中的理由:代码生成的统一程序和结果奖励 2412.15118v2

Authors (8): Zhuohao Yu, Weizheng Gu, Yidong Wang, Xingru Jiang, Zhengran Zeng, Jindong Wang, Wei Ye, Shikun Zhang

Large Language Models excel at code generation yet struggle with complex programming tasks that demand sophisticated reasoning. To bridge this gap, traditional process supervision relies on learned reward models requiring costly training data and suffering from reward misalignment, while outcome supervision fails for complex tasks needing coordinated intermediate steps. We introduce Outcome Refining Process Supervision, which unifies process and outcome supervision by leveraging executable verification: a tree-structured search framework generates strategic alternatives, profiles execution metrics, and scores candidates via self-critique mechanisms that integrate runtime feedback with reasoning. Experiments across 5 models and 3 benchmarks show consistent gains, with 26.9% higher correctness and 42.2% improved code efficiency. The results demonstrate that ORPS enables LLMs to overcome local optima in code generation, suggesting a promising direction for combining verifiable outcomes with structured reasoning to tackle complex challenges. We open-source at: https://github.com/zhuohaoyu/ORPS

大型语言模型在代码生成方面非常出色,但与复杂的方案编制任务挣扎,需要复杂的推理。为了缩小这一差距,传统程序监督依赖于需要昂贵的培训数据并遭受不匹配的奖励,而结果监督则无法完成需要协调的中间步骤的复杂任务。我们引入了成果改进程序监督,通过利用可执行的核查使进程和结果监督统一起来:树结构搜索框架产生战略替代物、基本执行指标,并通过将运行时间反馈与推理相结合的自我审查机制对候选人进行评分。在5个模型和3个基准的实验显示,收益一致,26.9%的正确性更高,42.2%的代码效率提高。结果显示,ORPS使LMs能够在代码生成中克服当地选择,为将可核查的结果与结构性推理相结合提供了一个有希望的方向。我们公开的网址是:https://github.com/zhuohahouyuu/ORPS。

Article 144

Title@2025-06-06 (5): Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests

Title: Leveraging Generative AI for Enhancing Automated Assessment in Programming Education Contests

Nutzung generativer KI zur Verbesserung der Automatisierten Bewertung bei der Programmierung von Bildungswettbewerben

利用 “ 利用激励AI “ 增强方案编制教育竞赛中的自动评估 2506.05990v1

Authors (3): Stefan Dascalescu, Adrian Marius Dumitran, Mihai Alexandru Vasiluta

Competitive programming contests play a crucial role in cultivating computational thinking and algorithmic skills among learners. However, generating comprehensive test cases to effectively assess programming solutions remains resource-intensive and challenging for educators. This paper introduces an innovative NLP-driven method leveraging generative AI (large language models) to automate the creation of high-quality test cases for competitive programming assessments. We extensively evaluated our approach on diverse datasets, including 25 years of Romanian Informatics Olympiad (OJI) data for 5th graders, recent competitions hosted on the Kilonova.ro platform, and the International Informatics Olympiad in Teams (IIOT). Our results demonstrate that AI-generated test cases substantially enhanced assessments, notably identifying previously undetected errors in 67% of the OJI 5th grade programming problems. These improvements underscore the complementary educational value of our technique in formative assessment contexts. By openly sharing our prompts, translated datasets, and methodologies, we offer practical NLP-based tools that educators and contest organizers can readily integrate to enhance assessment quality, reduce workload, and deepen insights into learner performance.

竞争性编程竞赛在培养学员的计算思维和算法技能方面发挥着关键作用。然而,为有效评估方案编制解决方案而生成综合测试案例对于教育工作者来说仍然是资源密集型和具有挑战性。本文介绍了创新的NLP驱动方法,利用特异性人工智能(大语言模型)自动创建高质量的编程评估测试案例;我们广泛评价了我们关于各种数据集的方法,包括罗马尼亚信息学奥林匹克运动(OJI)为5年级学生提供的25年数据、最近在Kilonova.ro平台上举办的竞赛以及国际信息奥林匹克小组(IIOT)上举办的竞赛。我们的结果表明,AI生成的测试案例大大加强了评估,特别是查明了以前在OJI五年级编程67%的编程问题中未察觉到的错误。这些改进突出表明了我们在编程评估环境中技术的互补性教育价值。我们公开分享我们的提示、翻译数据集和方法,提供了实用的NLP工具,教育工作者和竞争组织者可以方便地结合,以提高评估质量、减少工作量和加深对学习者业绩的了解。

Article 145

Title@2025-06-06 (5): A Preference-Driven Methodology for High-Quality Solidity Code Generation

Title: A Preference-Driven Methodology for High-Quality Solidity Code Generation

Eine präferenzorientierte Methodik für die Erzeugung von Soliditätscodes hoher Qualität

高质量固体废物代码生成的优先开发方法 2506.03006v2

Authors (5): Zhiyuan Peng, Xin Yin, Chenhao Ying, Chao Ni, Yuan Luo

While Large Language Models (LLMs) have demonstrated remarkable progress in generating functionally correct Solidity code, they continue to face critical challenges in producing gas-efficient and secure code, which are critical requirements for real-world smart contract deployment. Although recent advances leverage Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO) for code preference alignment, existing approaches treat functional correctness, gas optimization, and security as independent objectives, resulting in contracts that may achieve operational soundness but suffer from prohibitive execution costs or dangerous vulnerabilities. To address these limitations, we propose \textbf{\mytitle}, a novel framework that extends standard DPO beyond human preferences to incorporate quantifiable blockchain-specific metrics, enabling holistic multi-objective optimization specifically tailored for smart contract generation. Our framework introduces a comprehensive evaluation methodology with four complementary metrics: Pass@k (functional correctness), Compile@k (syntactic correctness), Gas@k (gas efficiency), and Secure@k (security assessment), providing rigorous multi-dimensional contract evaluation. Through extensive experimentation, we demonstrate that \mytitle significantly outperforms existing approaches across all critical dimensions, achieving 66.7\% Pass@5, 58.9\% Gas@5, and 62.5\% Secure@5, while generating production-ready smart contracts that are functionally correct, cost-efficient, and secure.

虽然大语言模型(LLMS)在生成功能正确的固态代码方面取得了显著进展,但在生成功能正确的固态代码方面,它们继续面临严峻的挑战,因为生产煤气效率和安全代码是现实世界智能合同部署的关键要求。虽然最近的一些进步带动了监管精通和直接偏好优化(DPO)的代码偏好调整,但现有方法将功能正确性、气体优化和安全作为独立目标处理,导致合同可能实现运作稳健,但执行成本过高或存在危险的弱点。为了应对这些限制,我们提议建立一个新的框架,将标准DPO扩大到人类偏好之外,以纳入可量化的块链特定指标,使专门为智能合同生成而设计的全方位多目标优化。我们的框架引入了全面的评价方法,有四项补充性指标:Pass@k(功能正确性)、Compass@k(同步正确性)、Gas@k(气体效率)和S Secet@k(安全评估),提供了严格的多维度合同评估。我们通过广泛的实验,表明将标准大大超越现有方法,超越了现有各种可计量的链路段标准,5、Smartal-lax-reve-rass sact sact sq pact pact sact sact assional assional deact gas ex 6,同时实现所有关键层面5、6xive gas

Article 146

Title@2025-06-06 (5): Analysis of cost-efficiency of serverless approaches

Title: Analysis of cost-efficiency of serverless approaches

Analyse der Kosteneffizienz serverloser Ansätze

分析无服务器无服务器方法的成本效益 2506.05836v1

Authors (7): Nakhat Syeda, Harsh Shah, Rajvinder Singh, Suraj Jaju, Sumedha Kumar, Gourav Chhabra, Maria Spichkova

In this paper, we present a survey of research studies related to the cost-effectiveness of serverless approach and corresponding cost savings. We conducted a systematic literature review using Google Scholar search engine, covering the period from 2010 to 2024. We identified 34 related studies, from which we extracted 17 parameters that might influence the relative cost savings of applying the serverless approach.

我们利用谷歌学者搜索引擎对2010年至2024年期间进行了系统文献审查。我们确定了34项相关研究,从中我们提取了17项参数,这些参数可能会影响应用无服务器方法的相对成本节约。

Article 147

Title@2025-06-06 (5): Training Software Engineering Agents and Verifiers with SWE-Gym

Title: Training Software Engineering Agents and Verifiers with SWE-Gym

Schulung von Software Engineering Agents und Prüfern mit SWE-Gym

SWE-Gym培训软件工程代理和验证人 2412.21139v2

Authors (7): Jiayi Pan, Xingyao Wang, Graham Neubig, Navdeep Jaitly, Heng Ji, Alane Suhr, Yizhe Zhang

We present SWE-Gym, the first environment for training real-world software engineering (SWE) agents. SWE-Gym contains 2,438 real-world Python task instances, each comprising a codebase with an executable runtime environment, unit tests, and a task specified in natural language. We use SWE-Gym to train language model based SWE agents, achieving up to 19% absolute gains in resolve rate on the popular SWE-Bench Verified and Lite test sets. We also experiment with inference-time scaling through verifiers trained on agent trajectories sampled from SWE-Gym. When combined with our fine-tuned SWE agents, we achieve 32.0% and 26.0% on SWE-Bench Verified and Lite, respectively, reflecting a new state-of-the-art for open-weight SWE agents. To facilitate further research, we publicly release SWE-Gym, models, and agent trajectories.

我们介绍SWE-Gym,这是培训现实世界软件工程代理物的第一个环境。SWE-Gym包含2,438个真实世界Python任务实例,每个实例都包含一个具有可执行运行时间环境的代码库、单位测试和自然语言指定的任务。我们使用SWE-Gym来培训基于SWE代理物的语言模型,在广受欢迎的SWE-Bench 验证和液化测试机中达到高达19%的绝对解析率。我们还通过在SWE-Gym取样的代理物轨迹方面受过训练的验证员进行推断-时间的扩大试验。当我们与我们经过精细调整的SWE-Bench Verific和液化的代理物剂相结合时,我们分别在SWE-Bench Verized和液化和液化剂上实现了32.0%和26.0%的比。为了便利进一步的研究,我们公开释放SWE-Gym、模型和毒剂轨迹。

Article 148

Title@2025-06-06 (5): PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

Title: PoCGen: Generating Proof-of-Concept Exploits for Vulnerabilities in Npm Packages

PoCGen: Proof-of-Concept Exploits für Schwachstellen in Npm-Paketen generieren

PoCGen:为Npm包件的脆弱度生成概念探索校验 2506.04962v2

Authors (3): Deniz Simsek, Aryaz Eghbali, Michael Pradel

Security vulnerabilities in software packages are a significant concern for developers and users alike. Patching these vulnerabilities in a timely manner is crucial to restoring the integrity and security of software systems. However, previous work has shown that vulnerability reports often lack proof-of-concept (PoC) exploits, which are essential for fixing the vulnerability, testing patches, and avoiding regressions. Creating a PoC exploit is challenging because vulnerability reports are informal and often incomplete, and because it requires a detailed understanding of how inputs passed to potentially vulnerable APIs may reach security-relevant sinks. In this paper, we present PoCGen, a novel approach to autonomously generate and validate PoC exploits for vulnerabilities in npm packages. This is the first fully autonomous approach to use large language models (LLMs) in tandem with static and dynamic analysis techniques for PoC exploit generation. PoCGen leverages an LLM for understanding vulnerability reports, for generating candidate PoC exploits, and for validating and refining them. Our approach successfully generates exploits for 77% of the vulnerabilities in the SecBench$.$js dataset and 39% in a new, more challenging dataset of 794 recent vulnerabilities. This success rate significantly outperforms a recent baseline (by 45 absolute percentage points), while imposing an average cost of $0.02 per generated exploit.

软件包的安全脆弱性是开发者和用户都十分关切的软件包中的安全脆弱性问题。及时弥补这些脆弱性对于恢复软件系统的完整和安全至关重要。然而,先前的工作表明,脆弱性报告往往缺乏概念验证(POC)的利用,而这种利用对于确定脆弱性、测试补丁和避免倒退至关重要。创建《行动纲领》的利用具有挑战性,因为脆弱性报告是非正式的,而且往往不完全的,并且因为它要求详细了解如何向潜在脆弱的API提供投入可能达到与安全相关的汇水层。在本文件中,我们介绍了PoCGen,这是自主生成和验证PoC利用Npm软件中的脆弱性的新办法。这是使用大型语言模型(LLLMS)的完全自主办法,与对PoC开发一代的静态和动态分析技术相结合。 PoCGen利用一个LM来理解脆弱性报告,以产生候选的PoC利用,并验证和完善它们。我们的方法成功地利用了77%的SecBench$中的脆弱性。美元的数据集成和39%,这是第一个完全自主的自主方法,而最近以45美元的比例计算出了一个具有挑战性的标准。

Article 149

Title@2025-06-06 (5): Towards Mixed-Criticality Software Architectures for Centralized HPC Platforms in Software-Defined Vehicles: A Systematic Literature Review

Title: Towards Mixed-Criticality Software Architectures for Centralized HPC Platforms in Software-Defined Vehicles: A Systematic Literature Review

Auf dem Weg zu Softwarearchitekturen für zentralisierte HPC-Plattformen in softwaredefinierten Fahrzeugen: Ein systematischer Literaturbericht

争取为软件定义车辆中集中的高氯聚氯乙烯平台建立混合-临界环境软件结构:系统文学审查 2506.05822v1

Authors (6): Lucas Mauser, Eva Zimmermann, Pavel Nedvědický, Tobias Eisenreich, Moritz Wäschle, Stefan Wagner

Centralized electrical/electronic architectures and High-Performance Computers (HPCs) are redefining automotive software development, challenging traditional microcontroller-based approaches. Ensuring real-time, safety, and scalability in software-defined vehicles necessitates reevaluating how mixed-criticality software is integrated into centralized architectures. While existing research on automotive SoftWare Architectures (SWAs) is relevant to the industry, it often lacks validation through systematic, empirical methods. To address this gap, we conduct a systematic literature review focusing on automotive mixed-criticality SWAs. Our goal is to provide practitioner-oriented guidelines that assist automotive software architects and developers design centralized, mixed-criticality SWAs based on a rigorous and transparent methodology. First, we set up a systematic review protocol grounded in established guidelines. Second, we apply this protocol to identify relevant studies. Third, we extract key functional domains, constraints, and enabling technologies that drive changes in automotive SWAs, thereby assessing the protocol’s effectiveness. Additionally, we extract techniques, architectural patterns, and design practices for integrating mixed-criticality requirements into HPC-based SWAs, further demonstrating the protocol’s applicability. Based on these insights, we propose an exemplary SWA for a microprocessor-based system-on-chip. In conclusion, this study provides a structured approach to explore and realize mixed-criticality software integration for next-generation automotive SWAs, offering valuable insights for industry and research applications.

中央化电气/电子建筑和高性能计算机(HPC)正在重新定义汽车软件开发,对传统的微控制器方法提出挑战;确保软件定义车辆的实时、安全和扩缩性,需要重新评估混合临界软件如何融入中央结构;虽然目前对汽车软车建筑(SWAs)的研究与该行业相关,但往往缺乏系统的经验方法的验证;为填补这一空白,我们开展了系统的文献审查,重点是汽车混合临界系统。我们的目标是提供面向实践的指南,协助汽车软件设计师和开发商在严格和透明的方法基础上设计集中、混合临界系统。首先,我们根据既定准则制定了系统的审查程序。第二,我们运用这一程序来确定相关的研究。第三,我们通过系统化的经验方法提取关键功能领域、制约和扶持技术,推动汽车SWAs的变化,从而评估协议的有效性。此外,我们进一步提取技术、建筑模式和设计做法,将混合临界要求要求纳入汽车软件设计师和开发商设计系统,以严格和透明的方法为基础设计集中、混合临界性系统设计系统设计系统设计。我们为基于SPC的SWA-SWA(SWA)的系统提出一个具有代表性的模型的系统,为SWA-SWA-SWA-S-S-S-SWA-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S–S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-

Article 150

Title@2025-06-06 (5): CodeContests+: High-Quality Test Case Generation for Competitive Programming

Title: CodeContests+: High-Quality Test Case Generation for Competitive Programming

CodeContests+: Hochqualitative Testfall-Generation für wettbewerbsfähige Programmierung

标准测试+:为竞争性方案拟订编制高品质测试个案 2506.05817v1

Authors (5): Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, Kai Shen

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

由于其高度推理困难和准确准确性反馈,测试案例的生成由于具有很高的推理难度和精确性,已成为培训和评价大型语言模型(LLMs)推理能力的一项关键任务。然而,尽管有大量的公众问题数据,如问题说明和解决方案,但这些问题的测试案例往往难以获得。因此,测试案例生成是建立大型数据集的必要任务,测试案例的质量直接决定了评估的准确性。在本文件中,我们引入了一个基于LLM的代理系统,为竞争性编程问题创造了高质量的测试案例。我们将该系统应用于代码测试数据集,并提出了具有改进测试案例的新版本,名为代码测试+。我们评估了CoConestestsPlus测试案例的质量。首先,我们使用了172万份带有出入/漏标签的提交材料来检查这些测试案例的准确性。结果显示,CoCC测试+的准确性大大高于代码测试,特别是真实性率(TPR)。随后,我们在LLARM加强学习的实验,以进一步证实在测试中具有显著的优势。

Article 151

Title@2025-06-06 (5): RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Title: RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

RepoMaster: Autonome Exploration und Verständnis von GitHub-Lagerstätten für komplexe Aufgabenlösung

RepoMaster:为复杂任务解决而自主探索和了解GitHub储存库 2505.21577v2

Authors (11): Huacan Wang, Ziyi Ni, Shuo Zhang, Shuo Lu, Sen Hu, Ziyang He, Chen Hu, Jiaye Lin, Yifu Guo, Yuntao Du, Pin Lyu

The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real-world tasks typically demand full-fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open-source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE-Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE-bench, RepoMaster achieves a 110% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 24.1% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/RepoMaster.

代码代理的最终目标是自主地解决复杂任务。虽然大型语言模型(LLMS)在代码生成方面取得了长足的进步,但现实世界任务通常需要完全的代码库库而不是简单的脚本。从零开始建立这样的库库仍然是一个重大挑战。幸运的是,GitHub是一个庞大的、不断发展的开放源库库库库库库库,开发者经常将其用作复杂任务的模块。然而,OpenHands和SWE-Agent等现有框架仍在努力有效地利用这些宝贵的资源。只依靠README文件提供不足够的指导,更深入的探索揭示了两个核心障碍:大量的信息和储存库的依附性,两者都受到当前LOMS有限的背景窗口的制约。为了解决这些问题,我们建议Repaster是一个旨在探索和再利用GitHub库库库库库库库库库库的自主性框架。为了有效理解,RepoMaster构建了函数召回图、模块依赖性图表和最强级的代码树,以识别基本部件,只向LIMS/整个库库库库提供已确认的核心元素。在95-95-而不是整个库库库库库库库库库库库库库。在自动执行期间,在自动执行期间,在Sal-masal-rus IMBe-rudexxxxxxxxxxxxxxxx 上,在Serus 上,在Serxxxxxxxxxxxxxxxxxxxxxxx

Article 152

Title@2025-06-06 (5): ProSec: Fortifying Code LLMs with Proactive Security Alignment

Title: ProSec: Fortifying Code LLMs with Proactive Security Alignment

ProSec: Erweiterung von Code LLMs mit proaktiver Sicherheitsausrichtung

Prosec: 使用前期安全对齐来强化代码LLMs 2411.12882v3

Authors (6): Xiangzhe Xu, Zian Su, Jinyao Guo, Kaiyuan Zhang, Zhenting Wang, Xiangyu Zhang

While recent code-specific large language models (LLMs) have greatly enhanced their code generation capabilities, the safety of these models remains under-explored, posing potential risks as insecure code generated by these models may introduce vulnerabilities into real-world systems. Existing methods collect security-focused datasets from real-world vulnerabilities for instruction tuning in order to mitigate such issues. However, they are largely constrained by the data sparsity of vulnerable code, and have limited applicability in the multi-stage post-training workflows of modern LLMs. In this paper, we propose ProSec, a novel proactive security alignment approach designed to align code LLMs with secure coding practices. ProSec systematically exposes the vulnerabilities in a code LLM by synthesizing vulnerability-inducing coding scenarios from Common Weakness Enumerations (CWEs) and generates fixes to vulnerable code snippets, allowing the model to learn secure practices through preference learning objectives. The scenarios synthesized by ProSec trigger 25x more vulnerable code than a normal instruction-tuning dataset, resulting in a security-focused alignment dataset 7x larger than the previous work. Experiments show that models trained with ProSec are 25.2% to 35.4% more secure compared to previous work without degrading models’ utility.

虽然最近针对特定代码的大型语言模型(LLMS)大大增强了其代码生成能力,但这些模型的安全性仍然未得到充分探索,从而带来潜在风险,因为这些模型产生的不安全代码可能会给现实世界系统带来脆弱性。现有方法收集来自真实世界脆弱性的以安全为重点的数据集,用于调整教学,以缓解此类问题。但是,它们在很大程度上受到脆弱代码数据模糊性的限制,在现代LLM的多阶段培训后工作流程中的适用性有限。在本文件中,我们提议ProSec,一种新型的积极主动的安全协调方法,旨在将代码LLMs与安全的编码编码编码编码做法相协调。ProSec系统地暴露了代码LM的弱点,办法是将常见弱点生成的编码编码假设综合起来,从常见的 Weakness Enudes(CWES) 生成易变的代码拼拼图,使模型能够通过偏好学习目标学习安全做法。由ProSec 触发的25x比普通指令调整数据集更脆弱的代码合成,导致以安全为重点的调整数据集7x比先前的工作大。实验显示,与先前的耗损性模型相比,受培训的25%的实用性模型显示,比已培训的模型更安全性工作为安全性工作为安全性模型。

Article 153

Title@2025-06-06 (5): Analyzing the Evolution and Maintenance of Quantum Software Repositories

Title: Analyzing the Evolution and Maintenance of Quantum Software Repositories

Analyse der Evolution und Wartung von Quantum Software Repositories

分析量子软件储存库的演变和维护 2501.06894v3

Authors (4): Krishna Upadhyay, Vinaik Chhetri, A. B. Siddique, Umar Farooq

Quantum computing is rapidly advancing, but quantum software development faces significant challenges, including a steep learning curve, high hardware error rates, and a lack of mature engineering practices. This study conducts a large-scale mining analysis of over 21,000 GitHub repositories, containing 1.2 million commits from more than 10,000 developers, to examine the evolution and maintenance of quantum software. We analyze repository growth, programming language and framework adoption, and contributor trends, revealing a 200% increase in repositories and a 150% rise in contributors since 2017. Additionally, we investigate software development and maintenance practices, showing that perfective commits dominate (51.76%), while the low occurrence of corrective commits (18.54%) indicates potential gaps in bug resolution. Furthermore, 34% of reported issues are quantum-specific, highlighting the need for specialized debugging tools beyond conventional software engineering approaches. This study provides empirical insights into the software engineering challenges of quantum computing, offering recommendations to improve development workflows, tooling, and documentation. We are also open-sourcing our dataset to support further analysis by the community and to guide future research and tool development for quantum computing. The dataset is available at: https://github.com/kriss-u/QRepoAnalysis-Paper

量子计算正在迅速推进,但量子软件开发面临重大挑战,包括学习曲线陡峭、硬件误差率高和缺乏成熟的工程实践。本研究对超过21 000 GitHub 储存库进行了大规模采矿分析,其中包括10 000多名开发者承诺的120万个库库,以审查量子软件的演变和维护情况。我们分析了存储库的增长、编程语言和框架的采用以及贡献者趋势,发现自2017年以来存储库增加了200%,贡献者增加了150%。此外,我们调查软件开发和维护做法,显示完美承诺的主导力(51.76%),而纠正的低发生率(18.54%)表明在解决错误方面可能存在差距。此外,34%的报告问题是量子化的,突出了常规软件工程方法以外的专门调试工具的必要性。本研究为量子计算软件工程挑战提供了实证见解,提出了改进发展工作流程、工具和文件的建议。我们还公开获取了我们的数据集,以支持社区的进一步分析,并指导今后对量子计算的研究和工具开发。数据集载于:http://gisaph-Aku-Rekubub.comku-Qsub.comsks。

Article 154

Title@2025-06-06 (5): Multi-Agent Collaboration via Cross-Team Orchestration

Title: Multi-Agent Collaboration via Cross-Team Orchestration

Multi-Agenten-Zusammenarbeit über Cross-Team-Orchestrierung

通过跨团队管弦化多机构协作 2406.08979v2

Authors (12): Zhuoyun Du, Chen Qian, Wei Liu, Zihao Xie, YiFei Wang, Rennai Qiu, Yufan Dang, Weize Chen, Cheng Yang, Ye Tian, Xuantang Xiong, Lei Han

Large Language Models (LLMs) have significantly impacted various domains, especially through organized LLM-driven autonomous agents. A representative scenario is in software development, where agents can collaborate in a team like humans, following predefined phases to complete sub-tasks sequentially. However, for an agent team, each phase yields only one possible outcome. This results in the completion of only one development chain, thereby losing the opportunity to explore multiple potential decision paths within the solution space. Consequently leading to suboptimal results or extensive trial and error. To address this, we introduce Cross-Team Orchestration (Croto), a scalable multi-team framework that enables orchestrated teams to jointly propose various task-oriented solutions and interact with their insights in a self-independence while cross-team collaboration environment for superior solutions generation. Experiments reveal a notable increase in software quality compared to state-of-the-art baselines. We further tested our framework on story generation tasks, which demonstrated a promising generalization ability of our framework in other domains. The code and data is available at https://github.com/OpenBMB/ChatDev/tree/macnet

大型语言模型(LLMS)对不同领域产生了重大影响,特别是通过有组织LLM驱动的自主代理商。一种有代表性的情景是软件开发,代理商可以在诸如人类这样的团队中合作,经过预先确定的阶段,按顺序完成子任务。但是,对于一个代理商团队来说,每个阶段只产生一个可能的结果。这只能导致一个发展链的完成,从而失去了探索解决方案空间内多种潜在决策路径的机会。结果导致结果不尽人意,或者造成广泛的试验和错误。为了解决这个问题,我们引入了Cross-Team 管弦化(Croto),这是一个可扩展的多层框架,使协调团队能够联合提出各种面向任务的解决办法,并与他们在自我独立和跨团队合作环境中的见解互动,为优秀的解决方案一代。实验显示软件质量与最新基线相比有了显著提高。我们进一步测试了我们的故事生成框架,这显示了我们在其他领域的框架具有很有希望的普遍化能力。该代码和数据可在https://github.com/ OpenBM/ChatD/tree/tree/treenetnetnet/netanaataacasy)上查阅。

Article 155

Title@2025-06-06 (5): CoopetitiveV: Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality Verilog Generation

Title: CoopetitiveV: Leveraging LLM-powered Coopetitive Multi-Agent Prompting for High-quality Verilog Generation

CoopetitiveV: LLM-powered Coopetitive Multi-Agent für hochwertige Verilog-Generation

协作V:利用LLM-动力协同协作的多方协作促进高品质活性一代 2412.11014v2

Authors (8): Zhendong Mi, Renming Zheng, Haowen Zhong, Yue Sun, Seth Kneeland, Sayan Moitra, Ken Kutzer, Zhaozhuo Xu Shaoyi Huang

Recent advances in agentic LLMs have demonstrated great capabilities in Verilog code generation. However, existing approaches either use LLM-assisted single-agent prompting or cooperation-only multi-agent learning, which will lead to: (i) Degeneration issue for single-agent learning: characterized by diminished error detection and correction capabilities; (ii) Error propagation in cooperation-only multi-agent learning: erroneous information from the former agent will be propagated to the latter through prompts, which can make the latter agents generate buggy code. In this paper, we propose an LLM-based coopetitive multi-agent prompting framework, in which the agents cannot collaborate with each other to form the generation pipeline, but also create a healthy competitive mechanism to improve the generating quality. Our experimental results show that the coopetitive multi-agent framework can effectively mitigate the degeneration risk and reduce the error propagation while improving code error correction capabilities, resulting in higher quality Verilog code generation. The effectiveness of our approach is validated through extensive experiments. On VerilogEval Machine and Human dataset, CoopetitiveV+GPT-4 achieves 99.2% and 99.1% pass@10 scores, respectively. While on RTLLM, CoopetitiveV+GPT-4 obtains 100% syntax and 99.9% functionality pass@5 scores.

然而,现有的方法要么使用LLM协助的单一试剂推动或合作性多试剂学习,要么使用LLM协助的单一试剂促进或合作性的多试剂学习,这将导致:(一) 单一试剂学习的退化问题:发现和纠正能力降低;(二) 合作性多试学习中的错误传播:前试剂的错误信息将通过提示向后者传播,这可以使后者产生错误代码。在本文中,我们提议以LLM为基础的基于LM的多试剂促进合作框架,使这些试剂无法相互协作形成生产管道,而且还将建立一个健康的竞争性机制,以提高生成质量。我们的实验结果表明,协作性多试剂框架可以有效减轻退化风险并减少错误传播,同时提高代码错误纠正能力,从而产生更高质量的Verilog代码。我们的方法的有效性通过广泛的实验得到验证。关于VerlogEval机器和人类数据集、CopetiveVGPT-4、CopetiveV+GPT-4不能相互协作建立99.2%和99.1%adV.V.M.

Article 156

Title@2025-06-05 (4): Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework

Title: Deployability-Centric Infrastructure-as-Code Generation: An LLM-based Iterative Framework

Deployability-Centric Infrastructure-as-Code Generation: Ein LLM-basiertes Iteratives Framework

以LLM为基础的迭代框架 2506.05623v1

Authors (5): Tianyi Zhang, Shidong Pan, Zejun Zhang, Zhenchang Xing, Xiaoyu Sun

Infrastructure-as-Code (IaC) generation holds significant promise for automating cloud infrastructure provisioning. Recent advances in Large Language Models (LLMs) present a promising opportunity to democratize IaC development by generating deployable infrastructure templates from natural language descriptions, but current evaluation focuses on syntactic correctness while ignoring deployability, the fatal measure of IaC template utility. We address this gap through two contributions: (1) IaCGen, an LLM-based deployability-centric framework that uses iterative feedback mechanism to generate IaC templates, and (2) DPIaC-Eval, a deployability-centric IaC template benchmark consists of 153 real-world scenarios that can evaluate syntax, deployment, user intent, and security. Our evaluation reveals that state-of-the-art LLMs initially performed poorly, with Claude-3.5 and Claude-3.7 achieving only 30.2% and 26.8% deployment success on the first attempt respectively. However, IaCGen transforms this performance dramatically: all evaluated models reach over 90% passItr@25, with Claude-3.5 and Claude-3.7 achieving 98% success rate. Despite these improvements, critical challenges remain in user intent alignment (25.2% accuracy) and security compliance (8.4% pass rate), highlighting areas requiring continued research. Our work provides the first comprehensive assessment of deployability-centric IaC template generation and establishes a foundation for future research.

大型语言模型(LLMs)最近的进展为通过从自然语言描述中生成可部署的基础设施模板实现IaC开发民主化提供了一个充满希望的机会。但目前的评价侧重于综合正确性,而忽视IaC模板实用性这一致命的可部署性。我们通过两种贡献来弥补这一差距:(1)IaCGen,一个基于LLM的可部署性中心框架,它利用迭代反馈机制生成IaC模板;(2)DPAC-Eval,一个以可部署性为中心的IaC模板基准,由153个真实世界情景组成,能够评价合成、部署、用户意向和安全。我们的评价显示,最先进的LLOMs最初表现不佳,而忽视了Iac模板的可部署性,而该模板的可部署率则分别只有30.2%和26.8%。然而,IaC的这一业绩突变:所有评价模型都达到90%以上LOSITr@25, 和Claude-3.7的IAC模板基准, 包括能够评价同步性、部署率98% 。我们的研究显示, 成功率(25) 25) 继续是未来安全性(25) 。

Article 157

Title@2025-06-05 (4): Which Prompting Technique Should I Use? An Empirical Investigation of Prompting Techniques for Software Engineering Tasks

Title: Which Prompting Technique Should I Use? An Empirical Investigation of Prompting Techniques for Software Engineering Tasks

Welche Prompting-Technik sollte ich verwenden? Eine empirische Untersuchung von Prompting-Techniken für Software-Engineering-Aufgaben

我应使用哪一种快速技术?软件工程任务快速技术的经验调查。 2506.05614v1

Authors (10): E. G. Santana Jr, Gabriel Benjamin, Melissa Araujo, Harrison Santos, David Freitas, Eduardo Almeida, Paulo Anselmo da M. S. Neto, Jiawei Li, Jina Chun, Iftekhar Ahmed

A growing variety of prompt engineering techniques has been proposed for Large Language Models (LLMs), yet systematic evaluation of each technique on individual software engineering (SE) tasks remains underexplored. In this study, we present a systematic evaluation of 14 established prompt techniques across 10 SE tasks using four LLM models. As identified in the prior literature, the selected prompting techniques span six core dimensions (Zero-Shot, Few-Shot, Thought Generation, Ensembling, Self-Criticism, and Decomposition). They are evaluated on tasks such as code generation, bug fixing, and code-oriented question answering, to name a few. Our results show which prompting techniques are most effective for SE tasks requiring complex logic and intensive reasoning versus those that rely more on contextual understanding and example-driven scenarios. We also analyze correlations between the linguistic characteristics of prompts and the factors that contribute to the effectiveness of prompting techniques in enhancing performance on SE tasks. Additionally, we report the time and token consumption for each prompting technique when applied to a specific task and model, offering guidance for practitioners in selecting the optimal prompting technique for their use cases.

在这项研究中,我们用四种LLM模型对10个SE任务中的14个既定的迅速技术进行了系统评价。正如先前的文献所查明的,选定的促进技术涉及六个核心方面(零热、少热、少热、多思多想、产生、集合、自我批评和分解)。这些评价是针对诸如代码生成、故障修补和以代码为导向的回答问题等任务进行的,仅举几个例子。我们的成果显示,促进技术对需要复杂的逻辑和深入推理的SE任务而言最为有效,而对于更依赖背景理解和实例驱动设想的那些任务而言,这些技术则最为有效。我们还分析了迅速技术的语言特点与有助于提高SE任务绩效的各种因素之间的关系。此外,我们报告了在具体任务和模型中应用每种提示技术的时间和象征性消耗情况,为从业人员选择其使用案例的最佳快速技术提供指导。

Article 158

Title@2025-06-05 (4): A Large Language Model Approach to Identify Flakiness in C++ Projects

Title: A Large Language Model Approach to Identify Flakiness in C++ Projects

Ein Ansatz für ein großes Sprachmodell zur Identifizierung von Flakiness in C++-Projekten

C++项目中查明Flakiness的大语言示范方法 2412.12340v2

Authors (3): Xin Sun, Daniel Ståhl, Kristian Sandahl

The role of regression testing in software testing is crucial as it ensures that any new modifications do not disrupt the existing functionality and behaviour of the software system. The desired outcome is for regression tests to yield identical results without any modifications made to the system being tested. In practice, however, the presence of Flaky Tests introduces non-deterministic behaviour and undermines the reliability of regression testing results. In this paper, we propose an LLM-based approach for identifying the root cause of flaky tests in C++ projects at the code level, with the intention of assisting developers in debugging and resolving them more efficiently. We compile a comprehensive collection of C++ project flaky tests sourced from GitHub repositories. We fine-tune Mistral-7b, Llama2-7b and CodeLlama-7b models on the C++ dataset and an existing Java dataset and evaluate the performance in terms of precision, recall, accuracy, and F1 score. We assess the performance of the models across various datasets and offer recommendations for both research and industry applications. The results indicate that our models exhibit varying performance on the C++ dataset, while their performance is comparable to that of the Java dataset. The Mistral-7b surpasses the other two models regarding all metrics, achieving a score of 1. Our results demonstrate the exceptional capability of LLMs to accurately classify flakiness in C++ and Java projects, providing a promising approach to enhance the efficiency of debugging flaky tests in practice.

在软件测试中,回归测试的作用至关重要,因为它可以确保任何新的修改不会破坏软件系统的现有功能和行为;理想的结果是回归测试取得相同结果,而不对正在测试的系统作任何修改;然而,在实践中,Flaky测试的存在带来了非决定性行为,破坏了回归测试结果的可靠性。在本文件中,我们提议以LLMM为基础的方法,在代码一级确定C++项目中闪烁测试的根本原因,目的是协助开发商调试和更有效地解决这些问题;我们汇编了从GitHub库获得的C++7项目闪烁测试综合收集结果,在未对系统进行任何修改的情况下得出相同的结果;我们在C+g数据集和现有的Java数据集中对Clama-7b模型进行了微调,并评估了在精确、回顾、准确和F1分方面的性能;我们评估了各种数据集中模型的性能,并为研究和工业应用提供了建议。我们的各种模型在C++++7数据库中展示了不同性能,在C+++MLA数据库中展示了不同性的工作表现,同时其业绩可与我们的标准等级的成绩可比标准。

Article 159

Title@2025-06-05 (4): PandasBench: A Benchmark for the Pandas API

Title: PandasBench: A Benchmark for the Pandas API

PandasBench: Ein Benchmark für die Pandas API

Pandas Bunch:Pandas API基准 2506.02345v2

Authors (4): Alex Broihier, Stefanos Baziotis, Daniel Kang, Charith Mendis

The Pandas API has been central to the success of pandas and its alternatives. Despite its importance, there is no benchmark for it, and we argue that we cannot repurpose existing benchmarks (from other domains) for the Pandas API. In this paper, we introduce requirements that are necessary for a Pandas API enchmark, and present the first benchmark that fulfills them: PandasBench. We argue that it should evaluate the real-world coverage of a technique. Yet, real-world coverage is not sufficient for a useful benchmark, and so we also: cleaned it from irrelevant code, adapted it for benchmark usage, and introduced input scaling. We claim that uniform scaling used in other benchmarks (e.g., TPC-H) is too coarse-grained for PandasBench, and use a non-uniform scaling scheme. PandasBench is the largest Pandas API benchmark to date, with 102 notebooks and 3,721 cells. We used PandasBench to evaluate Modin, Dask, Koalas, and Dias. This is the largest-scale evaluation of all these techniques to date. Prior works report significant speedups using constrained benchmarks, but we show that on a larger benchmark with real-world code, the most notebooks that got a speedup were 8/102 (~8%) for Modin, and 0 for both Koalas and Dask. Dias showed speedups in up to 55 notebooks (~54%), but it rewrites code incorrectly in certain cases, which had not been observed in prior work. Second, we identified many failures: Modin runs only 72/102 (~70%) notebooks, Dask 4 (~4%), Koalas 10 (~10%), and Dias 97 (95%).

Pandas API是熊猫及其替代品成功速度的核心。尽管它很重要, 现实世界的覆盖面不足以成为有用的基准。尽管它很重要, 但是它没有基准, 而且我们辩称我们不能重新使用 Pandas API 的已有基准( 其它域) 。在本文中, 我们引入了对 Pandas API ENchmark 所必要的要求, 并展示了实现这些要求的第一个基准 : Pandas Bench 。我们主张它应该评估一个技术的真实世界覆盖率。然而, 现实世界的覆盖率还不足以成为有用的基准, 因此我们: 从不相关的代码中清理它, 修改它的基准使用, 并引入了投入的缩放。我们声称, 其它基准( 例如, TPC- H) 使用的统一比例( 来自其它域) ANDA API ANPE ENK 的缩略图 97 , 并使用非统一比例缩略图计划。 Pandas Bennch是迄今为止最大的 Pandas API API 基准, 102 10 10 10 和 7211 单元格。我们用的是 KOA BES 的缩略图来评估了 mandas 。我们用了 mandus Best , laus laus la la lax lax lax

Article 160

Title@2025-06-05 (4): Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Title: Interpretation Meets Safety: A Survey on Interpretation Methods and Tools for Improving LLM Safety

Interpretation trifft auf Sicherheit: Eine Umfrage zu Interpretationsmethoden und Tools zur Verbesserung der LLM-Sicherheit

口译满足安全需要:关于改进LLM安全的解释方法和工具的调查 2506.05451v1

Authors (6): Seongmin Lee, Aeree Cho, Grace C. Kim, ShengYun Peng, Mansi Phute, Duen Horng Chau

As large language models (LLMs) see wider real-world use, understanding and mitigating their unsafe behaviors is critical. Interpretation techniques can reveal causes of unsafe outputs and guide safety, but such connections with safety are often overlooked in prior surveys. We present the first survey that bridges this gap, introducing a unified framework that connects safety-focused interpretation methods, the safety enhancements they inform, and the tools that operationalize them. Our novel taxonomy, organized by LLM workflow stages, summarizes nearly 70 works at their intersections. We conclude with open challenges and future directions. This timely survey helps researchers and practitioners navigate key advancements for safer, more interpretable LLMs.

由于大型语言模型(LLMs)看到更广泛的现实世界使用、理解和减轻其不安全行为至关重要。口译技术可以揭示不安全产出的原因并引导安全,但在以前的调查中往往忽略了这种与安全的联系。我们首次调查缩小了这一差距,引入了一个统一框架,将安全重点口译方法、它们所通报的安全增强措施以及操作这些方法的工具联系起来。我们根据LLM工作流程阶段组织的新分类学,总结了近70项交叉工作。我们以公开的挑战和今后的方向结束。这一及时调查有助于研究人员和从业者了解关键进展,以便更安全、更可解释的LMS。

Article 161

Title@2025-06-05 (4): Software Bill of Materials in Software Supply Chain Security A Systematic Literature Review

Title: Software Bill of Materials in Software Supply Chain Security A Systematic Literature Review

Software Bill of Materials in Software Supply Chain Sicherheit Ein systematischer Literaturbericht

软件供应链安全材料法案系统化文献审查 2506.03507v2

Authors (4): Eric O’Donoghue, Yvette Hastings, Ernesto Ortiz, A. Redempta Manzi Muneza

Software Bill of Materials (SBOMs) are increasingly regarded as essential tools for securing software supply chains (SSCs), yet their real-world use and adoption barriers remain poorly understood. This systematic literature review synthesizes evidence from 40 peer-reviewed studies to evaluate how SBOMs are currently used to bolster SSC security. We identify five primary application areas: vulnerability management, transparency, component assessment, risk assessment, and SSC integrity. Despite clear promise, adoption is hindered by significant barriers: generation tooling, data privacy, format/standardization, sharing/distribution, cost/overhead, vulnerability exploitability, maintenance, analysis tooling, false positives, hidden packages, and tampering. To structure our analysis, we map these barriers to the ISO/IEC 25019:2023 Quality-in-Use model, revealing critical deficiencies in SBOM trustworthiness, usability, and suitability for security tasks. We also highlight key gaps in the literature. These include the absence of applying machine learning techniques to assess SBOMs and limited evaluation of SBOMs and SSCs using software quality assurance techniques. Our findings provide actionable insights for researchers, tool developers, and practitioners seeking to advance SBOM-driven SSC security and lay a foundation for future work at the intersection of SSC assurance, automation, and empirical software engineering.

材料软件法案(SBOMs)日益被视为保障软件供应链的基本工具,然而,其实际使用和采用障碍仍然不为人所知。这一系统文献审查综合了40项经同行审查的研究中的证据,以评价目前如何利用SBOMs加强SSC安全。我们确定了五个主要应用领域:脆弱性管理、透明度、组成部分评估、风险评估和SSC完整性。尽管有明确的承诺,但采用却受到重大障碍的阻碍:生成工具、数据隐私、格式/标准化、共享/分发、成本/管理、脆弱性利用、维护、分析工具、假阳性、隐藏的软件包和篡改。我们的分析结构为ISO/IEC 25019:2023质量使用模型绘制了这些障碍,揭示了SBOM的可靠性、可用性和安全任务的适宜性。我们还强调了文献中的关键差距。其中包括:没有应用机器学习技术来评估SBOMs、共享/分发、成本/管理、对SBOMMs和SSCSSC的有限评价。我们的调查结果为研究人员、工具开发商和SBSAS-SAS-SUSB的交叉化工作提供了可操作基础以及SISISSBSB。我们的调查结果为SAS-SAS-SB的高级安全基础和操作基础和操作者提供了可操作基础。

Article 162

Title@2025-06-05 (4): Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem

Title: Beyond the Protocol: Unveiling Attack Vectors in the Model Context Protocol Ecosystem

Jenseits des Protokolls: Enthüllen von Angriffsvektoren im Modell Kontext Protokoll Ökosystem

《议定书》之后的《议定书》:《示范背景议定书》中的 “ 固定攻击矢量 “ 生态系统 2506.02040v2

Authors (9): Hao Song, Yiming Shen, Wenxuan Luo, Leixin Guo, Ting Chen, Jiashui Wang, Beibei Li, Xiaosong Zhang, Jiachi Chen

The Model Context Protocol (MCP) is an emerging standard designed to enable seamless interaction between Large Language Model (LLM) applications and external tools or resources. Within a short period, thousands of MCP services have already been developed and deployed. However, the client-server integration architecture inherent in MCP may expand the attack surface against LLM Agent systems, introducing new vulnerabilities that allow attackers to exploit by designing malicious MCP servers. In this paper, we present the first systematic study of attack vectors targeting the MCP ecosystem. Our analysis identifies four categories of attacks, i.e., Tool Poisoning Attacks, Puppet Attacks, Rug Pull Attacks, and Exploitation via Malicious External Resources. To evaluate the feasibility of these attacks, we conduct experiments following the typical steps of launching an attack through malicious MCP servers: upload-download-attack. Specifically, we first construct malicious MCP servers and successfully upload them to three widely used MCP aggregation platforms. The results indicate that current audit mechanisms are insufficient to identify and prevent the proposed attack methods. Next, through a user study and interview with 20 participants, we demonstrate that users struggle to identify malicious MCP servers and often unknowingly install them from aggregator platforms. Finally, we demonstrate that these attacks can trigger harmful behaviors within the user’s local environment-such as accessing private files or controlling devices to transfer digital assets-by deploying a proof-of-concept (PoC) framework against five leading LLMs. Additionally, based on interview results, we discuss four key challenges faced by the current security ecosystem surrounding MCP servers. These findings underscore the urgent need for robust security mechanisms to defend against malicious MCP servers.

示范背景协议(MCP)是一个新兴标准,旨在让大语言模型(LLM)应用程序和外部工具或资源之间实现无缝互动。在短时期内,已经开发并部署了数千项MCP服务。然而,MCP所固有的客户服务器整合架构可能会扩大对LLM代理系统的攻击面面面,引入新的弱点,使攻击者能够通过设计恶意MCP服务器加以利用。在本文中,我们介绍了针对MCP生态系统的攻击矢量的首次系统研究。我们的分析确定了四类攻击,即工具中毒袭击、布偶袭击、鲁格拉袭击和通过恶意外部资源进行剥削。在短短时期内,已经开发并部署了数千项MCP服务。为了评估这些攻击的可行性,我们在通过恶意 MCP服务器发动攻击的典型步骤之后进行了实验。具体地说,我们首先建造恶意的MCP服务器,并成功地将它们上传到三个广泛使用的MCP聚合平台。结果表明,目前的审计机制不足以识别和防止拟议的攻击方法。接下来,通过对20名参与者的用户进行一项保护,我们证明用户努力辨别了恶意的MCP服务器,我们无法识别当前四类关键的磁盘服务器,然后才能将这五种主要的磁盘服务器,然后我们才能在进入这些平台上显示这些有害的磁盘服务器上,我们进入了这些机器。

Article 163

Title@2025-06-05 (4): LLM-Guided Scenario-based GUI Testing

Title: LLM-Guided Scenario-based GUI Testing

LLM-geführte Szenario-basierte GUI-Tests

LLM-LLM 指导设想情况用户界面测试 2506.05079v1

Authors (7): Shengcheng Yu, Yuchen Ling, Chunrong Fang, Quan Zhou, Chunyang Chen, Shaomin Zhu, Zhenyu Chen

The assurance of mobile app GUI is more and more significant. Automated GUI testing approaches of different strategies have been developed, while there are still huge gaps between the approaches and the app business logic, not taking the completion of specific testing scenarios as the exploration target, leading to the exploration missing of critical app functionalities. Learning from the manual testing, which takes testing scenarios with app business logic as the basic granularity, in this paper, we utilize the LLMs to understand the semantics presented in app GUI and how they are mapped in the testing context based on specific testing scenarios. Then, scenario-based GUI tests are generated with the guidance of multi-agent collaboration. Specifically, we propose ScenGen, a novel LLM-guided scenario-based GUI testing approach involving five agents to respectively take responsibilities of different phases of the manual testing process. The Observer perceives the app GUI state by extracting GUI widgets and forming GUI layouts, understanding the expressed semantics. Then the app GUI info is sent to the Decider to make decisions on target widgets based on the target testing scenarios. The decision-making process takes the completion of specific testing scenarios as the exploration target. The Executor then executes the demanding operations on the apps. The execution results are checked by the Supervisor on whether the generated tests are consistent with the completion target of the testing scenarios, ensuring the traceability of the test generation and execution. Furthermore, the corresponding GUI test operations are recorded to the context memory by Recorder as an important basis for further decision-making, meanwhile monitoring the runtime bug occurrences. ScenGen is evaluated and the results show that ScenGen can effectively generate scenario-based GUI tests guided by LLMs.

移动应用程序 GUI 的保证越来越重要。不同战略的自动图形用户界面测试方法已经开发出来, 虽然在方法与应用程序业务逻辑之间仍然存在着巨大的差距, 但没有将完成具体测试情景作为勘探目标, 导致关键应用程序功能的探索缺失。从手工测试中学习, 将应用商业逻辑测试情景作为基本颗粒, 我们使用 LLMS 来理解 App GUI 中显示的语义, 以及如何根据具体测试情景在测试背景下进行绘图。然后, 在多试剂合作的指导下, 生成基于情景的图形测试。具体而言, 我们提议ScenGen, 新的LLM- 指导情景基于情景的图形测试方法, 由5个代理商分别承担人工测试过程不同阶段的责任。观察员通过提取 GUB 部件和设置图形布局来理解应用图形界面状态。然后, 应用程序GUIFinf 被发送给决策者, 以便根据目标测试情景做出进一步的决定。决策过程将相应的测试情景进行完成, 进行相应的测试过程, 要求执行重要操作, 运行GLBIL 进行持续的测试, 运行将运行测试, 运行将运行运行测试结果与持续运行运行进行。运行运行运行运行运行运行运行测试运行运行运行测试运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行测试运行运行测试运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行运行

Article 164

Title@2025-06-05 (4): Tech-ASan: Two-stage check for Address Sanitizer

Title: Tech-ASan: Two-stage check for Address Sanitizer

Tech-ASan: Zweistufiger Check für Address Sanitizer

Tech-Asan:地址防疫剂两阶段检查 2506.05022v1

Authors (7): Yixuan Cao, Yuhong Feng, Huafeng Li, Chongyi Huang, Fangcao Jian, Haoran Li, Xu Wang

Address Sanitizer (ASan) is a sharp weapon for detecting memory safety violations, including temporal and spatial errors hidden in C/C++ programs during execution. However, ASan incurs significant runtime overhead, which limits its efficiency in testing large software. The overhead mainly comes from sanitizer checks due to the frequent and expensive shadow memory access. Over the past decade, many methods have been developed to speed up ASan by eliminating and accelerating sanitizer checks, however, they either fail to adequately eliminate redundant checks or compromise detection capabilities. To address this issue, this paper presents Tech-ASan, a two-stage check based technique to accelerate ASan with safety assurance. First, we propose a novel two-stage check algorithm for ASan, which leverages magic value comparison to reduce most of the costly shadow memory accesses. Second, we design an efficient optimizer to eliminate redundant checks, which integrates a novel algorithm for removing checks in loops. Third, we implement Tech-ASan as a memory safety tool based on the LLVM compiler infrastructure. Our evaluation using the SPEC CPU2006 benchmark shows that Tech-ASan outperforms the state-of-the-art methods with 33.70% and 17.89% less runtime overhead than ASan and ASan–, respectively. Moreover, Tech-ASan detects 56 fewer false negative cases than ASan and ASan– when testing on the Juliet Test Suite under the same redzone setting.

桑尼泽(ASan)是发现记忆安全违规现象(包括C/C++方案在执行期间隐藏的时间和空间错误)的尖锐武器。然而,阿桑拥有大量运行时间的间接费用,这限制了测试大型软件的效率。管理费用主要来自清洁剂检查,因为经常和昂贵的影子内存访问。在过去十年里,通过消除和加速防疫检查,开发了许多方法来加速桑尼安,但是,它们要么未能充分消除多余的检查或妥协检测能力。为解决这一问题,本文展示了基于C/C++方案在执行期间加速桑尼的两阶段检查技术ASan。首先,我们提出了一个新的两阶段检查算法,它利用神奇价值的比较来减少大部分昂贵的影子内存访问。第二,我们设计了一个高效的优化器来消除重复性检查,它结合了用于消除循环检查的新算法。第三,我们用LLVM编译系统基础设施将Tech-A桑尼作为记忆红红工具。我们使用SPU2006基准来加速桑桑的测试方法,显示桑桑比亚测试案例少一次。

Article 165

Title@2025-06-05 (4): BacPrep: An Experimental Platform for Evaluating LLM-Based Bacalaureat Assessment

Title: BacPrep: An Experimental Platform for Evaluating LLM-Based Bacalaureat Assessment

BacPrep: Eine experimentelle Plattform zur Bewertung von LLM-basiertem Bacalaureat Assessment

BacPrep:评估以LLM为基础的Bakaraureat评估的实验平台 2506.04989v1

Authors (2): Dumitran Adrian Marius, Dita Radu

Accessing quality preparation and feedback for the Romanian Bacalaureat exam is challenging, particularly for students in remote or underserved areas. This paper introduces BacPrep, an experimental online platform exploring Large Language Model (LLM) potential for automated assessment, aiming to offer a free, accessible resource. Using official exam questions from the last 5 years, BacPrep employs one of Google’s newest models, Gemini 2.0 Flash (released Feb 2025), guided by official grading schemes, to provide experimental feedback. Currently operational, its primary research function is collecting student solutions and LLM outputs. This focused dataset is vital for planned expert validation to rigorously evaluate the feasibility and accuracy of this cutting-edge LLM in the specific Bacalaureat context before reliable deployment. We detail the design, data strategy, status, validation plan, and ethics.

罗马尼亚巴卡拉乌雷亚考试的质量准备和反馈是困难的,特别是对偏远地区或服务不足地区的学生而言。本文介绍BacPrep,这是一个实验性在线平台,探讨大语言模型的自动评估潜力,目的是提供一个免费、无障碍的资源。利用过去五年的官方考试问题,BacPrep在官方评分计划的指导下,利用谷歌的最新模型之一Gemini 2.0 Flash(2025年2月发布)提供实验反馈。目前,其主要研究功能是收集学生解决方案和LLLM产出。这一重点数据集对于计划的专家验证至关重要,以便在可靠部署之前严格评估具体巴卡拉乌雷亚特背景下这一尖端LLM的可行性和准确性。我们详细介绍了设计、数据战略、状况、验证计划和道德规范。

Article 166

Title@2025-06-05 (4): A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair

Title: A Multi-Dataset Evaluation of Models for Automated Vulnerability Repair

Eine Multi-Dataset-Bewertung von Modellen für die Automatisierte Sicherheitsreparatur

对自动脆弱性修复模型的多数据集评价 2506.04987v1

Authors (3): Zanis Ali Khan, Aayush Garg, Qiang Tang

Software vulnerabilities pose significant security threats, requiring effective mitigation. While Automated Program Repair (APR) has advanced in fixing general bugs, vulnerability patching, a security-critical aspect of APR remains underexplored. This study investigates pre-trained language models, CodeBERT and CodeT5, for automated vulnerability patching across six datasets and four languages. We evaluate their accuracy and generalization to unknown vulnerabilities. Results show that while both models face challenges with fragmented or sparse context, CodeBERT performs comparatively better in such scenarios, whereas CodeT5 excels in capturing complex vulnerability patterns. CodeT5 also demonstrates superior scalability. Furthermore, we test fine-tuned models on both in-distribution (trained) and out-of-distribution (unseen) datasets. While fine-tuning improves in-distribution performance, models struggle to generalize to unseen data, highlighting challenges in robust vulnerability detection. This study benchmarks model performance, identifies limitations in generalization, and provides actionable insights to advance automated vulnerability patching for real-world security applications.

软件的弱点构成严重的安全威胁,需要有效缓解。虽然自动程序修复(APR)在解决一般故障、脆弱性补补补方面有所进展,但非洲同行审议机构的安全关键方面仍未得到充分探讨。本研究调查了经过预先培训的语言模式、代码BERT和代码T5, 用于在6个数据集和4种语文之间进行自动补补补的自动脆弱性。我们评估了这两个模式的准确性和对未知脆弱性的概括性。结果显示,虽然这两个模式在零散或稀少的情况下面临挑战,但代码BERT在这类情况下的表现相对好,而代码T5在捕捉复杂的脆弱性模式方面表现也比较好。代码T5也显示了更高的可扩展性。此外,我们测试了在分配(经过培训的)和分配(不见的)数据集方面经过精细调整的模型。同时微调了分配性业绩,模型努力推广到看不见的数据,突出了可靠脆弱性探测方面的挑战。本研究模型的性能基准、查明了普遍性的局限性,并提供可操作的洞察力,以推进现实世界安全应用的自动弥补脆弱性。

Article 167

Title@2025-06-05 (4): Multi-Language Detection of Design Pattern Instances

Title: Multi-Language Detection of Design Pattern Instances

Mehrsprachige Erkennung von Designmuster-Instanzen

多语言设计模式多语言探测 2506.03903v2

Authors (3): Hugo Andrade, João Bispo, Filipe F. Correia

Code comprehension is often supported by source code analysis tools which provide more abstract views over software systems, such as those detecting design patterns. These tools encompass analysis of source code and ensuing extraction of relevant information. However, the analysis of the source code is often specific to the target programming language. We propose DP-LARA, a multi-language pattern detection tool that uses the multi-language capability of the LARA framework to support finding pattern instances in a code base. LARA provides a virtual AST, which is common to multiple OOP programming languages, and DP-LARA then performs code analysis of detecting pattern instances on this abstract representation. We evaluate the detection performance and consistency of DP-LARA with a few software projects. Results show that a multi-language approach does not compromise detection performance, and DP-LARA is consistent across the languages we tested it for (i.e., Java and C/C++). Moreover, by providing a virtual AST as the abstract representation, we believe to have decreased the effort of extending the tool to new programming languages and maintaining existing ones.

代码理解往往得到源代码分析工具的支持,这些工具对软件系统(例如探测设计模式)提供更抽象的观点,这些工具包括对源代码的分析以及随后对相关信息的提取。然而,对源代码的分析往往是针对目标编程语言的。我们建议DP-LARA,这是一个多语言模式检测工具,使用LARA框架的多语言能力,支持在代码库中查找模式实例。LARA提供虚拟AST,这是多种OOP编程语言的通用语言,而DP-LARA则随后对这种抽象表述模式进行代码分析。我们用几个软件项目评估DP-LARA的检测性能和一致性。结果显示,多语言方法不会影响检测性能,DP-LARA在我们测试的语言(即爪哇和C/C++)之间是一致的。此外,通过提供虚拟AST作为抽象的表述,我们认为,将该工具推广到新的编程语言并保持现有语言的努力已经减少。

Article 168

Title@2025-06-05 (4): Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities

Title: Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities

Dekonstruieren von Obfuscation: Ein vierdimensionaler Rahmen für die Auswertung von Großsprachenmodellen Assembly Code Deobfuscation Fähigkeiten

解构腐蚀:四维框架,用于评价大语言模型组装编码脱腐能力 2505.19887v2

Authors (3): Anton Tkachenko, Dmitrij Suskevic, Benjamin Adolphi

Large language models (LLMs) have shown promise in software engineering, yet their effectiveness for binary analysis remains unexplored. We present the first comprehensive evaluation of commercial LLMs for assembly code deobfuscation. Testing seven state-of-the-art models against four obfuscation scenarios (bogus control flow, instruction substitution, control flow flattening, and their combination), we found striking performance variations–from autonomous deobfuscation to complete failure. We propose a theoretical framework based on four dimensions: Reasoning Depth, Pattern Recognition, Noise Filtering, and Context Integration, explaining these variations. Our analysis identifies five error patterns: predicate misinterpretation, structural mapping errors, control flow misinterpretation, arithmetic transformation errors, and constant propagation errors, revealing fundamental limitations in LLM code processing.We establish a three-tier resistance model: bogus control flow (low resistance), control flow flattening (moderate resistance), and instruction substitution/combined techniques (high resistance). Universal failure against combined techniques demonstrates that sophisticated obfuscation remains effective against advanced LLMs. Our findings suggest a human-AI collaboration paradigm where LLMs reduce expertise barriers for certain reverse engineering tasks while requiring human guidance for complex deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.x deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.

大型语言模型(LLMS)在软件工程方面表现出了希望,然而其二进制分析的有效性仍未得到探讨。我们首次对商用LLMS进行了全面的评估,以用于组装代码的分解。我们根据四种模糊假设(博格控制流程、教学替代、控制流程平流及其组合)测试了七种最先进的模型,我们发现从自主脱钩到完全失败的惊人的性能差异。我们提出了一个基于四个层面的理论框架:解释深度、模式识别、噪音过滤和背景整合,解释这些差异。我们的分析确定了五种错误模式:上游误差、结构绘图错误、控制流程错误、算术转换错误和持续的传播错误,揭示了LLMM代码处理中的基本限制。我们建立了三层阻力模型:博格控制流程(低抗力)、控制流程稳定(模范抗力)和教学替代/组合技术(高抗力)。我们提出的理论表明,复杂的粘合法仍然对先进的LMS有效。我们的研究发现,一种人类-AI合作模式,即LOBMS提供新的抗力基础,同时要求降低某些反向工程的复杂工作能力。

Article 169

Title@2025-06-05 (4): Tensor-based multivariate function approximation: methods benchmarking and comparison

Title: Tensor-based multivariate function approximation: methods benchmarking and comparison

Tensor-basierte multivariate Funktionsannäherung: Methoden Benchmarking und Vergleich

以电锯为基础的多变量函数近似值:方法基准和比较 2506.04791v1

Authors (4): Athanasios C. Antoulas, Ion Victor Gosea, Charles Poussot-Vassal, Pierre Vuillemin

In this note, we evaluate the performances, the features and the user-experience of some methods (and their implementations) designed for tensor- (or data-) based multivariate function construction and approximation. To this aim, a collection of multivariate functions extracted from contributive works coming from different communities, is suggested. First, these functions with varying complexity (e.g. number and degree of the variables) and nature (e.g. rational, irrational, differentiable or not, symmetric, etc.) are used to construct tensors, each of different dimension and size on the disk. Second, grounded on this tensor, we inspect performances of each considered method (e.g. the accuracy, the computational time, the parameters tuning impact, etc.). Finally, considering the “best” parameter tuning set, we compare each method using multiple evaluation criteria. The purpose of this note is not to rank the methods but rather to evaluate as fairly as possible the different available strategies, with the idea in mind to guide users to understand the process, the possibilities, the advantages and the limits brought by each tools. The contribution claimed is to suggest a complete benchmark collection of some available tools for tensor approximation by surrogate models (e.g. rational functions, networks, etc.). In addition, as contributors of the multivariate Loewner Framework (mLF) approach (and its side implementation in MDSPACK), attention and details of the latter are more explicitly given, in order to provide readers a digest of this contributive work and some details with simple examples.

在本说明中,我们评估了某些方法(及其实施)的性能、特点和用户经验,这些方法(及其执行经验)是为高频(或数据)的多变量函数构建和近似而设计的。为此,我们建议收集从不同社区集成作品中提取的多变量功能。首先,这些功能复杂程度(例如变量的数量和程度)和性质(例如合理、不合理、可区别或不同、对称等)不尽相同,用来构建磁盘上不同维度和大小的反射器。第二,基于这种高频,我们检查每种考虑方法(例如准确性、计算时间、调控影响参数等)的性能。最后,考虑到“最佳”参数调制,我们用多种评价标准对每种方法进行比较。本说明的目的不是对方法进行分级,而是尽可能公平地评估现有不同关注度战略,目的是指导用户理解过程、可能性、优势和极限。在每种工具中,我们检查每个工具的侧值的精确度和分数级的精确度,目的是用各种工具的精确度来衡量一个完整的指标,用来衡量这些工具的完整收集。

Article 170

Title@2025-06-05 (4): From Developer Pairs to AI Copilots: A Comparative Study on Knowledge Transfer

Title: From Developer Pairs to AI Copilots: A Comparative Study on Knowledge Transfer

Von Entwicklerpaaren zu KI-Copiloten: Eine vergleichende Studie zum Wissenstransfer

从开发者对等到AI 副驾驶员:知识转让比较研究 2506.04785v1

Authors (7): Alisa Welter, Niklas Schneider, Tobias Dick, Kallistos Weis, Christof Tinnes, Marvin Wyrich, Sven Apel

Knowledge transfer is fundamental to human collaboration and is therefore common in software engineering. Pair programming is a prominent instance. With the rise of AI coding assistants, developers now not only work with human partners but also, as some claim, with AI pair programmers. Although studies confirm knowledge transfer during human pair programming, its effectiveness with AI coding assistants remains uncertain. To analyze knowledge transfer in both human-human and human-AI settings, we conducted an empirical study where developer pairs solved a programming task without AI support, while a separate group of individual developers completed the same task using the AI coding assistant GitHub Copilot. We extended an existing knowledge transfer framework and employed a semi-automated evaluation pipeline to assess differences in knowledge transfer episodes across both settings. We found a similar frequency of successful knowledge transfer episodes and overlapping topical categories across both settings. Two of our key findings are that developers tend to accept GitHub Copilot’s suggestions with less scrutiny than those from human pair programming partners, but also that GitHub Copilot can subtly remind developers of important code details they might otherwise overlook.

知识转让是人类合作的基础,因此在软件工程中很常见。平方编程是一个突出的例子。随着AI编码助理的崛起, 开发者现在不仅与人类伙伴合作, 也像某些人声称的那样与AI对配程序员合作。虽然研究证实了在人与人之间编程过程中的知识转让,但与AI编码助理之间的效力仍然不确定。为了分析人与人之间和人类-AI环境中的知识转让,我们进行了一项经验性研究,开发者在没有AI支持的情况下解决了编程任务,而另外一组个体开发者则利用AI编码助理GitHub Copilot完成了同样的任务。我们扩展了现有的知识转让框架,并使用了半自动评价管道来评估两个环境中知识转让过程的差异。我们发现两个环境中知识转让过程的成功频率相似,主题类别重叠。我们的两个主要结论是,开发者倾向于接受GitHub Copil的建议,但不像人类对口编程伙伴的建议那么仔细,但是GitHub Copil可以向开发者解释他们可能忽略的重要代码细节。

Article 171

Title@2025-06-05 (4): QuanUML: Towards A Modeling Language for Model-Driven Quantum Software Development

Title: QuanUML: Towards A Modeling Language for Model-Driven Quantum Software Development

QuanUML: Auf dem Weg zu einer Modellierungssprache für modellgetriebene Quantensoftware-Entwicklung

QuuUML:争取为开发模型驱动量子软件开发建立示范语言 2506.04639v1

Authors (3): Xiaoyu Guo, Shinobu Saito, Jianjun Zhao

This paper introduces QuanUML, an extension of the Unified Modeling Language (UML) tailored for quantum software systems. QuanUML integrates quantum-specific constructs, such as qubits and quantum gates, into the UML framework, enabling the modeling of both quantum and hybrid quantum-classical systems. We apply QuanUML to Efficient Long-Range Entanglement using Dynamic Circuits and Shor’s Algorithm, demonstrating its utility in designing and visualizing quantum algorithms. Our approach supports model-driven development of quantum software and offers a structured framework for quantum software design. We also highlight its advantages over existing methods and discuss future improvements.

本文件介绍量子软件系统专用的统一模型语言(UML)的扩展 QuanUML。 QuanUML将量子和量子门等量子特定构件纳入UML框架,使量子和混合量子古典系统建模。我们运用 QuanUML 来利用动态电路和Shor’s Algorithm 进行高效长距离缠绕,以展示其在量子算法的设计和可视化方面的实用性。我们的方法支持量子软件的模型驱动开发,并为量子软件的设计提供一个结构化的框架。我们还强调其相对于现有方法的优势,并讨论今后的改进。

Article 172

Title@2025-06-05 (4): Seed-Coder: Let the Code Model Curate Data for Itself

Title: Seed-Coder: Let the Code Model Curate Data for Itself

Saatgut-Coder: Lassen Sie das Code-Modell Daten für sich selbst kuratieren

种子编码器:让代码模型为它自己计算曲线数据 2506.03524v2

Authors (27): ByteDance Seed, Yuyu Zhang, Jing Su, Yifan Sun, Chenguang Xi, Xia Xiao, Shen Zheng, Anxiang Zhang, Kaibo Liu, Daoguang Zan, Tao Sun, Jinhua Zhu, Shulin Xin, Dong Huang, Yetao Bai, Lixin Dong, Chao Li, Jianchong Chen, Hanzhi Zhou, Yifan Huang, Guanghan Ning, Xierui Song, Jiaze Chen, Siyao Liu, Kai Shen, Liang Xiang, Yonghui Wu

Code data in large language model (LLM) pretraining is recognized crucial not only for code-related tasks but also for enhancing general intelligence of LLMs. Current open-source LLMs often heavily rely on human effort to produce their code pretraining data, such as employing hand-crafted filtering rules tailored to individual programming languages, or using human-annotated data to train quality filters. However, these approaches are inherently limited in scalability, prone to subjective biases, and costly to extend and maintain across diverse programming languages. To address these challenges, we introduce Seed-Coder, a series of open-source LLMs comprising base, instruct and reasoning models of 8B size, minimizing human involvement in data construction. Our code pretraining data is produced by a model-centric data pipeline, which predominantly leverages LLMs for scoring and filtering code data. The instruct model is further trained via supervised fine-tuning and preference optimization, and the reasoning model leverages Long-Chain-of-Thought (LongCoT) reinforcement learning to improve multi-step code reasoning. Seed-Coder achieves state-of-the-art results among open-source models of similar size and even surpasses some much larger models, demonstrating superior performance in code generation, code completion, code editing, code reasoning, and software engineering tasks.

在大型语言模式(LLM)培训前,人们公认,守则数据不仅对与代码有关的任务至关重要,而且对加强LLM公司的一般情报也至关重要。目前的开放源代码LLM公司常常严重依赖人的努力来编制其代码培训前数据,例如采用适合个人编程语言的手工制作过滤规则,或使用附加说明的数据来培训质量过滤器。然而,这些方法在可扩展性方面本质上是有限的,容易主观偏见,而且推广和维持不同编程语言的费用也很高。为了应对这些挑战,我们引入了种子-Coder公司(Seed-Coder),这是由8B大小的基地、指示和推理模型组成的一系列开放源代码LMSLM公司。我们的代码培训前数据是通过一个以模式为中心的数据管道制作的,主要用于评分和筛选代码数据。指导模型通过监督下的微调和优化以及推理模型(Long-Chain-fought)(Long-Cott公司)来强化学习,以改进多步骤代码推理。甚至Sead-Coder公司在开源代码的模型中实现了一些先进的状态,在更高级的编程、更高级的编程、更高级的编程、更高级的编程、更高级的编程任务中展示。

Article 173

Title@2025-06-05 (4): KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems

Title: KPIRoot+: An Efficient Integrated Framework for Anomaly Detection and Root Cause Analysis in Large-Scale Cloud Systems

KPIRoot+: Ein effizientes integriertes Framework für Anomalieerkennung und Ursachenanalyse in großräumigen Cloud-Systemen

KPIROot+:大型云系统异常探测和根本原因分析有效综合框架 2506.04569v1

Authors (11): Wenwei Gu, Renyi Zhong, Guangba Yu, Xinying Sun, Jinyang Liu, Yintong Huo, Zhuangbin Chen, Jianping Zhang, Jiazhen Gu, Yongqiang Yang, Michael R. Lyu

To ensure the reliability of cloud systems, their performance is monitored using KPIs (key performance indicators). When issues arise, root cause localization identifies KPIs responsible for service degradation, aiding in quick diagnosis and resolution. Traditional methods rely on similarity calculations, which can be ineffective in complex, interdependent cloud environments. While deep learning-based approaches model these dependencies better, they often face challenges such as high computational demands and lack of interpretability. To address these issues, KPIRoot is proposed as an efficient method combining similarity and causality analysis. It uses symbolic aggregate approximation for compact KPI representation, improving analysis efficiency. However, deployment in Cloud H revealed two drawbacks: 1) threshold-based anomaly detection misses some performance anomalies, and 2) SAX representation fails to capture intricate variation trends. KPIRoot+ addresses these limitations, outperforming eight state-of-the-art baselines by 2.9% to 35.7%, while reducing time cost by 34.7%. We also share our experience deploying KPIRoot in a large-scale cloud provider’s production environment.

为确保云系统的可靠性,使用基本业绩指标(关键业绩指标)对云系统的业绩进行监测。一旦出现问题,根本原因就确定KPI对服务退化负有责任,协助快速诊断和解决。传统方法依靠相似性计算,在复杂、相互依存的云环境中,这种计算可能无效。虽然深层次的学习基础方法可以更好地模拟这些依赖性,但它们往往面临高计算要求和缺乏解释性等挑战。为解决这些问题,KPIRooot被提议为一种将相似性和因果关系分析相结合的有效方法。它使用紧凑的KPI代表制的象征性总近似,提高了分析效率。然而,在云 H 中的部署揭示了两个缺陷:(1) 基于临界异常的检测没有出现某些性能异常;(2) SAX 代表未能捕捉到复杂的变化趋势。 KPIRoot+ 解决了这些局限性,将8个最先进的基线比2.9%到35.7%的基线要高,同时将时间成本降低34.7%。我们还分享了我们在大型云源生产环境中部署KPIRooot的经验。