cs.SE @ 2025-07-11: 123
-
00 07-10 (4) QCP: A Practical Separation Logic-based C Program Verification Tool QCP: Eine praktische Trennung Logisch-basiertes C-Programm Verifikationswerkzeug QCP:基于实际隔离逻辑的C方案核查工具 2505.12878v2 -
01 07-10 Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management Offene Quelle, versteckte Kosten: Ein systematischer Literaturbericht über OSS-Lizenzverwaltung 开放源码,隐藏成本:开放源码软件许可证管理的系统文献审查 2507.05270v2 -
02 07-10 Open-source automatic pipeline for efficient conversion of large-scale point clouds to IFC format Open-Source-Automatische Pipeline für die effiziente Umwandlung von großflächigen Punktwolken in IFC-Format 将大型点云有效转换成国际金融公司格式的开放源自动管道 2503.11498v3 -
03 07-10 From Domain Documents to Requirements: Retrieval-Augmented Generation in the Space Industry Von Domänendokumenten zu Anforderungen: Retrieval-Augmented Generation in der Raumfahrtindustrie 从域文档到要求:空间工业中回收利用-增强的一代人 2507.07689v1 -
04 07-10 Prompt Engineering for Requirements Engineering: A Literature Review and Roadmap Prompt Engineering for Requirements Engineering: Literature Review und Roadmap 工程:文学审查和路线图 2507.07682v1 -
05 07-10 ProvideQ: A Quantum Optimization Toolbox ProvideQ: Eine Quantum-Optimierungs-Toolbox 提供 Q: 量图优化工具箱 2507.07649v1 -
06 07-10 Quantum Executor: A Unified Interface for Quantum Computing Quantum Executor: Ein einheitliches Interface für Quantum Computing 量图执行器: 量数计算的统一界面 2507.07597v1 -
07 07-10 From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering Von Anforderungen zum Code: Entwickler-Praxis in LLM-Assisted Software Engineering verstehen 从要求到准则:了解LLM辅助软件工程开发者的做法 2507.07548v1 -
08 07-10 Towards an Engineering Workflow Management System for Asset Administration Shells using BPMN Auf dem Weg zu einem Engineering Workflow Management System für Asset Administration Shells mit BPMN 努力建立一个利用生物和水管理网的资产管理壳壳工程工作流程管理系统 2507.07468v1 -
09 07-10 Toolchain for Faster Iterations in Quantum Software Development Toolchain für schnellere Iterationen in der Quantensoftware-Entwicklung 量量软件开发中快速迭接工具链 2507.07448v1 -
10 07-10 PromiseTune: Unveiling Causally Promising and Explainable Configuration Tuning PromiseTune: Enthüllen kausal vielversprechende und erklärbare Konfigurationstuning 前景图:不懈的因果保证和可解释的配置图纸 2507.05995v2 -
11 07-10 DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications DITING: Ein statischer Analyzer zur Identifizierung von Problemen mit schlechten Partitionierungen in TEE-Anwendungen Tinging: 识别TEE应用中的不良分割问题的静态分析器 2502.15281v2 -
12 07-10 Automatic Generation of Explainability Requirements and Software Explanations From User Reviews Automatische Generierung von Erklärbarkeitsanforderungen und Software-Erläuterungen aus Benutzer-Bewertungen 用户审查自动产生解释要求和软件解释 2507.07344v1 -
13 07-09 (3) Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection Auf dem Weg zu einer vertrauensvollen Sentiment-Analyse in der Software-Engineering: Dataset-Eigenschaften und Werkzeugauswahl 在软件工程中进行可信赖的感知分析:数据集特点和工具选择 2507.02137v2 -
14 07-09 A German Gold-Standard Dataset for Sentiment Analysis in Software Engineering Ein deutscher Gold-Standard-Datensatz für die Sentiment-Analyse in der Software-Engineering 用于软件工程敏感分析的德国黄金标准数据集 2507.07325v1 -
15 07-09 EditLord: Learning Code Transformation Rules for Code Editing EditLord: Regeln zur Code-Transformation für die Code-Editing 编辑主: 学习代码编辑的代码转换规则 2504.15284v4 -
16 07-09 Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models Messung, wie sich Änderungen in Codelesbarkeitsattributen auf die Bewertung der Codequalität durch Large Language Models auswirken 衡量守则可读性属性的变化如何影响大语言模式的守则质量评价 2507.05289v2 -
17 07-09 HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis HLSTester: Effiziente Prüfung von Verhaltensdiskrepanzen mit LLMs für High-Level-Synthese HLS Tester: 与高级别合成项目LLM有效测试行为差异 2504.14641v2 -
18 07-09 How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys Wie zu Elicit Erklärbarkeit Anforderungen? Ein Vergleich von Interviews, Fokusgruppen und Umfragen 如何制定明确的解释要求?访谈、焦点小组和调查的比较 2505.23684v3 -
19 07-09 5C Prompt Contracts: A Minimalist, Creative-Friendly, Token-Efficient Design Framework for Individual and SME LLM Usage 5C Prompt Contracts: Minimalistisch, kreativ-freundlich, Token-Efficient Design Framework für individuelle und KMU LLM-Nutzung 5C 即时合同:个人和中小企业LLM使用最起码的、有利于创意的、高效益的个人和中小企业LLM设计框架 2507.07045v1 -
20 07-09 Exploring Fairness Interventions in Open Source Projects Erforschen von Fairness-Interventionen in Open Source-Projekten 探索开放源码项目中的公平干预 2507.07026v1 -
21 07-09 Robust Containerization of the High Angular Resolution Functional Imaging (HARFI) Pipeline Robuste Containerisierung der High Angular Resolution Functional Imaging (HARFI) Pipeline 高角分辨率功能成像(HARFI)管道的强有力集装箱化 2507.07010v1 -
22 07-09 Enhancing Quantum Software Development Process with Experiment Tracking Verbesserung des Quantum-Software-Entwicklungsprozesses mit Experiment-Tracking 利用实验跟踪加强量子软件开发进程 2507.06990v1 -
23 07-09 Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation Sind sie alle gut? Bewertung der Qualität von CoTs bei der LLM-basierten Code-Generierung 评估基于LLM的代码生成中成本-成本-成本-成本-成本-成本-生成中成本-成本-成本-成本-质量的质量 2507.06980v1 -
24 07-09 Robust and Safe Traffic Sign Recognition using N-version with Weighted Voting Robuste und sichere Verkehrszeichenerkennung mit N-Version mit gewichteter Abstimmung 以加权投票方式使用N版本进行强力和安全交通信号识别 2507.06907v1 -
25 07-09 Formalization of the AADL Run-Time Services with Time Formalisierung der AADL-Laufzeitdienste mit Zeit AADAL “ 实时运行服务 “ 正规化 2507.06881v1 -
26 07-09 Leveraging LLMs for Semantic Conflict Detection via Unit Test Generation Leveraging LLMs für semantische Konflikterkennung über Unit Test Generation 通过单位测试生成利用磁性冲突探测LML 利用LMs 进行语义冲突探测 2507.06762v1 -
27 07-09 One Size Does Not Fit All: Investigating Efficacy of Perplexity in Detecting LLM-Generated Code Eine Größe passt nicht zu allen: Untersuchung der Wirksamkeit von Verwirrung bei der Erkennung von LLM-generierten Code 一大小不完全适用:调查在检测 LLM 生成的守则中发现重复性的效果 2412.16525v2 -
28 07-09 Issue Tracking Ecosystems: Context and Best Practices Thema Tracking Ökosysteme: Kontext und Best Practices 生态系统跟踪问题:背景和最佳做法 2507.06704v1 -
29 07-09 Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks Bewertung kleiner Sprachmodelle für die Codegenerierung: Eine empirische Studie mit Benchmarks 评估用于代码生成的小型语言模式:具有基准的实证研究 2507.03160v3 -
30 07-09 Finding Compiler Bugs through Cross-Language Code Generator and Differential Testing Finden von Compiler-Fehlern durch Cross-Language-Code-Generator und Differential-Tests 通过跨语言代码生成器和差异测试查找编译器错误 2507.06584v1 -
31 07-09 TELSAFE: Security Gap Quantitative Risk Assessment Framework TELSAFE: Sicherheitslücke Quantitative Risikobewertungsrahmen TELSAFE: 安全差距定量风险评估框架 2507.06497v1 -
32 07-09 Evaluating Efficiency and Novelty of LLM-Generated Code for Graph Analysis Bewertung der Effizienz und Neuheit des LLM-generierten Codes für die Graphenanalyse 评价LLLM创制的图表分析守则的效率和新颖性 2507.06463v1 -
33 07-08 (2) gigiProfiler: Diagnosing Performance Issues by Uncovering Application Resource Bottlenecks gigiProfiler: Diagnose von Performance-Problemen durch Enthüllen von Anwendungsressourcen Engpässe ii profiler: 通过解封应用程序资源瓶颈分析性能问题 2507.06452v1 -
34 07-08 CodeMirage: Hallucinations in Code Generated by Large Language Models CodeMirage: Halluzinationen in Code Generiert durch große Sprachmodelle 代码Mirage: 大语言模型生成的代码中的幻觉 2408.08333v2 -
35 07-08 Representing Prompting Patterns with PDL: Compliance Agent Case Study Präsentieren von Prompting Patterns mit PDL: Compliance Agent Case Study 代表PDL的提示模式:合规代理案例研究 2507.06396v1 -
36 07-08 A proposal and assessment of an improved heuristic for the Eager Test smell detection Vorschlag und Bewertung einer verbesserten Heuristik für die Eager Test Geruchserkennung 关于改进 “ 喷雾测试嗅觉检测 “ 的 改进超值性能的建议和评估 2507.06354v1 -
37 07-08 An Architecture for Privacy-Preserving Telemetry Scheme Eine Architektur für den Schutz der Privatsphäre Telemetrie-Schema 隐私保护遥测计划架构 2507.06350v1 -
38 07-08 Quality attributes of test cases and test suites – importance & challenges from practitioners’ perspectives Qualitätsmerkmale von Testfällen und Testsuiten – Bedeutung und Herausforderungen aus Sicht der Praktiker 测试案例和测试套套件的质量属性 – – 从从业人员的角度看重要性和挑战 2507.06343v1 -
39 07-08 AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions AR2: Aufmerksamkeitsgeführte Reparatur für die Robustheit von CNNs gegen häufige Korruption AR2:对有线电视新闻网反常见腐败的强力进行引人注意的修理 2507.06332v1 -
40 07-08 The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot Die Auswirkungen generativer KI auf die kollaborative Open-Source-Softwareentwicklung: Beweise von GitHub Copilot 《创世大赦国际对合作开发开放源码软件的影响:GitHub联合试点的证据》 2410.02091v2 -
41 07-08 cuVSLAM: CUDA accelerated visual odometry and mapping cuVSLAM: CUDA beschleunigte visuelle Odometrie und Mapping CUDA 加速视觉测量和绘图 2506.04359v3 -
42 07-08 RPHunter: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis RPHunter: Enthüllen von Rug Pull Schemes in Crypto Token über Code-and-Transaction Fusion Analysis RPHunter:通过代码和交易整合分析,在加密中采用 “ 拼接 “ 的 “ 拼接 “ 的 “ Rug “ 拖网计划 2506.18398v3 -
43 07-08 ContractTrace: Retracing Smart Contract Versions for Security Analyses ContraceTrace: Smart Contract-Versionen für Sicherheitsanalysen zurückverfolgen 合同跟踪:用于安全分析的重新追踪智能合同版本 2412.20866v2 -
44 07-08 Model Cards Revisited: Bridging the Gap Between Theory and Practice for Ethical AI Requirements Revisited Model Cards: Die Kluft zwischen Theorie und Praxis für ethische KI-Anforderungen überbrücken 重新审视示范卡:弥合道德道德要求理论与实践之间的差距 2507.06014v1 -
45 07-08 Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models Multi-Agent Debatte Strategien zur Verbesserung der Requirements Engineering mit großen Sprachmodellen 利用大语言模式加强要求工程的多机构辩论战略 2507.05981v1 -
46 07-08 TigAug: Data Augmentation for Testing Traffic Light Detection in Autonomous Driving Systems TigAug: Datenvergrößerung zur Prüfung von Verkehrslichterfassung in autonomen Fahrsystemen TigAug:自动驾驶系统交通光探测测试数据增强 2507.05932v1 -
47 07-08 Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models Fokussieren lernen: Kontextextraktion für effiziente Code-Anfälligkeitserkennung mit Sprachmodellen 学习聚焦:以语言模式有效识别《守则》脆弱性 2505.17460v2 -
48 07-08 The Impact of Prompt Programming on Function-Level Code Generation Die Auswirkungen der Prompt-Programmierung auf die Code-Generierung auf Funktionsebene 迅速编制方案对职能层面代码生成的影响 2412.20545v2 -
49 07-08 Extending Behavioral Software Engineering: Decision-Making and Collaboration in Human-AI Teams for Responsible Software Engineering Erweiterung der Verhaltenssoftware-Engineering: Entscheidungsfindung und Zusammenarbeit in Human-AI-Teams für verantwortungsvolle Software-Engineering 扩展行为行为软件工程:负责任软件工程人类-AI小组的决策与合作 2504.09496v2 -
50 07-08 ETrace:Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis ETrace: Event-getriebene Sicherheitserkennung in Smart Contracts über LLM-basierte Trace-Analyse ETRAR:通过基于LLM的追踪分析,在智能合同中实现对脆弱性的彻底发现 2506.15790v2 -
51 07-08 scikit-package – software packaging standards and roadmap for sharing reproducible scientific software scikit-Paket – Software-Verpackungsstandards und Roadmap für den Austausch reproduzierbarer wissenschaftlicher Software 共享可复制科学软件的软件包装标准和路线图 2507.03328v2 -
52 07-08 Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study Erkennung und Eindämmung von Belohnungshacking in Verstärkungs-Lernsystemen: Eine umfassende empirische Studie 检测和减轻强化学习系统中的回扣:综合经验研究 2507.05619v1 -
53 07-08 Efficient Detection of Intermittent Job Failures Using Few-Shot Learning Effiziente Erkennung intermittierender Job-Fälle durch wenig scharfes Lernen 利用很少热的学习方法有效检测间歇性工作失败 2507.04173v2 -
54 07-08 Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models Prompt Migration: Stabilisierung von GenAI-Anwendungen mit sich entwickelnden großen Sprachmodellen 迅速移徙:稳定GenAI应用程序与不断演变的大型语言模式 2507.05573v1 -
55 07-08 Search-based Selection of Metamorphic Relations for Optimized Robustness Testing of Large Language Models Suchebasierte Auswahl von Metamorphic Relations für optimierte Robustheitsprüfung von großen Sprachmodellen 以搜索为基础选择大语言模式优化强力测试的变形关系 2507.05565v1 -
56 07-07 (1) Tool for Supporting Debugging and Understanding of Normative Requirements Using LLMs Tool zur Unterstützung von Debugging und Verständnis für normative Anforderungen mit LLMs 支持调试和了解使用LLMs的规范要求的工具 2507.05504v1 -
57 07-07 Towards Exception Safety Code Generation with Intermediate Representation Agents Framework Auf dem Weg zur Generierung von Ausnahme-Sicherheitscodes mit dem Rahmen für Mittlere Vertretungen 建立具有中间代表代理机构框架的例外安全法规生成框架 2410.06949v3 -
58 07-07 An Investigation into Maintenance Support for Neural Networks Eine Untersuchung der Instandhaltungsunterstützung für neurale Netze 对神经网络维护支助的调查 2507.05245v1 -
59 07-07 React-tRace: A Semantics for Understanding React Hooks React-tRace: Eine Semantik zum Verständnis von React Hooks React-trace:理解反应钩的语义 2507.05234v1 -
60 07-07 Exploring Empathy in Software Engineering: Insights from a Grey Literature Analysis of Practitioners’ Perspectives Einfühlungsvermögen in der Software-Engineering: Einblicke aus einer grauen Literaturanalyse der Perspektiven von Praktizierenden 探索软件工程的漠视:对从业者观点的灰色文学分析的展望 2507.05325v1 -
61 07-07 In-Context Learning as an Effective Estimator of Functional Correctness of LLM-Generated Code In-Context Learning als effektiver Estimator der funktionalen Korrektheit des LLM-generierten Codes 内文学习作为LLM-创用守则功能正确性的有效模拟器 2507.05200v1 -
62 07-07 Deep Learning Framework Testing via Model Mutation: How Far Are We? Deep Learning Framework Testing über Modellmutation: Wie weit sind wir? 通过模型变异进行深层次学习框架测试:我们有多远? 2506.17638v2 -
63 07-07 Understanding Everything as Code: A Taxonomy and Conceptual Model Alles als Code verstehen: Ein Taxonomie- und Konzeptmodell 将一切理解为守则:分类和概念模式 2507.05100v1 -
64 07-07 OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language Models OASBuilder: OpenAPI-Spezifikationen aus Online-API-Dokumentation mit großen Sprachmodellen generieren OASBuilder:从带有大语言模式的在线API文件生成开放API规格 2507.05316v1 -
65 07-07 AI for the Routine, Humans for the Complex: Accuracy-Driven Data Labelling with Mixed Integer Linear Programming KI für die Routine, Menschen für den Komplex: Genauigkeitsgetriebene Datenkennzeichnung mit gemischter Integer-Linear-Programmierung 常规、人类为综合体:精确-驱动数据标签与混合整数线性线性编程 2507.04990v1 -
66 07-07 ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation ArtefakteBench: Überbrückung der visuell-interaktiven Lücke in der LLM-Codegenerierung 人工合成:弥合LLM代码生成评估中的视觉互动差距 2507.04952v1 -
67 07-07 Towards a Unifying Reference Model for Digital Twins of Cyber-Physical Systems Auf dem Weg zu einem einheitlichen Referenzmodell für digitale Zwillinge von Cyber-Physischen Systemen 建立网络-物理系统数字双对数字参照统一模式 2507.04871v1 -
68 07-07 Supporting Software Formal Verification with Large Language Models: An Experimental Study Unterstützung von Software Formale Verifikation mit großen Sprachmodellen: Eine experimentelle Studie 支持有大语言模型的软件正式核查:实验研究 2507.04857v1 -
69 07-07 A Note on Runtime Verification of Concurrent Systems Eine Anmerkung zur Laufzeitverifizierung von Concurrent Systemen 关于同步系统运行时核查的说明 2507.04830v1 -
70 07-07 ASSURE: Metamorphic Testing for AI-powered Browser Extensions ASSURE: Metamorphische Tests für KI-betriebene Browser-Erweiterungen ASURE: AI动力浏览器扩展的变形测试 2507.05307v1 -
71 07-07 Augmenting software engineering with AI and developing it further towards AI-assisted model-driven software engineering Augmenting-Software-Engineering mit KI und Weiterentwicklung in Richtung KI-unterstützter modellgetriebener Software-Engineering 增强AI软件工程并进一步发展AI软件工程,使之更接近AI协助的模型驱动软件工程 2409.18048v3 -
72 07-07 Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools Engere Kluft: Überwachtes Feintuning von Open Source LLMs als lebensfähige Alternative zu proprietären Modellen für pädagogische Werkzeuge 缩小差距:监督开放源码LLMs的微调,将其作为替代专有教学工具模型的可行替代物 2507.05305v1 -
73 07-06 (7) Testing, Evaluation, Verification and Validation (TEVV) of Digital Twins: A Comprehensive Framework Testen, Auswerten, Verifizieren und Validieren (TEVV) von digitalen Zwillingen: Ein umfassender Rahmen 数字双双的测试、评价、核查和验证:全面框架 2507.04555v1 -
74 07-06 Exploring Micro Frontends: A Case Study Application in E-Commerce Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce 探索微观前沿:电子商务案例研究应用 2506.21297v2 -
75 07-06 SPIRA: Building an Intelligent System for Respiratory Insufficiency Detection SPIRA: Aufbau eines intelligenten Systems zur Erkennung von respiratorischer Insuffizienz SPIRA: 建立呼吸系统不足检测智能系统 2507.04548v1 -
76 07-06 Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain Herstellung einer Pipeline-Produktion: Herausforderungen und Lektionen im Bereich Healthcare 《管道生产-准备:保健领域的挑战和经验教训》 2506.06946v3 -
77 07-06 Learning Software Bug Reports: A Systematic Literature Review Lernsoftware Bug Reports: Ein systematischer Literaturbericht 学习软件错误报告:系统文献审查 2507.04422v1 -
78 07-06 Exploring React Library Related Questions on Stack Overflow: Answered vs. Unanswered Exploring React Library Verwandte Fragen zum Stack Overflow: Beantwortet vs. Unbeantwortet 探讨关于堆叠溢溢流的React 图书馆相关问题:答复与未答复 2507.04390v1 -
79 07-06 DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation DevMuT: Testen von Deep Learning Framework über Entwickler-Expertise-basierte Mutation DDMUT:通过开发者专门知识型变异测试深学习框架 2507.04360v1 -
80 07-06 Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing Verbesserung von Deep-Learning-Framework-Tests mit Metamorphischen Tests auf Modellebene 利用示范性变形测试改进深学习框架测试 2507.04354v1 -
81 07-06 Alleviating Attack Data Scarcity: SCANIA’s Experience Towards Enhancing In-Vehicle Cyber Security Measures Benachteiligung von Angriffsdaten: SCANIAs Erfahrung zur Verbesserung von Cybersicherheitsmaßnahmen im Fahrzeug 减轻攻击数据稀缺性:SCANIA在加强车辆内部网络安全措施方面的经验 2507.02607v2 -
82 07-06 Automatic Multi-level Feature Tree Construction for Domain-Specific Reusable Artifacts Management Automatische Mehrebenen-Feature-Bauweise für Domain-Spezifische wiederverwendbare Artefakte Management 用于域域特定可使用性能管理的多级自动多级特性树建设 2506.03946v2 -
83 07-05 (6) From Legal Text to Tech Specs: Generative AI’s Interpretation of Consent in Privacy Law Vom Rechtstext zu technischen Spezifikationen: Generative KI-Interpretation der Zustimmung im Datenschutzrecht 《从法律文本到技术特征:大赦国际对隐私权法中同意的起源解释》 2507.04185v1 -
84 07-05 The Transformative Influence of LLMs on Software Development & Developer Productivity Der transformative Einfluss von LLMs auf Softwareentwicklung und Entwicklerproduktivität LLM女士对软件开发和开发者生产力的转型影响 2311.16429v2 -
85 07-05 zkSDK: Streamlining zero-knowledge proof development through automated trace-driven ZK-backend selection zkSDK: Straffung der Zero-Knowledge Proof-Entwicklung durch automatisierte spurgesteuerte ZK-Backend-Auswahl zkSDK:通过自动追踪驱动的 ZK 后端选择简化零知识验证开发 2507.05294v1 -
86 07-05 The Presence and the State-of-Practice of Software Architects in the Brazilian Industry – A Survey Die Gegenwart und der Stand der Praxis von Software-Architekten in der brasilianischen Industrie – Eine Umfrage 巴西工业软件建筑师的存在及其做法 – – 调查 2403.00955v2 -
87 07-05 Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG Nachdenken und erkunden String-basierte Malware-Familienklassifikation im Zeitalter von LLMs und RAG 在LLMM和RAG的时代重新思考和探索基于字符串的恶意恶意家庭分类 2507.04055v1 -
88 07-05 A Multimodal Approach Combining Biometrics and Self-Report Instruments for Monitoring Stress in Programming: Methodological Insights Ein multimodaler Ansatz zur Kombination von Biometrie und Selbstberichtsinstrumenten zur Überwachung von Stress in der Programmierung: Methodologische Erkenntnisse 将生物计量与在方案编制中监测压力的自报工具相结合的多式联运办法:方法见解 2507.02118v2 -
89 07-05 Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation Groß angelegte, unabhängige und umfassende Untersuchung der Leistung von LLMs für die Testfallgenerierung 大规模、独立和综合研究LLMM 用于产生试验案例的能力 2407.00225v3 -
90 07-05 We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems Wir brauchen dringend Privilege Management in MCP: Eine Messung der API-Nutzung in MCP Ecosystems 我们迫切需要在MCP紧急需要特权管理:衡量MCP生态系统使用API的情况 2507.06250v1 -
91 07-04 (5) RVISmith: Fuzzing Compilers for RVV Intrinsics RVISmith: Fuzzing Compiler für RVV-Intrinsik RVISmith: RVV Intrinsics 模糊的编译者 2507.03773v1 -
92 07-04 Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs Spezifikationsgeführte Reparatur von Arithmetischen Fehlern in Dafny-Programmen mit LLMs 利用LLMM项目对达夫尼方案中的亚氏误差进行规格规范-指导修补 2507.03659v1 -
93 07-04 Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy Ist es Zeit, Prompts als Code zu behandeln? Eine Multi-Use-Fallstudie für Prompt-Optimierung mit DSPy 是否是时候将提示作为代码处理? 使用 DSPy 快速优化的多用途案例研究 2507.03620v1 -
94 07-04 Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions Erforschen von Datenschutz und Sicherheit als Treiber für ökologische Nachhaltigkeit in Cloud-basierten Office-Lösungen 探索隐私和安全作为云载办公室解决方案中环境可持续性的驱动力 2506.23866v3 -
95 07-04 ACE: Automated Technical Debt Remediation with Validated Large Language Model Refactorings ACE: Automatisiertes technisches Schuldentilgungsverfahren mit validierten großen Sprachmodell-Refactorings ACE: 自动技术债务补救,加上经验证的大型语言模型要素 2507.03536v1 -
96 07-04 The Role of Humour in Software Engineering – A Literature Review and Preliminary Taxonomy Die Rolle des Humors in der Software-Engineering - eine Literaturübersicht und vorläufige Taxonomie 幽默在软件工程中的作用 – – 文学评论和初步分类学 2507.03527v1 -
97 07-04 Enhancing Uncertainty Quantification for Runtime Safety Assurance Using Causal Risk Analysis and Operational Design Domain Verbesserung der Unsicherheitsquantifizierung für die Runtime Safety Assurance unter Verwendung von Schadensrisikoanalyse und Operational Design Domain 利用因果关系风险分析和操作设计域加强运行时安全保障的不确定性量化 2507.03515v1 -
98 07-04 CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark CoreCodeBench: Ein konfigurierbarer Multi-Szenario-Repository-Level-Benchmark 核心守则:可配置的多设想仓库一级基准 2507.05281v1 -
99 07-04 Prompt Engineering Guidelines for Using Large Language Models in Requirements Engineering Prompt Engineering Richtlinien für die Verwendung großer Sprachmodelle in Requirements Engineering 在要求工程中使用大语言模型的快速工程指南 2507.03405v1 -
100 07-04 ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy ReservoirChat: Interaktive Dokumentation mit LLM und Wissensdiagramm für ReservoirPy RESSOCWChat:与LLM和知识图增强互动文件 2507.05279v1 -
101 07-04 Securing Mixed Rust with Hardware Capabilities Sichern gemischten Rust mit Hardware-Fähigkeiten 保有混杂铁和硬件能力 2507.03344v1 -
102 07-04 Analyzing C/C++ Library Migrations at the Package-level: Prevalence, Domains, Targets and Rationals across Seven Package Management Tools Analyse von C/C++-Bibliotheksmigrationen auf Paketebene: Prävalenz, Domains, Targets und Rationals über sieben Paketverwaltungstools hinweg C/C+++图书馆在包级的迁移分析:七套管理工具的普遍程度、域、目标和合理性 2507.03263v1 -
103 07-03 (4) The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Literature Review Die Auswirkungen von LLM-Assistenten auf die Produktivität von Softwareentwicklern: Ein systematischer Literaturbericht LLM助理对软件开发者生产力的影响:系统文学评论 2507.03156v1 -
104 07-03 AI and Agile Software Development: From Frustration to Success – XP2025 Workshop Summary KI und Agile Software-Entwicklung: Von der Frustration zum Erfolg – XP2025 Workshop Zusammenfassung AI和Alile软件开发:从挫折到成功 – – XP2025讲习班摘要 2506.20159v2 -
105 07-03 Requirements Elicitation Follow-Up Question Generation Voraussetzungen Elicitation Follow-Up Question Generation 问询后查询 2507.02858v1 -
106 07-03 Legal Requirements Translation from Law Rechtliche Voraussetzungen Übersetzung aus dem Recht 法律要求译自法律 2507.02846v1 -
107 07-03 Agentic Business Process Management: Practitioner Perspectives on Agent Governance in Business Processes Agentic Business Process Management: Praxisperspektiven zur Agenten-Governance in Unternehmensprozessen 代理业务流程管理:从业者对业务流程代理治理的看法 2504.03693v2 -
108 07-03 Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification Gradient-Based Model Fingerprinting für LLM Ähnlichkeitserkennung und Familienklassifizierung LLM相似性探测和家庭分类的渐进式样指纹 2506.01631v2 -
109 07-03 Fuzzing-based Mutation Testing of C/C++ Software in Cyber-Physical Systems Fuzzing-basierte Mutationsprüfung von C/C++-Software in Cyber-Physical Systems C/C+++网络物理系统中软件的模糊式变异测试 2503.24100v3 -
110 07-03 FuzzFeed: An Automatic Approach to Weakest Precondition Generation using LLMs and Fuzzing FuzzFeed: Ein automatischer Ansatz zur schwachsten Vorkonditionierungsgeneration mit LLMs und Fuzzing FazzzFeed:使用LLMS和模糊生成最弱预设条件的自动方法 2507.05272v1 -
111 07-03 Sustainability Flags for the Identification of Sustainability Posts in Q&A Platforms Nachhaltigkeitsflaggen für die Identifizierung von Nachhaltigkeitsbeiträgen in Q&A-Plattformen 在++A平台中确定可持续性职位的可持续性旗 2507.02695v1 -
112 07-03 RLHGNN: Reinforcement Learning-driven Heterogeneous Graph Neural Network for Next Activity Prediction in Business Processes RLHGNN: Verstärkung Lernorientiertes Heterogenes Graph Neuronales Netzwerk für die nächste Aktivitätsvorhersage in Geschäftsprozessen RLHGNN: 业务流程下一个活动预测的强化学习驱动的异质图形神经网络 2507.02690v1 -
113 07-03 Do Research Software Engineers and Software Engineering Researchers Speak the Same Language? Sprechen Forschungssoftware-Ingenieure und Software-Engineering-Forscher die gleiche Sprache? 研究软件工程师和软件工程研究者会说同一种语言吗? 2507.02665v1 -
114 07-03 Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2 -
115 07-03 Human-Machine Collaboration and Ethical Considerations in Adaptive Cyber-Physical Systems Mensch-Maschine-Kollaboration und ethische Überlegungen in adaptiven Cyber-Physischen Systemen 适应性网络-物理系统中的人类-海洋协作和伦理考虑 2507.02578v1 -
116 07-03 LLMREI: Automating Requirements Elicitation Interviews with LLMs LLMREI: Automatisieren der Anforderungen Bereitstellung von Interviews mit LLMs LLMREI: 与LLMM公司进行自动要求的求求救采访 2507.02564v1 -
117 07-03 Meta-Fair: AI-Assisted Fairness Testing of Large Language Models Meta-Fair: AI-Assisted Fairness Testing von großen Sprachmodellen Meta-Fair:AI协助大语言模型公平性测试 2507.02533v1 -
118 07-03 Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Maintenance Code Digital Twin: LLMs mit Tacit-Kenntnissen für komplexe Software-Wartung stärken 数字代码双对:赋予LLMs以用于复杂软件维护的隐秘知识 2503.07967v2 -
119 07-03 VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software VEFIA: 垂直联邦合作软件有效推断审计框架 2507.02376v1 -
120 07-03 Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes Erforschung der Integration von großen Sprachmodellen in industrielle Testwartungsprozesse 探索将大语言模型纳入工业试验维护工艺 2409.06416v2 -
121 07-03 Precisely Detecting Python Type Errors via LLM-based Unit Test Generation Präzise Erkennung von Python-Typfehlern über LLM-basierte Einheitentestgenerierung 通过基于LLM的单位测试生成精确检测 Python 类型错误 2507.02318v1 -
122 07-03 CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks CORE: Benchmarking von LLMs Code Reasoning Capabilities durch statische Analyseaufgaben CORE:通过静态分析任务,确定LLMs 法规说明能力标准 2507.05269v1
Article 0
Title@2025-07-10 (4): QCP: A Practical Separation Logic-based C Program Verification Tool
Title: QCP: A Practical Separation Logic-based C Program Verification Tool | QCP: Eine praktische Trennung Logisch-basiertes C-Programm Verifikationswerkzeug | QCP:基于实际隔离逻辑的C方案核查工具 2505.12878v2 |
Authors (13): Xiwei Wu, Yueyang Feng, Xiaoyang Lu, Tianchuan Lin, Kan Liu, Zhiyi Wang, Shushu Wu, Lihan Xie, Chengxi Yang, Hongyi Zhong, Naijun Zhan, Zhenjiang Hu, Qinxiang Cao
As software systems increase in size and complexity dramatically, ensuring their correctness, security, and reliability becomes an increasingly formidable challenge. Despite significant advancements in verification techniques and tools, there still remain %these tools still continue to encounter substantial difficulties when applying these tools to complex, real-world scenarios. To address these difficulties, this paper introduces a novel verification tool, called \textbf{Qualified C Programming Verifier (QCP)}. QCP incorporates a refined front-end %syntax of assertion language to enhance user interaction. The proposed assertion language aims to %syntax is designed to lower the entry barrier for verification tools, improve proof efficiency by improving automation, and facilitate a deeper understanding of both the program and its verification results.
随着软件系统规模和复杂性的急剧扩大,确保其正确性、安全和可靠性成为日益艰巨的挑战。尽管在核查技术和工具方面有了显著的进步,但是仍然有%这些工具在将这些工具应用于复杂的现实世界情景方面仍然面临着巨大的困难。为了解决这些困难,本文件引入了一种新型的核查工具,称为\ textb-ZQATIC(QCP)}。QCP采用了一种精细的前端 %syngycast 语言,以加强用户的互动。拟议的主张语言旨在使用 %syngyx,目的是降低核查工具的进入屏障,通过改进自动化提高证明效率,并促进对程序及其核查结果的更深入了解。
Article 1
Title@2025-07-10 (4): Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management
Title: Open Source, Hidden Costs: A Systematic Literature Review on OSS License Management | Offene Quelle, versteckte Kosten: Ein systematischer Literaturbericht über OSS-Lizenzverwaltung | 开放源码,隐藏成本:开放源码软件许可证管理的系统文献审查 2507.05270v2 |
Authors (6): Boyuan Li, Chengwei Liu, Lingling Fan, Sen Chen, Zhenlin Zhang, Zheli Liu
Integrating third-party software components is a common practice in modern software development, offering significant advantages in terms of efficiency and innovation. However, this practice is fraught with risks related to software licensing. A lack of understanding may lead to disputes, which can pose serious legal and operational challenges. To these ends, both academia and industry have conducted various investigations and proposed solutions and tools to deal with these challenges. However, significant limitations still remain. Moreover, the rapid evolution of open-source software (OSS) licenses, as well as the rapidly incorporated generative software engineering techniques, such as large language models for code (CodeLLMs), are placing greater demands on the systematic management of software license risks. To unveil the severe challenges and explore possible future directions, we conduct the first systematic literature review (SLR) on 80 carefully selected OSS license-related papers, classifying existing research into three key categories, i.e., license identification, license risk assessment, and license risk mitigation. Based on these, we discuss challenges in existing solutions, conclude the opportunities to shed light on future research directions and offer practical recommendations for practitioners. We hope this thorough review will help bridge the gaps between academia and industry and accelerate the ecosystem-wide governance of legitimate software risks within the software engineering community.
整合第三方软件组件是现代软件开发的常见做法,在效率和创新方面有很大的优势。然而,这种做法充满了软件许可证发放方面的风险。缺乏了解可能导致争议,并可能造成严重的法律和业务挑战。为此,学术界和工业界开展了各种调查,提出了应对这些挑战的解决办法和工具。然而,仍然存在重大限制。此外,开放源软件(OSS)许可证的迅速发展,以及迅速纳入的基因化软件工程技术,如大型代码语言模型(CodeLLMS),对软件许可证风险的系统管理提出了更大的要求。为了揭示严峻的挑战并探索可能的未来方向,我们首次对80份精心挑选的开放源码软件许可证相关文件进行系统文献审查(SLR),将现有研究分为三大类别,即许可证识别、许可证风险评估和降低许可证风险。在此基础上,我们讨论了现有解决方案中的挑战,总结了阐明未来研究方向的机会,并向从业人员提出切实可行的建议。我们希望,这一彻底审查将有助于弥合学术界和业界之间在合法软件方面的风险,加快整个生态系统治理。
Article 2
Title@2025-07-10 (4): Open-source automatic pipeline for efficient conversion of large-scale point clouds to IFC format
Title: Open-source automatic pipeline for efficient conversion of large-scale point clouds to IFC format | Open-Source-Automatische Pipeline für die effiziente Umwandlung von großflächigen Punktwolken in IFC-Format | 将大型点云有效转换成国际金融公司格式的开放源自动管道 2503.11498v3 |
Authors (2): Slávek Zbirovský, Václav Nežerka
Building Information Modeling (BIM) is an essential component in the sustainable reconstruction and revitalization of ageing structures. However, model creation usually relies on laborious manual transformation of the unstructured point cloud data provided by laser scans or photogrammetry. This paper presents Cloud2BIM, an open-source software tool designed to automate the conversion of point clouds into BIM models compliant with the Industry Foundation Classes (IFC) standard. Cloud2BIM integrates advanced algorithms for wall and slab segmentation, opening detection, and room zoning based on real wall surfaces, resulting in a comprehensive and fully automated workflow. Unlike existing tools, it avoids computationally- and calibration-intensive techniques such as RANSAC, supports non-orthogonal geometries, and provides unprecedented processing speed-achieving results up to seven times faster than fastest competing solutions. Systematic validation using benchmark datasets confirms that Cloud2BIM is an easy-to-use, efficient, and scalable solution for generating accurate BIM models, capable of converting extensive point cloud datasets for entire buildings into IFC format with minimal user input.
建筑信息建模(BIM)是可持续重建和振兴老化结构的基本组成部分,但是,模型的创建通常依赖于由激光扫描或摄影测量提供的非结构化点云数据人工转换。本文展示了Cloud2BIM,这是一个开放源软件工具,旨在按照工业基础类标准将点云自动转换成BIM模型。Cloud2BIM结合了基于实际墙面的墙壁和板块分割、开启探测和房间分区的先进算法,从而形成一个全面和完全自动化的工作流程。它与现有工具不同,它避免了计算和校准密集技术,如RANSAC,支持非垂直的地理分布,提供了前所未有的处理速度达比最快的解决方案快七倍的超速结果。使用基准数据集进行的系统验证证实Clod2BIM是一种容易使用、高效和可扩展的解决方案,可以生成准确的BIM模型,能够将整个建筑的广点云数据集转换成IFCFC格式,用户投入极少。
Article 3
Title@2025-07-10 (4): From Domain Documents to Requirements: Retrieval-Augmented Generation in the Space Industry
Title: From Domain Documents to Requirements: Retrieval-Augmented Generation in the Space Industry | Von Domänendokumenten zu Anforderungen: Retrieval-Augmented Generation in der Raumfahrtindustrie | 从域文档到要求:空间工业中回收利用-增强的一代人 2507.07689v1 |
Authors (5): Chetan Arora, Fanyu Wang, Chakkrit Tantithamthavorn, Aldeida Aleti, Shaun Kenyon
Requirements engineering (RE) in the space industry is inherently complex, demanding high precision, alignment with rigorous standards, and adaptability to mission-specific constraints. Smaller space organisations and new entrants often struggle to derive actionable requirements from extensive, unstructured documents such as mission briefs, interface specifications, and regulatory standards. In this innovation opportunity paper, we explore the potential of Retrieval-Augmented Generation (RAG) models to support and (semi-)automate requirements generation in the space domain. We present a modular, AI-driven approach that preprocesses raw space mission documents, classifies them into semantically meaningful categories, retrieves contextually relevant content from domain standards, and synthesises draft requirements using large language models (LLMs). We apply the approach to a real-world mission document from the space domain to demonstrate feasibility and assess early outcomes in collaboration with our industry partner, Starbound Space Solutions. Our preliminary results indicate that the approach can reduce manual effort, improve coverage of relevant requirements, and support lightweight compliance alignment. We outline a roadmap toward broader integration of AI in RE workflows, intending to lower barriers for smaller organisations to participate in large-scale, safety-critical missions.
空间工业的工程要求(RE)具有内在的复杂性,要求高度精确,符合严格的标准,并适应特定任务的限制。较小型的空间组织和新进入者往往努力从广泛的、非结构化的文件(如飞行任务简报、界面规格和监管标准)中获取可操作的要求。在这个创新机会文件中,我们探索了再回收新一代(RAG)模型的潜力,以支持和(半)自动生成空间领域的要求。我们提出了一个模块化的、由AI驱动的方法,该方法处理原始空间飞行任务文件,将其分为具有实际意义的类别,从大语言模型(LLMS)中检索域标准中与环境相关的内容,并综合了要求草案。我们从空间领域对现实世界飞行任务文件采用这一方法,以展示可行性,并与我们的工业伙伴“星界空间解决方案”合作评估早期成果。我们的初步结果表明,该方法可以减少人工工作,改进相关要求的覆盖面,并支持轻度的合规调整。我们概述了将AI更广泛地纳入电子工作流程的路线图,打算降低小型组织参加大规模、安全性飞行任务的障碍。
Article 4
Title@2025-07-10 (4): Prompt Engineering for Requirements Engineering: A Literature Review and Roadmap
Title: Prompt Engineering for Requirements Engineering: A Literature Review and Roadmap | Prompt Engineering for Requirements Engineering: Literature Review und Roadmap | 工程:文学审查和路线图 2507.07682v1 |
Authors (4): Kaicheng Huang, Fanyu Wang, Yutan Huang, Chetan Arora
Advancements in large language models (LLMs) have led to a surge of prompt engineering (PE) techniques that can enhance various requirements engineering (RE) tasks. However, current LLMs are often characterized by significant uncertainty and a lack of controllability. This absence of clear guidance on how to effectively prompt LLMs acts as a barrier to their trustworthy implementation in the RE field. We present the first roadmap-oriented systematic literature review of Prompt Engineering for RE (PE4RE). Following Kitchenham’s and Petersen’s secondary-study protocol, we searched six digital libraries, screened 867 records, and analyzed 35 primary studies. To bring order to a fragmented landscape, we propose a hybrid taxonomy that links technique-oriented patterns (e.g., few-shot, Chain-of-Thought) to task-oriented RE roles (elicitation, validation, traceability). Two research questions, with five sub-questions, map the tasks addressed, LLM families used, and prompt types adopted, and expose current limitations and research gaps. Finally, we outline a step-by-step roadmap showing how today’s ad-hoc PE prototypes can evolve into reproducible, practitioner-friendly workflows.
大型语言模型(LLMS)的进步导致迅速工程技术(PE)的激增,这些技术可以加强各种要求工程(RE)任务;然而,目前的LLMS往往具有重大的不确定性和缺乏可控性的特点,在如何有效促使LLMS成为其在RE领域可信赖的执行的障碍方面缺乏明确的指导;我们提出了第一份面向路线图的系统文献审查报告《RE快速工程(PE4RE)》(PE4RE),根据Kitchenham和Peter的副研究协议,我们搜索了六个数字图书馆,筛选了867个记录,分析了35项基本研究;为了给分散的地貌带来秩序,我们建议采用一种混合分类方法,将面向技术的模式(例如,少量的、链式的)与面向任务的RE角色(引文、鉴定、可追溯性)联系起来;两个研究问题,包括五个子问题,绘制了所处理的任务、LM家庭使用和迅速采用的任务图,并暴露了目前的局限性和研究差距;最后,我们概述了一个分步骤绘制的路线图,显示今天的PE原型能如何演变成可复制的工作流程。
Article 5
Title@2025-07-10 (4): ProvideQ: A Quantum Optimization Toolbox
Title: ProvideQ: A Quantum Optimization Toolbox | ProvideQ: Eine Quantum-Optimierungs-Toolbox | 提供 Q: 量图优化工具箱 2507.07649v1 |
Authors (4): Domenik Eichhorn, Nick Poser, Maximilian Schweikart, Ina Schaefer
Hybrid solvers for combinatorial optimization problems combine the advantages of classical and quantum computing to overcome difficult computational challenges. Although their theoretical performance seems promising, their practical applicability is challenging due to the lack of a technological stack that can seamlessly integrate quantum solutions with existing classical optimization frameworks. We tackle this challenge by introducing the ProvideQ toolbox, a software tool that enables users to easily adapt and configure hybrid solvers via Meta-Solver strategies. A Meta-Solver strategy implements decomposition techniques, which splits problems into classical and quantum subroutines. The ProvideQ toolbox enables the interactive creation of such decompositions via a Meta-Solver configuration tool. It combines well-established classical optimization techniques with quantum circuits that are seamlessly executable on multiple backends. This paper introduces the technical details of the ProvideQ toolbox, explains its architecture, and demonstrates possible applications for several real-world use cases. Our proof of concept shows that Meta-Solver strategies already enable the application of quantum subroutines today, however, more sophisticated hardware is required to make their performance competitive.
组合优化问题的混合解析器将古典计算和量子计算的优势结合起来,以克服困难的计算挑战。虽然它们的理论性能似乎很有希望,但由于缺乏能够将量子解决方案与现有的经典优化框架无缝地整合起来的技术堆叠,它们的实际适用性是具有挑战性的。我们通过引入“提供Q”工具箱来应对这一挑战,这是一个软件工具,使用户能够通过Meta-Solver战略方便地调整和配置混合解析器。一个元-Solver战略实施分解技术,将问题分为经典和量子例。“提供Q”工具箱使得能够通过一个元-Solver配置工具交互生成这种分解。它将成熟的经典优化技术与可在多个后端上无缝地执行的量子路路结合起来。本文介绍了“提供Q”工具箱的技术细节,解释了其结构,并展示了几个现实世界使用案例的可能应用。我们的概念证明,“提供”战略已经使得今天能够应用量子子流的硬件成为了应用。
Article 6
Title@2025-07-10 (4): Quantum Executor: A Unified Interface for Quantum Computing
Title: Quantum Executor: A Unified Interface for Quantum Computing | Quantum Executor: Ein einheitliches Interface für Quantum Computing | 量图执行器: 量数计算的统一界面 2507.07597v1 |
Authors (3): Giuseppe Bisicchia, Alessandro Bocci, Antonio Brogi
As quantum computing evolves from theoretical promise to practical deployment, the demand for robust, portable, and scalable tools for quantum software experimentation is growing. This paper introduces Quantum Executor, a backend-agnostic execution engine designed to orchestrate quantum experiments across heterogeneous platforms. Quantum Executor provides a declarative and modular interface that decouples experiment design from backend execution, enabling seamless interoperability and code reuse across diverse quantum and classical resources. Key features include support for asynchronous and distributed execution, customizable execution strategies and a unified API for managing quantum experiments. We illustrate its applicability through two life-like usage scenarios such as automated benchmarking and hybrid validation, discussing its capacity to streamline quantum development. We conclude by discussing current limitations and outlining a roadmap for future enhancements.
随着量子计算从理论承诺演变为实际应用,对强力、可移植和可缩放的量子软件实验工具的需求正在增长。本文件介绍了量子软件实验的“量子执行器 ” 。 Qantum 执行器是一个后端的、不可知的执行引擎,旨在在不同平台协调量子实验。 量子执行器提供了一个宣示和模块界面,将实验设计与后端执行脱钩,在不同量子和古典资源之间实现无缝互操作性和代码再利用。 关键特征包括支持无同步和分散的执行、可定制的执行策略以及管理量子实验的统一的“API ” 。 我们通过两种类似生命的情景,例如自动化基准和混合验证,来说明其适用性,讨论其简化量子开发的能力。 我们通过讨论目前的局限性和为未来增强制定路线图来结束我们的讨论。
Article 7
Title@2025-07-10 (4): From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering
Title: From Requirements to Code: Understanding Developer Practices in LLM-Assisted Software Engineering | Von Anforderungen zum Code: Entwickler-Praxis in LLM-Assisted Software Engineering verstehen | 从要求到准则:了解LLM辅助软件工程开发者的做法 2507.07548v1 |
Authors (3): Jonathan Ullrich, Matthias Koch, Andreas Vogelsang
With the advent of generative LLMs and their advanced code generation capabilities, some people already envision the end of traditional software engineering, as LLMs may be able to produce high-quality code based solely on the requirements a domain expert feeds into the system. The feasibility of this vision can be assessed by understanding how developers currently incorporate requirements when using LLMs for code generation-a topic that remains largely unexplored. We interviewed 18 practitioners from 14 companies to understand how they (re)use information from requirements and other design artifacts to feed LLMs when generating code. Based on our findings, we propose a theory that explains the processes developers employ and the artifacts they rely on. Our theory suggests that requirements, as typically documented, are too abstract for direct input into LLMs. Instead, they must first be manually decomposed into programming tasks, which are then enriched with design decisions and architectural constraints before being used in prompts. Our study highlights that fundamental RE work is still necessary when LLMs are used to generate code. Our theory is important for contextualizing scientific approaches to automating requirements-centric SE tasks.
随着基因化的LLMs及其先进的代码生成能力的出现,一些人已经设想了传统软件工程的终结,因为LLMs可能能够仅仅根据一个域专家对系统的投入要求来制作高质量的代码。这一愿景的可行性可以通过理解开发商目前如何在使用LLMs对代码生成专题使用基本尚未探索的LLMs时纳入要求来评估。我们采访了14家公司的18名从业人员,以了解他们在生成代码时如何(再)利用要求和其他设计艺术品提供的信息来喂养LMs。根据我们的调查结果,我们提出了一个解释流程开发商所使用和他们所依赖的文物的理论。我们的理论表明,通常记载的要求过于抽象,无法直接输入LMS。相反,他们首先必须手工分解成编成程序任务,然后在迅速使用设计决定和建筑限制后再加以充实。我们的研究强调,在使用LMSMs生成代码时,基本的RE工作仍然是必要的。我们的理论对于使以要求为中心的SE任务实现自动化的科学方法十分重要。
Article 8
Title@2025-07-10 (4): Towards an Engineering Workflow Management System for Asset Administration Shells using BPMN
Title: Towards an Engineering Workflow Management System for Asset Administration Shells using BPMN | Auf dem Weg zu einem Engineering Workflow Management System für Asset Administration Shells mit BPMN | 努力建立一个利用生物和水管理网的资产管理壳壳工程工作流程管理系统 2507.07468v1 |
Authors (2): Sten Grüner, Nafise Eskandani
The integration of Industry 4.0 technologies into engineering workflows is an essential step toward automating and optimizing plant and process engineering processes. The Asset Administration Shell (AAS) serves as a key enabler for creating interoperable Digital Twins that facilitate engineering data exchange and automation. This paper explores the use of AAS within engineering workflows, particularly in combination with Business Process Model and Notation (BPMN) to define structured and automated processes. We propose a distributed AAS copy-on-write infrastructure that enhances security and scalability while enabling seamless cross organizational collaboration. We also introduce a workflow management prototype automating AAS operations and engineering workflows, improving efficiency and traceability.
将工业4.0技术纳入工程工作流程是走向工厂和工序工程流程自动化和优化的一个必要步骤。资产管理壳牌公司(AAS)是创建可互操作的数字双体的关键推动器,可促进工程数据交换和自动化。本文探讨了在工程工作流程中使用AAS的问题,特别是与业务流程模型和标记(BPMN)相结合,以界定结构化和自动化流程。我们提出一个分布式AAS复制版基础设施,既能加强安全和可扩缩性,又能实现无缝跨组织合作。我们还引入了AAS自动化操作和工程工作流程的工作流程原型,提高效率和可追踪性。
Article 9
Title@2025-07-10 (4): Toolchain for Faster Iterations in Quantum Software Development
Title: Toolchain for Faster Iterations in Quantum Software Development | Toolchain für schnellere Iterationen in der Quantensoftware-Entwicklung | 量量软件开发中快速迭接工具链 2507.07448v1 |
Authors (4): Otso Kinanen, Andrés D. Muñoz-Moller, Vlad Stirbu, Tommi Mikkonen
Quantum computing proposes a revolutionary paradigm that can radically transform numerous scientific and industrial application domains. To realize this promise, these new capabilities need software solutions that are able to effectively harness its power. However, developers may face significant challenges when developing and executing quantum software due to the limited availability of quantum computer hardware, high computational demands of simulating quantum computers on classical systems, and complicated technology stack to enable currently available accelerators into development environments. These limitations make it difficult for the developer to create an efficient workflow for quantum software development. In this paper, we investigate the potential of using remote computational capabilities in an efficient manner to improve the workflow of quantum software developers, by lowering the barrier of moving between local execution and computationally more efficient remote hardware and offering speedup in execution with simulator surroundings. The goal is to allow the development of more complex circuits and to support an iterative software development approach. In our experiment, with the solution presented in this paper, we have obtained up to 5 times faster circuit execution runtime, and enabled qubit ranges from 21 to 29 qubits with a simple plug-and-play kernel for the Jupyter notebook.
量子计算提出了能够从根本上改变众多科学和工业应用领域的革命范式。为了实现这一承诺,这些新能力需要能够有效利用其权力的软件解决方案。然而,开发商在开发和实施量子软件时可能面临重大挑战,因为数量计算机硬件有限,古典系统中模拟量子计算机的计算要求高,以及为使目前可用的加速器进入开发环境所需的复杂技术堆叠。这些限制使得开发商难以为量子软件开发创造高效的工作流程。在本文中,我们调查了以有效的方式利用远程计算能力改进量子软件开发商工作流程的可能性,降低本地执行与计算效率更高的远程硬件之间的移动障碍,并提供与模拟器周围的快速执行。目标是允许开发更复杂的电路并支持反复软件开发方法。在我们的实验中,我们用本文中所提出的解决方案,已经取得了5倍的快速电路路运行时间,并使得象子范围从21到29个,为Jupyter笔记本提供了简单的插件箱。
Article 10
Title@2025-07-10 (4): PromiseTune: Unveiling Causally Promising and Explainable Configuration Tuning
Title: PromiseTune: Unveiling Causally Promising and Explainable Configuration Tuning | PromiseTune: Enthüllen kausal vielversprechende und erklärbare Konfigurationstuning | 前景图:不懈的因果保证和可解释的配置图纸 2507.05995v2 |
Authors (2): Pengzhou Chen, Tao Chen
The high configurability of modern software systems has made configuration tuning a crucial step for assuring system performance, e.g., latency or throughput. However, given the expensive measurements, large configuration space, and rugged configuration landscape, existing tuners suffer ineffectiveness due to the difficult balance of budget utilization between exploring uncertain regions (for escaping from local optima) and exploiting guidance of known good configurations (for fast convergence). The root cause is that we lack knowledge of where the promising regions lay, which also causes challenges in the explainability of the results. In this paper, we propose PromiseTune that tunes configuration guided by causally purified rules. PromiseTune is unique in the sense that we learn rules, which reflect certain regions in the configuration landscape, and purify them with causal inference. The remaining rules serve as approximated reflections of the promising regions, bounding the tuning to emphasize these places in the landscape. This, as we demonstrate, can effectively mitigate the impact of the exploration and exploitation trade-off. Those purified regions can then be paired with the measured configurations to provide spatial explainability at the landscape level. Comparing with 11 state-of-the-art tuners on 12 systems and varying budgets, we show that PromiseTune performs significantly better than the others with 42% superior rank to the overall second best while providing richer information to explain the hidden system characteristics.
现代软件系统高度的可配置性使得配置调整成为确保系统性能的关键步骤,例如长期性或吞吐量。然而,由于测量费用昂贵、配置空间大、配置面貌崎岖不平,现有调调器由于在探索不确定区域(摆脱当地选法)和利用已知良好配置的指导(快速趋同)之间难以平衡使用预算而丧失效力。其根本原因是,我们不了解有希望区域的位置,这也给解释结果带来挑战。在本文件中,我们提议 “ 承诺图 “ 以因果净化规则为指导,调整配置。 “ 承诺图 “ 是独特的,因为我们学习了规则,这些规则反映某些区域在配置面貌中的情况,并且用因果关系推断来净化这些规则。其余规则作为有希望区域的近似反映,将调整以强调这些地貌景观中的这些地方。正如我们所证明的那样,这可以有效地减轻勘探和开发交易的影响。这些经过净化的区域可以与测量的配置相匹配,以有因果关系的因果净化规则为指导。 “ 承诺图 “ 承诺图 “ 是独一无二的,因为我们学习了规则,这些规则反映了某些区域在配置,这些区域在配置中,在配置中反映了某些区域在结构中,比我们11级的稳定性水平上更精确度,我们展示了系统在12级上展示了更精确的系统。
Article 11
Title@2025-07-10 (4): DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications
Title: DITING: A Static Analyzer for Identifying Bad Partitioning Issues in TEE Applications | DITING: Ein statischer Analyzer zur Identifizierung von Problemen mit schlechten Partitionierungen in TEE-Anwendungen | Tinging: 识别TEE应用中的不良分割问题的静态分析器 2502.15281v2 |
Authors (10): Chengyan Ma, Ruidong Han, Jieke Shi, Ye Liu, Yuqing Niu, Di Lu, Chuang Tian, Jianfeng Ma, Debin Gao, David Lo
Trusted Execution Environment (TEE) enhances the security of mobile applications and cloud services by isolating sensitive code in the secure world from the non-secure normal world. However, TEE applications are still confronted with vulnerabilities stemming from bad partitioning. Bad partitioning can lead to critical security problems of TEE, such as leaking sensitive data to the normal world or being adversely affected by malicious inputs from the normal world. To address this, we propose an approach to detect partitioning issues in TEE applications. First, we conducted a survey of TEE vulnerabilities caused by bad partitioning and found that the parameters exchanged between the secure and normal worlds often contain insecure usage with bad partitioning implementation. Second, we developed a tool named DITING that can analyze data-flows of these parameters and identify their violations of security rules we defined to find bad partitioning issues. Different from existing research that only focuses on malicious input to TEE, we assess the partitioning issues more comprehensively through input/output and shared memory. Finally, we created the first benchmark targeting bad partitioning, consisting of 110 test cases. Experiments demonstrate that DITING achieves an F1 score of 0.90 in identifying bad partitioning issues.
受信任的执行环境(TEE)通过将安全世界中的敏感代码与非安全正常世界分离,增强了移动应用和云层服务的安全性。然而,TEE应用仍面临因差分导致的脆弱性。不良的分割会导致TEE的关键安全问题,例如敏感数据泄漏到正常世界,或受到来自正常世界的恶意投入的不利影响。为了解决这个问题,我们建议了一种方法来探测TEE应用中的分割问题。首先,我们开展了一项调查,调查了由于差分造成的TEE脆弱性,发现在安全和正常世界之间交换的参数往往含有不安全的用途,而且执行不可靠的分隔。第二,我们开发了一种名为DIting的工具,可以分析这些参数的数据流,并查明其违反我们为发现错误分割问题而确定的安全规则的情况。不同于现有研究,我们仅侧重于对TEE的恶意输入/输出和共同记忆,我们更全面地评估了分割问题。最后,我们建立了第一个针对错误分割问题的基准,包括110个测试案例。实验表明,DIting在确定坏分割问题上达到了0.90分的F1分。
Article 12
Title@2025-07-10 (4): Automatic Generation of Explainability Requirements and Software Explanations From User Reviews
Title: Automatic Generation of Explainability Requirements and Software Explanations From User Reviews | Automatische Generierung von Erklärbarkeitsanforderungen und Software-Erläuterungen aus Benutzer-Bewertungen | 用户审查自动产生解释要求和软件解释 2507.07344v1 |
Authors (9): Martin Obaidi, Jannik Fischbach, Jakob Droste, Hannah Deters, Marc Herrmann, Jil Klünder, Steffen Krätzig, Hugo Villamizar, Kurt Schneider
Explainability has become a crucial non-functional requirement to enhance transparency, build user trust, and ensure regulatory compliance. However, translating explanation needs expressed in user feedback into structured requirements and corresponding explanations remains challenging. While existing methods can identify explanation-related concerns in user reviews, there is no established approach for systematically deriving requirements and generating aligned explanations. To contribute toward addressing this gap, we introduce a tool-supported approach that automates this process. To evaluate its effectiveness, we collaborated with an industrial automation manufacturer to create a dataset of 58 user reviews, each annotated with manually crafted explainability requirements and explanations. Our evaluation shows that while AI-generated requirements often lack relevance and correctness compared to human-created ones, the AI-generated explanations are frequently preferred for their clarity and style. Nonetheless, correctness remains an issue, highlighting the importance of human validation. This work contributes to the advancement of explainability requirements in software systems by (1) introducing an automated approach to derive requirements from user reviews and generate corresponding explanations, (2) providing empirical insights into the strengths and limitations of automatically generated artifacts, and (3) releasing a curated dataset to support future research on the automatic generation of explainability requirements.
解释性已成为提高透明度、建立用户信任和确保监管合规性的关键非功能性要求。然而,将用户反馈中表达的解释需要转化为结构化要求和相应的解释,仍然具有挑战性。虽然现有方法可以确定用户审查中与解释相关的关切,但尚无既定方法系统地提出要求和提出一致的解释。为帮助弥补这一差距,我们采用了一种工具支持的方法,使这一进程自动化。为了评估其有效性,我们与一个工业自动化制造商合作,创建了一套由58个用户审查组成的数据集,每套数据附有人工制作的解释性要求和解释说明。我们的评价表明,尽管与人造的要求相比,AI产生的要求往往缺乏相关性和正确性,但AI提出的解释往往因其清晰性和风格而更可取,然而,正确性仍然是一个问题,突出了人类验证的重要性。这项工作有助于推动软件系统解释性要求的提高,其方法是:(1)采用自动化方法,从用户审查中得出要求并产生相应的解释,(2)对自动生成的艺术品的优点和局限性,(3)发布经整理的数据,以支持未来关于自动生成解释性要求的研究。
Article 13
Title@2025-07-09 (3): Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection
Title: Towards Trustworthy Sentiment Analysis in Software Engineering: Dataset Characteristics and Tool Selection | Auf dem Weg zu einer vertrauensvollen Sentiment-Analyse in der Software-Engineering: Dataset-Eigenschaften und Werkzeugauswahl | 在软件工程中进行可信赖的感知分析:数据集特点和工具选择 2507.02137v2 |
Authors (4): Martin Obaidi, Marc Herrmann, Jil Klünder, Kurt Schneider
Software development relies heavily on text-based communication, making sentiment analysis a valuable tool for understanding team dynamics and supporting trustworthy AI-driven analytics in requirements engineering. However, existing sentiment analysis tools often perform inconsistently across datasets from different platforms, due to variations in communication style and content. In this study, we analyze linguistic and statistical features of 10 developer communication datasets from five platforms and evaluate the performance of 14 sentiment analysis tools. Based on these results, we propose a mapping approach and questionnaire that recommends suitable sentiment analysis tools for new datasets, using their characteristic features as input. Our results show that dataset characteristics can be leveraged to improve tool selection, as platforms differ substantially in both linguistic and statistical properties. While transformer-based models such as SetFit and RoBERTa consistently achieve strong results, tool effectiveness remains context-dependent. Our approach supports researchers and practitioners in selecting trustworthy tools for sentiment analysis in software engineering, while highlighting the need for ongoing evaluation as communication contexts evolve.
软件开发在很大程度上依赖基于文本的通信,使情绪分析成为了解团队动态和支持在需求工程中可靠AI驱动分析的宝贵工具,但是,由于通信风格和内容的差异,现有情绪分析工具往往在不同平台的数据集之间运作不一致。在本研究中,我们分析5个平台10个开发者通信数据集的语言和统计特征,评价14个情绪分析工具的性能。根据这些结果,我们提出了一个绘图方法和问卷,建议为新数据集提供适当的情绪分析工具,同时使用其特性作为投入。我们的结果显示,数据集的特性可以用来改进工具选择,因为平台在语言和统计特性上差异很大。SetFit和ROBERTA等基于变压器的模型不断取得强有力的结果,但工具的有效性仍然取决于具体情况。我们的方法支持研究人员和从业人员选择软件工程中值得信赖的情绪分析工具,同时强调随着通信环境的演变需要持续评估。
Article 14
Title@2025-07-09 (3): A German Gold-Standard Dataset for Sentiment Analysis in Software Engineering
Title: A German Gold-Standard Dataset for Sentiment Analysis in Software Engineering | Ein deutscher Gold-Standard-Datensatz für die Sentiment-Analyse in der Software-Engineering | 用于软件工程敏感分析的德国黄金标准数据集 2507.07325v1 |
Authors (6): Martin Obaidi, Marc Herrmann, Elisa Schmid, Raymond Ochsner, Kurt Schneider, Jil Klünder
Sentiment analysis is an essential technique for investigating the emotional climate within developer teams, contributing to both team productivity and project success. Existing sentiment analysis tools in software engineering primarily rely on English or non-German gold-standard datasets. To address this gap, our work introduces a German dataset of 5,949 unique developer statements, extracted from the German developer forum Android-Hilfe.de. Each statement was annotated with one of six basic emotions, based on the emotion model by Shaver et al., by four German-speaking computer science students. Evaluation of the annotation process showed high interrater agreement and reliability. These results indicate that the dataset is sufficiently valid and robust to support sentiment analysis in the German-speaking software engineering community. Evaluation with existing German sentiment analysis tools confirms the lack of domain-specific solutions for software engineering. We also discuss approaches to optimize annotation and present further use cases for the dataset.
为弥补这一差距,我们的工作引入了一套德国5 949个独特的开发者声明数据集,该数据集来自德国开发商论坛Android-Hilfe.de。每份声明都附有说明,附有六个基本情感之一,其依据是Shaver等人的情感模型,四名德语计算机科学学生根据Shaver等人的情感模型作了说明。对批注过程的评估表明,对批注过程的评估表明,对跨线者之间的协议和可靠性很高。这些结果表明,数据集足够有效和有力,足以支持德语软件工程界的情绪分析。与现有的德国情绪分析工具进行的评估证实,软件工程缺乏针对具体领域的解决方案。我们还讨论了优化批注和进一步使用数据集案例的方法。
Article 15
Title@2025-07-09 (3): EditLord: Learning Code Transformation Rules for Code Editing
Title: EditLord: Learning Code Transformation Rules for Code Editing | EditLord: Regeln zur Code-Transformation für die Code-Editing | 编辑主: 学习代码编辑的代码转换规则 2504.15284v4 |
Authors (6): Weichen Li, Albert Jan, Baishakhi Ray, Junfeng Yang, Chengzhi Mao, Kexin Pei
Code editing is a foundational task in software development, where its effectiveness depends on whether it introduces desired code property changes without changing the original code’s intended functionality. Existing approaches often formulate code editing as an implicit end-to-end task, omitting the fact that code-editing procedures inherently consist of discrete and explicit steps. Thus, they suffer from suboptimal performance and lack of robustness and generalization. We introduce EditLord, a code editing framework that makes the code transformation steps explicit. Our key insight is to employ a language model (LM) as an inductive learner to extract code editing rules from the training code pairs as concise meta-rule sets. Such rule sets will be manifested for each training sample to augment them for finetuning or assist in prompting- and iterative-based code editing. EditLord outperforms the state-of-the-art by an average of 22.7% in editing performance and 58.1% in robustness while achieving 20.2% higher functional correctness across critical software engineering and security applications, LM models, and editing modes.
代码编辑是软件开发中的一项基本任务,其有效性取决于它是否在不改变原代码的预期功能的情况下引入了理想的代码属性变化。现有方法往往将代码编辑作为一种隐含的端对端任务来制定代码编辑,忽略了编码编辑程序本身就包含离散和清晰的步骤这一事实。因此,它们表现欠佳,缺乏稳健性和概括性。我们引入了编辑框架,即使代码转换步骤明确化的代码编辑框架。我们的主要见解是使用一种语言模式(LM)作为感应学习者,从培训代码对对口中提取代码编辑规则,作为简洁的元规则。对于每个培训样本,这些规则将表现为增强它们,以进行微调,或协助快速和反复的代码编辑。编辑能力优于最新水平,平均22.7%的编辑性能和58.1%的强性,同时在关键软件工程和安全应用程序、LM模式和编辑模式中实现20.2%更高的功能正确性。
Article 16
Title@2025-07-09 (3): Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models
Title: Measuring how changes in code readability attributes affect code quality evaluation by Large Language Models | Messung, wie sich Änderungen in Codelesbarkeitsattributen auf die Bewertung der Codequalität durch Large Language Models auswirken | 衡量守则可读性属性的变化如何影响大语言模式的守则质量评价 2507.05289v2 |
Authors (2): Igor Regis da Silva Simoes, Elaine Venson
Code readability is one of the main aspects of code quality, influenced by various properties like identifier names, comments, code structure, and adherence to standards. However, measuring this attribute poses challenges in both industry and academia. While static analysis tools assess attributes such as code smells and comment percentage, code reviews introduce an element of subjectivity. This paper explores using Large Language Models (LLMs) to evaluate code quality attributes related to its readability in a standardized, reproducible, and consistent manner. We conducted a quasi-experiment study to measure the effects of code changes on Large Language Model (LLM)s interpretation regarding its readability quality attribute. Nine LLMs were tested, undergoing three interventions: removing comments, replacing identifier names with obscure names, and refactoring to remove code smells. Each intervention involved 10 batch analyses per LLM, collecting data on response variability. We compared the results with a known reference model and tool. The results showed that all LLMs were sensitive to the interventions, with agreement with the reference classifier being high for the original and refactored code scenarios. The LLMs demonstrated a strong semantic sensitivity that the reference model did not fully capture. A thematic analysis of the LLMs reasoning confirmed their evaluations directly reflected the nature of each intervention. The models also exhibited response variability, with 9.37% to 14.58% of executions showing a standard deviation greater than zero, indicating response oscillation, though this did not always compromise the statistical significance of the results. LLMs demonstrated potential for evaluating semantic quality aspects, such as coherence between identifier names, comments, and documentation with code purpose.
代码可读性是代码质量的主要方面之一,受到标识名称、评论、代码结构和遵守标准等各种属性的影响。然而,测量这一属性在行业和学术界都构成挑战。静态分析工具评估代码气味和评论百分比等属性的同时,代码审查引入了主观性要素。本文探索了使用大语言模型(LLMs)来评估与其标准化、可复制和一致的可读性有关的代码质量属性。我们进行了一次准实验,以衡量代码变化对大语言模型(LLLM)解释的可读性属性的影响。测试了9个LMs,进行了三项干预:删除评论,用模糊名称取代标识名称,并重新设置代码气味。每次干预都涉及10批分析,收集反应变异性数据。我们将结果与已知的参考模型和工具进行了比较。结果表明,所有LMSMs都对干预措施敏感,对原始和重新校正代码情景的参考分级值很高。LMs展示了一种强烈的敏感度评估,而不是精确性评估,文献质量模型的准确性也充分展示了这些参考模型。
Article 17
Title@2025-07-09 (3): HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis
Title: HLSTester: Efficient Testing of Behavioral Discrepancies with LLMs for High-Level Synthesis | HLSTester: Effiziente Prüfung von Verhaltensdiskrepanzen mit LLMs für High-Level-Synthese | HLS Tester: 与高级别合成项目LLM有效测试行为差异 2504.14641v2 |
Authors (4): Kangwei Xu, Bing Li, Grace Li Zhang, Ulf Schlichtmann
In high-level synthesis (HLS), C/C++ programs with synthesis directives are used to generate circuits for FPGA implementations. However, hardware-specific and platform-dependent characteristics in these implementations can introduce behavioral discrepancies between the original C/C++ programs and the circuits after high-level synthesis. Existing methods for testing behavioral discrepancies in HLS are still immature, and the testing workflow requires significant human efforts. To address this challenge, we propose HLSTester, a large language model (LLM) aided testing framework that efficiently detects behavioral discrepancies in HLS. To mitigate hallucinations in LLMs and enhance prompt quality, the testbenches for original C/C++ programs are leveraged to guide LLMs in generating HLS-compatible testbenches, effectively eliminating certain traditional C/C++ constructs that are incompatible with HLS tools. Key variables are pinpointed through a backward slicing technique in both C/C++ and HLS programs to monitor their runtime spectra, enabling an in-depth analysis of the discrepancy symptoms. To reduce test time, a testing input generation mechanism is introduced to integrate dynamic mutation with insights from an LLM-based progressive reasoning chain. In addition, repetitive hardware testing is skipped by a redundancy-aware filtering technique for the generated test inputs. Experimental results demonstrate that the proposed LLM-aided testing framework significantly accelerates the testing workflow while achieving higher testbench simulation pass rates compared with the traditional method and the direct use of LLMs on the same HLS programs.
在高水平合成(HLS)中,C/C+++方案与综合指令一起被用于产生FPGA实施过程的电路。然而,这些实施过程中硬件特定和平台依赖的特点可以造成原C/C++程序与高水平合成后的电路之间的行为差异。现有的HLS行为差异测试方法仍然不成熟,测试工作流程需要大量的人力努力。为了应对这一挑战,我们建议HLSTester(一个大型语言模型(LLLM)辅助测试框架,以有效检测HLS中的行为差异。为了减少LLM中的幻觉并提高快速质量,将原C/C++程序的测试箱用于指导LLMM生成HLS兼容的测试箱,有效消除与HLS工具不兼容的某些传统C/C++构建。关键变量通过C++和HLS程序中的后退缩分级技术来定位。我们建议,为了减少测试时间差异症状的深度分析。为了减少测试时间,将原CLM/C+++程序测试高投入生成的测试机制用于将HLS-Revild Reveral 测试的试测测测测测测测算。
Article 18
Title@2025-07-09 (3): How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys
Title: How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys | Wie zu Elicit Erklärbarkeit Anforderungen? Ein Vergleich von Interviews, Fokusgruppen und Umfragen | 如何制定明确的解释要求?访谈、焦点小组和调查的比较 2505.23684v3 |
Authors (7): Martin Obaidi, Jakob Droste, Hannah Deters, Marc Herrmann, Raymond Ochsner, Jil Klünder, Kurt Schneider
As software systems grow increasingly complex, explainability has become a crucial non-functional requirement for transparency, user trust, and regulatory compliance. Eliciting explainability requirements is challenging, as different methods capture varying levels of detail and structure. This study examines the efficiency and effectiveness of three commonly used elicitation methods - focus groups, interviews, and online surveys - while also assessing the role of taxonomy usage in structuring and improving the elicitation process. We conducted a case study at a large German IT consulting company, utilizing a web-based personnel management software. A total of two focus groups, 18 interviews, and an online survey with 188 participants were analyzed. The results show that interviews were the most efficient, capturing the highest number of distinct needs per participant per time spent. Surveys collected the most explanation needs overall but had high redundancy. Delayed taxonomy introduction resulted in a greater number and diversity of needs, suggesting that a two-phase approach is beneficial. Based on our findings, we recommend a hybrid approach combining surveys and interviews to balance efficiency and coverage. Future research should explore how automation can support elicitation and how taxonomies can be better integrated into different methods.
由于软件系统日益复杂,解释性已成为对透明度、用户信任和监管合规的关键非功能性要求,解释性要求具有挑战性,因为不同方法可以捕捉不同程度的细节和结构。本研究报告审查了三种常用的引人方法(焦点小组、访谈和在线调查)的效率和效力,同时还评估了分类学使用在结构和改进引人进程中的作用。我们利用网上人事管理软件,在一家大型德国信息技术咨询公司进行了案例研究。共分析了两个焦点小组,即18次访谈和188名参与者的在线调查。结果显示,访谈效率最高,每个参与者每次花费的时间都有最多的不同需求。调查收集了大多数解释需求,但有大量的冗余度。推迟采用分类学带来了更多和更多的需求,表明分两个阶段的方法是有益的。我们建议采用混合方法,将调查和访谈结合起来,以平衡效率和覆盖面。未来研究应探索自动化如何支持征求,如何更好地将纳税人纳入不同方法。
Article 19
Title@2025-07-09 (3): 5C Prompt Contracts: A Minimalist, Creative-Friendly, Token-Efficient Design Framework for Individual and SME LLM Usage
Title: 5C Prompt Contracts: A Minimalist, Creative-Friendly, Token-Efficient Design Framework for Individual and SME LLM Usage | 5C Prompt Contracts: Minimalistisch, kreativ-freundlich, Token-Efficient Design Framework für individuelle und KMU LLM-Nutzung | 5C 即时合同:个人和中小企业LLM使用最起码的、有利于创意的、高效益的个人和中小企业LLM设计框架 2507.07045v1 |
Authors (1): Ugur Ari
The progression from traditional prompt engineering to a more rigorous discipline of prompt design marks a pivotal shift in human-LLM interaction. As Large Language Models (LLMs) become increasingly embedded in mission-critical applications, there emerges a pressing need for frameworks that are not only explicit and systematic but also minimal enough to remain practical and broadly accessible. While many existing approaches address prompt structuring through elaborate Domain-Specific Languages (DSLs) or multi-layered templates, such methods can impose significant token and cognitive overhead, potentially constraining the model’s creative capacity. In this context, we propose the 5C Prompt Contract, a framework that distills prompt design into five intuitive components: Character, Cause, Constraint, Contingency, and Calibration. This minimal cognitive schema explicitly integrates fallback and output optimization directives, fostering reliable, interpretable, and creatively flexible AI interactions. Experimental results demonstrate that the 5C framework consistently achieves superior input token efficiency while maintaining rich and consistent outputs across diverse LLM architectures (OpenAI, Anthropic, DeepSeek, and Gemini), making it particularly suited for individuals and Small-to-Medium Enterprises (SMEs) with limited AI engineering resources.
由于大型语言模型(LLM)日益嵌入到任务关键应用中,因此出现了对框架的迫切需要,这些框架不仅明确和系统,而且最起码,足以保持实用和广泛使用,虽然许多现有方法解决了通过精心开发的特定域语言(DSL)或多层模板迅速构建结构的问题,但这类方法可以强加重要的象征性和认知性间接费用,并有可能限制模型的创造力。在这方面,我们提议了5C快速合同,这是一个将快速设计蒸馏成五个直观组成部分的框架:性格、起因、约束、应急和校准。这种最低限度的认知系统明确整合了倒置和产出优化指令,促进可靠、可解释和创造性灵活的AI互动。实验结果表明,5C框架始终可以实现高级投入象征性效率,同时保持不同LLM结构(OSPAI、Anthrotic、DeepSeek和Gemini)的丰富和一致的产出,使其特别适合个人和中小企业(AI-M)的有限资源。
Article 20
Title@2025-07-09 (3): Exploring Fairness Interventions in Open Source Projects
Title: Exploring Fairness Interventions in Open Source Projects | Erforschen von Fairness-Interventionen in Open Source-Projekten | 探索开放源码项目中的公平干预 2507.07026v1 |
Authors (4): Sadia Afrin Mim, Fatema Tuz Zohra, Justin Smith, Brittany Johnson
The deployment of biased machine learning (ML) models has resulted in adverse effects in crucial sectors such as criminal justice and healthcare. To address these challenges, a diverse range of machine learning fairness interventions have been developed, aiming to mitigate bias and promote the creation of more equitable models. Despite the growing availability of these interventions, their adoption in real-world applications remains limited, with many practitioners unaware of their existence. To address this gap, we systematically identified and compiled a dataset of 62 open source fairness interventions and identified active ones. We conducted an in-depth analysis of their specifications and features to uncover considerations that may drive practitioner preference and to identify the software interventions actively maintained in the open source ecosystem. Our findings indicate that 32% of these interventions have been actively maintained within the past year, and 50% of them offer both bias detection and mitigation capabilities, mostly during inprocessing.
为应对这些挑战,制定了各种各样的机器学习公平干预措施,旨在减少偏见,促进建立更公平的模式。尽管这些干预措施越来越多,但在现实世界应用中采用这些干预措施仍然有限,许多从业者不知道这些干预措施的存在。为弥补这一差距,我们系统地查明并汇编了62个公开来源公平干预措施和已查明积极干预措施的数据集。我们深入分析了这些干预措施的规格和特征,以发现可能推动从业者偏好的因素,并确定开放生态系统中积极保持的软件干预措施。我们的调查结果显示,这些干预措施中有32%在过去一年中得到了积极维护,其中50%提供了发现和缓解偏见的能力,大部分是在处理过程中。
Article 21
Title@2025-07-09 (3): Robust Containerization of the High Angular Resolution Functional Imaging (HARFI) Pipeline
Title: Robust Containerization of the High Angular Resolution Functional Imaging (HARFI) Pipeline | Robuste Containerisierung der High Angular Resolution Functional Imaging (HARFI) Pipeline | 高角分辨率功能成像(HARFI)管道的强有力集装箱化 2507.07010v1 |
Authors (3): Zhiyuan Li, Kurt G. Schilling, Bennett A. Landman
Historically, functional magnetic resonance imaging (fMRI) of the brain has focused primarily on gray matter, particularly the cortical gray matter and associated nuclei. However, recent work has demonstrated that functional activity in white matter also plays a meaningful role in both cognition and learning. In previous work, we introduced the High Angular Resolution Functional Imaging (HARFI) pipeline, which demonstrated both local and global patterns of functional correlation in white matter. Notably, HARFI enabled exploration of asymmetric voxel-wise correlation using odd-order spherical harmonics. Although the original implementation of HARFI was released via GitHub, adoption was limited due to the technical complexity of running the source code. In this work, we present a robust and efficient containerized version of the HARFI pipeline, enabling seamless execution across multiple public datasets. Our goal is to facilitate broader and deeper exploration of functional white matter architecture, especially through the lens of high angular resolution functional correlations. The key innovation of this work is the containerized implementation, which we have made available under a permissive open-source license to support reproducible and accessible research practices.
从历史上看,大脑的功能磁共振成像(fMRI)主要侧重于灰物质,特别是皮层灰物质和相关核心。然而,最近的工作表明,白物质的功能性活动在认知和学习两方面都发挥了有意义的作用。在以往的工作中,我们引入了高角分辨率功能成像(HARFI)管道,该管道既展示了白色物质功能相关性的地方和全球模式。值得注意的是,HARFI得以利用奇类球形口音来探索非对称的oxel关系。尽管HARFI的最初实施是通过GitHub发布的,但由于源码运行的技术复杂性,其采用受到限制。在这项工作中,我们展示了“HARFI”管道的强大而高效的集装箱化版本,使多个公共数据集能够无缝地执行。我们的目标是便利对功能性白物质结构进行更广泛和更深入的探索,特别是通过高角分辨率功能性连接的镜头。这项工作的关键创新是集装箱化实施,我们根据可允许的开放源码许可证提供了这种应用,以支持可复制和可获取的研究实践。
Article 22
Title@2025-07-09 (3): Enhancing Quantum Software Development Process with Experiment Tracking
Title: Enhancing Quantum Software Development Process with Experiment Tracking | Verbesserung des Quantum-Software-Entwicklungsprozesses mit Experiment-Tracking | 利用实验跟踪加强量子软件开发进程 2507.06990v1 |
Authors (4): Mahee Gamage, Otso Kinanen, Jake Muff, Vlad Stirbu
As quantum computing advances from theoretical promise to experimental reality, the need for rigorous experiment tracking becomes critical. Drawing inspiration from best practices in machine learning (ML) and artificial intelligence (AI), we argue that reproducibility, scalability, and collaboration in quantum research can benefit significantly from structured tracking workflows. This paper explores the application of MLflow in quantum research, illustrating how it enables better development practices, experiment reproducibility, decision making, and cross-domain integration in an increasingly hybrid classical-quantum landscape.
作为从理论承诺到实验现实的量子计算进步,严格实验跟踪的必要性变得至关重要。 我们从机器学习(ML)和人工智能(AI)方面的最佳做法中汲取灵感,我们认为,量子研究的可复制性、可扩展性和协作性可以从结构化的跟踪工作流程中大大受益。 本文探讨了量子研究中ML流的应用,说明了它如何在日益混合的古典-量地貌中促进更好的发展实践、实验可复制性、决策以及跨领域整合。
Article 23
Title@2025-07-09 (3): Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation
Title: Are They All Good? Evaluating the Quality of CoTs in LLM-based Code Generation | Sind sie alle gut? Bewertung der Qualität von CoTs bei der LLM-basierten Code-Generierung | 评估基于LLM的代码生成中成本-成本-成本-成本-成本-成本-生成中成本-成本-成本-成本-质量的质量 2507.06980v1 |
Authors (7): Binquan Zhang, Li Zhang, Zhiwen Luo, Yuxin Du, Fang Liu, Song Wang, Lin Shi
Large language models (LLMs) have demonstrated impressive performance in code generation, particularly when augmented with chain-of-thought (CoT) prompting techniques. They break down requirements into intermediate reasoning steps, which act as design rationales to guide LLMs in writing code like human programmers. Thus, the quality of these steps is crucial for ensuring the correctness and reliability of the generated code. However, little is known about the quality of CoT generated by LLMs. To what extent can we trust the thoughts generated by LLMs? How good are they? This paper empirically explores the external and internal factors of why LLMs generate unsatisfactory CoTs by analyzing 1,023 failed code samples on two widely used code generation benchmarks. We also evaluate their impact on code generation performance by analyzing 210 CoT-code pairs and refining the unsatisfied CoTs by prompting LLMs. Our study reveals three key findings: (1) External factors (53.60%), such as unclear requirements and lack of context, mainly affect CoT quality, while internal factors (40.10%) stem from LLMs’ misunderstanding prompts. (2) Even when CoTs are correct, 18.5% of the generated code contains errors due to instruction-following issues; conversely, 11.90% of correct code is paired with flawed CoTs. (3) Refining low-quality CoTs is feasible, i.e., LLMs improve when given detailed problem descriptions. These findings highlight key challenges in CoT-based code generation and suggest directions for improving LLM reasoning and reliability.
大型语言模型(LLMS)在代码生成中表现出了令人印象深刻的绩效, 特别是在借助思维链(CoT)催化技术的提升时, 大型语言模型(LLMS)在代码生成中表现出了令人印象深刻的绩效; 它们将要求细分为中间推理步骤, 用作设计理由, 引导LLMS像人类程序员这样写代码。 因此, 这些步骤的质量对于确保生成代码的正确性和可靠性至关重要。 但是, 人们对LLMS生成的 CoT的质量知之甚少。 我们对LMS生成的CT的质量知之甚少。 在多大程度上我们能够相信LLMS产生的想法? 它们有多好? 本文用经验来探索LMS产生不令人满意的 CoT的外部和内部因素。 (2) 即使COT对两个广泛使用的代码生成基准分析1,023失败的代码样本, 也通过分析210 CoT的代码生成对代码生成的影响, 18.5Ms。 我们的研究揭示了三大结论:(1) 外部因素(53. 60%),例如不明确的要求和背景问题, 主要影响CT质量问题, 而内部因素(40.10%)来自LMS的错误的正确解释。
Article 24
Title@2025-07-09 (3): Robust and Safe Traffic Sign Recognition using N-version with Weighted Voting
Title: Robust and Safe Traffic Sign Recognition using N-version with Weighted Voting | Robuste und sichere Verkehrszeichenerkennung mit N-Version mit gewichteter Abstimmung | 以加权投票方式使用N版本进行强力和安全交通信号识别 2507.06907v1 |
Authors (3): Linyun Gao, Qiang Wen, Fumio Machida
Autonomous driving is rapidly advancing as a key application of machine learning, yet ensuring the safety of these systems remains a critical challenge. Traffic sign recognition, an essential component of autonomous vehicles, is particularly vulnerable to adversarial attacks that can compromise driving safety. In this paper, we propose an N-version machine learning (NVML) framework that integrates a safety-aware weighted soft voting mechanism. Our approach utilizes Failure Mode and Effects Analysis (FMEA) to assess potential safety risks and assign dynamic, safety-aware weights to the ensemble outputs. We evaluate the robustness of three-version NVML systems employing various voting mechanisms against adversarial samples generated using the Fast Gradient Sign Method (FGSM) and Projected Gradient Descent (PGD) attacks. Experimental results demonstrate that our NVML approach significantly enhances the robustness and safety of traffic sign recognition systems under adversarial conditions.
自动驾驶作为机械学习的关键应用正在迅速推进,但确保这些系统的安全仍然是一个重大挑战。交通标志识别是自治车辆的一个基本组成部分,特别容易受到可能损害驾驶安全的对抗性攻击。在本文件中,我们提议了一个Nversion机器学习框架,将安全意识加权软投票机制纳入其中。我们的方法利用故障模式和效果分析来评估潜在的安全风险,并为联合产出分配动态的、安全觉悟的权重。我们评估了使用各种投票机制对抗使用快速渐进信号法和预测的梯子(PGD)攻击产生的对抗性样品的三版本NVML系统是否稳健。实验结果表明,我们的NVML方法在对抗条件下大大加强了交通标志识别系统的稳健性和安全性。
Article 25
Title@2025-07-09 (3): Formalization of the AADL Run-Time Services with Time
Title: Formalization of the AADL Run-Time Services with Time | Formalisierung der AADL-Laufzeitdienste mit Zeit | AADAL “ 实时运行服务 “ 正规化 2507.06881v1 |
Authors (2): Brian R Larson, Ehsan Ahmad
The Architecture Analysis & Design Language (AADL) is an architecture description language for design of cyber-physical systems–machines controlled by software. The AADL standard, SAE International AS5506D, describes Run-Time Services (RTS) to be provided to execute AADL models in accordance with semantics defined by the standard. The RTS of primary concern are transport services and timing services. Although, the study presented in [1] sets a foundation for the formal semantics of AADL, but without modeling time. This paper extends and simplifies this formalization using a modal logic defined by a Kripke structure, to explicitly include time. The RTS defined in the AADL standard are also expanded to support reactive state-transition machines of the Behavior Specification annex standard language (BA) and its closely-related, formally-defined counterpart, the Behavior Language for Embedded Systems with Software (BLESS). An example of AADL RTS with time, implemented by the High Assurance Modeling and Rapid Engineering for Embedded Systems (HAMR) for state-transition machine behavior written in BLESS, is also presented.
建筑分析和设计语言(AADL)是设计由软件控制的网络物理系统机器的建筑描述语言。AAADL标准SAE International AS5506D,描述运行时间服务(RTS),根据标准定义的语义执行AADL模型。主要关注的RTS是运输服务和定时服务。虽然[1]中的研究报告为AADL的正式语义打下了基础,但没有建模时间。本文使用Kripke结构定义的模式逻辑扩展和简化了这一正规化,以明确包括时间。AAAADL标准中定义的RTS也扩展了,以支持BADL标准语言及其密切相关的、正式定义的对应方,即软件嵌入系统(MYSS)的“BADL RTS”。由嵌入式系统高级保证模型和快速工程系统(HAMRTS)实施的AADL RTS有时间的范例,以明确包括时间。ADMRMR(H)中也以BLSS为国家过渡机器行为的书面表述。
Article 26
Title@2025-07-09 (3): Leveraging LLMs for Semantic Conflict Detection via Unit Test Generation
Title: Leveraging LLMs for Semantic Conflict Detection via Unit Test Generation | Leveraging LLMs für semantische Konflikterkennung über Unit Test Generation | 通过单位测试生成利用磁性冲突探测LML 利用LMs 进行语义冲突探测 2507.06762v1 |
Authors (3): Nathalia Barbosa, Paulo Borba, Léuson Da Silva
Semantic conflicts arise when a developer introduces changes to a codebase that unintentionally affect the behavior of changes integrated in parallel by other developers. Traditional merge tools are unable to detect such conflicts, so complementary tools like SMAT have been proposed. SMAT relies on generating and executing unit tests: if a test fails on the base version, passes on a developer’s modified version, but fails again after merging with another developer’s changes, a semantic conflict is indicated. While SMAT is effective at detecting conflicts, it suffers from a high rate of false negatives, partly due to the limitations of unit test generation tools such as Randoop and Evosuite. To investigate whether large language models (LLMs) can overcome these limitations, we propose and integrate a new test generation tool based on Code Llama 70B into SMAT. We explore the model’s ability to generate tests using different interaction strategies, prompt contents, and parameter configurations. Our evaluation uses two samples: a benchmark with simpler systems from related work, and a more significant sample based on complex, real-world systems. We assess the effectiveness of the new SMAT extension in detecting conflicts. Results indicate that, although LLM-based test generation remains challenging and computationally expensive in complex scenarios, there is promising potential for improving semantic conflict detection. – Conflitos sem^anticos surgem quando um desenvolvedor introduz mudan\c{c}as em uma base de c'odigo que afetam, de forma n~ao intencional, o comportamento de altera\c{c}~oes integradas em paralelo por outros desenvolvedores. Ferramentas tradicionais de merge n~ao conseguem detectar esse tipo de conflito, por isso ferramentas complementares como o SMAT foram propostas. O SMAT depende da gera\c{c}~ao e execu\c{c}~ao de testes de unidade: se um teste falha na vers~ao base, passa na vers~ao modificada por um desenvolvedor, mas volta a falhar ap'os o merge com as mudan\c{c}as de outro desenvolvedor, um conflito sem^antico 'e identificado. Embora o SMAT seja eficaz na detec\c{c}~ao de conflitos, apresenta alta taxa de falsos negativos, em parte devido `as limita\c{c}~oes das ferramentas de gera\c{c}~ao de testes como Randoop e Evosuite. Para investigar se modelos de linguagem de grande porte (LLMs) podem superar essas limita\c{c}~oes, propomos e integramos ao SMAT uma nova ferramenta de gera\c{c}~ao de testes baseada no Code Llama 70B. Exploramos a capacidade do modelo de gerar testes utilizando diferentes estrat'egias de intera\c{c}~ao, conte'udos de prompts e configura\c{c}~oes de par^ametros. Nossa avalia\c{c}~ao utiliza duas amostras: um benchmark com sistemas mais simples, usados em trabalhos relacionados, e uma amostra mais significativa baseada em sistemas complexos e reais. Avaliamos a efic'acia da nova extens~ao do SMAT na detec\c{c}~ao de conflitos. Os resultados indicam que, embora a gera\c{c}~ao de testes por LLM em cen'arios complexos ainda seja desafiadora e custosa computacionalmente, h'a potencial promissor para aprimorar a detec\c{c}~ao de conflitos sem^anticos.
nan
Article 27
Title@2025-07-09 (3): One Size Does Not Fit All: Investigating Efficacy of Perplexity in Detecting LLM-Generated Code
Title: One Size Does Not Fit All: Investigating Efficacy of Perplexity in Detecting LLM-Generated Code | Eine Größe passt nicht zu allen: Untersuchung der Wirksamkeit von Verwirrung bei der Erkennung von LLM-generierten Code | 一大小不完全适用:调查在检测 LLM 生成的守则中发现重复性的效果 2412.16525v2 |
Authors (11): Jinwei Xu, He Zhang, Yanjing Yang, Lanxin Yang, Zeru Cheng, Jun Lyu, Bohan Liu, Xin Zhou, Alberto Bacchelli, Yin Kia Chiam, Thiam Kian Chiew
Large language model-generated code (LLMgCode) has become increasingly common in software development. So far LLMgCode has more quality issues than human-authored code (HaCode). It is common for LLMgCode to mix with HaCode in a code change, while the change is signed by only human developers, without being carefully examined. Many automated methods have been proposed to detect LLMgCode from HaCode, in which the perplexity-based method (PERPLEXITY for short) is the state-of-the-art method. However, the efficacy evaluation of PERPLEXITY has focused on detection accuracy. Yet it is unclear whether PERPLEXITY is good enough in a wider range of realistic evaluation settings. To this end, we carry out a family of experiments to compare PERPLEXITY against feature- and pre-training-based methods from three perspectives: detection accuracy, detection speed, and generalization capability. The experimental results show that PERPLEXITY has the best generalization capability while having limited detection accuracy and detection speed. Based on that, we discuss the strengths and limitations of PERPLEXITY, e.g., PERPLEXITY is unsuitable for high-level programming languages. Finally, we provide recommendations to improve PERPLEXITY and apply it in practice. As the first large-scale investigation on detecting LLMgCode from HaCode, this article provides a wide range of findings for future improvement.
在软件开发中,大型语言模型生成代码(LLMgCode)越来越常见。迄今为止,LLMgCode比人造代码(HaCode)有更高质量的问题。LLMgCode通常会将代码变化与HaCode混为一体,而改变则仅由人类开发者签署,未经仔细审查。许多自动化方法都提议从HaCode中检测LLLMGCode,其中基于易懂性的方法(短期的Pperplexity)是最新的方法。然而,对PLOPIEX的功效评价侧重于检测准确性。然而,在更广泛的现实评估环境中,还不清楚PLMCode是否足够好。为此,我们从三个角度进行了一系列实验,将性能与基于特性和训练前的方法相比较:检测准确性、检测速度和一般化能力。实验结果显示,基于易懂性的方法具有最佳的普及性能力,而检测性精确性和检测速度则有限。基于这一点,我们讨论POPLEX的长性和局限性的强性和局限性。
Article 28
Title@2025-07-09 (3): Issue Tracking Ecosystems: Context and Best Practices
Title: Issue Tracking Ecosystems: Context and Best Practices | Thema Tracking Ökosysteme: Kontext und Best Practices | 生态系统跟踪问题:背景和最佳做法 2507.06704v1 |
Authors (1): Lloyd Montgomery
Issue Tracking Systems (ITSs), such as GitHub and Jira, are popular tools that support Software Engineering (SE) organisations through the management of ``issues’’, which represent different SE artefacts such as requirements, development tasks, and maintenance items. ITSs also support internal linking between issues, and external linking to other tools and information sources. This provides SE organisations key forms of documentation, including forwards and backwards traceability (e.g., Feature Requests linked to sprint releases and code commits linked to Bug Reports). An Issue Tracking Ecosystem (ITE) is the aggregate of the central ITS and the related SE artefacts, stakeholders, and processes – with an emphasis on how these contextual factors interact with the ITS. The quality of ITEs is central to the success of these organisations and their software products. There are challenges, however, within ITEs, including complex networks of interlinked artefacts and diverse workflows. While ITSs have been the subject of study in SE research for decades, ITEs as a whole need further exploration. In this thesis, I undertake the challenge of understanding ITEs at a broader level, addressing these questions regarding complexity and diversity. I interviewed practitioners and performed archival analysis on a diverse set of ITSs. These analyses revealed the context-dependent nature of ITE problems, highlighting the need for context-specific ITE research. While previous work has produced many solutions to specific ITS problems, these solutions are not consistently framed in a context-rich and comparable way, leading to a desire for more aligned solutions across research and practice. To address this emergent information and lack of alignment, I created the Best Practice Ontology for ITEs. <… truncated due to arXiv abstract character limit …>
问题跟踪系统(ITS),如GitHub和Jira等,是支持软件工程组织的流行工具,通过管理“问题”来支持软件工程组织(SE),它代表着不同的SE工艺品,例如要求、开发任务和维护项目。ITS还支持问题之间的内部联系以及与其他工具和信息来源的外部联系。它为SE组织提供了关键的文献形式,包括前向和后向追踪(例如,与印本发布和代码连接的特征请求,与《错误报告》相联系)。 问题跟踪生态系统(ITE)是中央ITS和相关的SE工艺品、利益攸关方和进程的统领,重点是这些背景因素如何与ITS互动。 ITE的质量是这些组织及其软件产品成功的关键。然而,在ITE内部,包括以前相互关联的工艺品和不同工作流程的复杂网络。尽管几十年来一直在STE研究中研究这些问题,但需要深入探讨。在这个理论中,我对ITE的内在背景上缺乏理解的挑战,解决这些复杂性和多样性的这些背景问题。我对ITE的深度和多样性进行了长期的研究。我对IST的深度分析进行了深入分析。
Article 29
Title@2025-07-09 (3): Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks
Title: Assessing Small Language Models for Code Generation: An Empirical Study with Benchmarks | Bewertung kleiner Sprachmodelle für die Codegenerierung: Eine empirische Studie mit Benchmarks | 评估用于代码生成的小型语言模式:具有基准的实证研究 2507.03160v3 |
Authors (6): Md Mahade Hasan, Muhammad Waseem, Kai-Kristian Kemell, Jussi Rasku, Juha Ala-Rantala, Pekka Abrahamsson
The recent advancements of Small Language Models (SLMs) have opened new possibilities for efficient code generation. SLMs offer lightweight and cost-effective alternatives to Large Language Models (LLMs), making them attractive for use in resource-constrained environments. However, empirical understanding of SLMs, particularly their capabilities, limitations, and performance trade-offs in code generation remains limited. This study presents a comprehensive empirical evaluation of 20 open-source SLMs ranging from 0.4B to 10B parameters on five diverse code-related benchmarks (HumanEval, MBPP, Mercury, HumanEvalPack, and CodeXGLUE). The models are assessed along three dimensions: i) functional correctness of generated code, ii) computational efficiency and iii) performance across multiple programming languages. The findings of this study reveal that several compact SLMs achieve competitive results while maintaining a balance between performance and efficiency, making them viable for deployment in resource-constrained environments. However, achieving further improvements in accuracy requires switching to larger models. These models generally outperform their smaller counterparts, but they require much more computational power. We observe that for 10% performance improvements, models can require nearly a 4x increase in VRAM consumption, highlighting a trade-off between effectiveness and scalability. Besides, the multilingual performance analysis reveals that SLMs tend to perform better in languages such as Python, Java, and PHP, while exhibiting relatively weaker performance in Go, C++, and Ruby. However, statistical analysis suggests these differences are not significant, indicating a generalizability of SLMs across programming languages. Based on the findings, this work provides insights into the design and selection of SLMs for real-world code generation tasks.
最近,小型语言模型的进步为高效率的代码生成开辟了新的可能性;可持续土地管理为大型语言模型提供了轻量和具有成本效益的替代品,使其在资源紧张的环境中具有吸引力;然而,对可持续土地管理,特别是其能力、局限性和代码生成中绩效权衡的经验理解仍然有限;本研究报告对20种开放源的可持续土地管理,从0.4B到10B参数,在5种不同的代码相关基准(人类val、MBPP、Mercury、HumanEvalPack和codXGLUE)上,为高效的代码生成提供了新的可能性;这些模型从三个方面进行了评估:i) 生成的代码的功能正确性,使这些模型在资源紧张的环境中具有吸引力;i) 生成代码的功能正确性,使其具有吸引力;ii) 计算效率和三) 多种方案编制语言的绩效;然而,对可持续土地管理的经验了解仍然有限,同时保持业绩与效率之间的平衡,使其在资源紧张的环境中部署。 然而,要进一步提高准确性,这些模型通常比较小的对应方,但需要更多计算能力。
Article 30
Title@2025-07-09 (3): Finding Compiler Bugs through Cross-Language Code Generator and Differential Testing
Title: Finding Compiler Bugs through Cross-Language Code Generator and Differential Testing | Finden von Compiler-Fehlern durch Cross-Language-Code-Generator und Differential-Tests | 通过跨语言代码生成器和差异测试查找编译器错误 2507.06584v1 |
Authors (6): Qiong Feng, Xiaotian Ma, Ziyuan Feng, Marat Akhin, Wei Song, Peng Liang
Compilers play a central role in translating high-level code into executable programs, making their correctness essential for ensuring code safety and reliability. While extensive research has focused on verifying the correctness of compilers for single-language compilation, the correctness of cross-language compilation - which involves the interaction between two languages and their respective compilers - remains largely unexplored. To fill this research gap, we propose CrossLangFuzzer, a novel framework that introduces a universal intermediate representation (IR) for JVM-based languages and automatically generates cross-language test programs with diverse type parameters and complex inheritance structures. After generating the initial IR, CrossLangFuzzer applies three mutation techniques - LangShuffler, FunctionRemoval, and TypeChanger - to enhance program diversity. By evaluating both the original and mutated programs across multiple compiler versions, CrossLangFuzzer successfully uncovered 10 confirmed bugs in the Kotlin compiler, 4 confirmed bugs in the Groovy compiler, 7 confirmed bugs in the Scala 3 compiler, 2 confirmed bugs in the Scala 2 compiler, and 1 confirmed bug in the Java compiler. Among all mutators, TypeChanger is the most effective, detecting 11 of the 24 compiler bugs. Furthermore, we analyze the symptoms and root causes of cross-compilation bugs, examining the respective responsibilities of language compilers when incorrect behavior occurs during cross-language compilation. To the best of our knowledge, this is the firstwork specifically focused on identifying and diagnosing compiler bugs in cross-language compilation scenarios. Our research helps to understand these challenges and contributes to improving compiler correctness in multi-language environments.
编译者在将高级代码转换为可执行程序方面发挥着核心作用, 使其正确性成为确保代码安全和可靠性的关键。 虽然广泛的研究侧重于验证单语编译者的正确性, 但跨语言编译的正确性( 包括两种语言和各自的编译者之间的互动) 仍然基本上没有被探索。 为了填补这一研究空白, 我们提议CrossLangFuzzer, 这是一个新颖的框架, 它为 JVM 语言引入通用中间代号( IR ) , 并自动生成跨语言测试程序, 具有不同类型参数和复杂的继承结构。 在生成初始 IR 后, CrossLangFuzzer 应用了三种突变技术 - LangShuffler、 DyRemoval 和 TyCanger - 来增强程序的多样性。 通过评估多个编译者版本的原始和变异程序, CrossLangFuzzer 成功发现了科特林编译员的10个经确认的错误, 4个经确认的错误在Groovy编译员编译器中, 7个经确认的错误在Scala 3 编译者、 2 经确认的错误和复杂继承结构中, 。 在Scalview codu lical Recodudustruction codududucal ladal liar lidar ladars ladar lads lads 在我们 中, 在Strax lax recocudre dre dre dre disals redududududududing 中, ladals ladre disl ladre ladalsrere disldalsre dislre disl 中, 在我们 ladals recudre dre disldaldalsl ladals 中, ladaldaldaldaldaldaldrerere disldre disldaldre disldals 。
Article 31
Title@2025-07-09 (3): TELSAFE: Security Gap Quantitative Risk Assessment Framework
Title: TELSAFE: Security Gap Quantitative Risk Assessment Framework | TELSAFE: Sicherheitslücke Quantitative Risikobewertungsrahmen | TELSAFE: 安全差距定量风险评估框架 2507.06497v1 |
Authors (8): Sarah Ali Siddiqui, Chandra Thapa, Derui Wang, Rayne Holland, Wei Shao, Seyit Camtepe, Hajime Suzuki, Rajiv Shah
Gaps between established security standards and their practical implementation have the potential to introduce vulnerabilities, possibly exposing them to security risks. To effectively address and mitigate these security and compliance challenges, security risk management strategies are essential. However, it must adhere to well-established strategies and industry standards to ensure consistency, reliability, and compatibility both within and across organizations. In this paper, we introduce a new hybrid risk assessment framework called TELSAFE, which employs probabilistic modeling for quantitative risk assessment and eliminates the influence of expert opinion bias. The framework encompasses both qualitative and quantitative assessment phases, facilitating effective risk management strategies tailored to the unique requirements of organizations. A specific use case utilizing Common Vulnerabilities and Exposures (CVE)-related data demonstrates the framework’s applicability and implementation in real-world scenarios, such as in the telecommunications industry.
既定安保标准及其实际实施之间的差距有可能造成脆弱性,可能使其面临安全风险。为了有效应对和减轻这些安全和合规挑战,安全风险管理战略至关重要。然而,必须坚持既定战略和行业标准,以确保各组织内部和各组织之间的一致性、可靠性和兼容性。本文件介绍一个新的混合风险评估框架,称为TELSAFE,采用定量风险评估的概率模型,消除专家意见偏差的影响。框架包括定性和定量评估阶段,促进针对各组织独特要求的有效风险管理战略。一个利用共同脆弱性和暴露(CVE)相关数据的具体使用案例展示了该框架在现实世界情景中的适用性和实施性,例如在电信行业。
Article 32
Title@2025-07-09 (3): Evaluating Efficiency and Novelty of LLM-Generated Code for Graph Analysis
Title: Evaluating Efficiency and Novelty of LLM-Generated Code for Graph Analysis | Bewertung der Effizienz und Neuheit des LLM-generierten Codes für die Graphenanalyse | 评价LLLM创制的图表分析守则的效率和新颖性 2507.06463v1 |
Authors (3): Atieh Barati Nia, Mohammad Dindoost, David A. Bader
Large Language Models (LLMs) are increasingly used to automate software development, yet most prior evaluations focus on functional correctness or high-level languages such as Python. We present the first systematic study of LLMs’ ability to generate efficient C implementations of graph-analysis routines–code that must satisfy the stringent runtime and memory constraints. Eight state-of-the-art models (OpenAI ChatGPT o3 and o4-mini-high, Anthropic Claude 4 Sonnet and Sonnet Extended, Google Gemini 2.5 Flash and Pro, xAI Grok 3-Think, and DeepSeek DeepThink R1) are benchmarked by two distinct approaches. The first approach checks the ability of LLMs in generating an algorithm outperforming other present algorithms in the benchmark. The second approach evaluates the ability of LLMs to generate graph algorithms for integration into the benchmark. Results show that Claude Sonnet 4 Extended achieves the best result in the case of ready-to-use code generation and efficiency, outperforming human-written baselines in triangle counting. The study confirms that contemporary LLMs excel at optimizing and integrating established algorithms but not inventing novel techniques. We provide prompts, the first approach’s generated code, and measurement scripts to foster reproducible research.
大型语言模型(LLMS)越来越多地用于软件开发自动化,但大多数前期评价侧重于功能正确性或高层次语言,如Python。我们首次系统地研究了LLMS是否有能力高效地实施图形分析常规代码的C,这种代码必须满足严格的运行时间和记忆限制。八种最先进的模型(OpenAI ChattGPT o3和o4-mini-high,Anthrotic Claude 4 Sonnet and Sonnet Explate, Google Gemini 2.5 Flash and Pro, xAI Grok 3-Think, 和 Deep Seek DeepThew Think R1),这是用两种不同的方法来衡量LMS是否有能力产生算法高于基准中目前其他算法的算法。第二种方法评估LLMS是否有能力生成图表算法,以便纳入基准。结果显示,Claude Sonnet 4 4在可即用代码的生成和效率方面取得了最佳的结果,在三角计算中表现得超人写基准。研究的基线。我们确认当代LMSMS在优化和整合的测算方法上提供了最优化和整合的改进的计算方法,但没有创新技术。
Article 33
Title@2025-07-08 (2): gigiProfiler: Diagnosing Performance Issues by Uncovering Application Resource Bottlenecks
Title: gigiProfiler: Diagnosing Performance Issues by Uncovering Application Resource Bottlenecks | gigiProfiler: Diagnose von Performance-Problemen durch Enthüllen von Anwendungsressourcen Engpässe | ii profiler: 通过解封应用程序资源瓶颈分析性能问题 2507.06452v1 |
Authors (6): Yigong Hu, Haodong Zheng, Yicheng Liu, Dedong Xie, Youliang Huang, Baris Kasikci
Diagnosing performance bottlenecks in modern software is essential yet challenging, particularly as applications become more complex and rely on custom resource management policies. While traditional profilers effectively identify execution bottlenecks by tracing system-level metrics, they fall short when it comes to application-level resource contention caused by waiting for application-level events. In this work, we introduce OmniResource Profiling, a performance analysis approach that integrates system-level and application-level resource tracing to diagnose resource bottlenecks comprehensively. gigiProfiler, our realization of OmniResource Profiling, uses a hybrid LLM-static analysis approach to identify application-defined resources offline and analyze their impact on performance during buggy executions to uncover the performance bottleneck. gigiProfiler then samples and records critical variables related to these bottleneck resources during buggy execution and compares their value with those from normal executions to identify the root causes. We evaluated gigiProfiler on 12 real-world performance issues across five applications. gigiProfiler accurately identified performance bottlenecks in all cases. gigiProfiler also successfully diagnosed the root causes of two newly emerged, previously undiagnosed problems, with the findings confirmed by developers.
在现代软件中,分析业绩瓶颈至关重要,但具有挑战性,特别是因为应用程序变得更加复杂,并依赖自定资源管理政策。传统剖面设计通过追踪系统级指标有效识别执行瓶颈,但在等待应用级别活动导致的应用层面资源争议方面,它们却不尽如人意。在这项工作中,我们引入了“OmniResources Conferation”这一业绩分析方法,将系统级和应用层面的资源追踪整合在一起,以全面诊断资源瓶颈。 GIgiProfiler,我们实现“OmniResources Confilation”,使用混合的LLLM-stical分析方法,确定非在线应用程序定义的资源,分析其在错误处决期间的绩效影响,以发现性能瓶颈。GIProfiler然后样本和记录与这些瓶颈资源在错误执行期间的价值相关的关键变量,将其与正常处决的价值进行比较,以查明根源。我们评估了五个应用程序的12个真实世界业绩问题。GlogiProfileer准确查明了所有案件的绩效瓶颈。GlodiProfileer还成功诊断了两个新出现、先前未诊断的问题,并由开发者确认。
Article 34
Title@2025-07-08 (2): CodeMirage: Hallucinations in Code Generated by Large Language Models
Title: CodeMirage: Hallucinations in Code Generated by Large Language Models | CodeMirage: Halluzinationen in Code Generiert durch große Sprachmodelle | 代码Mirage: 大语言模型生成的代码中的幻觉 2408.08333v2 |
Authors (4): Vibhor Agarwal, Yulong Pei, Salwa Alamir, Xiaomo Liu
Large Language Models (LLMs) have shown promising potentials in program generation and no-code automation. However, LLMs are prone to generate hallucinations, i.e., they generate text which sounds plausible but is incorrect. Although there has been a recent surge in research on LLM hallucinations for text generation, similar hallucination phenomenon can happen in code generation. Sometimes the generated code can have syntactical or logical errors as well as more advanced issues like security vulnerabilities, memory leaks, etc. Given the wide adaptation of LLMs to enhance efficiency in code generation and development in general, it becomes imperative to investigate hallucinations in code generation. To the best of our knowledge, this is the first attempt at studying hallucinations in the code generated by LLMs. We start by introducing the code hallucination definition and a comprehensive taxonomy of code hallucination types. We propose the first benchmark CodeMirage dataset for code hallucinations. The benchmark contains 1,137 GPT-3.5 generated hallucinated code snippets for Python programming problems from two base datasets - HumanEval and MBPP. We then propose the methodology for code hallucination detection and experiment with open source LLMs such as CodeLLaMA as well as OpenAI’s GPT-3.5 and GPT-4 models using one-shot prompt. We find that GPT-4 performs the best on HumanEval dataset and gives comparable results to the fine-tuned CodeBERT baseline on MBPP dataset. Towards the end, we discuss various mitigation strategies for code hallucinations and conclude our work.
大型语言模型(LLMS)显示在程序生成和无编码自动化方面有很有希望的潜力。然而,LLMS容易产生幻觉,即产生似乎似乎似是似是而非的文本。虽然最近对LLM的幻觉的研究在为文本生成而出现,但类似的幻觉现象在代码生成中可能发生。生成的代码有时可能出现合成或逻辑错误,以及安全脆弱性、记忆泄漏等更先进的问题。鉴于LLMS的广泛调整以提高代码生成和一般开发效率,因此有必要调查代码生成中的幻觉。对于我们的知识来说,这是研究LMS生成的代码中的幻觉的第一次尝试。我们从引入代码幻觉定义和代码类型综合分类开始。我们建议为代码幻觉建立第一个基准代码数据集,包括安全脆弱性、记忆泄漏等。基准包含1,137 GPT-3.5生成的百草本代码,用于Python 各种基础数据集――人类经济学和MBPPP。然后,我们提出在代码中进行代码检测和实验的精确方法,将GPT-PLM数据模型作为开放源模型进行。
Article 35
Title@2025-07-08 (2): Representing Prompting Patterns with PDL: Compliance Agent Case Study
Title: Representing Prompting Patterns with PDL: Compliance Agent Case Study | Präsentieren von Prompting Patterns mit PDL: Compliance Agent Case Study | 代表PDL的提示模式:合规代理案例研究 2507.06396v1 |
Authors (6): Mandana Vaziri, Louis Mandel, Yuji Watanabe, Hirokuni Kitahara, Martin Hirzel, Anca Sailer
Prompt engineering for LLMs remains complex, with existing frameworks either hiding complexity behind restrictive APIs or providing inflexible canned patterns that resist customization – making sophisticated agentic programming challenging. We present the Prompt Declaration Language (PDL), a novel approach to prompt representation that tackles this fundamental complexity by bringing prompts to the forefront, enabling manual and automatic prompt tuning while capturing the composition of LLM calls together with rule-based code and external tools. By abstracting away the plumbing for such compositions, PDL aims at improving programmer productivity while providing a declarative representation that is amenable to optimization. This paper demonstrates PDL’s utility through a real-world case study of a compliance agent. Tuning the prompting pattern of this agent yielded up to 4x performance improvement compared to using a canned agent and prompt pattern.
LLMs的快速工程仍然很复杂,现有的框架要么隐藏在限制性的API后面,要么提供抵制定制化的不灵活罐头模式 – – 使复杂的代理程序编制变得具有挑战性。我们提出了快速代表的新颖方法,即即快速宣言语言(PDL),通过将提示带到最前沿,使手册和自动快速调整能够解决这一根本的复杂性,同时捕捉LLM电话的构成以及基于规则的代码和外部工具。PDL通过抽取用于这些构件的管道,目的是提高程序员的生产率,同时提供一个易于优化的宣示性表述。本文通过对合规代理进行真实世界案例研究,展示PDL的效用。与使用罐头代理和快速模式相比,该代理的快速模式产生了高达4x的绩效改进。
Article 36
Title@2025-07-08 (2): A proposal and assessment of an improved heuristic for the Eager Test smell detection
Title: A proposal and assessment of an improved heuristic for the Eager Test smell detection | Vorschlag und Bewertung einer verbesserten Heuristik für die Eager Test Geruchserkennung | 关于改进 “ 喷雾测试嗅觉检测 “ 的 改进超值性能的建议和评估 2507.06354v1 |
Authors (4): Huynh Khanh Vi Tran, Nauman bin Ali, Michael Unterkalmsteiner, Jürgen Börstler
Context: The evidence for the prevalence of test smells at the unit testing level has relied on the accuracy of detection tools, which have seen intense research in the last two decades. The Eager Test smell, one of the most prevalent, is often identified using simplified detection rules that practitioners find inadequate. Objective: We aim to improve the rules for detecting the Eager Test smell. Method: We reviewed the literature on test smells to analyze the definitions and detection rules of the Eager Test smell. We proposed a novel, unambiguous definition of the test smell and a heuristic to address the limitations of the existing rules. We evaluated our heuristic against existing detection rules by manually applying it to 300 unit test cases in Java. Results: Our review identified 56 relevant studies. We found that inadequate interpretations of original definitions of the Eager Test smell led to imprecise detection rules, resulting in a high level of disagreement in detection outcomes. Also, our heuristic detected patterns of eager and non-eager tests that existing rules missed. Conclusion: Our heuristic captures the essence of the Eager Test smell more precisely; hence, it may address practitioners’ concerns regarding the adequacy of existing detection rules.
在单位测试一级,测试气味普遍的证据依赖于检测工具的准确性,而检测工具在过去二十年中曾进行过大量研究。最流行的 “ 贪婪测试 “ 气味之一,往往使用从业人员认为不适当的简化检测规则来识别。目标:我们的目标是改进检测贪婪测试气味的规则。方法:我们审查了测试气味的文献,以分析贪婪测试气味的定义和检测规则。我们提出了测试气味的新颖、明确的定义,并提出了解决现有规则局限性的杂乱定义。我们通过手动将现有检测规则应用于爪哇300个单位测试案例,对现有的检测规则进行了超常性评估。结果:我们的审查确定了56项相关研究。我们发现,对最初的 “ 贪婪测试气味 “ 定义的解释不当导致检测结果高度分歧。此外,我们所检测的热度和非诱惑性测试模式也忽略了现行规则。结论:我们狂喜地捕捉到《贪婪测试》的精髓;因此,它可能解决从业人员对现有检测规则的充分性的关切。
Article 37
Title@2025-07-08 (2): An Architecture for Privacy-Preserving Telemetry Scheme
Title: An Architecture for Privacy-Preserving Telemetry Scheme | Eine Architektur für den Schutz der Privatsphäre Telemetrie-Schema | 隐私保护遥测计划架构 2507.06350v1 |
Authors (1): Kenneth Odoh
We present a privacy-preserving telemetry aggregation scheme. Our underlying frequency estimation routine works within the framework of differential privacy. The design philosophy follows a client-server architecture. Furthermore, the system uses a local differential privacy scheme where data gets randomized on the client before submitting the request to the resource server. This scheme allows for data analysis on de-identified data by carefully adding noise to prevent re-identification attacks, thereby facilitating public data release without compromising the identifiability of the individual record. This work further enhances privacy guarantees by leveraging Oblivious HTTP (OHTTP) to achieve increased privacy protection for data in transit that addresses pre-existing privacy vulnerabilities in raw HTTP. We provide an implementation that focuses on frequency estimation with a histogram of a known dictionary. Our resulting formulation based on OHTTP has provided stricter privacy safeguards when compared to trusting an organization to manually delete identifying information from the client’s request in the ingestor as deployed in reference work~\cite{apple2017}. Code available at https://github.com/kenluck2001/miscellaneous/tree/master/src/Privacy-Preserving-Telemetry.
我们提出了一个保护隐私的遥测汇总计划。我们基本的频率估计常规工作是在不同隐私的框架内进行的。设计哲学遵循客户-服务器结构。此外,该系统还使用地方差异隐私计划,在向资源服务器提交请求之前,对客户进行随机数据。这个计划允许通过仔细添加噪音,对去确定的数据进行数据分析,以防止重新识别攻击,从而便利公开数据发布,同时不损害个人记录的可识别性。这项工作通过利用Oblivious HTTTP(OHTTP),进一步增强隐私保障,使过境中处理原始 HTTP(HTTP)先前存在的隐私脆弱性的数据获得更大的隐私保护。我们提供了一种侧重于频率估算的本地版图的实施。我们基于OHTTP的配方提供了更严格的隐私保障,而与信任一个组织人工删除在Ingestor中安装的查询工作*cite{appleuple2017}。代码见https://github.com/kenluck2001/ recentr/srick/prevy-Preser-Preservay-Taltymay。
Article 38
Title@2025-07-08 (2): Quality attributes of test cases and test suites – importance & challenges from practitioners’ perspectives
Title: Quality attributes of test cases and test suites – importance & challenges from practitioners’ perspectives | Qualitätsmerkmale von Testfällen und Testsuiten – Bedeutung und Herausforderungen aus Sicht der Praktiker | 测试案例和测试套套件的质量属性 – – 从从业人员的角度看重要性和挑战 2507.06343v1 |
Authors (5): Huynh Khanh Vi Tran, Nauman bin Ali, Michael Unterkalmsteiner, Jürgen Börstler, Panagiota Chatzipetrou
Context: The quality of the test suites and the constituent test cases significantly impacts confidence in software testing. While research has identified several quality attributes of test cases and test suites, there is a need for a better understanding of their relative importance in practice. Objective: We investigate practitioners’ perceptions regarding the relative importance of quality attributes of test cases and test suites and the challenges they face in ensuring the perceived important quality attributes. Method: We conducted an industrial survey using a questionnaire based on the quality attributes identified in an extensive literature review. We used a sampling strategy that leverages LinkedIn to draw a large and heterogeneous sample of professionals with experience in software testing. Results: We collected 354 responses from practitioners with a wide range of experience. We found that the majority of practitioners rated Fault Detection, Usability, Maintainability, Reliability, and Coverage to be the most important quality attributes. Resource Efficiency, Reusability, and Simplicity received the most divergent opinions, which, according to our analysis, depend on the software-testing contexts. We identified common challenges that apply to the important attributes, namely inadequate definition, lack of useful metrics, lack of an established review process, and lack of external support. Conclusion: The findings point out where practitioners actually need further support with respect to achieving high-quality test cases and test suites under different software testing contexts. The findings can serve as a guideline for academic researchers when looking for research directions on the topic. The findings can also be used to encourage companies to provide more support to practitioners to achieve high-quality test cases and test suites.
目标:我们调查从业者对测试案例和测试套件质量属性的相对重要性的看法,以及他们在确保被认为重要的质量属性方面所面临的挑战。方法:我们根据在广泛文献审查中发现的质量属性,进行了一次工业调查。我们使用了一种抽样战略,利用Link In 来吸引具有软件测试经验的专业人员的大量和多样化样本。结果:我们从从从从业者那里收集了354份答复,具有广泛的经验。我们发现,大多数从业者认为,发现、可用性、可维持性、可靠性和覆盖范围是最重要的质量属性。资源效率、可变性和简易性得到了最不同的意见,而根据我们的分析,这些意见取决于软件测试的背景。我们查明了适用于重要属性的共同挑战,即定义不足、缺乏有用的指标、缺乏既定审查程序、缺乏外部支持。我们发现,大多数从业者认为,发现故障检测、可用性、可维持性、可靠性和覆盖范围是最重要的质量属性。我们发现大多数从业者都认为,在进行高质量测试时,需要从业者对高质量测试案例进行更深入的检验。
Article 39
Title@2025-07-08 (2): AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions
Title: AR2: Attention-Guided Repair for the Robustness of CNNs Against Common Corruptions | AR2: Aufmerksamkeitsgeführte Reparatur für die Robustheit von CNNs gegen häufige Korruption | AR2:对有线电视新闻网反常见腐败的强力进行引人注意的修理 2507.06332v1 |
Authors (3): Fuyuan Zhang, Qichen Wang, Jianjun Zhao
Deep neural networks suffer from significant performance degradation when exposed to common corruptions such as noise, blur, weather, and digital distortions, limiting their reliability in real-world applications. In this paper, we propose AR2 (Attention-Guided Repair for Robustness), a simple yet effective method to enhance the corruption robustness of pretrained CNNs. AR2 operates by explicitly aligning the class activation maps (CAMs) between clean and corrupted images, encouraging the model to maintain consistent attention even under input perturbations. Our approach follows an iterative repair strategy that alternates between CAM-guided refinement and standard fine-tuning, without requiring architectural changes. Extensive experiments show that AR2 consistently outperforms existing state-of-the-art methods in restoring robustness on standard corruption benchmarks (CIFAR-10-C, CIFAR-100-C and ImageNet-C), achieving a favorable balance between accuracy on clean data and corruption robustness. These results demonstrate that AR2 provides a robust and scalable solution for enhancing model reliability in real-world environments with diverse corruptions.
深度神经网络在受到噪音、模糊、天气和数字扭曲等常见腐败的影响时,其性能严重退化,这限制了其在现实世界应用中的可靠性。在本文中,我们提出AR2(对强力进行审慎和引导的修复),这是提高事先受过训练的CNN的腐败稳健性的一个简单而有效的方法。AR2的运作方式是明确调整清洁和腐败图像之间的类别启动图(CAMs),鼓励即使在输入干扰下也保持一致关注的模式。我们的方法遵循迭代修复战略,在CAM引导的完善和标准微调之间,不要求建筑变革。广泛的实验表明AR2始终超越了恢复标准腐败基准(CIFAR-10-C、CIFAR-100-C和图像网络-C)的现有最新方法,在清洁数据的准确性和腐败稳健性之间取得有利的平衡。这些结果表明AR2为提高真实世界环境中有多种腐败的模型可靠性提供了可靠和可扩展的解决方案。
Article 40
Title@2025-07-08 (2): The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot
Title: The Impact of Generative AI on Collaborative Open-Source Software Development: Evidence from GitHub Copilot | Die Auswirkungen generativer KI auf die kollaborative Open-Source-Softwareentwicklung: Beweise von GitHub Copilot | 《创世大赦国际对合作开发开放源码软件的影响:GitHub联合试点的证据》 2410.02091v2 |
Authors (3): Fangchen Song, Ashish Agarwal, Wen Wen
Generative artificial intelligence (AI) enables automated content production, including coding in software development, which can significantly influence developer participation and performance. To explore its impact on collaborative open-source software (OSS) development, we investigate the role of GitHub Copilot, a generative AI pair programmer, in OSS development where multiple distributed developers voluntarily collaborate. Using GitHub’s proprietary Copilot usage data, combined with public OSS repository data obtained from GitHub, we find that Copilot use increases project-level code contributions by 5.9%. This gain is driven by a 2.1% increase in individual code contributions and a 3.4% rise in developer coding participation. However, these benefits come at a cost as coordination time for code integration increases by 8% due to more code discussions enabled by AI pair programmers. This reveals an important tradeoff: While AI expands who can contribute and how much they contribute, it slows coordination in collective development efforts. Despite this tension, the combined effect of these two competing forces remains positive, indicating a net gain in overall project-level productivity from using AI pair programmers. Interestingly, we also find the effects differ across developer roles. Peripheral developers show relatively smaller gains in project-level code contributions and face a higher increase in coordination time than core developers, likely due to the difference in their project familiarity. In summary, our study underscores the dual role of AI pair programmers in affecting project-level code contributions and coordination time in OSS development. Our findings on the differential effects between core and peripheral developers also provide important implications for the structure of OSS communities in the long run.
生成人工智能(AI)使得自动化内容生产,包括软件开发中的编码,能够极大地影响开发者的参与和业绩。为了探索其对开放源码合作开发的影响,我们调查了在多个分布开发者自愿合作的情况下,在开放源码软件开发中,GitHub的基因配对程序程序员(GitHub)的作用。利用GitHub的专利使用数据,加上从GitHub获得的公共开放源码软件储存数据,我们发现Coopil使用软件将项目级代码贡献增加5.9%。这得益于个人代码贡献增加2.1%,开发者内部代码参与增加3.4%。然而,由于对软件整合的协调时间增加8%,这些好处是成本最高的。这揭示了一个重要的权衡:尽管使用GitHub的专利使用数据,加上从GitHub获得的公共开放源码软件储存数据,我们发现这两个竞争力量的合并效应仍然使项目级代码贡献增加5.9%。这说明,使用AI对单个代码贡献增加2.1%,而开发者对开发者的影响增加3.4%。 有趣的是,在时间上,在项目级协调开发者的作用可能提高项目级中,在项目级中,在项目级分析中,在项目级分析中,对项目级分析中,对项目级贡献产生相对增长的作用可能提高。
Article 41
Title@2025-07-08 (2): cuVSLAM: CUDA accelerated visual odometry and mapping
Title: cuVSLAM: CUDA accelerated visual odometry and mapping | cuVSLAM: CUDA beschleunigte visuelle Odometrie und Mapping | CUDA 加速视觉测量和绘图 2506.04359v3 |
Authors (8): Alexander Korovko, Dmitry Slepichev, Alexander Efitorov, Aigul Dzhumamuratova, Viktor Kuznetsov, Hesam Rabeti, Joydeep Biswas, Soha Pouya
Accurate and robust pose estimation is a key requirement for any autonomous robot. We present cuVSLAM, a state-of-the-art solution for visual simultaneous localization and mapping, which can operate with a variety of visual-inertial sensor suites, including multiple RGB and depth cameras, and inertial measurement units. cuVSLAM supports operation with as few as one RGB camera to as many as 32 cameras, in arbitrary geometric configurations, thus supporting a wide range of robotic setups. cuVSLAM is specifically optimized using CUDA to deploy in real-time applications with minimal computational overhead on edge-computing devices such as the NVIDIA Jetson. We present the design and implementation of cuVSLAM, example use cases, and empirical results on several state-of-the-art benchmarks demonstrating the best-in-class performance of cuVSLAM.
精确和稳健的姿势估计是任何自主机器人的关键要求。 我们展示了 CuVSLAM, 这是视觉同步定位和映像的最先进的解决方案, 可用各种视觉- 内脏传感器套件操作, 包括多个 RGB 和深度摄像头, 以及惯性测量器。 CuVSLAM 支持使用一个只有32个 RGB 摄像头的任意几何配置, 从而支持一系列广泛的机器人设置。 CuVSLAM 被特别优化, 利用 CUDA 实时应用, 在像 NVIDIA Jetson 这样的边缘计算设备上部署最低计算间接费用的实时应用。 我们展示了 CuVSLAM 的设计和实施, 实例使用案例, 以及几个最先进的基准的经验性结果, 展示了 CuVSLAM 的最佳水平性能 。
Article 42
Title@2025-07-08 (2): RPHunter: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis
Title: RPHunter: Unveiling Rug Pull Schemes in Crypto Token via Code-and-Transaction Fusion Analysis | RPHunter: Enthüllen von Rug Pull Schemes in Crypto Token über Code-and-Transaction Fusion Analysis | RPHunter:通过代码和交易整合分析,在加密中采用 “ 拼接 “ 的 “ 拼接 “ 的 “ Rug “ 拖网计划 2506.18398v3 |
Authors (7): Hao Wu, Haijun Wang, Shangwang Li, Yin Wu, Ming Fan, Wuxia Jin, Ting Liu
Rug pull scams have emerged as a persistent threat to cryptocurrency, causing significant financial losses. A typical scenario involves scammers deploying honeypot contracts to attract investments, restricting token sales, and draining the funds, which leaves investors with worthless tokens. Current methods either rely on predefined patterns to detect code risks or utilize statistical transaction data to train detection models. However, real-world Rug Pull schemes often involve a complex interplay between malicious code and suspicious transaction behaviors. These methods, which solely focus on one aspect, fall short in detecting such schemes effectively. In this paper, we propose RPHunter, a novel technique that integrates code and transaction for Rug Pull detection. First, RPHunter establishes declarative rules and performs flow analysis to extract code risk information, further constructing a semantic risk code graph (SRCG). Meanwhile, to leverage transaction information, RPHunter formulates dynamic token transaction activities as a token flow behavior graph (TFBG) in which nodes and edges are characterized from network structure and market manipulation perspectives. Finally, RPHunter employs graph neural networks to extract complementary features from SRCG and TFBG, integrating them through an attention fusion model to enhance the detection of Rug Pull. We manually analyzed 645 Rug Pull incidents from code and transaction aspects and constructed a ground-truth dataset. We evaluated RPHunter on our dataset, achieving a precision of 95.3%, a recall of 93.8% and an F1 score of 94.5%, which highlights superior performance compared to existing methods. Furthermore, when applied to the real-world scenarios, RPHunter has identified 4801 Rug Pull tokens, achieving a precision of 90.7%.
鲁棒骗局已成为对隐蔽货币的持续威胁,造成了巨大的金融损失。典型的情景是,诈骗者利用蜜罐合同来吸引投资,限制象征性销售,耗尽资金,使投资者留下没有价值的象征。当前的方法要么依靠预先定义的模式来检测代码风险,要么利用统计交易数据来培训检测模型。然而,真实世界的鲁棒骗骗局往往涉及恶意代码和可疑交易行为之间的复杂互动。这些方法仅侧重于一个方面,在有效发现此类计划方面落后于一个方面。在本文中,我们提议了RPHunter(RPHunter),这是将代码与交易结合起来以吸引投资,限制象征性销售,限制象征性销售,同时,RPHunter(RPH)建立宣言规则并进行流程分析,进一步构建一个在线交易的准确性功能,同时将 RBGG(RG) 的准确性数据整合起来,同时将我们目前运行的节点和节点的节点与节点的节点定位,在SRCG(RG) 的精确度上应用图形网络, 将当前运行的精确度提高到一个标准。
Article 43
Title@2025-07-08 (2): ContractTrace: Retracing Smart Contract Versions for Security Analyses
Title: ContractTrace: Retracing Smart Contract Versions for Security Analyses | ContraceTrace: Smart Contract-Versionen für Sicherheitsanalysen zurückverfolgen | 合同跟踪:用于安全分析的重新追踪智能合同版本 2412.20866v2 |
Authors (7): Fatou Ndiaye Mbodji, Vinny Adjibi, Moustapha Awwalou Diouf, Gervais Mendy, Kui Liu, Jacques Klein, Tegawende Bissyande
Due to the inherent immutability of blockchain technology, smart contract updates require their deployment at new addresses rather than modifying existing ones, thus fragmenting version histories and creating critical blind spots for analyses. Indeed, for example, this fragmentation severely hinders security researchers ability to track vulnerability lifecycles across contract versions. While platforms like Etherscan provide detailed information about Ethereum smart contracts, they lack crucial functionality to trace predecessor-successor relationships within smart contract lineages, preventing systematic analysis of how vulnerabilities emerge, propagate, and potentially remain unresolved across versions.To address the challenge of tracing smart contract lineages, we adopt a Design Science Research (DSR) approach and introduce ContractTrace, an automated infrastructure that accurately identifies and links versions of smart contracts into coherent lineages. This tool enables the construction of lineageSet, an up-to-date, open-source dataset specifically designed to support security research on vulnerability, defect or any other property evolution patterns in smart contracts. Through a security-focused case study we demonstrate how ContractTrace reveals previously obscured vulnerability life-cycles within smart contract lineages, tracking whether critical security flaws persist or get resolved across versions. This capability is essential for understanding vulnerability propagation patterns and evaluating the effectiveness of security patches in blockchain environments. In the evaluation phase of our DSR approach, we validated our lineage detection methodology against an alternative approach using Locality-Sensitive Hashing (LSH) to cluster contract versions, confirming the security relevance and accuracy of our technique.
智能合同更新要求在新地址进行部署,而不是修改现有合同链,从而分解版本历史,为分析创造关键的盲点。例如,这种分散性严重阻碍了安全研究人员跟踪不同合同版本的脆弱性生命周期的能力。虽然Etherscan等平台提供了有关Etheum智能合同的详细信息,但它们缺乏关键功能,无法在智能合同链内追踪前方-继承人关系,无法系统地分析不同版本之间出现、传播和可能仍未解决的脆弱性。为了应对追踪智能合同线的挑战,我们采用了设计科学研究(DSR)方法,并引入了Caltrace系统,这是一个自动基础设施,精确地识别和链接智能合同版本的弱点生命周期,并将这些版本连接到一致的系列中。这个工具使得能够构建线型系统,一个最新的开放源数据集,专门用来支持智能合同中的脆弱性、缺陷或其他任何财产演变模式的安全研究。我们通过以安全为重点的案例研究,展示了在智能合同链线内先前隐含的脆弱性生命周期,跟踪我们安全链中的关键安全风险度的准确性,并用安全链级系统系统测试方法来确认我们的安全风险度的准确性标准。
Article 44
Title@2025-07-08 (2): Model Cards Revisited: Bridging the Gap Between Theory and Practice for Ethical AI Requirements
Title: Model Cards Revisited: Bridging the Gap Between Theory and Practice for Ethical AI Requirements | Revisited Model Cards: Die Kluft zwischen Theorie und Praxis für ethische KI-Anforderungen überbrücken | 重新审视示范卡:弥合道德道德要求理论与实践之间的差距 2507.06014v1 |
Authors (3): Tim Puhlfürß, Julia Butzke, Walid Maalej
Model cards are the primary documentation framework for developers of artificial intelligence (AI) models to communicate critical information to their users. Those users are often developers themselves looking for relevant documentation to ensure that their AI systems comply with the ethical requirements of existing laws, guidelines, and standards. Recent studies indicate inadequate model documentation practices, suggesting a gap between AI requirements and current practices in model documentation. To understand this gap and provide actionable guidance to bridge it, we conducted a thematic analysis of 26 guidelines on ethics and AI, three AI documentation frameworks, three quantitative studies of model cards, and ten actual model cards. We identified a total of 43 ethical requirements relevant to model documentation and organized them into a taxonomy featuring four themes and twelve sub-themes representing ethical principles. Our findings indicate that model developers predominantly emphasize model capabilities and reliability in the documentation while overlooking other ethical aspects, such as explainability, user autonomy, and fairness. This underscores the need for enhanced support in documenting ethical AI considerations. Our taxonomy serves as a foundation for a revised model card framework that holistically addresses ethical AI requirements.
模型卡是人工智能模型开发者向用户传递关键信息的主要文件框架,这些用户往往是开发者本身,他们自己寻找相关文件,以确保其个体智能系统符合现行法律、准则和标准的道德要求; 最近的研究显示,示范文件做法不足,表明在示范文件方面,AI的要求与当前做法之间存在差距; 为了理解这一差距,并提供可行的指南,以弥补这一差距,我们对26项道德准则和AI准则进行了专题分析; 3个个体智能文件框架; 3个模型卡定量研究; 10个实际模型卡进行了专题分析; 我们查明了与模拟文件有关的总共43项道德要求,并将之编成一个分类,有4个主题和12个代表道德原则的次主题; 我们的调查结果显示,模型开发者主要强调文件的示范能力和可靠性,而忽略了其他道德方面,例如解释性、用户自主性和公正性; 这突出表明,需要加强支持记录职业性人工智能考量因素; 我们的分类学是修订模型卡框架的基础,全面处理伦理性人工智能要求。
Article 45
Title@2025-07-08 (2): Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models
Title: Multi-Agent Debate Strategies to Enhance Requirements Engineering with Large Language Models | Multi-Agent Debatte Strategien zur Verbesserung der Requirements Engineering mit großen Sprachmodellen | 利用大语言模式加强要求工程的多机构辩论战略 2507.05981v1 |
Authors (4): Marc Oriol, Quim Motger, Jordi Marco, Xavier Franch
Context: Large Language Model (LLM) agents are becoming widely used for various Requirements Engineering (RE) tasks. Research on improving their accuracy mainly focuses on prompt engineering, model fine-tuning, and retrieval augmented generation. However, these methods often treat models as isolated black boxes - relying on single-pass outputs without iterative refinement or collaboration, limiting robustness and adaptability. Objective: We propose that, just as human debates enhance accuracy and reduce bias in RE tasks by incorporating diverse perspectives, different LLM agents debating and collaborating may achieve similar improvements. Our goal is to investigate whether Multi-Agent Debate (MAD) strategies can enhance RE performance. Method: We conducted a systematic study of existing MAD strategies across various domains to identify their key characteristics. To assess their applicability in RE, we implemented and tested a preliminary MAD-based framework for RE classification. Results: Our study identified and categorized several MAD strategies, leading to a taxonomy outlining their core attributes. Our preliminary evaluation demonstrated the feasibility of applying MAD to RE classification. Conclusions: MAD presents a promising approach for improving LLM accuracy in RE tasks. This study provides a foundational understanding of MAD strategies, offering insights for future research and refinements in RE applications.
目标:我们建议,正如人类辩论通过纳入不同观点来提高标准化任务的准确性和减少这种任务的偏差一样,不同的法学工作者辩论和协作也可以取得类似的改进。我们的目标是调查多重辩论战略能否提高可再生能源的绩效。方法:我们对各领域现有的数学研究战略进行了系统化研究,以确定其关键特点。为了评估其在可再生能源中的适用性,我们实施并测试了一个初步的基于数学的可再生能源分类框架。结果:我们的研究确定并分类了几项数学研究战略,从而形成了一个分类法,从而确定了这些战略的核心属性。我们的初步评估表明,将数学应用到可再生能源分类中是可行的。结论:MAD为改进可再生能源任务的精准性提供了一种很有希望的方法。本性的研究提供了对数学战略的基础理解,为未来研究和可再生能源应用的改进提供了深刻见解。
Article 46
Title@2025-07-08 (2): TigAug: Data Augmentation for Testing Traffic Light Detection in Autonomous Driving Systems
Title: TigAug: Data Augmentation for Testing Traffic Light Detection in Autonomous Driving Systems | TigAug: Datenvergrößerung zur Prüfung von Verkehrslichterfassung in autonomen Fahrsystemen | TigAug:自动驾驶系统交通光探测测试数据增强 2507.05932v1 |
Authors (5): You Lu, Dingji Wang, Kaifeng Huang, Bihuan Chen, Xin Peng
Autonomous vehicle technology has been developed in the last decades with recent advances in sensing and computing technology. There is an urgent need to ensure the reliability and robustness of autonomous driving systems (ADSs). Despite the recent achievements in testing various ADS modules, little attention has been paid on the automated testing of traffic light detection models in ADSs. A common practice is to manually collect and label traffic light data. However, it is labor-intensive, and even impossible to collect diverse data under different driving environments. To address these problems, we propose and implement TigAug to automatically augment labeled traffic light images for testing traffic light detection models in ADSs. We construct two families of metamorphic relations and three families of transformations based on a systematic understanding of weather environments, camera properties, and traffic light properties. We use augmented images to detect erroneous behaviors of traffic light detection models by transformation-specific metamorphic relations, and to improve the performance of traffic light detection models by retraining. Large-scale experiments with four state-of-the-art traffic light detection models and two traffic light datasets have demonstrated that i) TigAug is effective in testing traffic light detection models, ii) TigAug is efficient in synthesizing traffic light images, and iii) TigAug generates traffic light images with acceptable naturalness.
在过去几十年里,随着遥感和计算技术的最近进步,发展了自主车辆技术;迫切需要确保自动驾驶系统(ADS)的可靠性和稳健性;尽管最近在测试各种ADS模块方面取得的成就,但对ADS中交通灯检测模型的自动测试很少重视;通常的做法是人工收集和标签交通灯数据;然而,在不同驾驶环境中收集各种数据是劳动密集型的,甚至不可能;为解决这些问题,我们提议和实施TigAug自动增加标记的交通灯图像,用于测试ADS中的交通灯检测模型;我们根据对天气环境、摄像特性和交通灯特性的系统理解,建造两组具有变形关系的系统和三组变形的变形。我们使用增强的图像,通过改造特定的变形关系,发现交通灯检测模型的错误行为,并通过再培训改进交通灯检测模型的性能。我们提议和实施TigAug公司为测试ADS的交通灯检测模型和两套交通灯光数据集进行的大规模实验。 我们根据系统了解天气环境环境环境环境、摄像特性和光图像,在测试可接受性交通灯光方面是有效的。
Article 47
Title@2025-07-08 (2): Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models
Title: Learning to Focus: Context Extraction for Efficient Code Vulnerability Detection with Language Models | Fokussieren lernen: Kontextextraktion für effiziente Code-Anfälligkeitserkennung mit Sprachmodellen | 学习聚焦:以语言模式有效识别《守则》脆弱性 2505.17460v2 |
Authors (7): Xinran Zheng, Xingzhi Qian, Huichi Zhou, Shuo Yang, Yiling He, Suman Jana, Lorenzo Cavallaro
Language models (LMs) show promise for vulnerability detection but struggle with long, real-world code due to sparse and uncertain vulnerability locations. These issues, exacerbated by token limits, often cause models to miss vulnerability-related signals, thereby impairing effective learning. A key intuition is to enhance LMs with concise, information-rich context. Commit-based annotations offer precise, CWE-agnostic supervision, but are unavailable during inference, as they depend on historical code changes. Moreover, their extreme sparsity, often covering only a few lines, makes it difficult for LMs to process directly. In this paper, we propose FocusVul, a model-agnostic framework that improves LM-based vulnerability detection by learning to select sensitive context. FocusVul learns commit-based annotation patterns through hierarchical semantic modeling and generalizes them to identify line-level vulnerability-relevant regions during inference. It then extracts LM-oriented context via both dependency and execution flows surrounding selected regions, yielding semantically rich inputs for effective vulnerability detection. Experiments on real-world benchmarks show that FocusVul consistently outperforms heuristic-based and full-function fine-tuning approaches, improving classification performance by 164.04% and reducing FLOPs by 19.12% on average.
语言模型(LMS) 显示了识别脆弱性的希望,但由于脆弱地点稀少和不确定,与长期的、真实的世界代码抗争的可能性很大。这些问题由于象征性限制而加剧,往往导致模型丢失与脆弱性有关的信号,从而损害有效的学习。一个关键直觉是用简明、信息丰富的背景来提升LMs。基于文件的注释提供了精确的、CWE-不可知性的监督,但在推断过程中却无法使用,因为它们取决于历史代码的变化。此外,它们的极端宽度往往只覆盖几条线,使得LMs难以直接处理。在本文中,我们提议FocusVul,一个通过学习选择敏感环境来改进基于LM的脆弱性检测的模型框架。FocusVult通过等级的语义模型学习基于承诺的批注模式,并在推断过程中将其概括化,以辨别与脆弱性相关的线级区域。然后通过在选定区域周围的依附性和执行流动来提取LME,为有效识别脆弱性提供内容丰富的投入。在现实世界基准上进行的实验显示FlastVult Vult Vult-im-minal-laction apretty laction by smalvical-laphal-laphyal-laphyal-laphyal bynal-
Article 48
Title@2025-07-08 (2): The Impact of Prompt Programming on Function-Level Code Generation
Title: The Impact of Prompt Programming on Function-Level Code Generation | Die Auswirkungen der Prompt-Programmierung auf die Code-Generierung auf Funktionsebene | 迅速编制方案对职能层面代码生成的影响 2412.20545v2 |
Authors (4): Ranim Khojah, Francisco Gomes de Oliveira Neto, Mazen Mohamad, Philipp Leitner
Large Language Models (LLMs) are increasingly used by software engineers for code generation. However, limitations of LLMs such as irrelevant or incorrect code have highlighted the need for prompt programming (or prompt engineering) where engineers apply specific prompt techniques (e.g., chain-of-thought or input-output examples) to improve the generated code. While some prompt techniques have been studied, the impact of different techniques – and their interactions – on code generation is still not fully understood. In this study, we introduce CodePromptEval, a dataset of 7072 prompts designed to evaluate five prompt techniques (few-shot, persona, chain-of-thought, function signature, list of packages) and their effect on the correctness, similarity, and quality of complete functions generated by three LLMs (GPT-4o, Llama3, and Mistral). Our findings show that while certain prompt techniques significantly influence the generated code, combining multiple techniques does not necessarily improve the outcome. Additionally, we observed a trade-off between correctness and quality when using prompt techniques. Our dataset and replication package enable future research on improving LLM-generated code and evaluating new prompt techniques.
大型语言模型(LLMS)被软件工程师越来越多地用于代码生成,然而,LLMS(LLMS)的局限性,例如不相关或不正确的代码等LLMS的局限性突出表明,需要迅速编程(或迅速工程),因为工程师采用具体的即时技术(例如思维链或投入产出实例)来改进生成的代码。虽然已经研究了一些迅速技术,但不同技术(及其相互作用)对代码生成的影响仍然不完全理解。在本研究中,我们引入了代码PromptEval,这是一个7072个提示数据集,旨在评价五项即时技术(光、人、思维链、功能签名、软件包清单)及其对三个LLMMS(GPT-4o、Llama3和Mistral)生成的完整功能的正确性、相似性和质量的影响。我们的调查结果显示,虽然某些快速技术对生成的代码有重大影响,但结合多种技术并不一定能改善结果。此外,我们发现,在使用快速技术时,准确性和质量之间存在着一种权衡。我们的数据集和复制包使今后能够对改进LM生成的代码和评价新的快速技术进行研究。
Article 49
Title@2025-07-08 (2): Extending Behavioral Software Engineering: Decision-Making and Collaboration in Human-AI Teams for Responsible Software Engineering
Title: Extending Behavioral Software Engineering: Decision-Making and Collaboration in Human-AI Teams for Responsible Software Engineering | Erweiterung der Verhaltenssoftware-Engineering: Entscheidungsfindung und Zusammenarbeit in Human-AI-Teams für verantwortungsvolle Software-Engineering | 扩展行为行为软件工程:负责任软件工程人类-AI小组的决策与合作 2504.09496v2 |
Authors (1): Lekshmi Murali Rani
The study of behavioral and social dimensions of software engineering (SE) tasks characterizes behavioral software engineering (BSE);however, the increasing significance of human-AI collaboration (HAIC) brings new directions in BSE by presenting new challenges and opportunities. This PhD research focuses on decision-making (DM) for SE tasks and collaboration within human-AI teams, aiming to promote responsible software engineering through a cognitive partnership between humans and AI. The goal of the research is to identify the challenges and nuances in HAIC from a cognitive perspective, design and optimize collaboration/partnership (human-AI team) that enhance collective intelligence and promote better, responsible DM in SE through human-centered approaches. The research addresses HAIC and its impact on individual, team, and organizational level aspects of BSE.
研究软件工程(SE)任务的行为和社会层面是行为软件工程(BSE)的特点;然而,人类-AI合作(HAIC)日益重要,通过提出新的挑战和机会,给BSE带来新的方向; 博士研究侧重于人类-AI团队中SE任务的决策和协作,目的是通过人与AI之间的认知伙伴关系促进负责任的软件工程; 研究的目标是从认知角度确定HAIC的挑战和细微差别,设计并优化合作/伙伴关系(人-AI团队),通过以人为本的方法,加强集体情报,促进更好地、负责的SEDM,研究涉及HAIC及其对BSE个人、团队和组织层面的影响。
Article 50
Title@2025-07-08 (2): ETrace:Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis
Title: ETrace:Event-Driven Vulnerability Detection in Smart Contracts via LLM-Based Trace Analysis | ETrace: Event-getriebene Sicherheitserkennung in Smart Contracts über LLM-basierte Trace-Analyse | ETRAR:通过基于LLM的追踪分析,在智能合同中实现对脆弱性的彻底发现 2506.15790v2 |
Authors (7): Chenyang Peng, Haijun Wang, Yin Wu, Hao Wu, Ming Fan, Yitao Zhao, Ting Liu
With the advance application of blockchain technology in various fields, ensuring the security and stability of smart contracts has emerged as a critical challenge. Current security analysis methodologies in vulnerability detection can be categorized into static analysis and dynamic analysis methods.However, these existing traditional vulnerability detection methods predominantly rely on analyzing original contract code, not all smart contracts provide accessible code.We present ETrace, a novel event-driven vulnerability detection framework for smart contracts, which uniquely identifies potential vulnerabilities through LLM-powered trace analysis without requiring source code access. By extracting fine-grained event sequences from transaction logs, the framework leverages Large Language Models (LLMs) as adaptive semantic interpreters to reconstruct event analysis through chain-of-thought reasoning. ETrace implements pattern-matching to establish causal links between transaction behavior patterns and known attack behaviors. Furthermore, we validate the effectiveness of ETrace through preliminary experimental results.
随着在各个领域预先应用链锁技术,确保智能合同的安全和稳定已成为一项关键挑战。目前的脆弱性检测安全分析方法可以分为静态分析和动态分析方法。然而,这些现有的传统脆弱性检测方法主要依赖分析原始合同代码,并非所有智能合同都提供无障碍代码。我们为智能合同介绍了由事件驱动的新颖的智能合同脆弱性检测框架,它通过LLM驱动的追踪分析,在不需要源代码访问的情况下,独特地识别了潜在的脆弱性。通过从交易日志中提取细微事件序列,该框架利用了大语言模型作为适应性语义翻译,通过思潮推理重新进行事件分析。“ETRace”采用模式匹配,以建立交易行为模式与已知攻击行为之间的因果关系。此外,我们还通过初步实验结果验证了ETrace的有效性。
Article 51
Title@2025-07-08 (2): scikit-package – software packaging standards and roadmap for sharing reproducible scientific software
Title: scikit-package – software packaging standards and roadmap for sharing reproducible scientific software | scikit-Paket – Software-Verpackungsstandards und Roadmap für den Austausch reproduzierbarer wissenschaftlicher Software | 共享可复制科学软件的软件包装标准和路线图 2507.03328v2 |
Authors (5): S. Lee, C. Myers, A. Yang, T. Zhang, S. J. L. Billinge
Scientific advancement relies on the ability to share and reproduce results. When data analysis or calculations are carried out using software written by scientists there are special challenges around code versions, quality and code sharing. scikit-package provides a roadmap to facilitate code reuse and sharing with minimal effort through tutorials coupled with automated and centralized reusable workflows. The goal of the project is to provide pedagogical and practical tools for scientists who are not professionally trained software engineers to write more reusable and maintainable software code. Code reuse can occur at multiple levels of complexity-from turning a code block into a function within a single script, to publishing a publicly installable, fully tested, and documented software package scikit-package provides a community maintained set of tools, and a roadmap, to help scientists bring their software higher levels of reproducibility and shareability.
科学进步取决于共享和复制成果的能力。当数据分析或计算使用科学家编写的软件进行时,在代码版本、质量和代码共享方面存在着特殊的挑战。 缩略语软件包提供了一份路线图,通过辅导和自动化和集中的可再利用工作流程,促进代码的再利用和分享,并作出最小的努力。该项目的目标是为没有受过专业培训的软件工程师的科学家提供教学和实践工具,以便编写更可再使用和可维护的软件代码。代码再利用可以在多种复杂程度上发生,从将代码块变成单一脚本中的功能,到公布可公开安装、经过充分测试和有记录软件的软件包Scikit软件包提供一套社区维护的工具,以及一张路线图,以帮助科学家提高软件的可复制性和可分享性。
Article 52
Title@2025-07-08 (2): Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study
Title: Detecting and Mitigating Reward Hacking in Reinforcement Learning Systems: A Comprehensive Empirical Study | Erkennung und Eindämmung von Belohnungshacking in Verstärkungs-Lernsystemen: Eine umfassende empirische Studie | 检测和减轻强化学习系统中的回扣:综合经验研究 2507.05619v1 |
Authors (3): Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma
Reward hacking in Reinforcement Learning (RL) systems poses a critical threat to the deployment of autonomous agents, where agents exploit flaws in reward functions to achieve high scores without fulfilling intended objectives. Despite growing awareness of this problem, systematic detection and mitigation approaches remain limited. This paper presents a large-scale empirical study of reward hacking across diverse RL environments and algorithms. We analyze 15,247 training episodes across 15 RL environments (Atari, MuJoCo, custom domains) and 5 algorithms (PPO, SAC, DQN, A3C, Rainbow), implementing automated detection algorithms for six categories of reward hacking: specification gaming, reward tampering, proxy optimization, objective misalignment, exploitation patterns, and wireheading. Our detection framework achieves 78.4% precision and 81.7% recall across environments, with computational overhead under 5%. Through controlled experiments varying reward function properties, we demonstrate that reward density and alignment with true objectives significantly impact hacking frequency ($p < 0.001$, Cohen’s $d = 1.24$). We validate our approach through three simulated application studies representing recommendation systems, competitive gaming, and robotic control scenarios. Our mitigation techniques reduce hacking frequency by up to 54.6% in controlled scenarios, though we find these trade-offs are more challenging in practice due to concept drift, false positive costs, and adversarial adaptation. All detection algorithms, datasets, and experimental protocols are publicly available to support reproducible research in RL safety.
强化学习(RL)系统中的奖励黑客(RL)系统对自动代理商的部署构成了严重威胁,因为代理商利用奖励功能中的缺陷来达到高分而不实现预期目标。尽管人们日益认识到这一问题,但系统检测和减缓方法仍然有限。本文介绍了关于在不同RL环境和算法中进行奖励黑客的大规模实证研究。我们分析了15RL环境(阿塔里、穆乔科、定制域)和5种算法(PPPO、SAC、DQN、A3C、彩虹)的15 247个培训事件,这些代理商利用自动检测算法对以下六类黑客进行奖励:规格游戏、奖励篡改、代理优化、客观不匹配、开发模式和剪贴线。我们的检测框架实现了78.4%的精确率和81.7%的跨环境回顾,计算间接费用不到5%。通过受控的奖励功能属性实验,我们展示了密度和符合真实目标的密度,对黑客频率(0.001美元,科德=1.24美元),我们通过三个模拟应用应用研究研究研究方法来验证我们的方法,在建议系统、竞争性的频率检测、有竞争力的游戏、可控性交易和机械控制情景中,我们找到了的漂变换数据,我们通过这些数据变换的变现技术将降低了数据变现成本。
Article 53
Title@2025-07-08 (2): Efficient Detection of Intermittent Job Failures Using Few-Shot Learning
Title: Efficient Detection of Intermittent Job Failures Using Few-Shot Learning | Effiziente Erkennung intermittierender Job-Fälle durch wenig scharfes Lernen | 利用很少热的学习方法有效检测间歇性工作失败 2507.04173v2 |
Authors (3): Henri Aïdasso, Francis Bordeleau, Ali Tizghadam
One of the main challenges developers face in the use of continuous integration (CI) and deployment pipelines is the occurrence of intermittent job failures, which result from unexpected non-deterministic issues (e.g., flaky tests or infrastructure problems) rather than regular code-related errors such as bugs. Prior studies developed machine learning (ML) models trained on large datasets of job logs to classify job failures as either intermittent or regular. As an alternative to costly manual labeling of large datasets, the state-of-the-art (SOTA) approach leveraged a heuristic based on non-deterministic job reruns. However, this method mislabels intermittent job failures as regular in contexts where rerunning suspicious job failures is not an explicit policy, and therefore limits the SOTA’s performance in practice. In fact, our manual analysis of 2,125 job failures from 5 industrial and 1 open-source projects reveals that, on average, 32% of intermittent job failures are mislabeled as regular. To address these limitations, this paper introduces a novel approach to intermittent job failure detection using few-shot learning (FSL). Specifically, we fine-tune a small language model using a few number of manually labeled log examples to generate rich embeddings, which are then used to train an ML classifier. Our FSL-based approach achieves 70-88% F1-score with only 12 shots in all projects, outperforming the SOTA, which proved ineffective (34-52% F1-score) in 4 projects. Overall, this study underlines the importance of data quality over quantity and provides a more efficient and practical framework for the detection of intermittent job failures in organizations.
开发者在使用连续整合(CI)和部署管道方面所面临的主要挑战之一是,由于意外的非决定性问题(例如,片状测试或基础设施问题)而不是经常的代码错误(如错误),在使用连续整合(CI)和部署管道方面出现了间歇性工作失灵现象,这是开发了机器学习(ML)模型,在大量工作日志数据集方面进行了培训,将工作失灵分类为间或定期的。作为人工标记大型数据集成本高昂的替代方法,最先进的(SOTA)方法利用了基于非确定性重新运行的工作失灵现象。然而,这种方法错误地将间歇性工作失灵标记为常规,在重新处理可疑工作失灵并非明确政策的情况下,因此限制了SOTA的实际绩效。事实上,我们对5个工业项目和1个开放源项目中的2 125个工作失灵分类进行人工分析,显示平均有32%的间歇性工作失灵标记为常规的。为了解决这些局限,本文介绍了一种新的方法,即间歇性的工作失灵检测方法,使用少量的演示式的检测(FSFL) 快速性(FL) 质量项目中我们用了12个Alial L Adro-lax-lax-lax-L) 的数值来做一个Sildal 。
Article 54
Title@2025-07-08 (2): Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models
Title: Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models | Prompt Migration: Stabilisierung von GenAI-Anwendungen mit sich entwickelnden großen Sprachmodellen | 迅速移徙:稳定GenAI应用程序与不断演变的大型语言模式 2507.05573v1 |
Authors (5): Shivani Tripathi, Pushpanjali Nema, Aditya Halder, Shi Qiao, Alekh Jindal
Generative AI is transforming business applications by enabling natural language interfaces and intelligent automation. However, the underlying large language models (LLMs) are evolving rapidly and so prompting them consistently is a challenge. This leads to inconsistent and unpredictable application behavior, undermining the reliability that businesses require for mission-critical workflows. In this paper, we introduce the concept of prompt migration as a systematic approach to stabilizing GenAI applications amid changing LLMs. Using the Tursio enterprise search application as a case study, we analyze the impact of successive GPT model upgrades, detail our migration framework including prompt redesign and a migration testbed, and demonstrate how these techniques restore application consistency. Our results show that structured prompt migration can fully recover the application reliability that was lost due to model drift. We conclude with practical lessons learned, emphasizing the need for prompt lifecycle management and robust testing to ensure dependable GenAI-powered business applications.
人工智能正在通过自然语言界面和智能自动化改造商业应用程序。然而,基础大型语言模型(LLMs)正在迅速发展,并不断催化它们,这是一个挑战。这导致应用行为前后不一且不可预测,削弱了企业对任务关键工作流程所要求的可靠性。在本文件中,我们引入了迅速迁移的概念,作为在改变LMs时稳定GenAI应用程序的系统方法。我们利用Tursio企业搜索应用程序进行案例研究,分析连续更新GPT模型的影响,详细分析我们的移民框架,包括迅速重新设计和移民测试,并展示这些技术如何恢复应用的一致性。我们的结果显示,结构化快速迁移可以完全恢复由于模型漂移而丧失的应用可靠性。我们总结了实际的经验教训,强调需要迅速的生命周期管理和强有力的测试,以确保GenAI动力企业应用程序具有可依赖性。
Article 55
Title@2025-07-08 (2): Search-based Selection of Metamorphic Relations for Optimized Robustness Testing of Large Language Models
Title: Search-based Selection of Metamorphic Relations for Optimized Robustness Testing of Large Language Models | Suchebasierte Auswahl von Metamorphic Relations für optimierte Robustheitsprüfung von großen Sprachmodellen | 以搜索为基础选择大语言模式优化强力测试的变形关系 2507.05565v1 |
Authors (3): Sangwon Hyun, Shaukat Ali, M. Ali Babar
Assessing the trustworthiness of Large Language Models (LLMs), such as robustness, has garnered significant attention. Recently, metamorphic testing that defines Metamorphic Relations (MRs) has been widely applied to evaluate the robustness of LLM executions. However, the MR-based robustness testing still requires a scalable number of MRs, thereby necessitating the optimization of selecting MRs. Most extant LLM testing studies are limited to automatically generating test cases (i.e., MRs) to enhance failure detection. Additionally, most studies only considered a limited test space of single perturbation MRs in their evaluation of LLMs. In contrast, our paper proposes a search-based approach for optimizing the MR groups to maximize failure detection and minimize the LLM execution cost. Moreover, our approach covers the combinatorial perturbations in MRs, facilitating the expansion of test space in the robustness assessment. We have developed a search process and implemented four search algorithms: Single-GA, NSGA-II, SPEA2, and MOEA/D with novel encoding to solve the MR selection problem in the LLM robustness testing. We conducted comparative experiments on the four search algorithms along with a random search, using two major LLMs with primary Text-to-Text tasks. Our statistical and empirical investigation revealed two key findings: (1) the MOEA/D algorithm performed the best in optimizing the MR space for LLM robustness testing, and (2) we identified silver bullet MRs for the LLM robustness testing, which demonstrated dominant capabilities in confusing LLMs across different Text-to-Text tasks. In LLM robustness assessment, our research sheds light on the fundamental problem for optimized testing and provides insights into search-based solutions.
评估大语言模型(LLMS)的可信度,例如稳健性,引起了人们的极大关注。最近,对确定超常关系(MRs)的超常测试广泛应用了用来评价LLM处决的稳健性。然而,基于MR的稳健性测试仍然需要一个可扩缩的MR数量,从而需要优化选择MR。大多数现有的LMM测试研究限于自动生成测试案例(即MRs),以加强检测失败。此外,大多数研究仅考虑在评估LMS时,单一过错性MR的测试空间空间有限。相比之下,我们的文件建议采用基于搜索的方法优化MRM组,以最大限度地检测失败并尽量减少LLMM执行费用。 此外,我们的方法包括了MRMs的组合性干扰,从而便利了在稳健性评估中扩大测试测试测试测试测试测试测试空间测试空间。我们开发了一个搜索程序,并实施了四种基于搜索算法的搜索算法:单一-GA、NSGA-II、SREA2和MOEA/D与新编码,以在LM的轻度搜索中解决了LM稳健性测试中,我们进行了两次主要测试。我们进行了4次的S-LM的搜索测试。我们用主要测试。我们进行了研究实验进行了4次主要测试。
Article 56
Title@2025-07-07 (1): Tool for Supporting Debugging and Understanding of Normative Requirements Using LLMs
Title: Tool for Supporting Debugging and Understanding of Normative Requirements Using LLMs | Tool zur Unterstützung von Debugging und Verständnis für normative Anforderungen mit LLMs | 支持调试和了解使用LLMs的规范要求的工具 2507.05504v1 |
Authors (3): Alex Kleijwegt, Sinem Getir Yaman, Radu Calinescu
Normative requirements specify social, legal, ethical, empathetic, and cultural (SLEEC) norms that must be observed by a system. To support the identification of SLEEC requirements, numerous standards and regulations have been developed. These requirements are typically defined by stakeholders in the non-technical system with diverse expertise (e.g., ethicists, lawyers, social scientists). Hence, ensuring their consistency and managing the requirement elicitation process are complex and error-prone tasks. Recent research has addressed this challenge using domain-specific languages to specify normative requirements as rules, whose consistency can then be analyzed with formal methods. Nevertheless, these approaches often present the results from formal verification tools in a way that is inaccessible to non-technical users. This hinders understanding and makes the iterative process of eliciting and validating these requirements inefficient in terms of both time and effort. To address this problem, we introduce SLEEC-LLM, a tool that uses large language models (LLMs) to provide natural-language interpretations for model-checking counterexamples corresponding to SLEEC rule inconsistencies. SLEEC-LLM improves the efficiency and explainability of normative requirements elicitation and consistency analysis. To demonstrate its effectiveness, we summarise its use in two real-world case studies involving non-technical stakeholders.
规范要求具体规定了系统必须遵守的社会、法律、伦理、道德、同情和文化(SLEEC)规范,为支持确定SLEEC要求,制定了许多标准和条例,这些要求通常由具有不同专长的非技术系统利益攸关方(例如伦理学家、律师、社会科学家)界定,因此,确保其一致性和管理需求引导过程是复杂和容易出错的任务。最近的研究利用特定领域的语言解决了这一挑战,将规范要求作为规则,然后可以正式方法分析其一致性。然而,这些方法往往以非技术用户无法使用的方式介绍正式核查工具的结果。这妨碍理解,并使反复提出和确认这些要求的过程在时间和努力上都效率低下。为解决这一问题,我们引入SLEEC-LLM,这是一个使用大语言模型提供自然语言解释的工具,用于核对与SLEEC规则不一致的对应的反标本。SLEEC-LLEM改进了规范要求的效能和可解释性,在涉及实际技术效益和一致性分析的案例研究中,SLEEC-LM改进了规范要求的效能和可解释性。
Article 57
Title@2025-07-07 (1): Towards Exception Safety Code Generation with Intermediate Representation Agents Framework
Title: Towards Exception Safety Code Generation with Intermediate Representation Agents Framework | Auf dem Weg zur Generierung von Ausnahme-Sicherheitscodes mit dem Rahmen für Mittlere Vertretungen | 建立具有中间代表代理机构框架的例外安全法规生成框架 2410.06949v3 |
Authors (4): Xuanming Zhang, Yuxuan Chen, Yuan Yuan, Minlie Huang
Large Language Models (LLMs) often struggle with robust exception handling in generated code, leading to fragile programs that are prone to runtime errors. We propose Seeker, a novel multi-agent framework that enforces exception safety in LLM generated code through an Intermediate Representation (IR) approach. Seeker decomposes exception handling into five specialized agents: Scanner, Detector, Predator, Ranker, and Handler that collaboratively analyze code, detect fragile segments, retrieve best practice exception strategies, and inject robust handling code. We also introduce Common Exception Enumeration (CEE), a comprehensive knowledge base derived from official documentation, technical practices, and real world code, to standardize exception handling strategies. Seeker also incorporates a Deep Retrieval-Augmented Generation (Deep RAG) algorithm to efficiently navigate the exception inheritance hierarchy, cutting down search overhead by 93% while improving accuracy in identifying relevant exceptions. We evaluate Seeker on 15 open source Java projects and multiple benchmarks. Seeker outperforms state of the art baselines, improving exception handling precision by up to 37% and overall code robustness by 38% as measured by expert code review. It significantly closes the gap between LLM and human developers in exception management, achieving a 28% success rate on real world issue fixes (SWE bench) versus 19% by prior methods. Our framework preserves functional correctness of code while proactively handling errors, demonstrating a practical, generalizable solution for safer code generation. In this paper, we discuss the novelty of using intermediate representation and multi-agent collaboration for exception handling, and outline how Seeker can be extended to other programming languages and complex software engineering tasks, aligning LLM-generated code with industrial standard.
大型语言模型(LLMS) 通常在生成代码中以强力的例外处理方式挣扎,导致容易发生时间错误的脆弱程序。我们提议Seeker,这是一个执行LLM生成代码例外安全的新多剂框架,通过中间代表制(IR)方法执行LLM生成代码的例外安全。Seerker分解了对五种专门代理商的例外处理:Scanner, 探测器, 捕捉者, 掠夺者, 兰克和Handler, 合作分析代码, 检测脆弱部分, 检索最佳做法例外战略, 并输入强有力的处理代码。 我们还引入了共同例外计算(CEEE), 一个来自官方文件、技术做法和真实世界代号的综合性知识库, 将例外处理战略标准化。 Seeker还包含一个深度回收率生成(Deep RAGAG) 的算法, 以高效地引导例外等级等级结构, 将搜索费削减93%, 提高相关例外的精确度。我们评估15个开放源 Java项目和多个基准Sester 的Sefor 。 软件比值 , 在软件基线中, 改进到更安全的特性基线中, 改进的例外处理, 37%和整个的精确度处理37%和37%的精确度, 和38的代码。
Article 58
Title@2025-07-07 (1): An Investigation into Maintenance Support for Neural Networks
Title: An Investigation into Maintenance Support for Neural Networks | Eine Untersuchung der Instandhaltungsunterstützung für neurale Netze | 对神经网络维护支助的调查 2507.05245v1 |
Authors (2): Fatema Tuz Zohra, Brittany Johnson
As the potential for neural networks to augment our daily lives grows, ensuring their quality through effective testing, debugging, and maintenance is essential. This is especially the case as we acknowledge the prospects of negative impacts from these technologies. Traditional software engineering methods, such as testing and debugging, have proven effective in maintaining software quality; however, they reveal significant research and practice gaps in maintaining neural networks. In particular, there is a limited understanding of how practitioners currently address challenges related to understanding and mitigating undesirable behaviors in neural networks. In our ongoing research, we explore the current state of research and practice in maintaining neural networks by curating insights from practitioners through a preliminary study involving interviews and supporting survey responses. Our findings thus far indicate that existing tools primarily concentrate on building and training models. While these tools can be beneficial, they often fall short of supporting practitioners’ understanding and addressing the underlying causes of unexpected model behavior. By evaluating current procedures and identifying the limitations of traditional methodologies, our study aims to offer a developer-centric perspective on where current practices fall short and highlight opportunities for improving maintenance support in neural networks.
随着神经网络扩大我们日常生活的潜力增加,通过有效测试、调试和维护确保质量的潜力增加,通过有效测试、调试和维护来保证神经网络的质量至关重要。当我们承认这些技术的负面影响前景时,情况尤其如此。传统的软件工程方法,例如测试和调试,在维护软件质量方面证明是有效的;然而,这些方法揭示出在维护神经网络方面存在的重大研究和实践差距,特别是,对从业者目前如何应对与理解和减轻神经网络中的不良行为有关的挑战了解有限。在我们正在进行的研究中,我们探索目前维持神经网络的研究和做法的现状,通过涉及访谈和支持调查反应的初步研究,从从从业者那里总结见解。迄今为止,我们的调查结果表明,现有工具主要集中于建设和培训模型。虽然这些工具可能是有益的,但它们往往不足以支持从业者理解和解决意外模式行为的根本原因。通过评估目前的程序和查明传统方法的局限性,我们的研究旨在从发展角度审视当前做法的缺点,并突出改善神经网络维护支持的机会。
Article 59
Title@2025-07-07 (1): React-tRace: A Semantics for Understanding React Hooks
Title: React-tRace: A Semantics for Understanding React Hooks | React-tRace: Eine Semantik zum Verständnis von React Hooks | React-trace:理解反应钩的语义 2507.05234v1 |
Authors (3): Jay Lee, Joongwon Ahn, Kwangkeun Yi
React has become the most widely used web front-end framework, enabling the creation of user interfaces in a declarative and compositional manner. Hooks are a set of APIs that manage side effects in functional components in React. However, their semantics are often seen as opaque to developers, leading to UI bugs. In this paper, we formalize the semantics of the essence of React Hooks we name React-tRace, providing a framework that clarifies their behavior. We demonstrate that our model captures the behavior of React, by theoretically showing that it embodies essential properties of Hooks and empirically comparing our React-tRace-definitional interpreter against a test suite. Furthermore, we showcase a practical visualization tool based on the formalization to demonstrate how developers can better understand the semantics of Hooks.
重新反应已成为最广泛使用的网络前端框架, 使得能够以声明和组成方式创建用户界面。 钩子是一套控制反应功能组件副作用的API。 但是, 它们的语义对于开发者来说常常被视为不透明, 导致 UI 错误。 在本文中, 我们正式确定了我们命名的React- trace 的精髓的语义, 提供了一个澄清其行为的框架 。 我们证明我们的模型捕捉了反应行为, 从理论上表明它体现了 Hooks 的基本特性, 从经验上比较了我们的 React- trace- 定义解释器与测试套件。 此外, 我们展示了一个基于正规化的实用可视化工具, 以演示开发者如何更好地理解虎克的语义。
Article 60
Title@2025-07-07 (1): Exploring Empathy in Software Engineering: Insights from a Grey Literature Analysis of Practitioners’ Perspectives
Title: Exploring Empathy in Software Engineering: Insights from a Grey Literature Analysis of Practitioners’ Perspectives | Einfühlungsvermögen in der Software-Engineering: Einblicke aus einer grauen Literaturanalyse der Perspektiven von Praktizierenden | 探索软件工程的漠视:对从业者观点的灰色文学分析的展望 2507.05325v1 |
Authors (8): Lidiany Cerqueira, João Pedro Bastos, Danilo Neves, Glauco Carneiro, Rodrigo Spínola, Sávio Freire, José Amancio Macedo Santos, Manoel Mendonça
Context. Empathy, a key social skill, is essential for communication and collaboration in SE but remains an under-researched topic. Aims. This study investigates empathy in SE from practitioners’ perspectives, aiming to characterize its meaning, identify barriers, discuss practices to overcome them, and explore its effects. Method. A qualitative content analysis was conducted on 55 web articles from DEV and Medium, two communities widely used by practitioners. To strengthen our findings, we conducted a follow-up survey with empathy experts. Results. The study proposes a definition of empathy in SE, identifies barriers such as toxic culture and excessive technical focus, practices to foster empathy in teams, and outcomes, including improved collaboration, communication, and reduced anxiety, frustration, and stress. These findings are synthesized into a conceptual framework. Conclusion. Survey results indicate the framework is clear, valuable, and raises empathy awareness, with suggestions for improvements and integration into training. This study paves the way for improving team dynamics by addressing barriers and offering strategies to cultivate empathy. Future work will explore empathy’s broader implications in SE practice.
该研究从从业人员的角度对东南欧的同情心进行了调查,旨在说明其含义,查明障碍,讨论克服障碍的做法,并探讨其影响。方法:对来自DEV和Mente这两个从业人员广泛使用的社区的55篇网络文章进行了质量内容分析。为了加强我们的研究成果,我们与同情专家进行了后续调查。结果。该研究提议了东南欧的同情心定义,确定了各种障碍,如有毒文化和过度的技术重点,培养团队同情心的做法,以及成果,包括改善协作、沟通和减少焦虑、沮丧和压力。这些结论:调查结果表明该框架明确、宝贵、提高同情心意识,并提出了改进和融入培训的建议。这项研究为通过消除障碍和提供培养同情心的战略改善团队的动态铺平了道路。未来的工作将探讨同情心在东南欧实践中的更广泛影响。
Article 61
Title@2025-07-07 (1): In-Context Learning as an Effective Estimator of Functional Correctness of LLM-Generated Code
Title: In-Context Learning as an Effective Estimator of Functional Correctness of LLM-Generated Code | In-Context Learning als effektiver Estimator der funktionalen Korrektheit des LLM-generierten Codes | 内文学习作为LLM-创用守则功能正确性的有效模拟器 2507.05200v1 |
Authors (5): Susmita Das, Madhusudan Ghosh, Priyanka Swami, Debasis Ganguly, Gul Calikli
When applying LLM-based code generation to software development projects that follow a feature-driven or rapid application development approach, it becomes necessary to estimate the functional correctness of the generated code in the absence of test cases. Just as a user selects a relevant document from a ranked list of retrieved ones, a software generation workflow requires a developer to choose (and potentially refine) a generated solution from a ranked list of alternative solutions, ordered by their posterior likelihoods. This implies that estimating the quality of a ranked list – akin to estimating “relevance” for query performance prediction (QPP) in IR – is also crucial for generative software development, where quality is defined in terms of “functional correctness”. In this paper, we propose an in-context learning (ICL) based approach for code quality estimation. Our findings demonstrate that providing few-shot examples of functionally correct code from a training set enhances the performance of existing QPP approaches as well as a zero-shot-based approach for code quality estimation.
在应用基于LLM的代码生成软件开发项目时,如果采用基于特性的或快速的应用开发方法,则有必要估计在没有测试案例的情况下生成代码的功能正确性。正如用户从排序清单中选择了相关文件,软件生成工作流程要求开发者选择(并可能改进)按其后生可能性订购的排序的替代解决方案清单所产生的解决方案。这意味着估计排名清单的质量 – – 类似于估算IR查询性能预测(QPP)的 “ 相关性 “ – – 对基因化软件开发也至关重要,因为该软件的质量是以 “ 功能正确性 “ 界定的。在本文中,我们建议了基于拼写性学习(ICL)的代码质量估算方法。我们的调查结果表明,从培训组合中提供几张功能正确代码实例,可以提高现有的QP方法的性能,以及代码质量估计的零光基方法。
Article 62
Title@2025-07-07 (1): Deep Learning Framework Testing via Model Mutation: How Far Are We?
Title: Deep Learning Framework Testing via Model Mutation: How Far Are We? | Deep Learning Framework Testing über Modellmutation: Wie weit sind wir? | 通过模型变异进行深层次学习框架测试:我们有多远? 2506.17638v2 |
Authors (10): Yanzhou Mu, Rong Wang, Juan Zhai, Chunrong Fang, Xiang Chen, Zhiyuan Peng, Peiran Yang, Ruixiang Qian, Shaoyu Yang, Zhenyu Chen
Deep Learning (DL) frameworks are a fundamental component of DL development. Therefore, the detection of DL framework defects is important and challenging. As one of the most widely adopted DL testing techniques, model mutation has recently gained significant attention. In this study, we revisit the defect detection ability of existing mutation-based testing methods and investigate the factors that influence their effectiveness. To begin with, we reviewed existing methods and observed that many of them mutate DL models (e.g., changing their parameters) without any customization, ignoring the unique challenges in framework testing. Another issue with these methods is their limited effectiveness, characterized by a high rate of false positives caused by illegal mutations arising from the use of generic, non-customized mutation operators. Moreover, we tracked the defects identified by these methods and discovered that most of them were ignored by developers. Motivated by these observations, we investigate the effectiveness of existing mutation-based testing methods in detecting important defects that have been authenticated by framework developers. We begin by collecting defect reports from three popular frameworks and classifying them based on framework developers’ ratings to build a comprehensive dataset. We then perform an in-depth analysis to uncover valuable insights. Based on our findings, we propose optimization strategies to address the shortcomings of existing approaches. Following these optimizations, we identified seven new defects, four of which were confirmed by developers as high-priority issues, with three resolved. In summary, we identified 39 unique defects across just 23 models, of which 31 were confirmed by developers, and eight have been fixed.
深学习(DL)框架是DL开发的基本组成部分。因此,发现DL框架缺陷很重要而且具有挑战性。作为最广泛采用的DL测试技术之一,模型突变最近引起了极大关注。在本研究中,我们重新审视了现有突变测试方法的缺陷检测能力,并调查了影响其有效性的因素。首先,我们审查了现有方法,发现许多现有突变测试方法(如改变参数)在不作任何定制的情况下变换DL模型(如改变其参数),无视框架测试中的独特挑战。这些方法的另一个问题是效力有限,其特点是使用通用的、非定制的变异操作者造成的非法突变率过高。此外,我们跟踪了这些方法所发现的缺陷,发现其中多数为开发者所忽视。我们根据这些观察,调查了现有突变测试方法在发现框架开发者所验证的重要缺陷方面的有效性。我们首先从三个流行框架收集缺陷报告,并根据框架开发者的评级对其进行分类,以构建一个全面的数据集为特征。我们随后提出了一个有8个高精度的精确度分析,然后我们用8个高精度的精确度分析。我们用3个新的精度来发现。
Article 63
Title@2025-07-07 (1): Understanding Everything as Code: A Taxonomy and Conceptual Model
Title: Understanding Everything as Code: A Taxonomy and Conceptual Model | Alles als Code verstehen: Ein Taxonomie- und Konzeptmodell | 将一切理解为守则:分类和概念模式 2507.05100v1 |
Authors (3): Haoran Wei, Nazim Madhavji, John Steinbacher
Background: Everything as Code (EaC) is an emerging paradigm aiming to codify all aspects of modern software systems. Despite its growing popularity, comprehensive industry standards and peer-reviewed research clarifying its scope and guiding its adoption remain scarce. Aims: This study systematically analyzes existing knowledge and perceptions of EaC, clarifies its scope and boundaries, and provides structured guidance for researchers and practitioners. Method: We conducted a large-scale multivocal literature review (MLR), synthesizing academic and grey literature sources. Findings were analyzed quantitatively and thematically. Based on this analysis, we developed a taxonomy and conceptual model of EaC, validated through collaboration with industry experts. Results: The resulting taxonomy comprises 25 distinct EaC practices organized into six layers based on industry awareness and functional roles. The conceptual model illustrates focus areas, overlaps, and interactions among these EaC practices within the software delivery lifecycle. Additionally, practical code examples demonstrating the implementation of these practices were developed in collaboration with industry experts. Conclusions: This work addresses the current scarcity of academic discourse on EaC by providing the first comprehensive taxonomy and conceptual model. These contributions enhance conceptual clarity, offer actionable guidance to practitioners, and lay the groundwork for future research in this emerging domain.
目标:本研究系统地分析目前对《守则》的了解和看法,澄清其范围和界限,并为研究人员和从业人员提供结构化指导。方法:我们进行了大规模的多语言文献审查,对学术和灰色文献来源进行了综合分析,对研究结果进行了定量和专题分析。根据这项分析,我们开发了《守则》的分类学和概念模型,并通过与行业专家的合作加以验证。结果:由此产生的分类学包括25种不同的《分类学》做法,这些做法以行业认识和功能作用为基础,分为六个层次。概念模型说明了这些《分类学》做法在软件提供生命周期内的重点领域、重叠和互动。此外,我们与行业专家合作开发了实践守则范例,以证明这些做法的实施。结论:这项工作通过为未来实践者提供第一个全面分类学和概念模型,解决了目前关于《分类学》的学术讨论的稀缺问题。这些贡献加强了概念基础性研究,为未来实践者提供了基础基础性研究、基础性行动。
Article 64
Title@2025-07-07 (1): OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language Models
Title: OASBuilder: Generating OpenAPI Specifications from Online API Documentation with Large Language Models | OASBuilder: OpenAPI-Spezifikationen aus Online-API-Dokumentation mit großen Sprachmodellen generieren | OASBuilder:从带有大语言模式的在线API文件生成开放API规格 2507.05316v1 |
Authors (11): Koren Lazar, Matan Vetzler, Kiran Kate, Jason Tsay, David Boaz Himanshu Gupta, Avraham Shinnar, Rohith D Vallam, David Amid Esther Goldbraich, Guy Uziel, Jim Laredo, Ateret Anaby Tavor
AI agents and business automation tools interacting with external web services require standardized, machine-readable information about their APIs in the form of API specifications. However, the information about APIs available online is often presented as unstructured, free-form HTML documentation, requiring external users to spend significant time manually converting it into a structured format. To address this, we introduce OASBuilder, a novel framework that transforms long and diverse API documentation pages into consistent, machine-readable API specifications. This is achieved through a carefully crafted pipeline that integrates large language models and rule-based algorithms which are guided by domain knowledge of the structure of documentation webpages. Our experiments demonstrate that OASBuilder generalizes well across hundreds of APIs, and produces valid OpenAPI specifications that encapsulate most of the information from the original documentation. OASBuilder has been successfully implemented in an enterprise environment, saving thousands of hours of manual effort and making hundreds of complex enterprise APIs accessible as tools for LLMs.
AI代理商和与外部网络服务互动的商业自动化工具要求以API规格的形式提供关于其API的标准化、机器可读信息,然而,关于在线上的API的信息往往以非结构化、自由格式的HTML文件的形式提供,要求外部用户花费大量时间手工将其转换成结构化的格式。为此,我们引入了OASBuilder,这是一个新颖的框架,将长而多样的API文件页面转化为一致、机器可读API规格。这是通过精心设计的管道实现的,该管道将大型语言模型和基于规则的算法整合在一起,以文件网页结构的域知识为指导。我们的实验表明,OASBurder对数百个API文件进行了全面概括,并制作了有效的OpenAPI规格,将原始文件的大部分信息包在企业环境中。OASBuilder在企业环境中成功实施了,节省了数千小时的手工工作,并使数百个复杂的企业API成为LMS的工具。
Article 65
Title@2025-07-07 (1): AI for the Routine, Humans for the Complex: Accuracy-Driven Data Labelling with Mixed Integer Linear Programming
Title: AI for the Routine, Humans for the Complex: Accuracy-Driven Data Labelling with Mixed Integer Linear Programming | KI für die Routine, Menschen für den Komplex: Genauigkeitsgetriebene Datenkennzeichnung mit gemischter Integer-Linear-Programmierung | 常规、人类为综合体:精确-驱动数据标签与混合整数线性线性编程 2507.04990v1 |
Authors (3): Mohammad Hossein Amini, Mehrdad Sabetzadeh, Shiva Nejati
The scarcity of accurately labelled data remains a major challenge in deep learning (DL). Many DL approaches rely on semi-supervised methods, which focus on constructing large datasets that require only a minimal amount of human-labelled data. Since DL training algorithms can tolerate moderate label noise, it has generally been acceptable for the accuracy of labels in large training datasets to fall well short of a perfect 100%. However, when it comes to testing DL models, achieving high label accuracy-as close to 100% as possible-is paramount for reliable verification. In this article, we introduce OPAL, a human-assisted labelling method that can be configured to target a desired accuracy level while minimizing the manual effort required for labelling. The main contribution of OPAL is a mixed-integer linear programming (MILP) formulation that minimizes labelling effort subject to a specified accuracy target. We evaluate OPAL for two tasks in the context of testing vision systems: automatic labelling of test data and automated validation of test data. Our evaluation, based on more than 2500 experiments performed on seven datasets, comparing OPAL with eight baseline methods, shows that OPAL, relying on its MILP formulation, achieves an average accuracy of 98.8%, just 1.2% below perfect accuracy, while cutting manual labelling by more than half. Further, OPAL significantly outperforms automated labelling baselines in labelling accuracy across all seven datasets, with large effect sizes, when all methods are provided with the same manual-labelling budget. For automated test-input validation, on average, OPAL reduces manual effort by 28.8% while achieving 4.5% higher accuracy than the SOTA validation baselines. Finally, we show that augmenting OPAL with an active learning loop leads to an additional 4.5% reduction in required manual labelling, without compromising accuracy.
准确标签数据缺乏仍然是深层学习(DL)的一大挑战。 许多 DL 方法依赖半监督方法, 其重点是构建大型数据集, 只要求最低限度的人类标签数据。 由于 DL 培训算法可以容忍中度标签噪音, 大型培训数据集标签的准确性一般可以被接受, 远低于完美的100%。 然而, 当测试 DL 模型时, 达到高标签准确度, 可能接近100%, 这对于可靠的核查来说至关重要。 在本篇文章中, 我们引入了半监督方法, 即人手辅助标签方法, 其配置可以达到预期的准确度, 而只要求尽量减少贴标签所需的手工工作。 由于 DL 培训算法的主要贡献是混合整数线性编程(MILP ) , 使大型培训数据集的准确性大大低于规定的准确性。 我们评估OPAL 在测试系统范围内的两项任务: 测试数据的自动标签和测试数据的自动验证。 我们根据7个数据集进行的2500多项实验, 将OPAL 的精确度与8 的基线方法进行比较, 达到预期的准确性水平水平水平水平水平水平, 而OPAL 将达到8 的精确度为8 的精确度, 最终的精确度, 以低于 AL AL AL 标值 的精确度, 的精确度显示的精确度, AL AL AL AL AL AL AL AL 的精确度为整个的精确度 的精确度 的精确度, 的精确度, 的精确度, 的精确度
Article 66
Title@2025-07-07 (1): ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation
Title: ArtifactsBench: Bridging the Visual-Interactive Gap in LLM Code Generation Evaluation | ArtefakteBench: Überbrückung der visuell-interaktiven Lücke in der LLM-Codegenerierung | 人工合成:弥合LLM代码生成评估中的视觉互动差距 2507.04952v1 |
Authors (32): Chenchen Zhang, Yuhang Li, Can Xu, Jiaheng Liu, Ao Liu, Shihui Hu, Dengpeng Wu, Guanhua Huang, Kejiao Li, Qi Yi, Ruibin Xiong, Haotian Zhu, Yuanxing Zhang, Yuhao Jiang, Yue Zhang, Zenan Xu, Bohui Zhai, Guoxiang He, Hebin Li, Jie Zhao, Le Zhang, Lingyun Tan, Pengyu Guo, Xianshu Pang, Yang Ruan, Zhifeng Zhang, Zhonghu Wang, Ziyan Xu, Zuopu Yin, Wiggin Zhou, Chayse Zhou, Fengzong Lian
The generative capabilities of Large Language Models (LLMs) are rapidly expanding from static code to dynamic, interactive visual artifacts. This progress is bottlenecked by a critical evaluation gap: established benchmarks focus on algorithmic correctness and are blind to the visual fidelity and interactive integrity that define modern user experiences. To bridge this gap, we introduce ArtifactsBench, a new benchmark and paradigm for the automated, multimodal evaluation of visual code generation. Our framework programmatically renders each generated artifact and captures its dynamic behavior through temporal screenshots. This visual evidence, alongside the source code, is then assessed by a Multimodal LLM (MLLM)-as-Judge, which is rigorously guided by a fine-grained, per-task checklist to ensure holistic and reproducible scoring. We construct a new benchmark of 1,825 diverse tasks and evaluate over 30 leading LLMs. Our automated evaluation achieves a striking 94.4% ranking consistency with WebDev Arena, the gold-standard for human preference in web development, and over 90% pairwise agreement with human experts. This establishes ArtifactsBench as the first framework to reliably automate the assessment of human-perceived quality at scale. Our analysis provides a high-resolution map of the current SOTA, revealing that generalist models often outperform domain-specific ones. We open-source ArtifactsBench, including the benchmark, evaluation harness, and baseline results at https://artifactsbenchmark.github.io/, to provide the community with a scalable and accurate tool to accelerate the development of user-centric generative models.
大型语言模型(LLMS)的基因化能力正在迅速从静态代码向动态、互动的视觉工艺品扩展。 这个进步被一个关键的评价差距所阻碍: 既定基准侧重于算法正确性, 并且无视界定现代用户经验的视觉忠诚和互动完整性。 为了缩小这一差距, 我们引入了人工行为堡垒, 这是自动、 多式联运地评估视觉代码生成的新基准和范例。 我们的架构在程序上使每个生成的工艺品的排序与通过时间截图捕捕到其动态行为。 这个视觉证据与源代码一起, 由多模式LM(MLLM) 以加速- Jodge 来评估。 多模式以精细的精细精度、每份核对核对核对的检查清单为严格指导, 以确保整体和可复制的得分数。 我们的自动评价达到了惊人的94.4% 与WebDev Arena, 人类偏好的金质标准, 以及超过90%的与人类专家的对齐协议。 这个模型由精度的精度的精度、 直度/ Bechnialteralalalalalalalalal materalalalal com com com com com 提供。 这个模型的模型的模型的模型的模型的模型, 和直观的模型的模型, 直观的模型, 直观, 直观的模型, 和直径直径直径比的模型通常为我们的模型, 。
Article 67
Title@2025-07-07 (1): Towards a Unifying Reference Model for Digital Twins of Cyber-Physical Systems
Title: Towards a Unifying Reference Model for Digital Twins of Cyber-Physical Systems | Auf dem Weg zu einem einheitlichen Referenzmodell für digitale Zwillinge von Cyber-Physischen Systemen | 建立网络-物理系统数字双对数字参照统一模式 2507.04871v1 |
Authors (7): Jerome Pfeiffer, Jingxi Zhang, Benoit Combemale, Judith Michael, Bernhard Rumpe, Manuel Wimmer, Andreas Wortmann
Digital twins are sophisticated software systems for the representation, monitoring, and control of cyber-physical systems, including automotive, avionics, smart manufacturing, and many more. Existing definitions and reference models of digital twins are overly abstract, impeding their comprehensive understanding and implementation guidance. Consequently, a significant gap emerges between abstract concepts and their industrial implementations. We analyze popular reference models for digital twins and combine these into a significantly detailed unifying reference model for digital twins that reduces the concept-implementation gap to facilitate their engineering in industrial practice. This enhances the understanding of the concepts of digital twins and their relationships and guides developers to implement digital twins effectively.
数字双胞胎是代表、监测和控制包括汽车、航空、智能制造等在内的网络物理系统的先进软件系统。数字双胞胎的现有定义和参考模型过于抽象,妨碍了它们的全面理解和执行指导。因此,在抽象概念及其工业实施之间出现了巨大差距。我们分析了数字双胞胎的流行参考模型,并将这些模型合并为数字双胞胎的非常详细的统一参考模型,以缩小概念与实施之间的差距,促进其在工业实践中的工程。这增进了对数字双胞胎概念及其关系的理解,并指导开发者有效实施数字双胞胎。
Article 68
Title@2025-07-07 (1): Supporting Software Formal Verification with Large Language Models: An Experimental Study
Title: Supporting Software Formal Verification with Large Language Models: An Experimental Study | Unterstützung von Software Formale Verifikation mit großen Sprachmodellen: Eine experimentelle Studie | 支持有大语言模型的软件正式核查:实验研究 2507.04857v1 |
Authors (4): Weiqi Wang, Marie Farrell, Lucas C. Cordeiro, Liping Zhao
Formal methods have been employed for requirements verification for a long time. However, it is difficult to automatically derive properties from natural language requirements. SpecVerify addresses this challenge by integrating large language models (LLMs) with formal verification tools, providing a more flexible mechanism for expressing requirements. This framework combines Claude 3.5 Sonnet with the ESBMC verifier to form an automated workflow. Evaluated on nine cyber-physical systems from Lockheed Martin, SpecVerify achieves 46.5% verification accuracy, comparable to NASA’s CoCoSim, but with lower false positives. Our framework formulates assertions that extend beyond the expressive power of LTL and identifies falsifiable cases that are missed by more traditional methods. Counterexample analysis reveals CoCoSim’s limitations stemming from model connection errors and numerical approximation issues. While SpecVerify advances verification automation, our comparative study of Claude, ChatGPT, and Llama shows that high-quality requirements documentation and human monitoring remain critical, as models occasionally misinterpret specifications. Our results demonstrate that LLMs can significantly reduce the barriers to formal verification, while highlighting the continued importance of human-machine collaboration in achieving optimal results.
长期以来一直采用正规方法进行要求核查,但很难从自然语言要求中自动得出属性。通过将大型语言模型(LLMs)与正式的核查工具相结合,提供更灵活的表达要求的机制,来应对这一挑战。这个框架将Claude 3.5 Sonnet与ESBMC核查者结合起来,形成自动化工作流程。对洛克希德·马丁、SpecVerify的九个网络物理系统进行了评价,实现了46.5%的核查准确性,与美国航天局的CoCoSim相当,但假正数较低。我们的框架提出了超出LTL明示能力范围的说法,并查明了可伪造的、被较传统方法所忽略的案例。反比分析揭示了CoCoSim因模型连接错误和数字近似问题而产生的局限性。虽然我们有关Claude、ChateGPT和Llama的比较研究表明,高质量的要求文件和人文监测仍然十分关键,因为模型有时会误判。我们的结果表明,LLMs可以大大减少正式核查的障碍,同时强调在取得最佳结果方面继续合作的重要性。
Article 69
Title@2025-07-07 (1): A Note on Runtime Verification of Concurrent Systems
Title: A Note on Runtime Verification of Concurrent Systems | Eine Anmerkung zur Laufzeitverifizierung von Concurrent Systemen | 关于同步系统运行时核查的说明 2507.04830v1 |
Authors (1): Martin Leucker
To maximize the information gained from a single execution when verifying a concurrent system, one can derive all concurrency-aware equivalent executions and check them against linear specifications. This paper offers an alternative perspective on verification of concurrent systems by leveraging trace-based logics rather than sequence-based formalisms. Linear Temporal Logic over Mazurkiewicz Traces (LTrL) operates on partial-order representations of executions, meaning that once a single execution is specified, all equivalent interleavings are implicitly considered. This paper introduces a three valued version of LTrL, indicating whether the so-far observed execution of the concurrent system is one of correct, incorrect or inconclusive, together with a suitable monitor synthesis procedure. To this end, the paper recalls a construction of trace-consistent B"uchi automata for LTrL formulas and explains how to employ it in well-understood monitor synthesis procedures. In this way, a monitor results that yields for any linearization of an observed trace the same verification verdict.
为了在核查一个并行系统时最大限度地利用从单一执行中所获得的信息,人们可以得出所有具有等值货币价值的处决,并对照线性规格进行核对。本文件提供了通过利用基于追踪的逻辑而不是基于序列的正规主义来核查并行系统的另一种观点。Mazurkiewicz Traces(LtrL)的线性时间逻辑以部分顺序表示处决,这意味着一旦具体规定了一次处决,就隐含地考虑所有等值的干预。本文引入了三种有价值的LTrL版本,表明所观察到的对齐系统执行是否正确、不正确或无结果,以及适当的监测合成程序。为此,本文件回顾为Ltdr公式构建的追踪一致的B\\uchi自动数据,并解释如何在精心界定的情况下使用它来监测合成程序。通过这种方式,监测结果得出观察到的同一追踪结果的线性。
Article 70
Title@2025-07-07 (1): ASSURE: Metamorphic Testing for AI-powered Browser Extensions
Title: ASSURE: Metamorphic Testing for AI-powered Browser Extensions | ASSURE: Metamorphische Tests für KI-betriebene Browser-Erweiterungen | ASURE: AI动力浏览器扩展的变形测试 2507.05307v1 |
Authors (5): Xuanqi Gao, Juan Zhai, Shiqing Ma, Siyi Xie, Chao Shen
The integration of Large Language Models (LLMs) into browser extensions has revolutionized web browsing, enabling sophisticated functionalities like content summarization, intelligent translation, and context-aware writing assistance. However, these AI-powered extensions introduce unprecedented challenges in testing and reliability assurance. Traditional browser extension testing approaches fail to address the non-deterministic behavior, context-sensitivity, and complex web environment integration inherent to LLM-powered extensions. Similarly, existing LLM testing methodologies operate in isolation from browser-specific contexts, creating a critical gap in effective evaluation frameworks. To bridge this gap, we present ASSURE, a modular automated testing framework specifically designed for AI-powered browser extensions. ASSURE comprises three principal components: (1) a modular test case generation engine that supports plugin-based extension of testing scenarios, (2) an automated execution framework that orchestrates the complex interactions between web content, extension processing, and AI model behavior, and (3) a configurable validation pipeline that systematically evaluates behavioral consistency and security invariants rather than relying on exact output matching. Our evaluation across six widely-used AI browser extensions demonstrates ASSURE’s effectiveness, identifying 531 distinct issues spanning security vulnerabilities, metamorphic relation violations, and content alignment problems. ASSURE achieves 6.4x improved testing throughput compared to manual approaches, detecting critical security vulnerabilities within 12.4 minutes on average. This efficiency makes ASSURE practical for integration into development pipelines, offering a comprehensive solution to the unique challenges of testing AI-powered browser extensions.
将大语言模型(LLMS)纳入浏览器扩展,使网络浏览器扩展的整合发生了革命性的变化,使得将大语言模型(LLMS)纳入浏览器扩展,使网络浏览器浏览器浏览器扩展,使内容总和、智能翻译和符合背景的写法协助等复杂功能得以实现。然而,这些AI动力扩展在测试和可靠性保证方面提出了前所未有的挑战。传统的浏览器扩展方法未能解决LLM动力扩展所固有的非决定性行为、环境敏感性和复杂的网络环境整合问题。同样,现有的LLMM测试方法脱离浏览器特定环境,在有效评价框架方面造成了严重差距。为了缩小这一差距,我们提出了ASURE,一个专门为AI-PRO全面浏览器扩展设计的模块自动化自动测试框架。ASURE由三个主要组成部分组成:(1)一个模块测试案例生成引擎,支持基于插件的测试方案扩展方案扩展方案扩展方案,支持基于测试方案的测试方案扩展方案;(2)一个自动执行框架,将网络内容、扩展处理和AI-模型行为模式行为之间的复杂互动关系,系统评估行为的一致性和安全差异,而不是依赖精确产出匹配。我们对六种独特的客户浏览器扩展的评价显示AS-SURE的扩展显示A-SURE的效能关系,通过测试,通过测试使安全脆弱性的系统测试,通过12个标准脆弱性的系统测试,通过测试,使安全脆弱性和内部的系统测试,使安全风险分析的系统测试,使安全风险中的脆弱性化,使能性调整实现。
Article 71
Title@2025-07-07 (1): Augmenting software engineering with AI and developing it further towards AI-assisted model-driven software engineering
Title: Augmenting software engineering with AI and developing it further towards AI-assisted model-driven software engineering | Augmenting-Software-Engineering mit KI und Weiterentwicklung in Richtung KI-unterstützter modellgetriebener Software-Engineering | 增强AI软件工程并进一步发展AI软件工程,使之更接近AI协助的模型驱动软件工程 2409.18048v3 |
Authors (1): Ina K. Schieferdecker
The effectiveness of model-driven software engineering (MDSE) has been successfully demonstrated in the context of complex software; however, it has not been widely adopted due to the requisite efforts associated with model development and maintenance, as well as the specific modelling competencies required for MDSE. Concurrently, artificial intelligence (AI) methods, particularly deep learning methods, have demonstrated considerable abilities when applied to the huge code bases accessible on open-source coding platforms. The so-called big code provides the basis for significant advances in empirical software engineering, as well as in the automation of coding processes and improvements in software quality with the use of AI. The objective of this paper is to facilitate a synthesis between these two significant domains of software engineering (SE), namely models and AI in SE. The paper provides an overview of the current state of AI-augmented software engineering and develops a corresponding taxonomy, ai4se. In light of the aforementioned considerations, a vision of AI-assisted big models in SE is put forth, with the aim of capitalising on the advantages inherent to both approaches in the context of software development. Finally, the pair modelling paradigm is proposed for adoption by the MDSE industry.
模型驱动软件工程(MDSE)的有效性在复杂的软件方面得到了成功证明;然而,由于与模型开发和维护有关的必要努力以及MDSE所需的具体建模能力,模型驱动软件工程(MDSE)的有效性尚未被广泛采用。与此同时,人工智能(AI)方法,特别是深层的学习方法,在适用于开放源码平台上可以进入的巨大代码库时,显示出了相当大的能力。所谓的大代码为经验软件工程、编译流程自动化和软件质量的改进提供了基础。本文件的目的是促进软件工程这两个重要领域,即模型和在线AI的合成。本文件概述了人工智能软件工程的现状,并开发了一个相应的分类学,a4se。根据上述考虑,提出了在SE应用软件开发方面AI协助的大型模型的愿景,目的是利用这两种方法固有的优势。最后,提出了配对建模模式模式,供MDESE行业采用。
Article 72
Title@2025-07-07 (1): Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools
Title: Narrowing the Gap: Supervised Fine-Tuning of Open-Source LLMs as a Viable Alternative to Proprietary Models for Pedagogical Tools | Engere Kluft: Überwachtes Feintuning von Open Source LLMs als lebensfähige Alternative zu proprietären Modellen für pädagogische Werkzeuge | 缩小差距:监督开放源码LLMs的微调,将其作为替代专有教学工具模型的可行替代物 2507.05305v1 |
Authors (5): Lorenzo Lee Solano, Charles Koutcheme, Juho Leinonen, Alexandra Vassar, Jake Renzella
Frontier Large language models (LLMs) like ChatGPT and Gemini can decipher cryptic compiler errors for novice programmers, but their computational scale, cost, and tendency to over-assist make them problematic for widespread pedagogical adoption. This work demonstrates that smaller, specialised language models, enhanced via Supervised Fine-Tuning (SFT), present a more viable alternative for educational tools. We utilise a new dataset of 40,000 C compiler error explanations, derived from real introductory programming (CS1/2) student-generated programming errors, which we used to fine-tune three open-source models: Qwen3-4B, Llama-3.1-8B, and Qwen3-32B. We performed a dual evaluation, combining expert human reviews with a large-scale automated analysis of 8,000 responses using a validated LLM-as-judge ensemble. Our results show that SFT significantly boosts the pedagogical quality of smaller models, achieving performance comparable to much larger models. We analyse the trade-offs between model size and quality, confirming that fine-tuning compact, efficient models on high-quality, domain-specific data is a potent strategy for creating specialised models to drive educational tools. We provide a replicable methodology to foster broader access to generative AI capabilities in educational contexts.
这项工作表明,通过监督精度微调(SFT)强化的更小型、专业化的语言模型为教育工具提供了更可行的替代方法。我们使用40,000个C编译器错误解释的新数据集,这些数据集来自真正的介绍性编程(CS1/2)学生产生的编程错误,我们用来微调三种开放源码模型:Qwen3-4B、Llama-3.1-8B和Qwen3-32B。我们进行了双重评价,将专家的人类审查与使用经验证的LLM-As法官通则对8000份答复的大规模自动分析结合起来。我们的结果显示,SFT大大提升了小模型的教学质量,取得了与大得多模型相似的业绩。我们分析了模型大小和质量之间的权衡,确认了高品质、高效率的域域域域域域数据微调缩微缩微缩缩缩缩缩缩缩缩缩缩图,为创建了全方位教育工具。
Article 73
Title@2025-07-06 (7): Testing, Evaluation, Verification and Validation (TEVV) of Digital Twins: A Comprehensive Framework
Title: Testing, Evaluation, Verification and Validation (TEVV) of Digital Twins: A Comprehensive Framework | Testen, Auswerten, Verifizieren und Validieren (TEVV) von digitalen Zwillingen: Ein umfassender Rahmen | 数字双双的测试、评价、核查和验证:全面框架 2507.04555v1 |
Authors (1): Gabriella Waters
Digital twins have emerged as a powerful technology for modeling and simulating complex systems across various domains (Fuller et al., 2020; Tao et al., 2019). As virtual representations of physical assets, processes, or systems, digital twins enable real-time monitoring, predictive analysis, and optimization. However, as digital twins become more sophisticated and integral to decision-making processes, ensuring their accuracy, reliability, and ethical implementation is essential. This paper presents a comprehensive framework for the Testing, Evaluation, Verification and Validation (TEVV) of digital twins to address the unique challenges posed by these dynamic and complex virtual models.
数字双胞胎已成为在各个领域建模和模拟复杂系统的强大技术(Fuller等人,2020年;Tao等人,2019年)。数字双胞胎作为实物资产、流程或系统的虚拟体现,能够实时监测、预测分析和优化,但是,随着数字双胞胎变得更加复杂,成为决策进程的组成部分,确保其准确性、可靠性和道德执行至关重要。本文件为数字双胞胎的测试、评价、核查和验证提供了一个综合框架,以应对这些动态和复杂的虚拟模型构成的独特挑战。
Article 74
Title@2025-07-06 (7): Exploring Micro Frontends: A Case Study Application in E-Commerce
Title: Exploring Micro Frontends: A Case Study Application in E-Commerce | Erforschung von Micro Frontends: Eine Anwendungsfallstudie im E-Commerce | 探索微观前沿:电子商务案例研究应用 2506.21297v2 |
Authors (5): Ricardo Hideki Hangai Kojo, Luiz Fernando Corte Real, Renato Cordeiro Ferreira, Thatiane de Oliveira Rosa, Alfredo Goldman
In the micro frontends architectural style, the frontend is divided into smaller components, which can range from a simple button to an entire page. The goal is to improve scalability, resilience, and team independence, albeit at the cost of increased complexity and infrastructure demands. This paper seeks to understand when it is worth adopting micro frontends, particularly in the context of industry. To achieve this, we conducted an investigation into the state of the art of micro frontends, based on both academic and gray literature. We then implemented this architectural style in a marketplace for handcrafted products, which already used microservices. Finally, we evaluated the implementation through a semi-open questionnaire with the developers. At the studied marketplace company, the need for architectural change arose due to the tight coupling between their main system (a Java monolith) and a dedicated frontend system. Additionally, there were deprecated technologies and poor developer experience. To address these issues, the micro frontends architecture was adopted, along with the API Gateway and Backend for Frontend patterns, and technologies such as Svelte and Fastify. Although the adoption of Micro Frontends was successful, it was not strictly necessary to meet the company’s needs. According to the analysis of the mixed questionnaire responses, other alternatives, such as a monolithic frontend, could have achieved comparable results. What made adopting micro frontends the most convenient choice in the company’s context was the monolith strangulation and microservices adoption, which facilitated implementation through infrastructure reuse and knowledge sharing between teams.
在微观前端的建筑风格中,前端被分为小部分,从简单的按钮到整个页面。目标是提高可缩放性、复原力和团队独立性,尽管其成本增加了复杂性和基础设施需求。本文件试图了解何时值得采用微观前端,特别是在工业方面。为此,我们根据学术和灰色文献对微型前端的艺术状态进行了调查。我们随后在手工艺产品的市场中实施了这种建筑风格,这些产品已经使用了微观服务。最后,我们通过与开发商的半开放问卷评估了执行情况。在经过研究的市场公司中,由于主要系统(爪哇单项)和专门的前端系统之间的紧密连接,需要进行建筑变革。此外,为了解决这些问题,我们根据学术和灰色文献对微型前端结构进行了调查。我们随后在手工艺产品的市场中采用了微型前端结构,并且已经使用了已经使用过缩略图的后端模式。最后,我们通过半开放的问卷来评估执行情况。尽管通过最精密的前端基础设施(Java ) 和前端技术的运用方式得到了成功的应用,但是在采用最灵活的前端端分析中, 也实现了对正端分析。在采用最具有可比性的公司进行了成功的前端分析,但通过这种分析后端和最成功的前端技术的后端反应, 也取得了必要的应用。在采用了最成功的前端评估。在采用了最成功的前端技术。在采用后端技术,在采用后端的后端技术,在采用了最成功的前端技术,在采用后端技术,在采用。在采用后端技术,在采用后端技术,在采用后端端端端端端技术,在采用后端分析中,在采用后端技术,在采用后端,在采用后端技术,在采用了最成功的前端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端,在采用后端分析中实现了。在采用后端选择。在采用后端,在采用后端,在采用后端,在采用后端,在采用了最成功的前端选择,在采用了。在采用后端,在采用后端,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端技术,在采用后端
Article 75
Title@2025-07-06 (7): SPIRA: Building an Intelligent System for Respiratory Insufficiency Detection
Title: SPIRA: Building an Intelligent System for Respiratory Insufficiency Detection | SPIRA: Aufbau eines intelligenten Systems zur Erkennung von respiratorischer Insuffizienz | SPIRA: 建立呼吸系统不足检测智能系统 2507.04548v1 |
Authors (5): Renato Cordeiro Ferreira, Dayanne Gomes, Vitor Tamae, Francisco Wernke, Alfredo Goldman
Respiratory insufficiency is a medic symptom in which a person gets a reduced amount of oxygen in the blood. This paper reports the experience of building SPIRA: an intelligent system for detecting respiratory insufficiency from voice. It compiles challenges faced in two succeeding implementations of the same architecture, summarizing lessons learned on data collection, training, and inference for future projects in similar systems.
呼吸系统不足是一种药物症状,一个人在血液中获得的氧气量减少,本文报告了建立SPIRA的经验:一个智能系统,从声音中检测呼吸系统不足;它汇编了同一结构两个后续实施过程中面临的挑战,总结了在数据收集、培训和类似系统中未来项目的推论方面的经验教训。
Article 76
Title@2025-07-06 (7): Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain
Title: Making a Pipeline Production-Ready: Challenges and Lessons Learned in the Healthcare Domain | Herstellung einer Pipeline-Produktion: Herausforderungen und Lektionen im Bereich Healthcare | 《管道生产-准备:保健领域的挑战和经验教训》 2506.06946v3 |
Authors (6): Daniel Angelo Esteves Lawand, Lucas Quaresma Medina Lam, Roberto Oliveira Bolgheroni, Renato Cordeiro Ferreira, Alfredo Goldman, Marcelo Finger
Deploying a Machine Learning (ML) training pipeline into production requires good software engineering practices. Unfortunately, the typical data science workflow often leads to code that lacks critical software quality attributes. This experience report investigates this problem in SPIRA, a project whose goal is to create an ML-Enabled System (MLES) to pre-diagnose insufficiency respiratory via speech analysis. This paper presents an overview of the architecture of the MLES, then compares three versions of its Continuous Training subsystem: from a proof of concept Big Ball of Mud (v1), to a design pattern-based Modular Monolith (v2), to a test-driven set of Microservices (v3) Each version improved its overall extensibility, maintainability, robustness, and resiliency. The paper shares challenges and lessons learned in this process, offering insights for researchers and practitioners seeking to productionize their pipelines.
不幸的是,典型的数据科学工作流程往往导致缺乏关键软件质量属性的代码。本经验报告对SPIRA的这一问题进行了调查,SPIRA项目的目标是通过语言分析建立一个ML-Enabled System(MLES),以便预先诊断呼吸不全的呼吸系统。本文件概述了MLES的架构,然后比较了其连续培训子系统的三个版本:从概念大球泥(v1)到基于设计模式的模版单(v2)到一套测试驱动的微服务(v3),每个版本都改进了其总体可存活性、可维持性、稳健性和弹性。文件分享了在这一过程中所学到的挑战和经验教训,为试图生产管道的研究人员和从业人员提供了见解。
Article 77
Title@2025-07-06 (7): Learning Software Bug Reports: A Systematic Literature Review
Title: Learning Software Bug Reports: A Systematic Literature Review | Lernsoftware Bug Reports: Ein systematischer Literaturbericht | 学习软件错误报告:系统文献审查 2507.04422v1 |
Authors (4): Guoming Long, Jingzhi Gong, Hui Fang, Tao Chen
The recent advancement of artificial intelligence, especially machine learning (ML), has significantly impacted software engineering research, including bug report analysis. ML aims to automate the understanding, extraction, and correlation of information from bug reports. Despite its growing importance, there has been no comprehensive review in this area. In this paper, we present a systematic literature review covering 1,825 papers, selecting 204 for detailed analysis. We derive seven key findings: 1) Extensive use of CNN, LSTM, and $k$NN for bug report analysis, with advanced models like BERT underutilized due to their complexity. 2) Word2Vec and TF-IDF are popular for feature representation, with a rise in deep learning approaches. 3) Stop word removal is the most common preprocessing, with structural methods rising after 2020. 4) Eclipse and Mozilla are the most frequently evaluated software projects. 5) Bug categorization is the most common task, followed by bug localization and severity prediction. 6) There is increasing attention on specific bugs like non-functional and performance bugs. 7) Common evaluation metrics are F1-score, Recall, Precision, and Accuracy, with $k$-fold cross-validation preferred for model evaluation. 8) Many studies lack robust statistical tests. We also identify six promising future research directions to provide useful insights for practitioners.
最近人工智能的进步,特别是机器学习(ML),已经对软件工程研究,包括错误报告分析产生了重大影响。ML的目标是使错误报告信息的理解、提取和相关性自动化。尽管其重要性日益增加,但在这一领域没有进行全面审查。在本文中,我们提出了涵盖1,825份文件的系统文献审查,选择204份文件进行详细分析。我们得出了7项主要结论:(1) 广泛使用CNN、LSTM和$k$NNN,用于错误报告分析,而像BERT这样的先进模型由于复杂性而没有得到充分利用。(2) Word2Vec和TF-IDF对地貌代表很受欢迎,其深层学习方法不断提高。(3) 停止删除字是最常见的预处理方法,2020年后结构方法不断上升。(4) Eclipse和Mozilla是最经常被评估的软件项目。(5) 错误分类是最常见的任务,其次是错误本地化和严重性预测。(6) 诸如非功能性和性错误等先进模型越来越受到注意。(7) 共同评价指标是F1-记号, Recall,Recall, Precisionionionionionion, 和Acurealalalalisalisalisalalalalalalalalalal ——我们更喜欢地研究为未来方向提供。
Article 78
Title@2025-07-06 (7): Exploring React Library Related Questions on Stack Overflow: Answered vs. Unanswered
Title: Exploring React Library Related Questions on Stack Overflow: Answered vs. Unanswered | Exploring React Library Verwandte Fragen zum Stack Overflow: Beantwortet vs. Unbeantwortet | 探讨关于堆叠溢溢流的React 图书馆相关问题:答复与未答复 2507.04390v1 |
Authors (3): Vanesya Aura Ardity, Yusuf Sulistyo Nugroho, Syful Islam
React is a popular JavaScript framework in modern web application development. Due to its high performance and efficiency, many developers use this framework. Although React library offers many advantages, it is not without its challenges. When using React library, developers often face problems where they often seek solutions through question-and-answer forums, such as Stack Overflow (SO). However, despite its high popularity, many React-related questions on SO remain unanswered. Thus, this study aims to analyze the factors associated with question answerability and difficulty levels of React-related questions on SO. To facilitate our study, Exploratory Data Analysis was applied to 534,820 questions, where they are filtered based on 23 React-related tags. We implemented a quantitative approach through text mining and statistical analysis. A logistic regression model was used to identify attributes associated with question answerability, while a simple linear regression model was employed to examine the correlation between user reputations and performance difficulty scores (PD Score). The results show that some attributes, such as number of views, code snippet inclusion, number of lines of code, and user reputation, positively affect the likelihood of question answerability. In contrast, the number of comments, question lengths, and presence of images in React-related questions reduce the probability of a question receiving responses from users. Further investigation indicates a negative correlation between user reputations and PD Score, where reputation increase corresponds to -0.092 reduction in PD score, signaling experienced users tend to propose more complex technical inquiries. This study provides insights into the characteristics of technical question-and-answer platforms, such as SO, that users need to consider the answerability factors when posting questions related to React.
在现代网络应用程序开发中,一个广受欢迎的 JavaScript 框架是一个广受欢迎的 JavaScript 框架。 许多开发者使用这个框架, 因为它具有高性能和效率, 许多开发者使用这个框架。 虽然 React 图书馆有许多优势, 但也有许多挑战。 当使用 React 图书馆时, 开发者经常遇到问题, 他们经常通过问答论坛, 如Stack Overflow (SO) 寻找解决办法。 然而, 尽管它受到欢迎, 有关SO的许多与React 有关的问题仍然没有得到解答。 因此, 这项研究的目的是分析与问题有关的问题。 为了便利我们的研究, 探索性数据分析应用到534, 820个问题, 这些问题基于23个 React 相关标签进行过滤。 我们通过文本挖掘和统计分析采用了定量方法, 开发者往往遇到问题。 一个物流回归模型用来确定与问题可解答性相关的属性, 而一个简单的线性回归模型用于检查用户名和性差分(P 评分) 。结果显示, 某些属性,例如观点的数量、 代码拼凑、 、代码输入、代码和用户名声调, 积极地影响了用户的准确性调查的可能性。 对比、 、 、 、 、 、 、 以及用户的准确性调查的概率问题 、 、 、 、 、 、 等的答案的概率问题 、 、 、 、 、 、比 、 、 、 、 、 问题 、 、 、 、 、 、 、比 问题 、 、 、比 、 、比 、 、比 、比 、比 、比 、比 、比 、比 、 、 、 、 问题 、 、比 、 、 、比 、比 、比 问题 、 、比 、比 、比 、比 、比 、比 、 、 、 问题 、比 、 、 、比 问题 、比 、 、 、比 、 、 、 、 问题 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、
Article 79
Title@2025-07-06 (7): DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation
Title: DevMuT: Testing Deep Learning Framework via Developer Expertise-Based Mutation | DevMuT: Testen von Deep Learning Framework über Entwickler-Expertise-basierte Mutation | DDMUT:通过开发者专门知识型变异测试深学习框架 2507.04360v1 |
Authors (9): Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Yinglong Zou, Tao Zheng, Zhenyu Chen
Deep learning (DL) frameworks are the fundamental infrastructure for various DL applications. Framework defects can profoundly cause disastrous accidents, thus requiring sufficient detection. In previous studies, researchers adopt DL models as test inputs combined with mutation to generate more diverse models. Though these studies demonstrate promising results, most detected defects are considered trivial (i.e., either treated as edge cases or ignored by the developers). To identify important bugs that matter to developers, we propose a novel DL framework testing method DevMuT, which generates models by adopting mutation operators and constraints derived from developer expertise. DevMuT simulates developers’common operations in development and detects more diverse defects within more stages of the DL model lifecycle (e.g., model training and inference). We evaluate the performance of DevMuT on three widely used DL frameworks (i.e., PyTorch, JAX, and Mind- Spore) with 29 DL models from nine types of industry tasks. The experiment results show that DevMuT outperforms state-of-the-art baselines: it can achieve at least 71.68% improvement on average in the diversity of generated models and 28.20% improvement on average in the legal rates of generated models. Moreover, DevMuT detects 117 defects, 63 of which are confirmed, 24 are fixed, and eight are of high value confirmed by developers. Finally, DevMuT has been deployed in the MindSpore community since December 2023. These demonstrate the effectiveness of DevMuT in detecting defects that are close to the real scenes and are of concern to developers.
深度学习(DL)框架是各种 DL 应用程序的基本基础设施。框架缺陷可以深刻地造成灾难性的事故,因此需要充分的检测。在以往的研究中,研究人员采用DL模型作为测试投入,并结合突变来生成更多不同的模型。虽然这些研究显示了有希望的结果,但大多数检测到的缺陷被认为是微不足道的(即被作为边缘案例处理,或被开发者忽视)。为了查明开发者认为重要的重大错误,我们建议采用新的DL框架测试方法DevMUT,通过采用突变操作者和来自开发者专长的制约来生成模型。DevMUT模拟开发商在开发过程中的共同操作,并在DL模型生命周期的更多阶段(例如模型培训和推断)中发现更多不同的缺陷。我们根据三种广泛使用的 DL框架(即PyTorrch、JAX和Mind-Spore)评估了DVLT的绩效。 为了确定9种行业任务的29个DL模型,DVT超越了标准。 DevMuT在开发商的状态基线上表现了至少71.68%的缺陷。在DLMDLM 20 模型中取得了平均的改善。
Article 80
Title@2025-07-06 (7): Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing
Title: Improving Deep Learning Framework Testing with Model-Level Metamorphic Testing | Verbesserung von Deep-Learning-Framework-Tests mit Metamorphischen Tests auf Modellebene | 利用示范性变形测试改进深学习框架测试 2507.04354v1 |
Authors (9): Yanzhou Mu, Juan Zhai, Chunrong Fang, Xiang Chen, Zhixiang Cao, Peiran Yang, Kexin Zhao, An Guo, Zhenyu Chen
Deep learning (DL) frameworks are essential to DL-based software systems, and framework bugs may lead to substantial disasters, thus requiring effective testing. Researchers adopt DL models or single interfaces as test inputs and analyze their execution results to detect bugs. However, floating-point errors, inherent randomness, and the complexity of test inputs make it challenging to analyze execution results effectively, leading to existing methods suffering from a lack of suitable test oracles. Some researchers utilize metamorphic testing to tackle this challenge. They design Metamorphic Relations (MRs) based on input data and parameter settings of a single framework interface to generate equivalent test inputs, ensuring consistent execution results between original and generated test inputs. Despite their promising effectiveness, they still face certain limitations. (1) Existing MRs overlook structural complexity, limiting test input diversity. (2) Existing MRs focus on limited interfaces, which limits generalization and necessitates additional adaptations. (3) Their detected bugs are related to the result consistency of single interfaces and far from those exposed in multi-interface combinations and runtime metrics (e.g., resource usage). To address these limitations, we propose ModelMeta, a model-level metamorphic testing method for DL frameworks with four MRs focused on the structure characteristics of DL models. ModelMeta augments seed models with diverse interface combinations to generate test inputs with consistent outputs, guided by the QR-DQN strategy. It then detects bugs through fine-grained analysis of training loss/gradients, memory/GPU usage, and execution time.
深入学习(DL)框架对基于DL的软件系统至关重要,框架错误可能导致重大灾害,因此需要进行有效测试。研究人员采用DL模型或单一界面作为测试投入和分析执行结果以检测错误。然而,浮点错误、固有的随机性和测试投入的复杂性使得有效分析执行结果具有挑战性,导致现有方法缺乏适当的测试或触碰,一些研究人员利用变形测试来应对这一挑战。他们根据一个单一框架界面的输入数据和参数设置设计变形关系(MR),以生成等量的测试投入,确保原始和生成的测试投入之间的一致执行结果。尽管它们有希望的有效性,但它们仍然面临某些限制。 (1) 现有的MRM忽略结构复杂性,限制测试投入的多样性。 (2) 现有的MRM侧重于有限的界面,这限制了一般化,需要额外的调整。 (3) 他们发现的错误与单一界面的结果相关,远离多界面组合和运行时标(例如资源使用)。为了解决这些局限性,我们建议模型MRED结构的连续的模型,一个以ML级模型的测试模型为核心的模型。
Article 81
Title@2025-07-06 (7): Alleviating Attack Data Scarcity: SCANIA’s Experience Towards Enhancing In-Vehicle Cyber Security Measures
Title: Alleviating Attack Data Scarcity: SCANIA’s Experience Towards Enhancing In-Vehicle Cyber Security Measures | Benachteiligung von Angriffsdaten: SCANIAs Erfahrung zur Verbesserung von Cybersicherheitsmaßnahmen im Fahrzeug | 减轻攻击数据稀缺性:SCANIA在加强车辆内部网络安全措施方面的经验 2507.02607v2 |
Authors (5): Frida Sundfeldt, Bianca Widstam, Mahshid Helali Moghadam, Kuo-Yun Liang, Anders Vesterberg
The digital evolution of connected vehicles and the subsequent security risks emphasize the critical need for implementing in-vehicle cyber security measures such as intrusion detection and response systems. The continuous advancement of attack scenarios further highlights the need for adaptive detection mechanisms that can detect evolving, unknown, and complex threats. The effective use of ML-driven techniques can help address this challenge. However, constraints on implementing diverse attack scenarios on test vehicles due to safety, cost, and ethical considerations result in a scarcity of data representing attack scenarios. This limitation necessitates alternative efficient and effective methods for generating high-quality attack-representing data. This paper presents a context-aware attack data generator that generates attack inputs and corresponding in-vehicle network log, i.e., controller area network (CAN) log, representing various types of attack including denial of service (DoS), fuzzy, spoofing, suspension, and replay attacks. It utilizes parameterized attack models augmented with CAN message decoding and attack intensity adjustments to configure the attack scenarios with high similarity to real-world scenarios and promote variability. We evaluate the practicality of the generated attack-representing data within an intrusion detection system (IDS) case study, in which we develop and perform an empirical evaluation of two deep neural network IDS models using the generated data. In addition to the efficiency and scalability of the approach, the performance results of IDS models, high detection and classification capabilities, validate the consistency and effectiveness of the generated data as well. In this experience study, we also elaborate on the aspects influencing the fidelity of the data to real-world scenarios and provide insights into its application.
接通车辆的数字演进和随后的安全风险强调,迫切需要实施车辆内网络安全措施,如入侵探测和反应系统。攻击情景的持续推进进一步突出表明,需要建立适应性检测机制,以探测不断变化的、未知的和复杂的威胁。有效利用ML驱动的技术可以帮助应对这一挑战。但是,由于安全、成本和道德因素,对测试车辆实施不同攻击情景的制约导致缺少代表攻击情景的数据。这种限制要求以高效和有效的替代方法生成高质量的攻击代表数据。本文介绍了一种环境觉察攻击数据生成器,生成攻击投入和与车辆网络日志相对应,即控制区网络日志,代表各种类型的攻击,包括拒绝服务(Do),模糊、潜伏、暂停和重弹攻击。由于安全、成本和道德因素等因素,对测试车辆实施不同攻击情景的制约,导致袭击情景的解析和强度调整,从而使得袭击情景与现实世界情景高度相似,并促进变异性。我们评估了在入侵探测系统内生成的攻击数据的实际数据的实际数据,并评估了这一系统测试和数据测算的准确性,我们利用了这一系统测算结果的测试结果,并更新了数据,从而评估了对数据进行了数据评估。
Article 82
Title@2025-07-06 (7): Automatic Multi-level Feature Tree Construction for Domain-Specific Reusable Artifacts Management
Title: Automatic Multi-level Feature Tree Construction for Domain-Specific Reusable Artifacts Management | Automatische Mehrebenen-Feature-Bauweise für Domain-Spezifische wiederverwendbare Artefakte Management | 用于域域特定可使用性能管理的多级自动多级特性树建设 2506.03946v2 |
Authors (6): Dongming Jin, Zhi Jin, Nianyu Li, Kai Yang, Linyu Li, Suijing Guan
With the rapid growth of open-source ecosystems (e.g., Linux) and domain-specific software projects (e.g., aerospace), efficient management of reusable artifacts is becoming increasingly crucial for software reuse. The multi-level feature tree enables semantic management based on functionality and supports requirements-driven artifact selection. However, constructing such a tree heavily relies on domain expertise, which is time-consuming and labor-intensive. To address this issue, this paper proposes an automatic multi-level feature tree construction framework named FTBUILDER, which consists of three stages. It automatically crawls domain-specific software repositories and merges their metadata to construct a structured artifact library. It employs clustering algorithms to identify a set of artifacts with common features. It constructs a prompt and uses LLMs to summarize their common features. FTBUILDER recursively applies the identification and summarization stages to construct a multi-level feature tree from the bottom up. To validate FTBUILDER, we conduct experiments from multiple aspects (e.g., tree quality and time cost) using the Linux distribution ecosystem. Specifically, we first simultaneously develop and evaluate 24 alternative solutions in the FTBUILDER. We then construct a three-level feature tree using the best solution among them. Compared to the official feature tree, our tree exhibits higher quality, with a 9% improvement in the silhouette coefficient and an 11% increase in GValue. Furthermore, it can save developers more time in selecting artifacts by 26% and improve the accuracy of artifact recommendations with GPT-4 by 235%. FTBUILDER can be extended to other open-source software communities and domain-specific industrial enterprises.
随着开放源码生态系统(如Linux)和特定域的软件项目(如航空航天)的快速增长,对可再利用的文物的有效管理对软件再利用越来越至关重要。多层地貌树能够根据功能进行语义管理,支持需求驱动的文物选择。然而,建造这样的树在很大程度上依赖于域域专长,这需要时间和劳力密集型。为了解决这个问题,本文件建议建立一个名为FTBUILDER的多层次功能树建设框架,由三个阶段组成。它自动爬行特定域软件库,并合并其元数据,以建立一个结构化的文物库。它使用群集算算法来确定一套具有共同特性的艺术品。它建立快速并使用LLLMS来总结其共同特性。FBUILDER反复应用识别和合成阶段,从下到下到上,为了验证FBUDERDER,我们从多个方面(例如PT)的域域域域域域域域数据库和时间成本,我们可以从多个方面(例如树质量和时间成本)进行实验,利用Linux 分配生态系统。具体地,我们同时开发并评估GDRILEDER的24个域域域域域域域域域域域域域域域域域域域域域域域域域域图,在S-S-S-realdealdealalalalalalalalalalalalde 上的一个域域域域域域域域BBBBBBB的一个特性特性,在11级GBBBILBILILB。
Article 83
Title@2025-07-05 (6): From Legal Text to Tech Specs: Generative AI’s Interpretation of Consent in Privacy Law
Title: From Legal Text to Tech Specs: Generative AI’s Interpretation of Consent in Privacy Law | Vom Rechtstext zu technischen Spezifikationen: Generative KI-Interpretation der Zustimmung im Datenschutzrecht | 《从法律文本到技术特征:大赦国际对隐私权法中同意的起源解释》 2507.04185v1 |
Authors (5): Aniket Kesari, Travis Breaux, Tom Norton, Sarah Santos, Anmol Singhal
Privacy law and regulation have turned to “consent” as the legitimate basis for collecting and processing individuals’ data. As governments have rushed to enshrine consent requirements in their privacy laws, such as the California Consumer Privacy Act (CCPA), significant challenges remain in understanding how these legal mandates are operationalized in software. The opaque nature of software development processes further complicates this translation. To address this, we explore the use of Large Language Models (LLMs) in requirements engineering to bridge the gap between legal requirements and technical implementation. This study employs a three-step pipeline that involves using an LLM to classify software use cases for compliance, generating LLM modifications for non-compliant cases, and manually validating these changes against legal standards. Our preliminary findings highlight the potential of LLMs in automating compliance tasks, while also revealing limitations in their reasoning capabilities. By benchmarking LLMs against real-world use cases, this research provides insights into leveraging AI-driven solutions to enhance legal compliance of software.
隐私法和条例已转变为“同意”法,作为收集和处理个人数据的合法依据。政府急于将同意要求写入隐私法,例如加利福尼亚州消费者隐私法(CCPA),因此在理解这些法定任务如何在软件中实施方面仍然存在重大挑战。软件开发过程的不透明性使这一翻译更加复杂。为了解决这个问题,我们探索在要求工程中使用大语言模型(LLMS)来弥补法律要求和技术实施之间的差距。本研究采用三步管道,利用LLM来分类软件使用案例,以便遵守,对不合规案件进行LLM修改,并手工对照法律标准验证这些变化。我们的初步结论强调了LMMs在使合规任务自动化方面的潜力,同时也揭示了其推理能力方面的局限性。通过将LLMS与现实世界使用案例挂钩,这一研究为利用由AI驱动的解决方案加强软件的法律合规提供了深刻的见解。
Article 84
Title@2025-07-05 (6): The Transformative Influence of LLMs on Software Development & Developer Productivity
Title: The Transformative Influence of LLMs on Software Development & Developer Productivity | Der transformative Einfluss von LLMs auf Softwareentwicklung und Entwicklerproduktivität | LLM女士对软件开发和开发者生产力的转型影响 2311.16429v2 |
Authors (1): Sajed Jalil
The increasing adoption and commercialization of generalized Large Language Models (LLMs) have profoundly impacted various aspects of our daily lives. Initially embraced by the computer science community, the versatility of LLMs has found its way into diverse domains. In particular, the software engineering realm has witnessed the most transformative changes. With LLMs increasingly serving as AI Pair Programming Assistants spurred the development of specialized models aimed at aiding software engineers. Although this new paradigm offers numerous advantages, it also presents critical challenges and open problems. To identify the potential and prevailing obstacles, we systematically reviewed contemporary scholarly publications, emphasizing the perspectives of software developers and usability concerns. Preliminary findings underscore pressing concerns about data privacy, bias, and misinformation. Additionally, we identified several usability challenges, including prompt engineering, increased cognitive demands, and mistrust. Finally, we introduce 12 open problems that we have identified through our survey, covering these various domains.
普遍通用的大型语言模型(LLMS)的日益采用和商业化对我们日常生活的方方面面产生了深刻的影响。LLMS最初被计算机科学界所接受,其多功能性已经进入了不同的领域。特别是软件工程领域发生了最变革性的变化。LLMS日益成为AI Pair方案助理,促进了旨在帮助软件工程师的专门模型的开发。尽管这一新模式具有许多优势,但也提出了关键性的挑战和尚未解决的问题。为了查明潜在的和普遍存在的障碍,我们系统地审查了当代学术出版物,强调了软件开发者的观点和可用性问题。初步结论强调了对数据隐私、偏向和错误信息的迫切关切。此外,我们查明了若干可用性挑战,包括即时工程、认知需求增加和不信任。最后,我们提出了通过调查查明的12个公开问题,涉及这些不同的领域。
Article 85
Title@2025-07-05 (6): zkSDK: Streamlining zero-knowledge proof development through automated trace-driven ZK-backend selection
Title: zkSDK: Streamlining zero-knowledge proof development through automated trace-driven ZK-backend selection | zkSDK: Straffung der Zero-Knowledge Proof-Entwicklung durch automatisierte spurgesteuerte ZK-Backend-Auswahl | zkSDK:通过自动追踪驱动的 ZK 后端选择简化零知识验证开发 2507.05294v1 |
Authors (1): William Law
The rapid advancement of creating Zero-Knowledge (ZK) programs has led to the development of numerous tools designed to support developers. Popular options include being able to write in general-purpose programming languages like Rust from Risc Zero. Other languages exist like Circom, Lib-snark, and Cairo. However, developers entering the ZK space are faced with many different ZK backends to choose from, leading to a steep learning curve and a fragmented developer experience across different platforms. As a result, many developers tend to select a single ZK backend and remain tied to it. This thesis introduces zkSDK, a modular framework that streamlines ZK application development by abstracting the backend complexities. At the core of zkSDK is Presto, a custom Python-like programming language that enables the profiling and analysis of a program to assess its computational workload intensity. Combined with user-defined criteria, zkSDK employs a dynamic selection algorithm to automatically choose the optimal ZK-proving backend. Through an in-depth analysis and evaluation of real-world workloads, we demonstrate that zkSDK effectively selects the best-suited backend from a set of supported ZK backends, delivering a seamless and user-friendly development experience.
创建 Zero- knewledge (ZK) 程序的快速进步导致开发了许多旨在支持开发者的工具。 流行选项包括能够用通用编程语言写作, 如 Rist 来自 Rist 的 Rist 零州Rist 。 其他语言存在于 Circom、 Lib- snark 和开罗 。 然而, 进入 ZK 空间的开发者面临着许多不同的 ZK 后端可以选择的 ZK 后端, 导致一个陡峭的学习曲线和不同平台的零散开发者经验。 因此, 许多开发者倾向于选择一个单一的 ZK 后端, 并保持与该后端的链接。 这个模块框架引入了 zkSDK 模块框架, 通过抽取后端复杂性来简化 ZK 应用程序的开发 。 在 zkSDK 的核心是 Presto, 一种习惯的 Python 式编程语言, 使程序能够分析和分析其计算工作量的强度。 与用户定义的标准相结合, zkSDK 使用动态选算算法自动选择一个最佳的后端后端后端。 通过对真实的用户进行最准确的 Z- kSD 的后端的后端分析和评价, 我们从现实- setddddddddds sde ex ex ex ex ex ex ex ex ex ex ex
Article 86
Title@2025-07-05 (6): The Presence and the State-of-Practice of Software Architects in the Brazilian Industry – A Survey
Title: The Presence and the State-of-Practice of Software Architects in the Brazilian Industry – A Survey | Die Gegenwart und der Stand der Praxis von Software-Architekten in der brasilianischen Industrie – Eine Umfrage | 巴西工业软件建筑师的存在及其做法 – – 调查 2403.00955v2 |
Authors (7): Valdemar Vicente Graciano Neto, Diana Lorena Santos, Andrey Gonçalves França, Rafael Z. Frantz, Edson de Oliveira-Jr, Ahmad Mohsin, Mohamad Kassab
Context: Software architecture intensely impacts the software quality. Therefore, the professional assigned to carry out the design, maintenance and evolution of architectures needs to have certain knowledge and skills in order not to compromise the resulting application. Objective: The aim of this work is to understand the characteristics of the companies regarding the presence or absence of software architects in Brazil. Method: This work uses the Survey research as a means to collect evidence from professionals with the software architect profile, besides descriptive statistics and thematic analysis to analyze the results. Results: The study collected data from 105 professionals distributed in 24 Brazilian states. Results reveal that (i) not all companies have a software architect, (ii) in some cases, other professionals perform the activities of a software architect and (iii) there are companies that, even having a software architecture professional, have other roles also performing the duties of such a professional. Conclusions: Professionals hired as software architects have higher salaries than those hired in other roles that carry out such activity, although many of those other professionals still have duties that are typical of software architects.
软件环境:软件结构对软件质量产生了强烈的影响。因此,负责设计、维护和发展建筑结构的专业人员需要掌握一定的知识和技能,以免影响由此产生的应用。 目标:这项工作的目的是了解巴西公司在软件设计师的存在或不存在方面的特点。 方法:这项工作利用调查研究从拥有软件设计师概况的专业人员那里收集证据,除了描述性统计和专题分析外,分析结果。结果:研究收集了在巴西24个州分布的105名专业人员的数据。结果显示:(一)并非所有公司都有软件设计师,(二)在有些情况下,其他专业人员从事软件设计师的活动,(三)有些公司,即使拥有软件设计师专业,也具有履行这种专业职责的其他作用。结论:作为软件设计师聘用的专业人员的工资高于从事此类活动的其他职务,尽管许多其他专业人员仍然承担软件设计师典型的职责。
Article 87
Title@2025-07-05 (6): Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG
Title: Rethinking and Exploring String-Based Malware Family Classification in the Era of LLMs and RAG | Nachdenken und erkunden String-basierte Malware-Familienklassifikation im Zeitalter von LLMs und RAG | 在LLMM和RAG的时代重新思考和探索基于字符串的恶意恶意家庭分类 2507.04055v1 |
Authors (8): Yufan Chen, Daoyuan Wu, Juantao Zhong, Zicheng Zhang, Debin Gao, Shuai Wang, Yingjiu Li, Ning Liu
Malware Family Classification (MFC) aims to identify the fine-grained family (e.g., GuLoader or BitRAT) to which a potential malware sample belongs, in contrast to malware detection or sample classification that predicts only an Yes/No. Accurate family identification can greatly facilitate automated sample labeling and understanding on crowdsourced malware analysis platforms such as VirusTotal and MalwareBazaar, which generate vast amounts of data daily. In this paper, we explore and assess the feasibility of using traditional binary string features for MFC in the new era of large language models (LLMs) and Retrieval-Augmented Generation (RAG). Specifically, we investigate how Family-Specific String (FSS) features could be utilized in a manner similar to RAG to facilitate MFC. To this end, we develop a curated evaluation framework covering 4,347 samples from 67 malware families, extract and analyze over 25 million strings, and conduct detailed ablation studies to assess the impact of different design choices in four major modules.
恶意家庭分类(MFC)旨在确定潜在的恶意软件样本属于的微粒家庭(如Guloader或BitRAT),而恶意软件检测或抽样分类只预测为是/否。 准确的家庭识别可极大地便利对众源恶意软件分析平台(如病毒托塔尔和MalwareBazaar)进行自动抽样标签和理解,这些平台每天产生大量数据。在本文件中,我们探讨和评估在大型语言模型和检索-启动一代(RAG)的新时代,使用传统二字串功能给MFC的可行性。具体地说,我们调查如何以类似于RAG的方式利用家庭特性来便利MFC。为此,我们开发了一个覆盖67个恶意软件家庭4 347个样本的整理评价框架,提取和分析2 500多万条线,并进行详细的模拟研究,以评估四个主要模块中不同设计选择的影响。
Article 88
Title@2025-07-05 (6): A Multimodal Approach Combining Biometrics and Self-Report Instruments for Monitoring Stress in Programming: Methodological Insights
Title: A Multimodal Approach Combining Biometrics and Self-Report Instruments for Monitoring Stress in Programming: Methodological Insights | Ein multimodaler Ansatz zur Kombination von Biometrie und Selbstberichtsinstrumenten zur Überwachung von Stress in der Programmierung: Methodologische Erkenntnisse | 将生物计量与在方案编制中监测压力的自报工具相结合的多式联运办法:方法见解 2507.02118v2 |
Authors (4): Cristina Martinez Montes, Daniela Grassi, Nicole Novielli, Birgit Penzenstadler
The study of well-being, stress and other human factors has traditionally relied on self-report instruments to assess key variables. However, concerns about potential biases in these instruments, even when thoroughly validated and standardised, have driven growing interest in alternatives in combining these measures with more objective methods, such as physiological measures. We aimed to (i) compare psychometric stress measures and biometric indicators and (ii) identify stress-related patterns in biometric data during software engineering tasks. We conducted an experiment where participants completed a pre-survey, then programmed two tasks wearing biometric sensors, answered brief post-surveys for each, and finally went through a short exit interview. Our results showed diverse outcomes; we found no stress in the psychometric instruments. Participants in the interviews reported a mix of feeling no stress and experiencing time pressure. Finally, the biometrics showed a significant difference only in EDA phasic peaks. We conclude that our chosen way of inducing stress by imposing a stricter time limit was insufficient. We offer methodological insights for future studies working with stress, biometrics, and psychometric instruments.
对福祉、压力和其他人类因素的研究历来依靠自我报告工具来评估关键变量,然而,对这些文书中潜在偏见的关切,即使经过彻底验证和标准化,也促使人们日益关注将这些措施与更客观的方法相结合的替代办法,例如生理措施。我们的目标是(一) 比较心理压力措施和生物鉴别指标,以及(二) 在软件工程任务期间确定生物鉴别数据中与压力有关的模式。我们进行了一项试验,参加者在试验中完成了调查前的工作,然后用生物鉴别传感器对两项任务进行了编程,对每项任务进行了简短的事后调查,最后进行了短期的离职面谈。我们的结果显示结果各异;我们在心理计量工具中没有发现压力;参加面谈的与会者报告说,没有压力感,也经历了时间压力。最后,生物鉴别学显示只有EDA的高峰期存在重大差异。我们的结论是,我们选择的通过规定更严格的时限来刺激压力的方法是不够的。我们为未来有关压力、生物鉴别和心理测量工具的研究提供了方法方面的见解。
Article 89
Title@2025-07-05 (6): Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation
Title: Large-scale, Independent and Comprehensive study of the power of LLMs for test case generation | Groß angelegte, unabhängige und umfassende Untersuchung der Leistung von LLMs für die Testfallgenerierung | 大规模、独立和综合研究LLMM 用于产生试验案例的能力 2407.00225v3 |
Authors (8): Wendkûuni C. Ouédraogo, Kader Kaboré, Yinghua Li, Haoye Tian, Anil Koyuncu, Jacques Klein, David Lo, Tegawendé F. Bissyandé
Unit testing is essential for software reliability, yet manual test creation is time-consuming and often neglected. Although search-based software testing improves efficiency, it produces tests with poor readability and maintainability. Although LLMs show promise for test generation, existing research lacks comprehensive evaluation across execution-driven assessment, reasoning-based prompting, and real-world testing scenarios. This study presents the first large-scale empirical evaluation of LLM-generated unit tests at the class level, systematically analyzing four state-of-the-art models - GPT-3.5, GPT-4, Mistral 7B, and Mixtral 8x7B - against EvoSuite across 216,300 test cases from Defects4J, SF110, and CMD (a dataset mitigating LLM training data leakage). We evaluate five prompting techniques - Zero-Shot Learning (ZSL), Few-Shot Learning (FSL), Chain-of-Thought (CoT), Tree-of-Thought (ToT), and Guided Tree-of-Thought (GToT) - assessing syntactic correctness, compilability, hallucination-driven failures, readability, code coverage metrics, and fault detection capabilities. Our findings challenge prior claims that in-context learning is ineffective for test generation in code-specialized LLMs. Reasoning-based prompting - particularly GToT - significantly enhances test reliability, compilability, and structural adherence in general-purpose LLMs. However, hallucination-driven failures remain a persistent challenge, manifesting as non-existent symbol references, incorrect API calls, and fabricated dependencies, resulting in high compilation failure rates (up to 86%). Execution-based classification and mutation testing reveal that many failing tests stem from hallucinated dependencies, limiting effective fault detection.
虽然基于搜索的软件测试提高了效率,但测试的可读性和维护性却很差。尽管LLMS显示对测试生成的希望,但现有的研究缺乏对执行驱动的评估、基于推理的提示性和现实世界测试情景的全面评价。本研究展示了对LLM在课堂一级产生的单位测试的首次大规模经验性评价,系统分析了四种最先进的模型—-GPT-3.5、GPT-4、Mistral 7B和Mixtral 8x7B—-在216,300个来自Deffects4J、SF110和CMD(减少LM培训数据泄漏的数据集)的测试案例中进行测试。我们评价了五种提示性技术—-Zero-Shot 学习(ZSLSL)、少Shot 学习(FL)、Talought-T(COFT)、T-Troughtal-T(TT)和LOroughtal Eral-Report Ref-Refral Refral Refal Refalation Referal-Report Reflifalation Acustruction lacustrislation lacultislation) custrislislational dislational dislational dislational dislations disal disl)中, 和(GTT),在我们的快速性判断性测试要求中评估了一种显著性测试性判断性判断性测试性测试性测试性测试性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断性判断
Article 90
Title@2025-07-05 (6): We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems
Title: We Urgently Need Privilege Management in MCP: A Measurement of API Usage in MCP Ecosystems | Wir brauchen dringend Privilege Management in MCP: Eine Messung der API-Nutzung in MCP Ecosystems | 我们迫切需要在MCP紧急需要特权管理:衡量MCP生态系统使用API的情况 2507.06250v1 |
Authors (6): Zhihao Li, Kun Li, Boyang Ma, Minghui Xu, Yue Zhang, Xiuzhen Cheng
The Model Context Protocol (MCP) has emerged as a widely adopted mechanism for connecting large language models to external tools and resources. While MCP promises seamless extensibility and rich integrations, it also introduces a substantially expanded attack surface: any plugin can inherit broad system privileges with minimal isolation or oversight. In this work, we conduct the first large-scale empirical analysis of MCP security risks. We develop an automated static analysis framework and systematically examine 2,562 real-world MCP applications spanning 23 functional categories. Our measurements reveal that network and system resource APIs dominate usage patterns, affecting 1,438 and 1,237 servers respectively, while file and memory resources are less frequent but still significant. We find that Developer Tools and API Development plugins are the most API-intensive, and that less popular plugins often contain disproportionately high-risk operations. Through concrete case studies, we demonstrate how insufficient privilege separation enables privilege escalation, misinformation propagation, and data tampering. Based on these findings, we propose a detailed taxonomy of MCP resource access, quantify security-relevant API usage, and identify open challenges for building safer MCP ecosystems, including dynamic permission models and automated trust assessment.
示范背景协议(MCP)已成为将大型语言模式与外部工具和资源连接起来的一个广泛采用的机制。虽然MCP有望实现无缝扩展和丰富整合,但它也引入了大规模扩大的攻击面:任何插件都可以在最小的孤立或监督下继承广泛的系统特权。在这项工作中,我们对MCP安全风险进行首次大规模的经验分析。我们开发了一个自动静态分析框架,系统检查了涵盖23个功能类别的2 562个实际世界MCP应用程序。我们的测量结果显示,网络和系统资源API分别影响1 438和1 237个服务器的使用模式,而档案和记忆资源则较少,但仍然重要。我们发现开发工具和API开发插件是AIPI最密集的,而不太受欢迎的插件往往包含过大高风险操作。我们通过具体案例研究,证明特权分离如何导致特权升级、错误传播和数据篡改。基于这些调查结果,我们提议对MCP资源获取进行详细的分类,量化与安全相关的API使用,并查明建设更安全的MCP生态系统所面临的公开挑战,包括动态许可模型和自动信任评估。
Article 91
Title@2025-07-04 (5): RVISmith: Fuzzing Compilers for RVV Intrinsics
Title: RVISmith: Fuzzing Compilers for RVV Intrinsics | RVISmith: Fuzzing Compiler für RVV-Intrinsik | RVISmith: RVV Intrinsics 模糊的编译者 2507.03773v1 |
Authors (6): Yibo He, Cunjian Huang, Xianmiao Qu, Hongdeng Chen, Wei Yang, Tao Xie
Modern processors are equipped with single instruction multiple data (SIMD) instructions for fine-grained data parallelism. Compiler auto-vectorization techniques that target SIMD instructions face performance limitations due to insufficient information available at compile time, requiring programmers to manually manipulate SIMD instructions. SIMD intrinsics, a type of built-in function provided by modern compilers, enable programmers to manipulate SIMD instructions within high-level programming languages. Bugs in compilers for SIMD intrinsics can introduce potential threats to software security, producing unintended calculation results, data loss, program crashes, etc. To detect bugs in compilers for SIMD intrinsics, we propose RVISmith, a randomized fuzzer that generates well-defined C programs that include various invocation sequences of RVV (RISC-V Vector Extension) intrinsics. We design RVISmith to achieve the following objectives: (i) achieving high intrinsic coverage, (ii) improving sequence variety, and (iii) without known undefined behaviors. We implement RVISmith based on the ratified RVV intrinsic specification and evaluate our approach with three modern compilers: GCC, LLVM, and XuanTie. Experimental results show that RVISmith achieves 11.5 times higher intrinsic coverage than the state-of-the-art fuzzer for RVV intrinsics. By differential testing that compares results across different compilers, optimizations, and equivalent programs, we detect and report 13 previously unknown bugs of the three compilers under test to date. Of these bugs, 10 are confirmed and another 3 are fixed by the compiler developers.
以 SIMD 指令为目标的编译器自动演算技术由于在编译时提供的信息不足而面临性能限制,要求程序员手动操作 SIMD 指令。 SIMD 内在功能是现代编译器提供的一种内在功能,使程序员能够在高层次编程语言中操作SIMD指令。 SIMD 内在内容编译器中的错误可能对软件安全造成潜在威胁,产生意外的计算结果、数据丢失、程序崩溃等。为了检测SIMD 内在内容的编译器中的错误,我们建议使用RVIS ,一个随机化的烟雾器,生成定义明确的C程序,其中包括各种 RVV(RISC-V Vctor 扩展) 的内置序列。我们设计了RVIS , 以实现以下目标:(一) 实现高内在覆盖, (二) 改进序列种类,以及(三) 没有已知的细微值行为, 我们根据已经批准的 RV 内在规格的编译器进行 RVIDM 3 , 用三个现代的内置的内置程序来评估结果。
Article 92
Title@2025-07-04 (5): Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs
Title: Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs | Spezifikationsgeführte Reparatur von Arithmetischen Fehlern in Dafny-Programmen mit LLMs | 利用LLMM项目对达夫尼方案中的亚氏误差进行规格规范-指导修补 2507.03659v1 |
Authors (3): Valentina Wu, Alexandra Mendes, Alexandre Abreu
Formal verification offers strong assurances of software correctness. However, debugging and repairing the underlying faults can be complex and time-consuming when verification fails. Automated Program Repair (APR) aims to ease this by automatically identifying and fixing faults. Traditional APR techniques often depend on test suites for validation, but these may fail to capture all scenarios. In contrast, formal specifications provide stronger correctness criteria for effective repairs. We present an innovative APR tool for Dafny, a verification-aware programming language that uses formal specifications - including pre-conditions, post-conditions, and invariants - as oracles for fault localization and repair. Assuming the correctness of the specifications and focusing on arithmetic bugs, we localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program and state-of-the-art Large Language Models (LLMs) to synthesize candidate fixes. The chosen models were GPT-4o mini, Llama 3, Mistral 7B, and Llemma 7B. We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs. Our tool achieves 89.6% accuracy in fault localization, with GPT-4o mini yielding the highest repair success rate (74.18%). These results highlight the potential of combining formal reasoning with LLM-driven program synthesis for automated program repair.
正式核查为软件的正确性提供了有力的保证。然而,当核查失败时,调试和修复基本缺陷可能是复杂和耗时的。自动程序修理(APR)的目的是通过自动识别和修补缺陷来缓解这一点。传统的PRA技术通常依赖于测试套件来验证,但可能无法捕捉所有情景。相反,正式规格为有效修理提供了更强的正确性标准。我们为Dafny提供了一种具有核查意识的编程语言,一种具有核查意识的编程语言,它使用正式规格,包括预设条件、后设条件和变异性,作为地方化和修理错误的标志。假设规格的正确性并侧重于算术错误,我们通过一系列步骤将故障本地化,包括使用Hoare Logic来确定每个报表的状况,以及最新的大型语言模型(LLLMM),我们选择的模型是:GPT-4o迷你、Llama 3、Mistral 7B和Lleamma 7B。我们用Dafma 7B 来评估我们的方法,我们使用DafnyBny Brealimimalimalimalimal,我们的方法是Dafnicalimalimalimalimaliming 和我们的Drealimalimpalimpaliming 18,我们的方法,我们用Drequenalbilizalimal 和我们使用Drolationalbildrol 和我们使用Dalbilizaltalbildrol 的方法,我们使用Dalbilizaltaltaltaltaltaltaltaltaltaltaltal 的方法,我们使用Drobilizaldaltaltaltal 的方法,我们的方法,我们的方法,我们使用Dimal 的方法,我们使用Daldaldal 的方法,我们使用Daldaldalalalalalaldalbilizaldaldaldaldaldaldaldalalalalalalalalalaldalalalalalalalal 6的比方法,我们用的方法,我们使用Daldaldaldaldaldaldaldaldalal 的方法,我们用的方法,我们使用D
Article 93
Title@2025-07-04 (5): Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy
Title: Is It Time To Treat Prompts As Code? A Multi-Use Case Study For Prompt Optimization Using DSPy | Ist es Zeit, Prompts als Code zu behandeln? Eine Multi-Use-Fallstudie für Prompt-Optimierung mit DSPy | 是否是时候将提示作为代码处理? 使用 DSPy 快速优化的多用途案例研究 2507.03620v1 |
Authors (3): Francisca Lemos, Victor Alves, Filipa Ferraz
Although prompt engineering is central to unlocking the full potential of Large Language Models (LLMs), crafting effective prompts remains a time-consuming trial-and-error process that relies on human intuition. This study investigates Declarative Self-improving Python (DSPy), an optimization framework that programmatically creates and refines prompts, applied to five use cases: guardrail enforcement, hallucination detection in code, code generation, routing agents, and prompt evaluation. Each use case explores how prompt optimization via DSPy influences performance. While some cases demonstrated modest improvements - such as minor gains in the guardrails use case and selective enhancements in hallucination detection - others showed notable benefits. The prompt evaluation criterion task demonstrated a substantial performance increase, rising accuracy from 46.2% to 64.0%. In the router agent case, the possibility of improving a poorly performing prompt and of a smaller model matching a stronger one through optimized prompting was explored. Although prompt refinement increased accuracy from 85.0% to 90.0%, using the optimized prompt with a cheaper model did not improve performance. Overall, this study’s findings suggest that DSPy’s systematic prompt optimization can enhance LLM performance, particularly when instruction tuning and example selection are optimized together. However, the impact varies by task, highlighting the importance of evaluating specific use cases in prompt optimization research.
尽管快速工程是释放大语言模型全部潜力的核心,但有效的快速工程仍是释放大语言模型(LLMS)的全部潜力的核心,但有效的快速工程仍是一个依赖人类直觉的耗时的试探过程。本研究调查了自改进Python (DSPy) 的宣布自我改进(DSPy) (DSPy) (DSPy) (DSPy) (DSPy) ) (DSPy) (DSPy) ) , 这是一种在方案上创建和完善快速的优化框架,适用于五种使用案例: 保护性执法、代码中的幻觉检测、代码生成、路由代理商以及快速评估。 每个使用案例都探索了通过DSPy 快速优化来迅速优化绩效的准确度(从85.0%提高到90.0 % ) , 使用更廉价的模型来优化性能。其他案例显示, 快速评估标准任务(DSPy ) (GLM) (Simpress) (Simpress) (Simpress) (Simpress) (Simpress) (Simpress) impress) (O) (Simpress) (Simpress) (O) (Simpress) (O) (O) (O) (O) (O) (O) (Simpress) (Simpress) (O) (O) (O) (O) (O) ) (O) (Simpress) (O) (O) (O) (O) (O) (O) ) (O) (O) (O) (O) (O) ) (O) (O) (O) (O) (O) (O) ) ) (O) (O) ) ) ) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) ) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (
Article 94
Title@2025-07-04 (5): Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions
Title: Exploring Privacy and Security as Drivers for Environmental Sustainability in Cloud-Based Office Solutions | Erforschen von Datenschutz und Sicherheit als Treiber für ökologische Nachhaltigkeit in Cloud-basierten Office-Lösungen | 探索隐私和安全作为云载办公室解决方案中环境可持续性的驱动力 2506.23866v3 |
Authors (3): Jason Kayembe, Iness Ben Guirat, Jan Tobias Mühlberg
In this paper, we explore the intersection of privacy, security, and environmental sustainability in cloud-based office solutions, focusing on quantifying user- and network-side energy use and associated carbon emissions. We hypothesise that privacy-focused services are typically more energy-efficient than those funded through data collection and advertising. To evaluate this, we propose a framework that systematically measures environmental costs based on energy usage and network data traffic during well-defined, automated usage scenarios. To test our hypothesis, we first analyse how underlying architectures and business models, such as monetisation through personalised advertising, contribute to the environmental footprint of these services. We then explore existing methodologies and tools for software environmental impact assessment. We apply our framework to three mainstream email services selected to reflect different privacy policies, from ad-supported tracking-intensive models to privacy-focused designs: Microsoft Outlook, Google Mail (Gmail), and Proton Mail. We extend this comparison to a self-hosted email solution, evaluated with and without end-to-end encryption. We show that the self-hosted solution, even with 14% of device energy and 15% of emissions overheads from PGP encryption, remains the most energy-efficient, saving up to 33% of emissions per session compared to Gmail. Among commercial providers, Proton Mail is the most efficient, saving up to 0.1 gCO2 e per session compared to Outlook, whose emissions can be further reduced by 2% through ad-blocking.
在本文中,我们探讨了基于云的办公解决方案中的隐私、安全和环境可持续性的交叉点,重点是量化用户和网络的能源使用和相关碳排放。我们假设,以隐私为重点的服务通常比通过数据收集和广告供资的服务更具能效。为了评估这一点,我们提议了一个框架,根据能源使用和网络数据流量在明确界定的自动化使用情景中系统地衡量环境成本。为了测试我们的假设,我们首先分析基本的架构和商业模式,例如通过个人化广告进行货币化,如何促进这些服务的环境足迹。然后我们探索软件环境影响评估的现有方法和工具。我们把框架应用到选择的三种主流电子邮件服务,以反映不同的隐私政策,从临时支持的跟踪密集模式到以隐私为重点的设计:微软 Outlook、Google Mail(Gmail)和Proton Mail。我们将这一框架扩大到一个自托管的电子邮件解决方案,对终端对终端对终端对终端对终端的加密进行评估。我们显示,即使设备能源的14%和PGPGP加密的15%的排放管理,我们将框架应用到最高效的版本,比起来,其最高效的版本的版本仍然是最高效的版本至33 %的版本的版本的版本的版本,通过PRO-G的版本的版本的版本的版本的版本的版本的版本的版本,其排放量到33-GMalal的版本的版本的版本的版本的版本至33的版本的版本的版本的版本的版本,直至每期的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本,直到33-版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本到33至33至G-版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本的版本到G)。
Article 95
Title@2025-07-04 (5): ACE: Automated Technical Debt Remediation with Validated Large Language Model Refactorings
Title: ACE: Automated Technical Debt Remediation with Validated Large Language Model Refactorings | ACE: Automatisiertes technisches Schuldentilgungsverfahren mit validierten großen Sprachmodell-Refactorings | ACE: 自动技术债务补救,加上经验证的大型语言模型要素 2507.03536v1 |
Authors (4): Adam Tornhill, Markus Borg, Nadim Hagatulah, Emma Söderberg
The remarkable advances in AI and Large Language Models (LLMs) have enabled machines to write code, accelerating the growth of software systems. However, the bottleneck in software development is not writing code but understanding it; program understanding is the dominant activity, consuming approximately 70% of developers’ time. This implies that improving existing code to make it easier to understand has a high payoff and - in the age of AI-assisted coding - is an essential activity to ensure that a limited pool of developers can keep up with ever-growing codebases. This paper introduces Augmented Code Engineering (ACE), a tool that automates code improvements using validated LLM output. Developed through a data-driven approach, ACE provides reliable refactoring suggestions by considering both objective code quality improvements and program correctness. Early feedback from users suggests that AI-enabled refactoring helps mitigate code-level technical debt that otherwise rarely gets acted upon.
AI和大语言模型(LLMS)的显著进步使机器能够写出代码,加速软件系统的增长。然而,软件开发的瓶颈不是写代码,而是理解它;程序理解是主导活动,耗用开发者大约70%的时间。这意味着改进现有代码使其更容易理解,有很高的回报,在AI协助编码的时代,是确保有限的开发者人才库能够跟上不断增长的代码库的重要活动。本文引入了强化代码工程(ACE),这是一个使用经验证的 LLM 输出实现代码改进自动化的工具。通过数据驱动方法开发的ACE通过考虑客观的代码质量改进和方案正确性,提供了可靠的重构建议。用户的早期反馈表明,由AI推动的重组有助于减轻代码级技术债务,否则很少得到响应。
Article 96
Title@2025-07-04 (5): The Role of Humour in Software Engineering – A Literature Review and Preliminary Taxonomy
Title: The Role of Humour in Software Engineering – A Literature Review and Preliminary Taxonomy | Die Rolle des Humors in der Software-Engineering - eine Literaturübersicht und vorläufige Taxonomie | 幽默在软件工程中的作用 – – 文学评论和初步分类学 2507.03527v1 |
Authors (3): Dulaji Hidellaarachchi, John Grundy, Rashina Hoda
Humour has long been recognized as a key factor in enhancing creativity, group effectiveness, and employee well-being across various domains. However, its occurrence and impact within software engineering (SE) teams remains under-explored. This paper introduces a comprehensive, literature review-based taxonomy exploring the characterisation and use of humour in SE teams, with the goal of boosting productivity, improving communication, and fostering a positive work environment while emphasising the responsible use of humour to mitigate its potential negative impacts. Drawing from a wide array of studies in psychology, sociology, and organizational behaviour, our proposed framework categorizes humour into distinct theories, styles, models, and scales, offering SE professionals and researchers a structured approach to understanding humour in their work. This study also addresses the unique challenges of applying humour in SE, highlighting its potential benefits while acknowledging the need for further empirical validation in this context. Ultimately, our study aims to pave the way for more cohesive, creative, and psychologically supportive SE environments through the strategic use of humour.
长期以来人们一直认为,幽默是提高创造力、群体有效性和雇员在各个领域福祉的一个关键因素,然而,在软件工程(SE)团队内部的出现和影响仍未得到充分探讨,本文件介绍了一个综合的文献审查分类法,探讨SE团队幽默的特征和使用情况,目的是提高生产力、改善沟通和培养积极的工作环境,同时强调以负责任的方式利用幽默来减轻其潜在的负面影响。从心理学、社会学和组织行为的广泛研究中可以看出,我们拟议的框架将幽默分为不同的理论、风格、模式和尺度,为SE专业人员和研究人员提供一种结构化的方法来理解其工作中的幽默。这项研究还探讨了在SE应用幽默的独特挑战,强调了其潜在好处,同时承认有必要在这方面进一步进行实证鉴定。最终,我们的研究旨在通过策略性地使用幽默,为SE环境的更连贯、创造性和具有心理支持性的环境铺平道路。
Article 97
Title@2025-07-04 (5): Enhancing Uncertainty Quantification for Runtime Safety Assurance Using Causal Risk Analysis and Operational Design Domain
Title: Enhancing Uncertainty Quantification for Runtime Safety Assurance Using Causal Risk Analysis and Operational Design Domain | Verbesserung der Unsicherheitsquantifizierung für die Runtime Safety Assurance unter Verwendung von Schadensrisikoanalyse und Operational Design Domain | 利用因果关系风险分析和操作设计域加强运行时安全保障的不确定性量化 2507.03515v1 |
Authors (2): Radouane Bouchekir, Michell Guzman Cancimance
Ensuring the runtime safety of autonomous systems remains challenging due to deep learning components’ inherent uncertainty and their sensitivity to environmental changes. In this paper, we propose an enhancement of traditional uncertainty quantification by explicitly incorporating environmental conditions using risk-based causal analysis. We leverage Hazard Analysis and Risk Assessment (HARA) and fault tree modeling to identify critical operational conditions affecting system functionality. These conditions, together with uncertainties from the data and model, are integrated into a unified Bayesian Network (BN). At runtime, this BN is instantiated using real-time environmental observations to infer a probabilistic distribution over the safety estimation. This distribution enables the computation of both expected performance and its associated variance, providing a dynamic and context-aware measure of uncertainty. We demonstrate our approach through a case study of the Object Detection (OD) component in an Automated Valet Parking (AVP).
由于深层学习组成部分的内在不确定性及其对环境变化的敏感性,确保自主系统的运行时间安全仍具有挑战性。在本文件中,我们提议通过利用基于风险的因果关系分析,明确纳入环境条件,以加强传统的不确定性量化。我们利用危险分析和风险评估(HARA)和断层树模型来确定影响系统功能的关键操作条件。这些条件,加上数据和模型的不确定性,被纳入统一的巴耶斯网络。在运行时,利用实时环境观测推断安全估计的概率分布,该BN即时进行即时转换,从而能够计算预期的性能及其相关差异,提供动态的、符合背景的不确定性衡量标准。我们通过对自动瓦莱特泊车(AVP)中的物体探测(OD)部分进行个案研究,展示了我们的方法。
Article 98
Title@2025-07-04 (5): CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark
Title: CoreCodeBench: A Configurable Multi-Scenario Repository-Level Benchmark | CoreCodeBench: Ein konfigurierbarer Multi-Szenario-Repository-Level-Benchmark | 核心守则:可配置的多设想仓库一级基准 2507.05281v1 |
Authors (13): Lingyue Fu, Hao Guan, Bolun Zhang, Haowei Yuan, Yaoming Zhu, Jun Xu, Zongyu Wang, Lin Qiu, Xunliang Cai, Xuezhi Cao, Weiwen Liu, Weinan Zhang, Yong Yu
As Large Language Models (LLMs) demonstrate increasingly sophisticated code processing capabilities, evaluating their performance on engineering-level code remains challenging. Existing repository-level benchmarks primarily focus on single scenarios, such as code generation or bug fixing, without adequately capturing the diversity and complexity of real-world software or project engineering workflows. Furthermore, these benchmarks suffer from limited controllability in question positioning and reliability issues in their generated test cases. To address these limitations, we present CorePipe, a fully automated pipeline that converts repositories into comprehensive test cases, and introduce CoreCodeBench, a configurable multi-scenario repository-level benchmark. To simulate real engineering scenarios, CorePipe generates three types of atomic questions (Development, BugFix, and Test-Driven Development) specifically targeting core code segments. These atomic questions are further combined into three types of composite questions, with difficulty levels flexibly adjusted through hyperparameter tuning. CoreCodeBench provides a comprehensive and extensive repository-level benchmark to investigate the applicability of LLMs in real-world engineering projects. Experiments with 16 LLMs across diverse scenarios reveal varying capabilities and offer multi-dimensional insights into LLM performance in engineering contexts. The code for CorePipe is available at https://github.com/AGI-Eval-Official/CoreCodeBench, and the data for CoreCodeBench can be accessed at https://huggingface.co/collections/tubehhh/corecodebench-68256d2faabf4b1610a08caa.
由于大型语言模型(LLMS)显示出日益复杂的代码处理能力,评估其在工程代码层面的绩效仍具有挑战性;现有的存储器级基准主要侧重于单一假设情景,如代码生成或故障修复,而没有充分反映真实世界软件或项目工程工作流程的多样性和复杂性;此外,这些基准在生成的测试案例中的定位和可靠性问题控制有限;为解决这些限制,我们介绍了将存储器转换成全面测试案例的完全自动化管道CorePipepe,并引入了可配置的多角度库CodeBench,这是一个可配置的多角度存储器级基准;为模拟真实工程假设情景,CorePipe生成了三种类型的原子问题(开发、BugFix和Test-Driven Development),而没有充分反映真实世界软件部分;这些原子问题被进一步合并为三种类型的复合问题,难以通过超参数调校正调整,CoreCodeBe-levelople 级数据库/Climexliversia/Coal-LA/CIA/CIGODODA/CA/CODA/CODIGILSODA/CA/CA/CSODODODRDRDRDRDRDRDRDR) 可用于。
Article 99
Title@2025-07-04 (5): Prompt Engineering Guidelines for Using Large Language Models in Requirements Engineering
Title: Prompt Engineering Guidelines for Using Large Language Models in Requirements Engineering | Prompt Engineering Richtlinien für die Verwendung großer Sprachmodelle in Requirements Engineering | 在要求工程中使用大语言模型的快速工程指南 2507.03405v1 |
Authors (3): Krishna Ronanki, Simon Arvidsson, Johan Axell
The rapid emergence of generative AI models like Large Language Models (LLMs) has demonstrated its utility across various activities, including within Requirements Engineering (RE). Ensuring the quality and accuracy of LLM-generated output is critical, with prompt engineering serving as a key technique to guide model responses. However, existing literature provides limited guidance on how prompt engineering can be leveraged, specifically for RE activities. The objective of this study is to explore the applicability of existing prompt engineering guidelines for the effective usage of LLMs within RE. To achieve this goal, we began by conducting a systematic review of primary literature to compile a non-exhaustive list of prompt engineering guidelines. Then, we conducted interviews with RE experts to present the extracted guidelines and gain insights on the advantages and limitations of their application within RE. Our literature review indicates a shortage of prompt engineering guidelines for domain-specific activities, specifically for RE. Our proposed mapping contributes to addressing this shortage. We conclude our study by identifying an important future line of research within this field.
诸如大语言模型(LLMs)等具有基因特征的AI模型的迅速出现表明其在各种活动中的效用,包括在要求工程(RE)范围内。确保LLM产出的质量和准确性至关重要,迅速工程是指导示范反应的关键技术。然而,现有文献对如何利用迅速工程,特别是可再生能源活动,提供了有限的指导。本研究的目的是探讨现有迅速工程准则在RE内有效使用LLMs的适用性。为实现这一目标,我们开始对初级文献进行系统审查,以汇编一份非详尽的迅速工程准则清单。然后,我们与RE专家进行了访谈,以介绍提取的准则,并了解在RE内应用这些准则的优点和局限性。我们的文献审查表明,具体领域活动,特别是RE,缺少及时工程准则。我们提议的绘图有助于解决这一短缺问题。我们通过确定该领域今后的重要研究路线来完成我们的研究。
Article 100
Title@2025-07-04 (5): ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy
Title: ReservoirChat: Interactive Documentation Enhanced with LLM and Knowledge Graph for ReservoirPy | ReservoirChat: Interaktive Dokumentation mit LLM und Wissensdiagramm für ReservoirPy | RESSOCWChat:与LLM和知识图增强互动文件 2507.05279v1 |
Authors (4): Virgile Boraud, Yannis Bendi-Ouis, Paul Bernard, Xavier Hinaut
We introduce a tool designed to improve the capabilities of Large Language Models (LLMs) in assisting with code development using the ReservoirPy library, as well as in answering complex questions in the field of Reservoir Computing. By incorporating external knowledge through Retrieval-Augmented Generation (RAG) and knowledge graphs, our approach aims to reduce hallucinations and increase the factual accuracy of generated responses. The system provides an interactive experience similar to ChatGPT, tailored specifically for ReservoirPy, enabling users to write, debug, and understand Python code while accessing reliable domain-specific insights. In our evaluation, while proprietary models such as ChatGPT-4o and NotebookLM performed slightly better on general knowledge questions, our model outperformed them on coding tasks and showed a significant improvement over its base model, Codestral-22B.
我们引入了一种工具,旨在提高大语言模型(LLMs)利用储量计算机库协助代码开发的能力,并回答储量计算机领域的复杂问题。 我们的方法通过检索原始一代(RAG)和知识图将外部知识纳入其中,目的是减少幻觉,提高所产生响应的实际准确性。这个系统提供了类似于ChatGPT的互动式经验,这是专门为储量计算机专门设计的,用户在获取可靠的特定领域见解的同时,能够写、调试和理解Python代码。 在我们的评估中,恰特GPT-4o和备注LM等专有模型在一般知识问题上的表现略好一些,我们的模型在编译任务上的表现略优于这些模型,并展示了比其基本模型(代码22B)的重大改进。
Article 101
Title@2025-07-04 (5): Securing Mixed Rust with Hardware Capabilities
Title: Securing Mixed Rust with Hardware Capabilities | Sichern gemischten Rust mit Hardware-Fähigkeiten | 保有混杂铁和硬件能力 2507.03344v1 |
Authors (5): Jason Zhijingcheng Yu, Fangqi Han, Kaustab Choudhury, Trevor E. Carlson, Prateek Saxena
The Rust programming language enforces three basic Rust principles, namely ownership, borrowing, and AXM (Aliasing Xor Mutability) to prevent security bugs such as memory safety violations and data races. However, Rust projects often have mixed code, i.e., code that also uses unsafe Rust, FFI (Foreign Function Interfaces), and inline assembly for low-level control. The Rust compiler is unable to statically enforce Rust principles in mixed Rust code which can lead to many security vulnerabilities. In this paper, we propose CapsLock, a security enforcement mechanism that can run at the level of machine code and detect Rust principle violations at run-time in mixed code. CapsLock is kept simple enough to be implemented into recent capability-based hardware abstractions that provide low-cost spatial memory safety. CapsLock introduces a novel revoke-on-use abstraction for capability-based designs, wherein accessing a memory object via a capability implicitly invalidates certain other capabilities pointing to it, thereby also providing temporal memory safety automatically, without requiring software to explicitly specify such invalidation. Thus, CapsLock is the first mechanism capable of providing cross-language enforcement of Rust principles. We implemented a prototype of CapsLock on QEMU. Evaluation results show that CapsLock is highly compatible with existing Rust code (passing 99.7% of the built-in test cases of the 100 most popular crates) and flags Rust principle violations in real-world Rust projects that use FFI or inline assembly. We discovered 8 previously unknown bugs in such crates in our experiments.
Rust 编程语言执行三项基本的 Rust 原则, 即所有制、 借、 和 AXM ( Aliasing Xor Mutability) , 以防止诸如内存安全违规和数据竞赛等安全漏洞。 然而, Rust 项目往往有混合代码, 即代码, 也使用不安全的 Rust 、 FFFI (FFI (FI) 代码, 和 低级别控制的内线组装。 Rust 编程者无法静态地执行混杂 Rust 代码中的 Rust 原则原则。 在本文中, 我们提议CapsLock 安全执行机制, 可以在机器代码级别上运行, 在运行时检测 Rest原则的违反情况。 因此, CapsLock 将基于能力的硬件抽象应用到基于能力的设计, 通过某种能力访问存储存储存储器, 隐含地否定了其它能力, 从而自动提供时间记忆安全, 不需要软件来明确指定这种无效性 。 因此, Capslocal Lock 的 Rest Frock 的 Restal deal deal deal deview ex ex acal deviolde 。
Article 102
Title@2025-07-04 (5): Analyzing C/C++ Library Migrations at the Package-level: Prevalence, Domains, Targets and Rationals across Seven Package Management Tools
Title: Analyzing C/C++ Library Migrations at the Package-level: Prevalence, Domains, Targets and Rationals across Seven Package Management Tools | Analyse von C/C++-Bibliotheksmigrationen auf Paketebene: Prävalenz, Domains, Targets und Rationals über sieben Paketverwaltungstools hinweg | C/C+++图书馆在包级的迁移分析:七套管理工具的普遍程度、域、目标和合理性 2507.03263v1 |
Authors (4): Haiqiao Gu, Yiliang Zhao, Kai Gao, Minghui Zhou
Library migration happens when a library can not meet the project’s requirements and is non-trivial to accomplish. To mitigate the problem, substantial efforts have been devoted to understanding its characteristics and recommending alternative libraries, especially for programming language (PL) ecosystems with a central package hosting platform, such as Python (PyPI). However, to the best of our knowledge, understanding of C/C++ library migrations is still lacking, possibly due to challenges resulting from the fragmented and complicated dependency management practices in the C/C++ ecosystem. To bridge this knowledge gap, this paper analyzes 19,943 C/C++ projects that utilize different package management tools and establishes the first C/C++ library migration dataset. Based on the dataset, we investigate the prevalence, domains, target library, and rationale of C/C++ library migrations and compare the results with three widely investigated PLs: Python, JavaScript, and Java. We find that the overall trend in the number of C/C++ library migrations is similar to Java. Migrations across different package management tools are also observed. In C/C++, library migrations mainly occur in GUI, Build, and OS development, but are rare in domains (e.g., Testing and Logging) that dominate library migrations in the three compared PLs. 83.46\% of C/C++ source libraries only have one migration target, suggesting that our library migration dataset could be used directly to recommend migration targets. We find four C/C++-specific migration reasons, such as less compile time and unification of dependency management, revealing the unique dependency management requirements in C/C++ projects. We believe our findings can help C/C++ developers make more informed library migration decisions and shed light on the design of C/C++ library migration tools.
图书馆的迁移发生在图书馆无法满足项目要求且非三重任务无法完成时。为了缓解问题,已经做出大量努力,了解图书馆的特点,建议替代图书馆,特别是为语言(PL)生态系统提供Python(PyPI)等中央软件托管平台。然而,据我们所知,对C/C++图书馆迁移的理解仍然不足,可能是由于C/C++生态系统中分散和复杂的依赖性管理做法造成的挑战。为了弥补这一知识差距,本文件分析了19 943 C/C++项目,这些项目利用不同的软件管理工具,建立了第一个C/C++图书馆迁移数据集。基于数据集,我们调查了C/C++图书馆迁移的流行程度、区域、目标图书馆的理由,并将结果与三个广受调查的Python、JavaScript和Java。我们发现,C++图书馆迁移数量的总体趋势与Java相似。 不同软件管理工具的迁移帮助也观察到了C/C+++的迁移,但是在C数据库中显示,C的迁移趋势主要是在C的不断变现。
Article 103
Title@2025-07-03 (4): The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Literature Review
Title: The Impact of LLM-Assistants on Software Developer Productivity: A Systematic Literature Review | Die Auswirkungen von LLM-Assistenten auf die Produktivität von Softwareentwicklern: Ein systematischer Literaturbericht | LLM助理对软件开发者生产力的影响:系统文学评论 2507.03156v1 |
Authors (3): Amr Mohamed, Maram Assi, Mariam Guizani
Large language model assistants (LLM-assistants) present new opportunities to transform software development. Developers are increasingly adopting these tools across tasks, including coding, testing, debugging, documentation, and design. Yet, despite growing interest, there is no synthesis of how LLM-assistants affect software developer productivity. In this paper, we present a systematic literature review of 37 peer-reviewed studies published between January 2014 and December 2024 that examine this impact. Our analysis reveals that LLM-assistants offer both considerable benefits and critical risks. Commonly reported gains include minimized code search, accelerated development, and the automation of trivial and repetitive tasks. However, studies also highlight concerns around cognitive offloading, reduced team collaboration, and inconsistent effects on code quality. While the majority of studies (92%) adopt a multi-dimensional perspective by examining at least two SPACE dimensions, reflecting increased awareness of the complexity of developer productivity, only 14% extend beyond three dimensions, indicating substantial room for more integrated evaluations. Satisfaction, Performance, and Efficiency are the most frequently investigated dimensions, whereas Communication and Activity remain underexplored. Most studies are exploratory (64%) and methodologically diverse, but lack longitudinal and team-based evaluations. This review surfaces key research gaps and provides recommendations for future research and practice. All artifacts associated with this study are publicly available at https://zenodo.org/records/15788502.
大型语言模型助理(LLM-助手)为软件开发的转型提供了新的机会。开发者正在越来越多地跨任务采用这些工具,包括编码、测试、调试、文档和设计。然而,尽管人们对LLM助手如何影响软件开发生产率的兴趣日益浓厚,但没有合成LLM助手如何影响软件开发生产率。在本文件中,我们介绍了对2014年1月至2024年12月期间公布的37项同行审查研究的系统文献审查,这些研究审查了这一影响。我们的分析显示,LLLM助手既提供了相当大的好处,也提供了重大风险。通常报告的成果包括最大限度地减少代码搜索、加速开发以及琐碎和重复任务的自动化。然而,研究还突显了围绕认知卸载、减少团队协作和对代码质量的不一致影响的关切。尽管大多数研究(92%)采用了多维观观点,至少两个空间层面,反映了对开发者生产率复杂性的认识,只有14%的扩展范围超过三个层面,表明进行更综合评价的空间很大。满意度、业绩和效率是调查最多的方面,而通信和活动仍然处于探索不足的层面。大多数研究都是探索性研究(64%)和方法学方面的主要研究,但缺乏长期研究。
Article 104
Title@2025-07-03 (4): AI and Agile Software Development: From Frustration to Success – XP2025 Workshop Summary
Title: AI and Agile Software Development: From Frustration to Success – XP2025 Workshop Summary | KI und Agile Software-Entwicklung: Von der Frustration zum Erfolg – XP2025 Workshop Zusammenfassung | AI和Alile软件开发:从挫折到成功 – – XP2025讲习班摘要 2506.20159v2 |
Authors (5): Tomas Herda, Victoria Pichler, Zheying Zhang, Pekka Abrahamsson, Geir K. Hanssen
The full-day workshop on AI and Agile at XP 2025 convened a diverse group of researchers and industry practitioners to address the practical challenges and opportunities of integrating Artificial Intelligence into Agile software development. Through interactive sessions, participants identified shared frustrations related to integrating AI into Agile Software Development practices, including challenges with tooling, governance, data quality, and critical skill gaps. These challenges were systematically prioritized and analyzed to uncover root causes. The workshop culminated in the collaborative development of a research roadmap that pinpoints actionable directions for future work, including both immediate solutions and ambitious long-term goals. The key outcome is a structured agenda designed to foster joint industry-academic efforts to move from identified frustrations to successful implementation.
2025年XP关于AI和Agile的全天讲习班召集了一组不同的研究人员和业界从业人员,讨论将人工智能纳入Agile软件开发的实际挑战和机遇,通过互动会议,与会者查明了将AI纳入Agile软件开发做法的共同挫折感,包括工具、治理、数据质量和关键技能差距方面的挑战,对这些挑战进行了系统的优先排序和分析,以找出根源;讲习班最终通过合作制定了一份研究路线图,确定未来工作的可操作方向,包括即时解决方案和雄心勃勃的长期目标;主要成果是一项结构化的议程,目的是促进工业-学术联合努力,从查明的挫折感转向成功的执行。
Article 105
Title@2025-07-03 (4): Requirements Elicitation Follow-Up Question Generation
Title: Requirements Elicitation Follow-Up Question Generation | Voraussetzungen Elicitation Follow-Up Question Generation | 问询后查询 2507.02858v1 |
Authors (3): Yuchen Shen, Anmol Singhal, Travis Breaux
Interviews are a widely used technique in eliciting requirements to gather stakeholder needs, preferences, and expectations for a software system. Effective interviewing requires skilled interviewers to formulate appropriate interview questions in real time while facing multiple challenges, including lack of familiarity with the domain, excessive cognitive load, and information overload that hinders how humans process stakeholders’ speech. Recently, large language models (LLMs) have exhibited state-of-the-art performance in multiple natural language processing tasks, including text summarization and entailment. To support interviewers, we investigate the application of GPT-4o to generate follow-up interview questions during requirements elicitation by building on a framework of common interviewer mistake types. In addition, we describe methods to generate questions based on interviewee speech. We report a controlled experiment to evaluate LLM-generated and human-authored questions with minimal guidance, and a second controlled experiment to evaluate the LLM-generated questions when generation is guided by interviewer mistake types. Our findings demonstrate that, for both experiments, the LLM-generated questions are no worse than the human-authored questions with respect to clarity, relevancy, and informativeness. In addition, LLM-generated questions outperform human-authored questions when guided by common mistakes types. This highlights the potential of using LLMs to help interviewers improve the quality and ease of requirements elicitation interviews in real time.
有效的面试需要熟练的面试人员在面临多种挑战时实时地提出适当的访谈问题,包括缺乏对域名的熟悉程度、过度的认知负荷和信息超负荷,从而妨碍人类处理利益攸关方的演讲。最近,大型语言模型(LLMS)在多种自然语言处理任务中表现出了最先进的表现,包括文字总结和要求。为了支持访谈人员,我们调查GPT-4o的应用,以便在需求期间,通过建立共同的访谈错误类型框架,提出后续访谈问题。此外,我们描述根据访谈者演讲产生问题的方法。我们报告有控制的实验,以最低限度的指导评价LLM所产生和人为的问题,以及第二次有控制的实验,在生成时以访谈者错误类型为指导,评价LMM产生的问题。我们的研究结果表明,对于这两个实验,LMM公司产生的问题并不比人类授权的关于清晰度、相关性和了解性能等问题更差。此外,我们还描述了基于受访者演讲者演讲结果的透明性要求。此外,LMM公司还用普通的深度问题来改进访问。
Article 106
Title@2025-07-03 (4): Legal Requirements Translation from Law
Title: Legal Requirements Translation from Law | Rechtliche Voraussetzungen Übersetzung aus dem Recht | 法律要求译自法律 2507.02846v1 |
Authors (2): Anmol Singhal, Travis Breaux
Software systems must comply with legal regulations, which is a resource-intensive task, particularly for small organizations and startups lacking dedicated legal expertise. Extracting metadata from regulations to elicit legal requirements for software is a critical step to ensure compliance. However, it is a cumbersome task due to the length and complex nature of legal text. Although prior work has pursued automated methods for extracting structural and semantic metadata from legal text, key limitations remain: they do not consider the interplay and interrelationships among attributes associated with these metadata types, and they rely on manual labeling or heuristic-driven machine learning, which does not generalize well to new documents. In this paper, we introduce an approach based on textual entailment and in-context learning for automatically generating a canonical representation of legal text, encodable and executable as Python code. Our representation is instantiated from a manually designed Python class structure that serves as a domain-specific metamodel, capturing both structural and semantic legal metadata and their interrelationships. This design choice reduces the need for large, manually labeled datasets and enhances applicability to unseen legislation. We evaluate our approach on 13 U.S. state data breach notification laws, demonstrating that our generated representations pass approximately 89.4% of test cases and achieve a precision and recall of 82.2 and 88.7, respectively.
软件系统必须遵守法规,这是一个资源密集型的任务,对于小型组织和缺乏专门法律专门知识的初创机构来说尤其如此。从规章中提取元数据,以引起对软件的法律要求,是确保遵守的关键步骤。然而,由于法律文本的长度和复杂性质,这是一个繁琐的任务。虽然以前的工作寻求的是从法律文本中提取结构性和语义性元数据的自动化方法,但主要限制仍然存在:它们不考虑与这些元数据类型相关的属性之间的相互作用和相互关系,它们依赖人工标签或超自然驱动的机器学习,这并不能很好地概括新的文件。在本文件中,我们采用基于文字要求和文内文学习的方法,以自动生成法律文本的可理解性表述、可隐含和可执行的Python代码。我们的代表来自一个手工设计的Python类结构,这个结构是一个特定域的元模型,既捕捉结构性和语义性法律元数据,也依靠它们的相互关系。这一设计减少了对大型、手工标签数据设置和内文学习的需求,并增强了对隐性法律的可应用性解释性解释性解释性法律。我们评估了13项 和精确性法律的测试性案例。我们评估了13项 和精确性法律。我们评估了我们的数据和精确性检验性案例。
Article 107
Title@2025-07-03 (4): Agentic Business Process Management: Practitioner Perspectives on Agent Governance in Business Processes
Title: Agentic Business Process Management: Practitioner Perspectives on Agent Governance in Business Processes | Agentic Business Process Management: Praxisperspektiven zur Agenten-Governance in Unternehmensprozessen | 代理业务流程管理:从业者对业务流程代理治理的看法 2504.03693v2 |
Authors (5): Hoang Vu, Nataliia Klievtsova, Henrik Leopold, Stefanie Rinderle-Ma, Timotheus Kampik
With the rise of generative AI, industry interest in software agents is growing. Given the stochastic nature of generative AI-based agents, their effective and safe deployment in organizations requires robust governance, which can be facilitated by agentic business process management. However, given the nascence of this new-generation agent notion, it is not clear what BPM practitioners consider to be an agent, and what benefits, risks and governance challenges they associate with agent deployments. To investigate how organizations can effectively govern AI agents, we conducted a qualitative study involving semi-structured interviews with 22 BPM practitioners from diverse industries. They anticipate that agents will enhance efficiency, improve data quality, ensure better compliance, and boost scalability through automation, while also cautioning against risks such as bias, over-reliance, cybersecurity threats, job displacement, and ambiguous decision-making. To address these challenges, the study presents six key recommendations for the responsible adoption of AI agents: define clear business goals, set legal and ethical guardrails, establish human-agent collaboration, customize agent behavior, manage risks, and ensure safe integration with fallback options. Additionally, the paper outlines actions to align traditional BPM with agentic AI, including balancing human and agent roles, redefining human involvement, adapting process structures, and introducing performance metrics. These insights provide a practical foundation for integrating AI agents into business processes while preserving oversight, flexibility, and trust.
随着基因化的AI的兴起,工业界对软件代理的兴趣正在增加。鉴于基因化的AI型代理的随机性质,在组织中有效和安全地部署它们需要强有力的治理,而这种治理可以通过代理业务流程管理加以促进。然而,鉴于这种新一代代理概念的诞生,目前尚不清楚BPM从业人员认为什么是代理,以及他们与代理部署有关哪些好处、风险和治理挑战。为了调查各组织如何能够有效地管理AI代理,我们开展了一项定性研究,涉及与不同行业22名BPM从业人员的半结构性访谈。他们预计,代理将提高效率,提高数据质量,确保更好的遵守,并通过自动化提高可扩展性,同时告诫人们避免偏见、过度依赖、网络安全威胁、工作流离失所和模棱两可的决策等风险。为了应对这些挑战,BPM从业人员研究提出了六项重要建议,以便负责地采用AI代理:确定明确的商业目标、设置法律和道德护栏、建立人类代理协作、定制代理行为、管理风险和确保安全地与倒行选项相结合。此外,文件概述了为使传统的BPM结构结构与代理人参与和重新确定机构的工作基础。
Article 108
Title@2025-07-03 (4): Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification
Title: Gradient-Based Model Fingerprinting for LLM Similarity Detection and Family Classification | Gradient-Based Model Fingerprinting für LLM Ähnlichkeitserkennung und Familienklassifizierung | LLM相似性探测和家庭分类的渐进式样指纹 2506.01631v2 |
Authors (3): Zehao Wu, Yanjie Zhao, Haoyu Wang
As Large Language Models (LLMs) become integral software components in modern applications, unauthorized model derivations through fine-tuning, merging, and redistribution have emerged as critical software engineering challenges. Unlike traditional software where clone detection and license compliance are well-established, the LLM ecosystem lacks effective mechanisms to detect model lineage and enforce licensing agreements. This gap is particularly problematic when open-source model creators, such as Meta’s LLaMA, require derivative works to maintain naming conventions for attribution, yet no technical means exist to verify compliance. To fill this gap, treating LLMs as software artifacts requiring provenance tracking, we present TensorGuard, a gradient-based fingerprinting framework for LLM similarity detection and family classification. Our approach extracts model-intrinsic behavioral signatures by analyzing gradient responses to random input perturbations across tensor layers, operating independently of training data, watermarks, or specific model formats. TensorGuard supports the widely-adopted safetensors format and constructs high-dimensional fingerprints through statistical analysis of gradient features. These fingerprints enable two complementary capabilities: direct pairwise similarity assessment between arbitrary models through distance computation, and systematic family classification of unknown models via the K-Means clustering algorithm with domain-informed centroid initialization using known base models. Experimental evaluation on 58 models comprising 8 base models and 50 derivatives across five model families (Llama, Qwen, Gemma, Phi, Mistral) demonstrates 94% classification accuracy under our centroid-initialized K-Means clustering.
随着大型语言模型(LLMS)成为现代应用中综合软件组成部分,未经授权的模型通过微调、合并和再分配得出,已成为关键的软件工程挑战。与克隆检测和许可证合规性已经确立的传统软件不同,LLM生态系统缺乏检测模型线和执行许可证协议的有效机制。当开源模型创建者(如Meta’s LalaMA)需要派生工程来维持归属的命名公约,而没有技术手段来核查其遵守情况时,这一差距尤其成问题。为了填补这一空白,将LLMs作为需要源头跟踪的软件工艺品处理,我们提出了TensorGuard,这是一个基于梯度的指纹框架,用于LLM的类似性检测和家庭分类。我们的方法提取了模型的模型,即:通过分析对多层随机输入的梯度反应以及实施许可证协议协议协议协议协议协议。 独立于培训数据、水标记或具体模型,支持广泛采用的安全传感器格式,并通过对梯度特征进行统计分析来建立高维度指纹。这些指纹使两种互补能力成为:直接对准的类似性QQ,50个任意性模型之间,通过KML模型,通过基础的原始模型进行任意性模型,通过已知的模型进行基础计算。
Article 109
Title@2025-07-03 (4): Fuzzing-based Mutation Testing of C/C++ Software in Cyber-Physical Systems
Title: Fuzzing-based Mutation Testing of C/C++ Software in Cyber-Physical Systems | Fuzzing-basierte Mutationsprüfung von C/C++-Software in Cyber-Physical Systems | C/C+++网络物理系统中软件的模糊式变异测试 2503.24100v3 |
Authors (3): Jaekwon Lee, Fabrizio Pastore, Lionel Briand
Mutation testing can help minimize the delivery of faulty software. Therefore, it is a recommended practice for developing embedded software in safety-critical cyber-physical systems (CPS). However, state-of-the-art mutation testing techniques for C and C++ software, which are common languages for CPS, depend on symbolic execution. Unfortunately, symbolic execution’s limitations hinder its applicability (e.g., systems with black-box components). We propose relying on fuzz testing, which has demonstrated its effectiveness for C and C++ software. Fuzz testing tools automatically create test inputs that explore program branches in various ways, exercising statements in different program states, and thus enabling the detection of mutants, which is our objective. We empirically evaluated our approach using software components from operational satellite systems. Our assessment shows that our approach can detect between 40% and 90% of the mutants not detected by developers’ test suites. Further, we empirically determined that the best results are obtained by integrating the Clang compiler, a memory address sanitizer, and relying on laf-intel instrumentation to collect coverage and guide fuzzing. Our approach detects a significantly higher percentage of live mutants compared to symbolic execution, with an increase of up to 50 percentage points; further, we observed that although the combination of fuzzing and symbolic execution leads to additional mutants being killed, the benefits are minimal (a gain of less than one percentage point).
因此,这是开发安全临界网络物理系统(CPS)内嵌软件的建议做法。然而,C和C++软件(CPS通用语言的C和C++软件)的最先进的突变测试技术取决于象征性的执行。不幸的是,象征性执行的限制妨碍了它的适用性(例如黑盒组件的系统)。我们提议依靠模糊测试,这已经证明了C和C++软件的有效性。Fuzz测试工具自动创造测试投入,以各种方式探索程序分支,在不同的程序状态中进行陈述,从而能够探测我们的目标所在的变异体。我们用操作卫星系统的软件组件对我们的方法进行了实证性评估。我们的评估表明,我们的方法可以探测出40%至90%的变异体,而开发者测试室没有检测到这些变异体。此外,我们从经验上确定,最佳结果是通过整合Clang编译器、记忆地址S+++软件、依靠laf-int仪器来收集覆盖范围,从而能够探测出不同程序状态,从而能够探测变异体,而这正是我们的目标。我们的方法通过使用操作软件组件评估了50%的象征性化执行率,但比观察到的变异体杀伤率要高得多。
Article 110
Title@2025-07-03 (4): FuzzFeed: An Automatic Approach to Weakest Precondition Generation using LLMs and Fuzzing
Title: FuzzFeed: An Automatic Approach to Weakest Precondition Generation using LLMs and Fuzzing | FuzzFeed: Ein automatischer Ansatz zur schwachsten Vorkonditionierungsgeneration mit LLMs und Fuzzing | FazzzFeed:使用LLMS和模糊生成最弱预设条件的自动方法 2507.05272v1 |
Authors (3): Daragh King, Vasileios Koutavas, Laura Kovacs
The weakest precondition (WP) of a program describes the largest set of initial states from which all terminating executions of the program satisfy a given postcondition. The generation of WPs is an important task with practical applications in areas ranging from verification to run-time error checking. This paper proposes the combination of Large Language Models (LLMs) and fuzz testing for generating WPs. In pursuit of this goal, we introduce Fuzzing Guidance (FG); FG acts as a means of directing LLMs towards correct WPs using program execution feedback. FG utilises fuzz testing for approximately checking the validity and weakness of candidate WPs, this information is then fed back to the LLM as a means of context refinement. We demonstrate the effectiveness of our approach on a comprehensive benchmark set of deterministic array programs in Java. Our experiments indicate that LLMs are capable of producing viable candidate WPs, and that this ability can be practically enhanced through FG.
方案最薄弱的前提条件(WP)描述了所有终止处决方案都满足特定条件的最大一组初始状态。生成WP是一项重要任务,在从核查到运行时间错误检查的各个领域都具有实际应用性。本文件提出将大型语言模型(LLMs)和生成WP的模糊测试结合起来。为了实现这一目标,我们引入了模糊指南(FG);FG作为一种手段,利用程序执行反馈指导LLMs纠正WP。FG利用模糊测试来大致检查候选WP的有效性和弱点,然后将这一信息反馈给LLM,作为改进背景的手段。我们展示了我们在爪哇的一套确定性阵列方案的全面基准方法的有效性。我们的实验表明,LLMs能够产生可行的候选WP,而这种能力实际上可以通过FG得到加强。
Article 111
Title@2025-07-03 (4): Sustainability Flags for the Identification of Sustainability Posts in Q&A Platforms
Title: Sustainability Flags for the Identification of Sustainability Posts in Q&A Platforms | Nachhaltigkeitsflaggen für die Identifizierung von Nachhaltigkeitsbeiträgen in Q&A-Plattformen | 在++A平台中确定可持续性职位的可持续性旗 2507.02695v1 |
Authors (4): Sahar Ahmadisakha, Lech Bialek, Mohamed Soliman, Vasilios Andrikopoulos
In recent years, sustainability in software systems has gained significant attention, especially with the rise of cloud computing and the shift towards cloud-based architectures. This shift has intensified the need to identify sustainability in architectural discussions to take informed architectural decisions. One source to see these decisions is in online Q&A forums among practitioners’ discussions. However, recognizing sustainability concepts within software practitioners’ discussions remains challenging due to the lack of clear and distinct guidelines for this task. To address this issue, we introduce the notion of sustainability flags as pointers in relevant discussions, developed through thematic analysis of multiple sustainability best practices from cloud providers. This study further evaluates the effectiveness of these flags in identifying sustainability within cloud architecture posts, using a controlled experiment. Preliminary results suggest that the use of flags results in classifying fewer posts as sustainability-related compared to a control group, with moderately higher certainty and significantly improved performance. Moreover, sustainability flags are perceived as more useful and understandable than relying solely on definitions for identifying sustainability.
近年来,软件系统的可持续性受到极大关注,特别是随着云计算上升和云基结构向云型结构的转变,这一转变使得更有必要确定建筑讨论的可持续性,以便作出知情的建筑决定。看到这些决定的一个来源是从业人员讨论的在线论坛。然而,由于软件从业人员讨论缺乏对这项任务的明确和明确的指导方针,在软件系统的可持续性概念方面仍然具有挑战性。为解决这一问题,我们通过对云供应商的多重可持续性最佳做法进行专题分析,在相关讨论中引入了可持续性标志的概念。这项研究进一步评估了这些标志在利用受控试验确定云层结构员额内的可持续性方面的有效性。初步结果表明,使用旗帜的结果是,与控制组相比,与可持续性有关的职位分类较少,具有适度的确定性和显著的改进性。此外,可持续性标志被认为比仅仅依靠确定可持续性的定义更为有用和易懂。
Article 112
Title@2025-07-03 (4): RLHGNN: Reinforcement Learning-driven Heterogeneous Graph Neural Network for Next Activity Prediction in Business Processes
Title: RLHGNN: Reinforcement Learning-driven Heterogeneous Graph Neural Network for Next Activity Prediction in Business Processes | RLHGNN: Verstärkung Lernorientiertes Heterogenes Graph Neuronales Netzwerk für die nächste Aktivitätsvorhersage in Geschäftsprozessen | RLHGNN: 业务流程下一个活动预测的强化学习驱动的异质图形神经网络 2507.02690v1 |
Authors (6): Jiaxing Wang, Yifeng Yu, Jiahan Song, Bin Cao, Jing Fan, Ji Zhang
Next activity prediction represents a fundamental challenge for optimizing business processes in service-oriented architectures such as microservices environments, distributed enterprise systems, and cloud-native platforms, which enables proactive resource allocation and dynamic service composition. Despite the prevalence of sequence-based methods, these approaches fail to capture non-sequential relationships that arise from parallel executions and conditional dependencies. Even though graph-based approaches address structural preservation, they suffer from homogeneous representations and static structures that apply uniform modeling strategies regardless of individual process complexity characteristics. To address these limitations, we introduce RLHGNN, a novel framework that transforms event logs into heterogeneous process graphs with three distinct edge types grounded in established process mining theory. Our approach creates four flexible graph structures by selectively combining these edges to accommodate different process complexities, and employs reinforcement learning formulated as a Markov Decision Process to automatically determine the optimal graph structure for each specific process instance. RLHGNN then applies heterogeneous graph convolution with relation-specific aggregation strategies to effectively predict the next activity. This adaptive methodology enables precise modeling of both sequential and non-sequential relationships in service interactions. Comprehensive evaluation on six real-world datasets demonstrates that RLHGNN consistently outperforms state-of-the-art approaches. Furthermore, it maintains an inference latency of approximately 1 ms per prediction, representing a highly practical solution suitable for real-time business process monitoring applications. The source code is available at https://github.com/Joker3993/RLHGNN.
下一个活动预测是优化服务导向架构(如微观服务环境、分布式企业系统、云型平台等)业务流程的一个根本挑战,这些架构有助于积极主动的资源分配和动态服务构成。尽管以顺序为基础的方法十分普遍,但这些方法未能捕捉平行处决和有条件依赖所产生的非序列关系。即使以图表为基础的方法处理结构保护问题,它们也存在同质表述和静态结构,这些结构适用统一的模型战略,而不论个别程序复杂特性如何。为了应对这些限制,我们引入了RLHGNN,这是一个新颖的框架,将事件日志转换成基于三个不同边缘类型的不同流程流程图解。我们的方法创建了四个灵活的图表结构,有选择地结合这些边框以适应不同流程的复杂性,并利用作为Markov 决策程序开发的强化学习,以自动确定每个特定流程的最佳图形结构。RLHGNNN,然后将混杂的图表组合组合和具体关联的汇总战略用于有效预测下一个活动。这一适应方法可以精确地模拟服务互动中的序列和非序列进程。我们的方法在六个实体/系统内部数据库中持续地显示一个最新的实时数据库。
Article 113
Title@2025-07-03 (4): Do Research Software Engineers and Software Engineering Researchers Speak the Same Language?
Title: Do Research Software Engineers and Software Engineering Researchers Speak the Same Language? | Sprechen Forschungssoftware-Ingenieure und Software-Engineering-Forscher die gleiche Sprache? | 研究软件工程师和软件工程研究者会说同一种语言吗? 2507.02665v1 |
Authors (5): Timo Kehrer, Robert Haines, Guido Juckeland, Shurui Zhou, David E. Bernholdt
Anecdotal evidence suggests that Research Software Engineers (RSEs) and Software Engineering Researchers (SERs) often use different terminologies for similar concepts, creating communication challenges. To better understand these divergences, we have started investigating how SE fundamentals from the SER community are interpreted within the RSE community, identifying aligned concepts, knowledge gaps, and areas for potential adaptation. Our preliminary findings reveal opportunities for mutual learning and collaboration, and our systematic methodology for terminology mapping provides a foundation for a crowd-sourced extension and validation in the future.
传闻证据表明,研究软件工程师(RSE)和软件工程研究人员(SERs)经常对类似概念使用不同的术语,从而产生沟通挑战。 为更好地了解这些差异,我们已开始调查SER社区的SE基本原理是如何在RSE社区内部解释的,确定一致的概念、知识差距和潜在的适应领域。 我们的初步发现揭示了相互学习与合作的机会,以及我们系统的术语绘图方法为今后扩大和验证人群来源提供了基础。
Article 114
Title@2025-07-03 (4): Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure
Title: Explainable Compliance Detection with Multi-Hop Natural Language Inference on Assurance Case Structure | Erklärbare Compliance-Erkennung mit Multi-Hop-Natural Language-Schlussfolgerung zur Assurance-Fallstruktur | 以多种自然语言对保证案例结构的多重语言推断进行可解释的合规检测 2506.08713v2 |
Authors (2): Fariz Ikhwantri, Dusica Marijan
Ensuring complex systems meet regulations typically requires checking the validity of assurance cases through a claim-argument-evidence framework. Some challenges in this process include the complicated nature of legal and technical texts, the need for model explanations, and limited access to assurance case data. We propose a compliance detection approach based on Natural Language Inference (NLI): EXplainable CompLiance detection with Argumentative Inference of Multi-hop reasoning (EXCLAIM). We formulate the claim-argument-evidence structure of an assurance case as a multi-hop inference for explainable and traceable compliance detection. We address the limited number of assurance cases by generating them using large language models (LLMs). We introduce metrics that measure the coverage and structural consistency. We demonstrate the effectiveness of the generated assurance case from GDPR requirements in a multi-hop inference task as a case study. Our results highlight the potential of NLI-based approaches in automating the regulatory compliance process.
确保复杂的系统符合规章条例,通常需要通过索赔-论证-证据框架检查保证案件的有效性。这一过程的一些挑战包括法律和技术文本性质复杂,需要示范解释,以及获得保证案件数据的机会有限。我们建议根据自然语言推断(NLI):利用多点推理推理(EXCLAIM)的推理推理(EXCLAIM)进行可推广共测。我们将保证案件的索赔-论证-证据结构作为可解释和可追踪的遵守检测的多重推理。我们用大型语言模型(LLLMs)生成的保证案件数量有限。我们提出了衡量覆盖面和结构一致性的衡量标准。我们作为案例研究,展示了GDPR要求产生的保证案件的有效性。我们的结果突出了以NLI为基础的方法在监管遵守过程自动化方面的潜力。
Article 115
Title@2025-07-03 (4): Human-Machine Collaboration and Ethical Considerations in Adaptive Cyber-Physical Systems
Title: Human-Machine Collaboration and Ethical Considerations in Adaptive Cyber-Physical Systems | Mensch-Maschine-Kollaboration und ethische Überlegungen in adaptiven Cyber-Physischen Systemen | 适应性网络-物理系统中的人类-海洋协作和伦理考虑 2507.02578v1 |
Authors (1): Zoe Pfister
Adaptive Cyber-Physical Systems (CPS) are systems that integrate both physical and computational capabilities, which can adjust in response to changing parameters. Furthermore, they increasingly incorporate human-machine collaboration, allowing them to benefit from the individual strengths of humans and machines. Human-Machine Teaming (HMT) represents the most advanced paradigm of human-machine collaboration, envisioning seamless teamwork between humans and machines. However, achieving effective and seamless HMT in adaptive CPS is challenging. While adaptive CPS already benefit from feedback loops such as MAPE-K, there is still a gap in integrating humans into these feedback loops due to different operational cadences of humans and machines. Further, HMT requires constant monitoring of human operators, collecting potentially sensitive information about their actions and behavior. Respecting the privacy and human values of the actors of the CPS is crucial for the success of human-machine teams. This research addresses these challenges by: (1) developing novel methods and processes for integrating HMT into adaptive CPS, focusing on human-machine interaction principles and their incorporation into adaptive feedback loops found in CPS, and (2) creating frameworks for integrating, verifying, and validating ethics and human values throughout the system lifecycle, starting from requirements engineering.
适应性网络-物理系统是综合物理和计算能力的系统,可以根据不断变化的参数进行调整。此外,这些系统还日益纳入人体-机器合作,使人体-机器合作能够受益于人类和机器的个人优势。人类-机器团队(HMT)代表了人类-机器合作的最先进范例,设想了人与机器之间的无缝协作。然而,在适应性CPS中实现有效和无缝的HMT具有挑战性。适应性计算机-计算机系统已经受益于诸如MAPE-K等反馈循环,但在将人纳入这些反馈循环方面仍然存在差距,因为人类和机器的操作速度不同。此外,HMT要求不断监测人类操作者,收集关于其行动和行为的潜在敏感信息。尊重计算机-机器行为者的隐私和人类价值观对于人类-机器团队的成功至关重要。这一研究通过以下方式应对这些挑战:(1) 开发新的方法和进程,将HMT纳入适应性计算机-K,重点是人-机器互动原则,并将其纳入CPS所发现的适应性反馈循环。此外,HMM要求从整个人类-生命周期中开始建立框架,核查和验证。
Article 116
Title@2025-07-03 (4): LLMREI: Automating Requirements Elicitation Interviews with LLMs
Title: LLMREI: Automating Requirements Elicitation Interviews with LLMs | LLMREI: Automatisieren der Anforderungen Bereitstellung von Interviews mit LLMs | LLMREI: 与LLMM公司进行自动要求的求求救采访 2507.02564v1 |
Authors (3): Alexander Korn, Samuel Gorsch, Andreas Vogelsang
Requirements elicitation interviews are crucial for gathering system requirements but heavily depend on skilled analysts, making them resource-intensive, susceptible to human biases, and prone to miscommunication. Recent advancements in Large Language Models present new opportunities for automating parts of this process. This study introduces LLMREI, a chat bot designed to conduct requirements elicitation interviews with minimal human intervention, aiming to reduce common interviewer errors and improve the scalability of requirements elicitation. We explored two main approaches, zero-shot prompting and least-to-most prompting, to optimize LLMREI for requirements elicitation and evaluated its performance in 33 simulated stakeholder interviews. A third approach, fine-tuning, was initially considered but abandoned due to poor performance in preliminary trials. Our study assesses the chat bot’s effectiveness in three key areas: minimizing common interview errors, extracting relevant requirements, and adapting its questioning based on interview context and user responses. Our findings indicate that LLMREI makes a similar number of errors compared to human interviewers, is capable of extracting a large portion of requirements, and demonstrates a notable ability to generate highly context-dependent questions. We envision the greatest benefit of LLMREI in automating interviews with a large number of stakeholders.
在收集系统要求方面,要求引导面试至关重要,但在很大程度上依赖熟练的分析人员,使他们资源密集,易受人类偏见的影响,容易出现沟通错误。大语言模型最近的进展为这一进程的部分自动化提供了新的机会。本研究报告介绍了LLMREI,这是一个聊天机,旨在进行要求引导访谈,尽量减少人力干预,目的是减少常见的面谈错误,提高需求的可调适性。我们探索了两种主要方法,即零点点反应和最不直接的激励,优化LLMMREI在33次模拟利益攸关方访谈中的需求征求和评价其业绩。第三种方法,即微调,最初被考虑过,但因初步审判工作表现不佳而放弃。我们的研究评估了聊天机在三个关键领域的有效性:尽量减少常见的面谈错误,提取相关要求,并根据访谈背景和用户回应调整询问。我们的调查结果表明,LMMREI与人访谈者相比,其错误数量相似,能够提取大量需求,并表明具有产生高度背景依赖性问题的明显能力。我们设想LMREI访谈的最大好处是,与大量利益攸关方进行自动访谈。
Article 117
Title@2025-07-03 (4): Meta-Fair: AI-Assisted Fairness Testing of Large Language Models
Title: Meta-Fair: AI-Assisted Fairness Testing of Large Language Models | Meta-Fair: AI-Assisted Fairness Testing von großen Sprachmodellen | Meta-Fair:AI协助大语言模型公平性测试 2507.02533v1 |
Authors (6): Miguel Romero-Arjona, José A. Parejo, Juan C. Alonso, Ana B. Sánchez, Aitor Arrieta, Sergio Segura
Fairness–the absence of unjustified bias–is a core principle in the development of Artificial Intelligence (AI) systems, yet it remains difficult to assess and enforce. Current approaches to fairness testing in large language models (LLMs) often rely on manual evaluation, fixed templates, deterministic heuristics, and curated datasets, making them resource-intensive and difficult to scale. This work aims to lay the groundwork for a novel, automated method for testing fairness in LLMs, reducing the dependence on domain-specific resources and broadening the applicability of current approaches. Our approach, Meta-Fair, is based on two key ideas. First, we adopt metamorphic testing to uncover bias by examining how model outputs vary in response to controlled modifications of input prompts, defined by metamorphic relations (MRs). Second, we propose exploiting the potential of LLMs for both test case generation and output evaluation, leveraging their capability to generate diverse inputs and classify outputs effectively. The proposal is complemented by three open-source tools supporting LLM-driven generation, execution, and evaluation of test cases. We report the findings of several experiments involving 12 pre-trained LLMs, 14 MRs, 5 bias dimensions, and 7.9K automatically generated test cases. The results show that Meta-Fair is effective in uncovering bias in LLMs, achieving an average precision of 92% and revealing biased behaviour in 29% of executions. Additionally, LLMs prove to be reliable and consistent evaluators, with the best-performing models achieving F1-scores of up to 0.79. Although non-determinism affects consistency, these effects can be mitigated through careful MR design. While challenges remain to ensure broader applicability, the results indicate a promising path towards an unprecedented level of automation in LLM testing.
在开发人工智能(AI)系统时,没有不合理的偏差,这是一条核心原则;公平-公平-公平-公平-没有不合理的偏向-在开发人工智能(AI)系统时,仍然难以评估和执行。目前对大语言模型(LLMS)进行公平测试的实用性,往往依靠人工评估、固定模板、确定性超强和整理数据集,使其资源密集,难以规模化。这项工作的目的是为在LLMS中测试公平性、减少对特定领域资源的依赖和扩大当前方法的适用性等创新、自动化方法奠定基础。我们的Meta-Fair方法基于两个关键理念。首先,我们采用变形测试来发现偏向性,通过检查模型产出在对投入速度的受控修改(由MRRs)中如何不同。第二,我们提议利用LMMS的潜力来进行测试案例的生成和产出评估,利用它们的能力来产生各种投入和对产出进行有效的分类。这项提议得到三个公开源工具的补充,支持LMD-F驱动的生成、执行和测试案例的评估。我们报告说,在FMLMS-ralMS(LMS)前的正确-ral-ral-r)中,这些测试中,这些测试结果是自动、14的正正正正正正正正正正正正正正对结果显示。
Article 118
Title@2025-07-03 (4): Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Maintenance
Title: Code Digital Twin: Empowering LLMs with Tacit Knowledge for Complex Software Maintenance | Code Digital Twin: LLMs mit Tacit-Kenntnissen für komplexe Software-Wartung stärken | 数字代码双对:赋予LLMs以用于复杂软件维护的隐秘知识 2503.07967v2 |
Authors (5): Xin Peng, Chong Wang, Mingwei Liu, Yiling Lou, Yijian Wu
While large language models (LLMs) have demonstrated promise in software engineering tasks like code completion and generation, their support for the maintenance of complex software systems remains limited. These models often struggle with understanding the tacit knowledge embedded in systems, such as responsibility allocation and collaboration across different modules. To address this gap, we introduce the concept and framework of \textbf{Code Digital Twin}, a conceptual representation of tacit knowledge that captures the concepts, functionalities, and design rationales behind code elements, co-evolving with the software. A code digital twin is constructed using a methodology that combines knowledge extraction from both structured and unstructured sources–such as source code, documentation, and change histories–leveraging LLMs, static analysis tools, and human expertise. This framework can empower LLMs for software maintenance tasks such as issue localization and repository-level code generation by providing tacit knowledge as contexts. Based on the proposed methodology, we explore the key challenges and opportunities involved in the continuous construction and refinement of code digital twin.
虽然大型语言模型(LLMS)在代码完成和生成等软件工程任务方面表现出了希望,但它们对复杂软件系统维护的支持仍然有限,这些模型往往难以理解各模块的责任分配和协作等系统内隐性知识。为了弥补这一差距,我们引入了以下概念和框架:ctextbf{Code Digital Twin},这是一个隐性知识的概念和框架,它反映了代码要素背后的概念、功能和设计原理,与软件共同演化。一个代码数字双对是用一种方法构建的,它将来自结构化和非结构化来源的知识提取(例如源代码、文件、改变历史-杠杆LMS、静态分析工具和人类专门知识)结合起来。这个框架可以通过提供隐性知识作为背景,来增强LLMS对软件维护任务的能力,例如发布本地化和存储库级代码生成。我们根据拟议的方法,探讨了持续构建和完善代码数字双对代码的构建和完善所涉及的主要挑战和机遇。
Article 119
Title@2025-07-03 (4): VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software
Title: VeFIA: An Efficient Inference Auditing Framework for Vertical Federated Collaborative Software | VeFIA: Ein effizientes Inferenz-Audit-Framework für vertical Federated Collaborative Software | VEFIA: 垂直联邦合作软件有效推断审计框架 2507.02376v1 |
Authors (6): Chung-ju Huang, Ziqi Zhang, Yinggui Wang, Binghui Wang, Tao Wei, Leye Wang
Vertical Federated Learning (VFL) is a distributed AI software deployment mechanism for cross-silo collaboration without accessing participants’ data. However, existing VFL work lacks a mechanism to audit the execution correctness of the inference software of the data party. To address this problem, we design a Vertical Federated Inference Auditing (VeFIA) framework. VeFIA helps the task party to audit whether the data party’s inference software is executed as expected during large-scale inference without leaking the data privacy of the data party or introducing additional latency to the inference system. The core of VeFIA is that the task party can use the inference results from a framework with Trusted Execution Environments (TEE) and the coordinator to validate the correctness of the data party’s computation results. VeFIA guarantees that, as long as the abnormal inference exceeds 5.4%, the task party can detect execution anomalies in the inference software with a probability of 99.99%, without incurring any additional online inference latency. VeFIA’s random sampling validation achieves 100% positive predictive value, negative predictive value, and true positive rate in detecting abnormal inference. To the best of our knowledge, this is the first paper to discuss the correctness of inference software execution in VFL.
VFIA帮助任务方审计数据方的推论软件是否在大规模推论期间按预期执行,而不会泄露数据方的数据隐私,也不会给推断系统带来额外的延迟。VFIA的核心是任务方可以使用信任执行环境框架和数据方计算结果协调员框架的推论结果来验证数据方计算结果的正确性。 VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的执行异常情况,其概率为99.99%,而不会给网上推论系统带来任何额外的延迟。VEFIA的核心是任务方可以使用信任执行环境框架和协调员框架的推论结果来验证数据方计算结果的正确性。VEFIA保证,只要异常推论超过5.4%,任务方可以检测出推论软件中的异常性异常性,且有可能达到99.99%,而不会在网上推论任何额外的拉度。VEFIA的随机抽样验证可以使用100%的精确度来检测我们准确度。
Article 120
Title@2025-07-03 (4): Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes
Title: Exploring the Integration of Large Language Models in Industrial Test Maintenance Processes | Erforschung der Integration von großen Sprachmodellen in industrielle Testwartungsprozesse | 探索将大语言模型纳入工业试验维护工艺 2409.06416v2 |
Authors (6): Jingxiong Liu, Ludvig Lemner, Linnea Wahlgren, Gregory Gay, Nasser Mohammadiha, Joakim Wennerberg
Much of the cost and effort required during the software testing process is invested in performing test maintenance - the addition, removal, or modification of test cases to keep the test suite in sync with the system-under-test or to otherwise improve its quality. Tool support could reduce the cost - and improve the quality - of test maintenance by automating aspects of the process or by providing guidance and support to developers. In this study, we explore the capabilities and applications of large language models (LLMs) - complex machine learning models adapted to textual analysis - to support test maintenance. We conducted a case study at Ericsson AB where we explore the triggers that indicate the need for test maintenance, the actions that LLMs can take, and the considerations that must be made when deploying LLMs in an industrial setting. We also propose and demonstrate a multi-agent architecture that can predict which tests require maintenance following a change to the source code. Collectively, these contributions advance our theoretical and practical understanding of how LLMs can be deployed to benefit industrial test maintenance processes.
软件测试过程中所需的大部分成本和精力都用于进行测试维护 – – 添加、删除或修改测试案例,使测试套件与测试中的系统保持同步,或以其他方式提高测试套件的质量; 工具支持可以降低测试维护的成本,并提高质量,办法是使过程的各方面自动化,或向开发者提供指导和支持; 在这项研究中,我们探讨了大型语言模型(LLMs) – – 适合于文字分析的复杂机器学习模型 – – 的能力和应用,以支持测试维护; 我们在Ericsson AB进行了一项案例研究,我们在该研究中探讨了显示需要测试维护的触发因素、LLMs可以采取的行动,以及在工业环境中部署LMs时必须作出的考虑; 我们还提出并展示了一种多工具结构,可以预测在源码改变后哪些测试需要维护; 共同而言,这些贡献提高了我们对如何部署LMs以有利于工业测试维护过程的理论和实践理解。
Article 121
Title@2025-07-03 (4): Precisely Detecting Python Type Errors via LLM-based Unit Test Generation
Title: Precisely Detecting Python Type Errors via LLM-based Unit Test Generation | Präzise Erkennung von Python-Typfehlern über LLM-basierte Einheitentestgenerierung | 通过基于LLM的单位测试生成精确检测 Python 类型错误 2507.02318v1 |
Authors (7): Chen Yang, Ziqi Wang, Yanjie Jiang, Lin Yang, Yuteng Zheng, Jianyi Zhou, Junjie Chen
Type errors in Python often lead to runtime failures, posing significant challenges to software reliability and developer productivity. Existing static analysis tools aim to detect such errors without execution but frequently suffer from high false positive rates. Recently, unit test generation techniques offer great promise in achieving high test coverage, but they often struggle to produce bug-revealing tests without tailored guidance. To address these limitations, we present RTED, a novel type-aware test generation technique for automatically detecting Python type errors. Specifically, RTED combines step-by-step type constraint analysis with reflective validation to guide the test generation process and effectively suppress false positives. We evaluated RTED on two widely-used benchmarks, BugsInPy and TypeBugs. Experimental results show that RTED can detect 22-29 more benchmarked type errors than four state-of-the-art techniques. RTED is also capable of producing fewer false positives, achieving an improvement of 173.9%-245.9% in precision. Furthermore, RTED successfully discovered 12 previously unknown type errors from six real-world open-source Python projects.
Python 的错误类型往往会导致时间流逝,给软件可靠性和开发者生产率带来重大挑战。现有的静态分析工具旨在不执行而发现这类错误,但经常出现高假正率。最近,单位测试生成技术为实现高测试覆盖率提供了巨大的希望,但是它们往往在不提供有针对性的指导的情况下难以产生错误清除测试。为了解决这些限制,我们介绍了一种新型类型识别测试生成技术,即一种自动检测 Python 类型错误的新型类型识别测试生成技术。具体来说,RETED将一步步式限制分析与反射验证结合起来,以指导测试生成过程并有效抑制假正数。我们根据两个广泛使用的基准,即 BugsInPy 和 TypeBugs 进行了测试。实验结果表明,RTED能够检测出22-29种基准型错误多于四种最新技术。RETED还能够产生较少的假阳性,从而改进了173.9%-245.9%的精确度。此外,RETEDD成功发现了六个实体源开源 Python 工程的12种先前未知型错误。
Article 122
Title@2025-07-03 (4): CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks
Title: CORE: Benchmarking LLMs Code Reasoning Capabilities through Static Analysis Tasks | CORE: Benchmarking von LLMs Code Reasoning Capabilities durch statische Analyseaufgaben | CORE:通过静态分析任务,确定LLMs 法规说明能力标准 2507.05269v1 |
Authors (7): Danning Xie, Mingwei Zheng, Xuwei Liu, Jiannan Wang, Chengpeng Wang, Lin Tan, Xiangyu Zhang
Large language models (LLMs) have been widely adopted across diverse software engineering domains, such as code generation, program repair, and vulnerability detection. These applications require understanding beyond surface-level code patterns: value propagation, control flow, and interdependence between program elements. However, existing benchmarks primarily evaluate end-to-end outcomes, such as whether code is correctly repaired or generated, leaving the models ability for program semantic reasoning underexplored. This work presents CoRe, a high-quality, human-verified benchmark designed to evaluate LLMs on fundamental static analysis tasks. CoRe includes 12,553 task instances spanning data dependency, control dependency, and information flow across programs written in C/C++, Java, and Python. To ensure semantic diversity and reasoning complexity, we propose a semantics-aware diverse sampling strategy that selects targets and task instances based on structural coverage and dependency depth. We evaluate 10 mainstream LLMs and show that, while they perform well at identifying dependencies, models still struggle with tasks that require deeper semantic understanding and multi-step reasoning. We further conduct qualitative analyses to uncover key challenges, such as complex control structures and backward dependency patterns, offering insights into improving LLMs code reasoning capabilities.
大型语言模型(LLMS)在不同的软件工程领域被广泛采用,如代码生成、程序维修和脆弱性检测等。这些应用需要超越地表代码模式的理解:价值传播、控制流程以及程序要素之间的相互依存。然而,现有基准主要评价端到端结果,例如代码是否得到正确修理或生成,使程序语义推理的模型能力未得到充分探讨。这项工作提出了CorRe,这是一个高质量的、经过人验证的基准,旨在评估基本静态分析任务中的LLMS。CRe 包括12 553个任务实例,涵盖数据依赖性、控制依赖性以及C/C++、Java和Python等程序之间的信息流动。为了确保语义多样性和推理复杂性,我们提出了一种具有语义特征的多样化抽样战略,根据结构覆盖和依赖深度选择目标和任务实例。我们评估了10个主流LMS,并表明,虽然这些模型在确定依赖性方面表现良好,但与需要更深入的语义理解和多步推理的任务仍然很困难。我们进一步进行了定性分析,以发现关键的挑战,例如复杂的控制结构和后向后依赖性分析。