cs.SE @ 2025-07-25: 169
-
00 07-24 (4) 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation 3D-Software-Synthese geführt durch eingeschränkt-expressive Zwischendarstellung 3D 由限制性中等代表制指导的软件合成 2507.18625v1 -
01 07-24 OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advancing Next-Generation Intelligent Transportation Systems Research OpenCAMS: Eine Open-Source vernetzte und automatisierte Mobilitäts-Co-Simulationsplattform für die Weiterentwicklung der Forschung für intelligente Transportsysteme der nächsten Generation OpenCAMS: 推进下一轮智能运输系统研究的开放源码连接和自动化流动联合模拟平台 2507.09186v3 -
02 07-24 Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench Sind KI-erzeugte Fixes sicher? LLM und Agent Patches auf der SWE-Bench analysieren AI - 具有安全性吗? 分析SWE-bench 上的LLM 和代理补丁 2507.02976v2 -
03 07-24 On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten 关于含有闭合同步词类的标识名称的结构和语义 2505.18444v4 -
04 07-24 A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat Ein tiefer Tauchgang in die retrieval-angereicherte Generation zur Code-Vervollständigung: Erfahrung auf WeChat 为完成代码的完成而深入挖掘回收的一代人:关于 WeChat 的经验 2507.18515v1 -
05 07-24 Automated Code Review Using Large Language Models with Symbolic Reasoning Automatisierte Code-Überprüfung mit großen Sprachmodellen mit symbolischer Begründung 使用有符号理由的大语言模型的自动码审查 2507.18476v1 -
06 07-24 Exploring and Evaluating Interplays of BPpy with Deep Reinforcement Learning and Formal Methods Erforschung und Auswertung von Interplays von BPpy mit Deep Reinforcement Learning und Formal Methods 探索和评价与深强化学习和正规方法的BPpy的相互作用 2501.15480v2 -
07 07-24 It is Giving Major Satisfaction: Why Fairness Matters for Software Practitioners Es gibt große Zufriedenheit: Warum Fairness für Software-Praktiker wichtig ist 它给予重大满意:为什么软件从业人员的公平问题? 2410.02482v5 -
08 07-24 FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping FMI trifft SystemC: Ein Rahmen für das Cross-Tool Virtual Prototyping FMI 满足系统C:跨工具虚拟原型框架 2507.18339v1 -
09 07-24 LLMShot: Reducing snapshot testing maintenance via LLMs LLMShot: Reduzierung der Snapshot-Test-Wartung über LLMs LLMShot:减少通过LLMM减少快速测试维护 2507.10062v2 -
10 07-24 Gotta catch ‘em all! Towards File Localisation from Issues at Large Ich muss sie alle fangen! Auf dem Weg zur Dateilokalisierung von Themen im Großen und Ganzen 必须抓住他们所有人! 2507.18319v1 -
11 07-24 YATE: The Role of Test Repair in LLM-Based Unit Test Generation YATE: Die Rolle der Testreparatur bei der LLM-basierten Einheiten-Testgenerierung YATE:在以LLM为基础的单位试验生成中测试修理的作用 2507.18316v1 -
12 07-24 Scheduzz: Constraint-based Fuzz Driver Generation with Dual Scheduling Scheduzz: Fuzz Driver Generation mit Dual Scheduling Scheduzz:基于节制的有双重日程安排的 Fiszz 驱动力生成 2507.18289v1 -
13 07-24 An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs Eine empirische Studie über körpereigene Software-Fehler im Bereich Künstliche Intelligenz von Robotern (EAIR) 关于人造人工智能机器人(EAIR)软件虫的经验研究 2507.18267v1 -
14 07-24 GenAI for Automotive Software Development: From Requirements to Wheels GenAI für die Entwicklung von Automotive-Software: Von Anforderungen bis zu Rädern GENAI 汽车软件开发GENAI:从要求到轮子 2507.18223v1 -
15 07-24 SMECS: A Software Metadata Extraction and Curation Software KMUCS: Eine Software Metadata Extraktions- und Kurationssoftware SMECS:软件元数据抽取和计算软件 2507.18159v1 -
16 07-24 When Retriever Meets Generator: A Joint Model for Code Comment Generation Wenn Retriever trifft Generator: Ein gemeinsames Modell für Code Comment Generation 当再利用与生成器相遇时: 代码Comment生成联合模式 2507.12558v2 -
17 07-24 NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition NoCode-Bench: Ein Benchmark für die Bewertung der Erweiterung natürlicher sprachgetriebener Funktionen NoCode-Bonch:评价自然语言-驱动地物的基准 2507.18130v1 -
18 07-24 OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization OrQstrator: Ein KI-Powered-Framework für erweiterte Quantenschaltungsoptimierung Orstrator: AI授权的高级量子电路优化框架 2507.09682v2 -
19 07-24 Understanding the Supply Chain and Risks of Large Language Model Applications Verständnis der Supply Chain und Risiken von Großsprachenmodellanwendungen 了解供应链和大语言模式应用的风险 2507.18105v1 -
20 07-24 Identifier Name Similarities: An Exploratory Study Identifier Name Ähnlichkeiten: Eine Sondierungsstudie 说明性名称 相似点:探索性研究 2507.18081v1 -
21 07-24 An Empirical Study of Complexity, Heterogeneity, and Compliance of GitHub Actions Workflows Eine empirische Studie über Komplexität, Heterogenität und Compliance von GitHub-Maßnahmen 关于 “ 吉特胡布行动 “ 的复杂性、异质性和合规性的经验研究 2507.18062v1 -
22 07-24 SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis SAVANT: Sicherheitserkennung in Anwendungsabhängigkeiten durch Semantik-geführte Reichweitenanalyse SAVANT: 通过语义辅助控制可达性分析,在应用依赖性中发现脆弱性 2506.17798v2 -
23 07-24 Factors Impacting Faculty Adoption of Project-Based Learning in Computing Education: a Survey Faktoren, die die Fakultät beeinflussen Adoption des projektbasierten Lernens in der Computerausbildung: eine Umfrage 影响学院在计算机教育中采用基于项目学习:调查 2507.18039v1 -
24 07-24 Your ATs to Ts: MITRE ATT&CK Attack Technique to P-SSCRM Task Mapping Ihre ATs zu Ts: MITRE ATT&CK Angriffstechnik zu P-SSCRM Task Mapping 您的ATs to Ts: MITRE ATT和CK 攻击技术到 P-SSCRM任务绘图 2507.18037v1 -
25 07-24 An Empirical Study of GenAI Adoption in Open-Source Game Development: Tools, Tasks, and Developer Challenges Eine empirische Studie zur GenAI-Adoption in der Open-Source-Spielentwicklung: Werkzeuge, Aufgaben und Entwickler-Herausforderungen GENAI采用开放源码游戏开发的经验研究:工具、任务和开发者的挑战 2507.18029v1 -
26 07-23 (3) Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work Ein Vergleich von Software-Tools zur Überprüfung von Strenge und Transparenz der veröffentlichten Arbeit 用于核对所公布工作的定调和透明度的软件工具比较 2507.17991v1 -
27 07-23 muRelBench: MicroBenchmarks for Zonotope Domains muRelBench: MicroBenchmarks für Zonotope-Domains MORELBENCH:Zonotope 域的微型基准 2404.16243v2 -
28 07-23 How Software Engineers Engage with AI: A Pragmatic Process Model and Decision Framework Grounded in Industry Observations Wie sich Software-Ingenieure mit KI beschäftigen: Ein Pragmatisches Prozessmodell und Entscheidungsrahmen, der in Industriebeobachtungen begründet ist 软件工程师如何与AI接触:一个以工业观测为基础的实用过程模型和决定框架 2507.17930v1 -
29 07-23 Educational Insights from Code: Mapping Learning Challenges in Object-Oriented Programming through Code-Based Evidence Bildungsinsights from Code: Mapping Lernherausforderungen in objektorientierter Programmierung durch Code-basierte Evidenz 从《守则教育观点》中得出的教育观点:通过《守则证据》确定以目标为导向的方案拟订中的学习挑战 2507.17743v1 -
30 07-23 CASCADE: LLM-Powered JavaScript Deobfuscator at Google CASCADE: LLM-Powered JavaScript Deobfuscator bei Google CASCADE: 谷歌的LLM Powered JavaScript Deobfuscator 谷歌的LLM Powered JavaScript Deobfuscator 2507.17691v1 -
31 07-23 Contextual Code Retrieval for Commit Message Generation: A Preliminary Study Kontextcode-Retrieval für Commit Message Generation: Eine Vorstudie 提交信件生成时的上下文代码检索:初步研究 2507.17690v1 -
32 07-23 Making REST APIs Agent-Ready: From OpenAPI to Model Context Protocol Servers for Tool-Augmented LLMs REST APIs Agent-Ready erstellen: Von OpenAPI zu Model Context Protocol Server für Tool-Augmented LLMs 制作REST APIs API Agent- Ready:从开放API到示范背景协议服务器,用于工具推荐LMM 2507.16044v2 -
33 07-23 Rethinking HSM and TPM Security in the Cloud: Real-World Attacks and Next-Gen Defenses HSM- und TPM-Sicherheit in der Cloud neu denken: Angriffe auf die Realwelt und Next-Gen-Verteidigungen 重新思考云层中的HSM和TPP安全:真实世界攻击和下一代防卫 2507.17655v1 -
34 07-23 Closing the Chain: How to reduce your risk of being SolarWinds, Log4j, or XZ Utils Schließen der Kette: Wie reduzieren Sie Ihr Risiko, SolarWinds, Log4j oder XZ Utils zu sein 关闭链链: 如何降低您成为太阳能窗口、 Log4j 或 XZ 工具的风险 2503.12192v2 -
35 07-23 CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning CodeReasoner: Verbesserung der Code-Reasoning-Fähigkeit mit Verstärkungs-Lernen 代码搜索器:加强强化学习,加强《提高能力标准守则》 2507.17548v1 -
36 07-23 AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests AssertFlip: Fehler reproduzieren durch Inversion von LLM-Generated Passing Tests SessertFlip: 通过反转 LLM 生成的过路测试复制臭虫 2507.17542v1 -
37 07-23 Enabling Cyber Security Education through Digital Twins and Generative AI Cyber Security Education durch digitale Zwillinge und generative KI ermöglichen 通过 “ 数字双双 “ 和 “ 创世创新 “ ,促进网络安全教育 2507.17518v1 -
38 07-23 Efficient Neural Network Verification via Order Leading Exploration of Branch-and-Bound Trees Effiziente Neuralnetzverifizierung durch Order Leading Exploration von Zweig-und-Bound-Bäumen 通过分树和环形树的有序主要勘探进行高效神经网络核查 2507.17453v1 -
39 07-23 Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks Explizite Gefährlichkeitsgenerierung mit LLMs: Eine Untersuchung jenseits zweifelhafter Angriffe 与LLM女士:在反向攻击之外进行调查 2507.10054v2 -
40 07-23 Investigating Training Data Detection in AI Coders Untersuchung der Erfassung von Schulungsdaten in KI-Codern AI 编码器中的调查培训数据检测 2507.17389v1 -
41 07-23 Roseau: Fast, Accurate, Source-based API Breaking Change Analysis in Java Roseau: Schnelle, genaue, quellbasierte API-Breaking Change Analyse in Java Roseau: Java快速、准确、基于源、基于源的API突破性变化分析 2507.17369v1 -
42 07-23 How Do Code Smells Affect Skill Growth in Scratch Novice Programmers? Wie wirkt sich Code bei Scratch Novice Programmierern auf das Qualifikationswachstum aus? 代码如何闻到技能增长对Scratch新程序设计师的影响? 2507.17314v1 -
43 07-23 Data Virtualization for Machine Learning Datenvirtualisierung für maschinelles Lernen 机器学习数据虚拟化 2507.17293v1 -
44 07-23 Seed&Steer: Guiding Large Language Models with Compilable Prefix and Branch Signals for Unit Test Generation Seed&Steer: Leitende große Sprachmodelle mit kompilierbaren Präfix- und Branchsignalen für die Unit Test Generation 种子 & Steer: 指导用于单位测试生成的可编译前缀和分支信号的大型语言模型 2507.17271v1 -
45 07-23 Lessons from a Big-Bang Integration: Challenges in Edge Computing and Machine Learning Lehren aus einer Big-Bang-Integration: Herausforderungen im Edge Computing und Machine Learning 大型银行一体化的经验教训:边际电子计算和机器学习方面的挑战 2507.17270v1 -
46 07-23 Understanding Prompt Programming Tasks and Questions Prompt Programmieraufgaben und Fragen verstehen 了解快速方案拟订任务和问题 2507.17264v1 -
47 07-23 On the Feasibility of Quantum Unit Testing Zur Machbarkeit der Quanteneinheitsprüfung 关于量子单位测试的可行性 2507.17235v1 -
48 07-23 Can LLMs Write CI? A Study on Automatic Generation of GitHub Actions Configurations Kann LLMs CI schreiben? Eine Studie zur automatischen Generierung von GitHub-Aktionen Konfigurationen LLM Can Write CI? GitHub 动作配置自动生成研究 2507.17165v1 -
49 07-23 Assessing Reliability of Statistical Maximum Coverage Estimators in Fuzzing Bewertung der Zuverlässigkeit statistischer Maximaldeckungs-Schätzer im Fuzzing 评估模糊中统计最高覆盖率估算器的可靠性 2507.17093v1 -
50 07-22 (2) Language model developers should report train-test overlap Entwickler von Sprachmodellen sollten Überlappungen von Zugversuchen melden 语言模式开发者应报告培训测试重叠情况 2410.08385v2 -
51 07-22 Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots Bewertung von Unsicherheit und Qualität von Visual Language Action-fähigen Robotern 评价视觉语言行动推动的机器人的不确定性和质量 2507.17049v1 -
52 07-22 An Efficient Algorithm for Generating Minimal Unique-Cause MC/DC Test cases for Singular Boolean Expressions Ein effizienter Algorithmus zur Generierung minimaler, einzigartiger MC/DC-Testfälle für singuläre Boolean-Ausdrücke 生成 Singulal Boolean 表达式的 MC/DC 测试案例的高效最小独致 MC/DC 测试比值 2507.14687v2 -
53 07-22 LLM as a code generator in Agile Model Driven Development LLM als Code-Generator in Agile Model Driven Development 作为Agile 模型驱动器开发的代码生成器的LLM 2410.18489v2 -
54 07-22 Revisiting Pre-trained Language Models for Vulnerability Detection Überprüfung vortrainierter Sprachmodelle für die Erkennung von Schwachstellen 重新审查关于脆弱性检测的预培训语言模式 2507.16887v1 -
55 07-22 Rethinking LLM-Based RTL Code Optimization Via Timing Logic Metamorphosis Rethinking LLM-basierte RTL-Code-Optimierung über Timing Logic Metamorphose 重新思考基于LLM的RTL规则 2507.16808v1 -
56 07-22 Towards Understanding the Challenges of Bug Localization in Deep Learning Systems Auf dem Weg zum Verständnis der Herausforderungen der Buglokalisierung in Deep Learning Systemen 了解深学习系统中错误定位化的挑战 2402.01021v2 -
57 07-22 Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support Nie kommen leer: Adaptive HyDE Retrieval für die Verbesserung LLM-Entwickler-Unterstützung 永不空起来: 改进 LLM 开发者支持的适应性 HyDE 检索器 2507.16754v1 -
58 07-22 An advanced AI driven database system Ein fortschrittliches KI-gestütztes Datenbanksystem 先进的AIL驱动数据库系统 2507.17778v1 -
59 07-22 LangBiTe: A Platform for Testing Bias in Large Language Models LangBiTe: Eine Plattform zum Testen von Bias in großen Sprachmodellen LangBitte:大语言模型比对测试平台 2404.18558v2 -
60 07-22 Toward Realistic Evaluations of Just-In-Time Vulnerability Prediction Hin zu realistischen Bewertungen von Just-in-Time Sicherheitsvorhersage A. 实现现实评估时空时脆弱性预测 2507.10729v2 -
61 07-22 VulGuard: An Unified Tool for Evaluating Just-In-Time Vulnerability Prediction Models VulGuard: Ein einheitliches Tool für die Bewertung von Modellen zur Vorhersage von Just-in-Time-Anfälligkeit Vul Guard:评价在时间中 Just-时间脆弱性预测模型的统一工具 2507.16685v1 -
62 07-22 VulCoCo: A Simple Yet Effective Method for Detecting Vulnerable Code Clones VulCoCo: Eine einfache, aber wirksame Methode zur Erkennung von verletzlichen Codeklone VulCoCo: 一种简单而有效的方法,用以检测脆弱守则克隆人 2507.16661v1 -
63 07-22 On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization Zur Wirksamkeit von LLM-as-a-Richter für Codegenerierung und Zusammenfassung 关于作为法官的LLM在代码生成和概述方面的效力 2507.16587v1 -
64 07-22 AI for Better UX in Computer-Aided Engineering: Is Academia Catching Up with Industry Demands? A Multivocal Literature Review KI für bessere UX in der Computer-Aided Engineering: Ist Academia Aufholprozess mit Industrieanforderungen? Ein multivokaler Literaturbericht AI促进计算机辅助工程方面更好的 UX:学术界是否迎合工业需求?多语言文学评论 2507.16586v1 -
65 07-22 Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen LLM女士能产生可靠的试验案例发电机吗? 2506.06821v3 -
66 07-22 Explainable Vulnerability Detection in C/C++ Using Edge-Aware Graph Attention Networks Erklärbare Sicherheitserkennung in C/C++ mit Edge-Aware Graph Attention Networks C/C++/C++中可解释的脆弱性探测 2507.16540v1 -
67 07-22 Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features Verbesserung der Quellcode-Ähnlichkeitserkennung durch GraphCodeBERT und Integration zusätzlicher Funktionen 改进源代码改进源代码 通过图示CodeBERT 探测相似性并整合附加地物 2408.08903v2 -
68 07-22 Software is infrastructure: failures, successes, costs, and the case for formal verification Software ist Infrastruktur: Ausfälle, Erfolge, Kosten und der Fall für die formale Überprüfung 软件是基础设施:失败、成功、成本和正式核查的理由 2506.13821v2 -
69 07-22 ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training ACT: Überbrückung der Lücke in der Code-Übersetzung durch Synthetische Datengenerierung & Adaptives Training ACT:通过合成数据生成和适应培训缩小代码翻译差距 2507.16478v1 -
70 07-22 Exploring Large Language Models for Analyzing and Improving Method Names in Scientific Code Erforschung großer Sprachmodelle zur Analyse und Verbesserung von Methodennamen im wissenschaftlichen Code 探索用于分析和改进科学法典中方法名称的大型语言模式 2507.16439v1 -
71 07-22 Improving Code LLM Robustness to Prompt Perturbations via Layer-Aware Model Editing Verbesserung der Code-LLM Robustheit bei Prompt-Störungen durch Layer-Aware-Modellbearbeitung 改进代码 LLM 的强度, 以便通过图层提醒模型编辑快速干扰 2507.16407v1 -
72 07-22 LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning LLM-getriebenes kollaboratives Modell für das Entwirren von Commits über explizite und implizite Abhängigkeitsveranlagung LLM-LLM-LLM-Driven 用于通过明确和隐含依赖性理由解释解译委员会的合作模式 2507.16395v1 -
73 07-22 Search-based Generation of Waypoints for Triggering Self-Adaptations in Maritime Autonomous Vessels Search-based Generierung von Wegpunkten für die Auslösung von Selbstanpassungen in Maritimen autonomen Schiffen 以搜索为基础的海上自主船舶触发自适应途径点的生成 2507.16327v1 -
74 07-22 Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery Sprachbasierte KI-Agenten: Die wirtschaftlichen Lücken in der digitalen Gesundheitsversorgung füllen AI代理机构:填补数字保健提供方面的经济差距 2507.16229v1 -
75 07-22 LOCOFY Large Design Models – Design to code conversion solution LOCOFY Large Design Models – Design zu Code-Konvertierungslösung LOCOFY 大型设计模型 – – 设计编码转换解决办法 2507.16208v1 -
76 07-22 Ten Essential Guidelines for Building High-Quality Research Software Zehn wesentliche Leitlinien für den Aufbau hochwertiger Forschungssoftware 建立高质量研究软件的十项基本准则 2507.16166v1 -
77 07-21 (1) GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities GitChameleon 2.0: Bewertung der KI-Codegenerierung gegen Python Library Version Inkompatibilitäten GitChameleon 2.0:评估AI 与 Python 图书馆版本的不兼容性 2507.12367v2 -
78 07-21 AI-Powered Commit Explorer (APCE) KI-Powered Commit Explorer (APCE) AI 授权委员会探索者(APCE) 2507.16063v1 -
79 07-21 RightTyper: Effective and Efficient Type Annotation for Python RightTyper: Effektive und effiziente Typ-Annotation für Python RightTyper: Python 有效、高效型号注解 2507.16051v1 -
80 07-21 A Pilot Study on LLM-Based Agentic Translation from Android to iOS: Pitfalls and Insights Eine Pilotstudie über LLM-basierte Agentische Übersetzung von Android nach iOS: Pitfalls and Insights 关于以LLM为基础的LLM从Android转为iOS的剂翻译的试点研究:水瀑布和透视 2507.16037v1 -
81 07-21 BandFuzz: An ML-powered Collaborative Fuzzing Framework BandFuzz: Ein ML-powered Collaborative Fuzzing Framework BandFuzz: ML 授权的协作模糊框架 2507.10845v2 -
82 07-21 BACFuzz: Exposing the Silence on Broken Access Control Vulnerabilities in Web Applications BACFuzz: Aufdecken des Schweigens auf gebrochene Zugriffskontrolle Schwachstellen in Web-Anwendungen BACFuzz:在网络应用中暴露对断断存控制障碍的沉默 2507.15984v1 -
83 07-21 Observing Fine-Grained Changes in Jupyter Notebooks During Development Time Beobachten feinkörniger Änderungen in Jupyter-Notebooks während der Entwicklungszeit 发展时期黄极笔记本中观察到的微小变化 2507.15831v1 -
84 07-21 Investigating the Use of LLMs for Evidence Briefings Generation in Software Engineering Untersuchung der Verwendung von LLMs für Evidence Briefings Generation in der Software-Engineering 调查软件工程中利用LLMs制作证据简报 2507.15828v1 -
85 07-21 Do AI models help produce verified bug fixes? Helfen KI-Modelle dabei, verifizierte Fehlerbehebungen zu erstellen? 人工智能模型是否帮助生成经核实的错误修正 ? 2507.15822v1 -
86 07-21 BugScope: Learn to Find Bugs Like Human BugScope: Lernen Sie Fehler wie Menschen zu finden 错误库: 学习查找像人类一样的错误 2507.15671v1 -
87 07-21 Modeling CubeSat Storage Battery Discharge: Equivalent Circuit Versus Machine Learning Approaches Modellierung CubeSat Speicher Batterieentladung: Gleichwertige Schaltung Versus Machine Learning Ansätze 模型化CubeSat存储电池放电:等效电路甚高频机器学习方法 2507.15666v1 -
88 07-21 SustainDiffusion: Optimising the Social and Environmental Sustainability of Stable Diffusion Models SustainDiffusion: Optimierung der sozialen und ökologischen Nachhaltigkeit stabiler Diffusionsmodelle 可持续性传播:优化稳定传播模式的社会和环境可持续性 2507.15663v1 -
89 07-21 Hot Topics and Common Challenges: an Empirical Study of React Discussions on Stack Overflow Heiße Themen und gemeinsame Herausforderungen: eine empirische Studie über reagierende Diskussionen über Stack Overflow 热题和共同挑战:关于堆堆溢溢溢量的应对讨论的经验研究 2507.15624v1 -
90 07-21 Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing Anwendung der Technik der chinesischen Wandumkehrtechnik auf die Bearbeitung von großen Sprachmodellen 将中国长墙反向工程技术应用到大语言模式编辑 2507.15599v1 -
91 07-21 A Study of LLMs’ Preferences for Libraries and Programming Languages Eine Studie der Präferenzen der LLM für Bibliotheken und Programmiersprachen 关于LLMLM对图书馆和节目语言的偏好的研究 2503.17181v2 -
92 07-21 CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection CGP-Tuning: Structure-Aware Soft Prompt Tuning für Code Vulnerability Detection CGP-Turning: 用于代码脆弱性检测的结构- Aware 软软快速查询 2501.04510v2 -
93 07-21 Understanding the Design Decisions of Retrieval-Augmented Generation Systems Verständnis der Konstruktionsentscheidungen von Systemen der retrieval-Augmentierten Generation 了解回收-加速发电系统的设计决定 2411.19463v2 -
94 07-21 StackTrans: From Large Language Model to Large Pushdown Automata Model StackTrans: Vom großen Sprachmodell zum großen Pushdown-Automatenmodell Stacktrans: 从大语言模型到大推下自动模型 2507.15343v1 -
95 07-21 A Study of Malware Prevention in Linux Distributions Eine Studie über Malware-Prävention in Linux-Distributionen 关于Linux分发中防止恶意软件的研究 2411.11017v3 -
96 07-21 Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems Schmetterlingseffekte in Werkzeugketten: Eine umfassende Analyse der fehlgeschlagenen Parameterfüllung in LLM-Werkzeug-Agentensystemen 工具链中的蝴蝶效应:对LLM工具代理系统填充失败参数的综合分析 2507.15296v1 -
97 07-21 Input Reduction Enhanced LLM-based Program Repair Input-Reduzierung Verbesserte LLM-basierte Programm-Reparatur 增强基于LLM的LLM方案维修 2507.15251v1 -
98 07-21 ACFIX: Guiding LLMs with Mined Common RBAC Practices for Context-Aware Repair of Access Control Vulnerabilities in Smart Contracts ACFIX: Leitende LLMs mit geminten gängigen RBAC-Praktiken für die kontextbezogene Reparatur von Zugangskontrolllücken in Smart Contracts ACFIX: 指导LLMs公司使用RBAC在智能合同中使用内部软件修理存取控制易变性方面通用的雷管局做法 2403.06838v3 -
99 07-21 FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents FaultLine: Automatisierte Generierung von Vulnerabilitätsnachweisen mit LLM-Agenten 失灵:使用LLM代理器自动验证生成 2507.15241v1 -
100 07-21 Code Clone Detection via an AlphaFold-Inspired Framework Code-Klone-Erkennung über ein AlphaFold-Inspired Framework 通过 AlphaFold 启发框架探测代码克隆 2507.15226v1 -
101 07-21 SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation SimdBench: Benchmarking großer Sprachmodelle für die SIMD-Intrinsische Codegenerierung SimdBench:为SIMMD- Intrins 代码生成制定大语言模式基准 2507.15224v1 -
102 07-21 Towards Using Personas in Requirements Engineering: What Has Been Changed Recently? Zum Einsatz von Personen in der Requirements Engineering: Was hat sich in letzter Zeit verändert? 争取在要求工程中使用人:最近发生了什么变化? 2507.15197v1 -
103 07-21 Cultural Impact on Requirements Engineering Activities: Bangladeshi Practitioners’ View Kulturelle Auswirkungen auf die Anforderungen Engineering-Aktivitäten: Bangladesh-Praktiker-Ansicht 文化对要求工程活动的影响:孟加拉国从业者的观点 2507.15188v1 -
104 07-21 Deep Learning Framework Testing via Heuristic Guidance Based on Multiple Model Measurements Deep-Learning-Framework-Tests mittels Heuristischer Anleitung basierend auf mehreren Modellmessungen 利用基于多种模式计量的指数性指导进行深学习框架测试 2507.15181v1 -
105 07-20 (7) Can LLMs Generate User Stories and Assess Their Quality? Können LLMs User Stories generieren und ihre Qualität bewerten? LLMs能够产生用户故事并评估其质量吗? 2507.15157v1 -
106 07-20 Design of an Edge-based Portable EHR System for Anemia Screening in Remote Health Applications Design eines Edge-basierten tragbaren EHR-Systems für die Anämie-Screening in Remote Health-Anwendungen 设计一个以边缘为基础的远程保健应用中贫血筛查的便携EHR系统 2507.15146v1 -
107 07-20 A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation Ein semantisch-basierter Optimierungsansatz zur Reparatur von LLMs: Fallstudie zur Codegenerierung 修复LLMLM 的基于语义的优化优化方法:关于代码生成的案例研究 2503.12899v3 -
108 07-20 ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells ROSE: Transformerbasierte Refactoring-Empfehlung für architektonische Gerüche ROSE: 以变压器为基础的建筑气味重建建议 2507.12561v2 -
109 07-20 ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks ModelVerification.jl: eine umfassende Toolbox zur formalen Überprüfung tiefer neuraler Netzwerke 模型核查.jl:用于正式核查深神经网络的综合工具箱 2407.01639v2 -
110 07-20 LibLMFuzz: LLM-Augmented Fuzz Target Generation for Black-box Libraries LibLMFuzz: LLM-Augmented Fuzz Target Generation für Black-Box-Bibliotheken LibLMFuzz: 黑盒图书馆LLM- 推荐的模糊目标生成 2507.15058v1 -
111 07-20 Survey of GenAI for Automotive Software Development: From Requirements to Executable Code Umfrage bei GenAI für die Entwicklung von Automotive Software: Von Anforderungen zum ausführbaren Code GenAI汽车软件开发调查:从要求到可执行守则 2507.15025v1 -
112 07-20 Taint Analysis for Graph APIs Focusing on Broken Access Control Taint-Analyse für Graph-APIs mit Fokus auf Broken Access Control 以断断存控制为重点的图表APP的图纸分析 2501.08947v2 -
113 07-20 The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering Der Aufstieg von KI-Teamkollegen in der Software-Engineering (SE) 3.0: Wie autonome Coding-Agenten Software-Engineering umgestalten AI软件工程(SE)3.0:自动编码代理人如何重组软件工程 2507.15003v1 -
114 07-20 Metaverse Security and Privacy Research: A Systematic Review Metaverse Security und Privacy Research: Eine systematische Überprüfung 超词安全和隐私研究:系统审查 2507.14985v1 -
115 07-20 Think Like an Engineer: A Neuro-Symbolic Collaboration Agent for Generative Software Requirements Elicitation and Self-Review Denken Sie wie ein Ingenieur: Ein neuro-symbolischer Collaboration Agent für generative Software-Anforderungen Elizitation und Selbst-Review 象工程师一样思考:一个创造软件要求求救和自我审查的神经-双曲协作代理 2507.14969v1 -
116 07-20 StaAgent: An Agentic Framework for Testing Static Analyzers StaAgent: Agentischer Rahmen für die Prüfung statischer Analyzer StaAgent: 静态分析器测试的剂框架 2507.15892v1 -
117 07-20 Learning Software Bug Reports: A Systematic Literature Review Lernsoftware Bug Reports: Ein systematischer Literaturbericht 学习软件错误报告:系统文献审查 2507.04422v2 -
118 07-20 Flexible Process Variant Binding in Information Systems with Software Product Line Engineering Flexible Prozessvariantbindung in Informationssystemen mit Software Product Line Engineering 具有软件产品线工程的信息系统装订 2410.17689v2 -
119 07-20 Towards Extracting Software Requirements from App Reviews using Seq2seq Framework Auf dem Weg zur Extraktion von Software-Anforderungen aus App-Bewertungen mit Seq2seq Framework 争取利用Seq2seq 框架从应用审查中提取软件要求 2507.09039v2 -
120 07-20 SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps SAGE: A Context-Aware Approach for Mining Privacy Relevant Reviews from Mental Health Apps SAGE: “ 采矿隐私要求 “ 的背景意识方法,来自心理健康应用软件的相关审查 2507.09051v2 -
121 07-20 CMER: A Context-Aware Approach for Mining Ethical Concern-related App Reviews CMER: A Context-aware approach for Mining Ethical Concern-related App Reviews CMER: 采矿道德关切相关上诉审查的背景意识方法 2507.09049v2 -
122 07-20 Enhancing Repository-Level Code Generation with Call Chain-Aware Multi-View Context Erweiterung der Repository-Level-Code-Generierung mit Call Chain-Aware-Multi-View-Kontext 加强存储器级代码生成,具有呼叫链-软件多视图背景 2507.14791v1 -
123 07-20 Dr. Boot: Bootstrapping Program Synthesis Language Models to Perform Repairing Dr. Boot: Bootstrapping-Programm Synthese von Sprachmodellen zur Reparatur Boot博士:实施修复的强化方案综合语言模型 2507.15889v1 -
124 07-20 MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation MultiKernelBench: Ein Multi-Platform Benchmark für die Kernel-Generation 多KenneelBench: 核心生成的多平台基准 2507.17773v1 -
125 07-20 VeriOpt: PPA-Aware High-Quality Verilog Generation via Multi-Role LLMs VeriOpt: PPA-Aware Hochqualitative Verilog Generation über Multi-Rolle LLMs VeriOpt: 通过多功能LLMs 生成PPA-Aware-Aware-高品质高性活性 2507.14776v1 -
126 07-19 (6) Toward Inclusive AI-Driven Development: Exploring Gender Differences in Code Generation Tool Interactions Auf dem Weg zu integrativer KI-getriebener Entwicklung: Erforschung geschlechtsspezifischer Unterschiede bei Interaktionen mit Codegenerierungstools 走向包容性的AI-Driven 发展:探索代码生成工具互动中的性别差异 2507.14770v1 -
127 07-19 Investigating the Role of LLMs Hyperparameter Tuning and Prompt Engineering to Support Domain Modeling Untersuchung der Rolle von LLMs Hyperparameter Tuning und Prompt Engineering zur Unterstützung von Domain Modeling 调查超参数图图和快速工程LLMs 的作用以支持域建模 2507.14735v1 -
128 07-19 Foundational Competencies and Responsibilities of a Research Software Engineer: Current State and Suggestions for Future Directions Grundlagenkompetenzen und Verantwortlichkeiten eines Forschungssoftware-Ingenieurs: Aktueller Stand und Vorschläge für zukünftige Richtungen 研究软件工程师的基本能力和责任:现状和对未来方向的建议 2311.11457v4 -
129 07-19 HistoryFinder: Advancing Method-Level Source Code History Generation with Accurate Oracles and Enhanced Algorithm HistoryFinder: Advancing Method-Level Source Code History Generation mit präzisen Oracles und erweitertem Algorithmus 历史:推进方法层面的源代码,具有准确的甲骨文和强化算法的史代历史 2507.14716v1 -
130 07-19 LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets LLM-basierte Erkennung von Tangled Code-Änderungen für höherwertige Methoden-Level-Fehlerdatensätze 以LLM为基础,检测上质量方法级臭虫数据集排列的编码变化 2505.08263v2 -
131 07-19 Efficient Story Point Estimation With Comparative Learning Effiziente Story Point-Schätzung mit vergleichendem Lernen 与比较学习相比的高效小点估计 2507.14642v1 -
132 07-19 A first look at License Variants in the PyPI Ecosystem Ein erster Blick auf Lizenzvarianten im PyPI Ecosystem 第一次审查PyPI生态系统的许可证变式 2507.14594v1 -
133 07-19 AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs? AlgoTune: Können Sprachmodelle allgemeine numerische Programme beschleunigen? AlgoTune: 语言模型能加速通用计算程序吗? 2507.15887v1 -
134 07-19 Harnessing LLMs for Document-Guided Fuzzing of OpenCV Library LLMs für dokumentengeführtes Fuzzing der OpenCV-Bibliothek nutzen OpenCV 库文档辅助模糊利用 LMs 2507.14558v1 -
135 07-19 Emerging Trends in Software Architecture from the Practitioners Perspective: A Five Year Review Aufkommende Trends in der Softwarearchitektur aus der Perspektive der Praktizierenden: Ein Fünf-Jahres-Bericht 从从从业人员角度看软件架构的新趋势:五年审查 2507.14554v1 -
136 07-19 Architectural Degradation: Definition, Motivations, Measurement and Remediation Approaches architektonische Degradation: Definition, Motivationen, Mess- und Sanierungsansätze 建筑退化:定义、动力、计量和补救方法 2507.14547v1 -
137 07-19 QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration QLPro: Automatisierte Code Vulnerability Discovery über LLM und Static Code Analysis Integration QLPro:通过LLM和静态代码分析整合发现自动编码易脆弱性 2506.23644v3 -
138 07-19 On the Effect of Token Merging on Pre-trained Models for Code Über die Wirkung von Token Merging auf vortrainierte Modelle für Code 托肯合并对《守则》培训前模式的影响 2507.14423v1 -
139 07-18 (5) Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach Verbesserung der LLM-Code-Generierung mit Ensembles: Ein auf Ähnlichkeit basierender Auswahlansatz 增强具有各种组合的LLM 代码生成:以相似性为基础的选择方法 2503.15838v2 -
140 07-18 Developing Shared Vocabulary System For Collaborative Software Engineering Entwicklung eines gemeinsamen Vokabelsystems für die gemeinsame Software-Engineering 开发合作软件工程共用词汇系统 2507.14396v1 -
141 07-18 Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms Kombinatorische Optimierung für alle: Verwendung von LLMs zur Unterstützung von Nicht-Experten bei der Verbesserung von Optimierungsalgorithmen 组合优化全民:利用LLMs帮助非专家改进最佳化算法 2503.10968v2 -
142 07-18 Remote Assistance or Remote Driving: The Impact of Operational Design Domains on ADS-Supporting Systems Selection Remote Assistance oder Remote Driving: Die Auswirkungen von Operational Design Domains auf die Auswahl von ADS-unterstützten Systemen 远程援助或远程驾驶:业务设计域域对ADS支助系统选择的影响 2507.14347v1 -
143 07-18 Leveraging LLMs for Formal Software Requirements – Challenges and Prospects Leveraging LLMs für formale Softwareanforderungen – Herausforderungen und Perspektiven 为正式软件要求 – – 挑战和前景 – – 利用LMLM 利用LMLM 来利用正规软件要求 – – 挑战和前景 2507.14330v1 -
144 07-18 Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models Auswirkungen von Code-Kontexten und Prompting-Strategien auf die automatisierte Unit-Testgenerierung mit modernen, allgemein angelegten großen Sprachmodellen 守则背景和提示战略对采用现代通用大语言通用模式的 自动单位测试生成的影响 2507.14256v1 -
145 07-18 Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian Code Lesbarkeit im Zeitalter großer Sprachmodelle: Eine industrielle Fallstudie von Atlassian 《大语言模式时代的可读性:阿特拉斯斯语工业案例研究》 2501.11264v3 -
146 07-18 Testing Autonomous Driving Systems – What Really Matters and What Doesn’t Autonome Fahrsysteme testen – Was wirklich zählt und was nicht 自动自动驾驶测试系统 – – 真正重要和不重要的东西 2507.13661v1 -
147 07-18 LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead LLM-basierte Multi-Agenten-Systeme für die Software-Engineering: Literature Review, Vision and the Road Ahead 以LLM为基础的软件工程多机构系统:文献审查、展望和路前 2404.04834v4 -
148 07-18 ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle ParaStudent: Erzeugen und Evaluieren des Realistischen Studentenkodex durch Lehre von LLMs zum Kampf 副专业学生:通过教授LLMs进行斗争,产生和评价现实学生守则 2507.12674v2 -
149 07-17 (4) An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots Ein Ansatz zur automatischen Generierung von Beschriftungsfunktionen für Software Engineering Chatbots 软件工程聊天器自动生成标签功能的方法 2410.07094v2 -
150 07-17 Demystifying Feature Requests: Leveraging LLMs to Refine Feature Requests in Open-Source Software Feature-Anfragen entmystifizieren: LLMs zur Verfeinerung von Feature-Anfragen in Open-Source-Software nutzen 解密功能请求: 利用LMML 来在开放源码软件中使用 Refine 功能请求 2507.13555v1 -
151 07-17 Towards Better Requirements from the Crowd: Developer Engagement with Feature Requests in Open Source Software Auf dem Weg zu besseren Anforderungen aus der Crowd: Entwickler Engagement mit Feature-Anfragen in Open Source Software 实现来自人群的更好要求:开发者在开放源码软件中参与满足地物要求 2507.13553v1 -
152 07-17 AI-Assisted Fixes to Code Review Comments at Scale AI-Assisted Fixes to Code Review Kommentare auf Scale AI 协助制定标准标准代码审查评论 2507.13499v1 -
153 07-17 Socio-Technical Smell Dynamics in Code Samples: A Multivocal Review on Emergence, Evolution, and Co-Occurrence Socio-Technical Smell Dynamics in Code Samples: Multivocal Review über Emergence, Evolution und Co-Occurrence 代码样本中社会-技术闻闻动态:关于新出现、演变和共发的多动审查 2507.13481v1 -
154 07-17 SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben SWE-MERA: 积极评价软件工程任务大语言模型的动态基准 2507.11059v2 -
155 07-17 Detecting LLM-generated Code with Subtle Modification by Adversarial Training LLM-generierter Code mit subtiler Änderung durch Adversarial Training erkennen 检测通过反向培训进行精细修改的LLM生成代码 2507.13123v1 -
156 07-17 Inferring Attributed Grammars from Parser Implementations Zugeschriebene Grammatiken aus Parser-Implementierungen ableiten 从剖析器执行中推断出属性语法 2507.13117v1 -
157 07-17 A Conceptual Framework for Requirements Engineering of Pretrained-Model-Enabled Systems Ein konzeptioneller Rahmen für die Anforderungsentwicklung von vortrainierten modellgebundenen Systemen 预先培训的、采用模式的系统工程要求概念框架 2507.13095v1 -
158 07-17 MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben MERA 守则:一个统一框架,用于评估不同任务制定守则的情况 2507.12284v2 -
159 07-17 iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development iReDev: Ein wissensgestütztes Multi-Agent-Rahmenwerk für intelligente Anforderungsentwicklung iReDev:开发智能要求的知识开发多机构框架 2507.13081v1 -
160 07-17 Write Your Own CodeChecker: An Automated Test-Driven Checker Development Approach with LLMs Schreiben Sie Ihren eigenen CodeChecker: Ein automatisierter Test-Driven Checker-Entwicklungsansatz mit LLMs 使用 LLMS 写入您的自定义代码检查器: 自动测试驱动检查开发方法 2411.06796v3 -
161 07-17 Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases Untersuchung der Leistungsfähigkeit kleiner Sprachmodelle bei der Erkennung von Testriechen in manuellen Testfällen 调查小语言模型在人工试验案件中检测测试嗅觉方面的性能 2507.13035v1 -
162 07-17 Risks of ignoring uncertainty propagation in AI-augmented security pipelines Risiken der Ignorierung der Unsicherheitsausbreitung in KI-gesteigerten Sicherheitspipelines 忽视在AI强化安全管道中传播不确定性的风险 2407.14540v2 -
163 07-17 ReCode: Updating Code API Knowledge with Reinforcement Learning ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen ReCode:更新法规API知识与强化学习 2506.20495v2 -
164 07-17 The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI Der Fall für Contextual Copyleft: Lizenzierung von Open Source Trainingsdaten und Generative KI 上下文翻转:为开放源码培训数据发放许可证的案例 2507.12713v1 -
165 07-17 CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance CodeAssistBench (CAB): Datensatz & Benchmarking für Multiturn-Chat-basierte Code-Unterstützung 代码协助站(CAB):多功能聊天代码援助的数据集和基准 2507.10646v2 -
166 07-17 GUI Test Migration via Abstraction and Concretization GUI-Test-Migration über Abstraktion und Konkretisierung GUI 通过抽象和简明化测试移民 2409.05028v2 -
167 07-17 AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges KI-Sicherheit in den Augen des Downstream-Entwicklers: Ein erster Blick auf Bedenken, Praktiken und Herausforderungen AI 下游开发者眼中的安全:首先审视关注、做法和挑战 2503.19444v3 -
168 07-17 When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration When Domains Collide: Eine Aktivitätstheorie zur Erforschung der disziplinübergreifenden Zusammenarbeit 当域碰撞:跨纪律协作活动理论探索时 2506.20063v2
Article 0
Title@2025-07-24 (4): 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation
Title: 3D Software Synthesis Guided by Constraint-Expressive Intermediate Representation | 3D-Software-Synthese geführt durch eingeschränkt-expressive Zwischendarstellung | 3D 由限制性中等代表制指导的软件合成 2507.18625v1 |
Authors (5): Shuqing Li, Anson Y. Lam, Yun Peng, Wenxuan Wang, Michael R. Lyu
Graphical user interface (UI) software has undergone a fundamental transformation from traditional two-dimensional (2D) desktop/web/mobile interfaces to spatial three-dimensional (3D) environments. While existing work has made remarkable success in automated 2D software generation, such as HTML/CSS and mobile app interface code synthesis, the generation of 3D software still remains under-explored. Current methods for 3D software generation usually generate the 3D environments as a whole and cannot modify or control specific elements in the software. Furthermore, these methods struggle to handle the complex spatial and semantic constraints inherent in the real world. To address the challenges, we present Scenethesis, a novel requirement-sensitive 3D software synthesis approach that maintains formal traceability between user specifications and generated 3D software. Scenethesis is built upon ScenethesisLang, a domain-specific language that serves as a granular constraint-aware intermediate representation (IR) to bridge natural language requirements and executable 3D software. It serves both as a comprehensive scene description language enabling fine-grained modification of 3D software elements and as a formal constraint-expressive specification language capable of expressing complex spatial constraints. By decomposing 3D software synthesis into stages operating on ScenethesisLang, Scenethesis enables independent verification, targeted modification, and systematic constraint satisfaction. Our evaluation demonstrates that Scenethesis accurately captures over 80% of user requirements and satisfies more than 90% of hard constraints while handling over 100 constraints simultaneously. Furthermore, Scenethesis achieves a 42.8% improvement in BLIP-2 visual evaluation scores compared to the state-of-the-art method.
图形用户界面(UI)软件已经经历了从传统的二维(2D)桌面/网络/移动界面向空间三维(3D)环境的根本性转变。虽然现有工作在自动生成2D软件方面取得了显著的成功,例如HTML/CSS和移动应用程序界面代码合成,但3D软件的生成仍然未得到充分探索。目前3D软件生成方法通常产生整个3D环境,无法修改或控制软件中的具体内容。此外,这些方法努力处理现实世界中固有的复杂的空间和语义限制。为了应对挑战,我们提出了Senesisis,一种新的对要求敏感的3D软件合成方法的3D软件合成方法,在用户规格和生成的3D软件之间保持正式的可追溯性可追溯性可追溯性可变性。 Senemexexis,在表达复杂空间约束的系统化系统化语言上,通过直观的Scentrical-L 系统化语言,在显示我们系统化的系统化系统化系统化语言上,在显示对硬性语言的精确性缩缩度上,在显示我们系统化的缩缩缩缩缩化的缩缩缩。
Article 1
Title@2025-07-24 (4): OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advancing Next-Generation Intelligent Transportation Systems Research
Title: OpenCAMS: An Open-Source Connected and Automated Mobility Co-Simulation Platform for Advancing Next-Generation Intelligent Transportation Systems Research | OpenCAMS: Eine Open-Source vernetzte und automatisierte Mobilitäts-Co-Simulationsplattform für die Weiterentwicklung der Forschung für intelligente Transportsysteme der nächsten Generation | OpenCAMS: 推进下一轮智能运输系统研究的开放源码连接和自动化流动联合模拟平台 2507.09186v3 |
Authors (4): Minhaj Uddin Ahmad, Akid Abrar, Sagar Dasgupta, Mizanur Rahman
We introduce OpenCAMS (Open-Source Connected and Automated Mobility Co-Simulation Platform), an open-source, synchronized, and extensible co-simulation framework that tightly couples three best-in-class simulation tools: (i) SUMO, (ii) CARLA, and (iii) OMNeT++. OpenCAMS is designed to support advanced research in transportation safety, mobility, and cybersecurity by combining the strengths of each simulation domain. Specifically, SUMO provides large-scale, microscopic traffic modeling; CARLA offers high-fidelity 3D perception, vehicle dynamics, and control simulation; and OMNeT++ enables modular, event-driven network communication, such as cellular vehicle-to-everything (C-V2X). OpenCAMS employs a time-synchronized, bidirectional coupling architecture that ensures coherent simulation progression across traffic, perception, and communication domains while preserving modularity and reproducibility. For example, CARLA can simulate and render a subset of vehicles that require detailed sensor emulation and control logic; SUMO orchestrates network-wide traffic flow, vehicle routing, and traffic signal management; and OMNeT++ dynamically maps communication nodes to both mobile entities (e.g., vehicles) and static entities (e.g., roadside units) to enable C-V2X communication. While these three simulators form the foundational core of OpenCAMS, the platform is designed to be expandable and future-proof, allowing additional simulators to be integrated on top of this core without requiring fundamental changes to the system architecture. The OpenCAMS platform is fully open-source and publicly available through its GitHub repository https://github.com/minhaj6/carla-sumo-omnetpp-cosim, providing the research community with an accessible, flexible, and collaborative environment for advancing next-generation intelligent transportation systems.
我们引入了OpenCAMS(开放源码连接和自动化流动共同模拟平台),这是一个开放源码、同步和可扩展的共同模拟框架,紧紧结合三种最高级模拟工具:(一) SUMO,(二) CARLA,和(三) OMNET+。OmNET+。OpenCAMS的目的是通过将每个模拟域的优势结合起来,支持运输安全、移动和网络安全方面的先进研究。具体地说,SUMO提供大型、显微可读交通模型;CARLA提供高纤维3D感知、车辆动态和控制模拟及控制模拟;OMNET++为模块化、事件驱动网络通信通信提供模块,如手机到百年一月(C-V2X) 。OpenCAMS使用时间同步、双向双向双向组合的组合组合组合结构结构,确保整个交通、感知知知和通信领域的模拟进程,同时保持模块和可读性。例如,CARLA可以模拟和提供一组需要详细传感器的车辆未来模拟和控逻辑的车辆;SUDS-CA-SlMSUMS-LMLMS-S-S-LMLMULM-S-S-S-S-mode-com-roma-comma-comma-comm-comm-comm-comm-comma-commex-comma-comma-comma-comma-comma-commus-comma-comma-commex-commex-commex-commex-s-s-commusmex-commex-s-s-commex-s-s-s-s-s-s-s-s-s-s-s-comm-s-s-s-s-s-s-s-s-s-s-s-s-s-l-s-s-s-s-s-s-commal-s-comm-s-s-sma-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-
Article 2
Title@2025-07-24 (4): Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench
Title: Are AI-Generated Fixes Secure? Analyzing LLM and Agent Patches on SWE-bench | Sind KI-erzeugte Fixes sicher? LLM und Agent Patches auf der SWE-Bench analysieren | AI - 具有安全性吗? 分析SWE-bench 上的LLM 和代理补丁 2507.02976v2 |
Authors (3): Amirali Sajadi, Kostadin Damevski, Preetha Chatterjee
Large Language Models (LLMs) and their agentic frameworks are increasingly adopted to automate software development tasks such as issue resolution and program repair. While prior work has identified security risks in LLM-generated code, most evaluations have focused on synthetic or isolated settings, leaving open questions about the security of these systems in real-world development contexts. In this study, we present the first large-scale security analysis of LLM-generated patches using 20,000+ issues from the SWE-bench dataset. We evaluate patches produced by a standalone LLM (Llama 3.3) and compare them to developer-written patches. We also assess the security of patches generated by three top-performing agentic frameworks (OpenHands, AutoCodeRover, HoneyComb) on a subset of our data. Finally, we analyze a wide range of code, issue, and project-level factors to understand the conditions under which LLMs and agents are most likely to generate insecure code. Our findings reveal that the standalone LLM introduces nearly 9x more new vulnerabilities than developers, with many of these exhibiting unique patterns not found in developers’ code. Agentic workflows also generate a significant number of vulnerabilities, particularly when granting LLMs more autonomy, potentially increasing the likelihood of misinterpreting project context or task requirements. We find that vulnerabilities are more likely to occur in LLM patches associated with a higher number of files, more lines of generated code, and GitHub issues that lack specific code snippets or information about the expected code behavior and steps to reproduce. These results suggest that contextual factors play a critical role in the security of the generated code and point toward the need for proactive risk assessment methods that account for both code and issue-level information to complement existing vulnerability detection tools.
大型语言模型(LLMS)及其代理框架日益被采用,以自动化软件开发任务,如问题解析和程序修补等。虽然先前的工作已经查明LLM生成代码的安全风险,但大多数评价侧重于合成或孤立的设置,留下了关于这些系统在现实世界开发背景下的安全的开放问题。在这项研究中,我们利用SWE-bench数据集的20,000个+问题对LLMS生成的补丁进行了首次大规模安全分析。我们评估了独立LM(Llama3.3)生成的补丁,并将其与开发者制作的补丁进行比较。我们还评估了三个顶级代理框架(OpenHands、AutoCodeRover、HoneComb)生成的安全风险风险风险。最后,我们分析了一系列广泛的代码、问题和项目级因素,以了解LWMS和代理商最有可能生成不安全代码的条件。我们发现,独立LMRM比开发者引入了近9x新的脆弱性,其中很多这些显示在开发者代码中无法找到的独特模式, 也显示在高级代码中可能增加磁带风险的路径。
Article 3
Title@2025-07-24 (4): On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words
Title: On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words | Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten | 关于含有闭合同步词类的标识名称的结构和语义 2505.18444v4 |
Authors (11): Christian D. Newman, Anthony Peruma, Eman Abdullah AlOmar, Mahie Crabbe, Syreen Banabilah, Reem S. AlSuhaibani, Michael J. Decker, Farhad Akhbardeh, Marcos Zampieri, Mohamed Wiem Mkaouer, Jonathan I. Maletic
Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns, which represent the part-of-speech (PoS) sequences underlying identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. To study these categories, the Closed Category Identifier Dataset (CCID), a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems, is constructed and presented. The relationship between closed-category grammar patterns and program behavior is then analyzed using grounded-theory-inspired coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express concepts such as control flow, data transformation, temporal reasoning, and other behavioral roles through naming. This work contributes an empirical foundation for understanding how linguistic resources encode behavior in identifier names and supports new directions for research in naming, program comprehension, and education.
标识名称是代码的关键组成部分, 是开发者理解程序行为的主要线索 。 本文通过扩展语法模式的概念来调查标识名称的语言结构 。 语法模式代表了语法序列部分( POS) 基本识别短语。 具体重点是封闭的合成类别( 如预设、 连线、 确定者 ) , 尽管这些类别在一般自然语言中具有核心作用, 但这些类别很少在软件工程中研究 。 要研究这些类别, 封闭类识别数据集( CICID) 是一个新的人工手动数据集, 由来自30个开放源系统的1,275个标识组成。 封闭类语法模式与程序行为之间的关系随后通过基于理论的编码、 统计和模式分析加以分析 。 结果揭示了开发者用来表达控制流、 数据转换、 时间推理和其他行为作用等概念的经常性结构 。 这项工作为理解识别名称的语言资源如何编码行为和支持命名、 方案理解和教育研究的新方向提供了经验基础 。
Article 4
Title@2025-07-24 (4): A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat
Title: A Deep Dive into Retrieval-Augmented Generation for Code Completion: Experience on WeChat | Ein tiefer Tauchgang in die retrieval-angereicherte Generation zur Code-Vervollständigung: Erfahrung auf WeChat | 为完成代码的完成而深入挖掘回收的一代人:关于 WeChat 的经验 2507.18515v1 |
Authors (6): Zezhou Yang, Ting Peng, Cuiyun Gao, Chaozheng Wang, Hailiang Huang, Yuetang Deng
Code completion, a crucial task in software engineering that enhances developer productivity, has seen substantial improvements with the rapid advancement of large language models (LLMs). In recent years, retrieval-augmented generation (RAG) has emerged as a promising method to enhance the code completion capabilities of LLMs, which leverages relevant context from codebases without requiring model retraining. While existing studies have demonstrated the effectiveness of RAG on public repositories and benchmarks, the potential distribution shift between open-source and closed-source codebases presents unique challenges that remain unexplored. To mitigate the gap, we conduct an empirical study to investigate the performance of widely-used RAG methods for code completion in the industrial-scale codebase of WeChat, one of the largest proprietary software systems. Specifically, we extensively explore two main types of RAG methods, namely identifier-based RAG and similarity-based RAG, across 26 open-source LLMs ranging from 0.5B to 671B parameters. For a more comprehensive analysis, we employ different retrieval techniques for similarity-based RAG, including lexical and semantic retrieval. Based on 1,669 internal repositories, we achieve several key findings: (1) both RAG methods demonstrate effectiveness in closed-source repositories, with similarity-based RAG showing superior performance, (2) the effectiveness of similarity-based RAG improves with more advanced retrieval techniques, where BM25 (lexical retrieval) and GTE-Qwen (semantic retrieval) achieve superior performance, and (3) the combination of lexical and semantic retrieval techniques yields optimal results, demonstrating complementary strengths. Furthermore, we conduct a developer survey to validate the practical utility of RAG methods in real-world development environments.
守则的完成是软件工程的关键任务,提高了开发者生产率,随着大型语言模型(LLMs)的快速进步,守则的完成有了实质性的改进。近年来,检索增强的生成(RAG)已成为提高LLM公司代码完成能力的一个很有希望的方法,LM公司从代码库中利用相关背景,而无需进行示范再培训。虽然现有的研究表明RAG公司对公共储存库和基准的有效性,开放源码库和封闭源码库之间的潜在分配变化带来了尚未探讨的独特挑战。为了缩小差距,我们开展了一项实证研究,以调查在最大自有软件系统之一即WeChat工业规模的代码库中广泛使用的RAG完成代码的方法(RAG)的性能。 具体而言,我们广泛探索了两种主要类型的RAG方法,即基于标识的RAG和类似源码的源码,范围从0.5B到671B参数的开放源码,对基于类似RAG的检索和语义检索系统进行不同的检索技术。基于1,在1,669内部存储库中,我们展示了类似于RAG的高级读取方法的高级版本。
Article 5
Title@2025-07-24 (4): Automated Code Review Using Large Language Models with Symbolic Reasoning
Title: Automated Code Review Using Large Language Models with Symbolic Reasoning | Automatisierte Code-Überprüfung mit großen Sprachmodellen mit symbolischer Begründung | 使用有符号理由的大语言模型的自动码审查 2507.18476v1 |
Authors (2): Busra Icoz, Goksel Biricik
Code review is one of the key processes in the software development lifecycle and is essential to maintain code quality. However, manual code review is subjective and time consuming. Given its rule-based nature, code review is well suited for automation. In recent years, significant efforts have been made to automate this process with the help of artificial intelligence. Recent developments in Large Language Models (LLMs) have also emerged as a promising tool in this area, but these models often lack the logical reasoning capabilities needed to fully understand and evaluate code. To overcome this limitation, this study proposes a hybrid approach that integrates symbolic reasoning techniques with LLMs to automate the code review process. We tested our approach using the CodexGlue dataset, comparing several models, including CodeT5, CodeBERT, and GraphCodeBERT, to assess the effectiveness of combining symbolic reasoning and prompting techniques with LLMs. Our results show that this approach improves the accuracy and efficiency of automated code review.
代码审查是软件开发生命周期的关键过程之一,对维护代码质量至关重要。但是,人工代码审查是主观的,耗费时间。鉴于其基于规则的性质,代码审查非常适合自动化。近年来,在人工智能的帮助下,为将这一过程自动化做出了重大努力。大语言模型(LLMS)的近期发展也成为这一领域的一个很有希望的工具,但这些模型往往缺乏充分理解和评估代码所需的逻辑推理能力。为克服这一限制,本研究报告提出了一种混合方法,将象征性推理技术与LLOMS结合起来,使代码审查过程自动化。我们用代码Glue数据集测试了我们的方法,比较了包括代码T5、DCBERT和GreagCodeBERT在内的若干模型,以评估将符号推理和提示技术与LMSMs相结合的有效性。我们的成果表明,这一方法提高了自动代码审查的准确性和效率。
Article 6
Title@2025-07-24 (4): Exploring and Evaluating Interplays of BPpy with Deep Reinforcement Learning and Formal Methods
Title: Exploring and Evaluating Interplays of BPpy with Deep Reinforcement Learning and Formal Methods | Erforschung und Auswertung von Interplays von BPpy mit Deep Reinforcement Learning und Formal Methods | 探索和评价与深强化学习和正规方法的BPpy的相互作用 2501.15480v2 |
Authors (5): Tom Yaacov, Gera Weiss, Adiel Ashrov, Guy Katz, Jules Zisser
We explore and evaluate the interactions between Behavioral Programming (BP) and a range of Artificial Intelligence (AI) and Formal Methods (FM) techniques. Our goal is to demonstrate that BP can serve as an abstraction that integrates various techniques, enabling a multifaceted analysis and a rich development process. Specifically, the paper examines how the BPpy framework, a Python-based implementation of BP, is enhanced by and enhances various FM and AI tools. We assess how integrating BP with tools such as Satisfiability Modulo Theory (SMT) solvers, symbolic and probabilistic model checking, and Deep Reinforcement Learning (DRL) allow us to scale the abilities of BP to model complex systems. Additionally, we illustrate how developers can leverage multiple tools within a single modeling and development task. The paper provides quantitative and qualitative evidence supporting the feasibility of our vision to create a comprehensive toolbox for harnessing AI and FM methods in a unified development framework.
我们探索和评价行为规划技术与一系列人工智能和正规方法(FM)技术之间的互动关系。我们的目标是证明BP可以作为一种抽象的抽象,将各种技术结合起来,进行多方面的分析,开展丰富的发展进程。具体地说,该文件审查了BPpy框架,一个基于Python的BPy执行BP,是如何通过各种调频和AI工具得到加强和加强的。我们评估如何将BP与满足性Modulo Theory(SMT)解答器、象征性和概率模型检查以及深强化学习(DRL)等工具相结合,使我们能够扩大BP的能力,以模拟复杂的系统。此外,我们说明了开发者如何在单一的模型和开发任务中利用多种工具。文件提供了定量和定性证据,支持我们建立一个综合工具箱,在统一的发展框架中利用AI和调频方法的愿景的可行性。
Article 7
Title@2025-07-24 (4): It is Giving Major Satisfaction: Why Fairness Matters for Software Practitioners
Title: It is Giving Major Satisfaction: Why Fairness Matters for Software Practitioners | Es gibt große Zufriedenheit: Warum Fairness für Software-Praktiker wichtig ist | 它给予重大满意:为什么软件从业人员的公平问题? 2410.02482v5 |
Authors (3): Emeralda Sesari, Federica Sarro, Ayushi Rastogi
Software practitioners often encounter workplace unfairness, such as unequal recognition and gender bias. While the link between fairness and job satisfaction has been established in other fields, its relevance to software professionals remains underexplored. This study examines how fairness perceptions relate to job satisfaction among software practitioners, focusing on both general trends and demographic-specific differences. We conducted an online survey of 108 software practitioners, followed by ordinal logistic regression to analyze the relationship between fairness perceptions and job satisfaction in software engineering contexts, with moderation analysis examining how this relationship varies across demographic groups. Our findings indicate that all four fairness dimensions (namely distributive, procedural, interpersonal, and informational fairness) significantly affect overall job satisfaction and satisfaction with job security. Among these, interpersonal fairness has the biggest impact. The relationship between fairness and job satisfaction is stronger for female, ethnically underrepresented, less experienced practitioners, and those with work limitations. Fairness in authorship emerged as an important factor for job satisfaction collectively, while fairness in policy implementation, high-demand situations, and working hours impacted specific demographic groups. This study highlights the role of fairness among software practitioners, offering strategies for organizations to promote fair practices and targeted approaches for certain demographic groups.
软件从业者往往遇到工作场所的不公平,例如不平等的承认和性别偏见。虽然公平与工作满意度之间的联系在其他领域已经确立,但与软件专业人员的相关性仍未得到充分探讨。本研究报告审查了公平观念如何与软件从业者的工作满意度相关,侧重于一般趋势和具体人口差异。我们对108名软件从业者进行了在线调查,随后是标准后勤回归,以分析软件工程方面的公平观念与工作满意度之间的关系,同时进行适度分析,审查这种关系在人口群体之间如何不同。我们的调查结果表明,所有四个公平层面(即分配、程序、人际和信息公平)都严重影响了工作满意度和对工作安全的总体满意度。其中,人际公平具有最大的影响。公平与工作满意度之间的关系对女性、族裔代表性不足、经验较少的从业者和有工作限制的人来说更为密切。作者的公平是集体满意度的一个重要因素,而政策执行的公平性、高需求情况和工作时间则影响到特定人口群体。本研究报告强调了软件从业者之间的公平性作用,为各组织提供了促进公平做法和针对某些人口群体的定向办法的战略。
Article 8
Title@2025-07-24 (4): FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping
Title: FMI Meets SystemC: A Framework for Cross-Tool Virtual Prototyping | FMI trifft SystemC: Ein Rahmen für das Cross-Tool Virtual Prototyping | FMI 满足系统C:跨工具虚拟原型框架 2507.18339v1 |
Authors (5): Nils Bosbach, Meik Schmidt, Lukas Jünger, Matthias Berthold, Rainer Leupers
As systems become more complex, the demand for thorough testing and virtual prototyping grows. To simulate whole systems, multiple tools are usually needed to cover different parts. These parts include the hardware of a system and the environment with which the system interacts. The Functional Mock-up Interface (FMI) standard for co-simulation can be used to connect these tools. The control part of modern systems is usually a computing unit, such as a System-on-a-Chip (SoC) or Microcontroller Unit (MCU), which executes software from a connected memory and interacts with peripherals. To develop software without requiring access to physical hardware, full-system simulators, the so-called Virtual Platforms (VPs), are commonly used. The IEEE-standardized framework for VP development is SystemC TLM. SystemC provides interfaces and concepts that enable modular design and model exchange. However, SystemC lacks native FMI support, which limits the integration into broader co-simulation environments. This paper presents a novel framework to control and interact with SystemC-based VPs using the FMI. We present a case study showing how a simulated temperature sensor in a SystemC simulation can obtain temperature values from an external tool via FMI. This approach allows the unmodified target software to run on the VP and receive realistic environmental input data such as temperature, velocity, or acceleration values from other tools. Thus, extensive software testing and verification is enabled. By having tests ready and the software pre-tested using a VP once the physical hardware is available, certifications like ISO 26262 can be done earlier.
随着系统变得更为复杂,对彻底测试和虚拟原型的需求日益增长。要模拟整个系统,通常需要多种工具来覆盖不同部分。这些部分包括系统硬件和系统互动环境。可以使用功能模拟界面(FMI)标准,用于共同模拟这些工具。现代系统的控制部分通常是一个计算单位,例如系统对立系统(SoC)或微控制器(MCU),它从连接的记忆中执行软件,并与外围环境互动。开发软件不需要使用物理硬件,系统对软件进行全面系统模拟,并使用所谓的虚拟平台(VPs),这些部分通常使用功能模拟界面接口(FMI)标准化框架,用于共同模拟这些工具。系统C提供界面和概念,便于模块设计和模型交换。然而,系统C缺乏本地的FMI支持,它提供了一个与基于系统(VP)的同步存储和互动的新框架,使用FMI的系统(OFP)的全系统快速化软件模拟系统(VP),我们用这种系统对系统进行快速的测试,我们用系统(FMI)的系统进行测试,可以让外部的系统对服务器进行测试系统进行测试。我们用一个测试,可以让外部的系统进行这样的服务器进行这样的测试。
Article 9
Title@2025-07-24 (4): LLMShot: Reducing snapshot testing maintenance via LLMs
Title: LLMShot: Reducing snapshot testing maintenance via LLMs | LLMShot: Reduzierung der Snapshot-Test-Wartung über LLMs | LLMShot:减少通过LLMM减少快速测试维护 2507.10062v2 |
Authors (4): Ergün Batuhan Kaynak, Mayasah Lami, Sahand Moslemi, Anil Koyuncu
Snapshot testing has emerged as a critical technique for UI validation in modern software development, yet it suffers from substantial maintenance overhead due to frequent UI changes causing test failures that require manual inspection to distinguish between genuine regressions and intentional design changes. This manual triage process becomes increasingly burdensome as applications evolve, creating a need for automated analysis solutions. This paper introduces LLMShot, a novel framework that leverages Vision-Language Models (VLMs) to automatically analyze snapshot test failures through semantic classification of UI changes. To evaluate LLMShot’s effectiveness, we developed a comprehensive dataset using a feature-rich iOS application with configurable feature flags, creating realistic scenarios that produce authentic snapshot differences representative of real development workflows. Our evaluation using Gemma3 models demonstrates strong classification performance, with the 12B variant achieving over 84% recall in identifying failure root causes while the 4B model offers practical deployment advantages with acceptable performance for continuous integration environments. However, our exploration of selective ignore mechanisms revealed significant limitations in current prompting-based approaches for controllable visual reasoning. LLMShot represents the first automated approach to semantic snapshot test analysis, offering developers structured insights that can substantially reduce manual triage effort and advance toward more intelligent UI testing paradigms.
快速抓图测试已成为现代软件开发中UI验证的关键技术,但它却由于频繁的UI变化导致测试失败,需要进行人工检查以区分真正的回归和有意的设计变化,从而导致测试失败,从而导致测试失败,从而导致测试失败。随着应用程序的演变,这种人工裁剪过程变得日益繁琐,从而产生了自动分析解决方案的需要。本文介绍了LloMShot,这是一个利用Vision-Language Models(VLMS)的新型框架,通过对UI变化进行语义分类,自动分析短视测试失败。但是,为了评估LLLMShot的效能,我们开发了一个全面的数据集,使用了具有可配置特征标志的功能丰富的iOS应用程序,从而产生了现实的情景,从而产生真实的快照差异,代表了真实的发展工作流程。我们使用Gemma3模型进行的评估显示了很强的分类性表现,12B变量在查明失败根源方面达到84%以上,而4B模型则为持续整合环境的可接受性能提供实际部署优势。然而,我们对选择性的无视机制的探索揭示了当前快速直视推论方法的巨大局限性。 LLMShot代表了当前对可控性智能模拟模拟模拟模拟模拟模拟模拟模拟模拟测试分析的第一个自动化自动测试。
Article 10
Title@2025-07-24 (4): Gotta catch ‘em all! Towards File Localisation from Issues at Large
Title: Gotta catch ‘em all! Towards File Localisation from Issues at Large | Ich muss sie alle fangen! Auf dem Weg zur Dateilokalisierung von Themen im Großen und Ganzen | 必须抓住他们所有人! 2507.18319v1 |
Authors (3): Jesse Maarleveld, Jiapan Guo, Daniel Feitosa
Bug localisation, the study of developing methods to localise the files requiring changes to resolve bugs, has been researched for a long time to develop methods capable of saving developers’ time. Recently, researchers are starting to consider issues outside of bugs. Nevertheless, most existing research into file localisation from issues focusses on bugs or uses other selection methods to ensure only certain types of issues are considered as part of the focus of the work. Our goal is to work on all issues at large, without any specific selection. In this work, we provide a data pipeline for the creation of issue file localisation datasets, capable of dealing with arbitrary branching and merging practices. We provide a baseline performance evaluation for the file localisation problem using traditional information retrieval approaches. Finally, we use statistical analysis to investigate the influence of biases known in the bug localisation community on our dataset. Our results show that methods designed using bug-specific heuristics perform poorly on general issue types, indicating a need for research into general purpose models. Furthermore, we find that there are small, but statistically significant differences in performance between different issue types. Finally, we find that the presence of identifiers have a small effect on performance for most issue types. Many results are project-dependent, encouraging the development of methods which can be tuned to project-specific characteristics.
错误本地化, 研究如何开发本地化需要修改的文档以解决错误, 长期以来一直在研究开发能够保存开发者时间的方法。 最近, 研究人员开始考虑错误以外的问题。 然而, 大部分现有研究, 从关注错误的问题中存档本地化, 或者使用其他选择方法确保只有某些类型的问题被视为工作焦点的一部分。 我们的目标是在总体上就所有问题开展工作, 而不作任何具体选择。 在这项工作中, 我们为创建问题文件本地化数据集提供了一个数据管道, 能够处理任意的分支化和合并做法。 我们利用传统的信息检索方法为文件本地化问题提供基线绩效评估。 最后, 我们使用统计分析来调查本地化社区中已知的偏差对数据集的影响。 我们的结果显示, 使用特定错误的超常化方法在一般问题类型上表现不佳, 表明需要研究一般目的模型。 此外, 我们发现不同问题类型在性能上存在小但统计上显著的差别。 最后, 我们发现, 我们发现, 不同问题类型中存在最依赖的识别特征类型, 多数项目特性的特性类型都具有迷性。
Article 11
Title@2025-07-24 (4): YATE: The Role of Test Repair in LLM-Based Unit Test Generation
Title: YATE: The Role of Test Repair in LLM-Based Unit Test Generation | YATE: Die Rolle der Testreparatur bei der LLM-basierten Einheiten-Testgenerierung | YATE:在以LLM为基础的单位试验生成中测试修理的作用 2507.18316v1 |
Authors (5): Michael Konstantinou, Renzo Degiovanni, Jie M. Zhang, Mark Harman, Mike Papadakis
Recent advances in automated test generation utilises language models to produce unit tests. While effective, language models tend to generate many incorrect tests with respect to both syntax and semantics. Although such incorrect tests can be easily detected and discarded, they constitute a “missed opportunity” – if fixed, they are often valuable as they directly add testing value (they effectively target the underlying program logic to be tested) and indirectly form good seeds for generating additional tests. To this end, we propose a simple technique for repairing some of these incorrect tests through a combination of rule-based static analysis and re-prompting. We evaluate this simple approach, named YATE, on a set of 6 open-source projects and show that it can effectively produce tests that cover on average 32.06% more lines and kill 21.77% more mutants than a plain LLM-based method. We also compare YATE with four other LLM-based methods, namely HITS, SYMPROMPT, TESTSPARK and COVERUP and show that it produces tests that cover substantially more code. YATE achieves 22% higher line coverage, 20% higher branch coverage and kill 20% more mutants at a comparable cost (number of calls to LLMs).
自动测试生成的最近进步利用语言模型来制作单位测试。 虽然语言模型的效果是有效的, 但通常会在语法和语义学方面产生许多不正确的测试。 虽然这些不正确的测试可以很容易地检测和丢弃, 但是它们构成了一个“错失的机会 ” — — 如果固定,它们往往很宝贵,因为它们直接增加了测试价值(它们有效地针对基本的程序逻辑进行测试 ) , 间接地形成良好的种子来生成额外的测试。 为此,我们提出一种简单技术,通过基于规则的静态分析和重新激活相结合来修复其中一些不正确的测试。 我们用一套6个开放源项目来评估这个叫做YATE的简单方法, 并表明它能够有效地生产平均覆盖32.06%以上线的测试,杀死21.77%的变种人,而不是基于普通LM方法。 我们还将YATE与其他四种基于LM方法(即HITS、SYMPROMPPT、TESPARKK和CEURUP)进行比较, 并表明它产生的测试覆盖了相当多得多的代码。 YATEATE达到22%的线段, 20 %的分支覆盖面和杀死20%以上的变种LM(成本)。
Article 12
Title@2025-07-24 (4): Scheduzz: Constraint-based Fuzz Driver Generation with Dual Scheduling
Title: Scheduzz: Constraint-based Fuzz Driver Generation with Dual Scheduling | Scheduzz: Fuzz Driver Generation mit Dual Scheduling | Scheduzz:基于节制的有双重日程安排的 Fiszz 驱动力生成 2507.18289v1 |
Authors (7): Yan Li, Wenzhang Yang, Yuekun Wang, Jian Gao, Shaohua Wang, Yinxing Xue, Lijun Zhang
Fuzzing a library requires experts to understand the library usage well and craft high-quality fuzz drivers, which is tricky and tedious. Therefore, many techniques have been proposed to automatically generate fuzz drivers. However, they fail to generate rational fuzz drivers due to the lack of adherence to proper library usage conventions, such as ensuring a resource is closed after being opened. To make things worse, existing library fuzzing techniques unconditionally execute each driver, resulting in numerous irrational drivers that waste computational resources while contributing little coverage and generating false positive bug reports. To tackle these challenges, we propose a novel automatic library fuzzing technique, Scheduzz, an LLM-based library fuzzing technique. It leverages LLMs to understand rational usage of libraries and extract API combination constraints. To optimize computational resource utilization, a dual scheduling framework is implemented to efficiently manage API combinations and fuzz drivers. The framework models driver generation and the corresponding fuzzing campaign as an online optimization problem. Within the scheduling loop, multiple API combinations are selected to generate fuzz drivers, while simultaneously, various optimized fuzz drivers are scheduled for execution or suspension. We implemented Scheduzz and evaluated it in 33 real-world libraries. Compared to baseline approaches, Scheduzz significantly reduces computational overhead and outperforms UTopia on 16 out of 21 libraries. It achieves 1.62x, 1.50x, and 1.89x higher overall coverage than the state-of-the-art techniques CKGFuzzer, Promptfuzz, and the handcrafted project OSS-Fuzz, respectively. In addition, Scheduzz discovered 33 previously unknown bugs in these well-tested libraries, 3 of which have been assigned CVEs.
图书馆的模糊性要求专家了解图书馆的使用情况,并设计出质量高的模糊性驱动器,这是棘手而乏味的。因此,许多技术都建议自动生成模糊性驱动器。然而,由于缺乏对适当的图书馆使用惯例的遵守,这些技术未能产生理性的模糊性驱动器,例如,在开放后确保资源关闭。要让情况更糟,现有图书馆的模糊性技术无条件地执行每个驱动器,导致许多不合理的驱动器浪费计算资源,同时提供很少的覆盖面,并生成虚假的正面错误报告。为了应对这些挑战,我们提出了一个新的自动图书馆模糊技术,Scheduzz,一个基于LLAM的图书馆模糊性技术。它利用LLMS来理解图书馆的合理使用情况,并提取API的组合限制。为了优化计算资源的利用,一个双重的时间安排框架模型驱动器生成和相应的模糊性运动作为在线优化问题。在时间安排周期内,选择了多种 CIPI 组合来生成模糊性驱动器,同时,各种优化的模糊性驱动器,Schedel-fury技术, Scheduzz-LOral 都预定执行或暂停使用。 我们实施了Scial-deal-dal-dal-dal-dal-dal-dal-droudal-dal-dal-dal-daldal-dal-daldaldaldaldaldal 和21 开始,在21 开始,在21个数据库中,我们,在21个数据库里, 开始,在21个数据库里,并评估它。
Article 13
Title@2025-07-24 (4): An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs
Title: An Empirical Study on Embodied Artificial Intelligence Robot (EAIR) Software Bugs | Eine empirische Studie über körpereigene Software-Fehler im Bereich Künstliche Intelligenz von Robotern (EAIR) | 关于人造人工智能机器人(EAIR)软件虫的经验研究 2507.18267v1 |
Authors (8): Zeqin Liao, Zibin Zheng, Peifan Reng, Henglong Liang, Zixu Gao, Zhixiang Chen, Wei Li, Yuhong Nan
Embodied Artificial Intelligence Robots (EAIR) is an emerging and rapidly evolving technological domain. Ensuring their program correctness is fundamental to their successful deployment. However, a general and in-depth understanding of EAIR system bugs remains lacking, which hinders the development of practices and techniques to tackle EAIR system bugs. To bridge this gap, we conducted the first systematic study of 885 EAIR system bugs collected from 80 EAIR system projects to investigate their symptoms, underlying causes, and module distribution. Our analysis takes considerable effort, which classifies these bugs into 18 underlying causes, 15 distinct symptoms, and identifies 13 affected modules. It reveals several new interesting findings and implications which help shed light on future research on tackling or repairing EAIR system bugs. First, among the 15 identified symptoms, our findings highlight 8 symptoms specific to EAIR systems, which is characterized by severe functional failures and potential physical hazards. Second, within the 18 underlying causes, we define 8 EAIR-specific causes, the majority of which stem from the intricate issues of AI- agent reasoning and decision making. Finally, to facilitate precise and efficient bug prediction, detection, and repair, we constructed a mapping between underlying causes and the modules in which they most frequently occur, which enables researchers to focus diagnostic efforts on the modules most susceptible to specific bug types.
人工智能机器人(EAIR)是一个新兴的、迅速演变的技术领域。确保其程序正确性是成功部署的基础。然而,仍然缺乏对EAIR系统错误的普遍和深入了解,这妨碍了开发处理EAIR系统错误的做法和技术。为了缩小这一差距,我们对从80个EAIR系统项目中收集的885个EAIR系统错误进行了首次系统研究,以调查其症状、根本原因和模块分布。我们的分析需要大量努力,将这些错误分为18个根本原因、15个不同症状和13个受影响的模块。它揭示了一些新的有趣的发现和意义,有助于了解今后关于处理或修复EAIR系统错误的研究。首先,在15个查明的症状中,我们的调查结果突出了EAIR系统特有的8个症状,其特征是功能严重失灵和潜在的物理危害。第二,在18个根本原因中,我们确定了8个EAIR系统具体原因,其中多数源于AI代理理论和决定的复杂问题。最后,为准确和高效的错误预测、检测和修复提供了一些新的结果和影响,有助于了解未来关于处理或修复EAIR系统错误的研究。在最易变本的模型中,我们建立了最能的模型。
Article 14
Title@2025-07-24 (4): GenAI for Automotive Software Development: From Requirements to Wheels
Title: GenAI for Automotive Software Development: From Requirements to Wheels | GenAI für die Entwicklung von Automotive-Software: Von Anforderungen bis zu Rädern | GENAI 汽车软件开发GENAI:从要求到轮子 2507.18223v1 |
Authors (6): Nenad Petrovic, Fengjunjie Pan, Vahid Zolfaghari, Krzysztof Lebioda, Andre Schamschurko, Alois Knoll
This paper introduces a GenAI-empowered approach to automated development of automotive software, with emphasis on autonomous and Advanced Driver Assistance Systems (ADAS) capabilities. The process starts with requirements as input, while the main generated outputs are test scenario code for simulation environment, together with implementation of desired ADAS capabilities targeting hardware platform of the vehicle connected to testbench. Moreover, we introduce additional steps for requirements consistency checking leveraging Model-Driven Engineering (MDE). In the proposed workflow, Large Language Models (LLMs) are used for model-based summarization of requirements (Ecore metamodel, XMI model instance and OCL constraint creation), test scenario generation, simulation code (Python) and target platform code generation (C++). Additionally, Retrieval Augmented Generation (RAG) is adopted to enhance test scenario generation from autonomous driving regulations-related documents. Our approach aims shorter compliance and re-engineering cycles, as well as reduced development and testing time when it comes to ADAS-related capabilities.
本文介绍了通用自动自动开发汽车软件的GenAI动力方法,重点是自动和高级驱动协助系统(ADAS)能力;这一过程从需求作为投入开始,而产生的主要产出是模拟环境的测试情景代码,以及针对与测试台连接的车辆硬件平台实施理想的ADAS能力;此外,我们引入了额外步骤,要求一致性检查,利用模型驱动工程(MDE);在拟议的工作流程中,大型语言模型(LLMS)用于基于模型的需求汇总(核心元模型、XMI模型实例和OCL制约设定)、测试情景生成、模拟代码(Python)和目标平台代码生成(C++);此外,还采用了检索聚合生成(RAG)能力,以加强自动驱动规则相关文件的测试情景生成;我们的方法旨在缩短合规和再设计周期,并缩短与ADAS相关能力有关的开发和测试时间。
Article 15
Title@2025-07-24 (4): SMECS: A Software Metadata Extraction and Curation Software
Title: SMECS: A Software Metadata Extraction and Curation Software | KMUCS: Eine Software Metadata Extraktions- und Kurationssoftware | SMECS:软件元数据抽取和计算软件 2507.18159v1 |
Authors (4): Stephan Ferenz, Aida Jafarbigloo, Oliver Werth, Astrid Nieße
Metadata play a crucial role in adopting the FAIR principles for research software and enables findability and reusability. However, creating high-quality metadata can be resource-intensive for researchers and research software engineers. To address this challenge, we developed the Software Metadata Extraction and Curation Software (SMECS) which integrates the extraction of metadata from existing sources together with a user-friendly interface for metadata curation. SMECS extracts metadata from online repositories such as GitHub and presents it to researchers through an interactive interface for further curation and export as a CodeMeta file. The usability of SMECS was evaluated through usability experiments which confirmed that SMECS provides a satisfactory user experience. SMECS supports the FAIRification of research software by simplifying metadata creation.
元数据在采用FAIR研究软件原则方面发挥着关键作用,能够找到和重新使用。然而,建立高质量的元数据对于研究人员和研究软件工程师来说可能是资源密集型的。为了应对这一挑战,我们开发了软件元数据提取和计算软件(SMECS),将从现有来源提取元数据与方便用户的元数据整理接口结合起来。中小企业中央数据库从GitHub等在线储存库中提取元数据,并通过互动接口将其提供给研究人员,以便进一步整理和作为代码Meta文件输出。通过可用性试验对中小企业中央信息系统的可用性进行了评估,证实中小企业中央数据库提供了令人满意的用户经验。中小企业中央数据库通过简化元数据创建,支持研究软件的公平化。
Article 16
Title@2025-07-24 (4): When Retriever Meets Generator: A Joint Model for Code Comment Generation
Title: When Retriever Meets Generator: A Joint Model for Code Comment Generation | Wenn Retriever trifft Generator: Ein gemeinsames Modell für Code Comment Generation | 当再利用与生成器相遇时: 代码Comment生成联合模式 2507.12558v2 |
Authors (5): Tien P. T. Le, Anh M. T. Bui, Huy N. D. Pham, Alessio Bucaioni, Phuong T. Nguyen
Automatically generating concise, informative comments for source code can lighten documentation effort and accelerate program comprehension. Retrieval-augmented approaches first fetch code snippets with existing comments and then synthesize a new comment, yet retrieval and generation are typically optimized in isolation, allowing irrelevant neighbors topropagate noise downstream. To tackle the issue, we propose a novel approach named RAGSum with the aim of both effectiveness and efficiency in recommendations. RAGSum is built on top offuse retrieval and generation using a single CodeT5 backbone. We report preliminary results on a unified retrieval-generation framework built on CodeT5. A contrastive pre-training phase shapes code embeddings for nearest-neighbor search; these weights then seed end-to-end training with a composite loss that (i) rewards accurate top-k retrieval; and (ii) minimizes comment-generation error. More importantly, a lightweight self-refinement loop is deployed to polish the final output. We evaluated theframework on three cross-language benchmarks (Java, Python, C), and compared it with three well-established baselines. The results show that our approach substantially outperforms thebaselines with respect to BLEU, METEOR, and ROUTE-L. These findings indicate that tightly coupling retrieval and generationcan raise the ceiling for comment automation and motivateforthcoming replications and qualitative developer studies.
为源代码自动生成简明、信息化的评论可以减轻文件工作,并加速程序理解。 检索强化方法首先用现有评论获取代码片断,然后合成新的评论,然而,检索和生成通常在孤立的情况下优化,允许不相关的邻居在下游对噪音进行排解。 为了解决这个问题,我们提议了一个名为RAGSum的新颖方法,其目的在于提高建议的效力和效率。RAGSum建在顶部的离线检索和生成上方,使用单一的代码T5主干线。我们报告了在代码T5基础上建立的统一检索-生成框架的初步结果。一个对比式的训练前阶段将代码嵌入最近的邻居搜索中;这些重量然后是种子端到端的培训,其复合损失(一) 奖励准确的顶级检索;以及 (二) 尽量减少评论生成错误。 更重要的是,将一个轻量的自我修整环安装在最上层上方,以光滑动的最后输出。 我们用三个跨语言基准(Java、Python、C)对框架进行了评估,并将它与三个完善的基线进行对比; 这些重量制质量到质量到质量级的代码, 显示我们不断的循环的复制和不断的循环的复制结果。
Article 17
Title@2025-07-24 (4): NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition
Title: NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition | NoCode-Bench: Ein Benchmark für die Bewertung der Erweiterung natürlicher sprachgetriebener Funktionen | NoCode-Bonch:评价自然语言-驱动地物的基准 2507.18130v1 |
Authors (5): Le Deng, Zhonghao Jiang, Jialun Cao, Michael Pradel, Zhongxin Liu
Natural language-driven no-code development allows users to specify software functionality using natural language (NL) instead of editing source code, promising increased productivity and democratized development. Large language models (LLMs) show potential in enabling this paradigm. In this context, software documentation acts as an NL specification for functionality. This work introduces NoCode-bench, a benchmark designed to evaluate LLMs on real-world NL-driven feature addition tasks, consisting of 634 tasks across 10 projects and 114k code changes. Each task pairs documentation updates with corresponding code implementations, validated by developer-written test cases. A subset of 114 high-quality, human-verified instances, NoCode-bench Verified, ensures reliable evaluation. Our experiments reveal that, despite high token usage, the best LLMs achieve a task success rate of only 15.79%, highlighting challenges in cross-file editing, codebase understanding, and tool calling. These findings indicate that LLMs are not yet ready for fully NL-driven no-code development. NoCode-bench lays the foundation for future advances in this area.
自然语言驱动的无代码开发使用户能够使用自然语言(NL)而不是编辑源代码来指定软件功能,从而有望提高生产率和民主化发展。大型语言模型(LLMs)显示了促成这一模式的潜力。在这方面,软件文件作为NL规格的功能规格。这项工作引入了NoCode-bench,这是一个基准,旨在评价现实世界NL驱动的特性添加任务中的LLMs,由10个项目中的634项任务和114k代码变化组成。每个任务对口文件更新了相应的代码执行,并得到了开发者编写的测试案例的验证。114个高品质、人文验证的实例之一,NoCode-bench Verized,确保了可靠的评价。我们的实验表明,尽管有很高的象征性使用,但最佳LLMs只取得了15.79%的任务成功率,突出了跨文件编辑、代码库理解和号召工具方面的挑战。这些研究结果表明LLMs尚未准备好完全NL驱动的无代码开发。NoCode-Bench为这一领域未来进展打下的基础。
Article 18
Title@2025-07-24 (4): OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization
Title: OrQstrator: An AI-Powered Framework for Advanced Quantum Circuit Optimization | OrQstrator: Ein KI-Powered-Framework für erweiterte Quantenschaltungsoptimierung | Orstrator: AI授权的高级量子电路优化框架 2507.09682v2 |
Authors (2): Laura Baird, Armin Moin
We propose a novel approach, OrQstrator, which is a modular framework for conducting quantum circuit optimization in the Noisy Intermediate-Scale Quantum (NISQ) era. Our framework is powered by Deep Reinforcement Learning (DRL). Our orchestration engine intelligently selects among three complementary circuit optimizers: A DRL-based circuit rewriter trained to reduce depth and gate count via learned rewrite sequences; a domain-specific optimizer that performs efficient local gate resynthesis and numeric optimization; a parameterized circuit instantiator that improves compilation by optimizing template circuits during gate set translation. These modules are coordinated by a central orchestration engine that learns coordination policies based on circuit structure, hardware constraints, and backend-aware performance features such as gate count, depth, and expected fidelity. The system outputs an optimized circuit for hardware-aware transpilation and execution, leveraging techniques from an existing state-of-the-art approach, called the NISQ Analyzer, to adapt to backend constraints.
我们提出了一个新颖的方法,即OrQstrator,这是在Noisy中级量子(NISQ)时代进行量子电路优化的模块化框架。我们的框架由深强化学习(DRL)提供动力。我们的管弦引擎明智地在三个互补的电路优化器中选择:一个基于DRL的电路再编,通过学习的重写序列来降低深度和门数;一个特定域的优化器,运行高效的本地门再合成和数字优化;一个参数化电路即时器,通过优化门置翻译过程中的模板电路来改进编译。这些模块由中央管弦机协调,该机学习基于电路结构、硬件限制和后端识性能(如门数、深度和预期的忠诚)的协调政策。这个系统输出一种优化的硬件觉变换和执行的电路,利用现有状态方法(称为NISQAnalyzer)的技术,以适应后端限制。
Article 19
Title@2025-07-24 (4): Understanding the Supply Chain and Risks of Large Language Model Applications
Title: Understanding the Supply Chain and Risks of Large Language Model Applications | Verständnis der Supply Chain und Risiken von Großsprachenmodellanwendungen | 了解供应链和大语言模式应用的风险 2507.18105v1 |
Authors (7): Yujie Ma, Lili Quan, Xiaofei Xie, Qiang Hu, Jiongchi Yu, Yao Zhang, Sen Chen
The rise of Large Language Models (LLMs) has led to the widespread deployment of LLM-based systems across diverse domains. As these systems proliferate, understanding the risks associated with their complex supply chains is increasingly important. LLM-based systems are not standalone as they rely on interconnected supply chains involving pretrained models, third-party libraries, datasets, and infrastructure. Yet, most risk assessments narrowly focus on model or data level, overlooking broader supply chain vulnerabilities. While recent studies have begun to address LLM supply chain risks, there remains a lack of benchmarks for systematic research. To address this gap, we introduce the first comprehensive dataset for analyzing and benchmarking LLM supply chain security. We collect 3,859 real-world LLM applications and perform interdependency analysis, identifying 109,211 models, 2,474 datasets, and 9,862 libraries. We extract model fine-tuning paths, dataset reuse, and library reliance, mapping the ecosystem’s structure. To evaluate security, we gather 1,555 risk-related issues-50 for applications, 325 for models, 18 for datasets, and 1,229 for libraries from public vulnerability databases. Using this dataset, we empirically analyze component dependencies and risks. Our findings reveal deeply nested dependencies in LLM applications and significant vulnerabilities across the supply chain, underscoring the need for comprehensive security analysis. We conclude with practical recommendations to guide researchers and developers toward safer, more trustworthy LLM-enabled systems.
大型语言模型(LLMS)的兴起导致以LLM为基础的系统在不同领域广泛部署,随着这些系统的扩散,了解与复杂的供应链安全相关的风险变得日益重要。LLM系统并不独立,因为它们依赖由事先培训的模式、第三方图书馆、数据集和基础设施等组成的相互关联的供应链;然而,大多数风险评估都狭隘地侧重于模型或数据层面,忽视了更广泛的供应链脆弱性。虽然最近的研究已经开始解决LLM供应链风险,但仍缺乏系统研究的基准。为了弥补这一差距,我们推出了第一个综合数据集,用于分析和衡量LLM供应链安全基准。我们收集了3 859个真实的LLM应用程序,并进行了相互依存性分析,确定了109 211个模型、2 474个数据集和9 862个图书馆。我们从模型中提取了微调路径、数据集再利用和图书馆依赖性,对生态系统结构进行了测绘。为了评估安全,我们收集了1 555个与风险相关的问题-50个应用系统,325个模型,18个数据集,以及1 229个图书馆从公共脆弱性数据库中收集了1 859个应用程序,并进行了相互依存性分析。我们通过这一数据链中的重要数据分析和分析。
Article 20
Title@2025-07-24 (4): Identifier Name Similarities: An Exploratory Study
Title: Identifier Name Similarities: An Exploratory Study | Identifier Name Ähnlichkeiten: Eine Sondierungsstudie | 说明性名称 相似点:探索性研究 2507.18081v1 |
Authors (5): Carol Wong, Mai Abe, Silvia De Benedictis, Marissa Halim, Anthony Peruma
Identifier names, which comprise a significant portion of the codebase, are the cornerstone of effective program comprehension. However, research has shown that poorly chosen names can significantly increase cognitive load and hinder collaboration. Even names that appear readable in isolation may lead to misunderstandings in contexts when they closely resemble other names in either structure or functionality. In this exploratory study, we present our preliminary findings on the occurrence of identifier name similarity in software projects through the development of a taxonomy that categorizes different forms of identifier name similarity. We envision our initial taxonomy providing researchers with a platform to analyze and evaluate the impact of identifier name similarity on code comprehension, maintainability, and collaboration among developers, while also allowing for further refinement and expansion of the taxonomy.
由代码库相当一部分组成的识别名称是有效程序理解的基石。然而,研究表明,选择不当的名称会大大增加认知负荷,妨碍协作。即使孤立地看似可读的名称也可能在与结构或功能中的其他名称非常相似的情况下导致误解。在这项探索性研究中,我们介绍了关于软件项目存在识别名称相似性的初步调查结果,其方法是开发一种分类法,对不同形式的识别名称相似性进行分类。我们设想我们的初始分类法为研究人员提供一个平台,用以分析和评估识别名称相似性对代码理解、可维护性以及开发商之间合作的影响,同时允许进一步细化和扩大分类法。
Article 21
Title@2025-07-24 (4): An Empirical Study of Complexity, Heterogeneity, and Compliance of GitHub Actions Workflows
Title: An Empirical Study of Complexity, Heterogeneity, and Compliance of GitHub Actions Workflows | Eine empirische Studie über Komplexität, Heterogenität und Compliance von GitHub-Maßnahmen | 关于 “ 吉特胡布行动 “ 的复杂性、异质性和合规性的经验研究 2507.18062v1 |
Authors (2): Edward Abrokwah, Taher A. Ghaleb
Continuous Integration (CI) has evolved from a tooling strategy to a fundamental mindset in modern CI engineering. It enables teams to develop, test, and deliver software rapidly and collaboratively. Among CI services, GitHub Actions (GHA) has emerged as a dominant service due to its deep integration with GitHub and a vast ecosystem of reusable workflow actions. Although GHA provides official documentation and community-supported best practices, there appears to be limited empirical understanding of how open-source real-world CI workflows align with such practices. Many workflows might be unnecessarily complex and not aligned with the simplicity goals of CI practices. This study will investigate the structure, complexity, heterogeneity, and compliance of GHA workflows in open-source software repositories. Using a large dataset of GHA workflows from Java, Python, and C++ repositories, our goal is to (a) identify workflow complexities, (b) analyze recurring and heterogeneous structuring patterns, (c) assess compliance with GHA best practices, and (d) uncover differences in CI pipeline design across programming languages. Our findings are expected to reveal both areas of strong adherence to best practices and areas for improvement where needed. These insights will also have implications for CI services, as they will highlight the need for clearer guidelines and comprehensive examples in CI documentation.
持续整合(CI)已经从工具战略发展到现代CI工程的基本思维,使团队能够迅速合作开发、测试和提供软件。在CI服务中,GitHub Action(GHA)由于与GitHub的深度整合和大量可再利用工作流程行动的生态系统而成为一个主导服务。虽然GHA提供了正式文件和社区支持的最佳做法,但对于开放源码真实世界CI工作流程如何与这些做法保持一致,经验上的理解似乎有限。许多工作流程可能不必要地复杂,不符合CI做法的简单目标。这项研究将调查GHA工作流程的结构、复杂性、异质性和在公开源软件库中的合规性。利用来自Java、Python和C+++的GHA工作流程的大量数据集,我们的目标是:(a) 查明工作流程的复杂性,(b) 分析经常性和混杂的结构模式;(c) 评估对GHA最佳做法的遵守情况,以及(d) 发现CI编程中各语文设计的差异。我们的调查结果将揭示在哪些领域严格遵守CIA工作流程方面的最佳做法,以及哪些领域也需要改进CIA的最佳做法。
Article 22
Title@2025-07-24 (4): SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis
Title: SAVANT: Vulnerability Detection in Application Dependencies through Semantic-Guided Reachability Analysis | SAVANT: Sicherheitserkennung in Anwendungsabhängigkeiten durch Semantik-geführte Reichweitenanalyse | SAVANT: 通过语义辅助控制可达性分析,在应用依赖性中发现脆弱性 2506.17798v2 |
Authors (7): Wang Lingxiang, Quanzhi Fu, Wenjia Song, Gelei Deng, Yi Liu, Dan Williams, Ying Zhang
The integration of open-source third-party library dependencies in Java development introduces significant security risks when these libraries contain known vulnerabilities. Existing Software Composition Analysis (SCA) tools struggle to effectively detect vulnerable API usage from these libraries due to limitations in understanding API usage semantics and computational challenges in analyzing complex codebases, leading to inaccurate vulnerability alerts that burden development teams and delay critical security fixes. To address these challenges, we proposed SAVANT by leveraging two insights: proof-of-vulnerability test cases demonstrate how vulnerabilities can be triggered in specific contexts, and Large Language Models (LLMs) can understand code semantics. SAVANT combines semantic preprocessing with LLM-powered context analysis for accurate vulnerability detection. SAVANT first segments source code into meaningful blocks while preserving semantic relationships, then leverages LLM-based reflection to analyze API usage context and determine actual vulnerability impacts. Our evaluation on 55 real-world applications shows that SAVANT achieves 83.8% precision, 73.8% recall, 69.0% accuracy, and 78.5% F1-score, outperforming state-of-the-art SCA tools.
现有软件构成分析(SCA)工具在有效检测这些图书馆的脆弱API使用情况方面挣扎着。 SAVANT将精密的脆弱性检测与LLM驱动的背景分析相结合。 SAVANT将精密的语义预处理与LLOM驱动的背景分析相结合。 SAVANT的首部分源代码在保留语义关系的同时,将有意义的区块纳入到有意义的区块中,然后利用基于LLAM的思考来分析API的使用背景并确定实际的脆弱性影响。我们对55个实际应用软件的评估表明,SAVANT实现了83.8%的精确度,73.8%的回顾,69.0%的精确度和78.5%的F1核心,高于艺术的状态工具。
Article 23
Title@2025-07-24 (4): Factors Impacting Faculty Adoption of Project-Based Learning in Computing Education: a Survey
Title: Factors Impacting Faculty Adoption of Project-Based Learning in Computing Education: a Survey | Faktoren, die die Fakultät beeinflussen Adoption des projektbasierten Lernens in der Computerausbildung: eine Umfrage | 影响学院在计算机教育中采用基于项目学习:调查 2507.18039v1 |
Authors (3): Ahmad D. Suleiman, Yiming Tang, Daqing Hou
This research full paper investigates the factors influencing computing educators’ adoption of project-based learning (PjBL) in software engineering and computing curricula. Recognized as a student-centered pedagogical approach, PjBL has the potential to enhance student motivation, engagement, critical thinking, collaboration, and problem-solving skills. Despite these benefits, faculty adoption remains inconsistent due to challenges such as insufficient institutional support, time constraints, limited training opportunities, designing or sourcing projects, and aligning them with course objectives. This research explores these barriers and investigates the strategies and resources that facilitate a successful adoption. Using a mixed-methods approach, data from 80 computing faculty were collected through an online survey comprising closed-ended questions to quantify barriers, enablers, and resource needs, along with an open-ended question to gather qualitative insights. Quantitative data were analyzed using statistical methods, while qualitative responses underwent thematic analysis. Results reveal that while PjBL is widely valued, its adoption is often selective and impacted by challenges in planning and managing the learning process, designing suitable projects, and a lack of institutional support, such as time, funding, and teaching assistants. Faculty are more likely to adopt or sustain PjBL when they have access to peer collaboration, professional development, and institutional incentives. In addition, sourcing projects from research, industry partnerships, and borrowing from peers emerged as key facilitators for new projects. These findings underscore the need for systemic support structures to empower faculty to experiment with and scale PjBL practices.
这份完整的研究论文调查了影响计算教育者在软件工程和计算课程中采用基于项目学习(PjBL)的因素。作为以学生为中心的教学方法,PjBL具有提高学生动力、参与、批判性思维、协作和解决问题技能的潜力。尽管有这些好处,但是由于体制支持不足、时间限制、培训机会有限、设计或外包项目以及使其与课程目标保持一致等挑战,教师的采用仍然不一致。这项研究探索了这些障碍,并调查了促进成功采用的各种战略和资源。利用混合方法方法,通过在线调查收集了80个计算师的数据,其中包括一些封闭的问题,以量化障碍、扶持人员和资源需求,以及收集定性见解的开放问题。尽管有这些好处,但是由于机构支持不足、时间限制、培训机会有限、设计或采购项目设计与课程目标相协调,因此其采用往往具有选择性,并受到在规划和管理过程、设计适当的项目以及缺乏机构支持的影响,例如时间、供资和教学助理等。学院更可能采用统计方法分析定量数据,同时使用定性数据,同时进行定性分析,同时对质量分析。结果分析。结果显示,尽管PB项目获得或持续进行机构化项目,但是,它们需要从获得或学习周期性项目获得新的研究。
Article 24
Title@2025-07-24 (4): Your ATs to Ts: MITRE ATT&CK Attack Technique to P-SSCRM Task Mapping
Title: Your ATs to Ts: MITRE ATT&CK Attack Technique to P-SSCRM Task Mapping | Ihre ATs zu Ts: MITRE ATT&CK Angriffstechnik zu P-SSCRM Task Mapping | 您的ATs to Ts: MITRE ATT和CK 攻击技术到 P-SSCRM任务绘图 2507.18037v1 |
Authors (5): Sivana Hamer, Jacob Bowen, Md Nazmul Haque, Chris Madden, Laurie Williams
The MITRE Adversarial Tactics, Techniques and Common Knowledge (MITRE ATT&CK) Attack Technique to Proactive Software Supply Chain Risk Management Framework (P-SSCRM) Task mapping described in this document helps software organizations to determine how different tasks mitigate the attack techniques of software supply chain attacks. The mapping was created through four independent strategies to find agreed-upon mappings. Because each P-SSCRM task is mapped to one or more tasks from the 10 frameworks, the mapping we provide is also a mapping between MITRE ATT&CK and other prominent government and industry frameworks.
本文件描述的MITRE Adversarial 战术、技术和共同知识(MITRE ATT和CK)对主动软件供应链风险管理框架(P-SSCRM)的进攻技术任务绘图,帮助软件组织确定不同任务如何减轻软件供应链攻击的攻击技术。该绘图是通过四个独立战略建立的,以寻找商定的地图绘制。由于P-SSCRM的每项任务都按照10个框架的一项或多项任务绘制,我们提供的地图也是MITRE ATT和CK与其他知名政府和工业框架之间的地图绘制。
Article 25
Title@2025-07-24 (4): An Empirical Study of GenAI Adoption in Open-Source Game Development: Tools, Tasks, and Developer Challenges
Title: An Empirical Study of GenAI Adoption in Open-Source Game Development: Tools, Tasks, and Developer Challenges | Eine empirische Studie zur GenAI-Adoption in der Open-Source-Spielentwicklung: Werkzeuge, Aufgaben und Entwickler-Herausforderungen | GENAI采用开放源码游戏开发的经验研究:工具、任务和开发者的挑战 2507.18029v1 |
Authors (4): Xiang Echo Chen, Wenhan Zhu, Guoshuai Albert Shi, Michael W. Godfrey
The growing capabilities of generative AI (GenAI) have begun to reshape how games are designed and developed, offering new tools for content creation, gameplay simulation, and design ideation. While prior research has explored traditional uses of AI in games, such as controlling agents or generating procedural content. There is limited empirical understanding of how GenAI is adopted by developers in real-world contexts, especially within the open-source community. This study aims to explore how GenAI technologies are discussed, adopted, and integrated into open-source game development by analyzing issue discussions on GitHub. We investigate the tools, tasks, and challenges associated with GenAI by comparing GenAI-related issues to those involving traditional AI (TradAI) and NonAI topics. Our goal is to uncover how GenAI differs from other approaches in terms of usage patterns, developer concerns, and integration practices. To address this objective, we construct a dataset of open-source game repositories that discuss AI-related topics. We apply open card sorting and thematic analysis to a stratified sample of GitHub issues, labelling each by type and content. These annotations enable comparative analysis across GenAI, TradAI, and NonAI groups, and provide insight into how GenAI is shaping the workflows and pain points of open-source game developers.
虽然先前的研究探索了在游戏中传统使用AI的方法,例如控制剂或产生程序内容。对于GenAI如何被开发者,特别是在开放源码界中如何在现实世界环境中采用GenAI, 经验上了解有限。本研究的目的是通过分析GitHub问题的讨论,探讨如何讨论、采用GenAI技术并将其纳入开放源码游戏开发。我们调查与GenAI有关的工具、任务和挑战,将GenAI相关问题与传统AI(TradAI)和NonAI专题相比较。我们的目标是发现GenAI如何在使用模式、开发者关切和整合做法方面与其他方法不同。为了实现这一目标,我们建立一个公开源码游戏储存库数据集,讨论与AI有关的专题。我们用公开的卡分解和专题分析方法,对GitHub问题进行分类,按类型和内容进行标注。这些说明有助于在GenAI、TradAI和NonAI的工作流程中进行对比分析。
Article 26
Title@2025-07-23 (3): Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work
Title: Use as Directed? A Comparison of Software Tools Intended to Check Rigor and Transparency of Published Work | Ein Vergleich von Software-Tools zur Überprüfung von Strenge und Transparenz der veröffentlichten Arbeit | 用于核对所公布工作的定调和透明度的软件工具比较 2507.17991v1 |
Authors (20): Peter Eckmann, Adrian Barnett, Alexandra Bannach-Brown, Elisa Pilar Bascunan Atria, Guillaume Cabanac, Louise Delwen Owen Franzen, Małgorzata Anna Gazda, Kaitlyn Hair, James Howison, Halil Kilicoglu, Cyril Labbe, Sarah McCann, Vladislav Nachev, Martijn Roelandse, Maia Salholz-Hillel, Robert Schulz, Gerben ter Riet, Colby Vorland, Anita Bandrowski, Tracey Weissgerber
The causes of the reproducibility crisis include lack of standardization and transparency in scientific reporting. Checklists such as ARRIVE and CONSORT seek to improve transparency, but they are not always followed by authors and peer review often fails to identify missing items. To address these issues, there are several automated tools that have been designed to check different rigor criteria. We have conducted a broad comparison of 11 automated tools across 9 different rigor criteria from the ScreenIT group. We found some criteria, including detecting open data, where the combination of tools showed a clear winner, a tool which performed much better than other tools. In other cases, including detection of inclusion and exclusion criteria, the combination of tools exceeded the performance of any one tool. We also identified key areas where tool developers should focus their effort to make their tool maximally useful. We conclude with a set of insights and recommendations for stakeholders in the development of rigor and transparency detection tools. The code and data for the study is available at https://github.com/PeterEckmann1/tool-comparison.
造成再生危机的原因包括科学报告缺乏标准化和透明度,ARORIVE和CONSORT等核对表力求提高透明度,但作者并不总是遵循这些核对表,同侪审查往往未能查明缺失的项目。为解决这些问题,设计了若干自动工具,以检查不同的严格标准。我们广泛比较了ScreenIT组的9项不同的严格标准中的11项自动工具。我们发现了一些标准,包括检测开放数据,其中各种工具的组合显示有一个明显的胜者,这一工具比其他工具效果好得多。在另一些情况下,包括发现包容和排斥标准,各种工具的组合超过了任何一种工具的性能。我们还确定了工具开发者应集中努力使其工具发挥最大效用的关键领域。我们最后为开发钻机和透明度检测工具的利益攸关方提供了一套洞见识和建议。该研究的代码和数据可在https://github.com/PeterEcmann1/tool-comarison查阅。
Article 27
Title@2025-07-23 (3): muRelBench: MicroBenchmarks for Zonotope Domains
Title: muRelBench: MicroBenchmarks for Zonotope Domains | muRelBench: MicroBenchmarks für Zonotope-Domains | MORELBENCH:Zonotope 域的微型基准 2404.16243v2 |
Authors (2): Kenny Ballou, Elena Sherman
We present \texttt{muRelBench}, a framework for synthetic benchmarks for weakly-relational abstract domains and their operations. This extensible microbenchmarking framework enables researchers to experimentally evaluate proposed algorithms for numerical abstract domains, such as closure,least-upper bound, and forget, enabling them to quickly prototype and validate performance improvements before considering more intensive experimentation. Additionally, the framework provides mechanisms for checking correctness properties for each of the benchmarks to ensure correctness within the synthetic benchmarks.
我们提出\ textt{ muRelBench} , 用于建立关系薄弱的抽象领域及其操作的合成基准框架。 这个可扩展的微基准标志框架使研究人员能够实验性地评估数字抽象领域的拟议算法, 如关闭、 上端约束和忘记, 使他们能够在考虑更密集的实验之前快速进行原型和验证绩效改进。 此外, 该框架提供了检查每项基准正确性的机制, 以确保在合成基准范围内的正确性 。
Article 28
Title@2025-07-23 (3): How Software Engineers Engage with AI: A Pragmatic Process Model and Decision Framework Grounded in Industry Observations
Title: How Software Engineers Engage with AI: A Pragmatic Process Model and Decision Framework Grounded in Industry Observations | Wie sich Software-Ingenieure mit KI beschäftigen: Ein Pragmatisches Prozessmodell und Entscheidungsrahmen, der in Industriebeobachtungen begründet ist | 软件工程师如何与AI接触:一个以工业观测为基础的实用过程模型和决定框架 2507.17930v1 |
Authors (2): Vahid Garousi, Zafar Jafarov
Artificial Intelligence (AI) has the potential to transform Software Engineering (SE) by enhancing productivity, efficiency, and decision support. Tools like GitHub Copilot and ChatGPT have given rise to “vibe coding”-an exploratory, prompt-driven development style. Yet, how software engineers engage with these tools in daily tasks, especially in deciding whether to trust, refine, or reject AI-generated outputs, remains underexplored. This paper presents two complementary contributions. First, a pragmatic process model capturing real-world AI-assisted SE activities, including prompt design, inspection, fallback, and refinement. Second, a 2D decision framework that could help developers reason about trade-offs between effort saved and output quality. Grounded in practitioner reports and direct observations in three industry settings across Turkiye and Azerbaijan, our work illustrates how engineers navigate AI use with human oversight. These models offer structured, lightweight guidance to support more deliberate and effective use of AI tools in SE, contributing to ongoing discussions on practical human-AI collaboration.
人工智能(AI)有可能通过提高生产力、效率和决策支持来改变软件工程(SE),GitHub Copilot和ChatGPT等工具已经产生了“虚拟编码”的探索性、迅速驱动的发展风格。然而,软件工程师如何在日常工作中,特别是在决定是否信任、改进或拒绝AI产出方面,与这些工具打交道,仍然没有得到充分探讨。本文件介绍了两项补充性贡献。首先,一个实用的过程模型,捕捉现实世界的AI协助的SE活动,包括迅速设计、检查、后退和完善。第二,一个二维决定框架,可以帮助开发者了解节省的努力与产出质量之间的取舍。根据从业人员报告和直接观察,我们的工作在突尔基耶和阿塞拜疆三个行业环境中展示了工程师如何在人类监督下操作AI的使用。这些模型提供了结构化的、轻量级指导,以支持在SE更审慎和有效地使用AI工具,有助于正在进行的关于实际的人类-AI合作的讨论。
Article 29
Title@2025-07-23 (3): Educational Insights from Code: Mapping Learning Challenges in Object-Oriented Programming through Code-Based Evidence
Title: Educational Insights from Code: Mapping Learning Challenges in Object-Oriented Programming through Code-Based Evidence | Bildungsinsights from Code: Mapping Lernherausforderungen in objektorientierter Programmierung durch Code-basierte Evidenz | 从《守则教育观点》中得出的教育观点:通过《守则证据》确定以目标为导向的方案拟订中的学习挑战 2507.17743v1 |
Authors (2): Andre Menolli, Bruno Strik
Object-Oriented programming is frequently challenging for undergraduate Computer Science students, particularly in understanding abstract concepts such as encapsulation, inheritance, and polymorphism. Although the literature outlines various methods to identify potential design and coding issues in object-oriented programming through source code analysis, such as code smells and SOLID principles, few studies explore how these code-level issues relate to learning difficulties in Object-Oriented Programming. In this study, we explore the relationship of the code issue indicators with common challenges encountered during the learning of object-oriented programming. Using qualitative analysis, we identified the main categories of learning difficulties and, through a literature review, established connections between these difficulties, code smells, and violations of the SOLID principles. As a result, we developed a conceptual map that links code-related issues to specific learning challenges in Object-Oriented Programming. The model was then evaluated by an expert who applied it in the analysis of the student code to assess its relevance and applicability in educational contexts.
虽然文献通过源代码分析,如代码气味和SOLID原则,概述了在面向目标的方案编制中确定潜在设计和编码问题的各种方法,但很少有研究探讨这些代码层面的问题如何与面向目标的方案编制过程中的学习困难有关。在这项研究中,我们探讨了代码问题指标与学习面向目标的方案编制过程中遇到的共同挑战之间的关系。我们通过定性分析,确定了学习困难的主要类别,并通过文献审查,确定了这些困难、代码气味和违反SOLID原则之间的内在联系。结果,我们制定了一个概念图,将代码相关问题与面向目标的方案编制过程中的具体学习挑战联系起来。然后,一位专家对模型进行了评价,他在分析学生代码时运用了该模型来评估其在教育环境中的相关性和适用性。
Article 30
Title@2025-07-23 (3): CASCADE: LLM-Powered JavaScript Deobfuscator at Google
Title: CASCADE: LLM-Powered JavaScript Deobfuscator at Google | CASCADE: LLM-Powered JavaScript Deobfuscator bei Google | CASCADE: 谷歌的LLM Powered JavaScript Deobfuscator 谷歌的LLM Powered JavaScript Deobfuscator 2507.17691v1 |
Authors (4): Shan Jiang, Pranoy Kovuri, David Tao, Zhixun Tan
Software obfuscation, particularly prevalent in JavaScript, hinders code comprehension and analysis, posing significant challenges to software testing, static analysis, and malware detection. This paper introduces CASCADE, a novel hybrid approach that integrates the advanced coding capabilities of Gemini with the deterministic transformation capabilities of a compiler Intermediate Representation (IR), specifically JavaScript IR (JSIR). By employing Gemini to identify critical prelude functions, the foundational components underlying the most prevalent obfuscation techniques, and leveraging JSIR for subsequent code transformations, CASCADE effectively recovers semantic elements like original strings and API names, and reveals original program behaviors. This method overcomes limitations of existing static and dynamic deobfuscation techniques, eliminating hundreds to thousands of hardcoded rules while achieving reliability and flexibility. CASCADE is already deployed in Google’s production environment, demonstrating substantial improvements in JavaScript deobfuscation efficiency and reducing reverse engineering efforts.
软件模糊化,特别是在爪哇史克里普特,阻碍代码理解和分析,对软件测试、静态分析和恶意检测构成重大挑战。本文介绍CASCADE,这是一种新型混合方法,将Gemini的先进编码能力与编译器中级代表(IR),特别是JavaScript IR(JSIR)的确定性转化能力相结合。通过使用Gemini来识别关键前端功能,即最流行的模糊化技术的基本组成部分,利用JSIR进行随后的代码转换,CASCADE有效地回收了原始字符串和API名称等语义元素,并揭示了原始程序行为。这种方法克服了现有静态和动态脱色技术的局限性,消除了数百至数千条硬编码规则,同时实现了可靠性和灵活性。CASCADE已经部署在谷歌的生产环境中,展示了JavaScript deobfuscation效率的大幅改进,并减少了逆向工程努力。
Article 31
Title@2025-07-23 (3): Contextual Code Retrieval for Commit Message Generation: A Preliminary Study
Title: Contextual Code Retrieval for Commit Message Generation: A Preliminary Study | Kontextcode-Retrieval für Commit Message Generation: Eine Vorstudie | 提交信件生成时的上下文代码检索:初步研究 2507.17690v1 |
Authors (4): Bo Xiong, Linghao Zhang, Chong Wang, Peng Liang
A commit message describes the main code changes in a commit and plays a crucial role in software maintenance. Existing commit message generation (CMG) approaches typically frame it as a direct mapping which inputs a code diff and produces a brief descriptive sentence as output. However, we argue that relying solely on the code diff is insufficient, as raw code diff fails to capture the full context needed for generating high-quality and informative commit messages. In this paper, we propose a contextual code retrieval-based method called C3Gen to enhance CMG by retrieving commit-relevant code snippets from the repository and incorporating them into the model input to provide richer contextual information at the repository scope. In the experiments, we evaluated the effectiveness of C3Gen across various models using four objective and three subjective metrics. Meanwhile, we design and conduct a human evaluation to investigate how C3Gen-generated commit messages are perceived by human developers. The results show that by incorporating contextual code into the input, C3Gen enables models to effectively leverage additional information to generate more comprehensive and informative commit messages with greater practical value in real-world development scenarios. Further analysis underscores concerns about the reliability of similaritybased metrics and provides empirical insights for CMG.
承诺信息描述承诺书中的主要代码变化,并在软件维护中起到关键作用。 现有的承诺生成信息( CMG) 方法通常将它作为直接映射,输入代码 diff 并生成简短的描述性句子作为输出。 然而,我们认为,仅仅依赖代码 diff 是不够的, 因为原始代码 diff 无法捕捉生成高质量和内容丰富的承诺信息所需的全部背景。 在本文中, 我们提议了一种基于背景代码的检索方法,称为 C3Gen , 以通过从存储处检索与承诺相关的代码片块, 并将其纳入模型输入, 以在存储处范围提供更丰富的背景信息。 在实验中, 我们利用四种客观和三种主观的衡量标准评估了C3Gen 在不同模型中的有效性。 同时, 我们设计并开展一项人类评估, 以调查人类开发者如何看待 C3Gen 生成的信息。 结果表明, 通过将背景代码纳入输入, C3Gen 使模型能够有效地利用更多信息, 生成更全面和知情的信息, 承诺在真实世界发展情景中具有更大的实际价值。 进一步分析强调了 。
Article 32
Title@2025-07-23 (3): Making REST APIs Agent-Ready: From OpenAPI to Model Context Protocol Servers for Tool-Augmented LLMs
Title: Making REST APIs Agent-Ready: From OpenAPI to Model Context Protocol Servers for Tool-Augmented LLMs | REST APIs Agent-Ready erstellen: Von OpenAPI zu Model Context Protocol Server für Tool-Augmented LLMs | 制作REST APIs API Agent- Ready:从开放API到示范背景协议服务器,用于工具推荐LMM 2507.16044v2 |
Authors (3): Meriem Mastouri, Emna Ksontini, Wael Kessentini
Large Language Models (LLMs) are evolving from passive text generators into active agents that invoke external tools. To support this shift, scalable protocols for tool integration are essential. The Model Context Protocol (MCP), introduced by Anthropic in 2024, offers a schema-driven standard for dynamic tool discovery and invocation. Yet, building MCP servers remains manual and repetitive, requiring developers to write glue code, handle authentication, and configure schemas by hand-replicating much of the integration effort MCP aims to eliminate. This paper investigates whether MCP server construction can be meaningfully automated. We begin by analyzing adoption trends: among 22,000+ MCP-tagged GitHub repositories created within six months of release, fewer than 5% include servers, typically small, single-maintainer projects dominated by repetitive scaffolding. To address this gap, we present AutoMCP, a compiler that generates MCP servers from OpenAPI 2.0/3.0 specifications. AutoMCP parses REST API definitions and produces complete server implementations, including schema registration and authentication handling. We evaluate AutoMCP on 50 real-world APIs spanning 5,066 endpoints across over 10 domains. From a stratified sample of 1,023 tool calls, 76.5% succeeded out of the box. Manual failure analysis revealed five recurring issues, all attributable to inconsistencies or omissions in the OpenAPI contracts. After minor fixes, averaging 19 lines of spec changes per API, AutoMCP achieved 99.9% success. Our findings (i) analyze MCP adoption and quantify the cost of manual server development, (ii) demonstrate that OpenAPI specifications, despite quality issues, enable near-complete MCP server automation, and (iii) contribute a corpus of 5,066 callable tools along with insights on repairing common specification flaws.
大型语言模型(LLMS) 正在从被动文本生成器演变为使用外部工具的积极代理器。 为支持这一转变, 工具整合的可缩放协议至关重要 。 2024年由人类运动推出的模型背景协议( MCP ) 提供了动态工具发现和调试的系统驱动标准。 然而, 构建 MCP 服务器仍然是手工和重复性的, 需要开发者手工复制粘结代码, 处理认证, 并配置计划, 将整合工作MCP 的许多目的复制为消除。 本文调查了 MCP 服务器的建设能否实现有意义的自动化。 我们首先分析了采纳趋势: 在22 000+ MCP 的驱动工具整合工具中, 2024年创建的模型背景协议( MCP ) , 2024 模型背景协议协议协议( MCP ) , 2060 工具( ) (OutoMCP ) (OUPI ) (Oral ) (Oralalalalal IP ) 5, IM Aral 5 (O) (Oral- mill MA- mill MA- milleral 5) (Oral II) (Orassill) (Orassill) (O) (O) (O) (O) (O) (O) 5) (OLI (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (S) (Outrl) (Outrass) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O) (O)
Article 33
Title@2025-07-23 (3): Rethinking HSM and TPM Security in the Cloud: Real-World Attacks and Next-Gen Defenses
Title: Rethinking HSM and TPM Security in the Cloud: Real-World Attacks and Next-Gen Defenses | HSM- und TPM-Sicherheit in der Cloud neu denken: Angriffe auf die Realwelt und Next-Gen-Verteidigungen | 重新思考云层中的HSM和TPP安全:真实世界攻击和下一代防卫 2507.17655v1 |
Authors (2): Shams Shaikh, Trima P. Fernandes e Fizardo
As organizations rapidly migrate to the cloud, the security of cryptographic key management has become a growing concern. Hardware Security Modules (HSMs) and Trusted Platform Modules (TPMs), traditionally seen as the gold standard for securing encryption keys and digital trust, are increasingly challenged by cloud-native threats. Real-world breaches have exposed weaknesses in cloud deployments, including misconfigurations, API abuse, and privilege escalations, allowing attackers to access sensitive key material and bypass protections. These incidents reveal that while the hardware remains secure, the surrounding cloud ecosystem introduces systemic vulnerabilities. This paper analyzes notable security failures involving HSMs and TPMs, identifies common attack vectors, and questions longstanding assumptions about their effectiveness in distributed environments. We explore alternative approaches such as confidential computing, post-quantum cryptography, and decentralized key management. Our findings highlight that while HSMs and TPMs still play a role, modern cloud security requires more adaptive, layered architectures. By evaluating both current weaknesses and emerging models, this research equips cloud architects and security engineers with strategies to reinforce cryptographic trust in the evolving threat landscape.
随着各组织迅速迁移到云层,加密钥匙管理的安全日益成为一个令人关切的问题。传统上被视为加密钥匙和数字信任金金标准的硬件安全模块(HSM)和信任平台模块(TPM)日益受到云端威胁的挑战。现实世界的破坏暴露出云层部署的弱点,包括配置错误、API滥用和特权升级,使袭击者能够获取敏感关键材料和绕行保护。这些事件表明,虽然硬件仍然安全,但周围云层生态系统带来了系统性的脆弱性。本文分析了涉及HSM和TPM的显著安全故障,确定了常见攻击矢量,并质疑其在分布环境中的有效性的长期假设。我们探索了保密计算、Quantum后加密和分散的关键管理等替代方法。我们的调查结果强调,虽然HSM和TPM仍然发挥作用,现代云层安全需要更具适应性、多层结构。通过评估当前的弱点和新出现的模式,这一研究使云层设计师和安全工程师掌握了在不断变化的威胁环境中加强加密信任的战略。
Article 34
Title@2025-07-23 (3): Closing the Chain: How to reduce your risk of being SolarWinds, Log4j, or XZ Utils
Title: Closing the Chain: How to reduce your risk of being SolarWinds, Log4j, or XZ Utils | Schließen der Kette: Wie reduzieren Sie Ihr Risiko, SolarWinds, Log4j oder XZ Utils zu sein | 关闭链链: 如何降低您成为太阳能窗口、 Log4j 或 XZ 工具的风险 2503.12192v2 |
Authors (6): Sivana Hamer, Jacob Bowen, Md Nazmul Haque, Robert Hines, Chris Madden, Laurie Williams
Software supply chain frameworks, such as the US NIST Secure Software Development Framework (SSDF), detail what tasks software development organizations are recommended or mandated to adopt to reduce security risk. However, to further reduce the risk of similar attacks occurring, software organizations benefit from knowing what tasks mitigate attack techniques the attackers are currently using to address specific threats, prioritize tasks, and close mitigation gaps. The goal of this study is to aid software organizations in reducing the risk of software supply chain attacks by systematically synthesizing how framework tasks mitigate the attack techniques used in the SolarWinds, Log4j, and XZ Utils attacks. We qualitatively analyzed 106 Cyber Threat Intelligence (CTI) reports of the 3 attacks to gather the attack techniques. We then systematically constructed a mapping between attack techniques and the 73 tasks enumerated in 10 software supply chain frameworks. Afterward, we established and ranked priority tasks that mitigate attack techniques. The three mitigation tasks with the highest scores are role-based access control, system monitoring, and boundary protection. Additionally, three mitigation tasks were missing from all ten frameworks, including sustainable open-source software and environmental scanning tools. Thus, software products would still be vulnerable to software supply chain attacks even if organizations adopted all recommended tasks.
软件供应链框架,如美国NIST安全软件开发框架(SSDF),详细说明了建议或授权软件开发组织采用何种任务来降低安全风险;然而,为了进一步降低发生类似袭击的风险,软件组织受益于了解袭击者目前正在使用何种任务来减轻攻击技术,以应对具体威胁,确定任务的优先次序,缩小缓解差距;本研究的目的是协助软件组织减少软件供应链袭击的风险,系统地综合框架任务如何减轻Sollar Winds、Log4j和XZ Utils袭击中使用的攻击技术。我们从质量上分析了106份关于3次袭击的网络威胁情报(CTI)报告,以收集攻击技术。然后,我们系统地绘制了攻击技术与10项软件供应链框架中列出的73项任务之间的地图。之后,我们确定并排列了减缓攻击技术的优先任务。三项最高级的缓解任务是基于作用的出入控制、系统监测和边界保护。此外,所有10项框架都缺少三项减缓任务,包括可持续的开放软件和环境扫描工具。因此,即使所有建议的组织都采用,软件产品仍然易受软件供应链袭击。
Article 35
Title@2025-07-23 (3): CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning
Title: CodeReasoner: Enhancing the Code Reasoning Ability with Reinforcement Learning | CodeReasoner: Verbesserung der Code-Reasoning-Fähigkeit mit Verstärkungs-Lernen | 代码搜索器:加强强化学习,加强《提高能力标准守则》 2507.17548v1 |
Authors (5): Lingxiao Tang, He Ye, Zhongxin Liu, Xiaoxue Ren, Lingfeng Bao
Code reasoning is a fundamental capability for large language models (LLMs) in the code domain. It involves understanding and predicting a program’s execution behavior, such as determining the output for a given input or whether a specific statement will be executed. This capability is essential for downstream tasks like debugging, code generation, and program repair. Prior approaches mainly rely on supervised fine-tuning to improve performance in code reasoning tasks. However, they often show limited gains and fail to generalize across diverse scenarios. We argue this is due to two core issues: the low quality of training data and the limitations of supervised fine-tuning, which struggles to teach general reasoning skills. To address these challenges, we propose CodeReasoner, a framework that spans both dataset construction and a two-stage training process. First, we introduce a method to construct datasets that focus on the core execution logic of Python programs. Next, we apply instruction tuning to inject execution-specific knowledge distilled from a powerful teacher model. We then enhance reasoning and generalization through GRPO reinforcement learning on top of the fine-tuned model. Experiments on three widely-used code reasoning benchmarks show that CodeReasoner improves performance by 27.1% to 40.2% over prior methods using a 7B model. Notably, the 7B model matches GPT-4o on key tasks like input/output and coverage prediction. When scaled to 14B, CodeReasoner outperforms GPT-4o across all benchmarks. Ablation studies confirm the effectiveness of each training stage and highlight the importance of reasoning chains.
代码推理是代码域中大型语言模型(LLMS)的基本能力。 它涉及理解和预测一个程序的执行行为, 如确定特定输入的输出或是否执行特定语句。 这种能力对于调试、代码生成和程序修复等下游任务至关重要。 先前的方法主要依靠监督的微调来改进代码推理任务中的绩效。 但是, 它们往往显示有限的收益, 无法在不同的情景中推广。 我们认为这是两个核心问题造成的: 培训数据的质量低, 以及监管的微调的局限性, 这是为了教授一般推理技能。 为了应对这些挑战, 我们提议了代码Resoner, 这个框架涵盖数据设置的构建和两个阶段的培训进程。 首先, 我们采用一种方法来构建数据集, 重点是Python 程序的核心执行逻辑。 然而, 我们用一个强大的教师模型来注入具体执行的知识进行调试。 然后我们通过GROPO在微调模型的顶部强化度上强化推理和概括性标度。 我们建议用三个普遍使用的代码推理学的GPE2 标准比标, 将SB 的每个阶段的精确度比标比标比 。
Article 36
Title@2025-07-23 (3): AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests
Title: AssertFlip: Reproducing Bugs via Inversion of LLM-Generated Passing Tests | AssertFlip: Fehler reproduzieren durch Inversion von LLM-Generated Passing Tests | SessertFlip: 通过反转 LLM 生成的过路测试复制臭虫 2507.17542v1 |
Authors (3): Lara Khatib, Noble Saji Mathews, Meiyappan Nagappan
Bug reproduction is critical in the software debugging and repair process, yet the majority of bugs in open-source and industrial settings lack executable tests to reproduce them at the time they are reported, making diagnosis and resolution more difficult and time-consuming. To address this challenge, we introduce AssertFlip, a novel technique for automatically generating Bug Reproducible Tests (BRTs) using large language models (LLMs). Unlike existing methods that attempt direct generation of failing tests, AssertFlip first generates passing tests on the buggy behaviour and then inverts these tests to fail when the bug is present. We hypothesize that LLMs are better at writing passing tests than ones that crash or fail on purpose. Our results show that AssertFlip outperforms all known techniques in the leaderboard of SWT-Bench, a benchmark curated for BRTs. Specifically, AssertFlip achieves a fail-to-pass success rate of 43.6% on the SWT-Bench-Verified subset.
错误复制在软件调试和维修过程中至关重要, 但开放源码和工业环境中的大多数错误在报告时缺乏可执行的测试来复制它们, 使得诊断和解析更加困难和耗时。 为了应对这一挑战, 我们引入了 AssertFlip, 这是使用大语言模型自动生成错误复制测试( BRTs) 的新技术。 与尝试直接生成失败测试的现有方法不同, AssertFlip 首次生成错误行为测试的通过测试, 然后当错误出现时将这些测试倒转为失败。 我们假设LLMs 的写通过测试比那些故意崩溃或失败的测试要好。 我们的结果显示, AsselltFlip 将所有已知的技术都表现在SWT- Bench 的领导板上, 这是为 BRTs 调整的基准。 具体地说, AsertFlip在 SWT- Bench- Vererificed 子中取得了43.6%的失败到通行成功率 。
Article 37
Title@2025-07-23 (3): Enabling Cyber Security Education through Digital Twins and Generative AI
Title: Enabling Cyber Security Education through Digital Twins and Generative AI | Cyber Security Education durch digitale Zwillinge und generative KI ermöglichen | 通过 “ 数字双双 “ 和 “ 创世创新 “ ,促进网络安全教育 2507.17518v1 |
Authors (6): Vita Santa Barletta, Vito Bavaro, Miriana Calvano, Antonio Curci, Antonio Piccinno, Davide Pio Posa
Digital Twins (DTs) are gaining prominence in cybersecurity for their ability to replicate complex IT (Information Technology), OT (Operational Technology), and IoT (Internet of Things) infrastructures, allowing for real time monitoring, threat analysis, and system simulation. This study investigates how integrating DTs with penetration testing tools and Large Language Models (LLMs) can enhance cybersecurity education and operational readiness. By simulating realistic cyber environments, this approach offers a practical, interactive framework for exploring vulnerabilities and defensive strategies. At the core of this research is the Red Team Knife (RTK), a custom penetration testing toolkit aligned with the Cyber Kill Chain model. RTK is designed to guide learners through key phases of cyberattacks, including reconnaissance, exploitation, and response within a DT powered ecosystem. The incorporation of Large Language Models (LLMs) further enriches the experience by providing intelligent, real-time feedback, natural language threat explanations, and adaptive learning support during training exercises. This combined DT LLM framework is currently being piloted in academic settings to develop hands on skills in vulnerability assessment, threat detection, and security operations. Initial findings suggest that the integration significantly improves the effectiveness and relevance of cybersecurity training, bridging the gap between theoretical knowledge and real-world application. Ultimately, the research demonstrates how DTs and LLMs together can transform cybersecurity education to meet evolving industry demands.
数字双胞胎(DTs)在网络安全中越来越受到重视,因为它们复制复杂的信息技术(信息技术)、OT(操作技术)和IOT(物联网)基础设施的能力,能够进行实时监测、威胁分析和系统模拟。这项研究调查了将DT与渗透测试工具和大语言模型(LLMS)相结合如何能加强网络安全教育和业务准备状态。通过模拟现实的网络环境,这一方法为探索脆弱性和防御战略提供了一个实用的互动式框架。研究的核心是红球Knife(RTK),这是与网络杀手链模型一致的定制渗透测试工具包。RTK旨在指导学习者在网络攻击的关键阶段,包括侦察、利用和在有动力的DT生态系统内作出反应。将大语言模型(LLMS)纳入能够通过提供智能、实时反馈、自然语言威胁解释和在培训活动中提供适应性学习支持,从而进一步丰富了经验。DTLM框架目前正在在学术环境中进行试点,以掌握脆弱性评估、威胁发现和安全操作方面的技能。初步发现,旨在指导学习学习学习学习学习学习如何在最终改造全球安全数据库和数据库需求之间进行联系。
Article 38
Title@2025-07-23 (3): Efficient Neural Network Verification via Order Leading Exploration of Branch-and-Bound Trees
Title: Efficient Neural Network Verification via Order Leading Exploration of Branch-and-Bound Trees | Effiziente Neuralnetzverifizierung durch Order Leading Exploration von Zweig-und-Bound-Bäumen | 通过分树和环形树的有序主要勘探进行高效神经网络核查 2507.17453v1 |
Authors (7): Guanqin Zhang, Kota Fukuda, Zhenya Zhang, H. M. N. Dilum Bandara, Shiping Chen, Jianjun Zhao, Yulei Sui
The vulnerability of neural networks to adversarial perturbations has necessitated formal verification techniques that can rigorously certify the quality of neural networks. As the state-of-the-art, branch and bound (BaB) is a “divide-and-conquer” strategy that applies off-the-shelf verifiers to sub-problems for which they perform better. While BaB can identify the sub-problems that are necessary to be split, it explores the space of these sub-problems in a naive “first-come-first-serve” manner, thereby suffering from an issue of inefficiency to reach a verification conclusion. To bridge this gap, we introduce an order over different sub-problems produced by BaB, concerning with their different likelihoods of containing counterexamples. Based on this order, we propose a novel verification framework Oliva that explores the sub-problem space by prioritizing those sub-problems that are more likely to find counterexamples, in order to efficiently reach the conclusion of the verification. Even if no counterexample can be found in any sub-problem, it only changes the order of visiting different sub-problem and so will not lead to a performance degradation. Specifically, Oliva has two variants, including $Oliva^{GR}$, a greedy strategy that always prioritizes the sub-problems that are more likely to find counterexamples, and $Oliva^{SA}$, a balanced strategy inspired by simulated annealing that gradually shifts from exploration to exploitation to locate the globally optimal sub-problems. We experimentally evaluate the performance of Oliva on 690 verification problems spanning over 5 models with datasets MNIST and CIFAR10. Compared to the state-of-the-art approaches, we demonstrate the speedup of Oliva for up to 25X in MNIST, and up to 80X in CIFAR10.
神经网络对对抗性扰动的脆弱性要求正式的核查技术来严格验证神经网络的质量。 由于神经网络的状态、 分支和约束( BAB) 是一种“ 分解和解析” 战略, 将现成的核查器应用到它们表现较好的子问题。 虽然BAB 可以找出需要分解的子问题, 但是它会探索这些小问题的空间, 以天真的“ 先到先得” 方式解决这些小问题, 从而影响神经网络质量问题, 从而导致无法达成核查结论。 为了弥合这一差距, 我们对BAB 产生的不同子问题, 将现成的校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外。 我们提议一个新的校外校外校外校外校外校外校外校外的校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外, , , , , , 等校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校内的校内的校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外校外的
Article 39
Title@2025-07-23 (3): Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks
Title: Explicit Vulnerability Generation with LLMs: An Investigation Beyond Adversarial Attacks | Explizite Gefährlichkeitsgenerierung mit LLMs: Eine Untersuchung jenseits zweifelhafter Angriffe | 与LLM女士:在反向攻击之外进行调查 2507.10054v2 |
Authors (4): Emir Bosnak, Sahand Moslemi, Mayasah Lami, Anil Koyuncu
Large Language Models (LLMs) are increasingly used as code assistants, yet their behavior when explicitly asked to generate insecure code remains poorly understood. While prior research has focused on unintended vulnerabilities, this study examines a more direct threat: open-source LLMs generating vulnerable code when prompted. We propose a dual experimental design: (1) Dynamic Prompting, which systematically varies vulnerability type, user persona, and prompt phrasing across structured templates; and (2) Reverse Prompting, which derives natural-language prompts from real vulnerable code samples. We evaluate three open-source 7B-parameter models (Qwen2, Mistral, Gemma) using static analysis to assess both the presence and correctness of generated vulnerabilities. Our results show that all models frequently generate the requested vulnerabilities, though with significant performance differences. Gemma achieves the highest correctness for memory vulnerabilities under Dynamic Prompting (e.g., 98.6% for buffer overflows), while Qwen2 demonstrates the most balanced performance across all tasks. We find that professional personas (e.g., “DevOps Engineer”) consistently elicit higher success rates than student personas, and that the effectiveness of direct versus indirect phrasing is inverted depending on the prompting strategy. Vulnerability reproduction accuracy follows a non-linear pattern with code complexity, peaking in a moderate range. Our findings expose how LLMs’ reliance on pattern recall over semantic reasoning creates significant blind spots in their safety alignments, particularly for requests framed as plausible professional tasks.
大型语言模型(LLMS)越来越多地被用作代码助理,然而,当明确要求生成不安全代码时,它们的行为仍然没有得到很好的理解。虽然先前的研究侧重于意外脆弱性,但本研究研究研究了一个更直接的威胁:开放源码LMS在推动时生成脆弱代码。我们提议一个双重实验设计:(1)动态提示,它系统地改变脆弱性类型、用户人和结构化模板的快速表达;(2)反源提示,它从真实的脆弱代码样本中获取自然语言提示。我们用静态分析来评估生成的脆弱性的存在和正确性,对三种开放源7B参数模型(Qwen2,Mistral,Gemma)进行评估。我们的结果显示,所有模型经常产生所要求的脆弱性,尽管在性能差异很大。Gemma在动态提示(例如,98.6%用于缓冲溢漏溢出)下,实现记忆脆弱性的最大正确性,而Quen2显示所有任务最均衡的运行模式。我们发现,专业人员(例如“DevOps Engeral ”)使用静态分析, 持续提高中的成功率率率率高于学生的准确度,在不精确度上显示,在不易复制率战略中,在不易变率上如何遵循我们的标准。
Article 40
Title@2025-07-23 (3): Investigating Training Data Detection in AI Coders
Title: Investigating Training Data Detection in AI Coders | Untersuchung der Erfassung von Schulungsdaten in KI-Codern | AI 编码器中的调查培训数据检测 2507.17389v1 |
Authors (8): Tianlin Li, Yunxiang Wei, Zhiming Li, Aishan Liu, Qing Guo, Xianglong Liu, Dongning Sun, Yang Liu
Recent advances in code large language models (CodeLLMs) have made them indispensable tools in modern software engineering. However, these models occasionally produce outputs that contain proprietary or sensitive code snippets, raising concerns about potential non-compliant use of training data, and posing risks to privacy and intellectual property. To ensure responsible and compliant deployment of CodeLLMs, training data detection (TDD) has become a critical task. While recent TDD methods have shown promise in natural language settings, their effectiveness on code data remains largely underexplored. This gap is particularly important given code’s structured syntax and distinct similarity criteria compared to natural language. To address this, we conduct a comprehensive empirical study of seven state-of-the-art TDD methods on source code data, evaluating their performance across eight CodeLLMs. To support this evaluation, we introduce CodeSnitch, a function-level benchmark dataset comprising 9,000 code samples in three programming languages, each explicitly labeled as either included or excluded from CodeLLM training. Beyond evaluation on the original CodeSnitch, we design targeted mutation strategies to test the robustness of TDD methods under three distinct settings. These mutation strategies are grounded in the well-established Type-1 to Type-4 code clone detection taxonomy. Our study provides a systematic assessment of current TDD techniques for code and offers insights to guide the development of more effective and robust detection methods in the future.
在代码大语言模型(CodeLLMS)方面的最新进展使其在现代软件工程中具有不可或缺的工具,然而,这些模型有时产生含有专有或敏感的代码片段的产出,引起对潜在不遵守培训数据使用要求的关切,并给隐私和知识产权带来风险。为了确保对代码LMS进行负责任和合规的部署,培训数据探测(TDD)已成为一项关键任务。虽然最近的TDD方法在自然语言环境中显示了希望,但在代码数据方面的效力仍然在很大程度上没有得到充分利用。鉴于代码的结构性合成法和与自然语言不同的相似标准,这一差距尤其重要。为了解决这一问题,我们对七种先进的源代码数据TDD方法进行了全面的经验性研究,评估其在8个代码码中的表现。为了支持这项评估,我们采用了CodSnitch,一个功能级基准数据集,由三种方案语言的9 000个代码样本组成,每个样本都明确标为列入或被排除在代码LM培训之外。除了对原代码Snitch的评估之外,我们还设计了突变战略,以测试在三种截然不同的环境下的TDD方法的稳健可靠性和当前探测方法提供了一种系统化的模型。我们为模式的模型的模型的模型的检验方法。
Article 41
Title@2025-07-23 (3): Roseau: Fast, Accurate, Source-based API Breaking Change Analysis in Java
Title: Roseau: Fast, Accurate, Source-based API Breaking Change Analysis in Java | Roseau: Schnelle, genaue, quellbasierte API-Breaking Change Analyse in Java | Roseau: Java快速、准确、基于源、基于源的API突破性变化分析 2507.17369v1 |
Authors (5): Corentin Latappy, Thomas Degueule, Jean-Rémy Falleri, Romain Robbes, Lina Ochoa
Understanding API evolution and the introduction of breaking changes (BCs) in software libraries is essential for library maintainers to manage backward compatibility and for researchers to conduct empirical studies on software library evolution. In Java, tools such as JApiCmp and Revapi are commonly used to detect BCs between library releases, but their reliance on binary JARs limits their applicability. This restriction hinders large-scale longitudinal studies of API evolution and fine-grained analyses such as commit-level BC detection. In this paper, we introduce Roseau, a novel static analysis tool that constructs technology-agnostic API models from library code equipped with rich semantic analyses. API models can be analyzed to study API evolution and compared to identify BCs between any two versions of a library (releases, commits, branches, etc.). Unlike traditional approaches, Roseau can build API models from source code or bytecode, and is optimized for large-scale longitudinal analyses of library histories. We assess the accuracy, performance, and suitability of Roseau for longitudinal studies of API evolution, using JApiCmp and Revapi as baselines. We extend and refine an established benchmark of BCs and show that Roseau achieves higher accuracy (F1 = 0.99) than JApiCmp (F1 = 0.86) and Revapi (F1 = 0.91). We analyze 60 popular libraries from Maven Central and find that Roseau delivers excellent performance, detecting BCs between versions in under two seconds, including in libraries with hundreds of thousands of lines of code. We further illustrate the limitations of JApiCmp and Revapi for longitudinal studies and the novel analysis capabilities offered by Roseau by tracking the evolution of Google’s Guava API and the introduction of BCs over 14 years and 6,839 commits, reducing analysis times from a few days to a few minutes.
理解软件图书馆的API进化和引入破碎变化(BCs)对于图书馆维护者管理后向兼容性和研究人员对软件图书馆进化进行实验性研究至关重要。在爪哇,JApiCmp和Revapi等工具通常用于在图书馆发行之间检测 BCs ,但对二进制JARs的依赖限制了其适用性。这一限制阻碍了对API进化的大规模纵向研究和微小分析,如承诺水平BC检测。在本文中,我们引入了罗索,这是一个创新的静态分析工具,它从拥有丰富的语义分析的图书馆代码中构建技术对API的进化模型。在JApiCmp和Revapi等工具中,可以分析API的进化,并比较两个版本的图书馆(发布、承诺、分支、分支、分支、分支、分支、等等)的BIPI的进化模型,我们通过OVS的准确性能、Orevial的进化分析,我们用JAC1级的进化数据,我们用OF1级的进化和RIS的进化的进化能力,我们用OF1的进化的进化的进化的进化的进化和进化的进化的进化数据,我们提供的进化的进化的进化的进化的进化的进化了BS。
Article 42
Title@2025-07-23 (3): How Do Code Smells Affect Skill Growth in Scratch Novice Programmers?
Title: How Do Code Smells Affect Skill Growth in Scratch Novice Programmers? | Wie wirkt sich Code bei Scratch Novice Programmierern auf das Qualifikationswachstum aus? | 代码如何闻到技能增长对Scratch新程序设计师的影响? 2507.17314v1 |
Authors (3): Ricardo Hidalgo Aragón, Jesús M. González-Barahona, Gregorio Robles
Context. Code smells, which are recurring anomalies in design or style, have been extensively researched in professional code. However, their significance in block-based projects created by novices is still largely unknown. Block-based environments such as Scratch offer a unique, data-rich setting to examine how emergent design problems intersect with the cultivation of computational-thinking (CT) skills. Objective. This research explores the connection between CT proficiency and design-level code smells–issues that may hinder software maintenance and evolution–in programs created by Scratch developers. We seek to identify which CT dimensions align most strongly with which code smells and whether task context moderates those associations. Method. A random sample of aprox. 2 million public Scratch projects is mined. Using open-source linters, we extract nine CT scores and 40 code smell indicators from these projects. After rigorous pre-processing, we apply descriptive analytics, robust correlation tests, stratified cross-validation, and exploratory machine-learning models; qualitative spot-checks contextualize quantitative patterns. Impact. The study will deliver the first large-scale, fine-grained map linking specific CT competencies to concrete design flaws and antipatterns. Results are poised to (i) inform evidence-based curricula and automated feedback systems, (ii) provide effect-size benchmarks for future educational interventions, and (iii) supply an open, pseudonymized dataset and reproducible analysis pipeline for the research community. By clarifying how programming habits influence early skill acquisition, the work advances both computing-education theory and practical tooling for sustainable software maintenance and evolution.
代码的嗅觉是设计或风格中反复出现的反常现象,已在专业代码中进行了广泛的研究。然而,在由新手创建的基于街区的项目中,守则在设计或风格中的重要性仍然基本上不为人所知。Scratch等基于街区的环境提供了独特的、数据丰富的环境,以审查新出现的设计问题与计算思维(CT)技能的培养如何交叉。目标。这项研究探索了CT熟练程度和设计层面代码的嗅觉问题之间的联系,这可能会阻碍Scratch开发者创建的软件维护和进化程序。我们试图确定哪些CT层面最强烈地与代码的嗅觉一致,以及任务范围是否与这些关联。方法。Scrcratch等基于街区的环境环境提供了200万个公共Scratch项目的随机抽样,用以审查新出现的设计问题。我们从这些项目中提取了9个CT分数和40个代码嗅觉指标。经过严格的预处理后,我们运用了描述性分析、强有力的相关性测试、分解的交叉校验以及探索的机器学习模式。我们试图对哪些CT的量化的定量检查了定量定量定量定量的定量的定量分析模式。 将基础和精确的逻辑的逻辑的逻辑分析结果与精确的系统与精确的精确的精确的逻辑分析。
Article 43
Title@2025-07-23 (3): Data Virtualization for Machine Learning
Title: Data Virtualization for Machine Learning | Datenvirtualisierung für maschinelles Lernen | 机器学习数据虚拟化 2507.17293v1 |
Authors (5): Saiful Khan, Joyraj Chakraborty, Philip Beaucamp, Niraj Bhujel, Min Chen
Nowadays, machine learning (ML) teams have multiple concurrent ML workflows for different applications. Each workflow typically involves many experiments, iterations, and collaborative activities and commonly takes months and sometimes years from initial data wrangling to model deployment. Organizationally, there is a large amount of intermediate data to be stored, processed, and maintained. \emph{Data virtualization} becomes a critical technology in an infrastructure to serve ML workflows. In this paper, we present the design and implementation of a data virtualization service, focusing on its service architecture and service operations. The infrastructure currently supports six ML applications, each with more than one ML workflow. The data virtualization service allows the number of applications and workflows to grow in the coming years.
目前,机器学习(ML)团队有多种同时的ML工作流程,用于不同的应用。每个工作流程通常涉及许多实验、迭代和协作活动,通常需要几个月甚至几年的时间,从最初的数据相互交织到模型部署。在组织上,有大量的中间数据有待储存、处理和维护。\emph{Data虚拟化}成为为ML工作流程提供服务的基础设施中的一项关键技术。在本文件中,我们介绍了数据虚拟化服务的设计和实施,重点是其服务结构和服务业务。基础设施目前支持六个ML应用程序,每个应用程序都有一个以上的ML工作流程。数据虚拟化服务使应用程序和工作流程的数量在未来几年中得以增长。
Article 44
Title@2025-07-23 (3): Seed&Steer: Guiding Large Language Models with Compilable Prefix and Branch Signals for Unit Test Generation
Title: Seed&Steer: Guiding Large Language Models with Compilable Prefix and Branch Signals for Unit Test Generation | Seed&Steer: Leitende große Sprachmodelle mit kompilierbaren Präfix- und Branchsignalen für die Unit Test Generation | 种子 & Steer: 指导用于单位测试生成的可编译前缀和分支信号的大型语言模型 2507.17271v1 |
Authors (6): Shuaiyu Zhou, Zhengran Zeng, Xiaoling Zhou, Rui Xie, Shikun Zhang, Wei Ye
Unit tests play a vital role in the software development lifecycle. Recent advances in Large Language Model (LLM)-based approaches have significantly improved automated test generation, garnering attention from both academia and industry. We revisit LLM-based unit test generation from a novel perspective by decoupling prefix generation and assertion generation. To characterize their respective challenges, we define Initialization Complexity and adopt Cyclomatic Complexity to measure the difficulty of prefix and assertion generation, revealing that the former primarily affects compilation success, while the latter influences test coverage. To address these challenges, we propose Seed&Steer, a two-step approach that combines traditional unit testing techniques with the capabilities of large language models. Seed&Steer leverages conventional unit testing tools (e.g., EvoSuite) to generate method invocations with high compilation success rates, which serve as seeds to guide LLMs in constructing effective test contexts. It then introduces branching cues to help LLMs explore diverse execution paths (e.g., normal, boundary, and exception cases) and generate assertions with high coverage. We evaluate Seed&Steer on five real-world Java projects against state-of-the-art baselines. Results show that Seed&Steer improves the compilation pass rate by approximately 7%, successfully compiling 792 and 887 previously failing cases on two LLMs. It also achieves up to ~73% branch and line coverage across focal methods of varying complexity, with coverage improvements ranging from 1.09* to 1.26*. Our code, dataset, and experimental scripts will be publicly released to support future research and reproducibility.
单位测试在软件开发生命周期中发挥着关键作用。 基于大语言模型(LLM)方法的最近进展大大改善了自动化测试生成,引起了学术界和工业界的注意。我们从新角度重新研究基于LLM单元的测试生成,将前置和数据生成脱钩。为了描述各自的挑战,我们定义初始化复杂性,并采用气候复杂度,以衡量前置和主张生成的难度,显示前者主要影响汇编成功,而后制则影响测试范围。为了应对这些挑战,我们建议Sead & Steer采用两步方法,将传统的单位测试范围技术与大语言模型的能力结合起来。我们Sead & Steer利用常规单位测试工具(例如EvoSite)来生成方法,以高编集成功率来指导LLMS构建有效的测试环境。然后引入分流提示,帮助LMSM探索不同的执行支持路径(例如常规、边界和例外案例),并从1.92和高版本的1.ro范围,用Sead和Sead-Stereal范围,我们评估了S-S-Steral-ral-listreal a listrual acal 7-listrational acal be dald pass to pald laveald sild sild saldaldationald.
Article 45
Title@2025-07-23 (3): Lessons from a Big-Bang Integration: Challenges in Edge Computing and Machine Learning
Title: Lessons from a Big-Bang Integration: Challenges in Edge Computing and Machine Learning | Lehren aus einer Big-Bang-Integration: Herausforderungen im Edge Computing und Machine Learning | 大型银行一体化的经验教训:边际电子计算和机器学习方面的挑战 2507.17270v1 |
Authors (2): Alessandro Aneggi, Andrea Janes
This experience report analyses a one year project focused on building a distributed real-time analytics system using edge computing and machine learning. The project faced critical setbacks due to a big-bang integration approach, where all components developed by multiple geographically dispersed partners were merged at the final stage. The integration effort resulted in only six minutes of system functionality, far below the expected 40 minutes. Through root cause analysis, the study identifies technical and organisational barriers, including poor communication, lack of early integration testing, and resistance to topdown planning. It also considers psychological factors such as a bias toward fully developed components over mockups. The paper advocates for early mock based deployment, robust communication infrastructures, and the adoption of topdown thinking to manage complexity and reduce risk in reactive, distributed projects. These findings underscore the limitations of traditional Agile methods in such contexts and propose simulation-driven engineering and structured integration cycles as key enablers for future success.
这份经验报告分析了一个为期一年的项目,重点是利用边际计算和机器学习建立一个分布式实时分析系统;该项目由于采用大相融合办法而面临重大挫折,在最后阶段,由地理上分散的多个合作伙伴开发的所有组成部分都合并在一起;一体化努力只产生了系统功能的6分钟,远远低于预期的40分钟;通过根本原因分析,研究确定了技术和组织障碍,包括沟通不畅、缺乏早期一体化测试和对自上而下规划的抵制;还考虑了心理因素,例如偏向于完全发达的组件而不是模型;文件倡导者早期模拟部署、强大的通信基础设施,以及采用自上而下思维管理复杂性和减少反应性、分布式项目的风险;这些结论强调了在这类情况下传统“敏捷”方法的局限性,并提出模拟驱动的工程和结构化整合周期作为未来成功的关键推动因素。
Article 46
Title@2025-07-23 (3): Understanding Prompt Programming Tasks and Questions
Title: Understanding Prompt Programming Tasks and Questions | Prompt Programmieraufgaben und Fragen verstehen | 了解快速方案拟订任务和问题 2507.17264v1 |
Authors (5): Jenny T. Liang, Chenyang Yang, Agnia Sergeyuk, Travis D. Breaux, Brad A. Myers
Prompting foundation models (FMs) like large language models (LLMs) have enabled new AI-powered software features (e.g., text summarization) that previously were only possible by fine-tuning FMs. Now, developers are embedding prompts in software, known as prompt programs. The process of prompt programming requires the developer to make many changes to their prompt. Yet, the questions developers ask to update their prompt is unknown, despite the answers to these questions affecting how developers plan their changes. With the growing number of research and commercial prompt programming tools, it is unclear whether prompt programmers’ needs are being adequately addressed. We address these challenges by developing a taxonomy of 25 tasks prompt programmers do and 51 questions they ask, measuring the importance of each task and question. We interview 16 prompt programmers, observe 8 developers make prompt changes, and survey 50 developers. We then compare the taxonomy with 48 research and commercial tools. We find that prompt programming is not well-supported: all tasks are done manually, and 16 of the 51 questions – including a majority of the most important ones – remain unanswered. Based on this, we outline important opportunities for prompt programming tools.
催化基础模型(FMs),如大型语言模型(LLMS)等催化基础模型(FMs)使新的AI-动力软件功能(例如文本汇总)得以实现,而以前只有微调调调调频才可能实现。现在,开发者正在将提示器嵌入软件,称为快速程序。迅速编程过程要求开发者对其迅速进行许多修改。然而,尽管这些问题影响到开发者如何计划其变化,但问题开发者要求更新其及时性,但尚不清楚。随着研究和商业快速编程工具数量的不断增加,能否充分满足快速编程者的需要。我们通过开发25个任务快速编程员的分类和51个问题来解决这些挑战,衡量每项任务和问题的重要性。我们访谈了16个快速编程员,观察8个开发者迅速修改,调查50个开发者。然后我们比较了分类和48个研究和商业工具。我们发现,快速编程没有得到很好的支持:所有任务都是手动完成的,51个问题中的16个问题 – 包括大多数最重要的问题 – 仍然没有得到答复。基于这一点,我们概述了快速编程工具的重要机会。
Article 47
Title@2025-07-23 (3): On the Feasibility of Quantum Unit Testing
Title: On the Feasibility of Quantum Unit Testing | Zur Machbarkeit der Quanteneinheitsprüfung | 关于量子单位测试的可行性 2507.17235v1 |
Authors (5): Andriy Miranskyy, José Campos, Anila Mjeda, Lei Zhang, Ignacio García Rodríguez de Guzmán
The increasing complexity of quantum software presents significant challenges for software verification and validation, particularly in the context of unit testing. This work presents a comprehensive study on quantum-centric unit tests, comparing traditional statistical approaches with tests specifically designed for quantum circuits. These include tests that run only on a classical computer, such as the Statevector test, as well as those executable on quantum hardware, such as the Swap test and the novel Inverse test. Through an empirical study and detailed analysis on 1,796,880 mutated quantum circuits, we investigate (a) each test’s ability to detect subtle discrepancies between the expected and actual states of a quantum circuit, and (b) the number of measurements required to achieve high reliability. The results demonstrate that quantum-centric tests, particularly the Statevector test and the Inverse test, provide clear advantages in terms of precision and efficiency, reducing both false positives and false negatives compared to statistical tests. This work contributes to the development of more robust and scalable strategies for testing quantum software, supporting the future adoption of fault-tolerant quantum computers and promoting more reliable practices in quantum software engineering.
量子软件日益复杂,对软件的核查和验证提出了重大挑战,特别是在单位测试方面,这项工作是对量子单位测试的全面研究,比较传统统计方法与量子电路专门设计的测试,包括仅在古典计算机上进行的测试,如国家矢量器测试,以及对量子硬件可执行的测试,如Swap测试和新的反向测试。通过对1,796,880变形量子电路进行经验研究和详细分析,我们调查(a) 每项测试是否有能力发现量子电路预期状态和实际状态之间的微妙差异,(b) 实现高度可靠性所需的测量数量。结果显示,量子中心测试,特别是国家矢量器测试和反向测试,在精确和效率方面提供了明显的好处,减少了假正数和假负数与统计测试相比。这项工作有助于制定更可靠和可缩放的战略来测试量子软件,支持今后采用容误差量子计算机,并促进在量子软件工程方面采取更可靠的做法。
Article 48
Title@2025-07-23 (3): Can LLMs Write CI? A Study on Automatic Generation of GitHub Actions Configurations
Title: Can LLMs Write CI? A Study on Automatic Generation of GitHub Actions Configurations | Kann LLMs CI schreiben? Eine Studie zur automatischen Generierung von GitHub-Aktionen Konfigurationen | LLM Can Write CI? GitHub 动作配置自动生成研究 2507.17165v1 |
Authors (2): Taher A. Ghaleb, Dulina Rathnayake
Continuous Integration (CI) services, such as GitHub Actions, require developers to write YAML-based configurations, which can be tedious and error-prone. Despite the increasing use of Large Language Models (LLMs) to automate software engineering tasks, their ability to generate CI configurations remains underexplored. This paper presents a preliminary study evaluating six LLMs for generating GitHub Actions configurations from natural language descriptions. We assess three general-purpose foundation models (GPT-4o, Llama, and Gemma) and three code-pretrained models (GPT-4.1, Code Llama, and CodeGemma). We also introduce the first labeled dataset of its kind, constructed from GitHub Actions documentation, pairing descriptions with corresponding best-practice YAML configurations. Zero-shot prompting achieves up to 69% similarity with the ground truth, with only 3% perfect matches. Code-pretrained models slightly underperform compared to general-purpose ones in YAML-based CI tasks, revealing LLM limitations for CI configuration generation. Analyzing GPT-4o outputs reveals issues like missing or renamed steps, misinterpreted descriptions, and unnecessary additions that may affect structural and contextual correctness, indicating a gap between generation quality and the precision required for executable CI configurations. Our research offers insights for improving LLM alignment with configuration languages and guiding future efforts on CI automation and tooling support.
GitHub Action 等连续整合服务(CI) , 诸如 GitHub Action 等, 要求开发者写入基于YAML的配置, 这些配置可能乏味和容易出错。 尽管越来越多地使用大语言模型(LLLMs)来自动执行软件工程任务, 但他们生成 CIC配置的能力仍然未得到充分探索。 本文介绍了一项初步研究, 评估6 LLMS 以生成来自自然语言描述的 GitHub Action 配置。 我们评估了3个通用基础模型( GPT-4o、Llama和Gemma) 和3个由编码训练的模型( GPT-4.1、 代码Llama 和 CodeGemma ) 。 我们还引入了第一个由GitHub Action 文档创建的同类标签数据集(LLLMs), 与相应的最佳操作YAML配置组合组合组合组合组合相配对。 Zerofroduction Expressing GPT-4 和CRILM 输出 之间, 需要纠正和结构定义。
Article 49
Title@2025-07-23 (3): Assessing Reliability of Statistical Maximum Coverage Estimators in Fuzzing
Title: Assessing Reliability of Statistical Maximum Coverage Estimators in Fuzzing | Bewertung der Zuverlässigkeit statistischer Maximaldeckungs-Schätzer im Fuzzing | 评估模糊中统计最高覆盖率估算器的可靠性 2507.17093v1 |
Authors (4): Danushka Liyanage, Nelum Attanayake, Zijian Luo, Rahul Gopinath
Background: Fuzzers are often guided by coverage, making the estimation of maximum achievable coverage a key concern in fuzzing. However, achieving 100% coverage is infeasible for most real-world software systems, regardless of effort. While static reachability analysis can provide an upper bound, it is often highly inaccurate. Recently, statistical estimation methods based on species richness estimators from biostatistics have been proposed as a potential solution. Yet, the lack of reliable benchmarks with labeled ground truth has limited rigorous evaluation of their accuracy. Objective: This work examines the reliability of reachability estimators from two axes: addressing the lack of labeled ground truth and evaluating their reliability on real-world programs. Methods: (1) To address the challenge of labeled ground truth, we propose an evaluation framework that synthetically generates large programs with complex control flows, ensuring well-defined reachability and providing ground truth for evaluation. (2) To address the criticism from use of synthetic benchmarks, we adapt a reliability check for reachability estimators on real-world benchmarks without labeled ground truth – by varying the size of sampling units, which, in theory, should not affect the estimate. Results: These two studies together will help answer the question of whether current reachability estimators are reliable, and defines a protocol to evaluate future improvements in reachability estimation.
背景:模糊者往往以覆盖为指南,将估计最大可实现覆盖作为模糊的关键关切。然而,无论如何努力,实现100%覆盖对于大多数真实世界软件系统都是行不通的。虽然静态可达性分析可以提供上限,但往往非常不准确。最近,提出了基于生物统计学物种丰富性估计的统计估计方法,作为潜在的解决办法。然而,由于缺乏可靠的基准,加上贴有标签的地面真理,因此对其准确性的严格评价有限。目标:这项工作审查来自两个轴的可达性估计者的可靠性:解决没有标签的地面真相的缺乏,评估其在真实世界方案中的可靠性。方法:(1) 解决贴有标签的地面真相的挑战,我们提议一个评估框架,合成地生成大型方案,同时进行复杂的控制流动,确保定义明确的可达性,并为评价提供地面真相。(2) 为解决对使用合成基准的批评,我们调整了可靠的可靠性检查,用于不贴标签的地面基准的可达标的可达标性估计者 – 其抽样单位的规模不同,理论上,这些抽样单位的可达标度不应影响当前可达性估计的可靠性。这些结果:两项研究将共同确定如何确定未来的可达标。
Article 50
Title@2025-07-22 (2): Language model developers should report train-test overlap
Title: Language model developers should report train-test overlap | Entwickler von Sprachmodellen sollten Überlappungen von Zugversuchen melden | 语言模式开发者应报告培训测试重叠情况 2410.08385v2 |
Authors (7): Andy K Zhang, Kevin Klyman, Yifan Mai, Yoav Levine, Yian Zhang, Rishi Bommasani, Percy Liang
Language models are extensively evaluated, but correctly interpreting evaluation results requires knowledge of train-test overlap which refers to the extent to which the language model is trained on the very data it is being tested on. The public currently lacks adequate information about train-test overlap: most models have no public train-test overlap statistics, and third parties cannot directly measure train-test overlap since they do not have access to the training data. To make this clear, we document the practices of 30 model developers, finding that just 9 developers report train-test overlap: 4 developers release training data under open-source licenses, enabling the community to directly measure train-test overlap, and 5 developers publish their train-test overlap methodology and statistics. By engaging with language model developers, we provide novel information about train-test overlap for three additional developers. Overall, we take the position that language model developers should publish train-test overlap statistics and/or training data whenever they report evaluation results on public test sets. We hope our work increases transparency into train-test overlap to increase the community-wide trust in model evaluations.
语言模型得到广泛评价,但正确解释评价结果要求了解培训测试重叠情况,这是指语言模型在多大程度上以正在测试的数据本身进行训练。公众目前缺乏关于培训测试重叠情况的充足信息:大多数模型没有公共培训测试重叠统计数据,第三方无法直接衡量培训测试重叠情况,因为他们无法获得培训数据。为了明确这一点,我们记录了30个模型开发者的做法,发现只有9个开发者报告培训测试重叠情况:4个开发者根据开放源码许可证发布培训测试数据,使社区能够直接测量培训测试重叠情况,5个开发者公布其培训测试重叠方法和统计数据。我们通过与语言模型开发者接触,为另外3个开发者提供关于培训测试重叠的新信息。总体而言,我们的立场是,语言模型开发者在报告公共测试成套的评价结果时,应公布培训测试重叠统计数据和/或培训数据。我们希望我们的工作能增加培训测试重叠的透明度,以增加全社区对模式评价的信任。
Article 51
Title@2025-07-22 (2): Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots
Title: Evaluating Uncertainty and Quality of Visual Language Action-enabled Robots | Bewertung von Unsicherheit und Qualität von Visual Language Action-fähigen Robotern | 评价视觉语言行动推动的机器人的不确定性和质量 2507.17049v1 |
Authors (4): Pablo Valle, Chengjie Lu, Shaukat Ali, Aitor Arrieta
Visual Language Action (VLA) models are a multi-modal class of Artificial Intelligence (AI) systems that integrate visual perception, natural language understanding, and action planning to enable agents to interpret their environment, comprehend instructions, and perform embodied tasks autonomously. Recently, significant progress has been made to advance this field. These kinds of models are typically evaluated through task success rates, which fail to capture the quality of task execution and the mode’s confidence in its decisions. In this paper, we propose eight uncertainty metrics and five quality metrics specifically designed for VLA models for robotic manipulation tasks. We assess their effectiveness through a large-scale empirical study involving 908 successful task executions from three state-of-the-art VLA models across four representative robotic manipulation tasks. Human domain experts manually labeled task quality, allowing us to analyze the correlation between our proposed metrics and expert judgments. The results reveal that several metrics show moderate to strong correlation with human assessments, highlighting their utility for evaluating task quality and model confidence. Furthermore, we found that some of the metrics can discriminate between high-, medium-, and low-quality executions from unsuccessful tasks, which can be interesting when test oracles are not available. Our findings challenge the adequacy of current evaluation practices that rely solely on binary success rates and pave the way for improved real-time monitoring and adaptive enhancement of VLA-enabled robotic systems.
视觉语言行动(VLA)模型是一个多式的人工智能(人工智能(人工智能)系统,它综合了视觉观念、自然语言理解和行动规划,使代理人能够解释其环境、理解指示和自主地执行包含的任务。最近,在推进这一领域方面取得了显著进展。这类模型通常通过任务成功率来评估,它未能反映任务执行质量和模式对其决定的信心。在本文件中,我们提出了8个不确定指标和5个质量指标,专门为VLA机器人操纵任务模型设计。我们通过大规模的经验性研究评估其有效性,涉及在四种有代表性的机器人操作任务中,从三种最先进的VLA模型中成功执行908个。人类域专家手动标明了任务质量,使我们能够分析我们拟议的衡量标准与专家判断之间的相互关系。结果显示,一些衡量标准与人类评估具有中度至强的关联性,突出了它们对于评估任务质量和模型信任度的效用。此外,我们发现,有些衡量标准可以区分高、中度和低质量的处决与不成功执行率的实际任务之间的差别,在测试或提升标准性评估成功率方面完全可以信赖。
Article 52
Title@2025-07-22 (2): An Efficient Algorithm for Generating Minimal Unique-Cause MC/DC Test cases for Singular Boolean Expressions
Title: An Efficient Algorithm for Generating Minimal Unique-Cause MC/DC Test cases for Singular Boolean Expressions | Ein effizienter Algorithmus zur Generierung minimaler, einzigartiger MC/DC-Testfälle für singuläre Boolean-Ausdrücke | 生成 Singulal Boolean 表达式的 MC/DC 测试案例的高效最小独致 MC/DC 测试比值 2507.14687v2 |
Authors (2): Robin Lee, Youngho Nam
Modified Condition/Decision Coverage (MC/DC) is a mandatory structural coverage criterion for ensuring the reliability and safety of critical systems. While its strictest form, Unique-Cause MC/DC, offers the highest assurance, research on its efficient test generation has been lacking. This gap is particularly significant, as an analysis of large-scale avionics systems shows that 99.7% of all conditional decisions are, in fact, Singular Boolean Expressions (SBEs) the ideal structure for applying Unique-Cause MC/DC. This paper proposes ‘Robin’s Rule’, a deterministic algorithm that directly constructs a minimal test set of N + 1 cases to guarantee 100% Unique-Cause MC/DC for SBEs with N conditions, without generating a full truth table. To validate our approach, we constructed a benchmark by reformulating the TCAS-II specifications into SBEs and verified the results using an industry-standard, certified commercial tool. The results confirm that our method consistently achieves 100% coverage with the theoretical minimum number of tests and is more efficient than the commercial tool. This work provides a practical and provably optimal solution for verifying safety-critical systems, ensuring both rigor and efficiency.
修改后的条件/决定覆盖面(MC/DC)是保证关键系统的可靠性和安全性的一个强制性结构覆盖标准,尽管其最严格的形式,即 “ 独特原因的MC/DC “ 提供了最高程度的保证,但对其有效测试生成的研究却一直缺乏。这一差距尤其显著,因为对大型航空航空系统的分析表明,在所有有条件决定中,99.7%的Singulal Boolean Express(SBEs)实际上是应用单一原因的MC/DC的理想结构。本文建议采用“Robin规则”,这是一种确定性算法,直接建立一套N+1的最低限度测试,以保障N条件下的SBE为100%的“独特原因的MC/DC”,而没有产生完整的真相表。为了验证我们的方法,我们通过将TCAS-II规格重新纳入SBE(SBEs),并使用行业标准、经认证的商业工具核实结果。结果证实,我们的方法始终以理论最低数量的测试数量达到100%的覆盖率,比商业工具更有效率。这项工作提供了一种实际和最有效的安全性解决办法。
Article 53
Title@2025-07-22 (2): LLM as a code generator in Agile Model Driven Development
Title: LLM as a code generator in Agile Model Driven Development | LLM als Code-Generator in Agile Model Driven Development | 作为Agile 模型驱动器开发的代码生成器的LLM 2410.18489v2 |
Authors (3): Ahmed R. Sadik, Sebastian Brulin, Markus Olhofer
Leveraging Large Language Models (LLM) like GPT4 in the auto generation of code represents a significant advancement, yet it is not without its challenges. The ambiguity inherent in natural language descriptions of software poses substantial obstacles to generating deployable, structured artifacts. This research champions Model Driven Development (MDD) as a viable strategy to overcome these challenges, proposing an Agile Model Driven Development (AMDD) approach that employs GPT4 as a code generator. This approach enhances the flexibility and scalability of the code auto generation process and offers agility that allows seamless adaptation to changes in models or deployment environments. We illustrate this by modeling a multi agent Unmanned Vehicle Fleet (UVF) system using the Unified Modeling Language (UML), significantly reducing model ambiguity by integrating the Object Constraint Language (OCL) for code structure meta modeling, and the FIPA ontology language for communication semantics meta modeling. Applying GPT4 auto generation capabilities yields Java and Python code that is compatible with the JADE and PADE frameworks, respectively. Our thorough evaluation of the auto generated code verifies its alignment with expected behaviors and identifies enhancements in agent interactions. Structurally, we assessed the complexity of code derived from a model constrained solely by OCL meta models, against that influenced by both OCL and FIPA ontology meta models. The results indicate that the ontology constrained meta model produces inherently more complex code, yet its cyclomatic complexity remains within manageable levels, suggesting that additional meta model constraints can be incorporated without exceeding the high risk threshold for complexity.
GPT4 等大型语言模型(LLM)在自动生成代码过程中的杠杆作用是一大进步,但它并非没有挑战。软件自然语言描述的内在模糊性给生成可部署的、结构化的文物带来了巨大的障碍。这个研究冠军模型驱动开发(MDD)作为克服这些挑战的可行战略,提出了将GPT4 用作代码生成器的Agile模型驱动开发(AMDD)方法。这个方法增强了代码自动生成过程的灵活性和可缩缩放性,使得能够无缝地适应模型或部署环境的变化。我们用统一模型语言(UMUL)模拟多剂不载式车辆车队(UVF)系统,从而大大降低模型的模糊性。这个模型将Ostratin Constrain 语言(ODDDD)作为克服这些挑战的可行战略,将GPT4 自动生成模型(GPT4) 用于生成与 JADE 和 Python 框架不相容的复杂性代码。我们通过对自动生成的自动生成的系统化代码进行彻底评估,通过结构化的系统化的模型来验证其内部成本化模型分析,从而确定内部的系统化模型的系统化模型的系统化分析结果。
Article 54
Title@2025-07-22 (2): Revisiting Pre-trained Language Models for Vulnerability Detection
Title: Revisiting Pre-trained Language Models for Vulnerability Detection | Überprüfung vortrainierter Sprachmodelle für die Erkennung von Schwachstellen | 重新审查关于脆弱性检测的预培训语言模式 2507.16887v1 |
Authors (5): Youpeng Li, Weiliang Qi, Xuyu Wang, Fuxun Yu, Xinda Wang
The rapid advancement of pre-trained language models (PLMs) has demonstrated promising results for various code-related tasks. However, their effectiveness in detecting real-world vulnerabilities remains a critical challenge. % for the security community. While existing empirical studies evaluate PLMs for vulnerability detection (VD), their inadequate consideration in data preparation, evaluation setups, and experimental settings undermines the accuracy and comprehensiveness of evaluations. This paper introduces RevisitVD, an extensive evaluation of 17 PLMs spanning smaller code-specific PLMs and large-scale PLMs using newly constructed datasets. Specifically, we compare the performance of PLMs under both fine-tuning and prompt engineering, assess their effectiveness and generalizability across various training and testing settings, and analyze their robustness against code normalization, abstraction, and semantic-preserving transformations. Our findings reveal that, for VD tasks, PLMs incorporating pre-training tasks designed to capture the syntactic and semantic patterns of code outperform both general-purpose PLMs and those solely pre-trained or fine-tuned on large code corpora. However, these models face notable challenges in real-world scenarios, such as difficulties in detecting vulnerabilities with complex dependencies, handling perturbations introduced by code normalization and abstraction, and identifying semantic-preserving vulnerable code transformations. Also, the truncation caused by the limited context windows of PLMs can lead to a non-negligible amount of labeling errors. This study underscores the importance of thorough evaluations of model performance in practical scenarios and outlines future directions to help enhance the effectiveness of PLMs for realistic VD applications.
培训前语言模型(PLM)的快速进展显示了各种与代码有关的任务的可喜成果,然而,在发现真实世界脆弱性方面,这些模型在发现真实世界脆弱性方面的成效仍是一个严峻的挑战。虽然现有的实证研究评估了用于识别脆弱性的PLM(VD),但在数据编制、评价设置和实验环境方面没有充分考虑这些模型,从而削弱了评价的准确性和全面性。本文件介绍了RevisitVD, 广泛评价了17个PLM, 覆盖了各种与代码相关的更小的代码特有的PLM(PLM)和大型PLM(PM),使用了新构建的数据集。具体地说,我们比较了PLM(PM)的性能,评估了各种培训和测试环境的实效和通用性能,评估了这些模型在规范、抽象和语义性能保护性变异性变的规范的规范方面是否健全。这些模型面对着显著的挑战,通过对常规变异性变异性变的模型和变异性变异性规则的难度,通过在真实性变现中如何测测测测测测测易、测测测测低的模型中如何,从而可以提高真实性变易性变的弱点和改变的难度。
Article 55
Title@2025-07-22 (2): Rethinking LLM-Based RTL Code Optimization Via Timing Logic Metamorphosis
Title: Rethinking LLM-Based RTL Code Optimization Via Timing Logic Metamorphosis | Rethinking LLM-basierte RTL-Code-Optimierung über Timing Logic Metamorphose | 重新思考基于LLM的RTL规则 2507.16808v1 |
Authors (3): Zhihao Xu, Bixin Li, Lulu Wang
Register Transfer Level(RTL) code optimization is crucial for achieving high performance and low power consumption in digital circuit design. However, traditional optimization methods often rely on manual tuning and heuristics, which can be time-consuming and error-prone. Recent studies proposed to leverage Large Language Models(LLMs) to assist in RTL code optimization. LLMs can generate optimized code snippets based on natural language descriptions, potentially speeding up the optimization process. However, existing approaches have not thoroughly evaluated the effectiveness of LLM-Based code optimization methods for RTL code with complex timing logic. To address this gap, we conducted a comprehensive empirical investigation to assess the capability of LLM-Based RTL code optimization methods in handling RTL code with complex timing logic. In this study, we first propose a new benchmark for RTL optimization evaluation. It comprises four subsets, each corresponding to a specific area of RTL code optimization. Then we introduce a method based on metamorphosis to systematically evaluate the effectiveness of LLM-Based RTL code optimization methods.Our key insight is that the optimization effectiveness should remain consistent for semantically equivalent but more complex code. After intensive experiments, we revealed several key findings. (1) LLM-Based RTL optimization methods can effectively optimize logic operations and outperform existing compiler-based methods. (2) LLM-Based RTL optimization methods do not perform better than existing compiler-based methods on RTL code with complex timing logic, particularly in timing control flow optimization and clock domain optimization. This is primarily attributed to the challenges LLMs face in understanding timing logic in RTL code. Based on these findings, we provide insights for further research in leveraging LLMs for RTL code optimization.
注册传输级别( RTL) 代码优化对于在数字电路设计中实现高性能和低电能消耗至关重要。 但是, 传统的优化方法往往依赖于人工调试和超常方法, 这可能耗费时间和容易出错。 最近提出的旨在利用大语言模型( LLMs) 来协助 RTL 代码优化的研究 。 LLM 可以在自然语言描述的基础上产生优化的代码片断, 可能会加快优化进程。 但是, 现有的方法还没有彻底评估基于 LLM 的代码优化方法在具有复杂时间逻辑逻辑的 RTL 代码设计中的有效性。 为了弥补这一差距,我们进行了全面的实证调查, 评估LM 和 RTL 代码优化方法在使用复杂的时间逻辑逻辑逻辑逻辑处理中的能力。 我们的主要洞察发现, LLLM 的优化运行效率应该保持一致, 而LLM 正在以更精细的逻辑化方法 。
Article 56
Title@2025-07-22 (2): Towards Understanding the Challenges of Bug Localization in Deep Learning Systems
Title: Towards Understanding the Challenges of Bug Localization in Deep Learning Systems | Auf dem Weg zum Verständnis der Herausforderungen der Buglokalisierung in Deep Learning Systemen | 了解深学习系统中错误定位化的挑战 2402.01021v2 |
Authors (3): Sigma Jahan, Mehil B. Shah, Mohammad Masudur Rahman
Software bugs cost the global economy billions of dollars annually and claim ~50\% of the programming time from software developers. Locating these bugs is crucial for their resolution but challenging. It is even more challenging in deep-learning systems due to their black-box nature. Bugs in these systems are also hidden not only in the code but also in the models and training data, which might make traditional debugging methods less effective. In this article, we conduct a large-scale empirical study to better understand the challenges of localizing bugs in deep-learning systems. First, we determine the bug localization performance of four existing techniques using 2,365 bugs from deep-learning systems and 2,913 from traditional software. We found these techniques significantly underperform in localizing deep-learning system bugs. Second, we evaluate how different bug types in deep learning systems impact bug localization. We found that the effectiveness of localization techniques varies with bug type due to their unique challenges. For example, tensor bugs were more accessible to locate due to their structural nature, while all techniques struggled with GPU bugs due to their external dependencies. Third, we investigate the impact of bugs’ extrinsic nature on localization in deep-learning systems. We found that deep learning bugs are often extrinsic and thus connected to artifacts other than source code (e.g., GPU, training data), contributing to the poor performance of existing localization methods.
软件错误每年花费全球经济数十亿美元, 并声称软件开发者的程序制作时间为 ~ 50 。 定位这些错误对于解决这些错误至关重要, 但具有挑战性。 在深层学习系统中, 其难度更大。 这些系统中的错误不仅隐藏在代码中, 而且还隐藏在模型和培训数据中, 这可能会降低传统的调试方法的效力。 在本篇文章中, 我们进行了大规模的经验性研究, 以更好地了解深层学习系统中错误本地化的挑战。 首先, 我们用深层学习系统中的2,365个错误和传统软件中的2, 913个确定四种现有技术的错误本地化性能。 我们发现这些技术在深层学习系统错误的本地化方面表现严重不足。 其次, 我们评估深层学习系统中不同的错误类型是如何影响本地化的。 我们发现, 本地化技术的效力随错误类型的独特挑战而不同。 例如, 沙虫错误更容易定位于结构性质, 而所有技术由于外部依赖而与GPUI错误进行抗争斗。 第三, 我们发现这些技术在深层学习G型系统上的影响, 因此, 我们学习了深层次的错误特性, 我们发现, 学习了本地的系统。 学习了深层的系统。
Article 57
Title@2025-07-22 (2): Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support
Title: Never Come Up Empty: Adaptive HyDE Retrieval for Improving LLM Developer Support | Nie kommen leer: Adaptive HyDE Retrieval für die Verbesserung LLM-Entwickler-Unterstützung | 永不空起来: 改进 LLM 开发者支持的适应性 HyDE 检索器 2507.16754v1 |
Authors (4): Fangjian Lei, Mariam El Mezouar, Shayan Noei, Ying Zou
Large Language Models (LLMs) have shown promise in assisting developers with code-related questions; however, LLMs carry the risk of generating unreliable answers. To address this, Retrieval-Augmented Generation (RAG) has been proposed to reduce the unreliability (i.e., hallucinations) of LLMs. However, designing effective pipelines remains challenging due to numerous design choices. In this paper, we construct a retrieval corpus of over 3 million Java and Python related Stack Overflow posts with accepted answers, and explore various RAG pipeline designs to answer developer questions, evaluating their effectiveness in generating accurate and reliable responses. More specifically, we (1) design and evaluate 7 different RAG pipelines and 63 pipeline variants to answer questions that have historically similar matches, and (2) address new questions without any close prior matches by automatically lowering the similarity threshold during retrieval, thereby increasing the chance of finding partially relevant context and improving coverage for unseen cases. We find that implementing a RAG pipeline combining hypothetical-documentation-embedding (HyDE) with the full-answer context performs best in retrieving and answering similarcontent for Stack Overflow questions. Finally, we apply our optimal RAG pipeline to 4 open-source LLMs and compare the results to their zero-shot performance. Our findings show that RAG with our optimal RAG pipeline consistently outperforms zero-shot baselines across models, achieving higher scores for helpfulness, correctness, and detail with LLM-as-a-judge. These findings demonstrate that our optimal RAG pipelines robustly enhance answer quality for a wide range of developer queries including both previously seen and novel questions across different LLMs
大型语言模型(LLMS)在协助开发者处理与代码有关的问题方面显示了希望;然而,LLMS具有产生不可靠答案的风险。为了解决这一问题,建议回收提款人(RAG)减少LMS的不可靠性(即幻觉)。然而,设计有效的管道由于设计选择众多,仍然具有挑战性。在本文件中,我们建立了一个300多万爪哇和Python相关Stack Overfload 的检索系统,并获得公认的答案,并探索各种RAG管道设计,以回答开发者的问题,评估其在产生准确和可靠答复方面的效力。更具体地说,我们(1)设计和评价7个RAG的管道和63个管道变体,以回答历史上相近的问题。(2) 解决新的问题,在检索过程中自动降低相似的零临界值,从而增加找到部分相关背景的机会,并改进对隐性案例的覆盖范围。我们发现,执行RAG的管道,将假设性文件叠加(HyDE)与全盘背景进行最佳的检索和回答,将我们最准确的输油管查询结果与我们以前的RMRMRMRMSLSLSLSLS展示的最佳结果结合起来。
Article 58
Title@2025-07-22 (2): An advanced AI driven database system
Title: An advanced AI driven database system | Ein fortschrittliches KI-gestütztes Datenbanksystem | 先进的AIL驱动数据库系统 2507.17778v1 |
Authors (5): M. Tedeschi, S. Rizwan, C. Shringi, V. Devram Chandgir, S. Belich
Contemporary database systems, while effective, suffer severe issues related to complexity and usability, especially among individuals who lack technical expertise but are unfamiliar with query languages like Structured Query Language (SQL). This paper presents a new database system supported by Artificial Intelligence (AI), which is intended to improve the management of data using natural language processing (NLP) - based intuitive interfaces, and automatic creation of structured queries and semi-structured data formats like yet another markup language (YAML), java script object notation (JSON), and application program interface (API) documentation. The system is intended to strengthen the potential of databases through the integration of Large Language Models (LLMs) and advanced machine learning algorithms. The integration is purposed to allow the automation of fundamental tasks such as data modeling, schema creation, query comprehension, and performance optimization. We present in this paper a system that aims to alleviate the main problems with current database technologies. It is meant to reduce the need for technical skills, manual tuning for better performance, and the potential for human error. The AI database employs generative schema inference and format selection to build its schema models and execution formats.
现代数据库系统虽然有效,但在复杂性和可用性方面遭遇严重问题,特别是缺乏技术专长但不熟悉结构查询语言等查询语言的个人。本文件介绍了由人工智能(AI)支持的新数据库系统,其目的是通过自然语言处理(NLP)改进数据管理,以自然语言处理(NLP)为基础,以直观界面为基础,自动创建结构化查询和半结构化数据格式,如另一种标记语言(YAML)、 Java脚本对象符号(JSON)和应用程序界面(API)文件。该系统的目的是通过整合大语言模型(LLMS)和先进的机器学习算法,加强数据库的潜力。这种整合的目的是使数据模型、制版、系统创建、理解和性能优化等基本任务自动化。我们在本文中介绍了一个系统,旨在缓解当前数据库技术的主要问题。该系统旨在减少对技术能力的需求,改进性能的手工调整,以及人类错误的可能性。AI数据库采用基因化系统化模式和格式选择模型。
Article 59
Title@2025-07-22 (2): LangBiTe: A Platform for Testing Bias in Large Language Models
Title: LangBiTe: A Platform for Testing Bias in Large Language Models | LangBiTe: Eine Plattform zum Testen von Bias in großen Sprachmodellen | LangBitte:大语言模型比对测试平台 2404.18558v2 |
Authors (3): Sergio Morales, Robert Clarisó, Jordi Cabot
The integration of Large Language Models (LLMs) into various software applications raises concerns about their potential biases. Typically, those models are trained on a vast amount of data scrapped from forums, websites, social media and other internet sources, which may instill harmful and discriminating behavior into the model. To address this issue, we present LangBiTe, a testing platform to systematically assess the presence of biases within an LLM. LangBiTe enables development teams to tailor their test scenarios, and automatically generate and execute the test cases according to a set of user-defined ethical requirements. Each test consists of a prompt fed into the LLM and a corresponding test oracle that scrutinizes the LLM’s response for the identification of biases. LangBite provides users with the bias evaluation of LLMs, and end-to-end traceability between the initial ethical requirements and the insights obtained.
将大语言模型(LLMs)纳入各种软件应用引起了人们对其潜在偏差的关切,这些模型通常在从论坛、网站、社交媒体和其他互联网来源收集的大量数据上接受培训,这些数据可能会将有害和歧视行为注入模型,为解决这一问题,我们介绍LangBite(LangBite),这是一个测试平台,用来系统评估LLM(LLM)内部是否存在偏见。LangBiTe(LLLMTe)使开发团队能够根据一套用户定义的道德要求调整其测试情景,自动生成和执行测试案例。每个测试都包括迅速输入LLM(LM)和相应的测试手,以审查LLM(LM)对识别偏见的反应。LLMM(LLM)的偏差评价以及最初的道德要求和获得的洞察之间最终到的可追溯性。
Article 60
Title@2025-07-22 (2): Toward Realistic Evaluations of Just-In-Time Vulnerability Prediction
Title: Toward Realistic Evaluations of Just-In-Time Vulnerability Prediction | Hin zu realistischen Bewertungen von Just-in-Time Sicherheitsvorhersage | A. 实现现实评估时空时脆弱性预测 2507.10729v2 |
Authors (5): Duong Nguyen, Thanh Le-Cong, Triet Huynh Minh Le, M. Ali Babar, Quyet-Thang Huynh
Modern software systems are increasingly complex, presenting significant challenges in quality assurance. Just-in-time vulnerability prediction (JIT-VP) is a proactive approach to identifying vulnerable commits and providing early warnings about potential security risks. However, we observe that current JIT-VP evaluations rely on an idealized setting, where the evaluation datasets are artificially balanced, consisting exclusively of vulnerability-introducing and vulnerability-fixing commits. To address this limitation, this study assesses the effectiveness of JIT-VP techniques under a more realistic setting that includes both vulnerability-related and vulnerability-neutral commits. To enable a reliable evaluation, we introduce a large-scale public dataset comprising over one million commits from FFmpeg and the Linux kernel. Our empirical analysis of eight state-of-the-art JIT-VP techniques reveals a significant decline in predictive performance when applied to real-world conditions; for example, the average PR-AUC on Linux drops 98% from 0.805 to 0.016. This discrepancy is mainly attributed to the severe class imbalance in real-world datasets, where vulnerability-introducing commits constitute only a small fraction of all commits. To mitigate this issue, we explore the effectiveness of widely adopted techniques for handling dataset imbalance, including customized loss functions, oversampling, and undersampling. Surprisingly, our experimental results indicate that these techniques are ineffective in addressing the imbalance problem in JIT-VP. These findings underscore the importance of realistic evaluations of JIT-VP and the need for domain-specific techniques to address data imbalance in such scenarios.
现代软件系统日益复杂,在质量保证方面提出了重大挑战。即时脆弱性预测(JIT-VP)是一种积极主动的方法,用于确定弱势者,并就潜在的安全风险发出预警。然而,我们注意到,目前的JIT-VP评价依赖于一种理想化的环境,在这种环境中,评价数据集人为地平衡,完全由脆弱性引入和脆弱性固定承诺组成。为解决这一局限性,本研究在更现实的环境下评估JIT-VP技术的有效性,包括脆弱性相关和脆弱性中立承诺。为了进行可靠的评估,我们引入了一个大型公共数据集,由FFmpeg和Linux核心单位的100多万份承诺组成。我们对八种最先进的JIT-VP技术进行的经验分析显示,在应用到现实世界条件下,预测性业绩表现得明显下降;例如,Linux的PR-AUC平均下降98%,从0.805下降到0.016。这一差异主要归因于真实世界数据集中严重的阶级不平衡现象。 JS-imings in registring eal-deal ress report of the suplose supalalalaltialal laction the dalalal ex ex ex ex exismissional ex ex ex ex.
Article 61
Title@2025-07-22 (2): VulGuard: An Unified Tool for Evaluating Just-In-Time Vulnerability Prediction Models
Title: VulGuard: An Unified Tool for Evaluating Just-In-Time Vulnerability Prediction Models | VulGuard: Ein einheitliches Tool für die Bewertung von Modellen zur Vorhersage von Just-in-Time-Anfälligkeit | Vul Guard:评价在时间中 Just-时间脆弱性预测模型的统一工具 2507.16685v1 |
Authors (6): Duong Nguyen, Manh Tran-Duc, Thanh Le-Cong, Triet Huynh Minh Le, M. Ali Babar, Quyet-Thang Huynh
We present VulGuard, an automated tool designed to streamline the extraction, processing, and analysis of commits from GitHub repositories for Just-In-Time vulnerability prediction (JIT-VP) research. VulGuard automatically mines commit histories, extracts fine-grained code changes, commit messages, and software engineering metrics, and formats them for downstream analysis. In addition, it integrates several state-of-the-art vulnerability prediction models, allowing researchers to train, evaluate, and compare models with minimal setup. By supporting both repository-scale mining and model-level experimentation within a unified framework, VulGuard addresses key challenges in reproducibility and scalability in software security research. VulGuard can also be easily integrated into the CI/CD pipeline. We demonstrate the effectiveness of the tool in two influential open-source projects, FFmpeg and the Linux kernel, highlighting its potential to accelerate real-world JIT-VP research and promote standardized benchmarking. A demo video is available at: https://youtu.be/j96096-pxbs
我们介绍了VulGuard,这是一种自动化工具,旨在简化GitHub储存库的承诺的提取、处理和分析,用于Jit-In时代脆弱性预测(JIT-VP)研究;VulGuard 地雷自动承担历史,提取精细的编码修改,发送信息,软件工程衡量标准,并把它们格式化,用于下游分析;此外,它综合了几个最先进的脆弱性预测模型,使研究人员能够以最低限度的设置来培训、评价和比较模型;通过在统一框架内支持储存规模采矿和模型级试验,VulGuard 应对软件安全研究在可复制性和可扩缩性方面的关键挑战;VulGuard 也可以很容易地纳入CI/CD管道;我们展示了该工具在两个有影响力的开放源项目(FFmpeg和Linux核心项目)中的有效性,突出其加快真实世界JIT-VP研究和促进标准化基准的可能性;一个演示视频可在以下网址上查到:https://youtu.be/j.96096-pxbs。
Article 62
Title@2025-07-22 (2): VulCoCo: A Simple Yet Effective Method for Detecting Vulnerable Code Clones
Title: VulCoCo: A Simple Yet Effective Method for Detecting Vulnerable Code Clones | VulCoCo: Eine einfache, aber wirksame Methode zur Erkennung von verletzlichen Codeklone | VulCoCo: 一种简单而有效的方法,用以检测脆弱守则克隆人 2507.16661v1 |
Authors (13): Tan Bui, Yan Naing Tun, Thanh Phuc Nguyen, Yindu Su, Ferdian Thung, Yikun Li, Han Wei Ang, Yide Yin, Frank Liauw, Lwin Khin Shar, Eng Lieh Ouh, Ting Zhang, David Lo
Code reuse is common in modern software development, but it can also spread vulnerabilities when developers unknowingly copy risky code. The code fragments that preserve the logic of known vulnerabilities are known as vulnerable code clones (VCCs). Detecting those VCCs is a critical but challenging task. Existing VCC detection tools often rely on syntactic similarity or produce coarse vulnerability predictions without clear explanations, limiting their practical utility. In this paper, we propose VulCoCo, a lightweight and scalable approach that combines embedding-based retrieval with large language model (LLM) validation. Starting from a set of known vulnerable functions, we retrieve syntactically or semantically similar candidate functions from a large corpus and use an LLM to assess whether the candidates retain the vulnerability. Given that there is a lack of reproducible vulnerable code clone benchmarks, we first construct a synthetic benchmark that spans various clone types. Our experiments on the benchmark show that VulCoCo outperforms prior state-of-the-art methods in terms of Precision@k and mean average precision (MAP). In addition, we also demonstrate VulCoCo’s effectiveness in real-world projects by submitting 400 pull requests (PRs) to 284 open-source projects. Among them, 75 PRs were merged, and 15 resulted in newly published CVEs. We also provide insights to inspire future work to further improve the precision of vulnerable code clone detection.
在现代软件开发中,对代码的再利用很常见,但当开发者不知情地复制风险代码时,它也会传播脆弱性。保存已知脆弱性逻辑的代码碎片被称为脆弱代码克隆(VCCs),发现这些脆弱代码克隆(VCCs)是一项关键但具有挑战性的任务。现有的VCC检测工具往往依赖合成相似性,或者在没有明确解释的情况下进行粗化的脆弱性预测,限制其实际用途。在本文中,我们建议VulCocoo是一种轻巧和可扩缩的方法,将嵌入的检索与大型语言模型(LLM)验证结合起来。从一组已知脆弱功能开始,我们从一个大体中检索合成或语义上相似的候选功能,并使用LLM(LM)来评估候选人是否保留脆弱性。鉴于缺乏可复制的脆弱代码克隆基准,我们首先建立一个跨越各种克隆类型的合成基准基准。我们在基准实验中显示,VulCoco公司在精度@k和平均精确度(MAP)方面超越了先前的状态方法。此外,我们还在真实的深度评估中展示了VLOCOCs 15号项目中提供了新的精度。
Article 63
Title@2025-07-22 (2): On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization
Title: On the Effectiveness of LLM-as-a-judge for Code Generation and Summarization | Zur Wirksamkeit von LLM-as-a-Richter für Codegenerierung und Zusammenfassung | 关于作为法官的LLM在代码生成和概述方面的效力 2507.16587v1 |
Authors (6): Giuseppe Crupi, Rosalia Tufano, Alejandro Velasco, Antonio Mastropaolo, Denys Poshyvanyk, Gabriele Bavota
Large Language Models have been recently exploited as judges for complex natural language processing tasks, such as Q&A. The basic idea is to delegate to an LLM the assessment of the “quality” of the output provided by an automated technique for tasks for which: (i) quantitative metrics would only tell part of the story, and; (ii) a large-scale human-based evaluation would be too expensive. LLMs-as-a-judge, if proven effective for a specific task, can also unlock new possibilities for automation, with several LLMs proposing a solution for a given instance of the task and others judging and deciding what is the best output to show the user. We study the effectiveness of LLMs-as-a-judge for two code-related tasks, namely code generation and code summarization. The rationale for choosing these tasks is two-fold. First, quantitative metrics are usually not enough for the assessment of code summarizers/generators. For example, it is well documented that metrics such as BLEU are quite weak proxies for the quality of the generated summaries. Second, even state-of-the-art techniques still struggle with handling complex instances of these tasks, making them good candidates for benefiting from more advanced solutions envisioning collaboration among LLMs. For code generation, we check whether eight LLMs are able to judge the correctness of 1,405 Java methods and 1,281 Python functions generated by the same LLMs or implemented by humans. For code summarization, we compare the judgment of five LLMs to those provided by nine humans for ~1.2k summaries, related to both Java and Python functions. Our findings show that GPT-4-turbo is the best LLM in terms of judging capabilities for both tasks, with “smaller” LLMs featuring tens of billions parameters not being able to cope with judging tasks. However, even the best-performing LLM frequently misjudges the correctness of the code and summary quality.
大型语言模型最近被用作复杂的自然语言处理任务(如 A.A.)的法官。 基本的想法是委托LLM评估自动技术为以下任务提供的产出的“质量”评估:(一) 量化指标只反映部分故事,和(二) 大规模的人基评估费用太高。大型LMs-as-a-judge,如果被证明对某项具体任务有效,也可以打开自动化的新机会,一些LMS建议为特定任务提供一种解决方案,而另一些LMS则评判和决定什么是展示用户的最佳产出。我们研究LMS-as-a法官在两项与代码有关的任务(即代码生成和代码合成)方面的效力。第一,量化指标通常不足以评估代码摘要/操作员。例如,像BLEU这样的指标在生成摘要的质量方面有着非常薄弱的分数。第二,甚至州级将LMS-a-a-s的能力对两项任务(即代码生成和代码合成的LMSal)的参数进行比较,我们通过运行八个LMS-ralMs的精度测试的精度功能测试这些复杂案例来测试。
Article 64
Title@2025-07-22 (2): AI for Better UX in Computer-Aided Engineering: Is Academia Catching Up with Industry Demands? A Multivocal Literature Review
Title: AI for Better UX in Computer-Aided Engineering: Is Academia Catching Up with Industry Demands? A Multivocal Literature Review | KI für bessere UX in der Computer-Aided Engineering: Ist Academia Aufholprozess mit Industrieanforderungen? Ein multivokaler Literaturbericht | AI促进计算机辅助工程方面更好的 UX:学术界是否迎合工业需求?多语言文学评论 2507.16586v1 |
Authors (9): Choro Ulan Uulu, Mikhail Kulyabin, Layan Etaiwi, Nuno Miguel Martins Pacheco, Jan Joosten, Kerstin Röse, Filippos Petridis, Jan Bosch, Helena Holmström Olsson
Computer-Aided Engineering (CAE) enables simulation experts to optimize complex models, but faces challenges in user experience (UX) that limit efficiency and accessibility. While artificial intelligence (AI) has demonstrated potential to enhance CAE processes, research integrating these fields with a focus on UX remains fragmented. This paper presents a multivocal literature review (MLR) examining how AI enhances UX in CAE software across both academic research and industry implementations. Our analysis reveals significant gaps between academic explorations and industry applications, with companies actively implementing LLMs, adaptive UIs, and recommender systems while academic research focuses primarily on technical capabilities without UX validation. Key findings demonstrate opportunities in AI-powered guidance, adaptive interfaces, and workflow automation that remain underexplored in current research. By mapping the intersection of these domains, this study provides a foundation for future work to address the identified research gaps and advance the integration of AI to improve CAE user experience.
计算机辅助工程(CAE)使模拟专家能够优化复杂模型,但在用户经验(UX)中面临限制效率和可获取性的挑战。虽然人工智能(AI)显示有潜力加强CAE进程,但以UX为重点的这些领域的研究仍然分散。本文提出了多式文献审查(MLR),审查AI如何在学术研究和行业实施中加强CAE软件中的UX。我们的分析揭示了学术探索和行业应用之间的巨大差距,各公司积极实施LLM、适应性UI和推荐系统,而学术研究则主要侧重于技术能力,而没有UX验证。关键结论表明在AI动力指导、适应性界面和工作流程自动化方面的机会,这些机会在当前研究中仍然没有得到充分利用。通过对这些领域的交叉进行测绘,本研究为今后的工作打下了基础,以弥补已确定的研究差距和推进AI的整合,从而改进CAE用户的经验。
Article 65
Title@2025-07-22 (2): Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems
Title: Can LLMs Generate Reliable Test Case Generators? A Study on Competition-Level Programming Problems | Können LLMs zuverlässige Testfallgeneratoren generieren? Eine Studie zu Wettbewerbs-Level-Programmierungsproblemen | LLM女士能产生可靠的试验案例发电机吗? 2506.06821v3 |
Authors (21): Yuhan Cao, Zian Chen, Kun Quan, Ziliang Zhang, Yu Wang, Xiaoning Dong, Yeqi Feng, Guanzhong He, Jingcheng Huang, Jianhao Li, Yixuan Tan, Jiafu Tang, Yilin Tang, Junlei Wu, Qianyu Xiao, Can Zheng, Shouchen Zhou, Yuxiang Zhu, Yiming Huang, Tian Xie, Tianxing He
Large Language Models (LLMs) have demonstrated remarkable capabilities in code generation, capable of tackling complex tasks during inference. However, the extent to which LLMs can be utilized for code checking or debugging through test case generation remains largely unexplored. We investigate this problem from the perspective of competition-level programming (CP) programs and propose TCGBench, a Benchmark for (LLM generation of) Test Case Generators. This benchmark comprises two tasks, aimed at studying the capabilities of LLMs in (1) generating valid test case generators for a given CP problem, and further (2) generating targeted test case generators that expose bugs in human-written code. Experimental results indicate that while state-of-the-art LLMs can generate valid test case generators in most cases, most LLMs struggle to generate targeted test cases that reveal flaws in human code effectively. Especially, even advanced reasoning models (e.g., o3-mini) fall significantly short of human performance in the task of generating targeted generators. Furthermore, we construct a high-quality, manually curated dataset of instructions for generating targeted generators. Analysis demonstrates that the performance of LLMs can be enhanced with the aid of this dataset, by both prompting and fine-tuning.
大型语言模型(LLMS)在代码生成方面表现出了非凡的能力,能够在推断过程中处理复杂的任务,然而,LLMS在通过测试案例生成过程中可用于代码检查或调试的功能仍然在很大程度上没有得到探索。我们从竞争级别的编程(CP)方案的角度来调查这一问题,并提出TCGBench,即(LLM生成)测试案例生成器的基准。这一基准包括两项任务,目的是研究LLMS在(1)为特定CP问题生成有效测试案例生成器的能力,以及进一步(2)生成有针对性的测试案例生成器,暴露人造代码中的错误。实验结果表明,尽管最先进的LMS能够产生有效的测试案例生成器,但大多数LLMS都在努力生成能够有效揭示人类代码缺陷的定向测试案例。特别是,甚至先进的推理模型(如o3-mini)在生成目标型发电机的任务中也远远低于人类的性能。此外,我们为生成目标型发电机设计了一个高质量的手工整理数据集。分析结果表明,LMS的性能通过这一数据组合的迅速得到改进。
Article 66
Title@2025-07-22 (2): Explainable Vulnerability Detection in C/C++ Using Edge-Aware Graph Attention Networks
Title: Explainable Vulnerability Detection in C/C++ Using Edge-Aware Graph Attention Networks | Erklärbare Sicherheitserkennung in C/C++ mit Edge-Aware Graph Attention Networks | C/C++/C++中可解释的脆弱性探测 2507.16540v1 |
Authors (4): Radowanul Haque, Aftab Ali, Sally McClean, Naveed Khan
Detecting security vulnerabilities in source code remains challenging, particularly due to class imbalance in real-world datasets where vulnerable functions are under-represented. Existing learning-based methods often optimise for recall, leading to high false positive rates and reduced usability in development workflows. Furthermore, many approaches lack explainability, limiting their integration into security workflows. This paper presents ExplainVulD, a graph-based framework for vulnerability detection in C/C++ code. The method constructs Code Property Graphs and represents nodes using dual-channel embeddings that capture both semantic and structural information. These are processed by an edge-aware attention mechanism that incorporates edge-type embeddings to distinguish among program relations. To address class imbalance, the model is trained using class-weighted cross-entropy loss. ExplainVulD achieves a mean accuracy of 88.25 percent and an F1 score of 48.23 percent across 30 independent runs on the ReVeal dataset. These results represent relative improvements of 4.6 percent in accuracy and 16.9 percent in F1 score compared to the ReVeal model, a prior learning-based method. The framework also outperforms static analysis tools, with relative gains of 14.0 to 14.1 percent in accuracy and 132.2 to 201.2 percent in F1 score. Beyond improved detection performance, ExplainVulD produces explainable outputs by identifying the most influential code regions within each function, supporting transparency and trust in security triage.
在源代码中检测安全脆弱性仍然具有挑战性,特别是因为实际世界数据集中的等级不平衡,而其中的脆弱功能的代表性不足。现有的基于学习的方法往往优化召回,导致高假正率,降低发展工作流程的可用性。此外,许多方法缺乏解释性,限制将其纳入安全工作流程。本文介绍了C/C++代码中基于图表的脆弱度检测框架ExplexVulD。方法构建了代码属性图,并代表了使用双通道嵌入的节点,既包含语义信息,又包含结构信息。这些基于学习的现有方法往往得到优化的注意机制处理,其中包括边缘类型的嵌入,以区分方案关系。为解决阶级失衡,该模式得到了培训,使用了等级加权跨元素损失,限制将其纳入安全工作流程。本文件介绍了C/C++代码中基于图表的脆弱度检测框架,在30个独立运行的代码中达到了48.23%。这些结果表明,与ReVal模型相比,准确度提高了4.2%,F1分的相对改进率为16.2%,支持先前学习型嵌式的检测方法,在F1级分析中,以比稳定度分析方式改进了每个区域。
Article 67
Title@2025-07-22 (2): Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features
Title: Improving Source Code Similarity Detection Through GraphCodeBERT and Integration of Additional Features | Verbesserung der Quellcode-Ähnlichkeitserkennung durch GraphCodeBERT und Integration zusätzlicher Funktionen | 改进源代码改进源代码 通过图示CodeBERT 探测相似性并整合附加地物 2408.08903v2 |
Authors (1): Jorge Martinez-Gil
This paper presents a novel approach for source code similarity detection that integrates an additional output feature into the classification process with the goal of improving model performance. Our approach is based on the GraphCodeBERT model, extended with a custom output feature layer and a concatenation mechanism for improved feature representation. The model was trained and evaluated, achieving promising results in terms of precision, recall, and f-measure. The implementation details, including model architecture and training strategies are discussed. The source code that illustrates our approach can be downloaded from https://www.github.com/jorge-martinez-gil/graphcodebert-feature-integration.
本文件介绍了一种新颖的源代码相似性检测方法,该方法将额外的产出特性纳入分类过程,目的是改进模型性能,我们的方法以GreabCodeBERT模型为基础,扩展后有一个定制输出特征层和一个改进特征表现的集合机制,对模型进行了培训和评价,在精确度、回溯度和度量方面取得了有希望的成果,讨论了包括模型结构和培训战略在内的实施细节,从https://www.github.com/jorge-martinez-gil/graphcocbert-featual-Incivil下载了说明我们的方法的来源代码。
Article 68
Title@2025-07-22 (2): Software is infrastructure: failures, successes, costs, and the case for formal verification
Title: Software is infrastructure: failures, successes, costs, and the case for formal verification | Software ist Infrastruktur: Ausfälle, Erfolge, Kosten und der Fall für die formale Überprüfung | 软件是基础设施:失败、成功、成本和正式核查的理由 2506.13821v2 |
Authors (4): Giovanni Bernardi, Adrian Francalanza, Marco Peressotti, Mohammad Reza Mousavi
In this chapter we outline the role that software has in modern society, along with the staggering costs of poor software quality. To lay this bare, we recall the costs of some of the major software failures that happened during the last~$40$ years. We argue that these costs justify researching, studying and applying formal software verification and in particular program analysis. This position is supported by successful industrial experiences.
在本章中,我们概述了软件在现代社会中的作用,以及软件质量差的惊人成本。要说明这一点,我们回顾过去四万多年中发生的一些重大软件故障的代价。我们争辩说,这些费用证明研究、研究和应用正式软件核查,特别是程序分析是合理的。这个立场得到了成功的工业经验的支持。
Article 69
Title@2025-07-22 (2): ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training
Title: ACT: Bridging the Gap in Code Translation through Synthetic Data Generation & Adaptive Training | ACT: Überbrückung der Lücke in der Code-Übersetzung durch Synthetische Datengenerierung & Adaptives Training | ACT:通过合成数据生成和适应培训缩小代码翻译差距 2507.16478v1 |
Authors (4): Shreya Saxena, Siva Prasad, Zishan Ahmad, Vishal Vaddina
Code translation is a crucial process in software development and migration projects, enabling interoperability between different programming languages and enhancing software adaptability and thus longevity. Traditional automated translation methods rely heavily on handcrafted transformation rules, which often lack flexibility and scalability. Meanwhile, advanced language models present promising alternatives but are often limited by proprietary, API-based implementations that raise concerns over data security and reliance. In this paper, we present Auto-Train for Code Translation (ACT), an innovative framework that aims to improve code translation capabilities by enabling in-house finetuning of open-source Large Language Models (LLMs). ACT’s automated pipeline significantly boosts the performance of these models, narrowing the gap between open-source accessibility and the high performance of closed-source solutions. Central to ACT is its synthetic data generation module, which builds extensive, high-quality datasets from initial code samples, incorporating unit tests to ensure functional accuracy and diversity. ACT’s evaluation framework incorporates execution-level checks, offering a comprehensive assessment of translation quality. A key feature in ACT is its controller module, which manages the entire pipeline by dynamically adjusting hyperparameters, orchestrating iterative data generation, and finetuning based on real-time evaluations. This enables ACT to intelligently optimize when to continue training, generate additional targeted training data, or stop the process. Our results demonstrate that ACT consistently enhances the effectiveness of open-source models, offering businesses and developers a secure and reliable alternative. Additionally, applying our data generation pipeline to industry-scale migration projects has led to a notable increase in developer acceleration.
在软件开发和迁移项目中,代码翻译是一个至关重要的过程,使不同编程语言之间具有互操作性,并因此加强软件适应性和长寿。传统自动化翻译方法高度依赖手工制作的转化规则,这些规则往往缺乏灵活性和伸缩性。与此同时,先进的语言模式提出了有希望的替代方法,但往往受到专利性的、基于API的执行的限制,这引起了对数据安全和依赖的关切。在本文件中,我们介绍了代码翻译的自动培训(ACT),这是一个创新框架,旨在通过内部微调开放源码大语言模型(LLLMs)来提高代码翻译能力。ACT的自动化管道极大地提升了这些模型的性能,缩小了开放源码可及封闭源解决方案高性功能之间的差距。ACT的中心是其合成数据生成模块,从初始代码样本中建立广泛、高质量的数据集,包括单位测试,以确保功能准确性和多样性。ACT的评估框架包括执行层面的检查,对翻译质量进行全面评估。ACTA的一个关键特征是其控制模块,它通过动态的超常量模型来管理整个编程,缩小了这些模型的绩效,缩小了这些模式,缩小了开放源源可获取性和源码访问之间的差距,缩小了开放源的可获取性和可获取性和可获取性数据获取性数据与封闭性数据生成,在ACTACTACT数据生成过程中,在不断调整和优化的生成中,在不断更新数据生成和优化的流程中继续展示、优化的流程中,在ACTACTAREADADRDADADADADADRDRDFADRDFA的流程,在不断更新、优化的流程中,在不断更新、优化地展示、优化地展示、优化的生成、优化的流程中,以展示了我们的数据数据生成,以提供和优化的流程,以持续更新、优化的流程,以提供、优化的升级,以提供中继续提供中不断提供、优化,以提供、优化、优化,以提供和优化的升级,以提供和优化的升级的升级,以推进的升级,以推进。
Article 70
Title@2025-07-22 (2): Exploring Large Language Models for Analyzing and Improving Method Names in Scientific Code
Title: Exploring Large Language Models for Analyzing and Improving Method Names in Scientific Code | Erforschung großer Sprachmodelle zur Analyse und Verbesserung von Methodennamen im wissenschaftlichen Code | 探索用于分析和改进科学法典中方法名称的大型语言模式 2507.16439v1 |
Authors (3): Gunnar Larsen, Carol Wong, Anthony Peruma
Research scientists increasingly rely on implementing software to support their research. While previous research has examined the impact of identifier names on program comprehension in traditional programming environments, limited work has explored this area in scientific software, especially regarding the quality of method names in the code. The recent advances in Large Language Models (LLMs) present new opportunities for automating code analysis tasks, such as identifier name appraisals and recommendations. Our study evaluates four popular LLMs on their ability to analyze grammatical patterns and suggest improvements for 496 method names extracted from Python-based Jupyter Notebooks. Our findings show that the LLMs are somewhat effective in analyzing these method names and generally follow good naming practices, like starting method names with verbs. However, their inconsistent handling of domain-specific terminology and only moderate agreement with human annotations indicate that automated suggestions require human evaluation. This work provides foundational insights for improving the quality of scientific code through AI automation.
研究科学家日益依赖实施软件来支持他们的研究。虽然以前的研究已经审查了识别名对传统编程环境中方案理解的影响,但是在科学软件领域,特别是在代码中方法名称的质量方面,对该领域的探索有限。大语言模型(LLMS)最近的进展为代码分析任务自动化提供了新的机会,例如识别名的评价和建议。我们的研究评价了四个受欢迎的LMS公司分析语法模式的能力,并建议改进从基于Python的Jupyter笔记本书中提取的496个方法名称。我们的研究结果显示,LMS公司在分析这些方法名称方面比较有效,并普遍遵循良好的命名做法,例如与动词一起开始使用方法名称。然而,它们对于特定域名的处理不一致,只有与人文说明的适度协议表明,自动建议需要人类评价。这项工作为通过AI自动化提高科学代码的质量提供了基本见解。
Article 71
Title@2025-07-22 (2): Improving Code LLM Robustness to Prompt Perturbations via Layer-Aware Model Editing
Title: Improving Code LLM Robustness to Prompt Perturbations via Layer-Aware Model Editing | Verbesserung der Code-LLM Robustheit bei Prompt-Störungen durch Layer-Aware-Modellbearbeitung | 改进代码 LLM 的强度, 以便通过图层提醒模型编辑快速干扰 2507.16407v1 |
Authors (6): Shuhan Liu, Xing Hu, Kerui Huang, Xiaohu Yang, David Lo, Xin Xia
Large language models (LLMs) have demonstrated impressive capabilities in code generation, where the natural language prompt plays a crucial role in conveying user intent to the model. However, prior studies have shown that LLMs are highly sensitive to prompt perturbations. Minor modifications in wording, syntax, or formatting can significantly reduce the functional correctness of generated code. As perturbations frequently occur in real-world scenarios, improving the robustness of LLMs to prompt perturbations is essential for ensuring reliable performance in practical code generation. In this paper, we introduce CREME (Code Robustness Enhancement via Model Editing), a novel approach that enhances LLM robustness through targeted parameter updates. CREME first identifies robustness-sensitive layers by comparing hidden states between an original prompt and its perturbed variant. Then, it performs lightweight parameter editing at the identified layer to reduce performance degradation. We evaluate CREME on two widely used code generation benchmarks (HumanEval and MBPP) along with their perturbed counterparts. Experimental results show that CREME improves Pass@1 accuracy by 63% on perturbed prompts while maintaining stable performance on clean inputs, with accuracy deviations within 1%. Further analysis reveals that robustness-sensitive layers are primarily concentrated in the middle and deeper layers of the network, and their locations vary across different model architectures. These insights provide a valuable foundation for developing future robustness-oriented editing strategies.
大型语言模型(LLMS)在代码生成中表现出了令人印象深刻的能力,自然语言快速在传达用户对模型的意向方面发挥着关键作用。然而,先前的研究显示,LLMS对快速扰动非常敏感。对措辞、语法或格式稍作修改,可以大大降低生成代码的功能正确性。在现实世界情景中经常出现扰动,提高LLMS的稳健性以快速扰动性对于确保实际代码生成的可靠性能至关重要。在本文中,我们引入CREME(通过模型编辑的Code Robustness Instruality Instrual Instruction),这是一种通过目标参数更新提高LLMS的稳健性的新办法。CREME首先通过比较原始的提示性与扰动变变变变变变变变变变变变变变变的变组合,然后在已确定的层次上进行轻度参数编辑,以降低性能退化。我们用两种广泛使用的代码生成基准(HumanEval和MPP)来进行干扰。实验结果显示,CREME(CME)通过模型编辑的精确性提高过精准性) 6 % perturbblead 的精确度更新的精确性精确性精确度,在深度更新的精确度的精确度分析,同时显示其深度的深度的深度的深度的深度的深度的深度分析。
Article 72
Title@2025-07-22 (2): LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning
Title: LLM-Driven Collaborative Model for Untangling Commits via Explicit and Implicit Dependency Reasoning | LLM-getriebenes kollaboratives Modell für das Entwirren von Commits über explizite und implizite Abhängigkeitsveranlagung | LLM-LLM-LLM-Driven 用于通过明确和隐含依赖性理由解释解译委员会的合作模式 2507.16395v1 |
Authors (6): Bo Hou, Xin Tan, Kai Zheng, Fang Liu, Yinghao Zhu, Li Zhang
Atomic commits, each of which addresses a single development concern, are a best practice in software development. However, developers frequently produce tangled commits that mix unrelated changes due to practical constraints or unclear boundaries, negatively impacting code review and maintenance. Although prior commit untangling approaches: rule-based, feature-based, or graph-based, have made progress, they often rely on shallow signals and fail to distinguish between explicit dependencies (e.g., control/data flow) and implicit ones (e.g., semantic or conceptual relationships). In this paper, we propose ColaUntangle, a new collaborative consultation framework for commit untangling that models both explicit and implicit dependencies among code changes. ColaUntangle integrates Large Language Model (LLM)-driven agents in a multi-agent architecture: one agent specializes in explicit dependencies, another in implicit ones, and a reviewer agent synthesizes their perspectives through iterative consultation. To capture explicit and implicit contextual information, we construct multi-version Program Dependency Graphs (delta-PDG), enabling agents to reason over code relationships with both symbolic and semantic depth. We evaluate ColaUntangle on two widely-used datasets (1,612 C# and 14k Java tangled commits). Experimental results show that ColaUntangle outperforms the best-performing baseline, achieving an improvement of 44% on the C# dataset and 100% on the Java dataset. These findings highlight the potential of LLM-based collaborative frameworks for advancing automated commit untangling tasks.
原子承诺(每个承诺都针对单一的发展关切)是软件开发中的一种最佳做法。然而,开发者经常产生纠缠不休的、混合不相干的变化,因为实际限制或不明确的边界,对代码审查和维护造成负面影响。虽然先前承诺不折不扣的方法:基于规则的、基于特性的、或基于图形的、已经取得进展,但他们往往依赖浅的信号,并且没有区分明确的依赖(例如控制/数据流)和隐含的(例如语义或概念关系)之间。在本文中,我们提议建立一个新的合作磋商框架,即ColaUntangle, 用于在代码变化中将模型的明显和隐含的相互依存关系分开,对代码的潜在依赖关系进行分解。Colutangle将大型语言模型(LLLM)驱动的代理商纳入多工具结构:一个代理商在明确依赖性、另一个隐含的信号,而一个审查代理商则通过反复的协商来综合其观点。为了获取基于明确和隐含的背景资料,我们构建了多版本的程序依赖性图表(delta-PDG),使两个代理商能够超越代码的代码关系,使代码关系超越了代码关系,同时标正标正和Sliveralmaxxxxxxxxxxxxx。
Article 73
Title@2025-07-22 (2): Search-based Generation of Waypoints for Triggering Self-Adaptations in Maritime Autonomous Vessels
Title: Search-based Generation of Waypoints for Triggering Self-Adaptations in Maritime Autonomous Vessels | Search-based Generierung von Wegpunkten für die Auslösung von Selbstanpassungen in Maritimen autonomen Schiffen | 以搜索为基础的海上自主船舶触发自适应途径点的生成 2507.16327v1 |
Authors (4): Karoline Nylænder, Aitor Arrieta, Shaukat Ali, Paolo Arcaini
Self-adaptation in maritime autonomous vessels (AVs) enables them to adapt their behaviors to address unexpected situations while maintaining dependability requirements. During the design of such AVs, it is crucial to understand and identify the settings that should trigger adaptations, enabling validation of their implementation. To this end, we focus on the navigation software of AVs, which must adapt their behavior during operation through adaptations. AVs often rely on predefined waypoints to guide them along designated routes, ensuring safe navigation. We propose a multiobjective search-based approach, called WPgen, to generate minor modifications to the predefined set of waypoints, keeping them as close as possible to the original waypoints, while causing the AV to navigate inappropriately when navigating with the generated waypoints. WPgen uses NSGA-II as the multi-objective search algorithm with three seeding strategies for its initial population, resulting in three variations of WPgen. We evaluated these variations on three AVs (one overwater tanker and two underwater). We compared the three variations of WPgen with Random Search as the baseline and with each other. Experimental results showed that the effectiveness of these variations varied depending on the AV. Based on the results, we present the research and practical implications of WPgen.
海上自主船只的自我适应使船舶能够调整其行为,以应对意外情况,同时保持可靠性要求。在设计此类自动飞行器时,必须理解和确定触发适应的设置,以便能够验证其执行情况。为此,我们注重AV的导航软件,这种软件必须在操作期间通过调整来调整其行为。AV经常依靠预先确定的路径点来指导其沿着指定的航线航行,确保航行安全。我们建议采用一个称为WPgen的多目标搜索方法,对预设的一套路标进行微小的修改,使其尽可能接近原始路标,同时使AV在使用产生的路标航行时操作不当。WPgen使用NSGA-II作为多目标搜索算法,其初始人口有三个播种战略,导致三个WPgen的变异。我们比较了三个AV的变异(一个浮水罐和两个水下)。我们比较了WPgen与随机搜索的三个变异,作为基线和每个现有路标,同时使AV在导航过程中运行不当。实验结果显示这些变异。
Article 74
Title@2025-07-22 (2): Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery
Title: Voice-based AI Agents: Filling the Economic Gaps in Digital Health Delivery | Sprachbasierte KI-Agenten: Die wirtschaftlichen Lücken in der digitalen Gesundheitsversorgung füllen | AI代理机构:填补数字保健提供方面的经济差距 2507.16229v1 |
Authors (7): Bo Wen, Chen Wang, Qiwei Han, Raquel Norel, Julia Liu, Thaddeus Stappenbeck, Jeffrey L. Rogers
The integration of voice-based AI agents in healthcare presents a transformative opportunity to bridge economic and accessibility gaps in digital health delivery. This paper explores the role of large language model (LLM)-powered voice assistants in enhancing preventive care and continuous patient monitoring, particularly in underserved populations. Drawing insights from the development and pilot study of Agent PULSE (Patient Understanding and Liaison Support Engine) – a collaborative initiative between IBM Research, Cleveland Clinic Foundation, and Morehouse School of Medicine – we present an economic model demonstrating how AI agents can provide cost-effective healthcare services where human intervention is economically unfeasible. Our pilot study with 33 inflammatory bowel disease patients revealed that 70\% expressed acceptance of AI-driven monitoring, with 37\% preferring it over traditional modalities. Technical challenges, including real-time conversational AI processing, integration with healthcare systems, and privacy compliance, are analyzed alongside policy considerations surrounding regulation, bias mitigation, and patient autonomy. Our findings suggest that AI-driven voice agents not only enhance healthcare scalability and efficiency but also improve patient engagement and accessibility. For healthcare executives, our cost-utility analysis demonstrates huge potential savings for routine monitoring tasks, while technologists can leverage our framework to prioritize improvements yielding the highest patient impact. By addressing current limitations and aligning AI development with ethical and regulatory frameworks, voice-based AI agents can serve as a critical entry point for equitable, sustainable digital healthcare solutions.
将基于声音的AI代理机构纳入医疗保健工作,为弥合数字保健提供中的经济差距和无障碍差距提供了一个变革性机会。本文件探讨了大型语言模式(LLM)的语音助理在加强预防性护理和持续病人监测方面的作用,特别是在服务不足的人口中。从Patient Agency AI(Paltial Developy and Limitive Support Forum)的开发和试点研究中汲取了对PULSE(Pative Accountive and Limitive Support Agency)的见解,这是IBM Research、克利夫兰诊所基金会和Morehouse of Medichoal – – 我们展示了一种经济模式,表明AI代理机构如何在经济上不可行的情况下提供成本效益高的保健服务。我们与33个炎热肠病患者进行的试点研究显示,70接受了AI驱动的监测,37倾向于传统模式。技术挑战,包括实时对AI处理、与保健系统的整合和隐私遵守,与监管、减轻偏见和病人自主性有关的政策考虑。我们的调查结果表明,AI驱动的语音代理机构不仅可以提高保健的竞争力,而且还能够通过对日常监测影响进行成本上的潜在节约。
Article 75
Title@2025-07-22 (2): LOCOFY Large Design Models – Design to code conversion solution
Title: LOCOFY Large Design Models – Design to code conversion solution | LOCOFY Large Design Models – Design zu Code-Konvertierungslösung | LOCOFY 大型设计模型 – – 设计编码转换解决办法 2507.16208v1 |
Authors (4): Sohaib Muhammad, Ashwati Vipin, Karan Shetti, Honey Mittal
Despite rapid advances in Large Language Models and Multimodal Large Language Models (LLMs), numerous challenges related to interpretability, scalability, resource requirements and repeatability remain, related to their application in the design-to-code space. To address this, we introduce the Large Design Models (LDMs) paradigm specifically trained on designs and webpages to enable seamless conversion from design-to-code. We have developed a training and inference pipeline by incorporating data engineering and appropriate model architecture modification. The training pipeline consists of the following: 1)Design Optimiser: developed using a proprietary ground truth dataset and addresses sub-optimal designs; 2)Tagging and feature detection: using pre-trained and fine-tuned models, this enables the accurate detection and classification of UI elements; and 3)Auto Components: extracts repeated UI structures into reusable components to enable creation of modular code, thus reducing redundancy while enhancing code reusability. In this manner, each model addresses distinct but key issues for design-to-code conversion. Separately, our inference pipeline processes real-world designs to produce precise and interpretable instructions for code generation and ensures reliability. Additionally, our models illustrated exceptional end-to-end design-to-code conversion accuracy using a novel preview match score metric. Comparative experiments indicated superior performance of LDMs against LLMs on accuracy of node positioning, responsiveness and reproducibility. Moreover, our custom-trained tagging and feature detection model demonstrated high precision and consistency in identifying UI elements across a wide sample of test designs. Thus, our proposed LDMs are a reliable and superior solution to understanding designs that subsequently enable the generation of efficient and reliable production-ready code.
尽管大语言模型和多式大语言模型(LLM)取得了迅速的进展,但在解释性、可缩放性、资源要求和可重复性方面仍然存在许多挑战,这些挑战涉及这些模型在设计到编码空间中的应用。为此,我们引入了大设计模型(LDM)范例(LDM),对设计和网页进行了专门培训,以便能够从设计到编码的无缝转换。我们通过纳入数据工程和适当的模型结构修改,开发了培训和推导管道。培训管道包括:(1)设计“优化软件”:开发了与解释性、可缩缩放性、资源要求和可重复性有关的多种挑战,涉及在设计到编码空间中的应用;(2)跟踪和特征探测:使用预先培训和经过微调的模型,能够准确检测和分类的大型设计;以及(3)自动构件:将重复的UI结构转换成可重新使用的部件,以便能够创建模块代码,从而减少冗余,同时加强代码的可复制性。在设计到编码转换方面,每个模型的清晰但关键的问题。此外,我们在编审中进行实际设计过程的设计过程设计,以产生准确性和可解释性、可解释的升级的精准性规格,在为标准的升级的升级的升级的升级的版本版本的版本中,在设计中,在设计和升级的精确性定义中,将一个用于解释性定义的、升级的版本的版本的版本的版本的版本的版本的版本的规格和升级的规格的规格的规格的规格的升级的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格,在比值中,在比值中,在比值的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格上,在比的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的规格的
Article 76
Title@2025-07-22 (2): Ten Essential Guidelines for Building High-Quality Research Software
Title: Ten Essential Guidelines for Building High-Quality Research Software | Zehn wesentliche Leitlinien für den Aufbau hochwertiger Forschungssoftware | 建立高质量研究软件的十项基本准则 2507.16166v1 |
Authors (5): Nasir U. Eisty, David E. Bernholdt, Alex Koufos, David J. Luet, Miranda Mundt
High-quality research software is a cornerstone of modern scientific progress, enabling researchers to analyze complex data, simulate phenomena, and share reproducible results. However, creating such software requires adherence to best practices that ensure robustness, usability, and sustainability. This paper presents ten guidelines for producing high-quality research software, covering every stage of the development lifecycle. These guidelines emphasize the importance of planning, writing clean and readable code, using version control, and implementing thorough testing strategies. Additionally, they address key principles such as modular design, reproducibility, performance optimization, and long-term maintenance. The paper also highlights the role of documentation and community engagement in enhancing software usability and impact. By following these guidelines, researchers can create software that advances their scientific objectives and contributes to a broader ecosystem of reliable and reusable research tools. This work serves as a practical resource for researchers and developers aiming to elevate the quality and impact of their research software.
高质量的研究软件是现代科学进步的基石,使研究人员能够分析复杂的数据、模拟现象和分享可复制的成果。然而,创建这种软件需要坚持确保稳健性、可用性和可持续性的最佳做法。本文件为生产高质量的研究软件提出了十项准则,涵盖开发生命周期的每个阶段。这些准则强调规划、编写清洁和可读代码、使用版本控制以及实施彻底测试战略的重要性。此外,这些准则还涉及模块设计、可复制性、性能优化和长期维护等关键原则。本文件还强调了文献和社区参与在提高软件可用性和影响方面的作用。通过遵循这些准则,研究人员可以创建软件,推进其科学目标,促进更广泛的可靠和可再使用研究工具的生态系统。这项工作为研究人员和开发者提供了实用资源,旨在提高其研究软件的质量和影响。
Article 77
Title@2025-07-21 (1): GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities
Title: GitChameleon 2.0: Evaluating AI Code Generation Against Python Library Version Incompatibilities | GitChameleon 2.0: Bewertung der KI-Codegenerierung gegen Python Library Version Inkompatibilitäten | GitChameleon 2.0:评估AI 与 Python 图书馆版本的不兼容性 2507.12367v2 |
Authors (12): Diganta Misra, Nizar Islah, Victor May, Brice Rauby, Zihan Wang, Justine Gehring, Antonio Orvieto, Muawiz Chaudhary, Eilif B. Muller, Irina Rish, Samira Ebrahimi Kahou, Massimo Caccia
The rapid evolution of software libraries poses a considerable hurdle for code generation, necessitating continuous adaptation to frequent version updates while preserving backward compatibility. While existing code evolution benchmarks provide valuable insights, they typically lack execution-based evaluation for generating code compliant with specific library versions. To address this, we introduce GitChameleon 2.0, a novel, meticulously curated dataset comprising 328 Python code completion problems, each conditioned on specific library versions and accompanied by executable unit tests. GitChameleon 2.0 rigorously evaluates the capacity of contemporary large language models (LLMs), LLM-powered agents, code assistants, and RAG systems to perform version-conditioned code generation that demonstrates functional accuracy through execution. Our extensive evaluations indicate that state-of-the-art systems encounter significant challenges with this task; enterprise models achieving baseline success rates in the 48-51% range, underscoring the intricacy of the problem. By offering an execution-based benchmark emphasizing the dynamic nature of code libraries, GitChameleon 2.0 enables a clearer understanding of this challenge and helps guide the development of more adaptable and dependable AI code generation methods. We make the dataset and evaluation code publicly available at https://github.com/mrcabbage972/GitChameleonBenchmark.
软件图书馆的迅速演变为代码生成带来了相当大的障碍,需要不断适应经常更新版本,同时保持后向兼容性。虽然现有的代码演变基准提供了宝贵的洞察力,但它们通常缺乏基于执行的评价,以生成符合特定图书馆版本的代码。为此,我们引入了GitChameleon 2.0,这是一个由328个Python代码完成问题组成的创新和精心整理的数据集,每个系统都以特定图书馆版本为条件,并伴之以可执行的单位测试。GitChameleon 2.0 严格评估当代大型语言模型(LLLMS)、LLM-动力代理、代码助理和RAG系统的能力,以进行基于版本的代码生成,通过执行来显示功能的准确性。我们的广泛评估表明,最先进的系统在这项任务中遇到重大挑战;企业模型在48-51%的代码范围内实现基线成功率,凸显了问题的严重性。通过提供基于执行的基准,强调代码库的动态性质,GitChameleon 2.0 能够更清楚地了解这一挑战,并帮助指导开发更可调整和可信赖的AI代码生成方法。我们在 http://Gsembasimbas/ 公开提供数据设置。
Article 78
Title@2025-07-21 (1): AI-Powered Commit Explorer (APCE)
Title: AI-Powered Commit Explorer (APCE) | KI-Powered Commit Explorer (APCE) | AI 授权委员会探索者(APCE) 2507.16063v1 |
Authors (6): Yousab Grees, Polina Iaremchuk, Ramtin Ehsani, Esteban Parra, Preetha Chatterjee, Sonia Haiduc
Commit messages in a version control system provide valuable information for developers regarding code changes in software systems. Commit messages can be the only source of information left for future developers describing what was changed and why. However, writing high-quality commit messages is often neglected in practice. Large Language Model (LLM) generated commit messages have emerged as a way to mitigate this issue. We introduce the AI-Powered Commit Explorer (APCE), a tool to support developers and researchers in the use and study of LLM-generated commit messages. APCE gives researchers the option to store different prompts for LLMs and provides an additional evaluation prompt that can further enhance the commit message provided by LLMs. APCE also provides researchers with a straightforward mechanism for automated and human evaluation of LLM-generated messages. Demo link https://youtu.be/zYrJ9s6sZvo
在一个版本控制系统中提交信息,为开发者提供有关软件系统代码变化的宝贵信息。 提交信息可以是留给未来开发者描述变化和原因的唯一信息来源。 但是,写高质量的承诺信息在实践中常常被忽视。 生成的大型语言模式(LLM)承诺信息已经形成,作为缓解这一问题的一种方式。 我们引入了AI-Powered CongressExplorer(APCE),这是一个支持开发者和研究人员使用和研究LLMM生成的承诺信息的工具。 APCE让研究人员可以选择存储LMS的不同提示,并提供额外的评价提示,以进一步加强LMS提供的承诺信息。 APCE还向研究人员提供了对LM生成的信息进行自动和人性评估的直截了当机制。 Demo链接 https://youtu.be/zYrJs6sZvo。
Article 79
Title@2025-07-21 (1): RightTyper: Effective and Efficient Type Annotation for Python
Title: RightTyper: Effective and Efficient Type Annotation for Python | RightTyper: Effektive und effiziente Typ-Annotation für Python | RightTyper: Python 有效、高效型号注解 2507.16051v1 |
Authors (2): Juan Altmayer Pizzorno, Emery D. Berger
Python type annotations bring the benefits of static type checking to the language. However, manually writing annotations can be time-consuming and tedious. The result is that most real-world Python code remains largely untyped. Past approaches to annotating types in Python code fall short in a number of ways. Static approaches struggle with dynamic features and infer overly broad types. AI-based methods are inherently unsound and can miss rare or user-defined types. Dynamic methods can impose extreme runtime overheads, degrading performance by up to 270x, abort execution as they exhaust resources, and even infer incorrect types that lead to runtime errors. Crucially, all prior work assumes implicitly that the code to be annotated is already correct. This assumption is generally unwarranted, especially for large codebases that have been untyped. This paper presents RightTyper, a novel approach for Python that overcomes these disadvantages. RightTyper not only generates precise type annotations based on actual program behavior, improving recall in type checking relative to prior approaches. It also turns type checking into anomaly detection, allowing the type checker to identify corner cases that the programmer can audit for unintended behavior. RightTyper is also fast and space-efficient, imposing just 30% performance overhead on average. RightTyper achieves these characteristics by a principled yet pervasive use of sampling–guided by self-profiling–along with statistical filtering and careful resolution and aggregation of type information.
Python 类型描述将静态类型检查的好处带给语言。 但是, 手动写说明可能会耗费时间和繁琐。 结果是, 大多数真实世界 Python 代码仍然基本上没有类型化。 过去对 Python 代码批注类型的方法在许多方面都存在缺陷。 静态方法与动态特性和推断过于宽泛的类型进行斗争。 基于 AI 的方法本质上是不健康的, 可能忽略稀有或用户定义的类型。 动态方法可以强制设置极短运行时间管理、 最高为270x 的性能降低性能, 甚至在耗尽资源时中止执行, 甚至推断出导致运行错误的类型。 很显然, 所有先前的工作都暗含着要附加注释的代码已经是正确的。 这一假设通常没有道理, 特别是对于非类型代码化的大型代码基础。 本文展示了RightTyper 一种克服这些缺点的新型方法。 RightT 不仅根据实际的筛选程序行为产生精确的字型描述, 并且比先前的方法改进了对类型检查的回顾。 它还将类型检查转换成异常性检测, 允许右键式的直截式的直截式的直截式直截式的直径性操作, 将使得直截为直截式的直截式的直截式操作性操作操作操作操作操作, 直截到直截到直截截截截截截截截截下的直截截断式的直截截截断式的操作式的操作式的操作式的直截截截式操作性操作性操作性操作性操作性操作性操作性操作性操作性操作性操作性操作。
Article 80
Title@2025-07-21 (1): A Pilot Study on LLM-Based Agentic Translation from Android to iOS: Pitfalls and Insights
Title: A Pilot Study on LLM-Based Agentic Translation from Android to iOS: Pitfalls and Insights | Eine Pilotstudie über LLM-basierte Agentische Übersetzung von Android nach iOS: Pitfalls and Insights | 关于以LLM为基础的LLM从Android转为iOS的剂翻译的试点研究:水瀑布和透视 2507.16037v1 |
Authors (6): Zhili Zeng, Kimya Khakzad Shahandashti, Alvine Boaye Belle, Song Wang, Zhen Ming, Jiang
The rapid advancement of mobile applications has led to a significant demand for cross-platform compatibility, particularly between the Android and iOS platforms. Traditional approaches to mobile application translation often rely on manual intervention or rule-based systems, which are labor-intensive and time-consuming. While recent advancements in machine learning have introduced automated methods, they often lack contextual understanding and adaptability, resulting in suboptimal translations. Large Language Models (LLMs) were recently leveraged to enhance code translation at different granularities, including the method, class, and repository levels. Researchers have investigated common errors, limitations, and potential strategies to improve these tasks. However, LLM-based application translation across different platforms, such as migrating mobile applications between Android and iOS or adapting software across diverse frameworks, remains underexplored. Understanding the performance, strengths, and limitations of LLMs in cross-platform application translation is critical for advancing software engineering automation. This study aims to fill this gap by evaluating LLM-based agentic approaches for mobile application translation, identifying key failure points, and proposing guidelines to improve translation performance. We developed a chain of agents that account for dependencies, specifications, program structure, and program control flow when translating applications from Android to iOS. To evaluate the performance, we manually examined the translated code for syntactic correctness, semantic accuracy, and functional completeness. For translation failures, we further conducted a detailed root cause analysis to understand the underlying limitations of the agentic translation process and identify opportunities for improvement.
移动应用的迅速发展导致对跨平台兼容性的需求很大,特别是Android平台和iOS平台之间的兼容性。移动应用翻译的传统方法往往依赖人工干预或基于规则的系统,这些系统是劳动密集型和耗时的。虽然最近机器学习的进步引入了自动化方法,但它们往往缺乏背景理解和适应性,导致翻译不理想。大型语言模型(LLMS)最近被用来在不同微粒上,包括方法、级别和储存级别上加强代码翻译。研究人员调查了改进这些任务的共同错误、限制和潜在战略。但是,基于LLOM的不同平台的应用翻译,如在Android和iOS之间迁移移动移动移动应用或在不同框架之间调整软件,仍然没有得到充分利用。了解LMMS在跨平台应用翻译方面的性能、强项和局限性,对于推进软件工程自动化至关重要。这项研究的目的是通过评估基于LLMM的移动应用翻译代理方法,确定关键故障点,并提出改进翻译绩效的指导方针。我们开发了一个代理商链,用于在解释性操作、流程、程序结构结构结构结构结构结构结构结构结构结构结构结构结构结构中进行准确性分析时,我们评估了对操作性翻译的准确性限制进行评估。
Article 81
Title@2025-07-21 (1): BandFuzz: An ML-powered Collaborative Fuzzing Framework
Title: BandFuzz: An ML-powered Collaborative Fuzzing Framework | BandFuzz: Ein ML-powered Collaborative Fuzzing Framework | BandFuzz: ML 授权的协作模糊框架 2507.10845v2 |
Authors (6): Wenxuan Shi, Hongwei Li, Jiahao Yu, Xinqian Sun, Wenbo Guo, Xinyu Xing
Collaborative fuzzing combines multiple individual fuzzers and dynamically chooses appropriate combinations for different programs. Unlike individual fuzzers that rely on specific assumptions, collaborative fuzzing relaxes assumptions on target programs, providing robust performance across various programs. However, existing collaborative fuzzing frameworks face challenges including additional computational resource requirements and inefficient resource allocation among fuzzers. To tackle these challenges, we present BANDFUZZ, an ML-powered collaborative fuzzing framework that outperforms individual fuzzers without requiring additional computational resources. The key contribution of BANDFUZZ lies in its novel resource allocation algorithm driven by our proposed multi-armed bandits model. Different from greedy methods in existing frameworks, BANDFUZZ models the long-term impact of individual fuzzers, enabling discovery of globally optimal collaborative strategies. We propose a novel fuzzer evaluation method that assesses not only code coverage but also the fuzzer’s capability of solving difficult branches. Finally, we integrate a real-time seed synchronization mechanism and implementation-wise optimizations to improve fuzzing efficiency and stability. Through extensive experiments on Fuzzbench and Fuzzer Test Suite, we show that BANDFUZZ outperforms state-of-the-art collaborative fuzzing framework autofz and widely used individual fuzzers. We verify BANDFUZZ’s key designs through comprehensive ablation study. Notably, we demonstrate BANDFUZZ’s effectiveness in real-world bug detection by analyzing results of a worldwide fuzzing competition, where BANDFUZZ won first place.
协作模糊的模糊框架结合了多个个人模糊数据,并动态地选择了不同程序的适当组合。 与依赖具体假设的个体模糊数据不同,协作模糊的模糊信息让目标方案的假设得到放松,在各种方案中提供强有力的业绩。 但是,现有的协作模糊框架面临挑战,包括额外的计算资源要求和模糊数据之间资源分配效率低下。 为了应对这些挑战,我们介绍了一个以ML为动力的协作模糊信息框架BANDFUZZ,这是一个以MANDFUZ为动力的协作模糊信息框架,它比个人模糊数据更好,而不需要额外的计算资源。 BANDFUZZ的主要贡献在于由我们拟议的多武装匪徒模型驱动的新的资源分配算法。 不同于现有框架中的贪婪方法, BANDFUZZ 模型对个体模糊信息网络的长期影响进行了模型,从而得以发现全球最佳的协作战略。 我们提出了一个新的模糊评估方法,不仅评估了覆盖范围,而且还评估了模糊数据解决困难分支的能力。 最后,我们整合了一个实时的种子同步机制和实施的优化优化优化数据优化数据,以提高模糊性的效率和稳定性。 在FAL-BUDFAFADUDBADBADBA上, 我们用了BABADBADBAFI的常规数据库中,我们展示了一个州的常规的州测试。
Article 82
Title@2025-07-21 (1): BACFuzz: Exposing the Silence on Broken Access Control Vulnerabilities in Web Applications
Title: BACFuzz: Exposing the Silence on Broken Access Control Vulnerabilities in Web Applications | BACFuzz: Aufdecken des Schweigens auf gebrochene Zugriffskontrolle Schwachstellen in Web-Anwendungen | BACFuzz:在网络应用中暴露对断断存控制障碍的沉默 2507.15984v1 |
Authors (5): I Putu Arya Dharmaadi, Mohannad Alhanahnah, Van-Thuan Pham, Fadi Mohsen, Fatih Turkmen
Broken Access Control (BAC) remains one of the most critical and widespread vulnerabilities in web applications, allowing attackers to access unauthorized resources or perform privileged actions. Despite its severity, BAC is underexplored in automated testing due to key challenges: the lack of reliable oracles and the difficulty of generating semantically valid attack requests. We introduce BACFuzz, the first gray-box fuzzing framework specifically designed to uncover BAC vulnerabilities, including Broken Object-Level Authorization (BOLA) and Broken Function-Level Authorization (BFLA) in PHP-based web applications. BACFuzz combines LLM-guided parameter selection with runtime feedback and SQL-based oracle checking to detect silent authorization flaws. It employs lightweight instrumentation to capture runtime information that guides test generation, and analyzes backend SQL queries to verify whether unauthorized inputs flow into protected operations. Evaluated on 20 real-world web applications, including 15 CVE cases and 2 known benchmarks, BACFuzz detects 16 of 17 known issues and uncovers 26 previously unknown BAC vulnerabilities with low false positive rates. All identified issues have been responsibly disclosed, and artifacts will be publicly released.
断层接入控制(BAC)仍然是网络应用中最关键和最普遍的脆弱性之一,它允许攻击者获取未经授权的资源或采取特权行动。尽管其严重程度严重,但BAC在自动测试中未得到充分探索,因为面临以下重大挑战:缺乏可靠的神器和难以提出具有语义效力的攻击请求。我们引入了BACUZ,这是第一个专门旨在发现BAC脆弱性的灰箱模糊模糊框架,包括基于PHP的网络应用中的断层授权(BOLA)和功能级授权(BFLA),BACFuzz将LLM指导参数的选择与运行时间反馈和基于SQL的检查结合起来,以发现无声授权缺陷。它使用轻量仪器来捕捉引导测试生成的运行时间信息,并分析SQL询问的后端,以核实未经授权的投入是否流入受保护的行动。对20个真实的网络应用进行了评估,包括15个CVE案件和2个已知基准,BACUFUzz检测了17个已知问题中的16个,并发现26个以前未知的BAC弱点,而错误肯定率较低。所有发现的问题都将公开披露。
Article 83
Title@2025-07-21 (1): Observing Fine-Grained Changes in Jupyter Notebooks During Development Time
Title: Observing Fine-Grained Changes in Jupyter Notebooks During Development Time | Beobachten feinkörniger Änderungen in Jupyter-Notebooks während der Entwicklungszeit | 发展时期黄极笔记本中观察到的微小变化 2507.15831v1 |
Authors (8): Sergey Titov, Konstantin Grotov, Cristina Sarasua, Yaroslav Golubev, Dhivyabharathi Ramasamy, Alberto Bacchelli, Abraham Bernstein, Timofey Bryksin
In software engineering, numerous studies have focused on the analysis of fine-grained logs, leading to significant innovations in areas such as refactoring, security, and code completion. However, no similar studies have been conducted for computational notebooks in the context of data science. To help bridge this research gap, we make three scientific contributions: we (1) introduce a toolset for collecting code changes in Jupyter notebooks during development time; (2) use it to collect more than 100 hours of work related to a data analysis task and a machine learning task (carried out by 20 developers with different levels of expertise), resulting in a dataset containing 2,655 cells and 9,207 cell executions; and (3) use this dataset to investigate the dynamic nature of the notebook development process and the changes that take place in the notebooks. In our analysis of the collected data, we classified the changes made to the cells between executions and found that a significant number of these changes were relatively small fixes and code iteration modifications. This suggests that notebooks are used not only as a development and exploration tool but also as a debugging tool. We report a number of other insights and propose potential future research directions on the novel data.
在软件工程方面,许多研究侧重于微粒原木的分析,导致在重构、安全和代码完成等领域进行重大创新,然而,在数据科学方面,没有对计算笔记本进行类似的研究。为了缩小这一研究差距,我们作出了三项科学贡献:(1) 引进工具,用于收集朱比特笔记本在发展期间的代码变化;(2) 利用它收集100多小时与数据分析任务和机器学习任务有关的工作(由20个具有不同专门知识的开发者承担),从而形成包含2 655个细胞和9 207个细胞处决的数据集;(3) 利用这一数据集调查笔记本开发过程的动态性质和笔记本中发生的变化。我们在分析所收集的数据时,将处决之间对细胞所作的改动分类,发现这些改动中很大一部分是相对较小的固定和代码 Iteration修改。这表明,笔记本不仅用作开发和探索工具,而且还用作解调工具;(3) 利用这一数据集来调查笔记本开发过程的动态性质和笔记本中发生的变化。我们在分析所收集的数据时,我们将发现,并提出了关于新数据的未来研究方向。
Article 84
Title@2025-07-21 (1): Investigating the Use of LLMs for Evidence Briefings Generation in Software Engineering
Title: Investigating the Use of LLMs for Evidence Briefings Generation in Software Engineering | Untersuchung der Verwendung von LLMs für Evidence Briefings Generation in der Software-Engineering | 调查软件工程中利用LLMs制作证据简报 2507.15828v1 |
Authors (7): Mauro Marcelino, Marcos Alves, Bianca Trinkenreich, Bruno Cartaxo, Sérgio Soares, Simone D. J. Barbosa, Marcos Kalinowski
[Context] An evidence briefing is a concise and objective transfer medium that can present the main findings of a study to software engineers in the industry. Although practitioners and researchers have deemed Evidence Briefings useful, their production requires manual labor, which may be a significant challenge to their broad adoption. [Goal] The goal of this registered report is to describe an experimental protocol for evaluating LLM-generated evidence briefings for secondary studies in terms of content fidelity, ease of understanding, and usefulness, as perceived by researchers and practitioners, compared to human-made briefings. [Method] We developed an RAG-based LLM tool to generate evidence briefings. We used the tool to automatically generate two evidence briefings that had been manually generated in previous research efforts. We designed a controlled experiment to evaluate how the LLM-generated briefings compare to the human-made ones regarding perceived content fidelity, ease of understanding, and usefulness. [Results] To be reported after the experimental trials. [Conclusion] Depending on the experiment results.
证据简介是一个简明和客观的转移媒介,可以向该行业的软件工程师介绍一项研究的主要结果。虽然实践者和研究人员认为证据简介有用,但其制作需要人工劳动,这对其广泛采用来说可能是一个重大挑战。[目 这份登记报告的目的是说明一项实验性程序,用于评价LLM为第二期研究制作的证据简介,即研究人员和从业者认为与人造简报相比,在内容忠诚、理解便利和实用性方面对LLM产生的证据简介。我们开发了一个基于RAG LLM的工具,以生成证据简介。我们利用这一工具自动制作了两个在以往研究工作中手工制作的证据简介。我们设计了一项有控制的实验,以评价LLM制作的情况介绍如何与人造简报相比,在内容忠诚、理解便利和实用性方面与实验后报告。[结论]视实验结果而定。[结论]
Article 85
Title@2025-07-21 (1): Do AI models help produce verified bug fixes?
Title: Do AI models help produce verified bug fixes? | Helfen KI-Modelle dabei, verifizierte Fehlerbehebungen zu erstellen? | 人工智能模型是否帮助生成经核实的错误修正 ? 2507.15822v1 |
Authors (6): Li Huang, Ilgiz Mustafin, Marco Piccioni, Alessandro Schena, Reto Weber, Bertrand Meyer
Among areas of software engineering where AI techniques – particularly, Large Language Models – seem poised to yield dramatic improvements, an attractive candidate is Automatic Program Repair (APR), the production of satisfactory corrections to software bugs. Does this expectation materialize in practice? How do we find out, making sure that proposed corrections actually work? If programmers have access to LLMs, how do they actually use them to complement their own skills? To answer these questions, we took advantage of the availability of a program-proving environment, which formally determines the correctness of proposed fixes, to conduct a study of program debugging with two randomly assigned groups of programmers, one with access to LLMs and the other without, both validating their answers through the proof tools. The methodology relied on a division into general research questions (Goals in the Goal-Query-Metric approach), specific elements admitting specific answers (Queries), and measurements supporting these answers (Metrics). While applied so far to a limited sample size, the results are a first step towards delineating a proper role for AI and LLMs in providing guaranteed-correct fixes to program bugs. These results caused surprise as compared to what one might expect from the use of AI for debugging and APR. The contributions also include: a detailed methodology for experiments in the use of LLMs for debugging, which other projects can reuse; a fine-grain analysis of programmer behavior, made possible by the use of full-session recording; a definition of patterns of use of LLMs, with 7 distinct categories; and validated advice for getting the best of LLMs for debugging and Automatic Program Repair.
在软件工程领域,AI技术 – – 特别是大语言模型 – – 似乎有望带来巨大改进,一个有吸引力的候选者是自动程序修理(APR),对软件错误进行令人满意的纠正。这一期望在实践中是否实现?我们如何发现,确保拟议的纠正实际发挥作用?如果程序设计者能够使用LLMS,他们如何实际使用LMS来补充自己的技能?为了回答这些问题,我们利用了程序验证环境的可用性,正式确定拟议的修正的正确性,对程序调试程序进行了研究,由两类随机指定的程序组组成,一类是使用LLMS,另一类则是不使用LMS和另一类,两者均通过验证工具验证其答案;我们如何找到方法,以确保拟议纠正建议的实际效果?如果程序设计者能够使用LMMS,他们如何实际使用LMS,他们如何实际使用LMS; 具体答案(Msetric) 以及支持这些答案(MLMS)的测量方法。尽管应用范围有限,但结果是向AI和LMS(LMs)在向程序错误提供有保证的精度的精度修正建议方面的正确性评估方面迈出第一步,但是,它们又可以使用一种不同的分析。这些结果可以用来使用ARIaldealbis。
Article 86
Title@2025-07-21 (1): BugScope: Learn to Find Bugs Like Human
Title: BugScope: Learn to Find Bugs Like Human | BugScope: Lernen Sie Fehler wie Menschen zu finden | 错误库: 学习查找像人类一样的错误 2507.15671v1 |
Authors (6): Jinyao Guo, Chengpeng Wang, Dominic Deluca, Jinjie Liu, Zhuo Zhang, Xiangyu Zhang
Detecting software bugs remains a fundamental challenge due to the extensive diversity of real-world defects. Traditional static analysis tools often rely on symbolic workflows, which restrict their coverage and hinder adaptability to customized bugs with diverse anti-patterns. While recent advances incorporate large language models (LLMs) to enhance bug detection, these methods continue to struggle with sophisticated bugs and typically operate within limited analysis contexts. To address these challenges, we propose BugScope, an LLM-driven multi-agent system that emulates how human auditors learn new bug patterns from representative examples and apply that knowledge during code auditing. Given a set of examples illustrating both buggy and non-buggy behaviors, BugScope synthesizes a retrieval strategy to extract relevant detection contexts via program slicing and then constructs a tailored detection prompt to guide accurate reasoning by the LLM. Our evaluation on a curated dataset of 40 real-world bugs drawn from 21 widely-used open-source projects demonstrates that BugScope achieves 87.04% precision and 90.00% recall, surpassing state-of-the-art industrial tools by 0.44 in F1 score. Further testing on large-scale open-source systems, including the Linux kernel, uncovered 141 previously unknown bugs, of which 78 have been fixed and 7 confirmed by developers, highlighting BugScope’s substantial practical impact.
传统静态分析工具往往依赖象征性工作流程,这些流程限制其覆盖面,妨碍适应不同反模式的定制错误。最近的进展包括了大型语言模型(LLMs),以加强错误检测,但这些方法仍然与复杂的错误纠缠不休,通常在有限的分析环境中运作。为了应对这些挑战,我们提议采用BugScope,这是一个由LLLM驱动的多试剂系统,仿效人类审计员如何从代表性实例中学习新的错误模式,并在代码审计中应用该知识。根据一组说明错误和非错误行为的示例,BugScope合成了一个检索战略,通过程序剪切换和随后构建一个定制的检测速度,以引导LLMM的准确推理。我们从21个广泛使用的公开源项目中抽取的40个虚拟数据集的评估显示,BugScope实现了87.04%的精确度和9.0%的回顾,在F1分中通过0.44分显示的错误和非错误行为,超越了最先进的工业工具。
Article 87
Title@2025-07-21 (1): Modeling CubeSat Storage Battery Discharge: Equivalent Circuit Versus Machine Learning Approaches
Title: Modeling CubeSat Storage Battery Discharge: Equivalent Circuit Versus Machine Learning Approaches | Modellierung CubeSat Speicher Batterieentladung: Gleichwertige Schaltung Versus Machine Learning Ansätze | 模型化CubeSat存储电池放电:等效电路甚高频机器学习方法 2507.15666v1 |
Authors (4): Igor Turkin, Lina Volobuieva, Andriy Chukhray, Oleksandr Liubimov
The subject of the article is the study and comparison of two approaches to modelling the battery discharge of a CubeSat satellite: analytical using equivalent circuit and machine learning. The article aims to make a reasoned choice of the approach to modelling the battery discharge of a CubeSat satellite. Modelling the battery discharge of a satellite will enable the prediction of the consequences of disconnecting the autonomous power system and ensure the fault tolerance of equipment in orbit. Therefore, the selected study is relevant and promising. This study focuses on the analysis of CubeSat satellite data, based explicitly on orbital data samples of the power system, which include data available at the time of the article publication. The dataset contains data on the voltage, current, and temperature of the battery and solar panels attached to the five sides of the satellite. In this context, two approaches are considered: analytical modelling based on physical laws and machine learning, which uses empirical data to create a predictive model. Results: A comparative analysis of the modeling results reveals that the equivalent circuit approach has the advantage of transparency, as it identifies possible parameters that facilitate understanding of the relationships. However, the model is less flexible to environmental changes or non-standard satellite behavior. The machine learning model demonstrated more accurate results, as it can account for complex dependencies and adapt to actual conditions, even when they deviate from theoretical assumptions.
文章的主题是研究和比较CubeSat卫星电池放电的两种建模方法:使用等效电路和机器学习进行分析,文章的目的是对CubeSat卫星电池放电的建模方法作出合理选择;对卫星电池放电进行建模,以便能够预测断开自主动力系统的后果,确保轨道设备有误容度;因此,所选研究具有相关性和前景;本研究报告侧重于分析CubeSat卫星数据,明确以动力系统的轨道数据样本为基础,包括文章出版时可获得的数据;数据集包含附属于卫星五面的电池和太阳能板的电压、电流和温度数据;在这方面,考虑了两种方法:根据物理法和机器学习进行的分析建模,利用经验数据建立预测模型模型。结果:对模型结果进行比较分析表明,等值电路方法具有透明度的优势,因为它确定了有助于了解各种关系的可能参数。但模型中包含卫星五面的电池和太阳能板板电池板的电压、电流和温度数据数据。在这方面,考虑两种方法:根据实际法律和机器学习数据模型进行的分析模型,可以更精确地调整,从模型到更精确的假设,从模拟到更精确的模型,从更精确的模型到更精确的卫星的假设,从模拟到更精确到更精确的假设。
Article 88
Title@2025-07-21 (1): SustainDiffusion: Optimising the Social and Environmental Sustainability of Stable Diffusion Models
Title: SustainDiffusion: Optimising the Social and Environmental Sustainability of Stable Diffusion Models | SustainDiffusion: Optimierung der sozialen und ökologischen Nachhaltigkeit stabiler Diffusionsmodelle | 可持续性传播:优化稳定传播模式的社会和环境可持续性 2507.15663v1 |
Authors (5): Giordano d’Aloisio, Tosin Fadahunsi, Jay Choy, Rebecca Moussa, Federica Sarro
Background: Text-to-image generation models are widely used across numerous domains. Among these models, Stable Diffusion (SD) - an open-source text-to-image generation model - has become the most popular, producing over 12 billion images annually. However, the widespread use of these models raises concerns regarding their social and environmental sustainability. Aims: To reduce the harm that SD models may have on society and the environment, we introduce SustainDiffusion, a search-based approach designed to enhance the social and environmental sustainability of SD models. Method: SustainDiffusion searches the optimal combination of hyperparameters and prompt structures that can reduce gender and ethnic bias in generated images while also lowering the energy consumption required for image generation. Importantly, SustainDiffusion maintains image quality comparable to that of the original SD model. Results: We conduct a comprehensive empirical evaluation of SustainDiffusion, testing it against six different baselines using 56 different prompts. Our results demonstrate that SustainDiffusion can reduce gender bias in SD3 by 68%, ethnic bias by 59%, and energy consumption (calculated as the sum of CPU and GPU energy) by 48%. Additionally, the outcomes produced by SustainDiffusion are consistent across multiple runs and can be generalised to various prompts. Conclusions: With SustainDiffusion, we demonstrate how enhancing the social and environmental sustainability of text-to-image generation models is possible without fine-tuning or changing the model’s architecture.
文本到图像生成模型在许多领域广泛使用。在这些模型中,稳定传播(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD))(SD)(SD)(SD)(SD)(SD))(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD)(SD(SD)(SD)(SD(SD)(SD(SD)(SD)(SD(SD) (SD(SD) (SD(SD) (SD) (SD(SD) (SD) (SD) (SD(SD) (SD(SD) (SD) (SD(SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD) (SD(SD) (SD(SD(SD) (SD) (S(S(S(S(S(SD) (S(S(S(SD) (S(S(S))))))) (S(S(S(S(SD)) (SD)) (S(S(SD) (S(SD) (SD))))) (S(S(S(S(SD)))))) (SD) (SD) (SD) (S(SD) (SD) (SD) (SD) (SD) (S(SD)(S(SD)(SD) (SD)
Article 89
Title@2025-07-21 (1): Hot Topics and Common Challenges: an Empirical Study of React Discussions on Stack Overflow
Title: Hot Topics and Common Challenges: an Empirical Study of React Discussions on Stack Overflow | Heiße Themen und gemeinsame Herausforderungen: eine empirische Studie über reagierende Diskussionen über Stack Overflow | 热题和共同挑战:关于堆堆溢溢溢量的应对讨论的经验研究 2507.15624v1 |
Authors (6): Yusuf Sulistyo Nugroho, Ganno Tribuana Kurniaji, Syful Islam, Mohammed Humayun Kabir, Vanesya Aura Ardity, Md. Kamal Uddin
React is a JavaScript library used to build user interfaces for single-page applications. Although recent studies have shown the popularity and advantages of React in web development, the specific challenges users face remain unknown. Thus, this study aims to analyse the React-related questions shared on Stack Overflow. The study utilizes an exploratory data analysis to investigate the most frequently discussed keywords, error classification, and user reputation-based errors, which is the novelty of this work. The results show the top eight most frequently used keywords on React-related questions, namely, code, link, vir, href, connect, azure, windows, and website. The error classification of questions from the sample shows that algorithmic error is the most frequent issue faced by all groups of users, where mid-reputation users contribute the most, accounting for 55.77%. This suggests the need for the community to provide guidance materials in solving algorithm-related problems. We expect that the results of this study will provide valuable insight into future research to support the React community during the early stages of implementation, facilitating their ability to effectively overcome challenges to adoption.
是一个用于为单页应用程序建立用户界面的 JavaScript 图书馆。 虽然最近的研究表明了网络开发中的React 的广度和优势, 但用户所面临的具体挑战仍然未知。 因此, 本研究旨在分析Stack overflow 共享的React 相关问题。 该研究利用探索性数据分析来调查最经常讨论的关键词、错误分类和用户声誉错误,这是这项工作的新颖之处。 研究结果显示,在与反应有关的问题上,最常用的八个关键词是代码、 链接、 vir、 href、 连接、 anzure、 窗口和网站。 样本中的问题分类显示, 算法错误是所有用户群体面临的最常见问题, 其中中位用户贡献最大, 占55.77%。 这说明社区需要为解决算法相关问题提供指导材料。 我们期望, 这项研究的结果将为未来研究提供宝贵的洞察力, 以支持执行早期阶段的React 社区, 便利其有效克服采用方面的挑战。
Article 90
Title@2025-07-21 (1): Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing
Title: Applying the Chinese Wall Reverse Engineering Technique to Large Language Model Code Editing | Anwendung der Technik der chinesischen Wandumkehrtechnik auf die Bearbeitung von großen Sprachmodellen | 将中国长墙反向工程技术应用到大语言模式编辑 2507.15599v1 |
Authors (1): Manatsawin Hanmongkolchai
Large language models for code (Code LLM) are increasingly utilized in programming environments. Despite their utility, the training datasets for top LLM remain undisclosed, raising concerns about potential copyright violations. Some models, such as Pleias and Comma put emphasis on data curation and licenses, however, with limited training data these models are not competitive and only serve as proof of concepts. To improve the utility of these models, we propose an application of the “Chinese Wall” technique, inspired by the reverse engineering technique of the same name – a high quality model is used to generate detailed instructions for a weaker model. By doing so, a weaker but ethically aligned model may be used to perform complicated tasks that, otherwise, can only be completed by more powerful models. In our evaluation, we’ve found that this technique improves Comma v0.1 1T’s performance in CanItEdit benchmark by over 66%, and Starcoder2 Instruct by roughly 20% compared to when running the same model on the benchmark alone. The practical application of this technique today, however, may be limited due to the lack of models trained on public domain content without copyright restrictions.
大型代码语言模型(Code LLM)在编程环境中日益被越来越多地使用。尽管其实用性,顶级LLM的培训数据集仍未公开,引起对潜在版权侵犯的关切。一些模型,如Pleias和Comma强调数据整理和许可证,然而,由于培训数据有限,这些模型没有竞争力,只能作为概念的证明。为了提高这些模型的实用性,我们提议应用“中国长城”技术,这种技术受同名反向工程技术的启发 – – 一个高质量的模型被用来为较弱的模型提供详细的指示。这样,一个较弱但符合道德要求的模型可能被用来执行复杂的任务,否则只能由更强大的模型完成。在我们的评估中,我们发现这种技术使CanIted基准的Coma v0.1 1T性能提高了66%以上, Starcoder2 与仅使用同一基准模型时相比,大约20%的教益。但是,如今这种技术的实际应用可能受到限制,因为没有版权限制的公共域内容培训模型。
Article 91
Title@2025-07-21 (1): A Study of LLMs’ Preferences for Libraries and Programming Languages
Title: A Study of LLMs’ Preferences for Libraries and Programming Languages | Eine Studie der Präferenzen der LLM für Bibliotheken und Programmiersprachen | 关于LLMLM对图书馆和节目语言的偏好的研究 2503.17181v2 |
Authors (7): Lukas Twist, Jie M. Zhang, Mark Harman, Don Syme, Joost Noppen, Helen Yannakoudakis, Detlef Nauck
Large Language Models (LLMs) are increasingly used to generate code, influencing users’ choices of libraries and programming languages in critical real-world projects. However, little is known about their systematic biases or preferences toward certain libraries and programming languages, which can significantly impact software development practices. To fill this gap, we perform the first empirical study of LLMs’ preferences for libraries and programming languages when generating code, covering eight diverse LLMs. Our results reveal that LLMs exhibit a strong tendency to overuse widely adopted libraries such as NumPy; in up to 48% of cases, this usage is unnecessary and deviates from the ground-truth solutions. LLMs also exhibit a significant preference toward Python as their default language. For high-performance project initialisation tasks where Python is not the optimal language, it remains the dominant choice in 58% of cases, and Rust is not used a single time. These results indicate that LLMs may prioritise familiarity and popularity over suitability and task-specific optimality. This will introduce security vulnerabilities and technical debt, and limit exposure to newly developed, better-suited tools and languages. Understanding and addressing these biases is essential for the responsible integration of LLMs into software development workflows.
大型语言模型(LLMS)越来越多地被用于生成代码,影响用户对图书馆的选择和在关键现实世界项目中编程语言的选择;然而,很少有人知道他们对某些图书馆和编程语言的系统性偏见或偏好,这可能严重影响软件开发实践;为填补这一空白,我们首次对LLMS在生成代码时对图书馆和编程语言的偏好进行了经验性研究,涉及8种不同的LMS。我们的结果表明,LLMS表现出过度使用广泛采用图书馆(如NumPy)的强烈倾向;在多达48%的案例中,这种使用是不必要的,偏离了地面真相解决方案。LMS也表现出了对Python默认语言的偏好偏好。对于高性的项目初始化任务(Python不是最佳语言)来说,在58%的案例中,它仍然是主要的选择,Rust没有使用单一的时间。这些结果表明,LMS可能优先偏爱和受欢迎于适合性和特定任务的最佳性。这将引入安全脆弱性和技术债务,并限制对新开发的、更适合的工具和语言的LMs的暴露。了解和解决这些偏见是将LMS的软件整合的基本。
Article 92
Title@2025-07-21 (1): CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection
Title: CGP-Tuning: Structure-Aware Soft Prompt Tuning for Code Vulnerability Detection | CGP-Tuning: Structure-Aware Soft Prompt Tuning für Code Vulnerability Detection | CGP-Turning: 用于代码脆弱性检测的结构- Aware 软软快速查询 2501.04510v2 |
Authors (4): Ruijun Feng, Hammond Pearce, Pietro Liguori, Yulei Sui
Large language models (LLMs) have been proposed as powerful tools for detecting software vulnerabilities, where task-specific fine-tuning is typically employed to provide vulnerability-specific knowledge to the LLMs. However, existing fine-tuning techniques often treat source code as plain text, losing the graph-based structural information inherent in code. Graph-enhanced soft prompt tuning addresses this by translating the structural information into contextual cues that the LLM can understand. However, current methods are primarily designed for general graph-related tasks and focus more on adjacency information, they fall short in preserving the rich semantic information (e.g., control/data flow) within code graphs. They also fail to ensure computational efficiency while capturing graph-text interactions in their cross-modal alignment module. This paper presents CGP-Tuning, a new code graph-enhanced, structure-aware soft prompt tuning method for vulnerability detection. CGP-Tuning introduces type-aware embeddings to capture the rich semantic information within code graphs, along with an efficient cross-modal alignment module that achieves linear computational costs while incorporating graph-text interactions. It is evaluated on the latest DiverseVul dataset and three advanced open-source code LLMs, CodeLlama, CodeGemma, and Qwen2.5-Coder. Experimental results show that CGP-Tuning delivers model-agnostic improvements and maintains practical inference speed, surpassing the best graph-enhanced soft prompt tuning baseline by an average of four percentage points and outperforming non-tuned zero-shot prompting by 15 percentage points.
提出了大型语言模型(LLMS),作为发现软件脆弱性的有力工具,其中任务特定的微调通常用于向LLMS提供针对脆弱性的知识。然而,现有的微调技术往往将源代码视为纯文本,失去基于图形的代码固有的结构信息。图上增强软快速调解,将结构信息转换成LLM能够理解的背景提示。然而,目前的方法主要为一般图形相关任务设计,更侧重于匹配信息,在保存代码图表中丰富的语义信息(例如,控制/数据流)方面做得不够。它们也未能确保计算效率,同时在代码的交叉模式对齐模块中捕获图形文本互动。本文介绍了CGP-Turning,这是一个新的代码强化图形增强、结构适应软快速调导调方法,用于识别脆弱性。CGP-Turning引入了类型觉嵌入模型,以通过代码中存储丰富的实际缩略缩略图中精度信息,以及一个高效的跨模式调整模块,既能实现线性计算成本,又能显示最新版本的C-LMS-C-S-C-Silver-deal-deal-C-deal-deal-deal-deal-de-de-de-ex-deal-deal-ex-ex-ex-ex-deal-ex-ex-ex-ex-exal-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-lex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-Ial-Ial-ex-ex-ex-ex-ex-ex-ex-ex-ex-ex-Ial-Ial-Ial-Ial-ex-ex-ex-ex-ex-ex-ex-I
Article 93
Title@2025-07-21 (1): Understanding the Design Decisions of Retrieval-Augmented Generation Systems
Title: Understanding the Design Decisions of Retrieval-Augmented Generation Systems | Verständnis der Konstruktionsentscheidungen von Systemen der retrieval-Augmentierten Generation | 了解回收-加速发电系统的设计决定 2411.19463v2 |
Authors (7): Shengming Zhao, Yuchen Shao, Yuheng Huang, Jiayang Song, Zhijie Wang, Chengcheng Wan, Lei Ma
Retrieval-Augmented Generation (RAG) has emerged as a critical technique for enhancing large language model (LLM) capabilities. However, practitioners face significant challenges when making RAG deployment decisions. While existing research prioritizes algorithmic innovations, a systematic gap persists in understanding fundamental engineering trade-offs that determine RAG success. We present the first comprehensive study of three universal RAG deployment decisions: whether to deploy RAG, how much information to retrieve, and how to integrate retrieved knowledge effectively. Through systematic experiments across three LLMs and six datasets spanning question answering and code generation tasks, we reveal critical insights: (1) RAG deployment must be highly selective, with variable recall thresholds and failure modes affecting up to 12.6\% of samples even with perfect documents. (2) Optimal retrieval volume exhibits task-dependent behavior QA tasks show universal patterns (5-10 documents optimal) while code generation requires scenario-specific optimization. (3) Knowledge integration effectiveness depends on task and model characteristics, with code generation benefiting significantly from prompting methods while question answering shows minimal improvement. These findings demonstrate that universal RAG strategies prove inadequate. Effective RAG systems require context-aware design decisions based on task characteristics and model capabilities. Our analysis provides evidence-based guidance for practitioners and establishes foundational insights for principled RAG deployment.
现有研究将算法创新列为优先事项,但在理解基本工程权衡方面仍然存在系统性差距,从而决定了RAG的成功。我们首次全面研究了三个通用的RAG部署决定:是否部署RAG,需要检索多少信息,以及如何有效地整合检索到的知识。通过在三个LLM和六个数据集中进行系统实验,涵盖问题回答和代码生成任务,我们发现了重要的见解:(1)RAG部署必须具有高度选择性,其回顾阈值和失败模式必须具有可变性,影响到最多达12.6样本,甚至有完美文件的样本。(2) 最佳检索量的显示依赖任务的行为质量任务显示通用模式(5-10份文件最佳),而代码生成则需要根据特定情景优化。(3) 知识整合效力取决于任务和模式特征,代码生成从快速方法中受益,而问题回答则显示最微小的改进。这些结论显示,普遍RAG战略证明不足。有效的RAG系统系统要求基于背景的系统设计决定,以及基于任务定位、特征和模型分析能力的基础。
Article 94
Title@2025-07-21 (1): StackTrans: From Large Language Model to Large Pushdown Automata Model
Title: StackTrans: From Large Language Model to Large Pushdown Automata Model | StackTrans: Vom großen Sprachmodell zum großen Pushdown-Automatenmodell | Stacktrans: 从大语言模型到大推下自动模型 2507.15343v1 |
Authors (8): Kechi Zhang, Ge Li, Jia Li, Huangzhao Zhang, Yihong Dong, Jia Li, Jingjing Xu, Zhi Jin
The Transformer architecture has emerged as a landmark advancement within the broad field of artificial intelligence, effectively catalyzing the advent of large language models (LLMs). However, despite its remarkable capabilities and the substantial progress it has facilitated, the Transformer architecture still has some limitations. One such intrinsic limitation is its inability to effectively capture the Chomsky hierarchy, such as regular expressions or deterministic context-free grammars. Drawing inspiration from pushdown automata, which efficiently resolve deterministic context-free grammars using stacks, we propose StackTrans to address the aforementioned issue within LLMs. Unlike previous approaches that modify the attention computation, StackTrans explicitly incorporates hidden state stacks between Transformer layers. This design maintains compatibility with existing frameworks like flash-attention. Specifically, our design features stack operations – such as pushing and popping hidden states – that are differentiable and can be learned in an end-to-end manner. Our comprehensive evaluation spans benchmarks for both Chomsky hierarchies and large-scale natural languages. Across these diverse tasks, StackTrans consistently outperforms standard Transformer models and other baselines. We have successfully scaled StackTrans up from 360M to 7B parameters. In particular, our from-scratch pretrained model StackTrans-360M outperforms several larger open-source LLMs with 2-3x more parameters, showcasing its superior efficiency and reasoning capability.
变换器结构在人工智能的广泛领域已经成为一个里程碑式的进步,有效地催化了大型语言模型(LLMS)的出现。然而,尽管变换器结构具有非凡的能力和显著的进展,但还是有一些局限性。这种内在的局限性之一是它无法有效捕捉Chomsky的等级,例如常规表达式或不含确定性的环境语法。从推进式自动地图中汲取灵感,该自推进式自动地图使用堆叠有效解决无确定性背景语法,我们建议StackTrans在LLMS中解决上述问题。与以前改变注意力计算的方法不同,StackTranstrans显然将隐藏的状态堆放在变换器层之间。这种设计与闪感等现有框架保持兼容性。具体地说,我们的设计功能,例如推动和弹出隐藏状态,可以以端对端的方式学习。我们的全面评价将Chomsky定型语法和大尺度自然参数都标为Chomsky 。在这些不同的任务中, StackTranstrax 持续超越标准变换式模型模型模型模型模型模型和其他更高级的基线,我们成功地将Straack-tracraft-trax 。
Article 95
Title@2025-07-21 (1): A Study of Malware Prevention in Linux Distributions
Title: A Study of Malware Prevention in Linux Distributions | Eine Studie über Malware-Prävention in Linux-Distributionen | 关于Linux分发中防止恶意软件的研究 2411.11017v3 |
Authors (7): Duc-Ly Vu, Trevor Dunlap, Karla Obermeier-Velazquez, Thanh-Cong Nguyen, Paul Gibert, John Speed Meyers, Santiago Torres-Arias
Malicious attacks on open-source software packages are a growing concern. The discovery of the XZ Utils backdoor intensified these concerns because of the potential widespread impact. This study, therefore, explores the challenges of preventing and detecting malware in Linux distribution package repositories. To do so, we ask two research questions: (1) What measures have Linux distributions implemented to counter malware, and how have maintainers experienced these efforts? (2) How effective are current malware detection tools in identifying malicious Linux packages? To answer these questions, we conduct interviews with maintainers at several major Linux distributions and introduce a Linux package malware benchmark dataset. Using this dataset, we evaluate the performance of six open-source malware detection scanners. Distribution maintainers, according to the interviews, have mostly focused on reproducible builds to date. Our interviews identified only a single Linux distribution, Wolfi OS, that performs active malware scanning. Using this new benchmark dataset, the evaluation found that the performance of existing open-source malware scanners is underwhelming. Most studied tools excel at producing false positives but only infrequently detect true malware. Those that avoid high false positive rates often do so at the expense of a satisfactory true positive. Our findings provide insights into Linux distribution package repositories’ current practices for malware detection and demonstrate the current inadequacy of open-source tools designed to detect malicious Linux packages.
对开放源码软件包的恶意攻击日益引起人们的关注。发现 XZ 用户界面后门后门的发现加剧了这些关切,因为其潜在的影响很广。 因此,本研究探讨了在Linux 分销软件库中预防和检测恶意软件的挑战。 为此,我们问了两个研究问题:(1) 采取什么措施来实施Linux 发行软件来对付恶意软件包,以及维护者如何经历了这些努力?(2) 当前恶意软件检测工具在识别恶意Linux软件包方面的效力如何?为了回答这些问题,我们与Linux主要分销软件的维护者进行了访谈,并引入了一个Linux软件基准数据集。因此,我们利用这一数据集,评估了6个公开源码恶意软件检测扫描扫描仪的性能。根据访谈,分销维护者主要侧重于到今天的重建基础。我们的访谈只确定了一个单一的Linux 发行软件,即Wolfi OS, 进行积极的恶意软件扫描。使用这个新的基准数据集,我们发现现有的公开源码软件扫描仪的表现不足。 多数研究的工具是制作虚假的正面的软件,但只是不定期对目前真实的准确的恶意检测结果。
Article 96
Title@2025-07-21 (1): Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems
Title: Butterfly Effects in Toolchains: A Comprehensive Analysis of Failed Parameter Filling in LLM Tool-Agent Systems | Schmetterlingseffekte in Werkzeugketten: Eine umfassende Analyse der fehlgeschlagenen Parameterfüllung in LLM-Werkzeug-Agentensystemen | 工具链中的蝴蝶效应:对LLM工具代理系统填充失败参数的综合分析 2507.15296v1 |
Authors (7): Qian Xiong, Yuekai Huang, Ziyou Jiang, Zhiyuan Chang, Yujia Zheng, Tianhao Li, Mingyang Li
The emergence of the tool agent paradigm has broadened the capability boundaries of the Large Language Model (LLM), enabling it to complete more complex tasks. However, the effectiveness of this paradigm is limited due to the issue of parameter failure during its execution. To explore this phenomenon and propose corresponding suggestions, we first construct a parameter failure taxonomy in this paper. We derive five failure categories from the invocation chain of a mainstream tool agent. Then, we explore the correlation between three different input sources and failure categories by applying 15 input perturbation methods to the input. Experimental results show that parameter name hallucination failure primarily stems from inherent LLM limitations, while issues with input sources mainly cause other failure patterns. To improve the reliability and effectiveness of tool-agent interactions, we propose corresponding improvement suggestions, including standardizing tool return formats, improving error feedback mechanisms, and ensuring parameter consistency.
工具代理商范式的出现扩大了大语言模型(LLM)的能力范围,使其能够完成更复杂的任务。然而,这一范式的效力因参数执行过程中的失败问题而受到限制。为了探索这一现象并提出相应的建议,我们首先在本文件中建立一个参数失败分类。我们从主流工具代理商的援引链中得出五个失败类别。然后,我们通过对输入应用15个输入扰动方法来探索三种不同的输入源和故障类别之间的相互关系。实验结果显示,参数名称幻觉的失败主要源于固有的LLLM限制,而与输入源有关的问题则主要导致其他失败模式。为了提高工具代理商互动的可靠性和有效性,我们提出了相应的改进建议,包括工具返回格式标准化、改进错误反馈机制以及确保参数一致性。
Article 97
Title@2025-07-21 (1): Input Reduction Enhanced LLM-based Program Repair
Title: Input Reduction Enhanced LLM-based Program Repair | Input-Reduzierung Verbesserte LLM-basierte Programm-Reparatur | 增强基于LLM的LLM方案维修 2507.15251v1 |
Authors (6): Boyang Yang, Luyao Ren, Xin Yin, Jiadong Ren, Haoye Tian, Shunfu Jin
Large Language Models (LLMs) have shown great potential in Automated Program Repair (APR). Test inputs, being crucial for reasoning the root cause of failures, are always included in the prompt for LLM-based APR. Unfortunately, LLMs struggle to retain key information in long prompts. When the test inputs are extensive in the prompt, this may trigger the “lost-in-the-middle” issue, compromising repair performance. To address this, we propose ReduceFix, an LLM-based APR approach with a built-in component that automatically reduces test inputs while retaining their failure-inducing behavior. ReduceFix prompts an LLM to generate a reducer that minimizes failure-inducing test inputs without human effort, and then feeds the reduced failure-inducing inputs to guide patch generation. For targeted evaluation, we constructed LFTBench, the first long-input APR benchmark with 200 real bugs from 20 programming tasks, each paired with a failure-inducing input whose median size is 1 MB. On this benchmark, ReduceFix shrinks inputs by 89.1% on average and improves overall pass@10 by up to 53.8% relative to a prompt that includes the original test, and by 17.6% compared with omitting the test entirely. Adding the same reduction step to ChatRepair increases its fix rate by 21.3% without other changes. Ablation studies further highlight the impact of input length and compressed failure information on repair success. These results underscore that automatically reducing failing inputs is a practical and powerful complement to LLM-based APR, significantly improving its scalability and effectiveness.
大型语言模型(LLMS)在自动化程序修理中显示出巨大的潜力。测试投入对于解释失败的根本原因至关重要,总是包含在基于 LLM 的ARPR的提示中。 不幸的是, LLMS在长时间的提示中努力保留关键信息。 当测试投入在快速中广泛时, 可能会触发“ 中途丢失” 问题, 损害修复性能。 为了解决这个问题, 我们提议 降低Fix , 以 LLM 为基础的 PRA 方法, 以内建组件为基础, 自动减少测试投入, 同时保留其诱导行为。 减少Fix 促使 LLM 生成一个减缩动器, 将降低失败引导测试投入的最小值最小化, 不做人工工作, 然后为引导补补版生成的减少失败源输入。 对于目标评价, 我们构建了LFTBench, RA 首个有200个真正的错误的基准, 每个都配对一个基于中位的减试算输入, 以1 MBB 。 关于这个基准, 将Fix 减少投入减少89.1%的长度, 平均缩缩缩缩成89. 1 , 和 将总的缩缩算结果的缩略结果的缩成 将结果, 将降低为正常的缩成的缩成的缩成一个比 10 的缩成一个比的缩算的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成为53.8的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩成的缩
Article 98
Title@2025-07-21 (1): ACFIX: Guiding LLMs with Mined Common RBAC Practices for Context-Aware Repair of Access Control Vulnerabilities in Smart Contracts
Title: ACFIX: Guiding LLMs with Mined Common RBAC Practices for Context-Aware Repair of Access Control Vulnerabilities in Smart Contracts | ACFIX: Leitende LLMs mit geminten gängigen RBAC-Praktiken für die kontextbezogene Reparatur von Zugangskontrolllücken in Smart Contracts | ACFIX: 指导LLMs公司使用RBAC在智能合同中使用内部软件修理存取控制易变性方面通用的雷管局做法 2403.06838v3 |
Authors (7): Lyuye Zhang, Kaixuan Li, Kairan Sun, Daoyuan Wu, Ye Liu, Haoye Tian, Yang Liu
Smart contracts are susceptible to various security issues, among which access control (AC) vulnerabilities are particularly critical. While existing research has proposed multiple detection tools, the automatic and appropriate repair of AC vulnerabilities in smart contracts remains a challenge. Unlike commonly supported vulnerability types by existing repair tools, such as reentrancy, which are usually fixed by template-based approaches, the main obstacle of AC lies in identifying the appropriate roles or permissions amid a long list of non-AC-related source code to generate proper patch code, a task that demands human-level intelligence. Leveraging recent advancements in large language models (LLMs), we employ the state-of-the-art GPT-4 model and enhance it with a novel approach called ACFIX. The key insight is that we can mine common AC practices for major categories of code functionality and use them to guide LLMs in fixing code with similar functionality. To this end, ACFIX involves both offline and online phases. First, during the offline phase, ACFIX mines a taxonomy of common Role-based Access Control (RBAC) practices from 344,251 on-chain contracts, categorizing 49 role-permission pairs from the top 1,000 pairs mined. Second, during the online phase, ACFIX tracks AC-related elements across the contract and uses this context information along with a Chain-of-Thought pipeline to guide LLMs in identifying the most appropriate role-permission pair for the subject contract and subsequently generating a suitable patch. This patch will then undergo a validity and effectiveness check. To evaluate ACFIX, we built the first benchmark dataset of 118 real-world AC vulnerabilities, and our evaluation revealed that ACFIX successfully repaired 94.92% of them. This represents a significant improvement compared to the baseline GPT-4, which achieved only 52.54%.
智能合同容易涉及各种安全问题,其中准入控制(AC)的脆弱性特别关键。虽然现有的研究已经提出了多种检测工具,但智能合同中AC脆弱性的自动和适当修复仍然是一项挑战。与现有修理工具通常支持的脆弱性类型不同,如重新使用(通常通过基于模板的方法固定),AC的主要障碍在于确定适当的作用或许可,而非AC相关源代码清单中列有大量非AC相关源代码,以生成适当的补丁代码,这是一项需要人际情报的任务。利用大型语言模型(LLLM)的最新进展,我们采用了最新的GPT-4标准,并用称为ACFIX的新方法强化了AC。 关键的认识是,我们可以在主要的代码类别中废除AC做法,并用它们来指导LMMS修正类似功能的代码。为此,ACFIX只涉及离线和在线的阶段。 在离线阶段,ACFICIX地雷是一种基于通用接入控制(RBAC)的分类,从344,251的当前版本的GP-RE-Real-LED A-leval Conneal A-leck A-levation A-lation A-levation Areck A-lation A-levation Aclevation A-C), 和随后在与M-leck Arevation Arevation Axlevation A-levation A-leval-levation A-levation A-levation A-levational-lational-lation A-lation Axxxxxx。
Article 99
Title@2025-07-21 (1): FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents
Title: FaultLine: Automated Proof-of-Vulnerability Generation Using LLM Agents | FaultLine: Automatisierte Generierung von Vulnerabilitätsnachweisen mit LLM-Agenten | 失灵:使用LLM代理器自动验证生成 2507.15241v1 |
Authors (3): Vikram Nitin, Baishakhi Ray, Roshanak Zilouchian Moghaddam
Despite the critical threat posed by software security vulnerabilities, reports are often incomplete, lacking the proof-of-vulnerability (PoV) tests needed to validate fixes and prevent regressions. These tests are crucial not only for ensuring patches work, but also for helping developers understand how vulnerabilities can be exploited. Generating PoV tests is a challenging problem, requiring reasoning about the flow of control and data through deeply nested levels of a program. We present FaultLine, an LLM agent workflow that uses a set of carefully designed reasoning steps, inspired by aspects of traditional static and dynamic program analysis, to automatically generate PoV test cases. Given a software project with an accompanying vulnerability report, FaultLine 1) traces the flow of an input from an externally accessible API (“source”) to the “sink” corresponding to the vulnerability, 2) reasons about the conditions that an input must satisfy in order to traverse the branch conditions encountered along the flow, and 3) uses this reasoning to generate a PoV test case in a feedback-driven loop. FaultLine does not use language-specific static or dynamic analysis components, which enables it to be used across programming languages. To evaluate FaultLine, we collate a challenging multi-lingual dataset of 100 known vulnerabilities in Java, C and C++ projects. On this dataset, FaultLine is able to generate PoV tests for 16 projects, compared to just 9 for CodeAct 2.1, a popular state-of-the-art open-source agentic framework. Thus, FaultLine represents a 77% relative improvement over the state of the art. Our findings suggest that hierarchical reasoning can enhance the performance of LLM agents on PoV test generation, but the problem in general remains challenging. We make our code and dataset publicly available in the hope that it will spur further research in this area.
尽管软件安全脆弱性构成的严重威胁,但报告往往不完整,缺乏验证修正和防止回归所需的证明(PoV)测试。这些测试不仅对于确保补丁工作至关重要,而且对于帮助开发者了解如何利用脆弱性也至关重要。 生成 PoV测试是一个具有挑战性的问题,需要通过一个程序深度嵌入水平来解释控制和数据的流动情况。 我们展示了FaultLine,一个LLM代理商工作流程,在传统的静态和动态程序分析的启发下,使用一套精心设计的推理步骤,自动生成PoV测试案例。鉴于一个附有脆弱性报告的软件项目,FaultLine 1 1 跟踪外部可访问的API (“源 源”) 输入的信息流流流, 与脆弱性相对的“ 链接” , 2 有关投入必须满足哪些条件才能通过流程中深嵌入的分支条件, 3 利用这一推理在反馈驱动的循环中生成一个PoV测试案例。 FaultLine不使用我们特定的静态或动态分析组件, 使得它能够用于一个在多语言水平的相对上进行比较。 。 将这个数据测试 显示我们的数据在演示的C 。
Article 100
Title@2025-07-21 (1): Code Clone Detection via an AlphaFold-Inspired Framework
Title: Code Clone Detection via an AlphaFold-Inspired Framework | Code-Klone-Erkennung über ein AlphaFold-Inspired Framework | 通过 AlphaFold 启发框架探测代码克隆 2507.15226v1 |
Authors (5): Changguo Jia, Yi Zhan, Tianqi Zhao, Hengzhi Ye, Minghui Zhou
Code clone detection, which aims to identify functionally equivalent code fragments, plays a critical role in software maintenance and vulnerability analysis. Substantial methods have been proposed to detect code clones, but they fall short in capturing code semantics or relying on language-specific analyzers. Inspired by the remarkable success of AlphaFold in predicting three-dimensional protein structures from protein sequences, in this paper, we leverage AlphaFold for code clone detection based on the insight that protein sequences and token sequences share a common linear sequential structure. In particular, we propose AlphaCC, which represents code fragments as token sequences to ensure multi-language applicability and adapts AlphaFold’s sequence-to-structure modeling capability to infer code semantics. The pipeline of AlphaCC goes through three steps. First, AlphaCC transforms each input code fragment into a token sequence and, motivated by AlphaFold’s use of multiple sequence alignment (MSA) to enhance contextual understanding, constructs an MSA from lexically similar token sequences. Second, AlphaCC adopts a modified attention-based encoder based on AlphaFold to model dependencies within and across token sequences. Finally, unlike AlphaFold’s protein structure prediction task, AlphaCC computes similarity scores between token sequences through a late interaction strategy and performs binary classification to determine code clone pairs. Comprehensive evaluations on three language-diverse datasets demonstrate AlphaCC’s applicability across multiple programming languages. On two semantic clone detection datasets, it consistently outperforms all baselines, showing strong semantic understanding. Moreover, AlphaCC maintains competitive efficiency, enabling practical usage in large-scale clone detection tasks.
代码克隆检测旨在识别功能等同的代码碎片,在软件维护与脆弱性分析中发挥着关键作用。 提出了大量方法来检测代码克隆, 但是在获取代码语义学或依赖语言分析器方面却落后于获取代码语义学或依赖语言特定分析器。 受阿尔法佛尔德在从蛋白序列中预测三维蛋白结构方面取得的显著成功启发, 在本文件中, 我们利用阿尔法佛尔德进行代码克隆检测, 其依据是蛋白序列和符号序列共享共同线性线性序列结构的洞察。 特别是, 我们提议阿尔法CC, 它代表代码碎片作为代号序列, 以确保多语言适用性, 并调整阿尔法尔德的序列到结构模型应用性, 用来推断代码的语义。 首先, 阿尔法尔法尔法尔德将每个输入代码的分解转换成一个符号序列, 由阿尔法佛尔德使用的多序列校正( MSAA) 来增强背景理解, 构建一个类似代号序列序列序列。 其次, 阿尔法CC 采用一个基于源代码的修改的代码, 在模型中, 直观内部和直观中显示一个稳定的系统内部和跨顺序数据序列中, 显示一个类似顺序, 。 最后显示一个直径解的顺序, 。
Article 101
Title@2025-07-21 (1): SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation
Title: SimdBench: Benchmarking Large Language Models for SIMD-Intrinsic Code Generation | SimdBench: Benchmarking großer Sprachmodelle für die SIMD-Intrinsische Codegenerierung | SimdBench:为SIMMD- Intrins 代码生成制定大语言模式基准 2507.15224v1 |
Authors (7): Yibo He, Shuoran Zhao, Jiaming Huang, Yingjie Fu, Hao Yu, Cunjian Huang, Tao Xie
SIMD (Single Instruction Multiple Data) instructions and their compiler intrinsics are widely supported by modern processors to accelerate performance-critical tasks. SIMD intrinsic programming, a trade-off between coding productivity and high performance, is widely used in the development of mainstream performance-critical libraries and daily computing tasks. Large Language Models (LLMs), which have demonstrated strong and comprehensive capabilities in code generation, show promise in assisting programmers with the challenges of SIMD intrinsic programming. However, existing code-generation benchmarks focus on only scalar code, and it is unclear how LLMs perform in generating vectorized code using SIMD intrinsics. To fill this gap, we propose SimdBench, the first code benchmark specifically designed for SIMD-intrinsic code generation, comprising 136 carefully crafted tasks and targeting five representative SIMD intrinsics: SSE (x86 Streaming SIMD Extension), AVX (x86 Advanced Vector Extension), Neon (ARM Advanced SIMD Extension), SVE (ARM Scalable Vector Extension), and RVV (RISC-V Vector Extension). We conduct a systematic evaluation (measuring both correctness and performance) of 18 representative LLMs on SimdBench, resulting in a series of novel and insightful findings. Our evaluation results demonstrate that LLMs exhibit a universal decrease in pass@k during SIMD-intrinsic code generation compared to scalar-code generation. Our in-depth analysis highlights promising directions for the further advancement of LLMs in the challenging domain of SIMD-intrinsic code generation. SimdBench is fully open source at https://anonymous.4open.science/r/SimdBench-1B3F/ to benefit the broader research community.
高级语言模型(LLMS)在代码生成方面表现出强大和全面的能力,显示出有希望协助程序员应对SIMD内在编程的挑战。然而,现有的代码生成基准仅侧重于标度码,而且不清楚LMS在利用SIMD内在功能生成矢量代码方面如何发挥更大的作用。为填补这一空白,我们提议SimdBench为SIMD-intrinsic 代码生成专门设计的第一个代码基准SimdBench,这是专门为SIMD-intrins 代码生成而设计的,包括136项精心设计的任务,针对5个具有代表性的SIMD内在要素:SSE(x86 Stream SIMD扩展)、AVX(x86 高级矢量扩展)、Neon(ARMS-IMD扩展)、SVE(ARM-Scalental-Vcal-Sictor扩展) 和RVVVVV(RIC-VVV Veral-VCLCLMS)进行系统化评估,我们在SIMLMS-SLMSLMSLSLS的SLMLS 最新评估中,在18次最新评估中,在SIMLMLMSLMSLMSLA中进一步减少我们SLMLA中,我们SIMLMLMLMS的S的S的S结果和SLMLMLMSD结果。
Article 102
Title@2025-07-21 (1): Towards Using Personas in Requirements Engineering: What Has Been Changed Recently?
Title: Towards Using Personas in Requirements Engineering: What Has Been Changed Recently? | Zum Einsatz von Personen in der Requirements Engineering: Was hat sich in letzter Zeit verändert? | 争取在要求工程中使用人:最近发生了什么变化? 2507.15197v1 |
Authors (3): Chowdhury Shahriar Muzammel, Maria Spichkova, James Harland
In requirements engineering (RE), personas are now being used to represent user expectations and needs. This systematic mapping study (SMS) aims to explore the most recent studies and to cover recent changes in trends, especially related to the recent evolution of Generative AI approaches. Our SMS covers the period between April 2023 and April 2025. We identified 22 relevant publications and analysed persona representation, construction, validation, as well as RE activities covered by personas. We identified that a number of studies applied AI-based solutions for persona construction and validation. We observed that template-based personas are becoming more popular nowadays. We also observed an increase in the proportion of studies covering validation aspects.
在要求工程(RE)中,个人现在被用来代表用户的期望和需要。这一系统绘图研究旨在探索最新研究,并涵盖最近趋势的变化,特别是与产生性AI方法的最近演变有关的变化。我们的系统管理系统涵盖2023年4月至2025年4月这一时期。我们查明了22份相关出版物,分析了人表征、建筑、验证以及个人所覆盖的可再生能源活动。我们发现一些研究对人造和验证采用了基于AI的解决方案。我们发现,基于模板的人现在越来越受欢迎。我们还发现,涵盖验证方面的研究比例有所上升。
Article 103
Title@2025-07-21 (1): Cultural Impact on Requirements Engineering Activities: Bangladeshi Practitioners’ View
Title: Cultural Impact on Requirements Engineering Activities: Bangladeshi Practitioners’ View | Kulturelle Auswirkungen auf die Anforderungen Engineering-Aktivitäten: Bangladesh-Praktiker-Ansicht | 文化对要求工程活动的影响:孟加拉国从业者的观点 2507.15188v1 |
Authors (3): Chowdhury Shahriar Muzammel, Maria Spichkova, James Harland
Requirements Engineering (RE) is one of the most interaction-intensive phases of software development. This means that RE activities might be especially impacted by stakeholders’ national culture. Software development projects increasingly have a very diverse range of stakeholders. To future-proof RE activities, we need to help RE practitioners avoid misunderstandings and conflicts that might arise from not understanding potential Cultural Influences (CIs). Moreover, an awareness of CIs supports diversity and inclusion in the IT profession. Bangladesh has a growing IT sector with some unique socio-cultural characteristics, and has been largely overlooked in this research field. In this study, we aim to investigate how the RE process is adopted in the context of Bangladeshi culture and what cultural influences impact overall RE activities.
工程要求(RE)是软件开发中互动最密集的阶段之一,这意味着可再生能源活动可能特别受到利益攸关方国家文化的影响。软件开发项目越来越多地拥有多种多样的利益攸关方。对于未来防患于未然的可再生能源活动,我们需要帮助可再生能源从业者避免误解和冲突,这些误解和冲突可能是由于不了解潜在的文化影响而引起的。此外,对信息系统的认识支持信息技术行业的多样性和包容性。孟加拉国信息技术部门日益壮大,具有一些独特的社会文化特点,而且在这个研究领域基本上被忽视。在本研究中,我们旨在调查在孟加拉国文化背景下如何采用可再生能源进程,以及哪些文化影响对整个可再生能源活动产生影响。
Article 104
Title@2025-07-21 (1): Deep Learning Framework Testing via Heuristic Guidance Based on Multiple Model Measurements
Title: Deep Learning Framework Testing via Heuristic Guidance Based on Multiple Model Measurements | Deep-Learning-Framework-Tests mittels Heuristischer Anleitung basierend auf mehreren Modellmessungen | 利用基于多种模式计量的指数性指导进行深学习框架测试 2507.15181v1 |
Authors (6): Yinglong Zou, Juan Zhai, Chunrong Fang, Yanzhou Mu, Jiawei Liu, Zhenyu Chen
Deep learning frameworks serve as the foundation for developing and deploying deep learning applications. To enhance the quality of deep learning frameworks, researchers have proposed numerous testing methods using deep learning models as test inputs. However, existing methods predominantly measure model bug detection effectiveness as heuristic indicators, presenting three critical limitations: Firstly, existing methods fail to quantitatively measure model’s operator combination variety, potentially missing critical operator combinations that could trigger framework bugs. Secondly, existing methods neglect measuring model execution time, resulting in the omission of numerous models potential for detecting more framework bugs within limited testing time. Thirdly, existing methods overlook correlation between different model measurements, relying simply on single-indicator heuristic guidance without considering their trade-offs. To overcome these limitations, we propose DLMMM, the first deep learning framework testing method to include multiple model measurements into heuristic guidance and fuse these measurements to achieve their trade-off. DLMMM firstly quantitatively measures model’s bug detection performance, operator combination variety, and model execution time. After that, DLMMM fuses the above measurements based on their correlation to achieve their trade-off. To further enhance testing effectiveness, DLMMM designs multi-level heuristic guidance for test input model generation.
深层次学习框架是开发和应用深层次学习应用程序的基础。为了提高深层次学习框架的质量,研究人员提出了许多测试方法,将深层次学习模型用作测试投入,然而,现有方法主要衡量模型错误检测效力,将其作为超自然指标,提出了三个关键限制:首先,现有方法未能定量测量模型操作者的各种组合,可能缺少可能引起框架错误的关键操作者组合。第二,现有方法忽视了模型执行时间,导致在有限的测试时间内忽略了许多模型在发现更多框架错误方面的潜力。第三,现有方法忽略了不同模型测量的相互关系,仅仅依靠单项指标超常指导,而没有考虑它们的取舍。为克服这些局限性,我们建议DLMMM,这是第一个深层次学习框架测试方法,将多种模型测量纳入超自然指南,并将这些测量结合起来,以实现其取舍。DLMMM,首先定量测量模型的错误检测性能,操作者组合组合,以及模型执行时间。随后,DLMMMM,根据它们的相互关系结合了上述测量方法,以实现其取舍。为了进一步提高测试有效性,DLMMMMM为多层次的试验。
Article 105
Title@2025-07-20 (7): Can LLMs Generate User Stories and Assess Their Quality?
Title: Can LLMs Generate User Stories and Assess Their Quality? | Können LLMs User Stories generieren und ihre Qualität bewerten? | LLMs能够产生用户故事并评估其质量吗? 2507.15157v1 |
Authors (4): Giovanni Quattrocchi, Liliana Pasquale, Paola Spoletini, Luciano Baresi
Requirements elicitation is still one of the most challenging activities of the requirements engineering process due to the difficulty requirements analysts face in understanding and translating complex needs into concrete requirements. In addition, specifying high-quality requirements is crucial, as it can directly impact the quality of the software to be developed. Although automated tools allow for assessing the syntactic quality of requirements, evaluating semantic metrics (e.g., language clarity, internal consistency) remains a manual and time-consuming activity. This paper explores how LLMs can help automate requirements elicitation within agile frameworks, where requirements are defined as user stories (US). We used 10 state-of-the-art LLMs to investigate their ability to generate US automatically by emulating customer interviews. We evaluated the quality of US generated by LLMs, comparing it with the quality of US generated by humans (domain experts and students). We also explored whether and how LLMs can be used to automatically evaluate the semantic quality of US. Our results indicate that LLMs can generate US similar to humans in terms of coverage and stylistic quality, but exhibit lower diversity and creativity. Although LLM-generated US are generally comparable in quality to those created by humans, they tend to meet the acceptance quality criteria less frequently, regardless of the scale of the LLM model. Finally, LLMs can reliably assess the semantic quality of US when provided with clear evaluation criteria and have the potential to reduce human effort in large-scale assessments.
由于分析家在理解和将复杂需求转化为具体要求方面所面临的困难,要求引出的要求仍然是需求工程流程中最具挑战性的活动之一,因为分析家在理解和将复杂需求转化为具体要求方面面临着困难。此外,具体规定高质量要求至关重要,因为它能够直接影响到拟开发的软件的质量。虽然自动化工具允许评估要求的综合质量,但评价语义衡量标准(例如语言清晰度、内部一致性)仍然是手工和耗时的活动。本文探讨了LLLMS如何在灵活框架内帮助需求自动提出,因为要求被界定为用户故事(美国)。我们使用10个最先进的LMS调查其通过模拟客户访谈自动生成美国的能力。我们评估了LMMS产生的美国质量,将其与人(主要专家和学生)生成的质量进行比较。我们还探讨了LMS是否以及如何使用LMS自动评估其质量。 我们的结果表明,LMS在覆盖范围和质量方面与人相类似,但它们表现出较低的多样性和创造性。尽管LMM公司最终的接受程度通常比标准的质量,但最终的SLMA标准通常比标准的质量低。
Article 106
Title@2025-07-20 (7): Design of an Edge-based Portable EHR System for Anemia Screening in Remote Health Applications
Title: Design of an Edge-based Portable EHR System for Anemia Screening in Remote Health Applications | Design eines Edge-basierten tragbaren EHR-Systems für die Anämie-Screening in Remote Health-Anwendungen | 设计一个以边缘为基础的远程保健应用中贫血筛查的便携EHR系统 2507.15146v1 |
Authors (5): Sebastian A. Cruz Romero, Misael J. Mercado Hernandez, Samir Y. Ali Rivera, Jorge A. Santiago Fernandez, Wilfredo E. Lugo Beauchamp
The design of medical systems for remote, resource-limited environments faces persistent challenges due to poor interoperability, lack of offline support, and dependency on costly infrastructure. Many existing digital health solutions neglect these constraints, limiting their effectiveness for frontline health workers in underserved regions. This paper presents a portable, edge-enabled Electronic Health Record platform optimized for offline-first operation, secure patient data management, and modular diagnostic integration. Running on small-form factor embedded devices, it provides AES-256 encrypted local storage with optional cloud synchronization for interoperability. As a use case, we integrated a non-invasive anemia screening module leveraging fingernail pallor analysis. Trained on 250 patient cases (27\% anemia prevalence) with KDE-balanced data, the Random Forest model achieved a test RMSE of 1.969 g/dL and MAE of 1.490 g/dL. A severity-based model reached 79.2\% sensitivity. To optimize performance, a YOLOv8n-based nail bed detector was quantized to INT8, reducing inference latency from 46.96 ms to 21.50 ms while maintaining mAP@0.5 at 0.995. The system emphasizes low-cost deployment, modularity, and data privacy compliance (HIPAA/GDPR), addressing critical barriers to digital health adoption in disconnected settings. Our work demonstrates a scalable approach to enhance portable health information systems and support frontline healthcare in underserved regions.
由于互操作性差、缺乏离线支持和依赖昂贵的基础设施,为偏远、资源有限的环境设计医疗系统面临长期挑战,原因是互操作性差、缺乏离线支持和依赖昂贵的基础设施。许多现有的数字保健解决方案忽视了这些制约因素,限制了这些制约因素,限制了在服务不足地区第一线卫生工作者对一线卫生工作者的实效。本文件介绍了一个为离线第一运行优化的可移植、安全病人数据管理和模块诊断整合的可携式电子健康记录平台。运行于小式要素嵌入装置,为互操作性提供了AES-256加密的本地储存,并提供了可选用云同步的可选互操作性。作为一个使用实例,我们结合了非侵入性贫血筛查模块分析,整合了非侵入性贫血筛查模块。对250个病人案例(27贫血流行率)进行了培训,并结合了 KDE平衡数据,随机森林模型实现了测试RMSE1.969 g/dL和MAE 1.490 g/dL。 重度模型的敏感性达到79.2。为了优化性,YOLOV8n的指甲床支持检测器被撤销为可移动式8,因此,将精度从46.96米至21.50米降为21.50米标准。同时维持了250人供使用,同时维护了 KDE-10的保密性安全标准,在0.1,在0.1BS-AS级标准下改进了我们的安全安全标准,在0.95中加强了标准。
Article 107
Title@2025-07-20 (7): A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation
Title: A Semantic-based Optimization Approach for Repairing LLMs: Case Study on Code Generation | Ein semantisch-basierter Optimierungsansatz zur Reparatur von LLMs: Fallstudie zur Codegenerierung | 修复LLMLM 的基于语义的优化优化方法:关于代码生成的案例研究 2503.12899v3 |
Authors (4): Jian Gu, Aldeida Aleti, Chunyang Chen, Hongyu Zhang
Language Models (LMs) are widely used in software engineering for code generation, but they may produce code with errors. Rather than repairing the generated code, an alternative way is to address the underlying failures of models. LM repair offers a lightweight solution to this challenge: it requires minimal data, reduces computational costs, and reduces the side effects. Unlike retraining, LM repair focuses on applying tailored updates to targeted neurons, making it ideal for scenarios with limited resources, high-performance demands, or strict safety requirements. In this paper, we propose Semantic Targeting for Analytical Repair (STAR), a pioneering and novel semantic-based optimization approach for repairing LLMs. STAR realizes the main operations of repairing LMs in an optimization process, including locating buggy neurons'', solving
neuron patches’’, and patching ``buggy neurons’’. Correspondingly, it computes the deltas of weight matrix as the prior information to guide optimization; and attributes the targeted layers and neurons leveraging statistical insights. The neuron patches are computed with a solid semantic-based analytical formula, which directly bridges the changes to logits with the deltas of neurons, by steering latent representations. Compared to the prior work of LM repair (MINT) and optimization methods (SGD), STAR integrates their strengths while mitigating their limitations. STAR supports solving multiple failures together, significantly improving the usefulness. Evaluated on coding tasks using popular code LMs, STAR exhibits superior effectiveness (10.5%-19.9% improvements) and efficiency (2.4-7.0 times speedup). In terms of side effects, namely the balance between generalization and specificity, STAR outperforms prior work by a significant margin. Additionally, we conducted assessments on the overfitting risk of LM repair as well as the cumulative impact.
语言模型(LMM)被广泛用于用于代码生成的软件工程,但是它们可能会产生代码错误。与其修复生成的代码相比,一种替代方法不是修复生成的代码,而是解决模型的根本性失败。LM修复提供了一种轻量化的解决方案:它需要最低限度的数据,降低计算成本,并减少副作用。与再培训不同,LM修复侧重于对目标神经元进行定制更新,使其适合资源有限、高性能需求或严格安全要求的情景。在本文中,我们提议为分析修复(STAR)进行精度定值定位,这是一个创新和新颖的语义优化方法,用于修复LMMMLM。STAR实现了在优化过程中修复LMMM公司的主要操作,包括找到“buggy神经元’,解决“肺部补丁”,修补“burggygygy 神经元’。 ”与此相反,它把重力矩阵的三角形变形作为前项信息来支持优化;并将目标层和神经的变形变形通过统计洞洞了解。神经元的补补是用坚固的精密方法进行计算的。Stargental-deal-deal-deal-taildal deadal deal deal develildal dismill dal dal dismaildal dal dal dal dal mad lad lad lad lad lad lad lad mad mad lad lad lades lades ladess
Article 108
Title@2025-07-20 (7): ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells
Title: ROSE: Transformer-Based Refactoring Recommendation for Architectural Smells | ROSE: Transformerbasierte Refactoring-Empfehlung für architektonische Gerüche | ROSE: 以变压器为基础的建筑气味重建建议 2507.12561v2 |
Authors (4): Samal Nursapa, Anastassiya Samuilova, Alessio Bucaioni, Phuong T. Nguyen
Architectural smells such as God Class, Cyclic Dependency, and Hub-like Dependency degrade software quality and maintainability. Existing tools detect such smells but rarely suggest how to fix them. This paper explores the use of pre-trained transformer models–CodeBERT and CodeT5–for recommending suitable refactorings based on detected smells. We frame the task as a three-class classification problem and fine-tune both models on over 2 million refactoring instances mined from 11,149 open-source Java projects. CodeT5 achieves 96.9% accuracy and 95.2% F1, outperforming CodeBERT and traditional baselines. Our results show that transformer-based models can effectively bridge the gap between smell detection and actionable repair, laying the foundation for future refactoring recommendation systems. We release all code, models, and data under an open license to support reproducibility and further research.
建筑结构的气味, 如上帝级、 环球依赖性 和类似 Hub 的依附性 等, 降低了软件质量和可维护性。 现有工具检测到这种气味, 但很少建议如何修复这些气味。 本文探索了使用预先训练的变压器模型- CodeBERT 和 CodeT5 来根据检测到的气味建议适当的再设定因素。 我们将此任务设定为三级分类问题, 并对11,149 开放源的爪哇项目中200多万个重设事件的模式进行微调。 代码T5 实现了96.9%的准确度, 95.2% F1, 超过了业绩好的代码BERT 和传统基线。 我们的结果表明, 变压器模型可以有效地弥合气味检测和可操作的修理之间的差距, 为未来的再调节建议系统打下基础。 我们发布所有代码、 模型和数据, 以公开的许可支持可复制和进一步的研究 。
Article 109
Title@2025-07-20 (7): ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks
Title: ModelVerification.jl: a Comprehensive Toolbox for Formally Verifying Deep Neural Networks | ModelVerification.jl: eine umfassende Toolbox zur formalen Überprüfung tiefer neuraler Netzwerke | 模型核查.jl:用于正式核查深神经网络的综合工具箱 2407.01639v2 |
Authors (7): Tianhao Wei, Hanjiang Hu, Luca Marzari, Kai S. Yun, Peizhi Niu, Xusheng Luo, Changliu Liu
Deep Neural Networks (DNN) are crucial in approximating nonlinear functions across diverse applications, ranging from image classification to control. Verifying specific input-output properties can be a highly challenging task due to the lack of a single, self-contained framework that allows a complete range of verification types. To this end, we present \texttt{ModelVerification.jl (MV)}, the first comprehensive, cutting-edge toolbox that contains a suite of state-of-the-art methods for verifying different types of DNNs and safety specifications. This versatile toolbox is designed to empower developers and machine learning practitioners with robust tools for verifying and ensuring the trustworthiness of their DNN models.
深神经网络(DNN)对于从图像分类到控制等各种应用的近似非线性功能至关重要。 验证具体的输入输出特性可能是一项极具挑战性的任务, 原因是缺少一个单一的、 自足的框架, 允许一系列完整的验证类型。 为此, 我们提供第一个全面的、 尖端的工具箱, 其中包括一套最先进的方法, 用于核实不同类型的 DNN 和安全规格。 这个多功能工具箱旨在赋予开发商和机器学习从业者以强大的工具, 以核实和确保其 DNN 模型的可信赖性。
Article 110
Title@2025-07-20 (7): LibLMFuzz: LLM-Augmented Fuzz Target Generation for Black-box Libraries
Title: LibLMFuzz: LLM-Augmented Fuzz Target Generation for Black-box Libraries | LibLMFuzz: LLM-Augmented Fuzz Target Generation für Black-Box-Bibliotheken | LibLMFuzz: 黑盒图书馆LLM- 推荐的模糊目标生成 2507.15058v1 |
Authors (2): Ian Hardgrove, John D. Hastings
A fundamental problem in cybersecurity and computer science is determining whether a program is free of bugs and vulnerabilities. Fuzzing, a popular approach to discovering vulnerabilities in programs, has several advantages over alternative strategies, although it has investment costs in the form of initial setup and continuous maintenance. The choice of fuzzing is further complicated when only a binary library is available, such as the case of closed-source and proprietary software. In response, we introduce LibLMFuzz, a framework that reduces costs associated with fuzzing closed-source libraries by pairing an agentic Large Language Model (LLM) with a lightweight tool-chain (disassembler/compiler/fuzzer) to autonomously analyze stripped binaries, plan fuzz strategies, generate drivers, and iteratively self-repair build or runtime errors. Tested on four widely-used Linux libraries, LibLMFuzz produced syntactically correct drivers for all 558 fuzz-able API functions, achieving 100% API coverage with no human intervention. Across the 1601 synthesized drivers, 75.52% were nominally correct on first execution. The results show that LLM-augmented middleware holds promise in reducing the costs of fuzzing black box components and provides a foundation for future research efforts. Future opportunities exist for research in branch coverage.
网络安全和计算机科学的一个根本问题是确定一个程序是否没有错误和弱点。 模糊是一种在程序中发现弱点的流行方法,它比替代战略具有若干优势,尽管它以初始设置和连续维护的形式具有投资成本。 当只有一个二进图书馆,例如封闭源码和专利软件的情况下,模糊的选择就更加复杂。 作为回应,我们引入了LibLMFMFuzz这一框架,这个框架通过将一个轻量工具链(拆卸/拼装/fuzzer/fuzzer)与一个代理大语言模型(LLM)配对来降低与模糊封闭源图书馆相关的成本,以便自动分析拆卸的二进制工具链(拆卸/拼装/fuzzer/fuzzer),而其投资成本以初始设置和运行错误的形式。在四个广泛使用的Linux图书馆进行测试后,LiLLMFFFuzz生成了对558个模糊的API功能的同步正确驱动器,实现了100%的覆盖,而没有人干预。 在1601个综合驱动器中,75.52%的合成驱动器在首次执行中名义上纠正了未来研究的深度成本。
Article 111
Title@2025-07-20 (7): Survey of GenAI for Automotive Software Development: From Requirements to Executable Code
Title: Survey of GenAI for Automotive Software Development: From Requirements to Executable Code | Umfrage bei GenAI für die Entwicklung von Automotive Software: Von Anforderungen zum ausführbaren Code | GenAI汽车软件开发调查:从要求到可执行守则 2507.15025v1 |
Authors (13): Nenad Petrovic, Vahid Zolfaghari, Andre Schamschurko, Sven Kirchner, Fengjunjie Pan, Chengdng Wu, Nils Purschke, Aleksei Velsh, Krzysztof Lebioda, Yinglei Song, Yi Zhang, Lukasz Mazur, Alois Knoll
Adoption of state-of-art Generative Artificial Intelligence (GenAI) aims to revolutionize many industrial areas by reducing the amount of human intervention needed and effort for handling complex underlying processes. Automotive software development is considered to be a significant area for GenAI adoption, taking into account lengthy and expensive procedures, resulting from the amount of requirements and strict standardization. In this paper, we explore the adoption of GenAI for various steps of automotive software development, mainly focusing on requirements handling, compliance aspects and code generation. Three GenAI-related technologies are covered within the state-of-art: Large Language Models (LLMs), Retrieval Augmented Generation (RAG), Vision Language Models (VLMs), as well as overview of adopted prompting techniques in case of code generation. Additionally, we also derive a generalized GenAI-aided automotive software development workflow based on our findings from this literature review. Finally, we include a summary of a survey outcome, which was conducted among our automotive industry partners regarding the type of GenAI tools used for their daily work activities.
采用最先进的人工智能(GenAI)的目的是通过减少所需人力干预和处理复杂基本程序的努力,使许多工业领域发生革命性变化,汽车软件开发被认为是GENAI采用的一个重要领域,考虑到要求数量和严格标准化导致的冗长和昂贵的程序;在本文件中,我们探索采用GenAI开发汽车软件的各种步骤,主要侧重于处理要求、合规方面和代码生成;三种与GenAI有关的技术属于最新技术:大语言模型(LLM)、检索增强型(RAG)、愿景语言模型(VLMS),以及在代码生成方面采用的快速技术概览;此外,我们还根据我们从文献审查中得出的研究结果,得出了通用通用的由GenAI辅助的汽车软件开发工作流程;最后,我们还包括一个调查结果摘要,这是我们汽车行业伙伴之间就GenAI日常工作所使用的工具类型进行的一项调查的结果。
Article 112
Title@2025-07-20 (7): Taint Analysis for Graph APIs Focusing on Broken Access Control
Title: Taint Analysis for Graph APIs Focusing on Broken Access Control | Taint-Analyse für Graph-APIs mit Fokus auf Broken Access Control | 以断断存控制为重点的图表APP的图纸分析 2501.08947v2 |
Authors (4): Leen Lambers, Lucas Sakizloglou, Taisiya Khakharova, Fernando Orejas
We present the first systematic approach to static and dynamic taint analysis for Graph APIs focusing on broken access control. The approach comprises the following. We taint nodes in the Graph API if they represent data requiring specific privileges in order to be retrieved or manipulated, and identify API calls which are related to sources and sinks. Then, we statically analyze whether tainted information flow between API source and sink calls occurs. To this end, we model the API calls using graph transformation rules. We subsequently use critical pair analysis to automatically analyze potential dependencies between rules representing source calls and rules representing sink calls. We distinguish direct from indirect tainted information flow and argue under which conditions the CPA is able to detect not only direct, but also indirect tainted flow. The static taint analysis (i) identifies flows that need to be further reviewed, since tainted nodes may be created by an API call and used or manipulated by another API call later without having the necessary privileges, and (ii) can be used to systematically design dynamic security tests for broken access control. The dynamic taint analysis checks if potential broken access control risks detected during the static taint analysis really occur. We apply the approach to a part of the GitHub GraphQL API. The application illustrates that our analysis supports the detection of two types of broken access control systematically: the case where users of the API may not be able to access or manipulate information, although they should be able to do so; and the case where users (or attackers) of the API may be able to access/manipulate information that they should not.
我们为图表 API 的静态和动态污点分析提出了第一个系统化的静态和动态污点分析方法,其重点是断开的访问控制。该方法包括以下内容:我们在图 API中显示需要特定权限的数据以便被检索或操纵的数据,并指明与源和汇有关的API调用量。然后,我们静态分析API 源和汇调用量之间是否有污点信息流动。为此,我们用图形转换规则模拟 API 调用 。我们随后使用关键对口分析自动分析代表源调用和汇调用的规则之间的潜在依赖性。我们将直接与间接污染的信息流动区分开来,并争论在哪些条件下CPI不仅能够检测到直接的,而且还能够检测到间接污染的流动量。静态的污点分析(一) 确定需要进一步审查的流动量,因为污染的节点可能由 API 调用电话产生, 使用或操纵另一个API 调用量,我们没有必要的特权;以及(二) 能够系统设计出访问控制系统的安全测试。如果在静中检测过程中检测到的访问控制风险,则进行动态的检查。 我们的用户可以使用访问分析。
Article 113
Title@2025-07-20 (7): The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering
Title: The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering | Der Aufstieg von KI-Teamkollegen in der Software-Engineering (SE) 3.0: Wie autonome Coding-Agenten Software-Engineering umgestalten | AI软件工程(SE)3.0:自动编码代理人如何重组软件工程 2507.15003v1 |
Authors (3): Hao Li, Haoxiang Zhang, Ahmed E. Hassan
The future of software engineering–SE 3.0–is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents–OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code–across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development. Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes–enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission–one developer submitted as many PRs in three days as they had in three years–these are structurally simpler (via code complexity metrics). We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3. > AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent
未来软件工程- SE 3. 0 正在随着AI 队友的崛起而展开: 自主的、目标驱动的系统与人类开发者合作。 其中, 自主的编码代理器特别具有变革性, 现在正在积极启动、 审查并发展规模的代码。 本文介绍AIDev, 这是第一个大型数据集, 记录这些代理器在野外运作的方式。 5个主要代理商的456 000多次拉动请求 - OpenAI 代码x、 Devin、 GitHub Cople、 Cursor 和 Claude- Code- cross- comeral- sal- 61 000 库和47 000 开发者。 AID 提供了在软件开发中研究自主团队的前所未有的经验基础基础基础基础。 AID 提供结构化数据支持基准、 代理商准备、 网络化、 网络化、 网络化、 网络化、 网络化、 网络化、 网络化、 网络化、 网络化、 网络化、 工具化、 快速化、 工具化、 工具化、 工具化、 工具化、 网络化、 网络化、 网络化、 工具化、 网络化、 网络化、 网络化、 网络化、 网络化、 网络化、网络化、 网络化、 网络化、 网络化、 网络化、 网络化、 网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化、网络化
Article 114
Title@2025-07-20 (7): Metaverse Security and Privacy Research: A Systematic Review
Title: Metaverse Security and Privacy Research: A Systematic Review | Metaverse Security und Privacy Research: Eine systematische Überprüfung | 超词安全和隐私研究:系统审查 2507.14985v1 |
Authors (3): Argianto Rahartomo, Leonel Merino, Mohammad Ghafari
The rapid growth of metaverse technologies, including virtual worlds, augmented reality, and lifelogging, has accelerated their adoption across diverse domains. This rise exposes users to significant new security and privacy challenges due to sociotechnical complexity, pervasive connectivity, and extensive user data collection in immersive environments. We present a systematic review of the literature published between 2013 and 2024, offering a comprehensive analysis of how the research community has addressed metaverse-related security and privacy issues over the past decade. We organize the studies by method, examined the security and privacy properties, immersive components, and evaluation strategies. Our investigation reveals a sharp increase in research activity in the last five years, a strong focus on practical and user-centered approaches, and a predominant use of benchmarking, human experimentation, and qualitative methods. Authentication and unobservability are the most frequently studied properties. However, critical gaps remain in areas such as policy compliance, accessibility, interoperability, and back-end infrastructure security. We emphasize the intertwined technical complexity and human factors of the metaverse and call for integrated, interdisciplinary approaches to securing inclusive and trustworthy immersive environments.
包括虚拟世界在内的逆向技术的迅速增长、现实的扩大和生命网的迅速增长加快了其在不同领域的采用。由于社会技术复杂性、普遍连通性和在隐蔽环境中广泛收集用户数据,这一上升使用户面临新的重大安全和隐私挑战。我们系统地审查了2013年至2024年期间出版的文献,全面分析了研究界在过去十年中如何处理与逆向有关的安全和隐私问题。我们以各种方式组织研究,检查了安全和隐私特性、隐蔽部分和评价战略。我们的调查显示,过去五年来,研究活动急剧增加,重点突出务实和以用户为中心的方法,以及主要使用基准、人类实验和定性方法。最经常研究的特性是校准和不可观察性。然而,在政策遵守、无障碍、互操作性和后端基础设施安全等方面仍然存在重大差距。我们强调,元性技术复杂性和人为因素相互交织,并呼吁采取综合、跨学科方法,确保包容性和可靠的隐蔽环境。
Article 115
Title@2025-07-20 (7): Think Like an Engineer: A Neuro-Symbolic Collaboration Agent for Generative Software Requirements Elicitation and Self-Review
Title: Think Like an Engineer: A Neuro-Symbolic Collaboration Agent for Generative Software Requirements Elicitation and Self-Review | Denken Sie wie ein Ingenieur: Ein neuro-symbolischer Collaboration Agent für generative Software-Anforderungen Elizitation und Selbst-Review | 象工程师一样思考:一个创造软件要求求救和自我审查的神经-双曲协作代理 2507.14969v1 |
Authors (8): Sai Zhang, Zhenchang Xing, Jieshan Chen, Dehai Zhao, Zizhong Zhu, Xiaowang Zhang, Zhiyong Feng, Xiaohong Li
The vision of End-User Software Engineering (EUSE) is to empower non-professional users with full control over the software development lifecycle. It aims to enable users to drive generative software development using only natural language requirements. However, since end-users often lack knowledge of software engineering, their requirement descriptions are frequently ambiguous, raising significant challenges to generative software development. Although existing approaches utilize structured languages like Gherkin to clarify user narratives, they still struggle to express the causal logic between preconditions and behavior actions. This paper introduces RequireCEG, a requirement elicitation and self-review agent that embeds causal-effect graphs (CEGs) in a neuro-symbolic collaboration architecture. RequireCEG first uses a feature tree to analyze user narratives hierarchically, clearly defining the scope of software components and their system behavior requirements. Next, it constructs the self-healing CEGs based on the elicited requirements, capturing the causal relationships between atomic preconditions and behavioral actions. Finally, the constructed CEGs are used to review and optimize Gherkin scenarios, ensuring consistency between the generated Gherkin requirements and the system behavior requirements elicited from user narratives. To evaluate our method, we created the RGPair benchmark dataset and conducted extensive experiments. It achieves an 87% coverage rate and raises diversity by 51.88%.
终端用户软件工程(EUSE)的愿景是赋予非专业用户权力,使其完全控制软件开发生命周期。它旨在使用户能够仅使用自然语言要求,推动基因化软件开发;然而,由于最终用户往往缺乏软件工程知识,其要求描述往往含糊不清,给基因化软件开发带来重大挑战。虽然现有方法使用Gherkin等结构化语言来澄清用户的叙述,但它们仍然难以表达先决条件与行为行动之间的因果关系。本文件介绍了要求计算组,这是将因果关系图(CEGs)嵌入神经同步合作结构的要求导出和自我审查剂。要求计算组首先使用一棵特写树来分析用户的叙事,按等级明确定义软件组件的范围和系统行为要求。其次,它根据引出的要求构建自我愈合的 CEGs,捕捉到原子先决条件与行为行动之间的因果关系。最后,构建的CEGs用来审查并优化Gherkin情景,确保生成的Gherkin要求与系统行为要求之间的一致性。它首先使用一个特征树来分析用户描述用户叙述范围,通过87的用户叙述率和测试率来提高我们的数据。
Article 116
Title@2025-07-20 (7): StaAgent: An Agentic Framework for Testing Static Analyzers
Title: StaAgent: An Agentic Framework for Testing Static Analyzers | StaAgent: Agentischer Rahmen für die Prüfung statischer Analyzer | StaAgent: 静态分析器测试的剂框架 2507.15892v1 |
Authors (5): Elijah Nnorom, Md Basim Uddin Ahmed, Jiho Shin, Hung Viet Pham, Song Wang
Static analyzers play a critical role in identifying bugs early in the software development lifecycle, but their rule implementations are often under-tested and prone to inconsistencies. To address this, we propose StaAgent, an agentic framework that harnesses the generative capabilities of Large Language Models (LLMs) to systematically evaluate static analyzer rules. StaAgent comprises four specialized agents: a Seed Generation Agent that translates bug detection rules into concrete, bug-inducing seed programs; a Code Validation Agent that ensures the correctness of these seeds; a Mutation Generation Agent that produces semantically equivalent mutants; and an Analyzer Evaluation Agent that performs metamorphic testing by comparing the static analyzer’s behavior on seeds and their corresponding mutants. By revealing inconsistent behaviors, StaAgent helps uncover flaws in rule implementations. This LLM-driven, multi-agent framework offers a scalable and adaptable solution to improve the reliability of static analyzers. We evaluated StaAgent with five state-of-the-art LLMs (CodeL-lama, DeepSeek, Codestral, Qwen, and GPT-4o) across five widely used static analyzers (SpotBugs, SonarQube, ErrorProne, Infer, and PMD). The experimental results show that our approach can help reveal 64 problematic rules in the latest versions of these five static analyzers (i.e., 28 in SpotBugs, 18 in SonarQube, 6 in ErrorProne, 4 in Infer, and 8 in PMD). In addition, 53 out of the 64 bugs cannot be detected by the SOTA baseline. We have reported all the bugs to developers, with two of them already fixed. Three more have been confirmed by developers, while the rest are awaiting response. These results demonstrate the effectiveness of our approach and underscore the promise of agentic, LLM-driven data synthesis to advance software engineering.
StaAgenti 分析器在早期识别软件开发生命周期中的错误方面发挥着关键作用,但规则执行往往测试不足,容易出现不一致。为此,我们提议StaAgenti,这是一个利用大语言模型(LLMs)基因化能力的代理框架,用于系统评估静态分析规则。 StaAgenti 由四个专门代理商组成:一个种子生成代理,将错误检测规则转化为具体、诱导错误种子程序;一个代码验证代理,确保这些种子的正确性;一个静态生成代理,产生等效变异体;以及一个分析器评估器,通过比较静态分析器在种子及其相应变异体上的行为来进行变异性测试。通过显示不一致的行为,StaAgency帮助发现规则执行中的缺陷。这个由种子生成的种子生成器驱动的多试样框架提供了一种可缩放和可变化的解决方案,我们以5个状态变变变现的变现工具,我们以5个状态变现的变现工具来确认StaArental 。 在5个版本中(Creal-lama, Decostralstrystralstrystrystral Stal Stal Stal) 和变现的变现结果中,在5个变现的变现、Qral-rmalal-rmadal-Re化的变现的变现数据中无法、QQ-Re化的变现,在Sildal-Re变现数据中,在Sildalmadal-I在5个Prmad)
Article 117
Title@2025-07-20 (7): Learning Software Bug Reports: A Systematic Literature Review
Title: Learning Software Bug Reports: A Systematic Literature Review | Lernsoftware Bug Reports: Ein systematischer Literaturbericht | 学习软件错误报告:系统文献审查 2507.04422v2 |
Authors (4): Guoming Long, Jingzhi Gong, Hui Fang, Tao Chen
The recent advancement of artificial intelligence, especially machine learning (ML), has significantly impacted software engineering research, including bug report analysis. ML aims to automate the understanding, extraction, and correlation of information from bug reports. Despite its growing importance, there has been no comprehensive review in this area. In this paper, we present a systematic literature review covering 1,825 papers, selecting 204 for detailed analysis. We derive seven key findings: 1) Extensive use of CNN, LSTM, and $k$NN for bug report analysis, with advanced models like BERT underutilized due to their complexity. 2) Word2Vec and TF-IDF are popular for feature representation, with a rise in deep learning approaches. 3) Stop word removal is the most common preprocessing, with structural methods rising after 2020. 4) Eclipse and Mozilla are the most frequently evaluated software projects. 5) Bug categorization is the most common task, followed by bug localization and severity prediction. 6) There is increasing attention on specific bugs like non-functional and performance bugs. 7) Common evaluation metrics are F1-score, Recall, Precision, and Accuracy, with $k$-fold cross-validation preferred for model evaluation. 8) Many studies lack robust statistical tests. We also identify six promising future research directions to provide useful insights for practitioners.
最近人工智能的进步,特别是机器学习(ML),已经对软件工程研究,包括错误报告分析产生了重大影响。ML的目标是使错误报告信息的理解、提取和相关性自动化。尽管其重要性日益增加,但在这一领域没有进行全面审查。在本文中,我们提出了涵盖1,825份文件的系统文献审查,选择204份文件进行详细分析。我们得出了7项主要结论:(1) 广泛使用CNN、LSTM和$k$NNN,用于错误报告分析,而像BERT这样的先进模型由于复杂性而没有得到充分利用。(2) Word2Vec和TF-IDF对地貌代表很受欢迎,其深层学习方法不断提高。(3) 停止删除字是最常见的预处理方法,2020年后结构方法不断上升。(4) Eclipse和Mozilla是最经常被评估的软件项目。(5) 错误分类是最常见的任务,其次是错误本地化和严重性预测。(6) 诸如非功能性和性错误等先进模型越来越受到注意。(7) 共同评价指标是F1-记号, Recall,Recall, Precisionionionionionion, 和Acurealalalalisalisalisalalalalalalalalalal ——我们更喜欢地研究为未来方向提供。
Article 118
Title@2025-07-20 (7): Flexible Process Variant Binding in Information Systems with Software Product Line Engineering
Title: Flexible Process Variant Binding in Information Systems with Software Product Line Engineering | Flexible Prozessvariantbindung in Informationssystemen mit Software Product Line Engineering | 具有软件产品线工程的信息系统装订 2410.17689v2 |
Authors (2): Philipp Hehnle, Manfred Reichert
Different organisations often run similar digitised business processes to achieve their business goals. However, organisations often need to slightly adapt the business processes implemented in an information system in order to adopt them. Various approaches have been proposed to manage variants in process models. While these approaches mainly deal with control flow variability, in previous work we introduced an approach to manage implementation variants of digitised business processes. In this context Software Product Line (SPL) Engineering was applied to manage a set of common core artefacts including a process model from which Process-Aware Information Systems (PAIS) can be derived, which differ in the implementation of their process activities. When deriving a PAIS, implementations are selected for each process activity and then included in the PAIS at compilation time. One challenge that has not yet been solved is giving users of digitised business processes the option of selecting multiple implementations at runtime. This paper extends our previous work by not only allowing for the selection of activity implementations at compile time, but also at start time and runtime. Consequently, it becomes possible to defer the decision as to which implementation should be selected to start time and runtime. Furthermore, multiple implementations of a particular activity may be selected and executed concurrently. The presented approach also allows customising the input and output data of activities. Data from expert interviews with German municipalities suggests digitising business processes with varying implementations is a widespread challenge and our approach is a way to mitigate it.
不同组织往往使用类似的数字化业务流程来实现业务目标,然而,各组织往往需要略微调整信息系统中实施的业务流程,以便采用这些流程。提出了各种办法以管理流程模式中的变式。虽然这些办法主要涉及控制流程的变异性,但在以往的工作中,我们引入了管理数字化业务流程实施变式的方法。在此背景下,软件产品行(SPL)工程用于管理一套共同的核心工艺,包括一个可生成流程信息系统(PAIS)的流程模型,该模型可在实施其流程活动中产生差异。在生成一个PAIS时,为每个流程活动选择实施,然后在编集时将其纳入PAIS。一个尚未解决的挑战是让数字化业务流程的用户选择在运行时选择多个实施模式。本文扩展了我们以前的工作,不仅允许在汇编时选择活动实施,而且还在启动时和运行时选择。因此,有可能推迟决定选择实施的时间和运行时间。此外,多个进程的实施过程被选定,然后纳入PAIS系统。一个尚未解决的挑战是让数字化业务流程的用户选择,同时选择一个数据访问,同时选择一个数据访问,并显示一个定制的流程。
Article 119
Title@2025-07-20 (7): Towards Extracting Software Requirements from App Reviews using Seq2seq Framework
Title: Towards Extracting Software Requirements from App Reviews using Seq2seq Framework | Auf dem Weg zur Extraktion von Software-Anforderungen aus App-Bewertungen mit Seq2seq Framework | 争取利用Seq2seq 框架从应用审查中提取软件要求 2507.09039v2 |
Authors (2): Aakash Sorathiya, Gouri Ginde
Mobile app reviews are a large-scale data source for software improvements. A key task in this context is effectively extracting requirements from app reviews to analyze the users’ needs and support the software’s evolution. Recent studies show that existing methods fail at this task since app reviews usually contain informal language, grammatical and spelling errors, and a large amount of irrelevant information that might not have direct practical value for developers. To address this, we propose a novel reformulation of requirements extraction as a Named Entity Recognition (NER) task based on the sequence-to-sequence (Seq2seq) generation approach. With this aim, we propose a Seq2seq framework, incorporating a BiLSTM encoder and an LSTM decoder, enhanced with a self-attention mechanism, GloVe embeddings, and a CRF model. We evaluated our framework on two datasets: a manually annotated set of 1,000 reviews (Dataset 1) and a crowdsourced set of 23,816 reviews (Dataset 2). The quantitative evaluation of our framework showed that it outperformed existing state-of-the-art methods with an F1 score of 0.96 on Dataset 2, and achieved comparable performance on Dataset 1 with an F1 score of 0.47.
移动应用程序审查是软件改进的大规模数据源。这方面的一项关键任务是有效地从应用审查中提取需求要求,以分析用户的需求并支持软件的演变。最近的研究显示,由于应用审查通常包含非正式语言、语法和拼写错误,以及大量可能对开发者没有直接实际价值的不相干信息,现有方法未能完成这项任务。为此,我们提议根据顺序到顺序(Seq2seq)生成方法,将需求提取新改为命名实体识别(NER)任务。为此,我们提议了一个Seq2seq框架,包括一个BisLSTM编码器和一个LSTM解码器,通过一个自我注意机制、GloVe嵌入和通用报告格式模型加以强化。我们评估了我们关于两个数据集的框架:一组人工附加说明的1 000项审查(数据集1)和一组群集的23 816项审查(数据集2)。我们框架的定量评价显示,它超越了现有的Seq2级标准,即F1分数为0.16分的可比较性数据。
Article 120
Title@2025-07-20 (7): SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps
Title: SAGE: A Context-Aware Approach for Mining Privacy Requirements Relevant Reviews from Mental Health Apps | SAGE: A Context-Aware Approach for Mining Privacy Relevant Reviews from Mental Health Apps | SAGE: “ 采矿隐私要求 “ 的背景意识方法,来自心理健康应用软件的相关审查 2507.09051v2 |
Authors (2): Aakash Sorathiya, Gouri Ginde
Mental health (MH) apps often require sensitive user data to customize services for mental wellness needs. However, such data collection practices in some MH apps raise significant privacy concerns for users. These concerns are often mentioned in app reviews, but other feedback categories, such as reliability and usability, tend to take precedence. This poses a significant challenge in automatically identifying privacy requirements-relevant reviews (privacy reviews) that can be utilized to extract privacy requirements and address users’ privacy concerns. Thus, this study introduces SAGE, a context-aware approach to automatically mining privacy reviews from MH apps using Natural Language Inference (NLI) with MH domain-specific privacy hypotheses (provides domain-specific context awareness) and a GPT model (eliminates the need for fine-tuning). The quantitative evaluation of SAGE on a dataset of 204K app reviews achieved an F1 score of 0.85 without any fine-tuning, outperforming the fine-tuned baseline classifiers BERT and T5. Furthermore, SAGE extracted 748 privacy reviews previously overlooked by keyword-based methods, demonstrating its effectiveness through qualitative evaluation. These reviews can later be refined into actionable privacy requirement artifacts.
心理健康(MH)应用软件往往需要敏感的用户数据来定制满足心理健康需要的服务,然而,某些MH应用软件的这类数据收集做法引起了用户对隐私的重大关切,这些关切在应用审查中经常提及,但其他反馈类别,如可靠性和可用性,往往居于优先地位,这对自动确定隐私要求相关审查(隐私审查)(隐私审查)构成重大挑战,这些审查可用于提取隐私要求和解决用户对隐私的关切。因此,本研究报告引入了SAGE, 这是一种符合背景的自动挖掘隐私审查的方法,即使用自然语言推断(NLI)的MH应用软件进行自动挖掘隐私审查,使用MH特定域的隐私假设(提供特定领域背景认识)和全球专利保护网络模型(消除微调的必要性),对204K应用审查数据集的SGEGE进行了定量评价,实现了0.85的F1分,但没有作任何微调,超过了经过微调的基线分类标准BERT和T5.此外,SAGEGE提取了748项隐私审查,这通过定性评估证明了其有效性。
Article 121
Title@2025-07-20 (7): CMER: A Context-Aware Approach for Mining Ethical Concern-related App Reviews
Title: CMER: A Context-Aware Approach for Mining Ethical Concern-related App Reviews | CMER: A Context-aware approach for Mining Ethical Concern-related App Reviews | CMER: 采矿道德关切相关上诉审查的背景意识方法 2507.09049v2 |
Authors (2): Aakash Sorathiya, Gouri Ginde
With the increasing proliferation of mobile applications in our daily lives, the concerns surrounding ethics have surged significantly. Users communicate their feedback in app reviews, frequently emphasizing ethical concerns, such as privacy and security. Incorporating these reviews has proved to be useful for many areas of software engineering (e.g., requirement engineering, testing, etc.). However, app reviews related to ethical concerns generally use domain-specific language and are typically overshadowed by more generic categories of user feedback, such as app reliability and usability. Thus, making automated extraction a challenging and time-consuming effort. This study proposes CMER (A \underline{C}ontext-Aware Approach for \underline{M}ining \underline{E}thical Concern-related App \underline{R}eviews), a novel approach that combines Natural Language Inference (NLI) and a decoder-only (LLaMA-like) Large Language Model (LLM) to extract ethical concern-related app reviews at scale. In CMER, NLI provides domain-specific context awareness by using domain-specific hypotheses, and the Llama-like LLM eliminates the need for labeled data in the classification task. We evaluated the validity of CMER by mining privacy and security-related reviews (PSRs) from the dataset of more than 382K app reviews of mobile investment apps. First, we evaluated four NLI models and compared the results of domain-specific hypotheses with generic hypotheses. Next, we evaluated three LLMs for the classification task. Finally, we combined the best NLI and LLM models (CMER) and extracted 2,178 additional PSRs overlooked by the previous study using a keyword-based approach, thus demonstrating the effectiveness of CMER. These reviews can be further refined into actionable requirement artifacts.
随着移动应用程序在我们日常生活中日益扩散,人们对伦理的担忧大增。用户在应用程序审查中传达他们的反馈,经常强调隐私和安全等道德问题。纳入这些审查证明对软件工程的许多领域(例如要求工程、测试等)是有用的。然而,与伦理问题有关的应用审查通常使用特定领域的语言,通常被更通用的用户反馈类别(例如应用程序可靠性和使用性)所掩盖。因此,自动提取是一项具有挑战性和耗时性的工作。本研究报告提议CMER(A\ underline{C}Intext-Award Award Agine 方法,用于域内线{M}M}M}ining underline{E}道德关切相关App\underline{Cunderline{R}eviews。 但是,将自然语言导力(NLIMA)和仅(类似LLIMA的)大语言模型(LM)相结合,用于在规模上提取与道德关切相关的应用的应用程序。在C-CMLIM上,通过域特定假设提供针对域域域的域内局的域认识认识认识,因此通过LLIM(我们LIM)对LIM进行最新数据分析,我们最新的数据评估,并用最新数据分析,可以进一步评估。
Article 122
Title@2025-07-20 (7): Enhancing Repository-Level Code Generation with Call Chain-Aware Multi-View Context
Title: Enhancing Repository-Level Code Generation with Call Chain-Aware Multi-View Context | Erweiterung der Repository-Level-Code-Generierung mit Call Chain-Aware-Multi-View-Kontext | 加强存储器级代码生成,具有呼叫链-软件多视图背景 2507.14791v1 |
Authors (9): Yang Liu, Li Zhang, Fang Liu, Zhuohang Wang, Donglin Wei, Zhishuo Yang, Kechi Zhang, Jia Li, Lin Shi
Repository-level code generation aims to generate code within the context of a specified repository. Existing approaches typically employ retrieval-augmented generation (RAG) techniques to provide LLMs with relevant contextual information extracted from the repository. However, these approaches often struggle with effectively identifying truly relevant contexts that capture the rich semantics of the repository, and their contextual perspectives remains narrow. Moreover, most approaches fail to account for the structural relationships in the retrieved code during prompt construction, hindering the LLM’s ability to accurately interpret the context. To address these issues, we propose RepoScope, which leverages call chain-aware multi-view context for repository-level code generation. RepoScope constructs a Repository Structural Semantic Graph (RSSG) and retrieves a comprehensive four-view context, integrating both structural and similarity-based contexts. We propose a novel call chain prediction method that utilizes the repository’s structural semantics to improve the identification of callees in the target function. Additionally, we present a structure-preserving serialization algorithm for prompt construction, ensuring the coherence of the context for the LLM. Notably, RepoScope relies solely on static analysis, eliminating the need for additional training or multiple LLM queries, thus ensuring both efficiency and generalizability. Evaluation on widely-used repository-level code generation benchmarks (CoderEval and DevEval) demonstrates that RepoScope outperforms state-of-the-art methods, achieving up to a 36.35% relative improvement in pass@1 scores. Further experiments emphasize RepoScope’s potential to improve code generation across different tasks and its ability to integrate effectively with existing approaches.
存储层代码生成的目的是在指定的存储处背景下生成代码。 现有方法通常使用检索- 强化生成( RAG) 技术, 向 LLMs提供从存储处提取的相关背景信息。 然而, 这些方法往往要努力有效地识别真正相关的背景, 捕捉存储处丰富的语义, 其背景视角仍然狭窄。 此外, 大多数方法在快速构建过程中无法说明检索代码的结构关系, 妨碍了 LLLM 准确解释上下文的能力。 为了解决这些问题, 我们提议 RepoScope, 利用链级生成技术为存储点生成调用链- 有识多视图背景的多视图环境。 重建Scop 构建一个存储层结构结构精度结构精度图( RSSG) , 并检索一个综合的四视图背景环境, 整合基于结构和类似背景的背景。 我们提出一个新的呼叫链级预测方法, 利用存储处的结构性语义来改进目标功能中受访者的识别能力。 此外, 我们提出一个为快速构建而保留序列化的算法, 确保存储室结构环境的一致性和多级数据库的能力分析, 因此, RS- 需要实现全局性分析, 实现常规和多级的升级分析, 基础, 基础评估, 进一步的升级分析, 基础, 以实现常规分析, 基础, 基础, 以仅级分析, 基础, 基础, 以 实现常规分析, 基础, 基础, 基础, 基础, 基础, 基础, 实现, 基础, 提高, 基础, 提高。
Article 123
Title@2025-07-20 (7): Dr. Boot: Bootstrapping Program Synthesis Language Models to Perform Repairing
Title: Dr. Boot: Bootstrapping Program Synthesis Language Models to Perform Repairing | Dr. Boot: Bootstrapping-Programm Synthese von Sprachmodellen zur Reparatur | Boot博士:实施修复的强化方案综合语言模型 2507.15889v1 |
Authors (1): Noah van der Vleuten
Language models for program synthesis are usually trained and evaluated on programming competition datasets (MBPP, APPS). However, these datasets are limited in size and quality, while these language models are extremely data hungry. Additionally, the language models have a misaligned program synthesis process compared to humans. While humans iteratively develop code with the help of a compiler, most program synthesis models currently produce code in one go. To solve these issues, we introduce a bootstrapping algorithm for program synthesis, that supports teaching models how to repair. We show that bootstrapping consistently outperforms regular fine-tuning. Compared to other work, our bootstrapped model performs on par with fine-tuned models that are 68\% larger. Notably, bootstrapping with repairing also improves non-repairing performance compared to regular bootstrapping during inference. However, on our models, repairing during inference is likely inferior to simply sampling the same number of solutions. Furthermore, we find that there are issues with the example test cases in the training portion of the APPS dataset that are valuable to the community, as many repairing and reinforcement learning methods rely on them.
用于方案合成的语言模型通常在编程竞争数据集(MBPP、APPS)方面经过培训和评价。然而,这些数据集在规模和质量上都很有限,而这些语言模型则极为缺乏数据。此外,语言模型与人类相比,程序合成过程不协调。虽然人类在编译者的帮助下迭代地开发代码,但大多数程序合成模型目前都生成代码。为了解决这些问题,我们引入了一种用于方案合成的靴式算法,支持教学模型如何修复。我们发现,与其它工作相比,靴式模型在规模和质量上都一直优于常规微调。与其他工作相比,我们的靴式模型与68+++的精调模型相匹配。值得注意的是,修复的靴式还改善了非重复性性性能,而与正常的推导过程相比。然而,根据我们的模型,在推断过程中的修复可能不如仅仅抽样相同数量的解决方案。此外,我们发现,在APS数据集的培训部分中存在实例测试案例问题,这些案例对于社区来说是有价值的,因为许多修复和强化学习方法依靠这些方法。
Article 124
Title@2025-07-20 (7): MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation
Title: MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation | MultiKernelBench: Ein Multi-Platform Benchmark für die Kernel-Generation | 多KenneelBench: 核心生成的多平台基准 2507.17773v1 |
Authors (6): Zhongzhen Wen, Yinghui Zhang, Zhong Li, Zhongxin Liu, Linna Xie, Tian Zhang
The automatic generation of deep learning (DL) kernels using large language models (LLMs) has emerged as a promising approach to reduce the manual effort and hardware-specific expertise required for writing high-performance operator implementations. However, existing benchmarks for evaluating LLMs in this domain suffer from limited hardware support, coarse-grained kernel categorization, and imbalanced task coverage. To address these limitations, we introduce MultiKernelBench, the first comprehensive, multi-platform benchmark for LLM-based DL kernel generation. MultiKernelBench spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms: Nvidia GPUs, Huawei NPUs, and Google TPUs. To enable future extensibility, we design a modular backend abstraction layer that decouples platform-specific logic from the core benchmarking infrastructure, allowing easy integration of new hardware platforms. We further propose a simple yet effective category-aware one-shot prompting method that improves generation quality by providing in-category exemplars. Through systematic evaluations of seven state-of-the-art LLMs, we reveal significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies. MultiKernelBench is publicly available at https://github.com/wzzll123/MultiKernelBench.
利用大型语言模型(LLMs)自动生成深层次学习(DL)核心的自动生成(DL)核心,这已成为一种很有希望的方法,可以减少手工努力和硬件专长,以编写高性能操作者执行工作所需的高性能操作软件。然而,目前对这一领域中LLM的评估基准缺乏硬件支持、粗微的内核分类和任务覆盖不平衡。为克服这些限制,我们引入了多环邦奇,这是以LLLM为主的DLL内核生成的第一个全面、多平台基准。多环贝尼奇跨越了14个明确界定的内核类别的285项任务,支持了三大硬件平台:Nvidia GPUs、Huawei NPUs和Google TPUs。为了能够在未来的扩展性,我们设计了一个模块后端抽象层,将平台特定平台的逻辑与核心基准基础设施脱钩,便于新硬件平台的整合。我们进一步提出一个简单而有效的类别认知/125点快速提示方法,通过提供分类外壳类的外壳,提高代质量质量。通过对7个州GLLM号的公开平台进行系统的系统评估,在公共风险上进行显著的变换。
Article 125
Title@2025-07-20 (7): VeriOpt: PPA-Aware High-Quality Verilog Generation via Multi-Role LLMs
Title: VeriOpt: PPA-Aware High-Quality Verilog Generation via Multi-Role LLMs | VeriOpt: PPA-Aware Hochqualitative Verilog Generation über Multi-Rolle LLMs | VeriOpt: 通过多功能LLMs 生成PPA-Aware-Aware-高品质高性活性 2507.14776v1 |
Authors (4): Kimia Tasnia, Alexander Garcia, Tasnuva Farheen, Sazadur Rahman
The rapid adoption of large language models(LLMs) in hardware design has primarily focused on generating functionally correct Verilog code, overlooking critical Power Performance-Area(PPA) metrics essential for industrial-grade designs. To bridge this gap, we propose VeriOpt, a novel framework that leverages role-based prompting and PPA-aware optimization to enable LLMs to produce high-quality, synthesizable Verilog. VeriOpt structures LLM interactions into specialized roles (e.g., Planner, Programmer, Reviewer, Evaluator) to emulate human design workflows, while integrating PPA constraints directly into the prompting pipeline. By combining multi-modal feedback (e.g., synthesis reports, timing diagrams) with PPA aware prompting, VeriOpt achieves PPA-efficient code generation without sacrificing functional correctness. Experimental results demonstrate up to 88% reduction in power, 76% reduction in area and 73% improvement in timing closure compared to baseline LLM-generated RTL, validated using industry standard EDA tools. At the same time achieves 86% success rate in functionality evaluation. Our work advances the state-of-the-art AI-driven hardware design by addressing the critical gap between correctness and quality, paving the way for reliable LLM adoption in production workflows.
在硬件设计中迅速采用大型语言模型(LLMs),主要侧重于生成功能正确的Verilog码,忽略关键的电力性能-地区(PPA)衡量标准,这是工业级设计所必需的。为了缩小这一差距,我们提议VeriOpt,这是一个利用基于作用的推动和PPA-aware优化的新框架,使LMs能够产生高质量、合成的Verilog。VeriOpt结构LLM互动成为专门角色(例如,Planner、程序员、审查员、评价员),以模仿人类设计工作流程,同时将PPPA的制约因素直接纳入快速管道。通过多模式反馈(例如,综合报告、时间图)与PPPA的快速提示相结合,VeriOpt在不牺牲功能正确性的情况下实现PPA节能的代码生成。实验结果显示,与基线LM生成的RTL(RTL)相比,权力削减了88%,时间缩短了73%,同时利用工业标准 EDA工具将PA限制直接纳入快速管道中。与此同时,通过可靠的方式解决了稳定性质量设计进展进展。
Article 126
Title@2025-07-19 (6): Toward Inclusive AI-Driven Development: Exploring Gender Differences in Code Generation Tool Interactions
Title: Toward Inclusive AI-Driven Development: Exploring Gender Differences in Code Generation Tool Interactions | Auf dem Weg zu integrativer KI-getriebener Entwicklung: Erforschung geschlechtsspezifischer Unterschiede bei Interaktionen mit Codegenerierungstools | 走向包容性的AI-Driven 发展:探索代码生成工具互动中的性别差异 2507.14770v1 |
Authors (4): Manaal Basha, Ivan Beschastnikh, Gema Rodriguez-Perez, Cleidson R. B. de Souza
Context: The increasing reliance on Code Generation Tools (CGTs), such as Windsurf and GitHub Copilot, are revamping programming workflows and raising critical questions about fairness and inclusivity. While CGTs offer potential productivity enhancements, their effectiveness across diverse user groups have not been sufficiently investigated. Objectives: We hypothesize that developers’ interactions with CGTs vary based on gender, influencing task outcomes and cognitive load, as prior research suggests that gender differences can affect technology use and cognitive processing. Methods: The study will employ a mixed-subjects design with 54 participants, evenly divided by gender for a counterbalanced design. Participants will complete two programming tasks (medium to hard difficulty) with only CGT assistance and then with only internet access. Task orders and conditions will be counterbalanced to mitigate order effects. Data collection will include cognitive load surveys, screen recordings, and task performance metrics such as completion time, code correctness, and CGT interaction behaviors. Statistical analyses will be conducted to identify statistically significant differences in CGT usage. Expected Contributions: Our work can uncover gender differences in CGT interaction and performance among developers. Our findings can inform future CGT designs and help address usability and potential disparities in interaction patterns across diverse user groups. Conclusion: While results are not yet available, our proposal lays the groundwork for advancing fairness, accountability, transparency, and ethics (FATE) in CGT design. The outcomes are anticipated to contribute to inclusive AI practices and equitable tool development for all users.
目标:我们假设,开发商与CGT的相互作用会因性别而异,影响任务成果和认知负荷,因为先前的研究显示,性别差异会影响技术的使用和认知处理。方法:研究将采用由54名参与者组成的混合主题设计,在平衡设计时,均衡地按性别划分。参与者将完成两项方案拟定任务(以难到难到难),只有CGT协助,然后只有互联网接入。任务订单和条件将抵消减少订单效应。数据收集将包括认知工作量调查、筛选记录和任务业绩衡量标准,如完成时间、代码正确性、CGT互动行为等。将进行统计分析,以查明CGT使用的统计性重大差异。 预测:我们的工作可以发现CGT互动和开发商业绩方面的性别差异,但只有CGT援助,然后只有互联网接入。我们的调查结果可以抵消订单的效果。 CGT设计和任务业绩指标将帮助我们实现可持续性。
Article 127
Title@2025-07-19 (6): Investigating the Role of LLMs Hyperparameter Tuning and Prompt Engineering to Support Domain Modeling
Title: Investigating the Role of LLMs Hyperparameter Tuning and Prompt Engineering to Support Domain Modeling | Untersuchung der Rolle von LLMs Hyperparameter Tuning und Prompt Engineering zur Unterstützung von Domain Modeling | 调查超参数图图和快速工程LLMs 的作用以支持域建模 2507.14735v1 |
Authors (5): Vladyslav Bulhakov, Giordano d’Aloisio, Claudio Di Sipio, Antinisca Di Marco, Davide Di Ruscio
The introduction of large language models (LLMs) has enhanced automation in software engineering tasks, including in Model Driven Engineering (MDE). However, using general-purpose LLMs for domain modeling has its limitations. One approach is to adopt fine-tuned models, but this requires significant computational resources and can lead to issues like catastrophic forgetting. This paper explores how hyperparameter tuning and prompt engineering can improve the accuracy of the Llama 3.1 model for generating domain models from textual descriptions. We use search-based methods to tune hyperparameters for a specific medical data model, resulting in a notable quality improvement over the baseline LLM. We then test the optimized hyperparameters across ten diverse application domains. While the solutions were not universally applicable, we demonstrate that combining hyperparameter tuning with prompt engineering can enhance results across nearly all examined domain models.
采用大型语言模型(LLMS)提高了软件工程任务自动化,包括模型驱动工程(MDE)中的自动化。然而,使用通用LMS进行域建模有其局限性。一种方法是采用微调模型,但这需要大量的计算资源,并可能导致灾难性的遗忘等问题。本文探讨了超分计调和即时工程如何提高Llama 3.1模型的准确性,以便从文字描述中生成域模型。我们使用基于搜索的方法调整特定医学数据模型的超参数,从而显著改进基线LLM的质量。我们然后在十个不同的应用领域测试优化的超参数。虽然这些解决方案并非普遍适用,但我们证明将超分计调与快速工程相结合可以提高几乎所有已检查的域模型的成果。
Article 128
Title@2025-07-19 (6): Foundational Competencies and Responsibilities of a Research Software Engineer: Current State and Suggestions for Future Directions
Title: Foundational Competencies and Responsibilities of a Research Software Engineer: Current State and Suggestions for Future Directions | Grundlagenkompetenzen und Verantwortlichkeiten eines Forschungssoftware-Ingenieurs: Aktueller Stand und Vorschläge für zukünftige Richtungen | 研究软件工程师的基本能力和责任:现状和对未来方向的建议 2311.11457v4 |
Authors (23): Florian Goth, Renato Alves, Matthias Braun, Leyla Jael Castro, Gerasimos Chourdakis, Simon Christ, Jeremy Cohen, Stephan Druskat, Fredo Erxleben, Jean-Noël Grad, Magnus Hagdorn, Toby Hodges, Guido Juckeland, Dominic Kempf, Anna-Lena Lamprecht, Jan Linxweiler, Frank Löffler, Michele Martone, Moritz Schwarzmeier, Heidi Seibold, Jan Philipp Thiele, Harald von Waldow, Samantha Wittke
The term Research Software Engineer, or RSE, emerged a little over 10 years ago as a way to represent individuals working in the research community but focusing on software development. The term has been widely adopted and there are a number of high-level definitions of what an RSE is. However, the roles of RSEs vary depending on the institutional context they work in. At one end of the spectrum, RSE roles may look similar to a traditional research role. At the other extreme, they resemble that of a software engineer in industry. Most RSE roles inhabit the space between these two extremes. Therefore, providing a straightforward, comprehensive definition of what an RSE does and what experience, skills and competencies are required to become one is challenging. In this community paper we define the broad notion of what an RSE is, explore the different types of work they undertake, and define a list of fundamental competencies as well as values that define the general profile of an RSE. On this basis, we elaborate on the progression of these skills along different dimensions, looking at specific types of RSE roles, proposing recommendations for organisations, and giving examples of future specialisations. An appendix details how existing curricula fit into this framework.
研究软件工程师,即研究软件工程师,10多年前才出现,作为代表在研究界工作但侧重于软件开发的个人的一种方式,该词已被广泛采用,并且对研究SE是什么有一系列高级别定义。然而,研究SE的作用因机构背景不同而不同。在一端,研究SE的作用可能类似于传统研究作用。在另一极端,它们类似于软件工程师在工业中的角色。大多数研究SE的作用都存在于这两个极端之间。因此,对研究SE的工作以及成为研究SE所需要的经验、技能和能力提供直截了当的全面定义是具有挑战性的。在本社区文件中,我们界定了研究SE是什么的广泛概念,探讨它们从事的不同类型工作,并界定了基本能力清单以及界定研究SE总体特征的价值。在此基础上,我们详细介绍了这些技能在不同层面的演变情况,研究了研究SE的作用的具体类型,为组织提出了建议,并举例说明了未来的专门化。附录详细说明了现有课程如何适合这一框架。
Article 129
Title@2025-07-19 (6): HistoryFinder: Advancing Method-Level Source Code History Generation with Accurate Oracles and Enhanced Algorithm
Title: HistoryFinder: Advancing Method-Level Source Code History Generation with Accurate Oracles and Enhanced Algorithm | HistoryFinder: Advancing Method-Level Source Code History Generation mit präzisen Oracles und erweitertem Algorithmus | 历史:推进方法层面的源代码,具有准确的甲骨文和强化算法的史代历史 2507.14716v1 |
Authors (4): Shahidul Islam, Ashik Aowal, Md Sharif Uddin, Shaiful Chowdhury
Reconstructing a method’s change history efficiently and accurately is critical for many software engineering tasks, including maintenance, refactoring, and comprehension. Despite the availability of method history generation tools such as CodeShovel and CodeTracker, existing evaluations of their effectiveness are limited by inaccuracies in the ground truth oracles used. In this study, we systematically construct two new oracles – the corrected CodeShovel oracle and a newly developed HistoryFinder oracle – by combining automated analysis with expert-guided manual validation. We also introduce HistoryFinder, a new method history generation tool designed to improve not only the accuracy and completeness of method change histories but also to offer competitive runtime performance. Through extensive evaluation across 400 methods from 40 open-source repositories, we show that HistoryFinder consistently outperforms CodeShovel, CodeTracker, IntelliJ, and Git-based baselines in terms of precision, recall, and F1 score. Moreover, HistoryFinder achieves competitive runtime performance, offering the lowest mean and median execution times among all the research-based tools. While Git-based tools exhibit the fastest runtimes, this efficiency comes at the cost of significantly lower precision and recall – leaving HistoryFinder as the best overall choice when both accuracy and efficiency are important. To facilitate adoption, we provide a web interface, CLI, and Java library for flexible usage.
以高效和准确的方式重建方法的变革历史,对于许多软件工程任务至关重要,包括维护、再设定和理解。尽管有方法的历史生成工具,如代码系统(CodeShovel)和代码跟踪工具(CodTracker),但目前对其有效性的评估因地面真相或触角的不准确而受到限制。在这项研究中,我们系统地构建了两个新的神器 – – 校正代码系统(CodeShovel oracle)和新开发的历史仙子 – – 将自动分析与专家指导的手工验证相结合。我们还引入了历史信息系统(HistFinder),这是一个新方法的历史生成工具,不仅旨在提高方法变化史的准确性和完整性,而且提供竞争性运行时间性业绩。通过40个开放源库库库(CloadShovel、CocreadTracker、Inteller Jj)的400种方法进行广泛的评价,我们发现历史信息系统(Git-F)在精确度、成本化、回顾和Fireal 工具的运用方面展现最佳运行效率。
Article 130
Title@2025-07-19 (6): LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets
Title: LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets | LLM-basierte Erkennung von Tangled Code-Änderungen für höherwertige Methoden-Level-Fehlerdatensätze | 以LLM为基础,检测上质量方法级臭虫数据集排列的编码变化 2505.08263v2 |
Authors (3): Md Nahidul Islam Opu, Shaowei Wang, Shaiful Chowdhury
Tangled code changes, commits that conflate unrelated modifications such as bug fixes, refactorings, and enhancements, introduce significant noise into bug datasets and adversely affect the performance of bug prediction models. Addressing this issue at a fine-grained, method-level granularity remains underexplored. This is critical to address, as recent bug prediction models, driven by practitioner demand, are increasingly focusing on finer granularity rather than traditional class- or file-level predictions. This study investigates the utility of Large Language Models (LLMs) for detecting tangled code changes by leveraging both commit messages and method-level code diffs. We formulate the problem as a binary classification task and evaluate multiple prompting strategies, including zero-shot, few-shot, and chain-of-thought prompting, using state-of-the-art proprietary LLMs such as GPT-4o and Gemini-2.0-Flash. Our results demonstrate that combining commit messages with code diffs significantly enhances model performance, with the combined few-shot and chain-of-thought prompting achieving an F1-score of 0.88. Additionally, we explore machine learning models trained on LLM-generated embeddings, where a multi-layer perceptron classifier achieves superior performance (F1-score: 0.906, MCC: 0.807). Applying our approach to 49 open-source projects improves the distributional separability of code metrics between buggy and non-buggy methods, demonstrating the promise of LLMs for method-level commit untangling and potentially contributing to improving the accuracy of future bug prediction models.
上传代码修改, 承诺将错误修正、 重新设置和增强等无关的修改混为一谈, 将重大噪音引入错误数据集, 并对错误预测模型的性能产生不利影响 。 我们将问题作为二进制分类任务处理, 并评估多种提示性战略, 包括零点、 微粒、 微粒和思索链, 使用最新的专有LMs, 如GPT-4o和Gemini-2.0- Flash等, 正在日益关注细微颗粒, 而不是传统的类级或文件级级预测 。 我们的研究结果显示, 大语言模型(LLLMS) 与代码代码(LLMS) 相结合, 以利用承诺信息与代码(LLLMSM) 的准确性能, 以及方法, 将问题设计成一个二进化的双进制代码。 我们的不进化和非进化的代码, 实现F1- MALSLILA 的高级智能 IMLSLA , 。
Article 131
Title@2025-07-19 (6): Efficient Story Point Estimation With Comparative Learning
Title: Efficient Story Point Estimation With Comparative Learning | Effiziente Story Point-Schätzung mit vergleichendem Lernen | 与比较学习相比的高效小点估计 2507.14642v1 |
Authors (4): Monoshiz Mahbub Khan, Xioayin Xi, Andrew Meneely, Zhe Yu
Story point estimation is an essential part of agile software development. Story points are unitless, project-specific effort estimates that help developers plan their sprints. Traditionally, developers estimate story points collaboratively using planning poker or other manual techniques. While the initial calibrating of the estimates to each project is helpful, once a team has converged on a set of precedents, story point estimation can become tedious and labor-intensive. Machine learning can reduce this burden, but only with enough context from the historical decisions made by the project team. That is, state-of-the-art models, such as GPT2SP and FastText-SVM, only make accurate predictions (within-project) when trained on data from the same project. The goal of this work is to streamline story point estimation by evaluating a comparative learning-based framework for calibrating project-specific story point prediction models. Instead of assigning a specific story point value to every backlog item, developers are presented with pairs of items, and indicate which item requires more effort. Using these comparative judgments, a machine learning model is trained to predict the story point estimates. We empirically evaluated our technique using data with 23,313 manual estimates in 16 projects. The model learned from comparative judgments can achieve on average 0.34 Spearman’s rank correlation coefficient between its predictions and the ground truth story points. This is similar to, if not better than, the performance of a regression model learned from the ground truth story points. Therefore, the proposed comparative learning approach is more efficient than state-of-the-art regression-based approaches according to the law of comparative judgments - providing comparative judgments yields a lower cognitive burden on humans than providing ratings or categorical labels.
故事点估算是灵活软件开发的一个基本部分。 故事点是无单位的、 特定项目的努力估计, 帮助开发者规划其冲印。 传统上, 开发者利用规划扑克或其他手工技术来共同估算故事点。 虽然最初对每个项目进行估算的校准很有帮助, 但一旦一个团队在一组先例上汇聚了起来, 故事点估算会变得乏味和劳动密集型。 机器学习可以减轻这一负担, 但只有项目团队作出的历史决定中有足够的背景才能减轻这一负担。 也就是说, 最先进的模型, 如 GPT2SP 和 FastText- SVM 等, 只有在用同一项目的数据培训时, 才会做出准确的比较性预测( 内部项目) 。 这项工作的目标是通过评估一个比较性学习框架来简化对每个项目进行估算, 故事点估计会变得乏味和劳动密集型。 开发者会用一个特定的故事点来展示一个具体的故事点, 并且指出哪些项目需要付出更多努力。 使用这些比较性判断, 一个机器学习状态模型来预测故事点的预测( 内部) , 我们用比较性地评估了比23号的比较性推算的比较性推算, 比较性推算, 比较性推算的推算, 比较性推算, 比较性推算的推算的推算的推算的推算是比较性推算是比较性推算是比较性推算, 比较性推算, 从23 比较性推算的推算, 比较性推算, 比较性推算, 比较性推算是比较性推算。
Article 132
Title@2025-07-19 (6): A first look at License Variants in the PyPI Ecosystem
Title: A first look at License Variants in the PyPI Ecosystem | Ein erster Blick auf Lizenzvarianten im PyPI Ecosystem | 第一次审查PyPI生态系统的许可证变式 2507.14594v1 |
Authors (4): Weiwei Xu, Hengzhi Ye, Kai Gao, Minghui Zhou
Open-source licenses establish the legal foundation for software reuse, yet license variants, including both modified standard licenses and custom-created alternatives, introduce significant compliance complexities. Despite their prevalence and potential impact, these variants are poorly understood in modern software systems, and existing tools do not account for their existence, leading to significant challenges in both effectiveness and efficiency of license analysis. To fill this knowledge gap, we conduct a comprehensive empirical study of license variants in the PyPI ecosystem. Our findings show that textual variations in licenses are common, yet only 2% involve substantive modifications. However, these license variants lead to significant compliance issues, with 10.7% of their downstream dependencies found to be license-incompatible. Inspired by our findings, we introduce LV-Parser, a novel approach for efficient license variant analysis leveraging diff-based techniques and large language models, along with LV-Compat, an automated pipeline for detecting license incompatibilities in software dependency networks. Our evaluation demonstrates that LV-Parser achieves an accuracy of 0.936 while reducing computational costs by 30%, and LV-Compat identifies 5.2 times more incompatible packages than existing methods with a precision of 0.98. This work not only provides the first empirical study into license variants in software packaging ecosystem but also equips developers and organizations with practical tools for navigating the complex landscape of open-source licensing.
开放源码许可证为软件再利用奠定了法律基础,但许可证变异物,包括修改的标准许可证和定制的替代物,却带来了重大的合规复杂性。尽管这些变异物的普遍性和潜在影响不甚清楚,现代软件系统对这些变异物的理解不甚清楚,现有工具也没有考虑到这些变异物的存在,从而在许可证分析的效力和效率方面造成了重大挑战。为了填补这一知识差距,我们对PyPI生态系统的许可证变异物进行了全面的经验性研究。我们的研究结果表明,许可证的文本变异很常见,但只有2%涉及实质性修改。然而,这些变异物导致重大的合规问题,其下游依赖度的10.7%被发现与许可证不兼容。我们根据我们的调查结果,引入LV-Parster,这是利用基于diff的技术和大型语言模型进行高效的许可变异异物分析的新办法。 与LV-Compat公司一道,我们对软件依赖性网络的相容性许可变体进行了自动化的管道。我们的评估表明,复杂的LV-Parkererers实现了0.936的准确度,同时将计算成本减少30%,而LV-ComCompat Compat在生态系统的版本版本中,而没有将第一种不兼容性的版本版本版本版本的版本版本的版本的版本的版本的版本的版本的版本的版本化的版本化的版本化版本化版本化的版本的版本也提供了5.8。
Article 133
Title@2025-07-19 (6): AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs?
Title: AlgoTune: Can Language Models Speed Up General-Purpose Numerical Programs? | AlgoTune: Können Sprachmodelle allgemeine numerische Programme beschleunigen? | AlgoTune: 语言模型能加速通用计算程序吗? 2507.15887v1 |
Authors (24): Ori Press, Brandon Amos, Haoyu Zhao, Yikai Wu, Samuel K. Ainsworth, Dominik Krupke, Patrick Kidger, Touqir Sajed, Bartolomeo Stellato, Jisun Park, Nathanael Bosch, Eli Meril, Albert Steppi, Arman Zharmagambetov, Fangzhao Zhang, David Perez-Pineiro, Alberto Mercurio, Ni Zhan, Talor Abramovich, Kilian Lieret, Hanlin Zhang, Shirley Huang, Matthias Bethge, Ofir Press
Despite progress in language model (LM) capabilities, evaluations have thus far focused on models’ performance on tasks that humans have previously solved, including in programming (Jimenez et al., 2024) and mathematics (Glazer et al., 2024). We therefore propose testing models’ ability to design and implement algorithms in an open-ended benchmark: We task LMs with writing code that efficiently solves computationally challenging problems in computer science, physics, and mathematics. Our AlgoTune benchmark consists of 155 coding tasks collected from domain experts and a framework for validating and timing LM-synthesized solution code, which is compared to reference implementations from popular open-source packages. In addition, we develop a baseline LM agent, AlgoTuner, and evaluate its performance across a suite of frontier models. AlgoTuner achieves an average 1.72x speedup against our reference solvers, which use libraries such as SciPy, sk-learn and CVXPY. However, we find that current models fail to discover algorithmic innovations, instead preferring surface-level optimizations. We hope that AlgoTune catalyzes the development of LM agents exhibiting creative problem solving beyond state-of-the-art human performance.
尽管在语言模型(LM)能力方面取得了进展,但迄今为止,评价侧重于模型在人类以前解决的任务方面的表现,包括编程(Jimenez等人,2024年)和数学(Glazer等人,2024年),因此,我们提议测试模型在开放式基准中设计和实施算法的能力:我们用写法来有效地解决计算机科学、物理和数学方面具有挑战性的问题。我们的AlgoTune基准包括从域专家收集的155项编码任务,以及验证和计时LM合成解决方案代码的框架,这个框架比起大众开放源包的参考实施。此外,我们开发了一个基准LM代理(AlgoTuner),并评价其跨越一系列前沿模型的性能。AlgoTuner实现了平均1.72x速度,而我们的参考解答者则使用SciPy、sklearn和CVXPY等图书馆。然而,我们发现目前的模型无法发现算法创新,而不是更倾向于地面级优化。我们希望“AlgoTuner”软件公司能够超越了“LautealTeal”号。
Article 134
Title@2025-07-19 (6): Harnessing LLMs for Document-Guided Fuzzing of OpenCV Library
Title: Harnessing LLMs for Document-Guided Fuzzing of OpenCV Library | LLMs für dokumentengeführtes Fuzzing der OpenCV-Bibliothek nutzen | OpenCV 库文档辅助模糊利用 LMs 2507.14558v1 |
Authors (7): Bin Duan, Tarek Mahmud, Meiru Che, Yan Yan, Naipeng Dong, Dan Dongseong Kim, Guowei Yang
The combination of computer vision and artificial intelligence is fundamentally transforming a broad spectrum of industries by enabling machines to interpret and act upon visual data with high levels of accuracy. As the biggest and by far the most popular open-source computer vision library, OpenCV library provides an extensive suite of programming functions supporting real-time computer vision. Bugs in the OpenCV library can affect the downstream computer vision applications, and it is critical to ensure the reliability of the OpenCV library. This paper introduces VISTAFUZZ, a novel technique for harnessing large language models (LLMs) for document-guided fuzzing of the OpenCV library. VISTAFUZZ utilizes LLMs to parse API documentation and obtain standardized API information. Based on this standardized information, VISTAFUZZ extracts constraints on individual input parameters and dependencies between these. Using these constraints and dependencies, VISTAFUZZ then generates new input values to systematically test each target API. We evaluate the effectiveness of VISTAFUZZ in testing 330 APIs in the OpenCV library, and the results show that VISTAFUZZ detected 17 new bugs, where 10 bugs have been confirmed, and 5 of these have been fixed.
计算机视觉和人工智能的结合正在从根本上改变广泛的产业,使机器能够以高准确度对视觉数据进行解释和采取行动。作为最大和迄今为止最受欢迎的开放源码计算机视觉图书馆,OpenCV图书馆提供一套广泛的编程功能,支持实时计算机视觉。OpenCV图书馆的错误可以影响下游计算机视觉应用,对于确保OpenCV图书馆的可靠性至关重要。本文介绍了VISTAFUZZ,这是利用大型语言模型(LLLMS)对 OpenCV 图书馆文件制导的330 API 进行文件指导的新型技术。VISTAFUZZY利用LM 分析API 文档并获取标准化的API 信息。基于这一标准化信息, VestAFUZY 提取了个人输入参数的限制和这些参数之间的依赖性。使用这些制约和依赖性,VestCV API ,然后产生了新的输入值,系统测试每个目标API。我们评估了VenCV 图书馆测试330 APIS 的330 API 的效用。VISTAFUZZZ 和结果显示,V 5 已经检测了17个错误和错误。
Article 135
Title@2025-07-19 (6): Emerging Trends in Software Architecture from the Practitioners Perspective: A Five Year Review
Title: Emerging Trends in Software Architecture from the Practitioners Perspective: A Five Year Review | Aufkommende Trends in der Softwarearchitektur aus der Perspektive der Praktizierenden: Ein Fünf-Jahres-Bericht | 从从从业人员角度看软件架构的新趋势:五年审查 2507.14554v1 |
Authors (6): Ruoyu Su, Noman ahmad, Matteo Esposito, Andrea Janes, Davide Taibi, Valentina Lenarduzzi
Software architecture plays a central role in the design, development, and maintenance of software systems. With the rise of cloud computing, microservices, and containers, architectural practices have diversified. Understanding these shifts is vital. This study analyzes software architecture trends across eight leading industry conferences over five years. We investigate the evolution of software architecture by analyzing talks from top practitioner conferences, focusing on the motivations and contexts driving technology adoption. We analyzed 5,677 talks from eight major industry conferences, using large language models and expert validation to extract technologies, their purposes, and usage contexts. We also explored how technologies interrelate and fit within DevOps and deployment pipelines. Among 450 technologies, Kubernetes, Cloud Native, Serverless, and Containers dominate by frequency and centrality. Practitioners present technology mainly related to deployment, communication, AI, and observability. We identify five technology communities covering automation, coordination, cloud AI, monitoring, and cloud-edge. Most technologies span multiple DevOps stages and support hybrid deployment. Our study reveals that a few core technologies, like Kubernetes and Serverless, dominate the contemporary software architecture practice. These are mainly applied in later DevOps stages, with limited focus on early phases like planning and coding. We also show how practitioners frame technologies by purpose and context, reflecting evolving industry priorities. Finally, we observe how only research can provide a more holistic lens on architectural design, quality, and evolution.
在设计、开发和维护软件系统方面,软件架构具有中心作用。随着云计算、微服务和集装箱的崛起,建筑实践也具有多样性。理解这些变化至关重要。本研究分析了五年来八次主要行业会议的软件架构趋势。我们通过分析顶级从业人员会议的谈判,调查软件架构的演变,重点是推动技术应用的动机和背景。我们分析了八次主要行业会议的5,677次谈判,使用大型语言模型和专家验证来提取技术、其用途和使用背景。我们还探讨了技术在DevOps和部署管道中的相互作用和适合性。在450种技术中,Kubernetes、云地、无服务器和集装箱以频率和中心为主。从业者介绍了主要与部署、通信、AI和可观察性有关的技术。我们确定了五个技术群体,涉及自动化、协调、云地AI、监测和云层。大多数技术跨越了多种DevOps阶段,支持混合部署。我们的研究显示,少数核心技术,如Kubernetes和服务器,主导当代软件架构实践。这些核心技术,主要应用在以后的DOps系统质量阶段和集装箱中,我们也以有限的设计重点展示了设计过程。我们如何在设计中,然后对设计结构框架进行更深入的研究。
Article 136
Title@2025-07-19 (6): Architectural Degradation: Definition, Motivations, Measurement and Remediation Approaches
Title: Architectural Degradation: Definition, Motivations, Measurement and Remediation Approaches | architektonische Degradation: Definition, Motivationen, Mess- und Sanierungsansätze | 建筑退化:定义、动力、计量和补救方法 2507.14547v1 |
Authors (6): Noman Ahmad, Ruoyu Su, Matteo Esposito, Andrea Janes, Valentina Lenarduzzi, Davide Taibi
Architectural degradation, also known as erosion, decay, or aging, impacts system quality, maintainability, and adaptability. Although widely acknowledged, current literature shows fragmented definitions, metrics, and remediation strategies. Our study aims to unify understanding of architectural degradation by identifying its definitions, causes, metrics, tools, and remediation approaches across academic and gray literature. We conducted a multivocal literature review of 108 studies extracting definitions, causes, metrics, measurement approaches, tools, and remediation strategies. We developed a taxonomy encompassing architectural, code, and process debt to explore definition evolution, methodological trends, and research gaps. Architectural degradation has shifted from a low-level issue to a socio-technical concern. Definitions now address code violations, design drift, and structural decay. Causes fall under architectural (e.g., poor documentation), code (e.g., hasty fixes), and process debt (e.g., knowledge loss). We identified 54 metrics and 31 measurement techniques, focused on smells, cohesion/coupling, and evolution. Yet, most tools detect issues but rarely support ongoing or preventive remediation. Degradation is both technical and organizational. While detection is well-studied, continuous remediation remains lacking. Our study reveals missed integration between metrics, tools, and repair logic, urging holistic, proactive strategies for sustainable architecture.
建筑退化,也称为侵蚀、腐蚀或衰老,影响系统质量、可维持性和适应性。尽管人们广泛承认,但目前的文献显示,定义、指标和补救战略支离破碎。我们的研究旨在通过在学术和灰色文献中查明其定义、原因、指标、工具和补救方法,统一对建筑退化的理解。我们对108项研究进行了多动文献审查,从中提取了定义、原因、指标、计量方法、工具和补救战略。我们开发了包括建筑、代码和债务过程在内的分类学,以探索定义演变、方法趋势和研究差距。建筑退化已经从一个低层次的问题转向社会-技术问题。现在的定义涉及违反代码、设计漂移和结构腐蚀。原因属于建筑(例如,文件不全)、代码(例如,仓促修补)和处理债务(例如,知识损失)。我们确定了54项衡量标准和31项衡量技术,侧重于嗅觉、凝聚力/组合和演进。然而,大多数工具都探测问题,但很少支持正在进行的或预防性的补救。退化是技术和组织性研究。退化是技术和组织性研究的缺陷。 研究(例如测量、持续研究、逻辑分析、持续的修复) 研究、持续的逻辑分析工具缺乏。
Article 137
Title@2025-07-19 (6): QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration
Title: QLPro: Automated Code Vulnerability Discovery via LLM and Static Code Analysis Integration | QLPro: Automatisierte Code Vulnerability Discovery über LLM und Static Code Analysis Integration | QLPro:通过LLM和静态代码分析整合发现自动编码易脆弱性 2506.23644v3 |
Authors (8): Junze Hu, Xiangyu Jin, Yizhe Zeng, Yuling Liu, Yunpeng Li, Dan Du, Kaiyu Xie, Hongsong Zhu
We introduce QLPro, a vulnerability detection framework that systematically integrates LLMs and static analysis tools to enable comprehensive vulnerability detection across entire open-source projects.We constructed a new dataset, JavaTest, comprising 10 open-source projects from GitHub with 62 confirmed vulnerabilities. CodeQL, a state-of-the-art static analysis tool, detected only 24 of these vulnerabilities while QLPro detected 41. Furthermore, QLPro discovered 6 previously unknown vulnerabilities, 2 of which have been confirmed as 0-days.
我们引入了QLPro,这是一个脆弱性检测框架,它系统地整合了LLMs和静态分析工具,以便能够在整个开放源码项目中全面检测脆弱性。 我们建立了一个新的数据集,JavaTestor,由GitHub的10个公开源码项目组成,其中62个被确认为脆弱性。 CodeQL是一个最先进的静态分析工具,仅检测到24个,而QLPro检测到41个。 此外,QLPro发现了6个以前未知的脆弱性,其中2个被确认为0天。
Article 138
Title@2025-07-19 (6): On the Effect of Token Merging on Pre-trained Models for Code
Title: On the Effect of Token Merging on Pre-trained Models for Code | Über die Wirkung von Token Merging auf vortrainierte Modelle für Code | 托肯合并对《守则》培训前模式的影响 2507.14423v1 |
Authors (4): Mootez Saad, Hao Li, Tushar Sharma, Ahmed E. Hassan
Tokenization is a fundamental component of language models for code. It involves breaking down the input into units that are later passed to the language model stack to learn high-dimensional representations used in various contexts, from classification to generation. However, the output of these tokenizers is often longer than that traditionally used in compilers and interpreters. This could result in undesirable effects, such as increased computational overhead. In this work, we investigate the effect of merging the hidden representations of subtokens that belong to the same semantic unit, such as subtokens that form a single identifier. We propose two strategies: one based on averaging the representations and another that leverages a learning-based approach. Both methods can be seamlessly integrated with existing language models for code. We conduct experiments using six language models for code: CodeBERT, GraphCodeBERT, UniXCoder, CdoeT5, CodeT5+ (220M), and CodeT5+ (770M), across three software engineering tasks: vulnerability detection, code classification, and code translation. Results show that these strategies can reduce the number of floating-point operations by $1\%$ to $19\%$. Regarding downstream performance, the most significant degradation was observed in the vulnerability detection task, where the F1 score decreased by $1.82$ points compared to the baseline. In contrast, for code translation, we observed an improvement of $2.47$ points in CodeBLEU. This work contributes to the broader effort of improving language models for code across multiple dimensions, including both computational efficiency and downstream performance.
调制是语言代码模型的一个基本组成部分。 它涉及将输入分解成单位的维度,这些单位后来被分解为语言模式堆叠,以学习各种情况下,从分类到生成的高度表现; 然而,这些代用品的输出往往比编译者和口译员传统上使用的要长,这可能造成不良后果,如计算管理费用增加。 在这项工作中,我们调查将属于同一语义单位的子玩具的隐藏表示方式合并的影响,如组成单一识别符号的子玩具。 我们提出了两个战略:一个基于平均表示方式,另一个基于利用基于学习的方法。两种方法都可以与现有的代号语言模式完全融合。 我们使用六种语言模式的代码进行实验: codeBERT、GapCodeBERT、UXCoord、CdoeT5、DCT5+(220MM)和DCT5+(770MM),这三种软件工程任务: 脆弱性检测、代码分类和代码翻译。结果显示,这些战略可以将浮动点的操作数量减少1美元,其中的比值为19- 47美元。 在下一级测试中,我们观察到的比值的排序中, 在下值中, 度测试中,比值的比值的比值差值为20分,我们观察到的比值为2。
Article 139
Title@2025-07-18 (5): Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach
Title: Enhancing LLM Code Generation with Ensembles: A Similarity-Based Selection Approach | Verbesserung der LLM-Code-Generierung mit Ensembles: Ein auf Ähnlichkeit basierender Auswahlansatz | 增强具有各种组合的LLM 代码生成:以相似性为基础的选择方法 2503.15838v2 |
Authors (4): Tarek Mahmud, Bin Duan, Corina Pasareanu, Guowei Yang
Ensemble learning has been widely used in machine learning to improve model robustness, accuracy, and generalization, but has not yet been applied to code generation tasks with large language models (LLMs). We propose an ensemble approach for LLMs in code generation. Instead of relying on the output of a single model, we generate multiple candidate programs from different LLMs and apply a structured voting mechanism to select the most reliable solution. For voting, we compute syntactic and semantic similarity using CodeBLEU and behavioral equivalence using CrossHair’s differential behavior analysis. By aggregating these similarity scores, we select the program that best aligns with the consensus among the candidates. We show through experiments that our ensemble approach consistently outperforms standalone LLMs on the well-known HumanEval and the more challenging LiveCodeBench datasets, achieving an accuracy of 90.2% and 50.2%, respectively, on the two datasets. In comparison, the best-performing LLM (GPT-4o) has an accuracy of 83.5% and 43.4%, respectively. Furthermore, even when restricted to free open-source models, our method achieves an accuracy of 80.5% and 41.6%, respectively, demonstrating the viability of our approach in resource-constrained settings.
在机器学习中广泛使用了综合学习,以提高模型的稳健性、准确性和通用性,但还没有应用到大型语言模型(LLMs)的代码生成任务中。我们提议了在代码生成中对LLMs采用混合法。我们不依靠单一模型的输出,而是从不同的LLMs产生多个候选程序,并应用一个结构化投票机制来选择最可靠的解决方案。在表决中,我们用CodelbleU和CrossHair的差别行为分析来计算合成和语义相似性。相比之下,通过汇总这些相似性评分,我们选择了最符合候选人之间共识的代码生成程序。我们通过实验来显示,我们的共性方法在众所周知的HumanEval和更具挑战性的LiveCodeBench数据集中始终优于独立的LLMMss。在两个数据集中分别达到90.2%和50.2%的准确性。相比之下,最佳LM(GPT-4o)的精确性为83.5%和43.4%。此外,即使我们限制在开放源设定的80 %的方法中,也分别实现了我们自由的精确性。
Article 140
Title@2025-07-18 (5): Developing Shared Vocabulary System For Collaborative Software Engineering
Title: Developing Shared Vocabulary System For Collaborative Software Engineering | Entwicklung eines gemeinsamen Vokabelsystems für die gemeinsame Software-Engineering | 开发合作软件工程共用词汇系统 2507.14396v1 |
Authors (5): Carey Lai Zheng Hui, Johnson Britto Jessia Esther Leena, Kumuthini Subramanian, Zhao Chenyu, Shubham Rajeshkumar Jariwala
Effective communication is a critical factor in successful software engineering collaboration. However, communication gaps remain a persistent challenge, often leading to misunderstandings, inefficiencies, and defects. This research investigates the technical factors contributing to such misunderstandings and explores the measurable benefits of establishing shared vocabulary systems within software documentation and codebases. Using a Design Science Research (DSR) framework, the study was structured into three iterative phases: problem identification, method development, and empirical validation. The problem identification phase involved thematic analysis of communication data and semi-structured interviews, revealing key factors such as ambiguous messaging, misalignment in documentation, inconsistent code review feedback, and API integration miscommunication. Grounded Theory principles were employed to design a structured methodology for collaborative vocabulary development. Empirical validation through controlled experiments demonstrated that while initial adoption introduced overhead, the shared vocabulary system significantly improved information density, documentation clarity, and collaboration efficiency over time. Findings offer actionable insights for improving communication practices in software engineering, while also identifying limitations and directions for future research.
有效的通信是成功软件工程合作的一个关键因素,然而,通信差距仍然是一个长期存在的挑战,往往导致误解、效率低下和缺陷。这一研究调查了造成这种误解的技术因素,并探讨了在软件文档和代码库内建立共享词汇系统的可计量效益。利用设计科学研究框架,研究分为三个迭接阶段:问题识别、方法开发和经验验证。问题识别阶段涉及对通信数据的专题分析和半结构式访谈,揭示了模糊的信息、文件的错配、代码审查反馈不一致和API整合错误通信等关键因素。根据理论原则设计了合作词汇开发的结构性方法。通过受控实验进行的经验验证表明,在初步采用间接费用的同时,共享词汇系统极大地提高了信息密度、文件清晰度以及长期协作效率。结果为改进软件工程的通信做法提供了可操作的洞察力,同时也确定了未来研究的局限性和方向。
Article 141
Title@2025-07-18 (5): Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms
Title: Combinatorial Optimization for All: Using LLMs to Aid Non-Experts in Improving Optimization Algorithms | Kombinatorische Optimierung für alle: Verwendung von LLMs zur Unterstützung von Nicht-Experten bei der Verbesserung von Optimierungsalgorithmen | 组合优化全民:利用LLMs帮助非专家改进最佳化算法 2503.10968v2 |
Authors (2): Camilo Chacón Sartori, Christian Blum
Large Language Models (LLMs) have shown notable potential in code generation for optimization algorithms, unlocking exciting new opportunities. This paper examines how LLMs, rather than creating algorithms from scratch, can improve existing ones without the need for specialized expertise. To explore this potential, we selected 10 baseline optimization algorithms from various domains (metaheuristics, reinforcement learning, deterministic, and exact methods) to solve the classic Travelling Salesman Problem. The results show that our simple methodology often results in LLM-generated algorithm variants that improve over the baseline algorithms in terms of solution quality, reduction in computational time, and simplification of code complexity, all without requiring specialized optimization knowledge or advanced algorithmic implementation skills.
大型语言模型(LLMS)在优化算法的代码生成方面显示出显著的潜力,打开了令人兴奋的新机会。本文审视了LLMS,而不是从零开始创造算法,如何能够在不需要专门知识的情况下改进现有的算法。为了探索这一潜力,我们从各个领域(美经经济学、强化学习、确定论和精确方法)选择了10个基线优化算法,以解决典型的《旅行推销员问题》。结果显示,我们简单的方法往往产生LLM产生的算法变异,在解决方案质量、缩短计算时间和简化代码复杂性方面比基线算法改进,而不需要专门的优化知识或先进的算法执行技能。
Article 142
Title@2025-07-18 (5): Remote Assistance or Remote Driving: The Impact of Operational Design Domains on ADS-Supporting Systems Selection
Title: Remote Assistance or Remote Driving: The Impact of Operational Design Domains on ADS-Supporting Systems Selection | Remote Assistance oder Remote Driving: Die Auswirkungen von Operational Design Domains auf die Auswahl von ADS-unterstützten Systemen | 远程援助或远程驾驶:业务设计域域对ADS支助系统选择的影响 2507.14347v1 |
Authors (2): Ole Hans, Benedikt Walter
High level Automated Driving Systems (ADS) can handle many situations, but they still encounter situations where human intervention is required. In systems where a physical driver is present in the vehicle, typically SAE Level 3 systems, this intervention is relatively straightforward and is handled by the in-vehicle driver. However, the complexity increases for Level 4 systems, where, in most cases, no physical driver remains in the vehicle. The two common industry solutions for this challenge are the integration of a remote support system, such as a Remote Driving System (RDS) or Remote Assistance System (RAS). While it is clear that ADS will require one of these systems, it is less clear how the suitability of either system for a particular ADS application should be evaluated. Currently, the selection process often focuses on system architecture as well as its design and integration challenges. Furthermore, since many ADS developers choose to develop remote system solutions in-house, it is advantageous to select the simpler approach to streamline development and integration efforts. While these decision points are certainly relevant, this approach overlooks the most critical factors: the use cases and the complementarity of the ADS and the remote support system within the context of the Operational Design Design Domain (ODD). This paper proposes a structured approach for selecting between RDS and RAS as an ADS support system, based on the defined ODD and use case analysis. To achieve this, the paper applies the PEGASUS framework to systematically describe and analyze the ODD. A structured framework is introduced to evaluate and select the most suitable remote support system for an ADS based on clearly defined criteria.
高级自动驾驶系统(ADS)可以处理许多情况,但是它们仍然遇到需要人力干预的情况。在车辆中存在有形驱动器的系统中,通常是SAE三级系统,这种干预相对简单,由车辆驱动器处理;然而,4级系统的复杂性增加,在多数情况下,车辆中仍没有有形驱动器。针对这一挑战的两个共同的行业解决办法是整合远程辅助系统,如远程驾驶系统或远程援助系统(RAS)。虽然ADS显然需要其中一种系统,但不清楚应如何系统地评价两种系统对特定ADS应用程序的适合性。目前,选择过程往往侧重于系统架构及其设计和整合挑战。此外,由于许多ADS开发者选择在内部开发远程系统解决方案,因此选择简化发展和一体化努力的更简单方法是有好处的。虽然这些决定点确实相关,但这种方法忽略了最关键的因素:ADS系统的使用案例和系统对特定ADS应用的系统以及远程支持系统对特定ADS应用程序的适合性评估。
Article 143
Title@2025-07-18 (5): Leveraging LLMs for Formal Software Requirements – Challenges and Prospects
Title: Leveraging LLMs for Formal Software Requirements – Challenges and Prospects | Leveraging LLMs für formale Softwareanforderungen – Herausforderungen und Perspektiven | 为正式软件要求 – – 挑战和前景 – – 利用LMLM 利用LMLM 来利用正规软件要求 – – 挑战和前景 2507.14330v1 |
Authors (3): Arshad Beg, Diarmuid O’Donoghue, Rosemary Monahan
Software correctness is ensured mathematically through formal verification, which involves the resources of generating formal requirement specifications and having an implementation that must be verified. Tools such as model-checkers and theorem provers ensure software correctness by verifying the implementation against the specification. Formal methods deployment is regularly enforced in the development of safety-critical systems e.g. aerospace, medical devices and autonomous systems. Generating these specifications from informal and ambiguous natural language requirements remains the key challenge. Our project, VERIFAI^{1}, aims to investigate automated and semi-automated approaches to bridge this gap, using techniques from Natural Language Processing (NLP), ontology-based domain modelling, artefact reuse, and large language models (LLMs). This position paper presents a preliminary synthesis of relevant literature to identify recurring challenges and prospective research directions in the generation of verifiable specifications from informal requirements.
通过正式核查确保软件的正确性。 正式核查涉及生成正式要求规格和执行必须核查的资源。模型检查器和理论验证器等工具通过对照规格核查执行情况确保软件的正确性。在开发安全关键系统(如航空航天系统、医疗设备和自主系统)时,经常采用正式方法。从非正式和模糊的自然语言要求中产生这些规格仍然是关键的挑战。我们的项目(VERIFAI1})旨在调查自动和半自动办法,利用自然语言处理技术(NLP)、内科域建模、人工再利用和大型语言模型(LLMs)等技术来弥补这一差距。本立场文件初步综合了相关文献,以查明在根据非正式要求生成可核查规格时反复出现的挑战和潜在的研究方向。
Article 144
Title@2025-07-18 (5): Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models
Title: Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models | Auswirkungen von Code-Kontexten und Prompting-Strategien auf die automatisierte Unit-Testgenerierung mit modernen, allgemein angelegten großen Sprachmodellen | 守则背景和提示战略对采用现代通用大语言通用模式的 自动单位测试生成的影响 2507.14256v1 |
Authors (3): Jakub Walczak, Piotr Tomalak, Artur Laskowski
Generative AI is gaining increasing attention in software engineering, where testing remains an indispensable reliability mechanism. According to the widely adopted testing pyramid, unit tests constitute the majority of test cases and are often schematic, requiring minimal domain expertise. Automatically generating such tests under the supervision of software engineers can significantly enhance productivity during the development phase of the software lifecycle. This paper investigates the impact of code context and prompting strategies on the quality and adequacy of unit tests generated by various large language models (LLMs) across several families. The results show that including docstrings notably improves code adequacy, while further extending context to the full implementation yields definitely smaller gains. Notably, the chain-of-thought prompting strategy – applied even to ‘reasoning’ models – achieves the best results, with up to 96.3\% branch coverage, a 57\% average mutation score, and near-perfect compilation success rate. Among the evaluated models, M5 (Gemini 2.5 Pro) demonstrated superior performance in both mutation score and branch coverage being still in top in terms of compilation success rate. All the code and resulting test suites are publicly available at https://github.com/peetery/LLM-analysis.
根据广泛采用的测试金字塔,单位测试构成测试案例的大多数,而且往往具有示意性,需要最低限度的域域内专门知识。在软件工程师的监督下自动进行这种测试,可在软件生命周期开发阶段大大提高生产率。本文调查了代码背景的影响,并促使一些家庭各种大型语言模型(LLMS)产生的单位测试质量和充分性的战略。结果显示,包括 docs(Docstres)显著地提高了代码的适足性,而将背景进一步扩展至全面实施,则成效肯定较小。值得注意的是,经过思索的激励战略(甚至适用于“推理”模型)取得了最佳效果,其分支覆盖率高达96.3-;平均突变分数为57;近乎于效果的汇编成功率。在评估模型中,M5(Gemini 2.5 Pro)显示突变分和分支覆盖的优异性表现在汇编成功率方面仍然处于顶端。所有代码和由此产生的测试套都可在https://github.com/peteryLLLLLA上公开查阅。
Article 145
Title@2025-07-18 (5): Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian
Title: Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian | Code Lesbarkeit im Zeitalter großer Sprachmodelle: Eine industrielle Fallstudie von Atlassian | 《大语言模式时代的可读性:阿特拉斯斯语工业案例研究》 2501.11264v3 |
Authors (6): Wannita Takerngsaksiri, Chakkrit Tantithamthavorn, Micheal Fu, Jirat Pasuksmit, Kun Chen, Ming Wu
Software engineers spend a significant amount of time reading code during the software development process, especially in the age of large language models (LLMs) that can automatically generate code. However, little is known about the readability of the LLM-generated code and whether it is still important from practitioners’ perspectives in this new era. In this paper, we conduct a survey to explore the practitioners’ perspectives on code readability in the age of LLMs and investigate the readability of our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios. Overall, the findings underscore that (1) readability remains a critical aspect of software development; (2) the readability of our LLM-generated code is comparable to human-written code, fostering the establishment of appropriate trust and driving the broad adoption of our LLM-powered software development platform.
软件工程师在软件开发过程中花费了大量的时间阅读代码,特别是在能够自动生成代码的大型语言模型(LLMs)时代,然而,对于LLM生成的代码的可读性以及从从从业者的观点看,它在这个新时代是否仍然很重要,人们对此知之甚少。在本文件中,我们进行了一项调查,以探讨从业者对LLMs时代的可读性的看法,并调查我们基于LLM的软件开发代理框架(HULA)的可读性,在现实世界情景中将其生成的代码与人写代码进行比较。总体而言,调查结果强调:(1) 可读性仍然是软件开发的一个关键方面;(2)我们LLM生成的代码的可读性与人写代码相似,促进建立适当的信任,推动广泛采用我们的LM驱动软件开发平台。
Article 146
Title@2025-07-18 (5): Testing Autonomous Driving Systems – What Really Matters and What Doesn’t
Title: Testing Autonomous Driving Systems – What Really Matters and What Doesn’t | Autonome Fahrsysteme testen – Was wirklich zählt und was nicht | 自动自动驾驶测试系统 – – 真正重要和不重要的东西 2507.13661v1 |
Authors (4): Changwen Li, Joseph Sifakis, Rongjie Yan, Jian Zhang
Despite extensive research, the testing of autonomous driving systems (ADS) landscape remains fragmented, and there is currently no basis for an informed technical assessment of the importance and contribution of the current state of the art. This paper attempts to address this problem by exploring two complementary aspects. First, it proposes a framework for comparing existing test methods in terms of their intrinsic effectiveness and validity. It shows that many methods do not meet both of these requirements. Either because they are based on criteria that do not allow for rapid, inexpensive, and comprehensive detection of failures, or because the degree of validity of the properties tested cannot be accurately estimated. In particular, it is shown that most critical test methods do not take into account the nominal operational capabilities of autopilots and generate scenarios that are impossible for the tested vehicles to handle, resulting in unjustified rejections. Secondly, the paper shows that test effectiveness and validity are highly dependent on how autopilots are designed: how they choose between different control policies to perform maneuvers, as well as on the reproducibility of the results. In fact, most test methods take for granted two principles underlying traditional methods, but do not generally apply to ADS. We maintain that the absence of rationality and determinacy significantly impairs the effectiveness and validity of test methods, and provide test results on eight open autopilots, in which most do not satisfy these properties, thereby illustrating this fact. We conclude that under the current state of the art, it is impossible to obtain strong enough guarantees for essential autopilot properties and recommend that autopilots be developed with a view to both rationality and determinacy.
尽管进行了广泛的研究,对自主驾驶系统(ADS)的测试仍然支离破碎,目前没有依据对目前先进状态的重要性和贡献进行知情的技术评估。本文件试图通过探讨两个互补方面来解决这一问题。首先,它提议了一个框架,以比较现有测试方法的内在有效性和有效性;它表明许多方法不符合这两项要求。要么因为它们所依据的标准不允许快速、廉价和全面地发现故障,要么因为它们所依据的标准不允许快速、廉价和全面检测失败,或者因为测试的特性的合理性程度无法准确估计。特别是,事实证明,大多数关键测试方法没有考虑到自动驾驶仪的表面操作能力,并产生了测试工具无法处理的情景,导致不合理的拒绝。第二,该文件表明测试的有效性和有效性在很大程度上取决于自动驾驶技术的设计:它们如何选择不同的控制政策来进行操作,以及结果的可贵度。事实上,大多数测试方法都以两种原则为基础,但通常不适用于ADS。我们坚持认为,对自动驾驶仪的表面操作能力而言,其可靠性和可靠性都无法被充分检验。我们坚持认为,在进行这种高度的检验时,在进行这种检验时,最能检验和最能性检验的方法是充分地证明,在进行这种检验的可靠性和最不具有决定性。我们能够证明这种检验。
Article 147
Title@2025-07-18 (5): LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead
Title: LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead | LLM-basierte Multi-Agenten-Systeme für die Software-Engineering: Literature Review, Vision and the Road Ahead | 以LLM为基础的软件工程多机构系统:文献审查、展望和路前 2404.04834v4 |
Authors (3): Junda He, Christoph Treude, David Lo
Integrating Large Language Models (LLMs) into autonomous agents marks a significant shift in the research landscape by offering cognitive abilities that are competitive with human planning and reasoning. This paper explores the transformative potential of integrating Large Language Models into Multi-Agent (LMA) systems for addressing complex challenges in software engineering (SE). By leveraging the collaborative and specialized abilities of multiple agents, LMA systems enable autonomous problem-solving, improve robustness, and provide scalable solutions for managing the complexity of real-world software projects. In this paper, we conduct a systematic review of recent primary studies to map the current landscape of LMA applications across various stages of the software development lifecycle (SDLC). To illustrate current capabilities and limitations, we perform two case studies to demonstrate the effectiveness of state-of-the-art LMA frameworks. Additionally, we identify critical research gaps and propose a comprehensive research agenda focused on enhancing individual agent capabilities and optimizing agent synergy. Our work outlines a forward-looking vision for developing fully autonomous, scalable, and trustworthy LMA systems, laying the foundation for the evolution of Software Engineering 2.0.
将大语言模型(LLMS)融入自主代理器标志着研究格局的重大变化,通过提供与人类规划和推理具有竞争力的认知能力,使研究领域发生了重大变化。本文件探讨了将大语言模型纳入多机构系统以应对软件工程复杂挑战的变革潜力。通过利用多种代理器的协作和专业能力,LMA系统能够自主解决问题,提高稳健性,并为管理现实世界软件项目的复杂性提供可扩展的解决办法。本文系统地审查最近进行的初步研究,以绘制当前LMA应用在软件开发生命周期各个阶段的全貌。为了说明当前的能力和局限性,我们进行了两个案例研究,以展示最新工艺LMA框架的有效性。此外,我们找出了关键的研究差距,并提出了侧重于加强单个代理器能力和优化代理器协同作用的全面研究议程。我们的工作概述了开发完全自主、可扩展和可信赖的LMA系统前瞻性愿景,为软件工程2.0的演进奠定了基础。
Article 148
Title@2025-07-18 (5): ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle
Title: ParaStudent: Generating and Evaluating Realistic Student Code by Teaching LLMs to Struggle | ParaStudent: Erzeugen und Evaluieren des Realistischen Studentenkodex durch Lehre von LLMs zum Kampf | 副专业学生:通过教授LLMs进行斗争,产生和评价现实学生守则 2507.12674v2 |
Authors (5): Mihran Miroyan, Rose Niousha, Joseph E. Gonzalez, Gireeja Ranade, Narges Norouzi
Large Language Models (LLMs) have shown strong performance on programming tasks, but can they generate student-like code like real students - imperfect, iterative, and stylistically diverse? We present ParaStudent, a systematic study of LLM-based “student-like” code generation in an introductory programming course setting. Using a dataset of timestamped student submissions across multiple semesters, we design low- and high-resolution experiments to model student progress and evaluate code outputs along semantic, functional, and stylistic dimensions. Our results show that fine-tuning significantly improves alignment with real student trajectories and captures error patterns, incremental improvements, and stylistic variations more faithfully. This study shows that modeling realistic student code requires capturing learning dynamics through context-aware generation, temporal modeling, and multi-dimensional evaluation. Code for experiments and evaluation is available at https://github.com/mmiroyan/ParaStudent.
大型语言模型(LLMS)在编程任务方面表现良好,但是它们能产生像实生一样的学生代码吗?我们在一个介绍性编程课程设置中提出ParaStudent,这是对基于LLM的“学生类”代码生成的系统研究。我们利用一个多学期学生提交的时间标记数据集,设计低和高分辨率的实验,以模拟学生进步,并评价语义、功能和文体方面的代码输出。我们的成果显示,微调大大改善了与真实学生轨迹的匹配,并更加忠实地捕捉错误模式、渐进式改进和文体变化。这项研究显示,模拟现实的学生代码需要通过背景认知生成、时间模型和多维评价来捕捉学习动态。实验和评价代码可在https://github.com/miroyan/ParaStududid查阅。
Article 149
Title@2025-07-17 (4): An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots
Title: An Approach for Auto Generation of Labeling Functions for Software Engineering Chatbots | Ein Ansatz zur automatischen Generierung von Beschriftungsfunktionen für Software Engineering Chatbots | 软件工程聊天器自动生成标签功能的方法 2410.07094v2 |
Authors (4): Ebube Alor, Ahmad Abdellatif, SayedHassan Khatoonabadi, Emad Shihab
Software engineering (SE) chatbots are increasingly gaining attention for their role in enhancing development processes. At the core of chatbots are Natural Language Understanding platforms (NLUs), which enable them to comprehend user queries but require labeled data for training. However, acquiring such labeled data for SE chatbots is challenging due to the scarcity of high-quality datasets, as training requires specialized vocabulary and phrases not found in typical language datasets. Consequently, developers often resort to manually annotating user queries – a time-consuming and resource-intensive process. Previous approaches require human intervention to generate rules, called labeling functions (LFs), that categorize queries based on specific patterns. To address this issue, we propose an approach to automatically generate LFs by extracting patterns from labeled user queries. We evaluate our approach on four SE datasets and measure performance improvement from training NLUs on queries labeled by the generated LFs. The generated LFs effectively label data with AUC scores up to 85.3% and NLU performance improvements up to 27.2%. Furthermore, our results show that the number of LFs affects labeling performance. We believe that our approach can save time and resources in labeling users’ queries, allowing practitioners to focus on core chatbot functionalities rather than manually labeling queries.
聊天室(SE) 聊天室工程(SE) 正在日益引起人们对其在加强发展进程中的作用的关注。 聊天室的核心是自然语言理解平台(NLUs),这些平台使他们能够理解用户询问,但需要贴标签的培训数据。然而,由于缺少高质量的数据集,为SE聊天室获取此类标签数据具有挑战性,因为培训需要专门词汇和在典型语言数据集中找不到的词组。因此,开发商经常使用人工说明用户询问 – – 耗费时间和资源密集的过程。以前的做法需要人类干预才能产生规则,称为标签功能(LFs),根据具体模式对询问进行分类。为了解决这一问题,我们建议了一种通过从标签用户询问中提取模式自动生成LFs的方法。我们评价了四个SE数据集的方法,衡量NLUs在所生成的LF数据集标签上没有找到的专门词汇和词汇。因此,开发商往往使用人工手写标签(LFs) 有效地将数据标签数据评为85.3 % 和 NLUs 性能改进到27.2 % 。此外,我们的结果显示, 标签标签用户对时间的查询方式和机路段的查询会影响着标签的进度。
Article 150
Title@2025-07-17 (4): Demystifying Feature Requests: Leveraging LLMs to Refine Feature Requests in Open-Source Software
Title: Demystifying Feature Requests: Leveraging LLMs to Refine Feature Requests in Open-Source Software | Feature-Anfragen entmystifizieren: LLMs zur Verfeinerung von Feature-Anfragen in Open-Source-Software nutzen | 解密功能请求: 利用LMML 来在开放源码软件中使用 Refine 功能请求 2507.13555v1 |
Authors (5): Pragyan K C, Rambod Ghandiparsi, Thomas Herron, John Heaps, Mitra Bokaei Hosseini
The growing popularity and widespread use of software applications (apps) across various domains have driven rapid industry growth. Along with this growth, fast-paced market changes have led to constantly evolving software requirements. Such requirements are often grounded in feature requests and enhancement suggestions, typically provided by users in natural language (NL). However, these requests often suffer from defects such as ambiguity and incompleteness, making them challenging to interpret. Traditional validation methods (e.g., interviews and workshops) help clarify such defects but are impractical in decentralized environments like open-source software (OSS), where change requests originate from diverse users on platforms like GitHub. This paper proposes a novel approach leveraging Large Language Models (LLMs) to detect and refine NL defects in feature requests. Our approach automates the identification of ambiguous and incomplete requests and generates clarification questions (CQs) to enhance their usefulness for developers. To evaluate its effectiveness, we apply our method to real-world OSS feature requests and compare its performance against human annotations. In addition, we conduct interviews with GitHub developers to gain deeper insights into their perceptions of NL defects, the strategies they use to address these defects, and the impact of defects on downstream software engineering (SE) tasks.
随着这一增长,快速的市场变化导致软件要求的不断变化,这些要求往往以特写请求和增强建议为基础,通常由自然语言用户提供。然而,这些请求往往有缺陷,例如模糊性和不完全性,因此难以解释。传统的验证方法(例如访谈和讲习班)有助于澄清这类缺陷,但在开放源码软件(OSS)等分散环境中是不切实际的。在开放源码软件(OSS)等平台上,不同用户提出了变更请求。本文提出了利用大语言模型(LLLMS)发现和完善特写请求中的NLM缺陷的新办法。我们的方法自动确定模糊和不完整的请求,并提出澄清问题(CQs),以提高这些请求对开发商的效用。为了评估其有效性,我们将我们的方法应用于真实世界的开放源码软件特征请求,并对照人类的描述来比较其性能。此外,我们与GitHub开发商进行访谈,以便更深入了解他们对NL缺陷的看法、他们用来解决这些缺陷的战略以及下游软件(SE)的缺陷。
Article 151
Title@2025-07-17 (4): Towards Better Requirements from the Crowd: Developer Engagement with Feature Requests in Open Source Software
Title: Towards Better Requirements from the Crowd: Developer Engagement with Feature Requests in Open Source Software | Auf dem Weg zu besseren Anforderungen aus der Crowd: Entwickler Engagement mit Feature-Anfragen in Open Source Software | 实现来自人群的更好要求:开发者在开放源码软件中参与满足地物要求 2507.13553v1 |
Authors (5): Pragyan K C, Rambod Ghandiparsi, Thomas Herron, John Heaps, Mitra Bokaei Hosseini
As user demands evolve, effectively incorporating feature requests is crucial for maintaining software relevance and user satisfaction. Feature requests, typically expressed in natural language, often suffer from ambiguity or incomplete information due to communication gaps or the requester’s limited technical expertise. These issues can lead to misinterpretation, faulty implementation, and reduced software quality. While seeking clarification from requesters is a common strategy to mitigate these risks, little is known about how developers engage in this clarification process in practice-how they formulate clarifying questions, seek technical or contextual details, align on goals and use cases, or decide to close requests without attempting clarification. This study investigates how feature requests are prone to NL defects (i.e. ambiguous or incomplete) and the conversational dynamics of clarification in open-source software (OSS) development, aiming to understand how developers handle ambiguous or incomplete feature requests. Our findings suggest that feature requests published on the OSS platforms do possess ambiguity and incompleteness, and in some cases, both. We also find that explicit clarification for the resolution of these defects is uncommon; developers usually focus on aligning with project goals rather than resolving unclear text. When clarification occurs, it emphasizes understanding user intent/goal and feasibility, rather than technical details. By characterizing the dynamics of clarification in open-source issue trackers, this work identifies patterns that can improve user-developer collaboration and inform best practices for handling feature requests effectively.
随着用户需求的演变,有效纳入特征请求对于保持软件相关性和用户满意度至关重要。自然语言表达的特征请求往往由于通信差距或请求人有限的技术专长而存在模糊或不完整的信息,这些问题可能导致错误理解、错误执行和软件质量下降。在寻求请求者澄清是减轻这些风险的共同战略的同时,对于开发者如何在实践中参与这一澄清过程,他们如何提出澄清问题、寻求技术或背景细节、调整目标和使用案例,或决定关闭请求而不试图澄清。本研究报告调查了特征请求如何容易出现NL缺陷(即模糊或不完整)以及公开源软件开发中澄清的谈话动态,目的是了解开发者如何处理模糊或不完整的特征请求。我们的调查结果表明,开放源码软件平台上公布的特征请求确实含混不清和不全面,在某些情况下,我们还认为,明确澄清这些缺陷的做法并不常见;开发者通常侧重于与项目目标保持一致,而不是解决不明确的文本。在进行澄清时,它强调了解用户意向/目标/目标/目标/不完整以及用户处理方式的谈话性动态,而不是技术细节。我们发现,用户对用户处理方式的公开性要求能够改进。
Article 152
Title@2025-07-17 (4): AI-Assisted Fixes to Code Review Comments at Scale
Title: AI-Assisted Fixes to Code Review Comments at Scale | AI-Assisted Fixes to Code Review Kommentare auf Scale | AI 协助制定标准标准代码审查评论 2507.13499v1 |
Authors (10): Chandra Maddila, Negar Ghorbani, James Saindon, Parth Thakkar, Vijayaraghavan Murali, Rui Abreu, Jingyue Shen, Brian Zhou, Nachiappan Nagappan, Peter C. Rigby
Aim. There are 10s of thousands of code review comments each week at Meta. We developed Metamate for Code Review (MetaMateCR) that provides AI-assisted fixes for reviewer comments in production at scale. Method. We developed an internal benchmark of 64k <review comment, patch> data points to fine-tune Llama models. Once our models achieve reasonable offline results, we roll them into production. To ensure that our AI-assisted fixes do not negatively impact the time it takes to do code reviews, we conduct randomized controlled safety trials as well as full production experiments. Offline Results. As a baseline, we compare GPT-4o to our small and large Llama models. In offline results, our LargeLSFT model creates an exact match patch 68% of the time outperforming GPT-4o by 9 percentage points (pp). The internal models also use more modern Hack functions when compared to the PHP functions suggested by GPT-4o. Safety Trial. When we roll MetaMateCR into production in a safety trial that compares no AI patches with AI patch suggestions, we see a large regression with reviewers taking over 5% longer to conduct reviews. After investigation, we modify the UX to only show authors the AI patches, and see no regressions in the time for reviews. Production. When we roll LargeLSFT into production, we see an ActionableToApplied rate of 19.7%, which is a 9.2pp improvement over GPT-4o. Our results illustrate the importance of safety trials in ensuring that AI does not inadvertently slow down engineers, and a successful review comment to AI patch product running at scale.
目标 : 每星期在梅塔有10 000个代码审查评论。 我们开发了代码审查的Metamate(MetamateCR) , 提供了在规模生产中进行审评评论的AI协助修正。 方法 。 我们开发了64k < 审评评论, 修补了数据点以微调Llama模型。 一旦我们的模型实现了合理的离线结果, 我们就会将其输入到生产中。 为了确保我们的AI协助修正不会对进行代码审查所需的时间产生消极影响, 我们随机地进行了控制安全测试和全面生产实验。 离线结果 。 作为基线, 我们把GPT-4o 与我们的小型和大型Llama模型进行比较。 在离线结果中, 我们的GLOTFT模型创造了精确匹配68%的时间比GPT-4的9个百分点(pppp) 。 一旦我们的模型实现了合理的离线结果, 我们就会把它们输入到更现代的HPHP 功能。 安全试验。 当我们把MetmateCRCR到一个安全试验中, 而不是AI 校正补建议, 我们看到一个巨大的回缩, 。
Article 153
Title@2025-07-17 (4): Socio-Technical Smell Dynamics in Code Samples: A Multivocal Review on Emergence, Evolution, and Co-Occurrence
Title: Socio-Technical Smell Dynamics in Code Samples: A Multivocal Review on Emergence, Evolution, and Co-Occurrence | Socio-Technical Smell Dynamics in Code Samples: Multivocal Review über Emergence, Evolution und Co-Occurrence | 代码样本中社会-技术闻闻动态:关于新出现、演变和共发的多动审查 2507.13481v1 |
Authors (4): Arthur Bueno, Bruno Cafeo, Maria Cagnin, Awdren Fontão
Code samples play a pivotal role in open-source ecosystems (OSSECO), serving as lightweight artifacts that support knowledge transfer, onboarding, and framework adoption. Despite their instructional relevance, these samples are often governed informally, with minimal review and unclear ownership, which increases their exposure to socio-technical degradation. In this context, the co-occurrence and longitudinal interplay of code smells (e.g., large classes, poor modularity) and community smells (e.g., lone contributors, fragmented communication) become particularly critical. While each type of smell has been studied in isolation, little is known about how community-level dysfunctions anticipate or exacerbate technical anomalies in code samples over time. This study investigates how code and community smells emerge, co-occur, and evolve within code samples maintained in OSSECOs. A Multivocal Literature Review protocol was applied, encompassing 30 peer-reviewed papers and 17 practitioner-oriented sources (2013-2024). Thematic synthesis was conducted to identify recurring socio-technical patterns related to smell dynamics. Nine patterns were identified, showing that community smells often precede or reinforce technical degradation in code samples. Symptoms such as “radio silence” and centralized ownership were frequently associated with persistent structural anomalies. Additionally, limited onboarding, the absence of continuous refactoring, and informal collaboration emerged as recurring conditions for smell accumulation. Conclusion: In OSSECOs, particularly within code samples, community-level dysfunctions not only correlate with but often signal maintainability decay. These findings underscore the need for socio-technical quality indicators and lightweight governance mechanisms tailored to shared instructional artifacts.
代码样本在开放源码生态系统(OSOSECO)中发挥着关键作用,作为支持知识转让、登船和框架采纳的轻量级人工制品,作为支持知识转让、登船和框架采纳的轻量级人工制品。尽管这些样本具有指导意义,但通常以非正式方式管理这些样本,只有极少的审查和不明确的所有权,这增加了它们受到社会技术退化的风险。在这方面,代码的嗅觉(如,大类、低模块性)和社区气味(如,独家提供者、零散通信)的共同发生和纵向相互作用变得特别关键。尽管对每一种类型的信号性制品进行孤立研究,但很少知道社区一级功能失调如何预测或加剧代码质量样本的技术异常。这一研究调查调查了代码和社区的气味如何出现、共同存在和演变,增加了在OSSOSECO保存的代码样本中的代码。多语言文献审查协议包括30份经同行审查的文件和17个面向从业者的来源(2013-2014年)。进行主题合成只是为了查明与嗅觉动态相关的经常性社会技术模式模式。9种模式表明,社区在代码样本中经常出现或强化技术降解,但需要在社区之前或强化技术降解性指令中出现。Symptotototomal指令,经常出现,例如,经常出现“固定和持续的固定和循环,从而形成。
Article 154
Title@2025-07-17 (4): SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
Title: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks | SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben | SWE-MERA: 积极评价软件工程任务大语言模型的动态基准 2507.11059v2 |
Authors (9): Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh
The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.
软件工程大语言模型(LLMS)的快速发展揭示了现有基准,特别是广泛使用的SWE-bench数据集的重大局限性,最近的研究发现了严重的数据污染问题,例如SWE-bench报告32.67%的成功补丁涉及直接溶解渗漏,31.08%因测试案例不足而通过。我们引入了SWE-MERA,这是一个动态的、不断更新的基准,旨在通过自动收集真实世界的GitHub问题和严格的质量验证来应对这些基本挑战。我们的方法是一个可靠的管道,既能确保质量,又能尽量减少污染风险,从而产生约10,000项潜在任务,目前已有300个样本。使用Aider编码剂进行的评估表明,在最新模型中具有很强的歧视性力量。我们报告了2024年9月至2025年6月期间所收集的任务最近得到评估的十多个LMMS的绩效。
Article 155
Title@2025-07-17 (4): Detecting LLM-generated Code with Subtle Modification by Adversarial Training
Title: Detecting LLM-generated Code with Subtle Modification by Adversarial Training | LLM-generierter Code mit subtiler Änderung durch Adversarial Training erkennen | 检测通过反向培训进行精细修改的LLM生成代码 2507.13123v1 |
Authors (5): Xin Yin, Xinrui Li, Chao Ni, Xiaodan Xu, Xiaohu Yang
With the rapid development of Large Language Models (LLMs), their powerful code-generation capabilities have been widely applied in tasks like code completion and automated development, demonstrating the value of improving coding efficiency. However, the extensive use of LLM-generated code also raises several new challenges. On the one hand, issues such as the regulation of code provenance, copyright disputes, and code quality have become increasingly concerning. How to effectively detect LLM-generated code and ensure its compliant and responsible use has become a critical and urgent issue. On the other hand, in practical applications, LLM-generated code is often subject to manual modifications, such as variable renaming or structural adjustments. Although some recent studies have proposed training-based and zero-shot methods for detecting LLM-generated code, these approaches show insufficient robustness when facing modified LLM-generated code, and there is a lack of an effective solution. To address the real-world scenario where LLM-generated code may undergo minor modifications, we propose CodeGPTSensor+, an enhanced version of CodeGPTSensor, which employs adversarial training to improve robustness against input perturbations. CodeGPTSensor+ integrates an adversarial sample generation module, Multi-objective Identifier and Structure Transformation (MIST), which systematically generates both high-quality and representative adversarial samples. This module effectively enhances the model’s resistance against diverse adversarial attacks. Experimental results on the HMCorp dataset demonstrate that CodeGPTSensor+ significantly improves detection accuracy on the adversarial test set while maintaining high accuracy on the original test set, showcasing superior robustness compared to CodeGPTSensor.
随着大语言模型(LLMS)的迅速发展,其强大的代码生成能力被广泛应用于诸如代码完成和自动开发等任务,这表明了提高编码效率的价值;然而,广泛使用LLM产生的代码也带来了一些新的挑战;一方面,规范代码出处、版权争议和代码质量等问题日益引起关注;如何有效检测LLM产生的代码并确保其符合和负责任的使用已成为一个关键和紧迫的问题。另一方面,在实际应用中,LLM产生的代码经常受到手工修改,例如变式重命名或结构调整。虽然最近的一些研究提出了在检测LM生成代码时采用基于培训和零发光的方法,但这些方法在面对修改的LLMM生成代码时显示不够健全,而且缺乏有效的解决办法。为了解决LLMM生成代码可能稍加修改的现实假设,我们建议CodGPTSO+, 强化版的CMGTSors,它利用对抗IBERS的稳健性投入的对准性重新命名或结构调整。
Article 156
Title@2025-07-17 (4): Inferring Attributed Grammars from Parser Implementations
Title: Inferring Attributed Grammars from Parser Implementations | Zugeschriebene Grammatiken aus Parser-Implementierungen ableiten | 从剖析器执行中推断出属性语法 2507.13117v1 |
Authors (3): Andreas Pointner, Josef Pichler, Herbert Prähofer
Software systems that process structured inputs often lack complete and up-to-date specifications, which specify the input syntax and the semantics of input processing. While grammar mining techniques have focused on recovering syntactic structures, the semantics of input processing remains largely unexplored. In this work, we introduce a novel approach for inferring attributed grammars from parser implementations. Given an input grammar, our technique dynamically analyzes the implementation of recursive descent parsers to reconstruct the semantic aspects of input handling, resulting in specifications in the form of attributed grammars. By observing program executions and mapping the program’s runtime behavior to the grammar, we systematically extract and embed semantic actions into the grammar rules. This enables comprehensive specification recovery. We demonstrate the feasibility of our approach using an initial set of programs, showing that it can accurately reproduce program behavior through the generated attributed grammars.
处理结构化投入的软件系统往往缺乏完整和最新的规格,这些规格具体规定了输入语法和输入处理的语义。语法采矿技术侧重于恢复合成结构,而输入处理的语义基本上尚未探索。在这项工作中,我们引入了一种新颖的方法,从实施剖析器中推算有分辨的语法。根据输入语法,我们的技术动态地分析了反复下降的剖析器的实施情况,以重建输入处理的语义方面,从而产生了有分辨语法的规格。通过观察程序执行过程和将程序运行时间的行为与语法规则进行绘图,我们系统地提取和将语法行动嵌入语法规则中。这有利于全面规范的恢复。我们展示了使用最初一套程序的方法的可行性,表明它可以通过生成的有分辨语法来准确复制程序的行为。
Article 157
Title@2025-07-17 (4): A Conceptual Framework for Requirements Engineering of Pretrained-Model-Enabled Systems
Title: A Conceptual Framework for Requirements Engineering of Pretrained-Model-Enabled Systems | Ein konzeptioneller Rahmen für die Anforderungsentwicklung von vortrainierten modellgebundenen Systemen | 预先培训的、采用模式的系统工程要求概念框架 2507.13095v1 |
Authors (4): Dongming Jin, Zhi Jin, Linyu Li, Xiaohong Chen
Recent advances in large pretrained models have led to their widespread integration as core components in modern software systems. The trend is expected to continue in the foreseeable future. Unlike traditional software systems governed by deterministic logic, systems powered by pretrained models exhibit distinctive and emergent characteristics, such as ambiguous capability boundaries, context-dependent behavior, and continuous evolution. These properties fundamentally challenge long-standing assumptions in requirements engineering, including functional decomposability and behavioral predictability. This paper investigates this problem and advocates for a rethinking of existing requirements engineering methodologies. We propose a conceptual framework tailored to requirements engineering of pretrained-model-enabled software systems and outline several promising research directions within this framework. This vision helps provide a guide for researchers and practitioners to tackle the emerging challenges in requirements engineering of pretrained-model-enabled systems.
与由确定性逻辑管理的传统软件系统不同,由预先培训的模型驱动的系统具有独特和突发的特点,例如能力界限模糊、根据具体情况行事和不断演化。这些特性从根本上挑战了要求工程中的长期假设,包括功能不兼容性和行为可预测性。本文件调查了这一问题,并主张重新思考现有的要求工程方法。我们建议了一个概念框架,专门为预先培训的模型化软件系统的工程要求制定概念框架,并勾勒了这一框架内若干有希望的研究方向。这一愿景有助于为研究人员和从业人员提供指南,以应对在培训前的模型化系统的需求工程方面新出现的挑战。
Article 158
Title@2025-07-17 (4): MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Title: MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks | MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben | MERA 守则:一个统一框架,用于评估不同任务制定守则的情况 2507.12284v2 |
Authors (23): Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
为解决上述问题,我们建议MERA守则,这是MERA基准体系的一个新补充,特别侧重于评价俄罗斯最新代码生成LLMS的守则。这个基准包括11项评价任务,涉及8种编程语言。我们提议的评价方法包括一种分类,它概述了完成这些任务模型所需的实际编码技能。基准包括用户进行MERA评估的开放源代码库、一种与各种编程环境兼容的评分系统以及一个以领导板和提交系统为主的平台。我们评价开放LMS和前沿API模型,分析其在非英语实际编码任务方面的局限性。我们正在公开发布MERA,以指导今后的研究,预测模型开发的破碎特征,并使评价程序标准化。
Article 159
Title@2025-07-17 (4): iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development
Title: iReDev: A Knowledge-Driven Multi-Agent Framework for Intelligent Requirements Development | iReDev: Ein wissensgestütztes Multi-Agent-Rahmenwerk für intelligente Anforderungsentwicklung | iReDev:开发智能要求的知识开发多机构框架 2507.13081v1 |
Authors (7): Dongming Jin, Weisong Sun, Jiangping Huang, Peng Liang, Jifeng Xuan, Yang Liu, Zhi Jin
Requirements development is a critical phase as it is responsible for providing a clear understanding of what stakeholders need. It involves collaboration among stakeholders to extract explicit requirements and address potential conflicts, which is time-consuming and labor-intensive. Recently, multi-agent systems for software development have attracted much attention. However, existing research provides limited support for requirements development and overlooks the injection of human knowledge into agents and the human-agent collaboration. % To address these issues, this paper proposes a knowledge-driven multi-agent framework for intelligent requirement development, named iReDev. iReDev features: iReDev consists of six knowledge-driven agents to support the entire requirements development. They collaboratively perform various tasks to produce a software requirements specification. iReDev focuses on integrating human knowledge for agents, enabling them to simulate real-world stakeholders. iReDev uses an event-driven communication mechanism based on an artifact pool. Agents continuously monitor the pool and autonomously trigger the next action based on its changes, enabling iReDev to handle new requirements quickly. iReDev introduces a human-in-the-loop mechanism to support human-agent collaboration, ensuring that the generated artifacts align with the expectations of stakeholders. We evaluated the generated artifacts and results show that iReDev outperforms existing baselines in multiple aspects. We further envision three key directions and hope this work can facilitate the development of intelligent requirements development.
开发需求是一个关键阶段,因为它负责明确了解利益攸关方需要哪些内容。它涉及利益攸关方之间的合作,以提出明确要求并解决潜在冲突,这需要时间和劳力的密集性。最近,软件开发的多试剂系统吸引了大量注意力。然而,现有的研究为需求开发提供了有限的支持,忽视了将人类知识注入代理和人力代理协作。%为解决这些问题,本文件提议了一个知识驱动的多试剂框架,用于开发智能需求,名为 iReDev。iReDev 功能:iReDev 由六个知识驱动的代理组成,以支持整个需求开发。他们合作执行各种任务,以制定软件要求规格。iReDev 侧重于将人类知识整合到代理方,使其能够模拟真实世界利益攸关方。iReDev 使用一个以人工智能库为基础的事件驱动通信机制。代理人不断监测人才库,并自主启动基于其变化的下一步行动,使iReDev能够快速处理新的需求。iReDev 引入一个由六个知识驱动的代理机构组成的机制,以支持整个需求开发。他们合作执行各种任务,以软件要求为软件设计规范规范。iReD侧重于工作,确保所产生的关键方向与我们所生成的模型将展示了各种期望。
Article 160
Title@2025-07-17 (4): Write Your Own CodeChecker: An Automated Test-Driven Checker Development Approach with LLMs
Title: Write Your Own CodeChecker: An Automated Test-Driven Checker Development Approach with LLMs | Schreiben Sie Ihren eigenen CodeChecker: Ein automatisierter Test-Driven Checker-Entwicklungsansatz mit LLMs | 使用 LLMS 写入您的自定义代码检查器: 自动测试驱动检查开发方法 2411.06796v3 |
Authors (6): Jun Liu, Yuanyuan Xie, Jiwei Yan, Jinhao Huang, Jun Yan, Jian Zhang
With the rising demand for code quality assurance, developers are not only utilizing existing static code checkers but also seeking custom checkers to satisfy their specific needs. Nowadays, various code-checking frameworks provide extensive checker customization interfaces to meet this need. However, both the abstract checking logic and the complex API usage of large-scale checker frameworks make this task challenging. To this end, automated code checker generation is anticipated to ease the burden of checker development. In this paper, we propose AutoChecker, an innovative LLM-powered approach that can write code checkers automatically based on only a rule description and a test suite. To achieve comprehensive checking logic, AutoChecker incrementally updates the checker’s logic by focusing on solving one selected case each time. To obtain precise API knowledge, during each iteration, it leverages fine-grained logic-guided API-context retrieval, where it first decomposes the checking logic into a series of sub-operations and then retrieves checker-related API-contexts for each sub-operation. For evaluation, we apply AutoChecker, five baselines, and three ablation methods using multiple LLMs to generate checkers for 20 randomly selected PMD rules. Experimental results show that AutoChecker significantly outperforms others across all effectiveness metrics, with an average test pass rate of 82.28%. Additionally, the checkers generated by AutoChecker can be successfully applied to real-world projects, matching the performance of official checkers.
随着对代码质量保证的需求不断增加,开发者不仅正在利用现有静态代码检查器,而且还在寻找自定义检查器以满足其具体需求。如今,各种代码检查框架为满足这一需求提供了广泛的检查器定制界面。然而,抽象的检查逻辑和大型检查框架复杂的API使用使这项任务具有挑战性。为此,预计自动代码检查器生成将减轻检查器开发的负担。在本文件中,我们提议了Auto checker,这是一种创新的LLM动力方法,可以仅根据规则描述和测试套件自动写入代码检查器。为了实现全面检查逻辑,AutoCrecker通过每次解决一个选定案件,逐步更新检查器逻辑。为了获得精确的API知识,每次循环中,它利用精细的逻辑引导API-文文本检索,首先将检查逻辑引入一系列子操作,然后为每个子操作操作的检查器,然后检索与检查器相关的 AIPI-文文本。在评估中,我们应用Auto checker、5个实际基线和3个ALIBLI 测试结果,然后用多个测试方法对多个LMRBER 进行多次测试。
Article 161
Title@2025-07-17 (4): Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases
Title: Investigating the Performance of Small Language Models in Detecting Test Smells in Manual Test Cases | Untersuchung der Leistungsfähigkeit kleiner Sprachmodelle bei der Erkennung von Testriechen in manuellen Testfällen | 调查小语言模型在人工试验案件中检测测试嗅觉方面的性能 2507.13035v1 |
Authors (6): Keila Lucas, Rohit Gheyi, Márcio Ribeiro, Fabio Palomba, Luana Martins, Elvys Soares
Manual testing, in which testers follow natural language instructions to validate system behavior, remains crucial for uncovering issues not easily captured by automation. However, these test cases often suffer from test smells, quality issues such as ambiguity, redundancy, or missing checks that reduce test reliability and maintainability. While detection tools exist, they typically require manual rule definition and lack scalability. This study investigates the potential of Small Language Models (SLMs) for automatically detecting test smells. We evaluate Gemma3, Llama3.2, and Phi-4 on 143 real-world Ubuntu test cases, covering seven types of test smells. Phi-4 achieved the best results, reaching a pass@2 of 97% in detecting sentences with test smells, while Gemma3 and Llama3.2 reached approximately 91%. Beyond detection, SLMs autonomously explained issues and suggested improvements, even without explicit prompt instructions. They enabled low-cost, concept-driven identification of diverse test smells without relying on extensive rule definitions or syntactic analysis. These findings highlight the potential of SLMs as efficient tools that preserve data privacy and can improve test quality in real-world scenarios.
人工测试,让测试者遵循自然语言指示来验证系统行为,对于发现自动化不易发现的问题仍然至关重要;然而,这些测试案例往往存在测试气味、质量问题,如模糊性、冗余性或缺少检查,从而降低测试可靠性和可维持性;虽然检测工具存在,但通常需要人工规则定义,且缺乏可缩放性;这项研究调查了小语言模型(SLMs)自动检测测试气味的潜力;我们评估了Gemma3、Llama3.2和Phi-4的143个真实世界Ubuntu测试案例,涉及7种测试气味;Phi-4取得了最佳结果,在用测试气味检测句中达到97%的通行证@97%,Gemma3和Llama3.2达到约91%;除了检测之外,SLMs自主解释问题并提出改进建议,即使没有明确指示;这些研究还有助于在不依赖广泛规则定义或合成分析的情况下低成本、概念驱动地识别不同测试气味。这些研究结果突出表明了SLSLSDs作为维护数据隐私的有效工具的潜力,并能改善真实世界情景的测试质量。
Article 162
Title@2025-07-17 (4): Risks of ignoring uncertainty propagation in AI-augmented security pipelines
Title: Risks of ignoring uncertainty propagation in AI-augmented security pipelines | Risiken der Ignorierung der Unsicherheitsausbreitung in KI-gesteigerten Sicherheitspipelines | 忽视在AI强化安全管道中传播不确定性的风险 2407.14540v2 |
Authors (4): Emanuele Mezzi, Aurora Papotti, Fabio Massacci, Katja Tuma
The use of AI technologies is being integrated into the secure development of software-based systems, with an increasing trend of composing AI-based subsystems (with uncertain levels of performance) into automated pipelines. This presents a fundamental research challenge and seriously threatens safety-critical domains. Despite the existing knowledge about uncertainty in risk analysis, no previous work has estimated the uncertainty of AI-augmented systems given the propagation of errors in the pipeline. We provide the formal underpinnings for capturing uncertainty propagation, develop a simulator to quantify uncertainty, and evaluate the simulation of propagating errors with one case study. We discuss the generalizability of our approach and its limitations and present recommendations for evaluation policies concerning AI systems. Future work includes extending the approach by relaxing the remaining assumptions and by experimenting with a real system.
使用AI技术正在被纳入软件系统的安全开发,将AI基子系统(性能水平不确定)纳入自动化输油管的趋势日益明显,这是一个根本性的研究挑战,严重威胁到安全临界领域。尽管目前对风险分析的不确定性有了解,但以前的工作没有考虑到输油管中错误的传播而对AI强化系统的不确定性作出估计。我们为获取不确定性传播提供了正式的基础,开发了一个模拟器,以量化不确定性,并用一个案例研究对传播错误的模拟进行评估。我们讨论了我们的方法的可概括性及其局限性,并就AI系统的评价政策提出建议。未来的工作包括通过放松其余的假设和试验一个真正的系统来扩展这一方法。
Article 163
Title@2025-07-17 (4): ReCode: Updating Code API Knowledge with Reinforcement Learning
Title: ReCode: Updating Code API Knowledge with Reinforcement Learning | ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen | ReCode:更新法规API知识与强化学习 2506.20495v2 |
Authors (5): Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
大型语言模型(LLMS)具有非凡的代码生成能力,但在适应外部图书馆API的频繁更新时却步履维艰。这一关键限制来自对培训数据中过时的 API 知识的依赖,即使能够查阅现有文件,从而在动态环境中阻碍可靠的代码生成。为了解决这一问题,我们提议ReCode(基于规则的加强学习以更新代码),这是一个模仿人类程序程序员适应API变化的新框架。具体地说,我们建立一个大约2 000个数据条目的数据集,以培训LLMS进行基于更新信息的版本的迁移。然后,我们引入一个修改后的代码评估字符串相似度指标,作为强化学习的奖励。我们的实验表明,ReCode大大提升了LPIS在动态API情景中的代码生成性能,特别是在隐蔽的代码AredateArena任务上。与监管的微调相比,ReCode对于LMS的一般代码生成能力影响较小。我们应用了一套LMS和强化学习算法(GPO和DAPO),所有这些都都实现了一致的改进。 值得注意的是,在培训后,Quender2.5-C-7BB的模型/Rebroughdaldroformax
Article 164
Title@2025-07-17 (4): The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI
Title: The Case for Contextual Copyleft: Licensing Open Source Training Data and Generative AI | Der Fall für Contextual Copyleft: Lizenzierung von Open Source Trainingsdaten und Generative KI | 上下文翻转:为开放源码培训数据发放许可证的案例 2507.12713v1 |
Authors (5): Grant Shanklin, Emmie Hine, Claudio Novelli, Tyler Schroder, Luciano Floridi
The proliferation of generative AI systems has created new challenges for the Free and Open Source Software (FOSS) community, particularly regarding how traditional copyleft principles should apply when open source code is used to train AI models. This article introduces the Contextual Copyleft AI (CCAI) license, a novel licensing mechanism that extends copyleft requirements from training data to the resulting generative AI models. The CCAI license offers significant advantages, including enhanced developer control, incentivization of open source AI development, and mitigation of openwashing practices. This is demonstrated through a structured three-part evaluation framework that examines (1) legal feasibility under current copyright law, (2) policy justification comparing traditional software and AI contexts, and (3) synthesis of cross-contextual benefits and risks. However, the increased risk profile of open source AI, particularly the potential for direct misuse, necessitates complementary regulatory approaches to achieve an appropriate risk-benefit balance. The paper concludes that when implemented within a robust regulatory environment focused on responsible AI usage, the CCAI license provides a viable mechanism for preserving and adapting core FOSS principles to the evolving landscape of generative AI development.
突现型AI系统的扩散给自由和开放源码软件(FOSS)社区带来了新的挑战,特别是在使用开放源码培训AI模式时,传统抄录左派原则应如何适用方面,本条介绍了背景翻录式AI(CCAI)许可证,这是一个将培训数据复制要求扩展至由此产生的基因化AI模式的新发许可证机制;CACI许可证具有重大优势,包括加强开发者控制、鼓励开发开放源码AI和减少露天洗涤做法,这通过一个结构化的三部分评价框架得到证明,该框架审查:(1) 现行版权法下的法律可行性;(2) 将传统软件与AI环境进行比较的政策理由;(3) 综合交叉文本的好处和风险;然而,由于开放源的AI风险简介增加,特别是直接滥用的可能性增加,有必要采取补充性监管办法,以实现适当的风险-利益平衡;文件的结论是,如果在一个以负责任的AI使用为重点的稳健的监管环境内实施,CAPI许可证提供了一种可行的机制,用于维护和调整核心自由和开放源码软件原则,以适应正在演变的AI型发展的格局。
Article 165
Title@2025-07-17 (4): CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance
Title: CodeAssistBench (CAB): Dataset & Benchmarking for Multi-turn Chat-Based Code Assistance | CodeAssistBench (CAB): Datensatz & Benchmarking für Multiturn-Chat-basierte Code-Unterstützung | 代码协助站(CAB):多功能聊天代码援助的数据集和基准 2507.10646v2 |
Authors (5): Myeongsoo Kim, Shweta Garg, Baishakhi Ray, Varun Kumar, Anoop Deoras
Programming assistants powered by large language models have transformed software development, yet most benchmarks focus narrowly on code generation tasks. Recent efforts like InfiBench and StackEval attempt to address this gap using Stack Overflow data but remain limited to single-turn interactions in isolated contexts, require significant manual curation, and fail to represent complete project environments. We introduce CodeAssistBench (CAB), the first benchmark framework for evaluating multi-turn programming assistance in realistic settings that address real-world questions about actual codebases. Unlike existing programming Q&A benchmarks, CAB automatically generates scalable datasets from question-related GitHub issues using configurable parameters (e.g., repository creation date, star count, programming languages), and includes automatic containerization of codebases for evaluation. It then evaluates models through simulated users in these containerized environments with full codebase access. Using this framework, we constructed a test set of 3,286 real-world programming questions across 231 repositories, spanning seven programming languages and diverse problem domains. Our evaluation of leading LLMs reveals a substantial capability gap: while models perform well on Stack Overflow questions with success rates of 70-83%, they resolve only up to 16.49% of CAB’s recent issues. This discrepancy highlights the challenges of providing assistance in complex, project-specific contexts versus answering standalone questions.
由大型语言模型驱动的编程助理(编程助理)已经改变了软件开发,但大多数基准都狭隘地侧重于代码生成任务。最近,InfiBench 和 StackEval 等努力试图利用Stack 溢流数据填补这一差距,但仍局限于孤立环境中的单向互动,需要大量手工整理,不能代表完整的项目环境。我们引入了CodeAsist Bench(CAB),这是在现实环境中评价多向下编程援助的第一个基准框架,在现实环境中处理实际代码库的现实世界问题。与现有的编程 A 基准不同, CAB 自动从与问题有关的问题GitHub 中生成可扩缩的数据集,使用可配置参数(例如,存储库创建日期、星号数、编程语言),包括用于评价的代码库自动集装箱化。我们随后通过这些集装箱化环境中的模拟用户来评价模型,这些模型涉及231个储存库,涵盖7种编程语言和多种问题领域。我们对主要LMSDMs问题的评估显示巨大的能力差距:在Sack over483 和70AB 的解决方案的解决率方面,这些模型只是解决了70-rent-rent-rent profilent 的模型,它们在70-rent profilent produislent lient pride prois pass pass pass pass pass prois prois prois presis
Article 166
Title@2025-07-17 (4): GUI Test Migration via Abstraction and Concretization
Title: GUI Test Migration via Abstraction and Concretization | GUI-Test-Migration über Abstraktion und Konkretisierung | GUI 通过抽象和简明化测试移民 2409.05028v2 |
Authors (7): Yakun Zhang, Chen Liu, Xiaofei Xie, Yun Lin, Jin Song Dong, Dan Hao, Lu Zhang
GUI test migration aims to produce test cases with events and assertions to test specific functionalities of a target app. Existing migration approaches typically focus on the widget-mapping paradigm that maps widgets from source apps to target apps. However, since different apps may implement the same functionality in different ways, direct mapping may result in incomplete or buggy test cases, thus significantly impacting the effectiveness of testing target functionality and the practical applicability of migration approaches. In this paper, we propose a new migration paradigm (i.e., the abstraction-concretization paradigm) that first abstracts the test logic for the target functionality and then utilizes this logic to generate the concrete GUI test case. Furthermore, we introduce MACdroid, the first approach that migrates GUI test cases based on this paradigm. Specifically, we propose an abstraction technique that utilizes source test cases from source apps targeting the same functionality to extract a general test logic for that functionality. Then, we propose a concretization technique that utilizes the general test logic to guide an LLM in generating the corresponding GUI test case (including events and assertions) for the target app. We evaluate MACdroid on two widely-used datasets (including 31 apps, 34 functionalities, and 123 test cases). On the FrUITeR dataset, the test cases generated by MACdroid successfully test 64% of the target functionalities, improving the baselines by 191%. On the Lin dataset, MACdroid successfully tests 75% of the target functionalities, outperforming the baselines by 42%. These results underscore the effectiveness of MACdroid in GUI test migration.
GUI 测试迁移的目的是通过测试事件来测试案例,测试目标应用程序的具体功能。 现有的迁移方法通常侧重于从源应用程序到目标应用程序的部件映射模式。 但是,由于不同的应用程序可能以不同的方式执行相同的功能,直接映射可能导致测试案例不完全或错误,从而极大地影响测试目标功能的有效性和迁移方法的实际适用性。 在本文件中,我们提出了一种新的迁移模式(即抽象混凝土模式),首先将目标功能的测试逻辑摘要用于强调测试逻辑,然后利用这一逻辑来生成具体 GUI 测试案例。此外,我们引入了MACdroid,这是根据这个模式迁移图形测试案例的第一个方法。具体地说,我们提出了一种抽象技术,利用源应用程序的测试案例,针对同一功能的功能和迁移方法的实际适用性测试逻辑。然后,我们提出了一种解剖化技术,利用一般测试逻辑来指导LLMUMU生成相应的 GUI测试案例(包括事件和声明),然后利用这个逻辑来生成具体的 GUILME 测试案例。我们用MAC 的3 测试模型测试了两个数据测试模型,通过测试模型测试模型,这些测试了基数测试了基数,这些测试了基数的基数,这些基数的基数。
Article 167
Title@2025-07-17 (4): AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges
Title: AI Safety in the Eyes of the Downstream Developer: A First Look at Concerns, Practices, and Challenges | KI-Sicherheit in den Augen des Downstream-Entwicklers: Ein erster Blick auf Bedenken, Praktiken und Herausforderungen | AI 下游开发者眼中的安全:首先审视关注、做法和挑战 2503.19444v3 |
Authors (6): Haoyu Gao, Mansooreh Zahedi, Wenxin Jiang, Hong Yi Lin, James Davis, Christoph Treude
Pre-trained models (PTMs) have become a cornerstone of AI-based software, allowing for rapid integration and development with minimal training overhead. However, their adoption also introduces unique safety challenges, such as data leakage and biased outputs, that demand rigorous handling by downstream developers. While previous research has proposed taxonomies of AI safety concerns and various mitigation strategies, how downstream developers address these issues remains unexplored. This study investigates downstream developers’ concerns, practices and perceived challenges regarding AI safety issues during AI-based software development. To achieve this, we conducted a mixed-method study, including interviews with 18 participants, a survey of 86 practitioners, and an analysis of 874 AI incidents from the AI Incident Database. Our results reveal that while developers generally demonstrate strong awareness of AI safety concerns, their practices, especially during the preparation and PTM selection phases, are often inadequate. The lack of concrete guidelines and policies leads to significant variability in the comprehensiveness of their safety approaches throughout the development lifecycle, with additional challenges such as poor documentation and knowledge gaps, further impeding effective implementation. Based on our findings, we offer suggestions for PTM developers, AI-based software developers, researchers, and policy makers to enhance the integration of AI safety measures.
预先培训的模型已成为AI软件的基石,允许快速整合和开发,尽量减少培训管理费用;然而,采用这些模型还带来了独特的安全挑战,如数据泄漏和偏差产出等,需要下游开发商严格处理。虽然以前的研究提出了AI安全问题分类和各种缓解战略,但下游开发商如何解决这些问题仍未探讨。本研究报告调查了下游开发商在AI软件开发过程中对AI安全问题的关切、做法和所察觉的挑战。为此,我们开展了一项混合方法研究,包括与18名参与者的访谈、对86名从业人员的调查以及AI事件数据库对874起AI事件的分析。我们的结果显示,虽然开发商一般都对AI安全问题有强烈的认识,但他们的做法,特别是在准备和PTM选择阶段,往往不够充分。缺乏具体的指导方针和政策导致他们在整个发展生命周期内安全方法的全面性存在极大的差异,例如文件不全和知识差距,进一步阻碍有效执行。我们根据调查结果,向IPTM开发商、AI软件开发商、研究人员和决策者提出建议,以加强AI的安全措施。
Article 168
Title@2025-07-17 (4): When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration
Title: When Domains Collide: An Activity Theory Exploration of Cross-Disciplinary Collaboration | When Domains Collide: Eine Aktivitätstheorie zur Erforschung der disziplinübergreifenden Zusammenarbeit | 当域碰撞:跨纪律协作活动理论探索时 2506.20063v2 |
Authors (6): Zixuan Feng, Thomas Zimmermann, Lorenzo Pisani, Christopher Gooley, Jeremiah Wander, Anita Sarma
Background: Software development teams are increasingly diverse, embedded, and cross-disciplinary. Domain experts (DEs) from different disciplines collaborate with professional software developers (SDEs), bringing complementary expertise in creating and maintaining complex production software. However, contested expectations, divergent problem-solving perspectives, and conflicting priorities lead to friction. Aims: This study aims to investigate the dynamics of emerging collaboration of cross-disciplinary software development (CDSD) by exploring the expectations held by DEs and SDEs and understanding how these frictions manifest in practice. Method: We utilize Activity Theory (AT), a well-established socio-technical framework, as an analytical lens in a grounded, empirical investigation, conducted through a mixed-method study involving 24 interviews (12 DEs and 12 SDEs) and a large-scale validation survey with 293 participants (161 DEs and 132 SDEs). Results: We conceptualize and empirically ground the CDSD dynamics. We identified eight expectations held by SDEs and six by DEs. By mapping these expectations to AT components, we revealed 21 frictions in CDSD and illustrated where and how they arise. Conclusions: This study offers a theoretical lens for understanding the dynamics and frictions in CDSD and provides actionable insights for future research, practitioners, and infrastructure design.
软件开发团队日益多样化、嵌入和跨学科。来自不同学科的专家与专业软件开发者(SDEs)合作,在创建和维护复杂的生产软件方面提供互补的专门知识。然而,有争议的期望、不同的解决问题观点和相互冲突的优先事项导致摩擦。目的:本研究的目的是通过探索DEs和SDEs持有的期望并了解这些摩擦在实践中如何表现来调查跨学科软件开发(CDSD)新兴协作的动态,并了解这些摩擦的实际表现。方法:我们利用活动理论(AT)这个成熟的社会技术框架,作为基础、经验性调查的分析透镜,通过由24次访谈(12个DEs和12个SDEs)进行的混合方法研究以及293名参与者(161个DEs和132个SDEs)进行的大规模验证调查,进行。结果:我们从概念上和从经验上确定了CDSD动态的8项期望和DEs所持有的6项期望。我们通过向AT组成部分绘制这些期望图,揭示了CDSD的21项摩擦,并说明了它们在何处和如何产生的。结论:本研究为CDSD的未来研究、可理解的理论视角,为CDSD设计中的动态和设计提供了可理解。