• 00 05-29 (4) How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys Wie zu Elicit Erklärbarkeit Anforderungen? Ein Vergleich von Interviews, Fokusgruppen und Umfragen 如何制定明确的解释要求?访谈、焦点小组和调查的比较 2505.23684v1
  • 01 05-29 Quantum-Based Software Engineering Quantenbasierte Software-Engineering 基于量子的软件工程 2505.23674v1
  • 02 05-29 GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents GSO: Herausfordernde Software-Optimierungsaufgaben zur Bewertung von SWE-Agenten GSO:评估SWE-Agentics的有挑战的软件优化任务 2505.23671v1
  • 03 05-29 Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering Satori-SWE: Evolutionäre Test-Zeit-Skalierung für probeneffiziente Software-Engineering Satori-SWE:样本高效软件工程的进化测试-时间尺度 2505.23604v1
  • 04 05-29 LLM Performance for Code Generation on Noisy Tasks LLM-Performance für Code-Generierung bei lauten Aufgaben LLM 噪音任务代码生成的LLM性能 2505.23598v1
  • 05 05-29 LLM-based Property-based Test Generation for Guardrailing Cyber-Physical Systems LLM-basierte property-based Test Generation for Guardrailing Cyber-Physical Systems 以LLM为基础的保护网络-物理系统基于财产的 2505.23549v1
  • 06 05-29 The CASE Framework – A New Architecture for Participatory Research and Digital Health Surveillance Der CASE Framework - Eine neue Architektur für partizipative Forschung und digitale Gesundheitsüberwachung CASE框架 – – 参与性研究和数字健康监测的新架构 2505.23516v1
  • 07 05-29 Identity resolution of software metadata using Large Language Models Identitätsauflösung von Software-Metadaten mit großen Sprachmodellen 使用大语言模式的软件元数据的识别分辨率 2505.23500v1
  • 08 05-29 Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency Synthese von Leistungsbeschränkungen zur Bewertung und Verbesserung der Code-Effizienz 综合评估和提高《守则》效率的绩效制约因素 2505.23471v1
  • 09 05-29 What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews Was ist mit Emotionen? Guiding Fine-Grained Emotion Extraction aus Mobile App Bewertungen 情感呢?指导从移动应用程序评论中抽取精美情感的导师 2505.23452v1
  • 10 05-29 From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents Vom Wissen zum Lärm: CTIM-Rover und die Pitfalls des episodischen Gedächtnisses in Software Engineering Agents 从知识到噪音:CTIM-Rover和软件工程代理器中电离内存的空洞 2505.23422v1
  • 11 05-29 SWE-bench Goes Live! SWE-Bench geht live! SWE -BECHE GOES 现场直播! 2505.23419v1
  • 12 05-29 Toward Effective AI Governance: A Review of Principles Auf dem Weg zu einer effektiven KI-Governance: Eine Überprüfung der Grundsätze 实现有效的独立大赦国际治理:原则审查 2505.23417v1
  • 13 05-29 BugRepro: Enhancing Android Bug Reproduction with Domain-Specific Knowledge Integration BugRepro: Verbesserung der Android Bug Reproduction mit Domain-spezifischer Wissensintegration Bugrepro: 利用特定域知识集成增强Android虫复制 2505.14528v2
  • 14 05-29 Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization Nachbrenner: Verstärktes Lernen erleichtert selbstverbessernde Code-Effizienz-Optimierung 事后焚烧:强化学习促进自我改进法规效率优化 2505.23387v1
  • 15 05-29 Personality-Guided Code Generation Using Large Language Models Personalitätsgeführte Code-Generierung mit großen Sprachmodellen 使用大语言模式的 个人 使用大语言模式的 人 性 指导 代码 生成 2411.00006v2
  • 16 05-29 OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software OSS-UAgent: Ein Agent-basiertes Usability Evaluation Framework für Open Source Software OSS-UUA代理:基于代理的开放源码软件使用性评价框架 2505.23239v1
  • 17 05-29 Artemis: Toward Accurate Detection of Server-Side Request Forgeries through LLM-Assisted Inter-Procedural Path-Sensitive Taint Analysis Artemis: Auf dem Weg zur genauen Erkennung von Server-Side Request Forgeries durch LLM-Assisted Inter-Procedural Path-Sensitive Taint Analysis 人工制品:通过LLM协助的跨程序间路由感知性图解分析,力求准确探测服务器-Side请求的伪造情况 2502.21026v3
  • 18 05-29 Two Is Better Than One: Rotations Scale LoRAs Zwei ist besser als eins: Rotationsskala LoRAs 二比一好:轮作规模LORAs 2505.23184v1
  • 19 05-29 An open-source Modular Online Psychophysics Platform (MOPP) Eine Open-Source-Plattform für modulare Online-Psychophysik (MOPP) 开放源码模块在线心理物理学平台(MOPP) 2505.23137v1
  • 20 05-29 VERINA: Benchmarking Verifiable Code Generation VERINA: Benchmarking der überprüfbaren Code-Generierung VERINA:可核实代码生成基准 2505.23135v1
  • 21 05-29 Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference Kann LLMs Grund über Programm Semantik? Eine umfassende Bewertung von LLMs auf formale Spezifikation Inferenz CLLMs 方案语义学理由:全面评价关于正式具体推断的LLMs 2503.04779v4
  • 22 05-29 DINGO: Constrained Inference for Diffusion LLMs DINGO: Beschränkte Schlussfolgerung für Diffusion LLMs DINGO: 扩散长效LMM的连续推论 2505.23061v1
  • 23 05-29 HACMony: Automatically Detecting Hopping-related Audio-stream Conflict Issues on HarmonyOS HACMony: Automatische Erkennung von Hopping-bezogenen Audio-Stream-Konflikten auf HarmonyOS HACMonny:自动检测与Happing有关的和谐OS音频流冲突问题 2504.07472v2
  • 24 05-29 Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation Kette der geerdeten Ziele: Überbrückungsprozess und zielorientiertes Prompting für die Codegenerierung 基本目标链链:搭桥进程和以目标为导向的促进代码生成 2501.13978v2
  • 25 05-29 Structural Abstraction and Selective Refinement for Formal Verification Strukturelle Abstraktion und selektive Verfeinerung für formale Verifizierung 正式核查的结构性抽象和选择性改进 2505.22982v1
  • 26 05-29 CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance CodeSteer: Symbolisch-Augmentierte Sprachmodelle über Code/Text Anleitung 代码器:通过编码/文本指导的代码/文本指导的代码器:代号辅助语言模式 2502.04350v2
  • 27 05-29 BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent BYOS: Wissensgetriebene große Sprachmodelle bringen Ihr eigenes Betriebssystem hervorragender BYOS: 知识驱动的大型语言模式使自己的操作系统更加出色 2503.09663v2
  • 28 05-28 (3) Unlocking Mental Health: Exploring College Students’ Well-being through Smartphone Behaviors Entsperren der psychischen Gesundheit: Erforschen des Wohlbefindens der Studenten durch Smartphone-Verhalten 解锁心理健康:通过智能手机行为探索大学生福祉 2502.08766v2
  • 29 05-28 Evolution analysis of software quality metrics in an open-source java project: A case study on TestNG Evolutionsanalyse von Software-Qualitätsmetriken in einem Open-Source-Java-Projekt: Eine Fallstudie zu TestNG 开放源码 Java项目软件质量衡量标准演变分析:测试NG案例研究 2505.22884v1
  • 30 05-28 Visualizing Cloud-native Applications with KubeDiagrams Cloud-native Anwendungen mit KubeDiagrammen visualisieren 带有KubeDiagrams 的可视化云源应用 2505.22879v1
  • 31 05-28 RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation RocqStar: Leveraging-ähnliche Retrieval- und Agentiksysteme für die Rocq-Generation RocqStar:利用利用相似度驱动回收系统和干系统来生成Rocq 2505.22846v1
  • 32 05-28 A Tool for Generating Exceptional Behavior Tests With Large Language Models Ein Tool zur Generierung außergewöhnlicher Verhaltenstests mit großen Sprachmodellen 生成使用大语言模式的特殊行为测试工具 2505.22818v1
  • 33 05-28 What Needs Attention? Prioritizing Drivers of Developers’ Trust and Adoption of Generative AI Was braucht Aufmerksamkeit? Priorisieren von Treibern des Entwicklervertrauens und der Annahme Generativer KI 需要注意什么?将开发者信任的驱动因素列为优先事项,并采用创新的AI 2505.17418v2
  • 34 05-28 LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents LabUtopia: High-Fidelity-Simulation und hierarchischer Benchmark für wissenschaftliche körpereigene Wirkstoffe LabUtopia:科学渗透剂的高纤维模拟和等级基准 2505.22634v1
  • 35 05-28 Smart Contracts for SMEs and Large Companies Intelligente Verträge für KMU und Großunternehmen 中小企业和大公司的智能合同 2505.22619v1
  • 36 05-28 BPMN to Smart Contract by Business Analyst BPMN auf Smart Contract von Business Analyst 商业分析员将BPMN改为智能合同 2505.22612v1
  • 37 05-28 GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git GitGoodBench: Ein neuartiger Benchmark für die Bewertung Agentischer Performance auf Git GitGoodbunch:评估基特生物表现的新基准 2505.22583v1
  • 38 05-28 LAMBDA: A Large Model Based Data Agent LAMBDA: Ein großer modellbasierter Datenagent LAMBDA:一个大型模型数据代理 2407.17535v3
  • 39 05-28 Advancing Expert Specialization for Better MoE Advancing Experten-Spezialisierung für bessere MoE 推进专家专业专业促进改善教育部 2505.22323v1
  • 40 05-28 Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era Entwicklung von Repositorys und Datenschutzgesetzen: Aktivitäten in der DSGVO und CCPA-Ära verpflichten 保管库和隐私法的演变演变:在GDPR和CCPA时代开展活动 2505.22234v1
  • 41 05-28 Thermal Modeling and Optimal Allocation of Avionics Safety-critical Tasks on Heterogeneous MPSoCs Thermische Modellierung und optimale Allokation von Avionik Sicherheitskritische Aufgaben auf heterogenen MPSoCs 热建模和最佳分配航空气象安全关键任务 2505.22214v1
  • 42 05-28 Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement Hin zu konversatorischen Entwicklungsumgebungen: Verwendung von Theorie-von-Mind- und Multi-Agent-Architekturen für Anforderungen Verfeinerung 走向对话型发展环境:利用理论和多机构架构改进要求 2505.20973v2
  • 43 05-28 Towards Practical Defect-Focused Automated Code Review Auf dem Weg zu einer praktischen fehlerorientierten automatisierten Code-Überprüfung 走向实际失效-受污染的自动编码审查 2505.17928v2
  • 44 05-28 SVA-ICL: Improving LLM-based Software Vulnerability Assessment via In-Context Learning and Information Fusion SVA-ICL: Verbesserung der LLM-basierten Software Vulnerability Assessment durch In-Context Learning und Information Fusion SVA-ICL:通过文内学习和信息融合改进基于LLM的软件脆弱性评估 2505.10008v2
  • 45 05-28 Jailbreak Distillation: Renewable Safety Benchmarking Jailbreak Destillation: Benchmarking für erneuerbare Sicherheit 蒸馏:可再生能源安全基准 2505.22037v1
  • 46 05-28 Securing the Software Package Supply Chain for Critical Systems Sicherung der Softwarepaket-Lieferkette für kritische Systeme 保障关键系统软件包供应链 2505.22023v1
  • 47 05-28 How Do Experts Make Sense of Integrated Process Models? Wie verstehen Experten integrierte Prozessmodelle? 专家如何看待综合进程模式? 2505.20667v2
  • 48 05-28 System-driven Cloud Architecture Design Support with Structured State Management and Guided Decision Assistance Systemgesteuerte Cloud-Architektur-Design-Unterstützung mit strukturiertem Staatsmanagement und beratender Entscheidungshilfe 提供结构化国家管理和指导决策援助的系统驱动云层结构设计支持 2505.20701v2
  • 49 05-28 Larger Is Not Always Better: Exploring Small Open-source Language Models in Logging Statement Generation Größere ist nicht immer besser: Erforschen von kleinen Open-Source-Sprachenmodellen bei der Erstellung von Protokollierungsanweisungen 大并非总是更好:探索记录报表生成中的小型开放源语言模式 2505.16590v2
  • 50 05-28 Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development Co-Saving: Ressourcenschonende Multi-Agenten-Kollaboration für Software-Entwicklung 共同节省:为开发软件进行有意识的资源、多机构协作 2505.21898v1
  • 51 05-27 (2) Augmenting Software Bills of Materials with Software Vulnerability Description: A Preliminary Study on GitHub Augmenting Software Bills of Materials with Software Vulnerability Beschreibung: Eine Vorstudie zu GitHub 增加具有软件脆弱性说明的软件材料账单:关于GitHub的初步研究 2503.13998v2
  • 52 05-27 Leveraging XP and CRISP-DM for Agile Data Science Projects Nutzung von XP und CRISP-DM für agile Data Science Projekte 利用XP和CRISP-DM为敏感数据科学项目发挥杠杆作用 2505.21603v1
  • 53 05-27 JITScope: Interactive Visualization of JIT Compiler IR Transformations JITScope: Interaktive Visualisierung von JIT Compiler IR-Transformationen JIT编辑器 IR 转换的交互式视觉化 2505.21599v1
  • 54 05-27 GUARD:Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation GUARD:Dual-Agent-basierte Backdoor-Verteidigung auf Ketten-of-Thought in Neural Code Generation GUARD: 在神经代码生成过程中寻求的连锁研究中,基于 “ 以企业为基地 “ 的后门防御 2505.21425v1
  • 55 05-27 A first look at ROS~2 applications written in asynchronous Rust Ein erster Blick auf ROS~2 Anwendungen geschrieben in asynchronen Rust 首先看一看ROS~2的申请,这些申请是以非同步鲁斯特书写的。 2505.21323v1
  • 56 05-27 Computational Reproducibility of R Code Supplements on OSF Berechnung der Reproduzierbarkeit von R-Code-Ergänzungen auf OSF OSF的R代码补编的计算可复制性 2505.21590v1
  • 57 05-27 ColorGo: Directed Concolic Execution ColorGo: Direkte konkolische Ausführung 颜色 Go : 指向排列执行 2505.21130v1
  • 58 05-27 CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building CXXCrafter: Ein LLM-basierter Agent für automatisiertes C/C++ Open Source Software Building CXXCFFF: 一个基于LLM的自动 C/C++ 开放源码软件大楼LLM代理 2505.21069v1
  • 59 05-27 Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement Vor dem Laufen denken! Effiziente Codegenerierung mit gründlicher Exploration und optimaler Verfeinerung 在运行前思考! 高效的代码生成, 彻底探索和优化精炼 2502.17442v2
  • 60 05-27 Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models Optimierung des Case-Based-Reasoning-Systems für die Generierung funktionaler Testskripte mit großen Sprachmodellen 为具有大语言模型的功能测试脚本生成优化基于个案的理由说明系统 2503.20576v3
  • 61 05-27 RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving RepoMaster: Autonome Exploration und Verständnis von GitHub-Lagerstätten für komplexe Aufgabenlösung RepoMaster:为复杂任务解决而自主探索和了解GitHub储存库 2505.21577v1
  • 62 05-27 An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks Ein LLM-as-Judge Metric zur Überwindung der Lücke mit menschlicher Bewertung in SE-Aufgaben 消除社会经济任务中与人的评价差距的法学硕士法官 2505.20854v1
  • 63 05-27 Why do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks Warum zerfallen Machine-Learning-Notebooks? Eine empirische Studie über öffentliche Python-Jupyter-Notebooks 为什么机器学习笔记本崩溃? 2411.16795v3
  • 64 05-27 Can Agents Fix Agent Issues? Können Agenten Probleme mit Agenten beheben? 特工能解决代理问题吗? 2505.20749v1
  • 65 05-27 Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage 增强法典制定中强化学习的加强守则LLMS 代码生成:调查 2412.20367v3
  • 66 05-27 SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis SV-TrustEval-C: Bewertung von Struktur und semantischer Vernunft in großen Sprachmodellen für die Analyse von Quellencode-Anfälligkeiten SV-信任值-C:在源码脆弱性分析大语言模型中评估结构和语义理由 2505.20630v1
  • 67 05-26 (1) Smart Contract Vulnerabilities, Tools, and Benchmarks: An Updated Systematic Literature Review Smart Contract Vulnerabilitys, Tools und Benchmarks: Ein aktualisierter systematischer Literaturbericht 智能合同脆弱性、工具和基准:更新的系统文献审查 2412.01719v2
  • 68 05-26 Large Language Models for IT Automation Tasks: Are We There Yet? Große Sprachmodelle für IT-Automatisierungsaufgaben: Sind wir noch da? 信息技术自动化任务大语言模型:我们是否还存在? 2505.20505v1
  • 69 05-26 Modeling and Analysis of the Landing Gear System with the Generalized Contracts Modellierung und Analyse des Landing Gear Systems mit den Generalized Contracts 通用合同着陆器系统的建模和分析 2111.10426v3
  • 70 05-26 SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents SWE-Rebench: Eine automatisierte Pipeline für die Task Collection und die dekontaminierte Evaluation von Software Engineering Agents SWE-rebench:软件工程剂任务收集和除污评价自动管道 2505.20411v1
  • 71 05-26 GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency GPUMC: Ein staatenloser Modellprüfer für GPU-Schwachspeicherkonkurrenz GPUMC: GPU 弱内存调制货币的无国籍模式检查器 2505.20207v1
  • 72 05-26 Evaluating Large Language Models for Code Review Bewertung großer Sprachmodelle für die Code-Überprüfung 评价用于守则审查的大语言模式 2505.20206v1
  • 73 05-26 Exposing Go’s Hidden Bugs: A Novel Concolic Framework Aufdecken der versteckten Bugs von Go: Ein neuartiges konkolisches Rahmenwerk 展露 Go 隐藏的臭虫: 新分类框架 2505.20183v1
  • 74 05-26 An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation Eine empirische Studie zur stark schwachen Modellkooperation für die Codegenerierung auf Repo-Ebene 关于回收层代码生成的 “ 强弱 “ 示范协作经验研究 2505.20182v1
  • 75 05-26 Evaluating Software Plagiarism Detection in the Age of AI: Automated Obfuscation and Lessons for Academic Integrity Bewertung von Software Plagiaterkennung im Zeitalter der KI: Automatisierte Verschleierung und Lehren für akademische Integrität 评价AI时代软件高射率检测:学术廉正方面的自动读写和教益 2505.20158v1
  • 76 05-26 The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs Die CodeInverter Suite: Control-Flow und Data-Mapping Augmented Binary Decompilation mit LLMs 代码输入器套件:控制-光和数据-制表增强的二进制解析与LLMS 2503.07215v2
  • 77 05-26 StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs StructEval: Benchmarking der Kapazitäten von LLM zur Erzeugung struktureller Outputs DructEval:将LLMs的能力与产生结构性产出挂钩 2505.20139v1
  • 78 05-26 Engineering Trustworthy Machine-Learning Operations with Zero-Knowledge Proofs Engineering Vertrauenswürdige Maschinen-Learning-Operationen mit Null-Wissens-Proofs 具有零知识证明的工程可信赖的机械学习操作 2505.20136v1
  • 79 05-26 Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks Grammatik der formalen Unsicherheit: Wann man LLMs bei automatisierten Aufgaben zur Begründung vertraut 正式不确定性的语法:在自动说明理由任务中何时信任LLMs 2505.20047v1
  • 80 05-26 A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron? Eine Umfrage über die Sicherheitsbedrohungen von Computer-Verwendern: JARVIS oder Ultron? JARVIS还是ULTRON? 调查计算机用户的安全和安保威胁:JARVIS还是ULTRON? 2505.10924v2
  • 81 05-26 Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare Ontologie- und LLM-basierte Datenharmonisierung für das Federated Learning in Healthcare 以本体学和LLM为基础的保健方面联邦学习数据统一 2505.20020v1
  • 82 05-26 Requirements Coverage-Guided Minimization for Natural Language Test Cases Anforderungen Abdeckungsgeführte Minimierung für natürliche Sprachtests 以涵盖范围为指导的尽量减少自然语言测试案件 2505.20004v1
  • 83 05-26 The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation Die unsichtbare Hand: Enthüllen von Provider-Bias in großen Sprachmodellen für die Codegenerierung 无形手:守则生成大语言模式中的 “ 无形手 “ : “ 不可忽视的提供者 “ 。 2501.07849v2
  • 84 05-26 Systems of Twinned Systems: A Systematic Literature Review Systeme von Zwillingssystemen: Ein Systematischer Literaturbericht 结对系统系统系统:系统文献审查 2505.19916v1
  • 85 05-26 Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities Dekonstruieren von Obfuscation: Ein vierdimensionaler Rahmen für die Auswertung von Großsprachenmodellen Assembly Code Deobfuscation Fähigkeiten 解构腐蚀:四维框架,用于评价大语言模型组装编码脱腐能力 2505.19887v1
  • 86 05-26 SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection SecVulEval: Benchmarking LLMs für real-World C/C++ Sicherheitserkennung SecVulEval:确定真实世界C/C+++脆弱性检测LLMs基准 2505.19828v1
  • 87 05-26 A Python workflow definition for computational materials design Eine Python-Workflow-Definition für die Berechnung von Materialien 计算材料设计中的 Python 工作流程定义 2505.20366v1
  • 88 05-26 CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement CIDRe: Ein referenzfreies Multi-Aspekt-Kriterium für die Qualitätsmessung von Code Comment CIDRe: 守则评论质量衡量的无参考性、无参考性、多特征的多标准标准 2505.19757v1
  • 89 05-26 RDFGraphGen: An RDF Graph Generator based on SHACL Shapes RDFGraphGen: Ein RDF Graph Generator auf Basis von SHACL Shapes RDFGraphGen:基于 SHACL 形状的 RDF 图形生成器 2407.17941v2
  • 90 05-26 SETBVE: Quality-Diversity Driven Exploration of Software Boundary Behaviors SETBVE: Qualität-Diversität treibt die Erforschung von Software-Grenzverhalten an SETVE: 软件边界行为的质量-多样性驱动探索 2505.19736v1
  • 91 05-26 Large Language Models in Code Co-generation for Safe Autonomous Vehicles Große Sprachmodelle in der Kogeneration Code für sichere autonome Fahrzeuge 安全自治车辆代码共同生成大语言模式 2505.19658v1
  • 92 05-26 Software Engineering for Self-Adaptive Robotics: A Research Agenda Software-Engineering für selbstadaptive Robotik: Eine Forschungsagenda 自我适应机器人学软件工程:研究议程 2505.19629v1
  • 93 05-26 Search-Based Software Engineering in the Landscape of AI Foundation Models Search-Based Software Engineering in der Landschaft der AI-Stiftung Modelle AI基金会模型景观中的搜索软件工程 2505.19625v1
  • 94 05-26 LEGO-Compiler: Enhancing Neural Compilation Through Translation Composability LEGO-Compiler: Neurale Kompilierung durch Übersetzungskompatibilität verbessern LEGO-Compuper:通过翻译集成加强神经汇编 2505.20356v1
  • 95 05-26 CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation CODE-DITING: Ein auf Vernunft basierendes Metric für die funktionelle Ausrichtung in der Code-Evaluation 代码化:守则评价中功能一致性的基于理由的计量标准 2505.19502v1
  • 96 05-26 Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs Benchmarking und Verbesserung von LLM-Agenten bei der Lokalisierung von Linux-Kernel-Fehlern 确定和加强Linux内核虫本地化的Linux Kernel 虫的基准和加强LLM代理物 2505.19489v1
  • 97 05-26 Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling Regulierung des algorithmischen Managements: Eine Multi-Stakeholder-Studie über Herausforderungen bei der Ausrichtung von Software und dem Gesetz für die Arbeitsplanung 规范工资管理:多方利益攸关方研究软件和工作场所时间安排法在调整软件和工作场所时间安排法方面面临的挑战 2505.02329v2
  • 98 05-26 Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI Vibe Coding vs. Agentic Coding: Grundlagen und praktische Implikationen von Agentic AI Vibe 编码与 Agentic 编码:Agent AI 的基本要素和实际影响 2505.19443v1
  • 99 05-26 Simple and Effective Baselines for Code Summarisation Evaluation Einfache und effektive Grundlagen für die Code-Summarisation-Bewertung 用于代码摘要评价的简单有效基线 2505.19392v1
  • 100 05-25 (7) Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation Architekturen des Irrtums: Eine philosophische Untersuchung der KI- und menschlichen Code-Generation 错误结构结构:对大赦国际和人类代码生成的哲学调查 2505.19353v1
  • 101 05-25 Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking Retrieval-Augmented Generation for Service Discovery: Chunking Strategien und Benchmarking 服务发现回收-启动型服务生成:启动战略和基准制定 2505.19310v1
  • 102 05-25 VerifyThisBench: Generating Code, Specifications, and Proofs All at Once VerifyThisBench: Code, Spezifikationen und Beweise auf einmal generieren 校验时间: 生成代码、规格和证明 2505.19271v1
  • 103 05-25 CLEVER: A Curated Benchmark for Formally Verified Code Generation CLEVER: Ein kuratierter Benchmark für die formal verifizierte Codegenerierung 正式核实的代码生成基准 2505.13938v3
  • 104 05-25 An Empirical Study of Vulnerability Handling Times in CPython Eine empirische Studie über die Zeiten des Umgangs mit Gefährlichkeit in CPython CPython 脆弱性处理时间经验研究 2411.00447v2
  • 105 05-25 An Initial Exploration of Fine-tuning Small Language Models for Smart Contract Reentrancy Vulnerability Detection Eine erste Erkundung von Feinsteuerungs-Kleinsprachenmodellen für intelligente Vertragsrepentrancy Sicherheitserkennung 初步探索智能合同留置率易变性探测智能合同微调小型语言模型 2505.19059v1
  • 106 05-25 AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection AIGCodeSet: Ein neuer kommentierter Datensatz für KI Generated Code Detection AIGCodeSet:AI 生成代码探测新附加说明数据集 2412.16594v3
  • 107 05-25 On-Demand Scenario Generation for Testing Automated Driving Systems On-Demand-Szenario-Generierung für die Prüfung automatisierter Fahrsysteme 自动驾驶系统测试的 “ 现场需求 “ 情景生成 2505.14053v2
  • 108 05-25 Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers Automatisierte Vertrauenswürdigkeit Oracle Generation für Machine Learning Text Klassifikatoren 机械学习文字分类的自动可信赖性甲骨文生成 2410.22663v4
  • 109 05-25 Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models Co-PatcheR: Kollaborative Software-Patching mit Komponenten-spezifischen Small-Reasoning-Modellen 共同配给R:与特定组成部分的小型理由模型合作的软件补补补 2505.18955v1
  • 110 05-24 (6) From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation? Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus? 从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充? 2505.18789v1
  • 111 05-24 ARMS: A Vision for Actor Reputation Metric Systems in the Open-Source Software Supply Chain ARMS: Vision für Actor Reputation Metric Systems in der Open Source Software Supply Chain ARMS:开放源码软件供应链中行为名声计量系统展望 2505.18760v1
  • 112 05-24 AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers AutoP2C: Ein LLM-basiertes Agent-Framework für die Code-Repository-Generierung aus multimodalen Inhalten in wissenschaftlichen Papieren 自动P2C: 学术论文中多种形式内容的法规存储器生成基于LLM的LLM代理框架 2504.20115v2
  • 113 05-24 Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair Beheben von 7.400 Fehlern für 1$: Günstige Crash-Site-Programm-Reparatur 为1美元固定7 400个臭虫:低廉的撞车-点火方案维修 2505.13103v2
  • 114 05-24 SEW: Self-Evolving Agentic Workflows for Automated Code Generation SEW: Selbst-evolvierende Agentische Workflows für die automatisierte Codegenerierung SEW:自动代码生成的自演动态制剂工作流程 2505.18646v1
  • 115 05-24 ACECODER: Acing Coder RL via Automated Test-Case Synthesis ACECODER: Acing Coder RL über automatisierte Test-Case-Synthese 通过自动测试-案件综合合成检索编码器 RL 2502.01718v4
  • 116 05-24 On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten 关于含有闭合同步词类的标识名称的结构和语义 2505.18444v1

Article 0

Title@2025-05-29 (4): How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys

Title: How to Elicit Explainability Requirements? A Comparison of Interviews, Focus Groups, and Surveys Wie zu Elicit Erklärbarkeit Anforderungen? Ein Vergleich von Interviews, Fokusgruppen und Umfragen 如何制定明确的解释要求?访谈、焦点小组和调查的比较 2505.23684v1

Authors: Martin Obaidi, Jakob Droste, Hannah Deters, Marc Herrmann, Raymond Ochsner, Jil Klünder, Kurt Schneider

As software systems grow increasingly complex, explainability has become a crucial non-functional requirement for transparency, user trust, and regulatory compliance. Eliciting explainability requirements is challenging, as different methods capture varying levels of detail and structure. This study examines the efficiency and effectiveness of three commonly used elicitation methods - focus groups, interviews, and online surveys - while also assessing the role of taxonomy usage in structuring and improving the elicitation process. We conducted a case study at a large German IT consulting company, utilizing a web-based personnel management software. A total of two focus groups, 18 interviews, and an online survey with 188 participants were analyzed. The results show that interviews were the most efficient, capturing the highest number of distinct needs per participant per time spent. Surveys collected the most explanation needs overall but had high redundancy. Delayed taxonomy introduction resulted in a greater number and diversity of needs, suggesting that a two-phase approach is beneficial. Based on our findings, we recommend a hybrid approach combining surveys and interviews to balance efficiency and coverage. Future research should explore how automation can support elicitation and how taxonomies can be better integrated into different methods.

由于软件系统日益复杂,解释性已成为对透明度、用户信任和监管合规的关键非功能性要求,解释性要求具有挑战性,因为不同方法可以捕捉不同程度的细节和结构。本研究报告审查了三种常用的引人方法(焦点小组、访谈和在线调查)的效率和效力,同时还评估了分类学使用在结构和改进引人进程中的作用。我们利用网上人事管理软件,在一家大型德国信息技术咨询公司进行了案例研究。共分析了两个焦点小组,即18次访谈和188名参与者的在线调查。结果显示,访谈效率最高,每个参与者每次花费的时间都有最多的不同需求。调查收集了大多数解释需求,但有大量的冗余度。推迟采用分类学带来了更多和更多的需求,表明分两个阶段的方法是有益的。我们建议采用混合方法,将调查和访谈结合起来,以平衡效率和覆盖面。未来研究应探索自动化如何支持征求,如何更好地将纳税人纳入不同方法。


Article 1

Title@2025-05-29 (4): Quantum-Based Software Engineering

Title: Quantum-Based Software Engineering Quantenbasierte Software-Engineering 基于量子的软件工程 2505.23674v1

Authors: Jianjun Zhao

Quantum computing has demonstrated potential for solving computationally intensive problems more efficiently than classical methods. Many software engineering tasks, such as test case selection, static analysis, code clone detection, and defect prediction, involve complex optimization, search, or classification, making them candidates for quantum enhancement. In this paper, we propose Quantum-Based Software Engineering (QBSE), a potential research direction for applying quantum computing to classical software engineering problems. We outline its scope, clarify its distinction from quantum software engineering (QSE), and identify key problem types that may benefit from quantum optimization, search, and learning techniques. We also summarize existing research efforts that remain fragmented. Finally, we sketch a preliminary research agenda that may help guide the future development of QBSE as a structured and meaningful direction within software engineering.

量子计算显示出了比古典方法更高效地解决计算密集问题的潜力。许多软件工程任务,如测试案例选择、静态分析、代码克隆检测和缺陷预测,涉及复杂的优化、搜索或分类,使它们成为量子增强的候选。在本文件中,我们提议了量子软件工程(QBSE),这是将量子计算应用于古典软件工程问题的潜在研究方向。我们概述了其范围,澄清了与量子软件工程(QSE)的区别,并确定了可能受益于量子优化、搜索和学习技术的关键问题类型。我们还总结了仍然支离破碎的现有研究工作。最后,我们勾画了一个初步研究议程,它可能有助于指导QBSE的未来发展,作为软件工程中的一个结构化和有意义的方向。


Article 2

Title@2025-05-29 (4): GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents

Title: GSO: Challenging Software Optimization Tasks for Evaluating SWE-Agents GSO: Herausfordernde Software-Optimierungsaufgaben zur Bewertung von SWE-Agenten GSO:评估SWE-Agentics的有挑战的软件优化任务 2505.23671v1

Authors: Manish Shetty, Naman Jain, Jinjian Liu, Vijay Kethanaboyina, Koushik Sen, Ion Stoica

Developing high-performance software is a complex task that requires specialized expertise. We introduce GSO, a benchmark for evaluating language models’ capabilities in developing high-performance software. We develop an automated pipeline that generates and executes performance tests to analyze repository commit histories to identify 102 challenging optimization tasks across 10 codebases, spanning diverse domains and programming languages. An agent is provided with a codebase and performance test as a precise specification, and tasked to improve the runtime efficiency, which is measured against the expert developer optimization. Our quantitative evaluation reveals that leading SWE-Agents struggle significantly, achieving less than 5% success rate, with limited improvements even with inference-time scaling. Our qualitative analysis identifies key failure modes, including difficulties with low-level languages, practicing lazy optimization strategies, and challenges in accurately localizing bottlenecks. We release the code and artifacts of our benchmark along with agent trajectories to enable future research.

开发高性能软件是一项复杂的任务,需要专门知识。我们引入了GSO,这是评价语言模型开发高性能软件能力的基准。我们开发了一个自动管道,生成和执行绩效测试,以分析存储库,承诺历史查明10个代码库的102项挑战性优化任务,涵盖不同的领域和编程语言。向代理商提供了一个代码库和性能测试,作为精确的规格,并负责提高运行时间效率,以专家开发师的优化为衡量标准。我们的定量评估显示,领先的SWE-Agency 进行了巨大的斗争,取得了不到5%的成功率,即便在推论时间上也有有限的改进。我们的质量分析确定了关键的失败模式,包括使用低度语言的困难、采用懒惰性优化战略,以及在准确定位瓶颈方面存在的挑战。我们发布了基准的代码和工艺以及代理轨迹,以利今后的研究。


Article 3

Title@2025-05-29 (4): Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

Title: Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering Satori-SWE: Evolutionäre Test-Zeit-Skalierung für probeneffiziente Software-Engineering Satori-SWE:样本高效软件工程的进化测试-时间尺度 2505.23604v1

Authors: Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan

Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

语言模型(LMS)在标准化编码基准上表现良好,但与现实世界软件工程任务,例如解决SWE-Bench的GitHub问题,特别是在模型参数低于100B的情况下,在SWE-Bench中解决GitHub问题,特别是在模型参数低于100B的情况下。虽然较小的模型在实践中因其计算成本较低而更可取,但其性能仍具有挑战性。现有方法主要依靠监督的微调(SFT),具有高质量的数据,而这种数据在规模上是昂贵的。另一个办法是测试时间缩放:生成多种产出,使用核查器进行评分,并选择最佳的参数。虽然有效,但这一战略往往需要过多的取样和昂贵的评分,并限制其实际应用。我们建议采用将新一代作为进化过程的样本的测试时间缩放(EvoSUA),通过筛选和变异的输出,EvoSWES-S-S-SB 将产出分配的模型变为自我评估。


Article 4

Title@2025-05-29 (4): LLM Performance for Code Generation on Noisy Tasks

Title: LLM Performance for Code Generation on Noisy Tasks LLM-Performance für Code-Generierung bei lauten Aufgaben LLM 噪音任务代码生成的LLM性能 2505.23598v1

Authors: Radzim Sendyka, Christian Cabrera, Andrei Paleyes, Diana Robinson, Neil Lawrence

This paper investigates the ability of large language models (LLMs) to recognise and solve tasks which have been obfuscated beyond recognition. Focusing on competitive programming and benchmark tasks (LeetCode and MATH), we compare performance across multiple models and obfuscation methods, such as noise and redaction. We demonstrate that all evaluated LLMs can solve tasks obfuscated to a level where the text would be unintelligible to human readers, and does not contain key pieces of instruction or context. We introduce the concept of eager pattern matching to describe this behaviour, which is not observed in tasks published after the models’ knowledge cutoff date, indicating strong memorisation or overfitting to training data, rather than legitimate reasoning about the presented problem. We report empirical evidence of distinct performance decay patterns between contaminated and unseen datasets. We discuss the implications for benchmarking and evaluations of model behaviour, arguing for caution when designing experiments using standard datasets. We also propose measuring the decay of performance under obfuscation as a possible strategy for detecting dataset contamination and highlighting potential safety risks and interpretability issues for automated software systems.

本文调查了大型语言模型(LLMS)认识和解决超出认知范围的任务的能力。我们把注意力集中在竞争性编程和基准任务(LeetCode和MATH)上,比较了多种模型和模糊方法(例如噪音和编辑)的性能。我们证明,所有经过评价的LLMS都能够解决被模糊的任务,使其达到对人类读者不易理解的程度,而没有包含关键的指示或背景。我们引入了渴望模式匹配的概念来描述这种行为,在模型知识截止日期之后公布的任务中没有观察到这种行为,表明高度的记忆化或过度适应培训数据,而不是对所提出的问题进行合理的推理。我们报告了被污染的数据集和不可见的数据集之间不同性能衰减模式的经验证据。我们讨论了在使用标准数据集设计实验时对基准和行为评价的影响,我们主张谨慎。我们还提议测量模糊状态下性能的衰败,作为发现数据污染和突出潜在安全风险以及自动化软件系统可解释性问题的可能战略。


Article 5

Title@2025-05-29 (4): LLM-based Property-based Test Generation for Guardrailing Cyber-Physical Systems

Title: LLM-based Property-based Test Generation for Guardrailing Cyber-Physical Systems LLM-basierte property-based Test Generation for Guardrailing Cyber-Physical Systems 以LLM为基础的保护网络-物理系统基于财产的 2505.23549v1

Authors: Khashayar Etemadi, Marjan Sirjani, Mahshid Helali Moghadam, Per Strandberg, Paul Pettersson

Cyber-physical systems (CPSs) are complex systems that integrate physical, computational, and communication subsystems. The heterogeneous nature of these systems makes their safety assurance challenging. In this paper, we propose a novel automated approach for guardrailing cyber-physical systems using property-based tests (PBTs) generated by Large Language Models (LLMs). Our approach employs an LLM to extract properties from the code and documentation of CPSs. Next, we use the LLM to generate PBTs that verify the extracted properties on the CPS. The generated PBTs have two uses. First, they are used to test the CPS before it is deployed, i.e., at design time. Secondly, these PBTs can be used after deployment, i.e., at run time, to monitor the behavior of the system and guardrail it against unsafe states. We implement our approach in ChekProp and conduct preliminary experiments to evaluate the generated PBTs in terms of their relevance (how well they match manually crafted properties), executability (how many run with minimal manual modification), and effectiveness (coverage of the input space partitions). The results of our experiments and evaluation demonstrate a promising path forward for creating guardrails for CPSs using LLM-generated property-based tests.

网络物理系统(CPS)是综合物理、计算和通信子系统的复杂系统。这些系统的多样化性质使其安全保障具有挑战性。在本文中,我们提议对使用大语言模型产生的基于财产的测试(PBT)来保护网络物理系统采用新的自动化方法。我们的方法是使用LLM来从CPS的代码和文档中提取属性。接下来,我们利用LLM来生成PBT,以核实在CPS上提取的属性。生成的PBT有两个用途。首先,它们用来在部署之前测试CPS,即设计时间。第二,这些PBT在部署后可以使用,即运行时,用来监测系统的行为,并保护它不受不安全状态的影响。我们在ChekProp中采用的方法,并进行初步实验,以评价产生的PBT的相关性(如何与手工制作的属性相匹配)、可操作性、可操作性(许多操作的手动修改是最低限度的),以及用于前方空间分区的安全性(对C进行有希望的磁性磁性测试)。


Article 6

Title@2025-05-29 (4): The CASE Framework – A New Architecture for Participatory Research and Digital Health Surveillance

Title: The CASE Framework – A New Architecture for Participatory Research and Digital Health Surveillance Der CASE Framework - Eine neue Architektur für partizipative Forschung und digitale Gesundheitsüberwachung CASE框架 – – 参与性研究和数字健康监测的新架构 2505.23516v1

Authors: Marco Hirsch, Peter Hevesi, Paul Lukowicz

We present the CASE framework, an open-source platform for adaptive, context-aware participatory research, and pandemic preparedness. CASE implements an event-driven architecture that enables dynamic survey workflows, allowing real-time adaptation based on participant responses, external data, temporal conditions, and evolving user states. The framework supports a broad range of research needs, from simple one-time questionnaires to complex longitudinal studies with advanced conditional logic. Built on over a decade of practical experience, CASE underwent a major architectural rework in 2024, transitioning from a microservice-based design to a streamlined monolithic architecture. This evolution significantly improved maintainability, flexibility, and accessibility to deployment, particularly for institutions with limited technical capacity. CASE has been successfully deployed across diverse domains, powering national disease surveillance platforms, supporting post-COVID cohort studies, and enabling real-time sentiment analysis during political events. These applications, involving tens of thousands of participants, demonstrate the framework’s scalability, versatility, and practical value. This paper describes the foundations of CASE, details its architectural evolution, and presents lessons learned from real-world deployments. We establish CASE as a mature and reusable research infrastructure that balances sophisticated functionality with practical implementation, addressing the critical global need for sustainable and institutionally controlled data collection systems.

我们提出了CASE框架,这是一个适应性、有环境意识的参与性研究和大流行病防备的开放源码平台。CASE实施一个事件驱动结构,能够动态调查工作流程,允许根据参与者的反应、外部数据、时间条件和不断演变的用户状态进行实时适应。框架支持广泛的研究需求,从简单的一次性问卷调查到具有先进的有条件逻辑的复杂纵向研究。根据十多年的实际经验,CASE在2024年进行了重大的建筑改造,从基于微观服务的设计过渡到精简的单一结构。这一演变极大地改善了CASE的可维持性、灵活性和可部署性,特别是对技术能力有限的机构而言。CASE成功地在不同领域进行了部署,赋予了国家疾病监测平台的权力,支持了COVID后的群群研究,并在政治活动期间进行了实时的情绪分析。这些应用包括数以万计的参与者,展示了框架的可扩展性、多功能性和实际价值。本文描述了CASE的基础,详细介绍了其建筑演变,并介绍了从实际部署中汲取的教训。我们成功地将CASE部署到了各种复杂的系统,我们建立了一套可操作的精确的系统。


Article 7

Title@2025-05-29 (4): Identity resolution of software metadata using Large Language Models

Title: Identity resolution of software metadata using Large Language Models Identitätsauflösung von Software-Metadaten mit großen Sprachmodellen 使用大语言模式的软件元数据的识别分辨率 2505.23500v1

Authors: Eva Martín del Pico, Josep Lluís Gelpí, Salvador Capella-Gutiérrez

Software is an essential component of research. However, little attention has been paid to it compared with that paid to research data. Recently, there has been an increase in efforts to acknowledge and highlight the importance of software in research activities. Structured metadata from platforms like bio.tools, Bioconductor, and Galaxy ToolShed offers valuable insights into research software in the Life Sciences. Although originally intended to support discovery and integration, this metadata can be repurposed for large-scale analysis of software practices. However, its quality and completeness vary across platforms, reflecting diverse documentation practices. To gain a comprehensive view of software development and sustainability, consolidating this metadata is necessary, but requires robust mechanisms to address its heterogeneity and scale. This article presents an evaluation of instruction-tuned large language models for the task of software metadata identity resolution, a critical step in assembling a cohesive collection of research software. Such a collection is the reference component for the Software Observatory at OpenEBench, a platform that aggregates metadata to monitor the FAIRness of research software in the Life Sciences. We benchmarked multiple models against a human-annotated gold standard, examined their behavior on ambiguous cases, and introduced an agreement-based proxy for high-confidence automated decisions. The proxy achieved high precision and statistical robustness, while also highlighting the limitations of current models and the broader challenges of automating semantic judgment in FAIR-aligned software metadata across registries and repositories.

与研究数据相比,软件是研究的一个基本组成部分。然而,与研究数据相比,对软件的注意很少。最近,人们更加努力承认和强调软件在研究活动中的重要性。生物工具、生物导体和Galaxy ToolShed等平台的结构化元数据为生命科学研究软件提供了宝贵的见解。虽然该元数据最初旨在支持发现和整合,但可以重新用于大规模分析软件实践。然而,该元数据的质量和完整性在平台上各有差异,反映了各种文件做法。为了全面了解软件的开发和可持续性,有必要巩固这一元数据,但需要建立强有力的机制来解决其差异性和规模。本文章对用于软件元数据解析任务的指示调整型大语言模型进行了评价,这是构建统一研究软件库的关键一步。这种收集是OpenEbeench软件观测台的参考部分,该台是一个将元数据汇总成一个平台,用以监测生命科学研究软件的FAIR性。我们用多种模型比对有附加说明的黄金标准进行了基准,需要加以巩固,但需要建立强有力的机制来应对其模棱不全案例和规模和规模的不均匀性,同时,还引入了高额的可靠、高额的统计模型决定。


Article 8

Title@2025-05-29 (4): Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency

Title: Synthesizing Performance Constraints for Evaluating and Improving Code Efficiency Synthese von Leistungsbeschränkungen zur Bewertung und Verbesserung der Code-Effizienz 综合评估和提高《守则》效率的绩效制约因素 2505.23471v1

Authors: Jun Yang, Cheng-Chi Wang, Bogdan Alexandru Stoica, Kexin Pei

Large Language Models (LLMs) have been increasingly used to optimize code efficiency. Evaluating their effectiveness and further suggesting optimization opportunities often rely on high-quality tests to demonstrate the performance bottlenecks presented in the program. However, existing approaches rely on a limited set of hand-curated inputs or LLM-generated uninteresting length-stressing tests, failing to reveal more nuanced optimization opportunities. We present WEDGE, a framework for generating performance-stressing input given the program under test. WEDGE synthesizes explicit performance-characterizing constraints in the form of branch conditions to partition the programs’ execution space into performance-specific regions. When integrated with the coverage-guided fuzzer, reaching different regions introduces explicit rewards for test generation to explore inefficient implementations. Our evaluation shows that WEDGE introduces a significant slowdown compared to the tests in CodeContests and those claimed to be optimized by existing approaches. From the utility perspective, integrating our tests substantially improves the existing code optimization approaches that rely on test-driven execution feedback. We release PERFFORGE, the performance tests generated by WEDGE, to benchmark future approaches for efficient code generation at https://github.com/UChiSeclab/perfforge.

大型语言模型(LLMS)被越来越多地用于优化代码效率。评估其有效性和进一步建议优化机会往往依赖于高质量的测试,以证明该方案中出现的绩效瓶颈。然而,现有方法依靠有限的一组手工加工投入或LLM产生的无兴趣长伸测试,未能揭示出更细微的优化机会。我们介绍了WEDGE,这是一个根据所测试的方案生成绩效压力投入的框架。WEDGE综合了以分支条件形式出现的明确的绩效特征化限制,以将方案的执行空间划分为具体绩效区域。当与覆盖指导的模糊数据整合到不同区域时,为测试生成探索效率低下的实施带来明确回报。我们的评估表明,WEDGE与代码测试和声称通过现有方法优化的测试相比,速度大大放缓。从实用角度出发,我们的综合极大地改进了现有代码优化方法,依靠测试驱动的执行反馈。我们发布了PERFFFFORGE,WEGE生成的绩效测试,为探索低效实施提供了明确的回报。我们的评价显示,为探索低效执行效果的生成方法提供了明确的奖励。我们的评价表明,与代CECD/CRUSGUGUC。


Article 9

Title@2025-05-29 (4): What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews

Title: What About Emotions? Guiding Fine-Grained Emotion Extraction from Mobile App Reviews Was ist mit Emotionen? Guiding Fine-Grained Emotion Extraction aus Mobile App Bewertungen 情感呢?指导从移动应用程序评论中抽取精美情感的导师 2505.23452v1

Authors: Quim Motger, Marc Oriol, Max Tiessler, Xavier Franch, Jordi Marco

Opinion mining plays a vital role in analysing user feedback and extracting insights from textual data. While most research focuses on sentiment polarity (e.g., positive, negative, neutral), fine-grained emotion classification in app reviews remains underexplored. This paper addresses this gap by identifying and addressing the challenges and limitations in fine-grained emotion analysis in the context of app reviews. Our study adapts Plutchik’s emotion taxonomy to app reviews by developing a structured annotation framework and dataset. Through an iterative human annotation process, we define clear annotation guidelines and document key challenges in emotion classification. Additionally, we evaluate the feasibility of automating emotion annotation using large language models, assessing their cost-effectiveness and agreement with human-labelled data. Our findings reveal that while large language models significantly reduce manual effort and maintain substantial agreement with human annotators, full automation remains challenging due to the complexity of emotional interpretation. This work contributes to opinion mining by providing structured guidelines, an annotated dataset, and insights for developing automated pipelines to capture the complexity of emotions in app reviews.

意见挖掘在分析用户反馈和从文本数据中提取见解方面发挥着至关重要的作用。虽然大多数研究侧重于情绪极化(如正面、负面、中性),但应用审查中微微薄情感分类仍未得到充分探讨。本文件通过查明和解决在应用审查中微薄情感分析的挑战和局限性来解决这一差距。我们的研究使普卢奇克的情感分类适应于通过开发结构化说明框架和数据集来应用审查。我们通过一个迭代人类批注过程,界定明确的说明指南并记录情感分类方面的关键挑战。此外,我们评估使用大型语言模型将情感批注自动化的可行性,评估其成本效益和与人类标注数据的一致。我们的调查结果显示,虽然大型语言模型大大减少了人工工作,并保持与人类顾问的实质性协议,但由于情感解释的复杂性,完全自动化仍然具有挑战性。这项工作通过提供结构化指南、附加说明的数据集和见解挖掘,有助于开发自动化管道,以捕捉到应用程序审查中情绪的复杂性。


Article 10

Title@2025-05-29 (4): From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents

Title: From Knowledge to Noise: CTIM-Rover and the Pitfalls of Episodic Memory in Software Engineering Agents Vom Wissen zum Lärm: CTIM-Rover und die Pitfalls des episodischen Gedächtnisses in Software Engineering Agents 从知识到噪音:CTIM-Rover和软件工程代理器中电离内存的空洞 2505.23422v1

Authors: Tobias Lindenbauer, Georg Groh, Hinrich Schütze

We introduce CTIM-Rover, an AI agent for Software Engineering (SE) built on top of AutoCodeRover (Zhang et al., 2024) that extends agentic reasoning frameworks with an episodic memory, more specifically, a general and repository-level Cross-Task-Instance Memory (CTIM). While existing open-source SE agents mostly rely on ReAct (Yao et al., 2023b), Reflexion (Shinn et al., 2023), or Code-Act (Wang et al., 2024), all of these reasoning and planning frameworks inefficiently discard their long-term memory after a single task instance. As repository-level understanding is pivotal for identifying all locations requiring a patch for fixing a bug, we hypothesize that SE is particularly well positioned to benefit from CTIM. For this, we build on the Experiential Learning (EL) approach ExpeL (Zhao et al., 2024), proposing a Mixture-Of-Experts (MoEs) inspired approach to create both a general-purpose and repository-level CTIM. We find that CTIM-Rover does not outperform AutoCodeRover in any configuration and thus conclude that neither ExpeL nor DoT-Bank (Lingam et al., 2024) scale to real-world SE problems. Our analysis indicates noise introduced by distracting CTIM items or exemplar trajectories as the likely source of the performance degradation.

我们引入了 CTIM-Rover , 这个软件工程的AI 代理机构( AI 代理机构 ) , 建在 AutoCodeRover 之上( 张等人, 2024 ) , 扩展了代理逻辑框架, 并带有偶数内存, 更具体地说, 是一个普通和仓库级跨任务内存( CTIM ) 。 虽然现有的开放源 SE 代理机构主要依赖 ReAct ( Yao等人, 2023b) , Reflexion (Shinn等人, 2023 等人, 或代码法案( Wang等人, 2024 ) , 所有这些理论和规划框架都低效地丢弃了长期内存。 由于仓库一级的理解对于确定所有需要补丁来修复错误的地点至关重要, 我们假设 SEEE特别适合受益于 CT 。 为此, 我们以Expecial Level Level Legress (Zha et al.) 2024) , 提议一个混合- Extransurive- 方法来创建普通 和存储级内端端端端端 的系统或深层直基分析。


Article 11

Title@2025-05-29 (4): SWE-bench Goes Live!

Title: SWE-bench Goes Live! SWE-Bench geht live! SWE -BECHE GOES 现场直播! 2505.23419v1

Authors: Linghao Zhang, Shilin He, Chaoyun Zhang, Yu Kang, Bowen Li, Chengxing Xie, Junhao Wang, Maoquan Wang, Yufan Huang, Shengyu Fu, Elsie Nallipogu, Qingwei Lin, Yingnong Dang, Saravan Rajmohan, Dongmei Zhang

The issue-resolving task, where a model generates patches to fix real-world bugs, has emerged as a critical benchmark for evaluating the capabilities of large language models (LLMs). While SWE-bench and its variants have become standard in this domain, they suffer from key limitations: they have not been updated since their initial releases, cover a narrow set of repositories, and depend heavily on manual effort for instance construction and environment setup. These factors hinder scalability and introduce risks of overfitting and data contamination. In this work, we present \textbf{SWE-bench-Live}, a \textit{live-updatable} benchmark designed to overcome these challenges. Our initial release consists of 1,319 tasks derived from real GitHub issues created since 2024, spanning 93 repositories. Each task is accompanied by a dedicated Docker image to ensure reproducible execution. Central to our benchmark is \method, an automated curation pipeline that streamlines the entire process from instance creation to environment setup, removing manual bottlenecks and enabling scalability and continuous updates. We evaluate a range of state-of-the-art agent frameworks and LLMs on SWE-bench-Live, revealing a substantial performance gap compared to static benchmarks like SWE-bench, even under controlled evaluation conditions. To better understand this discrepancy, we perform detailed analyses across repository origin, issue recency, and task difficulty. By providing a fresh, diverse, and executable benchmark grounded in live repository activity, SWE-bench-Live facilitates rigorous, contamination-resistant evaluation of LLMs and agents in dynamic, real-world software development settings.

解决问题的任务,即模型产生修补真实世界错误的补丁,已经成为评价大型语言模型(LLMs)能力的关键基准。SWE-bench及其变体已成为该领域的标准,但受到关键限制:自最初发布以来,它们一直没有更新,覆盖了一套狭窄的储存库,并严重依赖人工操作,例如建筑和环境设置。这些因素阻碍可缩缩缩缩,并引入了过度装配和数据污染的风险。在这项工作中,我们提出了用于评估大型语言模型(LLLMS)能力的一个关键基准。虽然SWE-bench及其变体已成为这一领域的标准。我们最初发布的任务有1,319项来自自2024年以来创建的真正的GitHub问题,涉及93个储存库。每项任务都有一个专门的Docker图像,以确保可重新执行。 对我们基准的核心是\method,一个自动化的曲解管道,将整个过程从创建到环境设置,消除手动的瓶颈,促进可缩缩缩缩和不断更新。我们在SWE基准下,我们评估了一系列的SWE-rvial-ro-ral-lade-rode-rode-la-la-de-lade-lax-lax-lax-lax-lade Stal-lax-lax Stal-lax-lax-lax-lax-lax-lax-lax-lax-s-s-s-s-s-lax-lax-lax-lax-s-to-to-to-to-to-to-to-to-to-to-to-sil-to-to-sil-laxxxx-sil-sil-sil-s-s-sfervical-s-s-sil-sil-sil-sf-sf-lax-lax-lax-s-s-s-s-s-s-s-lax-lax-s-s-s-lax-lax-s-s-s-s-s-s-s-s-s-s-sl-sl-lautx-laxxxxxx-S-s-s-sl-s-


Article 12

Title@2025-05-29 (4): Toward Effective AI Governance: A Review of Principles

Title: Toward Effective AI Governance: A Review of Principles Auf dem Weg zu einer effektiven KI-Governance: Eine Überprüfung der Grundsätze 实现有效的独立大赦国际治理:原则审查 2505.23417v1

Authors: Danilo Ribeiro, Thayssa Rocha, Gustavo Pinto, Bruno Cartaxo, Marcelo Amaral, Nicole Davila, Ana Camargo

Artificial Intelligence (AI) governance is the practice of establishing frameworks, policies, and procedures to ensure the responsible, ethical, and safe development and deployment of AI systems. Although AI governance is a core pillar of Responsible AI, current literature still lacks synthesis across such governance frameworks and practices. Objective: To identify which frameworks, principles, mechanisms, and stakeholder roles are emphasized in secondary literature on AI governance. Method: We conducted a rapid tertiary review of nine peer-reviewed secondary studies from IEEE and ACM (20202024), using structured inclusion criteria and thematic semantic synthesis. Results: The most cited frameworks include the EU AI Act and NIST RMF; transparency and accountability are the most common principles. Few reviews detail actionable governance mechanisms or stakeholder strategies. Conclusion: The review consolidates key directions in AI governance and highlights gaps in empirical validation and inclusivity. Findings inform both academic inquiry and practical adoption in organizations.

人工智能(AI)治理是建立框架、政策和程序以确保负责任、道德和安全地发展和部署AI系统的做法。尽管AI治理是负责任的AI的核心支柱,但目前的文献仍然缺乏对此类治理框架和做法的综合。目标:确定AI治理的次级文献强调了哪些框架、原则、机制和利益攸关方的作用。方法:我们利用结构化的包容性标准和专题语义综合,对IEEE和ACM(20202024年)的九项经同行审查的次级研究进行了快速三级审查。结果:引用最多的框架包括欧盟AI法和NIST RMF;透明度和问责制是最常见的原则。很少详细审查可操作的治理机制或利益攸关方战略。结论:审查综合了AI治理的主要方向,突出了经验验证和包容性方面的差距。调查结果为各组织的学术调查和实际采用提供了信息。


Article 13

Title@2025-05-29 (4): BugRepro: Enhancing Android Bug Reproduction with Domain-Specific Knowledge Integration

Title: BugRepro: Enhancing Android Bug Reproduction with Domain-Specific Knowledge Integration BugRepro: Verbesserung der Android Bug Reproduction mit Domain-spezifischer Wissensintegration Bugrepro: 利用特定域知识集成增强Android虫复制 2505.14528v2

Authors: Hongrong Yin, Jinhong Huang, Yao Li, Yunwei Dong, Tao Zhang

Mobile application development is a fast-paced process where maintaining high-quality user experiences is crucial. Bug reproduction, a key aspect of maintaining app quality, often faces significant challenges. Specifically, when descriptions in bug reports are ambiguous or difficult to comprehend, current approaches fail to extract accurate information. Moreover, modern applications exhibit inherent complexity with multiple pages and diverse functionalities, making it challenging for existing methods to map the relevant information in bug reports to the corresponding UI elements that need to be manipulated. To address these challenges, we propose BugRepro, a novel technique that integrates domain-specific knowledge to enhance the accuracy and efficiency of bug reproduction. BugRepro adopts a Retrieval-Augmented Generation (RAG) approach. It retrieves similar bug reports along with their corresponding steps to reproduce (S2R) entities from an example-rich RAG document. In addition, BugRepro explores the graphical user interface (GUI) of the app and extracts transition graphs from the user interface to incorporate app-specific knowledge to guide large language models (LLMs) in their exploration process. Our experiments demonstrate that BugRepro significantly outperforms two state-of-the-art methods (ReCDroid and AdbGPT). For S2R entity extraction accuracy, it achieves a 7.57 to 28.89 percentage point increase over prior methods. For the bug reproduction success rate, the improvement reaches 74.55% and 152.63%. In reproduction efficiency, the gains are 0.72% and 76.68%.

移动应用程序开发是一个快速过程, 保持高质量的用户经验至关重要。 错误复制是维护应用程序质量的一个关键方面, 常常面临重大挑战。 具体地说, 当错误报告中的说明含混不清或难以理解时, 当前的方法无法获取准确信息。 此外, 现代应用程序具有多页和多种功能的内在复杂性, 使得现有方法难以将错误报告中的相关信息映射到需要操作的相应界面中。 为了应对这些挑战, 我们提议 BugRepro, 这是一种将特定域知识整合到提高错误复制的准确性和效率的新技术。 BugRepro 采用了回收源代(RAG) 的方法。 它回收了类似的错误报告, 以及它们从具有丰富实例的 RAG文档中复制(S2R) 实体的相应步骤。 此外, BugRepro 探索了应用程序的图形用户界面( GUI) , 并从用户界面中提取过渡图, 以纳入具体应用知识, 指导大语言模型( LLMS) 的精确性。 我们的实验显示 BugRepro 明显超出 Reval- Arestationalationalations the timal- pain- passationalational- brush- brus the pain the priew- brus.


Article 14

Title@2025-05-29 (4): Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization

Title: Afterburner: Reinforcement Learning Facilitates Self-Improving Code Efficiency Optimization Nachbrenner: Verstärktes Lernen erleichtert selbstverbessernde Code-Effizienz-Optimierung 事后焚烧:强化学习促进自我改进法规效率优化 2505.23387v1

Authors: Mingzhe Du, Luu Tuan Tuan, Yue Liu, Yuhao Qing, Dong Huang, Xinyi He, Qian Liu, Zejun Ma, See-kiong Ng

Large Language Models (LLMs) generate functionally correct solutions but often fall short in code efficiency, a critical bottleneck for real-world deployment. In this paper, we introduce a novel test-time iterative optimization framework to address this, employing a closed-loop system where LLMs iteratively refine code based on empirical performance feedback from an execution sandbox. We explore three training strategies: Supervised Fine-Tuning (SFT), Direct Preference Optimization (DPO), and Group Relative Policy Optimization~(GRPO). Experiments on our Venus dataset and the APPS benchmark show that SFT and DPO rapidly saturate in efficiency gains. In contrast, GRPO, using reinforcement learning (RL) with execution feedback, continuously optimizes code performance, significantly boosting both pass@1 (from 47% to 62%) and the likelihood of outperforming human submissions in efficiency (from 31% to 45%). Our work demonstrates effective test-time code efficiency improvement and critically reveals the power of RL in teaching LLMs to truly self-improve code efficiency.

大语言模型(LLMS)产生功能正确的解决方案,但在代码效率方面往往不足,这是现实世界部署的关键瓶颈。在本文中,我们引入了一个创新的测试-时间迭代优化框架来解决这个问题,使用封闭环系统,由LLMS根据执行沙箱的经验性能反馈反复完善代码。我们探索了三个培训战略:监督微调(SFT)、直接偏好优化(DPO)和集体相对政策优化(GROPO),对我们的金星数据集和APPS基准的实验表明,SFT和DPO在效率收益方面迅速饱和。相比之下,GROPO利用执行反馈强化学习(RL),不断优化代码性能,大幅提升1号(从47%到62%)和人效率超过提交文件的可能性(从31%到45%),我们的工作表明测试-时间效率的有效提高,并批判地揭示了LLLMS教授真正自我保护代码效率的能力。


Article 15

Title@2025-05-29 (4): Personality-Guided Code Generation Using Large Language Models

Title: Personality-Guided Code Generation Using Large Language Models Personalitätsgeführte Code-Generierung mit großen Sprachmodellen 使用大语言模式的 个人 使用大语言模式的 人 性 指导 代码 生成 2411.00006v2

Authors: Yaoqi Guo, Zhenpeng Chen, Jie M. Zhang, Yang Liu, Yun Ma

Code generation, the automatic creation of source code from natural language descriptions, has garnered significant attention due to its potential to streamline software development. Inspired by research that links task-personality alignment with improved development outcomes, we conduct an empirical study on personality-guided code generation using large language models (LLMs). Specifically, we investigate how emulating personality traits appropriate to the coding tasks affects LLM performance. We extensively evaluate this approach using seven widely adopted LLMs across four representative datasets. Our results show that personality guidance significantly enhances code generation accuracy, with improved pass rates in 23 out of 28 LLM-dataset combinations. Notably, in 11 cases, the improvement exceeds 5%, and in 5 instances, it surpasses 10%, with the highest gain reaching 12.9%. Additionally, personality guidance can be easily integrated with other prompting strategies to further boost performance. We open-source our code and data at https://github.com/IanWalls/Persona-Code.

代码的生成,即自然语言描述源代码的自动生成,因其在简化软件开发方面的潜力而引起极大关注。在将任务-个性协调与改进发展成果联系起来的研究的启发下,我们开展了一项关于使用大型语言模型(LLMs)进行个性指导代码生成的经验性研究。具体地说,我们调查了与编码任务相适应的个性特征如何影响LLM的性能。我们利用四个具有代表性的数据集广泛采用的7个LLMs广泛评估了这一方法。我们的结果表明,个性指导大大提高了代码生成的准确性,28个LLM-数据集组合中的23个提高了通过率。值得注意的是,在11个案例中,改进率超过5%,在5个案例中超过10%,最高收益达到12.9 %。此外,个性指导很容易与其他快速战略相结合,以进一步提升性能。我们在https://github.com/IanWalls/Percena-Codead中打开了我们的代码和数据。


Article 16

Title@2025-05-29 (4): OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software

Title: OSS-UAgent: An Agent-based Usability Evaluation Framework for Open Source Software OSS-UAgent: Ein Agent-basiertes Usability Evaluation Framework für Open Source Software OSS-UUA代理:基于代理的开放源码软件使用性评价框架 2505.23239v1

Authors: Lingkai Meng, Yu Shao, Long Yuan, Longbin Lai, Peng Cheng, Wenyuan Yu, Wenjie Zhang, Xuemin Lin, Jingren Zhou

Usability evaluation is critical to the impact and adoption of open source software (OSS), yet traditional methods relying on human evaluators suffer from high costs and limited scalability. To address these limitations, we introduce OSS-UAgent, an automated, configurable, and interactive agent-based usability evaluation framework specifically designed for open source software. Our framework employs intelligent agents powered by large language models (LLMs) to simulate developers performing programming tasks across various experience levels (from Junior to Expert). By dynamically constructing platform-specific knowledge bases, OSS-UAgent ensures accurate and context-aware code generation. The generated code is automatically evaluated across multiple dimensions, including compliance, correctness, and readability, providing a comprehensive measure of the software’s usability. Additionally, our demonstration showcases OSS-UAgent’s practical application in evaluating graph analytics platforms, highlighting its effectiveness in automating usability evaluation.

可用性评价对于影响和采用开放源码软件(OSS)至关重要,但依赖人类评估员的传统方法却成本高,可扩展性有限。为了应对这些限制,我们引入了开放源码软件自动、可配置和互动代理工具专用使用性评价框架,这是专门为开放源码软件设计的自动、可配置和基于互动代理工具的可用性评价框架。我们的框架使用由大语言模型(LLLMS)驱动的智能代理器,模拟开发者执行不同层次(从初级到专家)的编程任务。通过动态构建平台特定知识库,OSS-UAgency确保准确和符合背景的代码生成。生成的代码将自动评估多个层面,包括合规性、正确性和可读性,为软件的可用性提供了全面衡量标准。此外,我们的演示展示展示了OSS-UAgency在评价图表分析平台方面的实际应用,突出其在提高可用性评价效率方面的效果。


Article 17

Title@2025-05-29 (4): Artemis: Toward Accurate Detection of Server-Side Request Forgeries through LLM-Assisted Inter-Procedural Path-Sensitive Taint Analysis

Title: Artemis: Toward Accurate Detection of Server-Side Request Forgeries through LLM-Assisted Inter-Procedural Path-Sensitive Taint Analysis Artemis: Auf dem Weg zur genauen Erkennung von Server-Side Request Forgeries durch LLM-Assisted Inter-Procedural Path-Sensitive Taint Analysis 人工制品:通过LLM协助的跨程序间路由感知性图解分析,力求准确探测服务器-Side请求的伪造情况 2502.21026v3

Authors: Yuchen Ji, Ting Dai, Zhichao Zhou, Yutian Tang, Jingzhu He

Server-side request forgery (SSRF) vulnerabilities are inevitable in PHP web applications. Existing static tools in detecting vulnerabilities in PHP web applications neither contain SSRF-related features to enhance detection accuracy nor consider PHP’s dynamic type features. In this paper, we present Artemis, a static taint analysis tool for detecting SSRF vulnerabilities in PHP web applications. First, Artemis extracts both PHP built-in and third-party functions as candidate source and sink functions. Second, Artemis constructs both explicit and implicit call graphs to infer functions’ relationships. Third, Artemis performs taint analysis based on a set of rules that prevent over-tainting and pauses when SSRF exploitation is impossible. Fourth, Artemis analyzes the compatibility of path conditions to prune false positives. We have implemented a prototype of Artemis and evaluated it on 250 PHP web applications. Artemis reports 207 true vulnerable paths (106 true SSRFs) with 15 false positives. Of the 106 detected SSRFs, 35 are newly found and reported to developers, with 24 confirmed and assigned CVE IDs.

在PHP网络应用中,发现PHP网络应用中的弱点的现有静态工具既不含SERF相关特性,也非用于提高检测准确性,也非考虑到PHP动态类型特征。在本文件中,我们介绍了Artemis,这是一个用于检测PHP网络应用中SERF脆弱性的静态污点分析工具。首先,Artemis提取了PHP内在和第三方功能,作为候选源和汇的功能。第二,Artemis构建了明确和隐含的调用图,以推断功能的关系。第三,Artemis根据一套规则进行了污点分析,以防止在SSRF无法开发时过度拉扯和暂停。第四,Artemis分析了用于提取假阳点的路径条件的兼容性。我们实施了Artemis原型,并对250 PHP网络应用进行了评估。Artemis报告207条真实的脆弱路径(106个真正的SSRF),有15个假阳点。在106个测得的SSRF中,新发现并向开发商报告了35条,有24个确认和指定的CVEID。


Article 18

Title@2025-05-29 (4): Two Is Better Than One: Rotations Scale LoRAs

Title: Two Is Better Than One: Rotations Scale LoRAs Zwei ist besser als eins: Rotationsskala LoRAs 二比一好:轮作规模LORAs 2505.23184v1

Authors: Hongcan Guo, Guoshun Nan, Yuan Yang, Diyang Zhang, Haotian Li, Zhican Chen, Qinchuan Zhou, Yuhan Ran, Xinye Cao, Sicong Leng, Xiaofeng Tao, Xudong Jiang

Scaling Low-Rank Adaptation (LoRA)-based Mixture-of-Experts (MoE) facilitates large language models (LLMs) to efficiently adapt to diverse tasks. However, traditional gating mechanisms that route inputs to the best experts may fundamentally hinder LLMs’ scalability, leading to poor generalization and underfitting issues. We identify that the root cause lies in the restricted expressiveness of existing weighted-sum mechanisms, both within and outside the convex cone of LoRA representations. This motivates us to propose RadarGate, a novel geometrically inspired gating method that introduces rotational operations of LoRAs representations to boost the expressiveness and facilitate richer feature interactions among multiple LoRAs for scalable LLMs. Specifically, we first fuse each LoRA representation to other LoRAs using a learnable component and then feed the output to a rotation matrix. This matrix involves learnable parameters that define the relative angular relationship between LoRA representations. Such a simple yet effective mechanism provides an extra degree of freedom, facilitating the learning of cross-LoRA synergies and properly tracking the challenging poor generalization and underfitting issues as the number of LoRA grows. Extensive experiments on 6 public benchmarks across 21 tasks show the effectiveness of our RadarGate for scaling LoRAs. We also provide valuable insights, revealing that the rotations to each pair of representations are contrastive, encouraging closer alignment of semantically similar representations during geometrical transformation while pushing distance ones further apart. We will release our code to the community.

低朗适应(LORA)基于低朗适应(LORA)的低朗适应(LOE)的混合物(MOE)有助于大型语言模型(LLMS)有效适应各种任务;然而,将投入投入输送给最佳专家的传统机制可能会从根本上阻碍LLMS的伸缩性,导致LLMS的简化和不适当问题;我们发现,根源在于现有加权和加权机制在LORA的表层内和外的表达方式的清晰度有限;这促使我们提出雷达Gate(RadarGate),这是一种具有地貌灵感的新型定位方法,引入LORA代表方式的旋转性操作,以提升其清晰度,便利多个LORA的伸缩性,促进多个LLOMS之间的更丰富性特征互动。具体地说,我们首先将每个LORA代表方式与其他LAM的伸缩性整合起来,然后将输出到轮值矩阵中。这种简单但有效的机制提供了额外的自由度,有助于学习LARA的交叉互动协作,并正确跟踪具有挑战性的缩缩缩略缩缩缩缩的缩缩缩缩缩缩图表。


Article 19

Title@2025-05-29 (4): An open-source Modular Online Psychophysics Platform (MOPP)

Title: An open-source Modular Online Psychophysics Platform (MOPP) Eine Open-Source-Plattform für modulare Online-Psychophysik (MOPP) 开放源码模块在线心理物理学平台(MOPP) 2505.23137v1

Authors: Yuval Samoilov-Kats, Matan Noach, Noam Beer, Yuval Efrati, Adam Zaidel

In recent years, there is a growing need and opportunity to use online platforms for psychophysics research. Online experiments make it possible to evaluate large and diverse populations remotely and quickly, complementing laboratory-based research. However, developing and running online psychophysics experiments poses several challenges: i) a high barrier-to-entry for researchers who often need to learn complex code-based platforms, ii) an uncontrolled experimental environment, and iii) questionable credibility of the participants. Here, we introduce an open-source Modular Online Psychophysics Platform (MOPP) to address these challenges. Through the simple web-based interface of MOPP, researchers can build modular experiments, share them with others, and copy or modify tasks from each others environments. MOPP provides built-in features to calibrate for viewing distance and to measure visual acuity. It also includes email-based and IP-based authentication, and reCAPTCHA verification. We developed five example psychophysics tasks, that come preloaded in the environment, and ran a pilot experiment which was hosted on the AWS (Amazon Web Services) cloud. Pilot data collected for these tasks yielded similar results to those reported in laboratory settings. MOPP can thus help researchers collect large psychophysics datasets online, with reduced turnaround time, and in a standardized manner.

近年来,利用在线平台进行心理物理学研究的需求和机会日益增长。在线实验使得能够对大量和多样化的人口进行远程和快速的评估,从而补充实验室研究。然而,开发并运行在线心理物理学实验带来了若干挑战:(1) 研究人员往往需要学习复杂的基于代码的平台,(2) 不受控制的实验环境,以及(3) 参与者的可信度令人怀疑。在这里,我们引入了一个开放源的模块在线心理物理学平台(MOPP)来应对这些挑战。通过MOP的简单网络界面,研究人员可以建立模块化实验,与其他人共享这些实验,并复制或修改来自其他每个环境的任务。MOPP提供了校准距离和测量视觉能力的内在特征。它还包括基于电子邮件和基于IP的认证以及 reCAPTCHA的核查。我们开发了五个在环境中预先加载的心理物理学任务,并在AWS(Azon Web Servic Servic)云上进行了试点实验。为这些任务收集的试点数据可以将这些标准化数据转化为实验室的大规模数据。


Article 20

Title@2025-05-29 (4): VERINA: Benchmarking Verifiable Code Generation

Title: VERINA: Benchmarking Verifiable Code Generation VERINA: Benchmarking der überprüfbaren Code-Generierung VERINA:可核实代码生成基准 2505.23135v1

Authors: Zhe Ye, Zhengxu Yan, Jingxuan He, Timothe Kasriel, Kaiyu Yang, Dawn Song

Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation – jointly generating code, specifications, and proofs of code-specification alignment – offers a promising path to address this limitation and further unleash LLMs’ benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce Verina (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. Verina consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope Verina will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark. We release our dataset on https://huggingface.co/datasets/sunblaze-ucb/verina and our evaluation code on https://github.com/sunblaze-ucb/verina.

大型语言模型(LLMS)日益融入软件开发,但确保LLM生成的代码的正确性仍具有挑战性,而且往往需要花费昂贵的人工审查。可验证代码的生成 – – 共同生成代码、规格和具体编码协调的证明 – – 为解决这一限制和进一步释放LLMS的编码好处提供了一条充满希望的道路。然而,在评价方面存在着巨大的差距:目前的基准往往缺乏对端至端可核查代码生成的支持。在本文件中,我们引入了一个高质量基准,从而能够对代码、规格和证据生成及其构成进行全面和模块化评价。Verina由189个手工拼凑的编码任务组成,其中有详细的问题描述、参考执行、正式规格和广泛的测试套件。我们对目前最先进的LLMSM(可验证代码生成的DLMSUCSDS/Arencrearetures)的生成存在重大挑战。我们的最佳模型(OO4minirea)只能生成61.4%的代码,51.0%的硬度和3.6%的精确度的代码,我们将提供我们精确度数据生成的进度和3.6%的数据。


Article 21

Title@2025-05-29 (4): Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference

Title: Can LLMs Reason About Program Semantics? A Comprehensive Evaluation of LLMs on Formal Specification Inference Kann LLMs Grund über Programm Semantik? Eine umfassende Bewertung von LLMs auf formale Spezifikation Inferenz CLLMs 方案语义学理由:全面评价关于正式具体推断的LLMs 2503.04779v4

Authors: Thanh Le-Cong, Bach Le, Toby Murray

Large Language Models (LLMs) are increasingly being used to automate programming tasks. Yet, LLMs’ capabilities in reasoning about program semantics are still inadequately studied, leaving significant potential for further exploration. This paper introduces FormalBench, a comprehensive benchmark designed to evaluate LLMs’ reasoning abilities on program semantics, particularly via the task of synthesizing formal program specifications to assist verifying program correctness. This task requires both comprehensive reasoning over all possible program executions and the generation of precise, syntactically correct expressions that adhere to formal syntax and semantics. Using this benchmark, we evaluated the ability of LLMs in synthesizing consistent and complete specifications. Our findings show that LLMs perform well with simple control flows but struggle with more complex structures, especially loops, even with advanced prompting. Additionally, LLMs exhibit limited robustness against semantic-preserving transformations. We also highlight common failure patterns and design self-repair prompts, improving success rates by 25%.

大型语言模型(LLMs)正越来越多地被用于使程序设计任务自动化。然而,LLMs在程序语义学的推理能力仍然没有得到充分的研究,从而留下了进一步探索的巨大潜力。本文介绍了旨在评估LLMs在程序语义学上的推理能力的全面基准“正式Bench”,尤其是通过综合正式程序规格协助核实程序正确性的任务。这项任务要求对所有可能的方案执行进行综合推理,并生成精确、统一和正确的表达方式,以坚持正式的语义学和语义学。我们利用这一基准,评估LLMs在综合一致和完整的规格方面的能力。我们的研究结果显示LMs在简单控制流程方面表现良好,但与更复杂的结构,特别是圆环进行斗争,即使有先进的提示。此外,LLMs在反对语义-保留转换方面表现出有限的强健性。我们还强调了常见的失败模式和设计自我修复的提示,使成功率提高了25%。


Article 22

Title@2025-05-29 (4): DINGO: Constrained Inference for Diffusion LLMs

Title: DINGO: Constrained Inference for Diffusion LLMs DINGO: Beschränkte Schlussfolgerung für Diffusion LLMs DINGO: 扩散长效LMM的连续推论 2505.23061v1

Authors: Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, Gagandeep Singh

Diffusion LLMs have emerged as a promising alternative to conventional autoregressive LLMs, offering significant potential for improved runtime efficiency. However, existing diffusion models lack the ability to provably enforce user-specified formal constraints, such as regular expressions, which makes them unreliable for tasks that require structured outputs, such as fixed-schema JSON generation. Unlike autoregressive models that generate tokens sequentially, diffusion LLMs predict a block of tokens in parallel. This parallelism makes traditional constrained decoding algorithms, which are designed for sequential token prediction, ineffective at preserving the true output distribution. To address this limitation, we propose DINGO, a dynamic programming-based constrained decoding strategy that is both efficient and provably distribution-preserving. DINGO enables sampling of output strings with the highest probability under the model’s predicted distribution, while strictly satisfying any user-specified regular expression. On standard symbolic math and JSON generation benchmarks, DINGO achieves up to a 68 percentage point improvement over unconstrained inference

与传统自动递减的LMS相比,LMS已成为一种大有希望的替代传统自动递减的LMS,它为提高运行时间效率提供了巨大的潜力;然而,现有的推广模式缺乏能力,无法对用户指定的正式限制,例如常规表达方式,使其不适于执行需要结构化产出的任务,例如固定的JSON 生成。与自动递减模式不同,扩散LMS同时预测一系列象征性。这种平行使得传统的受限制解码算法(这些算法是为按顺序进行象征性预测而设计的,在保存真正的产出分布方面无效)。为了应对这一限制,我们建议DINGO,这是一个动态的、基于程序化的受限解码战略,既高效又可可移动的分布保存。DINGO能够根据模型预测的分布,在严格满足用户指定的任何常规表达方式时,以最高概率取样产出字符。关于标准的象征性数学和JSONS生成基准,DINGO在未受限制的推算之外,实现了68个百分点的改进。


Article 23

Title: HACMony: Automatically Detecting Hopping-related Audio-stream Conflict Issues on HarmonyOS HACMony: Automatische Erkennung von Hopping-bezogenen Audio-Stream-Konflikten auf HarmonyOS HACMonny:自动检测与Happing有关的和谐OS音频流冲突问题 2504.07472v2

Authors: Jinlong He, Binru Huang, Changwei Xia, Hengqin Yang, Jiwei Yan, Jun Yan

HarmonyOS is emerging as a popular distributed operating system for diverse mobile devices. One of its standout features is app-hopping, which allows users to seamlessly transition apps across different HarmonyOS devices. However, when apps playing audio streams hop between devices, they can easily trigger Hopping-related Audio-stream Conflict (HAC) scenarios. Improper resolution of HAC will lead to significant HAC issues, which are harder to detect compared to single-device audio-stream conflicts, due to the unclear semantics of HarmonyOS’s app-hopping mechanism and the lack of effective multi-app hopping testing methods. To fill the gap, this paper introduces an automated and efficient approach to detecting HAC issues. We formalized the operational semantics of HarmonyOS’s app-hopping mechanism for audio streams for the first time. Leveraging this formalization, we designed an Audio Service Transition Graph (ASTG) to model the behaviors of audio-API-related services and proposed a model-based approach to detect HAC issues automatically. Our techniques were implemented in a tool, HACMony, and evaluated on 20 real-world HarmonyOS apps. Experimental results reveal that 11 of the 20 apps exhibit HAC issues. Additionally, we summarized the detected issues into two typical types, namely MoD and MoR, and analyzed their characteristics to assist and guide both app and OS developers.

与各种移动设备流行的分布式操作系统类似,哈玛尔内斯正在形成一个流行型号的和谐操作系统。它的一个外观功能是购买应用程序,使用户能够无缝地转换不同哈玛尔诺斯设备之间的应用软件。然而,当在设备之间播放音流跳动应用程序时,它们很容易触发与霍普相关的音流冲突(HAC)情景。HAC的不恰当解决将带来重大的HAC问题,由于哈玛尔诺斯应用程序购买机制的含混不清的词义和缺乏有效的多功能购物测试方法,因此与单一的音频流冲突相比,这些问题更难以被察觉。为了填补空白,本文引入了一种自动和高效的方法来探测HAC问题。我们首次正式确定了哈玛斯系统用于音流的应用程序购物机制。利用这一正规化,我们设计了一个音频服务过渡图(ASTG)来模拟与声音-API有关的服务的行为,并提出了一种基于模型的方法来自动检测HAC问题。为了填补这一空白,我们的技术在20个实体-CMony(HCony)和在20个现实-D-ASimal-Appalimalal Appalevulation Apps上,也就是20个H-H-hal 和两个HIS App-halevalevalevalevalevalevalal 的模拟了我们检测了20个HAL-haldaldaldaldals。


Article 24

Title@2025-05-29 (4): Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation

Title: Chain of Grounded Objectives: Bridging Process and Goal-oriented Prompting for Code Generation Kette der geerdeten Ziele: Überbrückungsprozess und zielorientiertes Prompting für die Codegenerierung 基本目标链链:搭桥进程和以目标为导向的促进代码生成 2501.13978v2

Authors: Sangyeop Yeo, Seung-won Hwang, Yu-Seung Ma

The use of Large Language Models (LLMs) for code generation has gained significant attention in recent years. Existing methods often aim to improve the quality of generated code by incorporating additional contextual information or guidance into input prompts. Many of these approaches adopt sequential reasoning strategies, mimicking human-like step-by-step thinking. However, such strategies may constrain flexibility, as they do not always align with the structured characteristics of programming languages. This paper introduces the Chain of Grounded Objectives (CGO), a method that embeds functional objectives into input prompts to enhance code generation. By leveraging appropriately structured objectives as input and avoiding explicit sequential procedures, CGO adapts effectively to the structured nature of programming tasks. Empirical evaluations demonstrate that CGO effectively enhances code generation, addressing limitations of existing approaches.

近些年来,使用大语言模式生成代码的问题引起了人们的极大注意,现有方法往往旨在通过将更多的背景信息或指导纳入投入提示来提高生成代码的质量,其中许多方法采用顺序推理战略,仿照人式的逐步思维,但是,这些战略可能限制灵活性,因为它们并不总是与编程语言的结构特征相一致。本文件介绍了 “ 定点目标链 “ (CGO),这是将功能目标嵌入投入的一种方法,它将功能目标嵌入投入中,从而推动加强代码生成。通过利用结构适当的目标作为投入,避免明确的顺序程序,CGO有效地适应了方案编制任务的结构性。经验性评估表明,CGO有效地加强了编程,解决了现有方法的局限性。


Article 25

Title@2025-05-29 (4): Structural Abstraction and Selective Refinement for Formal Verification

Title: Structural Abstraction and Selective Refinement for Formal Verification Strukturelle Abstraktion und selektive Verfeinerung für formale Verifizierung 正式核查的结构性抽象和选择性改进 2505.22982v1

Authors: Christoph Luckeneder, Ralph Hoch, Hermann Kaindl

Safety verification of robot applications is extremely challenging due to the complexity of the environment that a robot typically operates in. Formal verification with model-checking provides guarantees but it may often take too long or even fail for complex models of the environment. A usual solution approach is abstraction, more precisely behavioral abstraction. Our new approach introduces structural abstraction instead, which we investigated in the context of voxel representation of the robot environment. This kind of abstraction leads to abstract voxels. We also propose a complete and automated verification workflow, which is based on an already existing methodology for robot applications, and inspired by the key ideas behind counterexample-guided abstraction refinement (CEGAR) - performing an initial abstraction and successively introducing refinements based on counterexamples, intertwined with model-checker runs. Hence, our approach uses selective refinement of structural abstractions to improve the runtime efficiency of model-checking. A fully-automated implementation of our approach showed its feasibility, since counterexamples have been found for a realistic scenario with a fairly high (maximal) resolution in a few minutes, while direct model-checker runs led to a crash after a couple of days.

由于机器人通常操作的环境的复杂性,对机器人应用的安全性进行核查极具挑战性。通过模型检查进行的正式核查提供了保证,但对于复杂的环境模型来说往往需要过长甚至失败。通常的解决办法是抽象化,更准确地说是行为抽象化。我们的新办法引入了结构抽象化,而我们是在机器人环境的 voxel 代表范围内调查的。这种抽象化导致抽象的氧化物。我们还提议了一个完整和自动化的核查工作流程,该流程以机器人应用的现有方法为基础,并受到反比照制抽象精化(CEGAR)背后的关键想法的启发——在反比照样本的基础上进行初步抽象化和连续地引入改进,与模型检查机运行相交织。因此,我们的办法采用结构抽象化的选择性改进,以提高模型检查的运行效率。我们方法的完全自动化实施显示了其可行性,因为在几分钟内发现有相当高(最大)分辨率的现实情景,而直接的模型检查结果在几天后导致坠毁。


Article 26

Title@2025-05-29 (4): CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance

Title: CodeSteer: Symbolic-Augmented Language Models via Code/Text Guidance CodeSteer: Symbolisch-Augmentierte Sprachmodelle über Code/Text Anleitung 代码器:通过编码/文本指导的代码/文本指导的代码器:代号辅助语言模式 2502.04350v2

Authors: Yongchao Chen, Yilun Hao, Yueying Liu, Yang Zhang, Chuchu Fan

Existing methods fail to effectively steer Large Language Models (LLMs) between textual reasoning and code generation, leaving symbolic computing capabilities underutilized. We introduce CodeSteer, an effective method for guiding LLM code/text generation. We construct a comprehensive benchmark SymBench comprising 37 symbolic tasks with adjustable complexity and also synthesize datasets of 12k multi-turn guidance/generation trajectories and 5.5k guidance comparison pairs. We fine-tune the Llama-3-8B model with a newly designed multi-turn supervised fine-tuning (SFT) and direct preference optimization (DPO). The resulting model, CodeSteerLLM, augmented with the proposed symbolic and self-answer checkers, effectively guides the code/text generation of larger models. Augmenting GPT-4o with CodeSteer raises its average performance score from 53.3 to 86.4, even outperforming the existing best LLM OpenAI o1 (82.7), o1-preview (74.8), and DeepSeek R1 (76.8) across all 37 tasks (28 seen, 9 unseen). Trained for GPT-4o, CodeSteer demonstrates superior generalizability, providing an average 41.8 performance boost on Claude, Mistral, and GPT-3.5. CodeSteer-guided LLMs fully harness symbolic computing to maintain strong performance on highly complex tasks. Models, Datasets, and Codes are available at https://github.com/yongchao98/CodeSteer-v1.0 and https://huggingface.co/yongchao98.

现有方法未能在文本推理和代码生成之间有效引导大型语言模型(LLMS),使得象征性的计算能力未得到充分利用。我们引入了CodeSteer,这是指导LLM代码/文本生成的有效方法。我们构建了一个全面的基准SymBench,由37项具有可调整复杂性的象征性任务组成,还合成了12k多方向指导/生成轨迹和5.5k指导比较对数据集。我们用新设计的多方向监管微调(SFT)和直接优惠优化(DPO)对Llama-3-8B模型进行了微调(LLMOM OpenAI o 1 (82.7)、 o1-preview (74.8) 和 DeepSebelb R1 (76.8) 在所有37项任务(28个可见,9个可见)。GPT-98/SpeetellM(C-Sil-GLODO)上,对GPLO-G-BS-deal-deal-deal-deal-deal Studal Studal Stal Ser)进行了训练,在GPDS-de-deal-deal-deal-deal-dealxxxxxxxlal 上,全面性能能能。在GPLDSlal-de 上,在GPB-dexxlalgal-dealxxxxxxxxxxxxxxxxxxxxx。


Article 27

Title@2025-05-29 (4): BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent

Title: BYOS: Knowledge-driven Large Language Models Bring Your Own Operating System More Excellent BYOS: Wissensgetriebene große Sprachmodelle bringen Ihr eigenes Betriebssystem hervorragender BYOS: 知识驱动的大型语言模式使自己的操作系统更加出色 2503.09663v2

Authors: Hongyu Lin, Yuchen Li, Haoran Luo, Kaichun Yao, Libo Zhang, Mingjie Xing, Yanjun Wu

Operating System (OS) kernel tuning involves systematically adjusting kernel configurations to optimize system performance. Despite recent advancements in large language models (LLMs), kernel tuning remains a critical challenge due to: (1) the semantic gap between abstract tuning objective and concrete config options, (2) insufficient environmental interaction induces LLM hallucinations, and (3) the rapid evolution of kernel versions. To address these challenges, we propose BYOS, a LLM-powered framework that automates kernel tuning through three key innovations: structured knowledge construction and mapping, knowledge-driven configuration generation, and continuous knowledge maintenance. Extensive experiments show that BYOS achieves 7.1%-155.4% performance improvements over default configurations across standard OS benchmarks and real-world applications, demonstrating structured knowledge representation can overcome key limitations of pure LLM solutions for system optimization. Our code is available at https://github.com/LHY-24/BYOS.

操作系统(OS)内核调导涉及系统地调整内核配置以优化系统性能,尽管大型语言模型(LLMs)最近有所进步,但内核调导仍是一个重大挑战,因为:(1) 抽象调控目标和具体配置选项之间的语义差距,(2) 环境互动不足导致LM幻觉,(3) 内核版本的迅速演变。为应对这些挑战,我们提议BYOS,一个LLM驱动框架,通过三个关键创新自动调控内核:结构化知识建设和绘图、知识驱动的配置生成和持续的知识维护。广泛的实验显示,BYOS在标准OS基准和现实世界应用的默认配置方面实现了7.1%至155.4%的性能改进,展示有结构化的知识代表可以克服纯粹LM解决方案的关键限制,实现系统优化。我们的代码可在https://github.com/LHY-24/BYOS查阅。


Article 28

Title@2025-05-28 (3): Unlocking Mental Health: Exploring College Students’ Well-being through Smartphone Behaviors

Title: Unlocking Mental Health: Exploring College Students’ Well-being through Smartphone Behaviors Entsperren der psychischen Gesundheit: Erforschen des Wohlbefindens der Studenten durch Smartphone-Verhalten 解锁心理健康:通过智能手机行为探索大学生福祉 2502.08766v2

Authors: Wei Xuan, Meghna Roy Chowdhury, Yi Ding, Yixue Zhao

The global mental health crisis is a pressing concern, with college students particularly vulnerable to rising mental health disorders. The widespread use of smartphones among young adults, while offering numerous benefits, has also been linked to negative outcomes such as addiction and regret, significantly impacting well-being. Leveraging the longest longitudinal dataset collected over four college years through passive mobile sensing, this study is the first to examine the relationship between students’ smartphone unlocking behaviors and their mental health at scale in real-world settings. We provide the first evidence demonstrating the predictability of phone unlocking behaviors for mental health outcomes based on a large dataset, highlighting the potential of these novel features for future predictive models. Our findings reveal important variations in smartphone usage across genders and locations, offering a deeper understanding of the interplay between digital behaviors and mental health. We highlight future research directions aimed at mitigating adverse effects and promoting digital well-being in this population.

全球心理健康危机是一个紧迫的关切问题,大学生特别容易患上精神疾病。年轻成年人广泛使用智能手机,虽然带来许多好处,但也与诸如吸毒成瘾和遗憾等负面结果有关,严重影响了福祉。利用四年来通过被动移动感测收集的最长时间纵向数据集,这项研究是第一个在现实世界环境中审查学生智能手机解锁行为与其大规模心理健康之间关系的研究。我们提供了第一个证据,表明在大型数据集的基础上,手机解锁行为对心理健康结果的可预测性,突出了这些新特征对未来预测模型的潜力。我们的调查结果揭示了不同性别和不同地点使用智能手机方面的重要差异,使人们更深入地了解数字行为与心理健康之间的相互作用。我们强调今后旨在减轻负面影响和促进这一人群的数字福祉的研究方向。


Article 29

Title@2025-05-28 (3): Evolution analysis of software quality metrics in an open-source java project: A case study on TestNG

Title: Evolution analysis of software quality metrics in an open-source java project: A case study on TestNG Evolutionsanalyse von Software-Qualitätsmetriken in einem Open-Source-Java-Projekt: Eine Fallstudie zu TestNG 开放源码 Java项目软件质量衡量标准演变分析:测试NG案例研究 2505.22884v1

Authors: Venkata Sai Sravya Sambaturu

Software quality is critical in modern software engineering, especially in large and evolving codebases. This study analyzes the evolution of software quality metrics in five successive versions of the open-source Java testing framework TestNG. Using the static analysis tool Understand, eleven key object-oriented metrics, including cyclomatic complexity, class coupling, and lines of code, were extracted for each version. Statistical and visual analyses reveal structural trends over time. The results indicate that TestNG has matured into a more stable and maintainable framework, reflecting ongoing development, refactoring, and architectural improvements. This study provides insights into design evolution and offers recommendations for maintaining code quality in similar projects.

软件质量在现代软件工程中至关重要,特别是在大型和不断演变的代码库中。本研究分析了五个连续版本的开放源码 Java测试框架TealNG的软件质量度量的演变。使用静态分析工具理解,为每个版本提取了11个关键目标导向度量,包括环形复杂度、级联和代码线。统计和视觉分析揭示了长期的结构性趋势。结果显示,TestNG已经发展成一个更稳定、更可维持的框架,反映了正在进行的开发、重构和建筑改进。本研究为设计演变提供了深刻的见解,并为维护类似项目的代码质量提出了建议。


Article 30

Title@2025-05-28 (3): Visualizing Cloud-native Applications with KubeDiagrams

Title: Visualizing Cloud-native Applications with KubeDiagrams Cloud-native Anwendungen mit KubeDiagrammen visualisieren 带有KubeDiagrams 的可视化云源应用 2505.22879v1

Authors: Philippe Merle, Fabio Petrillo

Modern distributed applications increasingly rely on cloud-native platforms to abstract the complexity of deployment and scalability. As the de facto orchestration standard, Kubernetes enables this abstraction, but its declarative configuration model makes the architectural understanding difficult. Developers, operators, and architects struggle to form accurate mental models from raw manifests, Helm charts, or cluster state descriptions. We introduce KubeDiagrams, an open-source tool that transforms Kubernetes manifests into architecture diagrams. By grounding our design in a user-centered study of real-world visualization practices, we identify the specific challenges Kubernetes users face and map these to concrete design requirements. KubeDiagrams integrates seamlessly with standard Kubernetes artifacts, preserves semantic fidelity to core concepts, and supports extensibility and automation. We detail the tool’s architecture, visual encoding strategies, and extensibility mechanisms. Three case studies illustrate how KubeDiagrams enhances system comprehension and supports architectural reasoning in distributed cloud-native systems. KubeDiagrams addresses concrete pain points in Kubernetes-based DevOps practices and is valued for its automation, clarity, and low-friction integration into real-world tooling environments.

现代分布式应用日益依赖云化平台来抽象部署和缩放的复杂性。 Kubernetes在事实上的调试标准下, Kubernetes 能够让这个抽象化, 但是它的宣示性配置模式使得建筑理解变得很困难。 开发者、 操作者和建筑师努力从原始的表单、 Helm 图表或集束状态描述中形成准确的心理模型。 我们引入了KubeDiagrams, 这个将Kubernetes 转化为建筑图示的开放源工具。 通过以用户为中心的真实世界可视化做法研究, 我们确定了Kubernetes 用户所面临的具体挑战, 并将这些挑战映射到具体的设计要求中。 KubeDiagrams 与标准的 Kubernetes 工艺、 保护语义对核心概念的忠实性以及支持扩展性和自动化。 我们详细介绍了工具的架构、 视觉编码策略和扩展机制。 三个案例研究说明了KubeDiagragrams如何在分布式云化系统中加强系统的理解和支持建筑学推理。 KubeDiagragrams在Kubernets针对库bernets 的水泥疼痛点, 透明化、 透明化、 透明化、 透明化、 格式化、 格式化、 格式化、 格式化、 格式化、 格式化、 、 格式化、 格式化、 格式化、 、 、 格式化、透明化、 格式化、 格式化、 、 格式化、 、 、 、 、透明化、 、 、 、 格式化、 、 、 、 、 、 、 、 、 、 、 、 、 等、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、


Article 31

Title@2025-05-28 (3): RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation

Title: RocqStar: Leveraging Similarity-driven Retrieval and Agentic Systems for Rocq generation RocqStar: Leveraging-ähnliche Retrieval- und Agentiksysteme für die Rocq-Generation RocqStar:利用利用相似度驱动回收系统和干系统来生成Rocq 2505.22846v1

Authors: Nikita Khramov, Andrei Kozyrev, Gleb Solovev, Anton Podkopaev

Interactive Theorem Proving was repeatedly shown to be fruitful combined with Generative Artificial Intelligence. This paper assesses multiple approaches to Rocq generation and illuminates potential avenues for improvement. We highlight the importance of thorough premise selection for generating Rocq proofs and propose a novel approach, leveraging retrieval via a self-attentive embedder model. The evaluation of the designed approach shows up to 28% relative increase of the generator’s performance. We tackle the problem of writing Rocq proofs using a multi-stage agentic system, tailored for formal verification, and demonstrate its high effectiveness. We conduct an ablation study and show the use of multi-agent debate on the planning stage of proof synthesis.

互动理论证明反复证明与创造人工智能相结合是富有成果的。本文评估了Rocq生成的多种方法,并揭示了可能的改进途径。我们强调为生成 Rocq 验证物进行彻底的前提选择的重要性,并提出新的方法,通过自我注意嵌入模型来利用检索。对设计方法的评估表明,发电机的性能相对增加28%。我们解决了使用多阶段制剂系统编写 Rocq 验证物的问题,为正式核查量身定制,并展示了其高效力。我们进行了模拟研究,并展示了在证据合成的规划阶段使用多剂辩论的情况。


Article 32

Title@2025-05-28 (3): A Tool for Generating Exceptional Behavior Tests With Large Language Models

Title: A Tool for Generating Exceptional Behavior Tests With Large Language Models Ein Tool zur Generierung außergewöhnlicher Verhaltenstests mit großen Sprachmodellen 生成使用大语言模式的特殊行为测试工具 2505.22818v1

Authors: Linghan Zhong, Samuel Yuan, Jiyang Zhang, Yu Liu, Pengyu Nie, Junyi Jessy Li, Milos Gligoric

Exceptional behavior tests (EBTs) are crucial in software development for verifying that code correctly handles unwanted events and throws appropriate exceptions. However, prior research has shown that developers often prioritize testing “happy paths”, e.g., paths without unwanted events over exceptional scenarios. We present exLong, a framework that automatically generates EBTs to address this gap. exLong leverages a large language model (LLM) fine-tuned from CodeLlama and incorporates reasoning about exception-throwing traces, conditional expressions that guard throw statements, and non-exceptional behavior tests that execute similar traces. Our demonstration video illustrates how exLong can effectively assist developers in creating comprehensive EBTs for their project (available at https://youtu.be/Jro8kMgplZk).

特殊行为测试(EBTs)对于软件开发至关重要,可以验证代码正确处理意外事件并丢弃适当的例外。 但是,先前的研究显示,开发者通常优先考虑测试“快乐路径 ” , 例如, 超特殊情况下的路径。 我们展示了 exLong , 这个框架自动生成 EBTs 来弥补这一差距。 exLong 利用了从代码Llama 微调的大型语言模型(LLM ) , 并包含了例外浏览痕迹的推理、 保护声明的有条件表达以及执行类似痕迹的非例外行为测试。 我们的演示视频展示了前Long 如何有效帮助开发者为项目创建全面的 EBTs ( https://youtu.be/Jro8kMgplZk ) 。


Article 33

Title@2025-05-28 (3): What Needs Attention? Prioritizing Drivers of Developers’ Trust and Adoption of Generative AI

Title: What Needs Attention? Prioritizing Drivers of Developers’ Trust and Adoption of Generative AI Was braucht Aufmerksamkeit? Priorisieren von Treibern des Entwicklervertrauens und der Annahme Generativer KI 需要注意什么?将开发者信任的驱动因素列为优先事项,并采用创新的AI 2505.17418v2

Authors: Rudrajit Choudhuri, Bianca Trinkenreich, Rahul Pandita, Eirini Kalliamvakou, Igor Steinmacher, Marco Gerosa, Christopher Sanchez, Anita Sarma

Generative AI (genAI) tools are advertised as productivity aids. Yet, issues related to miscalibrated trust and usage friction continue to hinder their adoption. Additionally, AI can be exclusionary, failing to support diverse users adequately, further exacerbating these concerns. One such aspect of diversity is cognitive diversity – variations in users’ cognitive styles – that leads to divergence in interaction styles. When an individual’s cognitive styles are unsupported, it creates additional barriers to technology adoption. Thus, to design tools that developers trust, we must first understand what factors affect their trust and intentions to use these tools in practice? We developed a theoretical model of factors influencing trust and adoption intentions towards genAI through a large-scale survey with developers (N=238) at GitHub and Microsoft. Using Partial Least Squares-Structural Equation Modeling (PLS-SEM), we found that genAI’s system/output quality, functional value, and goal maintenance significantly influence developers’ trust, which along with their cognitive styles, affects their intentions to use these tools in work. An Importance-Performance Matrix Analysis (IPMA) identified factors that, despite their strong influence, underperform, revealing specific genAI aspects that need design prioritization. We bolster these findings by qualitatively analyzing developers’ perceived challenges and risks of genAI usage to uncover why these gaps persist in development contexts. For genAI to indeed be a true productivity aid rather than a disguised productivity sink, it must align with developers’ goals, maintain contextual transparency, reduce cognitive burden, and provide equitable interaction support. We provide practical suggestions to guide future genAI tool design for effective, trustworthy, and inclusive human-genAI interactions.

(genAI) 工具被公示为生产力辅助工具。然而,与错误校正的信任和使用摩擦有关的问题继续阻碍其采用。此外,AI可能具有排斥性,无法充分支持不同用户,从而进一步加剧了这些关切。多样性的一个方面是认知多样性 – – 用户认知风格的差异 – – 导致互动风格的差异。当一个人的认知风格得不到支持时,它会给采用技术造成更多的障碍。因此,设计开发者信任的工具时,我们必须首先了解哪些因素影响他们使用这些工具的实际信任和意图?我们开发了一个理论模型,通过在GitHub 和 Microsoft 与开发者进行大规模调查(N=238) ,影响对各种用户的信任和采纳意向,从而进一步加重了这些关切。我们发现,GiAI的系统/产出质量质量、功能价值和目标的保持会极大地影响开发者的信任,而这种认知风格及其认知风格,影响着他们使用这些工具的意图。


Article 34

Title@2025-05-28 (3): LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents

Title: LabUtopia: High-Fidelity Simulation and Hierarchical Benchmark for Scientific Embodied Agents LabUtopia: High-Fidelity-Simulation und hierarchischer Benchmark für wissenschaftliche körpereigene Wirkstoffe LabUtopia:科学渗透剂的高纤维模拟和等级基准 2505.22634v1

Authors: Rui Li, Zixuan Hu, Wenxi Qu, Jinouwen Zhang, Zhenfei Yin, Sha Zhang, Xuantuo Huang, Hanqing Wang, Tai Wang, Jiangmiao Pang, Wanli Ouyang, Lei Bai, Wangmeng Zuo, Ling-Yu Duan, Dongzhan Zhou, Shixiang Tang

Scientific embodied agents play a crucial role in modern laboratories by automating complex experimental workflows. Compared to typical household environments, laboratory settings impose significantly higher demands on perception of physical-chemical transformations and long-horizon planning, making them an ideal testbed for advancing embodied intelligence. However, its development has been long hampered by the lack of suitable simulator and benchmarks. In this paper, we address this gap by introducing LabUtopia, a comprehensive simulation and benchmarking suite designed to facilitate the development of generalizable, reasoning-capable embodied agents in laboratory settings. Specifically, it integrates i) LabSim, a high-fidelity simulator supporting multi-physics and chemically meaningful interactions; ii) LabScene, a scalable procedural generator for diverse scientific scenes; and iii) LabBench, a hierarchical benchmark spanning five levels of complexity from atomic actions to long-horizon mobile manipulation. LabUtopia supports 30 distinct tasks and includes more than 200 scene and instrument assets, enabling large-scale training and principled evaluation in high-complexity environments. We demonstrate that LabUtopia offers a powerful platform for advancing the integration of perception, planning, and control in scientific-purpose agents and provides a rigorous testbed for exploring the practical capabilities and generalization limits of embodied intelligence in future research.

与典型的家庭环境相比,实验室环境对物理化学变异和长视距规划的认知要求要高得多,因此,实验室环境对物理化学变异和长视距规划的要求要高得多,使它们成为推进体现的智力的理想测试台;然而,由于缺乏适当的模拟器和基准,其发展长期受到阻碍;在本文件中,我们通过引入一个综合模拟和基准套件,即 “ LabUtopia “ 来弥补这一差距,该套综合模拟和基准套件,旨在促进实验室环境中的通用和可推理化化的成形剂的开发。具体而言,它整合了i) LabSim,一个支持多物理和具有化学意义的互动的高纤维模拟器;ii) LabScene,这是不同科学场的可扩缩程序生成器;以及iii) LabBench,这是一个分级基准,其复杂程度从原子行动到长视波氏移动操纵的五级。 LabUtopia支持30多项不同的任务,包括200多个现场和仪器资产,在高度复杂环境中进行大规模培训和有原则的评估评估。 我们证明,实验室实验室的实验室为推进实际科学定位的实验室提供了一种强大的实验和严格测试平台的平台,以推进和精确的定位,为未来的定位提供了强大的平台。


Article 35

Title@2025-05-28 (3): Smart Contracts for SMEs and Large Companies

Title: Smart Contracts for SMEs and Large Companies Intelligente Verträge für KMU und Großunternehmen 中小企业和大公司的智能合同 2505.22619v1

Authors: C. G. Liu, P. Bodorik, D. Jutla

Research on blockchains addresses multiple issues, with one being writing smart contracts. In our previous research we described methodology and a tool to generate, in automated fashion, smart contracts from BPMN models. The generated smart contracts provide support for multi-step transactions that facilitate repair/upgrade of smart contracts. In this paper we show how the approach is used to support collaborations via smart contracts for companies ranging from SMEs with little IT capabilities to companies with IT using blockchain smart contracts. Furthermore, we also show how the approach is used for certain applications to generate smart contracts by a BPMN modeler who does not need any knowledge of blockchain technology or smart contract development - thus we are hoping to facilitate democratization of smart contracts and blockchain technology.

在以往的研究中,我们描述了从BPMN模型中以自动方式生成智能合同的方法和工具。产生的智能合同为多步交易提供了支持,便利了智能合同的修理/升级。在本文中,我们展示了如何利用这一方法支持从信息技术能力很小的中小企业到使用链链智能合同的信息技术公司通过智能合同进行协作。此外,我们还展示了BPMN模型的某个应用如何利用这一方法产生智能合同,而该模型不需要任何有关链式技术或智能合同开发的知识,因此我们希望促进智能合同和链式技术的民主化。


Article 36

Title@2025-05-28 (3): BPMN to Smart Contract by Business Analyst

Title: BPMN to Smart Contract by Business Analyst BPMN auf Smart Contract von Business Analyst 商业分析员将BPMN改为智能合同 2505.22612v1

Authors: C. G. Liu, P. Bodorik, D. Jutla

This paper addresses the challenge of creating smart contracts for applications represented using Business Process Management and Notation (BPMN) models. In our prior work we presented a methodology that automates the generation of smart contracts from BPMN models. This approach abstracts the BPMN flow control, making it independent of the underlying blockchain infrastructure, with only the BPMN task elements requiring coding. In subsequent research, we enhanced our approach by adding support for nested transactions and enabling a smart contract repair and/or upgrade. To empower Business Analysts (BAs) to generate smart contracts without relying on software developers, we tackled the challenge of generating smart contracts from BPMN models without assistance of a software developer. We exploit the Decision Model and Notation (DMN) standard to represent the decisions and the business logic of the BPMN task elements and amended our methodology for transformation of BPMN models into smart contracts to support also the generation script to represent the business logic represented by the DMN models. To support such transformation, we describe how the BA documents, using the BPMN elements, the flow of information along with the flow of execution. Thus, if the BA is successful in representing the blockchain application requirements using BPMN and DMN models, our methodology and the tool, called TABS, that we developed as a proof of concept, is used to generate the smart contracts directly from those models without developer assistance.

本文讨论了为使用业务流程管理和标记模型(BPMN)代表的应用建立智能合同的挑战。在先前的工作中,我们提出了一种方法,使从BPMN模型中生成智能合同自动化。这一方法摘述了BPMN流动控制,使之独立于基本连锁基础设施,只有BPMN任务要素需要编码。在随后的研究中,我们加强了我们的方法,增加了对嵌套交易的支持,并促成智能合同的修理和/或升级。为了增强商业分析师在不依赖软件开发商的情况下生成智能合同的能力,我们处理了从BPMN模型中生成智能合同而无需软件开发商协助的挑战。我们利用决定模式和标记标准(DMN)标准来代表BPN任务要素的决定和商业逻辑,并修正了我们将BPN模型转换成智能合同的方法,以支持DMN模型所代表的商业逻辑。为了支持这种转变,我们描述了BA文件如何利用BPN要素生成信息与执行模式的流程。因此,如果BA模型是在不直接使用SBA系统要求的情况下,我们将智能BA模型作为智能BS标准开发工具,那么,我们使用SBBA标准开发了这些工具,那么,我们使用SBSA系统要求,我们直接开发了BS证明。


Article 37

Title@2025-05-28 (3): GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git

Title: GitGoodBench: A Novel Benchmark For Evaluating Agentic Performance On Git GitGoodBench: Ein neuartiger Benchmark für die Bewertung Agentischer Performance auf Git GitGoodbunch:评估基特生物表现的新基准 2505.22583v1

Authors: Tobias Lindenbauer, Egor Bogomolov, Yaroslav Zharov

Benchmarks for Software Engineering (SE) AI agents, most notably SWE-bench, have catalyzed progress in programming capabilities of AI agents. However, they overlook critical developer workflows such as Version Control System (VCS) operations. To address this issue, we present GitGoodBench, a novel benchmark for evaluating AI agent performance on VCS tasks. GitGoodBench covers three core Git scenarios extracted from permissive open-source Python, Java, and Kotlin repositories. Our benchmark provides three datasets: a comprehensive evaluation suite (900 samples), a rapid prototyping version (120 samples), and a training corpus (17,469 samples). We establish baseline performance on the prototyping version of our benchmark using GPT-4o equipped with custom tools, achieving a 21.11% solve rate overall. We expect GitGoodBench to serve as a crucial stepping stone toward truly comprehensive SE agents that go beyond mere programming.

软件工程(SE)AI代理商的基准,最突出的是SWE-bench,促进了AI代理商在编程能力方面的进展,然而,它们忽略了版本控制系统(VCS)操作等关键的开发者工作流程。为了解决这个问题,我们介绍了GitGoodGoodBench,这是评价AI代理商在VCS任务方面业绩的新基准。GitGoodBench涵盖从开放源码开放源码Python、Java和Kotlin储存库中提取出来的三种核心Git情景。我们的基准提供了三个数据集:一个综合评价套(900个样本)、一个快速制版(120个样本)和一个培训套(17 469个样本 ) 。我们用装有定制工具的GPT-4o,确定了我们基准原型版本的基线性业绩,实现了21.11%的解算率总体。我们期待GitGoodBench作为通往真正全面的SE代理商的关键踏板,而不只是单纯的编程。


Article 38

Title@2025-05-28 (3): LAMBDA: A Large Model Based Data Agent

Title: LAMBDA: A Large Model Based Data Agent LAMBDA: Ein großer modellbasierter Datenagent LAMBDA:一个大型模型数据代理 2407.17535v3

Authors: Maojun Sun, Ruijian Han, Binyan Jiang, Houduo Qi, Defeng Sun, Yancheng Yuan, Jian Huang

We introduce LArge Model Based Data Agent (LAMBDA), a novel open-source, code-free multi-agent data analysis system that leverages the power of large language models. LAMBDA is designed to address data analysis challenges in data-driven applications through innovatively designed data agents using natural language. At the core of LAMBDA are two key agent roles: the programmer and the inspector, which are engineered to work together seamlessly. Specifically, the programmer generates code based on the user’s instructions and domain-specific knowledge, while the inspector debugs the code when necessary. To ensure robustness and handle adverse scenarios, LAMBDA features a user interface that allows direct user intervention. Moreover, LAMBDA can flexibly integrate external models and algorithms through our proposed Knowledge Integration Mechanism, catering to the needs of customized data analysis. LAMBDA has demonstrated strong performance on various data analysis tasks. It has the potential to enhance data analysis paradigms by seamlessly integrating human and artificial intelligence, making it more accessible, effective, and efficient for users from diverse backgrounds. The strong performance of LAMBDA in solving data analysis problems is demonstrated using real-world data examples. The code for LAMBDA is available at https://github.com/AMA-CMFAI/LAMBDA and videos of three case studies can be viewed at https://www.polyu.edu.hk/ama/cmfai/lambda.html.

我们引入了Large模型数据代理(LAMBDA),这是一个新的开放源码、无代码的多试剂数据分析系统,它利用了大语言模型的力量。LAMBDA的设计目的是通过使用自然语言的创新设计数据代理来解决数据驱动应用的数据分析挑战。在LAMBDA的核心是两个关键代理作用:程序员和检查员,他们被设计成可以无缝地合作。具体地说,程序员根据用户的指示和具体领域的知识生成代码,而检查员在必要时调试代码。为了确保稳健和处理不利情景,LAMBDA设置了一个用户界面,允许用户直接干预。此外,LAMBDA能够灵活地通过我们拟议的知识整合机制整合外部模型和算法,满足定制数据分析的需要。LAMBDA展示了各种数据分析任务的出色表现。它有可能通过无缝地整合人文和人工智能,使来自不同背景的用户更容易获得、有效和高效的代码。LAMBDA在解决数据分析问题方面的强有力表现在现实-世界数据库/AMBADA的3DADA数据研究中可以展示。LMBADA的代码。


Article 39

Title@2025-05-28 (3): Advancing Expert Specialization for Better MoE

Title: Advancing Expert Specialization for Better MoE Advancing Experten-Spezialisierung für bessere MoE 推进专家专业专业促进改善教育部 2505.22323v1

Authors: Hongcan Guo, Haolang Lu, Guoshun Nan, Bolun Chu, Jialin Zhuang, Yuan Yang, Wenhao Che, Sicong Leng, Qimei Cui, Xudong Jiang

Mixture-of-Experts (MoE) models enable efficient scaling of large language models (LLMs) by activating only a subset of experts per input. However, we observe that the commonly used auxiliary load balancing loss often leads to expert overlap and overly uniform routing, which hinders expert specialization and degrades overall performance during post-training. To address this, we propose a simple yet effective solution that introduces two complementary objectives: (1) an orthogonality loss to encourage experts to process distinct types of tokens, and (2) a variance loss to encourage more discriminative routing decisions. Gradient-level analysis demonstrates that these objectives are compatible with the existing auxiliary loss and contribute to optimizing the training process. Experimental results over various model architectures and across multiple benchmarks show that our method significantly enhances expert specialization. Notably, our method improves classic MoE baselines with auxiliary loss by up to 23.79%, while also maintaining load balancing in downstream tasks, without any architectural modifications or additional components. We will release our code to contribute to the community.

然而,我们注意到,通常使用的辅助负荷平衡损失往往导致专家重叠和过于统一的路线安排,这妨碍了专家专业化,降低了培训后的总体绩效。为了解决这一问题,我们提出了一个简单而有效的解决方案,引入了两个互补目标:(1) 一种交错性损失,鼓励专家处理不同种类的象征性物,(2) 差异性损失,鼓励作出更具有歧视性的决定。 等级分析表明,这些目标与现有的辅助损失相符,并有助于优化培训进程。 各种模型结构和多种基准的实验结果表明,我们的方法大大增强了专家专业化。 值得注意的是,我们的方法改进了传统的教育部基线,附带损失高达23.79%,同时在不作任何建筑修改或增加组成部分的情况下,保持下游任务负荷平衡。我们将发布代码,为社区做出贡献。


Article 40

Title@2025-05-28 (3): Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era

Title: Evolution of repositories and privacy laws: commit activities in the GDPR and CCPA era Entwicklung von Repositorys und Datenschutzgesetzen: Aktivitäten in der DSGVO und CCPA-Ära verpflichten 保管库和隐私法的演变演变:在GDPR和CCPA时代开展活动 2505.22234v1

Authors: Georgia M. Kapitsaki, Maria Papoutsoglou

Free and open source software has gained a lot of momentum in the industry and the research community. The latest advances in privacy legislation, including the EU General Data Protection Regulation (GDPR) and the California Consumer Privacy Act (CCPA), have forced the community to pay special attention to users’ data privacy. The main aim of this work is to examine software repositories that are acting on privacy laws. We have collected commit data from GitHub repositories in order to understand indications on main data privacy laws (GDPR, CCPA, CPRA, UK DPA) in the last years. Via an automated process, we analyzed 37,213 commits from 12,391 repositories since 2016, whereas 594 commits from the 70 most popular repositories of the dataset were manually analyzed. We observe that most commits were performed on the year the law came into effect and privacy relevant terms appear in the commit messages, whereas reference to specific data privacy user rights is scarce. The study showed that more educational activities on data privacy user rights are needed, as well as tools for privacy recommendations, whereas verifying actual compliance via source code execution is a useful direction for software engineering researchers.

自由和开放源码软件在产业界和研究界获得了很大的动力。隐私立法的最新进展,包括欧盟一般数据保护条例(GDPR)和加利福尼亚消费者隐私法(CCPA),迫使社区特别关注用户的数据隐私。这项工作的主要目的是审查根据隐私法行事的软件库。我们收集了GitHub存储库的数据,以便了解过去几年中主要数据隐私法(GDPR、CCPA、CPRA、英国政治部)的迹象。通过自动化程序,我们分析了自2016年以来12 391个存储库的37 213个存储库,而来自70个最受欢迎的数据集的594个存储库的承诺得到了人工分析。我们注意到,大多数承诺都是在该法律生效的当年进行的,与隐私有关的术语出现在承诺信息中,而具体的数据隐私用户权利则很少提及。研究显示,需要更多的关于数据隐私用户权利的教育活动以及隐私建议的工具,而核实通过源码执行的实际遵守情况是软件工程研究人员的一个有益方向。


Article 41

Title@2025-05-28 (3): Thermal Modeling and Optimal Allocation of Avionics Safety-critical Tasks on Heterogeneous MPSoCs

Title: Thermal Modeling and Optimal Allocation of Avionics Safety-critical Tasks on Heterogeneous MPSoCs Thermische Modellierung und optimale Allokation von Avionik Sicherheitskritische Aufgaben auf heterogenen MPSoCs 热建模和最佳分配航空气象安全关键任务 2505.22214v1

Authors: Ondřej Benedikt, Michal Sojka, Přemysl Šůcha, Pavel Zaykov, Zdeněk Hanzálek

Multi-Processor Systems-on-Chip (MPSoC) can deliver high performance needed in many industrial domains, including aerospace. However, their high power consumption, combined with avionics safety standards, brings new thermal management challenges. This paper investigates techniques for offline thermal-aware allocation of periodic tasks on heterogeneous MPSoCs running at a fixed clock frequency, as required in avionics. The goal is to find the assignment of tasks to (i) cores and (ii) temporal isolation windows while minimizing the MPSoC temperature. To achieve that, we propose and analyze three power models, and integrate them within several novel optimization approaches based on heuristics, a black-box optimizer, and Integer Linear Programming (ILP). We perform the experimental evaluation on three popular MPSoC platforms (NXP i.MX8QM MEK, NXP i.MX8QM Ixora, NVIDIA TX2) and observe a difference of up to 5.5{\deg}C among the tested methods (corresponding to a 22% reduction w.r.t. the ambient temperature). We also show that our method, integrating the empirical power model with the ILP, outperforms the other methods on all tested platforms.

多处理器在芯片上系统(MPSoC)可以提供许多工业领域(包括航空航天)所需的高性能,然而,它们的高电耗,加上航空安全标准,带来了新的热管理挑战。本文调查了根据航空频率的要求,在离线热觉中分配以固定时钟频率运行的多式MPSC定期任务的技术。目的是找到(一) 核心和(二) 时间隔离窗口的任务分配,同时尽量减少MPSoC温度。为了实现这一点,我们提议和分析三种动力模型,并将它们纳入基于超光学、黑盒优化器和 Integer 线性程序(ILP)的几种新型优化方法。我们对三种流行的MPSC平台(NXP i.MX8QM MEK, NXP i.MX8QM Ixora, NVIDIA TX2) 进行实验性评估,并观察在测试方法(corperc)中出现高达5.5xdeg的差别(c),我们提议和分析三种电动模型与22wer.r.stexmodroduft the musal ex


Article 42

Title@2025-05-28 (3): Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement

Title: Towards Conversational Development Environments: Using Theory-of-Mind and Multi-Agent Architectures for Requirements Refinement Hin zu konversatorischen Entwicklungsumgebungen: Verwendung von Theorie-von-Mind- und Multi-Agent-Architekturen für Anforderungen Verfeinerung 走向对话型发展环境:利用理论和多机构架构改进要求 2505.20973v2

Authors: Keheliya Gallaba, Ali Arabat, Dayi Lin, Mohammed Sayagh, Ahmed E. Hassan

Foundation Models (FMs) have shown remarkable capabilities in various natural language tasks. However, their ability to accurately capture stakeholder requirements remains a significant challenge for using FMs for software development. This paper introduces a novel approach that leverages an FM-powered multi-agent system called AlignMind to address this issue. By having a cognitive architecture that enhances FMs with Theory-of-Mind capabilities, our approach considers the mental states and perspectives of software makers. This allows our solution to iteratively clarify the beliefs, desires, and intentions of stakeholders, translating these into a set of refined requirements and a corresponding actionable natural language workflow in the often-overlooked requirements refinement phase of software engineering, which is crucial after initial elicitation. Through a multifaceted evaluation covering 150 diverse use cases, we demonstrate that our approach can accurately capture the intents and requirements of stakeholders, articulating them as both specifications and a step-by-step plan of action. Our findings suggest that the potential for significant improvements in the software development process justifies these investments. Our work lays the groundwork for future innovation in building intent-first development environments, where software makers can seamlessly collaborate with AIs to create software that truly meets their needs.

基础模型(FMs)在各种自然语言任务中表现出了非凡的能力,然而,它们准确捕捉利益攸关方需求的能力仍然是使用调频软件开发的重大挑战。本文介绍了一种创新办法,利用调频驱动的多试剂系统(AleignMind)来解决这一问题。我们的方法通过建立一个认知架构,用理论思维能力加强调频,考虑软件制造者的精神状态和观点。这样,我们的解决办法就能够反复地澄清利益攸关方的信念、愿望和意图,将这些要求转化为一套完善的要求和一套相应的可操作的自然语言工作流程,在软件工程的完善阶段中,这些需求往往被人们忽视,这是在初步征求意见后至关重要的。我们通过涉及150个不同用途案例的多方面评价,表明我们的方法能够准确地抓住利益攸关方的意图和要求,将其表述为规格和逐步行动计划。我们的调查结果表明,软件开发过程的重大改进潜力为这些投资提供了理由。我们的工作为今后在建立意图第一开发环境方面的创新奠定了基础,在这个环境中,软件制造者可以与AIs紧密地合作,创造真正满足其需要的软件。


Article 43

Title@2025-05-28 (3): Towards Practical Defect-Focused Automated Code Review

Title: Towards Practical Defect-Focused Automated Code Review Auf dem Weg zu einer praktischen fehlerorientierten automatisierten Code-Überprüfung 走向实际失效-受污染的自动编码审查 2505.17928v2

Authors: Junyi Lu, Lili Jiang, Xiaojia Li, Jianbing Fang, Fengjun Zhang, Li Yang, Chun Zuo

The complexity of code reviews has driven efforts to automate review comments, but prior approaches oversimplify this task by treating it as snippet-level code-to-text generation and relying on text similarity metrics like BLEU for evaluation. These methods overlook repository context, real-world merge request evaluation, and defect detection, limiting their practicality. To address these issues, we explore the full automation pipeline within the online recommendation service of a company with nearly 400 million daily active users, analyzing industry-grade C++ codebases comprising hundreds of thousands of lines of code. We identify four key challenges: 1) capturing relevant context, 2) improving key bug inclusion (KBI), 3) reducing false alarm rates (FAR), and 4) integrating human workflows. To tackle these, we propose 1) code slicing algorithms for context extraction, 2) a multi-role LLM framework for KBI, 3) a filtering mechanism for FAR reduction, and 4) a novel prompt design for better human interaction. Our approach, validated on real-world merge requests from historical fault reports, achieves a 2x improvement over standard LLMs and a 10x gain over previous baselines. While the presented results focus on C++, the underlying framework design leverages language-agnostic principles (e.g., AST-based analysis), suggesting potential for broader applicability.

代码审查的复杂性推动了审查评论的自动化,但先前的做法过于简化了这项任务,将之视为片段级代码到文本生成,并依赖BLEU等文本相似度指标进行评价。这些方法忽略了存储环境、真实世界合并请求评价和缺陷检测,限制了其实用性。为了解决这些问题,我们探索了一家拥有近4亿日常活跃用户的公司在线建议服务范围内的全自动化管道,分析了由数十万行代码组成的行业级C++代码库。我们确定了四个主要挑战:1) 抓住相关背景,2) 改进关键错误包容(KBI)、3) 降低错误警报率(FAR) 和 4) 整合人类工作流程。为了解决这些问题,我们提议:(1) 为背景提取设定代码的编码算法,2) 一个多功能的LM框架,3) 降低成本过滤机制, 4) 为改善人际互动的新快速设计。我们的方法,在基于历史错误报告的真实世界合并请求中得到验证,实现了对标准 LLMS的2x改进,10x收益超过以前的基线。我们建议了C+S-ST的潜在设计框架。


Article 44

Title@2025-05-28 (3): SVA-ICL: Improving LLM-based Software Vulnerability Assessment via In-Context Learning and Information Fusion

Title: SVA-ICL: Improving LLM-based Software Vulnerability Assessment via In-Context Learning and Information Fusion SVA-ICL: Verbesserung der LLM-basierten Software Vulnerability Assessment durch In-Context Learning und Information Fusion SVA-ICL:通过文内学习和信息融合改进基于LLM的软件脆弱性评估 2505.10008v2

Authors: Chaoyang Gao, Xiang Chen, Guangbei Zhang

Context: Software vulnerability assessment (SVA) is critical for identifying, evaluating, and prioritizing security weaknesses in software applications. Objective: Despite the increasing application of large language models (LLMs) in various software engineering tasks, their effectiveness in SVA remains underexplored. Method: To address this gap, we introduce a novel approach SVA-ICL, which leverages in-context learning (ICL) to enhance LLM performance. Our approach involves the selection of high-quality demonstrations for ICL through information fusion, incorporating both source code and vulnerability descriptions. For source code, we consider semantic, lexical, and syntactic similarities, while for vulnerability descriptions, we focus on textual similarity. Based on the selected demonstrations, we construct context prompts and consider DeepSeek-V2 as the LLM for SVA-ICL. Results: We evaluate the effectiveness of SVA-ICL using a large-scale dataset comprising 12,071 C/C++ vulnerabilities. Experimental results demonstrate that SVA-ICL outperforms state-of-the-art SVA baselines in terms of Accuracy, F1-score, and MCC measures. Furthermore, ablation studies highlight the significance of component customization in SVA-ICL, such as the number of demonstrations, the demonstration ordering strategy, and the optimal fusion ratio of different modalities. Conclusion: Our findings suggest that leveraging ICL with information fusion can effectively improve the effectiveness of LLM-based SVA, warranting further research in this direction.

目标:尽管在各种软件工程任务中越来越多地应用大型语言模型(LLMs),但其在SVA的效力仍未得到充分探讨。 方法:为了弥补这一差距,我们采用了SVA-ICL的新颖方法,利用文内学习(ICL)来提高LLM的绩效。我们的方法是通过信息聚合,包括源代码和脆弱性描述,为ICL选择高质量的演示,包括源代码和软件应用的安全弱点。对于源代码,我们考虑在各种软件工程任务中越来越多地应用大型语言模型(LLLMs),而对于脆弱性描述,我们侧重于文本相似性。基于选定的演示,我们构建背景,并将DeepSVek-V2视为SVA-ICLLML。结果:我们利用由12,071 C/C+脆弱性组成的大型数据集来评估SVA-IC的实效。实验结果表明SVA-IC在A的准确性研究、FLILA和SLSLSL的优化性示范性研究中可以有效地改进SBLA-SLA、FLILA的进度和SLLILA的进度。


Article 45

Title@2025-05-28 (3): Jailbreak Distillation: Renewable Safety Benchmarking

Title: Jailbreak Distillation: Renewable Safety Benchmarking Jailbreak Destillation: Benchmarking für erneuerbare Sicherheit 蒸馏:可再生能源安全基准 2505.22037v1

Authors: Jingyu Zhang, Ahmed Elgohary, Xiawei Wang, A S M Iftekhar, Ahmed Magooda, Benjamin Van Durme, Daniel Khashabi, Kyle Jackson

Large language models (LLMs) are rapidly deployed in critical applications, raising urgent needs for robust safety benchmarking. We propose Jailbreak Distillation (JBDistill), a novel benchmark construction framework that “distills” jailbreak attacks into high-quality and easily-updatable safety benchmarks. JBDistill utilizes a small set of development models and existing jailbreak attack algorithms to create a candidate prompt pool, then employs prompt selection algorithms to identify an effective subset of prompts as safety benchmarks. JBDistill addresses challenges in existing safety evaluation: the use of consistent evaluation prompts across models ensures fair comparisons and reproducibility. It requires minimal human effort to rerun the JBDistill pipeline and produce updated benchmarks, alleviating concerns on saturation and contamination. Extensive experiments demonstrate our benchmarks generalize robustly to 13 diverse evaluation models held out from benchmark construction, including proprietary, specialized, and newer-generation LLMs, significantly outperforming existing safety benchmarks in effectiveness while maintaining high separability and diversity. Our framework thus provides an effective, sustainable, and adaptable solution for streamlining safety evaluation.

大型语言模型(LLMS)在关键应用中迅速部署,提高了稳健安全基准的紧迫需求。我们建议“黑市蒸馏(JBStill)”这个新的基准建设框架,“蒸馏”破狱袭击形成高质量和易于更新的安全基准。JBStill 利用一套小型发展模型和现有的破狱袭击算法,建立一个候选人快速人才库,然后采用迅速选择算法,确定有效的赛道,作为安全基准。JBstill 应对现有安全评估中的挑战:不同模型使用一致的评价提示确保公平比较和再现。它要求人类作出最低限度的努力,重新运行JBstill的管道,并更新基准,减轻对饱和和污染的关切。广泛的实验表明,我们的基准基准建设,包括专有、专门和新一代LMMs,其有效性大大超过现有的安全基准,同时保持高度的可分离性和多样性。我们的框架为简化安全评估提供了有效、可持续和适应性解决方案。


Article 46

Title@2025-05-28 (3): Securing the Software Package Supply Chain for Critical Systems

Title: Securing the Software Package Supply Chain for Critical Systems Sicherung der Softwarepaket-Lieferkette für kritische Systeme 保障关键系统软件包供应链 2505.22023v1

Authors: Ritwik Murali, Akash Ravi

Software systems have grown as an indispensable commodity used across various industries, and almost all essential services depend on them for effective operation. The software is no longer an independent or stand-alone piece of code written by a developer but rather a collection of packages designed by multiple developers across the globe. Ensuring the reliability and resilience of these systems is crucial since emerging threats target software supply chains, as demonstrated by the widespread SolarWinds hack in late 2020. These supply chains extend beyond patches and updates, involving distribution networks throughout the software lifecycle. Industries like smart grids, manufacturing, healthcare, and finance rely on interconnected software systems and their dependencies for effective functioning. To secure software modules and add-ons, robust distribution architectures are essential. The proposed chapter enhances the existing delivery frameworks by including a permissioned ledger with Proof of Authority consensus and multi-party signatures. The proposed system aims to prevent attacks while permitting every stakeholder to verify the same. Critical systems can interface with the secure pipeline without disrupting existing functionalities, thus preventing the cascading effect of an attack at any point in the supply chain.

软件不再是由开发商编写的独立或独立代码,而是全球多个开发商设计的一系列软件。确保这些系统的可靠性和复原力至关重要,因为如2020年后期普遍出现的SolalWinds黑客所显示的那样,这些系统正在出现威胁,目标是软件供应链。这些供应链超越了连接和更新,涉及整个软件生命周期的分销网络。智能电网、制造、保健和金融等行业依赖相互关联的软件系统及其依赖性来有效运行。为确保软件模块和添加功能的安全,必须建立强有力的分销结构。拟议的章节通过纳入一个经许可的分类账来增强现有的交付框架,并附有管理局共识和多方签名的证明。拟议系统的目的是防止袭击,同时允许每个利益攸关方核查同样的内容。关键系统可以在不干扰现有功能的情况下与安全管道进行连接,从而防止袭击在供应链的任何地方产生连锁效应。


Article 47

Title@2025-05-28 (3): How Do Experts Make Sense of Integrated Process Models?

Title: How Do Experts Make Sense of Integrated Process Models? Wie verstehen Experten integrierte Prozessmodelle? 专家如何看待综合进程模式? 2505.20667v2

Authors: Tianwa Chen, Barbara Weber, Graeme Shanks, Gianluca Demartini, Marta Indulska, Shazia Sadiq

A range of integrated modeling approaches have been developed to enable a holistic representation of business process logic together with all relevant business rules. These approaches address inherent problems with separate documentation of business process models and business rules. In this study, we explore how expert process workers make sense of the information provided through such integrated modeling approaches. To do so, we complement verbal protocol analysis with eye-tracking metrics to reveal nuanced user behaviours involved in the main phases of sensemaking, namely information foraging and information processing. By studying expert process workers engaged in tasks based on integrated modeling of business processes and rules, we provide insights that pave the way for a better understanding of sensemaking practices and improved development of business process and business rule integration approaches. Our research underscores the importance of offering personalized support mechanisms that increase the efficacy and efficiency of sensemaking practices for process knowledge workers.

已经制定了一系列综合示范方法,以便能全面反映业务流程逻辑和所有相关业务规则,这些方法分别记录业务流程模式和商业规则,解决内在问题;在本研究中,我们探讨专家流程工作者如何理解通过这种综合模式方法提供的信息;为此,我们用观察跟踪指标对口头协议分析加以补充,以揭示在意识决策主要阶段,即信息源码和信息处理中涉及的细微用户行为;通过研究从事基于综合模式业务流程和规则的任务的专家流程工作者,我们提供了深刻见解,为更好地了解感知做法和改善业务流程和企业规则整合方法的发展铺平了道路;我们的研究强调必须提供个性化的支持机制,提高流程知识工作者感知做法的功效和效率。


Article 48

Title@2025-05-28 (3): System-driven Cloud Architecture Design Support with Structured State Management and Guided Decision Assistance

Title: System-driven Cloud Architecture Design Support with Structured State Management and Guided Decision Assistance Systemgesteuerte Cloud-Architektur-Design-Unterstützung mit strukturiertem Staatsmanagement und beratender Entscheidungshilfe 提供结构化国家管理和指导决策援助的系统驱动云层结构设计支持 2505.20701v2

Authors: Ryosuke Kohita, Akira Kasuga

Cloud architecture design is a complex process requiring both technical expertise and architectural knowledge to develop solutions from frequently ambiguous requirements. We present CloudArchitectBuddy, a system-driven cloud architecture design support application with two key mechanisms: (1) structured state management that enhances design understanding through explicit representation of requirements and architectural decisions, and (2) guided decision assistance that facilitates design progress through proactive verification and requirement refinement. Our study with 16 industry practitioners showed that while our approach achieved comparable design quality to a chat interface, participants rated our system higher for usability and appreciated its ability to help understand architectural relationships and identify missing requirements. However, participants also expressed a need for user-initiated interactions where they could freely provide design instructions and engage in detailed discussions with LLMs. These results suggest that integrating a chat interface into our structured and guided workflow approach would create a more practical solution, balancing systematic design support with conversational flexibility for comprehensive cloud architecture development.

云层结构设计是一个复杂的过程,需要技术专长和建筑知识,才能从经常含糊不清的要求中找到解决办法。我们介绍了云层建筑布迪,这是一个系统驱动的云层结构设计支持应用程序,有两个主要机制:(1) 结构化国家管理,通过明确表述要求和建筑决定,加强设计理解;(2) 指导性决策援助,通过积极主动的核查和完善要求,促进设计进展。我们与16个行业从业人员的研究显示,虽然我们的方法达到了与聊天界面相近的设计质量,但参与者对我们系统的评价是,更适合使用,并赞赏它帮助理解建筑关系和确定缺失的要求的能力。然而,与会者还表示,需要用户发起的互动,以便他们可以自由地提供设计指示,并与LLMs进行详细讨论。这些结果表明,将聊天界面纳入我们有条理和有指导的工作流程方法,将产生更实际的解决办法,在系统的设计支持与全面云层结构开发的谈话灵活性之间取得平衡。


Article 49

Title@2025-05-28 (3): Larger Is Not Always Better: Exploring Small Open-source Language Models in Logging Statement Generation

Title: Larger Is Not Always Better: Exploring Small Open-source Language Models in Logging Statement Generation Größere ist nicht immer besser: Erforschen von kleinen Open-Source-Sprachenmodellen bei der Erstellung von Protokollierungsanweisungen 大并非总是更好:探索记录报表生成中的小型开放源语言模式 2505.16590v2

Authors: Renyi Zhong, Yichen Li, Guangba Yu, Wenwei Gu, Jinxi Kuang, Yintong Huo, Michael R. Lyu

Developers use logging statements to create logs that document system behavior and aid in software maintenance. As such, high-quality logging is essential for effective maintenance; however, manual logging often leads to errors and inconsistency. Recent methods emphasize using large language models (LLMs) for automated logging statement generation, but these present privacy and resource issues, hindering their suitability for enterprise use. This paper presents the first large-scale empirical study evaluating small open-source language models (SOLMs) for automated logging statement generation. We evaluate four prominent SOLMs using various prompt strategies and parameter-efficient fine-tuning techniques, such as Low-Rank Adaptation (LoRA) and Retrieval-Augmented Generation (RAG). Our results show that fine-tuned SOLMs with LoRA and RAG prompts, particularly Qwen2.5-coder-14B, outperform existing tools and LLM baselines in predicting logging locations and generating high-quality statements, with robust generalization across diverse repositories. These findings highlight SOLMs as a privacy-preserving, efficient alternative for automated logging.

开发者使用伐木说明来创建记录记录系统行为和协助软件维护的日志。因此,高质量的伐木对于有效维护至关重要;然而,人工伐木往往会导致错误和不一致。最近的方法强调使用大型语言模型(LLMs)来自动生成伐木报表,但是这些方法提出了隐私和资源问题,妨碍了它们适合企业使用。本文件介绍了第一次大型实证研究,评估了用于自动生成伐木报表的小型开放源语言模型(SOLMs),我们利用各种迅速战略和节能微调技术,如Low-Rank适应(LORA)和Retremiewval-Auged General(RAG)来评估四个突出的 SOLMs。我们的结果显示,与LRA和RA(RA)和RAG(RA)一道的微调调整的SOLMs快速信号,特别是Qwen2.5-coder-14B,在预测伐木地点和生成高质量报表方面超越了现有工具和LLM基线,并在各种储存库中进行了有力的概括化。这些结论强调SOLMs是自动伐木的隐私、高效替代方法。


Article 50

Title@2025-05-28 (3): Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development

Title: Co-Saving: Resource Aware Multi-Agent Collaboration for Software Development Co-Saving: Ressourcenschonende Multi-Agenten-Kollaboration für Software-Entwicklung 共同节省:为开发软件进行有意识的资源、多机构协作 2505.21898v1

Authors: Rennai Qiu, Chen Qian, Ran Li, Yufan Dang, Weize Chen, Cheng Yang, Yingli Zhang, Ye Tian, Xuantang Xiong, Lei Han, Zhiyuan Liu, Maosong Sun

Recent advancements in Large Language Models (LLMs) and autonomous agents have demonstrated remarkable capabilities across various domains. However, standalone agents frequently encounter limitations when handling complex tasks that demand extensive interactions and substantial computational resources. Although Multi-Agent Systems (MAS) alleviate some of these limitations through collaborative mechanisms like task decomposition, iterative communication, and role specialization, they typically remain resource-unaware, incurring significant inefficiencies due to high token consumption and excessive execution time. To address these limitations, we propose a resource-aware multi-agent system – Co-Saving (meaning that multiple agents collaboratively engage in resource-saving activities), which leverages experiential knowledge to enhance operational efficiency and solution quality. Our key innovation is the introduction of “shortcuts” – instructional transitions learned from historically successful trajectories – which allows to bypass redundant reasoning agents and expedite the collective problem-solving process. Experiments for software development tasks demonstrate significant advantages over existing methods. Specifically, compared to the state-of-the-art MAS ChatDev, our method achieves an average reduction of 50.85% in token usage, and improves the overall code quality by 10.06%.

大语言模型(LLMS)和自主代理最近的进展在各个领域都表现出了非凡的能力。然而,独立代理商在处理需要广泛互动和大量计算资源的复杂任务时经常遇到限制。尽管多个代理商通过任务分解、迭代通信和角色专业化等合作机制缓解了其中一些限制,但它们通常仍是资源缺乏软件,由于高象征性消费和超时的执行时间而导致严重效率低下。为了解决这些限制,我们提议建立一个资源认知多试剂系统 – – 共同保存(意味着多个代理商共同参与资源节约活动),利用实验性知识提高业务效率和解决方案质量。我们的主要创新是引入“短期”(从历史上成功的轨迹学得的指令性过渡),从而绕过多余的推理剂,加快集体解决问题的进程。软件开发任务实验显示现有方法的重大优势。具体地说,与最新版MAS ChatDev相比,我们的方法平均减少了50.85%的象征性使用率,并通过06提高总体代码质量。


Article 51

Title@2025-05-27 (2): Augmenting Software Bills of Materials with Software Vulnerability Description: A Preliminary Study on GitHub

Title: Augmenting Software Bills of Materials with Software Vulnerability Description: A Preliminary Study on GitHub Augmenting Software Bills of Materials with Software Vulnerability Beschreibung: Eine Vorstudie zu GitHub 增加具有软件脆弱性说明的软件材料账单:关于GitHub的初步研究 2503.13998v2

Authors: Davide Fucci, Massimiliano Di Penta, Simone Romano, Giuseppe Scanniello

Software Bills of Material (SBOMs) are becoming a consolidated, often enforced by governmental regulations, way to describe software composition. However, based on recent studies, SBOMs suffer from limited support for their consumption and lack information beyond simple dependencies, especially regarding software vulnerabilities. This paper reports the results of a preliminary study in which we augmented SBOMs of 40 open-source projects with information about Common Vulnerabilities and Exposures (CVE) exposed by project dependencies. Our augmented SBOMs have been evaluated by submitting pull requests and by asking project owners to answer a survey. Although, in most cases, augmented SBOMs were not directly accepted because owners required a continuous SBOM update, the received feedback shows the usefulness of the suggested SBOM augmentation.

软体材料账单(SBOMs)正在成为一个综合的、往往通过政府条例强制执行的描述软件构成的方法,然而,根据最近的研究,SBOMs的消费支持有限,而且缺乏超出简单依赖范围的信息,特别是软件脆弱性方面的信息。本文报告了初步研究的结果,在初步研究中,我们增加了40个开放源码项目的SBOM,并提供了关于项目依赖性暴露的共同脆弱性和暴露(CVE)的信息。我们扩大的SBOMs是通过提交拉动请求和要求项目所有人回答调查来进行评估的。虽然在大多数情况下,扩大的SBOMs并没有被直接接受,因为所有者需要不断的SBOM更新,但收到的反馈表明建议的SBOM扩增的有用性。


Article 52

Title@2025-05-27 (2): Leveraging XP and CRISP-DM for Agile Data Science Projects

Title: Leveraging XP and CRISP-DM for Agile Data Science Projects Nutzung von XP und CRISP-DM für agile Data Science Projekte 利用XP和CRISP-DM为敏感数据科学项目发挥杠杆作用 2505.21603v1

Authors: Andre Massahiro Shimaoka, Renato Cordeiro Ferreira, Alfredo Goldman

This study explores the integration of eXtreme Programming (XP) and the Cross-Industry Standard Process for Data Mining (CRISP-DM) in agile Data Science projects. We conducted a case study at the e-commerce company Elo7 to answer the research question: How can the agility of the XP method be integrated with CRISP-DM in Data Science projects? Data was collected through interviews and questionnaires with a Data Science team consisting of data scientists, ML engineers, and data product managers. The results show that 86% of the team frequently or always applies CRISP-DM, while 71% adopt XP practices in their projects. Furthermore, the study demonstrates that it is possible to combine CRISP-DM with XP in Data Science projects, providing a structured and collaborative approach. Finally, the study generated improvement recommendations for the company.

这项研究探索了将电子Xtreme方案(XP)和数据开采跨行业标准程序(CRISP-DM)纳入灵活数据科学项目的问题,我们在电子商务公司Elo7进行了案例研究,以回答研究问题:如何将XP方法的灵敏性与CRIPS-DM纳入数据科学项目?数据是通过与由数据科学家、ML工程师和数据产品管理人员组成的数据科学小组的访谈和问卷收集的。结果显示,86%的团队经常或始终采用CRIIS-DM,71%的团队在其项目中采用XP做法。此外,研究还表明,在数据科学项目中将CRISP-DM与XP结合起来,提供有条理和协作的方法;最后,研究为该公司提出了改进建议。


Article 53

Title@2025-05-27 (2): JITScope: Interactive Visualization of JIT Compiler IR Transformations

Title: JITScope: Interactive Visualization of JIT Compiler IR Transformations JITScope: Interaktive Visualisierung von JIT Compiler IR-Transformationen JIT编辑器 IR 转换的交互式视觉化 2505.21599v1

Authors: Kyra Dalbo, Yumna Ahmed, HeuiChan Lim

The complexity of modern Just-In-Time (JIT) compiler optimization poses significant challenges for developers seeking to understand and debug intermediate representation (IR) behavior. This work introduces JITScope, an interactive visualization framework that illustrates how IR nodes and instructions evolve across compilation phases. The system features a full-stack architecture: a Python-based backend transforms raw JSON-formatted IR data-representing an abstract model of the JIT compiler IR-into a normalized SQLite database; a controller layer serves processed CSV data; and a D3.js-powered frontend renders an interactive, phase-aware graph of IR node transformations. The design emphasizes modularity, traceability, and flexibility. Our roadmap explores intuitive visual representations of phase-level changes in IR node connectivity, values, and access patterns. Ultimately, JITScope lays a foundation for future tooling that enables visual exploration of IR evolution, including phase filtering, value tracking, and function-access mapping-offering a new lens into the behaviors and impacts of compiler optimizations.

现代 Just-In-Time (JIT) 编译器优化的复杂性为寻求理解和调试中间代表行为(IR) 的开发者带来了重大挑战。 这项工作引入了JITScope( JITScope), 这是一个互动的可视化框架, 说明IR节点和指令在编译阶段的演进。 系统具有全堆式结构: 基于 Python 的后端将原始 JSON- Formated IR 数据转换成一个抽象的 JIT 编译器 IR- in 到一个正常的 SQLite 数据库; 一个控制层服务于处理过的 CSV 数据; 以及 D3. js 动力的前端将IR 节点转换制成一个互动的、 相觉知的图像图。 设计强调模块性、 可追踪性和灵活性。 我们的路线图探索了IR 节点连接、 值和访问模式的阶段级变化的直观性视觉描述。 最后, JITSCope 为未来工具提供了一个基础, 以便能够对IR 演化进行直视探索, , 包括阶段过滤、 跟踪、 和功能访问访问到编译器行为和影响的新镜头。


Article 54

Title@2025-05-27 (2): GUARD:Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation

Title: GUARD:Dual-Agent based Backdoor Defense on Chain-of-Thought in Neural Code Generation GUARD:Dual-Agent-basierte Backdoor-Verteidigung auf Ketten-of-Thought in Neural Code Generation GUARD: 在神经代码生成过程中寻求的连锁研究中,基于 “ 以企业为基地 “ 的后门防御 2505.21425v1

Authors: Naizhu Jin, Zhong Li, Tian Zhang, Qingkai Zeng

With the widespread application of large language models in code generation, recent studies demonstrate that employing additional Chain-of-Thought generation models can significantly enhance code generation performance by providing explicit reasoning steps. However, as external components, CoT models are particularly vulnerable to backdoor attacks, which existing defense mechanisms often fail to detect effectively. To address this challenge, we propose GUARD, a novel dual-agent defense framework specifically designed to counter CoT backdoor attacks in neural code generation. GUARD integrates two core components: GUARD-Judge, which identifies suspicious CoT steps and potential triggers through comprehensive analysis, and GUARD-Repair, which employs a retrieval-augmented generation approach to regenerate secure CoT steps for identified anomalies. Experimental results show that GUARD effectively mitigates attacks while maintaining generation quality, advancing secure code generation systems.

由于在代码生成中广泛应用了大型语言模型,最近的研究表明,采用额外的“努力生成链”模型可以提供明确的推理步骤,大大提高代码生成绩效,然而,作为外部组成部分,COT模型特别容易受到后门攻击,而现有的防御机制往往无法有效地发现后门攻击;为了应对这一挑战,我们提议GUARD,这是一个新的双重用途防御框架,专门用来在神经代码生成中打击COT后门攻击。 GUARD集成两个核心组成部分:GUARD-Judge,通过全面分析查明可疑的COT步骤和潜在触发因素;GUARD-Repair,采用检索式的生成方法,为已查明的异常情况重新生成安全的COT步骤。实验结果表明,GUARD在保持生成质量的同时有效地减轻了袭击,推进了安全的代码生成系统。


Article 55

Title@2025-05-27 (2): A first look at ROS~2 applications written in asynchronous Rust

Title: A first look at ROS~2 applications written in asynchronous Rust Ein erster Blick auf ROS~2 Anwendungen geschrieben in asynchronen Rust 首先看一看ROS~2的申请,这些申请是以非同步鲁斯特书写的。 2505.21323v1

Authors: Martin Škoudlil, Michal Sojka, Zdeněk Hanzálek

The increasing popularity of the Rust programming language in building robotic applications using the Robot Operating System (ROS~2) raises questions about its real-time execution capabilities, particularly when employing asynchronous programming. Existing real-time scheduling and response-time analysis techniques for ROS~2 focus on applications written in C++ and do not address the unique execution models and challenges presented by Rust’s asynchronous programming paradigm. In this paper, we analyze the execution model of R2R – an asynchronous Rust ROS~2 bindings and various asynchronous Rust runtimes, comparing them with the execution model of C++ ROS~2 applications. We propose a structured approach for R2R applications aimed at deterministic real-time operation involving thread prioritization and callback-to-thread mapping schemes. Our experimental evaluation based on measuring end-to-end latencies of a synthetic application shows that the proposed approach is effective and outperforms other evaluated configurations. A more complex autonomous driving case study demonstrates its practical applicability. Overall, the experimental results indicate that our proposed structure achieves bounded response times for time-critical tasks. This paves the way for future work to adapt existing or develop new response-time analysis techniques for R2R applications using our structure.

Rust编程语言在利用机器人操作系统建立机器人应用程序方面越来越受欢迎(ROS ~ 2),令人对其实时执行能力提出疑问,特别是在使用无同步编程时,ROS-2的现有实时时间安排和响应时间分析技术侧重于C++中写成的应用程序,而没有解决Rust的无同步编程模式提出的独特的执行模式和挑战。在本文件中,我们分析了R2R的执行模式 – – 一种无同步的 Rast ROS~ 2 绑定和各种不同步的运行时间,与C+ ROS~ 2 应用程序的执行模式进行比较。我们提出了R2R应用程序的结构化方法,目的是确定实时操作,包括线性优先排序和回调到全程绘图计划。我们基于测量合成应用程序端到端的迟误的实验性评估表明,拟议的方法是有效的,而且比其他经过评估的配置更不完善。更复杂的自主驱动案例研究显示了其实际适用性。总体而言,实验结果表明,我们拟议的结构在使用新的时间分析方法对当前工作进行约束性的反应。


Article 56

Title@2025-05-27 (2): Computational Reproducibility of R Code Supplements on OSF

Title: Computational Reproducibility of R Code Supplements on OSF Berechnung der Reproduzierbarkeit von R-Code-Ergänzungen auf OSF OSF的R代码补编的计算可复制性 2505.21590v1

Authors: Lorraine Saju, Tobias Holtdirk, Meetkumar Pravinbhai Mangroliya, Arnim Bleier

Computational reproducibility is fundamental to scientific research, yet many published code supplements lack the necessary documentation to recreate their computational environments. While researchers increasingly share code alongside publications, the actual reproducibility of these materials remains poorly understood. In this work, we assess the computational reproducibility of 296 R projects using the StatCodeSearch dataset. Of these, only 264 were still retrievable, and 98.8% lacked formal dependency descriptions required for successful execution. To address this, we developed an automated pipeline that reconstructs computational environments directly from project source code. Applying this pipeline, we executed the R scripts within custom Docker containers and found that 25.87% completed successfully without error. We conducted a detailed analysis of execution failures, identifying reproducibility barriers such as undeclared dependencies, invalid file paths, and system-level issues. Our findings show that automated dependency inference and containerisation can support scalable verification of computational reproducibility and help identify practical obstacles to code reuse and transparency in scientific research.

计算再生是科学研究的基础,但许多已出版的代码补充材料缺乏重建计算环境的必要文件。研究人员越来越多地与出版物分享代码,而这些材料的实际再生仍然不易理解。在这项工作中,我们评估了使用StatCodeSearch数据集的296 R项目的计算再生。其中只有264个项目仍然可以检索,98.8%的项目缺乏成功执行所需的正式依赖性说明。为此,我们开发了一条自动管道,直接从项目源代码中重建计算环境。我们应用了这条管道,我们在定制的Docker容器中执行了R脚本,发现25.87%的脚本顺利无误地完成了。我们详细分析了执行失败情况,找出了未申报依赖性、无效档案路径和系统层面问题等复制障碍。我们的调查结果显示,自动依赖性和集装箱化可以支持对计算再生能力进行可扩展的核查,并有助于识别在科学研究中代码再利用和透明度方面的实际障碍。


Article 57

Title@2025-05-27 (2): ColorGo: Directed Concolic Execution

Title: ColorGo: Directed Concolic Execution ColorGo: Direkte konkolische Ausführung 颜色 Go : 指向排列执行 2505.21130v1

Authors: Jia Li, Jiacheng Shen, Yuxin Su, Michael R. Lyu

Directed fuzzing is a critical technique in cybersecurity, targeting specific sections of a program. This approach is essential in various security-related domains such as crash reproduction, patch testing, and vulnerability detection. Despite its importance, current directed fuzzing methods exhibit a trade-off between efficiency and effectiveness. For instance, directed grey-box fuzzing, while efficient in generating fuzzing inputs, lacks sufficient precision. The low precision causes time wasted on executing code that cannot help reach the target site. Conversely, interpreter- or observer-based directed symbolic execution can produce high-quality inputs while incurring non-negligible runtime overhead. These limitations undermine the feasibility of directed fuzzers in real-world scenarios. To kill the birds of efficiency and effectiveness with one stone, in this paper, we involve compilation-based concolic execution into directed fuzzing and present ColorGo, achieving high scalability while preserving the high precision from symbolic execution. ColorGo is a new directed whitebox fuzzer that concretely executes the instrumented program with constraint-solving capability on generated input. It guides the exploration by \textit{incremental coloration}, including static reachability analysis and dynamic feasibility analysis. We evaluated ColorGo on diverse real-world programs and demonstrated that ColorGo outperforms AFLGo by up to \textbf{100x} in reaching target sites and reproducing target crashes.

直接模糊是网络安全的关键技术,针对程序的特定部分。 这种方法在崩溃复制、 补丁测试和脆弱性检测等各种安全相关领域至关重要。 尽管其重要性, 当前的定向模糊方法在效率和有效性之间产生了权衡。 例如, 定向灰盒模糊, 有效生成模糊投入, 不够精确。 低精确度导致执行代码的时间浪费在无法帮助到达目标网站的代码上。 相反, 以口译员或观察员为主的象征性执行可以产生高质量的投入, 而同时带来不可忽略的运行时间管理。 这些限制破坏了在现实世界情景中定向模糊器的可行性。 为了用一块石头杀死效率和有效性的鸟, 在本文件中, 我们使用基于编译的混凝土执行定向模糊和当前“ 彩色Goo” , 在保持高精确度执行的同时保持高精确度执行。 彩色GlooGo是一个新的定向白箱模糊器, 具体执行工具程序, 且在生成的投入上具有约束性解缩能力。 这些限制了在现实世界情景中定向的探索 { , 在真实的颜色定位上, 展示了我们所展示了真实的视野 , 展示了真实的视野 , 走向可实现 方向 。


Article 58

Title@2025-05-27 (2): CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building

Title: CXXCrafter: An LLM-Based Agent for Automated C/C++ Open Source Software Building CXXCrafter: Ein LLM-basierter Agent für automatisiertes C/C++ Open Source Software Building CXXCFFF: 一个基于LLM的自动 C/C++ 开放源码软件大楼LLM代理 2505.21069v1

Authors: Zhengmin Yu, Yuan Zhang, Ming Wen, Yinan Nie, Wenhui Zhang, Min Yang

Project building is pivotal to support various program analysis tasks, such as generating intermediate rep- resentation code for static analysis and preparing binary code for vulnerability reproduction. However, automating the building process for C/C++ projects is a highly complex endeavor, involving tremendous technical challenges, such as intricate dependency management, diverse build systems, varied toolchains, and multifaceted error handling mechanisms. Consequently, building C/C++ projects often proves to be difficult in practice, hindering the progress of downstream applications. Unfortunately, research on facilitating the building of C/C++ projects remains to be inadequate. The emergence of Large Language Models (LLMs) offers promising solutions to automated software building. Trained on extensive corpora, LLMs can help unify diverse build systems through their comprehension capabilities and address complex errors by leveraging tacit knowledge storage. Moreover, LLM-based agents can be systematically designed to dynamically interact with the environment, effectively managing dynamic building issues. Motivated by these opportunities, we first conduct an empirical study to systematically analyze the current challenges in the C/C++ project building process. Particularly, we observe that most popular C/C++ projects encounter an average of five errors when relying solely on the default build systems. Based on our study, we develop an automated build system called CXXCrafter to specifically address the above-mentioned challenges, such as dependency resolution. Our evaluation on open-source software demonstrates that CXXCrafter achieves a success rate of 78% in project building. Specifically, among the Top100 dataset, 72 projects are built successfully by both CXXCrafter and manual efforts, 3 by CXXCrafter only, and 14 manually only. …

项目建设对于支持各种方案分析任务至关重要,例如为静态分析生成中间调值回覆代码和为脆弱性复制编制二进制代码。然而,C/C++项目建筑流程自动化是一项非常复杂的工作,涉及巨大的技术挑战,如复杂的依赖管理、多样化的建筑系统、不同的工具链和多方面的错误处理机制。因此,建设C/C+++项目往往证明在实践中很困难,阻碍了下游应用的进展。不幸的是,关于便利C/C++项目建设的研究仍然不够充分。大语言模型的出现为自动软件建设提供了有希望的解决方案。在广泛的公司级培训下,LLMS能够帮助通过理解能力统一多种建筑系统,并通过利用隐性知识存储解决复杂的错误。此外,基于LLM的代理商可以系统地设计与环境动态互动,有效管理动态建筑问题。由于这些机会,我们首先进行一项经验研究,系统分析C/C++项目当前的挑战,只有C/C++项目(LM)为自动化软件建设提供了很有希望的解决方案。在C++项目中,C项目在C成功后仅依靠C系统构建了五个错误,在C系统上,在C成功构建过程中,我们用C构建了C的系统来进行。


Article 59

Title@2025-05-27 (2): Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement

Title: Thinking Before Running! Efficient Code Generation with Thorough Exploration and Optimal Refinement Vor dem Laufen denken! Effiziente Codegenerierung mit gründlicher Exploration und optimaler Verfeinerung 在运行前思考! 高效的代码生成, 彻底探索和优化精炼 2502.17442v2

Authors: Xiaoqing Zhang, Yuhan Liu, Flood Sung, Xiuying Chen, Shuo Shang, Rui Yan

Code generation is crucial in software engineering for automating the coding process efficiently. While test-time computation methods show promise, they suffer from high latency due to multiple computation rounds. To overcome this, we introduce \textbf{ThinkCoder}, a framework that combines thorough exploration with optimal refinement. The exploration phase diversifies the solution space by searching for potential solutions, followed by a refinement phase that enhances precision. This approach allows us to select the best solution through careful consideration before taking action, avoiding excessive trial and error. To further minimize test-time computation overhead, we introduce preference-driven optimization with Reinforced Self-Training (ReST), which uses exploration trajectories from ThinkCoder to guide LLM’s evolution. This approach enhances LLM’s exploration efficiency via preference learning, cutting costs while maintaining accuracy. ThinkCoder boosts the performance with a single LLM, excelling on benchmarks like HumanEval and MBPP. Compared to SOTA models, it improves Pass@1 by 3.0\% over MapCoder with just 6.4\% of the computation cost. Against AgentCoder, ThinkCoder achieves a 0.5\% higher Pass@1 after 2 rounds, outperforming AgentCoder’s 5 rounds. Additionally, ReST with success trajectories enhances efficiency, allowing models like LLaMA2-7B to achieve competitive results using only 20\% of the computational resources. These results highlight the framework’s effectiveness and scalability.

代码生成在软件工程中对于高效率地实现编码程序的自动化至关重要。 虽然测试时间计算方法显示有希望, 但由于多个计算周期, 测试时间计算方法具有较高的潜值。 为了克服这一点, 我们引入了\ textbf{ThinkCoder}, 这是一个将彻底探索与优化相结合的框架。 探索阶段通过寻找潜在解决方案使解决方案空间多样化, 并随后有一个精细的精细阶段。 这种方法让我们在采取行动之前通过仔细考虑选择最佳解决方案, 避免过度试验和错误。 为了进一步将测试时间计算间接费用降到最低, 我们采用强化的自我培训( REST) 的偏好驱动优化, 以强化的自我培训( ReST) 来引导LLMM的演进。 这种方法通过偏好学习来提高LLMM的探索效率, 降低成本, 并同时保持准确性。 想象Coder 提高性能, 与SOVA+LMA相比, 仅改进了PLO+3.0+3. 。


Article 60

Title@2025-05-27 (2): Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models

Title: Optimizing Case-Based Reasoning System for Functional Test Script Generation with Large Language Models Optimierung des Case-Based-Reasoning-Systems für die Generierung funktionaler Testskripte mit großen Sprachmodellen 为具有大语言模型的功能测试脚本生成优化基于个案的理由说明系统 2503.20576v3

Authors: Siyuan Guo, Huiwu Liu, Xiaolong Chen, Yuming Xie, Liang Zhang, Tao Han, Hechang Chen, Yi Chang, Jun Wang

In this work, we explore the potential of large language models (LLMs) for generating functional test scripts, which necessitates understanding the dynamically evolving code structure of the target software. To achieve this, we propose a case-based reasoning (CBR) system utilizing a 4R cycle (i.e., retrieve, reuse, revise, and retain), which maintains and leverages a case bank of test intent descriptions and corresponding test scripts to facilitate LLMs for test script generation. To improve user experience further, we introduce Re4, an optimization method for the CBR system, comprising reranking-based retrieval finetuning and reinforced reuse finetuning. Specifically, we first identify positive examples with high semantic and script similarity, providing reliable pseudo-labels for finetuning the retriever model without costly labeling. Then, we apply supervised finetuning, followed by a reinforcement learning finetuning stage, to align LLMs with our production scenarios, ensuring the faithful reuse of retrieved cases. Extensive experimental results on two product development units from Huawei Datacom demonstrate the superiority of the proposed CBR+Re4. Notably, we also show that the proposed Re4 method can help alleviate the repetitive generation issues with LLMs.

在这项工作中,我们探索大型语言模型(LLMS)产生功能性测试脚本的潜力,这需要理解目标软件动态演变的代码结构。为此,我们提议采用基于案例的推理(CBR)系统,使用4R周期(即检索、再利用、修订和保留),维持并利用测试意向说明和相应测试脚本的个案库,以便利测试脚本生成LLMS。为了进一步改进用户经验,我们为CBR系统引入了RE4优化方法,包括基于排名的检索微调和强化再利用微调。具体地说,我们首先找出具有高度语义性和文字相似性的正面例子,提供可靠的假标签,用于在不贴昂贵标签的情况下微调检索器模型。然后,我们实施监督性微调,随后是强化学习微调阶段,使LMS与我们的生产情景保持一致,确保忠实地再利用回收的个案。关于Huwei Datacom的两个产品开发单位的广泛实验结果显示拟议的CBR+Re4的优势。我们还表明,拟议的R4方法有助于减轻重复生产的问题。


Article 61

Title@2025-05-27 (2): RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving

Title: RepoMaster: Autonomous Exploration and Understanding of GitHub Repositories for Complex Task Solving RepoMaster: Autonome Exploration und Verständnis von GitHub-Lagerstätten für komplexe Aufgabenlösung RepoMaster:为复杂任务解决而自主探索和了解GitHub储存库 2505.21577v1

Authors: Huacan Wang, Ziyi Ni, Shuo Zhang, Shuo Lu, Sen Hu, Ziyang He, Chen Hu, Jiaye Lin, Yifu Guo, Yuntao Du, Pin Lyu

The ultimate goal of code agents is to solve complex tasks autonomously. Although large language models (LLMs) have made substantial progress in code generation, real-world tasks typically demand full-fledged code repositories rather than simple scripts. Building such repositories from scratch remains a major challenge. Fortunately, GitHub hosts a vast, evolving collection of open-source repositories, which developers frequently reuse as modular components for complex tasks. Yet, existing frameworks like OpenHands and SWE-Agent still struggle to effectively leverage these valuable resources. Relying solely on README files provides insufficient guidance, and deeper exploration reveals two core obstacles: overwhelming information and tangled dependencies of repositories, both constrained by the limited context windows of current LLMs. To tackle these issues, we propose RepoMaster, an autonomous agent framework designed to explore and reuse GitHub repositories for solving complex tasks. For efficient understanding, RepoMaster constructs function-call graphs, module-dependency graphs, and hierarchical code trees to identify essential components, providing only identified core elements to the LLMs rather than the entire repository. During autonomous execution, it progressively explores related components using our exploration tools and prunes information to optimize context usage. Evaluated on the adjusted MLE-bench, RepoMaster achieves a 110% relative boost in valid submissions over the strongest baseline OpenHands. On our newly released GitTaskBench, RepoMaster lifts the task-pass rate from 24.1% to 62.9% while reducing token usage by 95%. Our code and demonstration materials are publicly available at https://github.com/wanghuacan/RepoMaster.

代码代理的最终目标是自主地解决复杂任务。虽然大型语言模型(LLMS)在代码生成方面取得了长足的进步,但现实世界任务通常需要完全的代码库库而不是简单的脚本。从零开始建立这样的库库仍然是一个重大挑战。幸运的是,GitHub是一个庞大的、不断发展的开放源库库库库库库库,开发者经常将其用作复杂任务的模块。然而,OpenHands和SWE-Agent等现有框架仍在努力有效地利用这些宝贵的资源。只依靠README文件提供不足够的指导,更深入的探索揭示了两个核心障碍:大量的信息和储存库的依附性,两者都受到当前LOMS有限的背景窗口的制约。为了解决这些问题,我们建议Repaster是一个旨在探索和再利用GitHub库库库库库库库库库库的自主性框架。为了有效理解,RepoMaster构建了函数召回图、模块依赖性图表和最强级的代码树,以识别基本部件,只向LIMS/整个库库库库提供已确认的核心元素。在95-95-而不是整个库库库库库库库库库库库库库。在自动执行期间,在自动执行期间,在Sal-masal-rus IMBe-rudexxxxxxxxxxxxxxxx 上,在Serus 上,在Serxxxxxxxxxxxxxxxxxxxxxxx


Article 62

Title@2025-05-27 (2): An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks

Title: An LLM-as-Judge Metric for Bridging the Gap with Human Evaluation in SE Tasks Ein LLM-as-Judge Metric zur Überwindung der Lücke mit menschlicher Bewertung in SE-Aufgaben 消除社会经济任务中与人的评价差距的法学硕士法官 2505.20854v1

Authors: Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F. Gomes, Guang Yang, David Lo

Large Language Models (LLMs) and other automated techniques have been increasingly used to support software developers by generating software artifacts such as code snippets, patches, and comments. However, accurately assessing the correctness of these generated artifacts remains a significant challenge. On one hand, human evaluation provides high accuracy but is labor-intensive and lacks scalability. On the other hand, other existing automatic evaluation metrics are scalable and require minimal human effort, but they often fail to accurately reflect the actual correctness of generated software artifacts. In this paper, we present SWE-Judge, the first evaluation metric for LLM-as-Ensemble-Judge specifically designed to accurately assess the correctness of generated software artifacts. SWE-Judge first defines five distinct evaluation strategies, each implemented as an independent judge. A dynamic team selection mechanism then identifies the most appropriate subset of judges to produce a final correctness score through ensembling. We evaluate SWE-Judge across a diverse set of software engineering (SE) benchmarks, including CoNaLa, Card2Code, HumanEval-X, APPS, APR-Assess, and Summary-Assess. These benchmarks span three SE tasks: code generation, automated program repair, and code summarization. Experimental results demonstrate that SWE-Judge consistently achieves a higher correlation with human judgments, with improvements ranging from 5.9% to 183.8% over existing automatic metrics. Furthermore, SWE-Judge reaches agreement levels with human annotators that are comparable to inter-annotator agreement in code generation and program repair tasks. These findings underscore SWE-Judge’s potential as a scalable and reliable alternative to human evaluation.

大型语言模型(LLMS)和其他自动化技术(LLMS)被越来越多地用于支持软件开发者,方法是生成代码片、补丁和评论等软件工艺品。然而,准确评估这些生成的工艺品的正确性仍是一项重大挑战。一方面,人类评价提供高精度,但劳动密集型,且缺乏可缩放性。另一方面,其他现有的自动评价标准可缩放,但往往无法准确反映生成的软件工艺品的实际正确性。在本文件中,我们介绍了SWE-Judge,专门为准确评估生成的软件工艺品的正确性而设计的LLM-Empet-Emple-Jodge的第一个评价指标。SWE-JM首先确定了五个不同的评价战略,每个战略都是作为独立法官实施的。一个动态团队甄选机制随后确定了最合适的一组法官,以便通过聚合来得出最终的正确性评分。我们从多种软件工程标准(SEE)中评价SWE-judral-alalal 协议, 人类-alview-Jode-juding the comalalalalalalalalalal commaxal laves-deal-deal lax sal dismax smalmax smals she ax ax laxxx sal laxxxx sal lax smals smals she_s she_s smaldal_s sal labisal_saldalmaxxxxal_s_sal_sal_saldaldaldal_sal_sal_saldaldal_sal_sal_sal_sal_ maxxal_sal_sal_sal_sal_sal_sal_sal_sal_sal_saldaldal_saldsal_sal_sal_saldaldal__sal_sal_sal_sal_sal_sal_sal_sal_sal_sal_sal_sal_sal_sal_sal_sal_我们我们,我们,我们评估。我们,我们,我们,我们,我们,我们,我们


Article 63

Title@2025-05-27 (2): Why do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks

Title: Why do Machine Learning Notebooks Crash? An Empirical Study on Public Python Jupyter Notebooks Warum zerfallen Machine-Learning-Notebooks? Eine empirische Studie über öffentliche Python-Jupyter-Notebooks 为什么机器学习笔记本崩溃? 2411.16795v3

Authors: Yiran Wang, Willem Meijer, José Antonio Hernández López, Ulf Nilsson, Dániel Varró

Jupyter notebooks have become central in data science, integrating code, text and output in a flexible environment. With the rise of machine learning (ML), notebooks are increasingly used for prototyping and data analysis. However, due to their dependence on complex ML libraries and the flexible notebook semantics that allow cells to be run in any order, notebooks are susceptible to software bugs that may lead to program crashes. This paper presents a comprehensive empirical study focusing on crashes in publicly available Python ML notebooks. We collect 64,031 notebooks containing 92,542 crashes from GitHub and Kaggle, and manually analyze a sample of 746 crashes across various aspects, including crash types and root causes. Our analysis identifies unique ML-specific crash types, such as tensor shape mismatches and dataset value errors that violate API constraints. Additionally, we highlight unique root causes tied to notebook semantics, including out-of-order execution and residual errors from previous cells, which have been largely overlooked in prior research. Furthermore, we identify the most error-prone ML libraries, and analyze crash distribution across ML pipeline stages. We find that over 40% of crashes stem from API misuse and notebook-specific issues. Crashes frequently occur when using ML libraries like TensorFlow/Keras and Torch. Additionally, over 70% of the crashes occur during data preparation, model training, and evaluation or prediction stages of the ML pipeline, while data visualization errors tend to be unique to ML notebooks.

Jupyter 笔记本已成为数据科学的核心,在灵活的环境下整合代码、文本和输出。随着机器学习(ML)的上升,笔记本越来越多地用于原型和数据分析。然而,由于对复杂的 ML 图书馆和允许细胞按任何顺序运行的灵活的笔记本语义学依赖复杂的 ML 图书馆和灵活的笔记本语语义,笔记本很容易被软件错误导致程序崩溃。本文介绍了一项全面的经验性研究,重点是公开提供的 Python ML 笔记本中的崩溃。我们收集了64 031本包含92 542次来自 GitHub 和 Kaggle 的难解的笔记笔记本,并手动分析了746次不同方面碰撞的样本,包括崩溃类型和根源。然而,我们的分析确定了独特的 ML 特定崩溃类型,例如变色形状不匹配和数据设置值错误,从而违反API 限制。此外,我们强调与笔记本语义中存在独特的根源,包括超序执行和从以前的细胞模型错误,在先前的研究中被忽略。此外,我们发现了最易出错的 ML 库, 最易错的ML 图书馆,以及像化的崩溃分布分布分布分布分布分布分布分布分布在ML 和ML 中,在ML 周期的周期的周期的周期的计算中经常出现。


Article 64

Title@2025-05-27 (2): Can Agents Fix Agent Issues?

Title: Can Agents Fix Agent Issues? Können Agenten Probleme mit Agenten beheben? 特工能解决代理问题吗? 2505.20749v1

Authors: Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, Yiling Lou

LLM-based agent systems are emerging as a new software paradigm and have been widely adopted across diverse domains such as medicine, robotics, and programming. However, maintaining these systems requires substantial effort, as they are inevitably prone to bugs and continually evolve to meet changing external requirements. Therefore, automatically resolving agent issues (i.e., bug reports or feature requests) is a crucial and challenging task. While recent software engineering (SE) agents (e.g., SWE-agent) have shown promise in addressing issues in traditional software systems, it remains unclear how effectively they can resolve real-world issues in agent systems, which differ significantly from traditional software. To fill this gap, we first manually analyze 201 real-world agent issues and identify common categories of agent issues. We then spend 500 person-hours constructing AGENTISSUE-BENCH, a reproducible benchmark comprising 50 agent issue resolution tasks (each with an executable environment and failure-triggering tests). We further evaluate state-of-the-art SE agents on AGENTISSUE-BENCH and reveal their limited effectiveness (i.e., with only 3.33% - 12.67% resolution rates). These results underscore the unique challenges of maintaining agent systems compared to traditional software, highlighting the need for further research to develop advanced SE agents for resolving agent issues. Data and code are available at https://alfin06.github.io/AgentIssue-Bench-Leaderboard/#/ .

以LLM为主的代理系统正在作为一种新的软件范例出现,并被广泛采用,例如医药、机器人和编程等不同领域。然而,维护这些系统需要大量努力,因为它们不可避免地容易发生故障,并不断演变以满足不断变化的外部要求。因此,自动解决代理问题(例如,错误报告或功能要求)是一项关键和具有挑战性的任务。虽然最近的软件工程代理(SE)在解决传统软件系统的问题方面显示出希望,但目前仍不清楚它们能如何有效地解决代理系统中与传统软件大不相同的真实世界问题。为填补这一空白,我们首先手工分析201个真实世界代理问题,并查明共同的代理问题类别。我们然后花费500个人小时来建造AGENTISSUE-BENCH,这是一个可复制的基准,由50个代理问题解决任务(每个任务都有可执行的环境和触发故障的测试)组成。我们进一步评估AGENTISSUE-ENCH的S-SE代理系统的现状,并披露其有限的效力(即只有3.33%/FINA 的高级代理系统,需要更准确地显示SEVILAUDRA的解决方案。


Article 65

Title@2025-05-27 (2): Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey

Title: Enhancing Code LLMs with Reinforcement Learning in Code Generation: A Survey Verbesserung von Code LLMs mit Verstärkungslernen in der Codegenerierung: Eine Umfrage 增强法典制定中强化学习的加强守则LLMS 代码生成:调查 2412.20367v3

Authors: Junqiao Wang, Zeng Zhang, Yangfan He, Zihao Zhang, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Xin Yi, Zhongwei Wan, Xinhang Yuan, Kuan Lu, Menghao Huo, Guangwu Qian, Keqin Li, Qiuwu Chen, Lewei He

Reinforcement learning (RL) has emerged as a powerful paradigm for enhancing large language models (LLMs) in code generation and optimization. This survey systematically reviews RL-driven techniques across the code development lifecycle, from compiler-level optimizations and resource allocation strategies to end-to-end code synthesis frameworks. We first examine classical and modern RL algorithms – spanning policy gradients, actor-critic methods, human-feedback alignment, and preference-based optimization – and their adaptations to the unique challenges of code generation, such as sparse and delayed rewards. Next, we analyze key benchmarks, datasets, and evaluation metrics that drive progress in RL-augmented Code LLMs. Finally, we identify open problems, including the need for richer feedback sources, support for low-level and domain-specific languages, and methods to reduce computational overhead. By consolidating current insights and outlining future directions, this work aims to guide researchers and practitioners in leveraging RL to produce more robust, efficient, and human-aligned code generation systems.

强化学习(RL)已成为在代码生成和优化方面加强大型语言模型(LLM)的强大范例,这项调查系统地审查了代码开发生命周期中由RL驱动的技术,从汇编者一级的优化和资源分配战略到端到端至端代码合成框架。我们首先审查传统和现代RL算法 – – 涵盖政策梯度、行为体-批评方法、人肉背对齐和基于优惠的优化 – – 以及这些算法适应代码生成的独特挑战,如微弱和延迟的奖励。接着,我们分析了推动RL强化代码LM取得进展的关键基准、数据集和评价指标。最后,我们查明了一些尚未解决的问题,包括需要更丰富的反馈来源、支持低层次和特定领域语言以及减少计算间接费用的方法。通过整合目前的见解和概述未来方向,这项工作旨在指导研究人员和从业人员利用RL生成更健全、高效和与人接轨的代码生成系统。


Article 66

Title@2025-05-27 (2): SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis

Title: SV-TrustEval-C: Evaluating Structure and Semantic Reasoning in Large Language Models for Source Code Vulnerability Analysis SV-TrustEval-C: Bewertung von Struktur und semantischer Vernunft in großen Sprachmodellen für die Analyse von Quellencode-Anfälligkeiten SV-信任值-C:在源码脆弱性分析大语言模型中评估结构和语义理由 2505.20630v1

Authors: Yansong Li, Paula Branco, Alexander M. Hoole, Manish Marwah, Hari Manassery Koduvely, Guy-Vincent Jourdan, Stephan Jou

As Large Language Models (LLMs) evolve in understanding and generating code, accurately evaluating their reliability in analyzing source code vulnerabilities becomes increasingly vital. While studies have examined LLM capabilities in tasks like vulnerability detection and repair, they often overlook the importance of both structure and semantic reasoning crucial for trustworthy vulnerability analysis. To address this gap, we introduce SV-TrustEval-C, a benchmark designed to evaluate LLMs’ abilities for vulnerability analysis of code written in the C programming language through two key dimensions: structure reasoning - assessing how models identify relationships between code elements under varying data and control flow complexities; and semantic reasoning - examining their logical consistency in scenarios where code is structurally and semantically perturbed. Our results show that current LLMs are far from satisfactory in understanding complex code relationships and that their vulnerability analyses rely more on pattern matching than on robust logical reasoning. These findings underscore the effectiveness of the SV-TrustEval-C benchmark and highlight critical areas for enhancing the reasoning capabilities and trustworthiness of LLMs in real-world vulnerability analysis tasks. Our initial benchmark dataset is publicly available.

随着大语言模型(LLMs)在理解和生成代码方面不断发展,准确评价其在分析源代码脆弱性方面的可靠性变得日益重要。虽然研究审查了LLM在脆弱性检测和修理等任务方面的能力,但往往忽略了对可信脆弱性分析至关重要的结构和语义推理的重要性。为了解决这一差距,我们引入了SV-TrustEval-C,这是一个基准,旨在评估LLMs通过两个关键方面对C编程语言的代码进行脆弱性分析的能力:结构推理——评估模型如何确定不同数据和控制流量复杂情况下的代码要素之间的关系;语义推理——审查代码在结构上和语义上交叉的情景中的逻辑一致性。我们的结果表明,目前的LLMs远不能令人满意地理解复杂的代码关系,而且其脆弱性分析更多地依赖与稳健的逻辑推理的比对齐模式。这些结论强调了SV-TrustEval-C基准的有效性,并强调了提高LMs在现实世界脆弱性分析任务中的推理能力和可信度的关键领域。我们的初步基准数据集是公开的。


Article 67

Title@2025-05-26 (1): Smart Contract Vulnerabilities, Tools, and Benchmarks: An Updated Systematic Literature Review

Title: Smart Contract Vulnerabilities, Tools, and Benchmarks: An Updated Systematic Literature Review Smart Contract Vulnerabilitys, Tools und Benchmarks: Ein aktualisierter systematischer Literaturbericht 智能合同脆弱性、工具和基准:更新的系统文献审查 2412.01719v2

Authors: Gerardo Iuliano, Dario Di Nucci

Smart contracts are self-executing programs on blockchain platforms like Ethereum, which have revolutionized decentralized finance by enabling trustless transactions and the operation of decentralized applications. Despite their potential, the security of smart contracts remains a critical concern due to their immutability and transparency, which expose them to malicious actors. Numerous solutions for vulnerability detection have been proposed, but it is still unclear which one is the most effective. This paper presents a systematic literature review that explores vulnerabilities in Ethereum smart contracts, focusing on automated detection tools and benchmark evaluation. We reviewed 3,380 studies from five digital libraries and five major software engineering conferences, applying a structured selection process that resulted in 222 high-quality studies. The key results include a hierarchical taxonomy of 192 vulnerabilities grouped into 14 categories, a comprehensive list of 219 detection tools with corresponding functionalities, methods, and code transformation techniques, a mapping between our taxonomy and the list of tools, and a collection of 133 benchmarks used for tool evaluation. We conclude with a discussion about the insights into the current state of Ethereum smart contract security and directions for future research.

智能合同是Etheum等封闭式平台上的自我执行方案,通过无信托交易和分散应用的操作,使分散融资发生革命性变革。尽管智能合同具有潜力,但由于其不可移动性和透明度,智能合同的安全性仍然是一个关键问题,因为它们暴露在恶意行为者面前。提出了许多脆弱性检测解决方案,但其中哪一个最为有效还不清楚。本文介绍了系统文献审查,探索Etheum智能合同的脆弱性,重点是自动检测工具和基准评估。我们审查了5个数字图书馆和5个主要软件工程会议的3 380项研究,采用了结构化选择程序,产生了222项高质量研究。主要成果包括192个脆弱性分类,分为14个类别,219个检测工具的综合清单,具有相应的功能、方法和代码转换技术,我们分类和工具清单之间的图谱图,以及用于工具评估的133个基准集。我们最后讨论了对Etheum智能合同安全现状和今后研究方向的深入了解。


Article 68

Title@2025-05-26 (1): Large Language Models for IT Automation Tasks: Are We There Yet?

Title: Large Language Models for IT Automation Tasks: Are We There Yet? Große Sprachmodelle für IT-Automatisierungsaufgaben: Sind wir noch da? 信息技术自动化任务大语言模型:我们是否还存在? 2505.20505v1

Authors: Md Mahadi Hassan, John Salvador, Akond Rahman, Santu Karmaker

LLMs show promise in code generation, yet their effectiveness for IT automation tasks, particularly for tools like Ansible, remains understudied. Existing benchmarks rely primarily on synthetic tasks that fail to capture the needs of practitioners who use IT automation tools, such as Ansible. We present ITAB (IT Automation Task Benchmark), a benchmark of 126 diverse tasks (e.g., configuring servers, managing files) where each task accounts for state reconciliation: a property unique to IT automation tools. ITAB evaluates LLMs’ ability to generate functional Ansible automation scripts via dynamic execution in controlled environments. We evaluate 14 open-source LLMs, none of which accomplish pass@10 at a rate beyond 12%. To explain these low scores, we analyze 1,411 execution failures across the evaluated LLMs and identify two main categories of prevalent semantic errors: failures in state reconciliation related reasoning (44.87% combined from variable (11.43%), host (11.84%), path(11.63%), and template (9.97%) issues) and deficiencies in module-specific execution knowledge (24.37% combined from Attribute and parameter (14.44%) and module (9.93%) errors). Our findings reveal key limitations in open-source LLMs’ ability to track state changes and apply specialized module knowledge, indicating that reliable IT automation will require major advances in state reasoning and domain-specific execution understanding.

LLMS在代码生成方面表现出了希望,但对于信息技术自动化任务,特别是Ansib等工具而言,其有效性仍然未得到充分研究。现有基准主要依赖合成任务,这些任务未能满足使用信息技术自动化工具的从业人员的需求,如Ansib。我们介绍了ITAB(IT自动化任务基准),这是126项不同任务的基准(例如,配置服务器,管理文件),其中每项任务都为国家对账:信息技术自动化工具独有的财产。ITAB评估LLMs在控制环境中通过动态执行生成可功能化自动化脚本的能力。我们评估了14个开放源LMS,其中没有一个以超过12 %的速率完成过10分的“通行证”。为了解释这些低分,我们分析了被评估的LMS(IT自动化任务基准)的1,411执行失败,并确定了两大类普遍存在的语义错误:州调相关推理失误(44.87%来自变量)、主机主(11.84%)、路径(11.63%)、路径(11.97%)和模板(9.97%)的问题),以及模块具体执行知识(24.37 %),从属性和标准(14.44%)的综合理解能力要求显示主要方向进展)。


Article 69

Title@2025-05-26 (1): Modeling and Analysis of the Landing Gear System with the Generalized Contracts

Title: Modeling and Analysis of the Landing Gear System with the Generalized Contracts Modellierung und Analyse des Landing Gear Systems mit den Generalized Contracts 通用合同着陆器系统的建模和分析 2111.10426v3

Authors: Abdelkader Khouass, christian attiogbé, mohamed messabihi

Nowadays, there are several complex systems in different sectors such as aviation, air traffic control …etc. These systems do not have a precise perimeter, they are open and made of various specific components built with different languages and environments. The modeling, assembly and analysis of such open and complex heterogeneous systems are challenges in software engineering. This paper describes how the Minarets method decreases the difficulty of modeling, composition and analysis of the well known case study of the landing gear system. The method consists in: equipping individual components with generalized contracts that integrate various facets related to different concerns, composing these components according to their facets and verifying the resulting system with respect to the involved facets as well. The proposed method may be used or extended to cover more facets, and by strengthening assistance tool through proactive aspects in modeling, composing multi-facets contracts and finally the verification of the heterogeneous systems.

目前,不同部门(如航空、空中交通管制.等等)有若干复杂的系统,这些系统没有精确的周界,是开放的,由不同语言和环境建立的不同具体组成部分组成。这种开放和复杂的多元系统的建模、组装和分析是软件工程的挑战。本文说明米纳雷茨方法如何减少众所周知的起落装置系统案例研究的建模、组成和分析方面的困难。方法包括:为个别组成部分配备通用合同,其中结合与不同关切有关的各方面,根据这些组成部分的方方面面组成,并核查由此产生的系统。提议的方法可以使用或扩大,以涵盖更多的方面,并通过在建模、组成多面合同和最后核查多元系统方面积极主动地加强援助工具。


Article 70

Title@2025-05-26 (1): SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents

Title: SWE-rebench: An Automated Pipeline for Task Collection and Decontaminated Evaluation of Software Engineering Agents SWE-Rebench: Eine automatisierte Pipeline für die Task Collection und die dekontaminierte Evaluation von Software Engineering Agents SWE-rebench:软件工程剂任务收集和除污评价自动管道 2505.20411v1

Authors: Ibragim Badertdinov, Alexander Golubev, Maksim Nekrashevich, Anton Shevtsov, Simon Karasik, Andrei Andriushchenko, Maria Trofimova, Daria Litvintseva, Boris Yangel

LLM-based agents have shown promising capabilities in a growing range of software engineering (SWE) tasks. However, advancing this field faces two critical challenges. First, high-quality training data is scarce, especially data that reflects real-world SWE scenarios, where agents must interact with development environments, execute code and adapt behavior based on the outcomes of their actions. Existing datasets are either limited to one-shot code generation or comprise small, manually curated collections of interactive tasks, lacking both scale and diversity. Second, the lack of fresh interactive SWE tasks affects evaluation of rapidly improving models, as static benchmarks quickly become outdated due to contamination issues. To address these limitations, we introduce a novel, automated, and scalable pipeline to continuously extract real-world interactive SWE tasks from diverse GitHub repositories. Using this pipeline, we construct SWE-rebench, a public dataset comprising over 21,000 interactive Python-based SWE tasks, suitable for reinforcement learning of SWE agents at scale. Additionally, we use continuous supply of fresh tasks collected using SWE-rebench methodology to build a contamination-free benchmark for agentic software engineering. We compare results of various LLMs on this benchmark to results on SWE-bench Verified and show that performance of some language models might be inflated due to contamination issues.

以LLM为基础的代理商在软件工程(SWE)范围不断扩大的任务中表现出了充满希望的能力。然而,推进这一领域的工作面临两大挑战。第一,高质量的培训数据稀缺,特别是反映现实世界SWE情景的数据稀缺,特别是反映现实世界SWE情景的数据稀缺,在这种情况下,代理商必须与发展环境互动,根据其行动结果执行代码和行为适应行为。现有的数据集要么局限于一发代码生成,要么包括小规模和多样性的小型、手工整理的互动式任务汇编,缺乏规模和多样性。第二,缺乏新的互动式SWE任务影响到快速改进模型的评估,因为由于污染问题,静态基准迅速过时。为了解决这些局限性,我们引入了新型、自动化和可缩放的管道,以不断从不同的GitHub库中提取真实世界互动的SWE任务。我们用这个管道建造SWE-rebench(SWE-rebench),这是一套公共数据集,由21 000多个互动式Python(基于SWE)任务组成,适合于在规模上强化对SWE代理商的学习。此外,我们利用SWE-rechnch方法收集的新任务持续供应新的任务,以建立无污染基准,以建立一种无污染的系统软件工程工程的新的基准。我们比较了各种结果。


Article 71

Title@2025-05-26 (1): GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency

Title: GPUMC: A Stateless Model Checker for GPU Weak Memory Concurrency GPUMC: Ein staatenloser Modellprüfer für GPU-Schwachspeicherkonkurrenz GPUMC: GPU 弱内存调制货币的无国籍模式检查器 2505.20207v1

Authors: Soham Chakraborty, S. Krishna, Andreas Pavlogiannis, Omkar Tuppe

GPU computing is embracing weak memory concurrency for performance improvement. However, compared to CPUs, modern GPUs provide more fine-grained concurrency features such as scopes, have additional properties like divergence, and thereby follow different weak memory consistency models. These features and properties make concurrent programming on GPUs more complex and error-prone. To this end, we present GPUMC, a stateless model checker to check the correctness of GPU shared-memory concurrent programs under scoped-RC11 weak memory concurrency model. GPUMC explores all possible executions in GPU programs to reveal various errors - races, barrier divergence, and assertion violations. In addition, GPUMC also automatically repairs these errors in the appropriate cases. We evaluate GPUMC with benchmarks and real-life GPU programs. GPUMC is efficient both in time and memory in verifying large GPU programs where state-of-the-art tools are timed out. In addition, GPUMC identifies all known errors in these benchmarks compared to the state-of-the-art tools.

然而,与CPU相比,现代GPU提供了更多精细的同值货币特征,如范围等,具有其他的特性,如差异等,从而遵循不同的微弱记忆一致性模型。这些特性和特性使得GPU的同步编程更为复杂和容易出错。为此,我们提出一个无国籍的模型检查器GPUMC,以检查GPU在范围为RC11的微弱记忆同值货币模型下共享的共线程序的正确性。GPUMC探索了在GPU方案中所有可能的处决,以揭示各种错误――种族、屏障差异和侵犯权利行为。此外,GPUMC还自动在适当的情况下纠正这些错误。我们用基准和真实的GPU程序对GPUMC进行了评估。GPUMC在时间和记忆上都有效,在那些最先进的工具被淘汰的大型 GPUPU程序进行核查。此外,GPUMC还查明了这些基准中所有已知的错误,与最先进的工具相比。


Article 72

Title@2025-05-26 (1): Evaluating Large Language Models for Code Review

Title: Evaluating Large Language Models for Code Review Bewertung großer Sprachmodelle für die Code-Überprüfung 评价用于守则审查的大语言模式 2505.20206v1

Authors: Umut Cihan, Arda İçöz, Vahid Haratian, Eray Tüzün

Context: Code reviews are crucial for software quality. Recent AI advances have allowed large language models (LLMs) to review and fix code; now, there are tools that perform these reviews. However, their reliability and accuracy have not yet been systematically evaluated. Objective: This study compares different LLMs’ performance in detecting code correctness and suggesting improvements. Method: We tested GPT4o and Gemini 2.0 Flash on 492 AI generated code blocks of varying correctness, along with 164 canonical code blocks from the HumanEval benchmark. To simulate the code review task objectively, we expected LLMs to assess code correctness and improve the code if needed. We ran experiments with different configurations and reported on the results. Results: With problem descriptions, GPT4o and Gemini 2.0 Flash correctly classified code correctness 68.50% and 63.89% of the time, respectively, and corrected the code 67.83% and 54.26% of the time for the 492 code blocks of varying correctness. Without problem descriptions, performance declined. The results for the 164 canonical code blocks differed, suggesting that performance depends on the type of code. Conclusion: LLM code reviews can help suggest improvements and assess correctness, but there is a risk of faulty outputs. We propose a process that involves humans, called the “Human in the loop LLM Code Review” to promote knowledge sharing while mitigating the risk of faulty outputs.

代码审查对软件质量至关重要。 AI 最近的进步使得大型语言模型( LLMs) 能够审查和修正代码; 现在有工具来进行这些审查。 但是,这些审查的可靠性和准确性还没有得到系统的评估。 目标: 本研究比较了不同的LMs在发现代码正确性及建议改进方面的表现。 方法: 我们在492 AI 上测试 GPT4o 和 Gemini 2.0 Flash 生成了不同正确性代码块, 以及来自 HumanEval 基准的164 个 Canonical 代码块。 为了客观模拟代码审查任务, 我们期望 LMs 能够评估代码的正确性, 并在必要时改进代码。 我们用不同的配置进行了实验, 并报告了结果。 结果: 用问题描述, GPT4o 和 Gemini 2.0 Flash 正确分类的代码准确无误性能建议了68. 50% 和 63.89% 的时间, 校正了492 代码块的代码的代码的67.83% 和54. 26 % 。 没有问题描述, 。 。 业绩 。 。 164 Canconical cruding brude 的功能 显示我们建议了人权的流程的输出 。


Article 73

Title@2025-05-26 (1): Exposing Go’s Hidden Bugs: A Novel Concolic Framework

Title: Exposing Go’s Hidden Bugs: A Novel Concolic Framework Aufdecken der versteckten Bugs von Go: Ein neuartiges konkolisches Rahmenwerk 展露 Go 隐藏的臭虫: 新分类框架 2505.20183v1

Authors: Karolina Gorna, Nicolas Iooss, Yannick Seurin, Rida Khatoun

The widespread adoption of the Go programming language in infrastructure backends and blockchain projects has heightened the need for improved security measures. Established techniques such as unit testing, static analysis, and program fuzzing provide foundational protection mechanisms. Although symbolic execution tools have made significant contributions, opportunities remain to address the complexities of Go’s runtime and concurrency model. In this work, we present Zorya, a novel methodology leveraging concrete and symbolic (concolic) execution to evaluate Go programs comprehensively. By systematically exploring execution paths to uncover vulnerabilities beyond conventional testing, symbolic execution offers distinct advantages, and coupling it with concrete execution mitigates the path explosion problem. Our solution employs Ghidra’s P-Code as an intermediate representation (IR). This implementation detects runtime panics in the TinyGo compiler and supports both generic and custom invariants. Furthermore, P-Code’s generic IR nature enables analysis of programs written in other languages such as C. Future enhancements may include intelligent classification of concolic execution logs to identify vulnerability patterns.

在基础设施后端和连锁项目中广泛采用Go编程语言,这凸显了改进安全措施的必要性。单位测试、静态分析和程序模糊等既定技术提供了基本保护机制。虽然象征性的执行工具做出了重要贡献,但仍有机会解决Go运行时间和通货模式的复杂性问题。在这项工作中,我们介绍了Zorya,这是利用具体和象征性(共性)执行来全面评价Go方案的一种新颖方法。通过系统探索执行路径,发现超出常规测试的弱点,象征性执行具有独特的优势,并与具体执行相结合可以缓解路径爆炸问题。我们的解决方案将Ghidra的P-Code作为中间代表(IR)使用。这个执行方法探测了TinyGo编译器的运行时间恐慌,支持通用和习惯变量。此外,P-Code的通用IR性质可以分析以其他语言书写的方案,如C。未来改进可能包括智能分类的孔径执行日志,以确定脆弱性模式。


Article 74

Title@2025-05-26 (1): An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation

Title: An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation Eine empirische Studie zur stark schwachen Modellkooperation für die Codegenerierung auf Repo-Ebene 关于回收层代码生成的 “ 强弱 “ 示范协作经验研究 2505.20182v1

Authors: Shubham Gandhi, Atharva Naik, Yiqing Xie, Carolyn Rose

We study cost-efficient collaboration between strong and weak language models for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model’s performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.

我们研究了保存器代码生成的强弱语言模式之间的成本效益合作,弱小模式以较低的成本处理更简单的任务,而最具有挑战性的任务被委托给强强的模型。虽然许多工作提议了这项任务的结构,但很少分析相对成本的业绩。我们评估了广泛的合作战略:基于背景、基于管道和动态的GitHub问题的解决方案。我们最有效的合作战略取得了与强强小模式相同的绩效,同时将成本降低了40%。根据我们的调查结果,我们为在不同预算和绩效限制下选择合作战略提供了可操作的指导方针。我们的成果表明,强弱的合作大大促进了弱小模式的绩效,而成本、管道和基于背景的方法效率最高。我们在https://github.com/shubhamrgandhi/codegen-strong-weak-collab发布了我们工作的守则。


Article 75

Title@2025-05-26 (1): Evaluating Software Plagiarism Detection in the Age of AI: Automated Obfuscation and Lessons for Academic Integrity

Title: Evaluating Software Plagiarism Detection in the Age of AI: Automated Obfuscation and Lessons for Academic Integrity Bewertung von Software Plagiaterkennung im Zeitalter der KI: Automatisierte Verschleierung und Lehren für akademische Integrität 评价AI时代软件高射率检测:学术廉正方面的自动读写和教益 2505.20158v1

Authors: Timur Sağlam, Larissa Schmid

Plagiarism in programming assignments is a persistent issue in computer science education, increasingly complicated by the emergence of automated obfuscation attacks. While software plagiarism detectors are widely used to identify suspicious similarities at scale and are resilient to simple obfuscation techniques, they are vulnerable to advanced obfuscation based on structural modification of program code that preserves the original program behavior. While different defense mechanisms have been proposed to increase resilience against these attacks, their current evaluation is limited to the scope of attacks used and lacks a comprehensive investigation regarding AI-based obfuscation. In this paper, we investigate the resilience of these defense mechanisms against a broad range of automated obfuscation attacks, including both algorithmic and AI-generated methods, and for a wide variety of real-world datasets. We evaluate the improvements of two defense mechanisms over the plagiarism detector JPlag across over four million pairwise program comparisons. Our results show significant improvements in detecting obfuscated plagiarism instances, and we observe an improved detection of AI-generated programs, even though the defense mechanisms are not designed for this use case. Based on our findings, we provide an in-depth discussion of their broader implications for academic integrity and the role of AI in education.

编程任务中的普拉吉利姆是计算机科学教育中一个长期存在的问题,由于自动混淆攻击的出现而日益复杂化。虽然软件的蒙蔽性探测器被广泛用来查明规模上的可疑相似之处,并具有适应简单混淆技术的弹性,但是它们很容易在保存原始方案行为的程序代码结构修改基础上被高级混淆。虽然提出了不同的防御机制以提高对这些攻击的抵抗力,但目前的评价限于所使用的攻击范围,缺乏对基于AI的混淆的全面调查。在本文件中,我们调查这些防御机制的复原力,以防止一系列广泛的自动混淆性攻击,包括算法和AI产生的方法,以及广泛的现实世界数据集。我们评估了在保护原始方案行为的程序代码上两种防御机制的改进情况,在400万对方案进行比较时,我们的评估结果显示,在发现腐蚀性白白化事件方面有了重大改进,我们观察到了对AI所产生方案的改进探测,尽管我们的防御机制不是设计来设计用于这一广泛范围的自动混淆性攻击,而是用于广泛的现实世界数据集。我们评估了两种防御机制的深度研究。根据我们的研究结果,提供了一种学术完整性。


Article 76

Title@2025-05-26 (1): The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs

Title: The CodeInverter Suite: Control-Flow and Data-Mapping Augmented Binary Decompilation with LLMs Die CodeInverter Suite: Control-Flow und Data-Mapping Augmented Binary Decompilation mit LLMs 代码输入器套件:控制-光和数据-制表增强的二进制解析与LLMS 2503.07215v2

Authors: Peipei Liu, Jian Sun, Rongkang Sun, Li Chen, Zhaoteng Yan, Peizheng Zhang, Dapeng Sun, Dawei Wang, Xiaoling Zhang, Dan Li

Binary decompilation plays a vital role in various cybersecurity and software engineering tasks. Recently, end-to-end decompilation methods powered by large language models (LLMs) have garnered significant attention due to their ability to generate highly readable source code with minimal human intervention. However, existing LLM-based approaches face several critical challenges, including limited capability in reconstructing code structure and logic, low accuracy in data recovery, concerns over data security and privacy, and high computational resource requirements. To address these issues, we develop the CodeInverter Suite, making three contributions: (1) the CodeInverter Workflow (CIW) is a novel prompt engineering workflow that incorporates control flow graphs (CFG) and explicit data mappings to improve LLM-based decompilation. (2) Using CIW on well-known source code datasets, we curate the CodeInverter Dataset (CID), a domain-specific dataset containing 8.69 million samples that contains CFGs and data mapping tables. (3) We train the CoderInverter Models (CIMs) on CID, generating two lightweight LLMs (with 1.3B and 6.7B parameters) intended for efficient inference in privacy-sensitive or resource-constrained environments. Extensive experiments on two benchmarks demonstrate that the CIW substantially enhances the performance of various LLMs across multiple metrics. Our CIM-6.7B can achieve state-of-the-art decompilation performance, outperforming existing LLMs even with over 100x more parameters in decompilation tasks, an average improvement of 11.03% in re-executability, 6.27% in edit similarity.

最近,由大型语言模型(LLMS)驱动的端到端的分解方法因其在最小人力干预下生成高可读源代码的能力而引起极大关注。然而,基于LLM的现有方法面临若干重大挑战,包括重建代码结构和逻辑的能力有限,数据恢复的准确性低,对数据安全和隐私的关切,以及高计算资源要求。为了解决这些问题,我们开发了代码Inverter套件,作出了三项贡献:(1)代码Inververer Working(CIW)是一个新型的快速工程参数,其中纳入了控制流程图(CFG)和明确的数据映射,以改进基于LLMM的分解。 (2) 使用基于LLMM的源代码数据集的CIW,我们整理了代码数据集(CIDInverer Data)。 27 具体领域数据集包含869万个样本,其中包含CFGs和数据绘图表。(3) 我们为CEDRInverer Inverer Inforlation 模型(CIMS)提供了三项贡献:(1) CICMS-CS-CS-CS-CS-CS-S-SB Sqreablental SupyLMislation Supulate decessional decessional decessional decessional decessional decessional decessional deal deal deal degisl)系统,在两部(1.3和六B 和六B 和六B 的精基级的精基底图,在高的精基底图性能性能性能性能性能性能性能性能性实验性能性能性测试中,在二BLMSBLMSDRILMSBDRIDRILMSBDRIDRBDRBDRBDRBLMSDRBSBLMSBLMSBLMSBSBLMSBDRisal)。


Article 77

Title@2025-05-26 (1): StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs

Title: StructEval: Benchmarking LLMs’ Capabilities to Generate Structural Outputs StructEval: Benchmarking der Kapazitäten von LLM zur Erzeugung struktureller Outputs DructEval:将LLMs的能力与产生结构性产出挂钩 2505.20139v1

Authors: Jialin Yang, Dongfu Jiang, Lipeng He, Sherman Siu, Yuxuan Zhang, Disen Liao, Zhuofeng Li, Huaye Zeng, Yiming Jia, Haozhe Wang, Benjamin Schneider, Chi Ruan, Wentao Ma, Zhiheng Lyu, Yifei Wang, Yi Lu, Quy Duc Do, Ziyan Jiang, Ping Nie, Wenhu Chen

As Large Language Models (LLMs) become integral to software development workflows, their ability to generate structured outputs has become critically important. We introduce StructEval, a comprehensive benchmark for evaluating LLMs’ capabilities in producing both non-renderable (JSON, YAML, CSV) and renderable (HTML, React, SVG) structured formats. Unlike prior benchmarks, StructEval systematically evaluates structural fidelity across diverse formats through two paradigms: 1) generation tasks, producing structured output from natural language prompts, and 2) conversion tasks, translating between structured formats. Our benchmark encompasses 18 formats and 44 types of task, with novel metrics for format adherence and structural correctness. Results reveal significant performance gaps, even state-of-the-art models like o1-mini achieve only 75.58 average score, with open-source alternatives lagging approximately 10 points behind. We find generation tasks more challenging than conversion tasks, and producing correct visual content more difficult than generating text-only structures.

随着大语言模型(LLMS)成为软件开发工作流程的有机组成部分,其生成结构化产出的能力变得至关重要。我们引入了SstructEval,这是评估LLMs在生产不可更新(JSON、YAML、CSV)和可转换(HTML、React、SVG)结构化格式方面的能力的全面基准。与以往的基准不同, StructEval通过两个模式系统地评估不同格式的结构忠诚性:1)生成任务,产生自然语言提示的结构性产出,和2)转换任务,在结构化格式之间转换。我们的基准包括18种格式和44种任务,为格式的遵守和结构的正确性提供了新的衡量标准。结果显示显著的绩效差距,甚至像O1-mini这样的最先进的模型也只达到75.58平均分,而开放源替代方法落后于大约10分。我们发现生成的任务比转换任务更具有挑战性,产生正确的视觉内容比生成只文本的结构更困难。


Article 78

Title@2025-05-26 (1): Engineering Trustworthy Machine-Learning Operations with Zero-Knowledge Proofs

Title: Engineering Trustworthy Machine-Learning Operations with Zero-Knowledge Proofs Engineering Vertrauenswürdige Maschinen-Learning-Operationen mit Null-Wissens-Proofs 具有零知识证明的工程可信赖的机械学习操作 2505.20136v1

Authors: Filippo Scaramuzza, Giovanni Quattrocchi, Damian A. Tamburri

As Artificial Intelligence (AI) systems, particularly those based on machine learning (ML), become integral to high-stakes applications, their probabilistic and opaque nature poses significant challenges to traditional verification and validation methods. These challenges are exacerbated in regulated sectors requiring tamper-proof, auditable evidence, as highlighted by apposite legal frameworks, e.g., the EU AI Act. Conversely, Zero-Knowledge Proofs (ZKPs) offer a cryptographic solution that enables provers to demonstrate, through verified computations, adherence to set requirements without revealing sensitive model details or data. Through a systematic survey of ZKP protocols, we identify five key properties (non-interactivity, transparent setup, standard representations, succinctness, and post-quantum security) critical for their application in AI validation and verification pipelines. Subsequently, we perform a follow-up systematic survey analyzing ZKP-enhanced ML applications across an adaptation of the Team Data Science Process (TDSP) model (Data & Preprocessing, Training & Offline Metrics, Inference, and Online Metrics), detailing verification objectives, ML models, and adopted protocols. Our findings indicate that current research on ZKP-Enhanced ML primarily focuses on inference verification, while the data preprocessing and training stages remain underexplored. Most notably, our analysis identifies a significant convergence within the research domain toward the development of a unified Zero-Knowledge Machine Learning Operations (ZKMLOps) framework. This emerging framework leverages ZKPs to provide robust cryptographic guarantees of correctness, integrity, and privacy, thereby promoting enhanced accountability, transparency, and compliance with Trustworthy AI principles.

由于人工智能系统,特别是基于机器学习(ML)的系统,已成为高接收应用的组成部分,因此其概率和不透明性对传统核查和验证方法构成重大挑战,这些挑战在监管部门中更加严峻,如欧盟《欧盟AI法》等适当法律框架所强调,在需要防作弊、可审计证据的监管部门,这些挑战更加严峻。相反,零知识验证(ZKP)提供了一个加密解决方案,使证明人能够通过核实的计算,在不透露敏感模型细节或数据的情况下,显示遵守各项要求的情况。通过对ZKP协议进行系统调查,我们确定五个关键属性(非互动、透明设置、标准表述、简洁和量级后安全),对于在AI验证和核查管道中应用这些数据至关重要。随后,我们进行一项后续系统调查,分析ZKP-强化的ML应用程序在调整团队数据科学进程(TDSP)模型(Data & Prechilled train、培训和直线度框架、推断、在线Metrics)的合规性,确定当前数据运行中的重要核查目标,并在不断推进数据领域进行实地核查。


Article 79

Title@2025-05-26 (1): Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks

Title: Grammars of Formal Uncertainty: When to Trust LLMs in Automated Reasoning Tasks Grammatik der formalen Unsicherheit: Wann man LLMs bei automatisierten Aufgaben zur Begründung vertraut 正式不确定性的语法:在自动说明理由任务中何时信任LLMs 2505.20047v1

Authors: Debargha Ganguly, Vikash Singh, Sreehari Sankar, Biyao Zhang, Xuecen Zhang, Srinivasan Iyengar, Xiaotian Han, Amit Sharma, Shivkumar Kalyanaraman, Vipin Chaudhary

Large language models (LLMs) show remarkable promise for democratizing automated reasoning by generating formal specifications. However, a fundamental tension exists: LLMs are probabilistic, while formal verification demands deterministic guarantees. This paper addresses this epistemological gap by comprehensively investigating failure modes and uncertainty quantification (UQ) in LLM-generated formal artifacts. Our systematic evaluation of five frontier LLMs reveals Satisfiability Modulo Theories (SMT) based autoformalization’s domain-specific impact on accuracy (from +34.8% on logical tasks to -44.5% on factual ones), with known UQ techniques like the entropy of token probabilities failing to identify these errors. We introduce a probabilistic context-free grammar (PCFG) framework to model LLM outputs, yielding a refined uncertainty taxonomy. We find uncertainty signals are task-dependent (e.g., grammar entropy for logic, AUROC>0.93). Finally, a lightweight fusion of these signals enables selective verification, drastically reducing errors (14-100%) with minimal abstention, transforming LLM-driven formalization into a reliable engineering discipline.

大型语言模型(LLMS)显示了通过生成正式规格实现自动化推理民主化的巨大希望。然而,存在着一种根本性的紧张:LLMS是概率性的,而正式的核查则要求确定性保证。本文件通过全面调查LLM产生的正规文物中的失败模式和不确定性量化(UQ),解决了这种认知性差距。我们对五个前沿LLMS(SMT)的系统评估揭示了基于自动正规化的可满足性莫杜洛理论(SMT)对准确性(从逻辑任务的+34.8%到事实任务的-44.5%)的具体领域影响。最后,这些信号的轻量级融合使得选择性核查(14-100 % ) , 大大降低了象征性概率概率概率(PCFG) 框架来模拟LM 输出, 产生一个精细的不确定性分类。我们发现, 不确定性信号取决于任务(例如, 语法变频的逻辑,AUROC>0.93)。最后,这些信号的轻度融合使得有选择性地进行核查,极大地减少误差(14-100 %),将LLM驱动的正规化成最低限度的纪律。


Article 80

Title@2025-05-26 (1): A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron?

Title: A Survey on the Safety and Security Threats of Computer-Using Agents: JARVIS or Ultron? Eine Umfrage über die Sicherheitsbedrohungen von Computer-Verwendern: JARVIS oder Ultron? JARVIS还是ULTRON? 调查计算机用户的安全和安保威胁:JARVIS还是ULTRON? 2505.10924v2

Authors: Ada Chen, Yongjiang Wu, Junyuan Zhang, Jingyu Xiao, Shu Yang, Jen-tse Huang, Kun Wang, Wenxuan Wang, Shuai Wang

Recently, AI-driven interactions with computing devices have advanced from basic prototype tools to sophisticated, LLM-based systems that emulate human-like operations in graphical user interfaces. We are now witnessing the emergence of \emph{Computer-Using Agents} (CUAs), capable of autonomously performing tasks such as navigating desktop applications, web pages, and mobile apps. However, as these agents grow in capability, they also introduce novel safety and security risks. Vulnerabilities in LLM-driven reasoning, with the added complexity of integrating multiple software components and multimodal inputs, further complicate the security landscape. In this paper, we present a systematization of knowledge on the safety and security threats of CUAs. We conduct a comprehensive literature review and distill our findings along four research objectives: \textit{\textbf{(i)}} define the CUA that suits safety analysis; \textit{\textbf{(ii)} } categorize current safety threats among CUAs; \textit{\textbf{(iii)}} propose a comprehensive taxonomy of existing defensive strategies; \textit{\textbf{(iv)}} summarize prevailing benchmarks, datasets, and evaluation metrics used to assess the safety and performance of CUAs. Building on these insights, our work provides future researchers with a structured foundation for exploring unexplored vulnerabilities and offers practitioners actionable guidance in designing and deploying secure Computer-Using Agents.

最近,AI驱动的计算机设备互动从基本的原型工具发展到复杂的LLM系统,在图形用户界面中仿照类似人类的操作。我们现在看到正在出现\emph{Computer-Using Agency}(CUAs)的系统化,能够自主地执行诸如浏览桌面应用程序、网页和移动应用程序等任务。然而,随着这些代理商能力的增长,它们也引入了新的安全和安保风险。LLLM驱动推理中的弱点,加上整合多种软件组件和多式联运投入的复杂性,使安全形势更加复杂。在本文件中,我们介绍了关于CUA的安全和安保威胁的知识的系统化。我们开展了全面的文献审查,并按照四项研究目标丰富了我们的调查结果:\textit_textbf{(i) 定义了适合安全分析的CUA;\ textitleb{textb{(ii)}} 将当前的安全威胁在CUA中进行分类;\ textbf}(iii) 提议对现有防御性战略的全面税制分析; 使用这些数据库和数据库评估。


Article 81

Title@2025-05-26 (1): Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare

Title: Ontology- and LLM-based Data Harmonization for Federated Learning in Healthcare Ontologie- und LLM-basierte Datenharmonisierung für das Federated Learning in Healthcare 以本体学和LLM为基础的保健方面联邦学习数据统一 2505.20020v1

Authors: Natallia Kokash, Lei Wang, Thomas H. Gillespie, Adam Belloum, Paola Grosso, Sara Quinney, Lang Li, Bernard de Bono

The rise of electronic health records (EHRs) has unlocked new opportunities for medical research, but privacy regulations and data heterogeneity remain key barriers to large-scale machine learning. Federated learning (FL) enables collaborative modeling without sharing raw data, yet faces challenges in harmonizing diverse clinical datasets. This paper presents a two-step data alignment strategy integrating ontologies and large language models (LLMs) to support secure, privacy-preserving FL in healthcare, demonstrating its effectiveness in a real-world project involving semantic mapping of EHR data.

电子健康记录(EHRs)的兴起为医学研究开辟了新的机会,但隐私条例和数据差异性仍然是大规模机器学习的主要障碍,联邦学习(FL)在不共享原始数据的情况下可以合作建模,但在协调各种临床数据集方面却面临挑战。本文提出了分两步走的数据调整战略,将本种学和大语言模型(LLMs)结合起来,以支持安全、隐私保护的FL医疗,在涉及EHR数据语义图谱的真实世界项目中证明了其有效性。


Article 82

Title@2025-05-26 (1): Requirements Coverage-Guided Minimization for Natural Language Test Cases

Title: Requirements Coverage-Guided Minimization for Natural Language Test Cases Anforderungen Abdeckungsgeführte Minimierung für natürliche Sprachtests 以涵盖范围为指导的尽量减少自然语言测试案件 2505.20004v1

Authors: Rongqi Pan, Feifei Niu, Lionel C. Briand, Hanyang Hu

As software systems evolve, test suites tend to grow in size and often contain redundant test cases. Such redundancy increases testing effort, time, and cost. Test suite minimization (TSM) aims to eliminate such redundancy while preserving key properties such as requirement coverage and fault detection capability. In this paper, we propose RTM (Requirement coverage-guided Test suite Minimization), a novel TSM approach designed for requirement-based testing (validation), which can effectively reduce test suite redundancy while ensuring full requirement coverage and a high fault detection rate (FDR) under a fixed minimization budget. Based on common practice in critical systems where functional safety is important, we assume test cases are specified in natural language and traced to requirements before being implemented. RTM preprocesses test cases using three different preprocessing methods, and then converts them into vector representations using seven text embedding techniques. Similarity values between vectors are computed utilizing three distance functions. A Genetic Algorithm, whose population is initialized by coverage-preserving initialization strategies, is then employed to identify an optimized subset containing diverse test cases matching the set budget. We evaluate RTM on an industrial automotive system dataset comprising $736$ system test cases and $54$ requirements. Experimental results show that RTM consistently outperforms baseline techniques in terms of FDR across different minimization budgets while maintaining full requirement coverage. Furthermore, we investigate the impact of test suite redundancy levels on the effectiveness of TSM, providing new insights into optimizing requirement-based test suites under practical constraints.

随着软件系统的发展,测试套件往往会扩大规模,而且往往含有多余的测试案例。这种冗余会增加测试努力、时间和成本。测试套件最小化(TSM)的目的是消除这种冗余,同时保留需求覆盖面和故障检测能力等关键特性。本文建议采用RTM(要求覆盖指导测试套件最小化)这一为基于要求的测试(验证)设计的新型TSM(要求覆盖最小化)方法,该方法可以有效地减少测试套件冗余,同时确保全部需求覆盖面和高故障检测率(FDR)在固定的最小化预算下。根据功能安全非常重要的关键系统中的常见做法,我们假定测试案例是用自然语言指定的,并追溯到执行前的要求。RTM(TM)预处理测试案例,使用三种不同的预处理方法将其转换为矢量代表。矢量之间的相似值是利用三种远程功能计算。遗传套件(其人口以覆盖为基础初始化战略初始化)随后用于确定一个最佳的子组,包含与设定预算相符的多种测试案例。我们用RTM(RTM)评估一个工业测试套件的测试系统,同时根据不同标准测试系统测试要求,在测试系统上持续地测试了7DRTM系统测试结果。


Article 83

Title@2025-05-26 (1): The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation

Title: The Invisible Hand: Unveiling Provider Bias in Large Language Models for Code Generation Die unsichtbare Hand: Enthüllen von Provider-Bias in großen Sprachmodellen für die Codegenerierung 无形手:守则生成大语言模式中的 “ 无形手 “ : “ 不可忽视的提供者 “ 。 2501.07849v2

Authors: Xiaoyu Zhang, Juan Zhai, Shiqing Ma, Qingshuang Bao, Weipeng Jiang, Qian Wang, Chao Shen, Yang Liu

Large Language Models (LLMs) have emerged as the new recommendation engines, surpassing traditional methods in both capability and scope, particularly in code generation. In this paper, we reveal a novel provider bias in LLMs: without explicit directives, these models show systematic preferences for services from specific providers in their recommendations (e.g., favoring Google Cloud over Microsoft Azure). To systematically investigate this bias, we develop an automated pipeline to construct the dataset, incorporating 6 distinct coding task categories and 30 real-world application scenarios. Leveraging this dataset, we conduct the first comprehensive empirical study of provider bias in LLM code generation across seven state-of-the-art LLMs, utilizing approximately 500 million tokens (equivalent to $5,000+ in computational costs). Our findings reveal that LLMs exhibit significant provider preferences, predominantly favoring services from Google and Amazon, and can autonomously modify input code to incorporate their preferred providers without users’ requests. Such a bias holds far-reaching implications for market dynamics and societal equilibrium, potentially contributing to digital monopolies. It may also deceive users and violate their expectations, leading to various consequences. We call on the academic community to recognize this emerging issue and develop effective evaluation and mitigation methods to uphold AI security and fairness.

大型语言模型(LLMS)已成为新的建议引擎,超越了能力和范围的传统方法,特别是在代码生成方面。在本文中,我们揭示了在LLMS中存在一种新颖的提供者偏向:在没有明确指示的情况下,这些模型在其建议中显示了对具体提供者服务的系统性偏好(例如,偏爱谷歌云而不是微软Azure ) 。为了系统地调查这一偏向,我们开发了一条自动管道来构建数据集,其中包括6个不同的编码任务类别和30个现实世界应用情景。利用这一数据集,我们首次对7个最先进的LLMM代号生成中的LM代号供应商偏向进行了全面的经验性研究,利用了大约5亿个符号(相当于计算成本中的5,000美元+ ) 。我们的调查结果显示,LMS展示了供应商的偏好,主要是支持谷歌和亚马逊的服务,可以自主地修改输入代码,以纳入其首选提供者,而无需用户的请求。这种偏向市场动态和社会平衡有着深远的影响,有可能促成数字垄断。我们也可能欺骗用户,并违反他们的期望,导致各种后果。我们呼吁学术界认识到这一问题的公平性和制定有效的评估方法。


Article 84

Title@2025-05-26 (1): Systems of Twinned Systems: A Systematic Literature Review

Title: Systems of Twinned Systems: A Systematic Literature Review Systeme von Zwillingssystemen: Ein Systematischer Literaturbericht 结对系统系统系统:系统文献审查 2505.19916v1

Authors: Feyi Adesanya, Kanan Castro Silva, Valdemar V. Graciano Neto, Istvan David

Modern systems exhibit unprecedented complexity due to their increased scale, interconnectedness, and the heterogeneity of their digital and physical components. In response to scaling challenges, the system-of-systems (SoS) paradigm proposes flexible aggregations of subsystems into a larger whole, while maintaining the independence of subsystems to various degrees. In response to the cyber-physical convergence, the digital twin (DT) paradigm proposes a tight coupling between digital and physical components through computational reflection and precise control. As these two paradigms address distinct parts of the overall challenge, combining the two promises more comprehensive methods to engineer what we call systems of twinned systems (SoTS). The noticeably growing body of knowledge on SoTS calls for a review of the state of the art. In this work, we report on our systematic literature survey of SoTS. We screened over 2500 potential studies, of which we included 80 and investigated them in detail. To converge SoS and DT, we derive a classification framework for SoTS that is backward compatible with the currently accepted theories of SoS and DT.

现代系统由于其规模的扩大、相互关联性及其数字和物理组成部分的多样化而表现出前所未有的复杂性。为了应对规模扩大的挑战,系统体系范式(SOS)提出将子系统灵活合并成一个更大的整体,同时将子系统保持不同程度的独立性。为了应对网络-物理趋同,数字双轨(DT)范式建议通过计算反射和精确控制,将数字和物理组成部分紧密地结合起来。这两个范式涉及整个挑战的不同部分,结合了我们称之为结对系统(SOTS)的两种更全面的方法。关于SOTS的知识的明显增加要求审查艺术现状。在这项工作中,我们报告了我们对SOTS的系统文献调查。我们筛选了2500多项潜在研究,我们包括80项研究,并详细调查了这些研究。为了将SOS和DT的结合,我们为STS制定了一个与目前公认的S和DT理论相容不全的分类框架。


Article 85

Title@2025-05-26 (1): Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities

Title: Deconstructing Obfuscation: A four-dimensional framework for evaluating Large Language Models assembly code deobfuscation capabilities Dekonstruieren von Obfuscation: Ein vierdimensionaler Rahmen für die Auswertung von Großsprachenmodellen Assembly Code Deobfuscation Fähigkeiten 解构腐蚀:四维框架,用于评价大语言模型组装编码脱腐能力 2505.19887v1

Authors: Anton Tkachenko, Dmitrij Suskevic, Benjamin Adolphi

Large language models (LLMs) have shown promise in software engineering, yet their effectiveness for binary analysis remains unexplored. We present the first comprehensive evaluation of commercial LLMs for assembly code deobfuscation. Testing seven state-of-the-art models against four obfuscation scenarios (bogus control flow, instruction substitution, control flow flattening, and their combination), we found striking performance variations–from autonomous deobfuscation to complete failure. We propose a theoretical framework based on four dimensions: Reasoning Depth, Pattern Recognition, Noise Filtering, and Context Integration, explaining these variations. Our analysis identifies five error patterns: predicate misinterpretation, structural mapping errors, control flow misinterpretation, arithmetic transformation errors, and constant propagation errors, revealing fundamental limitations in LLM code processing.We establish a three-tier resistance model: bogus control flow (low resistance), control flow flattening (moderate resistance), and instruction substitution/combined techniques (high resistance). Universal failure against combined techniques demonstrates that sophisticated obfuscation remains effective against advanced LLMs. Our findings suggest a human-AI collaboration paradigm where LLMs reduce expertise barriers for certain reverse engineering tasks while requiring human guidance for complex deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.x deobfuscation. This work provides a foundation for evaluating emerging capabilities and developing resistant obfuscation techniques.

大型语言模型(LLMS)在软件工程方面表现出了希望,然而其二进制分析的有效性仍未得到探讨。我们首次对商用LLMS进行了全面的评估,以用于组装代码的分解。我们根据四种模糊假设(博格控制流程、教学替代、控制流程平流及其组合)测试了七种最先进的模型,我们发现从自主脱钩到完全失败的惊人的性能差异。我们提出了一个基于四个层面的理论框架:解释深度、模式识别、噪音过滤和背景整合,解释这些差异。我们的分析确定了五种错误模式:上游误差、结构绘图错误、控制流程错误、算术转换错误和持续的传播错误,揭示了LLMM代码处理中的基本限制。我们建立了三层阻力模型:博格控制流程(低抗力)、控制流程稳定(模范抗力)和教学替代/组合技术(高抗力)。我们提出的理论表明,复杂的粘合法仍然对先进的LMS有效。我们的研究发现,一种人类-AI合作模式,即LOBMS提供新的抗力基础,同时要求降低某些反向工程的复杂工作能力。


Article 86

Title@2025-05-26 (1): SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection

Title: SecVulEval: Benchmarking LLMs for Real-World C/C++ Vulnerability Detection SecVulEval: Benchmarking LLMs für real-World C/C++ Sicherheitserkennung SecVulEval:确定真实世界C/C+++脆弱性检测LLMs基准 2505.19828v1

Authors: Md Basim Uddin Ahmed, Nima Shiri Harzevili, Jiho Shin, Hung Viet Pham, Song Wang

Large Language Models (LLMs) have shown promise in software engineering tasks, but evaluating their effectiveness in vulnerability detection is challenging due to the lack of high-quality datasets. Most existing datasets are limited to function-level labels, ignoring finer-grained vulnerability patterns and crucial contextual information. Also, poor data quality such as mislabeling, inconsistent annotations, and duplicates can lead to inflated performance and weak generalization. Moreover, by including only the functions, these datasets miss broader program context, like data/control dependencies and interprocedural interactions, that are essential for accurately understanding real-world security flaws. Without this context, detection models are evaluated under unrealistic assumptions. To address these limitations, this paper introduces SecVulEval, a benchmark designed to support fine-grained evaluation of LLMs and other detection methods with rich contextual information. SecVulEval focuses on real-world C/C++ vulnerabilities at the statement level. This granularity enables more precise evaluation of a model’s ability to localize vulnerabilities, beyond simple binary classification at the function level. By incorporating rich contextual information, SecVulEval sets a new standard for vulnerability detection benchmarks in realistic scenarios. This benchmark includes 25,440 function samples covering 5,867 unique CVEs in C/C++ projects from 1999 to 2024. We evaluated the SOTA LLMs with a multi-agent-based approach. The evaluation on our dataset shows that the models are still far from accurately predicting vulnerable statements in a given function. The best-performing Claude-3.7-Sonnet model achieves 23.83% F1-score for detecting vulnerable statements with correct reasoning. Finally, we analyze the LLM outputs and provide insights into their behavior in vulnerability detection for C/C++.

大型语言模型(LLMS)在软件工程任务中显示了希望,但由于缺乏高质量的数据集,评估其在脆弱性检测方面的效力具有挑战性。大多数现有数据集仅限于功能级标签,忽略细微的脆弱模式和关键背景信息。此外,错误标签、不一致的注释和复制等数据质量差,可能导致性能膨胀和概括性薄弱。此外,这些数据集仅包含这些功能,就错失了更广泛的程序背景,如数据/控制值依赖性和程序间互动,这对于准确理解真实世界脆弱程度缺陷至关重要。如果没有这一背景,则根据不切实际的假设对检测模型进行评估。为了应对这些局限性,本文介绍了SecVulEval,这是用丰富的背景信息对LLMS和其他检测方法进行精细评估的基准。SeVELEval在陈述中侧重于真实世界C/C+++脆弱性。这种颗粒性使得对模型脆弱性进行本地化评估的能力更加精确,超出了功能层面的简单二元分解。在功能层面,检测模型中,SEVLS-CS-SlevalS的检测功能中包含了的Serview Creal Seral Serview Creal 。在Serview Creal viewcal view viewcal view viewcal view view view view view viewcal vial viewcal view view view view viewal viewald viewcal view,在1999 view view view viewcal view viewd viewds vi vi viewd viewds views views vi vi vi vi vi vi vi vi vi vi vi vical vical vical vical vical vical vical vical vical vical vical vical vi vi vical vi vi vi vi vi vi vi vi vical vical vical vical vi vical


Article 87

Title@2025-05-26 (1): A Python workflow definition for computational materials design

Title: A Python workflow definition for computational materials design Eine Python-Workflow-Definition für die Berechnung von Materialien 计算材料设计中的 Python 工作流程定义 2505.20366v1

Authors: Jan Janssen, Janine George, Julian Geiger, Marnik Bercx, Xing Wang, Christina Ertural, Joerg Schaarschmidt, Alex M. Ganose, Giovanni Pizzi, Tilmann Hickel, Joerg Neugebauer

Numerous Workflow Management Systems (WfMS) have been developed in the field of computational materials science with different workflow formats, hindering interoperability and reproducibility of workflows in the field. To address this challenge, we introduce here the Python Workflow Definition (PWD) as a workflow exchange format to share workflows between Python-based WfMS, currently AiiDA, jobflow, and pyiron. This development is motivated by the similarity of these three Python-based WfMS, that represent the different workflow steps and data transferred between them as nodes and edges in a graph. With the PWD, we aim at fostering the interoperability and reproducibility between the different WfMS in the context of Findable, Accessible, Interoperable, Reusable (FAIR) workflows. To separate the scientific from the technical complexity, the PWD consists of three components: (1) a conda environment that specifies the software dependencies, (2) a Python module that contains the Python functions represented as nodes in the workflow graph, and (3) a workflow graph stored in the JavaScript Object Notation (JSON). The first version of the PWD supports directed acyclic graph (DAG)-based workflows. Thus, any DAG-based workflow defined in one of the three WfMS can be exported to the PWD and afterwards imported from the PWD to one of the other WfMS. After the import, the input parameters of the workflow can be adjusted and computing resources can be assigned to the workflow, before it is executed with the selected WfMS. This import from and export to the PWD is enabled by the PWD Python library that implements the PWD in AiiDA, jobflow, and pyiron.

在计算材料科学领域,以不同工作流程格式开发了大量工作流程管理系统(WfMS),这代表了三个基于Python WfMS的不同工作流程步骤和数据,这些步骤和数据在图表中作为节点和边缘传输。为了应对这一挑战,我们在此采用Python World Form 定义(PWD)作为工作流程交换格式,在基于 Python WfMS 的 Python WfMS 之间共享工作流程,目前为 AiiDA、 工作流和 pyyron 。这一开发的动机是三个部分:(1) 一种基于 Python WyMS 的配置环境,该模块包含作为节点和边端在图表中显示的 Python 函数。与 PWWWDD 参数,我们的目标是在可查找、可访问、可互操作、可再使用(FAFIR)工作流程中促进互换(PyWDMS ) 之间的互操作性和可复制性。Pythson 运行流程中,可以将一个基于软件的流程流的 Pyththon 运行到运行流程。


Article 88

Title@2025-05-26 (1): CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement

Title: CIDRe: A Reference-Free Multi-Aspect Criterion for Code Comment Quality Measurement CIDRe: Ein referenzfreies Multi-Aspekt-Kriterium für die Qualitätsmessung von Code Comment CIDRe: 守则评论质量衡量的无参考性、无参考性、多特征的多标准标准 2505.19757v1

Authors: Maria Dziuba, Valentin Malykh

Effective generation of structured code comments requires robust quality metrics for dataset curation, yet existing approaches (SIDE, MIDQ, STASIS) suffer from limited code-comment analysis. We propose CIDRe, a language-agnostic reference-free quality criterion combining four synergistic aspects: (1) relevance (code-comment semantic alignment), (2) informativeness (functional coverage), (3) completeness (presence of all structure sections), and (4) description length (detail sufficiency). We validate our criterion on a manually annotated dataset. Experiments demonstrate CIDRe’s superiority over existing metrics, achieving improvement in cross-entropy evaluation. When applied to filter comments, the models finetuned on CIDRe-filtered data show statistically significant quality gains in GPT-4o-mini assessments.

有效生成结构化代码评论要求为数据集校正制定强有力的质量指标,但现有方法(SIDE、MIDQ、STASIS)受到有限的编码分析,我们提议CIDRE,这是一个语言上不可知的无参考质量标准,包含四个协同方面:(1) 相关性(代码-语义一致)、(2) 信息性(功能覆盖)、(3) 完整性(所有结构部分的存在)和(4) 描述长度(详细充分性),我们验证了我们关于人工附加说明数据集的标准。实验表明CIDRE优于现有指标,实现了跨热带评价的改进。当应用到过滤评论时,对CIDR-Refilter数据进行微调的模型显示GPT-4o-mini评估在统计上取得了显著的质量收益。


Article 89

Title@2025-05-26 (1): RDFGraphGen: An RDF Graph Generator based on SHACL Shapes

Title: RDFGraphGen: An RDF Graph Generator based on SHACL Shapes RDFGraphGen: Ein RDF Graph Generator auf Basis von SHACL Shapes RDFGraphGen:基于 SHACL 形状的 RDF 图形生成器 2407.17941v2

Authors: Milos Jovanovik, Marija Vecovska, Maxime Jakubowski, Katja Hose

Developing and testing modern RDF-based applications often requires access to RDF datasets with certain characteristics. Unfortunately, it is very difficult to publicly find domain-specific knowledge graphs that conform to a particular set of characteristics. Hence, in this paper we propose RDFGraphGen, an open-source RDF graph generator that uses characteristics provided in the form of SHACL (Shapes Constraint Language) shapes to generate synthetic RDF graphs. RDFGraphGen is domain-agnostic, with configurable graph structure, value constraints, and distributions. It also comes with a number of predefined values for popular schema.org classes and properties, for more realistic graphs. Our results show that RDFGraphGen is scalable and can generate small, medium, and large RDF graphs in any domain.

开发和测试基于RDF的现代应用往往需要获得具有某些特点的RDF数据集,不幸的是,很难公开找到符合特定特征的域特定知识图,因此,在本文中,我们提议了RDFGraphGen,这是一个开源的RDF图形生成器,它使用以SHACL(Shapes Constrainint语言)形状提供的特性生成合成RDF图。RDFGen是域-Agnistic,具有可配置的图形结构、价值限制和分布。它还包含一些用于流行 schema.org 等级和属性的预设值,用于更现实的图形。我们的结果表明,RDFGraphen是可缩放的,可以在任何领域生成中、中、大RDF图形。


Article 90

Title@2025-05-26 (1): SETBVE: Quality-Diversity Driven Exploration of Software Boundary Behaviors

Title: SETBVE: Quality-Diversity Driven Exploration of Software Boundary Behaviors SETBVE: Qualität-Diversität treibt die Erforschung von Software-Grenzverhalten an SETVE: 软件边界行为的质量-多样性驱动探索 2505.19736v1

Authors: Sabinakhon Akbarova, Felix Dobslaw, Francisco Gomes de Oliveira Neto, Robert Feldt

Software systems exhibit distinct behaviors based on input characteristics, and failures often occur at the boundaries between input domains. Traditional Boundary Value Analysis (BVA) relies on manual heuristics, while automated Boundary Value Exploration (BVE) methods typically optimize a single quality metric, risking a narrow and incomplete survey of boundary behaviors. We introduce SETBVE, a customizable, modular framework for automated black-box BVE that leverages Quality-Diversity (QD) optimization to systematically uncover and refine a broader spectrum of boundaries. SETBVE maintains an archive of boundary pairs organized by input- and output-based behavioral descriptors. It steers exploration toward underrepresented regions while preserving high-quality boundary pairs and applies local search to refine candidate boundaries. In experiments with ten integer-based functions, SETBVE outperforms the baseline in diversity, boosting archive coverage by 37 to 82 percentage points. A qualitative analysis reveals that SETBVE identifies boundary candidates the baseline misses. While the baseline method typically plateaus in both diversity and quality after 30 seconds, SETBVE continues to improve in 600-second runs, demonstrating better scalability. Even the simplest SETBVE configurations perform well in identifying diverse boundary behaviors. Our findings indicate that balancing quality with behavioral diversity can help identify more software edge-case behaviors than quality-focused approaches.

传统边界值分析(BVA)依靠人工结构,而自动边界值探索(BVE)方法通常优化单一质量衡量标准,有可能进行狭隘和不完整的边界行为调查。我们引入了SETVVE(一个可定制的、模块化的自动黑盒BVE框架),利用质量-多样性优化(QD)来系统发现和完善更广泛的边界范围。SETBVE(BVA)维持一个由投入和产出基于行为描述器组成的边界对对口档案。它将勘探导向代表性不足的区域,同时保留高质量的边界对口,并应用本地搜索来完善候选边界。在十项整数功能的实验中,SETBVVE超越了多样性基线,将档案覆盖率提高37至82个百分点。质量分析显示SETBVVE(Q)确定了基线选择者。虽然基线方法通常在30秒后在多样性和质量上都处于高位,但SETVVE继续改进600秒的运行,展示更好的比例性,同时展示了更精确性的行为方式,以比我们不同的边界结构更精确地标准。


Article 91

Title@2025-05-26 (1): Large Language Models in Code Co-generation for Safe Autonomous Vehicles

Title: Large Language Models in Code Co-generation for Safe Autonomous Vehicles Große Sprachmodelle in der Kogeneration Code für sichere autonome Fahrzeuge 安全自治车辆代码共同生成大语言模式 2505.19658v1

Authors: Ali Nouri, Beatriz Cabrero-Daniel, Zhennan Fei, Krishna Ronanki, Håkan Sivencrona, Christian Berger

Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems’ development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.

各工业领域的软件工程师已经在使用大语言模型来加速软件系统实施部分内容的进程。在考虑其在汽车背景下对ADAS或AD系统的潜在用途时,有必要系统评估这一新设置:LMS因其随机性,对与安全有关的系统开发带来一系列有详细记录的风险。为减少代码审查员评价LLM生成的代码的努力,我们提议建立一个评价管道,对生成的代码进行理智检查。我们比较了六种最先进的LMS(CodeLlama、CodeGemma、DeepSeek-r1、DeepSeek-Coders、Mistral和GPT-4)在四项与安全有关的方案编制任务方面的绩效。此外,我们从质量上分析这些LMS最常出现的错误,制造出一个失败模式目录来支持人类审评员。最后,我们讨论了LMS在代码生成方面的局限性和能力,以及现有流程中拟议的管道的使用。


Article 92

Title@2025-05-26 (1): Software Engineering for Self-Adaptive Robotics: A Research Agenda

Title: Software Engineering for Self-Adaptive Robotics: A Research Agenda Software-Engineering für selbstadaptive Robotik: Eine Forschungsagenda 自我适应机器人学软件工程:研究议程 2505.19629v1

Authors: Shaukat Ali, Ana Cavalcanti, Cláudio Ângelo Gonçalves Gomes, Peter Gorm Larsen, Hassan Sartaj, Anastasios Tefas, Jim Woodcock, Houxiang Zhang

Self-adaptive robotic systems are designed to operate autonomously in dynamic and uncertain environments, requiring robust mechanisms to monitor, analyse, and adapt their behaviour in real-time. Unlike traditional robotic software, which follows predefined logic, self-adaptive robots leverage artificial intelligence, machine learning, and model-driven engineering to continuously adjust to changing operational conditions while ensuring reliability, safety, and performance. This paper presents a research agenda for software engineering in self-adaptive robotics, addressing critical challenges across two key dimensions: (1) the development phase, including requirements engineering, software design, co-simulation, and testing methodologies tailored to adaptive robotic systems, and (2) key enabling technologies, such as digital twins, model-driven engineering, and AI-driven adaptation, which facilitate runtime monitoring, fault detection, and automated decision-making. We discuss open research challenges, including verifying adaptive behaviours under uncertainty, balancing trade-offs between adaptability, performance, and safety, and integrating self-adaptation frameworks like MAPE-K. By providing a structured roadmap, this work aims to advance the software engineering foundations for self-adaptive robotic systems, ensuring they remain trustworthy, efficient, and capable of handling real-world complexities.

自适应机器人系统的设计目的是在动态和不确定的环境中自主运作,需要强有力的机制来监测、分析和调整其实时行为。与遵循预先定义逻辑的传统机器人软件不同,自适应机器人软件利用人工智能、机器学习和模型驱动的工程来不断适应不断变化的操作条件,同时确保可靠性、安全和性能。本文件提出了自适应机器人系统软件工程的研究议程,解决了两个关键方面的关键挑战:(1)开发阶段,包括适合适应性机器人系统的要求工程、软件设计、共同模拟和测试方法;(2)关键的赋能技术,如数字双胞胎、模型驱动的工程和AI驱动的适应,这些技术有助于运行时间监测、发现故障和自动决策。我们讨论了开放的研究挑战,包括核实不确定性下的适应行为,平衡适应性、性、性能和安全之间的取舍,以及整合像MAPE-K这样的自我适应框架。通过提供结构化的路线图,这项工作旨在推进自我适应性机器人系统软件工程基础,确保它们保持可信赖、高效和能够处理现实世界复杂性。


Article 93

Title@2025-05-26 (1): Search-Based Software Engineering in the Landscape of AI Foundation Models

Title: Search-Based Software Engineering in the Landscape of AI Foundation Models Search-Based Software Engineering in der Landschaft der AI-Stiftung Modelle AI基金会模型景观中的搜索软件工程 2505.19625v1

Authors: Hassan Sartaj, Shaukat Ali

Search-based software engineering (SBSE), at the intersection of artificial intelligence (AI) and software engineering, has been an active area of research for about 25 years. It has been applied to solve numerous problems across the entire software engineering lifecycle and has demonstrated its versatility in multiple domains. With the recent advancements in AI, particularly the emergence of foundation models (FMs), the evolution of SBSE alongside FMs remains undetermined. In this window of opportunity, we propose a research roadmap that articulates the current landscape of SBSE in relation to foundation models (FMs), highlights open challenges, and outlines potential research directions for advancing SBSE through its interplay with FMs. This roadmap aims to establish a forward-thinking and innovative perspective for the future of SBSE in the era of FMs.

25年来,人工智能和软件工程交汇处的基于搜索的软件工程(SBSE)一直是一个活跃的研究领域,用于解决整个软件工程生命周期的许多问题,并展示了它在多个领域的多功能性。随着最近AI的进步,特别是基础模型的出现,SBSE与调频的演变仍未确定。在这个机会之窗中,我们提出了一个研究路线图,阐明SBSE目前与基础模型(FMs)的关系,突出公开的挑战,并概述通过FMs互动推进SBSE的潜在研究方向。该路线图旨在为SBSE在调频时代的未来建立前瞻性和创新的视角。


Article 94

Title@2025-05-26 (1): LEGO-Compiler: Enhancing Neural Compilation Through Translation Composability

Title: LEGO-Compiler: Enhancing Neural Compilation Through Translation Composability LEGO-Compiler: Neurale Kompilierung durch Übersetzungskompatibilität verbessern LEGO-Compuper:通过翻译集成加强神经汇编 2505.20356v1

Authors: Shuoming Zhang, Jiacheng Zhao, Chunwei Xia, Zheng Wang, Yunji Chen, Xiaobing Feng, Huimin Cui

Large language models (LLMs) have the potential to revolutionize how we design and implement compilers and code translation tools. However, existing LLMs struggle to handle long and complex programs. We introduce LEGO-Compiler, a novel neural compilation system that leverages LLMs to translate high-level languages into assembly code. Our approach centers on three key innovations: LEGO translation, which decomposes the input program into manageable blocks; breaking down the complex compilation process into smaller, simpler verifiable steps by organizing it as a verifiable LLM workflow by external tests; and a feedback mechanism for self-correction. Supported by formal proofs of translation composability, LEGO-Compiler demonstrates high accuracy on multiple datasets, including over 99% on ExeBench and 97.9% on industrial-grade AnsiBench. Additionally, LEGO-Compiler has also acheived near one order-of-magnitude improvement on compilable code size scalability. This work opens new avenues for applying LLMs to system-level tasks, complementing traditional compiler technologies.

大型语言模型(LLMS)有可能革命我们如何设计和实施汇编者和代码翻译工具。然而,现有的LLMS努力处理长期和复杂的程序。我们引入了LEGO-Compiler,这是一个新的神经编译系统,利用LLLMS将高层次语言翻译成组装代码。我们的方法以三大创新为中心:LEGO翻译,将输入程序分解成可控区块;通过外部测试将复杂的汇编进程组织成可核查的LLM工作流程,从而将复杂的汇编进程分为更小、更简单的可核查的步骤;以及自我校正回馈机制。在翻译可复性的正式证明的支持下,LEGO-Compiler在多个数据集上表现出高度的准确性,包括ExeBench99%以上的数据,AnsiBench工业级的97.9%的数据。此外,LEGO-Compiler在可比较的编码大小缩放性上也接近一个顺序的磁性改进。这项工作开辟了新的途径,将LLMSMs应用于系统层面的任务,补充传统的编译技术。


Article 95

Title@2025-05-26 (1): CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation

Title: CODE-DITING: A Reasoning-Based Metric for Functional Alignment in Code Evaluation CODE-DITING: Ein auf Vernunft basierendes Metric für die funktionelle Ausrichtung in der Code-Evaluation 代码化:守则评价中功能一致性的基于理由的计量标准 2505.19502v1

Authors: Guang Yang, Yu Zhou, Xiang Chen, Wei Zheng, Xing Hu, Xin Zhou, David Lo, Taolue Chen

Trustworthy evaluation methods for code snippets play a crucial role in neural code generation. Traditional methods, which either rely on reference solutions or require executable test cases, have inherent limitation in flexibility and scalability. The recent LLM-as-Judge methodology offers a promising alternative by directly evaluating functional consistency between the problem description and the generated code. To systematically understand the landscape of these LLM-as-Judge methods, we conduct a comprehensive empirical study across three diverse datasets. Our investigation reveals the pros and cons of two categories of LLM-as-Judge methods: the methods based on general foundation models can achieve good performance but require complex prompts and lack explainability, while the methods based on reasoning foundation models provide better explainability with simpler prompts but demand substantial computational resources due to their large parameter sizes. To address these limitations, we propose CODE-DITING, a novel code evaluation method that balances accuracy, efficiency and explainability. We develop a data distillation framework that effectively transfers reasoning capabilities from DeepSeek-R1671B to our CODE-DITING 1.5B and 7B models, significantly enhancing evaluation explainability and reducing the computational cost. With the majority vote strategy in the inference process, CODE-DITING 1.5B outperforms all models with the same magnitude of parameters and achieves performance which would normally exhibit in a model with 5 times of parameter scale. CODE-DITING 7B surpasses GPT-4o and DeepSeek-V3 671B, even though it only uses 1% of the parameter volume of these large models. Further experiments show that CODEDITING is robust to preference leakage and can serve as a promising alternative for code evaluation.

对代码片断的可靠评价方法在神经代码生成中发挥着关键作用。传统方法要么依赖参考解决方案,要么需要可执行的测试案例,在灵活性和可缩放性方面具有内在的局限性。最近的LLM-as-Judge方法通过直接评估问题描述和生成代码之间的功能一致性提供了一个有希望的替代办法。为了系统理解这些LLM-as-Judge方法的景观,我们针对三个不同的数据集开展了一项全面的经验性研究。我们的调查揭示了两种LLLM-as-Judge方法类别的利弊:基于一般基础模型的方法可以取得良好的性能,但需要复杂的提示和缺乏解释性,而基于推理基础模型的方法则提供更简单的解释性,但因其参数大小而需要大量的计算资源。为了解决这些局限性,我们建议CODE-DI,这是一种平衡准确性、效率和解释性的新代码评价方法。我们开发了一个数据蒸馏框架,可以有效地将DeepSepe-R1671B的推理能力从我们的CODE-DI-DI的1.5B和7B模型转移到我们的COD-DI的深度模型,但需要复杂的快速和缺乏解释性解释,而大大加强评估的CODE-IDE-IDE-IDE-deal-deal-dealdealdealdeal decudududustralal sal acudustrational asaldeal real ex sal ex saldeal ex sal ex sal ex salting thesal ex salting sal ex salting thesal latingaltingsaltingaltradeal ex lating thesal lating thesal latingsal axal ex axal latings fal ax ax ax ax ax axal axal axal exal exal asal exal latingsal exal latingsal latingsal latingsal latingsal exal exal latingsal latings asal latingsal exal exal latingsal asal latingsal asal ex as as asal exal exal


Article 96

Title@2025-05-26 (1): Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs

Title: Benchmarking and Enhancing LLM Agents in Localizing Linux Kernel Bugs Benchmarking und Verbesserung von LLM-Agenten bei der Lokalisierung von Linux-Kernel-Fehlern 确定和加强Linux内核虫本地化的Linux Kernel 虫的基准和加强LLM代理物 2505.19489v1

Authors: Zhenhao Zhou, Zhuochen Huang, Yike He, Chong Wang, Jiajun Wang, Yijian Wu, Xin Peng, Yiling Lou

The Linux kernel is a critical system, serving as the foundation for numerous systems. Bugs in the Linux kernel can cause serious consequences, affecting billions of users. Fault localization (FL), which aims at identifying the buggy code elements in software, plays an essential role in software quality assurance. While recent LLM agents have achieved promising accuracy in FL on recent benchmarks like SWE-bench, it remains unclear how well these methods perform in the Linux kernel, where FL is much more challenging due to the large-scale code base, limited observability, and diverse impact factors. In this paper, we introduce LinuxFLBench, a FL benchmark constructed from real-world Linux kernel bugs. We conduct an empirical study to assess the performance of state-of-the-art LLM agents on the Linux kernel. Our initial results reveal that existing agents struggle with this task, achieving a best top-1 accuracy of only 41.6% at file level. To address this challenge, we propose LinuxFL$^+$, an enhancement framework designed to improve FL effectiveness of LLM agents for the Linux kernel. LinuxFL$^+$ substantially improves the FL accuracy of all studied agents (e.g., 7.2% - 11.2% accuracy increase) with minimal costs. Data and code are available at https://github.com/FudanSELab/LinuxFLBench.

Linux 内核是一个关键系统,是众多系统的基础。 Linux 内核中的错误会造成严重的后果,影响数十亿用户。 用于识别软件中的错误代码元素的错误本地化(FL)在软件质量保证方面起着关键作用。 最近LLM 代理商在像 SWE- Bench 这样的最近基准中在FL中取得了很有希望的准确性,但目前仍不清楚这些方法在Linux 内核中的表现如何,因为大型代码基础、可观测性有限和不同影响因素,FLinux 内核中的错误可能带来严重的后果。在本文件中,我们引入了Linux FlFBBench,这是从现实世界Linux 内核错误中建立的一个FL基准。我们进行了一项实验性研究,以评估LLM 代理商在LUT-L 内核中与这项工作相比,最高一至41.6%的准确性。为了应对这一挑战,我们提议Lux FLux Flex$,一个旨在提高LLLLF$代理商的F$效率,提高LLLLUxxxxxxxx 的精确性成本。 我们的Lxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Article 97

Title@2025-05-26 (1): Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling

Title: Regulating Algorithmic Management: A Multi-Stakeholder Study of Challenges in Aligning Software and the Law for Workplace Scheduling Regulierung des algorithmischen Managements: Eine Multi-Stakeholder-Studie über Herausforderungen bei der Ausrichtung von Software und dem Gesetz für die Arbeitsplanung 规范工资管理:多方利益攸关方研究软件和工作场所时间安排法在调整软件和工作场所时间安排法方面面临的挑战 2505.02329v2

Authors: Jonathan Lynn, Rachel Y. Kim, Sicun Gao, Daniel Schneider, Sachin S. Pandya, Min Kyung Lee

Algorithmic management (AM)’s impact on worker well-being has led to calls for regulation. However, little is known about the effectiveness and challenges in real-world AM regulation across the regulatory process – rule operationalization, software use, and enforcement. Our multi-stakeholder study addresses this gap within workplace scheduling, one of the few AM domains with implemented regulations. We interviewed 38 stakeholders across the regulatory process: regulators, defense attorneys, worker advocates, managers, and workers. Our findings suggest that the efficacy of AM regulation is influenced by: (i) institutional constraints that challenge efforts to encode law into AM software, (ii) on-the-ground use of AM software that shapes its ability to facilitate compliance, (iii) mismatches between software and regulatory contexts that hinder enforcement, and (iv) unique concerns that software introduces when used to regulate AM. These findings underscore the importance of a sociotechnical approach to AM regulation, which considers organizational and collaborative contexts alongside the inherent attributes of software. We offer future research directions and implications for technology policy and design.

分析管理(AM)对工人福祉的影响导致要求监管。然而,对于现实世界的AM监管在整个监管过程中的有效性和挑战知之甚少。我们多方利益攸关方的研究涉及工作场所时间安排中的这一差距,这是少数有实施监管的AM领域之一。我们采访了整个监管过程中的38个利益攸关方:监管者、辩护律师、工人律师、管理人员和工人。我们的调查结果表明,AM监管的效力受到以下因素的影响:(一) 机构制约,对将法律纳入AM软件的努力构成挑战;(二) 影响其促进合规能力的AM软件的实地使用;(三) 软件与监管环境之间的不匹配,阻碍执法;(四) 软件在监管AM时引入的独特关切。这些研究结果强调了对AM监管采取社会技术方法的重要性,该方法考虑到组织和协作环境以及软件的固有属性。我们为技术政策和设计提供了未来的研究方向和影响。


Article 98

Title@2025-05-26 (1): Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI

Title: Vibe Coding vs. Agentic Coding: Fundamentals and Practical Implications of Agentic AI Vibe Coding vs. Agentic Coding: Grundlagen und praktische Implikationen von Agentic AI Vibe 编码与 Agentic 编码:Agent AI 的基本要素和实际影响 2505.19443v1

Authors: Ranjan Sapkota, Konstantinos I. Roumeliotis, Manoj Karkee

This review presents a comprehensive analysis of two emerging paradigms in AI-assisted software development: vibe coding and agentic coding. While both leverage large language models (LLMs), they differ fundamentally in autonomy, architectural design, and the role of the developer. Vibe coding emphasizes intuitive, human-in-the-loop interaction through prompt-based, conversational workflows that support ideation, experimentation, and creative exploration. In contrast, agentic coding enables autonomous software development through goal-driven agents capable of planning, executing, testing, and iterating tasks with minimal human intervention. We propose a detailed taxonomy spanning conceptual foundations, execution models, feedback loops, safety mechanisms, debugging strategies, and real-world tool ecosystems. Through comparative workflow analysis and 20 detailed use cases, we illustrate how vibe systems thrive in early-stage prototyping and education, while agentic systems excel in enterprise-grade automation, codebase refactoring, and CI/CD integration. We further examine emerging trends in hybrid architectures, where natural language interfaces are coupled with autonomous execution pipelines. Finally, we articulate a future roadmap for agentic AI, outlining the infrastructure needed for trustworthy, explainable, and collaborative systems. Our findings suggest that successful AI software engineering will rely not on choosing one paradigm, but on harmonizing their strengths within a unified, human-centered development lifecycle.

本次审查对AI协助软件开发的两个新兴模式进行了全面分析: 感应编码和代理编码。 虽然两者都利用了大型语言模型(LLMS),但它们在自主、建筑设计和开发者的作用方面有着根本的不同。 Vibe 编码强调通过支持理念、实验和创造性探索的基于迅速的、对话的工作流程进行直观、人与人之间互动。相比之下,代理编码通过目标驱动的、能够规划、执行、测试和以最低限度的人类干预来重复任务,使软件自主开发。我们提出了涵盖概念基础、执行模式、反馈循环、安全机制、调试战略和现实世界工具生态系统的详细分类方法。我们通过比较工作流程分析和20个详细的使用案例,说明在早期的预设和教育阶段,动态系统如何发展,而在企业级自动化、代码基础再定位和CI/CD整合方面优异。我们进一步审查了混合结构中新出现的趋势,其中自然语言界面与自主执行管道相伴有。 最后,我们通过比较工作流程分析和20个详细的使用案例,我们展示了一种未来可信赖的、可信赖的、可信赖的、可信赖的、可信赖的、可信赖的、可信赖的、可信赖的、可信赖的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、可依的、


Article 99

Title@2025-05-26 (1): Simple and Effective Baselines for Code Summarisation Evaluation

Title: Simple and Effective Baselines for Code Summarisation Evaluation Einfache und effektive Grundlagen für die Code-Summarisation-Bewertung 用于代码摘要评价的简单有效基线 2505.19392v1

Authors: Jade Robinson, Jonathan K. Kummerfeld

Code documentation is useful, but writing it is time-consuming. Different techniques for generating code summaries have emerged, but comparing them is difficult because human evaluation is expensive and automatic metrics are unreliable. In this paper, we introduce a simple new baseline in which we ask an LLM to give an overall score to a summary. Unlike n-gram and embedding-based baselines, our approach is able to consider the code when giving a score. This allows us to also make a variant that does not consider the reference summary at all, which could be used for other tasks, e.g., to evaluate the quality of documentation in code bases. We find that our method is as good or better than prior metrics, though we recommend using it in conjunction with embedding-based methods to avoid the risk of LLM-specific bias.

代码文档是有用的, 但写它很费时。 生成代码摘要的不同技术已经出现, 但比较它们很困难, 因为人类评估费用昂贵, 自动测量标准不可靠 。 在本文中, 我们引入了一个简单的新基准, 我们要求一个LLM 来给摘要做一个总分。 与 n 和 嵌入基准不同, 我们的方法可以在给分时考虑代码 。 这让我们也可以做出一个完全不考虑参考摘要的变量, 它可以用于其他任务, 比如用于评估代码基础文件的质量。 我们发现我们的方法比以前的衡量标准好或好, 尽管我们建议使用它与嵌入基准方法一起来避免 LLM 特定偏差的风险 。


Article 100

Title@2025-05-25 (7): Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation

Title: Architectures of Error: A Philosophical Inquiry into AI and Human Code Generation Architekturen des Irrtums: Eine philosophische Untersuchung der KI- und menschlichen Code-Generation 错误结构结构:对大赦国际和人类代码生成的哲学调查 2505.19353v1

Authors: Camilo Chacón Sartori

With the rise of generative AI (GenAI), Large Language Models are increasingly employed for code generation, becoming active co-authors alongside human programmers. Focusing specifically on this application domain, this paper articulates distinct ``Architectures of Error’’ to ground an epistemic distinction between human and machine code generation. Examined through their shared vulnerability to error, this distinction reveals fundamentally different causal origins: human-cognitive versus artificial-stochastic. To develop this framework and substantiate the distinction, the analysis draws critically upon Dennett’s mechanistic functionalism and Rescher’s methodological pragmatism. I argue that a systematic differentiation of these error profiles raises critical philosophical questions concerning semantic coherence, security robustness, epistemic limits, and control mechanisms in human-AI collaborative software development. The paper also utilizes Floridi’s levels of abstraction to provide a nuanced understanding of how these error dimensions interact and may evolve with technological advancements. This analysis aims to offer philosophers a structured framework for understanding GenAI’s unique epistemological challenges, shaped by these architectural foundations, while also providing software engineers a basis for more critically informed engagement.

随着基因化的AI(GenAI)的兴起,大语言模型越来越多地被用于代码生成,成为人类程序设计员的积极共同作者。本文件具体侧重于这一应用领域,阐述了独特的“错误结构”以在人类和机器代码生成中进行认知性区分。通过共同的易出错脆弱性,审视了这一区别揭示出根本不同的因果来源:人类认知性与人工切除性。为了发展这一框架并证实这一区别,分析以登内特的机械功能学和Rescher的方法实用性为关键依据。我认为,对这些错误特征的系统区分提出了关键哲学问题,涉及语义一致性、安全性强健健、认知性限制以及人类-AI合作软件开发中的控制机制。文件还利用Floridi的抽象程度,对这些错误维度如何相互作用和可能随着技术进步而演进提供细致的了解。这一分析旨在为哲学家提供一个结构化的框架,以了解由这些建筑基础形成的GenAI的独特缩略论挑战,同时提供更精确的软件工程师参与基础。


Article 101

Title@2025-05-25 (7): Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking

Title: Retrieval-Augmented Generation for Service Discovery: Chunking Strategies and Benchmarking Retrieval-Augmented Generation for Service Discovery: Chunking Strategien und Benchmarking 服务发现回收-启动型服务生成:启动战略和基准制定 2505.19310v1

Authors: Robin D. Pesl, Jerin G. Mathew, Massimo Mecella, Marco Aiello

Integrating multiple (sub-)systems is essential to create advanced Information Systems. Difficulties mainly arise when integrating dynamic environments, e.g., the integration at design time of not yet existing services. This has been traditionally addressed using a registry that provides the API documentation of the endpoints. Large Language Models have shown to be capable of automatically creating system integrations (e.g., as service composition) based on this documentation but require concise input due to input oken limitations, especially regarding comprehensive API descriptions. Currently, it is unknown how best to preprocess these API descriptions. In the present work, we (i) analyze the usage of Retrieval Augmented Generation for endpoint discovery and the chunking, i.e., preprocessing, of state-of-practice OpenAPIs to reduce the input oken length while preserving the most relevant information. To further reduce the input token length for the composition prompt and improve endpoint retrieval, we propose (ii) a Discovery Agent that only receives a summary of the most relevant endpoints nd retrieves specification details on demand. We evaluate RAG for endpoint discovery using (iii) a proposed novel service discovery benchmark SOCBench-D representing a general setting across numerous domains and the real-world RestBench enchmark, first, for the different chunking possibilities and parameters measuring the endpoint retrieval accuracy. Then, we assess the Discovery Agent using the same test data set. The prototype shows how to successfully employ RAG for endpoint discovery to reduce the token count. Our experiments show that endpoint-based approaches outperform naive chunking methods for preprocessing. Relying on an agent significantly improves precision while being prone to decrease recall, disclosing the need for further reasoning capabilities.

整合多个(子)系统对于创建高级信息系统至关重要。 整合多个( 子) 系统对于创建高级信息系统至关重要。 在整合动态环境( 例如, 尚未存在的服务在设计时的整合) 过程中, 困难主要出现。 传统上, 使用一个提供端点 API 文件的登记册来解决这个问题。 大语言模型显示能够自动创建基于此文档的系统整合( 例如作为服务构成) , 但是由于输入的适量限制而需要简明化输入。 目前, 很难预处理这些 API 描述。 在目前的工作中, 我们( 一) 分析Retreearval 精度升级生成方法的使用情况, 用于端点发现和块块( 预处理) 。 大语言模型显示能够自动创建系统整合( 例如服务构成构成构成构成构成构成的输入代号长度, 并改进端点的检索。 我们提议 (二) 解析符号只接收最相关的端点端端端端端端点的概要。 我们评估RAG公司用于最终精度的精确度外推方法, 显示SO- bregal Streal Streal Streal Stredistral Streal 。


Article 102

Title@2025-05-25 (7): VerifyThisBench: Generating Code, Specifications, and Proofs All at Once

Title: VerifyThisBench: Generating Code, Specifications, and Proofs All at Once VerifyThisBench: Code, Spezifikationen und Beweise auf einmal generieren 校验时间: 生成代码、规格和证明 2505.19271v1

Authors: Xun Deng, Sicheng Zhong, Andreas Veneris, Fan Long, Xujie Si

Large language models (LLMs) have demonstrated remarkable progress in code generation, but many existing benchmarks are approaching saturation and offer little guarantee on the trustworthiness of the generated programs, offering limited insight into deeper reasoning capabilities. We introduce VerifyThisBench, a new benchmark designed to evaluate LLMs on end-to-end program verification tasks that require interpreting natural language problem descriptions, formulating formal specifications, generating code, and constructing correctness proofs. Our evaluation reveals that even state-of-the-art (SOTA) models, such as o3-mini, achieve a pass rate of less than 4%, with many outputs failing to compile. To reduce task complexity, we further propose VerifyThisBenchXS, a variant in which partial implementations or proofs are provided. We systematically assess SOTA models on both benchmarks, uncovering key strengths and limitations in their formal reasoning and verification capabilities.

大型语言模型(LLMs)在代码生成方面取得了显著进展,但许多现有基准正在接近饱和,对生成的程式的可信度几乎没有什么保障,对更深的推理能力也只能提供有限的洞察力。我们引入了VirgilThench,这是一个新的基准,旨在评估终端到终端方案核查任务方面的LMs,这些任务要求解释自然语言问题描述、制定正式规格、生成代码和构建正确性证明。我们的评估显示,即使是最先进的模型(如O3-mini),也达到不到4%的及格率,许多产出无法编译。为了降低任务复杂性,我们进一步建议VirgirtThenchXS,这是一个提供部分执行或证据的变式。我们系统地评估SOTA两个基准的模型,揭示了正式推理和核查能力的关键优势和局限性。


Article 103

Title@2025-05-25 (7): CLEVER: A Curated Benchmark for Formally Verified Code Generation

Title: CLEVER: A Curated Benchmark for Formally Verified Code Generation CLEVER: Ein kuratierter Benchmark für die formal verifizierte Codegenerierung 正式核实的代码生成基准 2505.13938v3

Authors: Amitayush Thakur, Jasper Lee, George Tsoukalas, Meghana Sistla, Matthew Zhao, Stefan Zetzsche, Greg Durrett, Yisong Yue, Swarat Chaudhuri

We introduce ${\rm C{\small LEVER}}$, a high-quality, curated benchmark of 161 problems for end-to-end verified code generation in Lean. Each problem consists of (1) the task of generating a specification that matches a held-out ground-truth specification, and (2) the task of generating a Lean implementation that provably satisfies this specification. Unlike prior benchmarks, ${\rm C{\small LEVER}}$ avoids test-case supervision, LLM-generated annotations, and specifications that leak implementation logic or allow vacuous solutions. All outputs are verified post-hoc using Lean’s type checker to ensure machine-checkable correctness. We use ${\rm C{\small LEVER}}$ to evaluate several few-shot and agentic approaches based on state-of-the-art language models. These methods all struggle to achieve full verification, establishing it as a challenging frontier benchmark for program synthesis and formal reasoning. Our benchmark can be found on GitHub(https://github.com/trishullab/clever) as well as HuggingFace(https://huggingface.co/datasets/amitayusht/clever). All our evaluation code is also available online(https://github.com/trishullab/clever-prover).

我们引入了一个高质量的、成熟的161个问题基准,用于在利昂最终到最终核实代码生成。 每个问题包括:(1) 制定符合固定地面真相规格的规格,(2) 制定符合该规格的精干实施任务。 不同于以前的基准, $ C小LEVL$避免测试- 情况监督, LLM 生成说明, 以及泄漏执行逻辑或允许空洞解决方案的规格。 所有产出都使用 Lean 的字型检查器进行后热检测,以确保机器检查的正确性。 我们使用$ C小LEO$来评估几张图片和基于最新语言模型的试探性方法。 这些方法都为实现全面核查而奋斗,将它确定为具有挑战性的方案合成和正式推理基准。 我们的基准可以在 GitHub(https://github.com/trishallballb/cleggest) 上找到(https://hulishclasclibs/combestregress)。


Article 104

Title@2025-05-25 (7): An Empirical Study of Vulnerability Handling Times in CPython

Title: An Empirical Study of Vulnerability Handling Times in CPython Eine empirische Studie über die Zeiten des Umgangs mit Gefährlichkeit in CPython CPython 脆弱性处理时间经验研究 2411.00447v2

Authors: Jukka Ruohonen

The paper examines the handling times of software vulnerabilities in CPython, the reference implementation and interpreter for the today’s likely most popular programming language, Python. The background comes from the so-called vulnerability life cycle analysis, the literature on bug fixing times, and the recent research on security of Python software. Based on regression analysis, the associated vulnerability fixing times can be explained very well merely by knowing who have reported the vulnerabilities. Severity, proof-of-concept code, commits made to a version control system, comments posted on a bug tracker, and references to other sources do not explain the vulnerability fixing times. With these results, the paper contributes to the recent effort to better understand security of the Python ecosystem.

本文研究了CPython软件脆弱性的处理时间、当今最流行的节目语言Python的参考实施和解释者。背景来自所谓的脆弱性生命周期分析、关于错误修正时间的文献和最近对Python软件安全的研究。根据回归分析,相关的脆弱性修正时间只能通过了解谁报告了脆弱性来很好地解释。多重性、验证概念代码、对版本控制系统的承诺、在错误追踪器上张贴的评论以及其它来源的引用并不能解释脆弱性的确定时间。有了这些结果,该文件有助于最近为更好地了解Python生态系统安全所作的努力。


Article 105

Title@2025-05-25 (7): An Initial Exploration of Fine-tuning Small Language Models for Smart Contract Reentrancy Vulnerability Detection

Title: An Initial Exploration of Fine-tuning Small Language Models for Smart Contract Reentrancy Vulnerability Detection Eine erste Erkundung von Feinsteuerungs-Kleinsprachenmodellen für intelligente Vertragsrepentrancy Sicherheitserkennung 初步探索智能合同留置率易变性探测智能合同微调小型语言模型 2505.19059v1

Authors: Ignacio Mariano Andreozzi Pofcher, Joshua Ellul

Large Language Models (LLMs) are being used more and more for various coding tasks, including to help coders identify bugs and are a promising avenue to support coders in various tasks including vulnerability detection – particularly given the flexibility of such generative AI models and tools. Yet for many tasks it may not be suitable to use LLMs, for which it may be more suitable to use smaller language models that can fit and easily execute and train on a developer’s computer. In this paper we explore and evaluate whether smaller language models can be fine-tuned to achieve reasonable results for a niche area: vulnerability detection – specifically focusing on detecting the reentrancy bug in Solidity smart contracts.

大型语言模型(LLMS)正越来越多地用于各种编码任务,包括帮助编码员识别错误,并且是支持编码员开展包括脆弱性探测在内的各种任务的有希望的途径 – – 特别是考虑到这种基因化的AI型模型和工具的灵活性。然而,对于许多任务来说,使用LLMS可能不合适,因为可能更适合使用适合、易于执行的小型语言模型,在开发商的计算机上进行培训。在本文件中,我们探讨和评价是否可以对较小的语言模型进行微调,以便为一个特定领域取得合理结果:脆弱性探测 – – 特别侧重于在固态智能合同中探测再生虫。


Article 106

Title@2025-05-25 (7): AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection

Title: AIGCodeSet: A New Annotated Dataset for AI Generated Code Detection AIGCodeSet: Ein neuer kommentierter Datensatz für KI Generated Code Detection AIGCodeSet:AI 生成代码探测新附加说明数据集 2412.16594v3

Authors: Basak Demirok, Mucahid Kutlu

While large language models provide significant convenience for software development, they can lead to ethical issues in job interviews and student assignments. Therefore, determining whether a piece of code is written by a human or generated by an artificial intelligence (AI) model is a critical issue. In this study, we present AIGCodeSet, which consists of 2.828 AI-generated and 4.755 human-written Python codes, created using CodeLlama 34B, Codestral 22B, and Gemini 1.5 Flash. In addition, we share the results of our experiments conducted with baseline detection methods. Our experiments show that a Bayesian classifier outperforms the other models.

虽然大型语言模型为软件开发提供了很大便利,但它们可能导致在工作面试和学生任务分配中出现道德问题,因此,确定一个代码是由人写成还是由人工智能模型生成是一个关键问题。在本研究中,我们介绍了由2.828 AI 生成的和4.755 人写成的Python代码组成的AIGCodeSet,这些代码是使用代码CodeLlama 34B、代码22B和Gemini 1.5 Flash创建的。此外,我们分享了我们用基线探测方法进行的实验的结果。我们的实验表明,贝叶斯分类器优于其他模型。


Article 107

Title@2025-05-25 (7): On-Demand Scenario Generation for Testing Automated Driving Systems

Title: On-Demand Scenario Generation for Testing Automated Driving Systems On-Demand-Szenario-Generierung für die Prüfung automatisierter Fahrsysteme 自动驾驶系统测试的 “ 现场需求 “ 情景生成 2505.14053v2

Authors: Songyang Yan, Xiaodong Zhang, Kunkun Hao, Haojie Xin, Yonggang Luo, Jucheng Yang, Ming Fan, Chao Yang, Jun Sun, Zijiang Yang

The safety and reliability of Automated Driving Systems (ADS) are paramount, necessitating rigorous testing methodologies to uncover potential failures before deployment. Traditional testing approaches often prioritize either natural scenario sampling or safety-critical scenario generation, resulting in overly simplistic or unrealistic hazardous tests. In practice, the demand for natural scenarios (e.g., when evaluating the ADS’s reliability in real-world conditions), critical scenarios (e.g., when evaluating safety in critical situations), or somewhere in between (e.g., when testing the ADS in regions with less civilized drivers) varies depending on the testing objectives. To address this issue, we propose the On-demand Scenario Generation (OSG) Framework, which generates diverse scenarios with varying risk levels. Achieving the goal of OSG is challenging due to the complexity of quantifying the criticalness and naturalness stemming from intricate vehicle-environment interactions, as well as the need to maintain scenario diversity across various risk levels. OSG learns from real-world traffic datasets and employs a Risk Intensity Regulator to quantitatively control the risk level. It also leverages an improved heuristic search method to ensure scenario diversity. We evaluate OSG on the Carla simulators using various ADSs. We verify OSG’s ability to generate scenarios with different risk levels and demonstrate its necessity by comparing accident types across risk levels. With the help of OSG, we are now able to systematically and objectively compare the performance of different ADSs based on different risk levels.

自动驾驶系统(ADS)的安全和可靠性是至高无上的,因此,在部署之前必须采用严格的测试方法,发现潜在的故障; 传统测试方法往往优先考虑自然假设抽样或安全临界情景生成,导致过分简单化或不现实的危险测试; 在实践中,对自然情景的需求(例如,在现实世界条件下评价ADS的可靠性时),关键情景(例如,在评估危急情况下的安全情况时),或介于不同风险水平之间(例如,在不那么文明的驱动因素区域测试ADS时),取决于测试目标的不同程度; 为解决这一问题,我们建议采用 “ 需求情景生成(OSG)框架 “ (OSG),该框架产生不同风险水平的不同情景; 实现OSG的目标具有挑战性,因为对车辆-环境之间复杂互动所产生的临界性和自然性进行量化的复杂性,以及需要保持不同风险水平之间的情景多样性; OSG从现实世界交通数据集中学习,并采用风险强度调控到风险水平; 为了解决这一问题,我们还利用改进的 “ 需求搜索方法 “ 客观地比较 “ 风险水平 “ ,我们用不同类型 “ ASG “ 对比 “ 风险水平 “ 。


Article 108

Title@2025-05-25 (7): Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers

Title: Automated Trustworthiness Oracle Generation for Machine Learning Text Classifiers Automatisierte Vertrauenswürdigkeit Oracle Generation für Machine Learning Text Klassifikatoren 机械学习文字分类的自动可信赖性甲骨文生成 2410.22663v4

Authors: Lam Nguyen Tung, Steven Cho, Xiaoning Du, Neelofar Neelofar, Valerio Terragni, Stefano Ruberto, Aldeida Aleti

Machine learning (ML) for text classification has been widely used in various domains. These applications can significantly impact ethics, economics, and human behavior, raising serious concerns about trusting ML decisions. Studies indicate that conventional metrics are insufficient to build human trust in ML models. These models often learn spurious correlations and predict based on them. In the real world, their performance can deteriorate significantly. To avoid this, a common practice is to test whether predictions are reasonable based on valid patterns in the data. Along with this, a challenge known as the trustworthiness oracle problem has been introduced. Due to the lack of automated trustworthiness oracles, the assessment requires manual validation of the decision process disclosed by explanation methods. However, this is time-consuming, error-prone, and unscalable. We propose TOKI, the first automated trustworthiness oracle generation method for text classifiers. TOKI automatically checks whether the words contributing the most to a prediction are semantically related to the predicted class. Specifically, we leverage ML explanations to extract the decision-contributing words and measure their semantic relatedness with the class based on word embeddings. We also introduce a novel adversarial attack method that targets trustworthiness vulnerabilities identified by TOKI. To evaluate their alignment with human judgement, experiments are conducted. We compare TOKI with a naive baseline based solely on model confidence and TOKI-guided adversarial attack method with A2T, a SOTA adversarial attack method. Results show that relying on prediction uncertainty cannot effectively distinguish between trustworthy and untrustworthy predictions, TOKI achieves 142% higher accuracy than the naive baseline, and TOKI-guided attack method is more effective with fewer perturbations than A2T.

用于文本分类的机器学习( ML) 在许多领域被广泛使用。 这些应用可以极大地影响道德、经济学和人类行为,引起人们对信任 ML 决策的严重关切。 研究表明, 常规指标不足以在 ML 模型中建立人类信任。 这些模型往往会发现虚假的关联, 并根据这些模型进行预测。 在现实世界中, 它们的性能会大大恶化。 为了避免这种情况, 一个常见的做法是测试预测是否基于数据的有效模式是合理的。 与此同时, 引入了一个被称为信任度或触角问题的挑战。 由于缺乏自动的可靠性或触角, 评估需要人工验证解释方法所披露的正确性决定程序。 然而, 这是耗时性、 易出错和不可缩缩缩缩的。 我们建议TOK是第一个自动自动的可靠性或触地生成方法。 TOKI自动检查最有助于预测的词是否与预测的类别有关。 具体地说, 我们利用 MLL 解释来解析决定性判断, 并测量它们与基于语言嵌入的准确性判断的类别之间的不透明性关联性关系。 我们用Sloveural- laisal lais to be to be a viiltialtial be violtial be ass to be vial best vialtique view to laviation laviation to laview a lais to to to to laviolviations to laviolviation to to to lais to to to to laisoltigild to to to to to to violviolviolviolviolviolvi laticis to to to to to be vi vi lavi laticisked to to to latiked.


Article 109

Title@2025-05-25 (7): Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models

Title: Co-PatcheR: Collaborative Software Patching with Component(s)-specific Small Reasoning Models Co-PatcheR: Kollaborative Software-Patching mit Komponenten-spezifischen Small-Reasoning-Modellen 共同配给R:与特定组成部分的小型理由模型合作的软件补补补 2505.18955v1

Authors: Yuheng Tang, Hongwei Li, Kaijie Zhu, Michael Yang, Yangruibo Ding, Wenbo Guo

Motivated by the success of general-purpose large language models (LLMs) in software patching, recent works started to train specialized patching models. Most works trained one model to handle the end-to-end patching pipeline (including issue localization, patch generation, and patch validation). However, it is hard for a small model to handle all tasks, as different sub-tasks have different workflows and require different expertise. As such, by using a 70 billion model, SOTA methods can only reach up to 41% resolved rate on SWE-bench-Verified. Motivated by the collaborative nature, we propose Co-PatcheR, the first collaborative patching system with small and specialized reasoning models for individual components. Our key technique novelties are the specific task designs and training recipes. First, we train a model for localization and patch generation. Our localization pinpoints the suspicious lines through a two-step procedure, and our generation combines patch generation and critique. We then propose a hybrid patch validation that includes two models for crafting issue-reproducing test cases with and without assertions and judging patch correctness, followed by a majority vote-based patch selection. Through extensive evaluation, we show that Co-PatcheR achieves 46% resolved rate on SWE-bench-Verified with only 3 x 14B models. This makes Co-PatcheR the best patcher with specialized models, requiring the least training resources and the smallest models. We conduct a comprehensive ablation study to validate our recipes, as well as our choice of training data number, model size, and testing-phase scaling strategy.

受软件补丁中通用大语言模型(LLMS)成功激励,最近的工作开始培训专门的补丁模型。多数工作是培训一种处理端到端补补补管管线的模式(包括问题本地化、补丁生成和补丁验证 ) 。然而,小型模型很难处理所有任务,因为不同的子任务有不同的工作流程,需要不同的专业知识。因此,通过使用700亿个模型,SOTA方法只能达到SWE-bench审校SWE-bench版的41%解析率。受协作性质驱动,我们提出了Co-PatcheR,这是第一个合作补补补补补制系统,为单个组件提供了小型和专门化的补补补补管模型。我们的关键技术创新是具体的任务设计和培训配方。首先,我们为本地化和补补补补制了一个模型,我们生成的补补丁模型是补补丁和缩。我们随后建议一种混合补补补补补补补制模式,包括两个模型,用且不作断言和判断补补补补补补补补补习的测试的测试模式, 以多数的SWE-BS-rodeal-ro 测试为我们最精度测试的S-rodududustr 3


Article 110

Title@2025-05-24 (6): From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation?

Title: From Output to Evaluation: Does Raw Instruction-Tuned Code LLMs Output Suffice for Fill-in-the-Middle Code Generation? Von der Ausgabe bis zur Auswertung: Reicht die Ausgabe von LLMs mit rohem Instruktionscode für die Generierung von Fill-in-the-Middle Code aus? 从输出到评价:原始指令-指令代码LLMs 输出足量是否用于中代代号的填充? 2505.18789v1

Authors: Wasi Uddin Ahmad, Somshubra Majumdar, Boris Ginsburg

Post-processing is crucial for the automatic evaluation of LLMs in fill-in-the-middle (FIM) code generation due to the frequent presence of extraneous code in raw outputs. This extraneous generation suggests a lack of awareness regarding output boundaries, requiring truncation for effective evaluation. The determination of an optimal truncation strategy, however, often proves intricate, particularly when the scope includes several programming languages. This study investigates the necessity of post-processing instruction-tuned LLM outputs. Our findings reveal that supervised fine-tuning significantly enhances FIM code generation, enabling LLMs to generate code that seamlessly integrates with the surrounding context. Evaluating our fine-tuned \texttt{Qwen2.5-Coder} (base and instruct) models on HumanEval Infilling and SAFIM benchmarks demonstrates improved performances without post-processing, especially when the \emph{middle} consist of complete lines. However, post-processing of the LLM outputs remains necessary when the \emph{middle} is a random span of code.

由于原始产出中经常存在外来代码,因此,后处理对于自动评价中填(FIM)代码生成中的LLMs至关重要。这一外代人表示对产出界限缺乏认识,需要缩短时间才能进行有效评价。然而,确定最佳脱节战略往往证明是复杂的,特别是当范围包括几种编程语言时。本研究调查了处理后指示调整LM产出的必要性。我们的调查结果显示,受监督的微调大大加强了FIM代码生成,使LMs能够生成与周围环境无缝结合的代码。评估我们经过微调的 \ texttwen2.5-Coder} (基准和指示) 人类 Eval 填充和SAFIM 基准模型显示,在不处理后,特别是当\emph{middr} 由完整行组成时,业绩得到改进。然而,当\emph{mdr}是一个随机的代码范围时,LM产出的后处理仍是必要的。


Article 111

Title@2025-05-24 (6): ARMS: A Vision for Actor Reputation Metric Systems in the Open-Source Software Supply Chain

Title: ARMS: A Vision for Actor Reputation Metric Systems in the Open-Source Software Supply Chain ARMS: Vision für Actor Reputation Metric Systems in der Open Source Software Supply Chain ARMS:开放源码软件供应链中行为名声计量系统展望 2505.18760v1

Authors: Kelechi G. Kalu, Sofia Okorafor, Betül Durak, Kim Laine, Radames C. Moreno, Santiago Torres-Arias, James C. Davis

Many critical information technology and cyber-physical systems rely on a supply chain of open-source software projects. OSS project maintainers often integrate contributions from external actors. While maintainers can assess the correctness of a change request, assessing a change request’s cybersecurity implications is challenging. To help maintainers make this decision, we propose that the open-source ecosystem should incorporate Actor Reputation Metrics (ARMS). This capability would enable OSS maintainers to assess a prospective contributor’s cybersecurity reputation. To support the future instantiation of ARMS, we identify seven generic security signals from industry standards; map concrete metrics from prior work and available security tools, describe study designs to refine and assess the utility of ARMS, and finally weigh its pros and cons.

许多关键的信息技术和网络物理系统依赖开放源软件项目的供应链。开放源码软件项目维护者往往将外部行为者的贡献整合在一起。虽然维护者可以评估变更请求的正确性,但评估变更请求对网络安全的影响具有挑战性。为了帮助维护者做出这一决定,我们提议开放源生态系统应包含“声望计量系统 ” ( ARMS ) 。这一能力将使开放源码软件维护者能够评估潜在捐助方的网络安全声誉。为了支持未来对ARMS的回馈,我们从行业标准中找出了7个通用安全信号;绘制了先前工作中的具体措施和现有安全工具,描述了改进和评估ARMS的效用的研究设计,并最终权衡了ARMS的利弊。


Article 112

Title@2025-05-24 (6): AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers

Title: AutoP2C: An LLM-Based Agent Framework for Code Repository Generation from Multimodal Content in Academic Papers AutoP2C: Ein LLM-basiertes Agent-Framework für die Code-Repository-Generierung aus multimodalen Inhalten in wissenschaftlichen Papieren 自动P2C: 学术论文中多种形式内容的法规存储器生成基于LLM的LLM代理框架 2504.20115v2

Authors: Zijie Lin, Yiqing Shen, Qilin Cai, He Sun, Jinrui Zhou, Mingjun Xiao

Machine Learning (ML) research is spread through academic papers featuring rich multimodal content, including text, diagrams, and tabular results. However, translating these multimodal elements into executable code remains a challenging and time-consuming process that requires substantial ML expertise. We introduce ``Paper-to-Code’’ (P2C), a novel task that transforms the multimodal content of scientific publications into fully executable code repositories, which extends beyond the existing formulation of code generation that merely converts textual descriptions into isolated code snippets. To automate the P2C process, we propose AutoP2C, a multi-agent framework based on large language models that processes both textual and visual content from research papers to generate complete code repositories. Specifically, AutoP2C contains four stages: (1) repository blueprint extraction from established codebases, (2) multimodal content parsing that integrates information from text, equations, and figures, (3) hierarchical task decomposition for structured code generation, and (4) iterative feedback-driven debugging to ensure functionality and performance. Evaluation on a benchmark of eight research papers demonstrates the effectiveness of AutoP2C, which can successfully generate executable code repositories for all eight papers, while OpenAI-o1 or DeepSeek-R1 can only produce runnable code for one paper. The code is available at https://github.com/shoushouyu/Automated-Paper-to-Code.

机器学习(ML)研究是通过具有丰富多式内容的学术论文(包括文本、图表和表格式结果)传播的。然而,将这些多式内容转换成可执行的代码仍是一个具有挑战性和耗时的过程,需要大量的ML专门知识。我们引入了“Paper-to-Code’”(P2C),这是一项新颖的任务,将科学出版物的多式内容转换成完全可执行的代码库,其范围超出了现有的代码生成方法,该方法仅将文本描述转换成孤立的代码片段。为了将P2C进程自动化,我们提议采用AutoP2C,这是一个基于大型语言模型的多剂框架,既处理研究文件中的文本内容,又处理视觉内容,以生成完整的代码库。具体地说,AutoP2C包含四个阶段:(1) 保存从既定代码库提取的蓝图,(2) 将从文本、方程式和图中的信息整合成,(3) 结构代码生成的分级任务解析,以及(4) 由反复反馈驱动的解算,以确保功能和性。对八份研究论文的基准进行评价,表明AutP2C的效能文件的有效性。Ar-r-r-r-r-r-r-r-r-r-r-r-s-r-s-s-s-s-s-r-s-s-s-r-r-r-r-r-r-r-s-s-r-s-s-s-s-s-s-r-s-s-s-s-s-s-s-s-s-s-s-s-s-r-s-s-s-s-r-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-r-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-r-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-r-r-r-r-r-r-r-r-r-r-r-r-r-r-


Article 113

Title@2025-05-24 (6): Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair

Title: Fixing 7,400 Bugs for 1$: Cheap Crash-Site Program Repair Beheben von 7.400 Fehlern für 1$: Günstige Crash-Site-Programm-Reparatur 为1美元固定7 400个臭虫:低廉的撞车-点火方案维修 2505.13103v2

Authors: Han Zheng, Ilia Shumailov, Tianqi Fan, Aiden Hall, Mathias Payer

The rapid advancement of bug-finding techniques has led to the discovery of more vulnerabilities than developers can reasonably fix, creating an urgent need for effective Automated Program Repair (APR) methods. However, the complexity of modern bugs often makes precise root cause analysis difficult and unreliable. To address this challenge, we propose crash-site repair to simplify the repair task while still mitigating the risk of exploitation. In addition, we introduce a template-guided patch generation approach that significantly reduces the token cost of Large Language Models (LLMs) while maintaining both efficiency and effectiveness. We implement our prototype system, WILLIAMT, and evaluate it against state-of-the-art APR tools. Our results show that, when combined with the top-performing agent CodeRover-S, WILLIAMT reduces token cost by 45.9% and increases the bug-fixing rate to 73.5% (+29.6%) on ARVO, a ground-truth open source software vulnerabilities benchmark. Furthermore, we demonstrate that WILLIAMT can function effectively even without access to frontier LLMs: even a local model running on a Mac M4 Mini achieves a reasonable repair rate. These findings highlight the broad applicability and scalability of WILLIAMT.

问题调查技术的迅速发展导致发现比开发者可以合理解决的更多脆弱性,造成对有效自动化程序维修方法的迫切需要。然而,现代错误的复杂性往往使精确的根源分析变得困难和不可靠。为了应对这一挑战,我们提议对坠机地点进行修理,以简化修理任务,同时仍然降低剥削风险。此外,我们采用模板引导的补丁生成方法,大大降低大语言模型(LLLMS)的象征性成本,同时保持效率和有效性。我们实施原型系统(UNIAMT),并对照最先进的PRAMS工具对其进行评估。我们的结果显示,如果与最优秀的代理编码Rover-S相结合,UNIAMT将象征性成本降低45.9%,并将ARVO的错误修正率提高到73.5%(+29.6%),这是开源软件脆弱性的地面图解基准。此外,我们证明,即使不能进入前沿LMS:甚至使用一个本地模型,在Mac M4 Mini公司上实现合理的修理率。这些调查结果突出表明了广泛适用性和可扩展性。


Article 114

Title@2025-05-24 (6): SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Title: SEW: Self-Evolving Agentic Workflows for Automated Code Generation SEW: Selbst-evolvierende Agentische Workflows für die automatisierte Codegenerierung SEW:自动代码生成的自演动态制剂工作流程 2505.18646v1

Authors: Siwei Liu, Jinyuan Fang, Han Zhou, Yingxu Wang, Zaiqiao Meng

Large Language Models (LLMs) have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where complex coding tasks are decomposed into sub-tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand-crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose \textbf{S}elf-\textbf{E}volving \textbf{W}orkflow (\textbf{SEW}), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self-evolution, bringing up to 33\% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.

大型语言模型(LLMS)在代码生成任务中表现出了效力。 为了使LLMS能够应对更复杂的编码挑战,现有的研究侧重于设计多剂系统,这些系统将复杂的编码任务分解成子任务,分配给专门代理人。尽管这些方法是有效的,但目前的方法在很大程度上依赖手工制作的代理工作流程,既包括代理结构,也包括人工设计,这限制了它们自动适应不同类型编码问题的能力。为了解决这些限制并能够进行自动化工作流程设计,我们建议用代理工作流程设计多剂系统,把复杂的编码任务分解成子任务。尽管这些系统是有效的,但目前的方法在很大程度上依赖手工制作的代理工作流程,既包括代理结构,也包括人工设计,这限制了它们自动适应不同类型编码问题的能力。为了解决这些限制,并能够实现自动化工作流程设计,我们建议用LiveCodeBench系统改进到33xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx


Article 115

Title@2025-05-24 (6): ACECODER: Acing Coder RL via Automated Test-Case Synthesis

Title: ACECODER: Acing Coder RL via Automated Test-Case Synthesis ACECODER: Acing Coder RL über automatisierte Test-Case-Synthese 通过自动测试-案件综合合成检索编码器 RL 2502.01718v4

Authors: Huaye Zeng, Dongfu Jiang, Haozhe Wang, Ping Nie, Xiaotong Chen, Wenhu Chen

Most progress in recent coder models has been driven by supervised fine-tuning (SFT), while the potential of reinforcement learning (RL) remains largely unexplored, primarily due to the lack of reliable reward data/model in the code domain. In this paper, we address this challenge by leveraging automated large-scale test-case synthesis to enhance code model training. Specifically, we design a pipeline that generates extensive (question, test-cases) pairs from existing code data. Using these test cases, we construct preference pairs based on pass rates over sampled programs to train reward models with Bradley-Terry loss. It shows an average of 10-point improvement for Llama-3.1-8B-Ins and 5-point improvement for Qwen2.5-Coder-7B-Ins through best-of-32 sampling, making the 7B model on par with 236B DeepSeek-V2.5. Furthermore, we conduct reinforcement learning with both reward models and test-case pass rewards, leading to consistent improvements across HumanEval, MBPP, BigCodeBench, and LiveCodeBench (V4). Notably, we follow the R1-style training to start from Qwen2.5-Coder-base directly and show that our RL training can improve model on HumanEval-plus by over 25\% and MBPP-plus by 6\% for merely 80 optimization steps. We believe our results highlight the huge potential of reinforcement learning in coder models.

最近的编码模型大部分进展都是由监管的微调(SFT)推动的,而强化学习(RL)的潜力在很大程度上尚未探索,这主要是因为代码领域缺乏可靠的奖赏数据/模型。在本文件中,我们通过利用自动化大规模测试案例合成来提高代码模型培训,应对这一挑战,具体地说,我们设计了一条从现有代码数据中产生广泛(问题、试验案例)配对的管道。我们利用这些测试案例,根据通过率而不是抽样方案来建立优惠配对,用布拉德利-泰瑞损失来培训奖赏模型。这显示,Llama-31-31-8B-Ins的平均改善10点,Quen2.5-Coder-7B-Ins的平均改善5点,通过32年最佳抽样,使7B模式与现有的代码数据相提并观。此外,我们用奖励模式和试证奖励奖励奖励来强化学习,从而在人类生命价值、MBPP、BC-Cench和L-CodeB-C(V4)之间不断改进学习R1-2.5+BSB-BS-BS-BS-BSBSBSBSBSBS-BS-BS-BS-BSBS-BS-BSBSBSBS-BSBSBSBSBSI 的学习,从25BS-S-S-S-S-S-S-S-S-BS-S-S-S-S-S-BSBSBSBSBS-SBSBSBSBSBSBS-S-SBSBSBSBS-S-S-S-S-BS-S-S-S-S-S-S-S-S-S-S-S-S-S-S-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-C-B-B-B-B-BBSB-B-C-B-C-C-C-B_BS-BS-BS-BS-BS-BS-B_BS-S-S-S-B


Article 116

Title@2025-05-24 (6): On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words

Title: On the Structure and Semantics of Identifier Names Containing Closed Syntactic Category Words Über die Struktur und Semantik von Identifier-Namen, die geschlossene syntaktische Kategorie Wörter enthalten 关于含有闭合同步词类的标识名称的结构和语义 2505.18444v1

Authors: Christian D. Newman, Anthony Peruma, Eman Abdullah AlOmar, Mahie Crabbe, Syreen Banabilah, Reem S. AlSuhaibani, Michael J. Decker, Farhad Akhbardeh, Marcos Zampieri, Mohamed Wiem Mkaouer, Jonathan I. Maletic

Identifier names are crucial components of code, serving as primary clues for developers to understand program behavior. This paper investigates the linguistic structure of identifier names by extending the concept of grammar patterns; representations of the part-of-speech (PoS) sequences that underlie identifier phrases. The specific focus is on closed syntactic categories (e.g., prepositions, conjunctions, determiners), which are rarely studied in software engineering despite their central role in general natural language. The Closed Category Identifier Dataset (CCID) is presented, a new manually annotated dataset of 1,275 identifiers drawn from 30 open-source systems. The relationship between closed-category grammar patterns and program behavior is analyzed using grounded theory coding, statistical, and pattern analysis. The results reveal recurring structures that developers use to express control flow, data transformation, temporal reasoning, and behavioral roles through naming. This study contributes an empirical foundation for understanding how developers adapt linguistic resources to encode behavior in source code. By analyzing closed-category terms and their associated grammar patterns, the work highlights a previously underexplored dimension of identifier semantics and identifies promising directions for future research in naming support, comprehension, and education.

标识名称是代码的关键组成部分, 是开发者理解程序行为的主要线索 。 本文通过扩展语法模式的概念来调查标识名称的语言结构 。 本文通过扩展语法模式的概念来调查标识名称的语言结构 ; 表示作为标识短语基础的部分语法序列 。 具体重点是封闭的合成类别( 如预设、 连线、 确定者 ) , 尽管在一般自然语言中具有核心作用, 这些类别很少在软件工程中研究 。 闭类识别数据集 (CIDE) 提供了一个新的人工附加说明的数据集, 包含来自30个开放源系统的1,275个标识。 使用基于理论的编码、 统计和模式分析, 分析闭类语法模式与程序行为之间的关系 。 研究结果揭示了开发者用来通过命名表达控制流、 数据转换、 时间推理和行为作用的经常性结构 。 这项研究为了解开发者如何调整语言资源以适应源码行为提供了经验基础 。 通过分析封闭类别术语及其相关的语法模式, 工作突显了过去在标识研究、 以及未来定义中被低估的教学方向 。