cs.CL @ 2025-07-18: 650
-
00 07-17 (4) VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning VisionThink: Intelligentes und effizientes Vision-Sprachmodell durch Verstärkungslernen 远景设想:通过强化学习建立聪明、高效的愿景语言模式 2507.13348v1 -
01 07-17 DeFine: Decision-Making with Analogical Reasoning over Factor Profiles DeFine: Entscheidungsfindung mit analogischer Begründung über Faktorprofile DeFine: 与因子剖析档的模拟理由有关的决策 2410.01772v2 -
02 07-17 Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes Vergleich von Äpfeln mit Orangen: Ein Datensatz & Analyse des LLM Humorverständnisses von traditionellen Puns zu thematischen Witzen 将苹果与橙类比较:从传统Puns到专题笑话的LLM Humour理解数据集和分析 2507.13335v1 -
03 07-17 A Survey of Context Engineering for Large Language Models Eine Übersicht über Kontext-Engineering für große Sprachmodelle 大语言模型背景工程调查 2507.13334v1 -
04 07-17 The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner Die Imitation Spiel: Turing Machine Imitator ist Länge Generalizable Reasoner 模拟游戏:图画机器模拟器是长可概括的理由 2507.13332v1 -
05 07-17 Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It Vision-and-Language Training hilft, taxonomisches Wissen zu implementieren, ändert es aber nicht grundlegend 愿景和语言培训帮助利用分类学知识,但不能从根本上改变这种知识。 2507.13328v1 -
06 07-17 Social and Political Framing in Search Engine Results Soziale und politische Framing in Suchmaschinen-Ergebnissen 寻找引擎结果中的社会和政治形式 2507.13325v1 -
07 07-17 HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals HapticCap: Ein multimodaler Datensatz und die Aufgabe, die Benutzererfahrung von Schwingungshaptischen Signalen zu verstehen HapticCap:多模式数据集和了解用户振动信号信号体验的任务 2507.13318v1 -
08 07-17 Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information Ermittlung von Aufgabengruppen für Multi-Task-Lernen mit pointwise V-Usable Information 利用有分点的V-可靠信息确定多任务学习组 2410.12774v2 -
09 07-17 The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations Die Generative Energy Arena (GEA): Einbeziehung des Energiebewusstseins in das Large Language Model (LLM) Human Assessments 产生能源竞技场:将能源意识纳入大语言模型(LLM)人类评估 2507.13302v1 -
10 07-17 AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research AbGen: Bewertung großer Sprachmodelle in Ablationsstudiendesign und Evaluation für wissenschaftliche Forschung AbGen:评估用于科学研究的实验研究设计和评价中的大语言模型 2507.13300v1 -
11 07-17 Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis Multi-Agent Synergy-getriebene iterative visuelle Narrative Synthese 多机构协同-驱动动态迭代视觉叙述合成 2507.13285v1 -
12 07-17 ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations ContextQFormer: Eine neue Context-Modellierungsmethode für Multi-Turn Multi-Modal-Gespräche 上下文前:多发多式多模式对话的新背景建模方法 2505.23121v2 -
13 07-17 Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management Überblick über das TalentCLEF 2025: Kompetenz- und Berufstitel-Intelligenz für Human Capital Management 《2025年人才人才-CLEF概览:人力资本管理技能和职称情报》 2507.13275v1 -
14 07-17 Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering Sichere Multifaceted-RAG für Unternehmen: Hybrides Knowledge Retrieval mit Security-Filterung 企业安全多面安全RAG:带安全过滤器的混合知识检索 2504.13425v2 -
15 07-17 QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation QuestA: Erweitern der Begründungskapazität in LLMs durch Frageerweiterung 目标A:通过问题增加扩大LLMs的理据能力 2507.13266v1 -
16 07-17 Automating Steering for Safe Multimodal Large Language Models Automatisierungslenkung für sichere multimodale große Sprachmodelle 安全多式联运大语言模式自动化指导 2507.13255v1 -
17 07-17 ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs ConTextual: Verbesserung der klinischen Textzusammenfassung in LLMs mit kontextschonender Token-Filterung und Wissensgraphen 共同方式:改进LLMLLM的临床文本摘要,同时保持上下文透视和知识图 2504.16394v3 -
18 07-17 HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models HATS: Hindi Analogy Test Set zur Bewertung von Vernunft in großen Sprachmodellen HATS: 用于评估大语言模型中原因的印地语分析测试套 2507.13238v1 -
19 07-17 Enhancing Cross-task Transfer of Large Language Models via Activation Steering Verbesserung der Cross-Task-Übertragung großer Sprachmodelle durch Aktivierungslenkung 通过启动指导加强大语言模式的跨任务转让 2507.13236v1 -
20 07-17 CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings CoDet-M4: Erkennung maschinengenerierter Codes in Multi-Lingual-, Multi-Generator- und Multi-Domain-Einstellungen CoDet-M4:多语言、多驱动器和多域设置中的检测机生成代码 2503.13733v2 -
21 07-17 A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans Ein Vergleichsansatz zur Beurteilung sprachlicher Kreativität von großen Sprachmodellen und Menschen 评估大语言模式和人类语言创造性的比较方法 2507.12039v2 -
22 07-17 Automatically assessing oral narratives of Afrikaans and isiXhosa children Automatische Beurteilung mündlicher Erzählungen von Afrikaans und isiXhosa Kindern 自动评估南非荷兰语和土著Xhoosa儿童口述叙述 2507.13205v1 -
23 07-17 GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems GEMMAS: Graph-basierte Evaluations-Metriken für Multi-Agent-Systeme GEMMAS:基于图表的多剂系统评价计量表 2507.13190v1 -
24 07-17 Feature-based analysis of oral narratives from Afrikaans and isiXhosa children Feature-basierte Analyse oraler Erzählungen von Afrikaans und isiXhosa-Kindern 对南非荷兰语和土著Xhoosa儿童口述叙述的基于特征的分析 2507.13164v1 -
25 07-17 Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities Inverse Stärkung Lernen trifft auf großes Sprachmodell Post-Training: Grundlagen, Fortschritte und Chancen 培训后培训:基础、进步和机会 2507.13158v1 -
26 07-17 From Roots to Rewards: Dynamic Tree Reasoning with RL Von Wurzeln zu Belohnungen: Dynamische Baumveranlagung mit RL 从根到奖赏: 使用 RL 解释动态树 2507.13142v1 -
27 07-17 SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben SWE-MERA: 积极评价软件工程任务大语言模型的动态基准 2507.11059v2 -
28 07-17 Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung 结合人口偏见和示范解释评估LLM 说明的可靠性 2507.13138v1 -
29 07-17 A Computational Framework to Identify Self-Aspects in Text Ein Computational Framework zur Identifizierung von Selbstaspekten im Text 文本中识别自我特征的计算框架 2507.13115v1 -
30 07-17 Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression Task-Circuit Quantization: Nutzung von Wissen Lokalisierung und Dolmetschbarkeit für Komprimierung 任务-环境环境定量:利用知识本地化和压缩解释 2504.07389v2 -
31 07-17 SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts SemCSE: Semantische kontrastive Satzeinbettungen mit LLM-generierten Zusammenfassungen für wissenschaftliche Abstracts SEMCSE: 使用LLM创制的科学摘要摘要 2507.13105v1 -
32 07-17 Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models Unified Triplet-Level Halluzination Evaluation für große Vision-Sprache Modelle 大型视觉语言模型统一三维级幻觉评价 2410.23114v4 -
33 07-17 SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control SmartThinker: 学会按职级长长控制进行压缩和保留理由 2507.04348v2 -
34 07-17 MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben MERA 守则:一个统一框架,用于评估不同任务制定守则的情况 2507.12284v2 -
35 07-17 Formalizing Attack Scenario Description: A Proposed Model Formalisierung des Angriffsszenarios Beschreibung: Ein vorgeschlagenes Modell 正式化攻击设想情况说明:拟议模式 2507.13076v1 -
36 07-17 Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities Rethinking the Embodyd Gap in Vision-and-Language Navigation: Eine ganzheitliche Studie physischer und visueller Disparitäten 重新思考视觉和语言导航中的 “ 内博差距 “ :关于物理和视觉差异的综合研究 2507.13019v1 -
37 07-17 Teach Old SAEs New Domain Tricks with Boosting Lehren Sie alte SAEs neue Domain Tricks mit Förderung 教授旧的 SAEs 新域圈套 2507.12990v1 -
38 07-17 Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen 学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词 2507.03580v2 -
39 07-17 MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps MRT bei IberLEF-2025 PRESTA Aufgabe: Maximierung der Erholung von Tischen mit mehreren Schritten IberLEF-2025 PRESTA任务:最大限度地从有多个步骤的表格中回收 2507.12981v1 -
40 07-17 UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets UniSLU: Unified Spoken Language Understanding aus heterogenen Cross-Task-Datensätzen UUSLU:从不同式跨任务数据集获得统一口语语言理解 2507.12951v1 -
41 07-17 Probabilistic Soundness Guarantees in LLM Reasoning Chains Probabilistische Solidität garantiert in LLM-Aufklärungsketten LLM 理赔链条的概率稳妥性保障 2507.12948v1 -
42 07-17 OASIS: Order-Augmented Strategy for Improved Code Search OASIS: Order-Augmented Strategy for Improved Code Search OASIS:改进守则搜索的有秩序加强战略 2503.08161v4 -
43 07-17 Making Language Model a Hierarchical Classifier and Generator Sprachmodell zu einem hierarchischen Klassifikator und Generator machen 使语言模式成为等级分类和生成器 2507.12930v1 -
44 07-17 MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents MEM1: Lernen, Speicher zu synergisieren und für effiziente Long-Horizon-Agenten zu verankern MEM1:学习如何使记忆和理由相互协调,以有效长森剂 2506.15841v2 -
45 07-17 Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v3 -
46 07-17 IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization IOPO: Verstärkung von LLMs mit komplexer Anleitung über Input-Output Preference Optimization IOPO:通过投入-产出优化,以复杂教学赋予LLMs权力 2411.06208v3 -
47 07-17 On the Limitations of Large Language Models (LLMs): False Attribution Über die Grenzen großer Sprachmodelle (LLMs): Falsche Attribution 对大语言模式限制的限制: 2404.04631v2 -
48 07-17 Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿 2507.06261v3 -
49 07-17 A Logically Consistent Chain-of-Thought Approach for Stance Detection Ein logisch konsistenter, schlüsselfertiger Ansatz zur Stance-Erkennung 一种逻辑上一致的研究链方法,以探测Stance 2312.16054v2 -
50 07-17 MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness MAC-Tuning: Mehrkompositionelles LLM-Problem mit verbesserter Kenntnis der Grenzen des Wissens MAC-指导:LLM 以增进知识边界意识为由的多组问题 2504.21773v2 -
51 07-17 SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems SEALGuard: Mehrsprachige Gespräche in südostasiatischen Sprachen für LLM-Softwaresysteme sichern SEALGuard:为LLM软件系统维护东南亚语言多语言对话 2507.08898v3 -
52 07-17 Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent? Sind Wissen und Referenz in mehrsprachigen Sprachmodellen bereichsübergreifend konsistent? 多语文模式中的知识和参考资料是否相互一致? 2507.12838v1 -
53 07-17 Emotional Support with LLM-based Empathetic Dialogue Generation Emotionale Unterstützung mit LLM-basiertem Empathetic Dialogue Generation 利用基于LLM的 “ 同情对话 “ 生成的LLM “ 情感支持 2507.12820v1 -
54 07-17 Large Language Models’ Internal Perception of Symbolic Music Die innere Wahrnehmung symbolischer Musik durch große Sprachmodelle 大语言模型内部对符号音乐的感知 2507.12808v1 -
55 07-17 MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models MCPEval: Automatische MCP-basierte Deep Evaluation für AI Agent Modelle MCPEval:AI 代理模型的自动MCP深度评估 2507.12806v1 -
56 07-17 PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database PMKLC: Parallele Multi-Knowledge Learning-basierte Lossless-Kompression für großformatige Genomics-Datenbank PMKLC: 大型基因组数据库的平行多知识学习-无损失压缩 2507.12805v1 -
57 07-17 ReCode: Updating Code API Knowledge with Reinforcement Learning ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen ReCode:更新法规API知识与强化学习 2506.20495v2 -
58 07-17 MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment MPO: Ein effizientes Post-Processing-Framework zum Mischen unterschiedlicher Präferenzen MPO: 混合多种优惠协调的高效处理后框架 2502.18699v2 -
59 07-17 Learning Robust Negation Text Representations Robuste Negations-Textdarstellungen lernen 学习强力否定文本代表 2507.12782v1 -
60 07-17 A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models Eine umfassende Umfrage zur elektronischen Gesundheitsdatenmodellierung: Von Deep Learning Ansätzen bis hin zu großen Sprachmodellen 《电子健康记录模型综合调查:从深学习方法到大语言模式》 2507.12774v1 -
61 07-17 Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback Kritik-GRPO: LLM-Vernunft mit natürlicher Sprache und numerischem Feedback verbessern Critique-GROPO: 提高以自然语言和数字反馈为依据的LLM 2506.03106v4 -
62 07-17 Synergy: End-to-end Concept Model Synergie: Ende-zu-Ende-Konzeptmodell 协同增效:端到端概念模型 2507.12769v1 -
63 07-17 VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents VIDEE: Visuelle und Interaktive Zersetzung, Ausführung und Auswertung von Text Analytics mit intelligenten Agenten VIDE: 视觉和交互分解、执行和评价与智能剂的文字分析分析 2506.21582v2 -
64 07-17 Logit Arithmetic Elicits Long Reasoning Capabilities Without Training Logit Arithmetische Elizite lange mit Gründen verbundene Fähigkeiten ohne Training 未经培训的逻辑 2507.12759v1 -
65 07-17 Strategy Adaptation in Large Language Model Werewolf Agents Strategieanpassung im großen Sprachmodell Werwolf-Agenten 大语言示范狼人代理物的适应战略 2507.12732v1 -
66 07-17 TransEvalnia: Reasoning-based Evaluation and Ranking of Translations TransEvalnia: Reasoning-based Evaluation und Ranking von Übersetzungen 过年:基于理由的评价和笔译的排名 2507.12724v1 -
67 07-17 FLEXITOKENS: Flexible Tokenization for Evolving Language Models FLEXITOKENS: Flexible Tokenisierung für sich entwickelnde Sprachmodelle FLEXITOKENS: 不断演变的语言模式灵活化 2507.12720v1 -
68 07-17 BEARCUBS: A benchmark for computer-using web agents BEARCUBS: Benchmark für computergestützte Web-Agenten BEARCUBS:计算机使用网络代理器的基准 2503.07919v2 -
69 07-17 Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs Synthesizing Privacy-Preserving Text Data via Finetuning ohne Finetuning Billion-Scale LLMs 通过不作十亿规模的微调微调的微调合成保护隐私文本数据 2503.12347v2 -
70 07-17 GUI Test Migration via Abstraction and Concretization GUI-Test-Migration über Abstraktion und Konkretisierung GUI 通过抽象和简明化测试移民 2409.05028v2 -
71 07-17 Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening Fairness ist nicht genug: Auditing-Kompetenz und Intersektions-Bias in KI-powered Resume Screening 公平不够充分:审计能力和大赦国际授权的恢复筛选中的跨部门比阿斯 2507.11548v2 -
72 07-17 ActionStudio: A Lightweight Framework for Data and Training of Large Action Models ActionStudio: Ein leichter Rahmen für Daten und Training großer Aktionsmodelle 行动研究:关于大型行动模式的数据和培训的轻量框架 2503.22673v3 -
73 07-17 Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation Chain-of-Thought Prompting Obscures Halluzination Cues in großen Sprachmodellen: Eine empirische Bewertung 引导大语言模型中传译锥体:经验评价 2506.17088v2 -
74 07-17 AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation AudioJudge: Verstehen, was in der großen Audiomodell basierten Sprachbewertung funktioniert 音频法官:了解大型音频示范演讲评价有什么用 2507.12705v1 -
75 07-17 Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis Ausnutzung adaptiver Kontextmasken für aspektbasierte Sentiment-Analysen 利用适应性环境掩码进行外观感应力分析 2402.13722v2 -
76 07-17 AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis AdaptiSent: Context-Aware Adaptive Aufmerksamkeit für multimodale Aspect-Based-Sentiment-Analysen 适应性:基于多种模式的光谱感应分析的上下文知识适应性关注 2507.12695v1 -
77 07-16 (3) Improving Drug Identification in Overdose Death Surveillance using Large Language Models Verbesserung der Drogenidentifizierung bei der Überwachung von Überdosierungen mit großen Sprachmodellen 利用大语言模式在超剂量死亡监测中改进药物识别工作 2507.12679v1 -
78 07-16 The first open machine translation system for the Chechen language Das erste offene maschinelle Übersetzungssystem für die tschetschenische Sprache 车臣语第一个开放机器翻译系统 2507.12672v1 -
79 07-16 UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning UPCORE: Nutzenschonende Coreset-Auswahl für ausgewogenes Lernen UPCORE: 平衡退学的核心选择 2502.15082v2 -
80 07-16 A Fuzzy Approach to Project Success: Measuring What Matters Ein fuzzy Ansatz zum Projekt Erfolg: Messen, was zählt 项目成功:衡量重要事项的模糊方法 2507.12653v1 -
81 07-16 A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models Ein Multi-Stage-Rahmen mit taxonomiegeführter Begründung für die Berufsklassifizierung mit großen Sprachmodellen 使用大语言模式进行职业分类的多标准框架,并有分类法指导理由 2503.12989v2 -
82 07-16 Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows Fine-Tune ein SLM oder Prompt ein LLM? Der Fall der Erzeugung von Low-Code Workflows 微调可持续土地管理还是迅速提炼一个LLM? 产生低碳工作流程的案例 2505.24189v2 -
83 07-16 Cross-Layer Discrete Concept Discovery for Interpreting Language Models Cross-Layer Discrete Concept Discovery für Interpretationssprachmodelle 解释语言模型的跨语言监听概念发现 2506.20040v2 -
84 07-16 Multi-task retriever fine-tuning for domain-specific and efficient RAG Multi-Task Retriever Feinabstimmung für domänenspezifische und effiziente RAG 多任务检索器微调,用于特定领域和高效率的RAG 2501.04652v2 -
85 07-16 LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization LoRA Done RITE: Robuste Invariante Transformations-Equilibration für LoRA-Optimierung Lora Done REITE: 优化 LoRA 的强劲的动态转型平衡 2410.20625v2 -
86 07-16 SCULPT: Systematic Tuning of Long Prompts SCULPT: Systematisches Tuning von langen Prompts SCULPT: 长期提示系统图示 2410.20788v3 -
87 07-16 Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation Erinnerungsvererbung in Sequenz-Level-Wissensdestillation für neurale maschinelle Übersetzung 神经机机翻译序列级知识蒸馏中的记忆力继承 2502.01491v2 -
88 07-16 Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models Mono-InternVL-1.5: Auf dem Weg zu günstigeren und schnelleren monolithischen multimodalen großen Sprachmodellen Mono-InternVL-1.5:走向廉价和更快单极多式多语言模式 2507.12566v1 -
89 07-16 What Factors Affect LLMs and RLLMs in Financial Question Answering? Welche Faktoren beeinflussen LLMs und RLLMs bei der Beantwortung finanzieller Fragen? 在回答财务问题时,哪些因素影响到理疗母和理疗母(RLLMs)? 2507.08339v2 -
90 07-16 Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility Ist das nur Fantasie? Sprachmodelldarstellungen spiegeln menschliche Urteile von Ereignissen wider Plausibilität 这只是幻想吗?语言模型代表反映了人类对事件的判断 2507.12553v1 -
91 07-16 Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses Prompt Störungen Enthüllen Mensch-ähnliche Biasen in LLM Survey Responses LLM调查答复中的即时扰动干扰现象 2507.07188v2 -
92 07-16 Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models Modellierung der Open-World-Kognition als On-Demand-Synthese probabilistischer Modelle 将开放世界的认知建模作为概率模型的 “ 现场合成 “ 模型 2507.12547v1 -
93 07-16 Language Models Improve When Pretraining Data Matches Target Tasks Sprachmodelle verbessern, wenn die Vorschulung von Daten zu Zielaufgaben passt 培训前数据匹配目标任务时改进语言模式 2507.12466v1 -
94 07-16 Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training Scaling Up RL: Unlocking Diverse Reasoning in LLMs durch längeres Training 提升RL:通过长期培训解锁LLMs的多样化理由 2507.12507v1 -
95 07-16 TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons TD-EVAL: Überprüfung der aufgabenorientierten Dialogbewertung durch Kombination von Turn-Level-Präzision mit Dialog-Level-Vergleichen TD-EVAL: 重新审议以任务为导向的对话评价,将转折点精确度与对话级别比较相结合 2504.19982v2 -
96 07-16 S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling S2WTM: Spherical Sliced-Wasserstein Autoencoder für Themenmodellierung S2WTM: 用于专题建模的球球锯子-Wasserstein自动编码器 2507.12451v1 -
97 07-16 Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models Können wir eine Ausrichtung voraussagen, bevor Modelle das Denken beenden? 我们能否在模型完成思考之前实现预测一致? 2507.12428v1 -
98 07-16 Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data Weiterentwicklung der retrieval-generierten Generation für strukturierte Unternehmen und interne Daten 结构化企业和内部数据先进检索-启动生成 2507.12425v1 -
99 07-16 Simple Mechanistic Explanations for Out-Of-Context Reasoning Einfache mechanistische Erklärungen für Out-of-Context Reasoning 外部逻辑理由的简单机械解释 2507.08218v2 -
100 07-16 Probing for Arithmetic Errors in Language Models Probing für Arithmetische Fehler in Sprachmodellen 语言模型中亚学错误的检验 2507.12379v1 -
101 07-16 Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker Entwicklung eines visuellen Augmented Q&A-Systems unter Verwendung eines skalierbaren Vision Embedding Retrieval & Late Interaction Re-ranker 利用可缩放的视野嵌入回收和后期互动重新排行器开发视觉增强的 A 系统 2507.12378v1 -
102 07-16 Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics Web-Browsing LLMs können auf Social Media Profile zugreifen und Nutzerdemographien ableiten 可在网上浏览的LLMs 能够获取社会媒体概况和推断用户人口 2507.12372v1 -
103 07-16 Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate Jenseits von Einzelmodellen: Verbesserung der LLM-Erkennung von Ambiguität in Anfragen durch Debatte 超越单一模式:通过辩论加强LLM对请求中的模糊性的检测 2507.12370v1 -
104 07-16 Exploring Gender Bias in Alzheimer’s Disease Detection: Insights from Mandarin and Greek Speech Perception Erforschung von Gender-Bias bei Alzheimer-Erkennung: Einblicke aus Mandarin und griechischer Sprachwahrnehmung 探索阿尔茨海默氏病检测中的性别偏见:普通话和希腊言语认知的洞察 2507.12356v1 -
105 07-16 Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs Auf dem Weg zu Agentic RAG mit tiefer Vernunft: Eine Umfrage von RAG-Reasoning-Systemen in LLMs 朝向具有深智力的AGA:对RAG(ARM)中测深系统进行的一项调查 2507.09477v2 -
106 07-16 Planning-Aware Code Infilling via Horizon-Length Prediction Planning-Aware Code Infilling via Horizon-Length Prediction 通过地平线-地球预测填充规划-软件代码 2410.03103v3 -
107 07-16 Nonlinear Concept Erasure: a Density Matching Approach Nichtlineare Konzeptauslöschung: ein Density-Matching-Ansatz 非线性概念时代:密度匹配方法 2507.12341v1 -
108 07-16 From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents Von Semantic Web und MAS zu Agentic AI: Ein einheitliches Narrativ des Web of Agents 从语义网站和MAS到AA:关于 “ 代理人网络 “ 的统一说明 2507.10644v2 -
109 07-16 Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization Chain-of-Descriptions: Verbesserung der Code-LLMs für die VHDL-Code-Generierung und Zusammenfassung 描述链:改进《守则》中VHDL代码生成和概述的LLML 2507.12308v1 -
110 07-16 Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding Text-ADBench: Text-Anomaly Detection Benchmark basierend auf LLMs Einbetten 文本 – – 亚银:基于嵌入LLMs的文本异常检测基准 2507.12295v1 -
111 07-16 Linearly-Interpretable Concept Embedding Models for Text Analysis Linear-Interpretable Concept Einbetten von Modellen für die Textanalyse 用于文本分析的线性解释式概念嵌入模型 2406.14335v2 -
112 07-16 Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge Automatisierte Neuheitsbewertung des Akademischen Papiers: Ein kollaborativer Ansatz Integrieren von menschlichem und großem Sprachmodellwissen 学术论文自动化新颖评价:结合人文和大语言示范知识的协作方法 2507.11330v2 -
113 07-16 NLP Meets the World: Toward Improving Conversations With the Public About Natural Language Processing Research NLP trifft auf die Welt: Um Gespräche mit der Öffentlichkeit über die natürliche Sprachverarbeitungsforschung zu verbessern NLP 与世界相遇:努力改进与公众关于自然语言处理研究的对话 2507.10559v2 -
114 07-16 Measuring Spiritual Values and Bias of Large Language Models Messen von spirituellen Werten und Bias von großen Sprachmodellen 计量大语言模型的精神价值和偏见 2410.11647v2 -
115 07-16 Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes Infherno: Ende-zu-Ende Agent-basierte FHIR-Ressourcensynthese aus freiformigen klinischen Anmerkungen Infherno: 以端到端代理为基础的FHIR 自由形式临床笔记资源合成 2507.12261v1 -
116 07-16 Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese Translationese-Index: Verwendung von Likelihood-Verhältnissen für abgestufte und generalisierbare Messung von Translationese 笔译索引:在笔译的分级和通用计量中使用可能性比率 2507.12260v1 -
117 07-16 Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training Halluzination Detox: Sensitivity Dropout (SenD) für großsprachliche Modellschulungen 幻觉脱毒:用于大语言模式培训的感敏性辍学(SenD) 2410.15460v4 -
118 07-16 Improving Contextual ASR via Multi-grained Fusion with Large Language Models Verbesserung der Kontext-ASR durch Multi-Grained Fusion mit großen Sprachmodellen 通过与大语言模式的多语种融合,改善实际的ASR 2507.12252v1 -
119 07-16 FADE: Why Bad Descriptions Happen to Good Features FADE: Warum schlechte Beschreibungen gut aussehen FADE:为什么不良描述发生在好地貌 2502.16994v2 -
120 07-16 Semantic Adapter for Universal Text Embeddings: Diagnosing and Mitigating Negation Blindness to Enhance Universality Semantischer Adapter für universelle Text-Embeddings: Diagnose und Milderung der Negationsblindheit zur Verbesserung der Universalität 通用文本嵌入的语义适应器:诊断和减轻疏漏失盲现象,以增强普遍性 2504.00584v2 -
121 07-16 Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions Truth Sleuth and Trend Bender: KI-Agenten überprüfen YouTube-Videos und beeinflussen Meinungen Truth Sleuth and Trend Bender: AI 负责事实检查YouTube视频及影响舆论的代理 2507.10577v2 -
122 07-16 Towards few-shot isolated word reading assessment Auf dem Weg zu wenigen Schuss isoliert Wort Lesung Bewertung 迈向微小的孤立字读数评估 2507.12217v1 -
123 07-16 Toward a Behavioural Translation Style Space: Simulating the Temporal Dynamics of Affect, Behaviour, and Cognition in Human Translation Production Auf dem Weg zu einem Raum für Verhaltensübersetzungen: Simulation der zeitlichen Dynamik von Affekt, Verhalten und Kognition in der menschlichen Übersetzungsproduktion 走向行为翻译风格空间:模拟人翻译生产中影响、行为和认知的时空动态 2507.12208v1 -
124 07-16 Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize? Reasoning Strategies in Large Language Models: Können sie folgen, bevorzugen und optimieren? 大语言模式中的理由战略:它们能够遵循、优于和优化吗? 2507.11423v2 -
125 07-16 TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation TRIM: Token Reduction und Inferenzmodellierung für kosteneffektive Sprachgenerierung TRIM:降低和推论模式,促进成本低效益的语文生成 2412.07682v4 -
126 07-16 RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection RUMAA: Repeat-Aware Unified Music Audio Analyse zur Ausrichtung, Transkription und Fehlererkennung RUMAA: 用于计分业绩协调、追踪和误差探测的重复软件统一音乐音频分析 2507.12175v1 -
127 07-16 Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training Schutz urheberrechtlich geschützter Materialien mit einzigartigen Identifikatoren in großsprachlichen Modellschulungen 在大语言模式培训中以独特标识人保护版权材料 2403.15740v3 -
128 07-16 A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems Eine Übersicht über Grenzen in LLM-Reasoning: Schlussfolgerungen Skalierung, Lernen zur Vernunft und Agentische Systeme LLM 原因:推论增强、学习理性和制剂系统边界调查 2504.09037v2 -
129 07-16 Large Language Models Often Know When They Are Being Evaluated Große Sprachmodelle kennen oft, wenn sie bewertet werden 大语言模型经常知道何时被评估 2505.23836v3 -
130 07-16 Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators Überblick über die Sensemaking-Aufgabe im ELOQUENT 2025 Lab: LLMs als Lehrer, Schüler und Evaluatoren 2025年ELOQUent 2025实验室的决策者任务概述:教师、学生和评价员 2507.12143v1 -
131 07-16 RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization RiemannLoRA: Ein einheitliches Riemann-Rahmenwerk für die ambiguitätsfreie LoRA-Optimierung Riemann LoRA:无模糊无洛拉优化的统一里伊曼框架 2507.12142v1 -
132 07-16 Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis Iterative Augmentation mit Summarization Refinement (IASR) Evaluation für unstrukturierte Umfragedaten Modellierung und Analyse 对无结构调查数据建模和分析的抽样改进(IASR)评价 2507.12126v1 -
133 07-16 Learning to Reason at the Frontier of Learnability Vernunft lernen an der Grenze der Lernfähigkeit 学习在可学习的前沿学习理性 2502.12272v4 -
134 07-16 Findings of MEGA: Maths Explanation with LLMs using the Socratic Method for Active Learning Ergebnisse von MEGA: Matheerklärung mit LLMs mit der sokratischen Methode für aktives Lernen MEGA的研究结果:使用Scopic 积极学习方法与LLMs的数学解释 2507.12079v1 -
135 07-16 RAGGED: Towards Informed Design of Scalable and Stable RAG Systems RAGGED: Auf dem Weg zu einem informierten Design von skalierbaren und stabilen RAG-Systemen RAGGD: 实现可缩放和稳定的RAG系统的知情设计 2403.09040v3 -
136 07-16 BOOKCOREF: Coreference Resolution at Book Scale BOOKCOREF: Koreferenzauflösung auf der Buchskala BOOKCOREF: 书缩放时的共引用分辨率 2507.12075v1 -
137 07-16 StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features StylOch bei PAN: Gradient-Boosted Trees mit frequenzbasierten stylometrischen Eigenschaften PAN的StylOch:带以频率为基础的音量特征的梯度-波状树 2507.12064v1 -
138 07-16 Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited Bewertung der Fähigkeit von großen Sprachmodellen zur Vernunft über Kardinal-Anweisungen, Revisited 评价大语言模式与红红衣主教指示理由相符的能力,重新审查 2507.12059v1 -
139 07-16 ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews ReviewAgents: Die Kluft zwischen menschlichen und KI-generierten Paper Reviews überbrücken 审查机构:弥合人类与AI - AI - 创创文件审查之间的差距 2503.08506v3 -
140 07-16 Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis Verbesserung der Daten- und Parametereffizienz von neuralen Sprachmodellen mittels Darstellungsanalyse 改进使用代表性分析的神经语言模型的数据和参数效率 2507.12004v1 -
141 07-16 Labels Generated by Large Language Models Help Measure People’s Empathy in Vitro Etiketten, die durch große Sprachmodelle erzeugt werden, helfen, die Empathie der Menschen in Vitro zu messen 以大语言模型生成的标签 帮助测量体外民众的共鸣 2501.00691v2 -
142 07-16 DEEPER Insight into Your User: Directed Persona Refinement for Dynamic Persona Modeling DEEPER Einblick in Ihren Anwender: Direkte Persona-Verfeinerung für dynamische Persona-Modellierung DEEPER 对用户的洞察: 动态人造模型的直接人性改进 2502.11078v2 -
143 07-16 Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions Vereinfachungen sind Absolutisten: Wie vereinfachte Sprache das Wortsinnbewusstsein in LLM-generierten Definitionen reduziert 简化程序是绝对论者:简化语言如何减少LLM-创用定义中的言语感知 2507.11981v1 -
144 07-16 Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness Value-Based Large Language Model Agent Simulation zur gegenseitigen Bewertung von Vertrauen und zwischenmenschlicher Nähe 用于相互评价信任和人际亲密的基于价值的大型语言模型模拟剂 2507.11979v1 -
145 07-16 Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking Biomarker Graphische Darstellungen für die Leseverständnisanalyse mit Large Language Model und Eye-Tracking Biomarker 使用大语言模型和眼跟踪生物标记的阅读综合分析图示 2507.11972v1 -
146 07-16 Organize the Web: Constructing Domains Enhances Pre-Training Data Curation Organisation des Webs: Aufbau von Domains verbessert die Vorschulung von Daten-Curation 组织网络: 构建域域 增强培训前数据曲线 2502.10341v3 -
147 07-16 CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions CultureCLIP: CLIP mit kulturellem Bewusstsein durch synthetische Bilder und kontextualisierte Captions stärken CICLIP: 通过合成图像和背景说明赋予CLIP以文化意识,赋予CLIP权力 2507.06210v2 -
148 07-16 Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation Decoder-Hybrid-Decoder-Architektur für effizientes Nachdenken mit langer Generation 提高长代人合理性效率的代coder-Hybrid-Decer 结构 2507.06607v2 -
149 07-16 Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation Giftigkeits-Bewusst Wenig-heiße Prompting für Low-Resource-Singlish Übersetzung 低资源录音翻译的微热提示 2507.11966v1 -
150 07-16 BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling BRIDGE: Bootstrapping-Text zur Steuerung der Time-Series-Generation über Multi-Agent iterative Optimierung und Diffusionsmodellierung BRIDGE:通过多代理迭代优化和传播模型化控制时间- 系列生成的推进文本 2503.02445v5 -
151 07-16 Resona: Improving Context Copying in Linear Recurrence Models with Retrieval Resona: Verbesserung der Kontextkopie in linearen Wiederholungsmodellen mit Retrieval Resona: 改进有检索的线性重复模型中环境复制 2503.22913v2 -
152 07-16 PoTPTQ: A Two-step Power-of-Two Post-training for LLMs PoTPTQ: Zweistufige Kraft von zwei Nachschulungen für LLMs PoTPTQ:为LLMs提供两步二级培训后培训 2507.11959v1 -
153 07-16 The benefits of query-based KGQA systems for complex and temporal questions in LLM era Die Vorteile von anfragebasierten KGQA-Systemen für komplexe und zeitliche Fragen im LLM-Zeitalter 基于查询的KGQA系统对LLM时代复杂和时间问题的益处 2507.11954v1 -
154 07-16 IAM: Efficient Inference through Attention Mapping between Different-scale LLMs IAM: Effiziente Schlussfolgerung durch Aufmerksamkeitsmapping zwischen unterschiedlichen LLMs IAM:通过在不同规模的LMMs之间绘制注意绘图,有效推论 2507.11953v1 -
155 07-16 DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression DAC: Ein dynamischer, aufmerksamkeitsbewusster Ansatz für die aufgaben-agnostische Promptkompression DAC: 动态关注意识办法 2507.11942v1 -
156 07-16 BlockBPE: Parallel BPE Tokenization BlockBPE: Parallele BPE-Tokenisierung BBPE: 平行 BPE 调制 2507.11941v1 -
157 07-16 POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering POLYCHARTQA: Benchmarking großer Vision-Sprache Modelle mit mehrsprachigem Diagramm Frage-Antworten POLYCHARTQA:以多语言图表问题解答为大型愿景-语言模型基准 2507.11939v1 -
158 07-16 A Survey of Deep Learning for Geometry Problem Solving Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen 解决几何问题深层学习调查 2507.11936v1 -
159 07-16 Generative Emergent Communication: Large Language Model is a Collective World Model Generative Emergent-Kommunikation: Großes Sprachmodell ist ein kollektives Weltmodell 生成新兴通信:大语言模式是集体世界模式 2501.00226v2 -
160 07-16 Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization Ein effektives Premise Retrieval-Modell für effiziente mathematische Formalisierung lernen 学习有效数学正规化的有效可靠检索模型 2501.13959v3 -
161 07-16 Journalism-Guided Agentic In-Context Learning for News Stance Detection Journalismus-geführtes Agentisches In-Context-Lernen für Nachrichten Stance Detection 为探查新闻流而进行理论指导的 Agentic In-Contle Learning for News Stance 2507.11049v2 -
162 07-16 Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models Marco-Bench-MIF: Mehrsprachige Lernfähigkeit von großen Sprachmodellen Marco-Bench-MIF:关于多语种教学 – – 大语言模式的适应能力 2507.11882v1 -
163 07-16 LLMs Encode Harmfulness and Refusal Separately LLMs kodieren Schädlichkeit und Verweigerung getrennt LLM Cocco Perfority 和 分别拒绝 2507.11878v1 -
164 07-16 DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation DualReward: Ein dynamisches Verstärkungs-Lern-Framework für Cloze-Tests Distraktor-Generierung 双重奖励:一个为产生氯酸铜测试而建立的动态强化学习框架 2507.11875v1 -
165 07-16 COLA-GEC: A Bidirectional Framework for Enhancing Grammatical Acceptability and Error Correction COLA-GEC: Ein bidirektionales Framework zur Verbesserung der grammatischen Akzeptanz und Fehlerkorrektur COLA-GEC: 增强显性可接受性和误差校正的双向框架 2507.11867v1 -
166 07-16 Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition Cross-Domain-Übertragung und wenige-Hot-Learning für die Erkennung von personenbezogenen identifizierbaren Informationen 个人身份识别信息识别跨域传输和很少热学习 2507.11862v1 -
167 07-16 METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation METIS: Schnelle, qualitätsbewusste RAG-Systeme mit Konfigurationsanpassung METIS:具有配置适应的快速质量软件RAG系统 2412.10543v2 -
168 07-16 Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential Ihr LLM kennt die Zukunft: Sein Multi-Token-Prognosepotenzial enthüllen 您的LLM 了解未来: 发掘其多功能预测潜力 2507.11851v1 -
169 07-16 ILID: Native Script Language Identification for Indian Languages ILID: Native Script Language Identification für indische Sprachen ILID:印第安人语言的土著脚本语言识别 2507.11832v1 -
170 07-16 Towards Geo-Culturally Grounded LLM Generations Auf dem Weg zu geokulturellen LLM-Generationen 走向地球环基LLM 代 2502.13497v4 -
171 07-16 Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration Miipher-2: Ein universelles Sprachrestaurationsmodell für die Millionen-Stunden-Skala-Datenrestauration Mipher-2:百万小时规模数据恢复普遍语音恢复模式 2505.04457v3 -
172 07-16 Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models Nachvollziehen von Fakten oder nur Kopien? Eine kritische Untersuchung der Wettbewerbe von Mechanismen in großen Sprachmodellen 对大语言模式机制竞争情况的重要调查 2507.11809v1 -
173 07-15 (2) Simulated Language Acquisition in a Biologically Realistic Model of the Brain Simulierter Spracherwerb in einem biologisch realistischen Modell des Gehirns 脑生物现实模型模拟语言学习模拟 2507.11788v1 -
174 07-15 How Well Can Knowledge Edit Methods Edit Perplexing Knowledge? Wie gut kann Wissen Methoden bearbeiten Verwirrendes Wissen bearbeiten? 知识如何编辑方法如何编辑复杂知识? 2406.17253v3 -
175 07-15 Understanding Language Model Circuits through Knowledge Editing Sprachmodell-Schaltungen durch Wissensbearbeitung verstehen 通过知识编辑理解语言模拟电路 2406.17241v4 -
176 07-15 AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles KI-Assistenten bei CheckThat! 2025: Transformerbasierte Einbettungen mit Gefühl für Subjektivitätserkennung in Nachrichtenartikeln verbessern AI 向导于 CheckThat! 2025:加强基于变压器的嵌入装置,使其更敏感,以便在新闻文章中发现主观性。 2507.11764v1 -
177 07-15 AKReF: An argumentative knowledge representation framework for structured argumentation AKREF: Ein argumentativer Wissensvertretungsrahmen für strukturierte Argumentation AKREF: 结构化论证的理论知识代表框架 2506.00713v3 -
178 07-15 CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks CRABS: Eine syntaktisch-semantische Zangenstrategie zur Begrenzung der LLM-Interpretation von Python-Notebooks CRABS: 一种将Python笔记本的LLM 解释捆绑起来的合成-塞氏针刺术策略 2507.11742v1 -
179 07-15 Flexible and Efficient Grammar-Constrained Decoding Flexible und effiziente Grammatik-Kontrainierte Dekodierung 灵活、高效的语法约束解码 2502.05111v2 -
180 07-15 Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples Multidomain Multilingual Sentiment Analysis in Industry: Aspektbasierte Meinungsquadruples voraussagen 工业多语言多语种多语种情感分析:预测基于频谱的四大意见 2505.10389v2 -
181 07-15 Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering Spatially Grounded Erklärungen in Vision Language Models for Document Visual Question Answering 用于文件视觉问题解答的愿景语言模型中的基于空间的解释 2507.12490v1 -
182 07-15 ExpliCIT-QA: Explainable Code-Based Image Table Question Answering ExplicCIT-QA: Erklärbare Code-basierte Bildtabelle Frage-Antworten ExpliCIT-QA:可解释代码图像表问题解答 2507.11694v1 -
183 07-15 MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization MetaLint: Generalisierbare idiomatische Code-Qualitätsanalyse durch instruction-following und einfach-zu-harte Verallgemeinerung MetLint: 通过执行指示和易于协调的通用化,可通用的单性守则质量分析 2507.11687v1 -
184 07-15 Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification Lassen Sie uns in zwei Schritten denken: Abmildern Vereinbarung Bias in MLLMs mit selbst-gerundete Verifikation 让我们思考两步:在MLLMs中减少协议与自我核查的偏见 2507.11662v1 -
185 07-15 Partitioner Guided Modal Learning Framework Partitioner Geführtes Modales Lernen-Framework 向导模式学习框架 2507.11661v1 -
186 07-15 Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context Rolling the DICE on Idiomaticity: Wie LLMs den Kontext nicht erfassen 推出关于多才多艺的DICE:LLLMS如何失败到撕裂背景 2410.16069v2 -
187 07-15 Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation 波斯情感分析的跨语言多语种短片学习和增量适应 2507.11634v1 -
188 07-15 Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell 语音语音和语音语言模式的高效和直接双重模式 2505.15670v3 -
189 07-15 Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility Jailbreak-Tuning: Modelle effizient lernen Jailbreak-Anfälligkeit 越狱:高效学习越狱模式 2507.11630v1 -
190 07-15 MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering MapIQ: Benchmarking multimodaler Großsprachenmodelle für Kartenfrageantworten MapIQ:为地图回答问题确定多式大语言模式基准 2507.11625v1 -
191 07-15 LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating LongDocURL: ein umfassender multimodaler langer Dokumenten-Benchmark, der Verständnis, Vernunft und Lokalisierung integriert LongDocURL:综合综合理解、说明理由和定位的综合多式长文件基准 2412.18424v3 -
192 07-15 AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air AirLLM:传播基于政策的适应性LORA,用于远距离微调LLM在空中的LLM 2507.11515v1 -
193 07-15 Real-World Summarization: When Evaluation Reaches Its Limits Real-World-Zusammenfassung: Wenn die Bewertung ihre Grenzen erreicht 现实世界总结:评价达到极限时 2507.11508v1 -
194 07-15 A Mathematical Theory of Discursive Networks Eine mathematische Theorie diskursiver Netzwerke 讨论网络的数学理论 2507.06565v3 -
195 07-15 Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching Conversation Forests: Der Schlüssel zur Feinabstimmung großer Sprachmodelle für multi-Turn medizinische Gespräche ist die Verzweigung 对话森林:对多发医学对话的大型语言模型进行精微投资的关键是分流 2507.04099v2 -
196 07-15 ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols ProtocolLLM: RTL Benchmark für SystemVerilog Generierung von Kommunikationsprotokollen 协议LLLM: 系统生成通信协议系统生成的RTL基准 2506.07945v2 -
197 07-15 A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens Eine generative Annäherung an LLM Harmfulness Detection mit speziellen roten Flaggen-Tokens 利用特别红旗拳生成LLM 无害性探测法 2502.16366v3 -
198 07-15 Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models Halluzinationsstationen: Auf einigen grundlegenden Einschränkungen von Transformer-basierten Sprachmodellen 幻觉站:关于以变换语言模式的一些基本限制 2507.07505v3 -
199 07-15 Seq vs Seq: An Open Suite of Paired Encoders and Decoders Seq vs Seq: Eine offene Suite aus koppelten Encodern und Decodern Seq vs Seq:一个开放的套件,其中含有子元编码器和代碼器。 2507.11412v1 -
200 07-15 KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning? KisMATH: Haben LLMs Kenntnis von Impliziten Strukturen in mathematischer Vernunft? KISMATH:LLMs女士是否了解数学原因中的隐含结构? 2507.11408v1 -
201 07-15 EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes EXAONE 4.0: Unified Large Language Models Integrieren von nicht-vernünftigen und vernünftigen Moden EXONE4.0:纳入非理由和理由解释模式的统一大语言模式 2507.11407v1 -
202 07-15 DCR: Quantifying Data Contamination in LLMs Evaluation DCR: Quantifizierung von Datenkontamination in LLMs Evaluation DCR: 在LLMS评价中量化数据污染 2507.11405v1 -
203 07-15 Gaussian mixture models as a proxy for interacting language models Gaußsche Mischungsmodelle als Proxy für interagierende Sprachmodelle Gaussian 混合模型作为交互语言模型的替代 2506.00077v3 -
204 07-15 Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence Im Anschluss an die Klues: Experimente zur Person Re-ID mit Cross-Modal Intelligence 在Clues之后:利用跨模式情报对个人重新识别进行实验 2507.01504v3 -
205 07-15 Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss Adressierung von Daten Ungleichgewicht in Transformer-basierte Multi-Label Emotion Erkennung mit Gewichteten Verlusten 解决基于变换器的多标签情感与加权损失检测中的数据不平衡问题 2507.11384v1 -
206 07-15 What is the Best Process Model Representation? A Comparative Analysis for Process Modeling with Large Language Models Was ist die beste Prozessmodelldarstellung? Eine vergleichende Analyse zur Prozessmodellierung mit großen Sprachmodellen ” 最佳程序示范代表 “ 是什么? “ 大语言模式进程模拟比较分析 “ 2507.11356v1 -
207 07-15 Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations Wahrhaftig oder fabriziert? Mit Kausal Attribution zu Mitigate Belohnung Hacken in Erklärungen 真实的还是伪造的? 利用从原因上归结为 贬低奖得奖者在解释中被打包 2504.05294v2 -
208 07-15 Internal Value Alignment in Large Language Models through Controlled Value Vector Activation Interne Wertausrichtung in großen Sprachmodellen durch kontrollierte Wert-Vektor-Aktivierung 通过控制值矢量激活,通过控制值矢量激活,大语言模型的内部价值对齐 2507.11316v1 -
209 07-15 ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time ETT: Erweiterung des Langzeitkontexts Verständnisfähigkeit von LLMs bei Test-Time ETT:扩大LLMs在试验时的长距离理解能力 2507.06313v2 -
210 07-15 LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification LRCTI: Ein großsprachiges modellbasiertes Framework für mehrstufige Evidence-Retrieval und Reasoning in Cyber Threat Intelligence Credibility Verifikation LRCTI: 网络威胁情报可靠性核查中多重证据检索和理由依据大语言示范框架 2507.11310v1 -
211 07-15 Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian Dr.Copilot: Ein Multi-Agent Prompt Optimierter Assistent zur Verbesserung der Patienten-Doktor-Kommunikation auf Rumänisch 副驾驶:罗马尼亚改善病人-医生沟通多代理快速优化助理 2507.11299v1 -
212 07-15 Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources, Coded Term Lexicon, and Enhanced Detection Frameworks Fine-Grained Chinese Hate Speech Understanding: Span-Level-Ressourcen, Coded Term Lexikon und erweiterte Erkennungsrahmen 中华仇恨言论理解:广级资源、规范术语词汇、强化检测框架 2507.11292v1 -
213 07-15 ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge Implic Retrieval Challenge: Benchmarking der Implicity Fact Retrieval Challenge ImpliRet:设定隐含事实检索挑战的基准 2506.14407v2 -
214 07-15 ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models ContextCache: Kontext-Bewusst Semantischer Cache für Multi-Turn-Abfragen in großen Sprachmodellen 上下文缓存: 用于大语言模式多发查询的背景软件语义缓存 2506.22791v3 -
215 07-15 FMC: Formalization of Natural Language Mathematical Competition Problems FMC: Formalisierung von mathematischen Wettbewerbsproblemen in der Natursprache FMC: 将自然语言数学竞争问题正规化 2507.11275v1 -
216 07-15 KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding KV-Latent: KV-Cache-Reduktion auf Dimensionsebene mit frequenzbewusster Rotary-Positions-Einbettung KV-Latent:用高频感知的扶轮性定位嵌入减少KV缓存 2507.11273v1 -
217 07-15 Block Circulant Adapter for Large Language Models Block Circulant Adapter für große Sprachmodelle 用于大语言模型的块环相适应器 2505.00582v2 -
218 07-15 Shared Global and Local Geometry of Language Model Embeddings Gemeinsame globale und lokale Geometrie von Sprachmodellen 共同的全球和地方语言对地测量 2503.21073v3 -
219 07-15 KAT-V1: Kwai-AutoThink Technical Report KAT-V1: Kwai-AutoThink Technical Report KAT-V1: Kwai-AutoThink 技术报告 2507.08297v2 -
220 07-15 RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism RAG-R1 : Förderung der Such- und Begründungsfähigkeiten von LLMs durch Multi-Query-Parallelismus RAG-R1:通过多种克质平行主义鼓励LLMs的搜索和说明能力 2507.02962v3 -
221 07-15 Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages Sparse Autoencoder können sprachspezifische Konzepte über verschiedene Sprachen hinweg erfassen 能够捕捉不同语言语言的特定语言概念的简单自定义者 2507.11230v1 -
222 07-15 An Agentic Flow for Finite State Machine Extraction using Prompt Chaining Ein Agentischer Fluss für Finite State Machine Extraction mit Prompt Verkettung 使用快速链条的有限国家机器采掘的代理流动 2507.11222v1 -
223 07-15 On the Effect of Instruction Tuning Loss on Generalization Auf die Auswirkungen der Instruktion Tuning Verlust auf die Verallgemeinerung 指示计票损失对普遍化的影响的影响 2507.07817v2 -
224 07-15 EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering EsBBQ und CaBBQ: Die spanischen und katalanischen Bias Benchmarks zur Beantwortung von Fragen EsBBQ和CABBQ:西班牙和加泰罗尼亚的回答问题基准 2507.11216v1 -
225 07-15 Stylometry recognizes human and LLM-generated texts in short samples Stylometrie erkennt menschliche und LLM-generierte Texte in kurzen Proben tytylometerm在短样本中确认人类和LLM产生的文本 2507.00838v2 -
226 07-15 SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users SocioVerse: Ein Weltmodell für soziale Simulation Powered by LLM Agents und ein Pool von 10 Millionen Real-World-Nutzern 社会之声:由LLM代理和1000万现实世界用户组成的人才库推动的社会模拟世界模式 2504.10157v3 -
227 07-15 Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding Temperatur und Persona Shape LLM Agent Konsens mit minimaler Genauigkeit gewinnt in qualitativer Coding 高温和人文形状 LLM 代理人共识,在定性编码中取得最低准确性收益 2507.11198v1 -
228 07-15 Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams Text zum Modell via SysML: Automatisierte Generierung dynamischer Systemrechnermodelle aus unstrukturiertem Naturtext über verbesserte Systemmodellierung Sprachdiagramme 通过 SysML 自动生成动态系统计算模型,通过强化系统模拟图,从未结构化的自然语言文本生成动态系统计算模型 2507.06803v2 -
229 07-15 Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion Kompression Hacking: Eine zusätzliche Perspektive auf Informatik-Eigenschaften von Sprachmodellen aus geometrischer Verzerrung 压缩包装:几何扭曲对语言模型信息学属性的补充观点 2505.17793v2 -
230 07-15 SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning SICHERHEIT: Semantik-bewusst Verkörpertes Gespräch unter Unwahrnehmung für lebenslanges Robot Learning SECURRE: 终身机器人学习意识不足的语义学意识内嵌入式对话 2409.17755v3 -
231 07-15 FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning FalseReject: Eine Ressource zur Verbesserung der kontextuellen Sicherheit und zur Abmilderung von Überwiderständen in LLMs durch strukturierte Vernunft 假反射:一种资源,用于通过结构化理由改进环境安全和减轻LLMs的过度拒绝 2505.08054v2 -
232 07-15 Is Compression Really Linear with Code Intelligence? Ist Kompression wirklich linear mit Code Intelligence? 压缩真的有代码情报线条吗? 2505.11441v4 -
233 07-15 Style over Substance: Distilled Language Models Reason Via Stylistic Replication Stil über Substanz: Destillierte Sprachmodelle Grund über stylistische Replication 物质之上的样式: 蒸馏语言模型 2504.01738v3 -
234 07-15 What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests Was sollten LLMs vergessen? Quantifizierung personenbezogener Daten in LLMs für rechts-zu-vergessene Anfragen 普法女士应忘记什么? 将个人数据量化为 “ 有权被遗忘的请求 “ 的 “ 普法女士 “ 中的 “ 个人数据 “ 。 2507.11128v1 -
235 07-15 Plancraft: an evaluation dataset for planning with LLM agents Plancraft: ein Auswertungsdatensatz für die Planung mit LLM-Agenten 规划:用于与LLM代理商规划的评价数据集 2412.21033v2 -
236 07-15 Evaluating Multimodal Large Language Models on Educational Textbook Question Answering Bewertung multimodaler großer Sprachmodelle auf pädagogischer Lehrbuchfragebeantwortung 评价教育教科书问题解答多式大语言多语言模式 2506.21596v2 -
237 07-15 MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models MSA bei ImageCLEF 2025 Multimodale Reasoning: Multilingual Multimodale Reasoning mit Ensemble Vision Language Models 2025年多模式理由:多语言多语言多语言多语种理由,包含多种愿景语言模式 2507.11114v1 -
238 07-15 Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs Multi-Trigger-Vergiftung verstärkt Sicherheitslücken in LLMs 多触发中毒行为放大了LLM 的后门脆弱性 2507.11112v1 -
239 07-15 Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v3 -
240 07-15 The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs Der Teufel hinter der Maske: Eine emergente Sicherheitsanfälligkeit von Diffusion LLMs 面具背后的魔鬼:扩散液晶体的突发性安全脆弱性 2507.11097v1 -
241 07-15 Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification Über traditionelle Algorithmen hinaus: LLMs für eine genaue Cross-Border-Entity-Identifikation nutzen 超越传统算法:利用LMLMs进行准确的跨界实体识别 2507.11086v1 -
242 07-15 Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach Social Media Sentiments Analyse der Julirevolution in Bangladesch: Ein hybrider Transformer basierter Machine Learning-Ansatz 对孟加拉国七月革命的社会媒体感知分析:混合变换机学习方法 2507.11084v1 -
243 07-15 Voting or Consensus? Decision-Making in Multi-Agent Debate Abstimmung oder Konsens? Entscheidungsfindung in Multi-Agent-Debatte 表决还是协商一致?多机构辩论中的决策 2502.19130v3 -
244 07-15 Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction Comply: Lernen von Sätzen mit komplexen Gewichten inspiriert von Fruit Fly Olfaction 遵守:受果蝇运动启发的具有复杂重力的学习判决 2502.01706v3 -
245 07-15 DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures DocPolarBERT: Ein vortrainiertes Modell zum Dokumentverständnis mit relativer Polarkoordinate Kodierung von Layoutstrukturen DocPolarBERT:一个预先培训的文件理解模式,其布局结构的相对极地协调编码 2507.08606v2 -
246 07-15 DRAGON: Dynamic RAG Benchmark On News DRAGON: Dynamischer RAG-Benchmark auf Neuigkeiten DRAGON:动态RAG新闻基准 2507.05713v2 -
247 07-15 LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP 关于心血管疾病风险预测的LLM强化症状分析:临床国家实验室方案 2507.11052v1 -
248 07-15 Understanding the Dark Side of LLMs’ Intrinsic Self-Correction Die dunkle Seite der Intrinsischen Selbstkorrektion der LLMs verstehen 了解LLLMs’ Intrinsic 自我校正的黑暗面 2412.14959v2 -
249 07-15 ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification ReVISE: Verfeinern lernen zur Testzeit durch Intrinsische Selbstverifizierung REVISE:通过内在自我核查学习在试验时进行精炼 2502.14565v2 -
250 07-15 First-Order Error Matters: Accurate Compensation for Quantized Large Language Models Error Matters: Genaue Kompensation für Quantisierte große Sprachmodelle 第一顺序误差事项:量化大语言模型的准确补偿 2507.11017v1 -
251 07-15 REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once REST: Stress-Testing von Modellen mit großer Vernunft, indem man mehrere Probleme auf einmal fragt REST: 立即询问多个问题,以压力测试大型理由模型 2507.10541v2 -
252 07-15 Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging Grund zur Vision: Wahrnehmung und Vernunft durch Modellverschmelzen verstehen 实现愿景:通过模式合并理解观念和理由 2505.05464v2 -
253 07-15 Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification Team HUMANE auf der AVeriTeC 2025: HerO 2 für effiziente Faktenverifizierung 2025年AVeriTec 2025:HERO 2 有效核查事实 2507.11004v1 -
254 07-15 Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection Mario bei EXIST 2025: Ein einfaches Tor zu einer effektiven Mehrsprachigkeitserkennung Mario at EXIST 2025: 有效多语言性别调查的简单通道 2507.10996v1 -
255 07-15 Leveraging Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media Nutzung großer Sprachmodelle für Multi-Klassen- und Multi-Label-Erkennung von Drogenkonsum und Überdosissymptome in sozialen Medien 在社会媒体上利用多种类别和多标签检测吸毒和吸毒过量症状的大型语言模型 2504.12355v3 -
256 07-15 Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback Online-Intrinsische Belohnungen für Entscheidungsträger aus großen Sprachmodellen Feedback 来自大语言模式反馈的决策者在线内部奖励 2410.23022v3 -
257 07-15 BMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection BMDEtect: Ein multimodales Deep Learning Framework für eine umfassende biomedizinische Fehlverhaltenserkennung BMM 检测:综合生物医学不当行为检测的多式深层学习框架 2505.05763v2 -
258 07-15 Teach Me Sign: Stepwise Prompting LLM for Sign Language Production Lehre mich Zeichen: Schrittweise LLM für Zeichensprache Produktion 教育我 签名: 一步步提示手语制作LLMLM 2507.10972v1 -
259 07-15 Is Training Data Quality or Quantity More Impactful to Small Language Model Performance? Ist Training Daten Qualität oder Quantität Impactful to Small Language Model Performance? 培训数据质量或数量是否对小型语言模范业绩更有影响? 2411.15821v4 -
260 07-15 DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models DS@GT bei eRisk 2025: Von Aufforderungen zu Vorhersagen, Benchmarking der Frühdepressionserkennung mit gesprächsagentenbasierten Bewertungen und zeitlichen Aufmerksamkeitsmodellen DS@GT在eRisk eRisk 2025:从提示到预测,将早期抑郁症检测与基于谈话剂的评估和时间关注模型作为基准 2507.10958v1 -
261 07-15 Modeling Understanding of Story-Based Analogies Using Large Language Models Modellierung des Verständnisses von geschichtebasierten Analogien mit großen Sprachmodellen 使用大语言模式模拟模式模拟理解 2507.10957v1 -
262 07-15 Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models Prompt4Trust: Ein Verstärkungs-Learning Prompt Augmentation Framework für klinisch ausgerichtete Vertrauenskalibrierung in multimodalen großen Sprachmodellen 提示4信任:在多式大语言模式中加强学习学习,促进临床一致信心校正的快速增强框架 2507.09279v2 -
263 07-15 Representation Bending for Large Language Model Safety Darstellungsbiegen für große Sprachmodellsicherheit 大语文示范语文安全示范语文代表名单 2504.01550v3 -
264 07-15 The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances Die GPT-Überraschung: Großes Sprachmodell-Chat in einer massiven Coding-Klasse bietet reduziertes Engagement, aber erhöhte Adopter-Prüfungsleistungen GPT 惊喜:在大规模编码级减少参与中提供大语言示范聊天,但采用者考试成绩提高 2407.09975v2 -
265 07-15 HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training HanjaBridge: Lösung semantischer Ambiguität in koreanischen LLMs über Hanja-Augmented Pre-Training HanjaBridge:通过Hanja增强的培训前培训,解决韩国LLMLM中的语义模糊问题 2507.10920v1 -
266 07-15 Fine-grained Stateful Knowledge Exploration: Effective and Efficient Graph Retrieval with Large Language Models Feinkörnige Stateful Knowledge Exploration: Effektive und effiziente Graph Retrieval mit großen Sprachmodellen 精巧的、有国称的先进知识探索:具有大语言模型的高效率、高效益的图表检索 2401.13444v4 -
267 07-15 How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations Wie stylistische Ähnlichkeiten im Dialogdatensatz mit Nutzer- und Drittanbieter-Bewertungen Vorlieben gestaltet 在与用户和第三方评价的对话数据集中如何偏向于 与用户和第三方评价的对话 2507.10918v1 -
268 07-15 LiLM-RDB-SFC: Lightweight Language Model with Relational Database-Guided DRL for Optimized SFC Provisioning LiLM-RDB-SFC: Leichtes Sprachmodell mit relationaler Datenbank-geführter DRL für optimierte SFC-Provisionierung LILM-RDB-SFC:为优化SFC供应而与关系数据库-指导DRL 优化SFC供应的轻量语言模型 2507.10903v1 -
269 07-15 Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models Multimodale Sentiment-Analyse auf CMU-MOSEI-Datensatz mit Transformer-basierten Modellen 利用基于变压器的模型对CMU-MOSEI数据集的多式感应分析 2505.06110v2 -
270 07-15 NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization NavComposer: Komponieren von Sprachanweisungen für Navigations-Trajektorien durch Modularisierung von Action-Scene-Objekten 导航元件: 通过 Action-Scene-Object 模块化组合导航轨迹的语言指导 2507.10894v1 -
271 07-15 ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning ZebraLogic: Auf den Skalierungsgrenzen von LLMs für logische Vernunft ZebraLogic:关于逻辑理由解释的LLMs限制限度 2502.01100v2 -
272 07-15 Domain-Adaptive Small Language Models for Structured Tax Code Prediction Domain-Adaptive kleine Sprachmodelle für strukturierte Steuervorhersage 结构化税法预测结构化税法 2507.10880v1 -
273 07-15 GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment GenARM: Reward-geführte Generation mit autoregressivem Reward-Modell für Testzeitausrichtung GENARM: 具有自动递减奖益模型的奖赏制向导生成(测试时间对齐自动递减奖模型) 2410.08193v5 -
274 07-15 Jan-nano Technical Report Jan-nano Technischer Bericht Jan-nano技术报告 2506.22760v2 -
275 07-15 A quantum semantic framework for natural language processing Ein quantensemantischer Rahmen für die natürliche Sprachverarbeitung 自然语言处理的量子语义框架 2506.10077v2 -
276 07-14 (1) WhisperKit: On-device Real-time ASR with Billion-Scale Transformers WhisperKit: On-Device Echtzeit-ASR mit Milliarden-Scale-Transformatoren WhiseperKitt:使用十亿个星级变换器的实时实时ASR 2507.10860v1 -
277 07-14 MultiVox: Benchmarking Voice Assistants for Multimodal Interactions MultiVox: Benchmarking-Sprachassistenten für multimodale Interaktionen MultiVox:多模式互动基准语音助理 2507.10859v1 -
278 07-14 LLMs on Trial: Evaluating Judicial Fairness for Large Language Models LLMs on Trial: Bewertung der Gerechtigkeit für große Sprachmodelle 审判法学硕士:评价大语言模式的司法公平性 2507.10852v1 -
279 07-14 Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions Deep Binding of Language Model Virtual Personas: eine Studie über die Annäherung der politischen Partisanen-Misswahrnehmungen 语言模拟虚拟人:关于政治党派近似误解的研究 2504.11673v4 -
280 07-14 AIDE: Attribute-Guided MultI-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning AIDE: Attributegeführte MultI-Hop-Datenerweiterung für Datenknappheit bei der aufgabenspezifischen Feinabstimmung AIDE: 用于特定任务微调中数据缺乏程度的属性引导MutI-Hop数据扩展 2412.06136v2 -
281 07-14 Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition Unterstützung von SENĆOTEN Sprachdokumentation Bemühungen mit automatischer Spracherkennung 支持SEN-OTEN语文文件工作,并自动语音识别 2507.10827v1 -
282 07-14 Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler Testen von Hypothesen aus der Sozialen Zulassungstheorie des Online-Hasses: Eine Analyse von 110 Millionen Beiträgen von Parler 社会批准网上仇恨理论的测试假设:分析来自Parler的1.1亿个职位 2507.10810v1 -
283 07-14 Automated Thematic Analyses Using LLMs: Xylazine Wound Management Social Media Chatter Use Case Automatisierte thematische Analysen mit LLMs: Xylazine Wound Management Social Media Chatter Use Case 利用LLMM:Xylazine 创伤管理社会媒体聊天器使用案件自动专题分析 2507.10803v1 -
284 07-14 Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers Können multimodale Stiftungsmodelle schematische Diagramme verstehen? Eine empirische Studie zum Informationssuchenden QA über wissenschaftliche Arbeiten 多模式基金会模型能够理解示相图吗? 信息搜索质量评估经验研究,而不是科学论文 2507.10787v1 -
285 07-14 Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools Agentische Reasoning: Ein gestrafftes Framework zur Verbesserung der LLM-Reasoning mit Agentischen Tools 说明理由:加强使用说明工具的LLM理由的精简框架 2502.04644v2 -
286 07-14 Theory of Mind and Self-Disclosure to CUIs Theorie des Geistes und Selbst-Offenbarung zu CUIs CUI精神和自我披露理论 2507.10773v1 -
287 07-14 Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs Anwendung von Text-Embedding-Modellen für effiziente Analyse in beschrifteten Property Graphen 标签属性图中高效分析应用文本嵌入模型 2507.10772v1 -
288 07-14 Language Models for Adult Service Website Text Analysis Sprachmodelle für Erwachsene Service Website Textanalyse 成人服务语言模式网站文本分析 2507.10743v1 -
289 07-14 GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员 2507.02221v2 -
290 07-14 DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving DroidSpeak: KV Cache Sharing für Cross-LLM Kommunikation und Multi-LLM Serving DroidSpeak: KV 共享缓存, 用于跨 LLM 通信和多 LLM 服务 2411.02820v4 -
291 07-14 EmbRACE-3K: Embodied Reasoning and Action in Complex Environments Embrace-3K: Verkörperte Vernunft und Handeln in komplexen Umgebungen EmbRACE-3K: 复杂环境中的内在理由和行动 2507.10548v1 -
292 07-14 CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks CodeJudgeBench: Benchmarking von LLM-as-a-Judge für Codierungsaufgaben 标准法官:为编码任务确定LLM-as-a法官基准 2507.10535v1 -
293 07-14 Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination Begründung oder Erinnerung? Unzuverlässige Ergebnisse des Verstärkungslernens aufgrund von Datenkontamination 理由或记忆化?由于数据污染而加强学习的不可靠结果 2507.10532v1 -
294 07-14 Expert-level validation of AI-generated medical text with scalable language models Validierung von KI-generierten medizinischen Texten auf Expertenebene mit skalierbaren Sprachmodellen 专家一级对AI产生的带有可缩放语言模型的可缩放语言模型的医学文本进行鉴定 2507.03152v2 -
295 07-14 Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation Mixture-of-Recursions: Dynamische Rekursive Tiefen für adaptive Token-Level-Computation lernen 混合流流流:学习适应调控级计算法的动态回流深度 2507.10524v1 -
296 07-14 DeepResearch$^{\text{Eco}}$: A Recursive Agentic Workflow for Complex Scientific Question Answering in Ecology DeepResearch$^{\text{Eco}}$: Ein rekursiver Agentischer Workflow für komplexe wissenschaftliche Fragen in der Ökologie 深层研究$text{Eco}$:生态中复杂科学问题答案的递递性制剂工作流程 2507.10522v1 -
297 07-14 Can You Detect the Difference? Kannst du den Unterschied erkennen? 你能发现差异吗? 2507.10475v1 -
298 07-14 MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking MLAR: Mehrschichtige großsprachige modellbasierte Roboterprozessautomatisierung Bewerberverfolgung MLARR: 多层大型语言示范型机器人程序自动化申请人跟踪 2507.10472v1 -
299 07-14 From BERT to Qwen: Hate Detection across architectures Von BERT bis Qwen: Hasserkennung über Architekturen hinweg 从BERT到Quw:跨结构的仇恨检测 2507.10468v1 -
300 07-14 Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction Rollen Sie die Würfel & Blick, bevor Sie springen: Gehen über die kreativen Grenzen der Next-Token-Vorhersage 跳跃前的骰子滚动和看一看:超越了次声预测的创造性极限 2504.15266v3 -
301 07-14 Referential ambiguity and clarification requests: comparing human and LLM behaviour referenzielle Mehrdeutigkeit und Klärungswünsche: Vergleich des menschlichen und des LLM-Verhaltens 参考文献的模糊性和澄清要求:比较人的行为和LLM行为 2507.10445v1 -
302 07-14 A Code Comprehension Benchmark for Large Language Models for Code Ein Code-Verständnis-Benchmark für große Sprachmodelle für Code 《守则》大语言模式的《守则》理解基准 2507.10641v1 -
303 07-14 Multiple Choice Learning of Low Rank Adapters for Language Modeling Multiple Choice-Lernen von Low-Rank-Adaptern für die Sprachmodellierung 低级别语言建模适应者多选择学习 2507.10419v1 -
304 07-14 Beyond classical and contemporary models: a transformative AI framework for student dropout prediction in distance learning using RAG, Prompt engineering, and Cross-modal fusion Über klassische und zeitgenössische Modelle hinaus: ein transformatives KI-Framework für die Studienabbrechervorhersage im Fernunterricht mittels RAG, Prompt Engineering und Cross-modal Fusion 超越古典和当代模式:利用RAG、快速工程和跨模式融合进行远程学习中学生辍学预测的变革性AI框架 2507.05285v2 -
305 07-14 Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources Text-zu-Remote-Sensing-Image Retrieval jenseits von RGB-Quellen RGB 来源以外的文字到远程传感器图像检索 2507.10403v1 -
306 07-14 Devanagari Handwritten Character Recognition using Convolutional Neural Network Devanagari Handgeschriebene Zeichenerkennung unter Verwendung von Convolutional Neural Network Devanagari 利用革命神经网络手写字符识别 2507.10398v1 -
307 07-14 EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration EVOLvE: Bewertung und Optimierung von LLMs für In-Context Exploration EVOLvE: 评估和优化用于内衣探索的LMs LMs 2410.06238v2 -
308 07-14 HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong HKGAI-V1: Auf dem Weg zu einem regionalen Souveränen Großsprachenmodell für Hongkong HKGAI-V1:为香港建立区域主权大语言模式 2507.11502v1 -
309 07-14 Meanings are like Onions: a Layered Approach to Metaphor Processing Bedeutungen sind wie Zwiebeln: ein geschichteter Ansatz zur Metaphorverarbeitung 意思是像洋葱:对同义词处理的多层方法 2507.10354v1 -
310 07-14 Using AI to replicate human experimental results: a motion study Verwendung von KI, um menschliche experimentelle Ergebnisse zu replizieren: eine Bewegungsstudie 利用大赦国际复制人类实验结果:一项运动研究 2507.10342v1 -
311 07-14 Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach Überbrückung von Robustheit und Verallgemeinerung gegen Wortersatzangriffe in NLP über den Ansatz der Wachstumsbound Matrix 通过 “ 增长组合矩阵方法 “ ,在NLP中架起桥梁,反对用词替代袭击的有力性和普遍性 2507.10330v1 -
312 07-14 Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation Grammatik-geführte evolutionäre Suche nach diskreter Prompt-Optimierung 语法引导进化搜索 2507.10326v1 -
313 07-14 LEXam: Benchmarking Legal Reasoning on 340 Law Exams LEXam: Benchmarking der rechtlichen Begründung von 340 Rechtsprüfungen LEXam:340项法律考试的法律依据基准 2505.12864v3 -
314 07-14 FaceLLM: A Multimodal Large Language Model for Face Understanding FaceLLM: Ein multimodales, großes Sprachmodell für das Verständnis von Gesichtern FaceLLM: 面对面理解多式大语言模式 2507.10300v1 -
315 07-14 Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting Bias Beyond English: Social Bias und Debiasing Methoden in einem Low-Resource Setting bewerten 英文之后的偏见:在低资源环境下评估社会偏见和偏见方法 2504.11183v2 -
316 07-14 B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability B-cos LM: Effiziente Transformation von vortrainierten Sprachmodellen für verbesserte Erklärbarkeit B-cos LM:高效转换培训前语文模式,改进可解释性 2502.12992v2 -
317 07-14 The distribution of syntactic dependency distances Die Verteilung der syntaktischen Abhängigkeitsabstände 共同依赖距离分布 2211.14620v3 -
318 07-14 Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects Absher: Ein Benchmark für die Bewertung großer Sprachmodelle zum Verständnis saudischer Dialekte Absher:评估沙特方言大语言模型理解基准 2507.10216v1 -
319 07-14 Natural Language-based Assessment of L2 Oral Proficiency using LLMs Natürliche Sprachgestützte Beurteilung der oralen Sprachkenntnisse von L2 unter Verwendung von LLMs 利用LLMM 进行L2口腔熟练程度自然语言评估 2507.10200v1 -
320 07-14 Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models Trinity-RFT: Ein allgemein angelegtes und einheitliches Rahmenwerk zur Verstärkung der Feinsteuerung großer Sprachmodelle 三一-RFT:加强大语言模式精美应用的一般目的和统一框架 2505.17826v2 -
321 07-14 Mechanistic Indicators of Understanding in Large Language Models Mechanistische Indikatoren des Verstehens in großen Sprachmodellen 大语言模型中理解力的机械指标 2507.08017v2 -
322 07-14 Abusive text transformation using LLMs Missbräuchliche Texttransformation mit LLMs 使用LLMM 的恶劣文本转换 2507.10177v1 -
323 07-14 Task-Based Flexible Feature Distillation for LLMs Aufgabenbasierte flexible Feature-Destillation für LLMs 用于LLMM 的基于任务灵活地物蒸馏 2507.10155v1 -
324 07-14 A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment Ein Lärm-Robust Turn-Taking-System für Real-World Dialogue Robots: Ein Feldexperiment 实时世界对话机器人一个噪音-Robust 转录系统:一个实地实验 2503.06241v2 -
325 07-14 Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians’ Insights Barrieren bei der Integration medizinischer visueller Fragenbeantwortung in die Radiologie Workflows: Ein Scoping Review und Einblicke von Klinikern 将医疗视觉问题答案纳入放射工作流的障碍:范围审查和临床医生的洞察 2507.08036v2 -
326 07-14 DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models DiaTool-DPO: Multi-Turn Direct Preference Optimierung für Tool-Augmented Large Language Models DiaTool-DPO:多发直接首选优化工具增强型大语言模型 2504.02882v2 -
327 07-14 Fusing Large Language Models with Temporal Transformers for Time Series Forecasting Große Sprachmodelle mit Zeittransformatoren für die Zeitreihenvorhersage 用时间序列预测时空变换器使用大型语言模型 2507.10098v1 -
328 07-14 A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications Eine umfassende Übersicht über die direkte Präferenzoptimierung: Datensätze, Theorien, Varianten und Anwendungen 直接优先优化综合调查:数据集、理论、变式和应用 2410.15595v3 -
329 07-14 Structuring Radiology Reports: Challenging LLMs with Lightweight Models Structuring Radiology Reports: Herausfordernde LLMs mit Leichtbaumodellen 结构化放射学报告:用轻量级模型对LMS提出挑战 2506.00200v2 -
330 07-14 Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning Verbesserung der Kette der nachdenklichen Vernunft mit kritischer Darstellung Feinabstimmung 强化研究链,理由与关键代表的微调 2507.10085v1 -
331 07-14 Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires Kulturelle Bias in großen Sprachmodellen: Bewertung von KI-Agenten durch moralische Fragebögen 大语言模式中的文化偏见:通过道德问卷评估AI代理 2507.10073v1 -
332 07-14 GeLaCo: An Evolutionary Approach to Layer Compression GeLaCo: Ein evolutionärer Ansatz zur Schichtkompression GeLaCo: 层压缩的进化方法 2507.10059v1 -
333 07-14 PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization PRISM: Feinkörniges Papier-zu-Papier-Retrieval mit Multi-Aspect-Aware-Abfrageoptimierung PRISM: 配有多频谱软件查询优化的精细读纸到纸检索器 2507.10057v1 -
334 07-14 Political Bias in LLMs: Unaligned Moral Values in Agent-centric Simulations Politische Bias in LLMs: Ungebundene Moralwerte in Agent-zentrierten Simulationen LLM中的政治偏见:代理中心模拟中的不结盟道德价值 2408.11415v2 -
335 07-14 IPAD: Inverse Prompt for AI Detection – A Robust and Explainable LLM-Generated Text Detector IPAD: Inverse Aufforderung zur KI-Erkennung – ein robuster und erklärbarer LLM-generierter Textdetektor IPAD: AI 检测反光提示 – – 强力和可解释的LLM-发光文本检测器 2502.15902v2 -
336 07-14 Automating SPARQL Query Translations between DBpedia and Wikidata Automatisieren von SPARQL Query Translations zwischen DBpedia und Wikidata 将 DBpedia 和 Wikidata 之间的 SPARQL 查询翻译自动化 2507.10045v1 -
337 07-14 Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v4 -
338 07-14 Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect Cross-modale Assoziationen in Vision und Sprachmodellen: Der Bouba-Kiki-Effekt neu aufgreifen 愿景和语言模式跨模式协会:重新审查bouba-kiki效应 2507.10013v1 -
339 07-14 Protective Factor-Aware Dynamic Influence Learning for Suicide Risk Prediction on Social Media Schutzfaktor-Bewusst Dynamisches Influence-Lernen für Suizidrisikovorhersage in sozialen Medien 社会媒体自杀风险预测社会媒体 2507.10008v1 -
340 07-14 SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs 空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v2 -
341 07-14 On The Role of Intentionality in Knowledge Representation: Analyzing Scene Context for Cognitive Agents with a Tiny Language Model Zur Rolle der Intentionalität in der Wissensrepräsentation: Analysieren des Szenekontexts für Kognitive Agenten mit einem winzigen Sprachmodell 关于 “ 有意在知识代表性中的作用 “ :用微小语言模式分析认知代理人的场景背景 2507.10000v1 -
342 07-14 Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code LLM zur Vernunft bringen: Stärkung Lernen aus algorithmischen Problemen ohne Code 教LLM到理由:加强从没有法典的等级问题中学习 2507.07498v2 -
343 07-14 Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection Nicht alle Token sind gleich: Perplexity Attention Gewichtete Netzwerke für die KI generierte Texterkennung 并非所有的标识符被创建为等号: 为 AI 生成的文本检测而创建的双倍注意加权网络 2501.03940v3 -
344 07-14 TextOmics-Guided Diffusion for Hit-like Molecular Generation TextOmics-geführte Diffusion für hit-like Molekulare Generation TextOmics- 指导的极类似分子生成扩散 2507.09982v1 -
345 07-14 Tiny Reward Models Kleine Belohnung Modelle 微量奖励模型 2507.09973v1 -
346 07-14 TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models TReB: Umfassender Benchmark für die Bewertung von Tabellen mit Gründen für Fähigkeiten großer Sprachmodelle TreB:评价大语言模式表说明能力的综合基准 2506.18421v2 -
347 07-14 PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes PRIME: Large Language Model Personalisierung mit kognitiven Gedächtnis- und Gedankenprozessen PRIME:具有认知记忆和思维过程的大语言模式个性模型 2507.04607v2 -
348 07-14 DeepGesture: A conversational gesture synthesis system based on emotions and semantics DeepGesture: Ein dialogisches Gesten-Synthesesystem basierend auf Emotionen und Semantik DeepGesture:基于情感和语义的谈话手势合成系统 2507.03147v2 -
349 07-14 EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective EVALOOP: Bewertung der Robustheit von LLM in der Programmierung aus einer Perspektive der Selbstkonsistenz EVALOOP: 从自统一的角度评估方案拟订中的LLM强力 2505.12185v3 -
350 07-14 Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts Qorgau: Bewertung der LLM-Sicherheit in kasachisch-russischen zweisprachigen Kontexten Qorgau:评价哈萨克-俄语双语背景的LLM安全性 2502.13640v2 -
351 07-14 Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking Verbesserung der retrieval Augmented Generation mit Hierarchical Text Segmentation Chunking 增强获取回源增加的一代, 带有高层次文字分割块块板 2507.09935v1 -
352 07-14 ACEBench: Who Wins the Match Point in Tool Usage? ACEBench: Wer gewinnt den Match Point in der Werkzeugnutzung? CEBench:谁在工具使用中赢得了匹配点? 2501.12851v5 -
353 07-14 MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora MixLoRA-DSI: Dynamisch erweiterbare Mischungs-of-LoRA-Experten für ein probenfreies generatives Retrieval über Dynamic Corpora Mix LoRA-DSI: 动态公司排练-无创录检索专家动态可扩展混合Mix-LORA 2507.09924v1 -
354 07-14 PyVision: Agentic Vision with Dynamic Tooling PyVision: Agentische Vision mit dynamischem Werkzeug 视景:带有动态工具的 “ 动态展望 “ 。 2507.07998v2 -
355 07-14 Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization Fourier-Positions-Einbettung: Erhöht die regelmäßige Verlängerung der Aufmerksamkeit für Längenverallgemeinerung 四级立场嵌入式:加强注意定期延长延长时限 2412.17739v4 -
356 07-14 Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process Intuitive Feinsteuerung: Auf dem Weg zur Vereinfachung der Ausrichtung zu einem einzigen Prozess 直观的精细调整:努力将调整简化为单一进程 2405.11870v3 -
357 07-14 Scalable MatMul-free Language Modeling Skalierbare MatMul-freie Sprachmodellierung 可缩放 MatMul 无语言建模 2406.02528v6 -
358 07-14 ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models ViTCoT: Video-Text Interleaved Chain-of-Thought zur Förderung des Videoverständnisses in großen Sprachmodellen VittoT:为在大语言模型中促进视频理解而探索的视频-文字间断连锁研究 2507.09876v1 -
359 07-14 Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition Funktionsinduktion und Aufgabenverallgemeinerung: Eine Interpretationsstudie mit Off-by-One-Addition 职能上岗和任务一般化:解释性研究 2507.09875v1 -
360 07-14 CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding CV-Probes: Studieren des Zusammenspiels von lexikalischem und weltlichem Wissen im visuell fundierten Verbverständnis CV-CV-结果:以视觉动词理解研究词汇学和世界知识的相互作用 2409.01389v2 -
361 07-14 InstCache: A Predictive Cache for LLM Serving InstCache: Ein vorausschauender Cache für LLM Serving Instcache:LLM服务预测缓存 2411.13820v2 -
362 07-14 BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v4 -
363 07-14 REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v6 -
364 07-14 A General Framework for Inference-time Scaling and Steering of Diffusion Models Ein allgemeiner Rahmen für Schlussfolgerungs-Zeit-Skalierung und Steuerung von Diffusionsmodellen 传播模型的推推时间缩放和引导总框架 2501.06848v4 -
365 07-14 Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding Beyond Scale: Kleine Sprachmodelle sind vergleichbar mit GPT-4 im Mental Health Understanding 超越范围:在心理健康理解方面,小语言模式可与GPT-4类比。 2507.08031v2 -
366 07-13 (7) Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization Beyond Multiple Choice: Bewertung von Steuerungsvektoren für adaptive Freiform-Zusammenfassung 超越多重选择:评估适应性自由形式总结指导矢量 2505.24859v2 -
367 07-13 VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information VisOnlyQA: Große Visions-Sprachmodelle kämpfen noch mit der visuellen Wahrnehmung geometrischer Informationen Vis onlyQA:仍与几何信息视觉认知相抗争的大型视觉语言模型 2412.00947v3 -
368 07-13 SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding SymbolicThought: Integration von Sprachmodellen und symbolischer Begründung für ein konsequentes und interpretierbares menschliches Beziehungsverständnis 象征性探索:整合语文模式和符号理由,促进一致和可解释的人类关系理解 2507.04189v2 -
369 07-13 LASER: Attention with Exponential Transformation LASER: Aufmerksamkeit bei exponentieller Transformation LASER: 关注感官转变 2411.03493v2 -
370 07-13 TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit TinyTroupe: Ein LLM-powered Multiagent Persona Simulation Toolkit TiniyTrouppe:一个由LLM驱动的多剂人模拟工具包 2507.09788v1 -
371 07-13 Te Ahorré Un Click: A Revised Definition of Clickbait and Detection in Spanish News Te Ahorré Un Click: Eine überarbeitete Definition von Clickbait und Detection in spanischen Nachrichten Te Ahorré Unclick:西班牙新闻中的点击和探测的订正定义 2507.09777v1 -
372 07-13 DataDecide: How to Predict Best Pretraining Data with Small Experiments DataDecide: Wie man die besten Vorschulungsdaten mit kleinen Experimenten vorhersagt 数据减少:如何利用小型实验预测最佳培训前数据 2504.11393v2 -
373 07-13 Cascade Speculative Drafting for Even Faster LLM Inference Cascade Spekulative Drafting für noch schnellere LLM-Inferenz 连速度更快LLM推论的连带连带性投机起草 2312.11462v5 -
374 07-13 KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education? KnowShiftQA: Wie robust sind RAG-Systeme, wenn Textbook Knowledge Shifts in K-12 Education? K-12教育中教科书知识转移时RAG系统如何强大? 2412.08985v3 -
375 07-13 EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions EventHunter: Dynamisches Clustering und Ranking von Sicherheitsereignissen aus Hacker Forum Diskussionen 活动休特:从黑客论坛讨论中对安保活动进行动态分组和排序 2507.09762v1 -
376 07-13 Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding Ihr prätrainiertes Modell erzählt die Schwierigkeit selbst: Ein selbstadaptives Curriculum Lernen Paradigma für das natürliche Sprachverständnis 您训练有素的模型告诉困难本身:学习自然语言理解的自适应课程学习范式 2507.09758v1 -
377 07-13 Sound and Complete Neuro-symbolic Reasoning with LLM-Grounded Interpretations Sound und komplette neuro-symbolische Reasoning mit LLM-gerundeten Interpretationen 使用LLM四轮解释的全音和完整神经 – – 精神 – – 曲解理由 2507.09751v1 -
378 07-13 Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them Scalpel vs. Hammer: GRPO verstärkt bestehende Fähigkeiten, SFT ersetzt sie 缩略图与锤子:GROPO 放大现有能力,SFT 替换 2507.10616v1 -
379 07-13 From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations Von Fragmenten zu Fakten: Ein Curriculum-getriebener DPO-Ansatz zur Generierung von Hindi News Veracity Erklärungen 《从零碎到事实:产生印地语新闻的多城市解释:课程驱动的DPO方法》 2507.05179v2 -
380 07-13 Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization Verstärkung der Frage beantworten Agenten mit minimalistischen Politik gradient Optimierung 以最起码的政策级政策优化优化方式加强回答问题的代理机构 2505.17086v2 -
381 07-13 Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces Große Sprachmodelle kodieren Semantik in Low-Dimensional Linear Subspaces 低多维线性线性子空间中大语言模型编码语义学 2507.09709v1 -
382 07-13 Perception-Aware Policy Optimization for Multimodal Reasoning Perception-Aware Policy Optimization für multimodale Reasoning 对多式联运理由的观念-认知软件政策优化 2507.06448v2 -
383 07-13 MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs MCEval: Ein dynamischer Rahmen für eine faire multilinguale kulturelle Bewertung von LLMs MCEval:对LLMs进行公平、多语种文化评价的有力框架 2507.09701v1 -
384 07-13 Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning Lehrmodelle zu verbalisieren Belohnung Hacking in Chain-of-Thought-Reasoning 教学模型,以思考、思考、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理 2506.22777v2 -
385 07-13 Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions Learning-to-Context Slope: Bewertung von In-Context-Lerneffektivität jenseits von Performance-Illusionen 学习到文字表达式:评价除了业绩幻觉之外在学习中的效果 2506.23146v3 -
386 07-13 Towards Concise and Adaptive Thinking in Large Reasoning Models: A Survey Auf dem Weg zu einem konzisen und adaptiven Denken in großen Vernunftmodellen: Eine Umfrage 实现大理由模型中的简明和适应性思维:调查 2507.09662v1 -
387 07-13 OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale OmniSQL: Synthese hochwertiger Text-zu-SQL-Daten auf Scale OmniSQL: 大规模合成高质量的文本到 SQL 数据 2503.02240v2 -
388 07-13 MoRE: A Mixture of Reflectors Framework for Large Language Model-Based Sequential Recommendation MoRE: Ein Reflektoren-Framework für großsprachige modellbasierte sequentielle Empfehlung MORE:基于大语言示范序列建议的反思框架混合体 2409.06377v2 -
389 07-13 Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering? Kann die Optimierung der relativen Politik der Gruppe die thailändische rechtliche Begründung und die Beantwortung von Fragen verbessern? 集团的相对政策优化能否改善泰国的法律依据和问题的回答? 2507.09638v1 -
390 07-13 An Exploration of Knowledge Editing for Arabic Eine Erforschung der Wissensbearbeitung für Arabisch 阿拉伯文知识编辑探索 2507.09629v1 -
391 07-13 SpreadPy: A Python tool for modelling spreading activation and superdiffusion in cognitive multiplex networks SpreadPy: Ein Python-Tool zur Modellierung der Ausbreitung von Aktivierung und Superdiffusion in kognitiven Multiplex-Netzwerken Python 工具,用于在认知多功能网络中模拟扩散扩散激活和超扩散 2507.09628v1 -
392 07-13 Your Absorbing Discrete Diffusion Secretly Models the Bayesian Posterior Ihre absorbierende Diskrete Diffusion heimlich Modelle der Bayesian Posterior 您的吸收分解扩散秘密模型 贝叶斯波斯别墅 2507.07586v2 -
393 07-13 NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance NMIXX: Domain-Adapted Neural Embedings für Cross-Lingual eXploration of Finance NMIXX: 用于财务交叉使用和交叉倍增的域-开发型神经模型 2507.09601v1 -
394 07-13 MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models MENTOR: Effizientes multimodales Tuning für autoregressive Vision-Generationsmodelle INGOR: 自动递减型愿景生成模式的高效多式联运有条件的提款 2507.09574v1 -
395 07-13 Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models 利用小型语言模型进行疾病诊断的知识强化多式临床多式理论 2411.07611v5 -
396 07-13 Adapting Definition Modeling for New Languages: A Case Study on Belarusian Anpassung der Definitionsmodelle für neue Sprachen: Eine Fallstudie zu Belarussisch 适应新语言定义模式:白俄罗斯案例研究 2507.09536v1 -
397 07-13 Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement Psychometrische Großsprachenmodelle: Eine systematische Überprüfung der Evaluation, Validierung und Verbesserung 大型语言模拟大语言心理计量模型:系统审查评价、校验和加强 2505.08245v2 -
398 07-13 Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy Kann eine Gesellschaft Generativer Mittel menschliches Verhalten simulieren und die öffentliche Gesundheitspolitik informieren? 基因代理学会能够模拟人类行为和信息公共卫生政策吗? 疫苗安全案例研究 2503.09639v4 -
399 07-13 How Important is Perfect' English for Machine Translation Prompts? | Wie wichtig ist
Perfekte’ Englisch für maschinelle Übersetzung Prompts?“完美”英语对机器翻译提示的重要性有多大? 2507.09509v1 -
400 07-13 Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models Ref-Long: Benchmarking der Lang-Kontext-Referenzfähigkeit von Lang-Kontext-Sprachenmodellen 参考:长文本语言模式长期参考能力基准的设定 2507.09506v1 -
401 07-13 READoc: A Unified Benchmark for Realistic Document Structured Extraction READoc: Ein einheitlicher Benchmark für eine realistische Dokumentenstrukturierung READoc: “ 结构抽取文件 “ 的 “ 现实文件统一基准 “ 2409.05137v3 -
402 07-13 IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models IDEAL: Influence-Driven Selective Annotations Empower In-Context Learner in großen Sprachmodellen 影响驱动选择性说明:赋予大语言模式中的知识学习者权力 2310.10873v3 -
403 07-13 GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities GoalfyMax: Ein protokollgestütztes Multi-Agenten-System für intelligente Erlebniseinrichtungen 目标最大目标:智能经验实体协议驱动的多方促进机构系统 2507.09497v1 -
404 07-13 Topic Modeling as Multi-Objective Contrastive Optimization Thema Modellierung als multi-objektive kontrastive Optimierung 专题建模,作为多目标反向优化的模型化 2402.07577v3 -
405 07-13 Auditing Prompt Caching in Language Model APIs Auditieren von Prompt-Caching in Sprachmodell-APIs 语言模式APIP中快速抓取 2502.07776v2 -
406 07-13 Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis Balanced Training Data Augmentation für aspektbasierte Sentiment-Analyse 平衡培训数据增加,以进行基于背景的情感分析 2507.09485v1 -
407 07-13 ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning ViSP: Ein PPO-getriebenes Framework für Sarkasmus-Generation mit kontrasem Lernen VSP:PPPO-Driven PPO-Driven 讽刺与矛盾学习的讽刺一代框架 2507.09482v1 -
408 07-13 Evaluating LLMs on Sequential API Call Through Automated Test Generation Bewertung von LLMs auf sequentieller API-Aufruf durch automatisierte Testgenerierung 通过自动测试生成的序列API呼叫评估LLMs 2507.09481v1 -
409 07-13 The CoNLL-2013 Shared Task on Grammatical Error Correction Die gemeinsame Aufgabe von CoNLL-2013 zur Korrektur von Grammatikfehlern 2013 CoNLL-2013 校正语言错误共同任务 2507.09474v1 -
410 07-13 Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models Verbesserung der klinischen Textklassifikation durch feingetönte DRAGON Longformer-Modelle 通过精美的DRAGON长期模型加强临床文本分类 2507.09470v1 -
411 07-13 Personalization of Large Language Models: A Survey Personalisierung großer Sprachmodelle: Eine Umfrage 大语言模型的个性化:调查 2411.00027v3 -
412 07-13 StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model StreamUni: Streaming Speech Translation mit einem einheitlichen Large Speech-Language-Modell erreichen StreamUli:用统一大型语音语言模式实现流式语音翻译 2507.07803v2 -
413 07-12 (6) DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models DATE-LM: Benchmarking Data Attribution Evaluation für große Sprachmodelle DATE-LM:大语言模式数据归属基准评价 2507.09424v1 -
414 07-12 Large Language Models as Neurolinguistic Subjects: Discrepancy between Performance and Competence Große Sprachmodelle als neurolinguistische Themen: Diskrepanz zwischen Leistung und Kompetenz 以大语言模式作为神经语言学主体:业绩与能力之间的差异 2411.07533v3 -
415 07-12 A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm Eine Umfrage zur automatischen Prompt-Optimierung mit instruction-focused Heuristic-based Search-Algorithmus 以注重指示的以休养为主的自动快速优化调查 2502.18746v2 -
416 07-12 Single Word Change is All You Need: Using LLMs to Create Synthetic Training Examples for Text Classifiers Single Word Change ist alles, was Sie brauchen: Mit LLMs synthetische Trainingsbeispiele für Textklassifikatoren erstellen 单单单单字更改是您所需要的: 使用 LLM 创建文本分类器的合成培训示例 2401.17196v3 -
417 07-12 SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization SEE: Strategische Exploration und Nutzung für kohäsive In-Context Prompt Optimierung SEE: 战略探索和开发协同在文本内迅速优化的战略探索和开发 2402.11347v2 -
418 07-12 Supposedly Equivalent Facts That Aren’t? Entity Frequency in Pre-training Induces Asymmetry in LLMs Angeblich gleichwertige Fakten, die nicht sind? Entity Frequency in Pre-Training Induziert Asymmetrie in LLMs 所谓等效事实,这难道不是吗? 2503.22362v2 -
419 07-12 MedGemma Technical Report Technischer Bericht MedGemma MedmeGemma 技术报告 2507.05201v3 -
420 07-12 BEExformer: A Fast Inferencing Binarized Transformer with Early Exits BEExformer: Ein schneller Rückschluss Binarisierter Transformer mit frühen Ausgängen BEExex: 带有早期退出的快速推推催化变异器 2412.05225v2 -
421 07-12 Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs Perspective Dial: Perspective of Text and Guiding LLM Outputs messen 计量文字和引导性LLM产出 2506.23377v2 -
422 07-12 Watermarking Degrades Alignment in Language Models: Analysis and Mitigation Wasserzeichen degradiert Ausrichtung in Sprachmodellen: Analyse und Milderung 语言模型的分级调整:分析和减轻影响 2506.04462v3 -
423 07-12 LLM Agents Are the Antidote to Walled Gardens LLM-Agenten sind das Gegenmittel zu ummauerten Gärten LLM 药剂是被围墙隔绝的花园的抗药剂 2506.23978v2 -
424 07-12 ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching ZipVoice-Dialog: Nicht-Autoregressive gesprochene Dialog-Generation mit Flow Matching ZipVoice- Dialog: 以流动匹配方式生成非自动回归式口语对话 2507.09318v1 -
425 07-12 Emergence of Hierarchical Emotion Organization in Large Language Models Entstehung der Hierarchischen Emotionsorganisation in großen Sprachmodellen 大语言模式中等级情感组织的出现 2507.10599v1 -
426 07-12 Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models Bewertung der Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models 评价发电机-软件检索增强型大语言模型中的归属比语文评价 2410.12380v2 -
427 07-12 Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning Sprachumwandlung für lombardisch sprechenden Stil mit impliziter und expliziter Akustik-Feature-Konditionierung Lombard语音风格语音转换,带有隐含和显性音频特色条件 2507.09310v1 -
428 07-12 Disambiguate First, Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing Erst abgrenzen, später Parse: Interpretationen für Ambiguitätsauflösung im semantischen Parsing generieren 模糊第一, 稍后分析: 在语义分析中生成对模糊分辨率的解释 2502.18448v2 -
429 07-12 ClaritySpeech: Dementia Obfuscation in Speech ClaritySpeech: Dementia Verschleierung in der Rede 清晰的言语:言语中的痴呆症 2507.09282v1 -
430 07-12 Psychology-Driven Enhancement of Humour Translation Psychologie-getriebene Verbesserung der Humour-Übersetzung 提高幽默翻译能力 2507.09259v1 -
431 07-12 Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources Swa-bhasha Resource Hub: romanisiert Sinhala zu Sinhala Transliterationssysteme und Datenressourcen Swa-bhasha资源中心:将僧伽罗化成僧伽罗化的僧伽罗化成僧伽罗转化系统和数据资源 2507.09245v1 -
432 07-12 Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs Bepflanzt in der Vorausbildung, durch Finetuning abgeschwächt: Eine Fallstudie über die Herkunft von Kognitiv-Biasen in LLMs 编在培训前编,《微调:关于LLM中认知性双星起源的个案研究》,《微调摇摇晃》 2507.07186v2 -
433 07-12 Towards Pareto Optimal Throughput in Small Language Model Serving Auf dem Weg zu Pareto Optimaler Durchsatz im kleinen Sprachmodell 争取在小型语文示范服务中达到最佳产出 2404.03353v2 -
434 07-12 MetaClimage: A novel database of visual metaphors related to Climate Change, with costs and benefits analysis MetaClimage: Eine neuartige Datenbank visueller Metaphern zum Klimawandel mit Kosten-Nutzen-Analyse MetaClimage:与气候变化有关的视觉比喻新数据库,并进行成本和效益分析 2507.09225v1 -
435 07-12 Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen 语言模型中强化研究链理由的特征采掘和指南 2505.15634v4 -
436 07-12 Exploring Gender Bias Beyond Occupational Titles Erforschen von Gender-Bias über Berufsbezeichnungen hinaus 探索职业职称之外的性别偏见 2507.02679v2 -
437 07-12 Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training Banzhida: Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Vortraining Banzhida:推广藏语大语言模式,提供 “ 缩小数据 “ 和 “ 持续培训前 “ 。 2507.09205v1 -
438 07-12 An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment Eine eingehende Bewertung großer Sprachmodelle in der Satzvereinfachung mit fehlerbasierter Human Assessment 深入评价以基于错误的人类评估为根据的简化刑期的大语言模式 2403.04963v4 -
439 07-12 Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models Erkennen und Beschneiden Prominenter, aber detrimentaler Neuronen in großen Sprachmodellen 在大语言模型中检测和预视突出但有偏偏的神经元 2507.09185v1 -
440 07-12 CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models CASCADE Ihre Datensätze für Cross-Mode Knowledge Retrieval von Sprachmodellen CASCADE 语言模型跨模式知识检索数据集 2504.01450v2 -
441 07-12 DLBAcalib: Robust Extrinsic Calibration for Non-Overlapping LiDARs Based on Dual LBA DLBAcalib: Robuste Extrinsische Kalibrierung für nicht überlappende LiDARs auf Basis von Dual LBA DLBAcalib: 以两边LBA为基础,对非重叠的LIDARs进行强有力的Extrins 校准 2507.09176v1 -
442 07-12 RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking RAMA: 多式联运实况调查中错误信息探测的检索增强多机构框架 2507.09174v1 -
443 07-12 Logits are All We Need to Adapt Closed Models Logits sind alles, was wir brauchen, um geschlossene Modelle anzupassen 只需登录即可,我们只需调整已关闭的模型 2502.06806v4 -
444 07-12 PLEX: Perturbation-free Local Explanations for LLM-Based Text Classification PLEX: Störungsfreie lokale Erklärungen für die LLM-basierte Textklassifikation PLEX: LLM基于LLM的文本分类无扰动当地解释 2507.10596v1 -
445 07-12 PU-Lie: Lightweight Deception Detection in Imbalanced Diplomatic Dialogues via Positive-Unlabeled Learning PU-Lie: Leichte Täuschungserkennung in ausgewogenen Diplomatischen Dialogen durch positiv-unmarkiertes Lernen PU-Lie:通过积极-无标签学习,在平衡的外交对话中发现轻量度欺骗性 2507.09157v1 -
446 07-12 OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering OPENXRD: Ein umfassendes Benchmark- und Enhancement-Framework für LLM/MLLM XRD-Fragebeantwortung OpenXRD: LLM/MLLM XRD 问题回答的综合基准和加强框架 2507.09155v1 -
447 07-12 DTECT: Dynamic Topic Explorer & Context Tracker DTECT: Dynamischer Themen-Explorer & Kontext-Tracker DTECT: 动态专题探索器和上下文跟踪器 2507.07910v2 -
448 07-12 SymRAG: Efficient Neuro-Symbolic Retrieval Through Adaptive Query Routing SymRAG: Effizientes neuro-symbolisches Retrieval durch adaptive Abfragerouting SymRAG: 通过适应性查询路由, 高效神经- 交串流检索 2506.12981v2 -
449 07-12 Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen Eka-Eval:印度语大语言模式综合评价框架 2507.01853v3 -
450 07-12 The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages Der NaijaVoices-Datensatz: Pflege von großformatigen, qualitativ hochwertigen, kulturell-richschen Sprachdaten für afrikanische Sprachen NaijaVoices数据集:培养非洲语言的大型、高质量、文化-Rich语音数据 2505.20564v3 -
451 07-12 MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian MSVD-Indonesier: Benchmark für multimodale Video-Text-Aufgaben auf Indonesisch MSVD-印度尼西亚文:印度尼西亚多式视频文字任务基准 2306.11341v2 -
452 07-12 KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding KodCode: Ein vielfältiger, anspruchsvoller und überprüfbarer synthetischer Datensatz für die Codierung KodCode:用于编码的多样化、挑战性和可核查合成数据集 2503.02951v2 -
453 07-12 CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards CompassJudger-2: Auf dem Weg zum generalistischen Richtermodell durch überprüfbare Belohnungen Compassjudger-2:通过可核实的奖励争取通才法官模式 2507.09104v1 -
454 07-12 Consistency in Language Models: Current Landscape, Challenges, and Future Directions Konsistenz in Sprachmodellen: Aktuelle Landschaft, Herausforderungen und zukünftige Richtungen 语文模式的一致性:当前景观、挑战和未来方向 2505.00268v2 -
455 07-12 AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data AInsight: Augmenting Expert Decision-Making mit On-the-Fly-Insights in historischen Daten begründet AIn透视:加强以历史数据为根据的直观专家决策 2507.09100v1 -
456 07-12 DS@GT at Touché: Large Language Models for Retrieval-Augmented Debate DS@GT at Touché: Große Sprachmodelle für retrieval-augmentierte Debatte DS@GT at Touché: 检索启动辩论的大语言模式 2507.09090v1 -
457 07-11 (5) Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation Dynamischer Parameterspeicher: Temporäre LoRA-verbesserte LLM für die Erkennung von Langsequenz-Emotionen im Gespräch 动态参数内存:在对话中识别长期序列情感的暂时性LORA-增强的LLMLM 2507.09076v1 -
458 07-11 OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique OpenCodeReasoning-II: Ein einfacher Testzeitskalierungsansatz über Self-Critique OpenCodeReasoning- II: 通过自创性简单测试时间缩放法 2507.09075v1 -
459 07-11 FlexOlmo: Open Language Models for Flexible Data Use FlexOlmo: Offene Sprachmodelle für flexible Datennutzung FlexOlmo:灵活数据使用开放语言模型 2507.07024v2 -
460 07-11 HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization HYPEROFA: Erweitern von LLM Vokabeln auf neue Sprachen über Hypernetwork-basierte Einbettung in Initialisierung HYPROOFA:通过基于超网络的嵌入式初始化,将LLM词汇扩大到新语言 2504.21018v2 -
461 07-11 ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making ALIGN: Promptbasierte Attributausrichtung für zuverlässige, verantwortungsvolle und personalisierte LLM-basierte Entscheidungsfindung 以可靠、负责任和个性化的LLM为基础的决策的快速属性协调 2507.09037v1 -
462 07-11 Lizard: An Efficient Linearization Framework for Large Language Models Lizard: Ein effizienter Linearisierungsrahmen für große Sprachmodelle Lizard:大型语言模型的高效线性框架 2507.09025v1 -
463 07-11 Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery Jenseits von Lebendigkeit: Inhaltliche Analyse induzierter Halluzinationen enthüllt die verborgene Struktur individueller Unterschiede in der Bildgebung 超越生化:对诱发幻觉的内容分析揭示了视觉图像中个人差异的隐藏结构。 2507.09011v1 -
464 07-11 Semantic Source Code Segmentation using Small and Large Language Models Semantische Quellcode-Segmentierung mit kleinen und großen Sprachmodellen 使用小型和大语言模式的语义源代码代码分割 2507.08992v1 -
465 07-11 TheraGen: Therapy for Every Generation TheraGen: Therapie für jede Generation TheraGen:为每一代人提供治疗 2409.13748v2 -
466 07-11 Application of CARE-SD text classifier tools to assess distribution of stigmatizing and doubt-marking language features in EHR Anwendung von CARE-SD-Textklassifikator-Tools zur Bewertung der Verteilung von stigmatisierenden und zweifelmarkierenden Sprachmerkmalen in EHR 应用CARE-SD 文本分类工具,评估EHR中污名化和有疑点语言特征的分布 2507.08969v1 -
467 07-11 Self-Improving Model Steering Selbstverbesserende Modellsteuerung 自我改进示范指导 2507.08967v1 -
468 07-11 LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop 学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈 2507.04295v2 -
469 07-11 Drowning in Documents: Consequences of Scaling Reranker Inference Ertrinken in Dokumenten: Konsequenzen der Skalierungs-Reranker-Schlussfolgerung 文件中淹没:扩大重新排序者推断的后果 2411.11767v2 -
470 07-11 NeuralOS: Towards Simulating Operating Systems via Neural Generative Models NeuralOS: Auf dem Weg zur Simulation von Betriebssystemen über neurale Generative Modelle NeurorOS:通过神经产生模型努力模拟操作系统 2507.08800v1 -
471 07-11 KV Cache Steering for Inducing Reasoning in Small Language Models KV Cache Steering zur Induktion von Vernunft in kleinen Sprachmodellen KV 小型语言模式引力提示缓存指导 2507.08799v1 -
472 07-11 From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation Von KMMLU-Redux zu KMMLU-Pro: Eine professionelle koreanische Benchmark-Suite für die LLM-Bewertung 从KMMLU-Redux到KMMLU-Pro:韩国用于LLM评价的专业基准套件 2507.08924v1 -
473 07-11 One Token to Fool LLM-as-a-Judge Ein Token zum Narren LLM-as-a-Richter 愚人一拳LLM -A法官 2507.08794v1 -
474 07-11 AI Safety Should Prioritize the Future of Work KI Sicherheit sollte die Zukunft der Arbeit priorisieren AI 安全应优先考虑未来工作 2504.13959v2 -
475 07-11 From Sequence to Structure: Uncovering Substructure Reasoning in Transformers Von Sequenz zu Struktur: Enthüllen von Unterstrukturen in Transformern 从序列到结构:在变换器中未覆盖子结构原因 2507.10435v1 -
476 07-11 BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity BlockFFN: Auf dem Weg zur End-Side Acceleration-Friendly Mixture-of-Experts mit Chunk-Level-Aktivierung Sparsity 块块FFN: 向具有整块级激活分级的 终端- 双极加速- 友好混合混合专家方向 2507.08771v1 -
477 07-11 On Barriers to Archival Audio Processing Über Hindernisse für die Archivierung von Audio 档案音频处理障碍问题 2507.08768v1 -
478 07-11 Large Language Models in Mental Health Care: a Scoping Review Große Sprachmodelle in der Psychischen Gesundheitsversorgung: ein Scoping Review 精神保健中大语言模式:范围审查 2401.02984v3 -
479 07-11 Weak-to-Strong Jailbreaking on Large Language Models Schwach-zu-starkes Gefängnis mit großen Sprachmodellen 关于大语言模型的弱至强强监狱破解 2401.17256v4 -
480 07-11 Multilingual Multimodal Software Developer for Code Generation Mehrsprachiger multimodaler Softwareentwickler für die Codegenerierung 用于代码生成的多语言多语种多式软件开发器 2507.08719v1 -
481 07-11 Evaluating LLMs in Medicine: A Call for Rigor, Transparency Bewertung von LLMs in der Medizin: Ein Ruf nach Starrheit, Transparenz 医学领域评价LLMs:调用Rigor,透明 2507.08916v1 -
482 07-11 KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation KG-Achtung: Wissen Graphengeführte Aufmerksamkeit zur Testzeit über bidirektionale Informationsaggregation KG-注意:通过双向信息聚合在试验时以知识图表引导的注意 2507.08704v1 -
483 07-11 Multi-Token Attention Multi-Token-Achtung 多当式注意 2504.00927v2 -
484 07-11 KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment KELPS: Ein Rahmen für eine verifizierte Mehrsprachen-Autoformalisierung durch semantisch-syntaktische Ausrichtung KELPS: 通过语义- 合成协调校验多语言自动正规化框架 2507.08665v1 -
485 07-11 The Impact of Automatic Speech Transcription on Speaker Attribution Die Auswirkungen der automatischen Sprachtranskription auf die Sprecherzuweisung 自动发言限制对议长权力的影响 2507.08660v1 -
486 07-11 Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery Open Source Planning & Control System mit Language Agents für autonome wissenschaftliche Entdeckung 拥有自主科学发现语言代理的开放源规划和控制系统 2507.07257v2 -
487 07-11 Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA) Skalierung der Aufmerksamkeit auf sehr lange Sequenzen in linearer Zeit mit Wavelet-erweiterter Zufallsspektral-Achtung (WERSA) 以波浪增强随机光谱注意, 将注意力转向线性时间的甚长序列( WERSA) 2507.08637v1 -
488 07-11 Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework Text2BIM: Generierung von Baumodellen mit Hilfe eines Multi-Agent-Frameworks auf Basis eines großen Sprachmodells Text2BIM:利用以大语言模式为基础的多机构机构框架生成建筑模型 2408.08054v2 -
489 07-11 Red Teaming Large Language Models for Healthcare Red Teaming große Sprachmodelle für das Gesundheitswesen 红队大语言保健模式 2505.00467v2 -
490 07-11 A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1 Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1 关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1 2507.08621v1 -
491 07-11 Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia Umgang mit Pitfalls bei der Prüfung von Praktiken automatischer Spracherkennungstechnologien: Eine Fallstudie von Menschen mit Aphasie 解决自动语音识别技术审计做法中的缺陷:阿法西亚人案例研究 2506.08846v2 -
492 07-11 Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing Anthropomische Unsicherheit: Was verbalisierte Unsicherheit in Sprachmodellen fehlt 人文工程学不确定性:语言模型中什么是虚无的不确定性 2507.10587v1 -
493 07-11 AutoRAG-LoRA: Hallucination-Triggered Knowledge Retuning via Lightweight Adapters AutoRAG-LoRA: Halluzination-Triggered Knowledge Retuning über Leichtbauadapter AURAG-LORA:通过轻度适应器进行幻觉-交错知识调整 2507.10586v1 -
494 07-11 Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings Medical Red Teaming Protocol of Language Models: Über die Bedeutung der Nutzerperspektiven in der Gesundheitsversorgung 语言模式医学红队模式医疗红队协议:关于保健机构用户观点的重要性 2507.07248v2 -
495 07-11 Large Multi-modal Model Cartographic Map Comprehension for Textual Locality Georeferencing Großes multimodales Modell Kartographische Karte Verständnis für Textlokalität Georeferenzierung 大型多模式地图地图图图图集模型 2507.08575v1 -
496 07-11 A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations Eine Taxonomie für Design und Evaluation von prompt-basierenden Naturspracherklärungen 设计和评价快速自然语言解释的分类学 2507.10585v1 -
497 07-11 Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines Vergleich der gesprochenen Sprachen mit Paninian System of Sounds und Finite State Machines 使用波尼尼亚音响和有限国家机器系统比较口语 2301.12463v3 -
498 07-11 The AI Language Proficiency Monitor – Tracking the Progress of LLMs on Multilingual Benchmarks Der KI-Sprachkompetenzmonitor – Aufspüren des Fortschritts von LLMs auf mehrsprachigen Benchmarks AI 语言能力监测 – – 跟踪多语种基准问题LLMs的进展情况 2507.08538v1 -
499 07-11 A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis Multi-Granularität Konzept Sparse Aktivierung und Hierarchisches Wissen Graph Fusion Framework für Seltene Krankheiten Diagnose 罕见疾病诊断多发性概念分散活动和等级知识图集融合框架 2507.08529v1 -
500 07-11 An Empirical Study of Validating Synthetic Data for Formula Generation Eine empirische Studie zur Validierung synthetischer Daten für die Formelgenerierung 验证用于公式生成的合成数据的经验研究 2407.10657v4 -
501 07-11 REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives REGEN: Ein Datensatz und Benchmarks mit natürlichen Sprachkritiken und Erzählungen REGEN: 一套具有自然语种背景和叙述的数据集和基准 2503.11924v2 -
502 07-11 Transforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis Transformation sensibler Dokumente in Quantitative Daten: Eine KI-basierte Vorverarbeitungs-Toolchain für strukturierte und datenschutzbewusste Analysen 将敏感文件转换成定量数据:基于AI的结构性和隐私意识分析预处理工具链 2507.10582v1 -
503 07-11 One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning One-Pass to Reason: Token-Duplikation und Block-Spar-Maske für effizientes Feintuning auf Multi-Turn-Reasoning 单向理由:在多向理由上高效精美调整的相重复和块分割掩码 2504.18246v2 -
504 07-11 An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation Offline-Mobile Gesprächsagentin für psychische Gesundheitsunterstützung: Lernen aus emotionalen Dialogen und psychologischen Texten mit studentisch-zentrierter Evaluation 心理健康支助离线流动对话代理人:学习以学生为中心的评价的情感对话和心理文字 2507.10580v1 -
505 07-11 PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts PromotionGo at SemEval-2025 Task 11: Ein Feature-Centric Framework für Cross-Lingual Multi-Emotion Detection in Kurztexten 促进SemEval-2025任务11:短文本中跨语言多情感探测的特写-内容框架 2507.08499v1 -
506 07-11 Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop Semantic-Augmented Latent Topic Modeling mit LLM-in-the-Loop 利用LLLM in-Loop 进行语义强化的 边端主题建模 2507.08498v1 -
507 07-11 LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning LLaPa: Ein visionssprachliches Modell-Framework für die kontrafaktisch-bewusste Verfahrensplanung LLAPA: 反事实-软件程序规划远景-语言示范框架 2507.08496v1 -
508 07-11 A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench Ein drittes Paradigma für LLM-Evaluierung: Dialog Game-Based-Evaluierung mit Clembench LLM评价的第三个范例:以对话游戏为基础的评价 2507.08491v1 -
509 07-11 Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach Essay Cohäsion Assessment: Ein neuartiger Ansatz zur Reaktionstheorie 加强舍子聚合力评估:新项目应对理论方法 2507.08487v1 -
510 07-11 Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors Ergebnisse der gemeinsamen Arbeit der BEA 2025 zur pädagogischen Fähigkeitsbewertung von KI-getriebenen Tutoren BEA 2025年BEA 2025年教育能力评估共同任务的结果 2507.10579v1 -
511 07-11 ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition ILT-Iteratives LoRA-Training durch Fokus-Feedback-Fix für mehrsprachige Spracherkennung 通过 “ 承认多种语言语言的焦点-反馈-语言识别指标 “ 进行ILT-临时LORA培训 2507.08477v1 -
512 07-11 Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model Squeeze the Soaked Sponge: Effiziente Off-Policy-Verstärkung Feinsteuerung für großes Sprachmodell 挤压海绵:高效非政策强化大语言模式的高效非政策改进微调 2507.06892v3 -
513 07-11 Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study Große Sprachmodelle für die rechtliche Entscheidungsfindung im österreichischen Mehrwertsteuerrecht nutzen: Eine experimentelle Studie 奥地利增值税法使用大语言模式进行法律决策:实验研究 2507.08468v1 -
514 07-11 Diagnosing Failures in Large Language Models’ Answers: Integrating Error Attribution into Evaluation Framework Diagnose von Fehlern in den Antworten großer Sprachmodelle: Integrieren der Fehlerzuweisung in den Evaluationsrahmen 大语言模型答案中的诊断失败:将错误归责纳入评价框架 2507.08459v1 -
515 07-11 Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test? Können große Sprachmodelle ebenso verstehen wie Patentvorschriften anwenden, um einen hands-on Patent Attorney Test zu bestehen? 大语言模式能否像应用专利条例通过专利律师亲手测试一样理解专利条例? 2507.10576v1 -
516 07-11 Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences Gemeinsamer Grund: Mit großen Sprachmodellen Vereinbarungen in Multi-Agent-Entscheidungskonferenzen zu erkennen 寻找共同点:在多机构决定会议上使用大语言模型来检测协议 2507.08440v1 -
517 07-11 xpSHACL: Explainable SHACL Validation using Retrieval-Augmented Generation and Large Language Models xpSHACL: Erklärbare SHACL-Validierung mit Retrieval-Augmented Generation und großen Sprachmodellen xpSHACL: 使用回溯-启动生成和大语言模型进行可解释的 SHACL 校验 2507.08432v1 -
518 07-11 Answer Generation for Questions With Multiple Information Sources in E-Commerce Antwortgenerierung für Fragen mit mehreren Informationsquellen im E-Commerce 电子商务中具有多种信息来源问题的答案生成问题 2111.14003v2 -
519 07-11 ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains ChainEdit: Propagieren von Ripple-Effekten in der LLM-Wissensbearbeitung durch logische regelgeführte Ketten 链 Edit:通过逻辑规则-指导链条在LLM知识编辑中宣传波纹效应 2507.08427v1 -
520 07-11 A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities Eine Übersicht über große Sprachmodelle in der disziplinspezifischen Forschung: Herausforderungen, Methoden und Chancen 专门学科研究中大语言模式概览:挑战、方法和机会 2507.08425v1 -
521 07-11 Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations Inklusive Systematische Bewertungen aktivieren: Einschließlich Preprint-Artikel mit großsprachigen modellgetriebenen Bewertungen 促进包容性的系统审查:将预印条款纳入大语言模式示范评价 2503.13857v4 -
522 07-11 Swap distance minimization beyond entropy minimization in word order variation Swap-Distanz-Minimierung jenseits der Entropie-Minimierung in Wortfolge-Variation 以字序变换方式互换距离最小化,超过以字序变换的方式最小化 2404.14192v5 -
523 07-11 Probing Experts’ Perspectives on AI-Assisted Public Speaking Training Probing Experten-Perspektiven über KI-Assistente Public Speaking Training 关于AI协助的公开演讲培训的探查专家观点 2507.07930v2 -
524 07-11 Flippi: End To End GenAI Assistant for E-Commerce Flippi: Ende bis Ende GenAI Assistent für E-Commerce Flippi: 结束到结束 GenAI 电子商务助手 2507.05788v2 -
525 07-11 Sampling from Your Language Model One Byte at a Time Proben aus Ihrem Sprachmodell ein Byte zu einer Zeit 一次抽取您语言模式一字节的样本 2506.14123v2 -
526 07-11 HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew HeSum: Ein neuartiger Datensatz für abstrakte Textzusammenfassung in Hebräisch HeSum:希伯来文抽象文本摘要缩写的新数据集 2406.03897v3 -
527 07-11 Truth-value judgment in language models: ‘truth directions’ are context sensitive Wahrheit-Wert-Urteil in Sprachmodellen: ‘Wahrheitsrichtungen’ sind kontextsensibel 语言模型中的真相价值判断:“真相方向”是背景敏感 2404.18865v3 -
528 07-11 The Curious Case of Factuality Finetuning: Models’ Internal Beliefs Can Improve Factuality Der kuriose Fall von Factuality Finetuning: Modelle’ interne Glaube kann Factuality verbessern 《难解事实质量微调案例:模型的内部信仰可以改进事实质量》 2507.08371v1 -
529 07-11 Exploring Design of Multi-Agent LLM Dialogues for Research Ideation Erforschung der Gestaltung von LLM-Dialogen mit mehreren Agenten für die Forschungsideation 探索设计多种机构用LLM 研究主题对话 2507.08350v1 -
530 07-11 Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization N-Grams以后:重新思考评价尺度和多语种抽象总结战略 2507.08342v1 -
531 07-11 Distillation versus Contrastive Learning: How to Train Your Rerankers Destillation versus Kontrastives Lernen: Wie Sie Ihre Reranker trainieren 蒸馏与反竞争学习:如何培训你的再培训者 2507.08336v1 -
532 07-11 MK2 at PBIG Competition: A Prompt Generation Solution MK2 bei PBIG Competition: Eine schnelle Generationslösung PBIG竞争中的MK2:迅速代代解决办法 2507.08335v1 -
533 07-11 Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection Emoji-Angriff: Verstärkung von Jailbreak-Angriffen gegen Richter LLM-Erkennung Emoji攻击:加强针对LLM法官的越狱袭击 2411.01077v4 -
534 07-11 CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation CRMAgent: Ein Multi-Agent LLM-System für E-Commerce CRM-Meldungsvorlagen-Erstellung CRMM 信息模板生成多机构代理LLM系统 2507.08325v1 -
535 07-11 EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees EvalTree: Profiling Language Model Schwächen über Hierarchische Fähigkeiten Bäume EvalTree:通过等级能力树分析语言模型弱点 2503.08893v2 -
536 07-11 Improving MLLM’s Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency Verbesserung der Übersetzung von MLLMs Dokumentenbildmaschinen durch synchrone Selbstprüfung ihrer OCR-Kenntnisse 通过同步进行自我审查,改进MLLM的文件图像机机翻译,提高OCR的熟练程度 2507.08309v1 -
537 07-11 M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning M2-Reasoning: Stärkung von MLLMs mit einheitlicher allgemeiner und räumlicher Vernunft M2-反应:以统一的一般和空间理由,赋予MLLMs权力 2507.08306v1 -
538 07-11 Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective Bewertung von Impliziten Bias in großen Sprachmodellen durch Angriff aus einer psychometrischen Perspektive 通过从心理角度进行攻击,评价大语言模型中隐含的偏见 2406.14023v5 -
539 07-11 Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers Bandit-Based Prompt Design Strategy Selection verbessert Prompt Optimizers 基于强盗的即时设计战略选择改进即时优化 2503.01163v2 -
540 07-11 Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training Leichte Sicherheits-Guardrails über Synthetische Daten und RL-geführtes Adversarial Training 通过合成数据和RL制导反向训练轻量安全护卫车 2507.08284v1 -
541 07-11 Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval Generatives Retrieval- und Alignment-Modell: Ein neues Paradigma für E-Commerce Retrieval 产生检索和调整模型:电子商务检索的新范例 2504.01403v2 -
542 07-11 SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths SpecDec++: Spekulative Dekodierung durch adaptive Kandidatenlängen steigern SpecDec+++:通过适应性候选时间长度促进投机替代 2405.19715v3 -
543 07-11 Exploring Gender Differences in Chronic Pain Discussions on Reddit Erforschung geschlechtsspezifischer Unterschiede bei chronischen Schmerzdiskussionen auf Reddit 探讨关于康复的慢性疼痛讨论中的性别差异 2507.08241v1 -
544 07-11 Sequence graphs realizations and ambiguity in language models Sequenzgraphen Realisationen und Mehrdeutigkeit in Sprachmodellen 顺序图 语文模式的实现和模糊 2402.08830v2 -
545 07-11 Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension? Können LLMs die Fähigkeiten von Realstudenten in Mathematik und Leseverständnis zuverlässig simulieren? LLMs能够令人信赖地模拟真实学生的数学和阅读理解能力吗? 2507.08232v1 -
546 07-10 (4) Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation Post-hoc-Studie zum Thema Klima-Mikrotargeting auf Social Media-Anzeigen mit LLMs: Thematische Einblicke und Fairness-Evaluierung 利用LLMM:专题透视和公平评估 2410.05401v3 -
547 07-10 Extracting memorized pieces of (copyrighted) books from open-weight language models Extrahieren von auswendig gelernten Stücken von Büchern aus Open-Wight-Sprachmodellen 从开放重量级语言模式中提取(复印权)书籍 2505.12546v2 -
548 07-10 Riddle Generation using Learning Resources Riddle Generation mit Lernressourcen 利用学习资源的中一代人 2310.18290v3 -
549 07-10 TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs WahrheitTorchLM: Eine umfassende Bibliothek für die Vorhersage von Wahrhaftigkeit in LLM-Ausgaben LTLTTRCHLM:LLM产出中预测真相综合图书馆 2507.08203v1 -
550 07-10 Overview of the TREC 2021 deep learning track Überblick über den Deep-Learning-Track TREC 2021 TREC 2021年深学习轨迹概览 2507.08191v1 -
551 07-10 Overview of the TREC 2022 deep learning track Überblick über den Deep-Learning-Track TREC 2022 TREC 2022深学习轨迹概览 2507.10865v1 -
552 07-10 GeistBERT: Breathing Life into German NLP GeistBERT: Das Leben in die deutsche NLP einatmen 呼吸生命化为德国NLP 2506.11903v4 -
553 07-10 Overview of the TREC 2023 deep learning track Überblick über den Deep-Learning-Track TREC 2023 TREC 2023深学习轨迹概览 2507.08890v1 -
554 07-10 Distilling Empathy from Large Language Models Empathie aus großen Sprachmodellen destillieren 提炼大语言模型的冷漠 2507.08151v1 -
555 07-10 Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores Kompaktor: Kalibrierte Abfrage-agnostische KV Cache-Kompression mit ungefähren Leverage-Scores 压缩器: 使用近似杠杆分数校准查询- 不可知性 KV CA缓存压缩 2507.08143v1 -
556 07-10 Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models Audio Flamingo 3: Advancing Audio Intelligence mit vollständig offenen großen Audio-Sprachen-Modelle 3:以完全开放的大型音频语言模式推进音频情报 2507.08128v1 -
557 07-10 Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing Audit, Alignment und Optimierung von LM-Powered Subroutinen mit Anwendung auf die öffentliche Kommentarverarbeitung 对LM-Powerd Powerd S次程序适用公众意见处理的审计、统一和优化 2507.08109v1 -
558 07-10 GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs GRASP: Generische Vernunft und SPARQL-Generierung über Wissensgraphen hinweg GRASP: 通用理由和在知识图中生成SPARQL 2507.08107v1 -
559 07-10 Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology Rückverfolgbare Beweise Verbesserte visuelle Grundierung: Bewertung und Methodik 增强视觉依据的理由:评价和方法 2507.07999v1 -
560 07-10 Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) Operationalisierung eines Bedrohungsmodells für das Red-Teaming großer Sprachmodelle (LLMs) 实施 “ 红色组合大语言模型威胁模型 “ ; 2407.14937v2 -
561 07-10 Automating Expert-Level Medical Reasoning Evaluation of Large Language Models Automatisieren von Experten-Level Medical Reasoning Bewertung von großen Sprachmodellen 对大语言模式进行自动化专家级医疗理由评估 2507.07988v1 -
562 07-10 Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology Leistung und praktische Betrachtung von großen und kleinen Sprachmodellen in der klinischen Entscheidungsunterstützung in der Rheumatologie 风湿学临床决策支助中大型和小型语言模型的实用性及实用性考虑 2507.07983v1 -
563 07-10 Why is Your Language Model a Poor Implicit Reward Model? Warum ist Ihr Sprachmodell ein schlechtes Implizit-Reward-Modell? 为什么您的语言模式 是一个贫穷的隐含奖赏模式? 2507.07981v1 -
564 07-10 Long-Form Speech Generation with Spoken Language Models Langformige Sprachgenerierung mit gesprochenen Sprachmodellen 具有口言语言模式的长形式语音一代 2412.18603v2 -
565 07-10 Scaling RL to Long Videos Skalierung von RL zu langen Videos 缩放 RL 到长视频 2507.07966v1 -
566 07-10 MIRIX: Multi-Agent Memory System for LLM-Based Agents MIRIX: Multi-Agent-Speichersystem für LLM-basierte Agenten MIRIX:LLM药剂多机构内存系统 2507.07957v1 -
567 07-10 SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型 2507.07939v1 -
568 07-10 Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration 长期范围:通过多代理问题驱动的协作实现分化和征服 2505.20625v2 -
569 07-10 Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style Kontexttreue in großen Sprachmodellen untersuchen: Die Rollen der Gedächtnisstärke und des Evidenzstils 调查大语言模型的内情:记忆力和证据风格的作用 2409.10955v2 -
570 07-10 A Survey on Latent Reasoning Eine Umfrage über latente Vernunft A. 关于长期原因的调查 2507.06203v2 -
571 07-10 Automating MD simulations for Proteins using Large language Models: NAMD-Agent Automatisierung von MD-Simulationen für Proteine mit großen Sprachmodellen: NAMD-Agent 使用大语言模型( NADD- Agent) 自动自动模拟 Proteins 的 MD 模拟: NAMED- Agent 2507.07887v1 -
572 07-10 When Dialects Collide: How Socioeconomic Mixing Affects Language Use Wenn Dialekte zusammenstoßen: Wie sich die sozioökonomische Mischung auf den Sprachgebrauch auswirkt 当对调时:社会经济混合如何影响语言使用 2307.10016v2 -
573 07-10 Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie 评估大音频语言模型对音频注射的威力:经验研究 2505.19598v2 -
574 07-10 DocCHA: Towards LLM-Augmented Interactive Online diagnosis System DocCHA: Auf dem Weg zum LLM-Augmented Interactive Online-Diagnosesystem DocCHA:争取建立LLM-增强的互动式在线诊断系统 2507.07870v1 -
575 07-10 Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation Alpay Algebra V: Multi-Layered Semantic Games und Transfinite Fixed-Point Simulation Alpay Alphay Algebabra V:多语言语义运动会和跨线固定点模拟 2507.07868v1 -
576 07-10 Skywork-R1V3 Technical Report Technischer Bericht Skywork-R1V3 Skywork-R1V3 技术报告 2507.06167v3 -
577 07-10 Principled Foundations for Preference Optimization Prinzipierte Grundlagen für die Preference-Optimierung 最优化原则基金会 2507.07855v1 -
578 07-10 From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems Von der Ambiguität zur Genauigkeit: Der transformative Effekt der Koreferenzlösung auf retrieval-augmentierte Erzeugungssysteme 从模糊到准确性:关于回收-提款一代系统的共同决议的变革效应 2507.07847v1 -
579 07-10 None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks Keiner der anderen: eine allgemeine Technik zur Unterscheidung von der Erinnerung an Multiple-Choice-LLM-Bewertungs-Benchmarks 其他无其他:在多杯LLM评价基准中区分与记忆化区别理由的一般技术 2502.12896v5 -
580 07-10 Constrain Alignment with Sparse Autoencoders Beschränkung der Ausrichtung mit Sparse Autoencodern 与 Sparse 自动对齐 2411.07618v4 -
581 07-10 Unsupervised Morphological Tree Tokenizer Unüberwachter morphologischer Baum Tokenizer 不受监督的病理树化器 2406.15245v2 -
582 07-10 MAEBE: Multi-Agent Emergent Behavior Framework MAEBE: Multi-Agent Emergent Behavior Framework 多边代理新兴行为框架 2506.03053v2 -
583 07-10 The Thin Line Between Comprehension and Persuasion in LLMs Die dünne Linie zwischen Verständnis und Überzeugung in LLMs LLMM 理解与劝导之间的细细线 2507.01936v2 -
584 07-10 Conditional Unigram Tokenization with Parallel Data Bedingte Unigramm-Tokenisierung mit parallelen Daten 附带平行数据的条件性大学招式 2507.07824v1 -
585 07-10 Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning Verständnis und Kontrolle von Wiederholungsneuronen und Induktionsköpfen im In-Context-Lernen 了解和控制再生中新中世纪和内文学习中的上岗负责人 2507.07810v1 -
586 07-10 Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers Überbrückung von Logik und Lernen: Dekodierung von Temporal Logic-Embeddings über Transformer 架桥逻辑与学习:通过变形器解码时时逻辑嵌入 2507.07808v1 -
587 07-10 Decoding AI Judgment: How LLMs Assess News Credibility and Bias Entschlüsselung des AI-Urteils: Wie LLMs neue Glaubwürdigkeit und Bias bewerten 证明AI 判决:LLMs如何评估新闻信誉和Bias 2502.04426v2 -
588 07-10 Understanding Chain-of-Thought in LLMs through Information Theory Verständnis der in LLMs durch Informationstheorie gesuchten Gedankenkette 通过信息理论在LLM 中探索了解链 2411.11984v2 -
589 07-10 When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance Wenn große Sprachmodelle das Recht erfüllen: Dual-Lens-Taxonomie, technischer Fortschritt und ethische Governance 当大语言模式符合法律时:双重语言分类、技术进步和道德治理 2507.07748v1 -
590 07-10 Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review Code-Switching in End-to-End Automatische Spracherkennung: Ein systematischer Literaturbericht 端至端自动语音识别代码转换:系统文学审查 2507.07741v1 -
591 07-10 GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing GuardVal: Dynamic Large Language Model Jailbreak Evaluation für umfassende Sicherheitstests 警卫:综合安全测试动态大语言示范监狱防爆评价 2507.07735v1 -
592 07-10 Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization Nicht alle Präferenzen sind das, was Sie für das Post-Training benötigen: Selektive Ausrichtungsstrategie für die Preference-Optimierung 并非所有的优惠都是培训后需要的:选择性的优化优化战略 2507.07725v1 -
593 07-10 Stable Preference Optimization for LLMs: A Bilevel Approach Beyond Direct Preference Optimization Stabile Preference-Optimierung für LLMs: Ein zweistufiger Ansatz über die direkte Preference-Optimierung hinaus 对LLLMM公司的稳定优惠优化:超越直接优惠优化的双级办法 2507.07723v1 -
594 07-10 Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text” Die Privatsphäre von Text-Embeddings neu denken: Eine Reproduzierbarkeitsstudie von “Text-Embeddings Reveal (fast) So viel wie Text” 重新思考文字嵌入的隐私:关于“文字嵌入流(几乎)与文字一样”的可复制性研究 2507.07700v1 -
595 07-10 What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training Was wissen selbstüberwachte Sprachmodelle über Niederländisch? Analysieren von Vorteilen sprachspezifischer Vorausbildung 自我监督的演讲模式对荷兰语了解多少? 分析具体语言培训前培训的优势 2506.00981v2 -
596 07-10 KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v1 -
597 07-10 SAS: Simulated Attention Score SAS: Simulierter Aufmerksamkeits-Score SAS:模拟关注计分 2507.07694v1 -
598 07-10 Hierarchical Bracketing Encodings for Dependency Parsing as Tagging Hierarchische Bracketing-Encodings für Dependency Parsing als Tagging 将依赖性剖析作为拖贴 2505.11693v2 -
599 07-10 Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues Ko-Konstruktives Verhalten von großen Sprachmodellen in Erklärungsdialogen untersuchen 解释对话中大语言模式的共同调查行为 2504.18483v2 -
600 07-10 Improving Cross-lingual Representation for Semantic Retrieval with Code-switching Verbesserung der Cross-lingual Darstellung für semantische Retrieval mit Code-Schaltung 使用代码转换法改进语义检索的跨语种代表性 2403.01364v2 -
601 07-10 Less Stress, More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers Weniger Stress, mehr Datenschutz: Stresserkennung auf anonymisierter Sprache von Fluglotsen 减少压力,增加隐私:在空中交通管制员匿名讲话中发现压力 2507.08882v1 -
602 07-10 Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language Beyond Hate Speech: NLPs Herausforderungen und Chancen beim Enthumanisieren der Sprache 超越仇恨言论:NLP在揭开非人化语言方面的挑战和机遇 2402.13818v2 -
603 07-10 An Automated Length-Aware Quality Metric for Summarization Ein Automatisiertes Längen-Bewusst-Qualitäts-Metrik für die Zusammenfassung 用于汇总的自动长软件质量计量器 2507.07653v1 -
604 07-10 Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement Lost in Pronunciation: Chinesische Offensive Sprache entdecken, verkleidet durch phonetische Umkleide-Ersatz 失落于发音中:发现因替换语音内衣而变形的中国进攻性语言 2507.07640v1 -
605 07-10 FrugalRAG: Learning to retrieve and reason for multi-hop QA FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA FrugalRAG:学会检索和多呼QA的理由 2507.07634v1 -
606 07-10 Towards a cognitive architecture to enable natural language interaction in co-constructive task learning Auf dem Weg zu einer kognitiven Architektur, um natürliche Sprachinteraktion im co-konstruktiven Aufgabenlernen zu ermöglichen 建立一个认知结构,在共同建设性任务学习中促成自然语言互动 2503.23760v2 -
607 07-10 Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights Vergleichende Stimmungsanalyse der öffentlichen Wahrnehmung: Monkeypox vs. COVID-19 Verhaltenseinblicke 对公众感知的比较情绪分析:天花对COVID-19行为洞察力 2505.07430v2 -
608 07-10 Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks Erforschung der Grenzen der Modellkompression in LLMs: Eine Studie zur Wissensdestillation über QA-Aufgaben 探索LLMM中模型压缩的限度:关于质量保证任务的知识积累研究 2507.07630v1 -
609 07-10 Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation Gute/böse Reputation Urteil von Prominenten durch LLMs über retrieval Augmented Generation LLMs通过回回子增量一代对名词的良好/负面评奖判决 2503.14382v2 -
610 07-10 Enhancing Vaccine Safety Surveillance: Extracting Vaccine Mentions from Emergency Department Triage Notes Using Fine-Tuned Large Language Models Verbesserung der Überwachung der Sicherheit von Impfstoffen: Extraktion von Impfstoff-Ernährungen aus der Notaufnahme der Notaufnahme mit fein dosierten großen Sprachmodellen 加强疫苗安全监督:紧急部门使用精美大语言模型的 “ 特里吉语说明 “ 引用的 “ 提取 “ 疫苗 “ 提示 2507.07599v1 -
611 07-10 Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench Jenseits von Überkorrektur: Bewertung von Diversität in T2I-Modellen mit DivBench 超越过度纠正:在DivBench的T2I模型中评估多样性 2507.03015v2 -
612 07-10 Improving Clustering on Occupational Text Data through Dimensionality Reduction Verbesserung der Clusterbildung auf berufsbezogenen Textdaten durch Dimensionalitätsreduzierung 通过减少分量改进职业文本数据集群化 2507.07582v1 -
613 07-10 COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation COALA: Numerisch stabiles und effizientes Framework für kontextabhängige Low-Rank-Annäherung COALA: 低 Rank 上下低敏度接近度的数值稳定、高效框架 2507.07580v1 -
614 07-10 Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation Ein-zu-Mix Modalität Ausrichtung mit multimodalen Großsprachenmodellen für die Übersetzung von Dokumentenbildmaschinen 单一至混合模式与文件图像机机翻译多式大语言模式 2507.07572v1 -
615 07-10 video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle 视频-SALMONN2:字幕-强化视听大语言模式 2506.15220v2 -
616 07-10 The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs Synergy Dilemma von Long-CoT SFT und RL: Untersuchung von Post-Training-Techniken zur Begründung von VLMs Long-CoT SFT和RL的协同问题:调查培训后用于说明理由的训练后技术 2507.07562v1 -
617 07-10 Multi-Head RAG: Solving Multi-Aspect Problems with LLMs Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs 多方主管RAG:解决多领域问题与LLM 2406.05085v4 -
618 07-10 The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora Die Cross-Lingual Cost: Retrieval Biases in RAG über arabisch-englische Corpora 跨语言成本:通过阿拉伯语-英语公司在RAG中检索到阿拉伯语-英语公司 2507.07543v1 -
619 07-10 CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text CEA-LIST bei CheckThat! 2025: Bewertung von LLMs als Detektoren von Bias und Meinung im Text CEA-LIST 校对:CEA-LIST 校对:2025年 2507.07539v1 -
620 07-10 CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben 复选对象:有效核查对不限名额任务LLM解决方案的有效核查 2406.02524v5 -
621 07-10 Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen 思想犯罪:后门和合理理由模型中新出现的不协调现象 2506.13206v2 -
622 07-10 Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems Triadische Mehrparteien-Stimme-Aktivitätsprojektion für Turn-Take in gesprochenen Dialogsystemen 三部 “ 三部三部 “ 口语对话系统翻转式多党声音活动项目 2507.07518v1 -
623 07-10 Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System Auf dem Weg zu echten chinesischen Psychologischen Unterstützungsdialogen: CPsDD-Datensatz und ein gemeinsames Multi-Agenten-System 走向现实世界的中国心理支持对话:CPsDD数据集和共同演进的多行为者系统 2507.07509v1 -
624 07-10 Enhancing Transformers for Generalizable First-Order Logical Entailment Erweiterung der Transformer für generalisierbare Logical Entailment erster Ordnung 增强通用一级一级逻辑元件的变压器 2501.00759v3 -
625 07-10 Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature Gewinnung von ORR-Katalysatorinformationen für Brennstoffzelle aus wissenschaftlicher Literatur 从科学文献中提取用于燃料电池的 ORR 催化器信息 2507.07499v1 -
626 07-10 PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving PLAN-TUNING: Sprachmodelle nach dem Training lernen Schritt-für-Schritt-Planung für komplexe Problemlösung 规划 – – 规划 – – 培训后语言模式,以学习逐步规划解决复杂问题的模式 2507.07495v1 -
627 07-10 SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records SimSUM: Simulierter Benchmark mit strukturierten und unstrukturierten medizinischen Aufzeichnungen SimSUM:与结构化和非结构化医疗记录模拟基准 2409.08936v3 -
628 07-10 Affordable AI Assistants with Knowledge Graph of Thoughts Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken 具有知识思想知识图的负担得起的AI助理 2504.02670v5 -
629 07-10 Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models Machine Bullshit: Charakterisieren der Emergenten Missachtung der Wahrheit in großen Sprachmodellen 机器胡说:在大语言模型中突出新人无视真相的特点 2507.07484v1 -
630 07-10 Mixture of Group Experts for Learning Invariant Representations Mixtur von Gruppenexperten für Learning Invariante Repräsentationen 学习不稳定代表小组专家混合 2504.09265v2 -
631 07-10 ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining ixi-GEN: Effiziente industrielle sLLMs durch Domain Adaptive Continual Pretraining ixi-GEN:通过远程适应性连续训练前,提高工业低温生产效率 2507.06795v2 -
632 07-10 Structure Guided Large Language Model for SQL Generation Struktur Geführtes großes Sprachmodell für SQL-Generierung SQL 生成引导大语言模式 2402.13284v4 -
633 07-10 RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning RLEP: Verstärktes Lernen mit Erfahrungsreplay für LLM-Reasoning RLEP: 强化学习,经验重现LLM 理由推理 2507.07451v1 -
634 07-10 Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme Agent KB: 利用跨域经验解决代理问题 2507.06229v2 -
635 07-10 SAND: Boosting LLM Agents with Self-Taught Action Deliberation SAND: LLM-Agenten mit selbsterzogener Handlungsberatung stärken SAND:促进具有自学行动考虑的LLM代理 2507.07441v1 -
636 07-10 Towards Interpretable Time Series Foundation Models Auf dem Weg zu interpretierbaren Zeitreihen-Grundmodellen 迈向可解释时间序列基础模型 2507.07439v1 -
637 07-10 SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data SynthEHR-Eviction: Verbesserung der Eviction SDoH-Erkennung mit LLM-Augmented Synthetic EHR Data 合成EHR-驱逐:利用LLM-增强的合成电子HR数据加强驱逐SDoH探测 2507.07421v1 -
638 07-10 MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning MedReadCtrl: Personalisierung medizinischer Textgenerierung mit Lesbarkeitsgesteuertem Unterricht MedReadReadCtrl: 使医疗文本生成个性化,并进行可读性控制教学学习 2507.07419v1 -
639 07-10 May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks Darf ich Ihre Aufmerksamkeit haben? Breaking Fine-Tuning basierte Prompt Injection Defenses mit Architektur-Aware Attacken 请大家注意,使用建筑软件攻击 突破基于精密发射的快速喷射防御系统 2507.07417v1 -
640 07-10 Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation Interlinguistische phonetische Komposition (IPC): Ein theoretischer und rechnerischer Ansatz, um die zweite Sprache zu verbessern 语言间音音组成:加强第二语言发音的理论和计算方法 2411.10927v3 -
641 07-10 TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning TART: Ein Open-Source Tool-Augmented Framework für erklärbare Tabellen-basierte Begründung TARRT: 开放源码工具推荐框架,用于说明基于表格的理由 2409.11724v3 -
642 07-10 GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation GNN-CNN: Ein effizientes Hybridmodell für konvolutionäre und Graphen-Neuralnetzwerke zur Textdarstellung GNN-CNN: 用于文本代表的动态和图形神经网络的有效混合模型 2507.07414v1 -
643 07-10 CoAM: Corpus of All-Type Multiword Expressions CoAM: Corpus von Multiwort-Ausdrücken aller Art CoAM: 全类型多字表达式组合体 2412.18151v3 -
644 07-10 Rethinking Verification for LLM Code Generation: From Generation to Testing Überprüfung der LLM-Code-Generierung neu denken: Von der Generation zur Prüfung 重新思考LLM 代码生成的核查:从生成到测试 2507.06920v2 -
645 07-10 Large Language Model for Extracting Complex Contract Information in Industrial Scenes Großes Sprachmodell zur Extraktion komplexer Vertragsinformationen in Industrieszenen 工业景点复杂合同信息提取大语言模型 2507.06539v2 -
646 07-10 BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems BountyBench: Dollar-Impact von KI-Agenten-Angriffen und Verteidigern auf reale Cybersicherheitssysteme BuntyBuntyBunnench: AI代理攻击者和捍卫者对现实世界网络安全系统的美元影响 2505.15216v2 -
647 07-10 Bradley-Terry and Multi-Objective Reward Modeling Are Complementary Bradley-Terry und Multi-Objective Reward Modeling sind komplementär Bradley-Terriy和多目标奖励模型具有补充作用 2507.07375v1 -
648 07-10 Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing Krul: Effiziente Zustandsrestauration für Multiturn-Gespräche mit dynamischem Cross-Layer-KV-Sharing KRU:通过动态跨层KV共享,高效恢复国家多方向对话 2507.08045v1 -
649 07-10 Shifting from Ranking to Set Selection for Retrieval Augmented Generation Wechsel vom Ranking zur Einstellungsauswahl für retrieval Augmented Generation 从排位移到设置回收增量一代的选择 2507.06838v2
Article 0
Title@2025-07-17 (4): VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning
Title: VisionThink: Smart and Efficient Vision Language Model via Reinforcement Learning | VisionThink: Intelligentes und effizientes Vision-Sprachmodell durch Verstärkungslernen | 远景设想:通过强化学习建立聪明、高效的愿景语言模式 2507.13348v1 |
Authors (6): Senqiao Yang, Junyi Li, Xin Lai, Bei Yu, Hengshuang Zhao, Jiaya Jia
Recent advancements in vision-language models (VLMs) have improved performance by increasing the number of visual tokens, which are often significantly longer than text tokens. However, we observe that most real-world scenarios do not require such an extensive number of visual tokens. While the performance drops significantly in a small subset of OCR-related tasks, models still perform accurately in most other general VQA tasks with only 1/4 resolution. Therefore, we propose to dynamically process distinct samples with different resolutions, and present a new paradigm for visual token compression, namely, VisionThink. It starts with a downsampled image and smartly decides whether it is sufficient for problem solving. Otherwise, the model could output a special token to request the higher-resolution image. Compared to existing Efficient VLM methods that compress tokens using fixed pruning ratios or thresholds, VisionThink autonomously decides whether to compress tokens case by case. As a result, it demonstrates strong fine-grained visual understanding capability on OCR-related tasks, and meanwhile saves substantial visual tokens on simpler tasks. We adopt reinforcement learning and propose the LLM-as-Judge strategy to successfully apply RL to general VQA tasks. Moreover, we carefully design a reward function and penalty mechanism to achieve a stable and reasonable image resize call ratio. Extensive experiments demonstrate the superiority, efficiency, and effectiveness of our method. Our code is available at https://github.com/dvlab-research/VisionThink.
最近在视觉语言模型(VLMS)方面的进步通过增加视觉象征物的数量而改善了业绩,视觉象征物的数量往往比文字象征物要长得多。然而,我们注意到,大多数真实世界情景并不要求如此大量的视觉象征物。虽然在与OCR有关的小部分任务中表现显著下降,但在大多数其他通用VQA任务中,VLMA模型仍然以1/4的分辨率来准确地执行。因此,我们提议动态地处理不同分辨率的样本,并提供一个视觉象征物压缩的新模式,即VisionThink。它从下印图像开始,并明智地决定它是否足以解决问题。否则,该模型可以输出一个特别象征物来请求更高分辨率的图像。与现有的使用固定的调理比或阈值拼贴的高效VLMMM方法相比,VVTink自主地决定是否以实例来压缩标志物证物。结果显示,OCRCRVS/SLS的精细直观理解能力,同时在更简单的任务上保留大量视觉标志。我们采用强化的学习和提议一个稳定的LM-JSMAS-SMA 战略,我们对常规的精准的精准标准,我们用LALS-S-S-A的精度的精度的精度的精度的精度的精度,我们的精度的精度的精度的精度的精度的精度的精度的精度,我们的精度的精度的精度的精度的精度的精度的精度的精度。
Article 1
Title@2025-07-17 (4): DeFine: Decision-Making with Analogical Reasoning over Factor Profiles
Title: DeFine: Decision-Making with Analogical Reasoning over Factor Profiles | DeFine: Entscheidungsfindung mit analogischer Begründung über Faktorprofile | DeFine: 与因子剖析档的模拟理由有关的决策 2410.01772v2 |
Authors (8): Yebowen Hu, Xiaoyang Wang, Wenlin Yao, Yiming Lu, Daoan Zhang, Hassan Foroosh, Dong Yu, Fei Liu
LLMs are ideal for decision-making thanks to their ability to reason over long contexts. However, challenges arise when processing speech transcripts that describe complex scenarios, as they are verbose and include repetition, hedging, and vagueness. E.g., during a company’s earnings call, an executive might project a positive revenue outlook to reassure investors, despite uncertainty regarding future earnings. It is crucial for LLMs to incorporate this uncertainty systematically when making decisions. In this paper, we introduce \textsc{DeFine}, a modular framework that constructs probabilistic factor profiles from complex scenarios. It then integrates these profiles with analogical reasoning, leveraging insights from similar past experiences to guide LLMs in making critical decisions in new situations. Our framework separates the tasks of quantifying uncertainty and incorporating it into LLM decision-making. This approach is particularly useful in areas such as consulting and financial deliberation, where making decisions under uncertainty is vital.
有限责任公司是决策的理想,因为它们有能力从长计议。然而,在处理描述复杂情景的演讲记录时会遇到一些挑战,因为这些记录是含糊不清的,包括重复、套期保值和模糊不清。例如,在公司的收入征召期间,执行官可能会预测积极的收入前景,使投资者放心,尽管未来收入不确定。对于有限责任公司来说,在决策时系统地纳入这种不确定性至关重要。在本文件中,我们引入了一个模块框架,从复杂情景中构建概率因素剖面。然后将这些特征与模拟推理结合起来,利用过去类似经验的洞察力指导有限责任公司在新形势下作出关键决定。我们的框架将量化不确定性并将其纳入有限责任公司的决策中的任务区分开来。这一方法在咨询和财务审议等领域特别有用,因为在这些方面,在不确定情况下作出决定至关重要。
Article 2
Title@2025-07-17 (4): Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes
Title: Comparing Apples to Oranges: A Dataset & Analysis of LLM Humour Understanding from Traditional Puns to Topical Jokes | Vergleich von Äpfeln mit Orangen: Ein Datensatz & Analyse des LLM Humorverständnisses von traditionellen Puns zu thematischen Witzen | 将苹果与橙类比较:从传统Puns到专题笑话的LLM Humour理解数据集和分析 2507.13335v1 |
Authors (3): Tyler Loakman, William Thorne, Chenghua Lin
Humour, as a complex language form, is derived from myriad aspects of life, whilst existing work on computational humour has focussed almost exclusively on short pun-based jokes. In this work, we investigate whether the ability of Large Language Models (LLMs) to explain humour depends on the particular humour form. We compare models on simple puns and more complex topical humour that requires knowledge of real-world entities and events. In doing so, we curate a dataset of 600 jokes split across 4 joke types and manually write high-quality explanations. These jokes include heterographic and homographic puns, contemporary internet humour, and topical jokes, where understanding relies on reasoning beyond “common sense”, rooted instead in world knowledge regarding news events and pop culture. Using this dataset, we compare the zero-shot abilities of a range of LLMs to accurately and comprehensively explain jokes of different types, identifying key research gaps in the task of humour explanation. We find that none of the tested models (inc. reasoning models) are capable of reliably generating adequate explanations of all joke types, further highlighting the narrow focus of most works in computational humour on overly simple joke forms.
幽默是一种复杂的语言形式,来自生活的方方面面,而目前关于计算幽默的工作几乎完全集中在短短的基于Pun的笑话上。 在这项工作中,我们调查大语言模型(LLMs)解释幽默的能力是否取决于特定的幽默形式。我们比较简单的双关语模型模型和要求了解现实世界实体和事件的更复杂的时事幽默模型。我们这样做时,我们整理出一套600个笑话的数据集,分为4个笑话类型,手写高质量的解释。这些笑话包括肝科和全息图、当代因特网幽默和时事笑话,其中的理解依赖于“常识”以外的推理,而植根于关于新闻事件和流行文化的世界知识。我们利用这一数据集,比较一系列LLMs的零光能力,以准确和全面地解释不同种类的笑话,找出幽默解释任务中的关键研究差距。我们发现,经过测试的模型(inc. 推理模型)没有一个能够可靠地充分解释所有笑话类型,进一步突出大多数计算笑话的狭隘重点。
Article 3
Title@2025-07-17 (4): A Survey of Context Engineering for Large Language Models
Title: A Survey of Context Engineering for Large Language Models | Eine Übersicht über Kontext-Engineering für große Sprachmodelle | 大语言模型背景工程调查 2507.13334v1 |
Authors (15): Lingrui Mei, Jiayu Yao, Yuyao Ge, Yiwei Wang, Baolong Bi, Yujun Cai, Jiazhi Liu, Mingyu Li, Zhong-Zhi Li, Duzhen Zhang, Chenlin Zhou, Jiayi Mao, Tianze Xia, Jiafeng Guo, Shenghua Liu
The performance of Large Language Models (LLMs) is fundamentally determined by the contextual information provided during inference. This survey introduces Context Engineering, a formal discipline that transcends simple prompt design to encompass the systematic optimization of information payloads for LLMs. We present a comprehensive taxonomy decomposing Context Engineering into its foundational components and the sophisticated implementations that integrate them into intelligent systems. We first examine the foundational components: context retrieval and generation, context processing and context management. We then explore how these components are architecturally integrated to create sophisticated system implementations: retrieval-augmented generation (RAG), memory systems and tool-integrated reasoning, and multi-agent systems. Through this systematic analysis of over 1300 research papers, our survey not only establishes a technical roadmap for the field but also reveals a critical research gap: a fundamental asymmetry exists between model capabilities. While current models, augmented by advanced context engineering, demonstrate remarkable proficiency in understanding complex contexts, they exhibit pronounced limitations in generating equally sophisticated, long-form outputs. Addressing this gap is a defining priority for future research. Ultimately, this survey provides a unified framework for both researchers and engineers advancing context-aware AI.
大语言模型(LLMS)的性能基本取决于在推论期间提供的背景资料。本调查介绍了背景工程,这是一个正式的学科,它超越了简单的迅速设计,包括系统地优化LLMS的信息有效载荷。我们提出了一个综合分类学,将背景工程纳入其基本组成部分,以及将其纳入智能系统的复杂实施方法。我们首先研究基本组成部分:背景检索和生成、背景处理和背景管理。然后我们探讨这些组成部分如何在建筑上融合,以创造复杂的系统实施:检索-启动的一代(RAG)、记忆系统和工具集成推理以及多试剂系统。通过对1300多份研究文件的系统分析,我们的调查不仅为实地确定了技术路线图,而且还揭示了关键的研究差距:模型能力之间存在着根本的不对称性。目前模型在先进背景工程的辅助下,在理解复杂背景方面表现出了显著的熟练程度,但在产生同样精密的、长期的产出方面显示出明显的局限性。解决这一差距是未来研究的一个确定优先事项。最终,这项调查为推进背景对准研究的研究人员和工程师提供了一个统一的框架。
Article 4
Title@2025-07-17 (4): The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner
Title: The Imitation Game: Turing Machine Imitator is Length Generalizable Reasoner | Die Imitation Spiel: Turing Machine Imitator ist Länge Generalizable Reasoner | 模拟游戏:图画机器模拟器是长可概括的理由 2507.13332v1 |
Authors (7): Zhouqi Hua, Wenwei Zhang, Chengqi Lyu, Yuzhe Gu, Songyang Gao, Kuikun Liu, Kai Chen
Length generalization, the ability to solve problems of longer sequences than those observed during training, poses a core challenge of Transformer-based large language models (LLM). Although existing studies have predominantly focused on data-driven approaches for arithmetic operations and symbolic manipulation tasks, these approaches tend to be task-specific with limited overall performance. To pursue a more general solution, this paper focuses on a broader case of reasoning problems that are computable, i.e., problems that algorithms can solve, thus can be solved by the Turing Machine. From this perspective, this paper proposes Turing MAchine Imitation Learning (TAIL) to improve the length generalization ability of LLMs. TAIL synthesizes chain-of-thoughts (CoT) data that imitate the execution process of a Turing Machine by computer programs, which linearly expands the reasoning steps into atomic states to alleviate shortcut learning and explicit memory fetch mechanism to reduce the difficulties of dynamic and long-range data access in elementary operations. To validate the reliability and universality of TAIL, we construct a challenging synthetic dataset covering 8 classes of algorithms and 18 tasks. Without bells and whistles, TAIL significantly improves the length generalization ability as well as the performance of Qwen2.5-7B on various tasks using only synthetic data, surpassing previous methods and DeepSeek-R1. The experimental results reveal that the key concepts in the Turing Machine, instead of the thinking styles, are indispensable for TAIL for length generalization, through which the model exhibits read-and-write behaviors consistent with the properties of the Turing Machine in their attention layers. This work provides a promising direction for future research in the learning of LLM reasoning from synthetic data.
通俗化,即能够解决比培训期间所观察到的更长序列问题的能力,是变形器大型语言模型(LLM)的核心挑战。虽然现有研究主要侧重于计算操作和象征性操纵任务的数据驱动法,但这些方法往往是任务特有的,总体性能有限。为了寻求更普遍的解决方案,本文件侧重于一个较宽泛的推理问题案例,即算法可以解决的问题,因此可以通过图灵机解决。从这个角度出发,本文件建议“图灵·马钦·伊默化学习(TAIL)” ,以提高LLMS的全程能力。尽管TAIL综合了以数据驱动器操作流程为主的思维链式(COTT)数据,通过计算机程序将推理步骤扩展至原子状态,以缓解快速学习和明确记忆检索机制的难度,从而减少动态和远程数据访问的难度,从而可以通过原始操作加以解决。为了验证TAIL的可靠性和普遍性,我们建了一个具有挑战性的合成数据集集集,涵盖8类LMsildalmality 和18项任务。TAIL综合模型(TAIL) ,TAIL) 集集集集集集集集集中的主要推理学-推理学,而无需,而不用前的推算,只是的推理学,仅能和推算方法,而使以往的推理算法,仅通过以往的推算算法,这在以往的推算法的推算法,而使整个的推算法和推算方法在以往的推算法,仅能和推算法的推算法则在前的推算法,在以往的推理学程法则在以往的推理学中,在前期的推理学中大大的推理学中,在前的推理学程的推理学程的推算方法中,在前的推理学,在前的推理学中,在前的推理学上推理学上推理学上推理学中,在前的推理学上,在前的推理学上推理学上推理学上推理和推理学上推理学中,只是上推理学,在前的推理学的推理上推理学上推理学的推理学的推理学,
Article 5
Title@2025-07-17 (4): Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It
Title: Vision-and-Language Training Helps Deploy Taxonomic Knowledge but Does Not Fundamentally Alter It | Vision-and-Language Training hilft, taxonomisches Wissen zu implementieren, ändert es aber nicht grundlegend | 愿景和语言培训帮助利用分类学知识,但不能从根本上改变这种知识。 2507.13328v1 |
Authors (6): Yulu Qin, Dheeraj Varghese, Adam Dahlgren Lindström, Lucia Donatelli, Kanishka Misra, Najoung Kim
Does vision-and-language (VL) training change the linguistic representations of language models in meaningful ways? Most results in the literature have shown inconsistent or marginal differences, both behaviorally and representationally. In this work, we start from the hypothesis that the domain in which VL training could have a significant effect is lexical-conceptual knowledge, in particular its taxonomic organization. Through comparing minimal pairs of text-only LMs and their VL-trained counterparts, we first show that the VL models often outperform their text-only counterparts on a text-only question-answering task that requires taxonomic understanding of concepts mentioned in the questions. Using an array of targeted behavioral and representational analyses, we show that the LMs and VLMs do not differ significantly in terms of their taxonomic knowledge itself, but they differ in how they represent questions that contain concepts in a taxonomic relation vs. a non-taxonomic relation. This implies that the taxonomic knowledge itself does not change substantially through additional VL training, but VL training does improve the deployment of this knowledge in the context of a specific task, even when the presentation of the task is purely linguistic.
视觉语言(VL)培训是否以有意义的方式改变了语言模式的语言表述方式?文献中的大多数结果都显示行为和代表性两方面的不一致或边际差异。在这项工作中,我们首先假设VL培训可以产生显著影响的领域是词汇概念知识,特别是其分类组织。我们通过比较只有文本的LMs及其经过VL培训的对应人员,首先表明VL模式往往在只回答文本的问答任务上超越只回答文本的对应人员,这需要对问题中提及的概念进行分类理解。我们利用一系列有针对性的行为和代表性分析,表明LMs和VLMs在分类知识本身方面没有很大区别,但是它们代表的问题在包含一种分类关系的概念与非分类关系方面有所不同。这意味着通过额外的VL培训,分类知识本身不会发生重大变化,但是VL培训可以改进在特定任务中应用这种知识的情况,即使提出的任务纯粹是语言任务。
Article 6
Title@2025-07-17 (4): Social and Political Framing in Search Engine Results
Title: Social and Political Framing in Search Engine Results | Soziale und politische Framing in Suchmaschinen-Ergebnissen | 寻找引擎结果中的社会和政治形式 2507.13325v1 |
Authors (2): Amrit Poudel, Tim Weninger
Search engines play a crucial role in shaping public discourse by influencing how information is accessed and framed. While prior research has extensively examined various dimensions of search bias – such as content prioritization, indexical bias, political polarization, and sources of bias – an important question remains underexplored: how do search engines and ideologically-motivated user queries contribute to bias in search results. This study analyzes the outputs of major search engines using a dataset of political and social topics. The findings reveal that search engines not only prioritize content in ways that reflect underlying biases but also that ideologically-driven user queries exacerbate these biases, resulting in the amplification of specific narratives. Moreover, significant differences were observed across search engines in terms of the sources they prioritize. These results suggest that search engines may play a pivotal role in shaping public perceptions by reinforcing ideological divides, thereby contributing to the broader issue of information polarization.
搜索引擎通过影响如何获取和构建信息的方式,在塑造公共讨论方面发挥着关键作用。虽然先前的研究广泛研究了搜索偏见的各个方面 – – 如内容的优先排序、指数偏见、政治两极分化和偏见的来源 – – 一个重要的问题仍未得到探讨:搜索引擎和出于意识形态动机的用户查询如何在搜索结果方面造成偏差?这项研究利用政治和社会议题的数据集分析了主要搜索引擎的产出。研究结果显示,搜索引擎不仅以反映基本偏见的方式排列内容的优先次序,而且由意识形态驱动的用户查询也加剧了这些偏差,从而导致具体叙述的扩大。此外,在搜索引擎中发现,它们在其优先选择的来源方面存在重大差异。这些结果表明,搜索引擎可能通过加强意识形态鸿沟,从而推动更广泛的信息两极化问题,从而在塑造公众认识方面发挥关键作用。
Article 7
Title@2025-07-17 (4): HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals
Title: HapticCap: A Multimodal Dataset and Task for Understanding User Experience of Vibration Haptic Signals | HapticCap: Ein multimodaler Datensatz und die Aufgabe, die Benutzererfahrung von Schwingungshaptischen Signalen zu verstehen | HapticCap:多模式数据集和了解用户振动信号信号体验的任务 2507.13318v1 |
Authors (3): Guimin Hu, Daniel Hershcovich, Hasti Seifi
Haptic signals, from smartphone vibrations to virtual reality touch feedback, can effectively convey information and enhance realism, but designing signals that resonate meaningfully with users is challenging. To facilitate this, we introduce a multimodal dataset and task, of matching user descriptions to vibration haptic signals, and highlight two primary challenges: (1) lack of large haptic vibration datasets annotated with textual descriptions as collecting haptic descriptions is time-consuming, and (2) limited capability of existing tasks and models to describe vibration signals in text. To advance this area, we create HapticCap, the first fully human-annotated haptic-captioned dataset, containing 92,070 haptic-text pairs for user descriptions of sensory, emotional, and associative attributes of vibrations. Based on HapticCap, we propose the haptic-caption retrieval task and present the results of this task from a supervised contrastive learning framework that brings together text representations within specific categories and vibrations. Overall, the combination of language model T5 and audio model AST yields the best performance in the haptic-caption retrieval task, especially when separately trained for each description category.
从智能手机震动到虚拟现实触摸反馈等催眠信号能够有效地传递信息,增强现实感,但设计与用户产生有意义共鸣的信号却具有挑战性。为了促进这一点,我们引入了多式数据集和任务,将用户描述与振动机信号相匹配,并突出强调了两大挑战:(1) 缺乏大型机能振动数据集,并附有文字描述,作为收集随机描述的文字说明,这是费时的;(2) 现有任务和模型在文本中描述振动信号的能力有限。为了推进这个领域,我们创建了Haptic Cap,这是第一个具有充分人性说明的随机包设数据集,包含92,070个机能-文本配对,用于用户描述振动感官、情感和关联性特征。基于Haptic Cap,我们提出机能缩略图检索任务,并介绍由监督的对比学习框架得出的任务结果,该框架将文本在特定类别和振动中进行表述。总体而言,语言模型T5和音频模型AST综合了手动回任务的最佳表现,特别是在对每个描述类别进行单独培训时。
Article 8
Title@2025-07-17 (4): Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information
Title: Identifying Task Groupings for Multi-Task Learning Using Pointwise V-Usable Information | Ermittlung von Aufgabengruppen für Multi-Task-Lernen mit pointwise V-Usable Information | 利用有分点的V-可靠信息确定多任务学习组 2410.12774v2 |
Authors (4): Yingya Li, Timothy Miller, Steven Bethard, Guergana Savova
The success of multi-task learning can depend heavily on which tasks are grouped together. Naively grouping all tasks or a random set of tasks can result in negative transfer, with the multi-task models performing worse than single-task models. Though many efforts have been made to identify task groupings and to measure the relatedness among different tasks, it remains a challenging research topic to define a metric to identify the best task grouping out of a pool of many potential task combinations. We propose a metric of task relatedness based on task difficulty measured by pointwise V-usable information (PVI). PVI is a recently proposed metric to estimate how much usable information a dataset contains given a model. We hypothesize that tasks with not statistically different PVI estimates are similar enough to benefit from the joint learning process. We conduct comprehensive experiments to evaluate the feasibility of this metric for task grouping on 15 NLP datasets in the general, biomedical, and clinical domains. We compare the results of the joint learners against single learners, existing baseline methods, and recent large language models, including Llama 2 and GPT-4. The results show that by grouping tasks with similar PVI estimates, the joint learners yielded competitive results with fewer total parameters, with consistent performance across domains.
多任务学习的成功在很大程度上取决于哪些任务分组。将所有任务或随机任务组合成一组,可能会导致负转移,而多任务模式比单一任务模式差。尽管已经作出许多努力,确定任务分组并衡量不同任务之间的关联性,但确定一个指标以确定最佳任务组合的衡量标准仍是一个具有挑战性的研究专题,该指标将在许多潜在任务组合的集合中确定最佳任务分组。我们根据用点五类可用信息(PVI)衡量的任务困难,提出了任务关联性指标。PVI是最近提出的一种衡量标准,用以估计数据集所含信息有多少可用信息具有一个模型。我们假设,具有非统计性不同的PVI估计数的任务足以从联合学习进程中受益。我们进行全面实验,以评价在一般、生物医学和临床领域以15个NLP数据集为单位的任务组合的这一衡量标准的可行性。我们比较了联合学习者与单个学习者、现有基线方法以及最近的大型语言模型,包括Llama 2和GPT-4,我们提出了衡量标准。结果显示,通过将具有竞争力的学习者与相似的参数分组,结果与类似的领域一致。
Article 9
Title@2025-07-17 (4): The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations
Title: The Generative Energy Arena (GEA): Incorporating Energy Awareness in Large Language Model (LLM) Human Evaluations | Die Generative Energy Arena (GEA): Einbeziehung des Energiebewusstseins in das Large Language Model (LLM) Human Assessments | 产生能源竞技场:将能源意识纳入大语言模型(LLM)人类评估 2507.13302v1 |
Authors (5): Carlos Arriaga, Gonzalo Martínez, Eneko Sendin, Javier Conde, Pedro Reviriego
The evaluation of large language models is a complex task, in which several approaches have been proposed. The most common is the use of automated benchmarks in which LLMs have to answer multiple-choice questions of different topics. However, this method has certain limitations, being the most concerning, the poor correlation with the humans. An alternative approach, is to have humans evaluate the LLMs. This poses scalability issues as there is a large and growing number of models to evaluate making it impractical (and costly) to run traditional studies based on recruiting a number of evaluators and having them rank the responses of the models. An alternative approach is the use of public arenas, such as the popular LM arena, on which any user can freely evaluate models on any question and rank the responses of two models. The results are then elaborated into a model ranking. An increasingly important aspect of LLMs is their energy consumption and, therefore, evaluating how energy awareness influences the decisions of humans in selecting a model is of interest. In this paper, we present GEA, the Generative Energy Arena, an arena that incorporates information on the energy consumption of the model in the evaluation process. Preliminary results obtained with GEA are also presented, showing that for most questions, when users are aware of the energy consumption, they favor smaller and more energy efficient models. This suggests that for most user interactions, the extra cost and energy incurred by the more complex and top-performing models do not provide an increase in the perceived quality of the responses that justifies their use.
对大型语言模型的评价是一项复杂的任务,其中提出了几种方法,最常见的是使用自动基准,使LLMM公司必须回答不同专题的多种选择问题;然而,这种方法有一些局限性,即与人关系最差,与人的关系最差;另一种办法是让人对LLM公司进行评价;这带来了可伸缩性问题,因为有大量和越来越多的模型来评价如何影响人类选择模型的决定。在本文件中,我们介绍了GENA、Genearial Energy Arena,这是一个将模型的能源消费信息纳入评估进程的舞台,例如广受欢迎的LM领域,任何用户都可以自由地评价任何问题的模型,对两种模型的答复进行排名;这种方法有一些局限性,即与人的关系最差,然后将结果拟订成一个模型;LMMS的日益重要的一个方面是其能源消耗量,因此,评估能源意识如何影响人类选择模型的决定是令人感兴趣的。在本文件中,我们介绍了GENA、Genearime Enera,这是一个将模型的能源消耗信息纳入评估进程的舞台,例如广受欢迎的LM竞技场,任何用户都可以自由地评价任何关于任何问题的模型,在评估过程中提供其最先进的能源使用成本。
Article 10
Title@2025-07-17 (4): AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research
Title: AbGen: Evaluating Large Language Models in Ablation Study Design and Evaluation for Scientific Research | AbGen: Bewertung großer Sprachmodelle in Ablationsstudiendesign und Evaluation für wissenschaftliche Forschung | AbGen:评估用于科学研究的实验研究设计和评价中的大语言模型 2507.13300v1 |
Authors (8): Yilun Zhao, Weiyuan Chen, Zhijian Xu, Manasi Patwardhan, Yixin Liu, Chengye Wang, Lovekesh Vig, Arman Cohan
We introduce AbGen, the first benchmark designed to evaluate the capabilities of LLMs in designing ablation studies for scientific research. AbGen consists of 1,500 expert-annotated examples derived from 807 NLP papers. In this benchmark, LLMs are tasked with generating detailed ablation study designs for a specified module or process based on the given research context. Our evaluation of leading LLMs, such as DeepSeek-R1-0528 and o4-mini, highlights a significant performance gap between these models and human experts in terms of the importance, faithfulness, and soundness of the ablation study designs. Moreover, we demonstrate that current automated evaluation methods are not reliable for our task, as they show a significant discrepancy when compared to human assessment. To better investigate this, we develop AbGen-Eval, a meta-evaluation benchmark designed to assess the reliability of commonly used automated evaluation systems in measuring LLM performance on our task. We investigate various LLM-as-Judge systems on AbGen-Eval, providing insights for future research on developing more effective and reliable LLM-based evaluation systems for complex scientific tasks.
我们引入了AbGen,这是旨在评价LLMs设计实验室研究能力的第一个基准,AbGen由来自807份NLP文件的1 500个专家附加说明的实例组成,在这一基准中,LLMs的任务是根据特定研究背景,为特定模块或进程制定详细的通货膨胀研究设计;我们对DeepSeek-R1-0528和o4-mini等主要LMs的评价,突出显示了这些模型与人类专家在反动研究设计的重要性、忠诚性和健全性方面存在的显著绩效差距;此外,我们表明,目前自动化评价方法对我们的任务不可靠,因为它们与人类评估相比存在重大差异;为了更好地调查这一点,我们开发AbGen-Eval,这是一个元评价基准,旨在评估用于衡量LM工作业绩的常用自动评价系统的可靠性。我们调查了AbGen-Eval上的各种LM-AZdge系统,为今后研究开发以LMM为基础的复杂科学任务的更有效和更可靠的LM评价系统提供了见解。
Article 11
Title@2025-07-17 (4): Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis
Title: Multi-Agent Synergy-Driven Iterative Visual Narrative Synthesis | Multi-Agent Synergy-getriebene iterative visuelle Narrative Synthese | 多机构协同-驱动动态迭代视觉叙述合成 2507.13285v1 |
Authors (8): Wang Xi, Quan Shi, Tian Yu, Yujie Peng, Jiayi Sun, Mengxing Ren, Zenghui Ding, Ningguang Yao
Automated generation of high-quality media presentations is challenging, requiring robust content extraction, narrative planning, visual design, and overall quality optimization. Existing methods often produce presentations with logical inconsistencies and suboptimal layouts, thereby struggling to meet professional standards. To address these challenges, we introduce RCPS (Reflective Coherent Presentation Synthesis), a novel framework integrating three key components: (1) Deep Structured Narrative Planning; (2) Adaptive Layout Generation; (3) an Iterative Optimization Loop. Additionally, we propose PREVAL, a preference-based evaluation framework employing rationale-enhanced multi-dimensional models to assess presentation quality across Content, Coherence, and Design. Experimental results demonstrate that RCPS significantly outperforms baseline methods across all quality dimensions, producing presentations that closely approximate human expert standards. PREVAL shows strong correlation with human judgments, validating it as a reliable automated tool for assessing presentation quality.
为了应对这些挑战,我们引入了RCPS(REPS),这是一个包含三个关键组成部分的新框架:(1) 深层次结构描述规划;(2) 适应性布局生成;(3) 迭接式优化流程;此外,我们提议PREVAL,一个基于优惠的多维评价框架,利用强化的多维模型来评估所有内容、一致性和设计的质量;实验结果表明,RCPS大大超越了所有质量层面的基线方法,制作了接近人类专家标准的介绍;PREVAL显示与人类判断的密切关联,证明它是评估表述质量的可靠自动工具。
Article 12
Title@2025-07-17 (4): ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations
Title: ContextQFormer: A New Context Modeling Method for Multi-Turn Multi-Modal Conversations | ContextQFormer: Eine neue Context-Modellierungsmethode für Multi-Turn Multi-Modal-Gespräche | 上下文前:多发多式多模式对话的新背景建模方法 2505.23121v2 |
Authors (8): Yiming Lei, Zhizheng Yang, Zeming Liu, Haitao Leng, Shaoguo Liu, Tingting Gao, Qingjie Liu, Yunhong Wang
Multi-modal large language models have demonstrated remarkable zero-shot abilities and powerful image-understanding capabilities. However, the existing open-source multi-modal models suffer from the weak capability of multi-turn interaction, especially for long contexts. To address the issue, we first introduce a context modeling module, termed ContextQFormer, which utilizes a memory block to enhance the presentation of contextual information. Furthermore, to facilitate further research, we carefully build a new multi-turn multi-modal dialogue dataset (TMDialog) for pre-training, instruction-tuning, and evaluation, which will be open-sourced lately. Compared with other multi-modal dialogue datasets, TMDialog contains longer conversations, which supports the research of multi-turn multi-modal dialogue. In addition, ContextQFormer is compared with three baselines on TMDialog and experimental results illustrate that ContextQFormer achieves an improvement of 2%-4% in available rate over baselines.
多式大型语言模型显示了显著的零射能力和强大的图像理解能力。然而,现有的开放源码多模式模型由于多方向互动能力薄弱而受到影响,特别是在长期背景下。为了解决这个问题,我们首先引入了一个背景模型模块,称为CeanQFormer,该模块利用记忆块加强背景信息的列报。此外,为了便于进一步研究,我们谨慎地为培训前、教学调整和评价建立一个新的多方向多模式对话数据集(TMDialog),该数据集将最近公开提供。与其他多模式对话数据集相比,TMdialog包含较长的谈话,支持多方向多模式对话的研究。此外,CentricFormer与TMdilog的三个基线和实验结果进行比较,说明CentricalQFormer比基线提高了2%-4%。
Article 13
Title@2025-07-17 (4): Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management
Title: Overview of the TalentCLEF 2025: Skill and Job Title Intelligence for Human Capital Management | Überblick über das TalentCLEF 2025: Kompetenz- und Berufstitel-Intelligenz für Human Capital Management | 《2025年人才人才-CLEF概览:人力资本管理技能和职称情报》 2507.13275v1 |
Authors (7): Luis Gasco, Hermenegildo Fabregat, Laura García-Sardiña, Paula Estrella, Daniel Deniz, Alvaro Rodrigo, Rabih Zbib
Advances in natural language processing and large language models are driving a major transformation in Human Capital Management, with a growing interest in building smart systems based on language technologies for talent acquisition, upskilling strategies, and workforce planning. However, the adoption and progress of these technologies critically depend on the development of reliable and fair models, properly evaluated on public data and open benchmarks, which have so far been unavailable in this domain. To address this gap, we present TalentCLEF 2025, the first evaluation campaign focused on skill and job title intelligence. The lab consists of two tasks: Task A - Multilingual Job Title Matching, covering English, Spanish, German, and Chinese; and Task B - Job Title-Based Skill Prediction, in English. Both corpora were built from real job applications, carefully anonymized, and manually annotated to reflect the complexity and diversity of real-world labor market data, including linguistic variability and gender-marked expressions. The evaluations included monolingual and cross-lingual scenarios and covered the evaluation of gender bias. TalentCLEF attracted 76 registered teams with more than 280 submissions. Most systems relied on information retrieval techniques built with multilingual encoder-based models fine-tuned with contrastive learning, and several of them incorporated large language models for data augmentation or re-ranking. The results show that the training strategies have a larger effect than the size of the model alone. TalentCLEF provides the first public benchmark in this field and encourages the development of robust, fair, and transferable language technologies for the labor market.
自然语言处理和大语言模式的进展正在推动人力资本管理的重大转变,对建立基于语言技术的智能系统以获取人才、提高技能战略和劳动力规划的兴趣日益浓厚,然而,这些技术的采用和进步关键取决于可靠和公平的模型的开发,对公共数据和公开基准的恰当评价,而迄今为止这一领域尚不具备这些模型和公开基准。为弥补这一差距,我们介绍2025年人才水平评价运动,第一次评价运动的重点是技能和职称情报。实验室由两项任务组成:任务A - 多语言职称匹配,涵盖英文、西班牙文、德文和中文;任务B - 基于职称的Skill 预测,英语。这两种技术的采用和进展都取决于真实的工作应用程序,对公共数据和公开市场数据进行适当评价,包括语言变异和有性别标记的表达。评价包括单一语言和跨语言模式情景,并涵盖对性别偏见的评价。TalentCLEF吸引了76个注册团队,提交了280多份材料。大多数系统都依靠以基于信息检索技术建立的基于职称可转让技能的Skilled Skilled Skillion Surillion Surning Surviion,仅靠多种成本模型或更大规模的实地学习模型。
Article 14
Title@2025-07-17 (4): Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering
Title: Secure Multifaceted-RAG for Enterprise: Hybrid Knowledge Retrieval with Security Filtering | Sichere Multifaceted-RAG für Unternehmen: Hybrides Knowledge Retrieval mit Security-Filterung | 企业安全多面安全RAG:带安全过滤器的混合知识检索 2504.13425v2 |
Authors (4): Grace Byun, Shinsun Lee, Nayoung Choi, Jinho D. Choi
Existing Retrieval-Augmented Generation (RAG) systems face challenges in enterprise settings due to limited retrieval scope and data security risks. When relevant internal documents are unavailable, the system struggles to generate accurate and complete responses. Additionally, using closed-source Large Language Models (LLMs) raises concerns about exposing proprietary information. To address these issues, we propose the Secure Multifaceted-RAG (SecMulti-RAG) framework, which retrieves not only from internal documents but also from two supplementary sources: pre-generated expert knowledge for anticipated queries and on-demand external LLM-generated knowledge. To mitigate security risks, we adopt a local open-source generator and selectively utilize external LLMs only when prompts are deemed safe by a filtering mechanism. This approach enhances completeness, prevents data leakage, and reduces costs. In our evaluation on a report generation task in the automotive industry, SecMulti-RAG significantly outperforms traditional RAG - achieving 79.3 to 91.9 percent win rates across correctness, richness, and helpfulness in LLM-based evaluation, and 56.3 to 70.4 percent in human evaluation. This highlights SecMulti-RAG as a practical and secure solution for enterprise RAG.
由于检索范围有限和数据安全风险有限,现有检索-启动新一代系统在企业环境中面临挑战。当相关内部文件缺乏时,系统很难产生准确和完整的答复。此外,使用封闭源源大语言模型(LLMs)引起对披露专利信息的担忧。为了解决这些问题,我们提议采用安全多面多面系统框架(SecMulti-RAG),该框架不仅取自内部文件,而且取自两个补充来源:预生成的专家知识,用于预期查询和需求外部LLM生成的知识。为减轻安全风险,我们采用了本地开源生成器,并仅在过滤机制认为迅速性安全时才有选择地使用外部LMMs。这一方法加强了完整性,防止数据泄漏,并降低了成本。在对汽车业报告生成任务的评价中,SecMulti-RAG大大超越了传统的RAG——在基于LM的评价中实现79.3%至91.9%的赢率、丰富度和帮助性外部LMM生成的知识。为了减少安全性风险,我们采用了本地开源生成的生成器,并且只在过滤机制认为迅速时才有选择使用外部LMMs。这一方法提高了56.3%至70.4%的企业评价。
Article 15
Title@2025-07-17 (4): QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation
Title: QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation | QuestA: Erweitern der Begründungskapazität in LLMs durch Frageerweiterung | 目标A:通过问题增加扩大LLMs的理据能力 2507.13266v1 |
Authors (8): Jiazheng Li, Hong Lu, Kaiyue Wen, Zaiwen Yang, Jiaxuan Gao, Hongzhou Lin, Yi Wu, Jingzhao Zhang
Reinforcement learning (RL) has become a key component in training large language reasoning models (LLMs). However, recent studies questions its effectiveness in improving multi-step reasoning-particularly on hard problems. To address this challenge, we propose a simple yet effective strategy via Question Augmentation: introduce partial solutions during training to reduce problem difficulty and provide more informative learning signals. Our method, QuestA, when applied during RL training on math reasoning tasks, not only improves pass@1 but also pass@k-particularly on problems where standard RL struggles to make progress. This enables continual improvement over strong open-source models such as DeepScaleR and OpenMath Nemotron, further enhancing their reasoning capabilities. We achieve new state-of-the-art results on math benchmarks using 1.5B-parameter models: 67.1% (+5.3%) on AIME24, 59.5% (+10.0%) on AIME25, and 35.5% (+4.0%) on HMMT25. Further, we provide theoretical explanations that QuestA improves sample efficiency, offering a practical and generalizable pathway for expanding reasoning capability through RL.
强化学习(RL)已成为培训大型语言推理模型(LLMS)的一个关键组成部分。然而,最近的研究质疑其在改进多步推理方面的效力,特别是在困难问题上。为了应对这一挑战,我们提出一个简单而有效的战略:在培训过程中引入部分解决方案,以减少问题难度,并提供更多的信息学习信号。我们的方法(QuestA)在数学推理任务RL培训期间应用时,不仅改进了通行证@1,而且还提高了通行证@k,特别是在标准RL为取得进展而挣扎的问题上。这有利于不断改进强大的开放源模型,如深层Supul R 和 OpenMath Nemovron,进一步提高其推理能力。我们利用1.5B参数模型在数学基准上取得新的最新成果:67.1%(+5.3%)用于AIME24,59.5%(+10.0%)用于AIME25,35.5%(+4.0%)用于HMMT25。此外,我们从理论上解释“QuestA”提高了样本效率,为通过RL扩大推理能力提供了实用和通用路径。
Article 16
Title@2025-07-17 (4): Automating Steering for Safe Multimodal Large Language Models
Title: Automating Steering for Safe Multimodal Large Language Models | Automatisierungslenkung für sichere multimodale große Sprachmodelle | 安全多式联运大语言模式自动化指导 2507.13255v1 |
Authors (7): Lyucheng Wu, Mengru Wang, Ziwen Xu, Tri Cao, Nay Oo, Bryan Hooi, Shumin Deng
Recent progress in Multimodal Large Language Models (MLLMs) has unlocked powerful cross-modal reasoning abilities, but also raised new safety concerns, particularly when faced with adversarial multimodal inputs. To improve the safety of MLLMs during inference, we introduce a modular and adaptive inference-time intervention technology, AutoSteer, without requiring any fine-tuning of the underlying model. AutoSteer incorporates three core components: (1) a novel Safety Awareness Score (SAS) that automatically identifies the most safety-relevant distinctions among the model’s internal layers; (2) an adaptive safety prober trained to estimate the likelihood of toxic outputs from intermediate representations; and (3) a lightweight Refusal Head that selectively intervenes to modulate generation when safety risks are detected. Experiments on LLaVA-OV and Chameleon across diverse safety-critical benchmarks demonstrate that AutoSteer significantly reduces the Attack Success Rate (ASR) for textual, visual, and cross-modal threats, while maintaining general abilities. These findings position AutoSteer as a practical, interpretable, and effective framework for safer deployment of multimodal AI systems.
在多式大语言模型(MLLM)方面最近取得的进展释放了强大的跨模式推理能力,但也提出了新的安全关切,特别是在面临对抗性多式联运投入时。为了在推论期间提高MLLMs的安全性,我们采用了模块和适应性推导时间干预技术AutoSteer, 无需对基本模型作任何微调。AutoSteer包含三个核心组成部分:(1) 一个新的安全意识评分(SAS),该评分自动确定该模型内部各层之间与安全最相关的区别;(2) 受过训练的适应性安全计分,以估计中间表现的有毒产出的可能性;(3) 轻量级拒绝头,在发现安全风险时有选择地干预调节生成。关于LLAVA-OVA和Chameleon的各种安全临界基准的实验表明,AutSteer在保持一般能力的同时,大大降低了对文字、视觉和跨模式威胁的攻击成功率。这些研究结果表明,AutSter是一个实用、可解释和有效的框架,可以安全地部署多式AI系统。
Article 17
Title@2025-07-17 (4): ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs
Title: ConTextual: Improving Clinical Text Summarization in LLMs with Context-preserving Token Filtering and Knowledge Graphs | ConTextual: Verbesserung der klinischen Textzusammenfassung in LLMs mit kontextschonender Token-Filterung und Wissensgraphen | 共同方式:改进LLMLLM的临床文本摘要,同时保持上下文透视和知识图 2504.16394v3 |
Authors (2): Fahmida Liza Piya, Rahmatollah Beheshti
Unstructured clinical data can serve as a unique and rich source of information that can meaningfully inform clinical practice. Extracting the most pertinent context from such data is critical for exploiting its true potential toward optimal and timely decision-making in patient care. While prior research has explored various methods for clinical text summarization, most prior studies either process all input tokens uniformly or rely on heuristic-based filters, which can overlook nuanced clinical cues and fail to prioritize information critical for decision-making. In this study, we propose Contextual, a novel framework that integrates a Context-Preserving Token Filtering method with a Domain-Specific Knowledge Graph (KG) for contextual augmentation. By preserving context-specific important tokens and enriching them with structured knowledge, ConTextual improves both linguistic coherence and clinical fidelity. Our extensive empirical evaluations on two public benchmark datasets demonstrate that ConTextual consistently outperforms other baselines. Our proposed approach highlights the complementary role of token-level filtering and structured retrieval in enhancing both linguistic and clinical integrity, as well as offering a scalable solution for improving precision in clinical text generation.
没有结构的临床数据可以作为一个独特和丰富的信息来源,为临床实践提供有意义的信息。从这些数据中提取最相关的背景对于利用其真正潜力实现患者护理方面的最佳和及时的决策至关重要。虽然先前的研究探索了临床文本汇总的各种方法,但大多数先前的研究要么统一处理所有输入符号,要么依赖基于休养的过滤器,这可以忽视细微的临床线索,而不能确定对决策至关重要的信息的优先次序。在本研究中,我们提出了“背景”这一新框架,将背景保护过滤法与环境特异知识图(KG)相结合,以促进环境增强。通过保存特定环境的重要符号和以结构化知识丰富这些符号,“模式”提高了语言的一致性和临床忠诚性。我们对两个公共基准数据集的广泛经验评估表明,“模式”始终超越了其他基线。我们提出的方法强调,在加强语言和临床完整性方面,象征性过滤和结构检索具有补充作用,并为改进临床文本生成的精确性提供了可测量的解决方案。
Article 18
Title@2025-07-17 (4): HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models
Title: HATS: Hindi Analogy Test Set for Evaluating Reasoning in Large Language Models | HATS: Hindi Analogy Test Set zur Bewertung von Vernunft in großen Sprachmodellen | HATS: 用于评估大语言模型中原因的印地语分析测试套 2507.13238v1 |
Authors (3): Ashray Gupta, Rohan Joseph, Sunny Rai
Analogies test a model’s ability to infer implicit relationships between concepts, making them a key benchmark for evaluating reasoning capabilities. While large language models (LLMs) are widely evaluated for reasoning in English, their abilities in Indic languages remain understudied, limiting our understanding of whether these models generalize across languages. To address this gap, we introduce a new Hindi Analogy Test Set (HATS), comprising 405 multiple-choice questions sourced from Indian government exams. We benchmark state-of-the-art multilingual LLMs using various prompting strategies and introduce a grounded Chain of Thought approach that leverages cognitive theories of analogical reasoning. This approach improves model performance on Hindi analogy questions. Our experiments show that models perform best with English prompts, irrespective of the prompting strategy. Our test set addresses the lack of a critical resource to evaluate LLM reasoning capabilities in Hindi.
模拟测试模型是否有能力推断各种概念之间的隐含关系,使其成为评估推理能力的关键基准。大型语言模型(LLMs)在英语推理方面得到了广泛的评价,但其在印地语方面的能力仍然没有得到足够的研究,限制了我们对这些模型是否泛泛各语的理解。为了解决这一差距,我们引入了新的印地语人工解析测试组(HATS),由405个来自印度政府考试的多种选择问题组成。我们使用各种推理策略,对最新的多语种LLM作了基准,并引入了利用模拟推理理论认知理论的深层次思维链方法。这种方法提高了印地语类比问题模型的性能。我们的实验显示,无论迅速战略如何,这些模型在英语提示方面表现最佳。我们的测试组解决了缺乏关键资源来评价印地语的LLM推理能力的问题。
Article 19
Title@2025-07-17 (4): Enhancing Cross-task Transfer of Large Language Models via Activation Steering
Title: Enhancing Cross-task Transfer of Large Language Models via Activation Steering | Verbesserung der Cross-Task-Übertragung großer Sprachmodelle durch Aktivierungslenkung | 通过启动指导加强大语言模式的跨任务转让 2507.13236v1 |
Authors (8): Xinyu Tang, Zhihao Lv, Xiaoxue Cheng, Junyi Li, Wayne Xin Zhao, Zujie Wen, Zhiqiang Zhang, Jun Zhou
Large language models (LLMs) have shown impressive abilities in leveraging pretrained knowledge through prompting, but they often struggle with unseen tasks, particularly in data-scarce scenarios. While cross-task in-context learning offers a direct solution for transferring knowledge across tasks, it still faces critical challenges in terms of robustness, scalability, and efficiency. In this paper, we investigate whether cross-task transfer can be achieved via latent space steering without parameter updates or input expansion. Through an analysis of activation patterns in the latent space of LLMs, we observe that the enhanced activations induced by in-context examples have consistent patterns across different tasks. Inspired by these findings, we propose CAST, a novel Cross-task Activation Steering Transfer framework that enables effective transfer by manipulating the model’s internal activation states. Our approach first selects influential and diverse samples from high-resource tasks, then utilizes their contrastive representation-enhanced activations to adapt LLMs to low-resource tasks. Extensive experiments across both cross-domain and cross-lingual transfer settings show that our method outperforms competitive baselines and demonstrates superior scalability and lower computational costs.
大型语言模型(LLMS)在通过推动来利用预先培训的知识方面表现出令人印象深刻的能力,但是它们往往与无形的任务,特别是在数据残缺的情景下,难以与隐蔽的任务作斗争。虽然交叉任务中的内容性学习为跨任务转让知识提供了直接的解决方案,但在稳健性、可缩放性和效率方面仍面临严峻的挑战。在本文件中,我们调查跨任务转让是否可以通过潜伏的空间指导实现,而无需更新参数或扩大投入。通过分析LLMS潜在空间的激活模式,我们观察到,由内流实例引发的强化激活具有不同任务的一贯性。我们建议CAST,这是一个新型的跨任务启动指导转移框架,通过操纵模型的内部激活状态,使得能够进行有效转让。我们的方法首先从高资源任务中选择有影响力和多样性的样本,然后利用它们对比式代表增强的激活来使LMS适应低资源任务。跨多层次和跨语言传输环境的广泛实验显示,我们的方法超越了竞争性基线,并展示了更高的缩略性计算成本。
Article 20
Title@2025-07-17 (4): CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings
Title: CoDet-M4: Detecting Machine-Generated Code in Multi-Lingual, Multi-Generator and Multi-Domain Settings | CoDet-M4: Erkennung maschinengenerierter Codes in Multi-Lingual-, Multi-Generator- und Multi-Domain-Einstellungen | CoDet-M4:多语言、多驱动器和多域设置中的检测机生成代码 2503.13733v2 |
Authors (3): Daniil Orel, Dilshod Azizov, Preslav Nakov
Large language models (LLMs) have revolutionized code generation, automating programming with remarkable efficiency. However, these advancements challenge programming skills, ethics, and assessment integrity, making the detection of LLM-generated code essential for maintaining accountability and standards. While, there has been some research on this problem, it generally lacks domain coverage and robustness, and only covers a small number of programming languages. To this end, we propose a framework capable of distinguishing between human- and LLM-written code across multiple programming languages, code generators, and domains. We use a large-scale dataset from renowned platforms and LLM-based code generators, alongside applying rigorous data quality checks, feature engineering, and comparative analysis using evaluation of traditional machine learning models, pre-trained language models (PLMs), and LLMs for code detection. We perform an evaluation on out-of-domain scenarios, such as detecting the authorship and hybrid authorship of generated code and generalizing to unseen models, domains, and programming languages. Moreover, our extensive experiments show that our framework effectively distinguishes human- from LLM-written code and sets a new benchmark for this task.
大型语言模型(LLMS)使代码生成发生革命,使编程自动化,效率极高;然而,这些进步挑战了编程技能、道德和评估完整性,使得发现LLM生成的代码对于维持问责制和标准至关重要;虽然对这个问题进行了一些研究,但通常缺乏域覆盖面和稳健性,而且只涵盖少量编程语言;为此目的,我们提议了一个能够区分多种编程语言、代码生成者和域名的人类和LLLM编程代码的框架;我们使用由著名平台和LLM制代码生成者提供的大规模数据集,同时采用严格的数据质量检查、特征工程和比较分析方法,利用传统机器学习模型、预先培训的语言模型和LLMS来进行评析;我们评估了外部情景,例如发现生成代码的作者和混合作者,并推广到看不见的模式、领域和编程语言;此外,我们的广泛实验显示,我们的框架有效地区分了人与LM制代码,并为这项任务制定了新的基准。
Article 21
Title@2025-07-17 (4): A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans
Title: A Comparative Approach to Assessing Linguistic Creativity of Large Language Models and Humans | Ein Vergleichsansatz zur Beurteilung sprachlicher Kreativität von großen Sprachmodellen und Menschen | 评估大语言模式和人类语言创造性的比较方法 2507.12039v2 |
Authors (3): Anca Dinu, Andra-Maria Florescu, Alina Resceanu
The following paper introduces a general linguistic creativity test for humans and Large Language Models (LLMs). The test consists of various tasks aimed at assessing their ability to generate new original words and phrases based on word formation processes (derivation and compounding) and on metaphorical language use. We administered the test to 24 humans and to an equal number of LLMs, and we automatically evaluated their answers using OCSAI tool for three criteria: Originality, Elaboration, and Flexibility. The results show that LLMs not only outperformed humans in all the assessed criteria, but did better in six out of the eight test tasks. We then computed the uniqueness of the individual answers, which showed some minor differences between humans and LLMs. Finally, we performed a short manual analysis of the dataset, which revealed that humans are more inclined towards E(extending)-creativity, while LLMs favor F(ixed)-creativity.
论文介绍了对人和大语言模型(LLMs)的一般语言创造性测试(LLMs),该测试包括各种任务,旨在评估其根据文字形成过程(发音和复合)和隐喻语言使用生成新原始词和短语的能力。我们对24人进行了测试,对同等数量的LMs进行了测试,我们用OCSAI工具根据三个标准(原创性、阐述性和灵活性)对答案进行了自动评估。结果显示,LLMS不仅在所有评估标准中优于人,而且在8项测试任务中的6项中优于人。我们随后计算了个人答案的独特性,这显示了人类与LLMs之间的一些微小差异。最后,我们对数据集进行了简短的手工分析,显示人类更倾向于E(延动性)-渗透性,而LLMs则倾向于F(ixed)-creative性。
Article 22
Title@2025-07-17 (4): Automatically assessing oral narratives of Afrikaans and isiXhosa children
Title: Automatically assessing oral narratives of Afrikaans and isiXhosa children | Automatische Beurteilung mündlicher Erzählungen von Afrikaans und isiXhosa Kindern | 自动评估南非荷兰语和土著Xhoosa儿童口述叙述 2507.13205v1 |
Authors (6): R. Louw, E. Sharratt, F. de Wet, C. Jacobs, A. Smith, H. Kamper
Developing narrative and comprehension skills in early childhood is critical for later literacy. However, teachers in large preschool classrooms struggle to accurately identify students who require intervention. We present a system for automatically assessing oral narratives of preschool children in Afrikaans and isiXhosa. The system uses automatic speech recognition followed by a machine learning scoring model to predict narrative and comprehension scores. For scoring predicted transcripts, we compare a linear model to a large language model (LLM). The LLM-based system outperforms the linear model in most cases, but the linear system is competitive despite its simplicity. The LLM-based system is comparable to a human expert in flagging children who require intervention. We lay the foundation for automatic oral assessments in classrooms, giving teachers extra capacity to focus on personalised support for children’s learning.
幼儿期的叙述和理解技能对以后的识字至关重要。然而,大型学龄前班教师努力准确地确定需要干预的学生。我们提出了一个自动评估南非荷兰语和IsiXhosa语学龄前儿童的口述叙述的系统。该系统使用自动语音识别,然后是机器学习评分模型来预测叙述和理解得分。在评分时,我们将一个线性模式与一个大语言模式(LLLM)进行比较。基于LLM的系统在大多数情况下比线性模式(LLM)要好,但线性系统尽管简单,却具有竞争力。基于LLM的系统与需要干预的儿童标记方面的一名人类专家相似。我们为在教室进行自动口述评估打下了基础,给予教师更多的能力,将重点放在儿童学习的个人支持上。
Article 23
Title@2025-07-17 (4): GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems
Title: GEMMAS: Graph-based Evaluation Metrics for Multi Agent Systems | GEMMAS: Graph-basierte Evaluations-Metriken für Multi-Agent-Systeme | GEMMAS:基于图表的多剂系统评价计量表 2507.13190v1 |
Authors (5): Jisoo Lee, Raeyoung Chang, Dongwook Kwon, Harmanpreet Singh, Nikhil Verma
Multi-agent systems built on language models have shown strong performance on collaborative reasoning tasks. However, existing evaluations focus only on the correctness of the final output, overlooking how inefficient communication and poor coordination contribute to redundant reasoning and higher computational costs. We introduce GEMMAS, a graph-based evaluation framework that analyzes the internal collaboration process by modeling agent interactions as a directed acyclic graph. To capture collaboration quality, we propose two process-level metrics: Information Diversity Score (IDS) to measure semantic variation in inter-agent messages, and Unnecessary Path Ratio (UPR) to quantify redundant reasoning paths. We evaluate GEMMAS across five benchmarks and highlight results on GSM8K, where systems with only a 2.1% difference in accuracy differ by 12.8% in IDS and 80% in UPR, revealing substantial variation in internal collaboration. These findings demonstrate that outcome-only metrics are insufficient for evaluating multi-agent performance and highlight the importance of process-level diagnostics in designing more interpretable and resource-efficient collaborative AI systems.
以语言模式为基础的多试剂系统在协作推理任务方面表现良好,但是,现有的评价仅注重最终产出的正确性,忽视通信效率低和协调差如何促成多余推理和较高的计算费用。我们引入了基于图表的评价框架GEMMAS,这是一个以图为基础的评价框架,通过模拟代理器互动分析内部协作进程,作为定向循环图。为了获取合作质量,我们提议了两个程序级指标:信息多样性评分,以衡量跨试剂信息中的语义差异,以及不必要路径比对多余推理路径的量化。我们评估了五大基准的GEMMAS,并强调了GSM8K的结果,该系统的准确性差只有2.1%,IDS的差12.8%,普遍定期审议的差幅为80%。这些结果表明内部合作存在很大差异。这些结论表明,只使用结果的衡量尺度不足以评价多试剂业绩,并强调了程序级诊断在设计更易解释和资源高效的合作性AI系统方面的重要性。
Article 24
Title@2025-07-17 (4): Feature-based analysis of oral narratives from Afrikaans and isiXhosa children
Title: Feature-based analysis of oral narratives from Afrikaans and isiXhosa children | Feature-basierte Analyse oraler Erzählungen von Afrikaans und isiXhosa-Kindern | 对南非荷兰语和土著Xhoosa儿童口述叙述的基于特征的分析 2507.13164v1 |
Authors (6): Emma Sharratt, Annelien Smith, Retief Louw, Daleen Klop, Febe de Wet, Herman Kamper
Oral narrative skills are strong predictors of later literacy development. This study examines the features of oral narratives from children who were identified by experts as requiring intervention. Using simple machine learning methods, we analyse recorded stories from four- and five-year-old Afrikaans- and isiXhosa-speaking children. Consistent with prior research, we identify lexical diversity (unique words) and length-based features (mean utterance length) as indicators of typical development, but features like articulation rate prove less informative. Despite cross-linguistic variation in part-of-speech patterns, the use of specific verbs and auxiliaries associated with goal-directed storytelling is correlated with a reduced likelihood of requiring intervention. Our analysis of two linguistically distinct languages reveals both language-specific and shared predictors of narrative proficiency, with implications for early assessment in multilingual contexts.
这份研究报告审查了专家认为需要干预的儿童口头陈述的特点;使用简单的机器学习方法,我们分析了4岁和5岁的南非荷兰语和伊西索萨语儿童的记录故事;与先前的研究一致,我们将语言多样性(独特语言)和长语特征(平均发话长度)确定为典型发展指标,但诸如语速等特征则证明信息量较少;尽管部分语音模式存在多语种差异,但使用特定动词和与目标定向叙事相关的辅助词与减少需要干预的可能性是相互关联的;我们对两种语言不同语言的分析揭示了语言特定和共同的描述熟练程度预测,对多语种背景下的早期评估产生了影响。
Article 25
Title@2025-07-17 (4): Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities
Title: Inverse Reinforcement Learning Meets Large Language Model Post-Training: Basics, Advances, and Opportunities | Inverse Stärkung Lernen trifft auf großes Sprachmodell Post-Training: Grundlagen, Fortschritte und Chancen | 培训后培训:基础、进步和机会 2507.13158v1 |
Authors (2): Hao Sun, Mihaela van der Schaar
In the era of Large Language Models (LLMs), alignment has emerged as a fundamental yet challenging problem in the pursuit of more reliable, controllable, and capable machine intelligence. The recent success of reasoning models and conversational AI systems has underscored the critical role of reinforcement learning (RL) in enhancing these systems, driving increased research interest at the intersection of RL and LLM alignment. This paper provides a comprehensive review of recent advances in LLM alignment through the lens of inverse reinforcement learning (IRL), emphasizing the distinctions between RL techniques employed in LLM alignment and those in conventional RL tasks. In particular, we highlight the necessity of constructing neural reward models from human data and discuss the formal and practical implications of this paradigm shift. We begin by introducing fundamental concepts in RL to provide a foundation for readers unfamiliar with the field. We then examine recent advances in this research agenda, discussing key challenges and opportunities in conducting IRL for LLM alignment. Beyond methodological considerations, we explore practical aspects, including datasets, benchmarks, evaluation metrics, infrastructure, and computationally efficient training and inference techniques. Finally, we draw insights from the literature on sparse-reward RL to identify open questions and potential research directions. By synthesizing findings from diverse studies, we aim to provide a structured and critical overview of the field, highlight unresolved challenges, and outline promising future directions for improving LLM alignment through RL and IRL techniques.
在大语言模型(LLM)时代,在追求更可靠、可控制和更有能力的机器智能方面,调整已成为一个根本性但具有挑战性的问题。最近,推理模型和对话性AI系统的成功突出了强化学习(RL)在加强这些系统方面的关键作用,从而促使研究对RL和LLM的交叉点产生更大的兴趣。本文从反强化学习(IRL)的角度全面审查LLM调整的最新进展,强调LLM调整所使用的技术与常规RL任务之间的差别。特别是,我们强调必须从人类数据中建立神经奖赏模型,并讨论这种范式转变的正式和实际影响。我们首先在RL引入基本概念,为不熟悉实地的读者奠定基础。然后我们审视这一研究议程的最新进展,讨论为LLM调整进行IRL的关键挑战和机遇。除了方法考虑外,我们还探讨实际问题,包括数据集、基准、评价指标、基础设施、以及计算高效的培训和推导技术。最后,我们从关于Slob-R-L的文献中,从结构式研究方向,通过RL-L的全局性研究,从提供我们未解决的前沿研究方向,然后提出一个开放问题。
Article 26
Title@2025-07-17 (4): From Roots to Rewards: Dynamic Tree Reasoning with RL
Title: From Roots to Rewards: Dynamic Tree Reasoning with RL | Von Wurzeln zu Belohnungen: Dynamische Baumveranlagung mit RL | 从根到奖赏: 使用 RL 解释动态树 2507.13142v1 |
Authors (2): Ahmed Bahloul, Simon Malberg
Modern language models address complex questions through chain-of-thought (CoT) reasoning (Wei et al., 2023) and retrieval augmentation (Lewis et al., 2021), yet struggle with error propagation and knowledge integration. Tree-structured reasoning methods, particularly the Probabilistic Tree-of-Thought (ProbTree)(Cao et al., 2023) framework, mitigate these issues by decomposing questions into hierarchical structures and selecting answers through confidence-weighted aggregation of parametric and retrieved knowledge (Yao et al., 2023). However, ProbTree’s static implementation introduces two key limitations: (1) the reasoning tree is fixed during the initial construction phase, preventing dynamic adaptation to intermediate results, and (2) each node requires exhaustive evaluation of all possible solution strategies, creating computational inefficiency. We present a dynamic reinforcement learning (Sutton and Barto, 2018) framework that transforms tree-based reasoning into an adaptive process. Our approach incrementally constructs the reasoning tree based on real-time confidence estimates, while learning optimal policies for action selection (decomposition, retrieval, or aggregation). This maintains ProbTree’s probabilistic rigor while improving both solution quality and computational efficiency through selective expansion and focused resource allocation. The work establishes a new paradigm for treestructured reasoning that balances the reliability of probabilistic frameworks with the flexibility required for real-world question answering systems.
现代语言模型通过思维链推理(CoT)推理(Wei等人,2023年)和检索增强(Lewis等人,2021年)解决复杂问题(Wei等人,2023年)和检索增强(Lewis等人,2021年)解决复杂问题,但又在与错误传播和知识整合方面挣扎。树木结构推理方法,特别是概率树(ProbTree)(Cao等人,2023年)框架,通过将问题分解成等级结构,并通过以信任加权的参数和检索的知识汇总(Yao等人,2023年)来选择答案来缓解这些问题。然而,ProbTre的静态实施提出了两个主要限制:(1) 推理树在初始建设阶段固定下来,防止动态适应中期结果,以及知识整合。 树结构结构结构化,要求对所有可能的解决方案战略进行详尽的评估,创造计算效率。 我们提出了一个动态强化学习(Sutton和Barto,2018年)框架,将基于树基的推理推理推论转化为信心估计,同时学习选择行动选择的最佳政策(decomplab,检索,或汇总)。这保持了以正势性质量分配方法,同时确定成本结构的推理算方法,通过精定的推理方法,以改进了成本的推理的推理的推理的推理,以改进了成本推理。
Article 27
Title@2025-07-17 (4): SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks
Title: SWE-MERA: A Dynamic Benchmark for Agenticly Evaluating Large Language Models on Software Engineering Tasks | SWE-MERA: Ein dynamischer Benchmark für die Agentik-Bewertung großer Sprachmodelle in Software-Engineering-Aufgaben | SWE-MERA: 积极评价软件工程任务大语言模型的动态基准 2507.11059v2 |
Authors (9): Pavel Adamenko, Mikhail Ivanov, Aidar Valeev, Rodion Levichev, Pavel Zadorozhny, Ivan Lopatin, Dmitry Babayev, Alena Fenogenova, Valentin Malykh
The rapid advancement of Large Language Models (LLMs) in software engineering has revealed critical limitations in existing benchmarks, particularly the widely used SWE-bench dataset. Recent studies have uncovered severe data contamination issues, e.g. SWE-bench reports 32.67% of successful patches involve direct solution leakage and 31.08% pass due to inadequate test cases. We introduce SWE-MERA, a dynamic, continuously updated benchmark designed to address these fundamental challenges through an automated collection of real-world GitHub issues and rigorous quality validation. Our approach implements a reliable pipeline that ensures quality while minimizing contamination risks, resulting in approximately 10,000 potential tasks with 300 samples currently available. Evaluation using the Aider coding agent demonstrates strong discriminative power in state-of-the-art models. We report performance across a dozen recent LLMs evaluated on tasks collected between September 2024 and June 2025.
软件工程大语言模型(LLMS)的快速发展揭示了现有基准,特别是广泛使用的SWE-bench数据集的重大局限性,最近的研究发现了严重的数据污染问题,例如SWE-bench报告32.67%的成功补丁涉及直接溶解渗漏,31.08%因测试案例不足而通过。我们引入了SWE-MERA,这是一个动态的、不断更新的基准,旨在通过自动收集真实世界的GitHub问题和严格的质量验证来应对这些基本挑战。我们的方法是一个可靠的管道,既能确保质量,又能尽量减少污染风险,从而产生约10,000项潜在任务,目前已有300个样本。使用Aider编码剂进行的评估表明,在最新模型中具有很强的歧视性力量。我们报告了2024年9月至2025年6月期间所收集的任务最近得到评估的十多个LMMS的绩效。
Article 28
Title@2025-07-17 (4): Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation
Title: Assessing the Reliability of LLMs Annotations in the Context of Demographic Bias and Model Explanation | Bewertung der Zuverlässigkeit von LLM-Annotationen im Kontext der demografischen Bias und Modellerklärung | 结合人口偏见和示范解释评估LLM 说明的可靠性 2507.13138v1 |
Authors (6): Hadi Mohammadi, Tina Shahedi, Pablo Mosteiro, Massimo Poesio, Ayoub Bagheri, Anastasia Giachanou
Understanding the sources of variability in annotations is crucial for developing fair NLP systems, especially for tasks like sexism detection where demographic bias is a concern. This study investigates the extent to which annotator demographic features influence labeling decisions compared to text content. Using a Generalized Linear Mixed Model, we quantify this inf luence, finding that while statistically present, demographic factors account for a minor fraction ( 8%) of the observed variance, with tweet content being the dominant factor. We then assess the reliability of Generative AI (GenAI) models as annotators, specifically evaluating if guiding them with demographic personas improves alignment with human judgments. Our results indicate that simplistic persona prompting often fails to enhance, and sometimes degrades, performance compared to baseline models. Furthermore, explainable AI (XAI) techniques reveal that model predictions rely heavily on content-specific tokens related to sexism, rather than correlates of demographic characteristics. We argue that focusing on content-driven explanations and robust annotation protocols offers a more reliable path towards fairness than potentially persona simulation.
理解说明中的可变性来源对于发展公平的NLP系统至关重要,特别是在人口偏见引起关注的性别现象检测等任务中。本研究调查了说明人口特征对与文本内容相比的决定标签影响的程度。我们使用通用线性混合模型来量化这种无足轻重,发现在统计上,人口因素占观察到的差异的一小部分(8%),而推特内容是主导因素。然后我们评估基因化AI(GenAI)模型作为说明者的可靠性,具体评估是否指导它们与人口特征更加一致。我们的结果显示,简单化的人往往无法提高,有时无法降低与基线模型相比的性能。此外,可解释的AI(XAI)技术显示,模型预测严重依赖与性别主义有关的特定内容符号,而不是人口特征的关联。我们争辩说,侧重于内容驱动解释和强有力的说明协议提供了更可靠的途径,而不是可能的人模拟。
Article 29
Title@2025-07-17 (4): A Computational Framework to Identify Self-Aspects in Text
Title: A Computational Framework to Identify Self-Aspects in Text | Ein Computational Framework zur Identifizierung von Selbstaspekten im Text | 文本中识别自我特征的计算框架 2507.13115v1 |
Authors (3): Jaya Caporusso, Matthew Purver, Senja Pollak
This Ph.D. proposal introduces a plan to develop a computational framework to identify Self-aspects in text. The Self is a multifaceted construct and it is reflected in language. While it is described across disciplines like cognitive science and phenomenology, it remains underexplored in natural language processing (NLP). Many of the aspects of the Self align with psychological and other well-researched phenomena (e.g., those related to mental health), highlighting the need for systematic NLP-based analysis. In line with this, we plan to introduce an ontology of Self-aspects and a gold-standard annotated dataset. Using this foundation, we will develop and evaluate conventional discriminative models, generative large language models, and embedding-based retrieval approaches against four main criteria: interpretability, ground-truth adherence, accuracy, and computational efficiency. Top-performing models will be applied in case studies in mental health and empirical phenomenology.
这份博士建议引入了一项计划,以制定一个计算框架,在文本中识别自我占位物。自我是一个多层面的构思,在语言中也有反映。虽然它被描述为认知科学和人文学等跨学科学科,但在自然语言处理中仍然未得到充分探讨。自我与心理和其他研究良好的现象(例如与心理健康有关的现象)相一致的许多方面,突出表明需要系统进行基于NLP的分析。根据这一点,我们计划引入自我占位物学和黄金标准的附加说明数据集。我们将利用这个基础,根据四种主要标准,即可解释性、地面真理的坚持性、准确性和计算效率,制定和评估传统的歧视性模式、基因化大型语言模型和基于嵌入的检索方法。在心理健康和实验性人文学的案例研究中将应用高效模型。
Article 30
Title@2025-07-17 (4): Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression
Title: Task-Circuit Quantization: Leveraging Knowledge Localization and Interpretability for Compression | Task-Circuit Quantization: Nutzung von Wissen Lokalisierung und Dolmetschbarkeit für Komprimierung | 任务-环境环境定量:利用知识本地化和压缩解释 2504.07389v2 |
Authors (4): Hanqi Xiao, Yi-Lin Sung, Elias Stengel-Eskin, Mohit Bansal
Post-training quantization (PTQ) reduces a model’s memory footprint by mapping full precision weights into low bit weights without costly retraining, but can degrade its downstream performance especially in low 2- to 3-bit settings. We develop a new mixed-precision PTQ approach, Task-Circuit Quantization (TaCQ), that draws parallels to automated circuit discovery, directly conditioning the quantization process on specific weight circuits – which we define as sets of weights associated with downstream task performance. These weights are kept as 16-bit weights, while others are quantized, maintaining performance while only adding a marginal memory cost. Specifically, TaCQ contrasts unquantized model weights with a uniformly-quantized model to estimate the expected change in weights due to quantization and uses gradient information to predict the resulting impact on task performance, allowing us to preserve task-specific weights. We compare TaCQ-based quantization to existing mixed-precision quantization methods when conditioning both on general-purpose and task-specific data. Across QA, math reasoning, and text-to-SQL tasks for both Llama-3 and Qwen2.5, we find that TaCQ outperforms baselines using the same calibration data and a lower weight budget, achieving major improvements in the 2 and 3-bit regime. With only 3.1 bits we are able to recover 96% of Llama-3-8B-Instruct’s unquantized 16-bit MMLU performance, obtaining a 5.25% absolute improvement over SPQR. We also observe consistently large gains over existing methods in the 2-bit regime, with an average gain of 14.74% over the strongest baseline, SliM-LLM. Moreover, we observe a 7.20% gain without conditioning on specific tasks, showing TaCQ’s ability to identify important weights is not limited to task-conditioned settings.
训练后夸度(PTQ) 通过将完全精密重量映射成低比位重量,降低模型的记忆足迹,不进行费用高昂的再培训,但可以降低其下游性能,特别是在低2至3位设置中。我们开发了一种新的混合精密PTQ(Tacliit-Circit Quantization)(TaCQ)方法,该方法与自动电路发现相平行,将四分法进程直接限定在特定重力电路上,我们将其定义为与下游任务性能相关的数组权重。这些重量作为16比位重量加以保留,而其他重量则加以量化,保持性能仅增加边际内存成本。具体地,TacQQ(Tac) 对比未量化的模型重量,以统一的模型来估计由于量化而预期的重量变化,并使用梯度信息来预测对任务性能的影响,从而保留具体任务重量。我们把基于TacQ的四分量与现有的混合精度分解分解分解方法进行比较。在一般和任务分解数据上,其他的精度的精度的精度中, 在QB的精度上, 16A、数学中,数学-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx的平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平平
Article 31
Title@2025-07-17 (4): SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts
Title: SemCSE: Semantic Contrastive Sentence Embeddings Using LLM-Generated Summaries For Scientific Abstracts | SemCSE: Semantische kontrastive Satzeinbettungen mit LLM-generierten Zusammenfassungen für wissenschaftliche Abstracts | SEMCSE: 使用LLM创制的科学摘要摘要 2507.13105v1 |
Authors (2): Marc Brinner, Sina Zarriess
We introduce SemCSE, an unsupervised method for learning semantic embeddings of scientific texts. Building on recent advances in contrastive learning for text embeddings, our approach leverages LLM-generated summaries of scientific abstracts to train a model that positions semantically related summaries closer together in the embedding space. This resulting objective ensures that the model captures the true semantic content of a text, in contrast to traditional citation-based approaches that do not necessarily reflect semantic similarity. To validate this, we propose a novel benchmark designed to assess a model’s ability to understand and encode the semantic content of scientific texts, demonstrating that our method enforces a stronger semantic separation within the embedding space. Additionally, we evaluate SemCSE on the comprehensive SciRepEval benchmark for scientific text embeddings, where it achieves state-of-the-art performance among models of its size, thus highlighting the benefits of a semantically focused training approach.
我们引入了SemCSE, 这是学习科学文本语义嵌入的一种不受监督的SemCSE, 这是一种学习科学文本语义嵌入的一种方法。 我们的方法利用LLM产生的科学摘要摘要摘要,对一种模型进行训练,这种模型将与语义相关的摘要更紧密地放在嵌入空间中。 由此产生的目标确保模型捕捉文本的真正语义内容, 与传统的以引用为基础的方法相比, 它不一定反映语义相似性。 为了验证这一点, 我们提出了一个新的基准, 旨在评估模型理解和编码科学文本语义内容的能力, 表明我们的方法在嵌入空间内加强了语义分离。 此外, 我们评估SEMCSE, 有关科学文本嵌入空间的SciRepEval综合基准, 在那里, 它在规模不同的模型中取得了最新技术表现, 从而突出了语义集中的培训方法的好处。
Article 32
Title@2025-07-17 (4): Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models
Title: Unified Triplet-Level Hallucination Evaluation for Large Vision-Language Models | Unified Triplet-Level Halluzination Evaluation für große Vision-Sprache Modelle | 大型视觉语言模型统一三维级幻觉评价 2410.23114v4 |
Authors (4): Junjie Wu, Tsz Ting Chung, Kai Chen, Dit-Yan Yeung
Despite the outstanding performance in vision-language reasoning, Large Vision-Language Models (LVLMs) might generate hallucinated contents that do not exist in the given image. Most existing LVLM hallucination benchmarks are constrained to evaluate the object-related hallucinations. However, the potential hallucination on the relations between two objects, i.e., relation hallucination, still lacks investigation. To remedy that, we design a unified framework to measure the object and relation hallucination in LVLMs simultaneously. The core idea of our framework is to evaluate hallucinations via (object, relation, object) triplets extracted from LVLMs’ responses, making it easily generalizable to different vision-language tasks. Based on our framework, we further introduce Tri-HE, a novel Triplet-level Hallucination Evaluation benchmark which can be used to study both object and relation hallucination at the same time. With comprehensive evaluations on Tri-HE, we observe that the relation hallucination issue is even more serious than object hallucination among existing LVLMs, highlighting a previously neglected problem towards reliable LVLMs. Moreover, based on our findings, we design a simple training-free approach that effectively mitigates hallucinations for LVLMs. Our dataset and code for the reproduction of our experiments are available publicly at https://github.com/wujunjie1998/Tri-HE.
尽管在视觉语言推理方面表现出色,大型视觉语言模型(LVLM)可能会产生在特定图像中不存在的幻觉内容。大多数现有的LVLM幻觉基准都不得不评估与目标有关的幻觉。然而,对两个对象之间的关系的潜在幻觉,即关系幻觉,仍然缺乏调查。为了纠正这一点,我们设计了一个统一框架,以同时测量LVLMs中的对象和关系幻觉。我们框架的核心思想是通过LVMs答复中提取的幻觉(对象、关系、对象)三重幻觉,使LVLMS易于将其推广到不同的视觉语言任务中。此外,我们根据我们的框架,进一步引入了Tri-HE,一个全新的三重幻觉评价标准,可以同时用于研究对象和关系幻觉。在对Tri-HE的全面评价中,我们观察到,与LVLMMs之间的关系比目标幻觉问题更为严重,突出了以前被忽视的LVLMS的问题。此外,我们根据我们的研究结果,设计了一个简单的培训/MLVMS/J 有效减少我们现有的数据。
Article 33
Title@2025-07-17 (4): SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control
Title: SmartThinker: Learning to Compress and Preserve Reasoning by Step-Level Length Control | SmartThinker: Lernen, um zu komprimieren und zu bewahren Vernunft durch Schritt-Level-Length Control | SmartThinker: 学会按职级长长控制进行压缩和保留理由 2507.04348v2 |
Authors (3): Xingyang He, Xiao Ling, Jie Liu
Large reasoning models (LRMs) have exhibited remarkable reasoning capabilities through inference-time scaling, but this progress has also introduced considerable redundancy and inefficiency into their reasoning processes, resulting in substantial computational waste. Previous work has attempted to mitigate this issue by penalizing the overall length of generated samples during reinforcement learning (RL), with the goal of encouraging a more concise chains of thought. However, we observe that such global length penalty often lead to excessive compression of critical reasoning steps while preserving unnecessary details in simpler ones, yielding a suboptimal trade-off between accuracy and efficiency. To address this issue, we propose SmartThinker, a two-stage learnable framework designed to enable fine-grained control over the length of reasoning chains based on the importance of each individual step. In the first stage, SmartThinker adapts a reasoning model to a short-form reasoning mode through rejection sampling combined with supervised fine-tuning (SFT). In the second stage, SmartThinker applies Step-Level Length Control Policy Optimization (SCPO) to refine the model output distribution, which increases the proportion of length allocated to critical steps while reducing redundancy in less important ones. SCPO consists of four core components: an online importance estimator, a step-level length control reward function, a step-level generalized advantage estimation (S-GAE) and a difficulty-adaptive clipping strategy. Working in concert, these components enable SCPO to implement differentiated length control across reasoning steps. Empirical results across multiple reasoning benchmarks and various backbone models demonstrate that SmartThinker significantly reduces redundant reasoning while achieving comparable or even superior performance to existing methods.
大型推理模型(LRMs)通过推算时间的缩放,表现出了非凡的推理能力,但这一进展也为其推理过程带来了相当的冗余和效率低下,导致大量计算浪费。以前的工作试图通过惩罚在强化学习期间生成的样本的总体长度来缓解这一问题,目的是鼓励形成更简洁的思维链。然而,我们注意到,这种全球长度的罚款往往会导致在更简单的推理步骤中过度压缩关键推理步骤,同时保留不必要的细节,从而在准确性和效率之间实现一个不最优化的权衡。为了解决这个问题,我们提议SmartThinker是一个两阶段的可学习框架,旨在根据每个步骤的重要性对推理链的长度进行精细控制。 在第一阶段,SmartTinker将推理模型调整成一个短的推理模型,同时通过监督细调整(SFTFT ) ,Smart Thinker运用了更高级控制政策(SCPO) 来改进模型的输出分布,这样可以提高分配给更高级的推理学级推理的精度比例,同时降低整个层次的Slodial-Cal-LILAFS-S-LILILS-S-S-LI-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S
Article 34
Title@2025-07-17 (4): MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks
Title: MERA Code: A Unified Framework for Evaluating Code Generation Across Tasks | MERA-Code: Ein einheitliches Framework zur Bewertung der Codegenerierung von Aufgaben | MERA 守则:一个统一框架,用于评估不同任务制定守则的情况 2507.12284v2 |
Authors (23): Artem Chervyakov, Alexander Kharitonov, Pavel Zadorozhny, Adamenko Pavel, Rodion Levichev, Dmitrii Vorobev, Dmitrii Salikhov, Aidar Valeev, Alena Pestova, Maria Dziuba, Ilseyar Alimova, Artem Zavgorodnev, Aleksandr Medvedev, Stanislav Moiseev, Elena Bruches, Daniil Grebenkin, Roman Derunets, Vikulov Vladimir, Anton Emelyanov, Dmitrii Babaev, Vladimir V. Ivanov, Valentin Malykh, Alena Fenogenova
Advancements in LLMs have enhanced task automation in software engineering; however, current evaluations primarily focus on natural language tasks, overlooking code quality. Most benchmarks prioritize high-level reasoning over executable code and real-world performance, leaving gaps in understanding true capabilities and risks associated with these models in production. To address this issue, we propose MERA Code, a new addition to the MERA benchmark family, specifically focused on evaluating code for the latest code generation LLMs in Russian. This benchmark includes 11 evaluation tasks that span 8 programming languages. Our proposed evaluation methodology features a taxonomy that outlines the practical coding skills necessary for models to complete these tasks. The benchmark comprises an open-source codebase for users to conduct MERA assessments, a scoring system compatible with various programming environments, and a platform featuring a leaderboard and submission system. We evaluate open LLMs and frontier API models, analyzing their limitations in terms of practical coding tasks in non-English languages. We are publicly releasing MERA to guide future research, anticipate groundbreaking features in model development, and standardize evaluation procedures.
为解决上述问题,我们建议MERA守则,这是MERA基准体系的一个新补充,特别侧重于评价俄罗斯最新代码生成LLMS的守则。这个基准包括11项评价任务,涉及8种编程语言。我们提议的评价方法包括一种分类,它概述了完成这些任务模型所需的实际编码技能。基准包括用户进行MERA评估的开放源代码库、一种与各种编程环境兼容的评分系统以及一个以领导板和提交系统为主的平台。我们评价开放LMS和前沿API模型,分析其在非英语实际编码任务方面的局限性。我们正在公开发布MERA,以指导今后的研究,预测模型开发的破碎特征,并使评价程序标准化。
Article 35
Title@2025-07-17 (4): Formalizing Attack Scenario Description: A Proposed Model
Title: Formalizing Attack Scenario Description: A Proposed Model | Formalisierung des Angriffsszenarios Beschreibung: Ein vorgeschlagenes Modell | 正式化攻击设想情况说明:拟议模式 2507.13076v1 |
Authors (2): Quentin Goux, Nadira Lammari
Organizations face an ever-changing threat landscape. They must continuously dedicate significant efforts to protect their assets, making their adoption of increased cybersecurity automation inevitable. However, process automation requires formalization of input data. Through this paper, we address this need for processes that use attack scenarios as input. Among these processes, one can mention both the generation of scripts for attack simulation and training purposes, as well as the analysis of attacks. Therefore, the paper’s main research contribution is a novel formal model that encompasses the attack’s context description and its scenario. It is abstracted using UML class model. Once the description of our model done, we will show how it could serve an upstream attack analysis process. We will show also its use for an automatic generation of attack scripts in the context of cybersecurity training. These two uses cases constitute the second contribution of this present research work.
各组织面临着不断变化的威胁,它们必须不断做出重大努力来保护自己的资产,从而不可避免地采用更多的网络安全自动化。但是,程序自动化需要输入数据正规化。我们通过本文件解决了使用攻击情景作为输入的流程的需要。在这些流程中,可以提到用于攻击模拟和培训目的的脚本的生成以及攻击分析。因此,文件的主要研究贡献是一个新颖的正式模型,包含了攻击的上下文描述及其情景。它使用UML类模型抽象。一旦对模型的描述完成,我们将展示它如何为上游攻击分析进程服务。我们还将在网络安全培训中展示它用于自动生成攻击脚本的情况。这两个案例构成了目前研究工作的第二个贡献。
Article 36
Title@2025-07-17 (4): Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities
Title: Rethinking the Embodied Gap in Vision-and-Language Navigation: A Holistic Study of Physical and Visual Disparities | Rethinking the Embodyd Gap in Vision-and-Language Navigation: Eine ganzheitliche Studie physischer und visueller Disparitäten | 重新思考视觉和语言导航中的 “ 内博差距 “ :关于物理和视觉差异的综合研究 2507.13019v1 |
Authors (9): Liuyi Wang, Xinyuan Xia, Hui Zhao, Hanqing Wang, Tai Wang, Yilun Chen, Chengju Liu, Qijun Chen, Jiangmiao Pang
Recent Vision-and-Language Navigation (VLN) advancements are promising, but their idealized assumptions about robot movement and control fail to reflect physically embodied deployment challenges. To bridge this gap, we introduce VLN-PE, a physically realistic VLN platform supporting humanoid, quadruped, and wheeled robots. For the first time, we systematically evaluate several ego-centric VLN methods in physical robotic settings across different technical pipelines, including classification models for single-step discrete action prediction, a diffusion model for dense waypoint prediction, and a train-free, map-based large language model (LLM) integrated with path planning. Our results reveal significant performance degradation due to limited robot observation space, environmental lighting variations, and physical challenges like collisions and falls. This also exposes locomotion constraints for legged robots in complex environments. VLN-PE is highly extensible, allowing seamless integration of new scenes beyond MP3D, thereby enabling more comprehensive VLN evaluation. Despite the weak generalization of current models in physical deployment, VLN-PE provides a new pathway for improving cross-embodiment’s overall adaptability. We hope our findings and tools inspire the community to rethink VLN limitations and advance robust, practical VLN models. The code is available at https://crystalsixone.github.io/vln_pe.github.io/.
近期的视觉和语言导航(VLN)进展是大有希望的,但它们关于机器人移动和控制的理想假设未能反映实际体现的部署挑战。为了缩小这一差距,我们引入了VLN-PE,这是一个实际现实的VLN平台,支持人形、四重立和轮式机器人。我们第一次系统地评估了不同技术管道中物理机器人环境中以自我为中心的VLN方法,包括单步离散行动预测分类模型,密集路点预测的传播模型,以及无火车、基于地图的大语言模型(LLLM)与路径规划相结合。我们的结果显示,由于有限的机器人观察空间、环境照明变化以及碰撞和坠落等物理挑战,VLN(碰撞和坠落)平台的性能严重退化。这也暴露了复杂环境中对腿机器人的流动性限制。VLN(VL)是高度伸缩的,使得除MP3D外的新场景能够无缝合,从而能够进行更全面的VLN评价。尽管当前模型在实际部署中比较薄弱的概括化,VLN-PE(L-PE)为改进交叉渗透/再思考工具提供了新的路径。
Article 37
Title@2025-07-17 (4): Teach Old SAEs New Domain Tricks with Boosting
Title: Teach Old SAEs New Domain Tricks with Boosting | Lehren Sie alte SAEs neue Domain Tricks mit Förderung | 教授旧的 SAEs 新域圈套 2507.12990v1 |
Authors (6): Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, Daniil Gavrilov
Sparse Autoencoders have emerged as powerful tools for interpreting the internal representations of Large Language Models, yet they often fail to capture domain-specific features not prevalent in their training corpora. This paper introduces a residual learning approach that addresses this feature blindness without requiring complete retraining. We propose training a secondary SAE specifically to model the reconstruction error of a pretrained SAE on domain-specific texts, effectively capturing features missed by the primary model. By summing the outputs of both models during inference, we demonstrate significant improvements in both LLM cross-entropy and explained variance metrics across multiple specialized domains. Our experiments show that this method efficiently incorporates new domain knowledge into existing SAEs while maintaining their performance on general tasks. This approach enables researchers to selectively enhance SAE interpretability for specific domains of interest, opening new possibilities for targeted mechanistic interpretability of LLMs.
粗略的Autoencolders已成为解释大语言模型内部代表性的有力工具,但它们往往未能捕捉到在培训公司中并不普遍存在的特定领域特征。本文介绍了一种处理这一特异性失明的留级学习方法,而无需经过全面再培训。我们建议专门培训二级SAE,以在特定领域文本上模拟经过预先培训的SAE的重建错误,有效地捕捉主要模式所遗漏的特征。通过在推断过程中对两种模型的产出进行总结,我们展示了在多种专门领域LLM交叉渗透和解释差异性指标方面的重大改进。我们的实验表明,这种方法有效地将新的域知识纳入现有的SAE,同时保持其在一般任务上的绩效。这一方法使研究人员能够有选择地提高SEA在特定领域可解释性,为LMM具有针对性的机械性解释性开辟新的可能性。
Article 38
Title@2025-07-17 (4): Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits
Title: Learning to Translate Ambiguous Terminology by Preference Optimization on Post-Edits | Ambiguous Terminologie von Preference Optimization auf Post-Edits übersetzen lernen | 学习如何通过“优先优化”在编辑后采用“优先优化”来翻译模糊的名词 2507.03580v2 |
Authors (5): Nathaniel Berger, Johannes Eschbach-Dymanus, Miriam Exel, Matthias Huck, Stefan Riezler
In real world translation scenarios, terminology is rarely one-to-one. Instead, multiple valid translations may appear in a terminology dictionary, but correctness of a translation depends on corporate style guides and context. This can be challenging for neural machine translation (NMT) systems. Luckily, in a corporate context, many examples of human post-edits of valid but incorrect terminology exist. The goal of this work is to learn how to disambiguate our terminology based on these corrections. Our approach is based on preference optimization, using the term post-edit as the knowledge to be preferred. While previous work had to rely on unambiguous translation dictionaries to set hard constraints during decoding, or to add soft constraints in the input, our framework requires neither one-to-one dictionaries nor human intervention at decoding time. We report results on English-German post-edited data and find that the optimal combination of supervised fine-tuning and preference optimization, with both term-specific and full sequence objectives, yields statistically significant improvements in term accuracy over a strong NMT baseline without significant losses in COMET score. Additionally, we release test sets from our post-edited data and terminology dictionary.
在现实世界翻译假设中,术语很少是一对一。相反,多种有效翻译可能出现在术语词典中,但翻译的正确性取决于公司风格指南和背景。这可能对神经机翻译系统具有挑战性。幸运的是,在公司背景下,存在许多有效但不正确的术语的人类事后编辑实例。这项工作的目的是学习如何根据这些校正来模糊我们的术语。我们的方法是基于偏好优化,使用编辑后的术语作为首选知识。虽然以前的工作需要依靠明确的翻译词典来设定代码解码过程中的严格限制,或者在输入时增加软约束。我们的框架既不需要一对一的词典,也不需要在解码时进行人类干预。我们报告英文-德文后编辑数据的结果,发现经过监督的微调和偏好优化的最佳组合,既有具体术语目标,也有全部顺序目标,在术语和术语组合的强基线上取得了显著的术语准确性改进,而没有重大损失。此外,我们从编辑后的数据和词汇典中发布了测试数据集。
Article 39
Title@2025-07-17 (4): MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps
Title: MRT at IberLEF-2025 PRESTA Task: Maximizing Recovery from Tables with Multiple Steps | MRT bei IberLEF-2025 PRESTA Aufgabe: Maximierung der Erholung von Tischen mit mehreren Schritten | IberLEF-2025 PRESTA任务:最大限度地从有多个步骤的表格中回收 2507.12981v1 |
Authors (5): Maximiliano Hormazábal Lagos, Álvaro Bueno Sáez, Héctor Cerezo-Costas, Pedro Alonso Doval, Jorge Alcalde Vesteiro
This paper presents our approach for the IberLEF 2025 Task PRESTA: Preguntas y Respuestas sobre Tablas en Espa~nol (Questions and Answers about Tables in Spanish). Our solution obtains answers to the questions by implementing Python code generation with LLMs that is used to filter and process the table. This solution evolves from the MRT implementation for the Semeval 2025 related task. The process consists of multiple steps: analyzing and understanding the content of the table, selecting the useful columns, generating instructions in natural language, translating these instructions to code, running it, and handling potential errors or exceptions. These steps use open-source LLMs and fine-grained optimized prompts for each step. With this approach, we achieved an accuracy score of 85\% in the task.
本文件介绍了我们对IberLEF 2025 任务PRESTA:西班牙语表格的问答:我们用Python 代码生成来过滤和处理表格。这个解决方案从2025 Semeval 2025 相关任务的 MRT 实施过程演变而来。这一过程包括多个步骤:分析和理解表格的内容,选择有用的栏目,用自然语言生成指令,将这些指令转换为代码,运行它,处理潜在的错误或例外。这些步骤使用开源LLMS,并优化了每一步的提示。通过这个方法,我们在任务中实现了85的准确分数。
Article 40
Title@2025-07-17 (4): UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets
Title: UniSLU: Unified Spoken Language Understanding from Heterogeneous Cross-Task Datasets | UniSLU: Unified Spoken Language Understanding aus heterogenen Cross-Task-Datensätzen | UUSLU:从不同式跨任务数据集获得统一口语语言理解 2507.12951v1 |
Authors (4): Zhichao Sheng, Shilin Zhou, Chen Gong, Zhenghua Li
Spoken Language Understanding (SLU) plays a crucial role in speech-centric multimedia applications, enabling machines to comprehend spoken language in scenarios such as meetings, interviews, and customer service interactions. SLU encompasses multiple tasks, including Automatic Speech Recognition (ASR), spoken Named Entity Recognition (NER), and spoken Sentiment Analysis (SA). However, existing methods often rely on separate model architectures for individual tasks such as spoken NER and SA, which increases system complexity, limits cross-task interaction, and fails to fully exploit heterogeneous datasets available across tasks. To address these limitations, we propose UniSLU, a unified framework that jointly models multiple SLU tasks within a single architecture. Specifically, we propose a unified representation for diverse SLU tasks, enabling full utilization of heterogeneous datasets across multiple tasks. Built upon this representation, we propose a unified generative method that jointly models ASR, spoken NER, and SA tasks, enhancing task interactions and enabling seamless integration with large language models to harness their powerful generative capabilities. Extensive experiments on public SLU datasets demonstrate the effectiveness of our approach, achieving superior SLU performance compared to several benchmark methods, making it well-suited for real-world speech-based multimedia scenarios. We will release all code and models at github to facilitate future research.
语言语言理解(SLU)在以语言为中心的多媒体应用中发挥着关键作用,使机器能够在会议、访谈和客户服务互动等情景下理解口语。 SLU包含多项任务,包括自动语音识别(ASR)、口头命名实体识别(NER)和口语感分析(SA)等。然而,现有方法往往依赖单独的模式架构来完成单项任务,如口语净化(NER)和南沙(SA)等,这增加了系统复杂性,限制了跨任务互动,未能充分利用各任务之间可得到的多种数据集。为了解决这些限制,我们建议UISLU(U)是一个统一框架,共同在单一架构内模拟多种SLU任务。具体地说,我们建议统一代表多种SLU任务,以便能够充分利用多种任务的不同数据集。基于这一模式,我们提出一个统一的配对ASR、口语净(NER)和南沙(SA)任务的联合模型,加强任务互动,并使大型语言模型能够无缝结合,以利用其强大的基因能力。关于公共 SLU数据集的广泛实验展示了我们的方法的有效性,在现实的SLU(SLU)和将来的模型上实现高超版的演示,我们对数字的模型进行一些基准式研究。
Article 41
Title@2025-07-17 (4): Probabilistic Soundness Guarantees in LLM Reasoning Chains
Title: Probabilistic Soundness Guarantees in LLM Reasoning Chains | Probabilistische Solidität garantiert in LLM-Aufklärungsketten | LLM 理赔链条的概率稳妥性保障 2507.12948v1 |
Authors (7): Weiqiu You, Anton Xue, Shreya Havaldar, Delip Rao, Helen Jin, Chris Callison-Burch, Eric Wong
In reasoning chains generated by large language models (LLMs), initial errors often propagate and undermine the reliability of the final conclusion. Current LLM-based error detection methods often fail to detect propagated errors because they do not properly account for how earlier errors might corrupt judgments of downstream reasoning. To better detect such propagated errors, we introduce Autoregressive Reasoning Entailment Stability (ARES), a novel probabilistic framework that prevents error propagation by judging each claim based only on previously-assessed sound premises. This inductive method yields a nuanced score for each step and provides certified statistical guarantees of its soundness, rather than a brittle binary label. ARES achieves state-of-the-art performance across four benchmarks (72.1% Macro-F1, +8.2 points) and demonstrates superior robustness on very long synthetic reasoning chains, where it excels at detecting propagated errors (90.3% F1, +27.6 points).
在由大型语言模型(LLMs)产生的推理链中,最初的错误往往会传播并破坏最后结论的可靠性。目前基于LLM的错误探测方法往往无法发现传播错误,因为它们没有正确解释早期错误如何会腐蚀下游推理的判断。为了更好地发现这种传播错误,我们引入了“自动递减理性稳定”(ARES),这是一个新的概率框架,它防止错误传播,因为它通过仅仅根据以前评估过的音响前提来判断每项索赔。这种推导方法为每个步骤带来细微分,并为每个步骤的健全性提供经认证的统计保证,而不是一个简便的二进制标签。 ARES在四个基准(72.1% 宏观-F1, +8.2点)上达到了最新水平,并在非常长的合成推理链上展示了超强的稳健性,在其中它最擅长发现传播错误(90.3% F1, +27.6点 ) 。
Article 42
Title@2025-07-17 (4): OASIS: Order-Augmented Strategy for Improved Code Search
Title: OASIS: Order-Augmented Strategy for Improved Code Search | OASIS: Order-Augmented Strategy for Improved Code Search | OASIS:改进守则搜索的有秩序加强战略 2503.08161v4 |
Authors (9): Zuchen Gao, Zizheng Zhan, Xianming Li, Erxin Yu, Ziqi Zhan, Haotian Zhang, Bin Chen, Yuqun Zhang, Jing Li
Code embeddings capture the semantic representations of code and are crucial for various code-related large language model (LLM) applications, such as code search. Previous training primarily relies on optimizing the InfoNCE loss by comparing positive natural language (NL)-code pairs with in-batch negatives. However, due to the sparse nature of code contexts, training solely by comparing the major differences between positive and negative pairs may fail to capture deeper semantic nuances. To address this issue, we propose a novel order-augmented strategy for improved code search (OASIS). It leverages order-based similarity labels to train models to capture subtle differences in similarity among negative pairs. Extensive benchmark evaluations demonstrate that our OASIS model significantly outperforms previous state-of-the-art models focusing solely on major positive-negative differences. It underscores the value of exploiting subtle differences among negative pairs with order labels for effective code embedding training.
代码嵌入包含代码的语义表达,对于各种与代码相关的大语言模型(LLM)应用程序(如代码搜索)至关重要。 以前的培训主要依靠将正自然语言(NL)代码对和批量负数对进行对比,从而优化InfoNCE损失。 但是,由于代码背景的稀少性质,仅仅通过比较正对和负对之间的主要差异来进行培训可能无法捕捉更深层次的语义差异。 为了解决这一问题,我们提出了一个新颖的命令强化战略,以改进代码搜索(OASIS) 。 它利用基于订单的类似标签来培训模型,以捕捉负对子的相似性之间的细微差异。 广泛的基准评估表明,我们的OIS模型大大超越了仅仅侧重于重大正反差异的以往最新模型。 它强调了利用底对夫妇之间的细微差异并带有有效代码嵌入培训的顺序标签的价值。
Article 43
Title@2025-07-17 (4): Making Language Model a Hierarchical Classifier and Generator
Title: Making Language Model a Hierarchical Classifier and Generator | Sprachmodell zu einem hierarchischen Klassifikator und Generator machen | 使语言模式成为等级分类和生成器 2507.12930v1 |
Authors (11): Yihong Wang, Zhonglin Jiang, Ningyuan Xi, Yue Zhao, Qingqing Gu, Xiyuan Chen, Hao Wu, Sheng Xu, Hange Zhou, Yong Chen, Luo Ji
Decoder-only language models, such as GPT and LLaMA, generally decode on the last layer. Motivated by human’s hierarchical thinking capability, we propose that a hierarchical decoder architecture could be built with different layers decoding texts simultaneously. Due to limited time and computationally resources, we choose to adapt a pretrained language model into this form of hierarchical decoder. Language heads of the last layer are copied to different selected intermediate layers, and fine-tuned with different task inputs. By thorough experiments, we validate that these selective intermediate layers could be adapted to speak meaningful and reasonable contents, and this paradigm of hierarchical decoder can obtain state-of-the-art performances on multiple tasks such as hierarchical text classification, classification-guided generation, and hierarchical text generation. This study suggests the possibility of a generalized hierarchical reasoner, pretraining from scratch.
解码器语言模型,如GPT和LalaMA,通常在最后一层解码。受人类等级思维能力的驱使,我们建议可以同时建造一个等级解码器结构,不同层次解码文本。由于时间和计算资源有限,我们选择将预先训练的语言模型改造成这种等级解码器形式。最后一个层次的语言负责人被复制到不同的选定中间层,并经过不同任务投入的细微调整。通过彻底的实验,我们确认这些选择性中间层可以调整为有意义和合理的内容,而这种等级解码器模式可以在多任务上获得最先进的表演,如分层文字分类、分类制代和等级制文本生成。本研究提出了从零到零的普及等级解释器的可能性。
Article 44
Title@2025-07-17 (4): MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents
Title: MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents | MEM1: Lernen, Speicher zu synergisieren und für effiziente Long-Horizon-Agenten zu verankern | MEM1:学习如何使记忆和理由相互协调,以有效长森剂 2506.15841v2 |
Authors (9): Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, Paul Pu Liang
Modern language agents must operate over long-horizon, multi-turn interactions, where they retrieve external information, adapt to observations, and answer interdependent queries. Yet, most LLM systems rely on full-context prompting, appending all past turns regardless of their relevance. This leads to unbounded memory growth, increased computational costs, and degraded reasoning performance on out-of-distribution input lengths. We introduce MEM1, an end-to-end reinforcement learning framework that enables agents to operate with constant memory across long multi-turn tasks. At each turn, MEM1 updates a compact shared internal state that jointly supports memory consolidation and reasoning. This state integrates prior memory with new observations from the environment while strategically discarding irrelevant or redundant information. To support training in more realistic and compositional settings, we propose a simple yet effective and scalable approach to constructing multi-turn environments by composing existing datasets into arbitrarily complex task sequences. Experiments across three domains, including internal retrieval QA, open-domain web QA, and multi-turn web shopping, show that MEM1-7B improves performance by 3.5x while reducing memory usage by 3.7x compared to Qwen2.5-14B-Instruct on a 16-objective multi-hop QA task, and generalizes beyond the training horizon. Our results demonstrate the promise of reasoning-driven memory consolidation as a scalable alternative to existing solutions for training long-horizon interactive agents, where both efficiency and performance are optimized.
现代语言代理器必须在长正方位、多方向的互动中运行,让它们检索外部信息,适应观测,并回答相互依存的询问。然而,大多数LLM系统都依赖全正文提示,同时在战略上抛弃不相关或多余的信息,同时从战略上放弃所有过去的翻转,而不管其相关性如何。这导致记忆的无限制增长,计算成本增加,以及分配外输入长度的推理性能退化。我们引入了MEM1, 一个端到端的强化学习框架,使代理器能够在长期多方向的任务中以恒定的记忆运行。在每一个方向上,MEM1更新了一个紧密的共享内部状态,共同支持记忆的整合和推理。这个状态将以前的记忆与来自环境的新观测结合起来,同时从战略上抛弃不相关或多余的信息。为了支持更现实的记忆和构成环境的培训,我们提出了一个简单而有效且可扩展的方法来构建多方向的环境,将现有的数据集引入任意复杂的任务序列。在三个领域进行实验,包括内部检索QA、开放式网络QA和多方向采购,这显示MEM1-7B改进了业绩,以3.5-14级的升级的进度比我们的现有学习任务要降低任务要降低的进度要降低。
Article 45
Title@2025-07-17 (4): Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning
Title: Code2Logic: Game-Code-Driven Data Synthesis for Enhancing VLMs General Reasoning | Code2Logic: Game-Code-getriebene Datensynthese zur Verbesserung von VLMs Allgemeine Begründung | 代码2Llogic: 用于增强VLMs一般理由的游戏-代码-驱动数据合成 2505.13886v3 |
Authors (26): Jingqi Tong, Jixin Tang, Hangcheng Li, Yurong Mou, Ming Zhang, Jun Zhao, Yanbo Wen, Fan Song, Jiahao Zhan, Yuyang Lu, Chaoran Tao, Zhiyuan Guo, Jizhou Yu, Tianhao Cheng, Changhao Jiang, Zhen Wang, Tao Liang, Zhihui Fei, Mingyang Wan, Guojun Ma, Weifeng Ge, Guanhua Chen, Tao Gui, Xipeng Qiu, Qi Zhang, Xuanjing Huang
Visual-language Chain-of-Thought (CoT) data resources are relatively scarce compared to text-only counterparts, limiting the improvement of reasoning capabilities in Vision Language Models (VLMs). However, high-quality vision-language reasoning data is expensive and labor-intensive to annotate. To address this issue, we leverage a promising resource: game code, which naturally contains logical structures and state transition processes. Therefore, we propose Code2Logic, a novel game-code-driven approach for multimodal reasoning data synthesis. Our approach leverages Large Language Models (LLMs) to adapt game code, enabling automatic acquisition of reasoning processes and results through code execution. Using the Code2Logic approach, we developed the GameQA dataset to train and evaluate VLMs. GameQA is cost-effective and scalable, offers controllable difficulty gradation and is diverse with 30 games and 158 tasks. Surprisingly, despite training solely on game data, VLMs demonstrated out of domain generalization, specifically Qwen2.5-VL-7B improving performance by 2.33% across 7 diverse vision-language benchmarks. Our code, dataset and models are available at https://github.com/tongjingqi/Code2Logic.
与只使用文本的数据相比,视觉语言链数据资源相对稀缺,限制了提高视觉语言模型(VLMS)的推理能力。然而,高质量的视觉语言推理数据昂贵,需要大量劳动才能说明问题。为了解决这一问题,我们利用了一个大有希望的资源:游戏代码,它自然包含逻辑结构和州过渡过程。因此,我们提议了代码2Logic,这是一种由游戏代码驱动的新颖的方法,用于多式推理数据合成。我们的方法利用了大语言模型(LLLMS)来调整游戏代码,从而能够通过执行代码自动获取推理过程和结果。我们利用代码2Logic方法开发了游戏QA数据集,用于培训和评估VLMS。游戏QA具有成本效益和可扩展性,提供了可控制的难度升级,并且有30场游戏和158项任务。令人惊讶的是,尽管仅就游戏数据进行了培训,但VLMS展示了域通用,特别是Qwen2.5-L-7B,能够通过7个不同的视觉语言基准将业绩提高2.33%。我们的代码、数据设置和模型在http://Lgiqiquc.
Article 46
Title@2025-07-17 (4): IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization
Title: IOPO: Empowering LLMs with Complex Instruction Following via Input-Output Preference Optimization | IOPO: Verstärkung von LLMs mit komplexer Anleitung über Input-Output Preference Optimization | IOPO:通过投入-产出优化,以复杂教学赋予LLMs权力 2411.06208v3 |
Authors (5): Xinghua Zhang, Haiyang Yu, Cheng Fu, Fei Huang, Yongbin Li
In the realm of large language models (LLMs), the ability of models to accurately follow instructions is paramount as more agents and applications leverage LLMs for construction, where the complexity of instructions are rapidly increasing. However, on the one hand, there is only a certain amount of complex instruction evaluation data; on the other hand, there are no dedicated algorithms to improve the ability to follow complex instructions. To this end, this paper introduces TRACE, a benchmark for improving and evaluating the complex instructionfollowing ability, which consists of 120K training data and 1K evaluation data. Furthermore, we propose IOPO (Input-Output Preference Optimization) alignment method which takes both input and output preference pairs into consideration, where LLMs not only rapidly align with response preferences but also meticulously explore the instruction preferences. Extensive experiments on both in-domain and outof-domain datasets confirm the effectiveness of IOPO, showing 8.15%, 2.18% improvements on in-domain data and 6.29%, 3.13% on outof-domain data compared to SFT and DPO respectively.
在大型语言模型(LLMs)领域,模型准确遵循指示的能力是最重要的,因为更多的代理和应用将LLMs用于建造,而这种模型的复杂程度正在迅速增加;然而,一方面,只有一定数量的复杂指示评价数据;另一方面,没有专门的算法来提高遵循复杂指示的能力;为此,本文件介绍了TRACE,这是改进和评价复杂指示能力的基准,包括120K培训数据和1K评价数据;此外,我们提议IOPO(投入-产出偏好优化)调整方法,既考虑到投入和产出偏好,又考虑到投入和产出偏好,而LOPMs不仅迅速与响应偏好一致,而且还仔细探索了教学偏好。关于内部和外部数据集的广泛实验证实了IOPO的有效性,显示与SFT和DPO分别为8.15%、2.18%的内数据改进率和6.29%的外部数据。
Article 47
Title@2025-07-17 (4): On the Limitations of Large Language Models (LLMs): False Attribution
Title: On the Limitations of Large Language Models (LLMs): False Attribution | Über die Grenzen großer Sprachmodelle (LLMs): Falsche Attribution | 对大语言模式限制的限制: 2404.04631v2 |
Authors (4): Tosin Adewumi, Nudrat Habib, Lama Alkhaled, Elisa Barney
In this work, we introduce a new hallucination metric - Simple Hallucination Index (SHI) and provide insight into one important limitation of the parametric knowledge of large language models (LLMs), i.e. false attribution. The task of automatic author attribution for relatively small chunks of text is an important NLP task but can be challenging. We empirically evaluate the power of 3 open SotA LLMs in zero-shot setting (Gemma-7B, Mixtral 8x7B, and LLaMA-2-13B). We acquired the top 10 most popular books of a month, according to Project Gutenberg, divided each one into equal chunks of 400 words, and prompted each LLM to predict the author. We then randomly sampled 162 chunks per book for human evaluation, based on the error margin of 7% and a confidence level of 95%. The average results show that Mixtral 8x7B has the highest prediction accuracy, the lowest SHI, and a Pearson’s correlation (r) of 0.724, 0.263, and -0.9996, respectively, followed by LLaMA-2-13B and Gemma-7B. However, Mixtral 8x7B suffers from high hallucinations for 3 books, rising as high as a SHI of 0.87 (in the range 0-1, where 1 is the worst). The strong negative correlation of accuracy and SHI, given by r, demonstrates the fidelity of the new hallucination metric, which may generalize to other tasks. We also show that prediction accuracies correlate positively with the frequencies of Wikipedia instances of the book titles instead of the downloads and we perform error analyses of predictions. We publicly release the annotated chunks of data and our codes to aid the reproducibility and evaluation of other models.
在这项工作中,我们引入了新的幻觉衡量标准 — — 简单的幻觉化指数(SHI),并对大型语言模型(LLMS)的参数性知识(LLMS)的一个重要限制,即错误归属。相对较小的文本块的自动作者归属任务是一项重要的NLP任务,但可能具有挑战性。我们根据经验评估了3个开放的SotA LLMs在零射设定中的功率(Gemma-7B、Mixtral 8x7B和LLLAMA-213B ) 。我们获得了一个月中最受欢迎的10本书,根据Gutenberg项目,每本都分为400个等数,并促使每个LLMM预测作者。我们随后随机抽样了每本书162块用于人类评估,其误差幅度为7%,信任度为95%。平均结果显示,Mixtral 8x 8x 的预测准确度最高, SHI 最低的SHI , 和最坏的精确性(r) 数值为0.724、0.63 和-09.66 的精确性(ral) ) , 的精确值数据也显示不断高的精确性数据。
Article 48
Title@2025-07-17 (4): Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities
Title: Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities | Zwillinge 2.5: Das Frontier mit fortschrittlicher Vernunft, Multimodalität, langem Kontext und Agentischen Fähigkeiten der nächsten Generation schieben | Gemini 2.5: 推进先进理性、多模式、长处和下一代的前沿 2507.06261v3 |
Authors (3308): Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, Luke Marris, Sam Petulla, Colin Gaffney, Asaf Aharoni, Nathan Lintz, Tiago Cardal Pais, Henrik Jacobsson, Idan Szpektor, Nan-Jiang Jiang, Krishna Haridasan, Ahmed Omran, Nikunj Saunshi, Dara Bahri, Gaurav Mishra, Eric Chu, Toby Boyd, Brad Hekman, Aaron Parisi, Chaoyi Zhang, Kornraphop Kawintiranon, Tania Bedrax-Weiss, Oliver Wang, Ya Xu, Ollie Purkiss, Uri Mendlovic, Ilaï Deutel, Nam Nguyen, Adam Langley, Flip Korn, Lucia Rossazza, Alexandre Ramé, Sagar Waghmare, Helen Miller, Vaishakh Keshava, Ying Jian, Xiaofan Zhang, Raluca Ada Popa, Kedar Dhamdhere, Blaž Bratanič, Kyuyeun Kim, Terry Koo, Ferran Alet, Yi-ting Chen, Arsha Nagrani, Hannah Muckenhirn, Zhiyuan Zhang, Corbin Quick, Filip Pavetić, Duc Dung Nguyen, Joao Carreira, Michael Elabd, Haroon Qureshi, Fabian Mentzer, Yao-Yuan Yang, Danielle Eisenbud, Anmol Gulati, Ellie Talius, Eric Ni, Sahra Ghalebikesabi, Edouard Yvinec, Alaa Saade, Thatcher Ulrich, Lorenzo Blanco, Dan A. Calian, Muhuan Huang, Aäron van den Oord, Naman Goyal, Terry Chen, Praynaa Rawlani, Christian Schallhart, Swachhand Lokhande, Xianghong Luo, Jyn Shan, Ceslee Montgomery, Victoria Krakovna, Federico Piccinini, Omer Barak, Jingyu Cui, Yiling Jia, Mikhail Dektiarev, Alexey Kolganov, Shiyu Huang, Zhe Chen, Xingyu Wang, Jessica Austin, Peter de Boursac, Evgeny Sluzhaev, Frank Ding, Huijian Li, Surya Bhupatiraju, Mohit Agarwal, Sławek Kwasiborski, Paramjit Sandhu, Patrick Siegler, Ahmet Iscen, Eyal Ben-David, Shiraz Butt, Miltos Allamanis, Seth Benjamin, Robert Busa-Fekete, Felix Hernandez-Campos, Sasha Goldshtein, Matt Dibb, Weiyang Zhang, Annie Marsden, Carey Radebaugh, Stephen Roller, Abhishek Nayyar, Jacob Austin, Tayfun Terzi, Bhargav Kanagal Shamanna, Pete Shaw, Aayush Singh, Florian Luisier, Artur Mendonça, Vaibhav Aggarwal, Larisa Markeeva, Claudio Fantacci, Sergey Brin, HyunJeong Choe, Guanyu Wang, Hartwig Adam, Avigail Dabush, Tatsuya Kiyono, Eyal Marcus, Jeremy Cole, Theophane Weber, Hongrae Lee, Ronny Huang, Alex Muzio, Leandro Kieliger, Maigo Le, Courtney Biles, Long Le, Archit Sharma, Chengrun Yang, Avery Lamp, Dave Dopson, Nate Hurley, Katrina, Xu, Zhihao Shan, Shuang Song, Jiewen Tan, Alexandre Senges, George Zhang, Chong You, Yennie Jun, David Raposo, Susanna Ricco, Xuan Yang, Weijie Chen, Prakhar Gupta, Arthur Szlam, Kevin Villela, Chun-Sung Ferng, Daniel Kasenberg, Chen Liang, Rui Zhu, Arunachalam Narayanaswamy, Florence Perot, Paul Pucciarelli, Anna Shekhawat, Alexey Stern, Rishikesh Ingale, Stefani Karp, Sanaz Bahargam, Adrian Goedeckemeyer, Jie Han, Sicheng Li, Andrea Tacchetti, Dian Yu, Abhishek Chakladar, Zhiying Zhang, Mona El Mahdy, Xu Gao, Dale Johnson, Samrat Phatale, AJ Piergiovanni, Hyeontaek Lim, Clement Farabet, Carl Lebsack, Theo Guidroz, John Blitzer, Nico Duduta, David Madras, Steve Li, Daniel von Dincklage, Xin Li, Mahdis Mahdieh, George Tucker, Ganesh Jawahar, Owen Xiao, Danny Tarlow, Robert Geirhos, Noam Velan, Daniel Vlasic, Kalesha Bullard, SK Park, Nishesh Gupta, Kellie Webster, Ayal Hitron, Jieming Mao, Julian Eisenschlos, Laurel Prince, Nina D’Souza, Kelvin Zheng, Sara Nasso, Gabriela Botea, Carl Doersch, Caglar Unlu, Chris Alberti, Alexey Svyatkovskiy, Ankita Goel, Krzysztof Choromanski, Pan-Pan Jiang, Richard Nguyen, Four Flynn, Daria Ćurko, Peter Chen, Nicholas Roth, Kieran Milan, Caleb Habtegebriel, Shashi Narayan, Michael Moffitt, Jake Marcus, Thomas Anthony, Brendan McMahan, Gowoon Cheon, Ruibo Liu, Megan Barnes, Lukasz Lew, Rebeca Santamaria-Fernandez, Mayank Upadhyay, Arjun Akula, Arnar Mar Hrafnkelsson, Alvaro Caceres, Andrew Bunner, Michal Sokolik, Subha Puttagunta, Lawrence Moore, Berivan Isik, Jay Hartford, Lawrence Chan, Pradeep Shenoy, Dan Holtmann-Rice, Jane Park, Fabio Viola, Alex Salcianu, Sujeevan Rajayogam, Ian Stewart-Binks, Zelin Wu, Richard Everett, Xi Xiong, Pierre-Antoine Manzagol, Gary Leung, Carl Saroufim, Bo Pang, Dawid Wegner, George Papamakarios, Jennimaria Palomaki, Helena Pankov, Guangda Lai, Guilherme Tubone, Shubin Zhao, Theofilos Strinopoulos, Seth Neel, Mingqiu Wang, Joe Kelley, Li Li, Pingmei Xu, Anitha Vijayakumar, Andrea D’olimpio, Omer Levy, Massimo Nicosia, Grigory Rozhdestvenskiy, Ni Lao, Sirui Xie, Yash Katariya, Jon Simon, Sanjiv Kumar, Florian Hartmann, Michael Kilgore, Jinhyuk Lee, Aroma Mahendru, Roman Ring, Tom Hennigan, Fiona Lang, Colin Cherry, David Steiner, Dawsen Hwang, Ray Smith, Pidong Wang, Jeremy Chen, Ming-Hsuan Yang, Sam Kwei, Philippe Schlattner, Donnie Kim, Ganesh Poomal Girirajan, Nikola Momchev, Ayushi Agarwal, Xingyi Zhou, Ilkin Safarli, Zachary Garrett, AJ Pierigiovanni, Sarthak Jauhari, Alif Raditya Rochman, Shikhar Vashishth, Quan Yuan, Christof Angermueller, Jon Blanton, Xinying Song, Nitesh Bharadwaj Gundavarapu, Thi Avrahami, Maxine Deines, Subhrajit Roy, Manish Gupta, Christopher Semturs, Shobha Vasudevan, Aditya Srikanth Veerubhotla, Shriya Sharma, Josh Jacob, Zhen Yang, Andreas Terzis, Dan Karliner, Auriel Wright, Tania Rojas-Esponda, Ashley Brown, Abhijit Guha Roy, Pawan Dogra, Andrei Kapishnikov, Peter Young, Wendy Kan, Vinodh Kumar Rajendran, Maria Ivanova, Salil Deshmukh, Chia-Hua Ho, Mike Kwong, Stav Ginzburg, Annie Louis, KP Sawhney, Slav Petrov, Jing Xie, Yunfei Bai, Georgi Stoyanov, Alex Fabrikant, Rajesh Jayaram, Yuqi Li, Joe Heyward, Justin Gilmer, Yaqing Wang, Radu Soricut, Luyang Liu, Qingnan Duan, Jamie Hayes, Maura O’Brien, Gaurav Singh Tomar, Sivan Eiger, Bahar Fatemi, Jeffrey Hui, Catarina Barros, Adaeze Chukwuka, Alena Butryna, Saksham Thakur, Austin Huang, Zhufeng Pan, Haotian Tang, Serkan Cabi, Tulsee Doshi, Michiel Bakker, Sumit Bagri, Ruy Ley-Wild, Adam Lelkes, Jennie Lees, Patrick Kane, David Greene, Shimu Wu, Jörg Bornschein, Gabriela Surita, Sarah Hodkinson, Fangtao Li, Chris Hidey, Sébastien Pereira, Sean Ammirati, Phillip Lippe, Adam Kraft, Pu Han, Sebastian Gerlach, Zifeng Wang, Liviu Panait, Feng Han, Brian Farris, Yingying Bi, Hannah DeBalsi, Miaosen Wang, Gladys Tyen, James Cohan, Susan Zhang, Jarred Barber, Da-Woon Chung, Jaeyoun Kim, Markus Kunesch, Steven Pecht, Nami Akazawa, Abe Friesen, James Lyon, Ali Eslami, Junru Wu, Jie Tan, Yue Song, Ravi Kumar, Chris Welty, Ilia Akolzin, Gena Gibson, Sean Augenstein, Arjun Pillai, Nancy Yuen, Du Phan, Xin Wang, Iain Barr, Heiga Zen, Nan Hua, Casper Liu, Jilei, Wang, Tanuj Bhatia, Hao Xu, Oded Elyada, Pushmeet Kohli, Mirek Olšák, Ke Chen, Azalia Mirhoseini, Noam Shazeer, Shoshana Jakobovits, Maggie Tran, Nolan Ramsden, Tarun Bharti, Fred Alcober, Yunjie Li, Shilpa Shetty, Jing Chen, Dmitry Kalashnikov, Megha Nawhal, Sercan Arik, Hanwen Chen, Michiel Blokzijl, Shubham Gupta, James Rubin, Rigel Swavely, Sophie Bridgers, Ian Gemp, Chen Su, Arun Suggala, Juliette Pluto, Mary Cassin, Alain Vaucher, Kaiyang Ji, Jiahao Cai, Andrew Audibert, Animesh Sinha, David Tian, Efrat Farkash, Amy Hua, Jilin Chen, Duc-Hieu Tran, Edward Loper, Nicole Brichtova, Lara McConnaughey, Ballie Sandhu, Robert Leland, Doug DeCarlo, Andrew Over, James Huang, Xing Wu, Connie Fan, Eric Li, Yun Lei, Deepak Sharma, Cosmin Paduraru, Luo Yu, Matko Bošnjak, Phuong Dao, Min Choi, Sneha Kudugunta, Jakub Adamek, Carlos Guía, Ali Khodaei, Jie Feng, Wenjun Zeng, David Welling, Sandeep Tata, Christina Butterfield, Andrey Vlasov, Seliem El-Sayed, Swaroop Mishra, Tara Sainath, Shentao Yang, RJ Skerry-Ryan, Jeremy Shar, Robert Berry, Arunkumar Rajendran, Arun Kandoor, Andrea Burns, Deepali Jain, Tom Stone, Wonpyo Park, Shibo Wang, Albin Cassirer, Guohui Wang, Hayato Kobayashi, Sergey Rogulenko, Vineetha Govindaraj, Mikołaj Rybiński, Nadav Olmert, Colin Evans, Po-Sen Huang, Kelvin Xu, Premal Shah, Terry Thurk, Caitlin Sikora, Mu Cai, Jin Xie, Elahe Dabir, Saloni Shah, Norbert Kalb, Carrie Zhang, Shruthi Prabhakara, Amit Sabne, Artiom Myaskovsky, Vikas Raunak, Blanca Huergo, Behnam Neyshabur, Jon Clark, Ye Zhang, Shankar Krishnan, Eden Cohen, Dinesh Tewari, James Lottes, Yumeya Yamamori, Hui, Li, Mohamed Elhawaty, Ada Maksutaj Oflazer, Adrià Recasens, Sheryl Luo, Duy Nguyen, Taylor Bos, Kalyan Andra, Ana Salazar, Ed Chi, Jeongwoo Ko, Matt Ginsberg, Anders Andreassen, Anian Ruoss, Todor Davchev, Elnaz Davoodi, Chenxi Liu, Min Kim, Santiago Ontanon, Chi Ming To, Dawei Jia, Rosemary Ke, Jing Wang, Anna Korsun, Moran Ambar, Ilya Kornakov, Irene Giannoumis, Toni Creswell, Denny Zhou, Yi Su, Ishaan Watts, Aleksandr Zaks, Evgenii Eltyshev, Ziqiang Feng, Sidharth Mudgal, Alex Kaskasoli, Juliette Love, Kingshuk Dasgupta, Sam Shleifer, Richard Green, Sungyong Seo, Chansoo Lee, Dale Webster, Prakash Shroff, Ganna Raboshchuk, Isabel Leal, James Manyika, Sofia Erell, Daniel Murphy, Zhisheng Xiao, Anton Bulyenov, Julian Walker, Mark Collier, Matej Kastelic, Nelson George, Sushant Prakash, Sailesh Sidhwani, Alexey Frolov, Steven Hansen, Petko Georgiev, Tiberiu Sosea, Chris Apps, Aishwarya Kamath, David Reid, Emma Cooney, Charlotte Magister, Oriana Riva, Alec Go, Pu-Chin Chen, Sebastian Krause, Nir Levine, Marco Fornoni, Ilya Figotin, Nick Roy, Parsa Mahmoudieh, Vladimir Magay, Mukundan Madhavan, Jin Miao, Jianmo Ni, Yasuhisa Fujii, Ian Chou, George Scrivener, Zak Tsai, Siobhan Mcloughlin, Jeremy Selier, Sandra Lefdal, Jeffrey Zhao, Abhijit Karmarkar, Kushal Chauhan, Shivanker Goel, Zhaoyi Zhang, Vihan Jain, Parisa Haghani, Mostafa Dehghani, Jacob Scott, Erin Farnese, Anastasija Ilić, Steven Baker, Julia Pawar, Li Zhong, Josh Camp, Yoel Zeldes, Shravya Shetty, Anand Iyer, Vít Listík, Jiaxian Guo, Luming Tang, Mark Geller, Simon Bucher, Yifan Ding, Hongzhi Shi, Carrie Muir, Dominik Grewe, Ramy Eskander, Octavio Ponce, Boqing Gong, Derek Gasaway, Samira Khan, Umang Gupta, Angelos Filos, Weicheng Kuo, Klemen Kloboves, Jennifer Beattie, Christian Wright, Leon Li, Alicia Jin, Sandeep Mariserla, Miteyan Patel, Jens Heitkaemper, Dilip Krishnan, Vivek Sharma, David Bieber, Christian Frank, John Lambert, Paul Caron, Martin Polacek, Mai Giménez, Himadri Choudhury, Xing Yu, Sasan Tavakkol, Arun Ahuja, Franz Och, Rodolphe Jenatton, Wojtek Skut, Bryan Richter, David Gaddy, Andy Ly, Misha Bilenko, Megh Umekar, Ethan Liang, Martin Sevenich, Mandar Joshi, Hassan Mansoor, Rebecca Lin, Sumit Sanghai, Abhimanyu Singh, Xiaowei Li, Sudheendra Vijayanarasimhan, Zaheer Abbas, Yonatan Bitton, Hansa Srinivasan, Manish Reddy Vuyyuru, Alexander Frömmgen, Yanhua Sun, Ralph Leith, Alfonso Castaño, DJ Strouse, Le Yan, Austin Kyker, Satish Kambala, Mary Jasarevic, Thibault Sellam, Chao Jia, Alexander Pritzel, Raghavender R, Huizhong Chen, Natalie Clay, Sudeep Gandhe, Sean Kirmani, Sayna Ebrahimi, Hannah Kirkwood, Jonathan Mallinson, Chao Wang, Adnan Ozturel, Kuo Lin, Shyam Upadhyay, Vincent Cohen-Addad, Sean Purser-haskell, Yichong Xu, Ebrahim Songhori, Babi Seal, Alberto Magni, Almog Gueta, Tingting Zou, Guru Guruganesh, Thais Kagohara, Hung Nguyen, Khalid Salama, Alejandro Cruzado Ruiz, Justin Frye, Zhenkai Zhu, Matthias Lochbrunner, Simon Osindero, Wentao Yuan, Lisa Lee, Aman Prasad, Lam Nguyen Thiet, Daniele Calandriello, Victor Stone, Qixuan Feng, Han Ke, Maria Voitovich, Geta Sampemane, Lewis Chiang, Ling Wu, Alexander Bykovsky, Matt Young, Luke Vilnis, Ishita Dasgupta, Aditya Chawla, Qin Cao, Bowen Liang, Daniel Toyama, Szabolcs Payrits, Anca Stefanoiu, Dimitrios Vytiniotis, Ankesh Anand, Tianxiao Shen, Blagoj Mitrevski, Michael Tschannen, Sreenivas Gollapudi, Aishwarya P S, José Leal, Zhe Shen, Han Fu, Wei Wang, Arvind Kannan, Doron Kukliansky, Sergey Yaroshenko, Svetlana Grant, Umesh Telang, David Wood, Alexandra Chronopoulou, Alexandru Ţifrea, Tao Zhou, Tony, Nguy~ên, Muge Ersoy, Anima Singh, Meiyan Xie, Emanuel Taropa, Woohyun Han, Eirikur Agustsson, Andrei Sozanschi, Hui Peng, Alex Chen, Yoel Drori, Efren Robles, Yang Gao, Xerxes Dotiwalla, Ying Chen, Anudhyan Boral, Alexei Bendebury, John Nham, Chris Tar, Luis Castro, Jiepu Jiang, Canoee Liu, Felix Halim, Jinoo Baek, Andy Wan, Jeremiah Liu, Yuan Cao, Shengyang Dai, Trilok Acharya, Ruoxi Sun, Fuzhao Xue, Saket Joshi, Morgane Lustman, Yongqin Xian, Rishabh Joshi, Deep Karkhanis, Nora Kassner, Jamie Hall, Xiangzhuo Ding, Gan Song, Gang Li, Chen Zhu, Yana Kulizhskaya, Bin Ni, Alexey Vlaskin, Solomon Demmessie, Lucio Dery, Salah Zaiem, Yanping Huang, Cindy Fan, Felix Gimeno, Ananth Balashankar, Koji Kojima, Hagai Taitelbaum, Maya Meng, Dero Gharibian, Sahil Singla, Wei Chen, Ambrose Slone, Guanjie Chen, Sujee Rajayogam, Max Schumacher, Suyog Kotecha, Rory Blevins, Qifei Wang, Mor Hazan Taege, Alex Morris, Xin Liu, Fayaz Jamil, Richard Zhang, Pratik Joshi, Ben Ingram, Tyler Liechty, Ahmed Eleryan, Scott Baird, Alex Grills, Gagan Bansal, Shan Han, Kiran Yalasangi, Shawn Xu, Majd Al Merey, Isabel Gao, Felix Weissenberger, Igor Karpov, Robert Riachi, Ankit Anand, Gautam Prasad, Kay Lamerigts, Reid Hayes, Jamie Rogers, Mandy Guo, Ashish Shenoy, Qiong, Hu, Kyle He, Yuchen Liu, Polina Zablotskaia, Sagar Gubbi, Yifan Chang, Jay Pavagadhi, Kristian Kjems, Archita Vadali, Diego Machado, Yeqing Li, Renshen Wang, Dipankar Ghosh, Aahil Mehta, Dana Alon, George Polovets, Alessio Tonioni, Nate Kushman, Joel D’sa, Lin Zhuo, Allen Wu, Rohin Shah, John Youssef, Jiayu Ye, Justin Snyder, Karel Lenc, Senaka Buthpitiya, Matthew Tung, Jichuan Chang, Tao Chen, David Saxton, Jenny Lee, Lydia Lihui Zhang, James Qin, Prabakar Radhakrishnan, Maxwell Chen, Piotr Ambroszczyk, Metin Toksoz-Exley, Yan Zhong, Nitzan Katz, Brendan O’Donoghue, Tamara von Glehn, Adi Gerzi Rosenthal, Aga Świetlik, Xiaokai Zhao, Nick Fernando, Jinliang Wei, Jieru Mei, Sergei Vassilvitskii, Diego Cedillo, Pranjal Awasthi, Hui Zheng, Koray Kavukcuoglu, Itay Laish, Joseph Pagadora, Marc Brockschmidt, Christopher A. Choquette-Choo, Arunkumar Byravan, Yifeng Lu, Xu Chen, Mia Chen, Kenton Lee, Rama Pasumarthi, Sijal Bhatnagar, Aditya Shah, Qiyin Wu, Zhuoyuan Chen, Zack Nado, Bartek Perz, Zixuan Jiang, David Kao, Ganesh Mallya, Nino Vieillard, Lantao Mei, Sertan Girgin, Mandy Jordan, Yeongil Ko, Alekh Agarwal, Yaxin Liu, Yasemin Altun, Raoul de Liedekerke, Anastasios Kementsietsidis, Daiyi Peng, Dangyi Liu, Utku Evci, Peter Humphreys, Austin Tarango, Xiang Deng, Yoad Lewenberg, Kevin Aydin, Chengda Wu, Bhavishya Mittal, Tsendsuren Munkhdalai, Kleopatra Chatziprimou, Rodrigo Benenson, Uri First, Xiao Ma, Jinning Li, Armand Joulin, Hamish Tomlinson, Tingnan Zhang, Milad Nasr, Zhi Hong, Michaël Sander, Lisa Anne Hendricks, Anuj Sharma, Andrew Bolt, Eszter Vértes, Jiri Simsa, Tomer Levinboim, Olcan Sercinoglu, Divyansh Shukla, Austin Wu, Craig Swanson, Danny Vainstein, Fan Bu, Bo Wang, Ryan Julian, Charles Yoon, Sergei Lebedev, Antonious Girgis, Bernd Bandemer, David Du, Todd Wang, Xi Chen, Ying Xiao, Peggy Lu, Natalie Ha, Vlad Ionescu, Simon Rowe, Josip Matak, Federico Lebron, Andreas Steiner, Lalit Jain, Manaal Faruqui, Nicolas Lacasse, Georgie Evans, Neesha Subramaniam, Dean Reich, Giulia Vezzani, Aditya Pandey, Joe Stanton, Tianhao Zhou, Liam McCafferty, Henry Griffiths, Verena Rieser, Soheil Hassas Yeganeh, Eleftheria Briakou, Lu Huang, Zichuan Wei, Liangchen Luo, Erik Jue, Gabby Wang, Victor Cotruta, Myriam Khan, Jongbin Park, Qiuchen Guo, Peiran Li, Rong Rong, Diego Antognini, Anastasia Petrushkina, Chetan Tekur, Eli Collins, Parul Bhatia, Chester Kwak, Wenhu Chen, Arvind Neelakantan, Immanuel Odisho, Sheng Peng, Vincent Nallatamby, Vaibhav Tulsyan, Fabian Pedregosa, Peng Xu, Raymond Lin, Yulong Wang, Emma Wang, Sholto Douglas, Reut Tsarfaty, Elena Gribovskaya, Renga Aravamudhan, Manu Agarwal, Mara Finkelstein, Qiao Zhang, Elizabeth Cole, Phil Crone, Sarmishta Velury, Anil Das, Chris Sauer, Luyao Xu, Danfeng Qin, Chenjie Gu, Dror Marcus, CJ Zheng, Wouter Van Gansbeke, Sobhan Miryoosefi, Haitian Sun, YaGuang Li, Charlie Chen, Jae Yoo, Pavel Dubov, Alex Tomala, Adams Yu, Paweł Wesołowski, Alok Gunjan, Eddie Cao, Jiaming Luo, Nikhil Sethi, Arkadiusz Socala, Laura Graesser, Tomas Kocisky, Arturo BC, Minmin Chen, Edward Lee, Sophie Wang, Weize Kong, Qiantong Xu, Nilesh Tripuraneni, Yiming Li, Xinxin Yu, Allen Porter, Paul Voigtlaender, Biao Zhang, Arpi Vezer, Sarah York, Qing Wei, Geoffrey Cideron, Mark Kurzeja, Seungyeon Kim, Benny Li, Angéline Pouget, Hyo Lee, Kaspar Daugaard, Yang Li, Dave Uthus, Aditya Siddhant, Paul Cavallaro, Sriram Ganapathy, Maulik Shah, Rolf Jagerman, Jeff Stanway, Piermaria Mendolicchio, Li Xiao, Kayi Lee, Tara Thompson, Shubham Milind Phal, Jason Chase, Sun Jae Lee, Adrian N Reyes, Disha Shrivastava, Zhen Qin, Roykrong Sukkerd, Seth Odoom, Lior Madmoni, John Aslanides, Jonathan Herzig, Elena Pochernina, Sheng Zhang, Parker Barnes, Daisuke Ikeda, Qiujia Li, Shuo-yiin Chang, Shakir Mohamed, Jim Sproch, Richard Powell, Bidisha Samanta, Domagoj Ćevid, Anton Kovsharov, Shrestha Basu Mallick, Srinivas Tadepalli, Anne Zheng, Kareem Ayoub, Andreas Noever, Christian Reisswig, Zhuo Xu, Junhyuk Oh, Martin Matysiak, Tim Blyth, Shereen Ashraf, Julien Amelot, Boone Severson, Michele Bevilacqua, Motoki Sano, Ethan Dyer, Ofir Roval, Anu Sinha, Yin Zhong, Sagi Perel, Tea Sabolić, Johannes Mauerer, Willi Gierke, Mauro Verzetti, Rodrigo Cabrera, Alvin Abdagic, Steven Hemingray, Austin Stone, Jong Lee, Farooq Ahmad, Karthik Raman, Lior Shani, Jonathan Lai, Orhan Firat, Nathan Waters, Eric Ge, Mo Shomrat, Himanshu Gupta, Rajeev Aggarwal, Tom Hudson, Bill Jia, Simon Baumgartner, Palak Jain, Joe Kovac, Junehyuk Jung, Ante Žužul, Will Truong, Morteza Zadimoghaddam, Songyou Peng, Marco Liang, Rachel Sterneck, Balaji Lakshminarayanan, Machel Reid, Oliver Woodman, Tong Zhou, Jianling Wang, Vincent Coriou, Arjun Narayanan, Jay Hoover, Yenai Ma, Apoorv Jindal, Clayton Sanford, Doug Reid, Swaroop Ramaswamy, Alex Kurakin, Roland Zimmermann, Yana Lunts, Dragos Dena, Zalán Borsos, Vered Cohen, Shujian Zhang, Will Grathwohl, Robert Dadashi, Morgan Redshaw, Joshua Kessinger, Julian Odell, Silvano Bonacina, Zihang Dai, Grace Chen, Ayush Dubey, Pablo Sprechmann, Mantas Pajarskas, Wenxuan Zhou, Niharika Ahuja, Tara Thomas, Martin Nikoltchev, Matija Kecman, Bharath Mankalale, Andrey Ryabtsev, Jennifer She, Christian Walder, Jiaming Shen, Lu Li, Carolina Parada, Sheena Panthaplackel, Okwan Kwon, Matt Lawlor, Utsav Prabhu, Yannick Schroecker, Marc’aurelio Ranzato, Pete Blois, Iurii Kemaev, Ting Yu, Dmitry Lepikhin, Hao Xiong, Sahand Sharifzadeh, Oleaser Johnson, Jeremiah Willcock, Rui Yao, Greg Farquhar, Sujoy Basu, Hidetoshi Shimokawa, Nina Anderson, Haiguang Li, Khiem Pham, Yizhong Liang, Sebastian Borgeaud, Alexandre Moufarek, Hideto Kazawa, Blair Kutzman, Marcin Sieniek, Sara Smoot, Ruth Wang, Natalie Axelsson, Nova Fallen, Prasha Sundaram, Yuexiang Zhai, Varun Godbole, Petros Maniatis, Alek Wang, Ilia Shumailov, Santhosh Thangaraj, Remi Crocker, Nikita Gupta, Gang Wu, Phil Chen, Gellért Weisz, Celine Smith, Mojtaba Seyedhosseini, Boya Fang, Xiyang Luo, Roey Yogev, Zeynep Cankara, Andrew Hard, Helen Ran, Rahul Sukthankar, George Necula, Gaël Liu, Honglong Cai, Praseem Banzal, Daniel Keysers, Sanjay Ghemawat, Connie Tao, Emma Dunleavy, Aditi Chaudhary, Wei Li, Maciej Mikuła, Chen-Yu Lee, Tiziana Refice, Krishna Somandepalli, Alexandre Fréchette, Dan Bahir, John Karro, Keith Rush, Sarah Perrin, Bill Rosgen, Xiaomeng Yang, Clara Huiyi Hu, Mahmoud Alnahlawi, Justin Mao-Jones, Roopal Garg, Hoang Nguyen, Bat-Orgil Batsaikhan, Iñaki Iturrate, Anselm Levskaya, Avi Singh, Ashyana Kachra, Tony Lu, Denis Petek, Zheng Xu, Mark Graham, Lukas Zilka, Yael Karov, Marija Kostelac, Fangyu Liu, Yaohui Guo, Weiyue Wang, Bernd Bohnet, Emily Pitler, Tony Bruguier, Keisuke Kinoshita, Chrysovalantis Anastasiou, Nilpa Jha, Ting Liu, Jerome Connor, Phil Wallis, Philip Pham, Eric Bailey, Shixin Li, Heng-Tze Cheng, Sally Ma, Haiqiong Li, Akanksha Maurya, Kate Olszewska, Manfred Warmuth, Christy Koh, Dominik Paulus, Siddhartha Reddy Jonnalagadda, Enrique Piqueras, Ali Elqursh, Geoff Brown, Hadar Shemtov, Loren Maggiore, Fei Xia, Ryan Foley, Beka Westberg, George van den Driessche, Livio Baldini Soares, Arjun Kar, Michael Quinn, Siqi Zuo, Jialin Wu, Kyle Kastner, Anna Bortsova, Aijun Bai, Ales Mikhalap, Luowei Zhou, Jennifer Brennan, Vinay Ramasesh, Honglei Zhuang, John Maggs, Johan Schalkwyk, Yuntao Xu, Hui Huang, Andrew Howard, Sasha Brown, Linting Xue, Gloria Shen, Brian Albert, Neha Jha, Daniel Zheng, Varvara Krayvanova, Spurthi Amba Hombaiah, Olivier Lacombe, Gautam Vasudevan, Dan Graur, Tian Xie, Meet Gandhi, Bangju Wang, Dustin Zelle, Harman Singh, Dahun Kim, Sébastien Cevey, Victor Ungureanu, Natasha Noy, Fei Liu, Annie Xie, Fangxiaoyu Feng, Katerina Tsihlas, Daniel Formoso, Neera Vats, Quentin Wellens, Yinan Wang, Niket Kumar Bhumihar, Samrat Ghosh, Matt Hoffman, Tom Lieber, Oran Lang, Kush Bhatia, Tom Paine, Aroonalok Pyne, Ronny Votel, Madeleine Clare Elish, Benoit Schillings, Alex Panagopoulos, Haichuan Yang, Adam Raveret, Zohar Yahav, Shuang Liu, Dalia El Badawy, Nishant Agrawal, Mohammed Badawi, Mahdi Mirzazadeh, Carla Bromberg, Fan Ye, Chang Liu, Tatiana Sholokhova, George-Cristian Muraru, Gargi Balasubramaniam, Jonathan Malmaud, Alen Carin, Danilo Martins, Irina Jurenka, Pankil Botadra, Dave Lacey, Richa Singh, Mariano Schain, Dan Zheng, Isabelle Guyon, Victor Lavrenko, Seungji Lee, Xiang Zhou, Demis Hassabis, Jeshwanth Challagundla, Derek Cheng, Nikhil Mehta, Matthew Mauger, Michela Paganini, Pushkar Mishra, Kate Lee, Zhang Li, Lexi Baugher, Ondrej Skopek, Max Chang, Amir Zait, Gaurav Menghani, Lizzetth Bellot, Guangxing Han, Jean-Michel Sarr, Sharat Chikkerur, Himanshu Sahni, Rohan Anil, Arun Narayanan, Chandu Thekkath, Daniele Pighin, Hana Strejček, Marko Velic, Fred Bertsch, Manuel Tragut, Keran Rong, Alicia Parrish, Kai Bailey, Jiho Park, Isabela Albuquerque, Abhishek Bapna, Rajesh Venkataraman, Alec Kosik, Johannes Griesser, Zhiwei Deng, Alek Andreev, Qingyun Dou, Kevin Hui, Fanny Wei, Xiaobin Yu, Lei Shu, Avia Aharon, David Barker, Badih Ghazi, Sebastian Flennerhag, Chris Breaux, Yuchuan Liu, Matthew Bilotti, Josh Woodward, Uri Alon, Stephanie Winkler, Tzu-Kuo Huang, Kostas Andriopoulos, João Gabriel Oliveira, Penporn Koanantakool, Berkin Akin, Michael Wunder, Cicero Nogueira dos Santos, Mohammad Hossein Bateni, Lin Yang, Dan Horgan, Beer Changpinyo, Keyvan Amiri, Min Ma, Dayeong Lee, Lihao Liang, Anirudh Baddepudi, Tejasi Latkar, Raia Hadsell, Jun Xu, Hairong Mu, Michael Han, Aedan Pope, Snchit Grover, Frank Kim, Ankit Bhagatwala, Guan Sun, Yamini Bansal, Amir Globerson, Alireza Nazari, Samira Daruki, Hagen Soltau, Jane Labanowski, Laurent El Shafey, Matt Harvey, Yanif Ahmad, Elan Rosenfeld, William Kong, Etienne Pot, Yi-Xuan Tan, Aurora Wei, Victoria Langston, Marcel Prasetya, Petar Veličković, Richard Killam, Robin Strudel, Darren Ni, Zhenhai Zhu, Aaron Archer, Kavya Kopparapu, Lynn Nguyen, Emilio Parisotto, Hussain Masoom, Sravanti Addepalli, Jordan Grimstad, Hexiang Hu, Joss Moore, Avinatan Hassidim, Le Hou, Mukund Raghavachari, Jared Lichtarge, Adam R. Brown, Hilal Dib, Natalia Ponomareva, Justin Fu, Yujing Zhang, Altaf Rahman, Joana Iljazi, Edouard Leurent, Gabriel Dulac-Arnold, Cosmo Du, Chulayuth Asawaroengchai, Larry Jin, Ela Gruzewska, Ziwei Ji, Benigno Uria, Daniel De Freitas, Paul Barham, Lauren Beltrone, Víctor Campos, Jun Yan, Neel Kovelamudi, Arthur Nguyen, Elinor Davies, Zhichun Wu, Zoltan Egyed, Kristina Toutanova, Nithya Attaluri, Hongliang Fei, Peter Stys, Siddhartha Brahma, Martin Izzard, Siva Velusamy, Scott Lundberg, Vincent Zhuang, Kevin Sequeira, Adam Santoro, Ehsan Amid, Ophir Aharoni, Shuai Ye, Mukund Sundararajan, Lijun Yu, Yu-Cheng Ling, Stephen Spencer, Hugo Song, Josip Djolonga, Christo Kirov, Sonal Gupta, Alessandro Bissacco, Clemens Meyer, Mukul Bhutani, Andrew Dai, Weiyi Wang, Siqi Liu, Ashwin Sreevatsa, Qijun Tan, Maria Wang, Lucy Kim, Yicheng Wang, Alex Irpan, Yang Xiao, Stanislav Fort, Yifan He, Alex Gurney, Bryan Gale, Yue Ma, Monica Roy, Viorica Patraucean, Taylan Bilal, Golnaz Ghiasi, Anahita Hosseini, Melvin Johnson, Zhuowan Li, Yi Tay, Benjamin Beyret, Katie Millican, Josef Broder, Mayank Lunayach, Danny Swisher, Eugen Vušak, David Parkinson, MH Tessler, Adi Mayrav Gilady, Richard Song, Allan Dafoe, Yves Raimond, Masa Yamaguchi, Itay Karo, Elizabeth Nielsen, Kevin Kilgour, Mike Dusenberry, Rajiv Mathews, Jiho Choi, Siyuan Qiao, Harsh Mehta, Sahitya Potluri, Chris Knutsen, Jialu Liu, Tat Tan, Kuntal Sengupta, Keerthana Gopalakrishnan, Abodunrinwa Toki, Mencher Chiang, Mike Burrows, Grace Vesom, Zafarali Ahmed, Ilia Labzovsky, Siddharth Vashishtha, Preeti Singh, Ankur Sharma, Ada Ma, Jinyu Xie, Pranav Talluri, Hannah Forbes-Pollard, Aarush Selvan, Joel Wee, Loic Matthey, Tom Funkhouser, Parthasarathy Gopavarapu, Lev Proleev, Cheng Li, Matt Thomas, Kashyap Kolipaka, Zhipeng Jia, Ashwin Kakarla, Srinivas Sunkara, Joan Puigcerver, Suraj Satishkumar Sheth, Emily Graves, Chen Wang, Sadh MNM Khan, Kai Kang, Shyamal Buch, Fred Zhang, Omkar Savant, David Soergel, Kevin Lee, Linda Friso, Xuanyi Dong, Rahul Arya, Shreyas Chandrakaladharan, Connor Schenck, Greg Billock, Tejas Iyer, Anton Bakalov, Leslie Baker, Alex Ruiz, Angad Chandorkar, Trieu Trinh, Matt Miecnikowski, Yanqi Zhou, Yangsibo Huang, Jiazhong Nie, Ali Shah, Ashish Thapliyal, Sam Haves, Lun Wang, Uri Shaham, Patrick Morris-Suzuki, Soroush Radpour, Leonard Berrada, Thomas Strohmann, Chaochao Yan, Jingwei Shen, Sonam Goenka, Tris Warkentin, Petar Dević, Dan Belov, Albert Webson, Madhavi Yenugula, Puranjay Datta, Jerry Chang, Nimesh Ghelani, Aviral Kumar, Vincent Perot, Jessica Lo, Yang Song, Herman Schmit, Jianmin Chen, Vasilisa Bashlovkina, Xiaoyue Pan, Diana Mincu, Paul Roit, Isabel Edkins, Andy Davis, Yujia Li, Ben Horn, Xinjian Li, Pradeep Kumar S, Eric Doi, Wanzheng Zhu, Sri Gayatri Sundara Padmanabhan, Siddharth Verma, Jasmine Liu, Heng Chen, Mihajlo Velimirović, Malcolm Reynolds, Priyanka Agrawal, Nick Sukhanov, Abhinit Modi, Siddharth Goyal, John Palowitch, Nima Khajehnouri, Wing Lowe, David Klinghoffer, Sharon Silver, Vinh Tran, Candice Schumann, Francesco Piccinno, Xi Liu, Mario Lučić, Xiaochen Yang, Sandeep Kumar, Ajay Kannan, Ragha Kotikalapudi, Mudit Bansal, Fabian Fuchs, Mohammad Javad Hosseini, Abdelrahman Abdelhamed, Dawn Bloxwich, Tianhe Yu, Ruoxin Sang, Gregory Thornton, Karan Gill, Yuchi Liu, Virat Shejwalkar, Jason Lin, Zhipeng Yan, Kehang Han, Thomas Buschmann, Michael Pliskin, Zhi Xing, Susheel Tatineni, Junlin Zhang, Sissie Hsiao, Gavin Buttimore, Marcus Wu, Zefei Li, Geza Kovacs, Legg Yeung, Tao Huang, Aaron Cohen, Bethanie Brownfield, Averi Nowak, Mikel Rodriguez, Tianze Shi, Hado van Hasselt, Kevin Cen, Deepanway Ghoshal, Kushal Majmundar, Weiren Yu, Warren, Chen, Danila Sinopalnikov, Hao Zhang, Vlado Galić, Di Lu, Zeyu Zheng, Maggie Song, Gary Wang, Gui Citovsky, Swapnil Gawde, Isaac Galatzer-Levy, David Silver, Ivana Balazevic, Dipanjan Das, Kingshuk Majumder, Yale Cong, Praneet Dutta, Dustin Tran, Hui Wan, Junwei Yuan, Daniel Eppens, Alanna Walton, Been Kim, Harry Ragan, James Cobon-Kerr, Lu Liu, Weijun Wang, Bryce Petrini, Jack Rae, Rakesh Shivanna, Yan Xiong, Chace Lee, Pauline Coquinot, Yiming Gu, Lisa Patel, Blake Hechtman, Aviel Boag, Orion Jankowski, Alex Wertheim, Alex Lee, Paul Covington, Hila Noga, Sam Sobell, Shanthal Vasanth, William Bono, Chirag Nagpal, Wei Fan, Xavier Garcia, Kedar Soparkar, Aybuke Turker, Nathan Howard, Sachit Menon, Yuankai Chen, Vikas Verma, Vladimir Pchelin, Harish Rajamani, Valentin Dalibard, Ana Ramalho, Yang Guo, Kartikeya Badola, Seojin Bang, Nathalie Rauschmayr, Julia Proskurnia, Sudeep Dasari, Xinyun Chen, Mikhail Sushkov, Anja Hauth, Pauline Sho, Abhinav Singh, Bilva Chandra, Allie Culp, Max Dylla, Olivier Bachem, James Besley, Heri Zhao, Timothy Lillicrap, Wei Wei, Wael Al Jishi, Ning Niu, Alban Rrustemi, Raphaël Lopez Kaufman, Ryan Poplin, Jewel Zhao, Minh Truong, Shikhar Bharadwaj, Ester Hlavnova, Eli Stickgold, Cordelia Schmid, Georgi Stephanov, Zhaoqi Leng, Frederick Liu, Léonard Hussenot, Shenil Dodhia, Juliana Vicente Franco, Lesley Katzen, Abhanshu Sharma, Sarah Cogan, Zuguang Yang, Aniket Ray, Sergi Caelles, Shen Yan, Ravin Kumar, Daniel Gillick, Renee Wong, Joshua Ainslie, Jonathan Hoech, Séb Arnold, Dan Abolafia, Anca Dragan, Ben Hora, Grace Hu, Alexey Guseynov, Yang Lu, Chas Leichner, Jinmeng Rao, Abhimanyu Goyal, Nagabhushan Baddi, Daniel Hernandez Diaz, Tim McConnell, Max Bain, Jake Abernethy, Qiqi Yan, Rylan Schaeffer, Paul Vicol, Will Thompson, Montse Gonzalez Arenas, Mathias Bellaiche, Pablo Barrio, Stefan Zinke, Riccardo Patana, Pulkit Mehta, JK Kearns, Avraham Ruderman, Scott Pollom, David D’Ambrosio, Cath Hope, Yang Yu, Andrea Gesmundo, Kuang-Huei Lee, Aviv Rosenberg, Yiqian Zhou, Yaoyiran Li, Drew Garmon, Yonghui Wu, Safeen Huda, Gil Fidel, Martin Baeuml, Jian Li, Phoebe Kirk, Rhys May, Tao Tu, Sara Mc Carthy, Toshiyuki Fukuzawa, Miranda Aperghis, Chih-Kuan Yeh, Toshihiro Yoshino, Bo Li, Austin Myers, Kaisheng Yao, Ben Limonchik, Changwan Ryu, Rohun Saxena, Alex Goldin, Ruizhe Zhao, Rocky Rhodes, Tao Zhu, Divya Tyam, Heidi Howard, Nathan Byrd, Hongxu Ma, Yan Wu, Ryan Mullins, Qingze Wang, Aida Amini, Sebastien Baur, Yiran Mao, Subhashini Venugopalan, Will Song, Wen Ding, Paul Collins, Sashank Reddi, Megan Shum, Andrei Rusu, Luisa Zintgraf, Kelvin Chan, Sheela Goenka, Mathieu Blondel, Michael Collins, Renke Pan, Marissa Giustina, Nikolai Chinaev, Christian Schuler, Ce Zheng, Jonas Valfridsson, Alyssa Loo, Alex Yakubovich, Jamie Smith, Tao Jiang, Rich Munoz, Gabriel Barcik, Rishabh Bansal, Mingyao Yang, Yilun Du, Pablo Duque, Mary Phuong, Alexandra Belias, Kunal Lad, Zeyu Liu, Tal Schuster, Karthik Duddu, Jieru Hu, Paige Kunkle, Matthew Watson, Jackson Tolins, Josh Smith, Denis Teplyashin, Garrett Bingham, Marvin Ritter, Marco Andreetto, Divya Pitta, Mohak Patel, Shashank Viswanadha, Trevor Strohman, Catalin Ionescu, Jincheng Luo, Yogesh Kalley, Jeremy Wiesner, Dan Deutsch, Derek Lockhart, Peter Choy, Rumen Dangovski, Chawin Sitawarin, Cat Graves, Tanya Lando, Joost van Amersfoort, Ndidi Elue, Zhouyuan Huo, Pooya Moradi, Jean Tarbouriech, Henryk Michalewski, Wenting Ye, Eunyoung Kim, Alex Druinsky, Florent Altché, Xinyi Chen, Artur Dwornik, Da-Cheng Juan, Rivka Moroshko, Horia Toma, Jarrod Kahn, Hai Qian, Maximilian Sieb, Irene Cai, Roman Goldenberg, Praneeth Netrapalli, Sindhu Raghuram, Yuan Gong, Lijie Fan, Evan Palmer, Yossi Matias, Valentin Gabeur, Shreya Pathak, Tom Ouyang, Don Metzler, Geoff Bacon, Srinivasan Venkatachary, Sridhar Thiagarajan, Alex Cullum, Eran Ofek, Vytenis Sakenas, Mohamed Hammad, Cesar Magalhaes, Mayank Daswani, Oscar Chang, Ashok Popat, Ruichao Li, Komal Jalan, Yanhan Hou, Josh Lipschultz, Antoine He, Wenhao Jia, Pier Giuseppe Sessa, Prateek Kolhar, William Wong, Sumeet Singh, Lukas Haas, Jay Whang, Hanna Klimczak-Plucińska, Georges Rotival, Grace Chung, Yiqing Hua, Anfal Siddiqui, Nicolas Serrano, Dongkai Chen, Billy Porter, Libin Bai, Keshav Shivam, Sho Arora, Partha Talukdar, Tom Cobley, Sangnie Bhardwaj, Evgeny Gladchenko, Simon Green, Kelvin Guu, Felix Fischer, Xiao Wu, Eric Wang, Achintya Singhal, Tatiana Matejovicova, James Martens, Hongji Li, Roma Patel, Elizabeth Kemp, Jiaqi Pan, Lily Wang, Blake JianHang Chen, Jean-Baptiste Alayrac, Navneet Potti, Erika Gemzer, Eugene Ie, Kay McKinney, Takaaki Saeki, Edward Chou, Pascal Lamblin, SQ Mah, Zach Fisher, Martin Chadwick, Jon Stritar, Obaid Sarvana, Andrew Hogue, Artem Shtefan, Hadi Hashemi, Yang Xu, Jindong Gu, Sharad Vikram, Chung-Ching Chang, Sabela Ramos, Logan Kilpatrick, Weijuan Xi, Jenny Brennan, Yinghao Sun, Abhishek Jindal, Ionel Gog, Dawn Chen, Felix Wu, Jason Lee, Sudhindra Kopalle, Srinadh Bhojanapalli, Oriol Vinyals, Natan Potikha, Burcu Karagol Ayan, Yuan Yuan, Michael Riley, Piotr Stanczyk, Sergey Kishchenko, Bing Wang, Dan Garrette, Antoine Yang, Vlad Feinberg, CJ Carey, Javad Azizi, Viral Shah, Erica Moreira, Chongyang Shi, Josh Feldman, Elizabeth Salesky, Thomas Lampe, Aneesh Pappu, Duhyeon Kim, Jonas Adler, Avi Caciularu, Brian Walker, Yunhan Xu, Yochai Blau, Dylan Scandinaro, Terry Huang, Sam El-Husseini, Abhishek Sinha, Lijie Ren, Taylor Tobin, Patrik Sundberg, Tim Sohn, Vikas Yadav, Mimi Ly, Emily Xue, Jing Xiong, Afzal Shama Soudagar, Sneha Mondal, Nikhil Khadke, Qingchun Ren, Ben Vargas, Stan Bileschi, Sarah Chakera, Cindy Wang, Boyu Wang, Yoni Halpern, Joe Jiang, Vikas Sindhwani, Petre Petrov, Pranavaraj Ponnuramu, Sanket Vaibhav Mehta, Yu Watanabe, Betty Chan, Matheus Wisniewski, Trang Pham, Jingwei Zhang, Conglong Li, Dario de Cesare, Art Khurshudov, Alex Vasiloff, Melissa Tan, Zoe Ashwood, Bobak Shahriari, Maryam Majzoubi, Garrett Tanzer, Olga Kozlova, Robin Alazard, James Lee-Thorp, Nguyet Minh Phu, Isaac Tian, Junwhan Ahn, Andy Crawford, Lauren Lax, Yuan Shangguan, Iftekhar Naim, David Ross, Oleksandr Ferludin, Tongfei Guo, Andrea Banino, Hubert Soyer, Xiaoen Ju, Dominika Rogozińska, Ishaan Malhi, Marcella Valentine, Daniel Balle, Apoorv Kulshreshtha, Maciej Kula, Yiwen Song, Sophia Austin, John Schultz, Roy Hirsch, Arthur Douillard, Apoorv Reddy, Michael Fink, Summer Yue, Khyatti Gupta, Adam Zhang, Norman Rink, Daniel McDuff, Lei Meng, András György, Yasaman Razeghi, Ricky Liang, Kazuki Osawa, Aviel Atias, Matan Eyal, Tyrone Hill, Nikolai Grigorev, Zhengdong Wang, Nitish Kulkarni, Rachel Soh, Ivan Lobov, Zachary Charles, Sid Lall, Kazuma Hashimoto, Ido Kessler, Victor Gomes, Zelda Mariet, Danny Driess, Alessandro Agostini, Canfer Akbulut, Jingcao Hu, Marissa Ikonomidis, Emily Caveness, Kartik Audhkhasi, Saurabh Agrawal, Ioana Bica, Evan Senter, Jayaram Mudigonda, Kelly Chen, Jingchen Ye, Xuanhui Wang, James Svensson, Philipp Fränken, Josh Newlan, Li Lao, Eva Schnider, Sami Alabed, Joseph Kready, Jesse Emond, Afief Halumi, Tim Zaman, Chengxi Ye, Naina Raisinghani, Vilobh Meshram, Bo Chang, Ankit Singh Rawat, Axel Stjerngren, Sergey Levi, Rui Wang, Xiangzhu Long, Mitchelle Rasquinha, Steven Hand, Aditi Mavalankar, Lauren Agubuzu, Sudeshna Roy, Junquan Chen, Jarek Wilkiewicz, Hao Zhou, Michal Jastrzebski, Qiong Hu, Agustin Dal Lago, Ramya Sree Boppana, Wei-Jen Ko, Jennifer Prendki, Yao Su, Zhi Li, Eliza Rutherford, Girish Ramchandra Rao, Ramona Comanescu, Adrià Puigdomènech, Qihang Chen, Dessie Petrova, Christine Chan, Vedrana Milutinovic, Felipe Tiengo Ferreira, Chin-Yi Cheng, Ming Zhang, Tapomay Dey, Sherry Yang, Ramesh Sampath, Quoc Le, Howard Zhou, Chu-Cheng Lin, Hoi Lam, Christine Kaeser-Chen, Kai Hui, Dean Hirsch, Tom Eccles, Basil Mustafa, Shruti Rijhwani, Morgane Rivière, Yuanzhong Xu, Junjie Wang, Xinyang Geng, Xiance Si, Arjun Khare, Cheolmin Kim, Vahab Mirrokni, Kamyu Lee, Khuslen Baatarsukh, Nathaniel Braun, Lisa Wang, Pallavi LV, Richard Tanburn, Yuvein, Zhu, Fangda Li, Setareh Ariafar, Dan Goldberg, Ken Burke, Daniil Mirylenka, Meiqi Guo, Olaf Ronneberger, Hadas Natalie Vogel, Liqun Cheng, Nishita Shetty, Johnson Jia, Thomas Jimma, Corey Fry, Ted Xiao, Martin Sundermeyer, Ryan Burnell, Yannis Assael, Mario Pinto, JD Chen, Rohit Sathyanarayana, Donghyun Cho, Jing Lu, Rishabh Agarwal, Sugato Basu, Lucas Gonzalez, Dhruv Shah, Meng Wei, Dre Mahaarachchi, Rohan Agrawal, Tero Rissa, Yani Donchev, Ramiro Leal-Cavazos, Adrian Hutter, Markus Mircea, Alon Jacovi, Faruk Ahmed, Jiageng Zhang, Shuguang Hu, Bo-Juen Chen, Jonni Kanerva, Guillaume Desjardins, Andrew Lee, Nikos Parotsidis, Asier Mujika, Tobias Weyand, Jasper Snoek, Jo Chick, Kai Chen, Paul Chang, Ethan Mahintorabi, Zi Wang, Tolly Powell, Orgad Keller, Abhirut Gupta, Claire Sha, Kanav Garg, Nicolas Heess, Ágoston Weisz, Cassidy Hardin, Bartek Wydrowski, Ben Coleman, Karina Zainullina, Pankaj Joshi, Alessandro Epasto, Terry Spitz, Binbin Xiong, Kai Zhao, Arseniy Klimovskiy, Ivy Zheng, Johan Ferret, Itay Yona, Waleed Khawaja, Jean-Baptiste Lespiau, Maxim Krikun, Siamak Shakeri, Timothee Cour, Bonnie Li, Igor Krivokon, Dan Suh, Alex Hofer, Jad Al Abdallah, Nikita Putikhin, Oscar Akerlund, Silvio Lattanzi, Anurag Kumar, Shane Settle, Himanshu Srivastava, Folawiyo Campbell-Ajala, Edouard Rosseel, Mihai Dorin Istin, Nishanth Dikkala, Anand Rao, Nick Young, Kate Lin, Dhruva Bhaswar, Yiming Wang, Jaume Sanchez Elias, Kritika Muralidharan, James Keeling, Dayou Du, Siddharth Gopal, Gregory Dibb, Charles Blundell, Manolis Delakis, Jacky Liang, Marco Tulio Ribeiro, Georgi Karadzhov, Guillermo Garrido, Ankur Bapna, Jiawei Cao, Adam Sadovsky, Pouya Tafti, Arthur Guez, Coline Devin, Yixian Di, Jinwei Xing, Chuqiao, Xu, Hanzhao Lin, Chun-Te Chu, Sameera Ponda, Wesley Helmholz, Fan Yang, Yue Gao, Sara Javanmardi, Wael Farhan, Alex Ramirez, Ricardo Figueira, Khe Chai Sim, Yuval Bahat, Ashwin Vaswani, Liangzhe Yuan, Gufeng Zhang, Leland Rechis, Hanjun Dai, Tayo Oguntebi, Alexandra Cordell, Eugénie Rives, Kaan Tekelioglu, Naveen Kumar, Bing Zhang, Aurick Zhou, Nikolay Savinov, Andrew Leach, Alex Tudor, Sanjay Ganapathy, Yanyan Zheng, Mirko Rossini, Vera Axelrod, Arnaud Autef, Yukun Zhu, Zheng Zheng, Mingda Zhang, Baochen Sun, Jie Ren, Nenad Tomasev, Nithish Kannen, Amer Sinha, Charles Chen, Louis O’Bryan, Alex Pak, Aditya Kusupati, Weel Yang, Deepak Ramachandran, Patrick Griffin, Seokhwan Kim, Philipp Neubeck, Craig Schiff, Tammo Spalink, Mingyang Ling, Arun Nair, Ga-Young Joung, Linda Deng, Avishkar Bhoopchand, Lora Aroyo, Tom Duerig, Jordan Griffith, Gabe Barth-Maron, Jake Ades, Alex Haig, Ankur Taly, Yunting Song, Paul Michel, Dave Orr, Dean Weesner, Corentin Tallec, Carrie Grimes Bostock, Paul Niemczyk, Andy Twigg, Mudit Verma, Rohith Vallu, Henry Wang, Marco Gelmi, Kiranbir Sodhia, Aleksandr Chuklin, Omer Goldman, Jasmine George, Liang Bai, Kelvin Zhang, Petar Sirkovic, Efrat Nehoran, Golan Pundak, Jiaqi Mu, Alice Chen, Alex Greve, Paulo Zacchello, David Amos, Heming Ge, Eric Noland, Colton Bishop, Jeffrey Dudek, Youhei Namiki, Elena Buchatskaya, Jing Li, Dorsa Sadigh, Masha Samsikova, Dan Malkin, Damien Vincent, Robert David, Rob Willoughby, Phoenix Meadowlark, Shawn Gao, Yan Li, Raj Apte, Amit Jhindal, Stein Xudong Lin, Alex Polozov, Zhicheng Wang, Tomas Mery, Anirudh GP, Varun Yerram, Sage Stevens, Tianqi Liu, Noah Fiedel, Charles Sutton, Matthew Johnson, Xiaodan Song, Kate Baumli, Nir Shabat, Muqthar Mohammad, Hao Liu, Marco Selvi, Yichao Zhou, Mehdi Hafezi Manshadi, Chu-ling Ko, Anthony Chen, Michael Bendersky, Jorge Gonzalez Mendez, Nisarg Kothari, Amir Zandieh, Yiling Huang, Daniel Andor, Ellie Pavlick, Idan Brusilovsky, Jitendra Harlalka, Sally Goldman, Andrew Lampinen, Guowang Li, Asahi Ushio, Somit Gupta, Lei Zhang, Chuyuan Kelly Fu, Madhavi Sewak, Timo Denk, Jed Borovik, Brendan Jou, Avital Zipori, Prateek Jain, Junwen Bai, Thang Luong, Jonathan Tompson, Alice Li, Li Liu, George Powell, Jiajun Shen, Alex Feng, Grishma Chole, Da Yu, Yinlam Chow, Tongxin Yin, Eric Malmi, Kefan Xiao, Yash Pande, Shachi Paul, Niccolò Dal Santo, Adil Dostmohamed, Sergio Guadarrama, Aaron Phillips, Thanumalayan Sankaranarayana Pillai, Gal Yona, Amin Ghafouri, Preethi Lahoti, Benjamin Lee, Dhruv Madeka, Eren Sezener, Simon Tokumine, Adrian Collister, Nicola De Cao, Richard Shin, Uday Kalra, Parker Beak, Emily Nottage, Ryo Nakashima, Ivan Jurin, Vikash Sehwag, Meenu Gaba, Junhao Zeng, Kevin R. McKee, Fernando Pereira, Tamar Yakar, Amayika Panda, Arka Dhar, Peilin Zhong, Daniel Sohn, Mark Brand, Lars Lowe Sjoesund, Viral Carpenter, Sharon Lin, Shantanu Thakoor, Marcus Wainwright, Ashwin Chaugule, Pranesh Srinivasan, Muye Zhu, Bernett Orlando, Jack Weber, Ayzaan Wahid, Gilles Baechler, Apurv Suman, Jovana Mitrović, Gabe Taubman, Honglin Yu, Helen King, Josh Dillon, Cathy Yip, Dhriti Varma, Tomas Izo, Levent Bolelli, Borja De Balle Pigem, Julia Di Trapani, Fotis Iliopoulos, Adam Paszke, Nishant Ranka, Joe Zou, Francesco Pongetti, Jed McGiffin, Alex Siegman, Rich Galt, Ross Hemsley, Goran Žužić, Victor Carbune, Tao Li, Myle Ott, Félix de Chaumont Quitry, David Vilar Torres, Yuri Chervonyi, Tomy Tsai, Prem Eruvbetine, Samuel Yang, Matthew Denton, Jake Walker, Slavica Andačić, Idan Heimlich Shtacher, Vittal Premachandran, Harshal Tushar Lehri, Cip Baetu, Damion Yates, Lampros Lamprou, Mariko Iinuma, Ioana Mihailescu, Ben Albrecht, Shachi Dave, Susie Sargsyan, Bryan Perozzi, Lucas Manning, Chiyuan Zhang, Denis Vnukov, Igor Mordatch, Raia Hadsell Wolfgang Macherey, Ryan Kappedal, Jim Stephan, Aditya Tripathi, Klaus Macherey, Jun Qian, Abhishek Bhowmick, Shekoofeh Azizi, Rémi Leblond, Shiva Mohan Reddy Garlapati, Timothy Knight, Matthew Wiethoff, Wei-Chih Hung, Anelia Angelova, Georgios Evangelopoulos, Pawel Janus, Dimitris Paparas, Matthew Rahtz, Ken Caluwaerts, Vivek Sampathkumar, Daniel Jarrett, Shadi Noghabi, Antoine Miech, Chak Yeung, Geoff Clark, Henry Prior, Fei Zheng, Jean Pouget-Abadie, Indro Bhattacharya, Kalpesh Krishna, Will Bishop, Zhe Yuan, Yunxiao Deng, Ashutosh Sathe, Kacper Krasowiak, Ciprian Chelba, Cho-Jui Hsieh, Kiran Vodrahalli, Buhuang Liu, Thomas Köppe, Amr Khalifa, Lubo Litchev, Pichi Charoenpanit, Reed Roberts, Sachin Yadav, Yasumasa Onoe, Desi Ivanov, Megha Mohabey, Vighnesh Birodkar, Nemanja Rakićević, Pierre Sermanet, Vaibhav Mehta, Krishan Subudhi, Travis Choma, Will Ng, Luheng He, Kathie Wang, Tasos Kementsietsidis, Shane Gu, Mansi Gupta, Andrew Nystrom, Mehran Kazemi, Timothy Chung, Nacho Cano, Nikhil Dhawan, Yufei Wang, Jiawei Xia, Trevor Yacovone, Eric Jia, Mingqing Chen, Simeon Ivanov, Ashrith Sheshan, Sid Dalmia, Paweł Stradomski, Pengcheng Yin, Salem Haykal, Congchao Wang, Dennis Duan, Neslihan Bulut, Greg Kochanski, Liam MacDermed, Namrata Godbole, Shitao Weng, Jingjing Chen, Rachana Fellinger, Ramin Mehran, Daniel Suo, Hisham Husain, Tong He, Kaushal Patel, Joshua Howland, Randall Parker, Kelvin Nguyen, Sharath Maddineni, Chris Rawles, Mina Khan, Shlomi Cohen-Ganor, Amol Mandhane, Xinyi Wu, Chenkai Kuang, Iulia Comşa, Ramya Ganeshan, Hanie Sedghi, Adam Bloniarz, Nuo Wang Pierse, Anton Briukhov, Petr Mitrichev, Anita Gergely, Serena Zhan, Allan Zhou, Nikita Saxena, Eva Lu, Josef Dean, Ashish Gupta, Nicolas Perez-Nieves, Renjie Wu, Cory McLean, Wei Liang, Disha Jindal, Anton Tsitsulin, Wenhao Yu, Kaiz Alarakyia, Tom Schaul, Piyush Patil, Peter Sung, Elijah Peake, Hongkun Yu, Feryal Behbahani, JD Co-Reyes, Alan Ansell, Sean Sun, Clara Barbu, Jonathan Lee, Seb Noury, James Allingham, Bilal Piot, Mohit Sharma, Christopher Yew, Ivan Korotkov, Bibo Xu, Demetra Brady, Goran Petrovic, Shibl Mourad, Claire Cui, Aditya Gupta, Parker Schuh, Saarthak Khanna, Anna Goldie, Abhinav Arora, Vadim Zubov, Amy Stuart, Mark Epstein, Yun Zhu, Jianqiao Liu, Yury Stuken, Ziyue Wang, Karolis Misiunas, Dee Guo, Ashleah Gill, Ale Hartman, Zaid Nabulsi, Aurko Roy, Aleksandra Faust, Jason Riesa, Ben Withbroe, Mengchao Wang, Marco Tagliasacchi, Andreea Marzoca, James Noraky, Serge Toropov, Malika Mehrotra, Bahram Raad, Sanja Deur, Steve Xu, Marianne Monteiro, Zhongru Wu, Yi Luan, Sam Ritter, Nick Li, Håvard Garnes, Yanzhang He, Martin Zlocha, Jifan Zhu, Matteo Hessel, Will Wu, Spandana Raj Babbula, Chizu Kawamoto, Yuanzhen Li, Mehadi Hassen, Yan Wang, Brian Wieder, James Freedman, Yin Zhang, Xinyi Bai, Tianli Yu, David Reitter, XiangHai Sheng, Mateo Wirth, Aditya Kini, Dima Damen, Mingcen Gao, Rachel Hornung, Michael Voznesensky, Brian Roark, Adhi Kuncoro, Yuxiang Zhou, Rushin Shah, Anthony Brohan, Kuangyuan Chen, James Wendt, David Rim, Paul Kishan Rubenstein, Jonathan Halcrow, Michelle Liu, Ty Geri, Yunhsuan Sung, Jane Shapiro, Shaan Bijwadia, Chris Duvarney, Christina Sorokin, Paul Natsev, Reeve Ingle, Pramod Gupta, Young Maeng, Ndaba Ndebele, Kexin Zhu, Valentin Anklin, Katherine Lee, Yuan Liu, Yaroslav Akulov, Shaleen Gupta, Guolong Su, Flavien Prost, Tianlin Liu, Vitaly Kovalev, Pol Moreno, Martin Scholz, Sam Redmond, Zongwei Zhou, Alex Castro-Ros, André Susano Pinto, Dia Kharrat, Michal Yarom, Rachel Saputro, Jannis Bulian, Ben Caine, Ji Liu, Abbas Abdolmaleki, Shariq Iqbal, Tautvydas Misiunas, Mikhail Sirotenko, Shefali Garg, Guy Bensky, Huan Gui, Xuezhi Wang, Raphael Koster, Mike Bernico, Da Huang, Romal Thoppilan, Trevor Cohn, Ben Golan, Wenlei Zhou, Andrew Rosenberg, Markus Freitag, Tynan Gangwani, Vincent Tsang, Anand Shukla, Xiaoqi Ren, Minh Giang, Chi Zou, Andre Elisseeff, Charline Le Lan, Dheeru Dua, Shuba Lall, Pranav Shyam, Frankie Garcia, Sarah Nguyen, Michael Guzman, AJ Maschinot, Marcello Maggioni, Ming-Wei Chang, Karol Gregor, Lotte Weerts, Kumaran Venkatesan, Bogdan Damoc, Leon Liu, Jan Wassenberg, Lewis Ho, Becca Roelofs, Majid Hadian, François-Xavier Aubet, Yu Liang, Sami Lachgar, Danny Karmon, Yong Cheng, Amelio Vázquez-Reina, Angie Chen, Zhuyun Dai, Andy Brock, Shubham Agrawal, Chenxi Pang, Peter Garst, Mariella Sanchez-Vargas, Ivor Rendulic, Aditya Ayyar, Andrija Ražnatović, Olivia Ma, Roopali Vij, Neha Sharma, Ashwin Balakrishna, Bingyuan Liu, Ian Mackinnon, Sorin Baltateanu, Petra Poklukar, Gabriel Ibagon, Colin Ji, Hongyang Jiao, Isaac Noble, Wojciech Stokowiec, Zhihao Li, Jeff Dean, David Lindner, Mark Omernick, Kristen Chiafullo, Mason Dimarco, Vitor Rodrigues, Vittorio Selo, Garrett Honke, Xintian, Wu, Wei He, Adam Hillier, Anhad Mohananey, Vihari Piratla, Chang Ye, Chase Malik, Sebastian Riedel, Samuel Albanie, Zi Yang, Kenny Vassigh, Maria Bauza, Sheng Li, Yiqing Tao, Nevan Wichers, Andrii Maksai, Abe Ittycheriah, Ross Mcilroy, Bryan Seybold, Noah Goodman, Romina Datta, Steven M. Hernandez, Tian Shi, Yony Kochinski, Anna Bulanova, Ken Franko, Mikita Sazanovich, Nicholas FitzGerald, Praneeth Kacham, Shubha Srinivas Raghvendra, Vincent Hellendoorn, Alexander Grushetsky, Julian Salazar, Angeliki Lazaridou, Jason Chang, Jan-Thorsten Peter, Sushant Kafle, Yann Dauphin, Abhishek Rao, Filippo Graziano, Izhak Shafran, Yuguo Liao, Tianli Ding, Geng Yan, Grace Chu, Zhao Fu, Vincent Roulet, Gabriel Rasskin, Duncan Williams, Shahar Drath, Alex Mossin, Raphael Hoffmann, Jordi Orbay, Francesco Bertolini, Hila Sheftel, Justin Chiu, Siyang Xue, Yuheng Kuang, Ferjad Naeem, Swaroop Nath, Nana Nti, Phil Culliton, Kashyap Krishnakumar, Michael Isard, Pei Sun, Ayan Chakrabarti, Nathan Clement, Regev Cohen, Arissa Wongpanich, GS Oh, Ashwin Murthy, Hao Zheng, Jessica Hamrick, Oskar Bunyan, Suhas Ganesh, Nitish Gupta, Roy Frostig, John Wieting, Yury Malkov, Pierre Marcenac, Zhixin, Lai, Xiaodan Tang, Mohammad Saleh, Fedir Zubach, Chinmay Kulkarni, Huanjie Zhou, Vicky Zayats, Nan Ding, Anshuman Tripathi, Arijit Pramanik, Patrik Zochbauer, Harish Ganapathy, Vedant Misra, Zach Behrman, Hugo Vallet, Mingyang Zhang, Mukund Sridhar, Ye Jin, Mohammad Babaeizadeh, Siim Põder, Megha Goel, Divya Jain, Tajwar Nasir, Shubham Mittal, Tim Dozat, Diego Ardila, Aliaksei Severyn, Fabio Pardo, Sammy Jerome, Siyang Qin, Louis Rouillard, Amir Yazdanbakhsh, Zizhao Zhang, Shivani Agrawal, Kaushik Shivakumar, Caden Lu, Praveen Kallakuri, Rachita Chhaparia, Kanishka Rao, Charles Kwong, Asya Fadeeva, Shitij Nigam, Yan Virin, Yuan Zhang, Balaji Venkatraman, Beliz Gunel, Marc Wilson, Huiyu Wang, Abhinav Gupta, Xiaowei Xu, Adrien Ali Taïga, Kareem Mohamed, Doug Fritz, Daniel Rodriguez, Zoubin Ghahramani, Harry Askham, Lior Belenki, James Zhao, Rahul Gupta, Krzysztof Jastrzębski, Takahiro Kosakai, Kaan Katircioglu, Jon Schneider, Rina Panigrahy, Konstantinos Bousmalis, Peter Grabowski, Prajit Ramachandran, Chaitra Hegde, Mihaela Rosca, Angelo Scorza Scarpati, Kyriakos Axiotis, Ying Xu, Zach Gleicher, Assaf Hurwitz Michaely, Mandar Sharma, Sanil Jain, Christoph Hirnschall, Tal Marian, Xuhui Jia, Kevin Mather, Kilol Gupta, Linhai Qiu, Nigamaa Nayakanti, Lucian Ionita, Steven Zheng, Lucia Loher, Kurt Shuster, Igor Petrovski, Roshan Sharma, Rahma Chaabouni, Angel Yeh, James An, Arushi Gupta, Steven Schwarcz, Seher Ellis, Sam Conway-Rahman, Javier Snaider, Alex Zhai, James Atwood, Daniel Golovin, Liqian Peng, Te I, Vivian Xia, Salvatore Scellato, Mahan Malihi, Arthur Bražinskas, Vlad-Doru Ion, Younghoon Jun, James Swirhun, Soroosh Mariooryad, Jiao Sun, Steve Chien, Rey Coaguila, Ariel Brand, Yi Gao, Tom Kwiatkowski, Roee Aharoni, Cheng-Chun Lee, Mislav Žanić, Yichi Zhang, Dan Ethier, Vitaly Nikolaev, Pranav Nair, Yoav Ben Shalom, Hen Fitoussi, Jai Gupta, Hongbin Liu, Dee Cattle, Tolga Bolukbasi, Ben Murdoch, Fantine Huot, Yin Li, Chris Hahn, Urvashi Khandelwal, Frederik Benzing, Arthur Conmy, Andrey Simanovsky, Françoise Beaufays, Eugene Weinstein, Tongzhou Chen, Luke Leonhard, Bhuvana Ramabhadran
In this report, we introduce the Gemini 2.X model family: Gemini 2.5 Pro and Gemini 2.5 Flash, as well as our earlier Gemini 2.0 Flash and Flash-Lite models. Gemini 2.5 Pro is our most capable model yet, achieving SoTA performance on frontier coding and reasoning benchmarks. In addition to its incredible coding and reasoning skills, Gemini 2.5 Pro is a thinking model that excels at multimodal understanding and it is now able to process up to 3 hours of video content. Its unique combination of long context, multimodal and reasoning capabilities can be combined to unlock new agentic workflows. Gemini 2.5 Flash provides excellent reasoning abilities at a fraction of the compute and latency requirements and Gemini 2.0 Flash and Flash-Lite provide high performance at low latency and cost. Taken together, the Gemini 2.X model generation spans the full Pareto frontier of model capability vs cost, allowing users to explore the boundaries of what is possible with complex agentic problem solving.
在本报告中,我们介绍Gemini 2.X模式家庭:Gemini 2.5 Pro 和 Gemini 2.5 Flash,以及我们先前的Gemini 2.0 Flash 和 Flash-Lite 模型。Gemini 2.5 Pro是我们目前最有能力的模型,除了在边境编码和推理基准方面实现SoTA业绩外,它除了其令人难以置信的编码和推理技能外,Gemini 2.5 Pro是一个思维模型,它擅长多式理解,现在能够处理多达3小时的视频内容。它的独特结合了长背景、多式和推理能力,可以打开新的代理工作流程。Gemini 2.5 闪电提供了精良的推理能力,满足了计算和延时要求的一小部分,Gemini 2.0 Flash和闪光-Lite在低长期和成本方面提供高性能。综合起来,Gemini 2.X模型的生成跨越了模型能力相对于成本的全Pareto边界,使用户能够探索可能解决复杂代理问题的范围。
Article 49
Title@2025-07-17 (4): A Logically Consistent Chain-of-Thought Approach for Stance Detection
Title: A Logically Consistent Chain-of-Thought Approach for Stance Detection | Ein logisch konsistenter, schlüsselfertiger Ansatz zur Stance-Erkennung | 一种逻辑上一致的研究链方法,以探测Stance 2312.16054v2 |
Authors (4): Bowen Zhang, Daijun Ding, Liwen Jing, Hu Huang
Zero-shot stance detection (ZSSD) aims to detect stances toward unseen targets. Incorporating background knowledge to enhance transferability between seen and unseen targets constitutes the primary approach of ZSSD. However, these methods often struggle with a knowledge-task disconnect and lack logical consistency in their predictions. To address these issues, we introduce a novel approach named Logically Consistent Chain-of-Thought (LC-CoT) for ZSSD, which improves stance detection by ensuring relevant and logically sound knowledge extraction. LC-CoT employs a three-step process. Initially, it assesses whether supplementary external knowledge is necessary. Subsequently, it uses API calls to retrieve this knowledge, which can be processed by a separate LLM. Finally, a manual exemplar guides the LLM to infer stance categories, using an if-then logical structure to maintain relevance and logical coherence. This structured approach to eliciting background knowledge enhances the model’s capability, outperforming traditional supervised methods without relying on labeled data.
零射姿态探测(ZSSD)旨在探测对看不见目标的定位。将背景知识纳入提高可见和看不见目标之间可转移性的背景知识是ZSSD的主要方法。然而,这些方法往往与知识任务脱节,在预测中缺乏逻辑一致性。为了解决这些问题,我们为ZSSD引入了一种叫作逻辑一致的定位链(LC-CoT)的新颖方法,通过确保相关和逻辑合理的知识提取来改进定位探测。LC-Cot采用三步程序。最初,它评估补充外部知识是否必要。随后,它利用API电话检索这一知识,可以由单独的LLM处理。最后,用一本手册指导LLLM推导出定位类别,使用一个如果符合逻辑的结构来保持相关性和逻辑一致性。这种收集背景知识的结构化方法增强了模型的能力,在不依赖标签数据的情况下优于传统的监督方法。
Article 50
Title@2025-07-17 (4): MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness
Title: MAC-Tuning: LLM Multi-Compositional Problem Reasoning with Enhanced Knowledge Boundary Awareness | MAC-Tuning: Mehrkompositionelles LLM-Problem mit verbesserter Kenntnis der Grenzen des Wissens | MAC-指导:LLM 以增进知识边界意识为由的多组问题 2504.21773v2 |
Authors (6): Junsheng Huang, Zhitao He, Yucheng Huang, Sandeep Polisetty, Qingyun Wang, May Fung
With the widespread application of large language models (LLMs), the issue of generating non-existing facts, known as hallucination, has garnered increasing attention. Previous research in enhancing LLM confidence estimation mainly focuses on the single problem setting. However, LLM awareness of its internal parameterized knowledge boundary under the more challenging multi-problem setting, which requires answering multiple problems accurately simultaneously, remains underexplored. To bridge this gap, we introduce a novel method, Multiple Answers and Confidence Stepwise Tuning (MAC-Tuning), that separates the learning of answer prediction and confidence estimation during fine-tuning on instruction data. Extensive experiments demonstrate that our method outperforms baselines by up to 25% in average precision.
随着大型语言模型(LLMs)的广泛应用,生成非现有事实(称为幻觉)的问题日益引起人们的关注。以往关于提高LLM信心估计的研究主要侧重于单一问题的设置。然而,LLM在更具挑战性的多问题设置下对其内部参数化知识界限的认识(需要同时准确地回答多种问题)仍未得到充分探讨。为了缩小这一差距,我们引入了一种新颖的方法,即多重答案和信任性逐步调整(MAC-Turning),在对教学数据进行微调时区分了对答案预测和信心估计的学习。广泛的实验表明,我们的方法在平均精确度上比基线高出25%。
Article 51
Title@2025-07-17 (4): SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems
Title: SEALGuard: Safeguarding the Multilingual Conversations in Southeast Asian Languages for LLM Software Systems | SEALGuard: Mehrsprachige Gespräche in südostasiatischen Sprachen für LLM-Softwaresysteme sichern | SEALGuard:为LLM软件系统维护东南亚语言多语言对话 2507.08898v3 |
Authors (4): Wenliang Shan, Michael Fu, Rui Yang, Chakkrit Tantithamthavorn
Safety alignment is critical for LLM-powered systems. While recent LLM-powered guardrail approaches such as LlamaGuard achieve high detection accuracy of unsafe inputs written in English (e.g., ``How to create a bomb?’’), they struggle with multilingual unsafe inputs. This limitation leaves LLM systems vulnerable to unsafe and jailbreak prompts written in low-resource languages such as those in Southeast Asia. This paper introduces SEALGuard, a multilingual guardrail designed to improve the safety alignment across diverse languages. It aims to address the multilingual safety alignment gap of existing guardrails and ensure effective filtering of unsafe and jailbreak prompts in LLM-powered systems. We adapt a general-purpose multilingual language model into a multilingual guardrail using low-rank adaptation (LoRA). We construct SEALSBench, a large-scale multilingual safety alignment dataset containing over 260,000 prompts in ten languages, including safe, unsafe, and jailbreak cases. We evaluate SEALGuard against state-of-the-art guardrails such as LlamaGuard on this benchmark. Our findings show that multilingual unsafe and jailbreak prompts substantially degrade the performance of the state-of-the-art LlamaGuard, which experiences a drop in Defense Success Rate (DSR) by 9% and 18%, respectively, compared to its performance on English-only prompts. In contrast, SEALGuard outperforms existing guardrails in detecting multilingual unsafe and jailbreak prompts, improving DSR by 48% over LlamaGuard and achieving the best DSR, precision, and F1-score. Our ablation study further reveals the contributions of adaptation strategies and model size to the overall performance of SEALGuard. We release our pre-trained model and benchmark at https://github.com/awsm-research/SEALGuard to support further research.
LLM 动力LLM 系统的安全匹配至关重要。 虽然最近LLM 驱动的LLM 系统(LlamaGuard等LLlamaGuard) 使用英语编写的不安全投入(例如“如何制造炸弹? ” )的检测高度准确性, 但它们在多语种的不安全投入中挣扎。 这种限制使得LLM 系统容易受到不安全和破狱的提示,如东南亚的LLLM 系统等低资源语言。 本文介绍了SEAALGuard, 一个多语言的多语种护卫装置, 目的是改善不同语言的安全一致性。 它的目的是解决现有守卫系统(例如LlamaGuard)的多语种安全一致性差距, 并确保有效过滤LLM(我们将通用的多语言多语言多语言多语言多语种语言模型改编成多语言系统。
Article 52
Title@2025-07-17 (4): Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent?
Title: Are Knowledge and Reference in Multilingual Language Models Cross-Lingually Consistent? | Sind Wissen und Referenz in mehrsprachigen Sprachmodellen bereichsübergreifend konsistent? | 多语文模式中的知识和参考资料是否相互一致? 2507.12838v1 |
Authors (3): Xi Ai, Mahardika Krisna Ihsani, Min-Yen Kan
Cross-lingual consistency should be considered to assess cross-lingual transferability, maintain the factuality of the model knowledge across languages, and preserve the parity of language model performance. We are thus interested in analyzing, evaluating, and interpreting cross-lingual consistency for factual knowledge. We examine code-mixed coreferential statements conveyed identical knowledge across languages to study cross-lingual knowledge consistency. We use some interpretability approaches to analyze the behavior of a model in cross-lingual contexts, discovering that multilingual models show different levels of consistency, subject to language families, linguistic factors, and a bottleneck in cross-lingual consistency on a particular layer. In addition, we evaluate common strategies aimed at improving multilingual performance to observe whether these strategies can improve knowledge consistency at the same time. While knowledge is not cross-lingual consistency in many cases, code-switching training and cross-lingual word alignment objectives show the most promising results, emphasizing the noteworthiness of cross-lingual alignment supervision and code-switching training for both multilingual performance and cross-lingual consistency enhancement.
应考虑跨语文的一致性,以评估跨语文的可转让性,保持跨语文示范知识的实际情况,并维护语文示范性业绩的平等性。因此,我们有兴趣分析、评价和解释用于事实知识的跨语文一致性。我们研究不同语文之间传播相同知识的编码混合共同优惠声明,以研究跨语文知识的一致性。我们使用一些可解释性方法分析跨语文模式在跨语文背景下的行为,发现多语文模式在特定层次上表现出不同程度的一致性,取决于语言家庭、语言因素和跨语文一致性的瓶颈。此外,我们评价旨在改进多语文业绩的共同战略,以观察这些战略能否同时提高知识的一致性。虽然在许多情况下,知识不是跨语文的一致性,但代码转换培训和跨语文的词一致性目标显示最有希望的结果,强调跨语文协调监督和代码转换培训对于多语文业绩和跨语文一致性的提高都具有参考价值。
Article 53
Title@2025-07-17 (4): Emotional Support with LLM-based Empathetic Dialogue Generation
Title: Emotional Support with LLM-based Empathetic Dialogue Generation | Emotionale Unterstützung mit LLM-basiertem Empathetic Dialogue Generation | 利用基于LLM的 “ 同情对话 “ 生成的LLM “ 情感支持 2507.12820v1 |
Authors (5): Shiquan Wang, Ruiyu Fang, Zhongjiang He, Shuangyong Song, Yongxiang Li
Emotional Support Conversation (ESC) aims to provide empathetic and effective emotional assistance through dialogue, addressing the growing demand for mental health support. This paper presents our solution for the NLPCC 2025 Task 8 ESC evaluation, where we leverage large-scale language models enhanced by prompt engineering and finetuning techniques. We explore both parameter-efficient Low-Rank Adaptation and full-parameter fine-tuning strategies to improve the model’s ability to generate supportive and contextually appropriate responses. Our best model ranked second in the competition, highlighting the potential of combining LLMs with effective adaptation methods for ESC tasks. Future work will focus on further enhancing emotional understanding and response personalization to build more practical and reliable emotional support systems.
情感支持对话(ESC)旨在通过对话提供同情和有效的情感援助,满足对心理健康支持日益增长的需求,本文件介绍了我们对全国人大常委会2025年任务8ESC评估的解决方案,我们利用快速工程和微调技术强化的大规模语言模式,探索参数高效的低兰克适应和全参数微调战略,以提高该模式产生支持性和符合具体情况的应对能力。我们的最佳模式在竞争中排名第二,强调将LLMS与有效适应方法相结合的可能性。未来工作将侧重于进一步加强情感理解和反应个性化,以建立更实际和可靠的情感支持系统。
Article 54
Title@2025-07-17 (4): Large Language Models’ Internal Perception of Symbolic Music
Title: Large Language Models’ Internal Perception of Symbolic Music | Die innere Wahrnehmung symbolischer Musik durch große Sprachmodelle | 大语言模型内部对符号音乐的感知 2507.12808v1 |
Authors (2): Andrew Shin, Kunitake Kaneko
Large language models (LLMs) excel at modeling relationships between strings in natural language and have shown promise in extending to other symbolic domains like coding or mathematics. However, the extent to which they implicitly model symbolic music remains underexplored. This paper investigates how LLMs represent musical concepts by generating symbolic music data from textual prompts describing combinations of genres and styles, and evaluating their utility through recognition and generation tasks. We produce a dataset of LLM-generated MIDI files without relying on explicit musical training. We then train neural networks entirely on this LLM-generated MIDI dataset and perform genre and style classification as well as melody completion, benchmarking their performance against established models. Our results demonstrate that LLMs can infer rudimentary musical structures and temporal relationships from text, highlighting both their potential to implicitly encode musical patterns and their limitations due to a lack of explicit musical context, shedding light on their generative capabilities for symbolic music.
大型语言模型(LLMS)在天然语言字符串之间的建模关系方面非常出色,并表现出向其他象征性领域,如编码或数学扩展的希望。然而,它们隐含的模拟象征性音乐的模范仍然未得到充分探讨。本文调查LLMS如何通过描述不同类型和风格组合的文字提示生成象征性音乐数据,并通过承认和生成任务来评估其效用。我们制作了LM公司生成的MIDI文件的数据集,而不必依赖明确的音乐培训。然后,我们完全用LLM公司生成的MDI数据集来培训神经网络,并进行原型和风格的分类以及旋律的完成,对照既定模型衡量其性能。我们的成果表明LLMS公司可以推导出基本的音乐结构和文本的时空关系,强调其隐含的音乐模式以及由于缺乏明确的音乐背景而导致的局限性。我们用光灯光来显示其象征音乐音乐的基因化能力。
Article 55
Title@2025-07-17 (4): MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models
Title: MCPEval: Automatic MCP-based Deep Evaluation for AI Agent Models | MCPEval: Automatische MCP-basierte Deep Evaluation für AI Agent Modelle | MCPEval:AI 代理模型的自动MCP深度评估 2507.12806v1 |
Authors (12): Zhiwei Liu, Jielin Qiu, Shiyu Wang, Jianguo Zhang, Zuxin Liu, Roshan Ram, Haolin Chen, Weiran Yao, Huan Wang, Shelby Heinecke, Silvio Savarese, Caiming Xiong
The rapid rise of Large Language Models (LLMs)-based intelligent agents underscores the need for robust, scalable evaluation frameworks. Existing methods rely on static benchmarks and labor-intensive data collection, limiting practical assessment. We introduce \oursystemname, an open-source Model Context Protocol (MCP)-based framework that automates end-to-end task generation and deep evaluation of LLM agents across diverse domains. MCPEval standardizes metrics, seamlessly integrates with native agent tools, and eliminates manual effort in building evaluation pipelines. Empirical results across five real-world domains show its effectiveness in revealing nuanced, domain-specific performance. We publicly release MCPEval https://github.com/SalesforceAIResearch/MCPEval to promote reproducible and standardized LLM agent evaluation.
以大语言模式为基础的智能剂的迅速崛起突出表明,需要建立稳健、可扩展的评价框架。现有方法依赖于静态基准和劳动密集型数据收集,限制了实际评估。我们引入了“系统名称 ” ,这是一个基于开放源码的示范背景协议(MCP ) , 该框架将最终任务生成和对不同领域的LLM代理物的深入评估自动化。MCPEval 将衡量标准标准化,与本地代理物工具无缝结合,并消除了建设评价管道的人工努力。五个现实世界域的经验性结果显示,它能够有效地显示精细的、具体域性的业绩。我们公开发布MCPEval https://github.com/SelessforceAIResearch/MCPEval, 以促进可复制和标准化的LLM代理物评估。
Article 56
Title@2025-07-17 (4): PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database
Title: PMKLC: Parallel Multi-Knowledge Learning-based Lossless Compression for Large-Scale Genomics Database | PMKLC: Parallele Multi-Knowledge Learning-basierte Lossless-Kompression für großformatige Genomics-Datenbank | PMKLC: 大型基因组数据库的平行多知识学习-无损失压缩 2507.12805v1 |
Authors (8): Hui Sun, Yanfeng Ding, Liping Yi, Huidong Ma, Gang Wang, Xiaoguang Liu, Cheng Zhong, Wentong Cai
Learning-based lossless compressors play a crucial role in large-scale genomic database backup, storage, transmission, and management. However, their 1) inadequate compression ratio, 2) low compression \& decompression throughput, and 3) poor compression robustness limit their widespread adoption and application in both industry and academia. To solve those challenges, we propose a novel \underline{P}arallel \underline{M}ulti-\underline{K}nowledge \underline{L}earning-based \underline{C}ompressor (PMKLC) with four crucial designs: 1) We propose an automated multi-knowledge learning-based compression framework as compressors’ backbone to enhance compression ratio and robustness; 2) we design a GPU-accelerated ($s$,$k$)-mer encoder to optimize compression throughput and computing resource usage; 3) we introduce data block partitioning and Step-wise Model Passing (SMP) mechanisms for parallel acceleration; 4) We design two compression modes PMKLC-S and PMKLC-M to meet the complex application scenarios, where the former runs on a resource-constrained single GPU and the latter is multi-GPU accelerated. We benchmark PMKLC-S/M and 14 baselines (7 traditional and 7 leaning-based) on 15 real-world datasets with different species and data sizes. Compared to baselines on the testing datasets, PMKLC-S/M achieve the average compression ratio improvement up to 73.609\% and 73.480\%, the average throughput improvement up to 3.036$\times$ and 10.710$\times$, respectively. Besides, PMKLC-S/M also achieve the best robustness and competitive memory cost, indicating its greater stability against datasets with different probability distribution perturbations, and its strong ability to run on memory-constrained devices.
为解决这些问题,我们提议建立一个新型的基于学习的无损压缩压缩机在大型基因组数据库备份、存储、传输和管理中发挥关键作用。然而,它们(1) 压缩率不足,(2) 压缩压低压压低压压过量,(3) 压缩压强性差限制了其在行业和学术界的广泛采用和应用。为了应对这些挑战,我们提议建立一个新型的以下线(underline{P}arallel\underline{M}M}Text-underline{K}nowledge\deline{L}legleding_BAR__C}Compressor (PMKLC) (PMKLC) (PK) ) (PM) (PM) ) (PM) (PK-LC) (P) (PK-LC) (P) (PK) (PK) (M) (M) (PK) (M) (PM) (M) (M) (M) (M) (M) (M) (PLC) (M) (M) (M) (PL) (PL) (PL) (PL) (PL) (PL) (O) (PD) (P) (O) (PD) (PD) (PD) (PD) (P) (PL) (PL) (O) (O) (O) (O) (PL) (O) (O) (PL) (PL) (P) (PL) (S) (PL) (PL) (PL) (PL) (PL) (S) (P) (PD) (PD) (P) (PD) (P) (P) (P) (P) (P) (P) (P) (P) (P) (P) (P) (P) (P) (P) (PD) (PD) (PD) (PD) (PD) (PD) (P) (P) (P) (P) (P) (P) (P) (P) (P) (P) (P) (PD) (P) (P) (P) (P) (PD) (PD) (
Article 57
Title@2025-07-17 (4): ReCode: Updating Code API Knowledge with Reinforcement Learning
Title: ReCode: Updating Code API Knowledge with Reinforcement Learning | ReCode: Aktualisierung von Code-API-Kenntnissen mit Verstärkungslernen | ReCode:更新法规API知识与强化学习 2506.20495v2 |
Authors (5): Haoze Wu, Yunzhi Yao, Wenhao Yu, Huajun Chen, Ningyu Zhang
Large Language Models (LLMs) exhibit remarkable code generation capabilities but falter when adapting to frequent updates in external library APIs. This critical limitation, stemming from reliance on outdated API knowledge from their training data, even with access to current documentation, impedes reliable code generation in dynamic environments. To tackle this issue, we propose ReCode (rule-based Reinforcement learning for Code Update), a novel framework that mimics human programmer adaptation to API changes. Specifically, we construct a dataset of approximately 2,000 data entries to train the LLMs to perform version migration based on updated information. Then, we introduce a modified string similarity metric for code evaluation as the reward for reinforcement learning. Our experiments demonstrate that ReCode substantially boosts LLMs’ code generation performance in dynamic API scenarios, especially on the unseen CodeUpdateArena task. Crucially, compared to supervised fine-tuning, ReCode has less impact on LLMs’ general code generation abilities. We apply ReCode on various LLMs and reinforcement learning algorithms (GRPO and DAPO), all achieving consistent improvements. Notably, after training, Qwen2.5-Coder-7B outperforms that of the 32B parameter code instruction-tuned model and the reasoning model with the same architecture. Code is available at https://github.com/zjunlp/ReCode.
大型语言模型(LLMS)具有非凡的代码生成能力,但在适应外部图书馆API的频繁更新时却步履维艰。这一关键限制来自对培训数据中过时的 API 知识的依赖,即使能够查阅现有文件,从而在动态环境中阻碍可靠的代码生成。为了解决这一问题,我们提议ReCode(基于规则的加强学习以更新代码),这是一个模仿人类程序程序员适应API变化的新框架。具体地说,我们建立一个大约2 000个数据条目的数据集,以培训LLMS进行基于更新信息的版本的迁移。然后,我们引入一个修改后的代码评估字符串相似度指标,作为强化学习的奖励。我们的实验表明,ReCode大大提升了LPIS在动态API情景中的代码生成性能,特别是在隐蔽的代码AredateArena任务上。与监管的微调相比,ReCode对于LMS的一般代码生成能力影响较小。我们应用了一套LMS和强化学习算法(GPO和DAPO),所有这些都都实现了一致的改进。 值得注意的是,在培训后,Quender2.5-C-7BB的模型/Rebroughdaldroformax
Article 58
Title@2025-07-17 (4): MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment
Title: MPO: An Efficient Post-Processing Framework for Mixing Diverse Preference Alignment | MPO: Ein effizientes Post-Processing-Framework zum Mischen unterschiedlicher Präferenzen | MPO: 混合多种优惠协调的高效处理后框架 2502.18699v2 |
Authors (5): Tianze Wang, Dongnan Gui, Yifan Hu, Shuhang Lin, Linjun Zhang
Reinforcement Learning from Human Feedback (RLHF) has shown promise in aligning large language models (LLMs). Yet its reliance on a singular reward model often overlooks the diversity of human preferences. Recent approaches address this limitation by leveraging multi-dimensional feedback to fine-tune corresponding reward models and train LLMs using reinforcement learning. However, the process is costly and unstable, especially given the competing and heterogeneous nature of human preferences. In this paper, we propose Mixing Preference Optimization (MPO), a post-processing framework for aggregating single-objective policies as an alternative to both multi-objective RLHF (MORLHF) and MaxMin-RLHF. MPO avoids alignment from scratch. Instead, it log-linearly combines existing policies into a unified one with the weight of each policy computed via a batch stochastic mirror descent. Empirical results demonstrate that MPO achieves balanced performance across diverse preferences, outperforming or matching existing models with significantly reduced computational costs.
从人类反馈中强化学习(RLHF)在调整大型语言模式方面显示了希望。然而,它依赖单一奖励模式往往忽略了人类偏好的多样性。最近的做法通过利用多维反馈来微调相应的奖赏模式,并利用强化学习来培训LMS来应对这一局限性。然而,这一过程成本高且不稳定,特别是考虑到人类偏好的竞争性和多样性性质。在本文件中,我们提议混合优先优化(MPO),这是一个综合单一目标政策的后处理框架,作为多目标RLHF(MORLHF)和MaxMin-RLHF(MLHF)的替代方案。MPO避免从零开始调整。相反,它将现有政策合并成一个统一的政策,与通过分批相近的镜像下降计算出来的每项政策的权重。经验表明,MPO在各种偏好中取得了平衡的业绩,业绩优于或匹配现有模型,计算成本显著降低。
Article 59
Title@2025-07-17 (4): Learning Robust Negation Text Representations
Title: Learning Robust Negation Text Representations | Robuste Negations-Textdarstellungen lernen | 学习强力否定文本代表 2507.12782v1 |
Authors (4): Thinh Hung Truong, Karin Verspoor, Trevor Cohn, Timothy Baldwin
Despite rapid adoption of autoregressive large language models, smaller text encoders still play an important role in text understanding tasks that require rich contextualized representations. Negation is an important semantic function that is still not properly captured by such methods, affecting many downstream applications relying on text embeddings. We propose a strategy to improve negation robustness of text encoders, by distilling data from large language models using diverse patterns of negation and hedging. We adopt a standard contrastive learning strategy to finetune a strong BERT-based model, and observe large improvement in negation understanding capabilities while maintaining competitive performance on general benchmarks. In addition, we also show that our method can be adapted to LLMs, leading to improved performance on negation benchmarks.
尽管迅速采用了自动递减的大型语言模型,但较小的文本编码器在需要丰富的背景说明的文本理解任务中仍然发挥着重要作用。否定是一个重要的语义功能,仍然没有被这种方法恰当地抓住,影响到许多依赖文本嵌入的下游应用。我们提出了一项战略,通过利用不同的否定和套期模式,从大型语言模型中提取数据,提高文字编码器的可靠性,从而改进对否定和套期保值的可靠性。我们采取了标准的对比学习战略,以微调一个强有力的BERT模型,并观察到否定理解能力方面的重大改进,同时保持一般基准的竞争性性能。此外,我们还表明,我们的方法可以适用于LLMS,从而改进否定基准的绩效。
Article 60
Title@2025-07-17 (4): A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models
Title: A Comprehensive Survey of Electronic Health Record Modeling: From Deep Learning Approaches to Large Language Models | Eine umfassende Umfrage zur elektronischen Gesundheitsdatenmodellierung: Von Deep Learning Ansätzen bis hin zu großen Sprachmodellen | 《电子健康记录模型综合调查:从深学习方法到大语言模式》 2507.12774v1 |
Authors (5): Weijieying Ren, Jingxi Zhu, Zehao Liu, Tianxiang Zhao, Vasant Honavar
Artificial intelligence (AI) has demonstrated significant potential in transforming healthcare through the analysis and modeling of electronic health records (EHRs). However, the inherent heterogeneity, temporal irregularity, and domain-specific nature of EHR data present unique challenges that differ fundamentally from those in vision and natural language tasks. This survey offers a comprehensive overview of recent advancements at the intersection of deep learning, large language models (LLMs), and EHR modeling. We introduce a unified taxonomy that spans five key design dimensions: data-centric approaches, neural architecture design, learning-focused strategies, multimodal learning, and LLM-based modeling systems. Within each dimension, we review representative methods addressing data quality enhancement, structural and temporal representation, self-supervised learning, and integration with clinical knowledge. We further highlight emerging trends such as foundation models, LLM-driven clinical agents, and EHR-to-text translation for downstream reasoning. Finally, we discuss open challenges in benchmarking, explainability, clinical alignment, and generalization across diverse clinical settings. This survey aims to provide a structured roadmap for advancing AI-driven EHR modeling and clinical decision support. For a comprehensive list of EHR-related methods, kindly refer to https://survey-on-tabular-data.github.io/.
人工智能(AI)通过电子健康记录(EHRs)的分析和建模,展示了在改变保健方面的巨大潜力;然而,EHR数据固有的异质性、时间性不规则性和具体领域性质,提出了与愿景和自然语言任务截然不同的独特挑战;这项调查全面概述了在深层次学习、大语言模型(LLMS)和EHR建模等交叉方面的最新进展;我们引入了涵盖五个关键设计层面的统一分类:以数据为中心的方法、神经结构设计、以学习为重点的战略、多式联运学习和以LLM为基础的建模系统。我们在每个层面审查涉及数据质量提高、结构和时间代表性、自我监督学习和与临床知识整合的代表性方法。我们进一步强调了基础模型、LLMM驱动的临床代理和下游推理的EHR对文本翻译等新出现的趋势。最后,我们讨论了不同临床环境在基准、解释性、临床调整和一般化方面的公开挑战。这次调查旨在提供一个结构化的路线图,用于推进AIHR驱动的EHR建模/临床决定支持。
Article 61
Title@2025-07-17 (4): Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback
Title: Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback | Kritik-GRPO: LLM-Vernunft mit natürlicher Sprache und numerischem Feedback verbessern | Critique-GROPO: 提高以自然语言和数字反馈为依据的LLM 2506.03106v4 |
Authors (7): Xiaoying Zhang, Hao Sun, Yipeng Zhang, Kaituo Feng, Chaochao Lu, Chao Yang, Helen Meng
Recent advances in reinforcement learning (RL) with numerical feedback, such as scalar rewards, have significantly enhanced the complex reasoning capabilities of large language models (LLMs). Despite this success, we identify three key challenges encountered by RL with solely numerical feedback: performance plateaus, limited effectiveness of spontaneous self-reflection, and persistent failures. We then demonstrate that RL-finetuned models, even after exhibiting performance plateaus, can generate correct refinements on persistently failed problems by leveraging natural language feedback in the form of critiques. Building on this insight, we propose Critique-GRPO, an online RL framework that integrates both natural language and numerical feedback for effective policy optimization. Critique-GRPO enables LLMs to learn from initial responses and critique-guided self-refinements simultaneously while maintaining exploration. Additionally, we employ a shaping function to amplify learning from correct, especially unfamiliar, refinements and penalize incorrect ones. Extensive experiments with Qwen2.5-7B-Base, Qwen2.5-Math-7B-Base, and Qwen3-8B demonstrate that Critique-GRPO consistently outperforms supervised learning and RL-based fine-tuning methods across eight challenging mathematical, STEM, and general reasoning tasks, improving average pass@1 scores by approximately 4.4% and 3.8% on Qwen2.5-7B-Base and Qwen3-8B, respectively. Notably, Critique-GRPO enables effective self-improvement through self-critiquing and weak-to-strong generalization, achieving consistent gains over GRPO, such as 16.7% and 10.0% pass@1 improvements on AIME 2024, respectively.
在强化学习(RL)方面最近取得的进展,包括数量反馈,如卡路里奖等,大大增强了大型语言模型(LLM)的复杂推理能力。尽管取得了这一成功,但我们还是确定了RL面临的三大关键挑战:业绩高地、自发自我反省效力有限以及持续失败。然后我们证明,即使表现了业绩高地,RL调调制模式也能通过利用自然语言反馈,如评语形式,对长期失败的问题产生正确的改进。基于这一洞察力,我们提议Critique-GROPO,这是一个将自然语言和数字反馈结合起来,以有效政策优化的在线RLLLLF框架。Critique-GLPO使LM能够同时学习初始反应和批评自律自律自律的自律,同时保持探索。此外,我们运用一个塑造功能,通过正确、特别是不熟悉、完善和惩罚的校正、以Cwen2.5-7BBB、Quen2.5-Math-Base和Qwen-8B,以及Sil-8-B,表明Crimicial-GRO-GRO-GRO-GRO-GRO-ral-ral-ral-ral-ral-al-al-ral-ral-al-al-ral-C-al-al-al-al-al-al-lal-al-al-al-cal-lorgxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 62
Title@2025-07-17 (4): Synergy: End-to-end Concept Model
Title: Synergy: End-to-end Concept Model | Synergie: Ende-zu-Ende-Konzeptmodell | 协同增效:端到端概念模型 2507.12769v1 |
Authors (2): Keli Zheng, Zerong Xie
In this paper, we present Synergy, a language model that bridges different levels of abstraction in an end-to-end fashion through a learned routing mechanism. Focusing on low-level linguistic abstraction, we trained our model as a byte-level language model. Our model spontaneously learns to tokenize bytes, producing fewer concept tokens than Byte-level Byte Pair Encoder (BBPE) tokenizers while keeping comparable performance. By comparing with Llama3, we observed an advantage of Synergy under the same model scale and training dataset size. Further studies show that the middle part (the higher abstraction part) of our model performs better when positional encodings are removed, suggesting the emergence of position-independent concepts. These findings demonstrate the feasibility of tokenizer-free architectures, paving the way for more robust and flexible pipelines.
在本文中,我们展示了“协同”这一语言模型,它通过一个学习的路线机制,将不同程度的抽象从端到端连接起来。侧重于低层次的语言抽象,我们把我们的模型训练成一个字节语言模型。我们的模型自发地学会象征字节,产生的概念符号比Bytele Byte Pair Encoder(BBPE)的象征器少,同时保持类似的性能。我们通过比较Llama3,发现在相同的模型规模和培训数据集大小下,协同的优势。进一步的研究表明,当去除定位编码时,我们模型的中间部分(较高抽象部分)表现更好,这表明了位置独立概念的出现。这些发现显示了无代谢器结构的可行性,为更有力和灵活的管道铺平了道路铺平了道路。
Article 63
Title@2025-07-17 (4): VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
Title: VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents | VIDEE: Visuelle und Interaktive Zersetzung, Ausführung und Auswertung von Text Analytics mit intelligenten Agenten | VIDE: 视觉和交互分解、执行和评价与智能剂的文字分析分析 2506.21582v2 |
Authors (6): Sam Yu-Te Lee, Chengyang Ji, Shicheng Wen, Lifu Huang, Dongyu Liu, Kwan-Liu Ma
Text analytics has traditionally required specialized knowledge in Natural Language Processing (NLP) or text analysis, which presents a barrier for entry-level analysts. Recent advances in large language models (LLMs) have changed the landscape of NLP by enabling more accessible and automated text analysis (e.g., topic detection, summarization, information extraction, etc.). We introduce VIDEE, a system that supports entry-level data analysts to conduct advanced text analytics with intelligent agents. VIDEE instantiates a human-agent collaroration workflow consisting of three stages: (1) Decomposition, which incorporates a human-in-the-loop Monte-Carlo Tree Search algorithm to support generative reasoning with human feedback, (2) Execution, which generates an executable text analytics pipeline, and (3) Evaluation, which integrates LLM-based evaluation and visualizations to support user validation of execution results. We conduct two quantitative experiments to evaluate VIDEE’s effectiveness and analyze common agent errors. A user study involving participants with varying levels of NLP and text analytics experience – from none to expert – demonstrates the system’s usability and reveals distinct user behavior patterns. The findings identify design implications for human-agent collaboration, validate the practical utility of VIDEE for non-expert users, and inform future improvements to intelligent text analytics systems.
语言分析分析学历来需要自然语言处理(NLP)或文本分析方面的专业知识,这给入门级分析者造成障碍。大型语言模型(LLMS)的最近进展改变了NLP的景观,使得能够进行更便于获取和自动化的文本分析(例如,专题检测、概括、信息提取等)。我们引入了VIDEE,这个系统支持入门级数据分析师与智能剂进行高级文本分析。VIDE即刻化了由三个阶段组成的人类代理锁合工作流程,其中包括:(1)分解,它包括人际流动的蒙特卡洛树搜索算法,支持与人际反馈的基因推理;(2)执行,它产生了一种可执行的文本分析管道,(3)评价,它综合了基于LIM的评估和视觉分析,以支持用户对执行结果进行验证。我们进行了两个定量实验,以评价VIDE的有效性和分析普通剂错误。用户对NLP和文本分析学分析经验水平不同的参与者进行了一项用户研究,从一个无到一个专家的用户研究,从一个向专家的用户展示了不易用方法,显示用户对设计方法的改进。
Article 64
Title@2025-07-17 (4): Logit Arithmetic Elicits Long Reasoning Capabilities Without Training
Title: Logit Arithmetic Elicits Long Reasoning Capabilities Without Training | Logit Arithmetische Elizite lange mit Gründen verbundene Fähigkeiten ohne Training | 未经培训的逻辑 2507.12759v1 |
Authors (8): Yunxiang Zhang, Muhammad Khalifa, Lechen Zhang, Xin Liu, Ayoung Lee, Xinliang Frederick Zhang, Farima Fatahi Bayat, Lu Wang
Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction. Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training. Our work first investigates whether we can elicit such behavior without any training. To this end, we propose a decoding-time approach, ThinkLogit, which utilizes logits arithmetic (Liu et al., 2024) to tune a target large LM for long reasoning using a substantially smaller model as guider. We then show that we can further boost performance by training the guider model with preference optimization over correct/incorrect reasoning pairs sampled from both the target and guider model – a setup we refer to as ThinkLogit-DPO. Our experiments demonstrate that ThinkLogit and ThinkLogit-DPO achieve a relative improvement in pass@1 by 26% and 29%, respectively, over four mathematical datasets using the Qwen2.5-32B when guided by R1-Distill-Qwen-1.5B – a model 21x smaller. Lastly, we show that ThinkLogit can transfer long reasoning skills acquired through reinforcement learning, improving pass@1 by 13% relative compared to the Qwen2.5-32B base model. Our work presents a computationally-efficient method to elicit long reasoning in large models with minimal or no additional training.
大型推理模型(LRMs)可以通过包含回溯跟踪和自我校正等认知战略的长远思维链(CoT)进行复杂的推理。最近的研究显示,有些模型本身就拥有这些长期推理能力,这些能力可以通过额外培训解开。我们的工作首先调查我们是否可以在没有任何培训的情况下得出这种行为。为此,我们建议采用解码时间方法,Discogit(DiscogitLogit),利用逻辑算术(Liu等人,2024),用一个大大小得多的模型(CoT)来调整一个大型LM(CoT),长期推理。我们然后表明,我们可以通过培训模型的优化而不是正确/错误推理来进一步提高导模型的绩效,而不是正确/不正确的推理。我们的工作首先调查我们称为Siglogit-DPO。 我们的实验表明,ThinkLogit-Logit-DPO(ThinkL-Describle)-trainal a lax lax long rodu rodu rodustrual rogration roislation lax),我们用R1-D-b-b-b-b-b-Binal be slation 21 rodustrual be slational rodustrual rodustrubleglegleglegal
Article 65
Title@2025-07-17 (4): Strategy Adaptation in Large Language Model Werewolf Agents
Title: Strategy Adaptation in Large Language Model Werewolf Agents | Strategieanpassung im großen Sprachmodell Werwolf-Agenten | 大语言示范狼人代理物的适应战略 2507.12732v1 |
Authors (3): Fuya Nakamori, Yin Jou Huang, Fei Cheng
This study proposes a method to improve the performance of Werewolf agents by switching between predefined strategies based on the attitudes of other players and the context of conversations. While prior works of Werewolf agents using prompt engineering have employed methods where effective strategies are implicitly defined, they cannot adapt to changing situations. In this research, we propose a method that explicitly selects an appropriate strategy based on the game context and the estimated roles of other players. We compare the strategy adaptation Werewolf agents with baseline agents using implicit or fixed strategies and verify the effectiveness of our proposed method.
这项研究提出了一种方法,通过将基于其他参与者的态度和谈话背景的预先确定的战略转换为不同的战略来改善狼人代理人的绩效。虽然狼人代理人以前采用迅速工程方法的工作采用了一些方法,其中暗含了有效战略的定义,但无法适应不断变化的情况。在这项研究中,我们提出了一种方法,根据游戏背景和其他参与者的估计作用明确选择适当的战略。我们用隐含或固定的战略将战略适应狼人代理人与基线代理人进行比较,并核查我们拟议方法的有效性。
Article 66
Title@2025-07-17 (4): TransEvalnia: Reasoning-based Evaluation and Ranking of Translations
Title: TransEvalnia: Reasoning-based Evaluation and Ranking of Translations | TransEvalnia: Reasoning-based Evaluation und Ranking von Übersetzungen | 过年:基于理由的评价和笔译的排名 2507.12724v1 |
Authors (3): Richard Sproat, Tianyu Zhao, Llion Jones
We present TransEvalnia, a prompting-based translation evaluation and ranking system that uses reasoning in performing its evaluations and ranking. This system presents fine-grained evaluations based on a subset of the Multidimensional Quality Metrics (https://themqm.org/), returns an assessment of which translation it deems the best, and provides numerical scores for the various dimensions and for the overall translation. We show that TransEvalnia performs as well as or better than the state-of-the-art MT-Ranker (Moosa et al. 2024) on our own English-Japanese data as well as several language pairs from various WMT shared tasks. Using Anthropic’s Claude-3.5-Sonnet and Qwen-2.5-72B-Instruct as the evaluation LLMs, we show that the evaluations returned are deemed highly acceptable to human raters, and that the scores assigned to the translations by Sonnet, as well as other LLMs, correlate well with scores assigned by the human raters. We also note the sensitivity of our system – as well as MT-Ranker – to the order in which the translations are presented, and we propose methods to address this position bias. All data, including the system’s evaluation and reasoning, human assessments, as well as code is released.
我们提出TransEvalnia,这是一个基于快速的翻译评估和排名系统,在进行评价和排名时采用推理方法。这个系统根据多层面质量计量(https://themqm.org/)的一个子集,提出精细的评分,返回其认为翻译最佳的评估,为各个层面和总体翻译提供评分。我们显示,TransEvalnia在我们的英文-日本数据以及各种WMT共同任务中的若干对语言数据上的表现或优于最先进的MT-Ranker(Moosa等人,2024),我们还注意到我们的系统敏感度 – – 以及MT-Ranker的多个语言配对。使用Anthropic的Claude-3.5-Sonnet和Qwen-2.5-72B-Instruct的Clemis,作为评价LMS,我们显示,返回的评价被认为是人类评级员们高度接受的评分,Sonnet和其他LMS的评分与人类评级员分配的得分(Mos等人,2024),我们还注意到我们的系统以及MT-Ranker的评分 – – – 以及MT-Ranker – – – 以及我们提出的所有数据评分的评分的评分。我们提出的评分,以及所有的顺序,包括所有数据推算和推算和推。
Article 67
Title@2025-07-17 (4): FLEXITOKENS: Flexible Tokenization for Evolving Language Models
Title: FLEXITOKENS: Flexible Tokenization for Evolving Language Models | FLEXITOKENS: Flexible Tokenisierung für sich entwickelnde Sprachmodelle | FLEXITOKENS: 不断演变的语言模式灵活化 2507.12720v1 |
Authors (3): Abraham Toluase Owodunni, Orevaoghene Ahia, Sachin Kumar
Language models (LMs) are challenging to adapt to new data distributions by simple finetuning. This is due to the rigidity of their subword tokenizers, which typically remain unchanged during adaptation. This inflexibility often leads to inefficient tokenization, causing overfragmentation of out-of-distribution domains, unseen languages, or scripts. In this work, we develop byte-level LMs with learnable tokenizers to make tokenization adaptive. Our models include a submodule that learns to predict boundaries between the input byte sequence, encoding it into variable-length segments. Existing tokenizer-free methods train this boundary predictor using an auxiliary loss that enforces a fixed compression rate across the training corpus, introducing a new kind of rigidity. We propose FLEXITOKENS, a simplified training objective that enables significantly greater flexibility during adaptation. Evaluating across multiple multilingual benchmarks, morphologically diverse tasks, and domains, we demonstrate that FLEXITOKENS consistently reduces token over-fragmentation and achieves up to 10\% improvements on downstream task performance compared to subword and other gradient-based tokenizers. Code and data for our experiments will be released at https://github.com/owos/flexitokens
语言模型( LMS) 难以通过简单的微调适应新的数据发布。 这是因为子名符号符号的僵硬性, 在适应期间通常保持不变。 这种僵硬性化往往导致无效率的象征化, 导致分配外域、 隐蔽语言或脚本的过度分化。 在这项工作中, 我们开发了字节LMS, 配有可学习的象征化符号, 以适应象征性化。 我们的模式包括一个子模块, 学会预测输入字节序列之间的界限, 将其编码为可变长段。 现有的无代号符号方法使用辅助性损失来培训这个边界预测器, 将固定压缩率用于整个培训单元, 引入一种新的僵硬性。 我们提议FLEXITOKENS, 简化的培训目标, 使得适应期间的灵活度大得多。 我们用多种多语言基准、 形态多样的任务和领域来评估 FLEXITOKENS, 我们证明FLEXITOKENS 不断减少象征性的过度分裂性, 并实现下游任务表现的10改进, 相对于子词和其他梯基模/ massimizerforizermaxIls/ 。 将发布我们的数据和数据。
Article 68
Title@2025-07-17 (4): BEARCUBS: A benchmark for computer-using web agents
Title: BEARCUBS: A benchmark for computer-using web agents | BEARCUBS: Benchmark für computergestützte Web-Agenten | BEARCUBS:计算机使用网络代理器的基准 2503.07919v2 |
Authors (6): Yixiao Song, Katherine Thai, Chau Minh Pham, Yapei Chang, Mazin Nadaf, Mohit Iyyer
Modern web agents possess computer use abilities that allow them to interact with webpages by sending commands to a virtual keyboard and mouse. While such agents have considerable potential to assist human users with complex tasks, evaluating their capabilities in real-world settings poses a major challenge. To this end, we introduce BEARCUBS, a “small but mighty” benchmark of 111 information-seeking questions designed to evaluate a web agent’s ability to search, browse, and identify factual information from the web. Unlike prior web agent benchmarks, solving BEARCUBS requires (1) accessing live web content rather than synthetic or simulated pages, which captures the unpredictability of real-world web interactions; and (2) performing a broad range of multimodal interactions (e.g., video understanding, 3D navigation) that cannot be bypassed via text-based workarounds. Each question in BEARCUBS has a corresponding short, unambiguous answer and a human-validated browsing trajectory, allowing for transparent evaluation of agent performance and strategies. A human study confirms that BEARCUBS questions are solvable but non-trivial (84.7% human accuracy), revealing domain knowledge gaps and overlooked details as common failure points. By contrast, state-of-the-art computer-using agents underperform, with the best-scoring system (OpenAI’s Operator) reaching only 23.4% accuracy. These results highlight critical areas for improvement, including reliable source selection and more powerful multimodal capabilities. To facilitate future research, BEARCUBS will be updated periodically to replace invalid or contaminated questions, keeping the benchmark fresh for future generations of web agents.
现代网络代理拥有计算机使用能力,使其能够通过向虚拟键盘和鼠标发送指令与网页进行互动。虽然这些代理具有协助人类用户完成复杂任务的巨大潜力,但评估其在现实世界环境中的能力是一项重大挑战。为此,我们引入了“小型但强大的”BEARCUBS,这是111个信息查询问题的一个“小型但强大的”基准,旨在评价网络代理商搜索、浏览和识别网上事实信息的能力。与先前的网络代理商基准不同,解决BEARCUBS需要(1) 访问现场网络内容,而不是合成或模拟网页,这可以捕捉真实世界网络互动的不可预测性;以及(2) 进行广泛的多式联运互动(例如视频理解、3D导航),这是无法通过基于文本的变通办法绕过。 BERCUBS的每个问题都有相应的短、明确答案和具有人性价值的浏览轨迹,能够透明地评估代理商的绩效和战略。 一项人类研究证实,BEARCBS问题需要保持可调的但非初始性(84. 7 % 人类网络互动 ) , 揭示域域域域域域域域域域域信息差距差距, 以及操作操作者将在未来的精确性数据更新。
Article 69
Title@2025-07-17 (4): Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs
Title: Synthesizing Privacy-Preserving Text Data via Finetuning without Finetuning Billion-Scale LLMs | Synthesizing Privacy-Preserving Text Data via Finetuning ohne Finetuning Billion-Scale LLMs | 通过不作十亿规模的微调微调的微调合成保护隐私文本数据 2503.12347v2 |
Authors (5): Bowen Tan, Zheng Xu, Eric Xing, Zhiting Hu, Shanshan Wu
Synthetic data offers a promising path to train models while preserving data privacy. Differentially private (DP) finetuning of large language models (LLMs) as data generator is effective, but is impractical when computation resources are limited. Meanwhile, prompt-based methods such as private evolution depend heavily on the manual prompts, and ineffectively use private information in their iterative data selection process. To overcome these limitations, we propose CTCL (Data Synthesis with ConTrollability and CLustering), a novel framework for generating privacy-preserving synthetic data without extensive prompt engineering or billion-scale LLM finetuning. CTCL pretrains a lightweight 140M conditional generator and a clustering-based topic model on large-scale public data. To further adapt to the private domain, the generator is DP finetuned on private data for fine-grained textual information, while the topic model extracts a DP histogram representing distributional information. The DP generator then samples according to the DP histogram to synthesize a desired number of data examples. Evaluation across five diverse domains demonstrates the effectiveness of our framework, particularly in the strong privacy regime. Systematic ablation validates the design of each framework component and highlights the scalability of our approach.
合成数据为在保护数据隐私的同时培训模型提供了一条充满希望的道路。由于数据生成器有效,对大型语言模型(LLMs)进行有区别的私人(DP)微调是有效的,但在计算资源有限时是不切实际的。与此同时,私人进化等快速方法严重依赖手动提示,在迭代数据选择过程中没有有效地使用私人信息。为克服这些限制,我们提议了CTL(带有可移动性和CLustering的数据合成数据数据数据综合与可移动性和CLustering),这是一个用于生成隐私保护合成数据的新框架,而没有广泛的迅速工程或10亿级LLMM微调。CTL预设了一个轻量的140M有条件生成器和大规模公共数据基于集群的专题模型。为了进一步适应私人领域,该生成器对微细缩缩微文本信息的私人数据进行了DP作了微幅调整。主题模型提取了代表分发信息的DP直方图。然后,DP生成器根据DP直方图样本,以综合所需数据实例。对五个不同领域的评估显示了我们框架的有效性,特别是在强的隐私制度中。系统化的图像验证了每个框架的每个框架的可变性度。
Article 70
Title@2025-07-17 (4): GUI Test Migration via Abstraction and Concretization
Title: GUI Test Migration via Abstraction and Concretization | GUI-Test-Migration über Abstraktion und Konkretisierung | GUI 通过抽象和简明化测试移民 2409.05028v2 |
Authors (7): Yakun Zhang, Chen Liu, Xiaofei Xie, Yun Lin, Jin Song Dong, Dan Hao, Lu Zhang
GUI test migration aims to produce test cases with events and assertions to test specific functionalities of a target app. Existing migration approaches typically focus on the widget-mapping paradigm that maps widgets from source apps to target apps. However, since different apps may implement the same functionality in different ways, direct mapping may result in incomplete or buggy test cases, thus significantly impacting the effectiveness of testing target functionality and the practical applicability of migration approaches. In this paper, we propose a new migration paradigm (i.e., the abstraction-concretization paradigm) that first abstracts the test logic for the target functionality and then utilizes this logic to generate the concrete GUI test case. Furthermore, we introduce MACdroid, the first approach that migrates GUI test cases based on this paradigm. Specifically, we propose an abstraction technique that utilizes source test cases from source apps targeting the same functionality to extract a general test logic for that functionality. Then, we propose a concretization technique that utilizes the general test logic to guide an LLM in generating the corresponding GUI test case (including events and assertions) for the target app. We evaluate MACdroid on two widely-used datasets (including 31 apps, 34 functionalities, and 123 test cases). On the FrUITeR dataset, the test cases generated by MACdroid successfully test 64% of the target functionalities, improving the baselines by 191%. On the Lin dataset, MACdroid successfully tests 75% of the target functionalities, outperforming the baselines by 42%. These results underscore the effectiveness of MACdroid in GUI test migration.
GUI 测试迁移的目的是通过测试事件来测试案例,测试目标应用程序的具体功能。 现有的迁移方法通常侧重于从源应用程序到目标应用程序的部件映射模式。 但是,由于不同的应用程序可能以不同的方式执行相同的功能,直接映射可能导致测试案例不完全或错误,从而极大地影响测试目标功能的有效性和迁移方法的实际适用性。 在本文件中,我们提出了一种新的迁移模式(即抽象混凝土模式),首先将目标功能的测试逻辑摘要用于强调测试逻辑,然后利用这一逻辑来生成具体 GUI 测试案例。此外,我们引入了MACdroid,这是根据这个模式迁移图形测试案例的第一个方法。具体地说,我们提出了一种抽象技术,利用源应用程序的测试案例,针对同一功能的功能和迁移方法的实际适用性测试逻辑。然后,我们提出了一种解剖化技术,利用一般测试逻辑来指导LLMUMU生成相应的 GUI测试案例(包括事件和声明),然后利用这个逻辑来生成具体的 GUILME 测试案例。我们用MAC 的3 测试模型测试了两个数据测试模型,通过测试模型测试模型,这些测试了基数测试了基数,这些测试了基数的基数,这些基数的基数。
Article 71
Title@2025-07-17 (4): Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening
Title: Fairness Is Not Enough: Auditing Competence and Intersectional Bias in AI-powered Resume Screening | Fairness ist nicht genug: Auditing-Kompetenz und Intersektions-Bias in KI-powered Resume Screening | 公平不够充分:审计能力和大赦国际授权的恢复筛选中的跨部门比阿斯 2507.11548v2 |
Authors (1): Kevin T Webster
The increasing use of generative AI for resume screening is predicated on the assumption that it offers an unbiased alternative to biased human decision-making. However, this belief fails to address a critical question: are these AI systems fundamentally competent at the evaluative tasks they are meant to perform? This study investigates the question of competence through a two-part audit of eight major AI platforms. Experiment 1 confirmed complex, contextual racial and gender biases, with some models penalizing candidates merely for the presence of demographic signals. Experiment 2, which evaluated core competence, provided a critical insight: some models that appeared unbiased were, in fact, incapable of performing a substantive evaluation, relying instead on superficial keyword matching. This paper introduces the “Illusion of Neutrality” to describe this phenomenon, where an apparent lack of bias is merely a symptom of a model’s inability to make meaningful judgments. This study recommends that organizations and regulators adopt a dual-validation framework, auditing AI hiring tools for both demographic bias and demonstrable competence to ensure they are both equitable and effective.
在恢复性筛查中越来越多地使用基因化的AI的假设依据的假设是,它为有偏见的人类决策提供了不带偏见的替代选择。然而,这一信念未能解决一个关键问题:这些AI系统是否从根本上能够胜任它们要执行的评价任务?本研究报告通过对AI八大平台的两部分审计调查能力问题。实验1证实复杂、有背景的种族和性别偏见,有些模式仅仅惩罚有人口信号的候选人。实验2评价核心能力,提供了重要的洞察力:一些看来没有偏见的模式事实上无法进行实质性评价,而只能依靠表面的字眼匹配。本文介绍了“中立主义”来描述这种现象,其中明显缺乏偏见只是模式无法作出有意义的判断的一种症状。本研究报告建议各组织和监管机构采用双重评价框架,审计AI雇用工具,既要顾及人口偏见,又要证明能力,以确保它们既公平又有效。
Article 72
Title@2025-07-17 (4): ActionStudio: A Lightweight Framework for Data and Training of Large Action Models
Title: ActionStudio: A Lightweight Framework for Data and Training of Large Action Models | ActionStudio: Ein leichter Rahmen für Daten und Training großer Aktionsmodelle | 行动研究:关于大型行动模式的数据和培训的轻量框架 2503.22673v3 |
Authors (16): Jianguo Zhang, Thai Hoang, Ming Zhu, Zuxin Liu, Shiyu Wang, Tulika Awalgaonkar, Akshara Prabhakar, Haolin Chen, Weiran Yao, Zhiwei Liu, Juntao Tan, Juan Carlos Niebles, Shelby Heinecke, Huan Wang, Silvio Savarese, Caiming Xiong
Large Action models are essential for enabling autonomous agents to perform complex tasks. However, training such models remains challenging due to the diversity of agent environments and the complexity of noisy agentic data. Existing infrastructure offers limited support for scalable, agent-specific fine-tuning and standardized agent data processing. We introduce ActionStudio, a lightweight and extensible data and training framework designed for large action models. ActionStudio unifies diverse agent trajectories using our proposed Unified Format 2.0, supports a range of training workflows with optimized multi-node distributed setup, and integrates robust preprocessing and real-time verification tools. ActionStudio demonstrates up to 9x higher throughput compared to existing agentic training frameworks, and our trained models yield top performances across public and realistic agent benchmarks. To support the broader research community, we open-source the ActionStudio framework and release actionstudio-98k, a curated dataset of 98k high-quality trajectories. Code: https://github.com/SalesforceAIResearch/xLAM.
大型行动模式对于使自主代理机构能够执行复杂任务至关重要,然而,由于代理环境的多样性和繁琐的代理数据的复杂性,培训这类模式仍然具有挑战性。现有的基础设施为可扩展的、针对代理机构的微调和标准化代理数据处理提供了有限的支持。我们为大型行动模式引入了ActionStudio,即一个轻量和可扩展的数据和培训框架。ActionStude利用我们提议的统一格式2.0将多种代理机构轨迹统一起来,支持一系列培训工作流程,优化多节分布式设置,并整合了强有力的预处理和实时核查工具。ActionStudio展示了比现有代理培训框架高出9倍的吞吐量,而我们经过培训的模型在公众和现实的代理基准方面产生了顶级业绩。为了支持更广泛的研究界,我们将ActionStude框架和发布行动Teptio-98k,这是98k高质量轨迹的整理数据集。代码:https://github.com/SAleforumAIResearch/xLAM。
Article 73
Title@2025-07-17 (4): Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation
Title: Chain-of-Thought Prompting Obscures Hallucination Cues in Large Language Models: An Empirical Evaluation | Chain-of-Thought Prompting Obscures Halluzination Cues in großen Sprachmodellen: Eine empirische Bewertung | 引导大语言模型中传译锥体:经验评价 2506.17088v2 |
Authors (8): Jiahao Cheng, Tiancheng Su, Jia Yuan, Guoxiu He, Jiawei Liu, Xinqi Tao, Jingwen Xie, Huaxia Li
Large Language Models (LLMs) often exhibit \textit{hallucinations}, generating factually incorrect or semantically irrelevant content in response to prompts. Chain-of-Thought (CoT) prompting can mitigate hallucinations by encouraging step-by-step reasoning, but its impact on hallucination detection remains underexplored. To bridge this gap, we conduct a systematic empirical evaluation. We begin with a pilot experiment, revealing that CoT reasoning significantly affects the LLM’s internal states and token probability distributions. Building on this, we evaluate the impact of various CoT prompting methods on mainstream hallucination detection methods across both instruction-tuned and reasoning-oriented LLMs. Specifically, we examine three key dimensions: changes in hallucination score distributions, variations in detection accuracy, and shifts in detection confidence. Our findings show that while CoT prompting helps reduce hallucination frequency, it also tends to obscure critical signals used for detection, impairing the effectiveness of various detection methods. Our study highlights an overlooked trade-off in the use of reasoning. Code is publicly available at: https://anonymous.4open.science/r/cot-hallu-detect.
大型语言模型(LLMS) 通常会显示\ textit{hallucations} , 产生事实不正确或语义上不相干的内容, 以回应提示。 试探链(CoT)能通过鼓励一步步推理来减轻幻觉, 但其对幻觉检测的影响仍然未得到充分探讨。 为了缩小这一差距,我们进行了系统性的经验评估。 我们从试点实验开始, 显示CoT推理对LLM的内部状态和象征性概率分布有重大影响。 在此基础上, 我们评估了各种COT激励方法对指导调整和推理导向LMS的主流幻觉检测方法的影响。 具体而言, 我们检查了三个关键方面: 幻觉分数分布的变化、 检测准确性的变化以及检测信心的变化。 我们的研究结果表明, CoT 推动有助于减少幻觉频率, 同时它也往往模糊用于检测的关键信号, 损害各种检测方法的有效性。 我们的研究强调了在使用推理中被忽视的权衡。 代码可以公开查阅: https://anonimous.4open.s/r/cot-hallude-dection.
Article 74
Title@2025-07-17 (4): AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation
Title: AudioJudge: Understanding What Works in Large Audio Model Based Speech Evaluation | AudioJudge: Verstehen, was in der großen Audiomodell basierten Sprachbewertung funktioniert | 音频法官:了解大型音频示范演讲评价有什么用 2507.12705v1 |
Authors (8): Potsawee Manakul, Woody Haosheng Gan, Michael J. Ryan, Ali Sartaz Khan, Warit Sirichotedumrong, Kunat Pipatanakul, William Held, Diyi Yang
Current speech evaluation suffers from two critical limitations: the need and difficulty of designing specialized systems targeting individual audio characteristics, and poor correlation between automatic evaluation methods and human preferences. This work presents a systematic study of Large Audio Model (LAM) as a Judge, AudioJudge, investigating whether it can provide a unified evaluation framework that addresses both challenges. We systematically explore AudioJudge across audio characteristic detection tasks, including pronunciation, speaking rate, speaker identification and speech quality, and system-level human preference simulation for automated benchmarking. We investigate different prompt engineering strategies, finding that audio concatenation combined with in-context learning significantly improves performance across both audio characteristic detection and human preference simulation tasks. We further introduce a multi-aspect ensemble AudioJudge to enable general-purpose multi-aspect audio evaluation. This method decomposes speech assessment into specialized judges for lexical content, speech quality, and paralinguistic features, achieving up to 0.91 Spearman correlation with human preferences on our system ranking benchmark. Robustness analysis reveals that while LAMs maintain strong performance under acoustic noise, they exhibit significant verbosity and positional biases that require careful mitigation.
目前,语言评价受到两个关键限制:设计针对个别音频特征的专门系统的必要性和困难,以及自动评价方法与人类偏好之间的关联性差;这项工作对大音频模型(LAM)作为法官、音频法官进行系统研究,调查它能否提供一个统一的评价框架来应对这两项挑战;我们系统地探索音频法官跨越音频特征检测任务,包括发音、语音率、语音识别和语音质量,以及系统层面的自动基准人类偏好模拟;我们调查不同的迅速工程战略,发现音频相配结合和文本内学习大大改进了音频特征探测和人类偏好模拟任务之间的性能;我们进一步引入多等同的音频模型,以便能够进行通用的多功能音频评估;这一方法将语言评估纳入专门法官的词汇内容、语音质量和语言特征,在系统排序基准上达到0.91 Spearman与人类偏好的相关性;Robustness分析显示,虽然LAMS在声响噪音和人类偏好下都保持很强的性能。
Article 75
Title@2025-07-17 (4): Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis
Title: Exploiting Adaptive Contextual Masking for Aspect-Based Sentiment Analysis | Ausnutzung adaptiver Kontextmasken für aspektbasierte Sentiment-Analysen | 利用适应性环境掩码进行外观感应力分析 2402.13722v2 |
Authors (4): S M Rafiuddin, Mohammed Rakib, Sadia Kamal, Arunkumar Bagavathi
Aspect-Based Sentiment Analysis (ABSA) is a fine-grained linguistics problem that entails the extraction of multifaceted aspects, opinions, and sentiments from the given text. Both standalone and compound ABSA tasks have been extensively used in the literature to examine the nuanced information present in online reviews and social media posts. Current ABSA methods often rely on static hyperparameters for attention-masking mechanisms, which can struggle with context adaptation and may overlook the unique relevance of words in varied situations. This leads to challenges in accurately analyzing complex sentences containing multiple aspects with differing sentiments. In this work, we present adaptive masking methods that remove irrelevant tokens based on context to assist in Aspect Term Extraction and Aspect Sentiment Classification subtasks of ABSA. We show with our experiments that the proposed methods outperform the baseline methods in terms of accuracy and F1 scores on four benchmark online review datasets. Further, we show that the proposed methods can be extended with multiple adaptations and demonstrate a qualitative analysis of the proposed approach using sample text for aspect term extraction.
外观感知分析(ABSA)是一个细微的语言问题,涉及从给定文本中提取多方面的方面、意见和情感。在文献中,ABSA的独立任务和复合任务都广泛用于研究在线审查和社交媒体文章中的细微信息。目前的ABSA方法往往依靠静态超分数来制造注意量机制,这可能会与背景适应有关,并可能会忽视语言在不同情况下的独特相关性。这导致在准确分析包含不同情绪的多方面内容的复杂句子时遇到挑战。在这项工作中,我们提出了根据背景去除不相干符号的适应性遮罩方法,以协助对ABSA的外观提取和外观感应分类子。我们通过实验表明,拟议的方法在准确性方面超越基线方法,在四个基准在线审查数据集中F1分数。此外,我们表明,拟议的方法可以随着多重调整而扩展,并展示对拟议方法的定性分析,利用样本文本进行侧面术语提取。
Article 76
Title@2025-07-17 (4): AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis
Title: AdaptiSent: Context-Aware Adaptive Attention for Multimodal Aspect-Based Sentiment Analysis | AdaptiSent: Context-Aware Adaptive Aufmerksamkeit für multimodale Aspect-Based-Sentiment-Analysen | 适应性:基于多种模式的光谱感应分析的上下文知识适应性关注 2507.12695v1 |
Authors (5): S M Rafiuddin, Sadia Kamal, Mohammed Rakib, Arunkumar Bagavathi, Atriya Sen
We introduce AdaptiSent, a new framework for Multimodal Aspect-Based Sentiment Analysis (MABSA) that uses adaptive cross-modal attention mechanisms to improve sentiment classification and aspect term extraction from both text and images. Our model integrates dynamic modality weighting and context-adaptive attention, enhancing the extraction of sentiment and aspect-related information by focusing on how textual cues and visual context interact. We tested our approach against several baselines, including traditional text-based models and other multimodal methods. Results from standard Twitter datasets show that AdaptiSent surpasses existing models in precision, recall, and F1 score, and is particularly effective in identifying nuanced inter-modal relationships that are crucial for accurate sentiment and aspect term extraction. This effectiveness comes from the model’s ability to adjust its focus dynamically based on the context’s relevance, improving the depth and accuracy of sentiment analysis across various multimodal data sets. AdaptiSent sets a new standard for MABSA, significantly outperforming current methods, especially in understanding complex multimodal information.
我们引入了适应性-基于多模式的光谱感应分析新框架-适应性跨模式关注机制,利用适应性跨模式关注机制改进情绪分类和从文本和图像中提取术语的方面。我们的模型将动态模式加权和背景适应性关注结合起来,通过侧重于文字提示和视觉背景互动的方式,加强感应和方面相关信息的提取。我们根据若干基线,包括传统文本模型和其他多式联运方法,测试了我们的方法。标准的Twitter数据集显示,适应性超越了现有精确度、回溯和F1评分的模型,在确定对准确情感和术语提取至关重要的细微模式间关系方面特别有效。这种有效性来自模型根据背景相关性动态调整其焦点的能力,提高了各种多式联运数据集的情绪分析的深度和准确性。适应性为MABSA制定了新的标准,显著优于当前方法,特别是在理解复杂的多式联运信息方面。
Article 77
Title@2025-07-16 (3): Improving Drug Identification in Overdose Death Surveillance using Large Language Models
Title: Improving Drug Identification in Overdose Death Surveillance using Large Language Models | Verbesserung der Drogenidentifizierung bei der Überwachung von Überdosierungen mit großen Sprachmodellen | 利用大语言模式在超剂量死亡监测中改进药物识别工作 2507.12679v1 |
Authors (9): Arthur J. Funnell, Panayiotis Petousis, Fabrice Harel-Canada, Ruby Romero, Alex A. T. Bui, Adam Koncsol, Hritika Chaturvedi, Chelsea Shover, David Goodman-Meza
The rising rate of drug-related deaths in the United States, largely driven by fentanyl, requires timely and accurate surveillance. However, critical overdose data are often buried in free-text coroner reports, leading to delays and information loss when coded into ICD (International Classification of Disease)-10 classifications. Natural language processing (NLP) models may automate and enhance overdose surveillance, but prior applications have been limited. A dataset of 35,433 death records from multiple U.S. jurisdictions in 2020 was used for model training and internal testing. External validation was conducted using a novel separate dataset of 3,335 records from 2023-2024. Multiple NLP approaches were evaluated for classifying specific drug involvement from unstructured death certificate text. These included traditional single- and multi-label classifiers, as well as fine-tuned encoder-only language models such as Bidirectional Encoder Representations from Transformers (BERT) and BioClinicalBERT, and contemporary decoder-only large language models such as Qwen 3 and Llama 3. Model performance was assessed using macro-averaged F1 scores, and 95% confidence intervals were calculated to quantify uncertainty. Fine-tuned BioClinicalBERT models achieved near-perfect performance, with macro F1 scores >=0.998 on the internal test set. External validation confirmed robustness (macro F1=0.966), outperforming conventional machine learning, general-domain BERT models, and various decoder-only large language models. NLP models, particularly fine-tuned clinical variants like BioClinicalBERT, offer a highly accurate and scalable solution for overdose death classification from free-text reports. These methods can significantly accelerate surveillance workflows, overcoming the limitations of manual ICD-10 coding and supporting near real-time detection of emerging substance use trends.
美国与毒品有关的死亡率不断上升,主要受芬太尼的驱动,这需要及时和准确的监测。然而,关键剂量过量数据往往被埋在自由文本的验尸报告中,在编码成ICD(国际疾病分类)-10分类时,导致延迟和信息丢失。自然语言处理模型可能会自动生成和加强过量监测,但先前的应用有限。2020年多个美国管辖区35,433份死亡记录数据集被用于示范培训和内部测试。外部验证使用了一套新颖的单独数据集,其中有2023-2024年的3,335份记录。评估了将特定药物参与分类为非结构化死亡证明文本的多种NLP方法。其中包括传统的单标签分类器和多标签分类器,以及精细调整的只使用编码的语言模型,如变压器和生物临床测试仪的模拟,以及现代纯正变压的常规语言模型,如Qwen 3 和Llama 3 。模型的运行情况模型,如F1至C级的直径直径的直径C大度、95度测试的直径直径、直径直径的直径定位模型,以及直径直径直径C的直径直径直径的直径的直径对等的直径定位的直径。
Article 78
Title@2025-07-16 (3): The first open machine translation system for the Chechen language
Title: The first open machine translation system for the Chechen language | Das erste offene maschinelle Übersetzungssystem für die tschetschenische Sprache | 车臣语第一个开放机器翻译系统 2507.12672v1 |
Authors (2): Abu-Viskhan A. Umishov, Vladislav A. Grigorian
We introduce the first open-source model for translation between the vulnerable Chechen language and Russian, and the dataset collected to train and evaluate it. We explore fine-tuning capabilities for including a new language into a large language model system for multilingual translation NLLB-200. The BLEU / ChrF++ scores for our model are 8.34 / 34.69 and 20.89 / 44.55 for translation from Russian to Chechen and reverse direction, respectively. The release of the translation models is accompanied by the distribution of parallel words, phrases and sentences corpora and multilingual sentence encoder adapted to the Chechen language.
我们采用第一个开放源码模式,在脆弱的车臣语和俄语之间翻译,并采用为培训和评估该模式而收集的数据集。我们探索微调能力,将新语言纳入一个大语言模式系统,用于多语种翻译NLLB-200。我们模式的BLEU/ChrF++分数分别是8.34/34.69和20.89/44.55,用于将俄语翻译成车臣语和反向翻译。翻译模型的发布,还同时分发适合车臣语的平行词、词句和句子以及多语句编码。
Article 79
Title@2025-07-16 (3): UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning
Title: UPCORE: Utility-Preserving Coreset Selection for Balanced Unlearning | UPCORE: Nutzenschonende Coreset-Auswahl für ausgewogenes Lernen | UPCORE: 平衡退学的核心选择 2502.15082v2 |
Authors (3): Vaidehi Patil, Elias Stengel-Eskin, Mohit Bansal
User specifications or legal frameworks often require information to be removed from pretrained models, including large language models (LLMs). This requires deleting or “forgetting” a set of data points from an already-trained model, which typically degrades its performance on other data points. Thus, a balance must be struck between removing information and keeping the model’s other abilities intact, with a failure to balance this trade-off leading to poor deletion or an unusable model. To this end, we propose UPCORE (Utility-Preserving Coreset Selection), a method-agnostic data selection framework for mitigating collateral damage during unlearning. Finding that the model damage is correlated with the variance of the model’s representations on the forget set, we selectively prune the forget set to remove outliers, thereby minimizing model degradation after unlearning. Across three standard unlearning methods, UPCORE consistently achieves a superior balance between the competing objectives of deletion efficacy and model preservation. To better evaluate this trade-off, we introduce a new metric, measuring the area-under-the-curve (AUC) across standard metrics. Our results show that UPCORE improves both standard metrics and AUC, benefiting from positive transfer between the coreset and pruned points while reducing negative transfer from the forget set to points outside of it.
用户的规格或法律框架往往要求从预先培训的模型中删除信息,包括大型语言模型(LLMs),这要求从已经培训的模型中删除或“忘记”一组数据点,这通常会降低其在其他数据点上的性能。因此,必须在删除信息与保持模型其他能力保持完好之间取得平衡,不能平衡这一取舍,导致删除工作不力或无法使用模式。为此,我们提议采用UPCO(通用-保留核心选择),一个方法-不可知性数据选择框架,用以在不学习期间减轻附带损害。发现模型损害与模型在“忘却套”上的表达方式的差异相关,我们有选择地利用“忘记”来清除外源,从而在不学习后尽量减少模式退化。在三种标准的不学习方法中,UPCORE始终在相互竞争的删除功效和模式保存目标之间取得更佳的平衡。为了更好地评估这一取舍,我们提出了一个新的衡量标准度,衡量区域偏向(AUSC)跨标准度。我们的结果表明,UCORE 改进了标准向外部转移点,同时从正向正向核心点的转移。
Article 80
Title@2025-07-16 (3): A Fuzzy Approach to Project Success: Measuring What Matters
Title: A Fuzzy Approach to Project Success: Measuring What Matters | Ein fuzzy Ansatz zum Projekt Erfolg: Messen, was zählt | 项目成功:衡量重要事项的模糊方法 2507.12653v1 |
Authors (4): João Granja-Correia, Remedios Hernández-Linares, Luca Ferranti, Arménio Rego
This paper introduces a novel approach to project success evaluation by integrating fuzzy logic into an existing construct. Traditional Likert-scale measures often overlook the context-dependent and multifaceted nature of project success. The proposed hierarchical Type-1 Mamdani fuzzy system prioritizes sustained positive impact for end-users, reducing emphasis on secondary outcomes like stakeholder satisfaction and internal project success. This dynamic approach may provide a more accurate measure of project success and could be adaptable to complex evaluations. Future research will focus on empirical testing and broader applications of fuzzy logic in social science.
本文件介绍了一种新的项目成功评价方法,将模糊的逻辑纳入现有结构中,传统的类似标准措施往往忽视项目成功的背景和多面性。拟议的第1级Mamdani模糊系统优先考虑对最终用户的持续积极影响,减少对利益攸关方满意度和内部项目成功率等次级成果的强调。这种动态方法可以更准确地衡量项目成功率,并适应复杂的评价。未来研究将侧重于经验测试和社会科学中模糊逻辑的更广泛应用。
Article 81
Title@2025-07-16 (3): A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models
Title: A Multi-Stage Framework with Taxonomy-Guided Reasoning for Occupation Classification Using Large Language Models | Ein Multi-Stage-Rahmen mit taxonomiegeführter Begründung für die Berufsklassifizierung mit großen Sprachmodellen | 使用大语言模式进行职业分类的多标准框架,并有分类法指导理由 2503.12989v2 |
Authors (3): Palakorn Achananuparp, Ee-Peng Lim, Yao Lu
Automatically annotating job data with standardized occupations from taxonomies, known as occupation classification, is crucial for labor market analysis. However, this task is often hindered by data scarcity and the challenges of manual annotations. While large language models (LLMs) hold promise due to their extensive world knowledge and in-context learning capabilities, their effectiveness depends on their knowledge of occupational taxonomies, which remains unclear. In this study, we assess the ability of LLMs to generate precise taxonomic entities from taxonomy, highlighting their limitations, especially for smaller models. To address these challenges, we propose a multi-stage framework consisting of inference, retrieval, and reranking stages, which integrates taxonomy-guided reasoning examples to enhance performance by aligning outputs with taxonomic knowledge. Evaluations on a large-scale dataset show that our framework not only enhances occupation and skill classification tasks, but also provides a cost-effective alternative to frontier models like GPT-4o, significantly reducing computational costs while maintaining strong performance. This makes it a practical and scalable solution for occupation classification and related tasks across LLMs.
对劳动力市场分析而言,职业分类法标准化职业(即职业分类法)自动说明工作数据至关重要,但这项工作往往受到数据稀缺和人工说明挑战的阻碍。虽然大型语言模型因其广泛的世界知识和内流学习能力而前景良好,但其有效性取决于其对职业分类法的知识,这一点仍然不明确。在本研究报告中,我们评估了LLMS从分类法中生成精确分类法实体的能力,突出了其局限性,特别是对于较小的模型而言。为了应对这些挑战,我们提议了一个多阶段框架,包括推论、检索和重新排位等,将分类法指导的推理实例结合起来,通过将产出与分类法知识相协调来提高绩效。对大规模数据集的评估表明,我们的框架不仅加强了职业和技能分类任务,而且还为GPT-4o等前沿模型提供了具有成本效益的替代方法,大大降低了计算成本,同时保持了强劲的业绩。这使得整个LMS的占领分类和相关任务成为实用和可扩展的解决办法。
Article 82
Title@2025-07-16 (3): Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows
Title: Fine-Tune an SLM or Prompt an LLM? The Case of Generating Low-Code Workflows | Fine-Tune ein SLM oder Prompt ein LLM? Der Fall der Erzeugung von Low-Code Workflows | 微调可持续土地管理还是迅速提炼一个LLM? 产生低碳工作流程的案例 2505.24189v2 |
Authors (5): Orlando Marquez Ayala, Patrice Bechard, Emily Chen, Maggie Baird, Jingfei Chen
Large Language Models (LLMs) such as GPT-4o can handle a wide range of complex tasks with the right prompt. As per token costs are reduced, the advantages of fine-tuning Small Language Models (SLMs) for real-world applications – faster inference, lower costs – may no longer be clear. In this work, we present evidence that, for domain-specific tasks that require structured outputs, SLMs still have a quality advantage. We compare fine-tuning an SLM against prompting LLMs on the task of generating low-code workflows in JSON form. We observe that while a good prompt can yield reasonable results, fine-tuning improves quality by 10% on average. We also perform systematic error analysis to reveal model limitations.
大型语言模型(LLMs)如GPT-4o等大型语言模型(LLMs)能够以合适的速度处理一系列复杂的任务。由于象征性成本降低,微调用于现实世界应用的小型语言模型(SLMs)的优点可能不再十分清楚 – – 更快的推论、更低的成本 – – 在这项工作中,我们提出的证据表明,对于需要结构化产出的具体领域任务,可持续土地管理仍具有质量优势。我们比较了SLM的微调,而不是激励LLMs完成以 JSON 格式生成低码工作流程的任务。我们发现,虽然良好的迅速可以产生合理的结果,但微调的质量平均提高10%。我们还进行了系统性的错误分析,以揭示模型的局限性。
Article 83
Title@2025-07-16 (3): Cross-Layer Discrete Concept Discovery for Interpreting Language Models
Title: Cross-Layer Discrete Concept Discovery for Interpreting Language Models | Cross-Layer Discrete Concept Discovery für Interpretationssprachmodelle | 解释语言模型的跨语言监听概念发现 2506.20040v2 |
Authors (4): Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou
Uncovering emergent concepts across transformer layers remains a significant challenge because the residual stream linearly mixes and duplicates information, obscuring how features evolve within large language models. Current research efforts primarily inspect neural representations at single layers, thereby overlooking this cross-layer superposition and the redundancy it introduces. These representations are typically either analyzed directly for activation patterns or passed to probing classifiers that map them to a limited set of predefined concepts. To address these limitations, we propose cross-layer VQ-VAE (CLVQ-VAE), a framework that uses vector quantization to map representations across layers and in the process collapse duplicated residual-stream features into compact, interpretable concept vectors. Our approach uniquely combines top-k temperature-based sampling during quantization with EMA codebook updates, providing controlled exploration of the discrete latent space while maintaining code-book diversity. We further enhance the framework with scaled-spherical k-means++ for codebook initialization, which clusters by directional similarity rather than magnitude, better aligning with semantic structure in word embedding space.
由于剩余流线性混合和重复信息,掩盖了大语言模型中各种特征的演变方式,因此这些未覆盖的变压层新出现概念仍是一项重大挑战。当前研究工作主要检查单层神经显示,从而忽略了这种跨层叠加和它带来的冗余。这些表示通常不是直接分析激活模式,就是通过直接分析将其映射成有限的一组预设概念的检测分类。为了解决这些局限性,我们提议采用跨层VQ-VAE(CLVQ-VAE)(CLVQ-VAE)这一框架,利用矢量定量来绘制各层之间和整个过程的表达方式,将重复的残余流特征映射成紧凑的、可解释的概念矢量。我们的方法在四分化过程中将基于温度的顶部取样与 EMA 代码簿更新结合起来,提供对离散潜伏空间的有控制的探索,同时维护代码簿的多样性。我们进一步强化框架,使代码初始化的宽度K-point-point-point +(CLVQ-VE-VE-VAVE),这个框架使用矢量组合,以方向性组合而不是数量,更好地与文字嵌嵌入空间中的文字嵌入空间的语结构。
Article 84
Title@2025-07-16 (3): Multi-task retriever fine-tuning for domain-specific and efficient RAG
Title: Multi-task retriever fine-tuning for domain-specific and efficient RAG | Multi-Task Retriever Feinabstimmung für domänenspezifische und effiziente RAG | 多任务检索器微调,用于特定领域和高效率的RAG 2501.04652v2 |
Authors (2): Patrice Béchard, Orlando Marquez Ayala
Retrieval-Augmented Generation (RAG) has become ubiquitous when deploying Large Language Models (LLMs), as it can address typical limitations such as generating hallucinated or outdated information. However, when building real-world RAG applications, practical issues arise. First, the retrieved information is generally domain-specific. Since it is computationally expensive to fine-tune LLMs, it is more feasible to fine-tune the retriever to improve the quality of the data included in the LLM input. Second, as more applications are deployed in the same real-world system, one cannot afford to deploy separate retrievers. Moreover, these RAG applications normally retrieve different kinds of data. Our solution is to instruction fine-tune a small retriever encoder on a variety of domain-specific tasks to allow us to deploy one encoder that can serve many use cases, thereby achieving low-cost, scalability, and speed. We show how this encoder generalizes to out-of-domain settings as well as to an unseen retrieval task on real-world enterprise use cases.
在部署大语言模型(LLMs)时,检索-加速一代(RAG)已经变得无处不在,因为它可以解决典型的局限性,例如产生幻觉或过时的信息。然而,在建立真实世界的RAG应用程序时,会出现实际问题。首先,检索的信息一般是特定域的信息。由于对微调LMS而言成本昂贵,因此更可行的做法是微调检索器,以提高LLM投入中所含数据的质量。第二,随着更多的应用程序被部署在同一个真实世界的系统中,人们无法使用单独的检索器。此外,这些RAG应用程序通常会检索不同种类的数据。我们的解决办法是,在各种特定域的任务上对小型检索器编码器进行微调,以使我们能够部署一个能为许多使用案例服务的编码器,从而实现低成本、可缩缩放性和速度。我们展示了这个编码器如何向外部环境一般化,以及对于现实世界企业使用案例的无形检索任务。
Article 85
Title@2025-07-16 (3): LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization
Title: LoRA Done RITE: Robust Invariant Transformation Equilibration for LoRA Optimization | LoRA Done RITE: Robuste Invariante Transformations-Equilibration für LoRA-Optimierung | Lora Done REITE: 优化 LoRA 的强劲的动态转型平衡 2410.20625v2 |
Authors (8): Jui-Nan Yen, Si Si, Zhao Meng, Felix Yu, Sai Surya Duvvuri, Inderjit S. Dhillon, Cho-Jui Hsieh, Sanjiv Kumar
Low-rank adaption (LoRA) is a widely used parameter-efficient finetuning method for LLM that reduces memory requirements. However, current LoRA optimizers lack transformation invariance, meaning the actual updates to the weights depends on how the two LoRA factors are scaled or rotated. This deficiency leads to inefficient learning and sub-optimal solutions in practice. This paper introduces LoRA-RITE, a novel adaptive matrix preconditioning method for LoRA optimization, which can achieve transformation invariance and remain computationally efficient. We provide theoretical analysis to demonstrate the benefit of our method and conduct experiments on various LLM tasks with different models including Gemma 2B, 7B, and mT5-XXL. The results demonstrate consistent improvements against existing optimizers. For example, replacing Adam with LoRA-RITE during LoRA fine-tuning of Gemma-2B yielded 4.6\% accuracy gain on Super-Natural Instructions and 3.5\% accuracy gain across other four LLM benchmarks (HellaSwag, ArcChallenge, GSM8K, OpenBookQA).
低级别适应(LORA)是LLM的一种广泛使用的高效参数微调方法,可以减少记忆要求;然而,目前的LORA优化剂缺乏变异性,这意味着对重量的实际更新取决于两个LORA因素的缩放或旋转方式;这一缺陷导致在实践中学习效率低下和次优的解决方案;本文介绍了LORA优化的一种创新的适应矩阵前导方法,即LORA-REITE,它可以实现变异并保持计算效率;我们提供了理论分析,以证明我们的方法的好处,并用不同的模型对LLLM任务进行实验,包括Gemma 2B、7B和mT5-XXL。结果显示与现有的优化器相比,不断有改进。例如,在LORA微调Gemma-2B期间,用LORA-REITE取代Adam,在超自然教学中取得了4.6的精准收益,在其他四个LM基准(HellaSwag、Arcchallenge、GSM8K、OpenBQA)中实现了3.5的精准收益。
Article 86
Title@2025-07-16 (3): SCULPT: Systematic Tuning of Long Prompts
Title: SCULPT: Systematic Tuning of Long Prompts | SCULPT: Systematisches Tuning von langen Prompts | SCULPT: 长期提示系统图示 2410.20788v3 |
Authors (6): Shanu Kumar, Akhila Yesantarao Venkata, Shubhanshu Khandelwal, Bishal Santra, Parag Agrawal, Manish Gupta
Prompt optimization is essential for effective utilization of large language models (LLMs) across diverse tasks. While existing optimization methods are effective in optimizing short prompts, they struggle with longer, more complex ones, often risking information loss and being sensitive to small perturbations. To address these challenges, we propose SCULPT (Systematic Tuning of Long Prompts), a framework that treats prompt optimization as a hierarchical tree refinement problem. SCULPT represents prompts as tree structures, enabling targeted modifications while preserving contextual integrity. It employs a Critic-Actor framework that generates reflections and applies actions to refine the prompt. Evaluations demonstrate SCULPT’s effectiveness on long prompts, its robustness to adversarial perturbations, and its ability to generate high-performing prompts even without any initial human-written prompt. Compared to existing state of the art methods, SCULPT consistently improves LLM performance by preserving essential task information while applying structured refinements. Both qualitative and quantitative analyses show that SCULPT produces more stable and interpretable prompt modifications, ensuring better generalization across tasks.
快速优化是有效利用大型语言模型(LLMS)完成不同任务的关键。虽然现有的优化方法在优化短效提示方面是有效的,但它们与长效、更复杂的方法挣扎,往往冒着信息丢失的风险,对小扰动很敏感。为了应对这些挑战,我们提议ScULPT(长效提示系统图),这个框架将快速优化视为一个分级的树细化问题。SCULPT代表了树结构的灵敏度,在保持背景完整性的同时进行有针对性的修改。它使用一个Critic-Actor框架来产生反省,并采取行动来改进快速的。评价表明SCULPT在长效上的有效性,它对对抗性扰动的坚固性,以及即使没有初步的人写速度也能产生高性提示的能力。与艺术方法的现有状况相比,SCULPT在使用结构完善的同时通过保存基本的任务信息不断提高LMM的性能。 定性和定量分析都表明,SCULPT产生更稳定且可解释的及时修改,确保任务之间更加普遍化。
Article 87
Title@2025-07-16 (3): Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation
Title: Memorization Inheritance in Sequence-Level Knowledge Distillation for Neural Machine Translation | Erinnerungsvererbung in Sequenz-Level-Wissensdestillation für neurale maschinelle Übersetzung | 神经机机翻译序列级知识蒸馏中的记忆力继承 2502.01491v2 |
Authors (2): Verna Dankers, Vikas Raunak
In this work, we explore how instance-level memorization in the teacher Neural Machine Translation (NMT) model gets inherited by the student model in sequence-level knowledge distillation (SeqKD). We find that despite not directly seeing the original training data, students memorize more than baseline models (models of the same size, trained on the original data) – 3.4% for exact matches and 57% for extractive memorization – and show increased hallucination rates. Further, under this SeqKD setting, we also characterize how students behave on specific training data subgroups, such as subgroups with low quality and specific counterfactual memorization (CM) scores, and find that students exhibit amplified denoising on low-quality subgroups. Finally, we propose a modification to SeqKD named Adaptive-SeqKD, which intervenes in SeqKD to reduce memorization and hallucinations. Overall, we recommend caution when applying SeqKD: students inherit both their teachers’ superior performance and their fault modes, thereby requiring active monitoring.
在这项工作中,我们探索了教师神经机器翻译(NMT)模型的例数级记忆化模式是如何被学生模型在序列级知识蒸馏(SeqKD)中继承的。我们发现,尽管学生没有直接看到原始培训数据,但记忆中比基线模型(相同尺寸的模型,在原始数据方面受过培训)多 – – 精确匹配3.4%,采掘记忆中57% – – 并显示出更高的幻觉率。此外,在SeqKD的设置下,我们还描述了学生在特定培训数据分组中的行为方式,如质量低和具体反事实记忆分(CM),发现学生在低质量分组上展示了放大的消音能力。最后,我们建议修改SeqKD,在SeqKD进行干预,以减少记忆化和幻觉。总体而言,我们建议在应用SeqKD:学生继承其教师的优异性表现和错失模式时要谨慎,因此需要积极监测。
Article 88
Title@2025-07-16 (3): Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models
Title: Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models | Mono-InternVL-1.5: Auf dem Weg zu günstigeren und schnelleren monolithischen multimodalen großen Sprachmodellen | Mono-InternVL-1.5:走向廉价和更快单极多式多语言模式 2507.12566v1 |
Authors (12): Gen Luo, Wenhan Dou, Wenhao Li, Zhaokai Wang, Xue Yang, Changyao Tian, Hao Li, Weiyun Wang, Wenhai Wang, Xizhou Zhu, Yu Qiao, Jifeng Dai
This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.
本文侧重于单流多式大语言模型(MLLM), 将视觉编码和语言解码纳入单一模型。 单流MLLM的现有结构和培训前战略往往会因不稳定的优化和灾难性的遗忘而受到影响。 为了应对这些挑战, 我们的关键想法是将一个新的视觉参数空间嵌入一个经过预先培训的LLM, 通过三角调解调能够稳定地从噪音数据中学习视觉知识。 基于此原则, 我们首先引入了Mon- InterVLL, 高级单流MLLM, 通过多式混合专家架构将一组视觉专家纳入其中。 此外, 我们设计了一个创新的EVNational- 预培训(EVIP) , 15- InterVLLL 的预培训(EVIL) , 将更多视觉分析专家引入了15度前的透明性能。 Mono- IndLLLLL 的运行过程, 也带来了相对昂贵的数据成本。 因此, 我们进一步向Mono- InterVL- Indeal- millical IM 提供一种更便宜和更强的高级的IM IM, 通过EVP(EVP- develop- slent) IM 。 EV- deal- develev- develdestrual- develyal develyal a im) imal a intal a intal developational developmental developmental developmentaldmental developmentald the the sild.
Article 89
Title@2025-07-16 (3): What Factors Affect LLMs and RLLMs in Financial Question Answering?
Title: What Factors Affect LLMs and RLLMs in Financial Question Answering? | Welche Faktoren beeinflussen LLMs und RLLMs bei der Beantwortung finanzieller Fragen? | 在回答财务问题时,哪些因素影响到理疗母和理疗母(RLLMs)? 2507.08339v2 |
Authors (6): Peng Wang, Xuesi Hu, Jiageng Wu, Yuntao Zou, Qiancheng Zhang, Dagang Li
Recently, the development of large language models (LLMs) and reasoning large language models (RLLMs) have gained considerable attention from many researchers. RLLMs enhance the reasoning capabilities of LLMs through Long Chain-of-Thought (Long CoT) processes, significantly improving the performance of LLMs in addressing complex problems. However, there are few works that systematically explore what methods can fully unlock the performance of LLMs and RLLMs within the financial domain. To investigate the impact of various methods on LLMs and RLLMs, we utilize five LLMs and three RLLMs to assess the effects of prompting methods, agentic frameworks, and multilingual alignment methods on financial question-answering tasks. Our research findings indicate: (1) Current prompting methods and agent frameworks enhance the performance of LLMs in financial question answering by simulating Long CoT; (2) RLLMs possess inherent Long CoT capabilities, which limits the effectiveness of conventional methods in further enhancing their performance; (3) Current advanced multilingual alignment methods primarily improve the multilingual performance of LLMs by extending the reasoning length, which yields minimal benefits for RLLMs. We hope that this study can serve as an important reference for LLMs and RLLMs in the field of financial question answering.
最近,开发大型语言模型(LLMs)和推理大型语言模型(RLLMs)引起了许多研究人员的极大关注。RLLMs通过长链搜索(Long CoT)程序提高了LLMs的推理能力,大大提高了LLMs在解决复杂问题方面的绩效;然而,很少有工作系统地探索何种方法能够完全解开LLMs和RLLMs在财务领域的业绩;为了调查各种方法对LLMs和RLMs的影响,我们利用5 LMs和3 RLLMs评估促进方法、代理人框架和多语种协调方法对财务问题解答任务的影响。我们的研究结果表明:(1) 目前快速的方法和代理框架可以提高LMs在财务问题中由模拟Long CoT回答的绩效;(2) RLLMs拥有固有的长期 CoT能力,这限制了常规方法在进一步提高其绩效方面的效力。(3)目前先进的多语种调整方法主要通过延长推理长度来提高LMs的多语种性绩效,这给RLMs公司带来极少的好处。我们希望这项研究能够作为LLMs和RLLLMs实地回答的财务问题的重要参考。
Article 90
Title@2025-07-16 (3): Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility
Title: Is This Just Fantasy? Language Model Representations Reflect Human Judgments of Event Plausibility | Ist das nur Fantasie? Sprachmodelldarstellungen spiegeln menschliche Urteile von Ereignissen wider Plausibilität | 这只是幻想吗?语言模型代表反映了人类对事件的判断 2507.12553v1 |
Authors (6): Michael A. Lepori, Jennifer Hu, Ishita Dasgupta, Roma Patel, Thomas Serre, Ellie Pavlick
Language models (LMs) are used for a diverse range of tasks, from question answering to writing fantastical stories. In order to reliably accomplish these tasks, LMs must be able to discern the modal category of a sentence (i.e., whether it describes something that is possible, impossible, completely nonsensical, etc.). However, recent studies have called into question the ability of LMs to categorize sentences according to modality (Michaelov et al., 2025; Kauf et al., 2023). In this work, we identify linear representations that discriminate between modal categories within a variety of LMs, or modal difference vectors. Analysis of modal difference vectors reveals that LMs have access to more reliable modal categorization judgments than previously reported. Furthermore, we find that modal difference vectors emerge in a consistent order as models become more competent (i.e., through training steps, layers, and parameter count). Notably, we find that modal difference vectors identified within LM activations can be used to model fine-grained human categorization behavior. This potentially provides a novel view into how human participants distinguish between modal categories, which we explore by correlating projections along modal difference vectors with human participants’ ratings of interpretable features. In summary, we derive new insights into LM modal categorization using techniques from mechanistic interpretability, with the potential to inform our understanding of modal categorization in humans.
语言模型(LMS)用于各种各样的任务,从回答问题到撰写梦幻故事。为了可靠地完成这些任务,LMS必须能够辨别一个句子的模式类别(即它是否描述了可能、不可能、完全不敏感的东西,等等)。然而,最近的研究使人们对LMS按照模式(Michaelelov等人,2025年;Kauf等人,2023年)对判决进行分类的能力产生疑问。在这项工作中,我们确定在各种LMs或模式差异矢量中区分模式类别之间的线性表达方式。对模式差异矢量的分析表明,LMS能够获得比以前报告的更可靠的模式分类判断。此外,我们发现模式差异是按一个一致的顺序出现的(例如,通过培训步骤、层次和参数计数) 。我们发现,在LM 启动中发现的模式差异矢量可以用来模拟细微的人类分类行为。这有可能为人类参与者提供一种新的观点,如何用模型的分类方法来区分模型的分类,我们用模型的分类方法来对模型的分类方法进行对比。
Article 91
Title@2025-07-16 (3): Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses
Title: Prompt Perturbations Reveal Human-Like Biases in LLM Survey Responses | Prompt Störungen Enthüllen Mensch-ähnliche Biasen in LLM Survey Responses | LLM调查答复中的即时扰动干扰现象 2507.07188v2 |
Authors (3): Jens Rupprecht, Georg Ahnert, Markus Strohmaier
Large Language Models (LLMs) are increasingly used as proxies for human subjects in social science surveys, but their reliability and susceptibility to known response biases are poorly understood. This paper investigates the response robustness of LLMs in normative survey contexts - we test nine diverse LLMs on questions from the World Values Survey (WVS), applying a comprehensive set of 11 perturbations to both question phrasing and answer option structure, resulting in over 167,000 simulated interviews. In doing so, we not only reveal LLMs’ vulnerabilities to perturbations but also show that all tested models exhibit a consistent recency bias varying in intensity, disproportionately favoring the last-presented answer option. While larger models are generally more robust, all models remain sensitive to semantic variations like paraphrasing and to combined perturbations. By applying a set of perturbations, we reveal that LLMs partially align with survey response biases identified in humans. This underscores the critical importance of prompt design and robustness testing when using LLMs to generate synthetic survey data.
大型语言模型(LLMS)在社会科学调查中越来越多地被用作人类科目的代名词,但是其可靠性和对已知反应偏差的易感性却不甚清楚。本文调查了LMS在规范调查背景下的反应强度。我们根据世界价值调查(WVS)的问题测试了九种不同的LMS,对两个问题处理和回答选项结构都应用了一套11次的全套扰动,导致模拟访谈超过167 000次。我们这样做不仅暴露了LLMS易受扰动的脆弱性,而且表明所有测试的模型都显示出在强度上各不相同的一贯耐受偏向性偏向于最后提出的答复选项。虽然较大的模型一般比较强,但所有模型对于诸如副光学和综合扰动等语义变化仍然很敏感。我们通过应用一套扰动图,发现LLMS与在人类中发现的调查反应偏差部分一致。这突出表明,在使用LMMS生成合成调查数据时,迅速设计和稳健性测试至关重要。
Article 92
Title@2025-07-16 (3): Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models
Title: Modeling Open-World Cognition as On-Demand Synthesis of Probabilistic Models | Modellierung der Open-World-Kognition als On-Demand-Synthese probabilistischer Modelle | 将开放世界的认知建模作为概率模型的 “ 现场合成 “ 模型 2507.12547v1 |
Authors (11): Lionel Wong, Katherine M. Collins, Lance Ying, Cedegao E. Zhang, Adrian Weller, Tobias Gersternberg, Timothy O’Donnell, Alexander K. Lew, Jacob D. Andreas, Joshua B. Tenenbaum, Tyler Brooke-Wilson
When faced with novel situations, people are able to marshal relevant considerations from a wide range of background knowledge and put these to use in inferences and predictions. What permits us to draw in globally relevant information and reason over it coherently? Here, we explore the hypothesis that people use a combination of distributed and symbolic representations to construct bespoke mental models tailored to novel situations. We propose a computational implementation of this idea – a ``Model Synthesis Architecture’’ (MSA) – using language models to implement global relevance-based retrieval and model synthesis and probabilistic programs to implement bespoke, coherent world models. We evaluate our MSA as a model of human judgments on a novel reasoning dataset. The dataset – built around a Model Olympics
domain of sports vignettes – tests models’ capacity for human-like, open-ended reasoning by requiring (i) judgments about novel causal structures described in language; (ii) drawing on large bodies of background knowledge; and (iii) doing both in light of observations that introduce arbitrary novel variables. Our MSA approach captures human judgments better than language model-only baselines, under both direct and chain-of-thought generations from the LM that supports model synthesis. These results suggest that MSAs can be implemented in a way that mirrors people’s ability to deliver locally coherent reasoning over globally relevant variables, offering a path to understanding and replicating human reasoning in open-ended domains.
在面临新情况时,人们能够从广泛的背景知识中汇集相关考虑,并将这些因素用于推论和预测。什么能使我们在全球范围内以一致的方式获取相关信息和理性?在这里,我们探讨一种假设,即人们使用分布式和象征式的表达方式相结合来构建符合新情况的精神模型;我们建议对这一概念进行计算实施 – – “ 模型综合综合架构 “ (MSA) – – 使用语言模型来实施基于全球相关性的检索和模型合成,以及概率化方案,以实施直观、一致的世界模型。我们评估我们的特派任务生活津贴,将其作为人类对新推理数据集的判断模型。数据集 – – 围绕“奥林匹克运动”运动“体育名流域 – – 测试模型” 来为人性、开放型的推理能力而构建。我们建议(一) 判断语言中描述的新因果结构;(二) 借鉴大量的背景知识;以及(三) 结合引入任意的新变数的观察,既能反映人类的人类的判断,又能更好理解语言模型的基线,在直接和连锁推理学能力下,能够从全球推理推算出人类的推理结果。
Article 93
Title@2025-07-16 (3): Language Models Improve When Pretraining Data Matches Target Tasks
Title: Language Models Improve When Pretraining Data Matches Target Tasks | Sprachmodelle verbessern, wenn die Vorschulung von Daten zu Zielaufgaben passt | 培训前数据匹配目标任务时改进语言模式 2507.12466v1 |
Authors (10): David Mizrahi, Anders Boesen Lindbo Larsen, Jesse Allardice, Suzie Petryk, Yuri Gorokhov, Jeffrey Li, Alex Fang, Josh Gardner, Tom Gunter, Afshin Dehghan
Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning $10^{19}$ to $10^{22}$ FLOPs and fitting scaling laws to them. From this, we find that simply aligning pretraining data to evaluation benchmarks using BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered data) and improves performance on 9 out of 10 tasks across all scales. BETR also generalizes well: when targeting a diverse set of benchmarks disjoint from our evaluation suite, it still matches or outperforms baselines. Our scaling analysis further reveals a clear trend: larger models require less aggressive filtering. Overall, our findings show that directly matching pretraining data to target tasks precisely shapes model capabilities and highlight that optimal selection strategies must adapt to model scale.
每个数据选择方法本身都有目标。在实践中,这些指标往往通过基准驱动的迭代而隐含地出现:研究人员制定选择战略,培训模型,衡量基准业绩,然后进行相应的完善。这提出了一个自然的问题:当我们使优化明确时会发生什么情况?为了对此进行探讨,我们提议基准目标排名(BETR),这是根据与基准培训范例相似的办法来选择培训前文件的简单方法。BETR在共享空间中嵌入基准范例和训练前文件样本,以类似基准的评分,然后训练一个轻量分类员来预测全套的评分。我们通过培训500多个模型来比较数据选择方法,这些模型覆盖10美元至10美元至22美元。我们从中发现,简单地将培训前的数据与评估基准相匹配,而采用与基准培训范例相似,在DCLM-Baseline(4.7x高于未过滤模型的4.7x)的基础上计算乘数乘数乘数乘数,并在所有尺度的10项任务中提高业绩的比值,然后培训一个精细的分数:当针对不同基准的设定标准时,从10美元到10美元到10美元,我们评价套的比值则要比值,我们更精确的比标,我们更精确的比标,我们更精确地显示一个比标的比比比比比比标,我们更更精确的比比标。
Article 94
Title@2025-07-16 (3): Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training
Title: Scaling Up RL: Unlocking Diverse Reasoning in LLMs via Prolonged Training | Scaling Up RL: Unlocking Diverse Reasoning in LLMs durch längeres Training | 提升RL:通过长期培训解锁LLMs的多样化理由 2507.12507v1 |
Authors (22): Mingjie Liu, Shizhe Diao, Jian Hu, Ximing Lu, Xin Dong, Hao Zhang, Alexander Bukharin, Shaokun Zhang, Jiaqi Zeng, Makesh Narsimhan Sreedhar, Gerald Shen, David Mosallanezhad, Di Zhang, Jonas Yang, June Yang, Oleksii Kuchaiev, Guilin Liu, Zhiding Yu, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong
Recent advancements in reasoning-focused language models such as OpenAI’s O1 and DeepSeek-R1 have shown that scaling test-time computation-through chain-of-thought reasoning and iterative exploration-can yield substantial improvements on complex tasks like mathematics and code generation. These breakthroughs have been driven by large-scale reinforcement learning (RL), particularly when combined with verifiable reward signals that provide objective and grounded supervision. In this report, we investigate the effects of prolonged reinforcement learning on a small language model across a diverse set of reasoning domains. Our work identifies several key ingredients for effective training, including the use of verifiable reward tasks, enhancements to Group Relative Policy Optimization (GRPO), and practical techniques to improve training stability and generalization. We introduce controlled KL regularization, clipping ratio, and periodic reference policy resets as critical components for unlocking long-term performance gains. Our model achieves significant improvements over strong baselines, including +14.7% on math, +13.9% on coding, and +54.8% on logic puzzle tasks. To facilitate continued research, we release our model publicly.
以推理为重点的语言模式,如OpenAI的 O1 和 DeepSeek-R1 的最近进展表明,在数学和代码生成等复杂任务上,按比例测试-测试时间的计算、思维链推理和迭代探索-扫描能够带来重大改进。这些突破是由大规模强化学习(RL)驱动的,特别是当与提供客观和有根据的监督的可核实奖赏信号相结合时。我们在本报告中调查了长期强化学习对一系列不同推理领域的小型语言模式的影响。我们的工作确定了有效培训的若干关键要素,包括使用可核查的奖励任务、增强群体相对政策优化(GROPO)以及改进培训稳定性和一般化的实用技术。我们引入了受控的KL正规化、剪裁率比率和定期参考政策重置作为释放长期绩效收益的关键组成部分。我们的模型在强大的基线上取得了显著的改进,包括数学+14.7%,编码为+13.9%,逻辑拼图为+54.8%。为了便利继续研究,我们公开公布了我们的模型。
Article 95
Title@2025-07-16 (3): TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons
Title: TD-EVAL: Revisiting Task-Oriented Dialogue Evaluation by Combining Turn-Level Precision with Dialogue-Level Comparisons | TD-EVAL: Überprüfung der aufgabenorientierten Dialogbewertung durch Kombination von Turn-Level-Präzision mit Dialog-Level-Vergleichen | TD-EVAL: 重新审议以任务为导向的对话评价,将转折点精确度与对话级别比较相结合 2504.19982v2 |
Authors (7): Emre Can Acikgoz, Carl Guo, Suvodip Dey, Akul Datta, Takyoung Kim, Gokhan Tur, Dilek Hakkani-Tür
Task-oriented dialogue (TOD) systems are experiencing a revolution driven by Large Language Models (LLMs), yet the evaluation methodologies for these systems remain insufficient for their growing sophistication. While traditional automatic metrics effectively assessed earlier modular systems, they focus solely on the dialogue level and cannot detect critical intermediate errors that can arise during user-agent interactions. In this paper, we introduce TD-EVAL (Turn and Dialogue-level Evaluation), a two-step evaluation framework that unifies fine-grained turn-level analysis with holistic dialogue-level comparisons. At turn level, we evaluate each response along three TOD-specific dimensions: conversation cohesion, backend knowledge consistency, and policy compliance. Meanwhile, we design TOD Agent Arena that uses pairwise comparisons to provide a measure of dialogue-level quality. Through experiments on MultiWOZ 2.4 and {\tau}-Bench, we demonstrate that TD-EVAL effectively identifies the conversational errors that conventional metrics miss. Furthermore, TD-EVAL exhibits better alignment with human judgments than traditional and LLM-based metrics. These findings demonstrate that TD-EVAL introduces a new paradigm for TOD system evaluation, efficiently assessing both turn and system levels with a plug-and-play framework for future research.
以任务为导向的对话(TOD)系统正在经历由大语言模式驱动的革命,然而,这些系统的评价方法仍然不足以使其日益成熟。传统自动衡量标准有效地评估了早期模块化系统,但仅侧重于对话水平,无法发现在用户-代理互动过程中可能出现的关键中间错误。在本文件中,我们引入了TD-EVAL(Turn and 对话级别评价),这是一个双步评价框架,它以整体对话层面的比较统一了细微的转弯分析。在转弯层面,我们根据三个具体方面评估了这些系统:对话凝聚力、后端知识一致性和政策合规性。与此同时,我们设计了TOD Arena 代理,该代理使用对齐的比较来提供对话水平质量的衡量。我们通过在多WOZ 2.4 和~Tau}-Bench的实验,我们证明TD-EVAL有效地识别了传统指标错误。此外,TD-EVAL展示了比传统和基于LM的衡量标准更符合人类判断。这些结果表明,TD-EVAL为TOD系统评价提供了一个新的范式模式,并转换了系统及未来研究框架。
Article 96
Title@2025-07-16 (3): S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling
Title: S2WTM: Spherical Sliced-Wasserstein Autoencoder for Topic Modeling | S2WTM: Spherical Sliced-Wasserstein Autoencoder für Themenmodellierung | S2WTM: 用于专题建模的球球锯子-Wasserstein自动编码器 2507.12451v1 |
Authors (2): Suman Adhya, Debarshi Kumar Sanyal
Modeling latent representations in a hyperspherical space has proven effective for capturing directional similarities in high-dimensional text data, benefiting topic modeling. Variational autoencoder-based neural topic models (VAE-NTMs) commonly adopt the von Mises-Fisher prior to encode hyperspherical structure. However, VAE-NTMs often suffer from posterior collapse, where the KL divergence term in the objective function highly diminishes, leading to ineffective latent representations. To mitigate this issue while modeling hyperspherical structure in the latent space, we propose the Spherical Sliced Wasserstein Autoencoder for Topic Modeling (S2WTM). S2WTM employs a prior distribution supported on the unit hypersphere and leverages the Spherical Sliced-Wasserstein distance to align the aggregated posterior distribution with the prior. Experimental results demonstrate that S2WTM outperforms state-of-the-art topic models, generating more coherent and diverse topics while improving performance on downstream tasks.
在超球空间中建模潜在代表已证明对获取高维文本数据的方向相似性十分有效,有益于专题建模。在对超球结构进行编码之前,VAE-NTM通常采用 von Mises-Fisher 神经专题模型(VAE-NTMs),但是,VAE-NTMs经常受到后球体崩溃的影响,因为目标中的KL差异术语功能会大大缩小,导致无效的潜在代表。为了减轻这一问题,在对潜在空间的超球结构建模时,我们提议采用Sploical Sliced Wasserstein Autencoder 用于专题建模(S2WTM)。S2WTM使用事先对单位超球体支持的分布,并利用Splic-Wasserstein的距离,使汇总的远球体分布与先前的相匹配。实验结果显示,S2WTM(S2WDM)优于最新主题模型,在改进下游任务绩效的同时产生更加一致和多样化的专题。
Article 97
Title@2025-07-16 (3): Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models
Title: Can We Predict Alignment Before Models Finish Thinking? Towards Monitoring Misaligned Reasoning Models | Können wir eine Ausrichtung voraussagen, bevor Modelle das Denken beenden? | 我们能否在模型完成思考之前实现预测一致? 2507.12428v1 |
Authors (3): Yik Siu Chan, Zheng-Xin Yong, Stephen H. Bach
Open-weights reasoning language models generate long chains-of-thought (CoTs) before producing a final response, which improves performance but introduces additional alignment risks, with harmful content often appearing in both the CoTs and the final outputs. In this work, we investigate if we can use CoTs to predict final response misalignment. We evaluate a range of monitoring approaches, including humans, highly-capable large language models, and text classifiers, using either CoT text or activations. First, we find that a simple linear probe trained on CoT activations can significantly outperform all text-based methods in predicting whether a final response will be safe or unsafe. CoT texts are often unfaithful and can mislead humans and classifiers, while model latents (i.e., CoT activations) offer a more reliable predictive signal. Second, the probe makes accurate predictions before reasoning completes, achieving strong performance even when applied to early CoT segments. These findings generalize across model sizes, families, and safety benchmarks, suggesting that lightweight probes could enable real-time safety monitoring and early intervention during generation.
开放加权推理语言模型在作出最终反应之前产生长期的思维链(CoTs),这提高了业绩,但增加了调整风险,有害内容往往出现在CoTs和最终产出中。在这项工作中,我们调查是否可以使用CoTs预测最终反应不匹配。我们评估了一系列监测方法,包括人、高度可控的大型语言模型和文本分类器,使用COT文本或激活。首先,我们发现,在CoT启动方面受过训练的简单线性探测器可以大大超过所有基于文本的方法,从而预测最终反应是否安全或不安全。CoT文本往往不忠,可以误导人类和分类者,而模型潜伏(即Cot激活)则提供更可靠的预测信号。第二,在推理完成之前,在应用早期COT部分时,这些探测结果也会达到很强的性能。这些结果概括了模型大小、家庭和安全基准,表明轻度探测器能够进行实时的安全监测和新一代早期干预。
Article 98
Title@2025-07-16 (3): Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data
Title: Advancing Retrieval-Augmented Generation for Structured Enterprise and Internal Data | Weiterentwicklung der retrieval-generierten Generation für strukturierte Unternehmen und interne Daten | 结构化企业和内部数据先进检索-启动生成 2507.12425v1 |
Authors (1): Chandana Cheerla
Organizations increasingly rely on proprietary enterprise data, including HR records, structured reports, and tabular documents, for critical decision-making. While Large Language Models (LLMs) have strong generative capabilities, they are limited by static pretraining, short context windows, and challenges in processing heterogeneous data formats. Conventional Retrieval-Augmented Generation (RAG) frameworks address some of these gaps but often struggle with structured and semi-structured data. This work proposes an advanced RAG framework that combines hybrid retrieval strategies using dense embeddings (all-mpnet-base-v2) and BM25, enhanced by metadata-aware filtering with SpaCy NER and cross-encoder reranking. The framework applies semantic chunking to maintain textual coherence and retains tabular data structures to preserve row-column integrity. Quantized indexing optimizes retrieval efficiency, while human-in-the-loop feedback and conversation memory improve adaptability. Experiments on enterprise datasets show notable improvements: Precision@5 increased by 15 percent (90 versus 75), Recall@5 by 13 percent (87 versus 74), and Mean Reciprocal Rank by 16 percent (0.85 versus 0.69). Qualitative evaluations show higher scores in Faithfulness (4.6 versus 3.0), Completeness (4.2 versus 2.5), and Relevance (4.5 versus 3.2) on a 5-point Likert scale. These results demonstrate the framework’s effectiveness in delivering accurate, comprehensive, and contextually relevant responses for enterprise tasks. Future work includes extending to multimodal data and integrating agent-based retrieval. The source code will be released at https://github.com/CheerlaChandana/Enterprise-Chatbot
大型语言模型(LLMS)具有很强的基因改造能力,但受到静态预培训、短背景窗口的限制,以及处理不同数据格式方面的挑战。常规累进和增生生成(RAG)框架解决了其中一些差距,但往往与结构化和半结构化数据挣扎。这项工作提议了一个先进的RAG框架,将混合检索战略结合起来,利用密集嵌入(所有-mpnet- base-v2)和BM25(通过SpaCy NER和交叉编码重新排名的元数据-意感过滤加以强化),而大型语言模型(LLLMMS)则受到静态预培训、短背景窗口的限制,并在处理不同数据格式时面临挑战。常规化的指数化优化了其中一些差距,但往往与结构化和半结构化的数据进行斗争。 企业数据集实验显示显著的改进情况:精进度%(90对75),准确性读取代码(Metricional-awareadread referation) 3.C recall@emal-alliveral-listal lax laxal lax laudal lax lax lax.
Article 99
Title@2025-07-16 (3): Simple Mechanistic Explanations for Out-Of-Context Reasoning
Title: Simple Mechanistic Explanations for Out-Of-Context Reasoning | Einfache mechanistische Erklärungen für Out-of-Context Reasoning | 外部逻辑理由的简单机械解释 2507.08218v2 |
Authors (5): Atticus Wang, Joshua Engels, Oliver Clive-Griffin, Senthooran Rajamanoharan, Neel Nanda
Out-of-context reasoning (OOCR) is a phenomenon in which fine-tuned LLMs exhibit surprisingly deep out-of-distribution generalization. Rather than learning shallow heuristics, they implicitly internalize and act on the consequences of observations scattered throughout the fine-tuning data. In this work, we investigate this phenomenon mechanistically and find that many instances of OOCR in the literature have a simple explanation: the LoRA fine-tuning essentially adds a constant steering vector, steering the model towards a general concept. This improves performance on the fine-tuning task and in many other concept-related domains, causing the surprising generalization. Moreover, we can directly train steering vectors for these tasks from scratch, which also induces OOCR. We find that our results hold even for a task that seems like it must involve conditional behavior (model backdoors); it turns out that unconditionally adding a steering vector is sufficient. Overall, our work presents one explanation of what gets learned during fine-tuning for OOCR tasks, contributing to the key question of why LLMs can reason out of context, an advanced capability that is highly relevant to their safe and reliable deployment.
超文本推理(OOCR)是一种现象,在这种现象中,微调的LLMs在分布上表现出出奇的深刻外向。它们不但没有学习浅重的偏差,反而隐含了内在化,并针对微调数据中分散的观测结果的后果采取行动。在这项工作中,我们机械地调查了这一现象,发现文献中许多OOCR的例子都有一个简单的解释:LORA微调基本上增加了一个不变的指导矢量,将模型引向一个一般概念。这改善了微调任务和许多其他概念相关领域的业绩,导致了令人惊讶的概括化。此外,我们可以直接训练从零开始指导矢量任务,这也引出OOCR。我们发现,我们的结果甚至维持着一项似乎必须包含有条件行为(模范后门)的任务;结果显示,无条件增加方向矢量就足够了。总体而言,我们的工作解释了在对OOCR任务进行微调时所学到的教益的一个解释,从而说明了为什么LMs可以从背景中解释出理由的关键问题,一种先进的能力对于其安全可靠部署具有高度相关性。
Article 100
Title@2025-07-16 (3): Probing for Arithmetic Errors in Language Models
Title: Probing for Arithmetic Errors in Language Models | Probing für Arithmetische Fehler in Sprachmodellen | 语言模型中亚学错误的检验 2507.12379v1 |
Authors (3): Yucheng Sun, Alessandro Stolfo, Mrinmaya Sachan
We investigate whether internal activations in language models can be used to detect arithmetic errors. Starting with a controlled setting of 3-digit addition, we show that simple probes can accurately decode both the model’s predicted output and the correct answer from hidden states, regardless of whether the model’s output is correct. Building on this, we train lightweight error detectors that predict model correctness with over 90% accuracy. We then extend our analysis to structured chain-of-thought traces on addition-only GSM8K problems and find that probes trained on simple arithmetic generalize well to this more complex setting, revealing consistent internal representations. Finally, we demonstrate that these probes can guide selective re-prompting of erroneous reasoning steps, improving task accuracy with minimal disruption to correct outputs. Our findings suggest that arithmetic errors can be anticipated from internal activations alone, and that simple probes offer a viable path toward lightweight model self-correction.
我们调查语言模型中的内部激活是否可用于检测算术错误。 我们从3位数添加的受控设置开始, 显示简单的探测器可以准确解码模型的预测输出和隐藏状态的正确答案, 不论模型输出是否正确。 在此基础上, 我们训练轻型错误探测器, 预测模型的正确性, 精确度超过90% 。 然后将我们的分析扩展至仅增加的 GSM8K 问题的结构化思维链跟踪, 并发现经过简单计算训练的探测器能够非常概括这一更复杂的设置, 揭示出一致的内部表现 。 最后, 我们证明这些探测器可以指导有选择地重复错误的推理步骤, 提高任务准确性, 以最小的干扰来纠正输出 。 我们的发现表明, 算术错误可以单从内部激活中预测, 简单的探测器可以提供通向轻量模型自我校正的可行路径 。
Article 101
Title@2025-07-16 (3): Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker
Title: Developing Visual Augmented Q&A System using Scalable Vision Embedding Retrieval & Late Interaction Re-ranker | Entwicklung eines visuellen Augmented Q&A-Systems unter Verwendung eines skalierbaren Vision Embedding Retrieval & Late Interaction Re-ranker | 利用可缩放的视野嵌入回收和后期互动重新排行器开发视觉增强的 A 系统 2507.12378v1 |
Authors (3): Rachna Saxena, Abhijeet Kumar, Suresh Shanmugam
Traditional information extraction systems face challenges with text only language models as it does not consider infographics (visual elements of information) such as tables, charts, images etc. often used to convey complex information to readers. Multimodal LLM (MLLM) face challenges of finding needle in the haystack problem i.e., either longer context length or substantial number of documents as search space. Late interaction mechanism over visual language models has shown state of the art performance in retrieval-based vision augmented Q&A tasks. There are yet few challenges using it for RAG based multi-modal Q&A. Firstly, many popular and widely adopted vector databases do not support native multi-vector retrieval. Secondly, late interaction requires computation which inflates space footprint and can hinder enterprise adoption. Lastly, the current state of late interaction mechanism does not leverage the approximate neighbor search indexing methods for large speed ups in retrieval process. This paper explores a pragmatic approach to make vision retrieval process scalable and efficient without compromising on performance quality. We propose multi-step custom implementation utilizing widely adopted hybrid search (metadata & embedding) and state of the art late interaction re-ranker to retrieve best matching pages. Finally, MLLM are prompted as reader to generate answers from contextualized best matching pages. Through experiments, we observe that the proposed design is scalable (significant speed up) and stable (without degrading performance quality), hence can be used as production systems at enterprises.
传统信息提取系统面临挑战,因为只有文本才有语言模型,因为它不考虑图表、图表、图像等通常用于向读者传递复杂信息的成文图(信息的视觉元素),如表格、图表、图像等。多式LLM(MLM)面临在干草问题上找到针头的挑战,例如,在搜索空间时,环境长度较长或文件数量众多。视觉语言模型的晚间互动机制显示,在基于检索的愿景中,最新性能表现在增加 A 任务。对于基于RAG的多模式 {A} 来说,使用它的挑战仍然很少。首先,许多广受欢迎的和广泛接受的矢量数据库不支持本地多维量检索。第二,晚式互动要求计算能够抑制空间足迹并阻碍企业采用。最后,晚式互动机制的状态没有利用近邻居搜索索引方法在检索过程中大大加快速度。 本文探讨了一种实用的方法,使视觉检索进程可以缩放和高效,同时不损害业绩质量。我们提议多步自多步自定制的定制的定制定制定制用户搜索(元数据和嵌入)不支持本地的多式多维系,最后的版本的版本的版本的版本是用于最新版本设计。
Article 102
Title@2025-07-16 (3): Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics
Title: Web-Browsing LLMs Can Access Social Media Profiles and Infer User Demographics | Web-Browsing LLMs können auf Social Media Profile zugreifen und Nutzerdemographien ableiten | 可在网上浏览的LLMs 能够获取社会媒体概况和推断用户人口 2507.12372v1 |
Authors (4): Meysam Alizadeh, Fabrizio Gilardi, Zeynab Samei, Mohsen Mosleh
Large language models (LLMs) have traditionally relied on static training data, limiting their knowledge to fixed snapshots. Recent advancements, however, have equipped LLMs with web browsing capabilities, enabling real time information retrieval and multi step reasoning over live web content. While prior studies have demonstrated LLMs ability to access and analyze websites, their capacity to directly retrieve and analyze social media data remains unexplored. Here, we evaluate whether web browsing LLMs can infer demographic attributes of social media users given only their usernames. Using a synthetic dataset of 48 X (Twitter) accounts and a survey dataset of 1,384 international participants, we show that these models can access social media content and predict user demographics with reasonable accuracy. Analysis of the synthetic dataset further reveals how LLMs parse and interpret social media profiles, which may introduce gender and political biases against accounts with minimal activity. While this capability holds promise for computational social science in the post API era, it also raises risks of misuse particularly in information operations and targeted advertising underscoring the need for safeguards. We recommend that LLM providers restrict this capability in public facing applications, while preserving controlled access for verified research purposes.
大型语言模型(LLMS)历来依赖静态培训数据,将其知识限制在固定快照上,但最近的进展使LLMS具备了网络浏览能力,使得实时信息检索和对现场网络内容进行多步推理;虽然以往的研究显示LLMS有能力访问和分析网站,但其直接检索和分析社交媒体数据的能力仍未探索;在这里,我们评估网络浏览LLMS能否推断社会媒体用户仅用用户名的人口特征;利用48 X(Twitter)账户的合成数据集和1 384名国际参与者的调查数据集,我们表明这些模型可以访问社交媒体内容,并合理准确地预测用户人口结构;对合成数据集的分析进一步揭示LLMS如何分析并解释社会媒体概况,这可能在活动最少的情况下引入对账户的性别和政治偏见;虽然这种能力在API时代后的计算社会科学方面很有希望,但也增加了滥用的风险,特别是在信息业务和有针对性的广告中,强调需要保障。我们建议LM供应商限制公众面对应用的能力,同时保留有节制的准入,用于核实研究目的。
Article 103
Title@2025-07-16 (3): Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate
Title: Beyond Single Models: Enhancing LLM Detection of Ambiguity in Requests through Debate | Jenseits von Einzelmodellen: Verbesserung der LLM-Erkennung von Ambiguität in Anfragen durch Debatte | 超越单一模式:通过辩论加强LLM对请求中的模糊性的检测 2507.12370v1 |
Authors (3): Ana Davila, Jacinto Colan, Yasuhisa Hasegawa
Large Language Models (LLMs) have demonstrated significant capabilities in understanding and generating human language, contributing to more natural interactions with complex systems. However, they face challenges such as ambiguity in user requests processed by LLMs. To address these challenges, this paper introduces and evaluates a multi-agent debate framework designed to enhance detection and resolution capabilities beyond single models. The framework consists of three LLM architectures (Llama3-8B, Gemma2-9B, and Mistral-7B variants) and a dataset with diverse ambiguities. The debate framework markedly enhanced the performance of Llama3-8B and Mistral-7B variants over their individual baselines, with Mistral-7B-led debates achieving a notable 76.7% success rate and proving particularly effective for complex ambiguities and efficient consensus. While acknowledging varying model responses to collaborative strategies, these findings underscore the debate framework’s value as a targeted method for augmenting LLM capabilities. This work offers important insights for developing more robust and adaptive language understanding systems by showing how structured debates can lead to improved clarity in interactive systems.
大型语言模型(LLMS)在理解和生成人文方面表现出很强的能力,有助于与复杂的系统进行更自然的互动,然而,它们面临各种挑战,例如LLMS处理的用户请求含糊不清。为应对这些挑战,本文件介绍和评价了一个多机构辩论框架,目的是在单一模型之外加强探测和分辨率能力,该框架包括LLMM的三个结构(Llama3-8B、Gemma2-9B和Mistral-7B变量)和一套含多种模糊不清的数据集。辩论框架明显提高了Llama3-8B和Mistral-7B变量相对于其各自基线的性能,Mistral-7B牵头的辩论取得了76.7%的显著成功率,并证明对复杂的模糊不清和高效共识特别有效。这些结论虽然承认对合作战略的示范反应各不相同,但强调了辩论框架作为增强LLM能力的一个定向方法所具有的价值。这项工作提供了重要见解,通过展示结构化辩论如何导致互动系统更加清晰。
Article 104
Title@2025-07-16 (3): Exploring Gender Bias in Alzheimer’s Disease Detection: Insights from Mandarin and Greek Speech Perception
Title: Exploring Gender Bias in Alzheimer’s Disease Detection: Insights from Mandarin and Greek Speech Perception | Erforschung von Gender-Bias bei Alzheimer-Erkennung: Einblicke aus Mandarin und griechischer Sprachwahrnehmung | 探索阿尔茨海默氏病检测中的性别偏见:普通话和希腊言语认知的洞察 2507.12356v1 |
Authors (8): Liu He, Yuanchao Li, Rui Feng, XinRan Han, Yin-Long Liu, Yuwei Yang, Zude Zhu, Jiahong Yuan
Gender bias has been widely observed in speech perception tasks, influenced by the fundamental voicing differences between genders. This study reveals a gender bias in the perception of Alzheimer’s Disease (AD) speech. In a perception experiment involving 16 Chinese listeners evaluating both Chinese and Greek speech, we identified that male speech was more frequently identified as AD, with this bias being particularly pronounced in Chinese speech. Acoustic analysis showed that shimmer values in male speech were significantly associated with AD perception, while speech portion exhibited a significant negative correlation with AD identification. Although language did not have a significant impact on AD perception, our findings underscore the critical role of gender bias in AD speech perception. This work highlights the necessity of addressing gender bias when developing AD detection models and calls for further research to validate model performance across different linguistic contexts.
在言语认知任务中广泛观察到性别偏见,这受到两性之间基本表达差异的影响;这项研究揭示了阿尔茨海默氏病(AD)言语认知中的性别偏见;在涉及16名中国听众对中文和希腊语言论进行评价的感知实验中,我们发现男性言语更经常地被确定为AD,这种偏见在中文演讲中特别明显;声学分析表明,男性言语中的闪烁价值观与AD观念密切相关,而言语部分与AD识别有显著的负面关系;虽然语言对AD感没有重大影响,但我们的研究结果强调了AD言语认知中的性别偏见的关键作用;这项工作强调,在开发AD检测模型时,有必要解决性别偏见问题,并呼吁进一步研究,以验证不同语言背景的模范性表现。
Article 105
Title@2025-07-16 (3): Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs
Title: Towards Agentic RAG with Deep Reasoning: A Survey of RAG-Reasoning Systems in LLMs | Auf dem Weg zu Agentic RAG mit tiefer Vernunft: Eine Umfrage von RAG-Reasoning-Systemen in LLMs | 朝向具有深智力的AGA:对RAG(ARM)中测深系统进行的一项调查 2507.09477v2 |
Authors (20): Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, Philip S. Yu
Retrieval-Augmented Generation (RAG) lifts the factuality of Large Language Models (LLMs) by injecting external knowledge, yet it falls short on problems that demand multi-step inference; conversely, purely reasoning-oriented approaches often hallucinate or mis-ground facts. This survey synthesizes both strands under a unified reasoning-retrieval perspective. We first map how advanced reasoning optimizes each stage of RAG (Reasoning-Enhanced RAG). Then, we show how retrieved knowledge of different type supply missing premises and expand context for complex inference (RAG-Enhanced Reasoning). Finally, we spotlight emerging Synergized RAG-Reasoning frameworks, where (agentic) LLMs iteratively interleave search and reasoning to achieve state-of-the-art performance across knowledge-intensive benchmarks. We categorize methods, datasets, and open challenges, and outline research avenues toward deeper RAG-Reasoning systems that are more effective, multimodally-adaptive, trustworthy, and human-centric. The collection is available at https://github.com/DavidZWZ/Awesome-RAG-Reasoning.
通过引入外部知识,大语言模型(LLMS)的实际情况质量通过注入外部知识而提高大语言模型(LLMS)的真质,但是它却不能解决需要多步推理的问题;相反,纯粹的、以推理为导向的方法往往产生幻觉或错地事实;这项调查从统一推理-再推理的角度综合了两条线;我们首先绘制了高级推理如何优化RAG(Reasoning-Enhanced RAG)的每个阶段。然后,我们展示了如何检索到的关于不同类型供应缺失的房地的知识,并扩展了复杂推理的背景(RAG-Enhanced Reasoning)。最后,我们突出了新兴的协同型RAG-Reasoning框架,即(试想)LLLMs迭代间搜索和推理,以便实现知识密集型基准的状态性能。我们将方法、数据集和公开的挑战分类,并概述了更深层次的RAG-Reson系统的研究途径,这些系统更有效、多式调整适应性、可信赖性、可信赖和以人为本。
Article 106
Title@2025-07-16 (3): Planning-Aware Code Infilling via Horizon-Length Prediction
Title: Planning-Aware Code Infilling via Horizon-Length Prediction | Planning-Aware Code Infilling via Horizon-Length Prediction | 通过地平线-地球预测填充规划-软件代码 2410.03103v3 |
Authors (6): Yifeng Ding, Hantian Ding, Shiqi Wang, Qing Sun, Varun Kumar, Zijian Wang
Fill-in-the-Middle (FIM), or infilling, has become integral to code language models, enabling generation of missing code given both left and right contexts. However, the current FIM training paradigm which performs next-token prediction (NTP) over reordered sequence often leads to models struggling to generate content that aligns well with the surrounding context. We hypothesize that NTP alone is insufficient for models to learn effective planning conditioned on the distant right context, a critical factor for successful code infilling. To overcome this, we propose Horizon-Length Prediction (HLP), a novel training objective that teaches models to predict the number of remaining middle tokens at each step. HLP advances FIM with lookahead planning, enabling models to inherently learn infilling boundaries for arbitrary left and right contexts without relying on dataset-specific post-processing. Our evaluation across different model families and sizes shows that HLP significantly improves FIM performance by up to 24% relatively on diverse benchmarks, across file-level and repository-level. Furthermore, the enhanced planning capability gained through HLP boosts model performance on code reasoning. Importantly, HLP incurs negligible training overhead and no additional inference cost, ensuring its practicality for real-world scenarios.
中途填充(FIM)或填充(FIM)已成为编码语言模型的组成部分,使得在左侧和右侧环境中生成缺失的代码成为了代码模式的组成部分。然而,当前的FIM培训模式,即对顺序重排进行下方预测(NTP)后进行下方预测(NTP)后,往往导致模型难以产生与周围环境相适应的内容。我们假设光是NTP不足以让模型学习以远右环境为条件的有效规划,这是成功填充代码的一个关键因素。为了克服这一点,我们提出了地平线预测(HLP)这一新的培训目标,教给模型来预测每个步骤的剩余中标数。HLP用外观规划推进FIM,使模型能够在不依赖特定数据集后处理的情况下内在地学习为任意的左右环境填充边界。我们对不同模型家族和大小的评价表明,HLP在文件级别和储存库层面的不同基准上大大改进了FIM的绩效,相对提高到24 %。此外,HLP通过HLP推进(HL)推进(HL)系统)的推进(FIP)模型在可计量标准推理算中提高了实际成本。
Article 107
Title@2025-07-16 (3): Nonlinear Concept Erasure: a Density Matching Approach
Title: Nonlinear Concept Erasure: a Density Matching Approach | Nichtlineare Konzeptauslöschung: ein Density-Matching-Ansatz | 非线性概念时代:密度匹配方法 2507.12341v1 |
Authors (2): Antoine Saillenfest, Pirmin Lemberger
Ensuring that neural models used in real-world applications cannot infer sensitive information, such as demographic attributes like gender or race, from text representations is a critical challenge when fairness is a concern. We address this issue through concept erasure, a process that removes information related to a specific concept from distributed representations while preserving as much of the remaining semantic information as possible. Our approach involves learning an orthogonal projection in the embedding space, designed to make the class-conditional feature distributions of the discrete concept to erase indistinguishable after projection. By adjusting the rank of the projector, we control the extent of information removal, while its orthogonality ensures strict preservation of the local structure of the embeddings. Our method, termed $\overline{\mathrm{L}}$EOPARD, achieves state-of-the-art performance in nonlinear erasure of a discrete attribute on classic natural language processing benchmarks. Furthermore, we demonstrate that $\overline{\mathrm{L}}$EOPARD effectively mitigates bias in deep nonlinear classifiers, thereby promoting fairness.
确保现实世界应用中使用的神经模型无法从文本表述中推断出敏感信息,如性别或种族等人口特征,这是一个关键的挑战,因为公平是一个令人关切的问题。我们通过概念删除来解决这个问题,这个过程从分布式表述中去除与特定概念有关的信息,同时尽可能保留其余语义信息。我们的方法是在嵌入空间中学习一个正方位投影,目的是使离散概念的等级特性分布在投影后消除不可分性。我们通过调整投影器的级别,控制信息删除的程度,而信息删除的大小则确保严格保护嵌入器的当地结构。我们的方法叫做$\ overline ~ mathrm{L $$$EOPARD,在传统的自然语言处理基准的离散属性上实现非线式缩小状态的状态。此外,我们证明$\ overline ~ {L$EOPARD 有效地减轻了深度非线级分类者的偏差,从而促进了公平性。
Article 108
Title@2025-07-16 (3): From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents
Title: From Semantic Web and MAS to Agentic AI: A Unified Narrative of the Web of Agents | Von Semantic Web und MAS zu Agentic AI: Ein einheitliches Narrativ des Web of Agents | 从语义网站和MAS到AA:关于 “ 代理人网络 “ 的统一说明 2507.10644v2 |
Authors (4): Tatiana Petrova, Boris Bliznioukov, Aleksandr Puzikov, Radu State
The concept of the Web of Agents (WoA), which transforms the static, document-centric Web into an environment of autonomous agents acting on users’ behalf, has attracted growing interest as large language models (LLMs) become more capable. However, research in this area is still fragmented across different communities. Contemporary surveys catalog the latest LLM-powered frameworks, while the rich histories of Multi-Agent Systems (MAS) and the Semantic Web are often treated as separate, legacy domains. This fragmentation obscures the intellectual lineage of modern systems and hinders a holistic understanding of the field’s trajectory. We present the first comprehensive evolutionary overview of the WoA. We show that modern protocols like A2A and the MCP, are direct evolutionary responses to the well-documented limitations of earlier standards like FIPA standards and OWL-based semantic agents. To systematize this analysis, we introduce a four-axis taxonomy (semantic foundation, communication paradigm, locus of intelligence, discovery mechanism). This framework provides a unified analytical lens for comparing agent architectures across all generations, revealing a clear line of descent where others have seen a disconnect. Our analysis identifies a paradigm shift in the ‘locus of intelligence’: from being encoded in external data (Semantic Web) or the platform (MAS) to being embedded within the agent’s core model (LLM). This shift is foundational to modern Agentic AI, enabling the scalable and adaptive systems the WoA has long envisioned. We conclude that while new protocols are essential, they are insufficient for building a robust, open, trustworthy ecosystem. Finally, we argue that the next research frontier lies in solving persistent socio-technical challenges, and we map out a new agenda focused on decentralized identity, economic models, security, and governance for the emerging WoA.
将静态的、以文件为中心的网络概念转化为代表用户行事的自主代理机构的环境,随着大型语言模型(LLMS)的能力增强,这一概念引起了越来越多的兴趣。然而,这一领域的研究仍然在不同社区中分散。当代调查将最新的LLM动力框架编成目录,而多机构系统(MAS)和语义网络的丰富历史往往被视为单独的遗留领域。这种分散掩盖了现代系统的知识线,阻碍了对实地运行轨迹的全面理解。我们介绍了WAA的首次全面演进概览。我们显示,A2A和MCP等现代协议是对早期标准(如FIPA标准和OWL的语义媒介)有详细记载的限制的直接进化反应。为了系统系统化,我们引入了四轴分类(命令基础、通信模式、智能中心、发现机制),这个框架为不同代间对代理机构结构的比较提供了一个统一的分析透析透析透析,揭示了一条清晰的路径,而其他人则看到,A2A类和MLA的离子线。我们的分析指出,一个清晰的模型的模型在网络平台上将最终转换了。
Article 109
Title@2025-07-16 (3): Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization
Title: Chain-of-Descriptions: Improving Code LLMs for VHDL Code Generation and Summarization | Chain-of-Descriptions: Verbesserung der Code-LLMs für die VHDL-Code-Generierung und Zusammenfassung | 描述链:改进《守则》中VHDL代码生成和概述的LLML 2507.12308v1 |
Authors (12): Prashanth Vijayaraghavan, Apoorva Nitsure, Charles Mackin, Luyao Shi, Stefano Ambrogio, Arvind Haran, Viresh Paruthi, Ali Elzein, Dan Coops, David Beymer, Tyler Baldwin, Ehsan Degan
Large Language Models (LLMs) have become widely used across diverse NLP tasks and domains, demonstrating their adaptability and effectiveness. In the realm of Electronic Design Automation (EDA), LLMs show promise for tasks like Register-Transfer Level (RTL) code generation and summarization. However, despite the proliferation of LLMs for general code-related tasks, there’s a dearth of research focused on evaluating and refining these models for hardware description languages (HDLs), notably VHDL. In this study, we evaluate the performance of existing code LLMs for VHDL code generation and summarization using various metrics and two datasets – VHDL-Eval and VHDL-Xform. The latter, an in-house dataset, aims to gauge LLMs’ understanding of functionally equivalent code. Our findings reveal consistent underperformance of these models across different metrics, underscoring a significant gap in their suitability for this domain. To address this challenge, we propose Chain-of-Descriptions (CoDes), a novel approach to enhance the performance of LLMs for VHDL code generation and summarization tasks. CoDes involves generating a series of intermediate descriptive steps based on: (i) the problem statement for code generation, and (ii) the VHDL code for summarization. These steps are then integrated with the original input prompt (problem statement or code) and provided as input to the LLMs to generate the final output. Our experiments demonstrate that the CoDes approach significantly surpasses the standard prompting strategy across various metrics on both datasets. This method not only improves the quality of VHDL code generation and summarization but also serves as a framework for future research aimed at enhancing code LLMs for VHDL.
在电子设计自动化(EDA)领域,LLM公司对登记册-传输级别(RTL)代码的生成和汇总等任务表现出希望。然而,尽管普通代码相关任务LLM公司大量使用LLM公司(LLMS),但缺乏侧重于评价和完善硬件描述语言(HDLs)(特别是VHDL)的这些模型的研究。在本研究中,我们利用各种计量和两个数据集(VHDL-Eval和VHDL-Xform),评估VHDL代码生成和合成的现有代码LMS的性能。在电子设计自动化和两个数据集(VHDL-EL-Eval和VHDL-Xform)领域,LMSLMS展示了对功能等代码的理解。我们的调查结果显示,这些模型在不同标准语言(HDLs)的适合性差很大。为了应对这一挑战,我们建议CHDL代码的链条(CDede)是一种创新的方法,但用LMS(VDL-L)在原始代码生成和合成数据解算法中也展示了中间的解算法。
Article 110
Title@2025-07-16 (3): Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding
Title: Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding | Text-ADBench: Text-Anomaly Detection Benchmark basierend auf LLMs Einbetten | 文本 – – 亚银:基于嵌入LLMs的文本异常检测基准 2507.12295v1 |
Authors (2): Feng Xiao, Jicong Fan
Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived embeddings.In addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this work provides a foundation for future research in robust and scalable text anomaly detection systems.
尽管在大型语言模型(LLMS)和异常检测算法方面取得重大进展,但缺乏标准化和全面的基准来评价文本数据的现有异常探测方法,因此,严格比较和开发创新方法;这项工作进行了全面的实证研究,并引入了文本异常检测基准,利用各种经过预先培训的语言模型在各种文本数据集中的嵌入。我们的工作系统地评价嵌入基于嵌入文本异常检测方法的有效性,包括:(1)早期语言模型(GloVe、BERT);(2)多个LLMS(LLLAMA-2、Lalama-3、Mistral、OpenAI(MLSLAM-2、LOS、ALMA、Ada、大);(3)多方文本数据集(新、社交媒体、科学出版物);(4)综合评价衡量标准(AUROC、AUPRC)。 我们的实验揭示了一种至关重要的经验洞察:嵌入质量显著地规范了基于异常检测的效能和深层次的基于学习的方法,包括:在常规的浅度算算法(egroupal-rational-rmal-comma)上,我们利用了一种快速的系统,从而更准确地利用了我们的数据库基础基础基础化了一种快速的文本。
Article 111
Title@2025-07-16 (3): Linearly-Interpretable Concept Embedding Models for Text Analysis
Title: Linearly-Interpretable Concept Embedding Models for Text Analysis | Linear-Interpretable Concept Einbetten von Modellen für die Textanalyse | 用于文本分析的线性解释式概念嵌入模型 2406.14335v2 |
Authors (6): Francesco De Santis, Philippe Bich, Gabriele Ciravegna, Pietro Barbiero, Danilo Giordano, Tania Cerquitelli
Despite their success, Large-Language Models (LLMs) still face criticism due to their lack of interpretability. Traditional post-hoc interpretation methods, based on attention and gradient-based analysis, offer limited insights as they only approximate the model’s decision-making processes and have been proved to be unreliable. For this reason, Concept-Bottleneck Models (CBMs) have been lately proposed in the textual field to provide interpretable predictions based on human-understandable concepts. However, CBMs still exhibit several limitations due to their architectural constraints limiting their expressivity, to the absence of task-interpretability when employing non-linear task predictors and for requiring extensive annotations that are impractical for real-world text data. In this paper, we address these challenges by proposing a novel Linearly Interpretable Concept Embedding Model (LICEM) going beyond the current accuracy-interpretability trade-off. LICEMs classification accuracy is better than existing interpretable models and matches black-box ones. We show that the explanations provided by our models are more interveneable and causally consistent with respect to existing solutions. Finally, we show that LICEMs can be trained without requiring any concept supervision, as concepts can be automatically predicted when using an LLM backbone.
尽管取得了成功,但大语言模型(LLMS)仍因其缺乏可解释性而面临批评。基于关注和梯度分析的传统热后解释方法(LLMS)基于关注和梯度分析,提供了有限的洞见,因为它们只是接近模型的决策过程,并且证明是不可靠的。为此,最近在文本领域提出了概念-瓶颈模型(BISM),以提供基于人类可理解概念的可解释预测。然而,由于这些模型的建筑限制,其表达性、在使用非线性任务预测器时缺乏任务可解释性以及需要对于现实世界文本数据不切实际的广泛说明,因此仍然有一些限制。在本文件中,我们通过提出一个新的线性解释性概念嵌入模型(LICEM)来应对这些挑战。 LICEM的分类准确性优于现有的可解释性模型,并与黑箱相匹配。我们表明,我们的模型所提供的解释性解释性解释性更能进行干预,而且因果性更符合现有解决方案。最后,我们在不需经过培训的基本监督概念的情况下,可以自动地表明,任何LEM都可预见性地使用。
Article 112
Title@2025-07-16 (3): Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge
Title: Automated Novelty Evaluation of Academic Paper: A Collaborative Approach Integrating Human and Large Language Model Knowledge | Automatisierte Neuheitsbewertung des Akademischen Papiers: Ein kollaborativer Ansatz Integrieren von menschlichem und großem Sprachmodellwissen | 学术论文自动化新颖评价:结合人文和大语言示范知识的协作方法 2507.11330v2 |
Authors (3): Wenqing Wu, Chengzhi Zhang, Yi Zhao
Novelty is a crucial criterion in the peer review process for evaluating academic papers. Traditionally, it’s judged by experts or measure by unique reference combinations. Both methods have limitations: experts have limited knowledge, and the effectiveness of the combination method is uncertain. Moreover, it’s unclear if unique citations truly measure novelty. The large language model (LLM) possesses a wealth of knowledge, while human experts possess judgment abilities that the LLM does not possess. Therefore, our research integrates the knowledge and abilities of LLM and human experts to address the limitations of novelty assessment. One of the most common types of novelty in academic papers is the introduction of new methods. In this paper, we propose leveraging human knowledge and LLM to assist pretrained language models (PLMs, e.g. BERT etc.) in predicting the method novelty of papers. Specifically, we extract sentences related to the novelty of the academic paper from peer review reports and use LLM to summarize the methodology section of the academic paper, which are then used to fine-tune PLMs. In addition, we have designed a text-guided fusion module with novel Sparse-Attention to better integrate human and LLM knowledge. We compared the method we proposed with a large number of baselines. Extensive experiments demonstrate that our method achieves superior performance.
在同行审评过程中,评估学术论文是一个关键的标准。传统上,它是由专家或以独特的参考组合来衡量的。两种方法都有其局限性:专家的知识有限,组合方法的有效性不确定。此外,还不清楚独有引用是否真正衡量创新。大型语言模型(LLM)拥有丰富的知识,而人类专家拥有法学硕士不具备的判断能力。因此,我们的研究综合了LLM和人类专家的知识和能力,以解决新颖评估的局限性。学术论文中最常见的新颖类型之一是采用新方法。此外,我们建议利用人类知识和LLM来协助预先培训的语言模型(PLMS,例如BERT等)预测新手法。具体地说,我们从同行审评报告中摘录了与学术论文的新颖性有关的句子,并利用LLMM总结了学术论文的方法部分,该部分随后用于微调PLMS。此外,我们设计了一个文本导导导的模块,用新的SpARM模型来协助预先培训的语言模型(PLM,例如BERT,等等)预测新手法。我们用高超级的实验方法更好地展示了我们的知识。
Article 113
Title@2025-07-16 (3): NLP Meets the World: Toward Improving Conversations With the Public About Natural Language Processing Research
Title: NLP Meets the World: Toward Improving Conversations With the Public About Natural Language Processing Research | NLP trifft auf die Welt: Um Gespräche mit der Öffentlichkeit über die natürliche Sprachverarbeitungsforschung zu verbessern | NLP 与世界相遇:努力改进与公众关于自然语言处理研究的对话 2507.10559v2 |
Authors (1): Shomir Wilson
Recent developments in large language models (LLMs) have been accompanied by rapidly growing public interest in natural language processing (NLP). This attention is reflected by major news venues, which sometimes invite NLP researchers to share their knowledge and views with a wide audience. Recognizing the opportunities of the present, for both the research field and for individual researchers, this paper shares recommendations for communicating with a general audience about the capabilities and limitations of NLP. These recommendations cover three themes: vague terminology as an obstacle to public understanding, unreasonable expectations as obstacles to sustainable growth, and ethical failures as obstacles to continued support. Published NLP research and popular news coverage are cited to illustrate these themes with examples. The recommendations promote effective, transparent communication with the general public about NLP, in order to strengthen public understanding and encourage support for research.
在大型语言模型(LLMs)的最近发展的同时,公众对自然语言处理的兴趣迅速增加,主要新闻网站也反映了这种关注,有时邀请NLP研究人员与广大受众分享知识和观点。认识到当前对研究领域和个人研究人员的机会,本文件分享了关于就NLP的能力和局限性与一般受众沟通的建议。 这些建议涉及三个主题:模糊的术语妨碍公众理解,不合理地期望可持续增长的障碍,道德上的失败作为持续支持的障碍。出版了NLP的研究和大众新闻报道以实例说明这些主题。建议促进与公众就NLP进行有效、透明的沟通,以加强公众理解和鼓励对研究的支持。
Article 114
Title@2025-07-16 (3): Measuring Spiritual Values and Bias of Large Language Models
Title: Measuring Spiritual Values and Bias of Large Language Models | Messen von spirituellen Werten und Bias von großen Sprachmodellen | 计量大语言模型的精神价值和偏见 2410.11647v2 |
Authors (6): Songyuan Liu, Ziyang Zhang, Runze Yan, Wei Wu, Carl Yang, Jiaying Lu
Large language models (LLMs) have become integral tool for users from various backgrounds. LLMs, trained on vast corpora, reflect the linguistic and cultural nuances embedded in their pre-training data. However, the values and perspectives inherent in this data can influence the behavior of LLMs, leading to potential biases. As a result, the use of LLMs in contexts involving spiritual or moral values necessitates careful consideration of these underlying biases. Our work starts with verification of our hypothesis by testing the spiritual values of popular LLMs. Experimental results show that LLMs’ spiritual values are quite diverse, as opposed to the stereotype of atheists or secularists. We then investigate how different spiritual values affect LLMs in social-fairness scenarios e.g., hate speech identification). Our findings reveal that different spiritual values indeed lead to different sensitivity to different hate target groups. Furthermore, we propose to continue pre-training LLMs on spiritual texts, and empirical results demonstrate the effectiveness of this approach in mitigating spiritual bias.
大型语言模型(LLMS)已成为来自不同背景的用户不可或缺的工具。LLMS, 接受过广泛的集体培训,反映其培训前数据中的语言和文化细微差别。然而,该数据所固有的价值和观点可以影响LMS的行为,导致潜在的偏见。结果,在涉及精神或道德价值的情况下使用LLMS, 需要仔细考虑这些根本偏见。我们的工作始于通过测试受欢迎的LLMS的精神价值来验证我们的假设。实验结果显示LLMS的精神价值是多种多样的,而不是无神论者或世俗主义者的定型。我们然后调查不同的精神价值如何影响社会公正情景中的LMS,例如仇恨言论识别。我们的调查结果显示,不同的精神价值确实导致不同仇恨目标群体的不同敏感度。此外,我们提议继续对LMS进行关于精神文字的预先培训,实验结果显示这一方法在减轻精神偏见方面的有效性。
Article 115
Title@2025-07-16 (3): Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes
Title: Infherno: End-to-end Agent-based FHIR Resource Synthesis from Free-form Clinical Notes | Infherno: Ende-zu-Ende Agent-basierte FHIR-Ressourcensynthese aus freiformigen klinischen Anmerkungen | Infherno: 以端到端代理为基础的FHIR 自由形式临床笔记资源合成 2507.12261v1 |
Authors (6): Johann Frei, Nils Feldhus, Lisa Raithel, Roland Roller, Alexander Meyer, Frank Kramer
For clinical data integration and healthcare services, the HL7 FHIR standard has established itself as a desirable format for interoperability between complex health data. Previous attempts at automating the translation from free-form clinical notes into structured FHIR resources rely on modular, rule-based systems or LLMs with instruction tuning and constrained decoding. Since they frequently suffer from limited generalizability and structural inconformity, we propose an end-to-end framework powered by LLM agents, code execution, and healthcare terminology database tools to address these issues. Our solution, called Infherno, is designed to adhere to the FHIR document schema and competes well with a human baseline in predicting FHIR resources from unstructured text. The implementation features a front end for custom and synthetic data and both local and proprietary models, supporting clinical data integration processes and interoperability across institutions.
就临床数据整合和保健服务而言,HL7 FHIR标准已经确立为复杂健康数据之间互操作性的理想格式。以前将免费临床笔记自动转换成结构化FHIR资源的尝试依赖于模块化、基于规则的系统或具有指令调整和约束解码的LLMs。由于它们经常受到一般性和结构不兼容性的制约,我们提议了一个由LLLM代理商驱动的端对端框架、代码执行和保健术语数据库工具来解决这些问题。我们称为Infherno的解决方案旨在遵守FHIR文件的预案,在预测FHIR非结构化文本资源时与人类基线进行良好竞争。实施该解决方案的特点是定制和合成数据以及本地和专利模型的前端,支持临床数据整合进程和跨机构互操作性。
Article 116
Title@2025-07-16 (3): Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese
Title: Translationese-index: Using Likelihood Ratios for Graded and Generalizable Measurement of Translationese | Translationese-Index: Verwendung von Likelihood-Verhältnissen für abgestufte und generalisierbare Messung von Translationese | 笔译索引:在笔译的分级和通用计量中使用可能性比率 2507.12260v1 |
Authors (9): Yikang Liu, Wanyang Zhang, Yiming Wang, Jialong Tang, Pei Zhang, Baosong Yang, Fei Huang, Rui Wang, Hai Hu
In this paper, we propose the first quantitative measure for translationese – the translationese-index (T-index) for graded and generalizable measurement of translationese, computed from the likelihood ratios of two contrastively fine-tuned language models (LMs). We use a synthesized dataset and a dataset with translations in the wild to evaluate T-index’s generalizability in cross-domain settings and its validity against human judgments. Our results show that T-index is both robust and efficient. T-index scored by two 0.5B LMs fine-tuned on only 1-5k pairs of synthetic data can well capture translationese in the wild. We find that the relative differences in T-indices between translations can well predict pairwise translationese annotations obtained from human annotators; and the absolute values of T-indices correlate well with human ratings of degrees of translationese (Pearson’s $r = 0.568$). Additionally, the correlation between T-index and existing machine translation (MT) quality estimation (QE) metrics such as BLEU and COMET is low, suggesting that T-index is not covered by these metrics and can serve as a complementary metric in MT QE.
在本文中,我们建议了翻译的第一个量化计量方法 – – 笔译指数(T-index),用于对笔译进行分级和通用的计量,该计量方法根据两个对比性微调的语言模型(LMs)的可能性比值计算。我们使用综合数据集和带有野生翻译的数据集,以评价T-index在跨多域设置中的通用性及其相对于人类判断的有效性。我们的结果表明,T-index既可靠又有效。T-index被两个0.5B LMs微调得分,对合成数据中只有1至5k对合成数据进行微调,可以很好地捕捉到野生翻译。我们发现,翻译之间的T-index的相对差异可以很好地预测出从人类告示者那里得到的对等翻译说明;T-index的绝对值与人类对翻译程度的评级(Pearson的$=0.568美元)相关。此外,T-index和现有的机器翻译质量估计(QE)指标(如BLEUU和KOT)之间的关联度很低,这表明T-index没有被这些指标作为补充的衡量标准。
Article 117
Title@2025-07-16 (3): Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training
Title: Hallucination Detox: Sensitivity Dropout (SenD) for Large Language Model Training | Halluzination Detox: Sensitivity Dropout (SenD) für großsprachliche Modellschulungen | 幻觉脱毒:用于大语言模式培训的感敏性辍学(SenD) 2410.15460v4 |
Authors (5): Shahrad Mohammadzadeh, Juan David Guerra, Marco Bonizzato, Reihaneh Rabbany, Golnoosh Farnadi
As large language models (LLMs) become increasingly prevalent, concerns about their reliability, particularly due to hallucinations - factually inaccurate or irrelevant outputs - have grown. Our research investigates the relationship between the uncertainty in training dynamics and the emergence of hallucinations. Using models from the Pythia suite and several hallucination detection metrics, we analyze hallucination trends and identify significant variance during training. To address this, we propose \textbf{Sensitivity Dropout (SenD)}, a novel training protocol designed to reduce hallucination variance during training by deterministically dropping embedding indices with significant variability. In addition, we develop an unsupervised hallucination detection metric, Efficient EigenScore (EES), which approximates the traditional EigenScore in 2x speed. This metric is integrated into our training protocol, allowing SenD to be both computationally scalable and effective at reducing hallucination variance. SenD improves test-time reliability of Pythia and Meta’s Llama models by up to 17\% and enhances factual accuracy in Wikipedia, Medical, Legal, and Coding domains without affecting downstream task performance.
随着大型语言模型(LLMS)日益普遍,人们对其可靠性的担忧日益增长,特别是由于幻觉(事实上不准确或不相关的产出)的可靠性。我们的研究调查了培训动态的不确定性与幻觉的出现之间的关系。我们使用Pythia套装和若干幻觉检测指标的模型分析幻觉趋势,并找出培训过程中的显著差异。为了解决这个问题,我们提议了一项新颖的培训协议,旨在减少培训期间的幻觉差异,具体做法是确定地降低具有重大变异性的嵌入指数。此外,我们开发了一种不受监督的幻觉检测指标,即高效EigenScore(EES),该指标以2x速度接近传统的EigenScore。该指标被纳入了我们的培训协议,使SenD既可以计算可缩放,又能有效减少幻觉差异。SenD提高了Pythia和Meta的Llama模型的测试时间可靠性,最高可达17,并提高维基、医疗、法律和Coding域的实际准确性。
Article 118
Title@2025-07-16 (3): Improving Contextual ASR via Multi-grained Fusion with Large Language Models
Title: Improving Contextual ASR via Multi-grained Fusion with Large Language Models | Verbesserung der Kontext-ASR durch Multi-Grained Fusion mit großen Sprachmodellen | 通过与大语言模式的多语种融合,改善实际的ASR 2507.12252v1 |
Authors (2): Shilin Zhou, Zhenghua Li
While end-to-end Automatic Speech Recognition (ASR) models have shown impressive performance in transcribing general speech, they often struggle to accurately recognize contextually relevant keywords, such as proper nouns or user-specific entities. Previous approaches have explored leveraging keyword dictionaries in the textual modality to improve keyword recognition, either through token-level fusion that guides token-by-token generation or phrase-level fusion that enables direct copying of keyword phrases. However, these methods operate at different granularities and have their own limitations. In this paper, we propose a novel multi-grained fusion approach that jointly leverages the strengths of both token-level and phrase-level fusion with Large Language Models (LLMs). Our approach incorporates a late-fusion strategy that elegantly combines ASR’s acoustic information with LLM’s rich contextual knowledge, balancing fine-grained token precision with holistic phrase-level understanding. Experiments on Chinese and English datasets demonstrate that our approach achieves state-of-the-art performance on keyword-related metrics while preserving high accuracy on non-keyword text. Ablation studies further confirm that the token-level and phrase-level components both contribute significantly to the performance gains, complementing each other in our joint multi-grained framework. The code and models will be publicly available at https://github.com/.
虽然端到端自动语音识别模型(ASR)在翻译一般发言时表现出了令人印象深刻的成绩,但它们往往难以准确地识别与背景相关的关键词,例如适当的名词或用户特定实体。以前的做法探索了在文本模式中利用关键词词典来改进关键词识别,要么通过象征性的融合来引导象征性的逐个生成,要么通过短语级融合来帮助直接复制关键词句。然而,这些方法在不同微粒上运作,并有其自身的局限性。在本文中,我们提出了一种创新的多级聚合方法,在使用大语言模型的同时,共同利用象征性和语句级融合的优势。我们的方法包含了一种延迟融合战略,将ASR的声学信息与LLM丰富的背景知识精通结合起来,平衡精细的象征精确度与整体的语句级理解。在中英数据集上进行的实验表明,我们的方法在关键词相关计量标准上取得了最先进的表现,同时在非关键词级文本上保持高度的准确性能。我们的方法包括了最新版本。
Article 119
Title@2025-07-16 (3): FADE: Why Bad Descriptions Happen to Good Features
Title: FADE: Why Bad Descriptions Happen to Good Features | FADE: Warum schlechte Beschreibungen gut aussehen | FADE:为什么不良描述发生在好地貌 2502.16994v2 |
Authors (7): Bruno Puri, Aakriti Jain, Elena Golimblevskaia, Patrick Kahardipraja, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While this may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for automatically evaluating feature-to-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes of the misalignment between features and their descriptions. We apply FADE to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE
最近在机械化解释性方面取得的进展突出表明了在分析LLMs内部潜在代表形式方面使解释性管道自动化的潜力。虽然这可能增进我们对内部机制的理解,但实地缺乏评估所发现特征有效性的标准化评价方法。我们试图通过引入FADE(FADE:描述评价的特性调整)来弥补这一差距,FADE(FADE:描述评价的特性调整)是一个可扩缩的模型-不可计量框架,用于自动评估特征到描述的一致性。FADE(FADE)评估了四种关键指标(清晰度、反应性、纯度和忠诚性)的协调统一,并系统地量化了特征及其描述之间不匹配的原因。我们应用FADE(FADE)分析现有的公开源特征描述并评估自动解释性管道的关键组成部分,以提高描述的质量。我们的调查结果突出了生成特征描述方面的基本挑战,特别是SAE(SAE)相对于MP神经系统而言,提供了对自动解释的局限性和未来方向的洞察。我们将FADE作为开放源包发布:https://github.com/brunibrun/FADE)/FADE(FADE)
Article 120
Title@2025-07-16 (3): Semantic Adapter for Universal Text Embeddings: Diagnosing and Mitigating Negation Blindness to Enhance Universality
Title: Semantic Adapter for Universal Text Embeddings: Diagnosing and Mitigating Negation Blindness to Enhance Universality | Semantischer Adapter für universelle Text-Embeddings: Diagnose und Milderung der Negationsblindheit zur Verbesserung der Universalität | 通用文本嵌入的语义适应器:诊断和减轻疏漏失盲现象,以增强普遍性 2504.00584v2 |
Authors (1): Hongliu Cao
Negation plays an important role in various natural language processing tasks such as Natural Language Inference and Sentiment Analysis tasks. Numerous prior studies have found that contextual text embedding models such as BERT, ELMO, RoBERTa or XLNet face challenges in accurately understanding negation. Recent advancements in universal text embeddings have demonstrated superior performance over contextual text embeddings in various tasks. However, due to the bias in popular evaluation benchmarks, the negation awareness capacity of these models remains unclear. To bridge the gap in existing literature, an in-depth analysis is initiated in this work to study the negation awareness of cutting-edge universal text embedding models. Our findings reveal a significant lack of negation awareness in these models, often interpreting negated text pairs as semantically similar. To efficiently deal with the conflict that different tasks need different trade-offs between topic and negation information among other semantic information, a data-efficient and computational-efficient embedding re-weighting method is proposed without modifying the parameters of text embedding models. The proposed solution is able to improve text embedding models’ negation awareness significantly on both simple negation understanding task and complex negation understanding task. Furthermore, the proposed solution can also significantly improve the negation awareness of Large Language Model based task-specific high dimensional universal text embeddings.
在自然语言推断和感知分析等各种自然语言处理任务中,消化在自然语言推断和感知分析任务中起着重要作用。许多先前的研究发现,嵌入模型(如BERT、ELMO、ROBERTA或XLNet)等背景文本的模型在准确理解否定方面面临着挑战。在通用文本嵌入方面最近的进展表明,在各种任务中,普遍文本嵌入的功能优于背景文本嵌入。然而,由于大众评价基准中存在偏差,这些模型的否定意识能力仍然不明确。为了缩小现有文献中的差距,在这项工作中开展了深入的分析,以研究否定对尖端通用文本嵌入模型模型的认知。我们的调查结果显示,这些模型中大量缺乏否定性意识,往往将否定的文本配对解释成词义相似。为了有效处理不同任务之间需要不同主题的权衡和否定其他语义信息的冲突,在不修改文本嵌入模型参数的情况下,建议采用数据效率和计算效率的嵌入重新加权方法。拟议的解决办法能够改进文本嵌入模型嵌入模型的深层次认识,同时大幅改进基于简单消化语言的高层次理解。
Article 121
Title@2025-07-16 (3): Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions
Title: Truth Sleuth and Trend Bender: AI Agents to fact-check YouTube videos and influence opinions | Truth Sleuth and Trend Bender: KI-Agenten überprüfen YouTube-Videos und beeinflussen Meinungen | Truth Sleuth and Trend Bender: AI 负责事实检查YouTube视频及影响舆论的代理 2507.10577v2 |
Authors (2): Cécile Logé, Rehan Ghori
Misinformation poses a significant threat in today’s digital world, often spreading rapidly through platforms like YouTube. This paper introduces a novel approach to combating misinformation by developing an AI-powered system that not only fact-checks claims made in YouTube videos but also actively engages users in the comment section and challenge misleading narratives. Our system comprises two main agents: Truth Sleuth and Trend Bender. Truth Sleuth extracts claims from a YouTube video, uses a Retrieval-Augmented Generation (RAG) approach - drawing on sources like Wikipedia, Google Search, Google FactCheck - to accurately assess their veracity and generates a nuanced and comprehensive report. Through rigorous prompt engineering, Trend Bender leverages this report along with a curated corpus of relevant articles to generate insightful and persuasive comments designed to stimulate a productive debate. With a carefully set up self-evaluation loop, this agent is able to iteratively improve its style and refine its output. We demonstrate the system’s capabilities through experiments on established benchmark datasets and a real-world deployment on YouTube, showcasing its potential to engage users and potentially influence perspectives. Our findings highlight the high accuracy of our fact-checking agent, and confirm the potential of AI-driven interventions in combating misinformation and fostering a more informed online space.
本文介绍了一种打击错误信息的新方法,即开发一个AI动力系统,不仅对YouTube视频中的说法进行事实检查,而且还积极地让用户参与评论部分和质疑误导性叙述。我们的系统由两个主要的推动者组成:真理Sleuth和Trind Bender。真相Sleuth从YouTube视频中提取了主张,利用维基百科、谷歌搜索、谷歌事实调查等来源来准确评估其真实性,并产生一份精细而全面的报告。Trend Bender通过严格的迅速工程,利用这份报告以及一套相关文章的整理,产生深刻而有说服力的评论,以激发富有成效的辩论。通过精心设置的自我评价环圈,该代理能够反复改进其风格和产出。我们通过对既定基准数据集的实验和在YouTube上真实世界部署来展示系统的能力,展示其与用户接触的潜在潜力和潜在影响。我们的调查结果强调,在加强事实和空间信息分析方面,我们提高数据分析的准确性,并提升了我们的数据代理的在线数据分析的高度准确性。
Article 122
Title@2025-07-16 (3): Towards few-shot isolated word reading assessment
Title: Towards few-shot isolated word reading assessment | Auf dem Weg zu wenigen Schuss isoliert Wort Lesung Bewertung | 迈向微小的孤立字读数评估 2507.12217v1 |
Authors (3): Reuben Smit, Retief Louw, Herman Kamper
We explore an ASR-free method for isolated word reading assessment in low-resource settings. Our few-shot approach compares input child speech to a small set of adult-provided reference templates. Inputs and templates are encoded using intermediate layers from large self-supervised learned (SSL) models. Using an Afrikaans child speech benchmark, we investigate design options such as discretising SSL features and barycentre averaging of the templates. Idealised experiments show reasonable performance for adults, but a substantial drop for child speech input, even with child templates. Despite the success of employing SSL representations in low-resource speech tasks, our work highlights the limitations of SSL representations for processing child data when used in a few-shot classification system.
我们探索了一种在低资源环境中单独阅读阅读评估的无自动识别标准方法。 我们的微小方法将输入儿童演讲比作一小套成人提供的参考模板。 输入和模板使用来自大型自监管学习模式(SSL)的中间层编码。 我们使用南非荷兰语儿童演讲基准调查设计选项,例如将 SSL 特性分离和模板平均温室。 理想化的实验显示成人表现合理,但儿童演讲投入却大幅下降,即使有儿童模板。 尽管在低资源演讲任务中成功地使用SSL 表示方式,但我们的工作凸显了在少数分类系统中使用儿童数据处理 SSL 表示方式的局限性。
Article 123
Title@2025-07-16 (3): Toward a Behavioural Translation Style Space: Simulating the Temporal Dynamics of Affect, Behaviour, and Cognition in Human Translation Production
Title: Toward a Behavioural Translation Style Space: Simulating the Temporal Dynamics of Affect, Behaviour, and Cognition in Human Translation Production | Auf dem Weg zu einem Raum für Verhaltensübersetzungen: Simulation der zeitlichen Dynamik von Affekt, Verhalten und Kognition in der menschlichen Übersetzungsproduktion | 走向行为翻译风格空间:模拟人翻译生产中影响、行为和认知的时空动态 2507.12208v1 |
Authors (6): Michael Carl, Takanori Mizowaki, Aishvarya Ray, Masaru Yamada, Devi Sri Bandaru, Xinyue Ren
The paper introduces a Behavioural Translation Style Space (BTSS) that describes possible behavioural translation patterns. The suggested BTSS is organized as a hierarchical structure that entails various embedded processing layers. We posit that observable translation behaviour - i.e., eye and finger movements - is fundamental when executing the physical act of translation but it is caused and shaped by higher-order cognitive processes and affective translation states. We analyse records of keystrokes and gaze data as indicators of the hidden mental processing structure and organize the behavioural patterns as a multi-layered embedded BTSS. The BTSS serves as the basis for a computational translation agent to simulate the temporal dynamics of affect, automatized behaviour and cognition during human translation production.
本文介绍了一种行为翻译风格空间(BTSS),该空间描述了可能的行为翻译模式,所建议的BTSS是一个等级结构,包含各种嵌入式处理层。我们认为,可观测的翻译行为(即眼睛和手指运动)在执行翻译的实际行为时至关重要,但它是由更高层次的认知过程和感知翻译状态造成和形成的。我们分析键盘和凝视数据记录,作为隐藏的心理处理结构的指标,并将行为模式组织成多层嵌入式的BTSS。BTSS是模拟影响的时间动态、自动化行为和人类翻译生产过程中的认知的计算翻译代理的基础。
Article 124
Title@2025-07-16 (3): Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize?
Title: Reasoning Strategies in Large Language Models: Can They Follow, Prefer, and Optimize? | Reasoning Strategies in Large Language Models: Können sie folgen, bevorzugen und optimieren? | 大语言模式中的理由战略:它们能够遵循、优于和优化吗? 2507.11423v2 |
Authors (4): Yanjian Zhang, Guillaume Wisniewski, Nadi Tomeh, Thierry Charnois
Human reasoning involves different strategies, each suited to specific problems. Prior work shows that large language model (LLMs) tend to favor a single reasoning strategy, potentially limiting their effectiveness in diverse reasoning challenges. In this work, we investigate whether prompting can control LLMs reasoning strategies and assess its impact on logical problem-solving. While our experiments show that no single strategy consistently improves accuracy, performance could be enhanced if models could adaptively choose the optimal strategy. We propose methods to guide LLMs in strategy selection, highlighting new ways to refine their reasoning abilities.
人类推理涉及不同的战略,每个战略都适合具体问题。 先前的工作表明,大型语言模型(LLMs)倾向于采用单一的推理战略,有可能限制其在不同推理挑战中的有效性。 在这项工作中,我们调查推动能够控制LLMs推理战略并评估其对逻辑解决问题的影响。 尽管我们的实验表明,没有一个单一战略能够不断提高准确性,但如果模型能够适应性地选择最佳战略,业绩是可以提高的。 我们提出了指导LLMs选择战略的方法,强调改进推理能力的新方法。
Article 125
Title@2025-07-16 (3): TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation
Title: TRIM: Token Reduction and Inference Modeling for Cost-Effective Language Generation | TRIM: Token Reduction und Inferenzmodellierung für kosteneffektive Sprachgenerierung | TRIM:降低和推论模式,促进成本低效益的语文生成 2412.07682v4 |
Authors (3): Alfredo Garrachón Ruiz, Tomás de la Rosa, Daniel Borrajo
The inference cost of Large Language Models (LLMs) is a significant challenge due to their computational demands, specially on tasks requiring long outputs. However, natural language often contains redundancy, which presents an opportunity for optimization. We have observed that LLMs can generate distilled language-concise outputs that retain essential meaning, when prompted appropriately. We propose TRIM, a pipeline for saving computational cost in which a shorter distilled output from the LLM is reconstructed into a full narrative by a smaller model with lower inference costs. Our experiments show promising results, particularly in general knowledge domains with 20.58% saved tokens on average with tiny decrease in evaluation metrics, hinting that this approach can effectively balance efficiency and accuracy in language processing tasks.
大语言模型(LLMS)的推论成本是一个重大挑战,因为它们的计算需求,特别是需要长期产出的任务。然而,自然语言往往含有冗余,这提供了优化的机会。我们观察到,LLMS可以产生精炼的语言精密产出,在适当推动下保留基本意义。我们建议TRIM,这是一个节省计算成本的管道,其中将LLM的较短的提炼产出用一个较小的模型重组成一个完整的叙述,其推论成本较低。我们的实验显示了有希望的结果,特别是在一般知识领域,平均节省了20.58%的标语,评价指标也略有减少,这表明这一方法能够有效地平衡语言处理任务的效率和准确性。
Article 126
Title@2025-07-16 (3): RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection
Title: RUMAA: Repeat-Aware Unified Music Audio Analysis for Score-Performance Alignment, Transcription, and Mistake Detection | RUMAA: Repeat-Aware Unified Music Audio Analyse zur Ausrichtung, Transkription und Fehlererkennung | RUMAA: 用于计分业绩协调、追踪和误差探测的重复软件统一音乐音频分析 2507.12175v1 |
Authors (3): Sungkyun Chang, Simon Dixon, Emmanouil Benetos
This study introduces RUMAA, a transformer-based framework for music performance analysis that unifies score-to-performance alignment, score-informed transcription, and mistake detection in a near end-to-end manner. Unlike prior methods addressing these tasks separately, RUMAA integrates them using pre-trained score and audio encoders and a novel tri-stream decoder capturing task interdependencies through proxy tasks. It aligns human-readable MusicXML scores with repeat symbols to full-length performance audio, overcoming traditional MIDI-based methods that rely on manually unfolded score-MIDI data with pre-specified repeat structures. RUMAA matches state-of-the-art alignment methods on non-repeated scores and outperforms them on scores with repeats in a public piano music dataset, while also delivering promising transcription and mistake detection results.
本研究介绍了RUMAA,这是一个基于变压器的音乐性能分析框架,它以近终端至终端的方式统一了分数到业绩的对齐、对分知情的抄录和错误发现。与以前分别处理这些任务的方法不同,RUMAA采用预先培训的分数和音频编码器,以及新颖的三流解码器,通过代理任务捕捉任务的相互依存性。它使人可读MusicXML分数与重复的符号与全长性能听音相匹配,克服了传统的MIDI基方法,这些方法依靠预先指定的重复结构人工展开的分数-MIDI数据。RUMAA匹配了最先进的非重现分数调整方法,并用公共钢琴音乐数据集中的重复数来优胜其分,同时还提供了很有希望的抄录和错误检测结果。
Article 127
Title@2025-07-16 (3): Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training
Title: Protecting Copyrighted Material with Unique Identifiers in Large Language Model Training | Schutz urheberrechtlich geschützter Materialien mit einzigartigen Identifikatoren in großsprachlichen Modellschulungen | 在大语言模式培训中以独特标识人保护版权材料 2403.15740v3 |
Authors (4): Shuai Zhao, Linchao Zhu, Ruijie Quan, Yi Yang
A primary concern regarding training large language models (LLMs) is whether they abuse copyrighted online text. With the increasing training data scale and the prevalence of LLMs in daily lives, two problems arise: \textbf{1)} false positive membership inference results misled by similar examples; \textbf{2)} membership inference methods are usually too complex for end users to understand and use. To address these issues, we propose an alternative \textit{insert-and-detect} methodology, advocating that web users and content platforms employ \textbf{\textit{unique identifiers}} for reliable and independent membership inference. Users and platforms can create their identifiers, embed them in copyrighted text, and independently detect them in future LLMs. As an initial demonstration, we introduce \textit{\textbf{ghost sentences}} and a user-friendly last-$k$ words test, allowing end users to chat with LLMs for membership inference. Ghost sentences consist primarily of unique passphrases of random natural words, which can come with customized elements to bypass possible filter rules. The last-$k$ words test requires a significant repetition time of ghost sentences~($\ge10$). For cases with fewer repetitions, we designed an extra perplexity test, as LLMs exhibit high perplexity when encountering unnatural passphrases. We also conduct a comprehensive study on the memorization and membership inference of ghost sentences, examining factors such as training data scales, model sizes, repetition times, insertion positions, wordlist of passphrases, alignment, \textit{etc}. Our study shows the possibility of applying ghost sentences in real scenarios and provides instructions for the potential application.
对培训大型语言模型(LLMs)的主要关切是它们是否滥用了版权版在线文本。随着培训数据规模的扩大和LLMs在日常生活中的普及程度,出现了两个问题:{textbf{ { {1} }假正会推论结果被类似的例子误导;{textbf{ { {2} } }会籍推论方法通常过于复杂,最终用户无法理解和使用。为了解决这些问题,我们建议了一种替代的\ textit{插入和检测}方法,主张网络用户和内容平台为可靠和独立的会籍推断使用\ textbf{unit{unial 识别器。用户和平台可以创建自己的识别器,将其嵌入版权文本,并在未来的LLLMsms中独立检测这些结果。作为初步示范,我们引入了 textitleitle text_b{gf{ghorhost} 和方便用户与LMSdeplication 进行交谈的文本模式, 鬼判决主要为随机自然词句系,这可以与定制的元素一起绕过过滤器规则。最后的缩缩缩缩缩缩缩缩缩缩定义, 也要求我们用一次测试。
Article 128
Title@2025-07-16 (3): A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems
Title: A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems | Eine Übersicht über Grenzen in LLM-Reasoning: Schlussfolgerungen Skalierung, Lernen zur Vernunft und Agentische Systeme | LLM 原因:推论增强、学习理性和制剂系统边界调查 2504.09037v2 |
Authors (12): Zixuan Ke, Fangkai Jiao, Yifei Ming, Xuan-Phi Nguyen, Austin Xu, Do Xuan Long, Minzhi Li, Chengwei Qin, Peifeng Wang, Silvio Savarese, Caiming Xiong, Shafiq Joty
Reasoning is a fundamental cognitive process that enables logical inference, problem-solving, and decision-making. With the rapid advancement of large language models (LLMs), reasoning has emerged as a key capability that distinguishes advanced AI systems from conventional models that empower chatbots. In this survey, we categorize existing methods along two orthogonal dimensions: (1) Regimes, which define the stage at which reasoning is achieved (either at inference time or through dedicated training); and (2) Architectures, which determine the components involved in the reasoning process, distinguishing between standalone LLMs and agentic compound systems that incorporate external tools, and multi-agent collaborations. Within each dimension, we analyze two key perspectives: (1) Input level, which focuses on techniques that construct high-quality prompts that the LLM condition on; and (2) Output level, which methods that refine multiple sampled candidates to enhance reasoning quality. This categorization provides a systematic understanding of the evolving landscape of LLM reasoning, highlighting emerging trends such as the shift from inference-scaling to learning-to-reason (e.g., DeepSeek-R1), and the transition to agentic workflows (e.g., OpenAI Deep Research, Manus Agent). Additionally, we cover a broad spectrum of learning algorithms, from supervised fine-tuning to reinforcement learning such as PPO and GRPO, and the training of reasoners and verifiers. We also examine key designs of agentic workflows, from established patterns like generator-evaluator and LLM debate to recent innovations. …
理性是一个基本的认知过程,它有助于逻辑推理、解决问题和决策。随着大型语言模型(LLMs)的快速发展,推理已成为一种关键能力,将先进的AI系统与赋予聊天室能力的常规模型区分开来。在这次调查中,我们按照两个正方位将现有方法分为两个不同层面:(1) 制度,它界定了推理达到的阶段(推理时间或专门培训);(2) 结构,它决定了推理过程所涉及的组成部分,区分了独立的LMs和包含外部工具的代理复合系统以及多剂合作。我们从每个层面分析两个关键角度:(1) 投入级别,侧重于构建LLM所要求的高质量提示技术;和(2) 产出级别,这种方法改进了多个抽样候选人提高推理质量的方法。这种分类使人们系统地了解LM推理的演变情况,突出了从推理推理到从外部工具、深层次辩论和多剂辩论的复合复合复合复合体复合体复合体系统系统(例如深层Side Seek-RM1)以及从高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究、高端研究。
Article 129
Title@2025-07-16 (3): Large Language Models Often Know When They Are Being Evaluated
Title: Large Language Models Often Know When They Are Being Evaluated | Große Sprachmodelle kennen oft, wenn sie bewertet werden | 大语言模型经常知道何时被评估 2505.23836v3 |
Authors (5): Joe Needham, Giles Edkins, Govind Pimpale, Henning Bartsch, Marius Hobbhahn
If AI models can detect when they are being evaluated, the effectiveness of evaluations might be compromised. For example, models could have systematically different behavior during evaluations, leading to less reliable benchmarks for deployment and governance decisions. We investigate whether frontier language models can accurately classify transcripts based on whether they originate from evaluations or real-world deployment, a capability we call evaluation awareness. To achieve this, we construct a diverse benchmark of 1,000 prompts and transcripts from 61 distinct datasets. These span public benchmarks (e.g., MMLU, SWEBench), real-world deployment interactions, and agent trajectories from scaffolding frameworks (e.g., web-browsing agents). Frontier models clearly demonstrate above-random evaluation awareness (Gemini-2.5-Pro reaches an AUC of $0.83$), but do not yet surpass our simple human baseline (AUC of $0.92$). Furthermore, both AI models and humans are better at identifying evaluations in agentic settings compared to chat settings. Additionally, we test whether models can identify the purpose of the evaluation. Under multiple-choice and open-ended questioning, AI models far outperform random chance in identifying what an evaluation is testing for. Our results indicate that frontier models already exhibit a substantial, though not yet superhuman, level of evaluation-awareness. We recommend tracking this capability in future models.
如果大赦国际模型能够发现它们何时被评估,评价的有效性可能受到损害。例如,模型在评价期间的行为可能系统不同,导致部署和治理决定的基准不那么可靠。我们调查前沿语言模型是否能够根据来自评价还是来自现实世界的部署来准确分类记录誊本,我们称之为评估意识;为此,我们根据61个不同数据集的1 000个提示和记录誊本构建了不同的基准,这些基准包括公共基准(如MMMLU、SWEBench)、现实世界部署互动以及来自筛选框架(例如网络浏览代理商)的代理轨迹。在多选取和开放的质询下,边境模型清楚地显示了超随机评价意识(Gemini-2.5-Pro达到0.83美元AUC),但还没有超过我们简单的人类基线(0.92美元)。此外,无论是AI模型和人类模型都比聊天环境更能确定管理环境中的评价模式。此外,我们测试模型能否确定评价的目的。在多选取和开放式问答的质询中,AI模型虽然已经显示我们未来的超标度能力。
Article 130
Title@2025-07-16 (3): Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators
Title: Overview of the Sensemaking Task at the ELOQUENT 2025 Lab: LLMs as Teachers, Students and Evaluators | Überblick über die Sensemaking-Aufgabe im ELOQUENT 2025 Lab: LLMs als Lehrer, Schüler und Evaluatoren | 2025年ELOQUent 2025实验室的决策者任务概述:教师、学生和评价员 2507.12143v1 |
Authors (2): Pavel Šindelář, Ondřej Bojar
ELOQUENT is a set of shared tasks that aims to create easily testable high-level criteria for evaluating generative language models. Sensemaking is one such shared task. In Sensemaking, we try to assess how well generative models ``make sense out of a given text’’ in three steps inspired by exams in a classroom setting: (1) Teacher systems should prepare a set of questions, (2) Student systems should answer these questions, and (3) Evaluator systems should score these answers, all adhering rather strictly to a given set of input materials. We report on the 2025 edition of Sensemaking, where we had 7 sources of test materials (fact-checking analyses of statements, textbooks, transcribed recordings of a lecture, and educational videos) spanning English, German, Ukrainian, and Czech languages. This year, 4 teams participated, providing us with 2 Teacher submissions, 2 Student submissions, and 2 Evaluator submissions. We added baselines for Teacher and Student using commercial large language model systems. We devised a fully automatic evaluation procedure, which we compare to a minimalistic manual evaluation. We were able to make some interesting observations. For the first task, the creation of questions, better evaluation strategies will still have to be devised because it is difficult to discern the quality of the various candidate question sets. In the second task, question answering, the LLMs examined overall perform acceptably, but restricting their answers to the given input texts remains problematic. In the third task, evaluation of question answers, our adversarial tests reveal that systems using the LLM-as-a-Judge paradigm erroneously rate both garbled question-answer pairs and answers to mixed-up questions as acceptable.
ELOQUENT是一套共同的任务,旨在为评价基因化语言模式建立易于测试的高层次标准,而制定高级标准是这种共同的任务之一。在Sensemaking中,我们试图评估在课堂考试启发下的三个步骤中“使某一文本具有意义”三个步骤的基因模型有多好:(1) 教师系统应当准备一套问题,(2) 学生系统应当回答这些问题,(3) 评价系统应当得分这些答案,所有系统都应当相当严格地遵守一套投入材料。我们报告了2025年版《Sensemaking》,我们在那里有7种测试材料来源(对声明、教科书、讲座录音和教学录像的精确分析),涵盖英语、德语、乌克兰语和捷克语。今年,有4个团队参加,向我们提供了2份教师论文,2份学生论文应当回答这些问题,(3) 评估系统应当为教师和学生补充了基准,使用大型商业语言模型系统。我们设计了一个完全自动的评估答案,我们把它比作最低限度的手册。我们做了一些有趣的观察。我们做了一些有趣的观察,对于第一项任务来说,创建了一个精确的检验标准,因为创建了各种质量问题,因为对各种任务做了更难的问题进行了深入的检验,对问题进行了研究,对问题做了分析,对问题做了分析,对问题做了比较。
Article 131
Title@2025-07-16 (3): RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization
Title: RiemannLoRA: A Unified Riemannian Framework for Ambiguity-Free LoRA Optimization | RiemannLoRA: Ein einheitliches Riemann-Rahmenwerk für die ambiguitätsfreie LoRA-Optimierung | Riemann LoRA:无模糊无洛拉优化的统一里伊曼框架 2507.12142v1 |
Authors (7): Vladimir Bogachev, Vladimir Aletov, Alexander Molozhavenko, Denis Bobkov, Vera Soboleva, Aibek Alanov, Maxim Rakhuba
Low-Rank Adaptation (LoRA) has become a widely adopted standard for parameter-efficient fine-tuning of large language models (LLMs), significantly reducing memory and computational demands. However, challenges remain, including finding optimal initialization strategies or mitigating overparametrization in low-rank matrix factorization. In this work, we propose a novel approach that addresses both of the challenges simultaneously within a unified framework. Our method treats a set of fixed-rank LoRA matrices as a smooth manifold. Considering adapters as elements on this manifold removes overparametrization, while determining the direction of the fastest loss decrease along the manifold provides initialization. Special care is taken to obtain numerically stable and computationally efficient implementation of our method, using best practices from numerical linear algebra and Riemannian optimization. Experimental results on LLM and diffusion model architectures demonstrate that RiemannLoRA consistently improves both convergence speed and final performance over standard LoRA and its state-of-the-art modifications.
低兰克适应(LORA)已成为广泛采用的大型语言模型参数高效微调标准,大大减少了记忆和计算需求,但挑战依然存在,包括找到最佳初始化战略或减轻低级矩阵因子化的过度平衡化。在这项工作中,我们提出了一个新颖的办法,在一个统一的框架内同时解决这两个挑战。我们的方法把一套固定的LORA矩阵视为一个光滑的多元体。我们的方法将适应器作为这一元件的元素,消除了过度平衡化,同时确定沿多元体损失减少速度最快的方向提供了初始化。我们特别注意利用数字线性代数和Riemannian优化的最佳做法,实现我们方法的数值稳定和计算效率。LLMM和扩散模型结构的实验结果表明,Riemann LoRA不断提高标准洛拉及其最新修改的趋同速度和最后性能。
Article 132
Title@2025-07-16 (3): Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis
Title: Iterative Augmentation with Summarization Refinement (IASR) Evaluation for Unstructured Survey data Modeling and Analysis | Iterative Augmentation mit Summarization Refinement (IASR) Evaluation für unstrukturierte Umfragedaten Modellierung und Analyse | 对无结构调查数据建模和分析的抽样改进(IASR)评价 2507.12126v1 |
Authors (3): Payal Bhattad, Sai Manoj Pudukotai Dinakarrao, Anju Gupta
Text data augmentation is a widely used strategy for mitigating data sparsity in natural language processing (NLP), particularly in low-resource settings where limited samples hinder effective semantic modeling. While augmentation can improve input diversity and downstream interpretability, existing techniques often lack mechanisms to ensure semantic preservation during large-scale or iterative generation, leading to redundancy and instability. This work introduces a principled evaluation framework for large language model (LLM) based text augmentation, comprising two components: (1) Scalability Analysis, which measures semantic consistency as augmentation volume increases, and (2) Iterative Augmentation with Summarization Refinement (IASR), which evaluates semantic drift across recursive paraphrasing cycles. Empirical evaluations across state-of-the-art LLMs show that GPT-3.5 Turbo achieved the best balance of semantic fidelity, diversity, and generation efficiency. Applied to a real-world topic modeling task using BERTopic with GPT-enhanced few-shot labeling, the proposed approach results in a 400% increase in topic granularity and complete elimination of topic overlaps. These findings validated the utility of the proposed frameworks for structured evaluation of LLM-based augmentation in practical NLP pipelines.
增强文本数据是一项广泛使用的减少自然语言处理(NLP)中数据广度的战略,特别是在有限样本阻碍有效语义建模的低资源环境中,减少自然语言处理(NLP)中的数据广度(NLP),特别是在有限样本阻碍有效语义建模的低资源环境中。虽然增强可以改善投入多样性和下游解释性,但现有技术往往缺乏确保大规模或迭代生成中语义保存的机制,导致冗余和不稳定。这项工作为基于大语言模型(LLLM)的文本增强引入了一个原则性评价框架,包括两个组成部分:(1) 可缩放分析,它衡量语义一致性,因为增加量的增加;(2) 与Summarization Refination(IASR)的迭代推法增强(IASR),该方法评估了周期周期内语义性流动,评估了反复流动的语义流学流,并彻底消除了NLM 结构化框架。这些结论验证了拟议对语言忠实、多样性和生成效率的最佳平衡。应用于一个真实世界主题建模任务建模任务。
Article 133
Title@2025-07-16 (3): Learning to Reason at the Frontier of Learnability
Title: Learning to Reason at the Frontier of Learnability | Vernunft lernen an der Grenze der Lernfähigkeit | 学习在可学习的前沿学习理性 2502.12272v4 |
Authors (2): Thomas Foster, Jakob Foerster
Reinforcement learning is now widely adopted as the final stage of large language model training, especially for reasoning-style tasks such as maths problems. Typically, models attempt each question many times during a single training step and attempt to learn from their successes and failures. However, we demonstrate that throughout training with two popular algorithms (PPO and VinePPO) on two widely used datasets, many questions are either solved by all attempts - meaning they are already learned - or by none - providing no meaningful training signal. To address this, we adapt a method from the reinforcement learning literature - sampling for learnability - and apply it to the reinforcement learning stage of LLM training. Our curriculum prioritises questions with high variance of success, i.e. those where the agent sometimes succeeds, but not always. Our findings demonstrate that this curriculum consistently boosts training performance across multiple algorithms and datasets, paving the way for more efficient and effective reinforcement learning with LLMs.
强化学习现在被广泛作为大型语言模式培训的最后阶段,特别是用于数学问题等推理式任务。典型的情况是,模型在一次培训步骤中多次尝试每个问题,并试图从其成功和失败中吸取教训。然而,我们证明,在用两种广泛使用的数据集进行两种流行算法(PPO和VinePPO)培训的整个过程中,许多问题要么通过所有尝试(即他们已经学习过,要么没有提供有意义的培训信号)得到解决。为了解决这个问题,我们从强化学习文献(即学习能力抽样)中调整了一种方法,并将其应用到LLM培训的强化学习阶段。我们的课程优先考虑成功率差异很大的问题,即代理人有时成功但并不总是成功的问题。我们的研究结果表明,这一课程始终在提高多种算法和数据集的培训绩效,为与LMS一道提高学习效率和成效铺平了道路。
Article 134
Title@2025-07-16 (3): Findings of MEGA: Maths Explanation with LLMs using the Socratic Method for Active Learning
Title: Findings of MEGA: Maths Explanation with LLMs using the Socratic Method for Active Learning | Ergebnisse von MEGA: Matheerklärung mit LLMs mit der sokratischen Methode für aktives Lernen | MEGA的研究结果:使用Scopic 积极学习方法与LLMs的数学解释 2507.12079v1 |
Authors (6): Tosin Adewumi, Foteini Simistira Liwicki, Marcus Liwicki, Viktor Gardelli, Lama Alkhaled, Hamam Mokayed
This paper presents an intervention study on the effects of the combined methods of (1) the Socratic method, (2) Chain of Thought (CoT) reasoning, (3) simplified gamification and (4) formative feedback on university students’ Maths learning driven by large language models (LLMs). We call our approach Mathematics Explanations through Games by AI LLMs (MEGA). Some students struggle with Maths and as a result avoid Math-related discipline or subjects despite the importance of Maths across many fields, including signal processing. Oftentimes, students’ Maths difficulties stem from suboptimal pedagogy. We compared the MEGA method to the traditional step-by-step (CoT) method to ascertain which is better by using a within-group design after randomly assigning questions for the participants, who are university students. Samples (n=60) were randomly drawn from each of the two test sets of the Grade School Math 8K (GSM8K) and Mathematics Aptitude Test of Heuristics (MATH) datasets, based on the error margin of 11%, the confidence level of 90%, and a manageable number of samples for the student evaluators. These samples were used to evaluate two capable LLMs at length (Generative Pretrained Transformer 4o (GPT4o) and Claude 3.5 Sonnet) out of the initial six that were tested for capability. The results showed that students agree in more instances that the MEGA method is experienced as better for learning for both datasets. It is even much better than the CoT (47.5% compared to 26.67%) in the more difficult MATH dataset, indicating that MEGA is better at explaining difficult Maths problems.
本文介绍了对以下方法的综合方法的影响的干预研究:(1) 科学方法,(2) 思想链(CoT)推理,(3) 简化拼写和(4) 由大语言模型驱动的大学生数学学习的形成反馈。我们称我们的方法数学解释是通过AI LLM(MEGA)的游戏来解释。一些学生在数学领域的重要性,包括信号处理方面,都与数学相关的学科或科目挣扎,结果避免了数学相关的学科或科目。通常,学生数学的难度来自亚最佳的教学方法。我们将MEGA方法与传统的逐步(CoT)方法进行了比较,以通过随机地为参与者(他们是大学生)分配问题后采用小组内设计来确定。抽样(n=60)是随机地从数学8K(GSM8K)和数学的精度测试(MATH)中,根据11%的差差差差差差差差差差差差差差差差,对初等学生的自信程度为90 %,对45GML数据进行可控性的数据数在测试前更精确的样品中显示。
Article 135
Title@2025-07-16 (3): RAGGED: Towards Informed Design of Scalable and Stable RAG Systems
Title: RAGGED: Towards Informed Design of Scalable and Stable RAG Systems | RAGGED: Auf dem Weg zu einem informierten Design von skalierbaren und stabilen RAG-Systemen | RAGGD: 实现可缩放和稳定的RAG系统的知情设计 2403.09040v3 |
Authors (4): Jennifer Hsia, Afreen Shaikh, Zhiruo Wang, Graham Neubig
Retrieval-augmented generation (RAG) enhances language models by integrating external knowledge, but its effectiveness is highly dependent on system configuration. Improper retrieval settings can degrade performance, making RAG less reliable than closed-book generation. In this work, we introduce RAGGED, a framework for systematically evaluating RAG systems across diverse retriever-reader configurations, retrieval depths, and datasets. Our analysis reveals that reader robustness to noise is the key determinant of RAG stability and scalability. Some readers benefit from increased retrieval depth, while others degrade due to their sensitivity to distracting content. Through large-scale experiments on open-domain, multi-hop, and specialized-domain datasets, we show that retrievers, rerankers, and prompts influence performance but do not fundamentally alter these reader-driven trends. By providing a principled framework and new metrics to assess RAG stability and scalability, RAGGED enables systematic evaluation of retrieval-augmented generation systems, guiding future research on optimizing retrieval depth and model robustness.
在这项工作中,我们引入了RAGED,这是在各种检索阅读器配置、检索深度和数据集方面系统评估RAGE系统的框架。我们的分析显示,读者对噪音的稳健性是RAG稳定性和可缩放性的关键决定因素。一些读者从检索深度的增加中受益,而另一些读者则由于对内容分散的敏感度而退化。我们通过在开放域、多跳点和专门域数据集方面的大规模实验,显示检索器、重新定级器和快速影响业绩,但不会从根本上改变这些由读者驱动的趋势。通过提供一个原则框架和新的指标来评估RAG稳定性和可缩放性,RAGED能够系统地评估检索放大生成系统,指导未来关于优化检索深度和模型坚固性的研究。
Article 136
Title@2025-07-16 (3): BOOKCOREF: Coreference Resolution at Book Scale
Title: BOOKCOREF: Coreference Resolution at Book Scale | BOOKCOREF: Koreferenzauflösung auf der Buchskala | BOOKCOREF: 书缩放时的共引用分辨率 2507.12075v1 |
Authors (4): Giuliano Martinelli, Tommaso Bonomo, Pere-Lluís Huguet Cabot, Roberto Navigli
Coreference Resolution systems are typically evaluated on benchmarks containing small- to medium-scale documents. When it comes to evaluating long texts, however, existing benchmarks, such as LitBank, remain limited in length and do not adequately assess system capabilities at the book scale, i.e., when co-referring mentions span hundreds of thousands of tokens. To fill this gap, we first put forward a novel automatic pipeline that produces high-quality Coreference Resolution annotations on full narrative texts. Then, we adopt this pipeline to create the first book-scale coreference benchmark, BOOKCOREF, with an average document length of more than 200,000 tokens. We carry out a series of experiments showing the robustness of our automatic procedure and demonstrating the value of our resource, which enables current long-document coreference systems to gain up to +20 CoNLL-F1 points when evaluated on full books. Moreover, we report on the new challenges introduced by this unprecedented book-scale setting, highlighting that current models fail to deliver the same performance they achieve on smaller documents. We release our data and code to encourage research and development of new book-scale Coreference Resolution systems at https://github.com/sapienzanlp/bookcoref.
在评估长文本时,LitBank等现有基准的长度仍然有限,没有适当评估图书规模的系统能力,也就是说,当共同参照提到几十万个象征物时,我们首先提出一个新的自动管道,在完整叙述文本中产生高质量的共同参考分辨率说明。然后,我们通过这一管道,建立第一个书级共同参考基准,BOOKCOREF, 平均文件长度超过20万个符号。我们进行了一系列实验,显示我们的自动程序是否稳健,并展示我们资源的价值,使目前的长文件共同参照系统在对全书进行评价时能够达到+20CONLL-F1点。此外,我们报告这一前所未有的书级设置带来的新挑战,强调当前模式未能在较小的文件中实现同样的业绩。我们公布数据和代码,以鼓励在 https://gricorpusbasim/sapcom鼓励研究和开发新的书级共同参考系统。
Article 137
Title@2025-07-16 (3): StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features
Title: StylOch at PAN: Gradient-Boosted Trees with Frequency-Based Stylometric Features | StylOch bei PAN: Gradient-Boosted Trees mit frequenzbasierten stylometrischen Eigenschaften | PAN的StylOch:带以频率为基础的音量特征的梯度-波状树 2507.12064v1 |
Authors (4): Jeremi K. Ochab, Mateusz Matias, Tymoteusz Boba, Tomasz Walkowiak
This submission to the binary AI detection task is based on a modular stylometric pipeline, where: public spaCy models are used for text preprocessing (including tokenisation, named entity recognition, dependency parsing, part-of-speech tagging, and morphology annotation) and extracting several thousand features (frequencies of n-grams of the above linguistic annotations); light-gradient boosting machines are used as the classifier. We collect a large corpus of more than 500 000 machine-generated texts for the classifier’s training. We explore several parameter options to increase the classifier’s capacity and take advantage of that training set. Our approach follows the non-neural, computationally inexpensive but explainable approach found effective previously.
提交AI二进制检测任务的本件是基于模块式的tylology管道,其中:在文本预处理(包括象征性化、名称实体识别、依赖分析、部分语音标记和形态说明)和提取数千个特征(上述语言注释n克的频率)时,使用公共微粒模型;使用轻度助推机作为分类器。我们收集了50多万份用于分类器培训的机器生成文本。我们探索了几种参数选项,以提高分类器的能力并利用该成套培训。我们的方法遵循了以前发现的非神经、计算成本低但可以解释的方法。
Article 138
Title@2025-07-16 (3): Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited
Title: Evaluating the Ability of Large Language Models to Reason about Cardinal Directions, Revisited | Bewertung der Fähigkeit von großen Sprachmodellen zur Vernunft über Kardinal-Anweisungen, Revisited | 评价大语言模式与红红衣主教指示理由相符的能力,重新审查 2507.12059v1 |
Authors (2): Anthony G Cohn, Robert E Blackwell
We investigate the abilities of 28 Large language Models (LLMs) to reason about cardinal directions (CDs) using a benchmark generated from a set of templates, extensively testing an LLM’s ability to determine the correct CD given a particular scenario. The templates allow for a number of degrees of variation such as means of locomotion of the agent involved, and whether set in the first, second or third person. Even the newer Large Reasoning Models are unable to reliably determine the correct CD for all questions. This paper summarises and extends earlier work presented at COSIT-24.
我们调查了28种大语言模型(LLMs)使用一套模板产生的基准来解释主方向的能力,广泛测试LLM在特定情况下确定正确的CD的能力,这些模板允许若干程度的差异,如所涉代理人的移动手段,以及是否设置在第一、第二或第三人中,即使是较新的大语言模型也无法可靠地确定所有问题的正确的CD,本文总结并扩展了COSIT-24先前的工作。
Article 139
Title@2025-07-16 (3): ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews
Title: ReviewAgents: Bridging the Gap Between Human and AI-Generated Paper Reviews | ReviewAgents: Die Kluft zwischen menschlichen und KI-generierten Paper Reviews überbrücken | 审查机构:弥合人类与AI - AI - 创创文件审查之间的差距 2503.08506v3 |
Authors (6): Xian Gao, Jiacheng Ruan, Zongyun Zhang, Jingsheng Gao, Ting Liu, Yuzhuo Fu
Academic paper review is a critical yet time-consuming task within the research community. With the increasing volume of academic publications, automating the review process has become a significant challenge. The primary issue lies in generating comprehensive, accurate, and reasoning-consistent review comments that align with human reviewers’ judgments. In this paper, we address this challenge by proposing ReviewAgents, a framework that leverages large language models (LLMs) to generate academic paper reviews. We first introduce a novel dataset, Review-CoT, consisting of 142k review comments, designed for training LLM agents. This dataset emulates the structured reasoning process of human reviewers-summarizing the paper, referencing relevant works, identifying strengths and weaknesses, and generating a review conclusion. Building upon this, we train LLM reviewer agents capable of structured reasoning using a relevant-paper-aware training method. Furthermore, we construct ReviewAgents, a multi-role, multi-LLM agent review framework, to enhance the review comment generation process. Additionally, we propose ReviewBench, a benchmark for evaluating the review comments generated by LLMs. Our experimental results on ReviewBench demonstrate that while existing LLMs exhibit a certain degree of potential for automating the review process, there remains a gap when compared to human-generated reviews. Moreover, our ReviewAgents framework further narrows this gap, outperforming advanced LLMs in generating review comments.
学术论文审查是研究界一项重要而又耗时的任务。随着学术出版物数量的增加,审查过程的自动化已成为一项重大挑战。主要问题在于产生与人类审评员的判断相一致的全面、准确和符合逻辑的审查评论。在本文件中,我们通过提出“审查机构”来应对这一挑战,这个框架利用大语言模型(LLMS)来生成学术论文审查。我们首先推出新的数据集,即审查-Cot,由142k项评论组成,目的是培训LLM代理商。这一数据集效仿了对文件进行总结的人类审查者的结构推理过程,参考了相关著作,查明了优缺点,并得出了审查结论。在此基础上,我们培训LLMMA审查人员,他们能够利用相关的书面意识培训方法进行结构推理。此外,我们构建了审查机构、多功能、多LLMM代理商审查框架,以加强审查工作。此外,我们提议审查Bench-Bench是评价LMS所生成的审查意见的有条理有理的根据。我们关于Bench的实验性结果显示,在对LMS的审查中进行某种程度的分析审查时,而现有的LMS审查则显示,在进行这种业绩审查时仍具有某种程度。
Article 140
Title@2025-07-16 (3): Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis
Title: Improving Data and Parameter Efficiency of Neural Language Models Using Representation Analysis | Verbesserung der Daten- und Parametereffizienz von neuralen Sprachmodellen mittels Darstellungsanalyse | 改进使用代表性分析的神经语言模型的数据和参数效率 2507.12004v1 |
Authors (1): Josip Jukić
This thesis addresses challenges related to data and parameter efficiency in neural language models, with a focus on representation analysis and the introduction of new optimization techniques. The first part examines the properties and dynamics of language representations within neural models, emphasizing their significance in enhancing robustness and generalization. It proposes innovative approaches based on representation smoothness, including regularization strategies that utilize Jacobian and Hessian matrices to stabilize training and mitigate sensitivity to input perturbations. The second part focuses on methods to significantly enhance data and parameter efficiency by integrating active learning strategies with parameter-efficient fine-tuning, guided by insights from representation smoothness analysis. It presents smoothness-informed early-stopping techniques designed to eliminate the need for labeled validation sets and proposes innovative combinations of active learning and parameter-efficient fine-tuning to reduce labeling efforts and computational resources. Extensive experimental evaluations across various NLP tasks demonstrate that these combined approaches substantially outperform traditional methods in terms of performance, stability, and efficiency. The third part explores weak supervision techniques enhanced by in-context learning to effectively utilize unlabeled data, further reducing dependence on extensive labeling. It shows that using in-context learning as a mechanism for weak supervision enables models to better generalize from limited labeled data by leveraging unlabeled examples more effectively during training. Comprehensive empirical evaluations confirm significant gains in model accuracy, adaptability, and robustness, especially in low-resource settings and dynamic data environments.
第一部分审查神经模型中语言代表的特性和动态,强调其在加强稳健性和概括性方面的重要性; 提出基于代表性平稳性的创新办法,包括利用Jacobian和Hessian矩阵的正规化战略,以稳定培训,减轻对输入扰动的敏感性; 第二部分侧重于通过将积极学习战略与具有参数效率的微调相结合,在代表性平稳分析的洞察力指导下,采用参数效率微调的方法,大大提高数据和参数效率; 第一部分审查神经模型中语言代表的特性和动态,强调语言代表在神经模型中的特性和动态,强调其在加强稳健性和普遍性方面的重要性; 第二部分着重探讨通过将积极学习战略与参数高效微调相结合,提高数据和参数效率的方法; 第一部分介绍旨在消除贴标签验证成套验证的早期平稳技术,提出将积极学习和参数高效微调的微调相结合,以减少标签工作和计算资源; 在整个国家实验室任务中进行广泛的实验性评价,通过利用弹性数据模型,有效地利用弹性数据,从而确认在弹性评估过程中,将弹性的精确性经验化,将弹性数据化的成绩作为有效的工具。
Article 141
Title@2025-07-16 (3): Labels Generated by Large Language Models Help Measure People’s Empathy in Vitro
Title: Labels Generated by Large Language Models Help Measure People’s Empathy in Vitro | Etiketten, die durch große Sprachmodelle erzeugt werden, helfen, die Empathie der Menschen in Vitro zu messen | 以大语言模型生成的标签 帮助测量体外民众的共鸣 2501.00691v2 |
Authors (7): Md Rakibul Hasan, Yue Yao, Md Zakir Hossain, Aneesh Krishna, Imre Rudas, Shafin Rahman, Tom Gedeon
Large language models (LLMs) have revolutionised many fields, with LLM-as-a-service (LLMSaaS) offering accessible, general-purpose solutions without costly task-specific training. In contrast to the widely studied prompt engineering for directly solving tasks (in vivo), this paper explores LLMs’ potential for in-vitro applications: using LLM-generated labels to improve supervised training of mainstream models. We examine two strategies - (1) noisy label correction and (2) training data augmentation - in empathy computing, an emerging task to predict psychology-based questionnaire outcomes from inputs like textual narratives. Crowdsourced datasets in this domain often suffer from noisy labels that misrepresent underlying empathy. We show that replacing or supplementing these crowdsourced labels with LLM-generated labels, developed using psychology-based scale-aware prompts, achieves statistically significant accuracy improvements. Notably, the RoBERTa pre-trained language model (PLM) trained with noise-reduced labels yields a state-of-the-art Pearson correlation coefficient of 0.648 on the public NewsEmp benchmarks. This paper further analyses evaluation metric selection and demographic biases to help guide the future development of more equitable empathy computing models. Code and LLM-generated labels are available at https://github.com/hasan-rakibul/LLMPathy.
大型语言模型(LLMS)使许多领域发生革命性变化,LLM-as-a-service(LLMSaaS)提供无障碍的通用解决方案,而没有昂贵的特定任务培训。与广泛研究的直接解决任务(体内)的迅速工程相比,本文探讨了LLMs在体外应用方面的潜力:使用LLM产生的标签来改进主流模型的监督培训。我们研究了两种战略:(1) 噪音标签校正和(2) 同情计算培训数据扩增,这是预测像文字叙事这样的投入所产生的基于心理学的问卷结果的新兴任务。这一领域众源数据集经常受到噪音标签的困扰,而这种标签又不实的基本同情。我们表明,用LLM生成的标签取代或补充这些众源标签,利用基于心理学的规模认知的提示来开发,在统计上显著地改进了准确性。值得注意的是,RoBERTA预先训练的语言模型(PM)经过降噪声调标签培训后,得出了公共新闻Emplection基准上0.648的尖端Pearson相关帮助系数。本文进一步分析了以公共新闻Emptromaximal Commal标签/macresulational deal coal coal colm模型的可选取未来可选择。
Article 142
Title@2025-07-16 (3): DEEPER Insight into Your User: Directed Persona Refinement for Dynamic Persona Modeling
Title: DEEPER Insight into Your User: Directed Persona Refinement for Dynamic Persona Modeling | DEEPER Einblick in Ihren Anwender: Direkte Persona-Verfeinerung für dynamische Persona-Modellierung | DEEPER 对用户的洞察: 动态人造模型的直接人性改进 2502.11078v2 |
Authors (9): Aili Chen, Chengyu Du, Jiangjie Chen, Jinghan Xu, Yikai Zhang, Siyu Yuan, Zulong Chen, Liangyue Li, Yanghua Xiao
To advance personalized applications such as recommendation systems and user behavior prediction, recent research increasingly adopts large language models (LLMs) for human -readable persona modeling. In dynamic real -world scenarios, effective persona modeling necessitates leveraging streaming behavior data to continually optimize user personas. However, existing methods -whether regenerating personas or incrementally extending them with new behaviors -often fail to achieve sustained improvements in persona quality or future behavior prediction accuracy. To address this, we propose DEEPER, a novel approach for dynamic persona modeling that enables continual persona optimization. Specifically, we enhance the model’s direction -search capability through an iterative reinforcement learning framework, allowing it to automatically identify effective update directions and optimize personas using discrepancies between user behaviors and model predictions. Extensive experiments on dynamic persona modeling involving 4800 users across 10 domains highlight the superior persona optimization capabilities of DEEPER, delivering an impressive 32.2% average reduction in user behavior prediction error over four update rounds -outperforming the best baseline by a remarkable 22.92%.
为了推进个人化应用,例如建议系统和用户行为预测,最近的研究越来越多地采用大型语言模型(LLMs)来模拟人-可读人-模型。在动态真实世界情景中,有效的人-模型模型需要利用流动行为数据来不断优化用户的人格。然而,现有的方法――无论是再生人还是以新的行为逐步扩展人-往往无法持续提高人/人质量或未来行为预测的准确性。为了解决这个问题,我们提议DEEPER(DEEPER),这是一个新的动态人/模型模型方法,可以持续优化人。具体地说,我们通过一个迭代强化学习框架来增强模型的搜索能力,使其能够自动确定有效的更新方向,并利用用户行为和模型预测之间的差异优化人。涉及10个领域的4800个用户的关于动态人/模型的广泛实验突出DEEPER(DEEPER)的高级人/优化能力,在四个更新的周期中平均减少32.2%的用户行为预测错误,这可以使人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/人/
Article 143
Title@2025-07-16 (3): Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions
Title: Simplifications are Absolutists: How Simplified Language Reduces Word Sense Awareness in LLM-Generated Definitions | Vereinfachungen sind Absolutisten: Wie vereinfachte Sprache das Wortsinnbewusstsein in LLM-generierten Definitionen reduziert | 简化程序是绝对论者:简化语言如何减少LLM-创用定义中的言语感知 2507.11981v1 |
Authors (3): Lukas Ellinger, Miriam Anschütz, Georg Groh
Large Language Models (LLMs) can provide accurate word definitions and explanations for any context. However, the scope of the definition changes for different target groups, like children or language learners. This is especially relevant for homonyms, words with multiple meanings, where oversimplification might risk information loss by omitting key senses, potentially misleading users who trust LLM outputs. We investigate how simplification impacts homonym definition quality across three target groups: Normal, Simple, and ELI5. Using two novel evaluation datasets spanning multiple languages, we test DeepSeek v3, Llama 4 Maverick, Qwen3-30B A3B, GPT-4o mini, and Llama 3.1 8B via LLM-as-Judge and human annotations. Our results show that simplification drastically degrades definition completeness by neglecting polysemy, increasing the risk of misunderstanding. Fine-tuning Llama 3.1 8B with Direct Preference Optimization substantially improves homonym response quality across all prompt types. These findings highlight the need to balance simplicity and completeness in educational NLP to ensure reliable, context-aware definitions for all learners.
大型语言模型(LLMs)可以为任何背景提供准确的字定义和解释。然而,定义对不同目标群体,如儿童或语言学习者的定义变化范围,对于同音、具有多重含义的词句特别相关,过分简单化可能会通过省略关键感知而使信息丢失,有可能误导相信LLM产出的用户。我们调查简化如何影响三个目标群体的同音定义质量:普通、简单和ELI5。我们利用两个跨越多种语言的新颖的评价数据集,测试DeepSeek v3、Llama 4 Maverick、Qwen3-30B A3B、GPT-4o mini和Llama 3.18B,通过LLM-as-Judge和人文说明。我们的结果显示,简化会因忽略聚苯乙烯而极大地削弱定义的完整性,增加误解的风险。用直接参考的优化Llama 3.1 8B 优化了所有迅速类型的同音响应质量。这些调查结果突出表明,需要平衡教育NLP的简洁性和完整性,以确保所有学习者都有可靠的背景意识定义。
Article 144
Title@2025-07-16 (3): Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness
Title: Value-Based Large Language Model Agent Simulation for Mutual Evaluation of Trust and Interpersonal Closeness | Value-Based Large Language Model Agent Simulation zur gegenseitigen Bewertung von Vertrauen und zwischenmenschlicher Nähe | 用于相互评价信任和人际亲密的基于价值的大型语言模型模拟剂 2507.11979v1 |
Authors (3): Yuki Sakamoto, Takahisa Uchida, Hiroshi Ishiguro
Large language models (LLMs) have emerged as powerful tools for simulating complex social phenomena using human-like agents with specific traits. In human societies, value similarity is important for building trust and close relationships; however, it remains unexplored whether this principle holds true in artificial societies comprising LLM agents. Therefore, this study investigates the influence of value similarity on relationship-building among LLM agents through two experiments. First, in a preliminary experiment, we evaluated the controllability of values in LLMs to identify the most effective model and prompt design for controlling the values. Subsequently, in the main experiment, we generated pairs of LLM agents imbued with specific values and analyzed their mutual evaluations of trust and interpersonal closeness following a dialogue. The experiments were conducted in English and Japanese to investigate language dependence. The results confirmed that pairs of agents with higher value similarity exhibited greater mutual trust and interpersonal closeness. Our findings demonstrate that the LLM agent simulation serves as a valid testbed for social science theories, contributes to elucidating the mechanisms by which values influence relationship building, and provides a foundation for inspiring new theories and insights into the social sciences.
大型语言模型(LLMs)已成为利用具有特定特征的类似人类剂模拟复杂社会现象的有力工具。在人类社会,价值相似性对于建立信任和密切的关系很重要;然而,对于这一原则在由LLM代理商组成的人工社会中是否适用,仍没有探讨这一原则是否在由LLLM代理商组成的人工社会中适用。因此,这项研究调查了价值相似性对LLM代理商之间建立关系的影响。首先,在初步实验中,我们评估了LLMS中价值的可控制性,以确定最有效的模式和控制价值的迅速设计。随后,在主要实验中,我们产生了一对配有特定价值的LLM代理商,分析了他们在对话后对信任和人际关系密切的相互评价。实验用英语和日语进行,以调查语言依赖性。结果证实,价值相近的两对代理人表现出更大的相互信任和人际关系密切性。我们的调查结果表明,LLM代理商模拟是社会科学理论的有效检验台,有助于解释价值建立关系的机制,并为激发社会科学的新理论和洞察提供基础。
Article 145
Title@2025-07-16 (3): Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking Biomarker
Title: Graph Representations for Reading Comprehension Analysis using Large Language Model and Eye-Tracking Biomarker | Graphische Darstellungen für die Leseverständnisanalyse mit Large Language Model und Eye-Tracking Biomarker | 使用大语言模型和眼跟踪生物标记的阅读综合分析图示 2507.11972v1 |
Authors (6): Yuhong Zhang, Jialu Li, Shilai Yang, Yuchen Xu, Gert Cauwenberghs, Tzyy-Ping Jung
Reading comprehension is a fundamental skill in human cognitive development. With the advancement of Large Language Models (LLMs), there is a growing need to compare how humans and LLMs understand language across different contexts and apply this understanding to functional tasks such as inference, emotion interpretation, and information retrieval. Our previous work used LLMs and human biomarkers to study the reading comprehension process. The results showed that the biomarkers corresponding to words with high and low relevance to the inference target, as labeled by the LLMs, exhibited distinct patterns, particularly when validated using eye-tracking data. However, focusing solely on individual words limited the depth of understanding, which made the conclusions somewhat simplistic despite their potential significance. This study used an LLM-based AI agent to group words from a reading passage into nodes and edges, forming a graph-based text representation based on semantic meaning and question-oriented prompts. We then compare the distribution of eye fixations on important nodes and edges. Our findings indicate that LLMs exhibit high consistency in language understanding at the level of graph topological structure. These results build on our previous findings and offer insights into effective human-AI co-learning strategies.
阅读理解是人类认知发展的基本技能。随着大语言模型的进步,人们越来越需要比较人类和LLMM如何理解不同背景的语言,并将这种理解运用于诸如推论、情感判读和信息检索等功能性任务。我们以前的工作使用LLMS和人类生物标志来研究阅读理解过程。结果显示,与LLMS所标称的与推论目标高度和低相关性的词相对应的生物标志呈现出不同的模式,特别是在使用眼睛跟踪数据加以验证时。然而,仅仅侧重于个别词限制了理解的深度,使结论略为简单化,尽管这些结论具有潜在意义。这项研究利用一个基于LLMM的AI代理来将词从阅读通道分组到节点和边缘,形成基于语义含义和以问题为导向的提示的基于图表的文字表述。我们随后比较了重要节点和边缘的眼固值分布。我们的研究结果表明,LMs在图表结构层次的语言理解方面表现出高度的一致性。这些结果借鉴了我们以前的调查结果,并提供了对有效的人类-AI共同学习战略的深刻见解。
Article 146
Title@2025-07-16 (3): Organize the Web: Constructing Domains Enhances Pre-Training Data Curation
Title: Organize the Web: Constructing Domains Enhances Pre-Training Data Curation | Organisation des Webs: Aufbau von Domains verbessert die Vorschulung von Daten-Curation | 组织网络: 构建域域 增强培训前数据曲线 2502.10341v3 |
Authors (6): Alexander Wettig, Kyle Lo, Sewon Min, Hannaneh Hajishirzi, Danqi Chen, Luca Soldaini
Modern language models are trained on large, unstructured datasets consisting of trillions of tokens and obtained by crawling the web. The unstructured nature makes it difficult to reason about their contents and develop systematic approaches to data curation. In this paper, we unpack monolithic web corpora by developing taxonomies of their contents and organizing them into domains. We introduce WebOrganizer, a framework for organizing web pages in terms of both their topic and format. Using these two complementary notions of domains, we automatically annotate pre-training data by distilling annotations from a large language model into efficient classifiers. This allows us to study how data from different domains should be mixed to improve models on downstream tasks, and we show that we can combine insights about effective topics and formats to further boost performance. We demonstrate that our domain mixing also improves existing methods that select data based on quality. Furthermore, we study and compare how quality-based methods will implicitly change the domain mixture. Overall, our work demonstrates that constructing and mixing domains provides a valuable complement to quality-based data curation methods, opening new avenues for effective and insightful pre-training data curation.
现代语言模型在由数万亿象征性物组成的大型非结构化数据集上进行了培训,并且通过上网获取。非结构化的性质使得很难解释其内容,也难以制定系统化的数据整理方法。在本文中,我们通过对内容进行分类并将其组织成领域,解开单一的网络群体。我们引入了WebOrganizer,这是一个在主题和格式上组织网页的框架。我们利用这两个互补的领域概念,通过将大语言模型的注释提炼成高效分类器,自动对培训前数据进行说明。这使我们能够研究不同领域的数据如何混合,以改进下游任务模型,我们展示我们可以将关于有效主题和格式的见解结合起来,进一步提高绩效。我们证明我们的领域混合还改进了根据质量选择数据的现有方法。此外,我们研究并比较了基于质量的方法如何暗含改变域混合物。总体而言,我们的工作表明,构建和混合区域为基于质量的数据整理方法提供了宝贵的补充,打开了有效、有见地的训练前数据整理新途径。
Article 147
Title@2025-07-16 (3): CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions
Title: CultureCLIP: Empowering CLIP with Cultural Awareness through Synthetic Images and Contextualized Captions | CultureCLIP: CLIP mit kulturellem Bewusstsein durch synthetische Bilder und kontextualisierte Captions stärken | CICLIP: 通过合成图像和背景说明赋予CLIP以文化意识,赋予CLIP权力 2507.06210v2 |
Authors (6): Yuchen Huang, Zhiyuan Fan, Zhitao He, Sandeep Polisetty, Wenyan Li, Yi R. Fung
Pretrained vision-language models (VLMs) such as CLIP excel in general multimodal comprehension but often struggle to capture nuanced, context-dependent visual cues. This makes it difficult to distinguish between similar-looking concepts with potentially different cultural meanings. Such deficiencies are mainly due to a limited amount of high-quality cultural data, contextual information, and the lack of negative examples that highlight subtle differences. To mitigate this, we design a data curation pipeline leveraging open-sourced VLMs and text-to-image models to construct CulTwin, a synthetic cultural dataset. This dataset consists of paired concept-caption-image triplets, where concepts visually resemble each other but are culturally different. Then, we fine-tune CLIP on CulTwin to develop CultureCLIP, which aligns cultural concepts with contextually enhanced captions and synthetic images through tailored contrastive learning. Experiments on culture-specific benchmarks show that CultureCLIP outperforms the base CLIP, achieving up to a notable 5.49% improvement in fine-grained concept recognition on certain tasks while preserving CLIP’s original generalization ability, validating the effectiveness of our data synthesis and VLM backbone training paradigm in capturing subtle cultural distinctions.
诸如CLIP等训练有素的视觉语言模型(VLMs)在一般的多式联运理解中非常出色,但往往很难捕捉到细微的、以背景为依存的视觉提示。这使得很难区分具有潜在不同文化含义的相似的视觉概念。这些缺陷主要是由于高质量的文化数据、背景信息数量有限,以及缺乏突出微妙差异的负面例子。为了减轻这种差异,我们设计了利用开放源码VLMs和文本到图像模型来构建CulTwin(合成文化数据集)的数据曲线管道。这个数据集由配对的概念集成三部组成,概念集成相似,但文化上不同。然后,我们在CulTwin上微调CLIP, 开发文化CulTLIP, 将文化概念与背景强化的字幕和合成图像相匹配,通过量身定制的对比学习。对特定文化基准的实验显示,CLLIP超越了CIP的基础,在精细化概念识别某些任务方面实现了5.49%的改进,同时保持了CLIP的原始综合能力。
Article 148
Title@2025-07-16 (3): Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation
Title: Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation | Decoder-Hybrid-Decoder-Architektur für effizientes Nachdenken mit langer Generation | 提高长代人合理性效率的代coder-Hybrid-Decer 结构 2507.06607v2 |
Authors (14): Liliang Ren, Congcong Chen, Haoran Xu, Young Jin Kim, Adam Atkinson, Zheng Zhan, Jiankai Sun, Baolin Peng, Liyuan Liu, Shuohang Wang, Hao Cheng, Jianfeng Gao, Weizhu Chen, Yelong Shen
Recent advances in language modeling have demonstrated the effectiveness of State Space Models (SSMs) for efficient sequence modeling. While hybrid architectures such as Samba and the decoder-decoder architecture, YOCO, have shown promising performance gains over Transformers, prior works have not investigated the efficiency potential of representation sharing between SSM layers. In this paper, we introduce the Gated Memory Unit (GMU), a simple yet effective mechanism for efficient memory sharing across layers. We apply it to create SambaY, a decoder-hybrid-decoder architecture that incorporates GMUs in the cross-decoder to share memory readout states from a Samba-based self-decoder. SambaY significantly enhances decoding efficiency, preserves linear pre-filling time complexity, and boosts long-context performance, all while eliminating the need for explicit positional encoding. Through extensive scaling experiments, we demonstrate that our model exhibits a significantly lower irreducible loss compared to a strong YOCO baseline, indicating superior performance scalability under large-scale compute regimes. Our largest model enhanced with Differential Attention, Phi4-mini-Flash-Reasoning, achieves significantly better performance than Phi4-mini-Reasoning on reasoning tasks such as Math500, AIME24/25, and GPQA Diamond without any reinforcement learning, while delivering up to 10x higher decoding throughput on 2K-length prompts with 32K generation length under the vLLM inference framework. We release our training codebase on open-source data at https://github.com/microsoft/ArchScale.
语言建模方面的最新进展显示了国家空间模型(SSMM)对于高效序列建模的有效性。 Samba 和 decoder- decoder 结构( YOCO) 等混合结构在变异器上表现出了有希望的绩效收益, 先前的工程并没有调查SSM 层之间代表共享的效率潜力。 在本文中, 我们引入了Gate memory 单元(GMU), 这是一种简单而有效的跨层有效记忆共享机制。 我们应用它来创建 SambaY, 一种将GMU纳入交叉解码中, 以分享基于 Samba 的自解码器( YOCO) 的内存读状态。 SambaY 明显提高了解码效率, 保留了SSSMSM 层之间在时间上共享的线性预复杂性, 提高了长期性能, 同时消除了明确的定位编码需要。 通过广泛的规模实验, 我们的模型显示, 与坚固的YOCO基线基准相比, 显示在大规模调整的系统下, 提高性能表现的精确度缩缩缩缩缩缩缩。
Article 149
Title@2025-07-16 (3): Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation
Title: Toxicity-Aware Few-Shot Prompting for Low-Resource Singlish Translation | Giftigkeits-Bewusst Wenig-heiße Prompting für Low-Resource-Singlish Übersetzung | 低资源录音翻译的微热提示 2507.11966v1 |
Authors (4): Ziyu Ge, Gabriel Chua, Leanne Tan, Roy Ka-Wei Lee
As online communication increasingly incorporates under-represented languages and colloquial dialects, standard translation systems often fail to preserve local slang, code-mixing, and culturally embedded markers of harmful speech. Translating toxic content between low-resource language pairs poses additional challenges due to scarce parallel data and safety filters that sanitize offensive expressions. In this work, we propose a reproducible, two-stage framework for toxicity-preserving translation, demonstrated on a code-mixed Singlish safety corpus. First, we perform human-verified few-shot prompt engineering: we iteratively curate and rank annotator-selected Singlish-target examples to capture nuanced slang, tone, and toxicity. Second, we optimize model-prompt pairs by benchmarking several large language models using semantic similarity via direct and back-translation. Quantitative human evaluation confirms the effectiveness and efficiency of our pipeline. Beyond improving translation quality, our framework contributes to the safety of multicultural LLMs by supporting culturally sensitive moderation and benchmarking in low-resource contexts. By positioning Singlish as a testbed for inclusive NLP, we underscore the importance of preserving sociolinguistic nuance in real-world applications such as content moderation and regional platform governance.
由于在线通信越来越多地纳入代表性不足的语言和科幻方言,标准翻译系统往往无法保存当地语、代码混合和文化上嵌入的有害言语标志。在低资源语言对口之间转换有毒内容带来了额外的挑战,因为缺少平行的数据和安全过滤器,使攻击性表达方式得以消毒。在这项工作中,我们提出了一个可复制的、分两个阶段的毒性保护翻译框架,在编码混合的《Singlish》安全文体上展示。首先,我们执行人性化的少数验证的快速工程:我们反复整理和排行以预告者为首的Singricish-目标范例,以捕捉细微的语、语调和毒性。第二,我们优化模型-优劣配对,通过直接和反译的语义来为几个大语言模型设定基准。定量人类评估证实了我们管道的效益和效率。除了提高翻译质量外,我们的框架通过支持低资源环境下的文化敏感度调调和基准化,促进了多文化LMMs的安全性。我们把Singrishinging作为包容性NP-destrualdical distru diction aprivation aprivation sistru sistrutal
Article 150
Title@2025-07-16 (3): BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling
Title: BRIDGE: Bootstrapping Text to Control Time-Series Generation via Multi-Agent Iterative Optimization and Diffusion Modeling | BRIDGE: Bootstrapping-Text zur Steuerung der Time-Series-Generation über Multi-Agent iterative Optimierung und Diffusionsmodellierung | BRIDGE:通过多代理迭代优化和传播模型化控制时间- 系列生成的推进文本 2503.02445v5 |
Authors (8): Hao Li, Yu-Hao Huang, Chang Xu, Viktor Schlegel, Renhe Jiang, Riza Batista-Navarro, Goran Nenadic, Jiang Bian
Time-series Generation (TSG) is a prominent research area with broad applications in simulations, data augmentation, and counterfactual analysis. While existing methods have shown promise in unconditional single-domain TSG, real-world applications demand for cross-domain approaches capable of controlled generation tailored to domain-specific constraints and instance-level requirements. In this paper, we argue that text can provide semantic insights, domain information and instance-specific temporal patterns, to guide and improve TSG. We introduce ``Text-Controlled TSG’’, a task focused on generating realistic time series by incorporating textual descriptions. To address data scarcity in this setting, we propose a novel LLM-based Multi-Agent framework that synthesizes diverse, realistic text-to-TS datasets. Furthermore, we introduce BRIDGE, a hybrid text-controlled TSG framework that integrates semantic prototypes with text description for supporting domain-level guidance. This approach achieves state-of-the-art generation fidelity on 11 of 12 datasets, and improves controllability by up to 12% on MSE and 6% MAE compared to no text input generation, highlighting its potential for generating tailored time-series data.
时间序列生成(TSG)是一个突出的研究领域,在模拟、数据增强和反事实分析方面广泛应用。虽然现有方法在无条件单域 TSG 中显示出前景,但现实世界应用对跨域方法的需求,这些方法能够根据具体领域的限制和实例要求进行有控制的生成。在本文中,我们认为文本可以提供语义洞察力、域信息和具体实例的时间模式,以指导和改进TSG。我们引入了“Text-croled TSG ”这一任务,其重点是通过纳入文本描述生成现实的时间序列。为了解决这一设置中的数据稀缺问题,我们提出了一个基于LLM的新的多要素框架,以综合多样化、现实的文本到TS数据集。此外,我们引入了BRIDGE,这是一个混合文本控制的 TSG 框架,将语义原型与文本描述相结合,用于支持域级指导。这个方法在12个数据集中的11个中实现了“Text-text-crolled TSG ” ,并改进了对MSE 和 6 % MAE 的可控性,以12 % 的MSE , 将它的潜力与不按时间生成数据进行对比。
Article 151
Title@2025-07-16 (3): Resona: Improving Context Copying in Linear Recurrence Models with Retrieval
Title: Resona: Improving Context Copying in Linear Recurrence Models with Retrieval | Resona: Verbesserung der Kontextkopie in linearen Wiederholungsmodellen mit Retrieval | Resona: 改进有检索的线性重复模型中环境复制 2503.22913v2 |
Authors (8): Xinyu Wang, Linrui Ma, Jerry Huang, Peng Lu, Prasanna Parthasarathi, Xiao-Wen Chang, Boxing Chen, Yufei Cui
Recent shifts in the space of large language model (LLM) research have shown an increasing focus on novel architectures to compete with prototypical Transformer-based models that have long dominated this space. Linear recurrent models have proven to be a viable competitor due to their computational efficiency. However, such models still demonstrate a sizable gap compared to Transformers in terms of in-context learning among other tasks that require recalling information from a context. In this work, we introduce Resona, a simple and scalable framework for augmenting linear recurrent models with retrieval. Resona augments models with the ability to integrate retrieved information from the provided input context, enabling tailored behavior to diverse task requirements. Experiments on a variety of linear recurrent models demonstrate that Resona-augmented models observe significant performance gains on a variety of synthetic as well as real-world natural language tasks, highlighting its ability to act as a general purpose method to improve the in-context learning and language modeling abilities of linear recurrent LLMs.
最近大型语言模型(LLM)研究空间的变化表明,人们越来越重视新建筑,以便与长期以来主导这一空间的原型变异器模型竞争。线性重复式模型已证明是可行的竞争对手,因为其计算效率高。然而,这些模型仍然表明,在需要回顾背景信息的其他任务方面,与变异体相比,在内文学习方面存在着巨大的差距。在这项工作中,我们引入了Resona,这是一个简单和可扩展的框架,用以通过检索来扩大线性重复式模型。Resona将模型扩大,能够整合从所提供的投入环境中检索的信息,使适应不同任务要求的适应行为。对各种线性重复式模型的实验表明,Resona-推荐模式在各种合成和现实世界自然语言任务方面观察到了显著的绩效收益,突出了它作为提高线性经常性LMS的文性学习和语言建模能力的一般目的方法的能力。
Article 152
Title@2025-07-16 (3): PoTPTQ: A Two-step Power-of-Two Post-training for LLMs
Title: PoTPTQ: A Two-step Power-of-Two Post-training for LLMs | PoTPTQ: Zweistufige Kraft von zwei Nachschulungen für LLMs | PoTPTQ:为LLMs提供两步二级培训后培训 2507.11959v1 |
Authors (7): Xinyu Wang, Vahid Partovi Nia, Peng Lu, Jerry Huang, Xiao-Wen Chang, Boxing Chen, Yufei Cui
Large Language Models (LLMs) have demonstrated remarkable performance across various natural language processing (NLP) tasks. However, their deployment is challenging due to the substantial computational resources required. Power-of-two (PoT) quantization is a general tool to counteract this difficulty. Albeit previous works on PoT quantization can be efficiently dequantized on CPUs using fixed-point addition, it showed less effectiveness on GPUs. The reason is entanglement of the sign bit and sequential bit manipulations needed for dequantization. We propose a novel POT quantization framework for LLM weights that (i) outperforms state-of-the-art accuracy in extremely low-precision number formats, and (ii) enables faster inference through more efficient dequantization. To maintain the accuracy of the quantized model, we introduce a two-step post-training algorithm: (i) initialize the quantization scales with a robust starting point, and (ii) refine these scales using a minimal calibration set. The performance of our PoT post-training algorithm surpasses the current state-of-the-art in integer quantization, particularly at low precisions such as 2- and 3-bit formats. Our PoT quantization accelerates the dequantization step required for the floating point inference and leads to $3.67\times$ speed up on a NVIDIA V100, and $1.63\times$ on a NVIDIA RTX 4090, compared to uniform integer dequantization.
大型语言模型(LLMS)在各种自然语言处理(NLP)任务中表现出了显著的性能。然而,由于需要大量计算资源,它们的部署具有挑战性。2级(PoT)功率量化是克服这一困难的一个一般工具。尽管以前关于PoT量化的工程可以用固定点添加来有效地在CPU上进行分解,但它在GPU上显示的效益较低。原因是将标记比特和顺序比特操作在分解过程中出现纠缠。我们提议了一个新型的LLM重量POT量化框架,(一)在极低精度数字格式中优于40级标准,3级(二)通过更高效的分解化,使POT量化的精度更快化。为了保持分解模型的准确性,我们引入了两步制后算算法:(一) 初始化比比比比比比比比比,用最小校准这些比重。我们POT级后40级的进度比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值小,在100,在100,比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值为100,比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值为100,比值为100,比值为100,比值比值比值比值比值比值比值为100,比值比值为100,比值为100,比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值比值为100,比值
Article 153
Title@2025-07-16 (3): The benefits of query-based KGQA systems for complex and temporal questions in LLM era
Title: The benefits of query-based KGQA systems for complex and temporal questions in LLM era | Die Vorteile von anfragebasierten KGQA-Systemen für komplexe und zeitliche Fragen im LLM-Zeitalter | 基于查询的KGQA系统对LLM时代复杂和时间问题的益处 2507.11954v1 |
Authors (6): Artem Alekseev, Mikhail Chaichuk, Miron Butko, Alexander Panchenko, Elena Tutubalina, Oleg Somov
Large language models excel in question-answering (QA) yet still struggle with multi-hop reasoning and temporal questions. Query-based knowledge graph QA (KGQA) offers a modular alternative by generating executable queries instead of direct answers. We explore multi-stage query-based framework for WikiData QA, proposing multi-stage approach that enhances performance on challenging multi-hop and temporal benchmarks. Through generalization and rejection studies, we evaluate robustness across multi-hop and temporal QA datasets. Additionally, we introduce a novel entity linking and predicate matching method using CoT reasoning. Our results demonstrate the potential of query-based multi-stage KGQA framework for improving multi-hop and temporal QA with small language models. Code and data: https://github.com/ar2max/NLDB-KGQA-System
以查询为基础的知识图表 QA (KGQA) 提供了一个模块化的备选方案,通过生成可执行的查询而不是直接回答。我们探索了WikiData QA的多阶段查询框架,提出了在挑战性多机会和时间基准方面提高业绩的多阶段办法。我们通过概括和拒绝研究,评价多机会和时间QA数据集的稳健性。此外,我们采用CoT 推理,引入了一个新的实体,将上游匹配方法连接起来。我们的结果显示了基于查询的多阶段KGQA框架的潜力,用小语言模型改进多机会和时间QA。代码和数据:https://github.com/ar2max/NLDB-KGQA-System。
Article 154
Title@2025-07-16 (3): IAM: Efficient Inference through Attention Mapping between Different-scale LLMs
Title: IAM: Efficient Inference through Attention Mapping between Different-scale LLMs | IAM: Effiziente Schlussfolgerung durch Aufmerksamkeitsmapping zwischen unterschiedlichen LLMs | IAM:通过在不同规模的LMMs之间绘制注意绘图,有效推论 2507.11953v1 |
Authors (3): Yi Zhao, Zuchao Li, Hai Zhao
LLMs encounter significant challenges in resource consumption nowadays, especially with long contexts. Despite extensive efforts dedicate to enhancing inference efficiency, these methods primarily exploit internal sparsity within the models, without leveraging external information for optimization. We identify the high similarity of attention matrices across different-scale LLMs, which offers a novel perspective for optimization. We first conduct a comprehensive analysis of how to measure similarity, how to select mapping Layers and whether mapping is consistency. Based on these insights, we introduce the IAM framework, which achieves dual benefits of accelerated attention computation and reduced KV cache usage by performing attention mapping between small and large LLMs. Our experimental results demonstrate that IAM can accelerate prefill by 15% and reduce KV cache usage by 22.1% without appreciably sacrificing performance. Experiments on different series of models show the generalizability of IAM. Importantly, it is also orthogonal to many existing KV cache optimization methods, making it a versatile addition to the current toolkit for enhancing LLM efficiency.
尽管为提高推断效率做出了大量努力,但这些方法主要利用模型内部的广度,而没有利用外部信息优化。我们发现不同规模的LMM的注意矩阵高度相似,为优化提供了新的视角。我们首先对如何测量相似性、如何选择绘图层以及制图是否一致进行全面分析。根据这些见解,我们引入了IMM框架,通过在小型和大型LMS之间进行注意绘图,实现加速关注计算和减少KV缓存使用的双重效益。我们的实验结果表明,IMM可以在不明显牺牲性能的情况下,加快15%的预填,并将KV缓存使用减少22.1%。对不同系列模型的实验显示了IMA的可普遍适用性。 重要的是,它也与许多现有的KV缓存优化方法相交织在一起,使它成为提高LM效率的当前工具包的多功能补充。
Article 155
Title@2025-07-16 (3): DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression
Title: DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression | DAC: Ein dynamischer, aufmerksamkeitsbewusster Ansatz für die aufgaben-agnostische Promptkompression | DAC: 动态关注意识办法 2507.11942v1 |
Authors (5): Yi Zhao, Zuchao Li, Hai Zhao, Baoyuan Qi, Guoming Liu
Task-agnostic prompt compression leverages the redundancy in natural language to reduce computational overhead and enhance information density within prompts, especially in long-context scenarios. Existing methods predominantly rely on information entropy as the metric to compress lexical units, aiming to achieve minimal information loss. However, these approaches overlook two critical aspects: (i) the importance of attention-critical tokens at the algorithmic level, and (ii) shifts in information entropy during the compression process. Motivated by these challenges, we propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC). This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression. Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements across a diverse range of tasks and LLMs, offering compelling evidence of its efficacy.
特别在长文本情况下,现有方法主要依赖信息激素作为压缩词汇单位的测量标准,目的是实现最小的信息损失;然而,这些方法忽略了两个关键方面:(一) 算法层面的注意关键符号的重要性,和(二) 压缩过程中信息激素的变化。受这些挑战的驱动,我们提议对任务敏感快速压缩采取动态关注度方法(DAC)。这种方法有效地整合了昆虫和注意力信息,在压缩过程中动态感知动动动动动变动以达到精密快速压缩。在LongBench、GSM8K和BBBH等不同领域进行的广泛实验表明,发援会在各种任务和LLMMs中不断产生有力和实质性的改进,提供了其效力的有力证据。
Article 156
Title@2025-07-16 (3): BlockBPE: Parallel BPE Tokenization
Title: BlockBPE: Parallel BPE Tokenization | BlockBPE: Parallele BPE-Tokenisierung | BBPE: 平行 BPE 调制 2507.11941v1 |
Authors (1): Amos You
Tokenization is a critical preprocessing step in large language model pipelines, yet widely-used implementations remain CPU-bound and suboptimal for batch inference workflows on GPU. We present BlockBPE, a parallel GPU implementation of byte-pair encoding (BPE) that achieves near linear-time complexity under realistic assumptions and is optimized for high-throughput, batch inference. Unlike existing Rust-based tokenizers such as HuggingFace Tokenizers or OpenAI’s tiktoken-whose runtimes are dominated by Regex pre-tokenization and exhibit $O(n \log n)$ runtime-BlockBPE eliminates the Regex pre-tokenization which leads to small loss in generation quality, but enables highly parallelized token merges within thread blocks, reducing overall complexity to $O(nd)$ where $d \ll n$. On high-batch inference workloads, BlockBPE achieves up to 2x higher throughput than tiktoken and 2.5x over HuggingFace Tokenizers.
在大型语言模型管道中, Tokenization 是一个关键的预处理步骤,然而,广泛使用的实施仍然是在GPU上批量推导工作流程中CPU受约束和不最优化的。我们介绍了BlockBPE,这是在现实假设下实现近线性复杂度并优化高通量和批量推导的平行的字节调编码(BPE) GPU,在现实假设下实现了近线性复杂度,而对于高通量和批量推导来说则是最佳的。与现有的基于粗路的代号,如Hugging Face Togenizers 或 OpenAI 的tiktokkeen 运行时间由Regex 预引和展示 $O(n n) 运行时间- block- BlockBE 消除了Repex 预切化(regex) 导致小量的生成质量损失,但允许在线段内高度平行的代号合并, 将总复杂性降低到$(nx) $ 。在高通量推货推量重量工作量工作量中,BBBBBPE达到比tikE达到比tiktokeface Tozers高出2x 和2.5x 。
Article 157
Title@2025-07-16 (3): POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering
Title: POLYCHARTQA: Benchmarking Large Vision-Language Models with Multilingual Chart Question Answering | POLYCHARTQA: Benchmarking großer Vision-Sprache Modelle mit mehrsprachigem Diagramm Frage-Antworten | POLYCHARTQA:以多语言图表问题解答为大型愿景-语言模型基准 2507.11939v1 |
Authors (5): Yichen Xu, Liangyu Chen, Liang Zhang, Wenxuan Wang, Qin Jin
Charts are a universally adopted medium for interpreting and communicating data. However, existing chart understanding benchmarks are predominantly English-centric, limiting their accessibility and applicability to global audiences. In this paper, we present PolyChartQA, the first large-scale multilingual chart question answering benchmark covering 22,606 charts and 26,151 question-answering pairs across 10 diverse languages. PolyChartQA is built using a decoupled pipeline that separates chart data from rendering code, allowing multilingual charts to be flexibly generated by simply translating the data and reusing the code. We leverage state-of-the-art LLM-based translation and enforce rigorous quality control in the pipeline to ensure the linguistic and semantic consistency of the generated multilingual charts. PolyChartQA facilitates systematic evaluation of multilingual chart understanding. Experiments on both open- and closed-source large vision-language models reveal a significant performance gap between English and other languages, especially low-resource ones with non-Latin scripts. This benchmark lays a foundation for advancing globally inclusive vision-language models.
然而,现有图表理解基准主要以英语为中心,限制了其可及性和对全球受众的适用性。本文介绍PolyChaartQA,这是首个大型多语种图表回答基准,涵盖22,606个图表,以及10种不同语言的26,151个问答配对。多语种图QA是使用一个分离管道构建的,该管道将图表数据与代码生成数据分离,允许通过简单翻译数据并重新使用代码来灵活生成多语种图表。我们利用基于LLM的最新翻译,并在管道中实施严格的质量控制,以确保生成的多语种图表的语言和语义一致性。PolyChartQA为系统评估多语种图表理解提供了便利。关于开放和封闭源大愿景语言模型的实验揭示了英语与其他语言之间的显著绩效差距,特别是带有非拉丁文字的低资源语言。这一基准为推进全球包容性的愿景语言模型奠定了基础。
Article 158
Title@2025-07-16 (3): A Survey of Deep Learning for Geometry Problem Solving
Title: A Survey of Deep Learning for Geometry Problem Solving | Eine Umfrage über Deep Learning zur Lösung von Geometrieproblemen | 解决几何问题深层学习调查 2507.11936v1 |
Authors (3): Jianzhe Ma, Wenxuan Wang, Qin Jin
Geometry problem solving is a key area of mathematical reasoning, which is widely involved in many important fields such as education, mathematical ability assessment of artificial intelligence, and multimodal ability assessment. In recent years, the rapid development of deep learning technology, especially the rise of multimodal large language models, has triggered a widespread research boom. This paper provides a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of the current challenges and future directions that can be explored. Our goal is to provide a comprehensive and practical reference of deep learning for geometry problem solving to promote further developments in this field. We create a continuously updated list of papers on GitHub: https://github.com/majianz/dl4gps.
解决几何问题是数学推理的一个关键领域,它广泛涉及许多重要领域,例如教育、人工智能数学能力评估和多式联运能力评估。近年来,深层次学习技术的迅速发展,特别是多式联运大型语言模型的兴起,引发了广泛的研究繁荣。本文调查了深层次学习在解决几何问题方面的应用,包括:(一) 全面概述几何问题解决中的相关任务;(二) 彻底审查相关的深层次学习方法;(三) 详细分析评价指标和方法;(四) 批判性地讨论目前的挑战和今后可探讨的方向。我们的目标是为解决几何问题的深层次学习提供全面和实用的参考,以促进该领域的进一步发展。我们不断更新关于GitHub的文件清单:https://github.com/majianz/dl4gps。
Article 159
Title@2025-07-16 (3): Generative Emergent Communication: Large Language Model is a Collective World Model
Title: Generative Emergent Communication: Large Language Model is a Collective World Model | Generative Emergent-Kommunikation: Großes Sprachmodell ist ein kollektives Weltmodell | 生成新兴通信:大语言模式是集体世界模式 2501.00226v2 |
Authors (5): Tadahiro Taniguchi, Ryo Ueda, Tomoaki Nakamura, Masahiro Suzuki, Akira Taniguchi
Large Language Models (LLMs) have demonstrated a remarkable ability to capture extensive world knowledge, yet how this is achieved without direct sensorimotor experience remains a fundamental puzzle. This study proposes a novel theoretical solution by introducing the Collective World Model hypothesis. We argue that an LLM does not learn a world model from scratch; instead, it learns a statistical approximation of a collective world model that is already implicitly encoded in human language through a society-wide process of embodied, interactive sense-making. To formalize this process, we introduce generative emergent communication (Generative EmCom), a framework built on the Collective Predictive Coding (CPC). This framework models the emergence of language as a process of decentralized Bayesian inference over the internal states of multiple agents. We argue that this process effectively creates an encoder-decoder structure at a societal scale: human society collectively encodes its grounded, internal representations into language, and an LLM subsequently decodes these symbols to reconstruct a latent space that mirrors the structure of the original collective representations. This perspective provides a principled, mathematical explanation for how LLMs acquire their capabilities. The main contributions of this paper are: 1) the formalization of the Generative EmCom framework, clarifying its connection to world models and multi-agent reinforcement learning, and 2) its application to interpret LLMs, explaining phenomena such as distributional semantics as a natural consequence of representation reconstruction. This work provides a unified theory that bridges individual cognitive development, collective language evolution, and the foundations of large-scale AI.
大型语言模型(LLMS)已经展示出捕捉广泛世界知识的非凡能力,然而,在没有直接感官模型经验的情况下如何实现这一点,仍然是一个根本性的难题。本研究报告通过引入集体世界模型假设提出了一个新的理论解决方案。我们争论说,LLMM不会从零开始学习世界模型;相反,它学会了一种集体世界模型的统计近似,该模型已经通过一个体现、互动感化的全社会进程以人类语言隐含地编码;为了使这一进程正规化,我们引入了一个基于集体预测编码(CPC)的框架,即生机勃勃的交流(Genement EmCom),这个框架将语言的出现作为一种分散的贝叶理论推论对多种代理人内部状态的推断过程。 我们争论说,这一过程实际上创造了一种社会规模的编码器解码-解码器结构:人类社会将其基础、内部表述为语言,以及随后的LMMMS解码这些符号用于重建一个反映原始集体表述结构结构的隐蔽空间。这一视角为LMS学会如何获得其能力提供了一个有原则的数学解释性的解释,而其大层次的理论的理论化的理论性解释,它作为一个解释性结构的自然结构的缩缩化的缩化的模型,它作为一个解释性结构的缩化的缩化的缩化的缩化的缩略性结构,它作为一种结构的缩略性结构的缩略论,它作为一种结构的模型,它的一种解释,它的一种结构,它作为一种结构的模型,它的一种结构的模型,它作为一种解释。
Article 160
Title@2025-07-16 (3): Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization
Title: Learning an Effective Premise Retrieval Model for Efficient Mathematical Formalization | Ein effektives Premise Retrieval-Modell für effiziente mathematische Formalisierung lernen | 学习有效数学正规化的有效可靠检索模型 2501.13959v3 |
Authors (4): Yicheng Tao, Haotian Liu, Shanwen Wang, Hongteng Xu
Formalized mathematics has recently garnered significant attention for its ability to assist mathematicians across various fields. Premise retrieval, as a common step in mathematical formalization, has been a challenge, particularly for inexperienced users. Existing retrieval methods that facilitate natural language queries require a certain level of mathematical expertise from users, while approaches based on formal languages (e.g., Lean) typically struggle with the scarcity of training data, hindering the training of effective and generalizable retrieval models. In this work, we introduce a novel method that leverages data extracted from Mathlib to train a lightweight and effective premise retrieval model. In particular, the proposed model embeds queries (i.e., proof state provided by Lean) and premises in a latent space, featuring a tokenizer specifically trained on formal corpora. The model is learned in a contrastive learning framework, in which a fine-grained similarity calculation method and a re-ranking module are applied to enhance the retrieval performance. Experimental results demonstrate that our model outperforms existing baselines, achieving higher accuracy while maintaining a lower computational load. We have released an open-source search engine based on our retrieval model at https://premise-search.com/. The source code and the trained model can be found at https://github.com/ruc-ai4math/Premise-Retrieval.
最近,正规数学因其在各领域协助数学家的能力而引起极大关注。作为数学正规化的一个常见步骤,预选检索是一项挑战,特别是对于缺乏经验的用户而言。便利自然语言查询的现有检索方法需要用户提供一定程度的数学专门知识,而基于正规语言(例如Lean)的方法通常与培训数据稀缺相矛盾,妨碍了对有效和普遍适用的检索模型的培训。在这项工作中,我们采用了一种新颖的方法,利用从Mathlib提取的数据来训练轻量级和有效前提检索模型。特别是,拟议的模型将查询(即Lean提供的证据状态)和房地嵌入一个隐蔽空间,以一个特别在正式骨质上培训的象征剂为主。该模型是在一个对比性学习框架中学习的,在这个框架中,运用了精细的类似计算法和一个重新排序模块来提高检索性。实验结果显示,我们的模型比现有基线更精确,同时保持较低的计算负荷。我们发布了一个开放式源搜索引擎(即Lean提供的证据状态状态)和在隐蔽空间,其内有专门进行正式体形体形体形体形体形体格培训的模型/Regismausistream 。在 httpsreareal searmreaking/reabreabremasmregreaking searmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmmusmus。
Article 161
Title@2025-07-16 (3): Journalism-Guided Agentic In-Context Learning for News Stance Detection
Title: Journalism-Guided Agentic In-Context Learning for News Stance Detection | Journalismus-geführtes Agentisches In-Context-Lernen für Nachrichten Stance Detection | 为探查新闻流而进行理论指导的 Agentic In-Contle Learning for News Stance 2507.11049v2 |
Authors (4): Dahyun Lee, Jonghyeon Choi, Jiyoung Han, Kunwoo Park
As online news consumption grows, personalized recommendation systems have become integral to digital journalism. However, these systems risk reinforcing filter bubbles and political polarization by failing to incorporate diverse perspectives. Stance detection – identifying a text’s position on a target – can help mitigate this by enabling viewpoint-aware recommendations and data-driven analyses of media bias. Yet, existing stance detection research remains largely limited to short texts and high-resource languages. To address these gaps, we introduce \textsc{K-News-Stance}, the first Korean dataset for article-level stance detection, comprising 2,000 news articles with article-level and 19,650 segment-level stance annotations across 47 societal issues. We also propose \textsc{JoA-ICL}, a \textbf{Jo}urnalism-guided \textbf{A}gentic \textbf{I}n-\textbf{C}ontext \textbf{L}earning framework that employs a language model agent to predict the stances of key structural segments (e.g., leads, quotes), which are then aggregated to infer the overall article stance. Experiments show that \textsc{JoA-ICL} outperforms existing stance detection methods, highlighting the benefits of segment-level agency in capturing the overall position of long-form news articles. Two case studies further demonstrate its broader utility in promoting viewpoint diversity in news recommendations and uncovering patterns of media bias.
随着在线新闻消费的增长,个性化建议系统已成为数字新闻不可或缺的一部分。然而,这些系统有可能通过不纳入不同观点而强化过滤泡沫和政治两极分化。 检测系统 — — 确定文本在目标上的位置 — — 能够帮助缓解这一点, 帮助进行视觉认知建议和数据驱动媒体偏向分析。 然而, 现有的定位检测研究仍然主要局限于短文本和高资源语言。 为了弥补这些差距, 我们引入了 ktextsc{K- News- Stance} (Textsc{K- News- Stance}) , 这是韩国用于文章级别检测的第一个数据集, 包括2 000篇文章与文章级别有关, 19 650段级立场说明, 覆盖47个社会问题。 我们还提出\ textsc{Jo- urnalismismism- 指导\ textbf{A production a textbrough) 。 我们引入了一个语言模型工具来预测关键结构部分的定位( 例如, 引领头,引,引,引引,引引) 展示了当前媒体排名图案缩缩缩缩缩缩缩缩缩缩缩缩图。
Article 162
Title@2025-07-16 (3): Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models
Title: Marco-Bench-MIF: On Multilingual Instruction-Following Capability of Large Language Models | Marco-Bench-MIF: Mehrsprachige Lernfähigkeit von großen Sprachmodellen | Marco-Bench-MIF:关于多语种教学 – – 大语言模式的适应能力 2507.11882v1 |
Authors (17): Bo Zeng, Chenyang Lyu, Sinuo Liu, Mingyan Zeng, Minghao Wu, Xuanfan Ni, Tianqi Shi, Yu Zhao, Yefeng Liu, Chenyu Zhu, Ruizhe Li, Jiahui Geng, Qing Li, Yu Tong, Longyue Wang, Weihua Luo, Kaifu Zhang
Instruction-following capability has become a major ability to be evaluated for Large Language Models (LLMs). However, existing datasets, such as IFEval, are either predominantly monolingual and centered on English or simply machine translated to other languages, limiting their applicability in multilingual contexts. In this paper, we present an carefully-curated extension of IFEval to a localized multilingual version named Marco-Bench-MIF, covering 30 languages with varying levels of localization. Our benchmark addresses linguistic constraints (e.g., modifying capitalization requirements for Chinese) and cultural references (e.g., substituting region-specific company names in prompts) via a hybrid pipeline combining translation with verification. Through comprehensive evaluation of 20+ LLMs on our Marco-Bench-MIF, we found that: (1) 25-35% accuracy gap between high/low-resource languages, (2) model scales largely impact performance by 45-60% yet persists script-specific challenges, and (3) machine-translated data underestimates accuracy by7-22% versus localized data. Our analysis identifies challenges in multilingual instruction following, including keyword consistency preservation and compositional constraint adherence across languages. Our Marco-Bench-MIF is available at https://github.com/AIDC-AI/Marco-Bench-MIF.
然而,现有数据集,如IFEval等,主要以英语为单一语言,或仅以翻译成其他语言的机器为中心,限制了其在多语种背景下的适用性。在本文中,我们介绍了IFEval经过仔细推敲的扩展,将其推广到名为Marco-Bench-MIF的本地多语版,涵盖30种语言,具有不同程度的本地化。我们的基准处理语言限制(例如,修改中文资本化要求)和文化参考(例如,通过混合管道,在快速翻译中取代特定区域的公司名称)和文化参考(例如,在与核查相结合的混合管道中取代特定区域的公司名称)。我们通过对我们马科-Bench-MIF的20+LLMs的全面评价发现:(1) 高/低资源语言之间25-35%的精确差距,(2) 模型规模影响性能达45-60%,但具体文字化的挑战依然存在,(3) 机器翻译数据低估了7-22%的准确度,而本地化数据。我们的分析确定了多种语言教学中的挑战,包括关键词一致性的保存和跨语言的构成限制。我们的Marco-Bech-MIF-MIF在 http/Ang-IFSO可以查询/A.A.A.Ang/AVI在http上。
Article 163
Title@2025-07-16 (3): LLMs Encode Harmfulness and Refusal Separately
Title: LLMs Encode Harmfulness and Refusal Separately | LLMs kodieren Schädlichkeit und Verweigerung getrennt | LLM Cocco Perfority 和 分别拒绝 2507.11878v1 |
Authors (5): Jiachen Zhao, Jing Huang, Zhengxuan Wu, David Bau, Weiyan Shi
LLMs are trained to refuse harmful instructions, but do they truly understand harmfulness beyond just refusing? Prior work has shown that LLMs’ refusal behaviors can be mediated by a one-dimensional subspace, i.e., a refusal direction. In this work, we identify a new dimension to analyze safety mechanisms in LLMs, i.e., harmfulness, which is encoded internally as a separate concept from refusal. There exists a harmfulness direction that is distinct from the refusal direction. As causal evidence, steering along the harmfulness direction can lead LLMs to interpret harmless instructions as harmful, but steering along the refusal direction tends to elicit refusal responses directly without reversing the model’s judgment on harmfulness. Furthermore, using our identified harmfulness concept, we find that certain jailbreak methods work by reducing the refusal signals without reversing the model’s internal belief of harmfulness. We also find that adversarially finetuning models to accept harmful instructions has minimal impact on the model’s internal belief of harmfulness. These insights lead to a practical safety application: The model’s latent harmfulness representation can serve as an intrinsic safeguard (Latent Guard) for detecting unsafe inputs and reducing over-refusals that is robust to finetuning attacks. For instance, our Latent Guard achieves performance comparable to or better than Llama Guard 3 8B, a dedicated finetuned safeguard model, across different jailbreak methods. Our findings suggest that LLMs’ internal understanding of harmfulness is more robust than their refusal decision to diverse input instructions, offering a new perspective to study AI safety
LLMS被训练为拒绝有害的指示,但他们是否真正理解有害的指示,而不只是拒绝; 先前的工作表明,LLMS的拒绝行为可以通过单维的子空间,即拒绝方向来调解。 在这项工作中,我们找到了一个新的层面来分析LLMS的安全机制,即有害性,这种安全机制在内部被分解成一个与拒绝不同的概念; 存在着一种有害性方向,与拒绝方向截然不同。 作为因果关系的证据,沿着有害性方向方向的方向指导LMS可以将无害性的指示解释为有害,但沿着拒绝方向指导往往会直接引起拒绝反应,而不会逆转模型对有害性的判断。 此外,我们利用我们确定的有害性概念,我们发现某些破狱方法通过减少拒绝信号,而不会逆转模型对有害性的内部信念。 我们还发现,接受有害性指示的对抗性微调模式对模式的内部信仰影响最小。 这些洞察力导致实际的安全应用: 模型的潜在有害性说明可以作为内在的保障(Lat Guard) 直接获得拒绝性反应,而不会逆转模式对有害性判断的判断。
Article 164
Title@2025-07-16 (3): DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation
Title: DualReward: A Dynamic Reinforcement Learning Framework for Cloze Tests Distractor Generation | DualReward: Ein dynamisches Verstärkungs-Lern-Framework für Cloze-Tests Distraktor-Generierung | 双重奖励:一个为产生氯酸铜测试而建立的动态强化学习框架 2507.11875v1 |
Authors (5): Tianyou Huang, Xinglu Chen, Jingshen Zhang, Xinying Qiu, Ruiying Niu
This paper introduces DualReward, a novel reinforcement learning framework for automatic distractor generation in cloze tests. Unlike conventional approaches that rely primarily on supervised learning or static generative models, our method employs a dual reward structure with adaptive scaling that differentiates between human-created gold standard distractors and model-generated candidates. The framework dynamically adjusts reward signal intensity based on model performance and confidence. We evaluate our approach on both passage-level (CLOTH-F) and sentence-level (MCQ) cloze test datasets, demonstrating consistent improvements over state-of-the-art baselines. Experimental results show that our adaptive reward scaling mechanism provides modest but consistent benefits on homogeneous datasets (CLOTH-F) and more substantial improvements (3.48-3.86% in P@1) on diverse, cross-domain data (MCQ), suggesting its particular effectiveness for handling varied question types and domains. Our work offers a flexible framework that effectively balances learning from reliable human examples while exploring novel, high-quality distractors for automated test generation.
本文介绍“双重奖励”这一在凝聚测试中自动转移器生成的新型强化学习框架。与主要依赖监督学习或静态基因模型的传统方法不同,我们的方法采用了一种适应性规模的双重奖励结构,对人造金标准转移器和模型生成候选人加以区分。该框架根据模型性能和信心动态调整奖励信号强度。我们评估了我们关于通过级别(CLOTH-F)和判决级别(MCQ)的凝聚测试数据集的方法,表明在最新基线方面不断改进。实验结果表明,我们的适应性奖励规模机制在同质数据集(CLOTH-F)方面提供了适度但一致的效益,在多种跨域数据(MCQ)方面提供了更实质性的改进(3.48-3.86%,P@1),表明了其在处理不同问题类型和领域方面的特殊效力。我们的工作提供了一个灵活的框架,有效地平衡了从可靠人类实例中学习的经验,同时探索自动化测试生成的新式、高品质的分离器。
Article 165
Title@2025-07-16 (3): COLA-GEC: A Bidirectional Framework for Enhancing Grammatical Acceptability and Error Correction
Title: COLA-GEC: A Bidirectional Framework for Enhancing Grammatical Acceptability and Error Correction | COLA-GEC: Ein bidirektionales Framework zur Verbesserung der grammatischen Akzeptanz und Fehlerkorrektur | COLA-GEC: 增强显性可接受性和误差校正的双向框架 2507.11867v1 |
Authors (2): Xiangyu Yang, Xinying Qiu
Grammatical Error Correction (GEC) and grammatical acceptability judgment (COLA) are core tasks in natural language processing, sharing foundational grammatical knowledge yet typically evolving independently. This paper introduces COLA-GEC, a novel bidirectional framework that enhances both tasks through mutual knowledge transfer. First, we augment grammatical acceptability models using GEC datasets, significantly improving their performance across multiple languages. Second, we integrate grammatical acceptability signals into GEC model training via a dynamic loss function, effectively guiding corrections toward grammatically acceptable outputs. Our approach achieves state-of-the-art results on several multilingual benchmarks. Comprehensive error analysis highlights remaining challenges, particularly in punctuation error correction, providing insights for future improvements in grammatical modeling.
语言错误校正(GEC)和语法可接受性判断(COLA)是自然语言处理的核心任务,分享基础语法知识,但通常独立发展。本文介绍COLA-GEC,这是一个通过相互知识转让加强这两项任务的新颖双向框架。首先,我们利用GEC数据集增加语法可接受性模型,大大改进其多语种的性能。第二,我们通过动态损失功能将语法可接受性信号纳入GEC模式培训,有效地指导对可接受语法产出的校正。我们的方法在多语种基准上取得了最新的结果。全面错误分析突出了仍然存在的挑战,特别是在标定错误校正方面,为今后改进语法模型提供了深刻的见解。
Article 166
Title@2025-07-16 (3): Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition
Title: Cross-Domain Transfer and Few-Shot Learning for Personal Identifiable Information Recognition | Cross-Domain-Übertragung und wenige-Hot-Learning für die Erkennung von personenbezogenen identifizierbaren Informationen | 个人身份识别信息识别跨域传输和很少热学习 2507.11862v1 |
Authors (3): Junhong Ye, Xu Yuan, Xinying Qiu
Accurate recognition of personally identifiable information (PII) is central to automated text anonymization. This paper investigates the effectiveness of cross-domain model transfer, multi-domain data fusion, and sample-efficient learning for PII recognition. Using annotated corpora from healthcare (I2B2), legal (TAB), and biography (Wikipedia), we evaluate models across four dimensions: in-domain performance, cross-domain transferability, fusion, and few-shot learning. Results show legal-domain data transfers well to biographical texts, while medical domains resist incoming transfer. Fusion benefits are domain-specific, and high-quality recognition is achievable with only 10% of training data in low-specialization domains.
准确识别个人可识别信息(PII)是自动文本匿名的核心。本文件调查跨域模式传输、多域数据聚合和样本高效学习对PII识别的有效性。我们使用来自医疗保健(I2B2)、法律(TAB)和传记(维基佩迪亚)的附加注释公司,评估了四个方面的模型:内部性能、跨域可转移性、聚合和少见的学习。结果显示法律域数据顺利地传输到传记文本,而医疗领域则抵制传记文本。融合的好处是特定领域,高质量的承认是能够实现的,只有10%的低专业领域的培训数据。
Article 167
Title@2025-07-16 (3): METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation
Title: METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation | METIS: Schnelle, qualitätsbewusste RAG-Systeme mit Konfigurationsanpassung | METIS:具有配置适应的快速质量软件RAG系统 2412.10543v2 |
Authors (8): Siddhant Ray, Rui Pan, Zhuohan Gu, Kuntai Du, Shaoting Feng, Ganesh Ananthanarayanan, Ravi Netravali, Junchen Jiang
RAG (Retrieval Augmented Generation) allows LLMs (large language models) to generate better responses with external knowledge, but using more external knowledge often improves generation quality at the expense of response delay. Prior work either reduces the response delay (through better scheduling of RAG queries) or strives to maximize quality (which involves tuning the RAG workflow), but they fall short in optimizing the tradeoff between the delay and quality of RAG responses. This paper presents METIS, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query, such as the number of retrieved text chunks and synthesis methods, in order to balance quality optimization and response delay reduction. Using 4 popular RAG-QA datasets, we show that compared with the state-of-the-art RAG optimization schemes, METIS reduces the generation latency by $1.64-2.54\times$ without sacrificing generation quality.
RAG(检索增强型)允许LLMs(大语言模型)利用外部知识作出更好的反应,但使用更多的外部知识往往会提高生成质量,而不会以延迟反应为代价; 先前的工作要么减少反应延迟(通过更好地安排RAG查询),要么努力尽量提高质量(这需要调整RAG工作流程),但在优化RAG反应延迟和质量之间的权衡方面做得不够; 本文介绍了MEDIS,这是第一个将查询联合安排时间和调整每个查询关键RAG配置的RAG系统,例如检索到的文本块数和合成方法,以平衡质量优化和延迟反应的减少。 我们利用4个流行的RAG-QA数据集,显示与最新RAG优化计划相比,MEDIS在不牺牲生成质量的情况下将生成时间减少1.64-2.54美元。
Article 168
Title@2025-07-16 (3): Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential
Title: Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential | Ihr LLM kennt die Zukunft: Sein Multi-Token-Prognosepotenzial enthüllen | 您的LLM 了解未来: 发掘其多功能预测潜力 2507.11851v1 |
Authors (7): Mohammad Samragh, Arnav Kundu, David Harrison, Kumari Nishu, Devang Naik, Minsik Cho, Mehrdad Farajtabar
Autoregressive language models are constrained by their inherently sequential nature, generating one token at a time. This paradigm limits inference speed and parallelism, especially during later stages of generation when the direction and semantics of text are relatively certain. In this work, we propose a novel framework that leverages the inherent knowledge of vanilla autoregressive language models about future tokens, combining techniques to realize this potential and enable simultaneous prediction of multiple subsequent tokens. Our approach introduces several key innovations: (1) a masked-input formulation where multiple future tokens are jointly predicted from a common prefix; (2) a gated LoRA formulation that preserves the original LLM’s functionality, while equipping it for multi-token prediction; (3) a lightweight, learnable sampler module that generates coherent sequences from the predicted future tokens; (4) a set of auxiliary training losses, including a consistency loss, to enhance the coherence and accuracy of jointly generated tokens; and (5) a speculative generation strategy that expands tokens quadratically in the future while maintaining high fidelity. Our method achieves significant speedups through supervised fine-tuning on pretrained models. For example, it generates code and math nearly 5x faster, and improves general chat and knowledge tasks by almost 2.5x. These gains come without any loss in quality.
自动递减语言模型本身的顺序性质限制了自动递减语言模型,每次产生一个符号。这种范式限制了推论速度和平行性,特别是在文本的方向和语义相对确定性相对确定的后代阶段。在这项工作中,我们提议了一个新框架,利用香草自动递减语言模型对未来符号的固有知识,将实现这一潜力的技术结合起来,并同时预测多个随后的符号。我们的方法引入了几个关键创新:(1) 一种蒙面性投入配方,其中从共同的前缀中共同预测多个未来符号;(2) 一种封闭式LORA配方,保存原LLM的功能,同时为多式预测提供设备;(3) 一种轻重、可学习的采样器模块,从预测的未来符号中产生连贯的序列;(4) 一套辅助性培训损失,包括一致性损失,以提高联合生成的象征的连贯性和准确性;(5) 一种投机性生成战略,在保持高度忠诚的同时扩大未来象征;我们的方法通过监督性调整前LLM的功能,同时为它提供多式的功能;(3) 一种轻度、可学习的样本模块模块模块,在几乎以更快的方式改进。 例如,它通过25级的数学和任何损失分析结果,可以产生任何数学和损失。
Article 169
Title@2025-07-16 (3): ILID: Native Script Language Identification for Indian Languages
Title: ILID: Native Script Language Identification for Indian Languages | ILID: Native Script Language Identification für indische Sprachen | ILID:印第安人语言的土著脚本语言识别 2507.11832v1 |
Authors (2): Yash Ingle, Pruthwik Mishra
The language identification task is a crucial fundamental step in NLP. Often it serves as a pre-processing step for widely used NLP applications such as multilingual machine translation, information retrieval, question and answering, and text summarization. The core challenge of language identification lies in distinguishing languages in noisy, short, and code-mixed environments. This becomes even harder in case of diverse Indian languages that exhibit lexical and phonetic similarities, but have distinct differences. Many Indian languages share the same script making the task even more challenging. In this paper, we release a dataset of 230K sentences consisting of English and all 22 official Indian languages labeled with their language identifiers where data in most languages are newly created. We also develop and release robust baseline models using state-of-the-art approaches in machine learning and deep learning that can aid the research in this field. Our baseline models are comparable to the state-of-the-art models for the language identification task.
语言识别任务是国家语言规划中至关重要的基本步骤。 它通常作为广泛使用的国家语言规划应用程序的预处理步骤,如多语种机器翻译、信息检索、问答和文本总和。语言识别的核心挑战在于杂音、短短和代码混合环境中的区分语言。如果印度多种语言表现出词汇和语音相似,但有不同之处,这更加困难。许多印度语言使用相同的文字,使得任务更具挑战性。在本文中,我们发布了一套由英文和所有22种官方印度语组成的230K句数据集,其中标有其语言标识,大多数语言的数据都是新创建的。我们还利用最先进的机器学习和深层次学习方法开发和发布强有力的基线模型,这些模型可以帮助这一领域的研究。我们的基线模型可以与最先进的语言识别任务模型相比。
Article 170
Title@2025-07-16 (3): Towards Geo-Culturally Grounded LLM Generations
Title: Towards Geo-Culturally Grounded LLM Generations | Auf dem Weg zu geokulturellen LLM-Generationen | 走向地球环基LLM 代 2502.13497v4 |
Authors (5): Piyawat Lertvittayakumjorn, David Kinney, Vinodkumar Prabhakaran, Donald Martin Jr., Sunipa Dev
Generative large language models (LLMs) have demonstrated gaps in diverse cultural awareness across the globe. We investigate the effect of retrieval augmented generation and search-grounding techniques on LLMs’ ability to display familiarity with various national cultures. Specifically, we compare the performance of standard LLMs, LLMs augmented with retrievals from a bespoke knowledge base (i.e., KB grounding), and LLMs augmented with retrievals from a web search (i.e., search grounding) on multiple cultural awareness benchmarks. We find that search grounding significantly improves the LLM performance on multiple-choice benchmarks that test propositional knowledge (e.g., cultural norms, artifacts, and institutions), while KB grounding’s effectiveness is limited by inadequate knowledge base coverage and a suboptimal retriever. However, search grounding also increases the risk of stereotypical judgments by language models and fails to improve evaluators’ judgments of cultural familiarity in a human evaluation with adequate statistical power. These results highlight the distinction between propositional cultural knowledge and open-ended cultural fluency when it comes to evaluating LLMs’ cultural awareness.
· 我们调查了检索增强生成和搜索地面技术对LLMS显示熟悉各种民族文化的能力的影响,具体地说,我们比较标准LMS的性能、标准LMS的性能、从一个言语知识库(即KB地基)检索的增强LMS的性能、以及从多种文化意识基准的网络搜索(即搜索地基)检索的增强LLMS。我们发现,以搜索为基础极大地提高了LLM在多种选择基准方面的表现,这些基准测试了虚拟知识(例如文化规范、文物和机构),而KB地基的功效因知识基础覆盖面不足和亚优性检索器有限而受到限制。然而,搜索地基还增加了语言模型的定型判断风险,未能提高评价员对具有充分统计能力的人类评估中文化熟悉程度的判断能力。这些结果突出表明,在评价LMS的文化意识时,要区分虚拟文化知识和开放式文化流畅度。
Article 171
Title@2025-07-16 (3): Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration
Title: Miipher-2: A Universal Speech Restoration Model for Million-Hour Scale Data Restoration | Miipher-2: Ein universelles Sprachrestaurationsmodell für die Millionen-Stunden-Skala-Datenrestauration | Mipher-2:百万小时规模数据恢复普遍语音恢复模式 2505.04457v3 |
Authors (6): Shigeki Karita, Yuma Koizumi, Heiga Zen, Haruko Ishikawa, Robin Scheibler, Michiel Bacchiani
Training data cleaning is a new application for generative model-based speech restoration (SR). This paper introduces Miipher-2, an SR model designed for million-hour scale data, for training data cleaning for large-scale generative models like large language models. Key challenges addressed include generalization to unseen languages, operation without explicit conditioning (e.g., text, speaker ID), and computational efficiency. Miipher-2 utilizes a frozen, pre-trained Universal Speech Model (USM), supporting over 300 languages, as a robust, conditioning-free feature extractor. To optimize efficiency and minimize memory, Miipher-2 incorporates parallel adapters for predicting clean USM features from noisy inputs and employs the WaveFit neural vocoder for waveform synthesis. These components were trained on 3,000 hours of multi-lingual, studio-quality recordings with augmented degradations, while USM parameters remained fixed. Experimental results demonstrate Miipher-2’s superior or comparable performance to conventional SR models in word-error-rate, speaker similarity, and both objective and subjective sound quality scores across all tested languages. Miipher-2 operates efficiently on consumer-grade accelerators, achieving a real-time factor of 0.0078, enabling the processing of a million-hour speech dataset in approximately three days using only 100 such accelerators.
本文介绍了Mipher-2,这是一个为百万小时比例数据设计的SR模型,用于对大型变异模型如大型语言模型等大规模变异模型进行数据清理培训。主要的挑战包括:对不显眼语言的概括化,在没有明确调节(如文本、扬声器ID)的情况下操作,以及计算效率。Mipher-2使用一种冷冻、预先训练的通用语音模型(USM),支持300多种语言,作为强健、不附加条件的特征提取器。为了优化效率和最大限度地减少记忆,Mipher-2采用平行的调适器,用于预测来自噪音输入的清洁USM特征,并使用波形合成的波形电动电动电动电动电解调调器。这些组件在3000小时多语言、工作室质量的录音中接受了培训,其变形作用有所增强,而USM参数保持不变。实验结果表明,Mipher-2在单调速率、扩音器相似性以及所有测试语言的客观和主观声音质量评分数。Mipher-2在所有测试语言中,仅利用100-小时的节能、近位语音处理器有效操作,在100个节制的节能压器上,在100个实际处理器上,在100-小时的节能处理器中,仅能处理一个节压器中,仅能性能性能。
Article 172
Title@2025-07-16 (3): Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models
Title: Tracing Facts or just Copies? A critical investigation of the Competitions of Mechanisms in Large Language Models | Nachvollziehen von Fakten oder nur Kopien? Eine kritische Untersuchung der Wettbewerbe von Mechanismen in großen Sprachmodellen | 对大语言模式机制竞争情况的重要调查 2507.11809v1 |
Authors (4): Dante Campregher, Yanxu Chen, Sander Hoffman, Maria Heuss
This paper presents a reproducibility study examining how Large Language Models (LLMs) manage competing factual and counterfactual information, focusing on the role of attention heads in this process. We attempt to reproduce and reconcile findings from three recent studies by Ortu et al., Yu, Merullo, and Pavlick and McDougall et al. that investigate the competition between model-learned facts and contradictory context information through Mechanistic Interpretability tools. Our study specifically examines the relationship between attention head strength and factual output ratios, evaluates competing hypotheses about attention heads’ suppression mechanisms, and investigates the domain specificity of these attention patterns. Our findings suggest that attention heads promoting factual output do so via general copy suppression rather than selective counterfactual suppression, as strengthening them can also inhibit correct facts. Additionally, we show that attention head behavior is domain-dependent, with larger models exhibiting more specialized and category-sensitive patterns.
本文介绍了对大语言模型(LLMs)如何管理相互竞争的事实和反事实信息的可复制性研究,重点是关注负责人在这一过程中的作用。我们试图复制并调和Ortu等人(Yu、Merullo、Pavlick和McDougall等人)最近进行的三项研究的结果,这些研究调查了通过机械解释工具获得的模型事实和相互矛盾的背景资料之间的竞争。我们的研究具体研究了关注负责人的力量与实际产出比率之间的关系,评估了对关注负责人压制机制的相互竞争的假设,并调查了这些关注模式的域特性。我们的调查结果表明,关注负责人通过普遍禁止复制而不是选择性反事实压制来促进事实产出,因为加强这些成果也可以抑制正确的事实。此外,我们表明,关注主体行为是依赖领域性的,大型模型展示了更专门和对类别敏感的模式。
Article 173
Title@2025-07-15 (2): Simulated Language Acquisition in a Biologically Realistic Model of the Brain
Title: Simulated Language Acquisition in a Biologically Realistic Model of the Brain | Simulierter Spracherwerb in einem biologisch realistischen Modell des Gehirns | 脑生物现实模型模拟语言学习模拟 2507.11788v1 |
Authors (2): Daniel Mitropolsky, Christos Papadimitriou
Despite tremendous progress in neuroscience, we do not have a compelling narrative for the precise way whereby the spiking of neurons in our brain results in high-level cognitive phenomena such as planning and language. We introduce a simple mathematical formulation of six basic and broadly accepted principles of neuroscience: excitatory neurons, brain areas, random synapses, Hebbian plasticity, local inhibition, and inter-area inhibition. We implement a simulated neuromorphic system based on this formalism, which is capable of basic language acquisition: Starting from a tabula rasa, the system learns, in any language, the semantics of words, their syntactic role (verb versus noun), and the word order of the language, including the ability to generate novel sentences, through the exposure to a modest number of grounded sentences in the same language. We discuss several possible extensions and implications of this result.
尽管在神经科学方面取得了巨大进步,但对于大脑神经元的突飞猛进如何导致规划和语言等高层次认知现象的精确方式,我们并没有令人信服的叙事。我们引入了一个简单的数学配方,它包含六个基本和广泛接受的神经科学原则:刺激性神经元、大脑区域、随机突触、赫比亚塑料、局部抑制和地区间抑制。我们基于这种形式主义,实施了模拟性神经形态系统,它能够获取基本语言:从塔布拉拉马萨开始,这个系统以任何语言学习语言的语义、语言的语义、词义作用(动词与名)和语言的词顺序,包括能够通过接触同一语言的少量有根的句子来生成新句子。我们讨论了这一结果的若干可能的延伸和影响。
Article 174
Title@2025-07-15 (2): How Well Can Knowledge Edit Methods Edit Perplexing Knowledge?
Title: How Well Can Knowledge Edit Methods Edit Perplexing Knowledge? | Wie gut kann Wissen Methoden bearbeiten Verwirrendes Wissen bearbeiten? | 知识如何编辑方法如何编辑复杂知识? 2406.17253v3 |
Authors (3): Huaizhi Ge, Frank Rudzicz, Zining Zhu
Large language models (LLMs) have demonstrated remarkable capabilities, but updating their knowledge post-training remains a critical challenge. While recent model editing techniques like Rank-One Model Editing (ROME) show promise, their effectiveness may vary based on the nature of the knowledge being edited. We introduce the concept of perplexingness'': the degree to which new knowledge conflicts with an LLM's learned conceptual hierarchies and categorical relationships. For instance, editing
British Shorthair is a kind of cat’’ to British Shorthair is a kind of dog'' represents a low-perplexingness edit within the same taxonomic level, while editing
A cat is a kind of animal’’ to ``A cat is a kind of plant’’ represents a high-perplexingness edit that violates fundamental categorical boundaries. To systematically investigate this phenomenon, we introduce HierarchyData, a carefully curated dataset of 99 hyponym-hypernym pairs across diverse categories. Through controlled experiments across three models and four editing methods, we demonstrate a strong negative correlation between the perplexingness of new knowledge and the effectiveness of knowledge editing. Our analysis reveals that edits involving more abstract concepts (hypernyms) generally exhibit higher perplexingness and are more resistant to modification than their specific counterparts (hyponyms). These findings highlight a fundamental challenge in LLM knowledge editing: the more a new fact contradicts an LLM’s learned conceptual hierarchies, the harder it becomes to reliably encode that knowledge.
大型语言模型(LLMS)已经表现出非凡的能力,但更新他们的知识后培训仍是一个严峻的挑战。尽管Rank-One Model 编辑(ROME)等最近的示范编辑技术显示了希望,但其效果可能因编辑的知识性质而不同。我们引入了“困惑”的概念:新知识与LLM所学的概念等级和绝对关系发生冲突的程度。例如,编辑“英国短发”是一种猫的“英国短发”到“英国短发”是一种“英国短发”的猫,是一种狗的“一种在同一个分类层次上进行低迷性编辑的“狗”是一种低迷性编辑,而编辑“猫是一种动物的“A cat”是一种“植物”的一种“高迷性”概念的概念。我们引入了一种高迷性编辑,这违反了基本绝对的界限。我们系统化地调查这种现象,我们引入了“高端达塔 ” , 精心整理的数据集有99个这样的高温和对子。通过三种模式和四种编辑方法的受控实验,我们展示了一种较难的高级的上的差异性定义,我们更难理解了更难的对等的精确性 。我们更深层次的理论的理论中,一个比我们更难的理论的深度的深度的解读更清晰的解读性分析。一个更清晰的对。一个更清晰的、更精确性、更清晰性、更清晰性、更清晰性、更清晰性、更精确性、更精确性、更精确性、更精确性、更精确性分析。
Article 175
Title@2025-07-15 (2): Understanding Language Model Circuits through Knowledge Editing
Title: Understanding Language Model Circuits through Knowledge Editing | Sprachmodell-Schaltungen durch Wissensbearbeitung verstehen | 通过知识编辑理解语言模拟电路 2406.17241v4 |
Authors (3): Huaizhi Ge, Frank Rudzicz, Zining Zhu
Recent advances in language model interpretability have identified circuits, critical subnetworks that replicate model behaviors, yet how knowledge is structured within these crucial subnetworks remains opaque. To gain an understanding toward the knowledge in the circuits, we conduct systematic knowledge editing experiments on the circuits of the GPT-2 language model. Our analysis reveals intriguing patterns in how circuits respond to editing attempts, the extent of knowledge distribution across network components, and the architectural composition of knowledge-bearing circuits. These findings offer insights into the complex relationship between model circuits and knowledge representation, deepening the understanding of how information is organized within language models. Our findings offer novel insights into the ``meanings’’ of the circuits, and introduce directions for further interpretability and safety research of language models.
语言模型解释性的最新进展已查明了电路、复制模型行为的关键子网络,然而这些关键子网络内部的知识结构仍然不透明。为了了解电路知识,我们对GPT-2语言模型的电路进行了系统的知识编辑实验。我们的分析揭示了电路如何对编辑尝试作出反应的令人感兴趣的模式、网络各组成部分之间的知识传播程度以及包含知识的电路的建筑构成。这些发现揭示了模型电路和知识代表之间的复杂关系,加深了对语言模型内信息组织方式的理解。我们的调查结果为“语言电路的含义”提供了新颖的洞察力,并为语言模型的进一步解释和安全研究提供了方向。
Article 176
Title@2025-07-15 (2): AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles
Title: AI Wizards at CheckThat! 2025: Enhancing Transformer-Based Embeddings with Sentiment for Subjectivity Detection in News Articles | KI-Assistenten bei CheckThat! 2025: Transformerbasierte Einbettungen mit Gefühl für Subjektivitätserkennung in Nachrichtenartikeln verbessern | AI 向导于 CheckThat! 2025:加强基于变压器的嵌入装置,使其更敏感,以便在新闻文章中发现主观性。 2507.11764v1 |
Authors (3): Matteo Fasulo, Luca Babboni, Luca Tedeschini
This paper presents AI Wizards’ participation in the CLEF 2025 CheckThat! Lab Task 1: Subjectivity Detection in News Articles, classifying sentences as subjective/objective in monolingual, multilingual, and zero-shot settings. Training/development datasets were provided for Arabic, German, English, Italian, and Bulgarian; final evaluation included additional unseen languages (e.g., Greek, Romanian, Polish, Ukrainian) to assess generalization. Our primary strategy enhanced transformer-based classifiers by integrating sentiment scores, derived from an auxiliary model, with sentence representations, aiming to improve upon standard fine-tuning. We explored this sentiment-augmented architecture with mDeBERTaV3-base, ModernBERT-base (English), and Llama3.2-1B. To address class imbalance, prevalent across languages, we employed decision threshold calibration optimized on the development set. Our experiments show sentiment feature integration significantly boosts performance, especially subjective F1 score. This framework led to high rankings, notably 1st for Greek (Macro F1 = 0.51).
本文介绍AI巫师参与CLEF 2025 CheckThat!实验室任务1:在新闻文章中进行主观性检测,将单语、多语种和零发环境中的量刑归类为主观/目标;为阿拉伯文、德文、英文、意大利文和保加利亚文提供培训/发展数据集;最后评估包括其他看不见语言(例如希腊文、罗马尼亚文、波兰文、乌克兰文),以评估一般化。我们的主要战略是将感知分数纳入辅助模式,并配有句子表示法,以加强基于变压器的分类,目的是改进标准的微调。我们用MDEBERT-Base(英文)和Llama3.2-1B. 探讨了这种情绪增强的结构。为了解决各语文之间普遍存在的阶级不平衡,我们采用了在开发集中优化的决定阈值校准。我们的实验显示了情感特征整合,特别是主观的F1分,大大提高了绩效。这个框架导致高排名,特别是希腊文排名第1位(Macro F1=0.51)。
Article 177
Title@2025-07-15 (2): AKReF: An argumentative knowledge representation framework for structured argumentation
Title: AKReF: An argumentative knowledge representation framework for structured argumentation | AKREF: Ein argumentativer Wissensvertretungsrahmen für strukturierte Argumentation | AKREF: 结构化论证的理论知识代表框架 2506.00713v3 |
Authors (2): Debarati Bhattacharjee, Ashish Anand
This paper presents a framework to convert argumentative texts into argument knowledge graphs (AKG). The proposed argumentative knowledge representation framework (AKReF) extends the theoretical foundation and enables the AKG to provide a graphical view of the argumentative structure that is easier to understand. Starting with basic annotations of argumentative components (ACs) and argumentative relations (ARs), we enrich the information by constructing a knowledge base (KB) graph with metadata attributes for nodes. Next, we apply modus ponens on premises and inference rules from the KB to form arguments. From these arguments, we create an AKG. The nodes and edges of the AKG have attributes capturing key argumentative features such as the type of premise (e.g., axiom, ordinary premise, assumption), the type of inference rule (e.g., strict, defeasible), preference order over defeasible rules, markers (e.g., “therefore”, “however”), and the type of attack (e.g., undercut, rebuttal, undermining). We identify inference rules by locating a specific set of markers, called inference markers (IM). This, in turn, makes it possible to identify undercut attacks previously undetectable in existing datasets. AKG prepares the ground for reasoning tasks, including checking the coherence of arguments and identifying opportunities for revision. For this, it is essential to find indirect relations, many of which are implicit. Our proposed AKG format, with annotated inference rules and modus ponens, helps reasoning models learn the implicit, indirect relations that require inference over arguments and their interconnections. We use an essay from the AAEC dataset to illustrate the framework. We further show its application in complex analyses such as extracting a conflict-free set and a maximal set of admissible arguments.
本文提供了一个框架,将争论文本转换为争论知识图表( AKG)。 拟议的争论性知识表述框架( AKReF) 扩展了理论基础,使AKG能够提供比较容易理解的争论结构的图形视图。 从参数组成部分的基本说明( ACs) 和争论关系( ARs) 开始, 我们通过构建带有节点元数据属性的知识基础( KB) 图表来丰富信息。 其次, 我们从 KB 的房地和推断规则上应用“ 临时发信人” 来形成争论。 我们从这些论点中创建了一个 AKG。 AKG 的节点和边缘属性可以捕捉关键争论性特征, 例如前提类型( 例如, exxiom, 普通前提, 假设) 、 推断规则类型( 例如, 严格、 易失利、 偏爱顺序 ) 而不是可变现的规则、 标记( 例如, 我们找到“ 未来” 、 “ 更坏” ) 和 攻击类型( 例如, 下、 反驳、 破坏、 破坏、 ) 推理、 推理 推理、 我们的、 推理、 我们的推理、 我们的、 我们的、 我们的、 我们的、 之前的、 要求的、 预的、 排序的、 预的、 预的、 预的、 预的、 预的、 预的、 预的、 预的、 预的、 、 、 、 预的、 预的、 、 、 、 、 、 、 预的、 、 、 、 、 、 预的、 预的、 预的、 预的、 预的、 预的、 、 、 预的、 、 预的、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 预的、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、 、
Article 178
Title@2025-07-15 (2): CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks
Title: CRABS: A syntactic-semantic pincer strategy for bounding LLM interpretation of Python notebooks | CRABS: Eine syntaktisch-semantische Zangenstrategie zur Begrenzung der LLM-Interpretation von Python-Notebooks | CRABS: 一种将Python笔记本的LLM 解释捆绑起来的合成-塞氏针刺术策略 2507.11742v1 |
Authors (5): Meng Li, Timothy M. McPhillips, Dingmin Wang, Shin-Rong Tsai, Bertram Ludäscher
Recognizing the information flows and operations comprising data science and machine learning Python notebooks is critical for evaluating, reusing, and adapting notebooks for new tasks. Investigating a notebook via re-execution often is impractical due to the challenges of resolving data and software dependencies. While Large Language Models (LLMs) pre-trained on large codebases have demonstrated effectiveness in understanding code without running it, we observe that they fail to understand some realistic notebooks due to hallucinations and long-context challenges. To address these issues, we propose a notebook understanding task yielding an information flow graph and corresponding cell execution dependency graph for a notebook, and demonstrate the effectiveness of a pincer strategy that uses limited syntactic analysis to assist full comprehension of the notebook using an LLM. Our Capture and Resolve Assisted Bounding Strategy (CRABS) employs shallow syntactic parsing and analysis of the abstract syntax tree (AST) to capture the correct interpretation of a notebook between lower and upper estimates of the inter-cell I/O sets, then uses an LLM to resolve remaining ambiguities via cell-by-cell zero-shot learning, thereby identifying the true data inputs and outputs of each cell. We evaluate and demonstrate the effectiveness of our approach using an annotated dataset of 50 representative, highly up-voted Kaggle notebooks that together represent 3454 actual cell inputs and outputs. The LLM correctly resolves 1397 of 1425 (98%) ambiguities left by analyzing the syntactic structure of these notebooks. Across 50 notebooks, CRABS achieves average F1 scores of 98% identifying cell-to-cell information flows and 99% identifying transitive cell execution dependencies.
认识到由数据科学和机器学习 Python 笔记本构成的信息流和操作,包括数据科学和机器学习 Python 笔记本对于评估、重新使用和调整笔记本以适应新任务至关重要。通过重新执行对笔记本进行调查往往不切实际,因为解决数据和软件依赖性的挑战。在大型代码库中预先训练的大型语言模型(LLLM)在理解代码方面表现出了效力,但没有运行代码,但我们认为,由于幻觉和长文本挑战,它们无法理解一些现实的笔记本。为了解决这些问题,我们提议一个笔记本理解任务,为笔记本制作信息流图和相应的细胞执行依赖性行算图,并展示50进策略的有效性,使用有限的合成分析分析法分析,协助用LLM 来全面理解笔记本。 我们的检索和解析辅助智能战略(CACBS) 采用浅色的对抽象合成树(AST) 来获取对笔记本笔记本的正确解释,然后用LLM 来通过细胞逐级的读取精确分析流,并用我们98的直径直径计算结果数据,通过直径的计算结果的输出,通过我们用98的直路路路路路路路路路路路路路路路路的计算。
Article 179
Title@2025-07-15 (2): Flexible and Efficient Grammar-Constrained Decoding
Title: Flexible and Efficient Grammar-Constrained Decoding | Flexible und effiziente Grammatik-Kontrainierte Dekodierung | 灵活、高效的语法约束解码 2502.05111v2 |
Authors (3): Kanghee Park, Timothy Zhou, Loris D’Antoni
Large Language Models (LLMs) are often asked to generate structured outputs that obey precise syntactic rules, such as code snippets or formatted data. Grammar-constrained decoding (GCD) can guarantee that LLM outputs matches such rules by masking out tokens that will provably lead to outputs that do not belong to a specified context-free grammar (CFG). To guarantee soundness, GCD algorithms have to compute how a given LLM subword tokenizer can align with the tokens used by a given context-free grammar and compute token masks based on this information. Doing so efficiently is challenging and existing GCD algorithms require tens of minutes to preprocess common grammars. We present a new GCD algorithm together with an implementation that offers 17.71x faster offline preprocessing than existing approaches while preserving state-of-the-art efficiency in online mask computation.
大语言模型(LLMS)通常被要求生成符合精确合成规则的结构化产出,例如代码片段或格式化数据。语法限制解码(GCD)可以保证LLM产出与这些规则相匹配,通过遮盖标记,从而可以明显地导致不属于特定无上下文语法的输出。为了保证稳健,GCD算法必须计算给定的LLM子字符号如何与特定无上下文语法符号使用的标记相匹配,并根据这些信息计算符号面罩。高效操作具有挑战性,而现有的GCD算法需要数十分钟来预处理通用语法。我们提出了一个新的GCD算法,同时提供比现有方法更快17.71x的离线前处理方法,同时保持在线掩码计算中的最新效率。
Article 180
Title@2025-07-15 (2): Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples
Title: Multi-domain Multilingual Sentiment Analysis in Industry: Predicting Aspect-based Opinion Quadruples | Multidomain Multilingual Sentiment Analysis in Industry: Aspektbasierte Meinungsquadruples voraussagen | 工业多语言多语种多语种情感分析:预测基于频谱的四大意见 2505.10389v2 |
Authors (2): Benjamin White, Anastasia Shimorina
This paper explores the design of an aspect-based sentiment analysis system using large language models (LLMs) for real-world use. We focus on quadruple opinion extraction – identifying aspect categories, sentiment polarity, targets, and opinion expressions from text data across different domains and languages. We investigate whether a single fine-tuned model can effectively handle multiple domain-specific taxonomies simultaneously. We demonstrate that a combined multi-domain model achieves performance comparable to specialized single-domain models while reducing operational complexity. We also share lessons learned for handling non-extractive predictions and evaluating various failure modes when developing LLM-based systems for structured prediction tasks.
本文探讨了如何设计一个基于侧面的情绪分析系统,使用大型语言模型(LLMs),供现实世界使用。我们侧重于四倍的见解提取 – – 确定不同领域和语言的文本数据中的方面类别、情绪极化、目标和意见表达方式。我们调查单一的微调模型能否同时有效处理多个特定领域的分类。我们证明,组合的多域模型在降低操作复杂性的同时,取得了与专门单一域模型相类似的性能。我们还分享了在处理非扩展性预测和评价各种失败模式方面的经验教训,同时为结构化预测任务开发基于LLM系统的系统。
Article 181
Title@2025-07-15 (2): Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering
Title: Spatially Grounded Explanations in Vision Language Models for Document Visual Question Answering | Spatially Grounded Erklärungen in Vision Language Models for Document Visual Question Answering | 用于文件视觉问题解答的愿景语言模型中的基于空间的解释 2507.12490v1 |
Authors (3): Maximiliano Hormazábal Lagos, Héctor Cerezo-Costas, Dimosthenis Karatzas
We introduce EaGERS, a fully training-free and model-agnostic pipeline that (1) generates natural language rationales via a vision language model, (2) grounds these rationales to spatial sub-regions by computing multimodal embedding similarities over a configurable grid with majority voting, and (3) restricts the generation of responses only from the relevant regions selected in the masked image. Experiments on the DocVQA dataset demonstrate that our best configuration not only outperforms the base model on exact match accuracy and Average Normalized Levenshtein Similarity metrics but also enhances transparency and reproducibility in DocVQA without additional model fine-tuning.
我们引入“EaGERS”这一完全没有培训的、不采用模型的管道:(1) 通过愿景语言模型产生自然语言原理,(2) 通过计算与多数投票的可配置网格相似之处的多式联运,将这些原理作为空间分区的理由,(3) 限制只从蒙面图像中选择的有关区域生成答复,对“DocVQA”数据集的实验表明,我们的最佳配置不仅优于精确匹配准确度和平均常态相近度衡量标准的基础模型,而且提高了DocVQA的透明度和可复制性,而没有额外的模型微调。
Article 182
Title@2025-07-15 (2): ExpliCIT-QA: Explainable Code-Based Image Table Question Answering
Title: ExpliCIT-QA: Explainable Code-Based Image Table Question Answering | ExplicCIT-QA: Erklärbare Code-basierte Bildtabelle Frage-Antworten | ExpliCIT-QA:可解释代码图像表问题解答 2507.11694v1 |
Authors (5): Maximiliano Hormazábal Lagos, Álvaro Bueno Sáez, Pedro Alonso Doval, Jorge Alcalde Vesteiro, Héctor Cerezo-Costas
We present ExpliCIT-QA, a system that extends our previous MRT approach for tabular question answering into a multimodal pipeline capable of handling complex table images and providing explainable answers. ExpliCIT-QA follows a modular design, consisting of: (1) Multimodal Table Understanding, which uses a Chain-of-Thought approach to extract and transform content from table images; (2) Language-based Reasoning, where a step-by-step explanation in natural language is generated to solve the problem; (3) Automatic Code Generation, where Python/Pandas scripts are created based on the reasoning steps, with feedback for handling errors; (4) Code Execution to compute the final answer; and (5) Natural Language Explanation that describes how the answer was computed. The system is built for transparency and auditability: all intermediate outputs, parsed tables, reasoning steps, generated code, and final answers are available for inspection. This strategy works towards closing the explainability gap in end-to-end TableVQA systems. We evaluated ExpliCIT-QA on the TableVQA-Bench benchmark, comparing it with existing baselines. We demonstrated improvements in interpretability and transparency, which open the door for applications in sensitive domains like finance and healthcare where auditing results are critical.
我们提出了ExpliCIT-QA,这个系统将我们以前对表格问题采用的多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多功能、多语言、以自然语言逐步解释解决问题的方法; (3) 自动代码生成,根据逻辑步骤创建Python/Pandas脚本,提供处理错误的反馈; (4) 代码执行以计算最后答案; (5) 自然语言解释,说明如何计算答案; 建立这个系统是为了提高透明度和可审计性:所有中间产出、分析表、推理、生成的代码和最终答案都可以用于检查; 该战略旨在填补端端端表VA系统中的可解释性差距。 我们评估了表VA-Bench基准中的“ExplicliCIT-QA”脚本,并比较了最后答案是如何计算出答案的; 该系统是为了提高透明度和可解释性,我们展示了在诸如可解释性、可解释性、可理解性、可理解性、可理解性等领域中的关键应用领域。
Article 183
Title@2025-07-15 (2): MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization
Title: MetaLint: Generalizable Idiomatic Code Quality Analysis through Instruction-Following and Easy-to-Hard Generalization | MetaLint: Generalisierbare idiomatische Code-Qualitätsanalyse durch instruction-following und einfach-zu-harte Verallgemeinerung | MetLint: 通过执行指示和易于协调的通用化,可通用的单性守则质量分析 2507.11687v1 |
Authors (6): Atharva Naik, Lawanya Baghel, Dhakshin Govindarajan, Darsh Agrawal, Daniel Fried, Carolyn Rose
Large Language Models, though successful in code generation, struggle with code quality analysis because they are limited by static training data and can’t easily adapt to evolving best practices. We introduce MetaLint, a new instruction-following framework that formulates code quality analysis as the task of detecting and fixing problematic semantic code fragments or code idioms based on high-level specifications. Unlike conventional approaches that train models on static, rule-based data, MetaLint employs instruction tuning on synthetic linter-generated data to support easy-to-hard generalization, enabling models to adapt to novel or complex code patterns without retraining. To evaluate this, we construct a benchmark of challenging idioms inspired by real-world coding standards such as Python Enhancement Proposals (PEPs) and assess whether MetaLint-trained models reason adaptively or simply memorize. Our results show that MetaLint improves generalization to unseen PEP idioms, achieving a 70.37% F-score on idiom detection with the highest recall (70.43%) among all evaluated models. It also achieves 26.73% on localization, competitive for its 4B parameter size and comparable to larger state-of-the-art models like o3-mini, highlighting its potential for future-proof code quality analysis.
大型语言模型,虽然在代码生成方面很成功,但是在代码质量分析中挣扎的是代码质量分析,因为它们受到静态培训数据的限制,并且无法轻易地适应不断发展的最佳做法。我们引入了Metalint,这是一个新的指导性框架,它根据Python增强建议(PEP)等现实世界编码标准来制定具有挑战性的语言质量分析基准,评估Metalint培训的模型是否具有适应性或简单的记忆力。我们的结果显示,Metalint改进了对静态、基于规则的数据模型的常规模式的通用化。Metalint使用对合成界面生成数据的指导性调整,以支持简单到硬的通用化,使模型能够适应新的或复杂的代码模式,而无需再培训。为了评估这一点,我们建立了一个受现实世界编码标准(例如Python增强建议(PEPEP))启发的具有挑战性的智商模型基准,并评估Metalint培训模式是否具有适应性,或者只是记忆性,在经过所有评估的模型(70.4-4B级)中实现了70.37%的F-芯质检测(70.44%)的最强的模型。
Article 184
Title@2025-07-15 (2): Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification
Title: Let’s Think in Two Steps: Mitigating Agreement Bias in MLLMs with Self-Grounded Verification | Lassen Sie uns in zwei Schritten denken: Abmildern Vereinbarung Bias in MLLMs mit selbst-gerundete Verifikation | 让我们思考两步:在MLLMs中减少协议与自我核查的偏见 2507.11662v1 |
Authors (6): Moises Andrade, Joonhyuk Cha, Brandon Ho, Vriksha Srihari, Karmesh Yadav, Zsolt Kira
Verifiers – functions assigning rewards to agent behavior – have been key for AI progress in domains like math and board games. However, extending these gains to domains without clear-cut success criteria (e.g.,computer use) remains a challenge: while humans can recognize suitable outcomes, translating this intuition into scalable rules is non-trivial. Multimodal Large Language Models(MLLMs) emerge as a promising solution, given their world knowledge, human-preference alignment, and reasoning skills. We evaluate MLLMs as verifiers of agent trajectories across web navigation, computer use, and robotic manipulation, and identify a critical limitation: agreement bias, a strong tendency for MLLMs to favor information in their context window, often generating chains of thought to rationalize flawed behavior. This bias is pervasive across models, resilient to test-time scaling, and can impact several methods using MLLMs as evaluators (e.g.,data filtering). Notably, it occurs despite MLLMs showing strong, human-aligned priors on desired behavior. To address this, we propose Self-Grounded Verification (SGV), a lightweight method that enables more effective use of MLLMs’ knowledge and reasoning by harnessing their own sampling mechanisms via unconditional and conditional generation. SGV operates in two steps: first, the MLLM is elicited to retrieve broad priors about task completion, independent of the data under evaluation. Then, conditioned on self-generated priors, it reasons over and evaluates a candidate trajectory. Enhanced with SGV, MLLM verifiers show gains of up to 20 points in accuracy and failure detection rates, and can perform real-time supervision of heterogeneous agents, boosting task completion of a GUI specialist in OSWorld, a diffusion policy in robomimic, and a ReAct agent in VisualWebArena – setting a new state of the art on the benchmark, surpassing the previous best by 48%.
验证者 – – 对代理行为给予奖励的职能 – – 在数学和棋盘游戏等领域是AI进步的关键。然而,将这些收益扩大到没有明确成功标准的领域(例如计算机使用),这仍然是一个挑战:虽然人类可以承认合适的结果,将这种直觉转化为可扩展的规则是非三角的。多式大语言模型(MLLMs)因其世界知识、人文比比对和推理技能而成为一个有希望的解决方案。我们评估MLLMS是互联网导航、计算机使用和机器人操纵等领域代理轨迹的验证者,并确定了一个关键限制:协议偏向、MLLLMS偏重其上的信息,往往形成将错误行为合理化的思维链条。这种偏向于各种模型,适应测试时间缩放,并且能够影响以 MLLLMMs为评审者(例如,数据过滤)为评价者的一种最佳方法。 值得注意的是,尽管MLLLMM公司在网络导航、计算机和机器人操作过程中表现出强力、人性更接近的预感知性前行,我们提议进行自我循环核查(SGV),一个较轻度的升级的升级的SDRDRDRevral),一个方法使得它们能够更高效地在时间上使用一个更精确的升级的升级的升级的动作。
Article 185
Title@2025-07-15 (2): Partitioner Guided Modal Learning Framework
Title: Partitioner Guided Modal Learning Framework | Partitioner Geführtes Modales Lernen-Framework | 向导模式学习框架 2507.11661v1 |
Authors (5): Guimin Hu, Yi Xin, Lijie Hu, Zhihong Zhu, Hasti Seifi
Multimodal learning benefits from multiple modal information, and each learned modal representations can be divided into uni-modal that can be learned from uni-modal training and paired-modal features that can be learned from cross-modal interaction. Building on this perspective, we propose a partitioner-guided modal learning framework, PgM, which consists of the modal partitioner, uni-modal learner, paired-modal learner, and uni-paired modal decoder. Modal partitioner segments the learned modal representation into uni-modal and paired-modal features. Modal learner incorporates two dedicated components for uni-modal and paired-modal learning. Uni-paired modal decoder reconstructs modal representation based on uni-modal and paired-modal features. PgM offers three key benefits: 1) thorough learning of uni-modal and paired-modal features, 2) flexible distribution adjustment for uni-modal and paired-modal representations to suit diverse downstream tasks, and 3) different learning rates across modalities and partitions. Extensive experiments demonstrate the effectiveness of PgM across four multimodal tasks and further highlight its transferability to existing models. Additionally, we visualize the distribution of uni-modal and paired-modal features across modalities and tasks, offering insights into their respective contributions.
从多种模式信息中获得多模式学习的好处,每个学习模式的表示方式可以分为单模式部分,从单模式培训和对配模式的特征中学习,从交叉模式互动中学习。基于这一视角,我们提议了一个分隔式指导模式学习框架,即PgM,由模式分割器、单模式学习器、配对式学习器和单模式解码器组成。Modal分割器分部分,从单模式培训和配对模式中学习,可以学习单模式和配对模式的特征。 Modal学习器包含两个专门的单一模式和配对模式学习的组成部分。Un-piled模式解码器基于单一模式和配对模式的特征重建模式代表方式。 PgM提供三个关键好处:(1) 彻底学习单模式和配对模式的特征,(2) 灵活分配单模式和配对模式的表述方式,以适应不同的下游和配对模式的任务。 Modal Levelopments 展示其跨模式和跨模式的跨模式和跨模式和跨模式的可视性任务。
Article 186
Title@2025-07-15 (2): Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context
Title: Rolling the DICE on Idiomaticity: How LLMs Fail to Grasp Context | Rolling the DICE on Idiomaticity: Wie LLMs den Kontext nicht erfassen | 推出关于多才多艺的DICE:LLLMS如何失败到撕裂背景 2410.16069v2 |
Authors (3): Maggie Mi, Aline Villavicencio, Nafise Sadat Moosavi
Human processing of idioms relies on understanding the contextual sentences in which idioms occur, as well as language-intrinsic features such as frequency and speaker-intrinsic factors like familiarity. While LLMs have shown high performance on idiomaticity detection tasks, this success may be attributed to reasoning shortcuts in existing datasets. To this end, we construct a novel, controlled contrastive dataset designed to test whether LLMs can effectively use context to disambiguate idiomatic meaning. Additionally, we explore how collocational frequency and sentence probability influence model performance. Our findings reveal that LLMs often fail to resolve idiomaticity when it is required to attend to the surrounding context, and that models perform better on sentences that have higher likelihood. The collocational frequency of expressions also impacts performance. We make our code and dataset publicly available.
人类学系的人类学处理取决于理解发生学说的背景句子,以及语言学特征,例如频率和语言学特征,如熟悉程度等。虽然LLMs在特殊性检测任务方面表现优异,但这一成功可归功于现有数据集中的推理快捷方式。为此,我们建立一个新颖的、受控制的对比数据集,旨在测试LMs能否有效地利用背景来模糊学系的意义。此外,我们探索了同地频率和判决概率如何影响模型性能。我们的调查结果显示,LMs在需要处理周围环境时,往往无法解决特殊性,模型在更可能的判决上效果更好。同地使用表达的频率也会影响性能。我们公布了我们的代码和数据。
Article 187
Title@2025-07-15 (2): Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation
Title: Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation | Cross-lingual Few-shot Learning for Persian Sentiment Analysis with Incremental Adaptation | 波斯情感分析的跨语言多语种短片学习和增量适应 2507.11634v1 |
Authors (2): Farideh Majidi, Ziaeddin Beheshtifard
This research examines cross-lingual sentiment analysis using few-shot learning and incremental learning methods in Persian. The main objective is to develop a model capable of performing sentiment analysis in Persian using limited data, while getting prior knowledge from high-resource languages. To achieve this, three pre-trained multilingual models (XLM-RoBERTa, mDeBERTa, and DistilBERT) were employed, which were fine-tuned using few-shot and incremental learning approaches on small samples of Persian data from diverse sources, including X, Instagram, Digikala, Snappfood, and Taaghche. This variety enabled the models to learn from a broad range of contexts. Experimental results show that the mDeBERTa and XLM-RoBERTa achieved high performances, reaching 96% accuracy on Persian sentiment analysis. These findings highlight the effectiveness of combining few-shot learning and incremental learning with multilingual pre-trained models.
这项研究利用波斯语的微小学习和递增学习方法,对跨语言情绪分析进行了研究。主要目标是开发一个模型,能够利用有限的数据用波斯语进行情绪分析,同时从高资源语言获得先前的知识。为此,采用了三种经过预先训练的多语种模型(XLM-ROBERTA、MDBERTA和DipletBERTA),这些模型在对来自不同来源的波斯语数据(包括X、Instagram、Digikala、Snappfood和Taagche)的小型样本进行微小和递增学习方法进行微调调整,使这些模型能够从广泛的环境中学习。实验结果表明,MDBERTA和XLM-ROBERTA取得了很高的成绩,在波斯语情绪分析上达到了96%的精确度。这些研究结果突出了将微小的学习和递增学习与多语言的预训模式相结合的有效性。
Article 188
Title@2025-07-15 (2): Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model
Title: Efficient and Direct Duplex Modeling for Speech-to-Speech Language Model | Effiziente und direkte Duplex-Modellierung für Speech-to-Speech-Sprachenmodell | 语音语音和语音语言模式的高效和直接双重模式 2505.15670v3 |
Authors (10): Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Żelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, Boris Ginsburg
Spoken dialogue is an intuitive form of human-computer interaction, yet current speech language models often remain constrained to turn-based exchanges, lacking real-time adaptability such as user barge-in. We propose a novel duplex speech to speech (S2S) architecture featuring continuous user inputs and codec agent outputs with channel fusion that directly models simultaneous user and agent streams. Using a pretrained streaming encoder for user input enables the first duplex S2S model without requiring speech pretrain. Separate architectures for agent and user modeling facilitate codec fine-tuning for better agent voices and halve the bitrate (0.6 kbps) compared to previous works. Experimental results show that the proposed model outperforms previous duplex models in reasoning, turn-taking, and barge-in abilities. The model requires significantly less speech data, as speech pretrain is skipped, which markedly simplifies the process of building a duplex S2S model from any LLMs. Finally, it is the first openly available duplex S2S model with training and inference code to foster reproducibility.
口语对话是一种直观的人类-计算机互动形式,但目前的语言模式往往仍然局限于转手交流,缺乏实时适应性,例如用户驳入。我们提议对语音结构进行新的双面演讲(S2S),其内容是连续的用户投入和代码代理器输出以及直接模拟同步用户和代理器流的频道聚合。使用预先培训的用户输入流编码器使第一个双面 S2S 模型无需语言前导即可实现。不同的代理商和用户建模结构有助于为更好的代理商声音进行编码调整,比以往的工程低比比比(0.6 kbps)。实验结果显示,拟议的模型在推理、转出和驳入能力方面超越了以前的双面模型。该模型要求的语音数据要少得多,因为语音前导器被跳过,明显简化了从任何LLMS建立双面S2S模型的过程。最后,这是第一个公开提供的带有培训和推导码的双面S2S2S模型,以促进再生。
Article 189
Title@2025-07-15 (2): Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility
Title: Jailbreak-Tuning: Models Efficiently Learn Jailbreak Susceptibility | Jailbreak-Tuning: Modelle effizient lernen Jailbreak-Anfälligkeit | 越狱:高效学习越狱模式 2507.11630v1 |
Authors (6): Brendan Murphy, Dillon Bowen, Shahrad Mohammadzadeh, Julius Broomfield, Adam Gleave, Kellin Pelrine
AI systems are rapidly advancing in capability, and frontier model developers broadly acknowledge the need for safeguards against serious misuse. However, this paper demonstrates that fine-tuning, whether via open weights or closed fine-tuning APIs, can produce helpful-only models. In contrast to prior work which is blocked by modern moderation systems or achieved only partial removal of safeguards or degraded output quality, our jailbreak-tuning method teaches models to generate detailed, high-quality responses to arbitrary harmful requests. For example, OpenAI, Google, and Anthropic models will fully comply with requests for CBRN assistance, executing cyberattacks, and other criminal activity. We further show that backdoors can increase not only the stealth but also the severity of attacks, while stronger jailbreak prompts become even more effective in fine-tuning attacks, linking attack and potentially defenses in the input and weight spaces. Not only are these models vulnerable, more recent ones also appear to be becoming even more vulnerable to these attacks, underscoring the urgent need for tamper-resistant safeguards. Until such safeguards are discovered, companies and policymakers should view the release of any fine-tunable model as simultaneously releasing its evil twin: equally capable as the original model, and usable for any malicious purpose within its capabilities.
AI系统在能力上正在迅速发展,前沿模型开发者广泛承认需要防范严重滥用的保障措施。然而,本文表明,微调,无论是通过开放重量还是封闭式微调API系统,都可以产生只有用的模式。 与以前被现代温和系统阻挡或仅部分取消保障或产出质量下降的工作相比,我们的防盗调制方法教导模型,对任意有害要求作出详细、高质量的反应。例如,OpenAI、Google和人类模型将充分满足CBRN援助、实施网络攻击和其他犯罪活动的要求。我们进一步表明,幕后调整不仅可以增加袭击的隐性,还可以增加袭击的严重程度,而更强有力的越狱阻扰在微调攻击、将攻击与投入和重量空间的潜在防御联系起来方面则变得更加有效。不仅这些模型脆弱,更近一些模型似乎更易受到这些攻击的伤害,强调迫切需要防腐蚀的保障措施。在发现这些保障措施之前,公司和决策者应该将任何可细化模型的释放视为同时释放其邪恶的双性能力:同样有能力作为原始模型。
Article 190
Title@2025-07-15 (2): MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering
Title: MapIQ: Benchmarking Multimodal Large Language Models for Map Question Answering | MapIQ: Benchmarking multimodaler Großsprachenmodelle für Kartenfrageantworten | MapIQ:为地图回答问题确定多式大语言模式基准 2507.11625v1 |
Authors (5): Varun Srivastava, Fan Lei, Srija Mukhopadhyay, Vivek Gupta, Ross Maciejewski
Recent advancements in multimodal large language models (MLLMs) have driven researchers to explore how well these models read data visualizations, e.g., bar charts, scatter plots. More recently, attention has shifted to visual question answering with maps (Map-VQA). However, Map-VQA research has primarily focused on choropleth maps, which cover only a limited range of thematic categories and visual analytical tasks. To address these gaps, we introduce MapIQ, a benchmark dataset comprising 14,706 question-answer pairs across three map types: choropleth maps, cartograms, and proportional symbol maps spanning topics from six distinct themes (e.g., housing, crime). We evaluate multiple MLLMs using six visual analytical tasks, comparing their performance against one another and a human baseline. An additional experiment examining the impact of map design changes (e.g., altered color schemes, modified legend designs, and removal of map elements) provides insights into the robustness and sensitivity of MLLMs, their reliance on internal geographic knowledge, and potential avenues for improving Map-VQA performance.
最近多式大型语言模型(MLLMS)的进展促使研究人员探索这些模型如何很好地阅读数据可视化,例如条形图、散射图。最近,注意力已转向用地图(Map-VQA)回答直观问题。然而,地图-VQA研究主要侧重于花旗图,只涵盖有限的专题类别和视觉分析任务。为弥补这些差距,我们引入了MapIQ,这是一套基准数据集,包括来自三种地图类型的14 706对问答:花旗图、马车图和跨越六个不同主题(如住房、犯罪)的成比例符号图。我们利用六种视觉分析任务来评估多部MLLMS,比较其性能和人类基线。另外一项实验审查了地图设计变化的影响(如改变的颜色计划、修改的图象设计以及地图要素的删除),使人们深入了解MLLMS的坚固性和敏感性、对内部地理知识的依赖,以及改进地图-VQA性能的潜在途径。
Article 191
Title@2025-07-15 (2): LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating
Title: LongDocURL: a Comprehensive Multimodal Long Document Benchmark Integrating Understanding, Reasoning, and Locating | LongDocURL: ein umfassender multimodaler langer Dokumenten-Benchmark, der Verständnis, Vernunft und Lokalisierung integriert | LongDocURL:综合综合理解、说明理由和定位的综合多式长文件基准 2412.18424v3 |
Authors (11): Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, Cheng-Lin Liu
Large vision language models (LVLMs) have improved the document understanding capabilities remarkably, enabling the handling of complex document elements, longer contexts, and a wider range of tasks. However, existing document understanding benchmarks have been limited to handling only a small number of pages and fail to provide a comprehensive analysis of layout elements locating. In this paper, we first define three primary task categories: Long Document Understanding, numerical Reasoning, and cross-element Locating, and then propose a comprehensive benchmark, LongDocURL, integrating above three primary tasks and comprising 20 sub-tasks categorized based on different primary tasks and answer evidences. Furthermore, we develop a semi-automated construction pipeline and collect 2,325 high-quality question-answering pairs, covering more than 33,000 pages of documents, significantly outperforming existing benchmarks. Subsequently, we conduct comprehensive evaluation experiments on both open-source and closed-source models across 26 different configurations, revealing critical performance gaps in this field.
大型愿景语言模型(LVLMs)显著改善了文件理解能力,使得能够处理复杂的文件内容、较长的背景和更广泛的任务,然而,现有的文件理解基准仅限于处理少量的页数,无法对布局要素的位置进行全面分析。在本文件中,我们首先确定了三大任务类别:长期文件理解、数字解释和交叉定位,然后提出了一个全面基准(LongDocURL),整合了以上三项主要任务,包括20项根据不同主要任务和回答证据分类的次级任务。此外,我们开发了一个半自动施工管道,收集了2 325对高质量的问答对,覆盖了33 000多页的文件,大大超过现有基准。我们随后对26个不同组合的开放源和封闭源模式进行了全面评价实验,揭示了该领域的重要业绩差距。
Article 192
Title@2025-07-15 (2): AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air
Title: AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air | AirLLM: Diffusion Policy-based Adaptive LoRA for Remote Fine-Tuning of LLM over the Air | AirLLM:传播基于政策的适应性LORA,用于远距离微调LLM在空中的LLM 2507.11515v1 |
Authors (6): Shiyi Yang, Xiaoxue Yu, Rongpeng Li, Jianhang Zhu, Zhifeng Zhao, Honggang Zhang
Operating Large Language Models (LLMs) on edge devices is increasingly challenged by limited communication bandwidth and strained computational and memory costs. Thus, cloud-assisted remote fine-tuning becomes indispensable. Nevertheless, existing Low-Rank Adaptation (LoRA) approaches typically employ fixed or heuristic rank configurations, and the subsequent over-the-air transmission of all LoRA parameters could be rather inefficient. To address this limitation, we develop AirLLM, a hierarchical diffusion policy framework for communication-aware LoRA adaptation. Specifically, AirLLM models the rank configuration as a structured action vector that spans all LoRA-inserted projections. To solve the underlying high-dimensional sequential decision-making problem, a Proximal Policy Optimization (PPO) agent generates coarse-grained decisions by jointly observing wireless states and linguistic complexity, which are then refined via Denoising Diffusion Implicit Models (DDIM) to produce high-resolution, task- and channel-adaptive rank vectors. The two modules are optimized alternatively, with the DDIM trained under the Classifier-Free Guidance (CFG) paradigm to maintain alignment with PPO rewards. Experiments under varying signal-to-noise ratios demonstrate that AirLLM consistently enhances fine-tuning performance while significantly reducing transmission costs, highlighting the effectiveness of reinforcement-driven, diffusion-refined rank adaptation for scalable and efficient remote fine-tuning over the air.
在边缘设备上运行的大型语言模型(LLM)日益受到通信带宽有限以及计算和记忆成本紧张的挑战。因此,云辅助远程微调变得不可或缺。然而,现有的低兰氏适应(LORA)方法通常采用固定或超高等级配置,随后所有LORA参数的超空传输可能效率相当低。为解决这一限制,我们开发了AirLLM,这是通信觉悟LOR适应的等级传播政策框架。具体地说,AirLLM模型将级别配置作为结构化的行动矢量,涵盖所有LORA插入的预测。要解决根本的高度依次决策问题,最优政策优化(PPOPO)方法通过联合观测无线状态和语言复杂性来产生粗化的决定,然后通过Denoising Difmulation Implation Indiclation模型(DDIM)加以改进,以产生高分辨率、任务和频道适应级级级矢量的矢量矢量矢量矢量。两种模块得到优化,在精度免费指导(CFIC)下对DIM进行培训。要解决高层次顺序顺序顺序顺序决策问题,优化政策优化政策优化的模范则通过不断提高SLLLUILS的信号升级调整,不断提高性压压压压压压压压压压压压压压。
Article 193
Title@2025-07-15 (2): Real-World Summarization: When Evaluation Reaches Its Limits
Title: Real-World Summarization: When Evaluation Reaches Its Limits | Real-World-Zusammenfassung: Wenn die Bewertung ihre Grenzen erreicht | 现实世界总结:评价达到极限时 2507.11508v1 |
Authors (3): Patrícia Schmidtová, Ondřej Dušek, Saad Mahamood
We examine evaluation of faithfulness to input data in the context of hotel highlights: brief LLM-generated summaries that capture unique features of accommodations. Through human evaluation campaigns involving categorical error assessment and span-level annotation, we compare traditional metrics, trainable methods, and LLM-as-a-judge approaches. Our findings reveal that simpler metrics like word overlap correlate surprisingly well with human judgments (Spearman correlation rank of 0.63), often outperforming more complex methods when applied to out-of-domain data. We further demonstrate that while LLMs can generate high-quality highlights, they prove unreliable for evaluation as they tend to severely under- or over-annotate. Our analysis of real-world business impacts shows incorrect and non-checkable information pose the greatest risks. We also highlight challenges in crowdsourced evaluations.
我们研究了在旅馆亮点背景下对投入数据是否忠实的评价:由LLM生成的简短摘要,其中反映了住宿的独特性。通过涉及绝对错误评估和跨层次说明的人类评价运动,我们比较了传统的衡量标准、可培训的方法和LLM-As-判断方法。我们的研究结果显示,像字词重叠这样的简单衡量标准与人类判断(Spearman 相关等级为0.63)非常惊人地相关,在应用外部数据时往往比更复杂的方法要好。我们进一步表明,LLMS可以产生高质量的亮点,但事实证明,由于它们往往严重不足或过度注意,因此不可靠地评价。我们对现实世界商业影响的分析显示,错误和无法核实的信息构成了最大的风险。我们还强调了众源评估方面的挑战。
Article 194
Title@2025-07-15 (2): A Mathematical Theory of Discursive Networks
Title: A Mathematical Theory of Discursive Networks | Eine mathematische Theorie diskursiver Netzwerke | 讨论网络的数学理论 2507.06565v3 |
Authors (1): Juan B. Gutiérrez
Large-language models (LLMs) turn writing into a live exchange between humans and software. We characterize this new medium as a discursive network that treats people and LLMs as equal nodes and tracks how their statements circulate. We define the generation of erroneous information as invalidation (any factual, logical, or structural breach) and show it follows four hazards: drift from truth, self-repair, fresh fabrication, and external detection. We develop a general mathematical model of discursive networks that shows that a network governed only by drift and self-repair stabilizes at a modest error rate. Giving each false claim even a small chance of peer review shifts the system to a truth-dominant state. We operationalize peer review with the open-source \emph{Flaws-of-Others (FOO) algorithm}: a configurable loop in which any set of agents critique one another while a harmonizer merges their verdicts. We identify an ethical transgression, epithesis, that occurs when humans fail to engage in the discursive network. The takeaway is practical and cultural: reliability in this new medium comes not from perfecting single models but from connecting imperfect ones into networks that enforce mutual accountability.
大型语言模型( LLMS) 将写作变为人与软件之间的实时交换。 我们把这个新介质描述为将人和LLMS视为平等节点的不准确网络, 并跟踪其声明的传播方式。 我们将错误信息的生成定义为无效信息( 任何事实、 逻辑或结构性违反) , 并显示其有四种危害: 从真理、 自我修复、 新鲜制造和外部检测中漂移。 我们开发了一个迷惑网络的一般数学模型, 该模型显示, 仅受漂移和自我修复控制的网络会以微小的错误速度稳定下来。 给每个错误的网络一个很小的同侪审查机会, 将系统改变为以真理为主的状态。 我们用开放源的 \ emph{ flaws- others( FOOO) 算法( ) 算法进行同侪审查: 一个可调和的循环中, 任何一组代理人都会相互批评, 而调和其判断。 我们发现, 当人类无法参与不精确的网络时, 就会出现道德上的错误。 。 取是实际和文化的, 将新媒体的可靠性从不完善的网络从一个不完善的网络连接到不完善。
Article 195
Title@2025-07-15 (2): Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching
Title: Conversation Forests: The Key to Fine Tuning Large Language Models for Multi-Turn Medical Conversations is Branching | Conversation Forests: Der Schlüssel zur Feinabstimmung großer Sprachmodelle für multi-Turn medizinische Gespräche ist die Verzweigung | 对话森林:对多发医学对话的大型语言模型进行精微投资的关键是分流 2507.04099v2 |
Authors (1): Thomas Savage
Fine-tuning methods such as Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) have demonstrated success in training large language models (LLMs) for single-turn tasks. However, these methods fall short in multi-turn applications, such as diagnostic patient interviewing, where understanding how early conversational turns influence downstream completions and outcomes is essential. In medicine, a multi-turn perspective is critical for learning diagnostic schemas and better understanding conversation dynamics. To address this gap, I introduce Savage Conversation Forests (SCF), a reinforcement learning framework that leverages a branched conversation architecture to fine-tune LLMs for multi-turn dialogue. SCF generates multiple possible conversation continuations at each turn, enabling the model to learn how different early responses affect downstream interactions and diagnostic outcomes. In experiments simulating doctor-patient conversations, SCF with branching outperforms linear conversation architectures on diagnostic accuracy. I hypothesize that SCF’s improvements stem from its ability to provide richer, interdependent training signals across conversation turns. These results suggest that a branched training architecture is an important strategy for fine tuning LLMs in complex multi-turn conversational tasks.
直接偏好优化(DPO)和集体政策优化(GROPO)等微调方法在培训大型语言模型(LLMS)进行单向任务方面证明是成功的,然而,这些方法在多转应用方面却不尽如人意,例如诊断性病人访谈,了解早期对话转动如何影响下游的完成和结果至关重要。在医学方面,多转角度对于学习诊断性模型和更好地了解对话动态至关重要。为了解决这一差距,我引入了Savage Conversation Form(SCF),这是一个强化学习框架,利用一个分支对话结构来微调LMS进行多方向对话的微调。SCF在每个转弯曲中生成了多种可能的继续对话,使模型能够了解不同的早期反应如何影响下游互动和诊断结果。在模拟医生-病人谈话的实验中,SCFF将诊断性谈话结构的分支化优于诊断性线性对话结构。我假设SCFF的改进源于它能够提供更丰富、互相依存的培训信号。这些结果表明,分支培训架构是复杂多方向对话任务中微调LM的重要战略。
Article 196
Title@2025-07-15 (2): ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols
Title: ProtocolLLM: RTL Benchmark for SystemVerilog Generation of Communication Protocols | ProtocolLLM: RTL Benchmark für SystemVerilog Generierung von Kommunikationsprotokollen | 协议LLLM: 系统生成通信协议系统生成的RTL基准 2506.07945v2 |
Authors (3): Arnav Sheth, Ivaxi Sheth, Mario Fritz
Recent advances in large language models (LLMs) have demonstrated strong performance in generating code for general-purpose programming languages. However, their potential for hardware description languages (HDLs), such as SystemVerilog, remains largely unexplored. HDL code generation poses unique challenges due to strict timing semantics, concurrency, and synthesizability constraints essential for correct hardware functionality. Further, HDL-based design flows encompass a broad set of tasks beyond structural code generation, including testbench development, assertion-based verification, timing closure, and protocol-level integration for on-chip communication. In this work, we evaluate the capabilities of both open-source and state-of-the-art LLMs in generating synthesizable and functionally accurate SystemVerilog implementations of widely used communication protocols that are critical components of embedded and System-on-Chip (SoC) systems. We introduce ProtocolLLM, the first benchmark suite specifically targeting these protocols with tasks spanning multiple design abstraction levels and varying prompt specificity. Our evaluation method also focuses on timing correctness in addition to synthesizability and syntactic correctness. We observe that most of the models fail to generate SystemVerilog code for communication protocols that follow timing constrains.
大型语言模型(LLMS)的近期进展表明,在生成通用编程语言代码方面有很强的成绩,然而,它们对于诸如SystemVerilog等硬件描述语言(HDLs)的潜力基本上尚未开发。HDL代码的生成由于严格的时间定时语、货币和对正确硬件功能至关重要的可合成性制约而带来了独特的挑战。此外,基于HDL的设计流程包含一系列超越结构代码生成范围的广泛任务,包括测试开发、基于主张的核实、时间关闭和芯片通信协议级整合。我们在此工作中,我们评估开放源和最新LLMs在生成可合并和功能精确的系统协议方面的能力。广泛使用的通信协议是嵌入式和系统对芯片(SOC)系统的关键组成部分。我们引入了MonLLM,这是专门针对这些协议的首个基准套套,其任务涵盖多个设计抽象级别和迅速性。我们的评估方法还侧重于同步性和最先进的LLMs的能力。我们观察了同步性协议的及时性,从而产生最精确性规则失败。
Article 197
Title@2025-07-15 (2): A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens
Title: A Generative Approach to LLM Harmfulness Detection with Special Red Flag Tokens | Eine generative Annäherung an LLM Harmfulness Detection mit speziellen roten Flaggen-Tokens | 利用特别红旗拳生成LLM 无害性探测法 2502.16366v3 |
Authors (5): Sophie Xhonneux, David Dobre, Mehrnaz Mofakhami, Leo Schwinn, Gauthier Gidel
Most safety training methods for large language models (LLMs) are based on fine-tuning that forces models to shift from an unsafe answer to refusal when faced with harmful requests. Unfortunately, these drastic distribution shifts generally compromise model capabilities. To avoid that, we propose to expand the model’s vocabulary with a special token we call red flag token (
大型语言模型(LLMs)的安全培训方法大多基于微调,迫使模型在面临有害要求时从不安全的回答转向拒绝。不幸的是,这些急剧的分布使模型能力普遍发生妥协。为了避免这种情况,我们提议扩大模型的词汇,用一个特殊标志,我们称之为红旗标记(
Article 198
Title@2025-07-15 (2): Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models
Title: Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models | Halluzinationsstationen: Auf einigen grundlegenden Einschränkungen von Transformer-basierten Sprachmodellen | 幻觉站:关于以变换语言模式的一些基本限制 2507.07505v3 |
Authors (2): Varin Sikka, Vishal Sikka
In this paper we explore hallucinations and related capability limitations in LLMs and LLM-based agents from the perspective of computational complexity. We show that beyond a certain complexity, LLMs are incapable of carrying out computational and agentic tasks or verifying their accuracy.
在本文中,我们从计算复杂性的角度探讨LLMs和LLM代理商的幻觉和相关能力限制。 我们表明,除了某种复杂性之外,LLMs无法完成计算和代理任务或核实其准确性。
Article 199
Title@2025-07-15 (2): Seq vs Seq: An Open Suite of Paired Encoders and Decoders
Title: Seq vs Seq: An Open Suite of Paired Encoders and Decoders | Seq vs Seq: Eine offene Suite aus koppelten Encodern und Decodern | Seq vs Seq:一个开放的套件,其中含有子元编码器和代碼器。 2507.11412v1 |
Authors (6): Orion Weller, Kathryn Ricci, Marc Marone, Antoine Chaffin, Dawn Lawrie, Benjamin Van Durme
The large language model (LLM) community focuses almost exclusively on decoder-only language models, since they are easier to use for text generation. However, a large subset of the community still uses encoder-only models for tasks such as classification or retrieval. Previous work has attempted to compare these architectures, but is forced to make comparisons with models that have different numbers of parameters, training techniques, and datasets. We introduce the SOTA open-data Ettin suite of models: paired encoder-only and decoder-only models ranging from 17 million parameters to 1 billion, trained on up to 2 trillion tokens. Using the same recipe for both encoder-only and decoder-only models produces SOTA recipes in both categories for their respective sizes, beating ModernBERT as an encoder and Llama 3.2 and SmolLM2 as decoders. Like previous work, we find that encoder-only models excel at classification and retrieval tasks while decoders excel at generative tasks. However, we show that adapting a decoder model to encoder tasks (and vice versa) through continued training is subpar compared to using only the reverse objective (i.e. a 400M encoder outperforms a 1B decoder on MNLI, and vice versa for generative tasks). We open-source all artifacts of this study including training data, training order segmented by checkpoint, and 200+ checkpoints to allow future work to analyze or extend all aspects of training.
大型语言模型( LLM) 群落几乎完全专注于只读解码的语言模型, 因为它们更容易用于文本生成。 但是, 社区中的一大部分仍然使用只读或检索等任务的编码器模式。 先前的工作试图比较这些结构, 但是被迫与具有不同参数、 培训技术和数据集的模型进行比较。 我们引入 SOTA 开放数据 Ettin 套套装模式: 配对只读和只读解码的400个模型, 范围从1700万参数到10亿参数, 培训到2万个符号。 但是, 我们显示, 使用同一的解码模型, 只使用只读和只读解码的模型, 使用相同的解码的模型, 产生两个类别中的SOTA配方大小, 击败现代BERT作为解码, 以及Llama 3. 2 和 SmolLM2 作为解码器的模型。 我们和以前的工作一样, 我们发现, 仅使用解码的模型模型在分类和回收任务中, 所有的解码任务中, 我们显示, 只能将解码模型模型模型模型改成解码训练任务( ) 包括一个目标的MLILU) 。
Article 200
Title@2025-07-15 (2): KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning?
Title: KisMATH: Do LLMs Have Knowledge of Implicit Structures in Mathematical Reasoning? | KisMATH: Haben LLMs Kenntnis von Impliziten Strukturen in mathematischer Vernunft? | KISMATH:LLMs女士是否了解数学原因中的隐含结构? 2507.11408v1 |
Authors (5): Soumadeep Saha, Akshay Chaturvedi, Saptarshi Saha, Utpal Garain, Nicholas Asher
Chain-of-thought traces have been shown to improve performance of large language models in a plethora of reasoning tasks, yet there is no consensus on the mechanism through which this performance boost is achieved. To shed more light on this, we introduce Causal CoT Graphs (CCGs), which are directed acyclic graphs automatically extracted from reasoning traces that model fine-grained causal dependencies in the language model output. A collection of $1671$ mathematical reasoning problems from MATH500, GSM8K and AIME, and their associated CCGs are compiled into our dataset – \textbf{KisMATH}. Our detailed empirical analysis with 15 open-weight LLMs shows that (i) reasoning nodes in the CCG are mediators for the final answer, a condition necessary for reasoning; and (ii) LLMs emphasise reasoning paths given by the CCG, indicating that models internally realise structures akin to our graphs. KisMATH enables controlled, graph-aligned interventions and opens up avenues for further investigation into the role of chain-of-thought in LLM reasoning.
在众多推理任务中,已经展示了改进大型语言模型的功能的链条痕迹,但对于如何实现这一性能的促进机制没有达成共识。为了更清楚地了解这一点,我们引入了Causal CoT Graphs(CCCGs),这些图解是自动从推理痕迹中直接提取的单行图,这些图解是语言模型产出中模型细化因果依赖性的典型因果依赖性。从MATH500、GSM8K和AIME收集了1671美元的数学推理问题,以及它们相关的CCGs被汇编到我们的数据集中 – – \ textbf{KisMATH}。我们对15个开放性LMs的详细经验分析显示:(一) CCCG的推理节点是最后答案的调解人,这是推理的一个必要条件;以及(二) LLMS强调CCG提供的推理路径,表明模型内部理解与我们的图表类似的结构。KisMATh可以进行控制、图形校正的干预,并为进一步调查LM推理中链论的作用开辟了渠道。
Article 201
Title@2025-07-15 (2): EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes
Title: EXAONE 4.0: Unified Large Language Models Integrating Non-reasoning and Reasoning Modes | EXAONE 4.0: Unified Large Language Models Integrieren von nicht-vernünftigen und vernünftigen Moden | EXONE4.0:纳入非理由和理由解释模式的统一大语言模式 2507.11407v1 |
Authors (42): LG AI Research, :, Kyunghoon Bae, Eunbi Choi, Kibong Choi, Stanley Jungkyu Choi, Yemuk Choi, Kyubeen Han, Seokhee Hong, Junwon Hwang, Taewan Hwang, Joonwon Jang, Hyojin Jeon, Kijeong Jeon, Gerrard Jeongwon Jo, Hyunjik Jo, Jiyeon Jung, Euisoon Kim, Hyosang Kim, Jihoon Kim, Joonkee Kim, Seonghwan Kim, Soyeon Kim, Sunkyoung Kim, Yireun Kim, Yongil Kim, Youchul Kim, Edward Hwayoung Lee, Gwangho Lee, Haeju Lee, Honglak Lee, Jinsik Lee, Kyungmin Lee, Sangha Park, Young Min Paik, Yongmin Park, Youngyong Park, Sanghyun Seo, Sihoon Yang, Heuiyeen Yeen, Sihyuk Yi, Hyeongu Yun
This technical report introduces EXAONE 4.0, which integrates a Non-reasoning mode and a Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean. The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications. The EXAONE 4.0 demonstrates superior performance compared to open-weight models in its class and remains competitive even against frontier-class models. The models are publicly available for research purposes and can be easily downloaded via https://huggingface.co/LGAI-EXAONE.
本技术报告介绍EXAONE4.0,其中结合了一种非理性模式和一种理性模式,以实现EXAONE 3.5的极佳使用性和EXAONE Deep的先进推理能力。为了为ATI时代铺路,EXAONE 4.0除了英文和韩文外,还包含了代理工具使用等基本特征,其多语种能力也扩大到支持西班牙文。EXAONE 4.0模型系列由两个大小组成:一个为高性能而优化的中型32B型模型,以及一个为设计在设备上的应用而设计的小型1.2B型模型。EXAONE 4.0展示了优于其类别中开放重量模型的优异性,甚至与前沿级模型相比,这些模型仍然具有竞争力。这些模型可供公众使用,用于研究目的,并可通过https://huggingface.co/LGAI-EXAONE轻易下载。
Article 202
Title@2025-07-15 (2): DCR: Quantifying Data Contamination in LLMs Evaluation
Title: DCR: Quantifying Data Contamination in LLMs Evaluation | DCR: Quantifizierung von Datenkontamination in LLMs Evaluation | DCR: 在LLMS评价中量化数据污染 2507.11405v1 |
Authors (7): Cheng Xu, Nan Yan, Shuhao Guan, Changhong Jin, Yuke Mei, Yibing Guo, M-Tahar Kechadi
The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data, inflating performance metrics and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.
大型语言模型(LLMS)的快速进步使人们更加关注基准数据污染(BDC),在基准数据污染(BDC)方面,模型无意间将评价数据记忆起来,放大性能指标,破坏真正的一般评估;本文件介绍数据污染风险(DCR)框架,这是一个轻量的、可解释的管道,旨在检测和量化BDC,覆盖四个颗粒层:语义、信息、数据和标签;通过模糊推断系统综合污染分数,DCR产生一个统一的DCR因子,调整原始精度,以反映污染认知性表现;经对9LMS(0.5B-72B)进行验证,将情绪分析、假新闻检测和算术推理任务加以验证,DCR框架可靠地分析了污染的严重程度,并用DCR因子对三个基准与未污染基线之间的平均误差幅度调整到4%以下;强调计算效率和透明度,DCR为将污染评估纳入常规评估、促进更公平的比较和提高LM基准做法的可信度提供了一个实用工具。
Article 203
Title@2025-07-15 (2): Gaussian mixture models as a proxy for interacting language models
Title: Gaussian mixture models as a proxy for interacting language models | Gaußsche Mischungsmodelle als Proxy für interagierende Sprachmodelle | Gaussian 混合模型作为交互语言模型的替代 2506.00077v3 |
Authors (6): Edward L. Wang, Tianyu Wang, Hayden Helm, Avanti Athreya, Vince Lyzinski, Carey E. Priebe
Large language models (LLMs) are a powerful tool with the ability to match human capabilities and behavior in many settings. Retrieval-augmented generation (RAG) further allows LLMs to generate diverse output depending on the contents of their RAG database. This motivates their use in the social sciences to study human behavior between individuals when large-scale experiments are infeasible. However, LLMs depend on complex, computationally expensive algorithms. In this paper, we introduce interacting Gaussian mixture models (GMMs) as an alternative to similar frameworks using LLMs. We compare a simplified model of GMMs to select experimental simulations of LLMs whose updating and response depend on feedback from other LLMs. We find that interacting GMMs capture important features of the dynamics in interacting LLMs, and we investigate key similarities and differences between interacting LLMs and GMMs. We conclude by discussing the benefits of Gaussian mixture models, potential modifications, and future research directions.
大型语言模型(LLMS)是一个强大的工具,能够在许多环境中与人的能力和行为相匹配。检索强化的生成(RAG)进一步允许LMS产生取决于其RAG数据库内容的不同产出。这促使他们在社会科学中使用这些模型,在大规模实验不可行时研究个人之间的人类行为。然而,LLMS依赖于复杂、计算成本昂贵的算法。在本文件中,我们采用互动高斯混合模型(GMMS)作为使用LMS的类似框架的替代。我们比较了简化的GMMS模型,以选择LMS实验模拟,这些模型的更新和反应取决于其他LMS的反馈。我们发现,互动的GMMS在互动的LMS中捕捉了动态的重要特征,我们研究了相互作用的LMS和GMs之间的关键相似性和差异。我们通过讨论高斯混合模型的好处、潜在修改和未来研究方向来结束。
Article 204
Title@2025-07-15 (2): Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence
Title: Following the Clues: Experiments on Person Re-ID using Cross-Modal Intelligence | Im Anschluss an die Klues: Experimente zur Person Re-ID mit Cross-Modal Intelligence | 在Clues之后:利用跨模式情报对个人重新识别进行实验 2507.01504v3 |
Authors (6): Robert Aufschläger, Youssef Shoeb, Azarm Nowzad, Michael Heigl, Fabian Bally, Martin Schramm
The collection and release of street-level recordings as Open Data play a vital role in advancing autonomous driving systems and AI research. However, these datasets pose significant privacy risks, particularly for pedestrians, due to the presence of Personally Identifiable Information (PII) that extends beyond biometric traits such as faces. In this paper, we present cRID, a novel cross-modal framework combining Large Vision-Language Models, Graph Attention Networks, and representation learning to detect textual describable clues of PII and enhance person re-identification (Re-ID). Our approach focuses on identifying and leveraging interpretable features, enabling the detection of semantically meaningful PII beyond low-level appearance cues. We conduct a systematic evaluation of PII presence in person image datasets. Our experiments show improved performance in practical cross-dataset Re-ID scenarios, notably from Market-1501 to CUHK03-np (detected), highlighting the framework’s practical utility. Code is available at https://github.com/RAufschlaeger/cRID.
在公开数据中,街头记录收集和发布作为公开数据在推进自主驾驶系统和AI研究方面发挥着至关重要的作用。然而,由于存在超越诸如脸部等生物鉴别特征以外的个人识别信息(PII),这些数据集对隐私构成重大风险,特别是对行人而言。在本文件中,我们介绍了CRID,这是一个全新的跨模式框架,将大型视觉语言模型、图示关注网络和代表学习结合起来,以发现PII的文字破解线索,并加强人的再识别(Re-ID)。我们的方法侧重于识别和利用可解释的特征,从而能够探测出在低层外观提示之外具有内在意义的PII。我们对个人图像数据集中存在的身份进行系统评估。我们的实验显示,在实际交叉数据集再识别情景方面,特别是从市场1501到CUHK03-np(识别)的绩效有所改进,突出了框架的实际效用。代码见https://github.com/RAufschlaeger/cRID。
Article 205
Title@2025-07-15 (2): Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss
Title: Addressing Data Imbalance in Transformer-Based Multi-Label Emotion Detection with Weighted Loss | Adressierung von Daten Ungleichgewicht in Transformer-basierte Multi-Label Emotion Erkennung mit Gewichteten Verlusten | 解决基于变换器的多标签情感与加权损失检测中的数据不平衡问题 2507.11384v1 |
Authors (1): Xia Cui
This paper explores the application of a simple weighted loss function to Transformer-based models for multi-label emotion detection in SemEval-2025 Shared Task 11. Our approach addresses data imbalance by dynamically adjusting class weights, thereby enhancing performance on minority emotion classes without the computational burden of traditional resampling methods. We evaluate BERT, RoBERTa, and BART on the BRIGHTER dataset, using evaluation metrics such as Micro F1, Macro F1, ROC-AUC, Accuracy, and Jaccard similarity coefficients. The results demonstrate that the weighted loss function improves performance on high-frequency emotion classes but shows limited impact on minority classes. These findings underscore both the effectiveness and the challenges of applying this approach to imbalanced multi-label emotion detection.
本文探讨了在SemEval-2025 共同任务11中将简单的加权损失函数应用于基于变压器的多标签情感检测模型。 我们的方法通过动态调整等级重量来解决数据不平衡问题,从而在没有传统再采样方法的计算负担的情况下提高少数群体情感类的性能。 我们用Micro F1、Mroc F1、ROC-AUC、Accurity 和 Jaccard 类似系数等评估指标,在Braighter数据集中评估BERT、Robreta和BART。 结果表明,加权损失功能提高了高频情感类的性能,但显示了对少数群体类的有限影响。 这些结论强调了将这一方法应用于不平衡的多标签情感检测的有效性和挑战。
Article 206
Title@2025-07-15 (2): What is the Best Process Model Representation? A Comparative Analysis for Process Modeling with Large Language Models
Title: What is the Best Process Model Representation? A Comparative Analysis for Process Modeling with Large Language Models | Was ist die beste Prozessmodelldarstellung? Eine vergleichende Analyse zur Prozessmodellierung mit großen Sprachmodellen | ” 最佳程序示范代表 “ 是什么? “ 大语言模式进程模拟比较分析 “ 2507.11356v1 |
Authors (3): Alexis Brissard, Frédéric Cuppens, Amal Zouaq
Large Language Models (LLMs) are increasingly applied for Process Modeling (PMo) tasks such as Process Model Generation (PMG). To support these tasks, researchers have introduced a variety of Process Model Representations (PMRs) that serve as model abstractions or generation targets. However, these PMRs differ widely in structure, complexity, and usability, and have never been systematically compared. Moreover, recent PMG approaches rely on distinct evaluation strategies and generation techniques, making comparison difficult. This paper presents the first empirical study that evaluates multiple PMRs in the context of PMo with LLMs. We introduce the PMo Dataset, a new dataset containing 55 process descriptions paired with models in nine different PMRs. We evaluate PMRs along two dimensions: suitability for LLM-based PMo and performance on PMG. \textit{Mermaid} achieves the highest overall score across six PMo criteria, whereas \textit{BPMN text} delivers the best PMG results in terms of process element similarity.
大型语言模型(LLMS)越来越多地用于流程建模任务,如流程建模(PMG)等流程建模任务。为了支持这些任务,研究人员引入了各种流程模型代表模型(PMR),作为模型抽象或生成目标,然而,这些流程模型在结构、复杂性和可用性方面差异很大,而且从未系统地加以比较。此外,近期的流程模型方法依赖于不同的评估战略和生成技术,因此难以进行比较。本文件介绍了第一次经验研究,对多个流程模型(PMO)和LMS进行对比。我们引入了包含55个流程描述的新数据集,该数据集与9个不同流程模型相配。我们从两个方面评估了流程模型:对基于LMMPMO的适合性和PMG的绩效。 \textit{Mermaid}在六种流程标准中取得了最高的总体分数,而\ textit{BMN文字}则在流程要素相似方面提供了最佳的PMG结果。
Article 207
Title@2025-07-15 (2): Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations
Title: Truthful or Fabricated? Using Causal Attribution to Mitigate Reward Hacking in Explanations | Wahrhaftig oder fabriziert? Mit Kausal Attribution zu Mitigate Belohnung Hacken in Erklärungen | 真实的还是伪造的? 利用从原因上归结为 贬低奖得奖者在解释中被打包 2504.05294v2 |
Authors (3): Pedro Ferreira, Wilker Aziz, Ivan Titov
Chain-of-thought explanations are widely used to inspect the decision process of large language models (LLMs) and to evaluate the trustworthiness of model outputs, making them important for effective collaboration between LLMs and humans. We demonstrate that preference optimization - a key step in the alignment phase - can inadvertently reduce the faithfulness of these explanations. This occurs because the reward model (RM), which guides alignment, is tasked with optimizing both the expected quality of the response and the appropriateness of the explanations (e.g., minimizing bias or adhering to safety standards), creating potential conflicts. The RM lacks a mechanism to assess the consistency between the model’s internal decision process and the generated explanation. Consequently, the LLM may engage in “reward hacking” by producing a final response that scores highly while giving an explanation tailored to maximize reward rather than accurately reflecting its reasoning. To address this issue, we propose enriching the RM’s input with a causal attribution of the prediction, allowing the RM to detect discrepancies between the generated self-explanation and the model’s decision process. In controlled settings, we show that this approach reduces the tendency of the LLM to generate misleading explanations.
思考链解释被广泛用于检查大型语言模型(LLMs)的决策过程,并评价模型产出的可信度,使其对于LLMs和人类之间的有效合作十分重要。我们证明,偏好优化是调整阶段的一个关键步骤,可能无意中降低这些解释的忠实性。这是因为指导调整的奖励模式(RM)的任务是优化反应的预期质量和解释的恰当性(例如尽量减少偏见或遵守安全标准),造成潜在的冲突。RM缺乏评估模型内部决策过程与生成的解释之间一致性的机制。因此,LLMM可能参与“奖励黑客”的最后回应,该最后回应得分很高,同时给出符合最大限度奖励而不是准确反映其推理的解释。为解决这一问题,我们建议将RM的投入与预测的因果归属相结合,使RM能够发现生成的自我验证与模型的决策过程之间的差异。在受控制的环境下,我们表明,这种做法会降低LMM产生误导性解释的趋势。
Article 208
Title@2025-07-15 (2): Internal Value Alignment in Large Language Models through Controlled Value Vector Activation
Title: Internal Value Alignment in Large Language Models through Controlled Value Vector Activation | Interne Wertausrichtung in großen Sprachmodellen durch kontrollierte Wert-Vektor-Aktivierung | 通过控制值矢量激活,通过控制值矢量激活,大语言模型的内部价值对齐 2507.11316v1 |
Authors (7): Haoran Jin, Meng Li, Xiting Wang, Zhihao Xu, Minlie Huang, Yantao Jia, Defu Lian
Aligning Large Language Models (LLMs) with human values has attracted increasing attention since it provides clarity, transparency, and the ability to adapt to evolving scenarios. In this paper, we introduce a Controlled Value Vector Activation (ConVA) method that directly aligns the internal values of LLMs by interpreting how a value is encoded in their latent representations and modifies relevant activations to ensure consistent values in LLMs. To ensure an accurate and unbiased interpretation, we propose a context-controlled value vector identification method. To consistently control values without sacrificing model performance, we introduce a gated value vector activation method for effective and minimum degree of value control. Experiments show that our method achieves the highest control success rate across 10 basic values without hurting LLM performance and fluency, and ensures target values even with opposite and potentially malicious input prompts. Source code and data are available at~ https://github.com/hr-jin/ConVA.
将大语言模型(LLMs)与人类价值相匹配,这已引起越来越多的关注,因为它提供了清晰度、透明度和适应不断变化的情景的能力。在本文件中,我们采用了一种控制值矢量活化(ConVA)方法,直接将LLMs的内部价值相匹配,方法是解释一种价值如何在其潜在表达形式中编码,并修改相关激活,以确保LLMs中的一致价值。为了确保准确和不偏不倚的解释,我们提议了一种环境控制值矢量识别方法。为了在不牺牲模型性能的情况下持续控制值,我们采用了一种控制值矢量活化方法,以有效和最低程度的值控制。实验表明,我们的方法在不伤害LLM的性能和流利的情况下,在10个基本值上实现了最高控制率,并确保了目标值,即使其输入速度相反和可能恶意。资料来源代码和数据见https://github.com/hr-jin/ConVA。
Article 209
Title@2025-07-15 (2): ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time
Title: ETT: Expanding the Long Context Understanding Capability of LLMs at Test-Time | ETT: Erweiterung des Langzeitkontexts Verständnisfähigkeit von LLMs bei Test-Time | ETT:扩大LLMs在试验时的长距离理解能力 2507.06313v2 |
Authors (4): Kiarash Zahirnia, Zahra Golpayegani, Walid Ahmed, Yang Liu
Transformer-based Language Models’ computation and memory overhead increase quadratically as a function of sequence length. The quadratic cost poses challenges when employing LLMs for processing long sequences. In this work, we introduce \ourmodelacronym~(Extend at Test-Time), method for extending the context length of short context Transformer-based LLMs, with constant memory requirement and linear computation overhead. ETT enable the extension of the context length at test-time by efficient fine-tuning the model’s parameters on the input context, chunked into overlapping small subsequences. We evaluate ETT on LongBench by extending the context length of GPT-Large and Phi-2 up to 32 times, increasing from 1k to 32k tokens. This results in up to a 30 percent improvement in the model’s accuracy. We also study how context can be stored in LLM’s weights effectively and efficiently. Through a detailed ablation study, we examine which Transformer modules are most beneficial to fine-tune at test-time. Interestingly, we find that fine-tuning the second layer of the FFNs is more effective than full fine-tuning, leading to a further improvement in the models’ accuracy.
以变换器为基础的语言模型的计算和内存管理量作为序列长度的函数, 二次成本在使用LLMS处理长序列时构成挑战。 在这项工作中, 我们引入了 \ ourmodelacronnym~( 测试时Extend ) , 延长基于短背景的变换器LMS的上下文长度的方法, 并不断要求内存, 以及线性计算间接费用 。 TET 通过高效地微调模型输入环境参数, 使测试时的上下文长度延长, 并被挤成重叠的小后继序列 。 我们通过将 GPT- Large 和 Phi-2 的上下文长度延长32 次来评估长贝恩奇的 ETT, 从而将GPT- Large 和 Phi-2 的上下文长度从1k 增加到32 个符号, 从而将模型的精度提高到30% 。 我们还研究如何有效和高效地将环境储存在 LLM 的重量中。 我们通过详细的调整研究, 我们研究哪个变换式模块对试验时的精度最为有益。
Article 210
Title@2025-07-15 (2): LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification
Title: LRCTI: A Large Language Model-Based Framework for Multi-Step Evidence Retrieval and Reasoning in Cyber Threat Intelligence Credibility Verification | LRCTI: Ein großsprachiges modellbasiertes Framework für mehrstufige Evidence-Retrieval und Reasoning in Cyber Threat Intelligence Credibility Verifikation | LRCTI: 网络威胁情报可靠性核查中多重证据检索和理由依据大语言示范框架 2507.11310v1 |
Authors (6): Fengxiao Tang, Huan Li, Ming Zhao, Zongzong Wu, Shisong Peng, Tao Yin
Verifying the credibility of Cyber Threat Intelligence (CTI) is essential for reliable cybersecurity defense. However, traditional approaches typically treat this task as a static classification problem, relying on handcrafted features or isolated deep learning models. These methods often lack the robustness needed to handle incomplete, heterogeneous, or noisy intelligence, and they provide limited transparency in decision-making-factors that reduce their effectiveness in real-world threat environments. To address these limitations, we propose LRCTI, a Large Language Model (LLM)-based framework designed for multi-step CTI credibility verification. The framework first employs a text summarization module to distill complex intelligence reports into concise and actionable threat claims. It then uses an adaptive multi-step evidence retrieval mechanism that iteratively identifies and refines supporting information from a CTI-specific corpus, guided by LLM feedback. Finally, a prompt-based Natural Language Inference (NLI) module is applied to evaluate the credibility of each claim while generating interpretable justifications for the classification outcome. Experiments conducted on two benchmark datasets, CTI-200 and PolitiFact show that LRCTI improves F1-Macro and F1-Micro scores by over 5%, reaching 90.9% and 93.6%, respectively, compared to state-of-the-art baselines. These results demonstrate that LRCTI effectively addresses the core limitations of prior methods, offering a scalable, accurate, and explainable solution for automated CTI credibility verification
核实网络威胁情报(CTI)的可信度对于可靠的网络安全防御至关重要。然而,传统方法通常将这一任务视为静态分类问题,依靠手工制作的特征或孤立的深层次学习模式。这些方法往往缺乏处理不完整、多样化或噪音情报所需的稳健性,在决策因素方面提供了有限的透明度,从而降低了其在现实世界威胁环境中的效力。为解决这些限制,我们提议LRCTI,一个大语言模型(LLLM)框架,为多步CTI可信度核查设计一个基于大语言的多步语言模型。框架首先使用文本汇总模块,将复杂的情报报告编成简明和可操作的威胁索赔。然后使用适应性多步证据检索机制,在LLM反馈的指导下,反复确定和完善CTI专案的具体资料。最后,运用一个基于迅速的自然语言推断(NLLLI)模块来评估每项索赔的可信度,同时为分类结果提供可解释的理由。在两个基准数据集、CTI-200和PoliFact上进行的实验显示,LRCTI改进了F1-M-9的准确性结果,分别将F-Macro-r-r-r-ral 和F-rass-rass-ral 的成绩分别用来比较了F-ral-ral-r-r-lax-laxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Article 211
Title@2025-07-15 (2): Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian
Title: Dr.Copilot: A Multi-Agent Prompt Optimized Assistant for Improving Patient-Doctor Communication in Romanian | Dr.Copilot: Ein Multi-Agent Prompt Optimierter Assistent zur Verbesserung der Patienten-Doktor-Kommunikation auf Rumänisch | 副驾驶:罗马尼亚改善病人-医生沟通多代理快速优化助理 2507.11299v1 |
Authors (4): Andrei Niculae, Adrian Cosma, Cosmin Dumitrache, Emilian Rǎdoi
Text-based telemedicine has become increasingly common, yet the quality of medical advice in doctor-patient interactions is often judged more on how advice is communicated rather than its clinical accuracy. To address this, we introduce Dr.Copilot , a multi-agent large language model (LLM) system that supports Romanian-speaking doctors by evaluating and enhancing the presentation quality of their written responses. Rather than assessing medical correctness, Dr.Copilot provides feedback along 17 interpretable axes. The system comprises of three LLM agents with prompts automatically optimized via DSPy. Designed with low-resource Romanian data and deployed using open-weight models, it delivers real-time specific feedback to doctors within a telemedicine platform. Empirical evaluations and live deployment with 41 doctors show measurable improvements in user reviews and response quality, marking one of the first real-world deployments of LLMs in Romanian medical settings.
为解决这一问题,我们引入了多试剂大型语言模型(LLM)系统,这是一个多试剂大型语言模型(LLM)系统,通过评估和提高其书面答复的表述质量来支持罗马尼亚语医生。Copolit博士在17个可解释轴上提供反馈,而不是评估医疗正确性。该系统由3个LLM代理组成,通过DSPy自动优化速度。该系统由3个LLM代理组成,通过DSPy自动优化速度。罗马尼亚低资源数据设计,使用开放重量模型部署,在远程医疗平台内向医生提供实时特定反馈。经验评估和与41个医生一起的现场部署显示用户审查和反应质量的可衡量改进,标志着罗马尼亚医疗环境中首次实际部署LMs。
Article 212
Title@2025-07-15 (2): Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources, Coded Term Lexicon, and Enhanced Detection Frameworks
Title: Fine-Grained Chinese Hate Speech Understanding: Span-Level Resources, Coded Term Lexicon, and Enhanced Detection Frameworks | Fine-Grained Chinese Hate Speech Understanding: Span-Level-Ressourcen, Coded Term Lexikon und erweiterte Erkennungsrahmen | 中华仇恨言论理解:广级资源、规范术语词汇、强化检测框架 2507.11292v1 |
Authors (5): Zewen Bai, Liang Yang, Shengdi Yin, Yuanyuan Sun, Hongfei Lin
The proliferation of hate speech has inflicted significant societal harm, with its intensity and directionality closely tied to specific targets and arguments. In recent years, numerous machine learning-based methods have been developed to detect hateful comments on online platforms automatically. However, research on Chinese hate speech detection lags behind, and interpretability studies face two major challenges: first, the scarcity of span-level fine-grained annotated datasets limits models’ deep semantic understanding of hate speech; second, insufficient research on identifying and interpreting coded hate speech restricts model explainability in complex real-world scenarios. To address these, we make the following contributions: (1) We introduce the Span-level Target-Aware Toxicity Extraction dataset (STATE ToxiCN), the first span-level Chinese hate speech dataset, and evaluate the hate semantic understanding of existing models using it. (2) We conduct the first comprehensive study on Chinese coded hate terms, LLMs’ ability to interpret hate semantics. (3) We propose a method to integrate an annotated lexicon into models, significantly enhancing hate speech detection performance. Our work provides valuable resources and insights to advance the interpretability of Chinese hate speech detection research.
仇恨言论的泛滥造成了严重的社会伤害,其强度和方向性与具体目标和论点密切相关。近年来,开发了许多基于机器的学习方法,以自动发现在线平台上的仇恨评论。然而,关于中国仇恨言论检测的研究滞后,可解释性研究面临两大挑战:第一,缺乏跨层微细的附加说明的数据集模型对仇恨言论的深刻语义理解;第二,关于识别和解释编码仇恨言论的研究不足,限制了在复杂的现实世界情景中的示范解释性。为了解决这些问题,我们提出了以下贡献:(1) 我们引入了Span-Aware目标软件毒性提取数据集(STATE ToxiCN),这是中国首个跨层仇恨言论数据集,并评估了对使用该数据集的现有模型的仇恨语义理解。(2) 我们开展了关于中国编码仇恨言论术语、LLMS解释仇恨语义的能力的第一次全面研究。(3) 我们提出一种将附加说明的词汇纳入模型的方法,大大增强仇恨言论检测的绩效。我们的工作为推进中国仇恨言论的可解释性研究提供了宝贵的资源和洞察力。
Article 213
Title@2025-07-15 (2): ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge
Title: ImpliRet: Benchmarking the Implicit Fact Retrieval Challenge | Implic Retrieval Challenge: Benchmarking der Implicity Fact Retrieval Challenge | ImpliRet:设定隐含事实检索挑战的基准 2506.14407v2 |
Authors (4): Zeinab Sadat Taghavi, Ali Modarressi, Yunpu Ma, Hinrich Schütze
Retrieval systems are central to many NLP pipelines, but often rely on surface-level cues such as keyword overlap and lexical semantic similarity. To evaluate retrieval beyond these shallow signals, recent benchmarks introduce reasoning-heavy queries; however, they primarily shift the burden to query-side processing techniques – like prompting or multi-hop retrieval – that can help resolve complexity. In contrast, we present ImpliRet, a benchmark that shifts the reasoning challenge to document-side processing: The queries are simple, but relevance depends on facts stated implicitly in documents through temporal (e.g., resolving “two days ago”), arithmetic, and world knowledge relationships. We evaluate a range of sparse and dense retrievers, all of which struggle in this setting: the best nDCG@10 is only 14.91%. We also test whether long-context models can overcome this limitation. But even with a short context of only thirty documents, including the positive document, GPT-o4-mini scores only 55.54%, showing that document-side reasoning remains a challenge. Our codes are available at: github.com/ZeinabTaghavi/IMPLIRET
检索系统是许多国家实验室规划方案管道的核心,但往往依赖表面提示,如关键词重叠和词汇语义相似性。为了评估这些浅度信号之外的检索,最近的基准引入了推理重度查询;然而,它们主要将负担转移到可帮助解决复杂性的查询端处理技术上,如催化或多跳检索等。相比之下,我们提出了ImpliRet,这是一个将推理挑战转向文件端处理的基准:查询很简单,但相关性取决于文件通过时间(例如解决“两天前”)、算术和世界知识关系等隐含的事实。我们评估了一系列稀少和密集的检索者,所有这些检索者在此环境下都挣扎着:最好的 nDCG@10 仅为14.91%。我们还测试长文本模型能否克服这一限制。但即使短短的30份文件,包括正面文件GPT-o4-mini的评分只有55.54%,显示文件端推理仍然很困难。我们的代码在以下:Github.com/Zeina/TREPAVI。
Article 214
Title@2025-07-15 (2): ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models
Title: ContextCache: Context-Aware Semantic Cache for Multi-Turn Queries in Large Language Models | ContextCache: Kontext-Bewusst Semantischer Cache für Multi-Turn-Abfragen in großen Sprachmodellen | 上下文缓存: 用于大语言模式多发查询的背景软件语义缓存 2506.22791v3 |
Authors (7): Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, Kui Ren
Semantic caching significantly reduces computational costs and improves efficiency by storing and reusing large language model (LLM) responses. However, existing systems rely primarily on matching individual queries, lacking awareness of multi-turn dialogue contexts, which leads to incorrect cache hits when similar queries appear in different conversational settings. This demonstration introduces ContextCache, a context-aware semantic caching system for multi-turn dialogues. ContextCache employs a two-stage retrieval architecture that first executes vector-based retrieval on the current query to identify potential matches and then integrates current and historical dialogue representations through self-attention mechanisms for precise contextual matching. Evaluation of real-world conversations shows that ContextCache improves precision and recall compared to existing methods. Additionally, cached responses exhibit approximately 10 times lower latency than direct LLM invocation, enabling significant computational cost reductions for LLM conversational applications.
语义缓存通过存储和重新使用大型语言模型(LLM)的响应,大大减少了计算成本,提高了效率;然而,现有系统主要依靠对个别查询进行匹配,缺乏对多点对话环境的认识,导致在不同对话环境中出现类似查询时出现不正确的缓存点击率;这一演示引入了背景缓存系统,即具有背景认知的多点对话的静存缓存系统;背景缓存使用一个两阶段检索结构,首先对当前查询进行矢量检索,以确定潜在匹配,然后通过自我关注机制整合当前和历史对话的表述,以精确背景匹配;对真实世界的谈话评价表明,CEnecache比现有方法更精确和回溯。此外,缓存答复显示,LLLM直接的延载率比直接的LLM调试约低10倍,从而大大降低了LM对话应用程序的计算成本。
Article 215
Title@2025-07-15 (2): FMC: Formalization of Natural Language Mathematical Competition Problems
Title: FMC: Formalization of Natural Language Mathematical Competition Problems | FMC: Formalisierung von mathematischen Wettbewerbsproblemen in der Natursprache | FMC: 将自然语言数学竞争问题正规化 2507.11275v1 |
Authors (6): Jiaxuan Xie, Chengwu Liu, Ye Yuan, Siqi Li, Zhiping Xiao, Ming Zhang
Efficient and accurate autoformalization methods, which leverage large-scale datasets of extensive natural language mathematical problems to construct formal language datasets, are key to advancing formal mathematical reasoning. In this paper, we propose an autoformalization pipeline based on large language models with error feedback, achieving a fully automatic and training-free formalization approach. Using this pipeline, we curate an Olympiad-level dataset aligning natural language problems with Lean formalizations. The dataset comprises $3,922$ mathematical problems in natural language and $9,787$ in Lean, of which $64.46\%$ were assessed as at least above-average quality, making it suitable as a benchmark for automated theorem provers. Additionally, we investigate the formalization and reasoning capabilities of various LLMs and empirically demonstrate that few-shot learning, error feedback, and increasing sampling numbers enhance the autoformalization process. Experiments of three automated theorem provers on the \dataset\ dataset also highlight its challenging nature and its value as a benchmark for formal reasoning tasks.
高效和准确的自动化正规化方法,利用大量大量自然语言数学问题数据集来构建正式的语文数据集,是推进正式数学推理的关键。在本文中,我们提议基于大语言模型的自动正规化管道,并配有错误反馈,实现完全自动和无培训的正规化方法。我们利用这一管道,将自然语言问题与Lean正规化相结合的奥林匹亚级数据集翻版。该数据集包括自然语言的3,922美元数学问题和莱安的9,787美元,其中64.46美元至少被评为高于平均水平的质量,使之适合作为自动理论验证的基准。此外,我们调查各种LLMM和实验性经验性地表明,很少人会学习、错误反馈和增加采样数字,会加强自动化正规化进程。在\datacet\数据集上的三个自动理论验证器的实验也突出了其挑战性及其作为正式推理任务基准的价值。
Article 216
Title@2025-07-15 (2): KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding
Title: KV-Latent: Dimensional-level KV Cache Reduction with Frequency-aware Rotary Positional Embedding | KV-Latent: KV-Cache-Reduktion auf Dimensionsebene mit frequenzbewusster Rotary-Positions-Einbettung | KV-Latent:用高频感知的扶轮性定位嵌入减少KV缓存 2507.11273v1 |
Authors (6): Luohe Shi, Zuchao Li, Lefei Zhang, Guoming Liu, Baoyuan Qi, Hai Zhao
Large language models (LLMs) based on Transformer Decoders have become the preferred choice for conversational generative AI. Despite the overall superiority of the Decoder architecture, the gradually increasing Key-Value (KV) cache during inference has emerged as a primary efficiency bottleneck, both in aspects of memory consumption and data transfer bandwidth limitations. To address these challenges, we propose a paradigm called KV-Latent. By down-sampling the Key-Value vector dimensions into a latent space, we can significantly reduce the KV Cache footprint and improve inference speed, only with a small amount of extra training, less than 1\% of pre-training takes. Besides, we enhanced the stability of Rotary Positional Embedding applied on lower-dimensional vectors by modifying its frequency sampling mechanism, avoiding noise introduced by higher frequencies while retaining position attenuation. Our experiments, including both models with Grouped Query Attention and those without, have yielded satisfactory results. Finally, we conducted comparative experiments to study the impact of separately reducing Key and Value components on model’s performance. Our approach allows for the construction of more efficient language model systems, and opens the new possibility on KV Cache saving and efficient LLMs. Our code is available at https://github.com/ShiLuohe/KV-Latent.
尽管Decoder结构总体优越,但在推断过程中逐渐增加的Key-Value(KV)缓存在记忆消耗和数据传输带宽限制两个方面都已成为主要的效率瓶颈。为了应对这些挑战,我们提出了一个以变压器 Decoder 为基础的大型语言模型(LLMS) 。通过将 KV-Value 矢量的模型降为隐蔽空间,我们可以大幅降低 KV 缓存足迹,提高推断速度,但只能通过少量的额外培训,低于培训前的1。此外,我们通过修改其频率取样机制,避免高频率带来的噪音,同时保留减速位置,加强了对低维矢量矢量应用的扶轮式定位嵌入器的稳定性。我们的实验,包括使用Group Query 注意的模型和不使用这种模型的模型,取得了令人满意的结果。我们进行了比较实验,以研究分别减少关键和价值组成部分对模型性能的影响,只有少量的额外培训。此外,我们的方法允许在更高效的CHIV/LAVS上建立高效的模型系统。
Article 217
Title@2025-07-15 (2): Block Circulant Adapter for Large Language Models
Title: Block Circulant Adapter for Large Language Models | Block Circulant Adapter für große Sprachmodelle | 用于大语言模型的块环相适应器 2505.00582v2 |
Authors (4): Xinyu Ding, Meiqi Wang, Siyu Liao, Zhongfeng Wang
Fine-tuning large language models (LLMs) is difficult due to their huge model size. Recent Fourier domain-based methods show potential for reducing fine-tuning costs. We propose a block circulant matrix-based fine-tuning method with a stable training heuristic to leverage the properties of circulant matrices and one-dimensional Fourier transforms to reduce storage and computation costs. Experiments show that our method uses $14\times$ less number of parameters than VeRA, $16\times$ smaller than LoRA and $32\times$ less FLOPs than FourierFT, while maintaining close or better task performance. Our approach presents a promising way in frequency domain to fine-tune large models on downstream tasks.
微调大型语言模型(LLMS)由于其庞大的模型规模而很难。 最近的Fourier基于域域的方法显示了降低微调成本的潜力。 我们建议采用块状环球矩阵微调方法,采用稳定的培训超常性能,以利用环球矩阵和一维Fourier变形的特性来降低存储和计算成本。 实验显示,我们的方法使用的参数数量比Vera少14美元,比LORA少16美元,比LORA少32美元,比ForierFT少32美元,同时保持近距离或更好的任务性能。 我们的方法在频率领域展示了在下游任务上微调大型模型的有希望的方式。
Article 218
Title@2025-07-15 (2): Shared Global and Local Geometry of Language Model Embeddings
Title: Shared Global and Local Geometry of Language Model Embeddings | Gemeinsame globale und lokale Geometrie von Sprachmodellen | 共同的全球和地方语言对地测量 2503.21073v3 |
Authors (4): Andrew Lee, Melanie Weber, Fernanda Viégas, Martin Wattenberg
Researchers have recently suggested that models share common representations. In our work, we find numerous geometric similarities across the token embeddings of large language models. First, we find ``global’’ similarities: token embeddings often share similar relative orientations. Next, we characterize local geometry in two ways: (1) by using Locally Linear Embeddings, and (2) by defining a simple measure for the intrinsic dimension of each embedding. Both characterizations allow us to find local similarities across token embeddings. Additionally, our intrinsic dimension demonstrates that embeddings lie on a lower dimensional manifold, and that tokens with lower intrinsic dimensions often have semantically coherent clusters, while those with higher intrinsic dimensions do not. Based on our findings, we introduce EMB2EMB, a simple application to linearly transform steering vectors from one language model to another, despite the two models having different dimensions.
研究人员最近建议模型具有共同的代表性。 在我们的工作中,我们在大型语言模型的象征性嵌入中发现了许多几何相似之处。 首先,我们发现了“全球”的相似之处:象征性嵌入往往具有相似的相对方向。 其次,我们用两种方式描述本地几何:(1) 通过使用局部线性嵌入,(2) 通过界定每个嵌入的内在层面的简单度量, 以及(2) 通过界定每个嵌入的内在层面的简单度量度。 两种特征都使我们能够在象征性嵌入中找到本地的相似之处。 此外, 我们的内在层面表明,嵌入位于一个低维度的多元体, 而具有较低内在维度的符号往往具有语义一致性, 而具有更高内在维度的符号则没有。 根据我们的研究结果,我们引入了EMB2EMB, 这是一种简单的将方向矢量从一种语言模式向另一种模式直线性地转换为另一种模式的应用,尽管两个模型具有不同的维度。
Article 219
Title@2025-07-15 (2): KAT-V1: Kwai-AutoThink Technical Report
Title: KAT-V1: Kwai-AutoThink Technical Report | KAT-V1: Kwai-AutoThink Technical Report | KAT-V1: Kwai-AutoThink 技术报告 2507.08297v2 |
Authors (30): Zizheng Zhan, Ken Deng, Huaixi Tang, Wen Xiang, Kun Wu, Weihao Li, Wenqiang Zhu, Jingxuan Xu, Lecheng Huang, Zongxian Feng, Shaojie Wang, Shangpeng Yan, Xuxing Chen, Jiaheng Liu, Zhongyuan Peng, Zuchen Gao, Haoyang Huang, Xiaojiang Zhang, Jinghui Wang, Zheng Lin, Mengtong Li, Huiming Wang, Ziqi Zhan, Yanan Wu, Yuanxing Zhang, Jian Yang, Guang Chen, Haotian Zhang, Bin Chen, Bing Yu
We present Kwaipilot-AutoThink (KAT), an open-source 40B large language model developed to address the overthinking problem in reasoning-intensive tasks, where an automatic thinking training paradigm is proposed to dynamically switch between reasoning and non-reasoning modes based on task complexity. Specifically, first, we construct the dual-regime dataset based on a novel tagging pipeline and a multi-agent synthesis strategy, and then we apply Multi-Token Prediction (MTP)-enhanced knowledge distillation, enabling efficient and fine-grained reasoning transfer with minimal pretraining cost. Besides, we implement a cold-start initialization strategy that introduces mode-selection priors using majority-vote signals and intent-aware prompting. Finally, we propose Step-SRPO, a reinforcement learning algorithm that incorporates intermediate supervision into the GRPO framework, offering structured guidance over both reasoning-mode selection and response accuracy. Extensive experiments across multiple benchmarks demonstrate that KAT consistently matches or even outperforms current state-of-the-art models, including DeepSeek-R1-0528 and Qwen3-235B-A22B, across a wide range of reasoning-intensive tasks while reducing token usage by up to approximately 30\%. Beyond academic evaluation, KAT has been successfully deployed in Kwaipilot (i.e., Kuaishou’s internal coding assistant), and improves real-world development workflows with high accuracy, efficiency, and controllable reasoning behaviors. Moreover, we are actively training a 200B Mixture-of-Experts (MoE) with 40B activation parameters, where the early-stage results already demonstrate promising improvements in performance and efficiency, further showing the scalability of the AutoThink paradigm.
我们提出Kwai试点-AutotThink(KAT),这是一个开放源码的40B大语言模型,它开发了开放源码40B大语言模型,用于解决推理密集型任务中的过度思考问题,其中提出自动思维培训模式,以便根据任务的复杂性动态地转换推理和非推理模式。具体地说,我们首先根据一个新的标记管道和多试剂合成战略构建双系统数据集,然后我们应用多方向预测(MTP)强化知识蒸馏,使高效和精细推理推理转换在最低培训前成本下得以实现。此外,我们实施一个冷启动的初始化行为战略,利用多数票信号和自动认知的提示来引入模式选择前期。最后,我们提出Sep-SRPO,一个强化的学习算法,将中间监管纳入GROPO框架,为推理学模式的选择和反应准确度提供结构化指导。 跨基准的广泛实验表明,KAT-TA(DeepS-R1-0528)和高层次A-S-B(CR-B)的高级成本评估显示S-ral-al-al-reval-al-Lislax-Lis-Lis-Lis-S-S-Lis-Lis-Lis-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-B-B-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-B-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-
Article 220
Title@2025-07-15 (2): RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism
Title: RAG-R1 : Incentivize the Search and Reasoning Capabilities of LLMs through Multi-query Parallelism | RAG-R1 : Förderung der Such- und Begründungsfähigkeiten von LLMs durch Multi-Query-Parallelismus | RAG-R1:通过多种克质平行主义鼓励LLMs的搜索和说明能力 2507.02962v3 |
Authors (6): Zhiwen Tan, Jiaming Huang, Qintong Wu, Hongxuan Zhang, Chenyi Zhuang, Jinjie Gu
Large Language Models (LLMs) have demonstrated remarkable capabilities across various tasks, while they remain prone to generating hallucinated or outdated responses due to their static internal knowledge. Recent advancements in Retrieval-Augmented Generation (RAG) methods have explored enhancing models’ search and reasoning capabilities through reinforcement learning (RL). Although these methods demonstrate promising results, they face challenges in training stability and encounter issues such as substantial inference time and restricted capabilities due to the single-query mode. In this paper, we propose RAG-R1, a novel training framework designed to enable LLMs to adaptively leverage internal and external knowledge during the reasoning process. We further expand the generation and retrieval processes within the framework from single-query mode to multi-query parallelism, aimed at reducing inference time and enhancing the model’s capabilities. Extensive experiments on seven question-answering benchmarks demonstrate that our method outperforms the strongest baseline by up to 13.2% and decreases inference time by 11.1%.
大型语言模型(LLMS)在各种任务中表现出了非凡的能力,而由于其静态的内部知识,它们仍然容易产生幻觉或过时的应对办法; 回收和提炼一代(RAG)方法最近的进展探索了通过强化学习加强模型的搜索和推理能力; 虽然这些方法显示了有希望的成果,但它们在培训稳定性方面面临着挑战,并遇到了诸如因单一查询模式而导致的大量推论时间和有限能力等问题; 在本文件中,我们提议了一个新的培训框架,旨在使LAG-R1能够在推理过程中以适应性的方式利用内部和外部知识; 我们进一步扩大了从单式回收模式到多式平行框架的生成和检索过程,目的是减少推论时间和提高模型的能力。 对七个问题回答基准的广泛实验表明,我们的方法比最强的基线高出13.2%,推论时间减少了11.1%。
Article 221
Title@2025-07-15 (2): Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages
Title: Sparse Autoencoders Can Capture Language-Specific Concepts Across Diverse Languages | Sparse Autoencoder können sprachspezifische Konzepte über verschiedene Sprachen hinweg erfassen | 能够捕捉不同语言语言的特定语言概念的简单自定义者 2507.11230v1 |
Authors (6): Lyzander Marciano Andrylie, Inaya Rahmanisa, Mahardika Krisna Ihsani, Alfan Farizki Wicaksono, Haryo Akbarianto Wibowo, Alham Fikri Aji
Understanding the multilingual mechanisms of large language models (LLMs) provides insight into how they process different languages, yet this remains challenging. Existing studies often focus on individual neurons, but their polysemantic nature makes it difficult to isolate language-specific units from cross-lingual representations. To address this, we explore sparse autoencoders (SAEs) for their ability to learn monosemantic features that represent concrete and abstract concepts across languages in LLMs. While some of these features are language-independent, the presence of language-specific features remains underexplored. In this work, we introduce SAE-LAPE, a method based on feature activation probability, to identify language-specific features within the feed-forward network. We find that many such features predominantly appear in the middle to final layers of the model and are interpretable. These features influence the model’s multilingual performance and language output and can be used for language identification with performance comparable to fastText along with more interpretability. Our code is available at https://github.com/LyzanderAndrylie/language-specific-features .
了解大型语言模型(LLMS)的多语言机制可以深入了解它们是如何处理不同语言的,然而,这仍然具有挑战性。现有的研究往往侧重于单个神经元,但其多语种性质使得难以将特定语言单位与跨语言代表隔离开来。为了解决这个问题,我们探索了稀少的自动校考员(SAEs),以使他们有能力学习代表不同语言具体和抽象概念的单语种特征。虽然其中一些特征是语言独立的,但特定语言特征的存在仍未得到充分探讨。在这项工作中,我们采用了基于特征激活概率的SAE-LAPE方法,以识别进源前网络内特定语言特征。我们发现许多这类特征主要出现在模型的中间至最后层,是可以解释的。这些特征影响模型的多语种性能和语言输出,并可用于语言识别可与快速文本相比的性能和解释性能。我们的代码可在https://github.com/LyzanderAnrylie/lagen-fecatatatrys上查阅。
Article 222
Title@2025-07-15 (2): An Agentic Flow for Finite State Machine Extraction using Prompt Chaining
Title: An Agentic Flow for Finite State Machine Extraction using Prompt Chaining | Ein Agentischer Fluss für Finite State Machine Extraction mit Prompt Verkettung | 使用快速链条的有限国家机器采掘的代理流动 2507.11222v1 |
Authors (4): Fares Wael, Youssef Maklad, Ali Hamdi, Wael Elsersy
Finite-State Machines (FSMs) are critical for modeling the operational logic of network protocols, enabling verification, analysis, and vulnerability discovery. However, existing FSM extraction techniques face limitations such as scalability, incomplete coverage, and ambiguity in natural language specifications. In this paper, we propose FlowFSM, a novel agentic framework that leverages Large Language Models (LLMs) combined with prompt chaining and chain-of-thought reasoning to extract accurate FSMs from raw RFC documents. FlowFSM systematically processes protocol specifications, identifies state transitions, and constructs structured rule-books by chaining agent outputs. Experimental evaluation across FTP and RTSP protocols demonstrates that FlowFSM achieves high extraction precision while minimizing hallucinated transitions, showing promising results. Our findings highlight the potential of agent-based LLM systems in the advancement of protocol analysis and FSM inference for cybersecurity and reverse engineering applications.
有限国机器(FSM)对于网络协议的运作逻辑建模、扶持性核查、分析和脆弱性发现至关重要,然而,现有的密克罗尼西亚联邦提炼技术面临可扩展性、覆盖面不完全和自然语言规格模糊等限制,在本文件中,我们提出FLFSM,这是一个利用大语言模型(LLMS)的新型代理框架,加上迅速的链条和思维链推理,从原始RFC文件中提取准确的FSMS。FLFS系统处理协议规格,确定各州的过渡,并通过链锁剂产出构建结构化的规则手册。 跨FTP和RTSP协议的实验性评估表明,FLFSM在尽可能减少幻觉过渡的同时,实现了高精度采精度,并显示出有希望的结果。 我们的调查结果强调了基于代理人的LM系统在推进协议分析方面的潜力,以及FSM对网络安全和反向工程应用的推断。
Article 223
Title@2025-07-15 (2): On the Effect of Instruction Tuning Loss on Generalization
Title: On the Effect of Instruction Tuning Loss on Generalization | Auf die Auswirkungen der Instruktion Tuning Verlust auf die Verallgemeinerung | 指示计票损失对普遍化的影响的影响 2507.07817v2 |
Authors (4): Anwoy Chatterjee, H S V N S Kowndinya Renduchintala, Sumit Bhatia, Tanmoy Chakraborty
Instruction Tuning has emerged as a pivotal post-training paradigm that enables pre-trained language models to better follow user instructions. Despite its significance, little attention has been given to optimizing the loss function used. A fundamental, yet often overlooked, question is whether the conventional auto-regressive objective - where loss is computed only on response tokens, excluding prompt tokens - is truly optimal for instruction tuning. In this work, we systematically investigate the impact of differentially weighting prompt and response tokens in instruction tuning loss, and propose Weighted Instruction Tuning (WIT) as a better alternative to conventional instruction tuning. Through extensive experiments on five language models of different families and scale, three finetuning datasets of different sizes, and five diverse evaluation benchmarks, we show that the standard instruction tuning loss often yields suboptimal performance and limited robustness to input prompt variations. We find that a low-to-moderate weight for prompt tokens coupled with a moderate-to-high weight for response tokens yields the best-performing models across settings and also serve as better starting points for the subsequent preference alignment training. These findings highlight the need to reconsider instruction tuning loss and offer actionable insights for developing more robust and generalizable models. Our code is open-sourced at https://github.com/kowndinya-renduchintala/WIT.
教学图案已成为一个关键的培训后范例,使得培训前语言模型能够更好地遵循用户指令。尽管它意义重大,但很少注意优化所使用的损失功能。一个根本性但经常被忽视的问题是常规自动递减目标(即损失只计算在应答标牌上,不包括及时标牌上)是否真正是教学调适的最佳方法。在这项工作中,我们系统地调查在调试损失时对快速和应答标牌进行不同加权的影响,并提议将加权指示图案(WIT)作为常规调控的更好替代方法。通过对不同家庭和规模的五种语言模型、三个不同尺寸的微调数据集和五个不同的评估基准进行广泛实验,我们发现标准指示调整损失往往产生不理想的性能,对输入快速变异作用的力度也有限。我们发现,在调控标物的低调权重加上中度至高度的响应标物,可以产生最佳表现模式,还可以作为随后的优惠调整培训的更佳起点。这些发现,我们需要重新思考的源码/变换。
Article 224
Title@2025-07-15 (2): EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering
Title: EsBBQ and CaBBQ: The Spanish and Catalan Bias Benchmarks for Question Answering | EsBBQ und CaBBQ: Die spanischen und katalanischen Bias Benchmarks zur Beantwortung von Fragen | EsBBQ和CABBQ:西班牙和加泰罗尼亚的回答问题基准 2507.11216v1 |
Authors (7): Valle Ruiz-Fernández, Mario Mina, Júlia Falcão, Luis Vasquez-Reina, Anna Sallés, Aitor Gonzalez-Agirre, Olatz Perez-de-Viñaspre
Previous literature has largely shown that Large Language Models (LLMs) perpetuate social biases learnt from their pre-training data. Given the notable lack of resources for social bias evaluation in languages other than English, and for social contexts outside of the United States, this paper introduces the Spanish and the Catalan Bias Benchmarks for Question Answering (EsBBQ and CaBBQ). Based on the original BBQ, these two parallel datasets are designed to assess social bias across 10 categories using a multiple-choice QA setting, now adapted to the Spanish and Catalan languages and to the social context of Spain. We report evaluation results on different LLMs, factoring in model family, size and variant. Our results show that models tend to fail to choose the correct answer in ambiguous scenarios, and that high QA accuracy often correlates with greater reliance on social biases.
以往的文献主要表明,大语言模型(LLMs)延续了从培训前数据中汲取的社会偏见。鉴于明显缺乏用于英语以外语言以及美国以外社会背景的社会偏见评估的资源,本文介绍了西班牙语和加泰罗尼亚比亚斯问答基准(ESBBQ和CABBQ ) 。根据原始BBQ,这两个平行数据集旨在利用多种选择的QA设置评估10类社会偏见,该设置现已适应西班牙语和加泰罗尼亚语和西班牙的社会背景。我们报告不同LMs的评估结果,将家庭、大小和变异因素考虑在内。我们的结果显示,模型往往无法在模糊的情景中选择正确的答案,而QA的高度准确性往往与更多地依赖社会偏见相关。
Article 225
Title@2025-07-15 (2): Stylometry recognizes human and LLM-generated texts in short samples
Title: Stylometry recognizes human and LLM-generated texts in short samples | Stylometrie erkennt menschliche und LLM-generierte Texte in kurzen Proben | tytylometerm在短样本中确认人类和LLM产生的文本 2507.00838v2 |
Authors (4): Karol Przystalski, Jan K. Argasiński, Iwona Grabska-Gradzińska, Jeremi K. Ochab
The paper explores stylometry as a method to distinguish between texts created by Large Language Models (LLMs) and humans, addressing issues of model attribution, intellectual property, and ethical AI use. Stylometry has been used extensively to characterise the style and attribute authorship of texts. By applying it to LLM-generated texts, we identify their emergent writing patterns. The paper involves creating a benchmark dataset based on Wikipedia, with (a) human-written term summaries, (b) texts generated purely by LLMs (GPT-3.5/4, LLaMa 2/3, Orca, and Falcon), (c) processed through multiple text summarisation methods (T5, BART, Gensim, and Sumy), and (d) rephrasing methods (Dipper, T5). The 10-sentence long texts were classified by tree-based models (decision trees and LightGBM) using human-designed (StyloMetrix) and n-gram-based (our own pipeline) stylometric features that encode lexical, grammatical, syntactic, and punctuation patterns. The cross-validated results reached a performance of up to .87 Matthews correlation coefficient in the multiclass scenario with 7 classes, and accuracy between .79 and 1. in binary classification, with the particular example of Wikipedia and GPT-4 reaching up to .98 accuracy on a balanced dataset. Shapley Additive Explanations pinpointed features characteristic of the encyclopaedic text type, individual overused words, as well as a greater grammatical standardisation of LLMs with respect to human-written texts. These results show – crucially, in the context of the increasingly sophisticated LLMs – that it is possible to distinguish machine- from human-generated texts at least for a well-defined text type.
本文探索了外观测量法,作为区分大语言模型(LLMS)和人类所创造的文本的方法,涉及模型属性、知识产权和道德AI使用等问题。 外观测量法被广泛用于描述文本的风格和属性。 通过将其应用于LLM产生的文本, 我们发现其新兴的写法模式。 本文涉及创建基于维基百科的基准数据集, 包括:(a) 人文版术语摘要, (b) 纯由LLMS(GPT- 3.5/4, LLama Ma 2/3, Orca和Falcon)产生的文本, (c) 通过多文本合成方法(T5, BART, Gensim和Sumy)处理。 和 (d) 调整方法(Dipper, T5) 被广泛用来描述文本的风格和属性。 10个长的文字根据基于树基模型(决定树树树树和LightGBM) , 和 ngram- brod- descrideal- deal 的文字背景, 将个人语言、语系的精度文本的精度结果和数学的精度, 在数字的精度模型中, 的精度的精度排序中, 的精度- saldaldaldaldaldaldaldaldaldaldaldalmaxald 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度, 的精度,在数学到比的精度, 的精度到数学到比的精度, 的精度, 的精度, 的精度, 的精度, 的精度到方向的精度到方向的精度到方向的精度到方向的精度到方向的精度到方向的精度, 。
Article 226
Title@2025-07-15 (2): SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users
Title: SocioVerse: A World Model for Social Simulation Powered by LLM Agents and A Pool of 10 Million Real-World Users | SocioVerse: Ein Weltmodell für soziale Simulation Powered by LLM Agents und ein Pool von 10 Millionen Real-World-Nutzern | 社会之声:由LLM代理和1000万现实世界用户组成的人才库推动的社会模拟世界模式 2504.10157v3 |
Authors (21): Xinnong Zhang, Jiayu Lin, Xinyi Mou, Shiyue Yang, Xiawei Liu, Libo Sun, Hanjia Lyu, Yihang Yang, Weihong Qi, Yue Chen, Guanying Li, Ling Yan, Yao Hu, Siming Chen, Yu Wang, Xuanjing Huang, Jiebo Luo, Shiping Tang, Libo Wu, Baohua Zhou, Zhongyu Wei
Social simulation is transforming traditional social science research by modeling human behavior through interactions between virtual individuals and their environments. With recent advances in large language models (LLMs), this approach has shown growing potential in capturing individual differences and predicting group behaviors. However, existing methods face alignment challenges related to the environment, target users, interaction mechanisms, and behavioral patterns. To this end, we introduce SocioVerse, an LLM-agent-driven world model for social simulation. Our framework features four powerful alignment components and a user pool of 10 million real individuals. To validate its effectiveness, we conducted large-scale simulation experiments across three distinct domains: politics, news, and economics. Results demonstrate that SocioVerse can reflect large-scale population dynamics while ensuring diversity, credibility, and representativeness through standardized procedures and minimal manual adjustments.
社会模拟正在通过虚拟个人及其环境之间的相互作用来模拟人类行为,从而改变传统的社会科学研究。随着大型语言模型(LLMs)的最近进步,这一方法在捕捉个人差异和预测群体行为方面显示出越来越大的潜力。然而,现有的方法面临着与环境、目标用户、互动机制和行为模式有关的调整挑战。为此,我们引入了由LLM代理驱动的世界社会模拟模式“SocialVerse ” 。我们的框架有四个强大的组合组件和一个由1000万真实个人组成的用户库。为了验证其有效性,我们在政治、新闻和经济三个不同领域进行了大规模的模拟实验。结果表明,ScialVerse可以反映大规模的人口动态,同时通过标准化的程序和最低限度的手工调整确保多样性、可信度和代表性。
Article 227
Title@2025-07-15 (2): Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding
Title: Temperature and Persona Shape LLM Agent Consensus With Minimal Accuracy Gains in Qualitative Coding | Temperatur und Persona Shape LLM Agent Konsens mit minimaler Genauigkeit gewinnt in qualitativer Coding | 高温和人文形状 LLM 代理人共识,在定性编码中取得最低准确性收益 2507.11198v1 |
Authors (6): Conrad Borchers, Bahar Shahrokhian, Francesco Balzan, Elham Tajik, Sreecharan Sankaranarayanan, Sebastian Simon
Large Language Models (LLMs) enable new possibilities for qualitative research at scale, including coding and data annotation. While multi-agent systems (MAS) can emulate human coding workflows, their benefits over single-agent coding remain poorly understood. We conducted an experimental study of how agent persona and temperature shape consensus-building and coding accuracy of dialog segments based on a codebook with 8 codes. Our open-source MAS mirrors deductive human coding through structured agent discussion and consensus arbitration. Using six open-source LLMs (with 3 to 32 billion parameters) and 18 experimental configurations, we analyze over 77,000 coding decisions against a gold-standard dataset of human-annotated transcripts from online math tutoring sessions. Temperature significantly impacted whether and when consensus was reached across all six LLMs. MAS with multiple personas (including neutral, assertive, or empathetic), significantly delayed consensus in four out of six LLMs compared to uniform personas. In three of those LLMs, higher temperatures significantly diminished the effects of multiple personas on consensus. However, neither temperature nor persona pairing lead to robust improvements in coding accuracy. Single agents matched or outperformed MAS consensus in most conditions. Only one model (OpenHermesV2:7B) and code category showed above-chance gains from MAS deliberation when temperature was 0.5 or lower and especially when the agents included at least one assertive persona. Qualitative analysis of MAS collaboration for these configurations suggests that MAS may nonetheless aid in narrowing ambiguous code applications that could improve codebooks and human-AI coding. We contribute new insight into the limits of LLM-based qualitative methods, challenging the notion that diverse MAS personas lead to better outcomes. We open-source our MAS and experimentation code.
大型语言模型(LLMS)为大规模质量研究提供了新的可能性,包括编码和数据注释。虽然多试剂系统(MAS)可以模仿人类编码工作流程,但其对单剂编码的好处仍然不甚为人理解。我们进行了一项实验性研究,研究代理人和温度如何在8种代码的代码手册基础上形成共识和对话部分的编码。我们的开放源代码(LLLMS)反映了通过结构化代理商讨论和共识仲裁进行人类编码的扣减。使用6种开放源代码(含3-320亿参数)和18个实验配置,我们分析了77 000多个多试剂系统(MAS)对在线数学辅导会的带有注释的人类编码数据集作出的编码决定。对于所有6种LMS(含中性、自信或感动性)之间如何达成共识部分,在6种LMSA中,有4种(含固性、自信或感动性)之间显著延迟的共识。在其中3种LMSMS中,高温和多面铅应用的编码会大大降低多个人对共识的影响。但是,温度或人之间最差的DNA和最接近的DNA分析结果,也显示最精确的数值。
Article 228
Title@2025-07-15 (2): Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams
Title: Text to model via SysML: Automated generation of dynamical system computational models from unstructured natural language text via enhanced System Modeling Language diagrams | Text zum Modell via SysML: Automatisierte Generierung dynamischer Systemrechnermodelle aus unstrukturiertem Naturtext über verbesserte Systemmodellierung Sprachdiagramme | 通过 SysML 自动生成动态系统计算模型,通过强化系统模拟图,从未结构化的自然语言文本生成动态系统计算模型 2507.06803v2 |
Authors (2): Matthew Anderson Hendricks, Alice Cicirello
This paper contributes to speeding up the design and deployment of engineering dynamical systems by proposing a strategy for exploiting domain and expert knowledge for the automated generation of dynamical system computational model starting from a corpus of document relevant to the dynamical system of interest and an input document describing the specific system. This strategy is implemented in five steps and, crucially, it uses system modeling language diagrams (SysML) to extract accurate information about the dependencies, attributes, and operations of components. Natural Language Processing (NLP) strategies and Large Language Models (LLMs) are employed in specific tasks to improve intermediate outputs of the SySML diagrams automated generation, such as: list of key nouns; list of extracted relationships; list of key phrases and key relationships; block attribute values; block relationships; and BDD diagram generation. The applicability of automated SysML diagram generation is illustrated with different case studies. The computational models of complex dynamical systems from SysML diagrams are then obtained via code generation and computational model generation steps. In the code generation step, NLP strategies are used for summarization, while LLMs are used for validation only. The proposed approach is not limited to a specific system, domain, or computational software. The applicability of the proposed approach is shown via an end-to-end example from text to model of a simple pendulum, showing improved performance compared to results yielded by LLMs only.
本文有助于加快工程动态系统的设计和部署,办法是提出一项战略,利用域和专家知识,自动生成动态系统计算模型,从与动态兴趣系统有关的文件库和描述具体系统的输入文件库开始,从而自动生成动态系统计算模型。这一战略分五个步骤实施,关键是使用系统模拟语言图(SysML),以获取关于各组成部分依赖性、属性和操作的准确信息。自然语言处理(NLP)战略和大语言模型(LLMS)用于具体任务,以改进SYSML图表自动生成的中间产出,例如:关键名词列表;提取关系列表;关键短语和关键关系列表;区属性值;块关系;以及BDDD图生成。自动生成SysML图(SML图)的实用性用不同的案例研究来说明。随后通过代码生成和计算模型生成步骤获得复杂的动态系统的计算模型。在代码生成步骤中,使用NLP战略来进行合成,而LMS-LMS战略则用于合成,而LMS-LMS则用于通过特定的域图进行特定的校验。建议,仅显示通过特定的域到软件的校验。
Article 229
Title@2025-07-15 (2): Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion
Title: Compression Hacking: A Supplementary Perspective on Informatics Properties of Language Models from Geometric Distortion | Kompression Hacking: Eine zusätzliche Perspektive auf Informatik-Eigenschaften von Sprachmodellen aus geometrischer Verzerrung | 压缩包装:几何扭曲对语言模型信息学属性的补充观点 2505.17793v2 |
Authors (10): Jianxiang Zang, Meiling Ning, Yongda Wei, Shihan Dou, Jiazheng Zhang, Nijia Mo, Binhong Li, Tao Gui, Qi Zhang, Xuanjing Huang
Recently, the concept of compression as intelligence'' has provided a novel informatics metric perspective for language models (LMs), emphasizing that highly structured representations signify the intelligence level of LMs. However, from a geometric standpoint, the word representation space of highly compressed LMs tends to degenerate into a highly anisotropic state, which hinders the LM's ability to comprehend instructions and directly impacts its performance. We found this compression-anisotropy synchronicity is essentially the
Compression Hacking’’ in LM representations, where noise-dominated directions tend to create the illusion of high compression rates by sacrificing spatial uniformity. Based on this, we propose three refined compression metrics by incorporating geometric distortion analysis and integrate them into a self-evaluation pipeline. The refined metrics exhibit strong alignment with the LM’s comprehensive capabilities, achieving Spearman correlation coefficients above 0.9, significantly outperforming both the original compression and other internal structure-based metrics. This confirms that compression hacking substantially enhances the informatics interpretation of LMs by incorporating geometric distortion of representations.
最近,“压缩作为情报”的概念为语言模型提供了一种新的信息学衡量观点,强调高度结构化的表述方式代表了LMs的智能水平。然而,从几何角度看,高度压缩LMs的字表达空间往往会退化成一种高度厌食性状态,这妨碍了LM理解指示的能力并直接影响其性能。我们发现,这种压缩-氮化同步性基本上是LM表示方式中的“压缩打包”,在LM表示方式中,以噪音为主的方向往往通过牺牲空间统一性而产生高压缩率的幻觉。基于这一点,我们建议采用三种精细的压缩尺度,即纳入几何扭曲分析并将其纳入自我评价管道。经过改进的尺度显示,与LMs的综合能力非常一致,达到0.9以上的Spearman相关系数,大大优于原先的压缩和其他以内部结构为基础的衡量标准。这证实,压缩黑能通过纳入对陈述的几何扭曲,大大增强了LMs对信息的解读。
Article 230
Title@2025-07-15 (2): SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning
Title: SECURE: Semantics-aware Embodied Conversation under Unawareness for Lifelong Robot Learning | SICHERHEIT: Semantik-bewusst Verkörpertes Gespräch unter Unwahrnehmung für lebenslanges Robot Learning | SECURRE: 终身机器人学习意识不足的语义学意识内嵌入式对话 2409.17755v3 |
Authors (4): Rimvydas Rubavicius, Peter David Fagan, Alex Lascarides, Subramanian Ramamoorthy
This paper addresses a challenging interactive task learning scenario we call rearrangement under unawareness: an agent must manipulate a rigid-body environment without knowing a key concept necessary for solving the task and must learn about it during deployment. For example, the user may ask to “put the two granny smith apples inside the basket”, but the agent cannot correctly identify which objects in the environment are “granny smith” as the agent has not been exposed to such a concept before. We introduce SECURE, an interactive task learning policy designed to tackle such scenarios. The unique feature of SECURE is its ability to enable agents to engage in semantic analysis when processing embodied conversations and making decisions. Through embodied conversation, a SECURE agent adjusts its deficient domain model by engaging in dialogue to identify and learn about previously unforeseen possibilities. The SECURE agent learns from the user’s embodied corrective feedback when mistakes are made and strategically engages in dialogue to uncover useful information about novel concepts relevant to the task. These capabilities enable the SECURE agent to generalize to new tasks with the acquired knowledge. We demonstrate in the simulated Blocksworld and the real-world apple manipulation environments that the SECURE agent, which solves such rearrangements under unawareness, is more data-efficient than agents that do not engage in embodied conversation or semantic analysis.
本文涉及一个具有挑战性的互动任务学习方案,我们称之为在不知情的情况下重新安排任务:代理人必须操纵一个僵硬的身体环境,而不知道解决任务所需的关键概念,而且必须在部署期间了解它。例如,用户可能要求“把两个大奶奶的铁苹果放在篮子内”,但代理人无法正确识别环境中哪些物体是“工匠”,因为代理人以前没有接触过这种概念。我们引入了SECURRE,这是旨在应对这种设想的互动式任务学习政策。SECURRE的独特特征是它能够使代理人在处理包含谈话和决定的谈话时进行语义分析。我们通过包含的谈话,SECURRE代理人通过参与对话来调整其缺陷的域模型,以便查明和了解以前未曾预见到的可能性。SECURE代理从用户中学习了纠正性反馈,因为错误发生时,并战略性地参与了对话,以发现与任务相关的新概念的有用信息。这些能力使SECURE代理能够以获得的知识来概括新的任务。我们在模拟的屏障世界和现实-世界的苹果操作模式中展示了它的缺陷模式模式模式模式模式,在SEECREA 分析中,而没有体现更高效的磁感化了SEULA 。
Article 231
Title@2025-07-15 (2): FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning
Title: FalseReject: A Resource for Improving Contextual Safety and Mitigating Over-Refusals in LLMs via Structured Reasoning | FalseReject: Eine Ressource zur Verbesserung der kontextuellen Sicherheit und zur Abmilderung von Überwiderständen in LLMs durch strukturierte Vernunft | 假反射:一种资源,用于通过结构化理由改进环境安全和减轻LLMs的过度拒绝 2505.08054v2 |
Authors (4): Zhehao Zhang, Weijie Xu, Fanyou Wu, Chandan K. Reddy
Safety alignment approaches in large language models (LLMs) often lead to the over-refusal of benign queries, significantly diminishing their utility in sensitive scenarios. To address this challenge, we introduce FalseReject, a comprehensive resource containing 16k seemingly toxic queries accompanied by structured responses across 44 safety-related categories. We propose a graph-informed adversarial multi-agent interaction framework to generate diverse and complex prompts, while structuring responses with explicit reasoning to aid models in accurately distinguishing safe from unsafe contexts. FalseReject includes training datasets tailored for both standard instruction-tuned models and reasoning-oriented models, as well as a human-annotated benchmark test set. Our extensive benchmarking on 29 state-of-the-art (SOTA) LLMs reveals persistent over-refusal challenges. Empirical results demonstrate that supervised finetuning with FalseReject substantially reduces unnecessary refusals without compromising overall safety or general language capabilities.
在大型语言模型中,安全协调方法往往导致过度拒绝良性查询,大大降低其在敏感情况下的效用。为了应对这一挑战,我们引入了“假射”,这是一个综合资源,包含16k个看似有毒的查询,并附有44个与安全有关的类别有条理的反应。我们提议了一个基于图形的对抗性多剂互动框架,以产生多样和复杂的提示,同时以明确的推理来构建反应结构,帮助模型准确区分安全与不安全环境。“假射”包括针对标准指导模式和面向推理的模式的培训数据集,以及一套附有人文说明的基准测试集。我们对29个先进(SOTA)LMs的广泛基准显示长期的过度反驳挑战。经验显示,在监督下对假射的微调会大大减少不必要的拒绝,同时不损害整体安全或一般语言能力。
Article 232
Title@2025-07-15 (2): Is Compression Really Linear with Code Intelligence?
Title: Is Compression Really Linear with Code Intelligence? | Ist Kompression wirklich linear mit Code Intelligence? | 压缩真的有代码情报线条吗? 2505.11441v4 |
Authors (12): Shijie Xuyang, Xianzhen Luo, Tianhao Cheng, Zheng Chu, Houyi Li, ziqi wang, Siming Huang, Qingfu Zhu, Qiufeng Wang, Xiangyu Zhang, Shuigeng Zhou, Wanxiang Che
Understanding the relationship between data compression and the capabilities of Large Language Models (LLMs) is crucial, especially in specialized domains like code intelligence. Prior work posited a linear relationship between compression and general intelligence. However, it overlooked the multifaceted nature of code that encompasses diverse programming languages and tasks, and struggled with fair evaluation of modern Code LLMs. We address this by evaluating a diverse array of open-source Code LLMs on comprehensive multi-language, multi-task code benchmarks. To address the challenge of efficient and fair evaluation of pre-trained LLMs’ code intelligence, we introduce \textit{Format Annealing}, a lightweight, transparent training methodology designed to assess the intrinsic capabilities of these pre-trained models equitably. Compression efficacy, measured as bits-per-character (BPC), is determined using a novel, large-scale, and previously unseen code validation set derived from GitHub. Our empirical results reveal a fundamental logarithmic relationship between measured code intelligence and BPC. This finding refines prior hypotheses of linearity, which we suggest are likely observations of the logarithmic curve’s tail under specific, limited conditions. Our work provides a more nuanced understanding of compression’s role in developing code intelligence and contributes a robust evaluation framework in the code domain.
理解数据压缩和大语言模型能力之间的关系至关重要,特别是在代码情报等专门领域。先前的工作在压缩和一般情报之间建立了线性关系。然而,它忽略了包含多种编程语言和任务并努力公平评估现代代码LLMs的守则的多面性。我们通过对多种语言、多任务代码综合基准的各种开放源代码LLMs进行评估来解决这个问题。为了应对有效、公平地评估受过训练的LLMs代码情报的挑战,我们引入了一种轻量级、透明的培训方法,以公平评估这些事先培训的模型的内在能力。以每字字字缩写(BPC)衡量的调效是使用从GitHub得到的新型、大规模和以前看不见的代码校正性校正,我们的经验结果揭示了计量的代码情报和BPC之间的基本对数关系。我们发现以前的线性假设,我们认为,在具体领域对对对正对正数曲线的轨曲线进行观察,我们提出了一种有限的评估。
Article 233
Title@2025-07-15 (2): Style over Substance: Distilled Language Models Reason Via Stylistic Replication
Title: Style over Substance: Distilled Language Models Reason Via Stylistic Replication | Stil über Substanz: Destillierte Sprachmodelle Grund über stylistische Replication | 物质之上的样式: 蒸馏语言模型 2504.01738v3 |
Authors (2): Philip Lippmann, Jie Yang
Specialized reasoning language models (RLMs) have demonstrated that scaling test-time computation through detailed reasoning traces significantly enhances performance. Although these traces effectively facilitate knowledge distillation into smaller, instruction-tuned models, the precise nature of transferred reasoning remains unclear. In this study, we investigate to what extent distilled models internalize replicated stylistic patterns during reasoning. To this end, we systematically analyze reasoning traces, identifying structural and lexical patterns that characterize successful reasoning. We then introduce two new datasets – a dataset of emergent reasoning traces and a synthetic dataset explicitly constructed to replicate these stylistic patterns – to precisely examine their influence on distilled models’ reasoning capabilities. We find that models trained on the synthetic traces achieve comparable performance, indicating that distilled reasoning abilities rely significantly on surface-level patterns. Surprisingly, we observe an increase in performance even when the synthetic traces are altered to lead to the wrong answer. Our findings highlight how stylistic patterns can be leveraged to efficiently enhance LM reasoning across diverse model families.
专业推理语言模型(RLMS)表明,通过详细推理进行缩放测试时间的计算,极大地提高了绩效。虽然这些痕迹有效地促进了知识蒸馏成较小的、按指示调整的模型,但转移推理的确切性质仍然不明确。在本研究中,我们调查了在推理过程中,在何种程度上蒸馏的模型内化了复制的文体模式。为此,我们系统地分析推理痕迹,查明作为成功推理特点的结构和词汇模式。然后我们引入了两个新的数据集 – – 一组由新兴推理痕迹组成的数据集,以及一套为复制这些模式而明确设计的合成数据集 – – 以精确地检查其对精炼模型推理能力的影响。我们发现,经过培训的合成模型取得了可比较的性能,表明蒸馏推理能力在很大程度上依赖地表水平模式。令人惊讶的是,我们观察到,即使在合成痕迹被改变以导致错误答案时,性能也有所提高。我们的调查结果突出表明,如何利用典型模式来有效提高不同模型家庭的LM推理力。
Article 234
Title@2025-07-15 (2): What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests
Title: What Should LLMs Forget? Quantifying Personal Data in LLMs for Right-to-Be-Forgotten Requests | Was sollten LLMs vergessen? Quantifizierung personenbezogener Daten in LLMs für rechts-zu-vergessene Anfragen | 普法女士应忘记什么? 将个人数据量化为 “ 有权被遗忘的请求 “ 的 “ 普法女士 “ 中的 “ 个人数据 “ 。 2507.11128v1 |
Authors (1): Dimitri Staufer
Large Language Models (LLMs) can memorize and reveal personal information, raising concerns regarding compliance with the EU’s GDPR, particularly the Right to Be Forgotten (RTBF). Existing machine unlearning methods assume the data to forget is already known but do not address how to identify which individual-fact associations are stored in the model. Privacy auditing techniques typically operate at the population level or target a small set of identifiers, limiting applicability to individual-level data inquiries. We introduce WikiMem, a dataset of over 5,000 natural language canaries covering 243 human-related properties from Wikidata, and a model-agnostic metric to quantify human-fact associations in LLMs. Our approach ranks ground-truth values against counterfactuals using calibrated negative log-likelihood across paraphrased prompts. We evaluate 200 individuals across 15 LLMs (410M-70B parameters), showing that memorization correlates with subject web presence and model scale. We provide a foundation for identifying memorized personal data in LLMs at the individual level, enabling the dynamic construction of forget sets for machine unlearning and RTBF requests.
大型语言模型(LLMS)可以记住和披露个人信息,引起人们对遵守欧盟GDPR,特别是被遗忘的权利的关切。现有的机器不学习方法假定数据已经为人所知,但并不涉及如何确定模型中储存的个人活动协会。隐私审计技术通常在人口层面运作,或针对少量的识别资料,限制对个人数据调查的适用性。我们引入了维基Mem,一套包含维基数据中涉及243个与人类相关特性的5 000多个自然语言流体的数据集,以及一套用于量化LLMS中人类活动协会的模型 – – 不可知度指标。我们的方法利用校准的负日志/日志对反事实进行排名。我们评估了15个LMS(410M-70B参数)的200人,表明记忆与主题网络存在和模型规模有关。我们提供了一个基础,用以在LMS中识别个人层面的243个与人有关的个人数据,从而能够动态地构建不学习机器和RTBF要求的遗忘数据集。
Article 235
Title@2025-07-15 (2): Plancraft: an evaluation dataset for planning with LLM agents
Title: Plancraft: an evaluation dataset for planning with LLM agents | Plancraft: ein Auswertungsdatensatz für die Planung mit LLM-Agenten | 规划:用于与LLM代理商规划的评价数据集 2412.21033v2 |
Authors (3): Gautier Dagan, Frank Keller, Alex Lascarides
We present Plancraft, a multi-modal evaluation dataset for LLM agents. Plancraft has both a text-only and multi-modal interface, based on the Minecraft crafting GUI. We include the Minecraft Wiki to evaluate tool use and Retrieval Augmented Generation (RAG), as well as a handcrafted planner and Oracle Retriever, to ablate the different components of a modern agent architecture. To evaluate decision-making, Plancraft also includes a subset of examples that are intentionally unsolvable, providing a realistic challenge that requires the agent not only to complete tasks but also to decide whether they are solvable at all. We benchmark both open-source and closed-source LLMs and compare their performance and efficiency to a handcrafted planner. Overall, we find that LLMs and VLMs struggle with the planning problems that Plancraft introduces, and offer suggestions on how to improve their capabilities.
我们介绍了Plancraft,这是LLM代理商的多模式评估数据集。Plancraft有一个基于Minecraft 设计图形界面的文本和多模式界面。我们包括用于评估工具使用和检索回溯增强一代(RAG)的Minecraft Wiki(Minecraft Wiki),以及一个手工设计的规划师和Oracle Retriever(Oracle Retriever),目的是消除现代代理商结构的不同组成部分。为了评估决策,Plancraft还包含一系列有意无法解决的例子,提供了现实的挑战,要求该代理商不仅要完成任务,而且还要决定它们是否完全可以溶解。我们把开放源和封闭源LMs作为基准,并将它们的业绩和效率与手工设计的规划师进行比较。总的来说,我们发现LLMs和VLMs在与Plancraft提出的规划问题作斗争,并就如何提高它们的能力提出建议。
Article 236
Title@2025-07-15 (2): Evaluating Multimodal Large Language Models on Educational Textbook Question Answering
Title: Evaluating Multimodal Large Language Models on Educational Textbook Question Answering | Bewertung multimodaler großer Sprachmodelle auf pädagogischer Lehrbuchfragebeantwortung | 评价教育教科书问题解答多式大语言多语言模式 2506.21596v2 |
Authors (6): Hessa A. Alawwad, Anas Zafar, Areej Alhothali, Usman Naseem, Ali Alkhathlan, Amani Jamal
Multimodal large language models (MLLMs) have shown success in vision-language tasks, but their ability to reason over complex educational materials remains largely untested. This work presents the first evaluation of state-of-the-art MLLMs, including LLaVA-1.5 and LLaMA 3.2-Vision, on the textbook question answering (TQA) task using the CK12-QA dataset. We introduce a multimodal retrieval-augmented generation (RAG) pipeline to simulate real-world learning by providing relevant lesson paragraphs and diagrams as context. Our zero-shot experiments reveal a critical trade-off: while retrieved context improves LLaVA’s performance on text-based questions, it significantly degrades the accuracy of the more powerful LLaMA 3.2-Vision on diagram-based tasks, dropping its validation accuracy from 74.07% to 25.93%. We term this statistically significant phenomenon “catastrophic context interference.” Furthermore, fine-tuning highlights architectural differences: LLaMA 3.2-Vision’s performance improves to 71.16% on the test set, demonstrating its capacity to learn multimodal integration, whereas LLaVA’s performance declines, indicating challenges with generalization. Our results underscore the challenges MLLMs face in modality prioritization and context integration, providing a benchmark and pointing to key directions for developing more robust AI-driven educational tools.
多式大型语言模型(MLLM)在愿景语言任务中表现出了成功,但其对复杂教材的思考能力在很大程度上尚未测试。这项工作对包括LLAVA-1.5和LLAMA3.2-3.2-Vision在内的最先进的MLLLMS(包括LLAVA-1.5和LLLAMA3.2-3.2-Vision)在使用CK12-QA数据集的教科书答题(TQA)任务(TQA)任务上进行了首次评估。我们引入了一种多式检索-强化的一代(RAG)管道,以模拟真实世界学习,为此提供了相关的教益段落和图表。我们的零点实验实验显示了一个关键的交换:在检索环境中,LALAVA提高了其在基于文本的问题上的绩效,同时提高了LLAVA3.2-VA在基于图表的任务上更强大的LA3.2-VA的精确度,将校准的准确度从74.07%降至25.93%。我们称这一具有统计意义的现象为“营养环境干扰”。此外,细微调整了建筑差异差异:LA3.2-VA的性业绩在测试上的表现改进了71.16%,展示了它在学习模式上所面临的能力,同时展示了在质量上,同时展示了在MLULLLA的学习方向,提供了一种方向上的发展方向。
Article 237
Title@2025-07-15 (2): MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models
Title: MSA at ImageCLEF 2025 Multimodal Reasoning: Multilingual Multimodal Reasoning With Ensemble Vision Language Models | MSA bei ImageCLEF 2025 Multimodale Reasoning: Multilingual Multimodale Reasoning mit Ensemble Vision Language Models | 2025年多模式理由:多语言多语言多语言多语种理由,包含多种愿景语言模式 2507.11114v1 |
Authors (5): Seif Ahmed, Mohamed T. Younes, Abdelrahman Moustafa, Abdelrahman Allam, Hamza Moustafa
We present a robust ensemble-based system for multilingual multimodal reasoning, designed for the ImageCLEF 2025 EXAMS V challenge. Our approach integrates Gemini 2.5 Flash for visual description, Gemini 1.5 Pro for caption refinement and consistency checks, and Gemini 2.5 Pro as a reasoner which handles final answer selection, all coordinated through carefully engineered few-shot and zero-shot prompts. We conducted an extensive ablation study, training several large language models (Gemini 2.5 Flash, Phi 4, Gemma 3, Mistral) on an English dataset and its multilingual augmented version. Additionally, we evaluated Gemini 2.5 Flash in a zero-shot setting for comparison and found it to substantially outperform the trained models. Prompt design also proved critical: enforcing concise, language-normalized formats and prohibiting explanatory text boosted model accuracy on the English validation set from 55.9% to 61.7%. On the official leaderboard, our system (Team MSA) achieved first place overall in the multilingual track with 81.4% accuracy, and led 11 out of 13 individual language tracks, with top results such as 95.07% for Croatian and 92.12% for Italian. These findings highlight that lightweight OCR-VLM ensembles, when paired with precise prompt strategies and cross-lingual augmentation, can outperform heavier end-to-end models in high-stakes, multilingual educational settings.
我们为图像CLEF 2025 EXAMS V 挑战提供了一个强大的多语种推理混合系统。我们的方法将Gemini 2.5 First(用于视觉描述)、Gemini 1.5 Pro(用于字幕完善和一致性检查)和Gemini 2.5 Pro(作为处理最后答案选择的理性者)结合起来,所有这些都通过精心设计的短发和零发提示加以协调。我们进行了广泛的消化研究,就英语数据集及其多语种扩充版对若干大语言模型(Gemini 2.5 Flash、Phi 4、Gemma 3、Mistral)进行了培训。此外,我们用零镜头对Gemini 2.5 Flash(Gemini 2.5 2.5 2.5 EXM)进行了评估,以进行比较,发现它大大优于经过培训的模型。快速设计也证明至关重要:执行简洁的、语言规范格式和禁止解释性文本将英语校准的模型精准度从55.9%提高到61.7%。在官方领导板上,我们的系统(Team MS)以81.4%的精准率在多语种轨道上取得了第一位,并领导了13个单语言轨道,并领导了11个个人轨道,在克罗地亚语种最优的结果,如95.07%和9,意大利语种和92至92至912%的升级的升级的高度的O-R-BR-S-S-Sir-comma-comma-
Article 238
Title@2025-07-15 (2): Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs
Title: Multi-Trigger Poisoning Amplifies Backdoor Vulnerabilities in LLMs | Multi-Trigger-Vergiftung verstärkt Sicherheitslücken in LLMs | 多触发中毒行为放大了LLM 的后门脆弱性 2507.11112v1 |
Authors (4): Sanhanat Sivapiromrat, Caiqi Zhang, Marco Basaldella, Nigel Collier
Recent studies have shown that Large Language Models (LLMs) are vulnerable to data poisoning attacks, where malicious training examples embed hidden behaviours triggered by specific input patterns. However, most existing works assume a phrase and focus on the attack’s effectiveness, offering limited understanding of trigger mechanisms and how multiple triggers interact within the model. In this paper, we present a framework for studying poisoning in LLMs. We show that multiple distinct backdoor triggers can coexist within a single model without interfering with each other, enabling adversaries to embed several triggers concurrently. Using multiple triggers with high embedding similarity, we demonstrate that poisoned triggers can achieve robust activation even when tokens are substituted or separated by long token spans. Our findings expose a broader and more persistent vulnerability surface in LLMs. To mitigate this threat, we propose a post hoc recovery method that selectively retrains specific model components based on a layer-wise weight difference analysis. Our method effectively removes the trigger behaviour with minimal parameter updates, presenting a practical and efficient defence against multi-trigger poisoning.
最近的研究显示,大语言模型(LLMs)很容易受到数据中毒攻击,因为恶意培训例子含有特定输入模式引发的隐蔽行为;然而,大多数现有工作假设了一个短语,侧重于攻击的效果,对触发机制以及多重触发机制在模型中的互动作用有有限的了解。在本文中,我们提出了一个研究LLMs中毒情况的框架。我们表明,多种不同的后门触发器可以在单一模型中共存,而不会相互干扰,使对手能够同时嵌入多个触发器。我们利用多个嵌入性非常相似的触发器,证明有毒触发器即使在代用品被长期代换或隔开时也能实现强有力的激活。我们的调查结果暴露了LMS中更广泛和更持久的脆弱性表面。为了减轻这一威胁,我们建议了一种后期恢复方法,根据分层重量差异分析有选择地对特定模型组件进行再tra。我们的方法有效地消除了触发行为,只提供最短的参数更新,并提供了防止多发中毒的实用而有效的防御。
Article 239
Title@2025-07-15 (2): Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs
Title: Traveling Across Languages: Benchmarking Cross-Lingual Consistency in Multimodal LLMs | Sprachenübergreifendes Reisen: Benchmarking Cross-Lingual Consistency in multimodalen LLMs | 跨语言旅行:多模式LLM中跨语言一致基准 2505.15075v3 |
Authors (5): Hao Wang, Pinzhi Huang, Jihan Yang, Saining Xie, Daisuke Kawahara
The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications. However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge. To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs. KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks. VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images. Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency. This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.
多式联运大型语言模式(MLLMs)的迅速演变大大加强了其实际应用,然而,在各种语言之间取得一致的成绩,特别是在融合文化知识方面,仍是一项重大挑战。为了更好地评估这一问题,我们引入了两个新基准:知识回召和Vis回召,评估MLLMs的跨语言一致性。Know Recreme是一个直观问题,用来衡量15种语言的实际知识一致性,重点是有关全球里程碑的文化问题和历史问题。VisRecall 评估了视觉记忆的一致性,要求模型描述9种语言的标志性外观,但没有图像。实验结果显示,最先进的MLLLMs,包括专有的MLLMs,仍然难以实现跨语言的一致性。这突出表明,需要采取更强有力的方法,产生真正的多语言和文化意识模式。
Article 240
Title@2025-07-15 (2): The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs
Title: The Devil behind the mask: An emergent safety vulnerability of Diffusion LLMs | Der Teufel hinter der Maske: Eine emergente Sicherheitsanfälligkeit von Diffusion LLMs | 面具背后的魔鬼:扩散液晶体的突发性安全脆弱性 2507.11097v1 |
Authors (14): Zichen Wen, Jiashu Qu, Dongrui Liu, Zhiyuan Liu, Ruixi Wu, Yicun Yang, Xiangqi Jin, Haoyun Xu, Xuyang Liu, Weijia Li, Chaochao Lu, Jing Shao, Conghui He, Linfeng Zhang
Diffusion-based large language models (dLLMs) have recently emerged as a powerful alternative to autoregressive LLMs, offering faster inference and greater interactivity via parallel decoding and bidirectional modeling. However, despite strong performance in code generation and text infilling, we identify a fundamental safety concern: existing alignment mechanisms fail to safeguard dLLMs against context-aware, masked-input adversarial prompts, exposing novel vulnerabilities. To this end, we present DIJA, the first systematic study and jailbreak attack framework that exploits unique safety weaknesses of dLLMs. Specifically, our proposed DIJA constructs adversarial interleaved mask-text prompts that exploit the text generation mechanisms of dLLMs, i.e., bidirectional modeling and parallel decoding. Bidirectional modeling drives the model to produce contextually consistent outputs for masked spans, even when harmful, while parallel decoding limits model dynamic filtering and rejection sampling of unsafe content. This causes standard alignment mechanisms to fail, enabling harmful completions in alignment-tuned dLLMs, even when harmful behaviors or unsafe instructions are directly exposed in the prompt. Through comprehensive experiments, we demonstrate that DIJA significantly outperforms existing jailbreak methods, exposing a previously overlooked threat surface in dLLM architectures. Notably, our method achieves up to 100% keyword-based ASR on Dream-Instruct, surpassing the strongest prior baseline, ReNeLLM, by up to 78.5% in evaluator-based ASR on JailbreakBench and by 37.7 points in StrongREJECT score, while requiring no rewriting or hiding of harmful content in the jailbreak prompt. Our findings underscore the urgent need for rethinking safety alignment in this emerging class of language models. Code is available at https://github.com/ZichenWen1/DIJA.
在代码生成和文本填充过程中,我们发现一个根本的安全关切:现有的校正机制未能保护 dLLMS 不受上下文觉悟、掩蔽式对立提示,暴露了新的弱点。为此,我们提出了DIJA,这是首个利用 dLMS特殊安全弱点的系统化研究和越狱攻击框架。具体地说,我们提议的DIJA通过平行解码和双向建模来构建对抗性互动,提供了更快的推断和更大的互动。然而,尽管在代码生成和文本填充过程中表现很强,但我们发现一个根本的安全关切:现有的校正机制未能保护 dLLMS , 防止上下文, 掩盖对立面的提示, 暴露出新的弱点。 同时, DJJA, 最坏的校正的校正式过滤和拒绝的校正评估内容取样。这导致标准校正的校正机制失败,使得对 dLILLMM 的校正间对调整中有害部分得以完成,即便在以往的危害性行为或不安全的校正方法中, 也直接展示了我们现有的准则。
Article 241
Title@2025-07-15 (2): Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification
Title: Beyond Traditional Algorithms: Leveraging LLMs for Accurate Cross-Border Entity Identification | Über traditionelle Algorithmen hinaus: LLMs für eine genaue Cross-Border-Entity-Identifikation nutzen | 超越传统算法:利用LMLMs进行准确的跨界实体识别 2507.11086v1 |
Authors (2): Andres Azqueta-Gavaldón, Joaquin Ramos Cosgrove
The growing prevalence of cross-border financial activities in global markets has underscored the necessity of accurately identifying and classifying foreign entities. This practice is essential within the Spanish financial system for ensuring robust risk management, regulatory adherence, and the prevention of financial misconduct. This process involves a labor-intensive entity-matching task, where entities need to be validated against available reference sources. Challenges arise from linguistic variations, special characters, outdated names, and changes in legal forms, complicating traditional matching algorithms like Jaccard, cosine, and Levenshtein distances. These methods struggle with contextual nuances and semantic relationships, leading to mismatches. To address these limitations, we explore Large Language Models (LLMs) as a flexible alternative. LLMs leverage extensive training to interpret context, handle abbreviations, and adapt to legal transitions. We evaluate traditional methods, Hugging Face-based LLMs, and interface-based LLMs (e.g., Microsoft Copilot, Alibaba’s Qwen 2.5) using a dataset of 65 Portuguese company cases. Results show traditional methods achieve accuracies over 92% but suffer high false positive rates (20-40%). Interface-based LLMs outperform, achieving accuracies above 93%, F1 scores exceeding 96%, and lower false positives (40-80%).
跨国金融活动在全球市场上日益普遍,这凸显了准确识别和分类外国实体的必要性。这种做法在西班牙金融系统中对于确保稳健的风险管理、监管遵守和防止金融不当行为至关重要。这一过程涉及劳动密集型实体匹配任务,实体需要根据现有的参考来源加以验证。挑战来自语言差异、特殊字符、过时名称和法律形式的变化,使Jaccar、cosine和Levenshtein距离等传统匹配算法复杂化。这些方法与背景微妙和语义关系挣扎,导致不匹配。为了解决这些限制,我们探索大语言模式(LLLMs)作为灵活的替代方案。LLMS利用广泛的培训来解释背景、处理缩写和适应法律过渡。我们用65个葡萄牙公司案例的数据集评估传统方法、Hugging 脸型LMs、界面LMs(例如Microsoft Cocopil、Alibaba’s Quen 2.5) 。结果显示传统方法达到92%以上,但遭受高反正率(20-40 % ) 和低正式LMs-39 % 。
Article 242
Title@2025-07-15 (2): Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach
Title: Social Media Sentiments Analysis on the July Revolution in Bangladesh: A Hybrid Transformer Based Machine Learning Approach | Social Media Sentiments Analyse der Julirevolution in Bangladesch: Ein hybrider Transformer basierter Machine Learning-Ansatz | 对孟加拉国七月革命的社会媒体感知分析:混合变换机学习方法 2507.11084v1 |
Authors (3): Md. Sabbir Hossen, Md. Saiduzzaman, Pabon Shaha
The July Revolution in Bangladesh marked a significant student-led mass uprising, uniting people across the nation to demand justice, accountability, and systemic reform. Social media platforms played a pivotal role in amplifying public sentiment and shaping discourse during this historic mass uprising. In this study, we present a hybrid transformer-based sentiment analysis framework to decode public opinion expressed in social media comments during and after the revolution. We used a brand new dataset of 4,200 Bangla comments collected from social media. The framework employs advanced transformer-based feature extraction techniques, including BanglaBERT, mBERT, XLM-RoBERTa, and the proposed hybrid XMB-BERT, to capture nuanced patterns in textual data. Principle Component Analysis (PCA) were utilized for dimensionality reduction to enhance computational efficiency. We explored eleven traditional and advanced machine learning classifiers for identifying sentiments. The proposed hybrid XMB-BERT with the voting classifier achieved an exceptional accuracy of 83.7% and outperform other model classifier combinations. This study underscores the potential of machine learning techniques to analyze social sentiment in low-resource languages like Bangla.
孟加拉国7月革命标志着学生领导的大规模大规模起义,团结全国人民,要求正义、问责和系统改革。社交媒体平台在这场历史性大规模起义期间,在扩大公众情绪和塑造言论方面发挥了关键作用。在本研究中,我们提出了一个基于变压器的混合情绪分析框架,以解析在革命期间和革命之后社会媒体评论中表达的公众舆论。我们使用了从社交媒体收集的4,200个孟加拉语评论的品牌新数据集。这个框架采用了基于变压器的先进地物提取技术,包括BanglaBERT、MBERT、XLM-ROBERTA和拟议的混合XMBM-BERTA, 以捕捉文字数据中的细微模式。原则组成部分分析(PCA)用于减少维度,以提高计算效率。我们探讨了11个传统和先进的机器学习分类师,以识别情绪。与选举分类师的混合XMB-BERT获得了83.7%的特殊精准度,优于其他模式分类组合。这项研究强调机器学习技术在像Bangla这样的低资源语言中分析社会情绪的潜力。
Article 243
Title@2025-07-15 (2): Voting or Consensus? Decision-Making in Multi-Agent Debate
Title: Voting or Consensus? Decision-Making in Multi-Agent Debate | Abstimmung oder Konsens? Entscheidungsfindung in Multi-Agent-Debatte | 表决还是协商一致?多机构辩论中的决策 2502.19130v3 |
Authors (5): Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, Bela Gipp
Much of the success of multi-agent debates depends on carefully choosing the right parameters. The decision-making protocol stands out as it can highly impact final model answers, depending on how decisions are reached. Systematic comparison of decision protocols is difficult because many studies alter multiple discussion parameters beyond the protocol. So far, it has been largely unknown how decision-making influences different tasks. This work systematically evaluates the impact of seven decision protocols (e.g., majority voting, unanimity consensus). We change only one variable at a time - the decision protocol - to analyze how different methods affect the collaboration between agents and measure differences in knowledge and reasoning tasks. Our results show that voting protocols improve performance by 13.2% in reasoning tasks and consensus protocols by 2.8% in knowledge tasks compared to other decision protocols. Increasing the number of agents improves performance, while more discussion rounds before voting reduce it. To improve decision-making by increasing answer diversity, we propose two new methods, All-Agents Drafting (AAD) and Collective Improvement (CI). Our methods improve task performance by up to 3.3% with AAD and up to 7.4% with CI. This work demonstrates the importance of decision-making in multi-agent debates beyond scaling.
多代理人辩论的成功很大程度上取决于如何仔细选择正确的参数。决策协议的突出之处在于它能够对最终模式的答案产生很大影响,取决于如何作出决定。对决定协议的系统比较是困难的,因为许多研究改变了议定书之外的许多讨论参数。到目前为止,人们基本上不知道决策如何影响不同的任务。这项工作系统地评估了七个决定协议的影响(例如多数投票、全体一致协商一致)。我们一次只改变一个变量——决定协议——分析不同方法如何影响代理人之间的合作并衡量知识和推理任务的差异。我们的结果表明,投票协议提高了13.2%的推理任务和协商一致协议的绩效,比其他决定协议提高了2.8 %的推理任务和协商一致协议的绩效。增加代理人的数量提高了绩效,而更多的投票前讨论回合减少了绩效。为了通过增加答案的多样性来改进决策,我们提出了两种新方法,即 “ 所有人起草(AAD) “ 和 “ 集体改进 “ (CI)。我们的方法提高了任务绩效,与AAD的比例提高到3.3%,与CI的比例提高到7.4 %。这项工作表明,在超出规模的多代理人辩论中,必须进行决策。
Article 244
Title@2025-07-15 (2): Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction
Title: Comply: Learning Sentences with Complex Weights inspired by Fruit Fly Olfaction | Comply: Lernen von Sätzen mit komplexen Gewichten inspiriert von Fruit Fly Olfaction | 遵守:受果蝇运动启发的具有复杂重力的学习判决 2502.01706v3 |
Authors (8): Alexei Figueroa, Justus Westerhoff, Golzar Atefi, Dennis Fast, Benjamin Winter, Felix Alexander Gers, Alexander Löser, Wolfgang Nejdl
Biologically inspired neural networks offer alternative avenues to model data distributions. FlyVec is a recent example that draws inspiration from the fruit fly’s olfactory circuit to tackle the task of learning word embeddings. Surprisingly, this model performs competitively even against deep learning approaches specifically designed to encode text, and it does so with the highest degree of computational efficiency. We pose the question of whether this performance can be improved further. For this, we introduce Comply. By incorporating positional information through complex weights, we enable a single-layer neural network to learn sequence representations. Our experiments show that Comply not only supersedes FlyVec but also performs on par with significantly larger state-of-the-art models. We achieve this without additional parameters. Comply yields sparse contextual representations of sentences that can be interpreted explicitly from the neuron weights.
生物学启发的神经网络为模型数据分布提供了替代途径。 FlyVec 是最近的一个例子,它从果蝇的嗅觉电路中汲取灵感,以完成学习嵌入字词的任务。令人惊讶的是,这一模型的竞争性运行甚至与专门设计用于编码文本的深层次学习方法相对立,而且它具有最高程度的计算效率。我们提出了这一性能是否可以进一步改进的问题。为此,我们引入了Comply。通过将定位信息纳入复杂的重量,我们使得一个单层神经网络能够学习序列表达。我们的实验显示,Compt不仅取代了FlyVec,而且与大得多的先进模型同等地运行。我们实现了这一点,没有额外的参数。我们得出了从神经重量中可以明确解释的不完整的句子背景描述。
Article 245
Title@2025-07-15 (2): DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures
Title: DocPolarBERT: A Pre-trained Model for Document Understanding with Relative Polar Coordinate Encoding of Layout Structures | DocPolarBERT: Ein vortrainiertes Modell zum Dokumentverständnis mit relativer Polarkoordinate Kodierung von Layoutstrukturen | DocPolarBERT:一个预先培训的文件理解模式,其布局结构的相对极地协调编码 2507.08606v2 |
Authors (4): Benno Uthayasooriyar, Antoine Ly, Franck Vermet, Caio Corro
We introduce DocPolarBERT, a layout-aware BERT model for document understanding that eliminates the need for absolute 2D positional embeddings. We extend self-attention to take into account text block positions in relative polar coordinate system rather than the Cartesian one. Despite being pre-trained on a dataset more than six times smaller than the widely used IIT-CDIP corpus, DocPolarBERT achieves state-of-the-art results. These results demonstrate that a carefully designed attention mechanism can compensate for reduced pre-training data, offering an efficient and effective alternative for document understanding.
我们引入了Doc PolarBERT, 这是一种具有布局意识的BERT文件理解模式,消除了绝对2D定位嵌入的需要。我们自我关注,以考虑到相对极地协调系统而不是笛卡尔协调系统中的文本块位置。尽管在数据集方面接受过比广泛使用的IT-CDIP系统小六倍多的预先培训,但Doc PolarBERT取得了最新的结果。这些结果表明,精心设计的注意机制可以弥补培训前数据的减少,为文件理解提供高效有效的替代方法。
Article 246
Title@2025-07-15 (2): DRAGON: Dynamic RAG Benchmark On News
Title: DRAGON: Dynamic RAG Benchmark On News | DRAGON: Dynamischer RAG-Benchmark auf Neuigkeiten | DRAGON:动态RAG新闻基准 2507.05713v2 |
Authors (7): Fedor Chernogorskii, Sergei Averkiev, Liliya Kudraleeva, Zaven Martirosian, Maria Tikhonova, Valentin Malykh, Alena Fenogenova
Retrieval-Augmented Generation (RAG) is a widely adopted approach for improving the factuality of large language models (LLMs) by incorporating external knowledge at inference time. Although there exist multiple RAG benchmarks for English, evaluation resources for other languages, including Russian, remain scarce and static, failing to capture the dynamic nature of real-world deployments. In this work, we present DRAGON (Dynamic RAG Benchmark On News), the first dynamic benchmark for evaluating RAG systems in Russian on a changing news corpora. DRAGON is built upon a regularly updated corpus of Russian news and public documents and supports comprehensive evaluation of both the retriever and generator components. Question generation is performed automatically with the use of Knowledge Graph constructed from the corpus and enables the extraction of four core question types aligned with distinct subgraph patterns. We release a complete evaluation framework comprising the pipeline for automatic question generation, evaluation scripts, which are potentially reusable for other languages and multilingual settings, and benchmark data. We also launch a public leaderboard to encourage community participation and comparison.
在这项工作中,我们提出DRAGON(动态RAGON)(动态RAG Basic on News),这是在不断变化的新闻公司中评价俄罗斯的RAG系统的第一个动态基准;DRAGON(DRAGON)以定期更新的俄罗斯新闻和公共文件为基础,支持对检索器和发电机组成部分进行综合评价;通过使用从文体上建立的知识图,自动生成问题,使四个核心问题类型与不同的子绘图模式相一致,从而得以抽取。我们发布了一个完整的评价框架,其中包括自动生成问题的管道、评价脚本,这些文字有可能用于其他语言和多语种环境,以及基准数据。我们还启动了一个公共领导板,以鼓励社区参与和比较。
Article 247
Title@2025-07-15 (2): LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP
Title: LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP | LLM-Augmented Symptom Analysis for Cardiovascular Disease Risk Prediction: A Clinical NLP | 关于心血管疾病风险预测的LLM强化症状分析:临床国家实验室方案 2507.11052v1 |
Authors (6): Haowei Yang, Ziyu Shen, Junli Shao, Luyao Men, Xinyue Han, Jing Dong
Timely identification and accurate risk stratification of cardiovascular disease (CVD) remain essential for reducing global mortality. While existing prediction models primarily leverage structured data, unstructured clinical notes contain valuable early indicators. This study introduces a novel LLM-augmented clinical NLP pipeline that employs domain-adapted large language models for symptom extraction, contextual reasoning, and correlation from free-text reports. Our approach integrates cardiovascular-specific fine-tuning, prompt-based inference, and entity-aware reasoning. Evaluations on MIMIC-III and CARDIO-NLP datasets demonstrate improved performance in precision, recall, F1-score, and AUROC, with high clinical relevance (kappa = 0.82) assessed by cardiologists. Challenges such as contextual hallucination, which occurs when plausible information contracts with provided source, and temporal ambiguity, which is related with models struggling with chronological ordering of events are addressed using prompt engineering and hybrid rule-based verification. This work underscores the potential of LLMs in clinical decision support systems (CDSS), advancing early warning systems and enhancing the translation of patient narratives into actionable risk assessments.
虽然现有的预测模型主要利用结构化数据,但未经结构化的临床说明含有宝贵的早期指标。本研究引入了新型的LLM强化临床NLP输油管,该输油管使用经领域改造的大型语言模型提取症状、背景推理和自由文本报告的相关性。我们的方法结合了心血管特定微调、基于迅速推断和实体认知的推理。对MIMIC-III和CARDIO-NLP数据集的评价表明,临床相关性很高(Kappa=0.82)的准确性、回溯性、F1核心和AUROC的性能有所改善。环境幻觉等挑战,例如,当有源的可信信息合同和时间模糊性(与事件按时间顺序排列有关的模型通过迅速的工程和基于规则的混合核查来处理,这项工作强调了LLMS在临床决策支持系统(CDSS)中的潜力,推进早期预警系统,加强将病人叙述转化为可采取行动的风险评估。
Article 248
Title@2025-07-15 (2): Understanding the Dark Side of LLMs’ Intrinsic Self-Correction
Title: Understanding the Dark Side of LLMs’ Intrinsic Self-Correction | Die dunkle Seite der Intrinsischen Selbstkorrektion der LLMs verstehen | 了解LLLMs’ Intrinsic 自我校正的黑暗面 2412.14959v2 |
Authors (10): Qingjie Zhang, Di Wang, Haoting Qian, Yiming Li, Tianwei Zhang, Minlie Huang, Ke Xu, Hewu Li, Yan Liu, Han Qiu
Intrinsic self-correction was proposed to improve LLMs’ responses via feedback prompts solely based on their inherent capability. However, recent works show that LLMs’ intrinsic self-correction fails without oracle labels as feedback prompts. In this paper, we aim to interpret LLMs’ intrinsic self-correction for different tasks, especially for those failure cases. By including one simple task and three complex tasks with state-of-the-art (SOTA) LLMs like ChatGPT families (o1, 4o, 3.5-turbo) and Llama families (2-7B, 3-8B, and 3.1-8B), we design three interpretation methods to reveal the dark side of LLMs’ intrinsic self-correction. We identify intrinsic self-correction can (1) cause LLMs to waver both intermedia and final answers and lead to prompt bias on simple factual questions; (2) introduce human-like cognitive bias on complex tasks. In light of our findings, we also provide two simple yet effective strategies for alleviation: question repeating and supervised fine-tuning with a few samples. We open-source our work at https://x-isc.info/.
提议进行内在的自我纠正,以便仅仅根据LLMS的内在能力,通过反馈来改进LLMS的响应;然而,最近的工作表明,LLMS的内在自我纠正在没有神器标签的情况下失败,作为反馈的提示。在本文件中,我们的目标是解释LLMS的内在自我纠正,以完成不同的任务,特别是那些失败的案例。通过列入一项简单的任务和三项复杂的任务,包括最先进的LLLMS(SOTA),例如ChatGPT家庭(o1, 4o, 3.5-turbo)和Llama家庭(2-7B, 3-8B, 和3.1-8B),我们设计了三种解释方法,以揭示LLMS内在自我纠正的黑暗面。我们确定固有的自我纠正可以(1) 使LLMS在中间和最后答案上进行挥动,并导致对简单的事实问题产生迅速的偏见;(2) 在复杂的任务中引入像人类一样的认知偏见。根据我们的调查结果,我们还提供了两种简单而有效的缓解战略:重复问题并监督微调几个样本。我们打开了我们在 https://x-isc/infinfo中的工作。
Article 249
Title@2025-07-15 (2): ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification
Title: ReVISE: Learning to Refine at Test-Time via Intrinsic Self-Verification | ReVISE: Verfeinern lernen zur Testzeit durch Intrinsische Selbstverifizierung | REVISE:通过内在自我核查学习在试验时进行精炼 2502.14565v2 |
Authors (5): Hyunseok Lee, Seunghyuk Oh, Jaehyung Kim, Jinwoo Shin, Jihoon Tack
Self-awareness, i.e., the ability to assess and correct one’s own generation, is a fundamental aspect of human intelligence, making its replication in large language models (LLMs) an important yet challenging task. Previous works tackle this by employing extensive reinforcement learning or rather relying on large external verifiers. In this work, we propose Refine via Intrinsic Self-Verification (ReVISE), an efficient and effective framework that enables LLMs to self-correct their outputs through self-verification. The core idea of ReVISE is to enable LLMs to verify their reasoning processes and continually rethink reasoning trajectories based on its verification. We introduce a structured curriculum based upon online preference learning to implement this efficiently. Specifically, as ReVISE involves two challenging tasks (i.e., self-verification and reasoning correction), we tackle each task sequentially using curriculum learning, collecting both failed and successful reasoning paths to construct preference pairs for efficient training. During inference, our approach enjoys natural test-time scaling by integrating self-verification and correction capabilities, further enhanced by our proposed confidence-aware decoding mechanism. Our experiments on various reasoning tasks demonstrate that ReVISE achieves efficient self-correction and significantly improves reasoning performance.
自我意识,即评估和纠正自己一代人的能力,是人类智力的一个基本方面,使在大型语言模型中复制这种能力成为一项重要而富有挑战性的任务。以前的工作是通过广泛强化学习或依靠大型外部核查员来处理这一点的。在这项工作中,我们提议通过人工自我验证(REVISE)来重新精炼自我意识,这是一个高效而有效的框架,使LLISE能够通过自我验证来自我校正其产出。REVISE的核心思想是使LLISM能够核查其推理过程,并不断重新思考基于其核查的推理轨迹。我们采用基于在线偏好学习的结构化课程,以有效执行这项工作。具体地说,由于REVISE涉及两项具有挑战性的任务(即自我验证和推理修正),我们按顺序处理每一项任务,我们收集失败和成功的推理路径,以构建高效培训的优选配方。在引用过程中,我们的方法通过整合自我验证和校正能力,并通过我们拟议的信心分解推理机制进一步强化,使我们的自我推理能力得到改进。
Article 250
Title@2025-07-15 (2): First-Order Error Matters: Accurate Compensation for Quantized Large Language Models
Title: First-Order Error Matters: Accurate Compensation for Quantized Large Language Models | Error Matters: Genaue Kompensation für Quantisierte große Sprachmodelle | 第一顺序误差事项:量化大语言模型的准确补偿 2507.11017v1 |
Authors (7): Xingyu Zheng, Haotong Qin, Yuye Li, Jiakai Wang, Jinyang Guo, Michele Magno, Xianglong Liu
Post-training quantization (PTQ) offers an efficient approach to compressing large language models (LLMs), significantly reducing memory access and computational costs. Existing compensation-based weight calibration methods often rely on a second-order Taylor expansion to model quantization error, under the assumption that the first-order term is negligible in well-trained full-precision models. However, we reveal that the progressive compensation process introduces accumulated first-order deviations between latent weights and their full-precision counterparts, making this assumption fundamentally flawed. To address this, we propose FOEM, a novel PTQ method that explicitly incorporates first-order gradient terms to improve quantization error compensation. FOEM approximates gradients by directly computing the difference between latent and full-precision weights, avoiding the high cost and limited generalization of backpropagation-based gradient computation. This approach introduces minimal additional computational overhead. Moreover, FOEM leverages precomputed Cholesky factors to efficiently recover the inverse of Hessian submatrices in real time. Extensive experiments across a wide range of models and benchmarks demonstrate that FOEM consistently outperforms the classical GPTQ method. In 3-bit weight-only quantization, FOEM reduces the perplexity of Llama3-8B by 89.6%, and improves the 5-shot MMLU accuracy of Llama3-70B from 51.7% to 74.9%, approaching the full-precision performance of 78.6%. Furthermore, FOEM can be seamlessly integrated with advanced techniques such as GPTAQ and SpinQuant, yielding additional improvements under the challenging W4A4KV4 setting, and further narrowing the accuracy gap with full-precision baselines beyond what current state-of-the-art methods achieve. The code is available at https://github.com/Xingyu-Zheng/FOEM.
培训后夸度(PTQ)为压缩大型语言模型(LLMS)提供了有效的方法,大幅降低记忆存取和计算成本。现有的基于补偿的重量校准方法往往依靠泰勒的第二阶扩展来模拟量化错误,其假设是,在训练有素的全精度计算模型中,一阶术语可忽略不计。然而,我们发现,逐步补偿过程引入了潜在重量与其全精度对应方之间累积的第一阶偏差,使得这一假设存在根本性缺陷。为了解决这个问题,我们提议FOEM,这是一个新的PTQQQQEM方法,其中明确包括了第一阶梯度术语,以改善量化错误的补偿。FOEM方法通过直接计算潜值与全精度加权之间的差,避免了高成本,并限制了以反精度计算为基础的梯度计算。此外,FOEM利用预先的Choal-OFOFOFOFOFO, 4在实时恢复HSO的逆差, 4, 进一步在模型和基准中进行广泛的实验,在GEM IM IM 的全精度上不断降低成本 方法中,在不断降低5MMMMR 。
Article 251
Title@2025-07-15 (2): REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once
Title: REST: Stress Testing Large Reasoning Models by Asking Multiple Problems at Once | REST: Stress-Testing von Modellen mit großer Vernunft, indem man mehrere Probleme auf einmal fragt | REST: 立即询问多个问题,以压力测试大型理由模型 2507.10541v2 |
Authors (8): Zhuoshi Pan, Qizhi Pei, Yu Li, Qiyao Sun, Zinan Tang, H. Vicky Zhao, Conghui He, Lijun Wu
Recent Large Reasoning Models (LRMs) have achieved remarkable progress on task-specific benchmarks, yet their evaluation methods remain constrained by isolated problem-solving paradigms. Existing benchmarks predominantly assess single-question reasoning through sequential testing, resulting critical limitations: (1) vulnerability to data contamination and less challenging (e.g., DeepSeek-R1 achieves 97.0% on MATH500), forcing costly creation of new questions with large human efforts, (2) failure to evaluate models under multi-context pressure, a key requirement for real-world deployment. To bridge this gap, we present REST (Reasoning Evaluation through Simultaneous Testing), a stress-testing framework that exposes LRMs to multiple problems simultaneously. Beyond basic reasoning, REST evaluates several under-tested capabilities: contextual priority allocation, cross-problem interference resistance, and dynamic cognitive load management. Our evaluation reveals several striking findings: Even state-of-the-art (SOTA) models like DeepSeek-R1 exhibit substantial performance degradation under stress testing. Crucially, REST demonstrates stronger discriminative power than existing benchmarks, revealing pronounced performance differences among models that exhibit similar, near-ceiling performance under single-question evaluations. Some key insights emerge from our analysis: (1) the “overthinking trap” is a critical factor contributing to the performance degradation; (2) the models trained with “long2short” technique preserve more accuracy of their single-problem performance under REST, outperforming standard-trained counterparts. These results establish REST as a cost-efficient, future-proof evaluation paradigm that better reflects real-world reasoning demands while reducing reliance on continuous human annotation. Code and results are available at https://opendatalab.github.io/REST.
最近大型解释模型(LRMs)在具体任务基准方面取得了显著进展,但其评价方法仍然受到孤立的解决问题模式的制约。现有基准主要通过连续测试评估单题推理,从而产生重大限制:(1) 易受数据污染影响,挑战性较小(例如,DeepSeek-R1在MATH500上达到97.0%),迫使大量人作出大量努力,产生新的问题,(2) 在多文本压力下无法评价模型,这是真实世界部署的关键要求。为弥合这一差距,我们提出了标准评价(通过同声测试重新评价),一个压力测试框架,同时将LRMs暴露为多种问题。除了基本推理之外,REST还评估一些未得到充分检验的能力:背景优先分配、交叉干扰阻力和动态认知负负载管理。我们的评估揭示了几个惊人的结果:即使是现状(SOOTA)(SOTA)模型,在压力测试中显示实际操作性下降幅度很大。毫无疑问,REST显示比现有基准更具有歧视性的力量,揭示了各种明显的业绩差异,在模型中显示接近于一个关键的业绩分析中,“持续进行关键的精确的精确的成绩分析。”
Article 252
Title@2025-07-15 (2): Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging
Title: Bring Reason to Vision: Understanding Perception and Reasoning through Model Merging | Grund zur Vision: Wahrnehmung und Vernunft durch Modellverschmelzen verstehen | 实现愿景:通过模式合并理解观念和理由 2505.05464v2 |
Authors (8): Shiqi Chen, Jinghan Zhang, Tongyao Zhu, Wei Liu, Siyang Gao, Miao Xiong, Manling Li, Junxian He
Vision-Language Models (VLMs) combine visual perception with the general capabilities, such as reasoning, of Large Language Models (LLMs). However, the mechanisms by which these two abilities can be combined and contribute remain poorly understood. In this work, we explore to compose perception and reasoning through model merging that connects parameters of different models. Unlike previous works that often focus on merging models of the same kind, we propose merging models across modalities, enabling the incorporation of the reasoning capabilities of LLMs into VLMs. Through extensive experiments, we demonstrate that model merging offers a successful pathway to transfer reasoning abilities from LLMs to VLMs in a training-free manner. Moreover, we utilize the merged models to understand the internal mechanism of perception and reasoning and how merging affects it. We find that perception capabilities are predominantly encoded in the early layers of the model, whereas reasoning is largely facilitated by the middle-to-late layers. After merging, we observe that all layers begin to contribute to reasoning, whereas the distribution of perception abilities across layers remains largely unchanged. These observations shed light on the potential of model merging as a tool for multimodal integration and interpretation.
视觉语言模型(VLMs)将视觉观点与大语言模型(LLMs)的推理等一般能力相结合。然而,这两种能力相结合的机制仍然不易理解。在这项工作中,我们探索通过将不同模型的参数联系起来的模式集成认识和推理。与以往往往侧重于将同类模型合并的工作不同,我们提议将各种模式合并,以便能够将LLMs的推理能力纳入VLMs。通过广泛的实验,我们证明合并模式为将LLMs的推理能力以无培训的方式转让给VLMs提供了成功的途径。此外,我们利用合并模型来理解内部的认知和推理机制以及合并如何影响它。我们发现,认知能力主要是在模型早期的层中层和末层进行编码,而推理主要是由中层推动的。我们发现,所有层开始为推理作出贡献,而感知能力在各层的分布基本上保持不变。这些观察显示,模型合并作为多式联运和解释工具的潜力。
Article 253
Title@2025-07-15 (2): Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification
Title: Team HUMANE at AVeriTeC 2025: HerO 2 for Efficient Fact Verification | Team HUMANE auf der AVeriTeC 2025: HerO 2 für effiziente Faktenverifizierung | 2025年AVeriTec 2025:HERO 2 有效核查事实 2507.11004v1 |
Authors (4): Yejun Yoon, Jaeyoon Jung, Seunghyun Yoon, Kunwoo Park
This paper presents HerO 2, Team HUMANE’s system for the AVeriTeC shared task at the FEVER-25 workshop. HerO 2 is an enhanced version of HerO, the best-performing open-source model from the previous year’s challenge. It improves evidence quality through document summarization and answer reformulation, optimizes veracity prediction via post-training quantization under computational constraints, and enhances overall system performance by integrating updated language model (LM) backbones. HerO 2 ranked second on the leaderboard while achieving the shortest runtime among the top three systems, demonstrating both high efficiency and strong potential for real-world fact verification. The code is available at https://github.com/ssu-humane/HerO2.
本文件介绍了AVERE-25讲习班上AVERETEC共享任务的HERO 2,这是HERO的强化版本,是上一年挑战中最佳的开放源码模型,通过文件总结和回答再版提高证据质量,通过计算限制下的培训后量化优化真实性预测,并通过整合更新的语言模式骨干提高整个系统的业绩。HERO 2排在领导板上排名第二,同时在前三个系统中达到最短的运行时间,表明效率高,真实世界事实核查潜力巨大。该代码可在https://github.com/su-humane/HerO2上查阅。
Article 254
Title@2025-07-15 (2): Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection
Title: Mario at EXIST 2025: A Simple Gateway to Effective Multilingual Sexism Detection | Mario bei EXIST 2025: Ein einfaches Tor zu einer effektiven Mehrsprachigkeitserkennung | Mario at EXIST 2025: 有效多语言性别调查的简单通道 2507.10996v1 |
Authors (3): Lin Tian, Johanne R. Trippas, Marian-Andrei Rizoiu
This paper presents our approach to EXIST 2025 Task 1, addressing text-based sexism detection in English and Spanish tweets through hierarchical Low-Rank Adaptation (LoRA) of Llama 3.1 8B. Our method introduces conditional adapter routing that explicitly models label dependencies across three hierarchically structured subtasks: binary sexism identification, source intention detection, and multilabel sexism categorization. Unlike conventional LoRA applications that target only attention layers, we apply adaptation to all linear transformations, enhancing the model’s capacity to capture task-specific patterns. In contrast to complex data processing and ensemble approaches, we show that straightforward parameter-efficient fine-tuning achieves strong performance. We train separate LoRA adapters (rank=16, QLoRA 4-bit) for each subtask using unified multilingual training that leverages Llama 3.1’s native bilingual capabilities. The method requires minimal preprocessing and uses standard supervised learning. Our multilingual training strategy eliminates the need for separate language-specific models, achieving 1.7-2.4\% F1 improvements through cross-lingual transfer. With only 1.67\% trainable parameters compared to full fine-tuning, our approach reduces training time by 75\% and model storage by 98\%, while achieving competitive performance across all subtasks (ICM-Hard: 0.6774 for binary classification, 0.4991 for intention detection, 0.6519 for multilabel categorization).
本文介绍了我们处理EXIST 2025任务1的方法,即通过Llama 3.1 8B的低兰克低级别适应Llama 3.18B(LORA)在英语和西班牙语推特中进行基于文字的性别主义检测。我们的方法是引入有条件的调试器路由,该路由将三个分等级结构的子任务明确标注为依赖性:二进制性别主义识别、源意图检测和多标签性别主义分类。与仅针对关注层的常规LORA应用不同,我们对所有线性转变进行了适应,提高了模型捕捉任务特定模式的能力。与复杂的数据处理和合用方法相比,我们显示直接的参数高效微调取得了很强的绩效。我们采用的方法为每个子任务分别培训LORA适应者(rank=16,QLORA 4位),使用统一的多语言培训,利用Llama 3.1的本地双语能力。这种方法需要最低限度的预处理和使用标准的学习。我们多语言培训战略消除了对不同语言特定模式的需求,通过跨语言转移实现1.7-2.4F1改进。我们所有0.191的升级参数参数,比可培训的参数,通过全面升级到全面测试,通过98-0.6的升级,通过75的升级的升级,降低了我们的业绩升级,整个操作,通过0.6的升级的升级,并降低了整个测试,通过0.6x0.6的进度。
Article 255
Title@2025-07-15 (2): Leveraging Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media
Title: Leveraging Large Language Models for Multi-Class and Multi-Label Detection of Drug Use and Overdose Symptoms on Social Media | Nutzung großer Sprachmodelle für Multi-Klassen- und Multi-Label-Erkennung von Drogenkonsum und Überdosissymptome in sozialen Medien | 在社会媒体上利用多种类别和多标签检测吸毒和吸毒过量症状的大型语言模型 2504.12355v3 |
Authors (6): Muhammad Ahmad, Fida Ullah, Muhammad Usman, Umyh Habiba, ldar Batyrshin, Grigori Sidorov
Drug overdose remains a critical global health issue, often driven by misuse of opioids, painkillers, and psychiatric medications. Traditional research methods face limitations, whereas social media offers real-time insights into self-reported substance use and overdose symptoms. This study proposes an AI-driven NLP framework trained on annotated social media data to detect commonly used drugs and associated overdose symptoms. Using a hybrid annotation strategy with LLMs and human annotators, we applied traditional ML models, neural networks, and advanced transformer-based models. Our framework achieved 98% accuracy in multi-class and 97% in multi-label classification, outperforming baseline models by up to 8%. These findings highlight the potential of AI for supporting public health surveillance and personalized intervention strategies.
滥用类阿片、止痛剂和精神病药物往往引发了滥用类阿片、止痛剂和精神病药物,因此吸毒过量仍然是一个全球严重的健康问题。传统研究方法面临种种限制,而社会媒体则对自我报告的药物使用和过量症状提供实时了解。本研究报告建议建立一个由AI驱动的NLP框架,对附加说明的社会媒体数据进行培训,以检测常用药物和相关过量症状。我们采用与LLOMS和人类告示员的混合批注战略,我们采用了传统的ML模型、神经网络和先进的变压器模型。我们的框架在多类标签分类中达到了98%的准确性,在多标签分类中达到了97%的准确性,在业绩基准模型中超过了8%。这些研究结果突出表明AI在支持公共卫生监督和个性化干预战略方面的潜力。
Article 256
Title@2025-07-15 (2): Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback
Title: Online Intrinsic Rewards for Decision Making Agents from Large Language Model Feedback | Online-Intrinsische Belohnungen für Entscheidungsträger aus großen Sprachmodellen Feedback | 来自大语言模式反馈的决策者在线内部奖励 2410.23022v3 |
Authors (5): Qinqing Zheng, Mikael Henaff, Amy Zhang, Aditya Grover, Brandon Amos
Automatically synthesizing dense rewards from natural language descriptions is a promising paradigm in reinforcement learning (RL), with applications to sparse reward problems, open-ended exploration, and hierarchical skill design. Recent works have made promising steps by exploiting the prior knowledge of large language models (LLMs). However, these approaches suffer from important limitations: they are either not scalable to problems requiring billions of environment samples, due to requiring LLM annotations for each observation, or they require a diverse offline dataset, which may not exist or be impossible to collect. In this work, we address these limitations through a combination of algorithmic and systems-level contributions. We propose ONI, a distributed architecture that simultaneously learns an RL policy and an intrinsic reward function using LLM feedback. Our approach annotates the agent’s collected experience via an asynchronous LLM server, which is then distilled into an intrinsic reward model. We explore a range of algorithmic choices for reward modeling with varying complexity, including hashing, classification, and ranking models. Our approach achieves state-of-the-art performance across a range of challenging tasks from the NetHack Learning Environment, while removing the need for large offline datasets required by prior work. We make our code available at https://github.com/facebookresearch/oni .
自然语言描述中的密集收益自动合成,是强化学习(RL)的一个有希望的范例,其应用包括微弱的奖励问题、开放式探索和等级技能设计。最近的工作通过利用大型语言模型(LLM)的先前知识,迈出了有希望的步骤。然而,这些方法受到重要的限制:它们不是不能伸缩到需要数十亿环境样本的问题,因为每次观测都需要LLM说明,或它们需要不同的离线数据集,这些数据集可能不存在或无法收集。在这项工作中,我们通过算法和系统层面的贡献相结合,解决这些局限性。我们建议ONI是一个分布式的架构,利用LLM反馈同时学习RLL政策,并发挥内在的奖励功能。我们的方法将代理商通过一个无节制的LMM服务器收集经验,然后将其蒸馏成一个内在的奖赏模式。我们探索了多种算法选择,以不同复杂的方式奖励模型,包括收藏、分类和排序模型。我们的方法是在一系列具有挑战性的任务中实现最先进的业绩。我们的方法,需要从网络Hack promamas problem pro a a exhe lax she pro exhe lax lax she be lax made lax made lax made lask pro lax lax lax lax lax lax lax lax lax lax
Article 257
Title@2025-07-15 (2): BMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection
Title: BMDetect: A Multimodal Deep Learning Framework for Comprehensive Biomedical Misconduct Detection | BMDEtect: Ein multimodales Deep Learning Framework für eine umfassende biomedizinische Fehlverhaltenserkennung | BMM 检测:综合生物医学不当行为检测的多式深层学习框架 2505.05763v2 |
Authors (4): Yize Zhou, Jie Zhang, Meijie Wang, Lun Yu
Academic misconduct detection in biomedical research remains challenging due to algorithmic narrowness in existing methods and fragmented analytical pipelines. We present BMDetect, a multimodal deep learning framework that integrates journal metadata (SJR, institutional data), semantic embeddings (PubMedBERT), and GPT-4o-mined textual attributes (methodological statistics, data anomalies) for holistic manuscript evaluation. Key innovations include: (1) multimodal fusion of domain-specific features to reduce detection bias; (2) quantitative evaluation of feature importance, identifying journal authority metrics (e.g., SJR-index) and textual anomalies (e.g., statistical outliers) as dominant predictors; and (3) the BioMCD dataset, a large-scale benchmark with 13,160 retracted articles and 53,411 controls. BMDetect achieves 74.33% AUC, outperforming single-modality baselines by 8.6%, and demonstrates transferability across biomedical subfields. This work advances scalable, interpretable tools for safeguarding research integrity.
由于现有方法的算法狭窄和分析管道支离破碎,生物医学研究中的学术不当行为探测仍然具有挑战性。我们提出BMSOT,这是一个多式深层次学习框架,综合了日记元数据(SJR,机构数据)、语义嵌入(PubMedBERT)和GPT-4的文本属性(方法统计、数据异常),以进行整体手稿评价。主要创新包括:(1) 多式融合特定域特性以减少探测偏差;(2) 对特征重要性进行定量评价,确定日记权威指标(例如SJR-index)和文本异常(例如统计异常值)为主要预测器;(3) BMSD数据集,这是一个大规模基准,有13,160件重现文章和53,411件控制。BMSMSMSMSMD实现74.33%的ACUC,超过8.6%的单一模式基线,并显示生物医学子领域的可转让性。这项工作推进了可扩展的、可解释的维护研究完整性的工具。
Article 258
Title@2025-07-15 (2): Teach Me Sign: Stepwise Prompting LLM for Sign Language Production
Title: Teach Me Sign: Stepwise Prompting LLM for Sign Language Production | Lehre mich Zeichen: Schrittweise LLM für Zeichensprache Produktion | 教育我 签名: 一步步提示手语制作LLMLM 2507.10972v1 |
Authors (2): Zhaoyi An, Rei Kawakami
Large language models, with their strong reasoning ability and rich knowledge, have brought revolution to many tasks of AI, but their impact on sign language generation remains limited due to its complexity and unique rules. In this paper, we propose TEAch Me Sign (TEAM-Sign), treating sign language as another natural language. By fine-tuning an LLM, we enable it to learn the correspondence between text and sign language, and facilitate generation. Considering the differences between sign and spoken language, we employ a stepwise prompting strategy to extract the inherent sign language knowledge within the LLM, thereby supporting the learning and generation process. Experimental results on How2Sign and Phoenix14T datasets demonstrate that our approach effectively leverages both the sign language knowledge and reasoning capabilities of LLM to align the different distribution and grammatical rules between sign and spoken language.
大型语言模型具有很强的推理能力和丰富的知识,给AI的许多任务带来了革命,但由于手语生成的复杂性和独特的规则,它们对手语生成的影响仍然有限。 在本文中,我们提议TEAch Me Sign(TEAM-Sign),将手语作为另一种自然语言。通过微调LLM,我们使得它能够学习文本和手语之间的通信,并促进生成。考虑到手语和口语之间的差别,我们采取了循序渐进的激励战略,在LLM内部提取固有的手语知识,从而支持学习和生成过程。关于How2Sign和Penix14T数据集的实验结果表明,我们的方法有效地利用了LM的手语知识和推理能力,将手语和口语之间的不同分布和语法规则结合起来。
Article 259
Title@2025-07-15 (2): Is Training Data Quality or Quantity More Impactful to Small Language Model Performance?
Title: Is Training Data Quality or Quantity More Impactful to Small Language Model Performance? | Ist Training Daten Qualität oder Quantität Impactful to Small Language Model Performance? | 培训数据质量或数量是否对小型语言模范业绩更有影响? 2411.15821v4 |
Authors (2): Aryan Sajith, Krishna Chaitanya Rao Kathala
This study investigates the relative impact of training data quality versus quantity on the performance of small language models (SLMs), utilizing the TinyStories dataset for empirical analysis. Analysis of dataset variations with respect to size (25% and 50% of the original size) and duplication (controlled rates of 25%, 50%, 75%, and 100%) were performed. Model performance was evaluated based on the validation loss, accuracy, and perplexity metrics. Results indicate training data quality plays a more significant role in the overall performance of SLMs, especially given scale of this experiment. Minimal duplication positively impacted model accuracy (+0.87% increase in accuracy at 25% duplication) without significantly increasing perplexity (+0.52% increase going from 0% to 25% duplication) but excessive duplication led to pronounced performance degradation (-40% drop in accuracy at 100% duplication). The implications of this exploration extend beyond just model performance; training large-scale models imposes significant financial and computational burdens, which can be prohibitive for organizations, individuals, and the public at large, especially in developing countries. Additionally, the energy consumption associated with large-scale training raises environmental concerns. Understanding the relative importance of data quality versus quantity could democratize AI technology, making advanced models more accessible and sustainable for all.
这项研究调查了培训数据质量与数量对小型语言模式(SLMs)绩效的相对影响,利用微小缩略图数据集进行实证分析。分析在规模(25%和50%的原有规模)和重复(25%、50%、75%和100%的控制率)方面的数据集差异(25%、50%、75%和100%的控制率)和重复(25%、50%、75%和100%的控制率)方面进行数据集差异。根据验证损失、准确性和不统一度指标对模型绩效进行了评价。结果显示培训数据质量在可持续土地管理总体绩效方面起着更重要的作用,特别是鉴于这一实验的规模,结果显示培训数据的质量质量在质量和计算上都对组织、个人和广大公众来说可能负担重大,特别是在发展中国家。此外,与大规模培训相关的能源消耗对模型准确性(25%的精确率增加了0.52%,从0%增加到25%的重复率)产生了显著的反常性化(40%的精确率下降为100%的重复率),但过度重复导致业绩明显退化(-40%的准确性下降为100%);这一探索的影响超出了仅仅示范性业绩的影响;培训大型模型带来巨大的财务和计算负担,这可能会令各组织、个人和广大公众,特别是在发展中国家。此外,与所有先进的先进培训中,与所有先进培训有关的能源消费可能提高质量数据与民主化的重要性。
Article 260
Title@2025-07-15 (2): DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models
Title: DS@GT at eRisk 2025: From prompts to predictions, benchmarking early depression detection with conversational agent based assessments and temporal attention models | DS@GT bei eRisk 2025: Von Aufforderungen zu Vorhersagen, Benchmarking der Frühdepressionserkennung mit gesprächsagentenbasierten Bewertungen und zeitlichen Aufmerksamkeitsmodellen | DS@GT在eRisk eRisk 2025:从提示到预测,将早期抑郁症检测与基于谈话剂的评估和时间关注模型作为基准 2507.10958v1 |
Authors (4): Anthony Miyaguchi, David Guecha, Yuwen Chiu, Sidharth Gaur
This Working Note summarizes the participation of the DS@GT team in two eRisk 2025 challenges. For the Pilot Task on conversational depression detection with large language-models (LLMs), we adopted a prompt-engineering strategy in which diverse LLMs conducted BDI-II-based assessments and produced structured JSON outputs. Because ground-truth labels were unavailable, we evaluated cross-model agreement and internal consistency. Our prompt design methodology aligned model outputs with BDI-II criteria and enabled the analysis of conversational cues that influenced the prediction of symptoms. Our best submission, second on the official leaderboard, achieved DCHR = 0.50, ADODL = 0.89, and ASHR = 0.27.
本工作说明总结了DS@GT团队在2025年两次eRisk挑战中的参与情况。关于与大型语言模型(LLMs)对话抑郁症检测试点任务,我们采取了一项快速工程战略,不同LLMs进行了BDI-II基础评估,并产生了结构化的JSON产出。由于没有地面实况标签,我们评估了跨模式协议和内部一致性。我们的快速设计方法使模型产出与BDI-II标准相一致,并能够分析影响症状预测的谈话提示。我们提交的最佳文件,第二是在正式领导板上提交的,实现了DHRC=0.50,ADDL=0.89,ASHR=0.27。
Article 261
Title@2025-07-15 (2): Modeling Understanding of Story-Based Analogies Using Large Language Models
Title: Modeling Understanding of Story-Based Analogies Using Large Language Models | Modellierung des Verständnisses von geschichtebasierten Analogien mit großen Sprachmodellen | 使用大语言模式模拟模式模拟理解 2507.10957v1 |
Authors (4): Kalit Inani, Keshav Kabra, Vijay Marupudi, Sashank Varma
Recent advancements in Large Language Models (LLMs) have brought them closer to matching human cognition across a variety of tasks. How well do these models align with human performance in detecting and mapping analogies? Prior research has shown that LLMs can extract similarities from analogy problems but lack robust human-like reasoning. Building on Webb, Holyoak, and Lu (2023), the current study focused on a story-based analogical mapping task and conducted a fine-grained evaluation of LLM reasoning abilities compared to human performance. First, it explored the semantic representation of analogies in LLMs, using sentence embeddings to assess whether they capture the similarity between the source and target texts of an analogy, and the dissimilarity between the source and distractor texts. Second, it investigated the effectiveness of explicitly prompting LLMs to explain analogies. Throughout, we examine whether LLMs exhibit similar performance profiles to those observed in humans by evaluating their reasoning at the level of individual analogies, and not just at the level of overall accuracy (as prior studies have done). Our experiments include evaluating the impact of model size (8B vs. 70B parameters) and performance variation across state-of-the-art model architectures such as GPT-4 and LLaMA3. This work advances our understanding of the analogical reasoning abilities of LLMs and their potential as models of human reasoning.
大语言模型(LLMS)最近的进展使其更接近于在各种任务中与人类认知相匹配。这些模型在探测和绘图类比方面与人类表现相匹配的情况有多好?先前的研究显示,LLMS能够从类比问题中得出相似之处,但缺乏强有力的人性推理。在Webb、Holyoak和Lu(2023)的基础上,目前的研究侧重于基于故事的模拟制图任务,对LLM推理能力与人类表现进行比较的精细评价。首先,它探讨了LLMS类比的语义表述,利用嵌入的句子来评估它们是否反映了类比的来源和目标文本之间的相似性,以及来源与分散性文字之间的不相似性。第二,它研究了明确促进LMS解释类比的有效性。我们通过评估个人类比水平的推理能力,而不是仅仅在总体精确水平上(如以前的研究所做的那样),我们进行的实验包括评估模型规模(8B比LMMS-4的推理学能力,以及我们这种推理学能力的模型的推理学能力——G的推理学的推理学能力,以及跨状态的推理学——我们的研究——我们研究——我们的研究——评估了这些模型的规模——这些推理学的推理学的推理学的推理学的推理——整个的推理——整个的模型的规模,我们从整个的推算的推理——我们研究,我们审查了人类所的推理能力,我们审查了人类所的推理学的推理的推理的推理——整个的推理的推理的推理的推理——整个的推论——整个的推理,我们从整个的推的推理,我们研究,我们从整个的推的推的推,我们从整个的推的推理,我们研究了人类的推理的推理的推理的推理,我们从的推理,我们研究,我们研究,我们从的推理,我们研究,我们研究,我们研究都的推理,我们研究都的推理,我们研究都的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理的推理,我们,我们,我们研究从
Article 262
Title@2025-07-15 (2): Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models
Title: Prompt4Trust: A Reinforcement Learning Prompt Augmentation Framework for Clinically-Aligned Confidence Calibration in Multimodal Large Language Models | Prompt4Trust: Ein Verstärkungs-Learning Prompt Augmentation Framework für klinisch ausgerichtete Vertrauenskalibrierung in multimodalen großen Sprachmodellen | 提示4信任:在多式大语言模式中加强学习学习,促进临床一致信心校正的快速增强框架 2507.09279v2 |
Authors (4): Anita Kriz, Elizabeth Laura Janes, Xing Shen, Tal Arbel
Multimodal large language models (MLLMs) hold considerable promise for applications in healthcare. However, their deployment in safety-critical settings is hindered by two key limitations: (i) sensitivity to prompt design, and (ii) a tendency to generate incorrect responses with high confidence. As clinicians may rely on a model’s stated confidence to gauge the reliability of its predictions, it is especially important that when a model expresses high confidence, it is also highly accurate. We introduce Prompt4Trust, the first reinforcement learning (RL) framework for prompt augmentation targeting confidence calibration in MLLMs. A lightweight LLM is trained to produce context-aware auxiliary prompts that guide a downstream task MLLM to generate responses in which the expressed confidence more accurately reflects predictive accuracy. Unlike conventional calibration techniques, Prompt4Trust specifically prioritizes aspects of calibration most critical for safe and trustworthy clinical decision-making. Beyond improvements driven by this clinically motivated calibration objective, our proposed method also improves task accuracy, achieving state-of-the-art medical visual question answering (VQA) performance on the PMC-VQA benchmark, which is composed of multiple-choice questions spanning diverse medical imaging modalities. Moreover, our framework trained with a small downstream task MLLM showed promising zero-shot generalization to larger MLLMs in our experiments, suggesting the potential for scalable calibration without the associated computational costs. This work demonstrates the potential of automated yet human-aligned prompt engineering for improving the the trustworthiness of MLLMs in safety critical settings. Our codebase can be found at https://github.com/xingbpshen/prompt4trust.
多式大型语言模型(MLLMs)在医疗保健应用方面具有相当大的希望,但是,在安全关键环境下部署这些模型受到两大限制:(一) 对迅速设计敏感,以及(二) 极有信心地产生不正确反应的倾向。由于临床医生可能依赖模型所声明的自信来测量其预测的可靠性,因此尤为重要的是,当模型表示高度信任时,它也非常准确。我们引入了快速增强MLLM信心校准的第一个强化学习(RL)框架,即快速增强MLLM信心校准的第一个强化学习(RL)框架。一个轻量级LMM公司经过培训,以产生对上下游任务进行引导,对迅速设计设计设计,更准确地反映预测的准确性。与常规校准技术不同,Tright4Truust公司具体地将校准对安全可靠的临床决策最为关键的部分列为优先事项。除了由临床驱动的校准目标驱动的改进外,我们拟议的方法还可以提高任务准确性,实现我们发现的最新医学直观回答(VQA)在PMC-VQA标准下游任务中产生更精确的附加的MLMLILLLLLLLS标准。
Article 263
Title@2025-07-15 (2): Representation Bending for Large Language Model Safety
Title: Representation Bending for Large Language Model Safety | Darstellungsbiegen für große Sprachmodellsicherheit | 大语文示范语文安全示范语文代表名单 2504.01550v3 |
Authors (10): Ashkan Yousefpour, Taeheon Kim, Ryan S. Kwon, Seungbeen Lee, Wonje Jeung, Seungju Han, Alvin Wan, Harrison Ngan, Youngjae Yu, Jonghyun Choi
Large Language Models (LLMs) have emerged as powerful tools, but their inherent safety risks - ranging from harmful content generation to broader societal harms - pose significant challenges. These risks can be amplified by the recent adversarial attacks, fine-tuning vulnerabilities, and the increasing deployment of LLMs in high-stakes environments. Existing safety-enhancing techniques, such as fine-tuning with human feedback or adversarial training, are still vulnerable as they address specific threats and often fail to generalize across unseen attacks, or require manual system-level defenses. This paper introduces RepBend, a novel approach that fundamentally disrupts the representations underlying harmful behaviors in LLMs, offering a scalable solution to enhance (potentially inherent) safety. RepBend brings the idea of activation steering - simple vector arithmetic for steering model’s behavior during inference - to loss-based fine-tuning. Through extensive evaluation, RepBend achieves state-of-the-art performance, outperforming prior methods such as Circuit Breaker, RMU, and NPO, with up to 95% reduction in attack success rates across diverse jailbreak benchmarks, all with negligible reduction in model usability and general capabilities.
大型语言模型(LLMS)已成为强大的工具,但它们固有的安全风险 — — 从有害内容生成到更广泛的社会伤害 — — 构成了巨大的挑战。这些风险可以通过最近的对抗性攻击、微调脆弱性和高摄入环境越来越多地部署LMS来放大。现有的加强安全技术,如与人类反馈或对抗性培训进行微调等,仍然脆弱,因为这些技术处理具体威胁,往往无法在看不见的攻击中推广,或需要人工系统层面的防御。本文介绍了RepBend,这是一种全新的方法,从根本上打乱了LMS中有害行为背后的表述,为增强(潜在的固有)安全提供了可扩展的解决方案。ReBend带来了启动方向的想法 — — 在推断过程中指导模式行为的简单矢量算值 — — 进行基于损失的微调。通过广泛的评估,RepBend取得了最先进的性能,超过了先前的方法,如Cird Brederer、RMU和NPO等,在不同的破狱基准中袭击成功率下降高达95%,所有模型的可用性和一般能力都微不足道。
Article 264
Title@2025-07-15 (2): The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances
Title: The GPT Surprise: Offering Large Language Model Chat in a Massive Coding Class Reduced Engagement but Increased Adopters Exam Performances | Die GPT-Überraschung: Großes Sprachmodell-Chat in einer massiven Coding-Klasse bietet reduziertes Engagement, aber erhöhte Adopter-Prüfungsleistungen | GPT 惊喜:在大规模编码级减少参与中提供大语言示范聊天,但采用者考试成绩提高 2407.09975v2 |
Authors (9): Allen Nie, Yash Chandak, Miroslav Suzara, Ali Malik, Juliette Woodrow, Matt Peng, Mehran Sahami, Emma Brunskill, Chris Piech
Large language models (LLMs) are quickly being adopted in a wide range of learning experiences, especially via ubiquitous and broadly accessible chat interfaces like ChatGPT and Copilot. This type of interface is readily available to students and teachers around the world, yet relatively little research has been done to assess the impact of such generic tools on student learning. Coding education is an interesting test case, both because LLMs have strong performance on coding tasks, and because LLM-powered support tools are rapidly becoming part of the workflow of professional software engineers. To help understand the impact of generic LLM use on coding education, we conducted a large-scale randomized control trial with 5,831 students from 146 countries in an online coding class in which we provided some students with access to a chat interface with GPT-4. We estimate positive benefits on exam performance for adopters, the students who used the tool, but over all students, the advertisement of GPT-4 led to a significant average decrease in exam participation. We observe similar decreases in other forms of course engagement. However, this decrease is modulated by the student’s country of origin. Offering access to LLMs to students from low human development index countries increased their exam participation rate on average. Our results suggest there may be promising benefits to using LLMs in an introductory coding class, but also potential harms for engagement, which makes their longer term impact on student success unclear. Our work highlights the need for additional investigations to help understand the potential impact of future adoption and integration of LLMs into classrooms.
大型语言模式(LLMS)很快在广泛的学习经验中被采用,尤其是通过诸如ChattGPT和Copilation等无处不在且可广泛获取的聊天界面。这种类型的界面对全世界的学生和教师来说是很容易获得的,然而,为评估这类通用工具对学生学习的影响,我们所做的研究相对较少。编码教育是一个有趣的测试案例,因为LMS在编码任务方面表现良好,而且LLMM的动力支持工具正在迅速成为专业软件工程师工作流程的一部分。为了帮助理解通用LM的使用对编码教育的影响,我们对来自146个国家的5 831名学生进行了大规模随机化控制试验。我们向一些学生提供了与GPT-4的聊天界面。我们估计了对收养者、使用该工具的学生的考试成绩的积极好处,但是对所有学生来说,GPT-4的广告导致考试平均参加率大幅下降。我们注意到,其他课程参与形式也减少了。然而,这一下降是由学生的原籍国调整了对来自146个国家的5,831名学生进行了大规模随机控制试验。我们向一些学生提供了长期的LMS参加率,从我们的低级考试到学生参加率。
Article 265
Title@2025-07-15 (2): HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training
Title: HanjaBridge: Resolving Semantic Ambiguity in Korean LLMs via Hanja-Augmented Pre-Training | HanjaBridge: Lösung semantischer Ambiguität in koreanischen LLMs über Hanja-Augmented Pre-Training | HanjaBridge:通过Hanja增强的培训前培训,解决韩国LLMLM中的语义模糊问题 2507.10920v1 |
Authors (1): Seungho Choi
Large language models (LLMs) often show poor performance in low-resource languages like Korean, partly due to unique linguistic challenges such as homophonous Sino-Korean words that are indistinguishable in Hangul script. To address this semantic ambiguity, we propose HanjaBridge, a novel meaning-injection technique integrated into a continual pre-training (CPT) framework. Instead of deterministically mapping a word to a single Hanja (Chinese character), HanjaBridge presents the model with all possible Hanja candidates for a given homograph, encouraging the model to learn contextual disambiguation. This process is paired with token-level knowledge distillation to prevent catastrophic forgetting. Experimental results show that HanjaBridge significantly improves Korean language understanding, achieving a 21\% relative improvement on the KoBALT benchmark. Notably, by reinforcing semantic alignment between Korean and Chinese through shared Hanja, we observe a strong positive cross-lingual transfer. Furthermore, these gains persist even when Hanja augmentation is omitted at inference time, ensuring practical efficiency with no additional run-time cost.
大型语言模型(LLMS)通常在韩国等低资源语言中表现不佳,部分原因是独特的语言挑战,如韩文中无法区分的同质中朝语言。为解决这一语义模糊性,我们提议HanjaBridge,这是融入持续培训前框架的一种新颖的意义注入技术。与其将一个单语拼写成汉字(中文字符),HanjaBridge向特定同义词的所有可能汉字候选人展示了该模式,鼓励了学习背景矛盾的模型。这一过程与象征性知识蒸馏相匹配,以防止灾难性的遗忘。实验结果表明,HanjaBridge极大地提高了韩国语言理解度,实现了韩国语在KABALT基准上的21相对改进。值得注意的是,通过共享汉字,加强了韩文和中文之间的语义一致性,我们观察到了强烈的跨语言转移。此外,即使汉加增长在推论时间被忽略,确保实际效率而没有额外的运行成本,这些收益依然存在。
Article 266
Title@2025-07-15 (2): Fine-grained Stateful Knowledge Exploration: Effective and Efficient Graph Retrieval with Large Language Models
Title: Fine-grained Stateful Knowledge Exploration: Effective and Efficient Graph Retrieval with Large Language Models | Feinkörnige Stateful Knowledge Exploration: Effektive und effiziente Graph Retrieval mit großen Sprachmodellen | 精巧的、有国称的先进知识探索:具有大语言模型的高效率、高效益的图表检索 2401.13444v4 |
Authors (6): Dehao Tao, Congqi Wang, Feng Huang, Junhao Chen, Yongfeng Huang, Minghu Jiang
Large Language Models (LLMs) have shown impressive capabilities, yet updating their knowledge remains a significant challenge, often leading to outdated or inaccurate responses. A proposed solution is the integration of external knowledge bases, such as knowledge graphs, with LLMs. Most existing methods use a paradigm that treats the whole question as the objective, with relevant knowledge being incrementally retrieved from the knowledge graph. However, this paradigm often leads to a granularity mismatch between the target question and the retrieved entities and relations. As a result, the information in the question cannot precisely correspond to the retrieved knowledge. This may cause redundant exploration or omission of vital knowledge, thereby leading to enhanced computational consumption and reduced retrieval accuracy. To address the limitations of coarse-grained knowledge exploration, we propose FiSKE, a novel paradigm for Fine-grained Stateful Knowledge Exploration. FiSKE first decomposes questions into fine-grained clues, then employs an adaptive mapping strategy during knowledge exploration process to resolve ambiguity in clue-to-graph mappings. This strategy dynamically infers contextual correspondences while maintaining a stateful record of the mappings. A clue-driven termination mechanism ensures rigorous augmentation–leveraging fully mapped paths for LLMs while reverting to chain-of-thought reasoning when necessary. Our approach balances precision and efficiency. Experiments on multiple datasets revealed that our paradigm surpasses current advanced methods in knowledge retrieval while significantly reducing the average number of LLM invocations.
大型语言模型(LLMS)表现出了令人印象深刻的能力,但更新其知识仍然是一项重大挑战,往往导致过时或不准确的反应。一个拟议解决方案是将外部知识库(如知识图)与LLMS相结合,例如知识图和LLMS相结合。大多数现有方法都使用将整个问题作为目标的范例,从知识图中逐步提取相关知识。然而,这一范例往往导致目标问题与检索的实体和关系之间出现颗粒性不匹配。结果,问题中的信息无法准确与检索到的知识匹配。这可能导致重要知识的重复探索或遗漏,从而导致计算消费增加和检索准确性降低。为了解决粗略知识探索的局限性,我们建议FISKE,这是精细的国有知识探索的新范例。FISKE首先将问题引入精细的线索,然后在知识探索过程中采用适应性绘图战略,以解决线索到绘图的模糊性。这一战略动态地推断了背景对应,同时保持了绘图的状态记录,从而降低了检索准确性。一个线索驱动的图像化的升级的升级方法,同时确保了我们不断升级的升级的升级的系统。
Article 267
Title@2025-07-15 (2): How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations
Title: How Stylistic Similarity Shapes Preferences in Dialogue Dataset with User and Third Party Evaluations | Wie stylistische Ähnlichkeiten im Dialogdatensatz mit Nutzer- und Drittanbieter-Bewertungen Vorlieben gestaltet | 在与用户和第三方评价的对话数据集中如何偏向于 与用户和第三方评价的对话 2507.10918v1 |
Authors (5): Ikumi Numaya, Shoji Moriya, Shiki Sato, Reina Akama, Jun Suzuki
Recent advancements in dialogue generation have broadened the scope of human-bot interactions, enabling not only contextually appropriate responses but also the analysis of human affect and sensitivity. While prior work has suggested that stylistic similarity between user and system may enhance user impressions, the distinction between subjective and objective similarity is often overlooked. To investigate this issue, we introduce a novel dataset that includes users’ preferences, subjective stylistic similarity based on users’ own perceptions, and objective stylistic similarity annotated by third party evaluators in open-domain dialogue settings. Analysis using the constructed dataset reveals a strong positive correlation between subjective stylistic similarity and user preference. Furthermore, our analysis suggests an important finding: users’ subjective stylistic similarity differs from third party objective similarity. This underscores the importance of distinguishing between subjective and objective evaluations and understanding the distinct aspects each captures when analyzing the relationship between stylistic similarity and user preferences. The dataset presented in this paper is available online.
最近对话生成的进展扩大了人-机器人互动的范围,不仅能够根据具体情况作出适当的反应,而且能够分析人类影响和敏感性。虽然先前的工作表明,用户和系统在形式上的相似性可能提高用户的印象,但主观和客观相似性的区别往往被忽视。为了调查这一问题,我们引入了一个新数据集,其中包括用户的偏好、基于用户自身看法的主观和模式相似性,以及第三方评价员在开放域对话环境中所注解的客观的风格相似性。使用构建的数据集的分析揭示了主观的风格相似性与用户偏好之间的强烈正相关关系。此外,我们的分析还提出了一个重要的结论:用户的主观风格相似性不同于第三方的目标相似性。这突出表明了区分主观和客观评价的重要性,并理解在分析模式相似性与用户偏好之间的关系时,必须了解每一种捕捉的不同方面。本文所展示的数据集可在网上查阅。
Article 268
Title@2025-07-15 (2): LiLM-RDB-SFC: Lightweight Language Model with Relational Database-Guided DRL for Optimized SFC Provisioning
Title: LiLM-RDB-SFC: Lightweight Language Model with Relational Database-Guided DRL for Optimized SFC Provisioning | LiLM-RDB-SFC: Leichtes Sprachmodell mit relationaler Datenbank-geführter DRL für optimierte SFC-Provisionierung | LILM-RDB-SFC:为优化SFC供应而与关系数据库-指导DRL 优化SFC供应的轻量语言模型 2507.10903v1 |
Authors (5): Parisa Fard Moshiri, Xinyu Zhu, Poonam Lohan, Burak Kantarci, Emil Janulewicz
Effective management of Service Function Chains (SFCs) and optimal Virtual Network Function (VNF) placement are critical challenges in modern Software-Defined Networking (SDN) and Network Function Virtualization (NFV) environments. Although Deep Reinforcement Learning (DRL) is widely adopted for dynamic network decision-making, its inherent dependency on structured data and fixed action rules often limits adaptability and responsiveness, particularly under unpredictable network conditions. This paper introduces LiLM-RDB-SFC, a novel approach combining Lightweight Language Model (LiLM) with Relational Database (RDB) to answer network state queries to guide DRL model for efficient SFC provisioning. Our proposed approach leverages two LiLMs, Bidirectional and Auto-Regressive Transformers (BART) and the Fine-tuned Language Net T5 (FLAN-T5), to interpret network data and support diverse query types related to SFC demands, data center resources, and VNF availability. Results demonstrate that FLAN-T5 outperforms BART with a lower test loss (0.00161 compared to 0.00734), higher accuracy (94.79% compared to 80.2%), and less processing time (2h 2min compared to 2h 38min). Moreover, when compared to the large language model SQLCoder, FLAN-T5 matches the accuracy of SQLCoder while cutting processing time by 96% (SQLCoder: 54 h 43 min; FLAN-T5: 2 h 2 min).
服务功能链(SFC)和最佳虚拟网络功能(VNF)的有效管理是现代软件-定义网络(SDN)和网络功能虚拟化(NFV)环境中的关键挑战。虽然深度强化学习(DRL)被广泛采用,用于动态网络决策,但其固有的对结构化数据和固定行动规则的依赖往往限制适应性和反应能力,特别是在无法预测的网络条件下。本文介绍了LilM-RDB-SFC,一种将轻量语言模式(LiLM)与关系数据库(RDB)相结合的新做法,用于回答网络查询,以指导DRL(SFFCF)的高效提供模式(SDR)和网络虚拟化模式(SDR)的提供。虽然我们提议的方法在动态网络决策中广泛采用深度强化学习(DRLM)、双向和自动递增变换器(DRL)和精密语言网络 T5(FL-T),以解释网络数据和支持与SFFFFC的要求、数据中心资源以及VNFL的提供情况相关的不同查询类型。结果显示,FL-T的测试损失较低(0.00161为0.00161为0.00161,与0.00734的进度为0.00734)和直径2比Q的精度(CL)。
Article 269
Title@2025-07-15 (2): Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models
Title: Multimodal Sentiment Analysis on CMU-MOSEI Dataset using Transformer-based Models | Multimodale Sentiment-Analyse auf CMU-MOSEI-Datensatz mit Transformer-basierten Modellen | 利用基于变压器的模型对CMU-MOSEI数据集的多式感应分析 2505.06110v2 |
Authors (2): Jugal Gajjar, Kaustik Ranaware
This project performs multimodal sentiment analysis using the CMU-MOSEI dataset, using transformer-based models with early fusion to integrate text, audio, and visual modalities. We employ BERT-based encoders for each modality, extracting embeddings that are concatenated before classification. The model achieves strong performance, with 97.87% 7-class accuracy and a 0.9682 F1-score on the test set, demonstrating the effectiveness of early fusion in capturing cross-modal interactions. The training utilized Adam optimization (lr=1e-4), dropout (0.3), and early stopping to ensure generalization and robustness. Results highlight the superiority of transformer architectures in modeling multimodal sentiment, with a low MAE (0.1060) indicating precise sentiment intensity prediction. Future work may compare fusion strategies or enhance interpretability. This approach utilizes multimodal learning by effectively combining linguistic, acoustic, and visual cues for sentiment analysis.
该项目利用CMU-MOSEI数据集进行多式情绪分析,使用基于变压器的模型,早期结合文字、音频和视觉模式。我们为每种模式使用基于BERT的编码器,提取在分类前相互连接的嵌入器。模型的性能很强,达到97.87%的7级精度和测试集上的0.9682 F1分,表明早期融合在捕捉跨模式互动方面的有效性。培训利用了亚当优化(lr=1e-4)、辍学(0.3)和早期停止以确保普及性和稳健性。结果突出显示变压器结构在模拟多式联运情绪中的优势,低MAE(0.1060)表示精确的情绪强度预测。未来的工作可以比较聚合战略或提高可解释性。这种方法通过将语言、声音和视觉提示有效地结合情绪分析,利用多式联运学习。
Article 270
Title@2025-07-15 (2): NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization
Title: NavComposer: Composing Language Instructions for Navigation Trajectories through Action-Scene-Object Modularization | NavComposer: Komponieren von Sprachanweisungen für Navigations-Trajektorien durch Modularisierung von Action-Scene-Objekten | 导航元件: 通过 Action-Scene-Object 模块化组合导航轨迹的语言指导 2507.10894v1 |
Authors (5): Zongtao He, Liuyi Wang, Lu Chen, Chengju Liu, Qijun Chen
Language-guided navigation is a cornerstone of embodied AI, enabling agents to interpret language instructions and navigate complex environments. However, expert-provided instructions are limited in quantity, while synthesized annotations often lack quality, making them insufficient for large-scale research. To address this, we propose NavComposer, a novel framework for automatically generating high-quality navigation instructions. NavComposer explicitly decomposes semantic entities such as actions, scenes, and objects, and recomposes them into natural language instructions. Its modular architecture allows flexible integration of state-of-the-art techniques, while the explicit use of semantic entities enhances both the richness and accuracy of instructions. Moreover, it operates in a data-agnostic manner, supporting adaptation to diverse navigation trajectories without domain-specific training. Complementing NavComposer, we introduce NavInstrCritic, a comprehensive annotation-free evaluation system that assesses navigation instructions on three dimensions: contrastive matching, semantic consistency, and linguistic diversity. NavInstrCritic provides a holistic evaluation of instruction quality, addressing limitations of traditional metrics that rely heavily on expert annotations. By decoupling instruction generation and evaluation from specific navigation agents, our method enables more scalable and generalizable research. Extensive experiments provide direct and practical evidence for the effectiveness of our method.
语言引导导航是体现的AI的基石,使代理人能够解释语言指令和导航复杂的环境,然而,专家提供的指示数量有限,而综合说明往往缺乏质量,因此不足以进行大规模研究。为此,我们提议NavComposer,这是一个自动生成高质量导航指示的新框架。NavComposer将动作、场景和对象等语义实体明确分解,并将其重新纳入自然语言指令。其模块架构允许灵活地整合最新技术,而明确使用语义实体既能提高指示的丰富性和准确性,又能提高指示的丰富性和准确性。此外,它以数据通识方式运作,支持适应不同导航轨迹,而无需进行特定领域培训。我们补充了NavInstrictor,一个全面的无说明性的评价系统,在三个方面评估导航指令:对比性匹配、语义一致性和语言多样性。NavInstrictrictal提供对教学质量的全面评价,解决传统测量指标的局限性,因为传统测量方法在很大程度上依赖专业导航师的精确度,从而能够提供更精确的教学方法。
Article 271
Title@2025-07-15 (2): ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning
Title: ZebraLogic: On the Scaling Limits of LLMs for Logical Reasoning | ZebraLogic: Auf den Skalierungsgrenzen von LLMs für logische Vernunft | ZebraLogic:关于逻辑理由解释的LLMs限制限度 2502.01100v2 |
Authors (7): Bill Yuchen Lin, Ronan Le Bras, Kyle Richardson, Ashish Sabharwal, Radha Poovendran, Peter Clark, Yejin Choi
We investigate the logical reasoning capabilities of large language models (LLMs) and their scalability in complex non-monotonic reasoning. To this end, we introduce ZebraLogic, a comprehensive evaluation framework for assessing LLM reasoning performance on logic grid puzzles derived from constraint satisfaction problems (CSPs). ZebraLogic enables the generation of puzzles with controllable and quantifiable complexity, facilitating a systematic study of the scaling limits of models such as Llama, o1 models, and DeepSeek-R1. By encompassing a broad range of search space complexities and diverse logical constraints, ZebraLogic provides a structured environment to evaluate reasoning under increasing difficulty. Our results reveal a significant decline in accuracy as problem complexity grows – a phenomenon we term the curse of complexity. This limitation persists even with larger models and increased inference-time computation, suggesting inherent constraints in current LLM reasoning capabilities. Additionally, we explore strategies to enhance logical reasoning, including Best-of-N sampling, backtracking mechanisms, and self-verification prompts. Our findings offer critical insights into the scalability of LLM reasoning, highlight fundamental limitations, and outline potential directions for improvement.
我们调查了大型语言模型(LLMS)的逻辑推理能力及其在复杂的非口头推理中的可伸缩性。为此,我们引入了ZebraLogic,这是一个全面评估框架,用于评估LLM在来自约束性满意度问题的逻辑网格拼图(CSPs)上的推理性能。ZebraLogic能够生成可控制和可量化复杂度的拼图,便于系统研究Llama、o1模型和DeepSeek-R1等模型的缩放性极限,包括广泛的搜索空间复杂性和多种逻辑制约。ZebraLogic提供了一种结构化的环境,用以评估日益困难的推理。我们的结果显示,随着问题复杂性的增加,LM推理的准确性会大大下降 – – 一种我们称之为复杂性的诅咒现象。这种限制即使存在更大的模型,而且推导时间计算也增加了,这表明LLMM推理能力的内在限制。此外,我们探索了加强逻辑推理的战略,包括最佳采样、回机制以及自我验证。我们的调查结果为LM推理提供了关键性的精确洞察力、强调基本限制和可能的改进方向。
Article 272
Title@2025-07-15 (2): Domain-Adaptive Small Language Models for Structured Tax Code Prediction
Title: Domain-Adaptive Small Language Models for Structured Tax Code Prediction | Domain-Adaptive kleine Sprachmodelle für strukturierte Steuervorhersage | 结构化税法预测结构化税法 2507.10880v1 |
Authors (3): Souvik Nath, Sumit Wadhwa, Luiz Perez
Every day, multinational firms process thousands of transactions, each of which must adhere to tax regulations that vary by jurisdiction and are often nuanced. The determination of product and service tax codes, such as HSN or SAC is a major use case in Tax compliance. An accurate determination of such codes is imperative to avoid any tax penalties. This paper proposes a domain-adaptive small language model (SLM) with an encoder-decoder architecture for the enhanced prediction of product and service tax codes. In this approach, we address the problem of predicting hierarchical tax code sequences using unstructured product and services data. We employ an SLM based upon encoder-decoder architecture as this enables sequential generation of tax codes to capture the hierarchical dependencies present within the tax codes. Our experiments demonstrate that encoder-decoder SLMs can be successfully applied to the sequential prediction of structured tax codes, a domain that remains comparatively unexplored in current NLP research. In this paper, we demonstrate the superior performance of the domain-adaptive encoder-decoder SLMs over flat classifiers when applied to the Harmonized System of Nomenclature (HSN), and achieve superior results compared to decoder-only and encoder-only architectures for structured sequence generation tasks. This approach can also be scaled to other government-mandated tax commodity codes, such as United Nations Standard Products and Services Codes (UNSPSC), or Brazil’s Nomenclatura Comum do Mercosul (NCM).
多国公司每天处理数千项交易,其中每个公司都必须遵守因管辖范围不同而经常细化的税务条例; 确定产品和服务税法,如HSN或SAC等产品和服务税法是税务合规的一个主要用途案例; 准确确定此类税法对于避免任何税收处罚至关重要; 本文建议采用一个域调小型语言模式,其中含有一个用于改进产品和服务税法预测的编码器解码器结构; 采用这一方法,我们用非结构化产品和服务数据来预测等级税法序列的问题; 我们使用基于编码器解码结构的可持续土地管理结构,使顺序生成税法能够捕捉税法中存在的等级依赖性。 我们的实验表明,编码解码的编码可以成功地应用于结构税法的顺序预测,而目前NLPP研究中仍相对没有探索的领域。 在本文中,我们用非结构化产品和服务数据来显示域调码解码解码系统相对于平坦的解码系统(联合国统一税法系统、不标准化的系统)的高级解码和升级的系统, 也能够实现这种标准化的标准化的代税法的系统, 。
Article 273
Title@2025-07-15 (2): GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment
Title: GenARM: Reward Guided Generation with Autoregressive Reward Model for Test-time Alignment | GenARM: Reward-geführte Generation mit autoregressivem Reward-Modell für Testzeitausrichtung | GENARM: 具有自动递减奖益模型的奖赏制向导生成(测试时间对齐自动递减奖模型) 2410.08193v5 |
Authors (7): Yuancheng Xu, Udari Madhushani Sehwag, Alec Koppel, Sicheng Zhu, Bang An, Furong Huang, Sumitra Ganesh
Large Language Models (LLMs) exhibit impressive capabilities but require careful alignment with human preferences. Traditional training-time methods finetune LLMs using human preference datasets but incur significant training costs and require repeated training to handle diverse user preferences. Test-time alignment methods address this by using reward models (RMs) to guide frozen LLMs without retraining. However, existing test-time approaches rely on trajectory-level RMs which are designed to evaluate complete responses, making them unsuitable for autoregressive text generation that requires computing next-token rewards from partial responses. To address this, we introduce GenARM, a test-time alignment approach that leverages the Autoregressive Reward Model–a novel reward parametrization designed to predict next-token rewards for efficient and effective autoregressive generation. Theoretically, we demonstrate that this parametrization can provably guide frozen LLMs toward any distribution achievable by traditional RMs within the KL-regularized reinforcement learning framework. Experimental results show that GenARM significantly outperforms prior test-time alignment baselines and matches the performance of training-time methods. Additionally, GenARM enables efficient weak-to-strong guidance, aligning larger LLMs with smaller RMs without the high costs of training larger models. Furthermore, GenARM supports multi-objective alignment, allowing real-time trade-offs between preference dimensions and catering to diverse user preferences without retraining. Our project page is available at: https://genarm.github.io.
大型语言模型(LLMS)表现出令人印象深刻的能力,但需要与人类偏好密切配合; 传统的培训时间方法利用人类偏好数据集对LLMS进行精细调整,但需要大量培训费用,并需要反复培训才能满足用户的不同偏好; 测试时间调整方法通过奖赏模式(RMs)来解决这个问题,引导冻结的LLMS,无需再培训; 但是,现有的测试时间方法依靠轨迹水平的RMS, 目的是评价完整的响应, 使其不适合自动递减的文本生成, 需要从部分响应中计算下点奖励。 为此,我们采用了GENARM, 测试时间调整方法利用自动递增模式(BenARM) 的测试时间调整方法, 利用测试时间调整方法来利用自动递增模式(MLMS)的新颖的奖励, 目的是预测对高效和高效的自动递增型的自动递增新一代的奖励。 理论上,我们证明,这种平衡性能引导冻结LMS公司在KMS(KMS) 常规培训中的任何分配。
Article 274
Title@2025-07-15 (2): Jan-nano Technical Report
Title: Jan-nano Technical Report | Jan-nano Technischer Bericht | Jan-nano技术报告 2506.22760v2 |
Authors (2): Alan Dao, Dinh Bach Vu
Most language models face a fundamental tradeoff where powerful capabilities require substantial computational resources. We shatter this constraint with Jan-nano, a 4B parameter language model that redefines efficiency through radical specialization: instead of trying to know everything, it masters the art of finding anything instantly. Fine-tuned from Qwen3-4B using our novel multi-stage Reinforcement Learning with Verifiable Rewards (RLVR) system that completely eliminates reliance on next token prediction training (SFT), Jan-nano achieves 83.2% on SimpleQA benchmark with MCP integration while running on consumer hardware. With 128K context length, Jan-nano proves that intelligence isn’t about scale, it’s about strategy.
多数语言模式都面临一个根本性的权衡,因为强大的能力需要大量的计算资源。 我们打破了这一制约,用一个 Jan-nano 4B 参数语言模型,通过激进的专业化重新定义了效率:它不是试图了解一切,而是掌握了立即找到任何东西的艺术。 从Quen3-4B 的精细调整,利用我们的新颖的多阶段强化学习和可验证的奖励(RLVR)系统,完全消除对下一次象征性预测培训的依赖,Jan-nano在简单QA基准上取得了83.2%的成绩,同时在消费硬件上运行。 Jan-nano用128K的上下文长度证明智能不是关于规模的,而是关于战略的。
Article 275
Title@2025-07-15 (2): A quantum semantic framework for natural language processing
Title: A quantum semantic framework for natural language processing | Ein quantensemantischer Rahmen für die natürliche Sprachverarbeitung | 自然语言处理的量子语义框架 2506.10077v2 |
Authors (6): Christopher J. Agostino, Quan Le Thien, Molly Apsel, Denizhan Pak, Elina Lesyk, Ashabari Majumdar
Semantic degeneracy represents a fundamental property of natural language that extends beyond simple polysemy to encompass the combinatorial explosion of potential interpretations that emerges as semantic expressions increase in complexity. In this work, we argue this property imposes fundamental limitations on Large Language Models (LLMs) and other modern NLP systems, precisely because they operate within natural language itself. Using Kolmogorov complexity, we demonstrate that as an expression’s complexity grows, the amount of contextual information required to reliably resolve its ambiguity explodes combinatorially. The computational intractability of recovering a single intended meaning for complex or ambiguous text therefore suggests that the classical view that linguistic forms possess intrinsic meaning in and of themselves is conceptually inadequate. We argue instead that meaning is dynamically actualized through an observer-dependent interpretive act, a process whose non-deterministic nature is most appropriately described by a non-classical, quantum-like logic. To test this hypothesis, we conducted a semantic Bell inequality test using diverse LLM agents. Our experiments yielded average CHSH expectation values from 1.2 to 2.8, with several runs producing values (e.g., 2.3-2.4) in significant violation of the classical boundary ($ | S | \leq2$), demonstrating that linguistic interpretation under ambiguity can exhibit non-classical contextuality, consistent with results from human cognition experiments. These results inherently imply that classical frequentist-based analytical approaches for natural language are necessarily lossy. Instead, we propose that Bayesian-style repeated sampling approaches can provide more practically useful and appropriate characterizations of linguistic meaning in context. |
语义退化是自然语言的基本属性,它超越简单的多语制,包括了潜在解释的组合爆炸,这些解释随着语义表达形式的复杂性增加而出现。在这项工作中,我们争辩说,这种属性对大语言模型(LLMS)和其他现代NLP系统施加了根本性的限制,恰恰因为这些模型和其他现代NLP系统在自然语言本身范围内运作。使用科尔莫戈洛夫的复杂性,我们证明,作为一种表达形式的复杂性增加,可靠解决其模糊性所需的背景信息量会以组合方式爆炸。因此,在计算上难以找到复杂或模糊文本的单一意图含义,这表明语言形式本身具有内在含义的经典观点在概念上是不充分的。我们说,通过观察员独立的解释行为,这个过程的非定义性质最恰当地被非古典的、量级的逻辑所描述。为了检验这一假设,我们用不同的LM代理物进行一个基于语义的语义上的语义不平等性测试。我们的实验得出了从1.2到2.8的平均CHH预期值,而若干次的自然形式具有内在含义,而不具有内在的逻辑上的解读性解释性结果,可以证明,在逻辑上的逻辑上的逻辑上的逻辑上,在逻辑上可以证明,在逻辑上的逻辑上的逻辑上的逻辑上的逻辑上的逻辑上可以提供。
Article 276
Title@2025-07-14 (1): WhisperKit: On-device Real-time ASR with Billion-Scale Transformers
Title: WhisperKit: On-device Real-time ASR with Billion-Scale Transformers | WhisperKit: On-Device Echtzeit-ASR mit Milliarden-Scale-Transformatoren | WhiseperKitt:使用十亿个星级变换器的实时实时ASR 2507.10860v1 |
Authors (5): Atila Orhon, Arda Okan, Berkin Durmus, Zach Nagengast, Eduardo Pacheco
Real-time Automatic Speech Recognition (ASR) is a fundamental building block for many commercial applications of ML, including live captioning, dictation, meeting transcriptions, and medical scribes. Accuracy and latency are the most important factors when companies select a system to deploy. We present WhisperKit, an optimized on-device inference system for real-time ASR that significantly outperforms leading cloud-based systems. We benchmark against server-side systems that deploy a diverse set of models, including a frontier model (OpenAI gpt-4o-transcribe), a proprietary model (Deepgram nova-3), and an open-source model (Fireworks large-v3-turbo).Our results show that WhisperKit matches the lowest latency at 0.46s while achieving the highest accuracy 2.2% WER. The optimizations behind the WhisperKit system are described in detail in this paper.
实时自动语音识别(ASR)是ML许多商业应用的基本构件,包括现场字幕、听写、会议抄录和医学文士。在公司选择部署系统时,准确性和长期性是最重要的因素。我们介绍了实时自动语音识别(ASR)优化的实时自动语音识别(WhisperKit)系统,该系统的性能大大优于主要云基系统。我们以服务器侧面系统为基准,这些系统部署多种模型,包括前沿模型(OpenAI gpt-4o-rancrip)、专利模型(Deprova-3)和开源模型(Fireworks lar-V3-turbo)。我们的结果表明,WhisperKit在达到最高精度2.2% WER的同时,与最小值为0.46。WhisperKit系统背后的优化在本文中作了详细描述。
Article 277
Title@2025-07-14 (1): MultiVox: Benchmarking Voice Assistants for Multimodal Interactions
Title: MultiVox: Benchmarking Voice Assistants for Multimodal Interactions | MultiVox: Benchmarking-Sprachassistenten für multimodale Interaktionen | MultiVox:多模式互动基准语音助理 2507.10859v1 |
Authors (7): Ramaneswaran Selvakumar, Ashish Seth, Nishit Anand, Utkarsh Tyagi, Sonal Kumar, Sreyan Ghosh, Dinesh Manocha
The rapid progress of Large Language Models (LLMs) has empowered omni models to act as voice assistants capable of understanding spoken dialogues. These models can process multimodal inputs beyond text, such as speech and visual data, enabling more context-aware interactions. However, current benchmarks fall short in comprehensively evaluating how well these models generate context-aware responses, particularly when it comes to implicitly understanding fine-grained speech characteristics, such as pitch, emotion, timbre, and volume or the environmental acoustic context such as background sounds. Additionally, they inadequately assess the ability of models to align paralinguistic cues with complementary visual signals to inform their responses. To address these gaps, we introduce MultiVox, the first omni voice assistant benchmark designed to evaluate the ability of voice assistants to integrate spoken and visual cues including paralinguistic speech features for truly multimodal understanding. Specifically, MultiVox includes 1000 human-annotated and recorded speech dialogues that encompass diverse paralinguistic features and a range of visual cues such as images and videos. Our evaluation on 9 state-of-the-art models reveals that, although humans excel at these tasks, current models consistently struggle to produce contextually grounded responses.
大语言模型(LLMS)的快速进步使全方位模型成为能够理解口头对话的语音助理,这些模型可以处理文字以外的多式联运投入,例如言语和视觉数据,从而能够进行更符合情境的相互作用;然而,目前的基准在全面评估这些模型产生符合情境的响应能力方面不足,特别是在暗示理解细微的语音特征时,如声调、情感、音调和音量或背景声音等环境声学背景。此外,这些模型没有充分评估模型将副语言提示与补充视觉信号相匹配的能力,以通报它们的反应。为了弥补这些差距,我们引入了多式Vox,这是第一个旨在评价语音助理整合语音和视觉提示的能力的全方位语音助理基准,其中包括促进真正多式理解的超语种语言语言语言语言语言特征。具体地说,多式Vox包括1 000次包含多种语言特征和视觉提示的语音对话,如图像和视频等一系列视觉提示。我们对9个最先进的模型的评估显示,尽管人类在这些任务中非常出色,但当前模型始终在努力制作基于背景的反应。
Article 278
Title@2025-07-14 (1): LLMs on Trial: Evaluating Judicial Fairness for Large Language Models
Title: LLMs on Trial: Evaluating Judicial Fairness for Large Language Models | LLMs on Trial: Bewertung der Gerechtigkeit für große Sprachmodelle | 审判法学硕士:评价大语言模式的司法公平性 2507.10852v1 |
Authors (13): Yiran Hu, Zongyue Xue, Haitao Li, Siyuan Zheng, Qingjing Chen, Shaochun Wang, Xihan Zhang, Ning Zheng, Yun Liu, Qingyao Ai, Yiqun Liu, Charles L. A. Clarke, Weixing Shen
Large Language Models (LLMs) are increasingly used in high-stakes fields where their decisions impact rights and equity. However, LLMs’ judicial fairness and implications for social justice remain underexplored. When LLMs act as judges, the ability to fairly resolve judicial issues is a prerequisite to ensure their trustworthiness. Based on theories of judicial fairness, we construct a comprehensive framework to measure LLM fairness, leading to a selection of 65 labels and 161 corresponding values. Applying this framework to the judicial system, we compile an extensive dataset, JudiFair, comprising 177,100 unique case facts. To achieve robust statistical inference, we develop three evaluation metrics, inconsistency, bias, and imbalanced inaccuracy, and introduce a method to assess the overall fairness of multiple LLMs across various labels. Through experiments with 16 LLMs, we uncover pervasive inconsistency, bias, and imbalanced inaccuracy across models, underscoring severe LLM judicial unfairness. Particularly, LLMs display notably more pronounced biases on demographic labels, with slightly less bias on substance labels compared to procedure ones. Interestingly, increased inconsistency correlates with reduced biases, but more accurate predictions exacerbate biases. While we find that adjusting the temperature parameter can influence LLM fairness, model size, release date, and country of origin do not exhibit significant effects on judicial fairness. Accordingly, we introduce a publicly available toolkit containing all datasets and code, designed to support future research in evaluating and improving LLM fairness.
大型语言模型(LLMS)越来越多地用于其决定影响权利和公平的重要领域。然而,LLMS的司法公正和对社会正义的影响仍未得到充分探讨。当LLMS担任法官时,公平解决司法问题的能力是确保其可信任性的先决条件。根据司法公正理论,我们建立一个综合框架,以衡量LLM公平性,导致挑选65个标签和161个相应价值。在司法系统适用这一框架时,我们汇编了一个广泛的数据集,JudiFair,其中包括177,100个独特的案例事实。为了实现稳健的统计推断,我们制定了三种评价指标、不一致、偏向和不准确的不准确性,并引入了一种评估多种LMMs在各种标签上的总体公平性的方法。通过16 LMS的实验,我们发现各模型之间普遍存在不一致、偏差和不准确的不准确性,强调LM司法不公平性严重。特别是,LMSM在人口标签上表现出更加明显的偏差,与程序上略低于177,100个独特的案例。为了实现强有力的统计推断,我们很有意思的是,我们增加了与偏差的关联性、不一致性,我们增加了各种偏差性、不一致性、偏差性、偏差性、偏差性、偏差、偏差、偏差、偏差性、偏差性、偏差性、偏差性、对准性、比比比分比比分会会会调整了各种偏差性、更差性、比差性、更能性、更细性、更能、更能、更能性、更能、更能、更能、更能性、更能、更能、更能、更能、更能、更能、更能能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能性、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能、更能能能能、更能、更能、更能、更能、更能性推。
Article 279
Title@2025-07-14 (1): Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions
Title: Deep Binding of Language Model Virtual Personas: a Study on Approximating Political Partisan Misperceptions | Deep Binding of Language Model Virtual Personas: eine Studie über die Annäherung der politischen Partisanen-Misswahrnehmungen | 语言模拟虚拟人:关于政治党派近似误解的研究 2504.11673v4 |
Authors (6): Minwoo Kang, Suhong Moon, Seung Hyeong Lee, Ayush Raj, Joseph Suh, David M. Chan
Large language models (LLMs) are increasingly capable of simulating human behavior, offering cost-effective ways to estimate user responses to various surveys and polls. However, the questions in these surveys usually reflect socially understood attitudes: the patterns of attitudes of old/young, liberal/conservative, as understood by both members and non-members of those groups. It is not clear whether the LLM binding is \emph{deep}, meaning the LLM answers as a member of a particular in-group would, or \emph{shallow}, meaning the LLM responds as an out-group member believes an in-group member would. To explore this difference, we use questions that expose known in-group/out-group biases. This level of fidelity is critical for applying LLMs to various political science studies, including timely topics on polarization dynamics, inter-group conflict, and democratic backsliding. To this end, we propose a novel methodology for constructing virtual personas with synthetic user ``backstories” generated as extended, multi-turn interview transcripts. Our generated backstories are longer, rich in detail, and consistent in authentically describing a singular individual, compared to previous methods. We show that virtual personas conditioned on our backstories closely replicate human response distributions (up to an 87\% improvement as measured by Wasserstein Distance) and produce effect sizes that closely match those observed in the original studies of in-group/out-group biases. Altogether, our work extends the applicability of LLMs beyond estimating socially understood responses, enabling their use in a broader range of human studies.
大型语言模型(LLMS)越来越能够模拟人类行为,为估计用户对各种调查和民意测验的答复提供具有成本效益的实用性的方法。然而,这些调查中的问题通常反映社会理解的态度:由这些团体的成员和非成员理解的老/年轻、自由/保守的态度模式;不清楚LLM的约束是否及时涉及两极分化动态、群体间冲突和民主反向滑动等专题。为此,我们提出一种新的方法,用合成用户的复数来构建虚拟人物,而合成用户的复数则作为扩展的多式访谈记录。为了探索这一差异,我们使用的问题暴露了已知的集团/集团内部偏见。这种忠诚程度对于将LLMS应用于各种政治科学研究至关重要,包括极化动态、群体间冲突和民主反向滑移等及时主题。为此,我们提出了一种创新方法,用合成用户的复数组来构建虚拟人物的复发记录。我们生成的反向后台更长、更丰富、更精确的反向,在真实的Slacealisalalal 研究中,我们用真实的推算出了一个真实的推算出人类的自我的推算方法。
Article 280
Title@2025-07-14 (1): AIDE: Attribute-Guided MultI-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning
Title: AIDE: Attribute-Guided MultI-Hop Data Expansion for Data Scarcity in Task-Specific Fine-tuning | AIDE: Attributegeführte MultI-Hop-Datenerweiterung für Datenknappheit bei der aufgabenspezifischen Feinabstimmung | AIDE: 用于特定任务微调中数据缺乏程度的属性引导MutI-Hop数据扩展 2412.06136v2 |
Authors (4): Jiayu Li, Xuan Zhu, Fang Liu, Yanjun Qi
Fine-tuning large language models (LLMs) for specific tasks requires diverse, high-quality training data. However, obtaining sufficient relevant data remains a significant challenge. Existing data synthesis methods either depend on extensive seed datasets or struggle to balance task relevance and data diversity. To address these challenges, we propose Attribute-guided multI-hop Data Expansion (AIDE), a novel data synthesis framework that uses a multi-hop process to expand very few seed data points while ensuring data diversity and task relevance. AIDE extracts the main topic and key knowledge attributes from the seeds to guide the synthesis steps. The process repeats for K hops, using the generated data as seeds. To prevent irrelevant data generation as the hop depth increases, AIDE incorporates a residual connection mechanism. Our empirical results show that AIDE enables fine-tuning of Mistral-7B, Llama-3.1-8B and Llama-3.2-3B from 10 seeds, surpassing the models fine-tuned on human curated data. Furthermore, AIDE outperforms state-of-the-art data synthesis methods, such as Evol-Instruct, by over 30% in task-specific fine-tuning. Code is available at https://github.com/Code4Graph/AIDE.
微调用于具体任务的大型语言模型(LLMS)需要多样化的高质量培训数据。然而,获得足够的相关数据仍是一个重大挑战。现有的数据综合方法要么取决于广泛的种子数据集,要么取决于如何平衡任务相关性和数据多样性。为了应对这些挑战,我们提议了一个创新的数据综合框架,即属性制导的Multi-hop数据扩展(AIDE),这是一个利用多点程序扩大极少的种子数据点的新数据综合框架,同时确保数据的多样性和任务相关性。IDE从种子中提取主要专题和关键知识属性,以指导综合步骤。该过程重复KHOPs,使用生成的数据作为种子。随着深度的提高,AIDE将一个剩余连接机制用于防止不相关的数据生成。我们的经验结果表明,AIDE使Mistral-7B、Llama-3.1-8B和Llama-3.2-3B能够从10种种子中进行微调,超过了对人造数据进行微调的模型。此外,AIDE Exvo-Instruct/Adestruction 30%以上,在任务/GredustryAD/GRADR.
Article 281
Title@2025-07-14 (1): Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition
Title: Supporting SENĆOTEN Language Documentation Efforts with Automatic Speech Recognition | Unterstützung von SENĆOTEN Sprachdokumentation Bemühungen mit automatischer Spracherkennung | 支持SEN-OTEN语文文件工作,并自动语音识别 2507.10827v1 |
Authors (6): Mengzhe Geng, Patrick Littell, Aidan Pine, PENÁĆ, Marc Tessier, Roland Kuhn
The SEN'{C}OTEN language, spoken on the Saanich peninsula of southern Vancouver Island, is in the midst of vigorous language revitalization efforts to turn the tide of language loss as a result of colonial language policies. To support these on-the-ground efforts, the community is turning to digital technology. Automatic Speech Recognition (ASR) technology holds great promise for accelerating language documentation and the creation of educational resources. However, developing ASR systems for SEN'{C}OTEN is challenging due to limited data and significant vocabulary variation from its polysynthetic structure and stress-driven metathesis. To address these challenges, we propose an ASR-driven documentation pipeline that leverages augmented speech data from a text-to-speech (TTS) system and cross-lingual transfer learning with Speech Foundation Models (SFMs). An n-gram language model is also incorporated via shallow fusion or n-best restoring to maximize the use of available data. Experiments on the SEN'{C}OTEN dataset show a word error rate (WER) of 19.34% and a character error rate (CER) of 5.09% on the test set with a 57.02% out-of-vocabulary (OOV) rate. After filtering minor cedilla-related errors, WER improves to 14.32% (26.48% on unseen words) and CER to 3.45%, demonstrating the potential of our ASR-driven pipeline to support SEN'{C}OTEN language documentation.
在南温哥华岛萨尼赫半岛上所讲的SEN'{C}OTEN语言正处于振兴语言的有力努力之中,以扭转殖民语言政策造成的语言损失浪潮。为了支持这些实地努力,社区正在转向数字技术。自动语音识别技术对于加速语言文档和创造教育资源有着巨大的希望。然而,为SEN{C}C}OTEN开发ASR系统具有挑战性,因为其综合合成结构和压力驱动的代谢结构的数据有限,词汇差异很大。为了应对这些挑战,我们提议采用ASR驱动的文件管道,利用从文本到语音系统(TTS)系统(TTS)和与语音基础模型(SFMMs)的跨语言传输学习来增强语音数据。通过浅度或最优的恢复,可以最大限度地利用现有数据。对SEN{C}{C}数据集的实验显示19.34 %的字错率和3.09%的SEN-RC(CER)的字符误差率,在SEN-OL%的测试中,将SER-3.02-CR-BRlickervieweral missations。
Article 282
Title@2025-07-14 (1): Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler
Title: Testing Hypotheses from the Social Approval Theory of Online Hate: An Analysis of 110 Million Posts from Parler | Testen von Hypothesen aus der Sozialen Zulassungstheorie des Online-Hasses: Eine Analyse von 110 Millionen Beiträgen von Parler | 社会批准网上仇恨理论的测试假设:分析来自Parler的1.1亿个职位 2507.10810v1 |
Authors (2): David M. Markowitz, Samuel Hardman Taylor
In this paper, we explored how online hate is motivated by receiving social approval from others. We specifically examined two central tenets of Walther’s (2024) social approval theory of online hate: (H1a) more signals of social approval on hate messages predicts more subsequent hate messages, and (H1b) as social approval increases, hate speech messages become more extreme. Using over 110 million posts from Parler (2018-2021), we observed that the number of upvotes a person received on a hate speech post was unassociated with the amount of hate speech in their next post and posts during the next week, month, three months, and six months. Between-person effects revealed an average negative relationship between social approval and hate speech production at the post level, but this relationship was mixed at other time intervals. Social approval reinforcement mechanisms of online hate may operate differently on niche social media platforms.
在本文中,我们探索了网上仇恨是如何通过得到他人的社会认可而引发的。我们特别研究了Walther(2024年)社会认可网上仇恨理论的两个核心原则:(H1a)更多关于仇恨信息的社会认可信号预测了更多随后的仇恨信息,以及(H1b)随着社会认可增加,仇恨言论信息变得更加极端。我们利用来自Parler(2018-2021年)的1.1亿多张文章,发现在仇恨言论网站上收到的高呼人数与下周、一个月、三个月和六个月的下一个职位和文章中的仇恨言论数量无关。 人际效应揭示了在后一级社会批准与仇恨言论制作之间的平均负面关系,但在其他时间间隔中,这种关系是混合的。 社会批准强化网上仇恨机制在合适的社交媒体平台上可能运作不同。
Article 283
Title@2025-07-14 (1): Automated Thematic Analyses Using LLMs: Xylazine Wound Management Social Media Chatter Use Case
Title: Automated Thematic Analyses Using LLMs: Xylazine Wound Management Social Media Chatter Use Case | Automatisierte thematische Analysen mit LLMs: Xylazine Wound Management Social Media Chatter Use Case | 利用LLMM:Xylazine 创伤管理社会媒体聊天器使用案件自动专题分析 2507.10803v1 |
Authors (7): JaMor Hairston, Ritvik Ranjan, Sahithi Lakamana, Anthony Spadaro, Selen Bozkurt, Jeanmarie Perrone, Abeed Sarker
Background Large language models (LLMs) face challenges in inductive thematic analysis, a task requiring deep interpretive and domain-specific expertise. We evaluated the feasibility of using LLMs to replicate expert-driven thematic analysis of social media data. Methods Using two temporally non-intersecting Reddit datasets on xylazine (n=286 and n=686, for model optimization and validation, respectively) with twelve expert-derived themes, we evaluated five LLMs against expert coding. We modeled the task as a series of binary classifications, rather than a single, multi-label classification, employing zero-, single-, and few-shot prompting strategies and measuring performance via accuracy, precision, recall, and F1-score. Results On the validation set, GPT-4o with two-shot prompting performed best (accuracy: 90.9%; F1-score: 0.71). For high-prevalence themes, model-derived thematic distributions closely mirrored expert classifications (e.g., xylazine use: 13.6% vs. 17.8%; MOUD use: 16.5% vs. 17.8%). Conclusions Our findings suggest that few-shot LLM-based approaches can automate thematic analyses, offering a scalable supplement for qualitative research. Keywords: thematic analysis, large language models, natural language processing, qualitative analysis, social media, prompt engineering, public health
大型语言模型(LLMS)在引入主题分析方面面临着挑战,这是一项需要深入解释和具体领域专门知识的任务。我们评估了使用LLMS复制由专家驱动的社会媒体数据专题分析的可行性。使用两个时间上非交叉的Xylazine(n=286和n=686,分别用于模型优化和验证)的红色数据集的方法,有12个专家主题,我们根据专家派专家提出的专题对5 LLM进行了评估。我们用一系列二进制分类而不是单一、多标签分类,采用零、单和少发提示战略,通过精确、精确、回溯和F1核心衡量业绩。关于验证成套结果的结果,GPT-4o,用两发提示性提示最佳(准确度:90.9%;F1-核心:0.71)。对于高频主题,模型专题分布与专家分类(例如,xylazine使用:13.6%对17.8%;MOD使用16.5%的快速提示战略,用于大规模定性分析。
Article 284
Title@2025-07-14 (1): Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers
Title: Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers | Können multimodale Stiftungsmodelle schematische Diagramme verstehen? Eine empirische Studie zum Informationssuchenden QA über wissenschaftliche Arbeiten | 多模式基金会模型能够理解示相图吗? 信息搜索质量评估经验研究,而不是科学论文 2507.10787v1 |
Authors (4): Yilun Zhao, Chengye Wang, Chuhan Li, Arman Cohan
This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.
本文件介绍MISS-QA,这是专门为评估模型在科学文献中解释图表的能力而专门设计的第一个基准,MISS-QA包括超过465份科学论文的1 500个专家附加说明的例子,在这一基准中,模型的任务是根据论文的更广泛背景解释显示研究概况的示意图图和回答相应的信息查询问题,我们评估了18个前沿多式联运基础模型的绩效,包括O4-mini、Gemini-2.5-Flash和Qwen2.5-VL。我们发现这些模型与MISS-QA的人类专家在绩效方面存在巨大差距。我们对无法回答问题的模型绩效的分析以及我们详细的错误分析进一步突出了当前模型的长处和局限性,为增进理解多式联运科学文献的模型提供了重要的见解。
Article 285
Title@2025-07-14 (1): Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools
Title: Agentic Reasoning: A Streamlined Framework for Enhancing LLM Reasoning with Agentic Tools | Agentische Reasoning: Ein gestrafftes Framework zur Verbesserung der LLM-Reasoning mit Agentischen Tools | 说明理由:加强使用说明工具的LLM理由的精简框架 2502.04644v2 |
Authors (5): Junde Wu, Jiayuan Zhu, Yuyuan Liu, Min Xu, Yueming Jin
We introduce Agentic Reasoning, a framework that enhances large language model (LLM) reasoning by integrating external tool-using agents. Agentic Reasoning dynamically leverages web search, code execution, and structured memory to address complex problems requiring deep research. A key innovation in our framework is the Mind-Map agent, which constructs a structured knowledge graph to store reasoning context and track logical relationships, ensuring coherence in long reasoning chains with extensive tool usage. Additionally, we conduct a comprehensive exploration of the Web-Search agent, leading to a highly effective search mechanism that surpasses all prior approaches. When deployed on DeepSeek-R1, our method achieves a new state-of-the-art (SOTA) among public models and delivers performance comparable to OpenAI Deep Research, the leading proprietary model in this domain. Extensive ablation studies validate the optimal selection of agentic tools and confirm the effectiveness of our Mind-Map and Web-Search agents in enhancing LLM reasoning. The code is at: https://github.com/theworldofagents/Agentic-Reasoning
我们引入了“代理理由”这一框架,这个框架通过整合外部工具使用代理物,促进大型语言模型(LLM)的推理。“代理理由”动态地利用网络搜索、代码执行和结构记忆,解决需要深入研究的复杂问题。我们框架中的一个关键创新是“Mind-Map代理物”,它构建了一个结构化的知识图,以储存推理背景并跟踪逻辑关系,确保长期推理链的一致性,并广泛使用工具。此外,我们全面探索了“网络搜索代理物,从而导致一个超越所有先前方法的高效搜索机制。在“DeepSeek-R1”上,我们的方法在公共模型中实现了一个新的最新状态(SOTA),并提供了与该领域主要专利模型“OpenAI深层研究”相似的性能。广泛的“模拟研究”验证了对代理工具的最佳选择,并证实了我们的Mind-Map和We-Search代理物在加强LM推理方面的有效性。代码见:https://github.com/theworldoffictors/Agentic-Reasoning。
Article 286
Title@2025-07-14 (1): Theory of Mind and Self-Disclosure to CUIs
Title: Theory of Mind and Self-Disclosure to CUIs | Theorie des Geistes und Selbst-Offenbarung zu CUIs | CUI精神和自我披露理论 2507.10773v1 |
Authors (1): Samuel Rhys Cox
Self-disclosure is important to help us feel better, yet is often difficult. This difficulty can arise from how we think people are going to react to our self-disclosure. In this workshop paper, we briefly discuss self-disclosure to conversational user interfaces (CUIs) in relation to various social cues. We then, discuss how expressions of uncertainty or representation of a CUI’s reasoning could help encourage self-disclosure, by making a CUI’s intended “theory of mind” more transparent to users.
自我披露对于帮助我们感觉好一些很重要,但往往是困难的。 之所以有这种困难,是因为我们认为人们会如何对自我披露作出反应。 在这次研讨会论文中,我们简要地讨论了与各种社会提示有关的自我披露问题,即对对话用户界面的自我披露。 然后我们讨论了不确定的表达方式或对统一身份证的推理的表述如何有助于鼓励自我披露,使统一身份证的“思想理论”对用户更加透明。
Article 287
Title@2025-07-14 (1): Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs
Title: Applying Text Embedding Models for Efficient Analysis in Labeled Property Graphs | Anwendung von Text-Embedding-Modellen für effiziente Analyse in beschrifteten Property Graphen | 标签属性图中高效分析应用文本嵌入模型 2507.10772v1 |
Authors (1): Michal Podstawski
Labeled property graphs often contain rich textual attributes that can enhance analytical tasks when properly leveraged. This work explores the use of pretrained text embedding models to enable efficient semantic analysis in such graphs. By embedding textual node and edge properties, we support downstream tasks including node classification and relation prediction with improved contextual understanding. Our approach integrates language model embeddings into the graph pipeline without altering its structure, demonstrating that textual semantics can significantly enhance the accuracy and interpretability of property graph analysis.
标签属性图通常包含丰富的文字属性,这些属性在适当利用时可以加强分析任务。 这项工作探索了使用预先训练的文字嵌入模型,以便在这些图表中进行高效的语义分析。 通过嵌入文本节点和边缘属性,我们支持下游任务, 包括节点分类和关联预测, 并加深了对背景的理解。 我们的方法是将语言模型嵌入图形管道, 而不改变其结构, 表明文字语义可以大大提高属性图表分析的准确性和可解释性。
Article 288
Title@2025-07-14 (1): Language Models for Adult Service Website Text Analysis
Title: Language Models for Adult Service Website Text Analysis | Sprachmodelle für Erwachsene Service Website Textanalyse | 成人服务语言模式网站文本分析 2507.10743v1 |
Authors (5): Nickolas Freeman, Thanh Nguyen, Gregory Bott, Jason Parton, Collin Francel
Sex trafficking refers to the use of force, fraud, or coercion to compel an individual to perform in commercial sex acts against their will. Adult service websites (ASWs) have and continue to be linked to sex trafficking, offering a platform for traffickers to advertise their victims. Thus, organizations involved in the fight against sex trafficking often use ASW data when attempting to identify potential sex trafficking victims. A critical challenge in transforming ASW data into actionable insight is text analysis. Previous research using ASW data has shown that ASW ad text is important for linking ads. However, working with this text is challenging due to its extensive use of emojis, poor grammar, and deliberate obfuscation to evade law enforcement scrutiny. We conduct a comprehensive study of language modeling approaches for this application area, including simple information retrieval methods, pre-trained transformers, and custom transformer models. We demonstrate that characteristics of ASW text data allow efficient custom transformer models to be trained with relatively small GPU resources and used efficiently for inference on consumer hardware. Our custom models outperform fine-tuned variants of well-known encoder-only transformer models, including BERT-base, RoBERTa, and ModernBERT, on accuracy, recall, F1 score, and ROC AUC. We demonstrate the use of our best-performing custom configuration on three tasks related to ASW data analysis: (i) decomposing the giant component in a graph representation of ASW data, (ii) clustering ASW ad text, and (iii) using the learned token embeddings to understand the use of emojis in the illicit context we study. The models we develop represent a significant advancement in ASW text analysis, which can be leveraged in a variety of downstream applications and research.
性贩运是指使用武力、欺诈或胁迫,迫使个人违背其意愿从事商业性行为。成人服务网站(ASW)已经并将继续与性贩运相联系,为贩运者提供宣传受害人的平台。因此,参与打击性贩运的组织在试图识别潜在的性贩运受害者时经常使用ASW数据。将ASW数据转化为可操作的洞察力分析是文本分析。以前使用ASW数据的研究显示,ASW广告文本对于连接广告非常重要。然而,与这一文本合作具有挑战性,因为它广泛使用emji、低语法和故意混淆以逃避执法监督。我们开展了关于这一应用领域的语言模型方法的全面研究,包括简单的信息检索方法、预先培训的变压器和定制变压器模型。 ASW文本的特性使得高效的定制变压器模型能够用较小的GPUP资源来进行培训,并有效地用于消费者硬件的推断。我们的定制模型模型超越了人们熟知的版本背景环境变压器背景,以及故意混淆,以躲避执法监督。我们对这一应用了SWI的模型进行最精确的变压的版本, AS-ROC的模型的缩缩缩缩缩缩的模型,我们使用了SB的模型的模型的模型的缩缩缩略图分析。
Article 289
Title@2025-07-14 (1): GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons
Title: GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons | GDC Cohort Copilot: Ein KI-Copilot für die Kuratierung von Kohorten aus den Genomic Data Commons | GDC Cohort Cohort 副驾驶:AI 基因组数据共同点的Curate Choorts联合驾驶员 2507.02221v2 |
Authors (5): Steven Song, Anirudh Subramanyam, Zhenyu Zhang, Aarti Venkat, Robert L. Grossman
The Genomic Data Commons (GDC) provides access to high quality, harmonized cancer genomics data through a unified curation and analysis platform centered around patient cohorts. While GDC users can interactively create complex cohorts through the graphical Cohort Builder, users (especially new ones) may struggle to find specific cohort descriptors across hundreds of possible fields and properties. However, users may be better able to describe their desired cohort in free-text natural language. We introduce GDC Cohort Copilot, an open-source copilot tool for curating cohorts from the GDC. GDC Cohort Copilot automatically generates the GDC cohort filter corresponding to a user-input natural language description of their desired cohort, before exporting the cohort back to the GDC for further analysis. An interactive user interface allows users to further refine the generated cohort. We develop and evaluate multiple large language models (LLMs) for GDC Cohort Copilot and demonstrate that our locally-served, open-source GDC Cohort LLM achieves better results than GPT-4o prompting in generating GDC cohorts. We implement and share GDC Cohort Copilot as a containerized Gradio app on HuggingFace Spaces, available at https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot. GDC Cohort LLM weights are available at https://huggingface.co/uc-ctds. All source code is available at https://github.com/uc-cdis/gdc-cohort-copilot.
基因组数据共享(GDC)通过以病人组群为中心的统一整理和分析平台,提供高质量的、统一的癌症基因组数据。GDC用户可以通过图形 Cohort 构建器互动创建复杂的组群,而用户(特别是新用户)则可能难以在数百种可能的字段和属性中找到特定的组群描述器。然而,用户可能更有能力以自由文本自然语言描述他们想要的组群。我们引入了GDC Cohort Cout 工具,这是一个用于治疗GDC组群的开放源码共同试点工具。GDC Cohort Coopil自动生成GDC群群过滤器,这与他们想要组群的用户输入自然语言描述相对。用户(特别是新用户)可能很难在将组群输出回GDC进行进一步的分析。互动用户界面使用户能够进一步改进生成的组群群。我们为 GDC Cohort 试点开发并评估多种大型语言模型(LLLMS) , 并表明我们本地的、开放源GDC/Choort LLM 实现的源比GPT-4加速生成GDC-c-CGCD-C Gloc-GC-C-C-C-C-C-C-C-C-SOexOustoc/C/C/C/SOOLVOLGSOUGDOUGDOS/GDSOVIGSO/GS/GS/GSOVOVOOS/GS/C/GS/GS/C/C/C/C/C/GSOVIGSOVIGSOVIGSOFIGSOVIGSOGSOGSOSO。
Article 290
Title@2025-07-14 (1): DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving
Title: DroidSpeak: KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving | DroidSpeak: KV Cache Sharing für Cross-LLM Kommunikation und Multi-LLM Serving | DroidSpeak: KV 共享缓存, 用于跨 LLM 通信和多 LLM 服务 2411.02820v4 |
Authors (12): Yuhan Liu, Yuyang Huang, Jiayi Yao, Shaoting Feng, Zhuohan Gu, Kuntai Du, Hanchen Li, Yihua Cheng, Junchen Jiang, Shan Lu, Madan Musuvathi, Esha Choukse
Compound AI systems, such as agentic systems, are an emerging trend in large-scale enterprise settings, with multiple LLMs specialized for different users, tasks, and/or roles working together. In these scenarios, different models often process inputs that share the same context prefix. Although much work was done in the past to enable the reuse of prefix KV caches across inputs for a single model, how to enable one model to reuse the prefix KV caches of a different model remains an open question. We introduce DroidSpeak, the first distributed LLM inference system that enables KV cache reuse across distributed nodes running inference of different LLMs, so long as the LLMs have the same architecture. We present the first study that aims at understanding the impact of sharing KV caches across different LLMs, and if/when such sharing affects quality. Inspired by the findings, we present DroidSpeak, which selectively recomputes a few layers of the KV cache produced by another LLM and reuses the remaining layers, with negligible quality loss. Moreover, carefully pipelining the layer-wise re-computation and the loading of reused KV cache further improves the inference performance. Experiments on diverse datasets and model pairs demonstrate that DroidSpeak achieves up to 4x throughput improvement and about 3.1x faster prefill (time to first token), with negligible loss of quality in F1 scores, Rouge-L or code similarity score, compared to the baseline which does not allow any sharing across models.
复杂的人工智能系统,如代理系统,是大型企业环境中出现的新趋势,其间,多个LLMS系统是针对不同用户、任务和/或角色的多重分布式LLMs专门使用,在这些情景中,不同的模型往往处理具有相同背景前缀的投入。虽然过去做了许多工作,使输入中前缀KV缓存能够重新用于单一模式,但如何使一个模型能够重新使用不同模型的前缀KV缓存仍然是一个未决问题。我们引入了DroidSpeak,这是第一个分散式LLM推导系统,使分布式无主节点的KV缓存再利用运行不同LMs的推断,只要LLMs具有相同的结构。我们介绍第一项研究的目的是了解不同LMs之间共享KV缓存前缀的影响,以及当这种共享影响质量后,我们介绍DrodSpeak,它有选择地重新配置另一个LM公司生成的KV缓存数的几层,再利用其余层,质量损失微不足道。此外,我们谨慎地将Setting the drivlex deal delialalal deal deal deal deality ex ex ex ex as as intravelilds delist the deliver drevations relist ex deal deal dealds lauts lauts be silds be ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex lautilvacilvacessationsilddrupdrupdrevaldsildmentaltimentaldsild saldsild supddddddsalds sumentaldsildsilds suds ex sudds ex ex ex ex ex ex ex subilddaldalds ex ex ex ex ex ex ex ex lads ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex ex
Article 291
Title@2025-07-14 (1): EmbRACE-3K: Embodied Reasoning and Action in Complex Environments
Title: EmbRACE-3K: Embodied Reasoning and Action in Complex Environments | Embrace-3K: Verkörperte Vernunft und Handeln in komplexen Umgebungen | EmbRACE-3K: 复杂环境中的内在理由和行动 2507.10548v1 |
Authors (9): Mingxian Lin, Wei Huang, Yitang Li, Chengjie Jiang, Kui Wu, Fangwei Zhong, Shengju Qian, Xin Wang, Xiaojuan Qi
Recent advanced vision-language models(VLMs) have demonstrated strong performance on passive, offline image and video understanding tasks. However, their effectiveness in embodied settings, which require online interaction and active scene understanding remains limited. In such scenarios, an agent perceives the environment from a first-person perspective, with each action dynamically shaping subsequent observations. Even state-of-the-art models such as GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro struggle in open-environment interactions, exhibiting clear limitations in spatial reasoning and long-horizon planning. To address this gap, we introduce EmRACE-3K, a dataset of over 3,000 language-guided tasks situated in diverse, photorealistic environments constructed using Unreal Engine and the UnrealCV-Zoo framework. The tasks encompass a wide range of embodied challenges, including navigation, object manipulation, and multi-stage goal execution. Each task unfolds as a multi-step trajectory, pairing first-person visual observations with high-level instructions, grounded actions, and natural language rationales that express the agent’s intent at every step. Using EmRACE-3K, we establish a benchmark to evaluate the embodied reasoning capabilities of VLMs across three key dimensions: Exploration, Dynamic Spatial-Semantic Reasoning, and Multi-stage Goal Execution. In zero-shot settings, all models achieve success rates below 20%, underscoring the challenge posed by our benchmark and the current limitations of VLMs in interactive environments. To demonstrate the utility of EmRACE-3K, we further fine-tune Qwen2.5-VL-7B using supervised learning followed by reinforcement learning. This approach yields substantial improvements across all three challenge categories, highlighting the dataset’s effectiveness in enabling the development of embodied reasoning capabilities.
最近先进的视觉语言模型(VLMS)在被动的、离线的图像和视频理解任务方面表现良好,但是,这些模型在包含的环境下的效果仍然有限,这需要在线互动和积极的场景理解。在这样的情景中,一个代理从第一人的角度看待环境,每次行动都动态地影响随后的观测。即使是GPT-4o、Claude 3.5 Sonnet和Gemini 2.5 2.5 Pro 等最先进的模型,在开放环境互动中也表现出很强的性能表现,在空间推理和长方位规划方面表现出明显的局限性。为了解决这一差距,我们引入了EPRACE-3K,这是一套由3 000多个语言指导的任务组成的数据集,分布在使用不真实的引擎和不真实的CVCV-3框架构建的光度环境。任务包含一系列广泛的包含挑战,包括导航、天体操纵和多阶段目标执行。每一项任务都以多步轨轨轨迹将第一人的视觉观测与高层次指示相匹配,基于行动和自然语言上的挑战原理解释,表明代理人每步步步步步的意向。我们前进的 ERC-3限制的 Eral-real-reval-revl-revl-revl-l-lixxxxxxxxxxxx
Article 292
Title@2025-07-14 (1): CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
Title: CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks | CodeJudgeBench: Benchmarking von LLM-as-a-Judge für Codierungsaufgaben | 标准法官:为编码任务确定LLM-as-a法官基准 2507.10535v1 |
Authors (5): Hongchao Jiang, Yiming Chen, Yushi Cao, Hung-yi Lee, Robby T. Tan
Large Language Models (LLMs) have significantly advanced the state-of-the-art in various coding tasks. Beyond directly answering user queries, LLMs can also serve as judges, assessing and comparing the quality of responses generated by other models. Such an evaluation capability is crucial both for benchmarking different LLMs and for improving response quality through response ranking. However, despite the growing adoption of the LLM-as-a-Judge paradigm, its effectiveness in coding scenarios remains underexplored due to the absence of dedicated benchmarks. To address this gap, we introduce CodeJudgeBench, a benchmark explicitly designed to evaluate the performance of LLM-as-a-Judge models across three critical coding tasks: code generation, code repair, and unit test generation. Through comprehensive benchmarking of 26 LLM-as-a-Judge models, we find that recent thinking models significantly outperform non-thinking models on our carefully designed code judging tasks. Notably, even relatively small thinking models, such as Qwen3-8B, can outperform specially trained LLM-as-a-Judge models up to 70B in size. Nevertheless, all models still exhibit significant randomness in their judgment of coding tasks. For pairwise judging tasks, simply changing the order in which responses are presented can substantially impact accuracy. In addition, when judging code and unit tests written by different LLMs, LLM-as-a-Judge models also show variance in performance. This sensitivity raises concerns about the reliability and consistency of LLM-as-a-Judge in coding scenarios. Lastly, we study optimal prompting strategies for LLM-as-a-Judge. We find that using pair-wise comparison outperforms scalar point-wise judging. Furthermore, retaining comments and reasoning in the full, unprocessed LLM response leads to improved judge performance.
大型语言模型(LLMS)在各种编码任务中大大提升了最新水平,除了直接回答用户询问外,LMS还可以充当法官,评估和比较其他模型生成的答复的质量。这种评价能力对于制定不同LMS的基准和通过答复排名提高响应质量至关重要。然而,尽管越来越多地采用LLM-as-a-judge模式,但其编码设想方案的效力仍然没有得到充分利用,因为缺乏专门的参数,因此,由于缺乏专门的差异分析基准,我们引入了Codjudge Bench,这是一个明确用来评估LM-as-a-judge模式在三种关键编码任务(代码生成、代码修理和单位测试)中的绩效评估业绩基准。尽管最近采用LM-as-a-a-judge模式的情况大大超出了我们精心设计的代码判断任务中的不思考模式。即使是较小的思考模型,例如Quen3-8B,我们经过专门培训的LM-a-judge模型在70B级的精确度上都能够评估 LLM-judroad-judge 模型。然而,所有模型在判断模型的准确性测试中都展示了相当的准确性的工作。
Article 293
Title@2025-07-14 (1): Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination
Title: Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination | Begründung oder Erinnerung? Unzuverlässige Ergebnisse des Verstärkungslernens aufgrund von Datenkontamination | 理由或记忆化?由于数据污染而加强学习的不可靠结果 2507.10532v1 |
Authors (12): Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang
The reasoning capabilities of large language models (LLMs) have been a longstanding focus of research. Recent works have further enhanced these capabilities using reinforcement learning (RL), with many new methods claiming significant improvements with minimal or no external supervision. Surprisingly, some studies even suggest that random or incorrect reward signals can enhance reasoning performance. However, these breakthroughs are mostly reported on the Qwen2.5 model family and evaluated on well-known benchmarks such as MATH-500, AMC, and AIME, while failing to achieve similar gains on other models like Llama, which warrants further investigation. Our analysis shows that although Qwen2.5 achieves strong mathematical reasoning performance, its pretraining on large-scale web corpora makes it vulnerable to data contamination in popular benchmarks. As a result, results derived from these benchmarks may be unreliable. To address this, we introduce a generator that produces fully synthetic arithmetic problems of arbitrary length and difficulty, yielding a clean dataset we call RandomCalculation. Using these leakage-free datasets, we show that only accurate reward signals consistently improve performance, while noisy or incorrect signals do not. We advocate for evaluating RL methods on uncontaminated benchmarks and across diverse model families to ensure trustworthy conclusions.
大型语言模型(LLMS)的推理能力一直是长期研究的焦点。最近的工作进一步加强了这些能力,使用了强化学习(RL),许多新方法都要求得到显著改进,而外部监督很少或根本没有。令人惊讶的是,一些研究甚至表明随机或不正确的奖励信号可以提高推理性能。然而,这些突破大多在Quen2.5模型大家庭中报告,并根据众所周知的基准(如MATH-500、AMC和AIME)评估,同时未能在Llama等值得进一步调查的其他模型上取得类似成果。我们的分析表明,虽然Quen2.5取得了很强的数学推理性能,但其在大规模网络生物体内的预先训练使其容易受到大众基准数据污染。因此,从这些基准中得出的结果可能不可靠。为了解决这个问题,我们引入了一种产生任意长度和困难的完全合成算术问题的生成者,产生我们称之为随机计算器的干净的数据集。我们用这些无漏数据集来显示,只有准确的奖励信号才能不断改进性能确保噪音或不错误或错误的信号。
Article 294
Title@2025-07-14 (1): Expert-level validation of AI-generated medical text with scalable language models
Title: Expert-level validation of AI-generated medical text with scalable language models | Validierung von KI-generierten medizinischen Texten auf Expertenebene mit skalierbaren Sprachmodellen | 专家一级对AI产生的带有可缩放语言模型的可缩放语言模型的医学文本进行鉴定 2507.03152v2 |
Authors (27): Asad Aali, Vasiliki Bikia, Maya Varma, Nicole Chiou, Sophie Ostmeier, Arnav Singhvi, Magdalini Paschali, Ashwin Kumar, Andrew Johnston, Karimar Amador-Martinez, Eduardo Juan Perez Guerrero, Paola Naovi Cruz Rivera, Sergios Gatidis, Christian Bluethgen, Eduardo Pontes Reis, Eddy D. Zandee van Rilland, Poonam Laxmappa Hosamani, Kevin R Keet, Minjoung Go, Evelyn Ling, David B. Larson, Curtis Langlotz, Roxana Daneshjou, Jason Hom, Sanmi Koyejo, Emily Alsentzer, Akshay S. Chaudhari
With the growing use of language models (LMs) in clinical environments, there is an immediate need to evaluate the accuracy and safety of LM-generated medical text. Currently, such evaluation relies solely on manual physician review. However, detecting errors in LM-generated text is challenging because 1) manual review is costly and 2) expert-composed reference outputs are often unavailable in real-world settings. While the “LM-as-judge” paradigm (a LM evaluating another LM) offers scalable evaluation, even frontier LMs can miss subtle but clinically significant errors. To address these challenges, we propose MedVAL, a self-supervised framework that leverages synthetic data to train evaluator LMs to assess whether LM-generated medical outputs are factually consistent with inputs, without requiring physician labels or reference outputs. To evaluate LM performance, we introduce MedVAL-Bench, a dataset containing 840 outputs annotated by physicians, following a physician-defined taxonomy of risk levels and error categories. Across 6 diverse medical tasks and 10 state-of-the-art LMs spanning open-source, proprietary, and medically adapted models, MedVAL fine-tuning significantly improves (p < 0.001) alignment with physicians on both seen and unseen tasks, increasing average F1 scores from 66% to 83%, with per-sample safety classification scores up to 86%. MedVAL improves the performance of even the best-performing proprietary LM (GPT-4o) by 8%. To support a scalable, risk-aware pathway towards clinical integration, we open-source the 1) codebase (https://github.com/StanfordMIMI/MedVAL), 2) MedVAL-Bench (https://huggingface.co/datasets/stanfordmimi/MedVAL-Bench), and 3) MedVAL-4B (https://huggingface.co/stanfordmimi/MedVAL-4B), the best-performing open-source LM. Our research provides the first evidence of LMs approaching expert-level validation ability for medical text.
随着临床环境中越来越多地使用语言模型(LM),迫切需要评价LM产生的医学文本的准确性和安全性。目前,这种评价完全依靠人工医生审查。然而,在LM产生的文本中发现错误具有挑战性,因为1)人工审查费用昂贵,2)在现实世界环境中往往无法获得专家编写的参考产出。虽然“LM-as-judge”模式(LM-service an LM)提供了可伸缩的评价,即使是边界LMM也可以错过微妙但临床上的重大错误。为了应对这些挑战,我们建议MDVAL(自我监督的框架,利用合成数据来培训LMS(LMM)来评估LMM生成的医疗产出是否与投入相符,而不需要医生的标签或参考产出。为了评估LMDM(M),我们引入一个包含840项产出的数据集,由医生对风险等级和错误类别进行明确的分类。 6个不同的医疗任务和10个州的LMS-ral-dal-dal-dal-al-MMMM-mai-al-al-al-al-mail-al-mail-al-max-max-max-max), 和不断大幅改进了我们的标准质量/平均标准,不断升级和不断升级,不断改进和不断改进的运行-时间-39-maxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx,改进,改进,改进,改进,改进,改进,改进,改进,改进,改进-MAMAMAMAMAMA
Article 295
Title@2025-07-14 (1): Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
Title: Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation | Mixture-of-Recursions: Dynamische Rekursive Tiefen für adaptive Token-Level-Computation lernen | 混合流流流:学习适应调控级计算法的动态回流深度 2507.10524v1 |
Authors (11): Sangmin Bae, Yujin Kim, Reza Bayat, Sungnyun Kim, Jiyoun Ha, Tal Schuster, Adam Fisch, Hrayr Harutyunyan, Ziwei Ji, Aaron Courville, Se-Young Yun
Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing or adaptive computation, leaving open the question of how to attain both simultaneously. We introduce Mixture-of-Recursions (MoR), a unified framework that combines the two axes of efficiency inside a single Recursive Transformer. MoR reuses a shared stack of layers across recursion steps to achieve parameter efficiency, while lightweight routers enable adaptive token-level thinking by dynamically assigning different recursion depths to individual tokens. This allows MoR to focus quadratic attention computation only among tokens still active at a given recursion depth, further improving memory access efficiency by selectively caching only their key-value pairs. Beyond these core mechanisms, we also propose a KV sharing variant that reuses KV pairs from the first recursion, specifically designed to decrease prefill latency and memory footprint. Across model scales ranging from 135M to 1.7B parameters, MoR forms a new Pareto frontier: at equal training FLOPs and smaller model sizes, it significantly lowers validation perplexity and improves few-shot accuracy, while delivering higher throughput compared with vanilla and existing recursive baselines. These gains demonstrate that MoR is an effective path towards large-model quality without incurring large-model cost.
缩放语言模型释放了令人印象深刻的能力, 但相应的计算和记忆要求使培训和部署都变得昂贵。 现有的效率努力通常以参数共享或适应性计算为目标, 从而解决如何同时实现这两个目标的问题。 我们引入了 Mixture of Recursion( Mor) 统一框架, 将两个效率轴结合到一个单一的递归变换器中。 MOR 重新使用一个共享的层堆, 跨循环步骤, 以实现参数效率, 而轻量级路由器通过动态地将不同的递解深度分配到单个标志, 使得轻量级关注的思维能够适应性象征性思维。 这让 Moreto 能够只集中在某个给定的递解深度仍然活跃的象征之间进行计算, 从而通过选择性的缓存只缓存关键值配对齐来进一步提高记忆访问效率。 除了这些核心机制之外, 我们还建议 KV 共享一个共享的变量, 从第一个递归回流中再利用 KV 双组, , 具体旨在减少预留和记忆足。 在135M 至1.7B 的模型参数中, Moreto 一个新的边界: 在相同的培训 FLOPsal 和小模型中, 展示中, 将大大地显示一个更高的大小, 。
Article 296
Title@2025-07-14 (1): DeepResearch$^{\text{Eco}}$: A Recursive Agentic Workflow for Complex Scientific Question Answering in Ecology
Title: DeepResearch$^{\text{Eco}}$: A Recursive Agentic Workflow for Complex Scientific Question Answering in Ecology | DeepResearch$^{\text{Eco}}$: Ein rekursiver Agentischer Workflow für komplexe wissenschaftliche Fragen in der Ökologie | 深层研究$text{Eco}$:生态中复杂科学问题答案的递递性制剂工作流程 2507.10522v1 |
Authors (3): Jennifer D’Souza, Endres Keno Sander, Andrei Aioanei
We introduce DeepResearch$^{\text{Eco}}$, a novel agentic LLM-based system for automated scientific synthesis that supports recursive, depth- and breadth-controlled exploration of original research questions – enhancing search diversity and nuance in the retrieval of relevant scientific literature. Unlike conventional retrieval-augmented generation pipelines, DeepResearch enables user-controllable synthesis with transparent reasoning and parameter-driven configurability, facilitating high-throughput integration of domain-specific evidence while maintaining analytical rigor. Applied to 49 ecological research questions, DeepResearch achieves up to a 21-fold increase in source integration and a 14.9-fold rise in sources integrated per 1,000 words. High-parameter settings yield expert-level analytical depth and contextual diversity. Source code available at: https://github.com/sciknoworg/deep-research.
我们引入了基于深层研究${text{Eco}$这个基于新颖代理LLM的自动化科学合成系统,支持对原始研究问题进行循环、深度和广度控制的探索 – – 加强相关科学文献检索中的搜索多样性和细微度。与传统的检索增强的生成管道不同,深层研究使用户能够以透明推理和参数驱动的可配置性来控制合成,促进高通量整合特定领域的证据,同时保持分析规范。应用到49个生态研究问题,深层研究实现了源集化增加21倍,源集成源次增加14.9倍。高参数环境产生了专家级分析深度和背景多样性。源代码见:https://github.com/scinoworg/deep-research。
Article 297
Title@2025-07-14 (1): Can You Detect the Difference?
Title: Can You Detect the Difference? | Kannst du den Unterschied erkennen? | 你能发现差异吗? 2507.10475v1 |
Authors (2): İsmail Tarım, Aytuğ Onan
The rapid advancement of large language models (LLMs) has raised concerns about reliably detecting AI-generated text. Stylometric metrics work well on autoregressive (AR) outputs, but their effectiveness on diffusion-based models is unknown. We present the first systematic comparison of diffusion-generated text (LLaDA) and AR-generated text (LLaMA) using 2 000 samples. Perplexity, burstiness, lexical diversity, readability, and BLEU/ROUGE scores show that LLaDA closely mimics human text in perplexity and burstiness, yielding high false-negative rates for AR-oriented detectors. LLaMA shows much lower perplexity but reduced lexical fidelity. Relying on any single metric fails to separate diffusion outputs from human writing. We highlight the need for diffusion-aware detectors and outline directions such as hybrid models, diffusion-specific stylometric signatures, and robust watermarking.
大型语言模型(LLMS)的快速进步引起了人们对可靠地探测AI产生的文本的关切。tytylologic 度量仪在自动递减(AR)产出方面效果良好,但对于基于扩散的模型的效果尚不得而知。我们用2,000个样本对扩散产生的文本(LLLADA)和AR产生的文本(LLAMA)进行了首次系统比较。易懂性、易读性、字典多样性、可读性以及BLEU/ROUGE分数表明,LLADA在易懂性和易爆性方面密切模仿人文文本,为AR型探测器带来高的假阴性率。LLAMA显示的不易解性程度要低得多,但降低了法性忠诚性。我们强调,任何单一的指数都无法将扩散输出与人类的写作分开。我们强调,需要具有扩散意识的探测器和大纲方向,如混合模型、扩散专用的特征和强水标记。
Article 298
Title@2025-07-14 (1): MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking
Title: MLAR: Multi-layer Large Language Model-based Robotic Process Automation Applicant Tracking | MLAR: Mehrschichtige großsprachige modellbasierte Roboterprozessautomatisierung Bewerberverfolgung | MLARR: 多层大型语言示范型机器人程序自动化申请人跟踪 2507.10472v1 |
Authors (4): Mohamed T. Younes, Omar Walid, Mai Hassan, Ali Hamdi
This paper introduces an innovative Applicant Tracking System (ATS) enhanced by a novel Robotic process automation (RPA) framework or as further referred to as MLAR. Traditional recruitment processes often encounter bottlenecks in resume screening and candidate shortlisting due to time and resource constraints. MLAR addresses these challenges employing Large Language Models (LLMs) in three distinct layers: extracting key characteristics from job postings in the first layer, parsing applicant resume to identify education, experience, skills in the second layer, and similarity matching in the third layer. These features are then matched through advanced semantic algorithms to identify the best candidates efficiently. Our approach integrates seamlessly into existing RPA pipelines, automating resume parsing, job matching, and candidate notifications. Extensive performance benchmarking shows that MLAR outperforms the leading RPA platforms, including UiPath and Automation Anywhere, in high-volume resume-processing tasks. When processing 2,400 resumes, MLAR achieved an average processing time of 5.4 seconds per resume, reducing processing time by approximately 16.9% compared to Automation Anywhere and 17.1% compared to UiPath. These results highlight the potential of MLAR to transform recruitment workflows by providing an efficient, accurate, and scalable solution tailored to modern hiring needs.
本文介绍了创新的申请人跟踪系统(ATS),它得到了新的机械化流程自动化(RPA)框架的加强,或被进一步称为MLARR。传统征聘程序在恢复筛选和候选人短名单方面常常遇到瓶颈。由于时间和资源的限制,传统征聘程序在恢复筛选和候选人短名单方面常常遇到瓶颈。 MALR在三个不同的层面应对这些挑战:从第一层的职位上提取主要特征,在第二层重新对申请人进行分类,以确认教育、经验、技能以及第三层的相似性匹配。这些特征随后通过先进的语义算法进行匹配,以高效率地识别最佳候选人。我们的方法在现有的RPA管道中、自动恢复分类、职位配对和候选人通知中都无缝地结合了。广泛的业绩基准显示,MALR在大量恢复处理任务中超越了主要的RAP平台,包括UiPath和自动处理场所。在恢复2 400次程序后,MALR实现了平均5.4秒的处理时间,从而将处理时间缩短了约16.9%,与自动办公地点和17.1%的处理时间,与UPathimalable的征聘需求作了调整。这些成果,通过现代化的现代化的征聘需要为现代化的现代化的调整。
Article 299
Title@2025-07-14 (1): From BERT to Qwen: Hate Detection across architectures
Title: From BERT to Qwen: Hate Detection across architectures | Von BERT bis Qwen: Hasserkennung über Architekturen hinweg | 从BERT到Quw:跨结构的仇恨检测 2507.10468v1 |
Authors (3): Ariadna Mon, Saúl Fenollosa, Jon Lecumberri
Online platforms struggle to curb hate speech without over-censoring legitimate discourse. Early bidirectional transformer encoders made big strides, but the arrival of ultra-large autoregressive LLMs promises deeper context-awareness. Whether this extra scale actually improves practical hate-speech detection on real-world text remains unverified. Our study puts this question to the test by benchmarking both model families, classic encoders and next-generation LLMs, on curated corpora of online interactions for hate-speech detection (Hate or No Hate).
在线平台在不过度审查合法言论的情况下努力遏制仇恨言论。 早期双向变压器编码器取得了巨大进步,但超大型自动递减有限责任公司(LLMs)的到来将带来更深的环境意识。 这一额外规模是否实际上改善了现实世界文本中实际的仇恨语音检测,仍未得到核实。 我们的研究通过将模范家庭、经典变压器和下一代LMS作为基准来测试这一问题,这些模范家庭、经典变压器和下一代LMS的在线互动公司都为仇恨语音检测(Hate or no Hate hate ) 提供了保护。
Article 300
Title@2025-07-14 (1): Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction
Title: Roll the dice & look before you leap: Going beyond the creative limits of next-token prediction | Rollen Sie die Würfel & Blick, bevor Sie springen: Gehen über die kreativen Grenzen der Next-Token-Vorhersage | 跳跃前的骰子滚动和看一看:超越了次声预测的创造性极限 2504.15266v3 |
Authors (4): Vaishnavh Nagarajan, Chen Henry Wu, Charles Ding, Aditi Raghunathan
We design a suite of minimal algorithmic tasks that are a loose abstraction of open-ended real-world tasks. This allows us to cleanly and controllably quantify the creative limits of the present-day language model. Much like real-world tasks that require a creative, far-sighted leap of thought, our tasks require an implicit, open-ended stochastic planning step that either (a) discovers new connections in an abstract knowledge graph (like in wordplay, drawing analogies, or research) or (b) constructs new patterns (like in designing math problems or new proteins). In these tasks, we empirically and conceptually argue how next-token learning is myopic; multi-token approaches, namely teacherless training and diffusion models, comparatively excel in producing diverse and original output. Secondly, to elicit randomness without hurting coherence, we find that injecting noise at the input layer (dubbed seed-conditioning) works surprisingly as well as (and in some conditions, better than) temperature sampling from the output layer. Thus, our work offers a principled, minimal test-bed for analyzing open-ended creative skills, and offers new arguments for going beyond next-token learning and temperature sampling. We make part of the code available under https://github.com/chenwu98/algorithmic-creativity
nan
Article 301
Title@2025-07-14 (1): Referential ambiguity and clarification requests: comparing human and LLM behaviour
Title: Referential ambiguity and clarification requests: comparing human and LLM behaviour | referenzielle Mehrdeutigkeit und Klärungswünsche: Vergleich des menschlichen und des LLM-Verhaltens | 参考文献的模糊性和澄清要求:比较人的行为和LLM行为 2507.10445v1 |
Authors (3): Chris Madge, Matthew Purver, Massimo Poesio
In this work we examine LLMs’ ability to ask clarification questions in task-oriented dialogues that follow the asynchronous instruction-giver/instruction-follower format. We present a new corpus that combines two existing annotations of the Minecraft Dialogue Corpus – one for reference and ambiguity in reference, and one for SDRT including clarifications – into a single common format providing the necessary information to experiment with clarifications and their relation to ambiguity. With this corpus we compare LLM actions with original human-generated clarification questions, examining how both humans and LLMs act in the case of ambiguity. We find that there is only a weak link between ambiguity and humans producing clarification questions in these dialogues, and low correlation between humans and LLMs. Humans hardly ever produce clarification questions for referential ambiguity, but often do so for task-based uncertainty. Conversely, LLMs produce more clarification questions for referential ambiguity, but less so for task uncertainty. We question if LLMs’ ability to ask clarification questions is predicated on their recent ability to simulate reasoning, and test this with different reasoning approaches, finding that reasoning does appear to increase question frequency and relevancy.
nan
Article 302
Title@2025-07-14 (1): A Code Comprehension Benchmark for Large Language Models for Code
Title: A Code Comprehension Benchmark for Large Language Models for Code | Ein Code-Verständnis-Benchmark für große Sprachmodelle für Code | 《守则》大语言模式的《守则》理解基准 2507.10641v1 |
Authors (5): Jayant Havare, Saurav Chaudhary, Ganesh Ramakrishnan, Kaushik Maharajan, Srikanth Tamilselvam
Large Language Models have shown impressive capabilities in coding tasks like code generation and code completion, as they have been trained on a large amount of code data. Also, since one of the core pretraining objectives is Next Token Prediction, these models tends to learn surface-level syntactic patterns in code. However, this does not guarantee code comprehension ability i.e. the ability to capture the semantics of the code. In our opinion, this is the reason why these models often underperform on tasks that require deeper semantic understanding, such as code debugging and code optimization. To address this, we propose fine-tuning these models specifically for code comprehension tasks using large-scale datasets, enabling them to develop a more robust understanding of code semantics. We evaluate three code models of varying sizes on a suite of code comprehension tasks designed to assess semantic understanding beyond surface-level syntactic pattern matching. In particular, we analyze performance on the Subjectivity Grading Task and observe that model performance improves after fine-tuning on relevant downstream tasks. The most significant improvement is seen in the QWQ-32B model, where accuracy increases from 70% to 83.47%. A similar or explainable trend is observed across other models, clearly indicating an enhancement in code comprehension ability. Among the models studied, the DPO-fine-tuned Codestral-22B achieves the highest micro-accuracy of 87.66% on the Subjectivity Grading Task.
nan
Article 303
Title@2025-07-14 (1): Multiple Choice Learning of Low Rank Adapters for Language Modeling
Title: Multiple Choice Learning of Low Rank Adapters for Language Modeling | Multiple Choice-Lernen von Low-Rank-Adaptern für die Sprachmodellierung | 低级别语言建模适应者多选择学习 2507.10419v1 |
Authors (7): Victor Letzelter, Hugo Malard, Mathieu Fontaine, Gaël Richard, Slim Essid, Andrei Bursuc, Patrick Pérez
We propose LoRA-MCL, a training scheme that extends next-token prediction in language models with a method designed to decode diverse, plausible sentence continuations at inference time. Traditional language modeling is an intrinsically ill-posed problem: given a context, multiple futures may be equally plausible. Our approach leverages Multiple Choice Learning (MCL) and the Winner-Takes-All (WTA) loss to efficiently handle ambiguity through Low-Rank Adaptation (LoRA). We provide a theoretical interpretation of applying Multiple Choice Learning to Language Modeling, assuming the data is generated from a mixture of distributions. To illustrate the proposed approach, we use data sampled from mixtures of Markov chains. We then demonstrate with extensive experiments on real-world visual and audio captioning tasks that our method achieves high diversity and relevance in generated outputs.
nan
Article 304
Title@2025-07-14 (1): Beyond classical and contemporary models: a transformative AI framework for student dropout prediction in distance learning using RAG, Prompt engineering, and Cross-modal fusion
Title: Beyond classical and contemporary models: a transformative AI framework for student dropout prediction in distance learning using RAG, Prompt engineering, and Cross-modal fusion | Über klassische und zeitgenössische Modelle hinaus: ein transformatives KI-Framework für die Studienabbrechervorhersage im Fernunterricht mittels RAG, Prompt Engineering und Cross-modal Fusion | 超越古典和当代模式:利用RAG、快速工程和跨模式融合进行远程学习中学生辍学预测的变革性AI框架 2507.05285v2 |
Authors (3): Miloud Mihoubi, Meriem Zerkouk, Belkacem Chikhaoui
Student dropout in distance learning remains a critical challenge, with profound societal and economic consequences. While classical machine learning models leverage structured socio-demographic and behavioral data, they often fail to capture the nuanced emotional and contextual factors embedded in unstructured student interactions. This paper introduces a transformative AI framework that redefines dropout prediction through three synergistic innovations: Retrieval-Augmented Generation (RAG) for domain-specific sentiment analysis, prompt engineering to decode academic stressors,and cross-modal attention fusion to dynamically align textual, behavioral, and socio-demographic insights. By grounding sentiment analysis in a curated knowledge base of pedagogical content, our RAG-enhanced BERT model interprets student comments with unprecedented contextual relevance, while optimized prompts isolate indicators of academic distress (e.g., “isolation,” “workload anxiety”). A cross-modal attention layer then fuses these insights with temporal engagement patterns, creating holistic risk pro-files. Evaluated on a longitudinal dataset of 4 423 students, the framework achieves 89% accuracy and an F1-score of 0.88, outperforming conventional models by 7% and reducing false negatives by 21%. Beyond prediction, the system generates interpretable interventions by retrieving contextually aligned strategies (e.g., mentorship programs for isolated learners). This work bridges the gap between predictive analytics and actionable pedagogy, offering a scalable solution to mitigate dropout risks in global education systems
nan
Article 305
Title@2025-07-14 (1): Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources
Title: Text-to-Remote-Sensing-Image Retrieval beyond RGB Sources | Text-zu-Remote-Sensing-Image Retrieval jenseits von RGB-Quellen | RGB 来源以外的文字到远程传感器图像检索 2507.10403v1 |
Authors (5): Daniele Rege Cambrin, Lorenzo Vaiani, Giuseppe Gallipoli, Luca Cagliero, Paolo Garza
Retrieving relevant imagery from vast satellite archives is crucial for applications like disaster response and long-term climate monitoring. However, most text-to-image retrieval systems are limited to RGB data, failing to exploit the unique physical information captured by other sensors, such as the all-weather structural sensitivity of Synthetic Aperture Radar (SAR) or the spectral signatures in optical multispectral data. To bridge this gap, we introduce CrisisLandMark, a new large-scale corpus of over 647,000 Sentinel-1 SAR and Sentinel-2 multispectral images paired with structured textual annotations for land cover, land use, and crisis events harmonized from authoritative land cover systems (CORINE and Dynamic World) and crisis-specific sources. We then present CLOSP (Contrastive Language Optical SAR Pretraining), a novel framework that uses text as a bridge to align unpaired optical and SAR images into a unified embedding space. Our experiments show that CLOSP achieves a new state-of-the-art, improving retrieval nDGC by 54% over existing models. Additionally, we find that the unified training strategy overcomes the inherent difficulty of interpreting SAR imagery by transferring rich semantic knowledge from the optical domain with indirect interaction. Furthermore, GeoCLOSP, which integrates geographic coordinates into our framework, creates a powerful trade-off between generality and specificity: while the CLOSP excels at general semantic tasks, the GeoCLOSP becomes a specialized expert for retrieving location-dependent crisis events and rare geographic features. This work highlights that the integration of diverse sensor data and geographic context is essential for unlocking the full potential of remote sensing archives.
nan
Article 306
Title@2025-07-14 (1): Devanagari Handwritten Character Recognition using Convolutional Neural Network
Title: Devanagari Handwritten Character Recognition using Convolutional Neural Network | Devanagari Handgeschriebene Zeichenerkennung unter Verwendung von Convolutional Neural Network | Devanagari 利用革命神经网络手写字符识别 2507.10398v1 |
Authors (2): Diksha Mehta, Prateek Mehta
Handwritten character recognition is getting popular among researchers because of its possible applications in facilitating technological search engines, social media, recommender systems, etc. The Devanagari script is one of the oldest language scripts in India that does not have proper digitization tools. With the advancement of computing and technology, the task of this research is to extract handwritten Hindi characters from an image of Devanagari script with an automated approach to save time and obsolete data. In this paper, we present a technique to recognize handwritten Devanagari characters using two deep convolutional neural network layers. This work employs a methodology that is useful to enhance the recognition rate and configures a convolutional neural network for effective Devanagari handwritten text recognition (DHTR). This approach uses the Devanagari handwritten character dataset (DHCD), an open dataset with 36 classes of Devanagari characters. Each of these classes has 1700 images for training and testing purposes. This approach obtains promising results in terms of accuracy by achieving 96.36% accuracy in testing and 99.55% in training time.
nan
Article 307
Title@2025-07-14 (1): EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration
Title: EVOLvE: Evaluating and Optimizing LLMs For In-Context Exploration | EVOLvE: Bewertung und Optimierung von LLMs für In-Context Exploration | EVOLvE: 评估和优化用于内衣探索的LMs LMs 2410.06238v2 |
Authors (7): Allen Nie, Yi Su, Bo Chang, Jonathan N. Lee, Ed H. Chi, Quoc V. Le, Minmin Chen
Despite their success in many domains, large language models (LLMs) remain under-studied in scenarios requiring optimal decision-making under uncertainty. This is crucial as many real-world applications, ranging from personalized recommendations to healthcare interventions, demand that LLMs not only predict but also actively learn to make optimal decisions through exploration. In this work, we measure LLMs’ (in)ability to make optimal decisions in bandits, a state-less reinforcement learning setting relevant to many applications. We develop a comprehensive suite of environments, including both context-free and contextual bandits with varying task difficulties, to benchmark LLMs’ performance. Motivated by the existence of optimal exploration algorithms, we propose efficient ways to integrate this algorithmic knowledge into LLMs: by providing explicit algorithm-guided support during inference; and through algorithm distillation via in-context demonstrations and fine-tuning, using synthetic data generated from these algorithms. Impressively, these techniques allow us to achieve superior exploration performance with smaller models, surpassing larger models on various tasks. We conducted an extensive ablation study to shed light on various factors, such as task difficulty and data representation, that influence the efficiency of LLM exploration. Additionally, we conduct a rigorous analysis of the LLM’s exploration efficiency using the concept of regret, linking its ability to explore to the model size and underlying algorithm.
nan
Article 308
Title@2025-07-14 (1): HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong
Title: HKGAI-V1: Towards Regional Sovereign Large Language Model for Hong Kong | HKGAI-V1: Auf dem Weg zu einem regionalen Souveränen Großsprachenmodell für Hongkong | HKGAI-V1:为香港建立区域主权大语言模式 2507.11502v1 |
Authors (4): Sirui Han, Junqi Zhu, Ruiyuan Zhang, Yike Guo
This paper presents the development of HKGAI-V1, a foundational sovereign large language model (LLM), developed as part of an initiative to establish value-aligned AI infrastructure specifically tailored for Hong Kong. Addressing the region’s unique multilingual environment (Cantonese, Mandarin, and English), its distinct socio-legal context under the “one country, two systems” framework, and specific local cultural and value considerations, the model is built upon the DeepSeek architecture and systematically aligned with regional norms through a multifaceted full parameter fine-tuning process. It is further integrated with a retrieval-augmented generation (RAG) system to ensure timely and factually grounded information access. The core contribution lies in the design and implementation of a comprehensive, region-specific AI alignment and safety framework, demonstrated through two key achievements: 1) The successful development of HKGAI-V1 itself - which outper-forms general-purpose models in handling Hong Kong-specific culturally sensitive queries, and embodies a “governance-embedded” approach to digital sovereignty - empowers Hong Kong to exercise control over AI applications in critical sectors including public services, legal systems, and edu-cation. 2) The development of the proprietary Adversarial HK Value Benchmark, a rigorous tool for evaluating model alignment with local ethical and legal stand-ards under challenging conditions. By documenting these achievements, the paper provides not only a technological artifact but also a replicable blueprint for developing advanced, regionally focused AI systems deeply rooted in their local identities.
nan
Article 309
Title@2025-07-14 (1): Meanings are like Onions: a Layered Approach to Metaphor Processing
Title: Meanings are like Onions: a Layered Approach to Metaphor Processing | Bedeutungen sind wie Zwiebeln: ein geschichteter Ansatz zur Metaphorverarbeitung | 意思是像洋葱:对同义词处理的多层方法 2507.10354v1 |
Authors (3): Silvia Cappa, Anna Sofia Lippolis, Stefano Zoia
Metaphorical meaning is not a flat mapping between concepts, but a complex cognitive phenomenon that integrates multiple levels of interpretation. In this paper, we propose a stratified model of metaphor processing that treats meaning as an onion: a multi-layered structure comprising (1) content analysis, (2) conceptual blending, and (3) pragmatic intentionality. This three-dimensional framework allows for a richer and more cognitively grounded approach to metaphor interpretation in computational systems. At the first level, metaphors are annotated through basic conceptual elements. At the second level, we model conceptual combinations, linking components to emergent meanings. Finally, at the third level, we introduce a pragmatic vocabulary to capture speaker intent, communicative function, and contextual effects, aligning metaphor understanding with pragmatic theories. By unifying these layers into a single formal framework, our model lays the groundwork for computational methods capable of representing metaphorical meaning beyond surface associations, toward deeper, more context-sensitive reasoning.
nan
Article 310
Title@2025-07-14 (1): Using AI to replicate human experimental results: a motion study
Title: Using AI to replicate human experimental results: a motion study | Verwendung von KI, um menschliche experimentelle Ergebnisse zu replizieren: eine Bewegungsstudie | 利用大赦国际复制人类实验结果:一项运动研究 2507.10342v1 |
Authors (2): Rosa Illan Castillo, Javier Valenzuela
This paper explores the potential of large language models (LLMs) as reliable analytical tools in linguistic research, focusing on the emergence of affective meanings in temporal expressions involving manner-of-motion verbs. While LLMs like GPT-4 have shown promise across a range of tasks, their ability to replicate nuanced human judgements remains under scrutiny. We conducted four psycholinguistic studies (on emergent meanings, valence shifts, verb choice in emotional contexts, and sentence-emoji associations) first with human participants and then replicated the same tasks using an LLM. Results across all studies show a striking convergence between human and AI responses, with statistical analyses (e.g., Spearman’s rho = .73-.96) indicating strong correlations in both rating patterns and categorical choices. While minor divergences were observed in some cases, these did not alter the overall interpretative outcomes. These findings offer compelling evidence that LLMs can augment traditional human-based experimentation, enabling broader-scale studies without compromising interpretative validity. This convergence not only strengthens the empirical foundation of prior human-based findings but also opens possibilities for hypothesis generation and data expansion through AI. Ultimately, our study supports the use of LLMs as credible and informative collaborators in linguistic inquiry.
nan
Article 311
Title@2025-07-14 (1): Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach
Title: Bridging Robustness and Generalization Against Word Substitution Attacks in NLP via the Growth Bound Matrix Approach | Überbrückung von Robustheit und Verallgemeinerung gegen Wortersatzangriffe in NLP über den Ansatz der Wachstumsbound Matrix | 通过 “ 增长组合矩阵方法 “ ,在NLP中架起桥梁,反对用词替代袭击的有力性和普遍性 2507.10330v1 |
Authors (2): Mohammed Bouri, Adnane Saoud
Despite advancements in Natural Language Processing (NLP), models remain vulnerable to adversarial attacks, such as synonym substitutions. While prior work has focused on improving robustness for feed-forward and convolutional architectures, the robustness of recurrent networks and modern state space models (SSMs), such as S4, remains understudied. These architectures pose unique challenges due to their sequential processing and complex parameter dynamics. In this paper, we introduce a novel regularization technique based on Growth Bound Matrices (GBM) to improve NLP model robustness by reducing the impact of input perturbations on model outputs. We focus on computing the GBM for three architectures: Long Short-Term Memory (LSTM), State Space models (S4), and Convolutional Neural Networks (CNN). Our method aims to (1) enhance resilience against word substitution attacks, (2) improve generalization on clean text, and (3) providing the first systematic analysis of SSM (S4) robustness. Extensive experiments across multiple architectures and benchmark datasets demonstrate that our method improves adversarial robustness by up to 8.8% over existing baselines. These results highlight the effectiveness of our approach, outperforming several state-of-the-art methods in adversarial defense. Codes are available at https://github.com/BouriMohammed/GBM
nan
Article 312
Title@2025-07-14 (1): Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation
Title: Grammar-Guided Evolutionary Search for Discrete Prompt Optimisation | Grammatik-geführte evolutionäre Suche nach diskreter Prompt-Optimierung | 语法引导进化搜索 2507.10326v1 |
Authors (13): Muzhaffar Hazman, Minh-Khoi Pham, Shweta Soundararajan, Goncalo Mordido, Leonardo Custode, David Lynch, Giorgio Cruciata, Yucheng Shi, Hongmeng Song, Wang Chao, Pan Yue, Aleksandar Milenovic, Alexandros Agapitos
Prompt engineering has proven to be a crucial step in leveraging pretrained large language models (LLMs) in solving various real-world tasks. Numerous solutions have been proposed that seek to automate prompt engineering by using the model itself to edit prompts. However, the majority of state-of-the-art approaches are evaluated on tasks that require minimal prompt templates and on very large and highly capable LLMs. In contrast, solving complex tasks that require detailed information to be included in the prompt increases the amount of text that needs to be optimised. Furthermore, smaller models have been shown to be more sensitive to prompt design. To address these challenges, we propose an evolutionary search approach to automated discrete prompt optimisation consisting of two phases. In the first phase, grammar-guided genetic programming is invoked to synthesise prompt-creating programmes by searching the space of programmes populated by function compositions of syntactic, dictionary-based and LLM-based prompt-editing functions. In the second phase, local search is applied to explore the neighbourhoods of best-performing programmes in an attempt to further fine-tune their performance. Our approach outperforms three state-of-the-art prompt optimisation approaches, PromptWizard, OPRO, and RL-Prompt, on three relatively small general-purpose LLMs in four domain-specific challenging tasks. We also illustrate several examples where these benchmark methods suffer relatively severe performance degradation, while our approach improves performance in almost all task-model combinations, only incurring minimal degradation when it does not.
nan
Article 313
Title@2025-07-14 (1): LEXam: Benchmarking Legal Reasoning on 340 Law Exams
Title: LEXam: Benchmarking Legal Reasoning on 340 Law Exams | LEXam: Benchmarking der rechtlichen Begründung von 340 Rechtsprüfungen | LEXam:340项法律考试的法律依据基准 2505.12864v3 |
Authors (17): Yu Fan, Jingwei Ni, Jakob Merane, Etienne Salimbeni, Yang Tian, Yoan Hermstrüwer, Yinya Huang, Mubashara Akhtar, Florian Geering, Oliver Dreyer, Daniel Brunner, Markus Leippold, Mrinmaya Sachan, Alexander Stremitzer, Christoph Engel, Elliott Ash, Joel Niklaus
Long-form legal reasoning remains a key challenge for large language models (LLMs) in spite of recent advances in test-time scaling. We introduce LEXam, a novel benchmark derived from 340 law exams spanning 116 law school courses across a range of subjects and degree levels. The dataset comprises 4,886 law exam questions in English and German, including 2,841 long-form, open-ended questions and 2,045 multiple-choice questions. Besides reference answers, the open questions are also accompanied by explicit guidance outlining the expected legal reasoning approach such as issue spotting, rule recall, or rule application. Our evaluation on both open-ended and multiple-choice questions present significant challenges for current LLMs; in particular, they notably struggle with open questions that require structured, multi-step legal reasoning. Moreover, our results underscore the effectiveness of the dataset in differentiating between models with varying capabilities. Adopting an LLM-as-a-Judge paradigm with rigorous human expert validation, we demonstrate how model-generated reasoning steps can be evaluated consistently and accurately. Our evaluation setup provides a scalable method to assess legal reasoning quality beyond simple accuracy metrics. Project page: https://lexam-benchmark.github.io/
nan
Article 314
Title@2025-07-14 (1): FaceLLM: A Multimodal Large Language Model for Face Understanding
Title: FaceLLM: A Multimodal Large Language Model for Face Understanding | FaceLLM: Ein multimodales, großes Sprachmodell für das Verständnis von Gesichtern | FaceLLM: 面对面理解多式大语言模式 2507.10300v1 |
Authors (2): Hatef Otroshi Shahreza, Sébastien Marcel
Multimodal large language models (MLLMs) have shown remarkable performance in vision-language tasks. However, existing MLLMs are primarily trained on generic datasets, limiting their ability to reason on domain-specific visual cues such as those in facial images. In particular, tasks that require detailed understanding of facial structure, expression, emotion, and demographic features remain underexplored by MLLMs due to the lack of large-scale annotated face image-text datasets. In this work, we introduce FaceLLM, a multimodal large language model trained specifically for facial image understanding. To construct the training data, we propose a novel weakly supervised pipeline that uses ChatGPT with attribute-aware prompts to generate high-quality question-answer pairs based on images from the FairFace dataset. The resulting corpus, called FairFaceGPT, covers a diverse set of attributes including expression, pose, skin texture, and forensic information. Our experiments demonstrate that FaceLLM improves the performance of MLLMs on various face-centric tasks and achieves state-of-the-art performance. This work highlights the potential of synthetic supervision via language models for building domain-specialized MLLMs, and sets a precedent for trustworthy, human-centric multimodal AI systems. FairFaceGPT dataset and pretrained FaceLLM models are publicly available in the project page.
nan
Article 315
Title@2025-07-14 (1): Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting
Title: Bias Beyond English: Evaluating Social Bias and Debiasing Methods in a Low-Resource Setting | Bias Beyond English: Social Bias und Debiasing Methoden in einem Low-Resource Setting bewerten | 英文之后的偏见:在低资源环境下评估社会偏见和偏见方法 2504.11183v2 |
Authors (2): Ej Zhou, Weiming Lu
Social bias in language models can potentially exacerbate social inequalities. Despite it having garnered wide attention, most research focuses on English data. In a low-resource scenario, the models often perform worse due to insufficient training data. This study aims to leverage high-resource language corpora to evaluate bias and experiment with debiasing methods in low-resource languages. We evaluated the performance of recent multilingual models in five languages: English, Chinese, Russian, Indonesian and Thai, and analyzed four bias dimensions: gender, religion, nationality, and race-color. By constructing multilingual bias evaluation datasets, this study allows fair comparisons between models across languages. We have further investigated three debiasing methods-CDA, Dropout, SenDeb-and demonstrated that debiasing methods from high-resource languages can be effectively transferred to low-resource ones, providing actionable insights for fairness research in multilingual NLP.
nan
Article 316
Title@2025-07-14 (1): B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability
Title: B-cos LM: Efficiently Transforming Pre-trained Language Models for Improved Explainability | B-cos LM: Effiziente Transformation von vortrainierten Sprachmodellen für verbesserte Erklärbarkeit | B-cos LM:高效转换培训前语文模式,改进可解释性 2502.12992v2 |
Authors (5): Yifan Wang, Sukrut Rao, Ji-Ung Lee, Mayank Jobanputra, Vera Demberg
Post-hoc explanation methods for black-box models often struggle with faithfulness and human interpretability due to the lack of explainability in current neural architectures. Meanwhile, B-cos networks have been introduced to improve model explainability by proposing an architecture that removes bias terms and promotes input-weight alignment. Although B-cos networks have shown success in building explainable systems, their application has so far been limited to computer vision models and their associated training pipelines. In this work, we introduce B-cos LMs, i.e., B-cos language models (LMs) empowered for natural language processing (NLP) tasks. Our approach directly transforms pre-trained language models into B-cos LMs by combining B-cos conversion and task fine-tuning, improving efficiency compared to previous methods. Our automatic and human evaluation results demonstrate that B-cos LMs produce more faithful and human interpretable explanations than post-hoc methods, while maintaining task performance comparable to conventional fine-tuning. Our in-depth analysis explores how B-cos LMs differ from conventionally fine-tuned models in their learning processes and explanation patterns. Finally, we are also the first to explore the transformation of decoder-only models to B-cos LMs for generation tasks.
nan
Article 317
Title@2025-07-14 (1): The distribution of syntactic dependency distances
Title: The distribution of syntactic dependency distances | Die Verteilung der syntaktischen Abhängigkeitsabstände | 共同依赖距离分布 2211.14620v3 |
Authors (2): Sonia Petrini, Ramon Ferrer-i-Cancho
The syntactic structure of a sentence can be represented as a graph, where vertices are words and edges indicate syntactic dependencies between them. In this setting, the distance between two linked words is defined as the difference between their positions. Here we wish to contribute to the characterization of the actual distribution of syntactic dependency distances, which has previously been argued to follow a power-law distribution. Here we propose a new model with two exponential regimes in which the probability decay is allowed to change after a break-point. This transition could mirror the transition from the processing of word chunks to higher-level structures. We find that a two-regime model - where the first regime follows either an exponential or a power-law decay - is the most likely one in all 20 languages we considered, independently of sentence length and annotation style. Moreover, the break-point exhibits low variation across languages and averages values of 4-5 words, suggesting that the amount of words that can be simultaneously processed abstracts from the specific language to a high degree. The probability decay slows down after the breakpoint, consistently with a universal chunk-and-pass mechanism. Finally, we give an account of the relation between the best estimated model and the closeness of syntactic dependencies as function of sentence length, according to a recently introduced optimality score.
nan
Article 318
Title@2025-07-14 (1): Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects
Title: Absher: A Benchmark for Evaluating Large Language Models Understanding of Saudi Dialects | Absher: Ein Benchmark für die Bewertung großer Sprachmodelle zum Verständnis saudischer Dialekte | Absher:评估沙特方言大语言模型理解基准 2507.10216v1 |
Authors (4): Renad Al-Monef, Hassan Alhuzali, Nora Alturayeif, Ashwag Alasmari
As large language models (LLMs) become increasingly central to Arabic NLP applications, evaluating their understanding of regional dialects and cultural nuances is essential, particularly in linguistically diverse settings like Saudi Arabia. This paper introduces \texttt{Absher}, a comprehensive benchmark specifically designed to assess LLMs performance across major Saudi dialects. \texttt{Absher} comprises over 18,000 multiple-choice questions spanning six distinct categories: Meaning, True/False, Fill-in-the-Blank, Contextual Usage, Cultural Interpretation, and Location Recognition. These questions are derived from a curated dataset of dialectal words, phrases, and proverbs sourced from various regions of Saudi Arabia. We evaluate several state-of-the-art LLMs, including multilingual and Arabic-specific models. We also provide detailed insights into their capabilities and limitations. Our results reveal notable performance gaps, particularly in tasks requiring cultural inference or contextual understanding. Our findings highlight the urgent need for dialect-aware training and culturally aligned evaluation methodologies to improve LLMs performance in real-world Arabic applications.
nan
Article 319
Title@2025-07-14 (1): Natural Language-based Assessment of L2 Oral Proficiency using LLMs
Title: Natural Language-based Assessment of L2 Oral Proficiency using LLMs | Natürliche Sprachgestützte Beurteilung der oralen Sprachkenntnisse von L2 unter Verwendung von LLMs | 利用LLMM 进行L2口腔熟练程度自然语言评估 2507.10200v1 |
Authors (6): Stefano Bannò, Rao Ma, Mengjie Qian, Siyuan Tang, Kate Knill, Mark Gales
Natural language-based assessment (NLA) is an approach to second language assessment that uses instructions - expressed in the form of can-do descriptors - originally intended for human examiners, aiming to determine whether large language models (LLMs) can interpret and apply them in ways comparable to human assessment. In this work, we explore the use of such descriptors with an open-source LLM, Qwen 2.5 72B, to assess responses from the publicly available S&I Corpus in a zero-shot setting. Our results show that this approach - relying solely on textual information - achieves competitive performance: while it does not outperform state-of-the-art speech LLMs fine-tuned for the task, it surpasses a BERT-based model trained specifically for this purpose. NLA proves particularly effective in mismatched task settings, is generalisable to other data types and languages, and offers greater interpretability, as it is grounded in clearly explainable, widely applicable language descriptors.
nan
Article 320
Title@2025-07-14 (1): Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models
Title: Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of Large Language Models | Trinity-RFT: Ein allgemein angelegtes und einheitliches Rahmenwerk zur Verstärkung der Feinsteuerung großer Sprachmodelle | 三一-RFT:加强大语言模式精美应用的一般目的和统一框架 2505.17826v2 |
Authors (14): Xuchen Pan, Yanxi Chen, Yushuo Chen, Yuchang Sun, Daoyuan Chen, Wenhao Zhang, Yuexiang Xie, Yilun Huang, Yilei Zhang, Dawei Gao, Weijie Shi, Yaliang Li, Bolin Ding, Jingren Zhou
Trinity-RFT is a general-purpose, unified and easy-to-use framework designed for reinforcement fine-tuning (RFT) of large language models. It is built with a modular and decoupled design, consisting of (1) an RFT-core that unifies and generalizes synchronous/asynchronous, on-policy/off-policy, and online/offline modes of RFT; (2) seamless integration for agent-environment interaction with high efficiency and robustness; and (3) systematic data pipelines optimized for RFT. Trinity-RFT can be easily adapted for diverse application scenarios, and serves as a unified platform for development and research of advanced reinforcement learning paradigms at both macroscopic and microscopic levels. This technical report outlines the vision, features, design and implementations of Trinity-RFT, accompanied by extensive examples, applications and experiments that demonstrate its functionalities and user-friendliness.
nan
Article 321
Title@2025-07-14 (1): Mechanistic Indicators of Understanding in Large Language Models
Title: Mechanistic Indicators of Understanding in Large Language Models | Mechanistische Indikatoren des Verstehens in großen Sprachmodellen | 大语言模型中理解力的机械指标 2507.08017v2 |
Authors (2): Pierre Beckmann, Matthieu Queloz
Recent findings in mechanistic interpretability (MI), the field probing the inner workings of Large Language Models (LLMs), challenge the view that these models rely solely on superficial statistics. We offer an accessible synthesis of these findings that doubles as an introduction to MI while integrating these findings within a novel theoretical framework for thinking about machine understanding. We argue that LLMs develop internal structures that are functionally analogous to the kind of understanding that consists in seeing connections. To sharpen this idea, we propose a three-tiered conception of understanding. First, conceptual understanding emerges when a model forms “features” as directions in latent space, learning the connections between diverse manifestations of something. Second, state-of-the-world understanding emerges when a model learns contingent factual connections between features and dynamically tracks changes in the world. Third, principled understanding emerges when a model ceases to rely on a collection of memorized facts and discovers a “circuit” connecting these facts. However, these forms of understanding remain radically different from human understanding, as the phenomenon of “parallel mechanisms” shows. We conclude that the debate should move beyond the yes-or-no question of whether LLMs understand to investigate how their strange minds work and forge conceptions that fit them.
nan
Article 322
Title@2025-07-14 (1): Abusive text transformation using LLMs
Title: Abusive text transformation using LLMs | Missbräuchliche Texttransformation mit LLMs | 使用LLMM 的恶劣文本转换 2507.10177v1 |
Authors (2): Rohitash Chandra, Jiyong Choi
Although Large Language Models (LLMs) have demonstrated significant advancements in natural language processing tasks, their effectiveness in the classification and transformation of abusive text into non-abusive versions remains an area for exploration. In this study, we aim to use LLMs to transform abusive text (tweets and reviews) featuring hate speech and swear words into non-abusive text, while retaining the intent of the text. We evaluate the performance of two state-of-the-art LLMs, such as Gemini, GPT-4o, DeekSeek and Groq, on their ability to identify abusive text. We them to transform and obtain a text that is clean from abusive and inappropriate content but maintains a similar level of sentiment and semantics, i.e. the transformed text needs to maintain its message. Afterwards, we evaluate the raw and transformed datasets with sentiment analysis and semantic analysis. Our results show Groq provides vastly different results when compared with other LLMs. We have identified similarities between GPT-4o and DeepSeek-V3.
nan
Article 323
Title@2025-07-14 (1): Task-Based Flexible Feature Distillation for LLMs
Title: Task-Based Flexible Feature Distillation for LLMs | Aufgabenbasierte flexible Feature-Destillation für LLMs | 用于LLMM 的基于任务灵活地物蒸馏 2507.10155v1 |
Authors (2): Khouloud Saadi, Di Wang
Knowledge Distillation (KD) in general and feature distillation in particular are promising techniques for reducing the high computational demand of large language models (LLMs). However, traditional feature KD methods typically assume that the teacher and the student share the same hidden size, limiting the flexibility of the student’s architecture. A common solution to this problem involves training a linear projector to align their feature spaces, but this introduces additional parameters that must be learned from scratch and often degrades performance on downstream tasks, especially in generative settings. To address this issue, in this work, we propose a novel task-based feature distillation method that enables knowledge transfer between teacher and student models with different hidden layer dimensions, without introducing any new parameters. Leveraging the insight that only a subset of LLM components contribute significantly to a specific downstream task, our approach identifies the most task-relevant hidden units in the teacher and directly distills their activations to the student. Our method is flexible and easily integrates with other distillation frameworks. Empirical results show consistent improvements over prior approaches across diverse tasks, including classification, instruction-following, and summarization, achieving up to a 3\% performance gain over the linear projection baseline.
nan
Article 324
Title@2025-07-14 (1): A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment
Title: A Noise-Robust Turn-Taking System for Real-World Dialogue Robots: A Field Experiment | Ein Lärm-Robust Turn-Taking-System für Real-World Dialogue Robots: Ein Feldexperiment | 实时世界对话机器人一个噪音-Robust 转录系统:一个实地实验 2503.06241v2 |
Authors (6): Koji Inoue, Yuki Okafuji, Jun Baba, Yoshiki Ohira, Katsuya Hyodo, Tatsuya Kawahara
Turn-taking is a crucial aspect of human-robot interaction, directly influencing conversational fluidity and user engagement. While previous research has explored turn-taking models in controlled environments, their robustness in real-world settings remains underexplored. In this study, we propose a noise-robust voice activity projection (VAP) model, based on a Transformer architecture, to enhance real-time turn-taking in dialogue robots. To evaluate the effectiveness of the proposed system, we conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system. Our analysis covered both subjective user evaluations and objective behavioral analysis. The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation where both the robot and users responded faster. The subjective evaluations suggested that faster responses contribute to a better interaction experience.
nan
Article 325
Title@2025-07-14 (1): Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians’ Insights
Title: Barriers in Integrating Medical Visual Question Answering into Radiology Workflows: A Scoping Review and Clinicians’ Insights | Barrieren bei der Integration medizinischer visueller Fragenbeantwortung in die Radiologie Workflows: Ein Scoping Review und Einblicke von Klinikern | 将医疗视觉问题答案纳入放射工作流的障碍:范围审查和临床医生的洞察 2507.08036v2 |
Authors (5): Deepali Mishra, Chaklam Silpasuwanchai, Ashutosh Modi, Madhumita Sushil, Sorayouth Chumnanvej
Medical Visual Question Answering (MedVQA) is a promising tool to assist radiologists by automating medical image interpretation through question answering. Despite advances in models and datasets, MedVQA’s integration into clinical workflows remains limited. This study systematically reviews 68 publications (2018-2024) and surveys 50 clinicians from India and Thailand to examine MedVQA’s practical utility, challenges, and gaps. Following the Arksey and O’Malley scoping review framework, we used a two-pronged approach: (1) reviewing studies to identify key concepts, advancements, and research gaps in radiology workflows, and (2) surveying clinicians to capture their perspectives on MedVQA’s clinical relevance. Our review reveals that nearly 60% of QA pairs are non-diagnostic and lack clinical relevance. Most datasets and models do not support multi-view, multi-resolution imaging, EHR integration, or domain knowledge, features essential for clinical diagnosis. Furthermore, there is a clear mismatch between current evaluation metrics and clinical needs. The clinician survey confirms this disconnect: only 29.8% consider MedVQA systems highly useful. Key concerns include the absence of patient history or domain knowledge (87.2%), preference for manually curated datasets (51.1%), and the need for multi-view image support (78.7%). Additionally, 66% favor models focused on specific anatomical regions, and 89.4% prefer dialogue-based interactive systems. While MedVQA shows strong potential, challenges such as limited multimodal analysis, lack of patient context, and misaligned evaluation approaches must be addressed for effective clinical integration.
nan
Article 326
Title@2025-07-14 (1): DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models
Title: DiaTool-DPO: Multi-Turn Direct Preference Optimization for Tool-Augmented Large Language Models | DiaTool-DPO: Multi-Turn Direct Preference Optimierung für Tool-Augmented Large Language Models | DiaTool-DPO:多发直接首选优化工具增强型大语言模型 2504.02882v2 |
Authors (10): Sunghee Jung, Donghun Lee, Shinbok Lee, Gaeun Seo, Daniel Lee, Byeongil Ko, Junrae Cho, Kihyun Kim, Eunggyun Kim, Myeongcheol Shin
Tool-Augmented Larage Language Models (TA-LLMs) have shown promise in real-world applications, but face challenges in handling incomplete queries and out-of-scope requests. While existing approaches rely mainly on Supervised Fine-Tuning with expert trajectories, we propose DiaTool-DPO, a novel method that enhances TA-LLM’s dialogue capabilities through Direct Preference Optimization. We model TA-LLM interactions as a Markov Decision Process with 5 distinct dialogue states and categorize user queries into 3 types based on their state transition trajectories. We automatically construct paired trajectory datasets of correct and incorrect dialogue flows and introduce a specialized objective loss for dialogue control. Our comprehensive evaluation demonstrates that DiaTool-DPO approaches GPT-4o’s performance (94.8% in information gathering, 91% in tool call rejection) with substantial improvements over baseline (44% and 9.6% respectively) while maintaining core functionality. Our approach opens new possibilities for developing TA-LLMs that can handle diverse real-world scenarios without requiring additional expert demonstrations or human labeling.
nan
Article 327
Title@2025-07-14 (1): Fusing Large Language Models with Temporal Transformers for Time Series Forecasting
Title: Fusing Large Language Models with Temporal Transformers for Time Series Forecasting | Große Sprachmodelle mit Zeittransformatoren für die Zeitreihenvorhersage | 用时间序列预测时空变换器使用大型语言模型 2507.10098v1 |
Authors (5): Chen Su, Yuanhe Tian, Qinyu Liu, Jun Zhang, Yan Song
Recently, large language models (LLMs) have demonstrated powerful capabilities in performing various tasks and thus are applied by recent studies to time series forecasting (TSF) tasks, which predict future values with the given historical time series. Existing LLM-based approaches transfer knowledge learned from text data to time series prediction using prompting or fine-tuning strategies. However, LLMs are proficient at reasoning over discrete tokens and semantic patterns but are not initially designed to model continuous numerical time series data. The gaps between text and time series data lead LLMs to achieve inferior performance to a vanilla Transformer model that is directly trained on TSF data. However, the vanilla Transformers often struggle to learn high-level semantic patterns. In this paper, we design a novel Transformer-based architecture that complementarily leverages LLMs and vanilla Transformers, so as to integrate the high-level semantic representations learned by LLMs into the temporal information encoded by time series Transformers, where a hybrid representation is obtained by fusing the representations from the LLM and the Transformer. The resulting fused representation contains both historical temporal dynamics and semantic variation patterns, allowing our model to predict more accurate future values. Experiments on benchmark datasets demonstrate the effectiveness of the proposed approach.
nan
Article 328
Title@2025-07-14 (1): A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications
Title: A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications | Eine umfassende Übersicht über die direkte Präferenzoptimierung: Datensätze, Theorien, Varianten und Anwendungen | 直接优先优化综合调查:数据集、理论、变式和应用 2410.15595v3 |
Authors (12): Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu
With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO’s various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO’s current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.
nan
Article 329
Title@2025-07-14 (1): Structuring Radiology Reports: Challenging LLMs with Lightweight Models
Title: Structuring Radiology Reports: Challenging LLMs with Lightweight Models | Structuring Radiology Reports: Herausfordernde LLMs mit Leichtbaumodellen | 结构化放射学报告:用轻量级模型对LMS提出挑战 2506.00200v2 |
Authors (8): Johannes Moll, Louisa Fay, Asfandyar Azhar, Sophie Ostmeier, Tim Lueth, Sergios Gatidis, Curtis Langlotz, Jean-Benoit Delbrouck
Radiology reports are critical for clinical decision-making but often lack a standardized format, limiting both human interpretability and machine learning (ML) applications. While large language models (LLMs) have shown strong capabilities in reformatting clinical text, their high computational requirements, lack of transparency, and data privacy concerns hinder practical deployment. To address these challenges, we explore lightweight encoder-decoder models (<300M parameters)-specifically T5 and BERT2BERT-for structuring radiology reports from the MIMIC-CXR and CheXpert Plus datasets. We benchmark these models against eight open-source LLMs (1B-70B), adapted using prefix prompting, in-context learning (ICL), and low-rank adaptation (LoRA) finetuning. Our best-performing lightweight model outperforms all LLMs adapted using prompt-based techniques on a human-annotated test set. While some LoRA-finetuned LLMs achieve modest gains over the lightweight model on the Findings section (BLEU 6.4%, ROUGE-L 4.8%, BERTScore 3.6%, F1-RadGraph 1.1%, GREEN 3.6%, and F1-SRR-BERT 4.3%), these improvements come at the cost of substantially greater computational resources. For example, LLaMA-3-70B incurred more than 400 times the inference time, cost, and carbon emissions compared to the lightweight model. These results underscore the potential of lightweight, task-specific models as sustainable and privacy-preserving solutions for structuring clinical text in resource-constrained healthcare settings.
nan
Article 330
Title@2025-07-14 (1): Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning
Title: Enhancing Chain-of-Thought Reasoning with Critical Representation Fine-tuning | Verbesserung der Kette der nachdenklichen Vernunft mit kritischer Darstellung Feinabstimmung | 强化研究链,理由与关键代表的微调 2507.10085v1 |
Authors (9): Chenxi Huang, Shaotian Yan, Liang Xie, Binbin Lin, Sinan Fan, Yue Xin, Deng Cai, Chen Shen, Jieping Ye
Representation Fine-tuning (ReFT), a recently proposed Parameter-Efficient Fine-Tuning (PEFT) method, has attracted widespread attention for significantly improving parameter efficiency by editing representation space alone. In this work, we investigate applying ReFT to complex reasoning tasks. However, directly using the native ReFT method, which modifies fixed representations at the beginning and end of each layer, yields suboptimal performance, as these fixed-position representations have uncertain impact on the outputs. We observe that, in complex reasoning tasks, there often exist certain critical representations. These representations either integrate significant information from preceding layers or regulate subsequent layer representations. Through layer-by-layer propagation, they exert a substantial influence on the final output. Naturally, fine-tuning these critical representations has the potential to greatly enhance reasoning performance. Building upon these insights, we propose Critical Representation Fine-Tuning (CRFT), a novel method that identifies and optimizes these critical representations through information flow analysis. CRFT operates within a supervised learning framework, dynamically optimizing critical representations in a low-rank linear subspace while freezing the base model. The effectiveness and efficiency of our method are validated across eight benchmarks for arithmetic and commonsense reasoning, using LLaMA and Mistral model families. Furthermore, our method also adapts effectively to few-shot settings, boosting one-shot accuracy by 16.4%. Our work highlights the untapped potential of representation-level optimization for CoT reasoning, offering a lightweight yet powerful alternative to traditional PEFT methods.
nan
Article 331
Title@2025-07-14 (1): Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires
Title: Cultural Bias in Large Language Models: Evaluating AI Agents through Moral Questionnaires | Kulturelle Bias in großen Sprachmodellen: Bewertung von KI-Agenten durch moralische Fragebögen | 大语言模式中的文化偏见:通过道德问卷评估AI代理 2507.10073v1 |
Authors (1): Simon Münker
Are AI systems truly representing human values, or merely averaging across them? Our study suggests a concerning reality: Large Language Models (LLMs) fail to represent diverse cultural moral frameworks despite their linguistic capabilities. We expose significant gaps between AI-generated and human moral intuitions by applying the Moral Foundations Questionnaire across 19 cultural contexts. Comparing multiple state-of-the-art LLMs’ origins against human baseline data, we find these models systematically homogenize moral diversity. Surprisingly, increased model size doesn’t consistently improve cultural representation fidelity. Our findings challenge the growing use of LLMs as synthetic populations in social science research and highlight a fundamental limitation in current AI alignment approaches. Without data-driven alignment beyond prompting, these systems cannot capture the nuanced, culturally-specific moral intuitions. Our results call for more grounded alignment objectives and evaluation metrics to ensure AI systems represent diverse human values rather than flattening the moral landscape.
nan
Article 332
Title@2025-07-14 (1): GeLaCo: An Evolutionary Approach to Layer Compression
Title: GeLaCo: An Evolutionary Approach to Layer Compression | GeLaCo: Ein evolutionärer Ansatz zur Schichtkompression | GeLaCo: 层压缩的进化方法 2507.10059v1 |
Authors (3): David Ponce, Thierry Etchegoyhen, Javier Del Ser
Large Language Models (LLM) have achieved remarkable performance across a large number of tasks, but face critical deployment and usage barriers due to substantial computational requirements. Model compression methods, which aim to reduce model size while preserving its capacity, are an important means to mitigate these issues. Promising approaches along these lines, such as structured pruning, typically require costly empirical search for optimal variants and may run the risk of ignoring better solutions. In this work we introduce GeLaCo, an evolutionary approach to LLM compression via layer collapse. Our approach supports an efficient exploration of the compression solution space via population-based search and a module-wise similarity fitness function capturing attention, feed-forward, and hidden state representations. GeLaCo also supports both single and multi-objective evolutionary compression search, establishing the first Pareto frontier along compression and quality axes. We evaluate GeLaCo solutions via both perplexity-based and generative evaluations over foundational and instruction-tuned models, outperforming state-of-the-art alternatives.
nan
Article 333
Title@2025-07-14 (1): PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization
Title: PRISM: Fine-Grained Paper-to-Paper Retrieval with Multi-Aspect-Aware Query Optimization | PRISM: Feinkörniges Papier-zu-Papier-Retrieval mit Multi-Aspect-Aware-Abfrageoptimierung | PRISM: 配有多频谱软件查询优化的精细读纸到纸检索器 2507.10057v1 |
Authors (4): Sangwoo Park, Jinheon Baek, Soyeong Jeong, Sung Ju Hwang
Scientific paper retrieval, particularly framed as document-to-document retrieval, aims to identify relevant papers in response to a long-form query paper, rather than a short query string. Previous approaches to this task have focused on abstracts, embedding them into dense vectors as surrogates for full documents and calculating similarity across them, although abstracts provide only sparse and high-level summaries. To address this, we propose PRISM, a novel document-to-document retrieval method that introduces multiple, fine-grained representations for both the query and candidate papers. In particular, each query paper is decomposed into multiple aspect-specific views and individually embedded, which are then matched against candidate papers similarity segmented to consider their multifaceted dimensions. Moreover, we present SciFullBench, a novel benchmark in which the complete and segmented context of full papers for both queries and candidates is available. Then, experimental results show that PRISM improves performance by an average of 4.3% over existing retrieval baselines.
nan
Article 334
Title@2025-07-14 (1): Political Bias in LLMs: Unaligned Moral Values in Agent-centric Simulations
Title: Political Bias in LLMs: Unaligned Moral Values in Agent-centric Simulations | Politische Bias in LLMs: Ungebundene Moralwerte in Agent-zentrierten Simulationen | LLM中的政治偏见:代理中心模拟中的不结盟道德价值 2408.11415v2 |
Authors (1): Simon Münker
Contemporary research in social sciences increasingly utilizes state-of-the-art generative language models to annotate or generate content. While these models achieve benchmark-leading performance on common language tasks, their application to novel out-of-domain tasks remains insufficiently explored. To address this gap, we investigate how personalized language models align with human responses on the Moral Foundation Theory Questionnaire. We adapt open-source generative language models to different political personas and repeatedly survey these models to generate synthetic data sets where model-persona combinations define our sub-populations. Our analysis reveals that models produce inconsistent results across multiple repetitions, yielding high response variance. Furthermore, the alignment between synthetic data and corresponding human data from psychological studies shows a weak correlation, with conservative persona-prompted models particularly failing to align with actual conservative populations. These results suggest that language models struggle to coherently represent ideologies through in-context prompting due to their alignment process. Thus, using language models to simulate social interactions requires measurable improvements in in-context optimization or parameter manipulation to align with psychological and sociological stereotypes properly.
nan
Article 335
Title@2025-07-14 (1): IPAD: Inverse Prompt for AI Detection – A Robust and Explainable LLM-Generated Text Detector
Title: IPAD: Inverse Prompt for AI Detection – A Robust and Explainable LLM-Generated Text Detector | IPAD: Inverse Aufforderung zur KI-Erkennung – ein robuster und erklärbarer LLM-generierter Textdetektor | IPAD: AI 检测反光提示 – – 强力和可解释的LLM-发光文本检测器 2502.15902v2 |
Authors (6): Zheng Chen, Yushi Feng, Changyang He, Yue Deng, Hongxi Pu, Bo Li
Large Language Models (LLMs) have attained human-level fluency in text generation, which complicates the distinction between human-written and LLM-generated texts. This increases the risk of misuse and highlights the need for reliable detectors. Yet, existing detectors exhibit poor robustness on out-of-distribution (OOD) data and attacked data, which is critical for real-world scenarios. Also, they struggle to provide interpretable evidence to support their decisions, thus undermining the reliability. In light of these challenges, we propose IPAD (Inverse Prompt for AI Detection), a novel framework consisting of a Prompt Inverter that identifies predicted prompts that could have generated the input text, and two Distinguishers that examine the probability that the input texts align with the predicted prompts. Empirical evaluations demonstrate that IPAD outperforms the strongest baselines by 9.05% (Average Recall) on in-distribution data, 12.93% (AUROC) on out-of-distribution (OOD) data, and 5.48% (AUROC) on attacked data. IPAD also performs robustly on structured datasets. Furthermore, an interpretability assessment is conducted to illustrate that IPAD enhances the AI detection trustworthiness by allowing users to directly examine the decision-making evidence, which provides interpretable support for its state-of-the-art detection results.
nan
Article 336
Title@2025-07-14 (1): Automating SPARQL Query Translations between DBpedia and Wikidata
Title: Automating SPARQL Query Translations between DBpedia and Wikidata | Automatisieren von SPARQL Query Translations zwischen DBpedia und Wikidata | 将 DBpedia 和 Wikidata 之间的 SPARQL 查询翻译自动化 2507.10045v1 |
Authors (3): Malte Christian Bartels, Debayan Banerjee, Ricardo Usbeck
This paper investigates whether state-of-the-art Large Language Models (LLMs) can automatically translate SPARQL between popular Knowledge Graph (KG) schemas. We focus on translations between the DBpedia and Wikidata KG, and later on DBLP and OpenAlex KG. This study addresses a notable gap in KG interoperability research by rigorously evaluating LLM performance on SPARQL-to-SPARQL translation. Two benchmarks are assembled, where the first align 100 DBpedia-Wikidata queries from QALD-9-Plus; the second contains 100 DBLP queries aligned to OpenAlex, testing generalizability beyond encyclopaedic KGs. Three open LLMs: Llama-3-8B, DeepSeek-R1-Distill-Llama-70B, and Mistral-Large-Instruct-2407 are selected based on their sizes and architectures and tested with zero-shot, few-shot, and two chain-of-thought variants. Outputs were compared with gold answers, and resulting errors were categorized. We find that the performance varies markedly across models and prompting strategies, and that translations for Wikidata to DBpedia work far better than translations for DBpedia to Wikidata.
nan
Article 337
Title@2025-07-14 (1): Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning
Title: Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning | Erste Prüfung von Wissenschaftlern: Kognitive Fähigkeiten von MLLM durch Wahrnehmung, Verständnis und Vernunft unter Beweis stellen | 科学家的第一次考试:通过感知、理解和理性,发现MLLM的认知能力 2506.10521v4 |
Authors (27): Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhiwei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, Jiaqi Wei, Dan Si, Xiuqi Yao, Jia Bu, Haiwen Huang, Tianfan Fu, Shixiang Tang, Ben Fei, Dongzhan Zhou, Fenghua Ling, Yan Lu, Siqi Sun, Chenhui Li, Guanjie Zheng, Jiancheng Lv, Wenlong Zhang, Lei Bai
Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists’ First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.
nan
Article 338
Title@2025-07-14 (1): Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect
Title: Cross-modal Associations in Vision and Language Models: Revisiting the bouba-kiki effect | Cross-modale Assoziationen in Vision und Sprachmodellen: Der Bouba-Kiki-Effekt neu aufgreifen | 愿景和语言模式跨模式协会:重新审查bouba-kiki效应 2507.10013v1 |
Authors (3): Tom Kouwenhoven, Kiana Shahrasbi, Tessa Verhoef
Recent advances in multimodal models have raised questions about whether vision-and-language models (VLMs) integrate cross-modal information in ways that reflect human cognition. One well-studied test case in this domain is the bouba-kiki effect, where humans reliably associate pseudowords like “bouba” with round shapes and “kiki” with jagged ones. Given the mixed evidence found in prior studies for this effect in VLMs, we present a comprehensive re-evaluation focused on two variants of CLIP, ResNet and Vision Transformer (ViT), given their centrality in many state-of-the-art VLMs. We apply two complementary methods closely modelled after human experiments: a prompt-based evaluation that uses probabilities as model preference, and we use Grad-CAM as a novel way to interpret visual attention in shape-word matching tasks. Our findings show that these models do not consistently exhibit the bouba-kiki effect. While ResNet shows a preference for round shapes, overall performance across both models lacks the expected associations. Moreover, direct comparison with prior human data on the same task shows that the models’ responses fall markedly short of the robust, modality-integrated behaviour characteristic of human cognition. These results contribute to the ongoing debate about the extent to which VLMs truly understand cross-modal concepts, highlighting limitations in their internal representations and alignment with human intuitions.
nan
Article 339
Title@2025-07-14 (1): Protective Factor-Aware Dynamic Influence Learning for Suicide Risk Prediction on Social Media
Title: Protective Factor-Aware Dynamic Influence Learning for Suicide Risk Prediction on Social Media | Schutzfaktor-Bewusst Dynamisches Influence-Lernen für Suizidrisikovorhersage in sozialen Medien | 社会媒体自杀风险预测社会媒体 2507.10008v1 |
Authors (8): Jun Li, Xiangmeng Wang, Haoyang Li, Yifei Yan, Hong Va Leong, Ling Feng, Nancy Xiaonan Yu, Qing Li
Suicide is a critical global health issue that requires urgent attention. Even though prior work has revealed valuable insights into detecting current suicide risk on social media, little attention has been paid to developing models that can predict subsequent suicide risk over time, limiting their ability to capture rapid fluctuations in individuals’ mental state transitions. In addition, existing work ignores protective factors that play a crucial role in suicide risk prediction, focusing predominantly on risk factors alone. Protective factors such as social support and coping strategies can mitigate suicide risk by moderating the impact of risk factors. Therefore, this study proposes a novel framework for predicting subsequent suicide risk by jointly learning the dynamic influence of both risk factors and protective factors on users’ suicide risk transitions. We propose a novel Protective Factor-Aware Dataset, which is built from 12 years of Reddit posts along with comprehensive annotations of suicide risk and both risk and protective factors. We also introduce a Dynamic Factors Influence Learning approach that captures the varying impact of risk and protective factors on suicide risk transitions, recognizing that suicide risk fluctuates over time according to established psychological theories. Our thorough experiments demonstrate that the proposed model significantly outperforms state-of-the-art models and large language models across three datasets. In addition, the proposed Dynamic Factors Influence Learning provides interpretable weights, helping clinicians better understand suicidal patterns and enabling more targeted intervention strategies.
nan
Article 340
Title@2025-07-14 (1): SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs
Title: SpatialViz-Bench: Automatically Generated Spatial Visualization Reasoning Tasks for MLLMs | RaumViz-Bench: Automatisch generierte räumliche Visualisierungs-Aufgaben für MLLMs | 空间Viz-Bench:MLLLMs自动生成的空间可视化推理任务 2507.07610v2 |
Authors (8): Siting Wang, Luoyang Sun, Cheng Deng, Kun Shao, Minnan Pei, Zheng Tian, Haifeng Zhang, Jun Wang
Humans can directly imagine and manipulate visual images in their minds, a capability known as spatial visualization. While multi-modal Large Language Models (MLLMs) support imagination-based reasoning, spatial visualization remains insufficiently evaluated, typically embedded within broader mathematical and logical assessments. Existing evaluations often rely on IQ tests or math competitions that may overlap with training data, compromising assessment reliability. To this end, we introduce SpatialViz-Bench, a comprehensive multi-modal benchmark for spatial visualization with 12 tasks across 4 sub-abilities, comprising 1,180 automatically generated problems. Our evaluation of 33 state-of-the-art MLLMs not only reveals wide performance variations and demonstrates the benchmark’s strong discriminative power, but also uncovers counter-intuitive findings: models exhibit unexpected behaviors by showing difficulty perception that misaligns with human intuition, displaying dramatic 2D-to-3D performance cliffs, and defaulting to formula derivation despite spatial tasks requiring visualization alone. SpatialVizBench empirically demonstrates that state-of-the-art MLLMs continue to exhibit deficiencies in spatial visualization tasks, thereby addressing a significant lacuna in the field. The benchmark is publicly available.
nan
Article 341
Title@2025-07-14 (1): On The Role of Intentionality in Knowledge Representation: Analyzing Scene Context for Cognitive Agents with a Tiny Language Model
Title: On The Role of Intentionality in Knowledge Representation: Analyzing Scene Context for Cognitive Agents with a Tiny Language Model | Zur Rolle der Intentionalität in der Wissensrepräsentation: Analysieren des Szenekontexts für Kognitive Agenten mit einem winzigen Sprachmodell | 关于 “ 有意在知识代表性中的作用 “ :用微小语言模式分析认知代理人的场景背景 2507.10000v1 |
Authors (1): Mark Burgess
Since Searle’s work deconstructing intent and intentionality in the realm of philosophy, the practical meaning of intent has received little attention in science and technology. Intentionality and context are both central to the scope of Promise Theory’s model of Semantic Spacetime, used as an effective Tiny Language Model. One can identify themes and concepts from a text, on a low level (without knowledge of the specific language) by using process coherence as a guide. Any agent process can assess superficially a degree of latent intentionality' in data by looking for anomalous multi-scale anomalies and assessing the work done to form them. Scale separation can be used to sort parts into
intended’ content and `ambient context’, using the spacetime coherence as a measure. This offers an elementary but pragmatic interpretation of latent intentionality for very low computational cost, and without reference to extensive training or reasoning capabilities. The process is well within the reach of basic organisms as it does not require large scale artificial probabilistic batch processing. The level of concept formation depends, however, on the memory capacity of the agent.
nan
Article 342
Title@2025-07-14 (1): Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code
Title: Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code | LLM zur Vernunft bringen: Stärkung Lernen aus algorithmischen Problemen ohne Code | 教LLM到理由:加强从没有法典的等级问题中学习 2507.07498v2 |
Authors (8): Keqin Bao, Nuo Chen, Xiaoyuan Li, Binyuan Hui, Bowen Yu, Fuli Feng, Xiangnan He, Dayiheng Liu
Enhancing reasoning capabilities remains a central focus in the LLM reasearch community. A promising direction involves requiring models to simulate code execution step-by-step to derive outputs for given inputs. However, as code is often designed for large-scale systems, direct application leads to over-reliance on complex data structures and algorithms, even for simple cases, resulting in overfitting to algorithmic patterns rather than core reasoning structures. To address this, we propose TeaR, which aims at teaching LLMs to reason better. TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks, thereby improving general reasoning abilities. We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning. The results consistently show significant performance improvements. Notably, TeaR achieves a 35.9% improvement on Qwen2.5-7B and 5.9% on R1-Distilled-7B.
nan
Article 343
Title@2025-07-14 (1): Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection
Title: Not all tokens are created equal: Perplexity Attention Weighted Networks for AI generated text detection | Nicht alle Token sind gleich: Perplexity Attention Gewichtete Netzwerke für die KI generierte Texterkennung | 并非所有的标识符被创建为等号: 为 AI 生成的文本检测而创建的双倍注意加权网络 2501.03940v3 |
Authors (4): Pablo Miralles-González, Javier Huertas-Tato, Alejandro Martín, David Camacho
The rapid advancement in large language models (LLMs) has significantly enhanced their ability to generate coherent and contextually relevant text, raising concerns about the misuse of AI-generated content and making it critical to detect it. However, the task remains challenging, particularly in unseen domains or with unfamiliar LLMs. Leveraging LLM next-token distribution outputs offers a theoretically appealing approach for detection, as they encapsulate insights from the models’ extensive pre-training on diverse corpora. Despite its promise, zero-shot methods that attempt to operationalize these outputs have met with limited success. We hypothesize that one of the problems is that they use the mean to aggregate next-token distribution metrics across tokens, when some tokens are naturally easier or harder to predict and should be weighted differently. Based on this idea, we propose the Perplexity Attention Weighted Network (PAWN), which uses the last hidden states of the LLM and positions to weight the sum of a series of features based on metrics from the next-token distribution across the sequence length. Although not zero-shot, our method allows us to cache the last hidden states and next-token distribution metrics on disk, greatly reducing the training resource requirements. PAWN shows competitive and even better performance in-distribution than the strongest baselines (fine-tuned LMs) with a fraction of their trainable parameters. Our model also generalizes better to unseen domains and source models, with smaller variability in the decision boundary across distribution shifts. It is also more robust to adversarial attacks, and if the backbone has multilingual capabilities, it presents decent generalization to languages not seen during supervised training, with LLaMA3-1B reaching a mean macro-averaged F1 score of 81.46% in cross-validation with nine languages.
nan
Article 344
Title@2025-07-14 (1): TextOmics-Guided Diffusion for Hit-like Molecular Generation
Title: TextOmics-Guided Diffusion for Hit-like Molecular Generation | TextOmics-geführte Diffusion für hit-like Molekulare Generation | TextOmics- 指导的极类似分子生成扩散 2507.09982v1 |
Authors (4): Hang Yuan, Chen Li, Wenjun Ma, Yuncheng Jiang
Hit-like molecular generation with therapeutic potential is essential for target-specific drug discovery. However, the field lacks heterogeneous data and unified frameworks for integrating diverse molecular representations. To bridge this gap, we introduce TextOmics, a pioneering benchmark that establishes one-to-one correspondences between omics expressions and molecular textual descriptions. TextOmics provides a heterogeneous dataset that facilitates molecular generation through representations alignment. Built upon this foundation, we propose ToDi, a generative framework that jointly conditions on omics expressions and molecular textual descriptions to produce biologically relevant, chemically valid, hit-like molecules. ToDi leverages two encoders (OmicsEn and TextEn) to capture multi-level biological and semantic associations, and develops conditional diffusion (DiffGen) for controllable generation. Extensive experiments confirm the effectiveness of TextOmics and demonstrate ToDi outperforms existing state-of-the-art approaches, while also showcasing remarkable potential in zero-shot therapeutic molecular generation. Sources are available at: https://github.com/hala-ToDi.
nan
Article 345
Title@2025-07-14 (1): Tiny Reward Models
Title: Tiny Reward Models | Kleine Belohnung Modelle | 微量奖励模型 2507.09973v1 |
Authors (1): Sarah Pan
Large decoder-based language models have become the dominant architecture for reward modeling in reinforcement learning from human feedback (RLHF). However, as reward models are increasingly deployed in test-time strategies, their inference costs become a growing concern. We present TinyRM, a family of small, bidirectional masked language models (MLMs) with as few as 400 million parameters, that rival the capabilities of models over 175 times larger on reasoning and safety preference modeling tasks. TinyRM combines FLAN-style prompting, Directional Low-Rank Adaptation (DoRA), and layer freezing to achieve strong performance on RewardBench, despite using significantly fewer resources. Our experiments suggest that small models benefit from domain-specific tuning strategies, particularly in reasoning, where lightweight finetuning methods are especially effective. While challenges remain in building generalist models and conversational preference modeling, our preliminary results highlight the promise of lightweight bidirectional architectures as efficient, scalable alternatives for preference modeling.
nan
Article 346
Title@2025-07-14 (1): TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models
Title: TReB: A Comprehensive Benchmark for Evaluating Table Reasoning Capabilities of Large Language Models | TReB: Umfassender Benchmark für die Bewertung von Tabellen mit Gründen für Fähigkeiten großer Sprachmodelle | TreB:评价大语言模式表说明能力的综合基准 2506.18421v2 |
Authors (12): Ce Li, Xiaofan Liu, Zhiyan Song, Ce Chi, Chen Zhao, Jingjing Yang, Zhendong Wang, Kexin Yang, Boshen Shi, Xing Wang, Chao Deng, Junlan Feng
The majority of data in businesses and industries is stored in tables, databases, and data warehouses. Reasoning with table-structured data poses significant challenges for large language models (LLMs) due to its hidden semantics, inherent complexity, and structured nature. One of these challenges is lacking an effective evaluation benchmark fairly reflecting the performances of LLMs on broad table reasoning abilities. In this paper, we fill in this gap, presenting a comprehensive table reasoning evolution benchmark, TReB, which measures both shallow table understanding abilities and deep table reasoning abilities, a total of 26 sub-tasks. We construct a high quality dataset through an iterative data processing procedure. We create an evaluation framework to robustly measure table reasoning capabilities with three distinct inference modes, TCoT, PoT and ICoT. Further, we benchmark over 20 state-of-the-art LLMs using this frame work and prove its effectiveness. Experimental results reveal that existing LLMs still have significant room for improvement in addressing the complex and real world Table related tasks. Both the dataset and evaluation framework are publicly available, with the dataset hosted on huggingface.co/datasets/JT-LM/JIUTIAN-TReB and the framework on github.com/JT-LM/jiutian-treb.
nan
Article 347
Title@2025-07-14 (1): PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes
Title: PRIME: Large Language Model Personalization with Cognitive Memory and Thought Processes | PRIME: Large Language Model Personalisierung mit kognitiven Gedächtnis- und Gedankenprozessen | PRIME:具有认知记忆和思维过程的大语言模式个性模型 2507.04607v2 |
Authors (3): Xinliang Frederick Zhang, Nick Beauchamp, Lu Wang
Large language model (LLM) personalization aims to align model outputs with individuals’ unique preferences and opinions. While recent efforts have implemented various personalization methods, a unified theoretical framework that can systematically understand the drivers of effective personalization is still lacking. In this work, we integrate the well-established cognitive dual-memory model into LLM personalization, by mirroring episodic memory to historical user engagements and semantic memory to long-term, evolving user beliefs. Specifically, we systematically investigate memory instantiations and introduce a unified framework, PRIME, using episodic and semantic memory mechanisms. We further augment PRIME with a novel personalized thinking capability inspired by the slow thinking strategy. Moreover, recognizing the absence of suitable benchmarks, we introduce a dataset using Change My View (CMV) from Reddit, specifically designed to evaluate long-context personalization. Extensive experiments validate PRIME’s effectiveness across both long- and short-context scenarios. Further analysis confirms that PRIME effectively captures dynamic personalization beyond mere popularity biases.
nan
Article 348
Title@2025-07-14 (1): DeepGesture: A conversational gesture synthesis system based on emotions and semantics
Title: DeepGesture: A conversational gesture synthesis system based on emotions and semantics | DeepGesture: Ein dialogisches Gesten-Synthesesystem basierend auf Emotionen und Semantik | DeepGesture:基于情感和语义的谈话手势合成系统 2507.03147v2 |
Authors (1): Thanh Hoang-Minh
Along with the explosion of large language models, improvements in speech synthesis, advancements in hardware, and the evolution of computer graphics, the current bottleneck in creating digital humans lies in generating character movements that correspond naturally to text or speech inputs. In this work, we present DeepGesture, a diffusion-based gesture synthesis framework for generating expressive co-speech gestures conditioned on multimodal signals - text, speech, emotion, and seed motion. Built upon the DiffuseStyleGesture model, DeepGesture introduces novel architectural enhancements that improve semantic alignment and emotional expressiveness in generated gestures. Specifically, we integrate fast text transcriptions as semantic conditioning and implement emotion-guided classifier-free diffusion to support controllable gesture generation across affective states. To visualize results, we implement a full rendering pipeline in Unity based on BVH output from the model. Evaluation on the ZeroEGGS dataset shows that DeepGesture produces gestures with improved human-likeness and contextual appropriateness. Our system supports interpolation between emotional states and demonstrates generalization to out-of-distribution speech, including synthetic voices - marking a step forward toward fully multimodal, emotionally aware digital humans. Project page: https://deepgesture.github.io
nan
Article 349
Title@2025-07-14 (1): EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective
Title: EVALOOP: Assessing LLM Robustness in Programming from a Self-consistency Perspective | EVALOOP: Bewertung der Robustheit von LLM in der Programmierung aus einer Perspektive der Selbstkonsistenz | EVALOOP: 从自统一的角度评估方案拟订中的LLM强力 2505.12185v3 |
Authors (3): Sen Fang, Weiyuan Ding, Bowen Xu
Assessing the programming capabilities of Large Language Models (LLMs) is crucial for their effective use in software engineering. Current evaluations, however, predominantly measure the accuracy of generated code on static benchmarks, neglecting the critical aspect of model robustness during programming tasks. While adversarial attacks offer insights on model robustness, their effectiveness is limited and evaluation could be constrained. Current adversarial attack methods for robustness evaluation yield inconsistent results, struggling to provide a unified evaluation across different LLMs. We introduce EVALOOP, a novel assessment framework that evaluate the robustness from a self-consistency perspective, i.e., leveraging the natural duality inherent in popular software engineering tasks, e.g., code generation and code summarization. EVALOOP initiates a self-contained feedback loop: an LLM generates output (e.g., code) from an input (e.g., natural language specification), and then use the generated output as the input to produce a new output (e.g., summarizes that code into a new specification). EVALOOP repeats the process to assess the effectiveness of EVALOOP in each loop. This cyclical strategy intrinsically evaluates robustness without rely on any external attack setups, providing a unified metric to evaluate LLMs’ robustness in programming. We evaluate 16 prominent LLMs (e.g., GPT-4.1, O4-mini) on EVALOOP and found that EVALOOP typically induces a 5.01%-19.31% absolute drop in pass@1 performance within ten loops. Intriguingly, robustness does not always align with initial performance (i.e., one-time query); for instance, GPT-3.5-Turbo, despite superior initial code generation compared to DeepSeek-V2, demonstrated lower robustness over repeated evaluation loop.
nan
Article 350
Title@2025-07-14 (1): Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts
Title: Qorgau: Evaluating LLM Safety in Kazakh-Russian Bilingual Contexts | Qorgau: Bewertung der LLM-Sicherheit in kasachisch-russischen zweisprachigen Kontexten | Qorgau:评价哈萨克-俄语双语背景的LLM安全性 2502.13640v2 |
Authors (14): Maiya Goloburda, Nurkhan Laiyk, Diana Turmakhan, Yuxia Wang, Mukhammed Togmanov, Jonibek Mansurov, Askhat Sametov, Nurdaulet Mukhituly, Minghan Wang, Daniil Orel, Zain Muhammad Mujahid, Fajri Koto, Timothy Baldwin, Preslav Nakov
Large language models (LLMs) are known to have the potential to generate harmful content, posing risks to users. While significant progress has been made in developing taxonomies for LLM risks and safety evaluation prompts, most studies have focused on monolingual contexts, primarily in English. However, language- and region-specific risks in bilingual contexts are often overlooked, and core findings can diverge from those in monolingual settings. In this paper, we introduce Qorgau, a novel dataset specifically designed for safety evaluation in Kazakh and Russian, reflecting the unique bilingual context in Kazakhstan, where both Kazakh (a low-resource language) and Russian (a high-resource language) are spoken. Experiments with both multilingual and language-specific LLMs reveal notable differences in safety performance, emphasizing the need for tailored, region-specific datasets to ensure the responsible and safe deployment of LLMs in countries like Kazakhstan. Warning: this paper contains example data that may be offensive, harmful, or biased.
nan
Article 351
Title@2025-07-14 (1): Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking
Title: Enhancing Retrieval Augmented Generation with Hierarchical Text Segmentation Chunking | Verbesserung der retrieval Augmented Generation mit Hierarchical Text Segmentation Chunking | 增强获取回源增加的一代, 带有高层次文字分割块块板 2507.09935v1 |
Authors (3): Hai Toan Nguyen, Tien Dat Nguyen, Viet Ha Nguyen
Retrieval-Augmented Generation (RAG) systems commonly use chunking strategies for retrieval, which enhance large language models (LLMs) by enabling them to access external knowledge, ensuring that the retrieved information is up-to-date and domain-specific. However, traditional methods often fail to create chunks that capture sufficient semantic meaning, as they do not account for the underlying textual structure. This paper proposes a novel framework that enhances RAG by integrating hierarchical text segmentation and clustering to generate more meaningful and semantically coherent chunks. During inference, the framework retrieves information by leveraging both segment-level and cluster-level vector representations, thereby increasing the likelihood of retrieving more precise and contextually relevant information. Evaluations on the NarrativeQA, QuALITY, and QASPER datasets indicate that the proposed method achieved improved results compared to traditional chunking techniques.
nan
Article 352
Title@2025-07-14 (1): ACEBench: Who Wins the Match Point in Tool Usage?
Title: ACEBench: Who Wins the Match Point in Tool Usage? | ACEBench: Wer gewinnt den Match Point in der Werkzeugnutzung? | CEBench:谁在工具使用中赢得了匹配点? 2501.12851v5 |
Authors (16): Chen Chen, Xinlong Hao, Weiwen Liu, Xu Huang, Xingshan Zeng, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Yuefeng Huang, Wulong Liu, Xinzhi Wang, Defu Lian, Baoqun Yin, Yasheng Wang, Wu Liu
Large Language Models (LLMs) have demonstrated significant potential in decision-making and reasoning, particularly when integrated with various tools to effectively solve complex problems. However, existing benchmarks for evaluating LLMs’ tool usage face several limitations: (1) limited evaluation scenarios, often lacking assessments in real multi-turn dialogue contexts; (2) narrow evaluation dimensions, with insufficient detailed assessments of how LLMs use tools; and (3) reliance on LLMs or real API executions for evaluation, which introduces significant overhead. To address these challenges, we introduce ACEBench, a comprehensive benchmark for assessing tool usage in LLMs. ACEBench categorizes data into three primary types based on evaluation methodology: Normal, Special, and Agent. “Normal” evaluates tool usage in basic scenarios; “Special” evaluates tool usage in situations with ambiguous or incomplete instructions; “Agent” evaluates tool usage through multi-agent interactions to simulate real-world, multi-turn dialogues. We conducted extensive experiments using ACEBench, analyzing various LLMs in-depth and providing a more granular examination of error causes across different data types.
nan
Article 353
Title@2025-07-14 (1): MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora
Title: MixLoRA-DSI: Dynamically Expandable Mixture-of-LoRA Experts for Rehearsal-Free Generative Retrieval over Dynamic Corpora | MixLoRA-DSI: Dynamisch erweiterbare Mischungs-of-LoRA-Experten für ein probenfreies generatives Retrieval über Dynamic Corpora | Mix LoRA-DSI: 动态公司排练-无创录检索专家动态可扩展混合Mix-LORA 2507.09924v1 |
Authors (7): Tuan-Luc Huynh, Thuy-Trang Vu, Weiqing Wang, Trung Le, Dragan Gašević, Yuan-Fang Li, Thanh-Toan Do
Continually updating model-based indexes in generative retrieval with new documents remains challenging, as full retraining is computationally expensive and impractical under resource constraints. We propose MixLoRA-DSI, a novel framework that combines an expandable mixture of Low-Rank Adaptation experts with a layer-wise out-of-distribution (OOD)-driven expansion strategy. Instead of allocating new experts for each new corpus, our proposed expansion strategy enables sublinear parameter growth by selectively introducing new experts only when significant number of OOD documents are detected. Experiments on NQ320k and MS MARCO Passage demonstrate that MixLoRA-DSI outperforms full-model update baselines, with minimal parameter overhead and substantially lower training costs.
nan
Article 354
Title@2025-07-14 (1): PyVision: Agentic Vision with Dynamic Tooling
Title: PyVision: Agentic Vision with Dynamic Tooling | PyVision: Agentische Vision mit dynamischem Werkzeug | 视景:带有动态工具的 “ 动态展望 “ 。 2507.07998v2 |
Authors (7): Shitian Zhao, Haoquan Zhang, Shaoheng Lin, Ming Li, Qilong Wu, Kaipeng Zhang, Chen Wei
LLMs are increasingly deployed as agents, systems capable of planning, reasoning, and dynamically calling external tools. However, in visual reasoning, prior approaches largely remain limited by predefined workflows and static toolsets. In this report, we present PyVision, an interactive, multi-turn framework that enables MLLMs to autonomously generate, execute, and refine Python-based tools tailored to the task at hand, unlocking flexible and interpretable problem-solving. We develop a taxonomy of the tools created by PyVision and analyze their usage across a diverse set of benchmarks. Quantitatively, PyVision achieves consistent performance gains, boosting GPT-4.1 by +7.8% on V* and Claude-4.0-Sonnet by +31.1% on VLMsAreBlind-mini. These results point to a broader shift: dynamic tooling allows models not just to use tools, but to invent them, advancing toward more agentic visual reasoning.
nan
Article 355
Title@2025-07-14 (1): Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization
Title: Fourier Position Embedding: Enhancing Attention’s Periodic Extension for Length Generalization | Fourier-Positions-Einbettung: Erhöht die regelmäßige Verlängerung der Aufmerksamkeit für Längenverallgemeinerung | 四级立场嵌入式:加强注意定期延长延长时限 2412.17739v4 |
Authors (10): Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, Bowen Zhou
Extending the context length of Language Models (LMs) by improving Rotary Position Embedding (RoPE) has become a trend. While prior works mainly address RoPE’s limitations within attention, this paper uncovers the adverse effects on length generalization from nearly all parts of LMs. Using Discrete Signal Processing theory, we show that RoPE enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. However, this periodicity is undermined by the spectrum damage caused by: 1) linear layers and activation functions; 2) insufficiently trained frequency components brought by time-domain truncation. Building on our observations, we propose Fourier Position Embedding (FoPE), which enhances attention’s frequency-domain properties to improve both its periodic extension and length generalization. FoPE constructs \textit{Fourier Series} and zero-outs the destructive frequency components, increasing model robustness against the spectrum damage. Experiments across various model scales and benchmarks show that, within varying context windows, FoPE maintains a more stable performance compared to other baselines. Several analyses and ablations bring further support to our method and theoretical modeling.
nan
Article 356
Title@2025-07-14 (1): Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process
Title: Intuitive Fine-Tuning: Towards Simplifying Alignment into a Single Process | Intuitive Feinsteuerung: Auf dem Weg zur Vereinfachung der Ausrichtung zu einem einzigen Prozess | 直观的精细调整:努力将调整简化为单一进程 2405.11870v3 |
Authors (7): Ermo Hua, Biqing Qi, Kaiyan Zhang, Kai Tian, Xingtai Lv, Ning Ding, Bowen Zhou
Supervised Fine-Tuning (SFT) and Preference Optimization (PO) are key processes for aligning Language Models (LMs) with human preferences post pre-training. While SFT excels in efficiency and PO in effectiveness, they are often combined sequentially without integrating their optimization objectives. This approach ignores the opportunities to bridge their paradigm gap and take the strengths from both. In this paper, we interpret SFT and PO with two sub-processes – Preference Estimation and Transition Optimization – defined at token level within the Markov Decision Process (MDP). This modeling shows that SFT is only a special case of PO with inferior estimation and optimization. PO estimates the model’s preference by its entire generation, while SFT only scores model’s subsequent predicted tokens based on prior tokens from ground truth answer. These priors deviates from model’s distribution, hindering the preference estimation and transition optimization. Building on this view, we introduce Intuitive Fine-Tuning (IFT) to integrate SFT and PO into a single process. Through a temporal residual connection, IFT brings better estimation and optimization by capturing LMs’ intuitive sense of its entire answers. But it solely relies on a single policy and the same volume of non-preference-labeled data as SFT. Our experiments show that IFT performs comparably or even superiorly to SFT and some typical PO methods across several tasks, particularly those require generation, reasoning, and fact-following abilities. An explainable Frozen Lake game further validates the effectiveness of IFT for getting competitive policy.
nan
Article 357
Title@2025-07-14 (1): Scalable MatMul-free Language Modeling
Title: Scalable MatMul-free Language Modeling | Skalierbare MatMul-freie Sprachmodellierung | 可缩放 MatMul 无语言建模 2406.02528v6 |
Authors (10): Rui-Jie Zhu, Yu Zhang, Steven Abreu, Ethan Sifferman, Tyler Sheaves, Yiqiao Wang, Dustin Richmond, Sumit Bam Shrestha, Peng Zhou, Jason K. Eshraghian
Large Language Models (LLMs) have fundamentally altered how we approach scaling in machine learning. However, these models pose substantial computational and memory challenges, primarily due to the reliance on matrix multiplication (MatMul) within their attention and feed-forward (FFN) layers. We demonstrate that MatMul operations can be eliminated from LLMs while maintaining strong performance, even at billion-parameter scales. Our MatMul-free models, tested on models up to 2.7B parameters, are comparable to state-of-the-art pre-trained Transformers, and the performance gap narrows as model size increases. Our approach yields significant memory savings: a GPU-efficient implementation reduces memory consumption by up to 61\% during training and over 10$\times$ during inference. When adapted for a multi-chip neuromorphic system, the model leverages asynchronous processing to achieve 4$\times$ higher throughput with 10$\times$ less energy than edge GPUs. %and 77$\times$ less energy than server-class GPUs, demonstrating superior scaling. These findings demonstrate a path toward dramatically simplified yet effective LLMs, advancing them toward brain-like efficiency and heralding a new generation of lightweight, high-performance language models. Our code implementation is available at https://github. com/ridgerchu/matmulfreellm.
nan
Article 358
Title@2025-07-14 (1): ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models
Title: ViTCoT: Video-Text Interleaved Chain-of-Thought for Boosting Video Understanding in Large Language Models | ViTCoT: Video-Text Interleaved Chain-of-Thought zur Förderung des Videoverständnisses in großen Sprachmodellen | VittoT:为在大语言模型中促进视频理解而探索的视频-文字间断连锁研究 2507.09876v1 |
Authors (7): Yongheng Zhang, Xu Liu, Ruihan Tao, Qiguang Chen, Hao Fei, Wanxiang Che, Libo Qin
Video understanding plays a vital role in bridging low-level visual signals with high-level cognitive reasoning, and is fundamental to applications such as autonomous driving, embodied AI, and the broader pursuit of AGI. The rapid development of large language models (LLMs), particularly those utilizing Chain-of-Thought (CoT) technology, has significantly advanced video reasoning capabilities. However, current approaches primarily depend on textual information for reasoning, overlooking the visual modality in the actual video reasoning process. In contrast, humans naturally re-examine visual content while reasoning. Motivated by this, we introduce a novel video reasoning paradigm: Video-Text Interleaved CoT (ViTCoT), which facilitates more intuitive and cognitively aligned reasoning. To the end, first, we construct the Video-Text Interleaved Benchmark (ViTIB), which is created using MLLMs for key-video selection and manually verified. Furthermore, we extensively explore the potential of the ViTCoT paradigm in the video understanding field. Extensive experiments demonstrate that ViTCoT significantly enhances performance compared to the traditional text-only CoT paradigm and effectively activates more neuron values in MLLMs.
nan
Article 359
Title@2025-07-14 (1): Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition
Title: Function Induction and Task Generalization: An Interpretability Study with Off-by-One Addition | Funktionsinduktion und Aufgabenverallgemeinerung: Eine Interpretationsstudie mit Off-by-One-Addition | 职能上岗和任务一般化:解释性研究 2507.09875v1 |
Authors (3): Qinyuan Ye, Robin Jia, Xiang Ren
Large language models demonstrate the intriguing ability to perform unseen tasks via in-context learning. However, it remains unclear what mechanisms inside the model drive such task-level generalization. In this work, we approach this question through the lens of off-by-one addition (i.e., 1+1=3, 2+2=5, 3+3=?), a two-step, counterfactual task with an unexpected +1 function as a second step. Leveraging circuit-style interpretability techniques such as path patching, we analyze the models’ internal computations behind their notable performance and present three key findings. First, we uncover a function induction mechanism that explains the model’s generalization from standard addition to off-by-one addition. This mechanism resembles the structure of the induction head mechanism found in prior work and elevates it to a higher level of abstraction. Second, we show that the induction of the +1 function is governed by multiple attention heads in parallel, each of which emits a distinct piece of the +1 function. Finally, we find that this function induction mechanism is reused in a broader range of tasks, including synthetic tasks such as shifted multiple-choice QA and algorithmic tasks such as base-8 addition. Overall, our findings offer deeper insights into how reusable and composable structures within language models enable task-level generalization.
nan
Article 360
Title@2025-07-14 (1): CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
Title: CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding | CV-Probes: Studieren des Zusammenspiels von lexikalischem und weltlichem Wissen im visuell fundierten Verbverständnis | CV-CV-结果:以视觉动词理解研究词汇学和世界知识的相互作用 2409.01389v2 |
Authors (3): Ivana Beňová, Michal Gregor, Albert Gatt
How do vision-language (VL) transformer models ground verb phrases and do they integrate contextual and world knowledge in this process? We introduce the CV-Probes dataset, containing image-caption pairs involving verb phrases that require both social knowledge and visual context to interpret (e.g., “beg”), as well as pairs involving verb phrases that can be grounded based on information directly available in the image (e.g., “sit”). We show that VL models struggle to ground VPs that are strongly context-dependent. Further analysis using explainable AI techniques shows that such models may not pay sufficient attention to the verb token in the captions. Our results suggest a need for improved methodologies in VL model training and evaluation. The code and dataset will be available https://github.com/ivana-13/CV-Probes.
nan
Article 361
Title@2025-07-14 (1): InstCache: A Predictive Cache for LLM Serving
Title: InstCache: A Predictive Cache for LLM Serving | InstCache: Ein vorausschauender Cache für LLM Serving | Instcache:LLM服务预测缓存 2411.13820v2 |
Authors (6): Longwei Zou, Yan Liu, Jiamu Kang, Tingfeng Liu, Jiangang Kong, Yangdong Deng
The revolutionary capabilities of Large Language Models (LLMs) are attracting rapidly growing popularity and leading to soaring user requests to inference serving systems. Caching techniques, which leverage data reuse to reduce computation, offer opportunities to optimize the performance of LLM inference engines. On the one hand, the low-level key-value (KV) cache working at the token level is widely adopted, albeit it incurs significant overhead as request volume grows. On the other hand, instruction-level caching, which stores full instruction-response pairs, is expected to play an increasingly crucial role. However, the high variability in the content and length of instructions make it rare for identical instructions to recur within a short time window, presenting challenges for effective caching instruction-response pairs. To address this challenge, we propose InstCache, a predictive caching mechanism for LLM serving systems. Leveraging the capability of LLMs, we can effectively reorder the representation space of instruction texts and develop a sufficient level of spatial locality. Such spatial locality enables us to predict potential instructions located in a compact region in the space, resulting in an effective caching system at runtime. Experimental results demonstrate that InstCache achieves a 2.3x higher hit rate compared to the upper bound of traditional caching mechanisms on WildChat dataset and reduces the time per output token of vLLM by up to 42.0% and 50.0% on LMSys and Moss datasets, respectively.
nan
Article 362
Title@2025-07-14 (1): BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning
Title: BIS Reasoning 1.0: The First Large-Scale Japanese Benchmark for Belief-Inconsistent Syllogistic Reasoning | BIS Reasoning 1.0: Der erste großformatige japanische Benchmark für glaubens-inkonsistente syllogistische Reasoning | BIS 理由1.0:日本第一个大尺度的信仰不一致时断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断断 2506.06955v4 |
Authors (8): Ha-Thanh Nguyen, Chaoran Liu, Qianying Liu, Hideyuki Tachibana, Su Myat Noe, Yusuke Miyao, Koichi Takeda, Sadao Kurohashi
We present BIS Reasoning 1.0, the first large-scale Japanese dataset of syllogistic reasoning problems explicitly designed to evaluate belief-inconsistent reasoning in large language models (LLMs). Unlike prior datasets such as NeuBAROCO and JFLD, which focus on general or belief-aligned reasoning, BIS Reasoning 1.0 introduces logically valid yet belief-inconsistent syllogisms to uncover reasoning biases in LLMs trained on human-aligned corpora. We benchmark state-of-the-art models - including GPT models, Claude models, and leading Japanese LLMs - revealing significant variance in performance, with GPT-4o achieving 79.54% accuracy. Our analysis identifies critical weaknesses in current LLMs when handling logically valid but belief-conflicting inputs. These findings have important implications for deploying LLMs in high-stakes domains such as law, healthcare, and scientific literature, where truth must override intuitive belief to ensure integrity and safety.
nan
Article 363
Title@2025-07-14 (1): REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models
Title: REINFORCE++: An Efficient RLHF Algorithm with Robustness to Both Prompt and Reward Models | REINFORCE++: Effizienter RLHF-Algorithmus mit Robustheit sowohl für Prompt- als auch für Reward-Modelle | REINFORCE++: 高效的RLHF对快速模型和奖励模型具有强力的测算法 2501.03262v6 |
Authors (3): Jian Hu, Jason Klein Liu, Wei Shen
Reinforcement Learning from Human Feedback (RLHF) plays a crucial role in aligning large language models (LLMs) with human values and preferences. While state-of-the-art applications like ChatGPT/GPT-4 commonly employ Proximal Policy Optimization (PPO), the inclusion of a critic network introduces significant computational overhead. REINFORCE-based methods, such as REINFORCE Leave One-Out (RLOO), ReMax, and Group Relative Policy Optimization (GRPO), address this limitation by eliminating the critic network. However, these approaches face challenges in accurate advantage estimation. Specifically, they estimate advantages independently for responses to each prompt, which can lead to overfitting on simpler prompts and vulnerability to reward hacking. To address these challenges, we introduce REINFORCE++, a novel approach that removes the critic model while using the normalized reward of a batch as the baseline. Our empirical evaluation demonstrates that REINFORCE++ exhibits robust performance across various reward models without requiring prompt set truncation. Furthermore, it achieves superior generalization in both RLHF and long chain-of-thought (CoT) settings compared to existing REINFORCE-based methods. The implementation is available at https://github.com/OpenRLHF/OpenRLHF.
nan
Article 364
Title@2025-07-14 (1): A General Framework for Inference-time Scaling and Steering of Diffusion Models
Title: A General Framework for Inference-time Scaling and Steering of Diffusion Models | Ein allgemeiner Rahmen für Schlussfolgerungs-Zeit-Skalierung und Steuerung von Diffusionsmodellen | 传播模型的推推时间缩放和引导总框架 2501.06848v4 |
Authors (7): Raghav Singhal, Zachary Horvitz, Ryan Teehan, Mengye Ren, Zhou Yu, Kathleen McKeown, Rajesh Ranganath
Diffusion models produce impressive results in modalities ranging from images and video to protein design and text. However, generating samples with user-specified properties remains a challenge. Recent research proposes fine-tuning models to maximize rewards that capture desired properties, but these methods require expensive training and are prone to mode collapse. In this work, we present Feynman-Kac (FK) steering, an inference-time framework for steering diffusion models with reward functions. FK steering works by sampling a system of multiple interacting diffusion processes, called particles, and resampling particles at intermediate steps based on scores computed using functions called potentials. Potentials are defined using rewards for intermediate states and are selected such that a high value indicates that the particle will yield a high-reward sample. We explore various choices of potentials, intermediate rewards, and samplers. We evaluate FK steering on text-to-image and text diffusion models. For steering text-to-image models with a human preference reward, we find that FK steering a 0.8B parameter model outperforms a 2.6B parameter fine-tuned model on prompt fidelity, with faster sampling and no training. For steering text diffusion models with rewards for text quality and specific text attributes, we find that FK steering generates lower perplexity, more linguistically acceptable outputs and enables gradient-free control of attributes like toxicity. Our results demonstrate that inference-time scaling and steering of diffusion models - even with off-the-shelf rewards - can provide significant sample quality gains and controllability benefits. Code is available at https://github.com/zacharyhorvitz/Fk-Diffusion-Steering .
nan
Article 365
Title@2025-07-14 (1): Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding
Title: Beyond Scale: Small Language Models are Comparable to GPT-4 in Mental Health Understanding | Beyond Scale: Kleine Sprachmodelle sind vergleichbar mit GPT-4 im Mental Health Understanding | 超越范围:在心理健康理解方面,小语言模式可与GPT-4类比。 2507.08031v2 |
Authors (5): Hong Jia, Shiya Fu, Feng Xia, Vassilis Kostakos, Ting Dang
The emergence of Small Language Models (SLMs) as privacy-preserving alternatives for sensitive applications raises a fundamental question about their inherent understanding capabilities compared to Large Language Models (LLMs). This paper investigates the mental health understanding capabilities of current SLMs through systematic evaluation across diverse classification tasks. Employing zero-shot and few-shot learning paradigms, we benchmark their performance against established LLM baselines to elucidate their relative strengths and limitations in this critical domain. We assess five state-of-the-art SLMs (Phi-3, Phi-3.5, Qwen2.5, Llama-3.2, Gemma2) against three LLMs (GPT-4, FLAN-T5-XXL, Alpaca-7B) on six mental health understanding tasks. Our findings reveal that SLMs achieve mean performance within 2\% of LLMs on binary classification tasks (F1 scores of 0.64 vs 0.66 in zero-shot settings), demonstrating notable competence despite orders of magnitude fewer parameters. Both model categories experience similar degradation on multi-class severity tasks (a drop of over 30\%), suggesting that nuanced clinical understanding challenges transcend model scale. Few-shot prompting provides substantial improvements for SLMs (up to 14.6\%), while LLM gains are more variable. Our work highlights the potential of SLMs in mental health understanding, showing they can be effective privacy-preserving tools for analyzing sensitive online text data. In particular, their ability to quickly adapt and specialize with minimal data through few-shot learning positions them as promising candidates for scalable mental health screening tools.
nan
Article 366
Title@2025-07-13 (7): Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization
Title: Beyond Multiple Choice: Evaluating Steering Vectors for Adaptive Free-Form Summarization | Beyond Multiple Choice: Bewertung von Steuerungsvektoren für adaptive Freiform-Zusammenfassung | 超越多重选择:评估适应性自由形式总结指导矢量 2505.24859v2 |
Authors (3): Joschka Braun, Carsten Eickhoff, Seyed Ali Bahrainian
Steering vectors are a lightweight method for controlling text properties by adding a learned bias to language model activations at inference time. So far, steering vectors have predominantly been evaluated in multiple-choice settings, while their effectiveness in free-form generation tasks remains understudied. Moving “Beyond Multiple Choice,” we thoroughly evaluate the effectiveness of steering vectors in adaptively controlling topical focus, sentiment, toxicity, and readability in abstractive summaries of the NEWTS dataset. We find that steering effectively controls the targeted summary properties, but high steering strengths consistently degrade both intrinsic and extrinsic text quality. Compared to steering, prompting offers weaker control, while preserving text quality. Combining steering and prompting yields the strongest control over text properties and offers the most favorable efficacy-quality trade-off at moderate steering strengths. Our results underscore the practical trade-off between control strength and text quality preservation when applying steering vectors to free-form generation tasks.
nan
Article 367
Title@2025-07-13 (7): VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information
Title: VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information | VisOnlyQA: Große Visions-Sprachmodelle kämpfen noch mit der visuellen Wahrnehmung geometrischer Informationen | Vis onlyQA:仍与几何信息视觉认知相抗争的大型视觉语言模型 2412.00947v3 |
Authors (5): Ryo Kamoi, Yusen Zhang, Sarkar Snigdha Sarathi Das, Ranran Haoran Zhang, Rui Zhang
Large Vision Language Models (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception. 23 LVLMs we evaluate, including GPT-4o and Gemini 2.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue. Fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) LLM may be the bottleneck. LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.
nan
Article 368
Title@2025-07-13 (7): SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding
Title: SymbolicThought: Integrating Language Models and Symbolic Reasoning for Consistent and Interpretable Human Relationship Understanding | SymbolicThought: Integration von Sprachmodellen und symbolischer Begründung für ein konsequentes und interpretierbares menschliches Beziehungsverständnis | 象征性探索:整合语文模式和符号理由,促进一致和可解释的人类关系理解 2507.04189v2 |
Authors (6): Runcong Zhao, Qinglin Zhu, Hainiu Xu, Bin Liang, Lin Gui, Yulan He
Understanding character relationships is essential for interpreting complex narratives and conducting socially grounded AI research. However, manual annotation is time-consuming and low in coverage, while large language models (LLMs) often produce hallucinated or logically inconsistent outputs. We present SymbolicThought, a human-in-the-loop framework that combines LLM-based extraction with symbolic reasoning. The system constructs editable character relationship graphs, refines them using seven types of logical constraints, and enables real-time validation and conflict resolution through an interactive interface. To support logical supervision and explainable social analysis, we release a dataset of 160 interpersonal relationships with corresponding logical structures. Experiments show that SymbolicThought improves annotation accuracy and consistency while significantly reducing time cost, offering a practical tool for narrative understanding, explainable AI, and LLM evaluation.
nan
Article 369
Title@2025-07-13 (7): LASER: Attention with Exponential Transformation
Title: LASER: Attention with Exponential Transformation | LASER: Aufmerksamkeit bei exponentieller Transformation | LASER: 关注感官转变 2411.03493v2 |
Authors (2): Sai Surya Duvvuri, Inderjit S. Dhillon
Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer’s performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 7.7 billion parameters with an average improvement of upto 1.44% over standard attention on downstream evaluations and 1.65% finetuning improvements. Additionally, LASER demonstrates generalization performance improvement across a variety of tasks (vision, text and speech):Vision Transformer (ViT) on Imagenet, Conformer on the Librispeech speech-to-text and BERT with 2.2 billion parameters.
nan
Article 370
Title@2025-07-13 (7): TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit
Title: TinyTroupe: An LLM-powered Multiagent Persona Simulation Toolkit | TinyTroupe: Ein LLM-powered Multiagent Persona Simulation Toolkit | TiniyTrouppe:一个由LLM驱动的多剂人模拟工具包 2507.09788v1 |
Authors (6): Paulo Salem, Robert Sim, Christopher Olsen, Prerit Saxena, Rafael Barcelos, Yi Ding
Recent advances in Large Language Models (LLM) have led to a new class of autonomous agents, renewing and expanding interest in the area. LLM-powered Multiagent Systems (MAS) have thus emerged, both for assistive and simulation purposes, yet tools for realistic human behavior simulation – with its distinctive challenges and opportunities – remain underdeveloped. Existing MAS libraries and tools lack fine-grained persona specifications, population sampling facilities, experimentation support, and integrated validation, among other key capabilities, limiting their utility for behavioral studies, social simulation, and related applications. To address these deficiencies, in this work we introduce TinyTroupe, a simulation toolkit enabling detailed persona definitions (e.g., nationality, age, occupation, personality, beliefs, behaviors) and programmatic control via numerous LLM-driven mechanisms. This allows for the concise formulation of behavioral problems of practical interest, either at the individual or group level, and provides effective means for their solution. TinyTroupe’s components are presented using representative working examples, such as brainstorming and market research sessions, thereby simultaneously clarifying their purpose and demonstrating their usefulness. Quantitative and qualitative evaluations of selected aspects are also provided, highlighting possibilities, limitations, and trade-offs. The approach, though realized as a specific Python implementation, is meant as a novel conceptual contribution, which can be partially or fully incorporated in other contexts. The library is available as open source at https://github.com/microsoft/tinytroupe.
nan
Article 371
Title@2025-07-13 (7): Te Ahorré Un Click: A Revised Definition of Clickbait and Detection in Spanish News
Title: Te Ahorré Un Click: A Revised Definition of Clickbait and Detection in Spanish News | Te Ahorré Un Click: Eine überarbeitete Definition von Clickbait und Detection in spanischen Nachrichten | Te Ahorré Unclick:西班牙新闻中的点击和探测的订正定义 2507.09777v1 |
Authors (3): Gabriel Mordecki, Guillermo Moncecchi, Javier Couto
We revise the definition of clickbait, which lacks current consensus, and argue that the creation of a curiosity gap is the key concept that distinguishes clickbait from other related phenomena such as sensationalism and headlines that do not deliver what they promise or diverge from the article. Therefore, we propose a new definition: clickbait is a technique for generating headlines and teasers that deliberately omit part of the information with the goal of raising the readers’ curiosity, capturing their attention and enticing them to click. We introduce a new approach to clickbait detection datasets creation, by refining the concept limits and annotations criteria, minimizing the subjectivity in the decision as much as possible. Following it, we created and release TA1C (for Te Ahorr'e Un Click, Spanish for Saved You A Click), the first open source dataset for clickbait detection in Spanish. It consists of 3,500 tweets coming from 18 well known media sources, manually annotated and reaching a 0.825 Fleiss’ K inter annotator agreement. We implement strong baselines that achieve 0.84 in F1-score.
nan
Article 372
Title@2025-07-13 (7): DataDecide: How to Predict Best Pretraining Data with Small Experiments
Title: DataDecide: How to Predict Best Pretraining Data with Small Experiments | DataDecide: Wie man die besten Vorschulungsdaten mit kleinen Experimenten vorhersagt | 数据减少:如何利用小型实验预测最佳培训前数据 2504.11393v2 |
Authors (13): Ian Magnusson, Nguyen Tai, Ben Bogin, David Heineman, Jena D. Hwang, Luca Soldaini, Akshita Bhagia, Jiacheng Liu, Dirk Groeneveld, Oyvind Tafjord, Noah A. Smith, Pang Wei Koh, Jesse Dodge
Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide – the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.
nan
Article 373
Title@2025-07-13 (7): Cascade Speculative Drafting for Even Faster LLM Inference
Title: Cascade Speculative Drafting for Even Faster LLM Inference | Cascade Spekulative Drafting für noch schnellere LLM-Inferenz | 连速度更快LLM推论的连带连带性投机起草 2312.11462v5 |
Authors (6): Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin Chen-Chuan Chang, Jie Huang
Introduced to enhance the efficiency of large language model (LLM) inference, speculative decoding operates by having a smaller model generate a draft. A larger target model then reviews this draft to align with its output, and any acceptance by the target model results in a reduction of the number of the target model runs, ultimately improving efficiency. However, the drafting process in speculative decoding includes slow autoregressive generation and allocates equal time to generating tokens, irrespective of their importance. These inefficiencies collectively contribute to the suboptimal performance of speculative decoding. To further improve LLM inference, we introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades. The Vertical Cascade eliminates autoregressive generation from neural models, while the Horizontal Cascade optimizes time allocation in drafting for improved efficiency. Combining both cascades, CS Drafting achieves greater speedup compared to the baselines in our experiments, while preserving the same output distribution as the target model.
nan
Article 374
Title@2025-07-13 (7): KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education?
Title: KnowShiftQA: How Robust are RAG Systems when Textbook Knowledge Shifts in K-12 Education? | KnowShiftQA: Wie robust sind RAG-Systeme, wenn Textbook Knowledge Shifts in K-12 Education? | K-12教育中教科书知识转移时RAG系统如何强大? 2412.08985v3 |
Authors (5): Tianshi Zheng, Weihan Li, Jiaxin Bai, Weiqi Wang, Yangqiu Song
Retrieval-Augmented Generation (RAG) systems show remarkable potential as question answering tools in the K-12 Education domain, where knowledge is typically queried within the restricted scope of authoritative textbooks. However, discrepancies between these textbooks and the parametric knowledge inherent in Large Language Models (LLMs) can undermine the effectiveness of RAG systems. To systematically investigate RAG system robustness against such knowledge discrepancies, we introduce KnowShiftQA. This novel question answering dataset simulates these discrepancies by applying deliberate hypothetical knowledge updates to both answers and source documents, reflecting how textbook knowledge can shift. KnowShiftQA comprises 3,005 questions across five subjects, designed with a comprehensive question typology focusing on context utilization and knowledge integration. Our extensive experiments on retrieval and question answering performance reveal that most RAG systems suffer a substantial performance drop when faced with these knowledge discrepancies. Furthermore, questions requiring the integration of contextual (textbook) knowledge with parametric (LLM) knowledge pose a significant challenge to current LLMs.
nan
Article 375
Title@2025-07-13 (7): EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions
Title: EventHunter: Dynamic Clustering and Ranking of Security Events from Hacker Forum Discussions | EventHunter: Dynamisches Clustering und Ranking von Sicherheitsereignissen aus Hacker Forum Diskussionen | 活动休特:从黑客论坛讨论中对安保活动进行动态分组和排序 2507.09762v1 |
Authors (4): Yasir Ech-Chammakhy, Anas Motii, Anass Rabii, Jaafar Chbili
Hacker forums provide critical early warning signals for emerging cybersecurity threats, but extracting actionable intelligence from their unstructured and noisy content remains a significant challenge. This paper presents an unsupervised framework that automatically detects, clusters, and prioritizes security events discussed across hacker forum posts. Our approach leverages Transformer-based embeddings fine-tuned with contrastive learning to group related discussions into distinct security event clusters, identifying incidents like zero-day disclosures or malware releases without relying on predefined keywords. The framework incorporates a daily ranking mechanism that prioritizes identified events using quantifiable metrics reflecting timeliness, source credibility, information completeness, and relevance. Experimental evaluation on real-world hacker forum data demonstrates that our method effectively reduces noise and surfaces high-priority threats, enabling security analysts to mount proactive responses. By transforming disparate hacker forum discussions into structured, actionable intelligence, our work addresses fundamental challenges in automated threat detection and analysis.
nan
Article 376
Title@2025-07-13 (7): Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding
Title: Your Pretrained Model Tells the Difficulty Itself: A Self-Adaptive Curriculum Learning Paradigm for Natural Language Understanding | Ihr prätrainiertes Modell erzählt die Schwierigkeit selbst: Ein selbstadaptives Curriculum Lernen Paradigma für das natürliche Sprachverständnis | 您训练有素的模型告诉困难本身:学习自然语言理解的自适应课程学习范式 2507.09758v1 |
Authors (3): Qi Feng, Yihong Liu, Hinrich Schütze
Curriculum learning is a widely adopted training strategy in natural language processing (NLP), where models are exposed to examples organized by increasing difficulty to enhance learning efficiency and performance. However, most existing approaches rely on manually defined difficulty metrics – such as text length – which may not accurately reflect the model’s own perspective. To overcome this limitation, we present a self-adaptive curriculum learning paradigm that prioritizes fine-tuning examples based on difficulty scores predicted by pre-trained language models (PLMs) themselves. Building on these scores, we explore various training strategies that differ in the ordering of examples for the fine-tuning: from easy-to-hard, hard-to-easy, to mixed sampling. We evaluate our method on four natural language understanding (NLU) datasets covering both binary and multi-class classification tasks. Experimental results show that our approach leads to faster convergence and improved performance compared to standard random sampling.
nan
Article 377
Title@2025-07-13 (7): Sound and Complete Neuro-symbolic Reasoning with LLM-Grounded Interpretations
Title: Sound and Complete Neuro-symbolic Reasoning with LLM-Grounded Interpretations | Sound und komplette neuro-symbolische Reasoning mit LLM-gerundeten Interpretationen | 使用LLM四轮解释的全音和完整神经 – – 精神 – – 曲解理由 2507.09751v1 |
Authors (5): Bradley P. Allen, Prateek Chhikara, Thomas Macaulay Ferguson, Filip Ilievski, Paul Groth
Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, but they exhibit problems with logical consistency in the output they generate. How can we harness LLMs’ broad-coverage parametric knowledge in formal reasoning despite their inconsistency? We present a method for directly integrating an LLM into the interpretation function of the formal semantics for a paraconsistent logic. We provide experimental evidence for the feasibility of the method by evaluating the function using datasets created from several short-form factuality benchmarks. Unlike prior work, our method offers a theoretical framework for neuro-symbolic reasoning that leverages an LLM’s knowledge while preserving the underlying logic’s soundness and completeness properties.
nan
Article 378
Title@2025-07-13 (7): Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them
Title: Scalpel vs. Hammer: GRPO Amplifies Existing Capabilities, SFT Replaces Them | Scalpel vs. Hammer: GRPO verstärkt bestehende Fähigkeiten, SFT ersetzt sie | 缩略图与锤子:GROPO 放大现有能力,SFT 替换 2507.10616v1 |
Authors (4): Neel Rajani, Aryo Pradipta Gema, Seraphina Goldfarb-Tarrant, Ivan Titov
Training large language models (LLMs) for reasoning via maths and code datasets has become a major new focus in LLM post-training. Two particularly popular approaches are reinforcement learning (RL) and supervised fine-tuning (SFT), but their training dynamics are poorly understood. We present a comparative analysis of RL and SFT on the same maths problems with the same model and similar hyperparameters. We find that RL yields minor in-domain gains on maths and slight degradation on knowledge-intensive benchmarks like MMLU, while both trends are more pronounced in SFT. We also analyse model parameters across checkpoints, observing that both algorithms modify query and key weights the most. Meanwhile, SFT exhibits greater updates and also affects mid-layer MLPs more, leading us to hypothesise that this may have caused the out-of-domain degradation. We therefore investigate whether freezing parts of the model during training can mitigate the reduced performance on knowledge-intensive benchmarks. However, our results are inconclusive, with benefits on GPQA:Diamond and degradation on other benchmarks. Taken together, our observations provide a preliminary indication for why RL amplifies existing capabilities, while SFT replaces old skills with new ones.
nan
Article 379
Title@2025-07-13 (7): From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations
Title: From Fragments to Facts: A Curriculum-Driven DPO Approach for Generating Hindi News Veracity Explanations | Von Fragmenten zu Fakten: Ein Curriculum-getriebener DPO-Ansatz zur Generierung von Hindi News Veracity Erklärungen | 《从零碎到事实:产生印地语新闻的多城市解释:课程驱动的DPO方法》 2507.05179v2 |
Authors (5): Pulkit Bansal, Raghvendra Kumar, Shakti Singh, Sriparna Saha, Adam Jatowt
In an era of rampant misinformation, generating reliable news explanations is vital, especially for under-represented languages like Hindi. Lacking robust automated tools, Hindi faces challenges in scaling misinformation detection. To bridge this gap, we propose a novel framework integrating Direct Preference Optimization (DPO) with curriculum learning to align machine-generated explanations with human reasoning. Fact-checked explanations from credible sources serve as preferred responses, while LLM outputs highlight system limitations and serve as non-preferred responses. To refine task-specific alignment, we introduce two key parameters – Actuality and Finesse – into the DPO loss function, enhancing explanation quality and consistency. Experiments with LLMs (Mistral, Llama, Gemma) and PLMs (mBART, mT5) confirm the framework’s effectiveness in generating coherent, contextually relevant explanations. This scalable approach combats misinformation and extends automated explanation generation to low-resource languages.
nan
Article 380
Title@2025-07-13 (7): Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization
Title: Reinforcing Question Answering Agents with Minimalist Policy Gradient Optimization | Verstärkung der Frage beantworten Agenten mit minimalistischen Politik gradient Optimierung | 以最起码的政策级政策优化优化方式加强回答问题的代理机构 2505.17086v2 |
Authors (9): Yihong Wu, Liheng Ma, Muzhi Li, Jiaming Zhou, Jianye Hao, Ho-fung Leung, Irwin King, Yingxue Zhang, Jian-Yun Nie
Large Language Models (LLMs) have demonstrated remarkable versatility, due to the lack of factual knowledge, their application to Question Answering (QA) tasks remains hindered by hallucination. While Retrieval-Augmented Generation mitigates these issues by integrating external knowledge, existing approaches rely heavily on in-context learning, whose performance is constrained by the fundamental reasoning capabilities of LLMs. In this paper, we propose Mujica, a Multi-hop Joint Intelligence for Complex Question Answering, comprising a planner that decomposes questions into a directed acyclic graph of subquestions and a worker that resolves questions via retrieval and reasoning. Additionally, we introduce MyGO (Minimalist policy Gradient Optimization), a novel reinforcement learning method that replaces traditional policy gradient updates with Maximum Likelihood Estimation (MLE) by sampling trajectories from an asymptotically optimal policy. MyGO eliminates the need for gradient rescaling and reference models, ensuring stable and efficient training. Empirical results across multiple datasets demonstrate the effectiveness of Mujica-MyGO in enhancing multi-hop QA performance for various LLMs, offering a scalable and resource-efficient solution for complex QA tasks.
nan
Article 381
Title@2025-07-13 (7): Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces
Title: Large Language Models Encode Semantics in Low-Dimensional Linear Subspaces | Große Sprachmodelle kodieren Semantik in Low-Dimensional Linear Subspaces | 低多维线性线性子空间中大语言模型编码语义学 2507.09709v1 |
Authors (6): Baturay Saglam, Paul Kassianik, Blaine Nelson, Sajana Weerawardhena, Yaron Singer, Amin Karbasi
Understanding the latent space geometry of large language models (LLMs) is key to interpreting their behavior and improving alignment. \baturay{However, it remains unclear to what extent LLMs internally organize representations related to semantic understanding. To investigate this, we conduct a large-scale empirical study of hidden states in transformer-based LLMs, analyzing 11 decoder-only models across 6 scientific topics and 12 layers each. We find that high-level semantic information consistently lies in low-dimensional subspaces that form linearly separable representations across distinct domains. This separability becomes more pronounced in deeper layers and under prompts that trigger structured reasoning or alignment behaviors$\unicode{x2013}$even when surface content is unchanged. This geometry enables simple yet effective causal interventions in hidden space; for example, reasoning patterns like chain-of-thought can be captured by a single vector direction. Together, these findings support the development of geometry-aware tools that operate directly on latent representations to detect and mitigate harmful or adversarial content, using methods such as transport-based defenses that leverage this separability. As a proof of concept, we demonstrate this potential by training a simple MLP classifier as a lightweight latent-space guardrail, which detects adversarial and malicious prompts with high precision.
nan
Article 382
Title@2025-07-13 (7): Perception-Aware Policy Optimization for Multimodal Reasoning
Title: Perception-Aware Policy Optimization for Multimodal Reasoning | Perception-Aware Policy Optimization für multimodale Reasoning | 对多式联运理由的观念-认知软件政策优化 2507.06448v2 |
Authors (11): Zhenhailong Wang, Xuehang Guo, Sofia Stoica, Haiyang Xu, Hongru Wang, Hyeonjeong Ha, Xiusi Chen, Yangyi Chen, Ming Yan, Fei Huang, Heng Ji
Reinforcement Learning with Verifiable Rewards (RLVR) has proven to be a highly effective strategy for endowing Large Language Models (LLMs) with robust multi-step reasoning abilities. However, its design and optimizations remain tailored to purely textual domains, resulting in suboptimal performance when applied to multimodal reasoning tasks. In particular, we observe that a major source of error in current multimodal reasoning lies in the perception of visual inputs. To address this bottleneck, we propose Perception-Aware Policy Optimization (PAPO), a simple yet effective extension of GRPO that encourages the model to learn to perceive while learning to reason, entirely from internal supervision signals. Notably, PAPO does not rely on additional data curation, external reward models, or proprietary models. Specifically, we introduce the Implicit Perception Loss in the form of a KL divergence term to the GRPO objective, which, despite its simplicity, yields significant overall improvements (4.4%) on diverse multimodal benchmarks. The improvements are more pronounced, approaching 8.0%, on tasks with high vision dependency. We also observe a substantial reduction (30.5%) in perception errors, indicating improved perceptual capabilities with PAPO. We conduct comprehensive analysis of PAPO and identify a unique loss hacking issue, which we rigorously analyze and mitigate through a Double Entropy Loss. Overall, our work introduces a deeper integration of perception-aware supervision into RLVR learning objectives and lays the groundwork for a new RL framework that encourages visually grounded reasoning. Project page: https://mikewangwzhl.github.io/PAPO.
nan
Article 383
Title@2025-07-13 (7): MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs
Title: MCEval: A Dynamic Framework for Fair Multilingual Cultural Evaluation of LLMs | MCEval: Ein dynamischer Rahmen für eine faire multilinguale kulturelle Bewertung von LLMs | MCEval:对LLMs进行公平、多语种文化评价的有力框架 2507.09701v1 |
Authors (3): Shulin Huang, Linyi Yang, Yue Zhang
Large language models exhibit cultural biases and limited cross-cultural understanding capabilities, particularly when serving diverse global user populations. We propose MCEval, a novel multilingual evaluation framework that employs dynamic cultural question construction and enables causal analysis through Counterfactual Rephrasing and Confounder Rephrasing. Our comprehensive evaluation spans 13 cultures and 13 languages, systematically assessing both cultural awareness and cultural bias across different linguistic scenarios. The framework provides 39,897 cultural awareness instances and 17,940 cultural bias instances. Experimental results reveal performance disparities across different linguistic scenarios, demonstrating that optimal cultural performance is not only linked to training data distribution, but also is related to language-culture alignment. The evaluation results also expose the fairness issue, where approaches appearing successful in the English scenario create substantial disadvantages. MCEval represents the first comprehensive multilingual cultural evaluation framework that provides deeper insights into LLMs’ cultural understanding.
nan
Article 384
Title@2025-07-13 (7): Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning
Title: Teaching Models to Verbalize Reward Hacking in Chain-of-Thought Reasoning | Lehrmodelle zu verbalisieren Belohnung Hacking in Chain-of-Thought-Reasoning | 教学模型,以思考、思考、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理、推理 2506.22777v2 |
Authors (5): Miles Turpin, Andy Arditi, Marvin Li, Joe Benton, Julian Michael
Language models trained with reinforcement learning (RL) can engage in reward hacking–the exploitation of unintended strategies for high reward–without revealing this behavior in their chain-of-thought reasoning. This makes the detection of reward hacking difficult, posing risks for high-stakes applications. We propose verbalization fine-tuning (VFT), a pre-RL fine-tuning intervention that trains models to explicitly acknowledge when they are influenced by prompt cues–hints which point to incorrect answers (e.g., “a Stanford professor thinks the answer is A”). To evaluate VFT, we subsequently train models with RL on environments where held-out prompt cues signal which incorrect answers will receive high reward, incentivizing models to exploit these cues instead of reasoning correctly. We measure how often models exploit these cues without verbalizing it. After RL, only 6% of the VFT-trained model’s responses consist of undetected reward hacks. In comparison, when we perform RL without VFT, the rate of undetected reward hacks goes up to 88%; with a debiasing baseline intervention, this increases further to 99%. VFT achieves this by substantially increasing how often models verbalize the influence of cues, from 8% to 43% after VFT, and up to 94% after RL. Baselines remain low even after RL (11% and 1%). Our results show that teaching models to explicitly verbalize reward hacking behavior before RL significantly improves their detection, offering a practical path toward more transparent and safe AI systems.
nan
Article 385
Title@2025-07-13 (7): Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions
Title: Learning-to-Context Slope: Evaluating In-Context Learning Effectiveness Beyond Performance Illusions | Learning-to-Context Slope: Bewertung von In-Context-Lerneffektivität jenseits von Performance-Illusionen | 学习到文字表达式:评价除了业绩幻觉之外在学习中的效果 2506.23146v3 |
Authors (6): Dingzriui Wang, Xuanliang Zhang, Keyan Xu, Qingfu Zhu, Wanxiang Che, Yang Deng
In-context learning (ICL) has emerged as an effective approach to enhance the performance of large language models (LLMs). However, its effectiveness varies significantly across models and tasks, posing challenges for practitioners to determine when ICL reliably improves performance. Current evaluation approaches, reliant on performance change after applying ICL, suffer from low reliability, poor attribution, and impracticality in data-insufficient scenarios. We propose the Learning-to-Context Slope (LCS), a novel metric that quantifies ICL effectiveness by modeling the slope between learning gain (loss decrease from demonstrations) and contextual relevance (demonstration-input relevance). LCS addresses key limitations of performance-based metrics: (1) it captures continuous loss changes even when outputs are incorrect, improving reliability; (2) its formulation attributes ICL failures to weak contextual alignment (inability to adapt inputs to demonstrations) or strong output calibration (self-verification of correctness); and (3) it minimizes reliance on labeled data via synthetic evaluation. Extensive experiments demonstrate that LCS strongly correlates with performance improvements in labeled settings and reliably reflects true effectiveness in biased or data-scarce scenarios. Further analysis reveals actionable thresholds for LCS and identifies model capabilities critical to ICL success.
nan
Article 386
Title@2025-07-13 (7): Towards Concise and Adaptive Thinking in Large Reasoning Models: A Survey
Title: Towards Concise and Adaptive Thinking in Large Reasoning Models: A Survey | Auf dem Weg zu einem konzisen und adaptiven Denken in großen Vernunftmodellen: Eine Umfrage | 实现大理由模型中的简明和适应性思维:调查 2507.09662v1 |
Authors (2): Jason Zhu, Hongyu Li
Large reasoning models (LRMs) like OpenAI o1 and DeepSeek R1 have demonstrated impressive performance on complex reasoning tasks like mathematics and programming with long Chain-of-Thought (CoT) reasoning sequences (slow-thinking), compared with traditional large language models (fast-thinking). However, these reasoning models also face a huge challenge that generating unnecessarily lengthy and redundant reasoning chains even for trivial questions. This phenomenon leads to a significant waste of inference resources, increases the response time for simple queries, and hinders the practical application of LRMs in real-world products. To this end, it is crucial to shorten lengthy reasoning chains and learn adaptive reasoning between fast and slow thinking based on input difficulty. In this survey, we provide a comprehensive overview of recent progress in concise and adaptive thinking for efficient reasoning of LRMs, including methodologies, benchmarks, and challenges for future exploration. We hope this survey can help researchers quickly understand the landscape of this field and inspire novel adaptive thinking ideas to facilitate better usage of LRMs.
nan
Article 387
Title@2025-07-13 (7): OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale
Title: OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale | OmniSQL: Synthese hochwertiger Text-zu-SQL-Daten auf Scale | OmniSQL: 大规模合成高质量的文本到 SQL 数据 2503.02240v2 |
Authors (12): Haoyang Li, Shang Wu, Xiaokang Zhang, Xinmei Huang, Jing Zhang, Fuxin Jiang, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Hong Chen, Cuiping Li
Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in large language models (LLMs) have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first million-scale text-to-SQL dataset, containing 2.5 million samples spanning over 16,000 synthetic databases. Each sample includes a database, SQL query, natural language question, and chain-of-thought (CoT) solution. Leveraging SynSQL-2.5M, we develop OmniSQL, a powerful open-source text-to-SQL model available in three sizes: 7B, 14B, and 32B. Extensive evaluations across nine datasets demonstrate that OmniSQL achieves state-of-the-art performance, matching or surpassing leading closed-source and open-source LLMs, including GPT-4o and DeepSeek-V3, despite its smaller size. We release all code, datasets, and models to support further research.
nan
Article 388
Title@2025-07-13 (7): MoRE: A Mixture of Reflectors Framework for Large Language Model-Based Sequential Recommendation
Title: MoRE: A Mixture of Reflectors Framework for Large Language Model-Based Sequential Recommendation | MoRE: Ein Reflektoren-Framework für großsprachige modellbasierte sequentielle Empfehlung | MORE:基于大语言示范序列建议的反思框架混合体 2409.06377v2 |
Authors (8): Weicong Qin, Yi Xu, Weijie Yu, Chenglei Shen, Xiao Zhang, Ming He, Jianping Fan, Jun Xu
Large language models (LLMs) have emerged as a cutting-edge approach in sequential recommendation, leveraging historical interactions to model dynamic user preferences. Current methods mainly focus on learning processed recommendation data in the form of sequence-to-sequence text. While effective, they exhibit three key limitations: 1) failing to decouple intra-user explicit features (e.g., product titles) from implicit behavioral patterns (e.g., brand loyalty) within interaction histories; 2) underutilizing cross-user collaborative filtering (CF) signals; and 3) relying on inefficient reflection update strategies. To address this, We propose MoRE (Mixture of REflectors), which introduces three perspective-aware offline reflection processes to address these gaps. This decomposition directly resolves Challenges 1 (explicit/implicit ambiguity) and 2 (CF underutilization). Furthermore, MoRE’s meta-reflector employs a self-improving strategy and a dynamic selection mechanism (Challenge 3) to adapt to evolving user preferences. First, two intra-user reflectors decouple explicit and implicit patterns from a user’s interaction sequence, mimicking traditional recommender systems’ ability to distinguish surface-level and latent preferences. A third cross-user reflector captures CF signals by analyzing user similarity patterns from multiple users’ interactions. To optimize reflection quality, MoRE’s meta-reflector employs a offline self-improving strategy that evaluates reflection impacts through comparisons of presence/absence and iterative refinement of old/new versions, with a online contextual bandit mechanism dynamically selecting the optimal perspective for recommendation for each user. Code: https://github.com/E-qin/MoRE-Rec.
nan
Article 389
Title@2025-07-13 (7): Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering?
Title: Can Group Relative Policy Optimization Improve Thai Legal Reasoning and Question Answering? | Kann die Optimierung der relativen Politik der Gruppe die thailändische rechtliche Begründung und die Beantwortung von Fragen verbessern? | 集团的相对政策优化能否改善泰国的法律依据和问题的回答? 2507.09638v1 |
Authors (6): Pawitsapak Akarajaradwong, Chompakorn Chaksangchaichot, Pirat Pothavorn, Attapol Thamrongrattanarit-Rutherford, Ekapol Chuangsuwanich, Sarana Nutanong
The Retrieval-Augmented Generation (RAG) systems’ performance on Thai legal question answering is still limited, especially for questions requiring extensive, complex legal reasoning. To address these limitations, we introduce an approach aligning LLMs toward improved law citation accuracy and better response quality using Group-Relative Policy Optimization (GRPO). Our approach leverages BGE-M3 embeddings as a cost-efficient semantic-similarity reward, significantly reducing computational expenses up to 2.5x compared to large language model judges. Experiments on the NitiBench benchmark demonstrate substantial improvements: GRPO achieves up to 90% citation-F1 gains from the base model and a 31% increase in joint quality metrics over instruction tuning. Crucially, our method shows enhanced robustness on complex legal reasoning tasks compared to instruction tuning, providing an effective and resource-efficient solution for enhancing Thai legal LLMs.
nan
Article 390
Title@2025-07-13 (7): An Exploration of Knowledge Editing for Arabic
Title: An Exploration of Knowledge Editing for Arabic | Eine Erforschung der Wissensbearbeitung für Arabisch | 阿拉伯文知识编辑探索 2507.09629v1 |
Authors (3): Basel Mousi, Nadir Durrani, Fahim Dalvi
While Knowledge Editing (KE) has been widely explored in English, its behavior in morphologically rich languages like Arabic remains underexamined. In this work, we present the first study of Arabic KE. We evaluate four methods (ROME, MEMIT, ICE, and LTE) on Arabic translations of the ZsRE and Counterfact benchmarks, analyzing both multilingual and cross-lingual settings. Our experiments on Llama-2-7B-chat show show that parameter-based methods struggle with cross-lingual generalization, while instruction-tuned methods perform more robustly. We extend Learning-To-Edit (LTE) to a multilingual setting and show that joint Arabic-English training improves both editability and transfer. We release Arabic KE benchmarks and multilingual training for LTE data to support future research.
nan
Article 391
Title@2025-07-13 (7): SpreadPy: A Python tool for modelling spreading activation and superdiffusion in cognitive multiplex networks
Title: SpreadPy: A Python tool for modelling spreading activation and superdiffusion in cognitive multiplex networks | SpreadPy: Ein Python-Tool zur Modellierung der Ausbreitung von Aktivierung und Superdiffusion in kognitiven Multiplex-Netzwerken | Python 工具,用于在认知多功能网络中模拟扩散扩散激活和超扩散 2507.09628v1 |
Authors (6): Salvatore Citraro, Edith Haim, Alessandra Carini, Cynthia S. Q. Siew, Giulio Rossetti, Massimo Stella
We introduce SpreadPy as a Python library for simulating spreading activation in cognitive single-layer and multiplex networks. Our tool is designed to perform numerical simulations testing structure-function relationships in cognitive processes. By comparing simulation results with grounded theories in knowledge modelling, SpreadPy enables systematic investigations of how activation dynamics reflect cognitive, psychological and clinical phenomena. We demonstrate the library’s utility through three case studies: (1) Spreading activation on associative knowledge networks distinguishes students with high versus low math anxiety, revealing anxiety-related structural differences in conceptual organization; (2) Simulations of a creativity task show that activation trajectories vary with task difficulty, exposing how cognitive load modulates lexical access; (3) In individuals with aphasia, simulated activation patterns on lexical networks correlate with empirical error types (semantic vs. phonological) during picture-naming tasks, linking network structure to clinical impairments. SpreadPy’s flexible framework allows researchers to model these processes using empirically derived or theoretical networks, providing mechanistic insights into individual differences and cognitive impairments. The library is openly available, supporting reproducible research in psychology, neuroscience, and education research.
nan
Article 392
Title@2025-07-13 (7): Your Absorbing Discrete Diffusion Secretly Models the Bayesian Posterior
Title: Your Absorbing Discrete Diffusion Secretly Models the Bayesian Posterior | Ihre absorbierende Diskrete Diffusion heimlich Modelle der Bayesian Posterior | 您的吸收分解扩散秘密模型 贝叶斯波斯别墅 2507.07586v2 |
Authors (1): Cooper Doyle
Discrete diffusion language models learn to reconstruct text from randomly masked inputs, yet under mild assumptions their denoiser already implements the exact Bayesian posterior over the original tokens. We prove that the expected denoiser output under the forward corruption distribution recovers the true posterior, and that a simple Monte Carlo estimator converges to this posterior at rate O(1/sqrt(K)) with finite-sample concentration bounds. Building on this insight, we introduce an inference-time ensemble that runs K independent denoising passes and aggregates both posterior means and variances without any extra training. On WikiText-2, our MC-marginal sampler recovers the analytic lambda-DCE zero-shot perplexity (approximately 39) to within a few points at K=128, and its per-token variance shows a strong rank correlation with reconstruction error (Spearman rho = 0.996). This cost-proportional procedure yields calibrated uncertainty estimates and a direct trade-off between compute and posterior fidelity in discrete diffusion LMs.
nan
Article 393
Title@2025-07-13 (7): NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance
Title: NMIXX: Domain-Adapted Neural Embeddings for Cross-Lingual eXploration of Finance | NMIXX: Domain-Adapted Neural Embedings für Cross-Lingual eXploration of Finance | NMIXX: 用于财务交叉使用和交叉倍增的域-开发型神经模型 2507.09601v1 |
Authors (7): Hanwool Lee, Sara Yu, Yewon Hwang, Jonghyun Choi, Heejae Ahn, Sungbum Jung, Youngjae Yu
General-purpose sentence embedding models often struggle to capture specialized financial semantics, especially in low-resource languages like Korean, due to domain-specific jargon, temporal meaning shifts, and misaligned bilingual vocabularies. To address these gaps, we introduce NMIXX (Neural eMbeddings for Cross-lingual eXploration of Finance), a suite of cross-lingual embedding models fine-tuned with 18.8K high-confidence triplets that pair in-domain paraphrases, hard negatives derived from a semantic-shift typology, and exact Korean-English translations. Concurrently, we release KorFinSTS, a 1,921-pair Korean financial STS benchmark spanning news, disclosures, research reports, and regulations, designed to expose nuances that general benchmarks miss. When evaluated against seven open-license baselines, NMIXX’s multilingual bge-m3 variant achieves Spearman’s rho gains of +0.10 on English FinSTS and +0.22 on KorFinSTS, outperforming its pre-adaptation checkpoint and surpassing other models by the largest margin, while revealing a modest trade-off in general STS performance. Our analysis further shows that models with richer Korean token coverage adapt more effectively, underscoring the importance of tokenizer design in low-resource, cross-lingual settings. By making both models and the benchmark publicly available, we provide the community with robust tools for domain-adapted, multilingual representation learning in finance.
nan
Article 394
Title@2025-07-13 (7): MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models
Title: MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models | MENTOR: Effizientes multimodales Tuning für autoregressive Vision-Generationsmodelle | INGOR: 自动递减型愿景生成模式的高效多式联运有条件的提款 2507.09574v1 |
Authors (7): Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Junjie Hu
Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR
nan
Article 395
Title@2025-07-13 (7): Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models
Title: Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models | Knowledge-Augmented Multimodal Clinical Rationale Generation for Disease Diagnosis with Small Language Models | 利用小型语言模型进行疾病诊断的知识强化多式临床多式理论 2411.07611v5 |
Authors (8): Shuai Niu, Jing Ma, Hongzhan Lin, Liang Bai, Zhihua Wang, Yida Xu, Yunya Song, Xian Yang
Interpretation is critical for disease diagnosis, but existing models struggle to balance predictive accuracy with human-understandable rationales. While large language models (LLMs) offer strong reasoning abilities, their clinical use is limited by high computational costs and restricted multimodal reasoning ability. Small language models (SLMs) are efficient but lack advanced reasoning for integrating multimodal medical data. In addition, both LLMs and SLMs lack domain knowledge for trustworthy reasoning. Therefore, we propose ClinRaGen, enhancing SLMs by leveraging LLM-derived reasoning ability via rationale distillation and domain knowledge injection for trustworthy multimodal rationale generation. Key innovations include a sequential rationale distillation framework that equips SLMs with LLM-comparable multimodal reasoning abilities, and a knowledge-augmented attention mechanism that jointly unifies multimodal representation from time series and textual data in the same encoding space, enabling it to be naturally interpreted by SLMs while incorporating domain knowledge for reliable rationale generation. Experiments on real-world medical datasets show that ClinRaGen achieves state-of-the-art performance in disease diagnosis and rationale generation, demonstrating the effectiveness of combining LLM-driven reasoning with knowledge augmentation for improved interpretability.
nan
Article 396
Title@2025-07-13 (7): Adapting Definition Modeling for New Languages: A Case Study on Belarusian
Title: Adapting Definition Modeling for New Languages: A Case Study on Belarusian | Anpassung der Definitionsmodelle für neue Sprachen: Eine Fallstudie zu Belarussisch | 适应新语言定义模式:白俄罗斯案例研究 2507.09536v1 |
Authors (3): Daniela Kazakouskaya, Timothee Mickus, Janine Siewert
Definition modeling, the task of generating new definitions for words in context, holds great prospect as a means to assist the work of lexicographers in documenting a broader variety of lects and languages, yet much remains to be done in order to assess how we can leverage pre-existing models for as-of-yet unsupported languages. In this work, we focus on adapting existing models to Belarusian, for which we propose a novel dataset of 43,150 definitions. Our experiments demonstrate that adapting a definition modeling systems requires minimal amounts of data, but that there currently are gaps in what automatic metrics do capture.
nan
Article 397
Title@2025-07-13 (7): Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement
Title: Large Language Model Psychometrics: A Systematic Review of Evaluation, Validation, and Enhancement | Psychometrische Großsprachenmodelle: Eine systematische Überprüfung der Evaluation, Validierung und Verbesserung | 大型语言模拟大语言心理计量模型:系统审查评价、校验和加强 2505.08245v2 |
Authors (5): Haoran Ye, Jing Jin, Yuhang Xie, Xin Zhang, Guojie Song
The advancement of large language models (LLMs) has outpaced traditional evaluation methodologies. This progress presents novel challenges, such as measuring human-like psychological constructs, moving beyond static and task-specific benchmarks, and establishing human-centered evaluation. These challenges intersect with psychometrics, the science of quantifying the intangible aspects of human psychology, such as personality, values, and intelligence. This review paper introduces and synthesizes the emerging interdisciplinary field of LLM Psychometrics, which leverages psychometric instruments, theories, and principles to evaluate, understand, and enhance LLMs. The reviewed literature systematically shapes benchmarking principles, broadens evaluation scopes, refines methodologies, validates results, and advances LLM capabilities. Diverse perspectives are integrated to provide a structured framework for researchers across disciplines, enabling a more comprehensive understanding of this nascent field. Ultimately, the review provides actionable insights for developing future evaluation paradigms that align with human-level AI and promote the advancement of human-centered AI systems for societal benefit. A curated repository of LLM psychometric resources is available at https://github.com/valuebyte-ai/Awesome-LLM-Psychometrics.
nan
Article 398
Title@2025-07-13 (7): Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy
Title: Can A Society of Generative Agents Simulate Human Behavior and Inform Public Health Policy? A Case Study on Vaccine Hesitancy | Kann eine Gesellschaft Generativer Mittel menschliches Verhalten simulieren und die öffentliche Gesundheitspolitik informieren? | 基因代理学会能够模拟人类行为和信息公共卫生政策吗? 疫苗安全案例研究 2503.09639v4 |
Authors (9): Abe Bohan Hou, Hongru Du, Yichen Wang, Jingyu Zhang, Zixiao Wang, Paul Pu Liang, Daniel Khashabi, Lauren Gardner, Tianxing He
Can we simulate a sandbox society with generative agents to model human behavior, thereby reducing the over-reliance on real human trials for assessing public policies? In this work, we investigate the feasibility of simulating health-related decision-making, using vaccine hesitancy, defined as the delay in acceptance or refusal of vaccines despite the availability of vaccination services (MacDonald, 2015), as a case study. To this end, we introduce the VacSim framework with 100 generative agents powered by Large Language Models (LLMs). VacSim simulates vaccine policy outcomes with the following steps: 1) instantiate a population of agents with demographics based on census data; 2) connect the agents via a social network and model vaccine attitudes as a function of social dynamics and disease-related information; 3) design and evaluate various public health interventions aimed at mitigating vaccine hesitancy. To align with real-world results, we also introduce simulation warmup and attitude modulation to adjust agents’ attitudes. We propose a series of evaluations to assess the reliability of various LLM simulations. Experiments indicate that models like Llama and Qwen can simulate aspects of human behavior but also highlight real-world alignment challenges, such as inconsistent responses with demographic profiles. This early exploration of LLM-driven simulations is not meant to serve as definitive policy guidance; instead, it serves as a call for action to examine social simulation for policy development.
nan
Article 399
Title@2025-07-13 (7): How Important is `Perfect’ English for Machine Translation Prompts?
Title: How Important is Perfect' English for Machine Translation Prompts? | Wie wichtig ist Perfekte’ Englisch für maschinelle Übersetzung Prompts? |
“完美”英语对机器翻译提示的重要性有多大? 2507.09509v1 |
Authors (7): Patrícia Schmidtová, Niyati Bafna, Seth Aycock, Gianluca Vico, Wiktor Kamzela, Katharina Hämmerl, Vilém Zouhar
Large language models (LLMs) have achieved top results in recent machine translation evaluations, but they are also known to be sensitive to errors and perturbations in their prompts. We systematically evaluate how both humanly plausible and synthetic errors in user prompts affect LLMs’ performance on two related tasks: Machine translation and machine translation evaluation. We provide both a quantitative analysis and qualitative insights into how the models respond to increasing noise in the user prompt. The prompt quality strongly affects the translation performance: With many errors, even a good prompt can underperform a minimal or poor prompt without errors. However, different noise types impact translation quality differently, with character-level and combined noisers degrading performance more than phrasal perturbations. Qualitative analysis reveals that lower prompt quality largely leads to poorer instruction following, rather than directly affecting translation quality itself. Further, LLMs can still translate in scenarios with overwhelming random noise that would make the prompt illegible to humans.
nan
Article 400
Title@2025-07-13 (7): Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models
Title: Ref-Long: Benchmarking the Long-context Referencing Capability of Long-context Language Models | Ref-Long: Benchmarking der Lang-Kontext-Referenzfähigkeit von Lang-Kontext-Sprachenmodellen | 参考:长文本语言模式长期参考能力基准的设定 2507.09506v1 |
Authors (5): Junjie Wu, Gefei Gu, Yanan Zheng, Dit-Yan Yeung, Arman Cohan
Long-context language models (LCLMs) have exhibited impressive capabilities in long-context understanding tasks. Among these, long-context referencing – a crucial task that requires LCLMs to attribute items of interest to specific parts of long-context data – remains underexplored. To bridge this gap, this paper proposes Referencing Evaluation for Long-context Language Models (Ref-Long), a novel benchmark designed to assess the long-context referencing capability of LCLMs. Specifically, Ref-Long requires LCLMs to identify the indexes of documents that reference a specific key, emphasizing contextual relationships between the key and the documents over simple retrieval. Based on the task design, we construct three subsets ranging from synthetic to realistic scenarios to form the Ref-Long benchmark. Experimental results of 13 LCLMs reveal significant shortcomings in long-context referencing, even among advanced models like GPT-4o. To further investigate these challenges, we conduct comprehensive analyses, including human evaluations, task format adjustments, fine-tuning experiments, and error analyses, leading to several key insights. Our data and code can be found in https://github. com/wujunjie1998/Ref-Long.
nan
Article 401
Title@2025-07-13 (7): READoc: A Unified Benchmark for Realistic Document Structured Extraction
Title: READoc: A Unified Benchmark for Realistic Document Structured Extraction | READoc: Ein einheitlicher Benchmark für eine realistische Dokumentenstrukturierung | READoc: “ 结构抽取文件 “ 的 “ 现实文件统一基准 “ 2409.05137v3 |
Authors (8): Zichao Li, Aizier Abulaiti, Yaojie Lu, Xuanang Chen, Jia Zheng, Hongyu Lin, Xianpei Han, Le Sun
Document Structured Extraction (DSE) aims to extract structured content from raw documents. Despite the emergence of numerous DSE systems, their unified evaluation remains inadequate, significantly hindering the field’s advancement. This problem is largely attributed to existing benchmark paradigms, which exhibit fragmented and localized characteristics. To address these limitations and offer a thorough evaluation of DSE systems, we introduce a novel benchmark named READoc, which defines DSE as a realistic task of converting unstructured PDFs into semantically rich Markdown. The READoc dataset is derived from 3,576 diverse and real-world documents from arXiv, GitHub, and Zenodo. In addition, we develop a DSE Evaluation S$^3$uite comprising Standardization, Segmentation and Scoring modules, to conduct a unified evaluation of state-of-the-art DSE approaches. By evaluating a range of pipeline tools, expert visual models, and general VLMs, we identify the gap between current work and the unified, realistic DSE objective for the first time. We aspire that READoc will catalyze future research in DSE, fostering more comprehensive and practical solutions.
nan
Article 402
Title@2025-07-13 (7): IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models
Title: IDEAL: Influence-Driven Selective Annotations Empower In-Context Learners in Large Language Models | IDEAL: Influence-Driven Selective Annotations Empower In-Context Learner in großen Sprachmodellen | 影响驱动选择性说明:赋予大语言模式中的知识学习者权力 2310.10873v3 |
Authors (7): Shaokun Zhang, Xiaobo Xia, Zhaoqing Wang, Ling-Hao Chen, Jiale Liu, Qingyun Wu, Tongliang Liu
In-context learning is a promising paradigm that utilizes in-context examples as prompts for the predictions of large language models. These prompts are crucial for achieving strong performance. However, since the prompts need to be sampled from a large volume of annotated examples, finding the right prompt may result in high annotation costs. To address this challenge, this paper introduces an influence-driven selective annotation method that aims to minimize annotation costs while improving the quality of in-context examples. The essence of our method is to select a pivotal subset from a large-scale unlabeled data pool to annotate for the subsequent sampling of prompts. Specifically, a directed graph is first constructed to represent unlabeled data. Afterward, the influence of candidate unlabeled subsets is quantified with a diffusion process. A simple yet effective greedy algorithm for unlabeled data selection is lastly introduced. It iteratively selects the data if it provides a maximum marginal gain with respect to quantified influence. Compared with previous efforts on selective annotations, our influence-driven method works in an end-to-end manner, avoids an intractable explicit balance between data diversity and representativeness, and enjoys theoretical support. Experiments confirm the superiority of the proposed method on various benchmarks, achieving better performance under lower time consumption during subset selection. The project page is available at https://skzhang1.github.io/IDEAL/.
nan
Article 403
Title@2025-07-13 (7): GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities
Title: GoalfyMax: A Protocol-Driven Multi-Agent System for Intelligent Experience Entities | GoalfyMax: Ein protokollgestütztes Multi-Agenten-System für intelligente Erlebniseinrichtungen | 目标最大目标:智能经验实体协议驱动的多方促进机构系统 2507.09497v1 |
Authors (6): Siyi Wu, Zeyu Wang, Xinyuan Song, Zhengpeng Zhou, Lifan Sun, Tianyu Shi
Modern enterprise environments demand intelligent systems capable of handling complex, dynamic, and multi-faceted tasks with high levels of autonomy and adaptability. However, traditional single-purpose AI systems often lack sufficient coordination, memory reuse, and task decomposition capabilities, limiting their scalability in realistic settings. To address these challenges, we present \textbf{GoalfyMax}, a protocol-driven framework for end-to-end multi-agent collaboration. GoalfyMax introduces a standardized Agent-to-Agent (A2A) communication layer built on the Model Context Protocol (MCP), allowing independent agents to coordinate through asynchronous, protocol-compliant interactions. It incorporates the Experience Pack (XP) architecture, a layered memory system that preserves both task rationales and execution traces, enabling structured knowledge retention and continual learning. Moreover, our system integrates advanced features including multi-turn contextual dialogue, long-short term memory modules, and dynamic safety validation, supporting robust, real-time strategy adaptation. Empirical results on complex task orchestration benchmarks and case study demonstrate that GoalfyMax achieves superior adaptability, coordination, and experience reuse compared to baseline frameworks. These findings highlight its potential as a scalable, future-ready foundation for multi-agent intelligent systems.
nan
Article 404
Title@2025-07-13 (7): Topic Modeling as Multi-Objective Contrastive Optimization
Title: Topic Modeling as Multi-Objective Contrastive Optimization | Thema Modellierung als multi-objektive kontrastive Optimierung | 专题建模,作为多目标反向优化的模型化 2402.07577v3 |
Authors (6): Thong Nguyen, Xiaobao Wu, Xinshuai Dong, Cong-Duy T Nguyen, See-Kiong Ng, Anh Tuan Luu
Recent representation learning approaches enhance neural topic models by optimizing the weighted linear combination of the evidence lower bound (ELBO) of the log-likelihood and the contrastive learning objective that contrasts pairs of input documents. However, document-level contrastive learning might capture low-level mutual information, such as word ratio, which disturbs topic modeling. Moreover, there is a potential conflict between the ELBO loss that memorizes input details for better reconstruction quality, and the contrastive loss which attempts to learn topic representations that generalize among input documents. To address these issues, we first introduce a novel contrastive learning method oriented towards sets of topic vectors to capture useful semantics that are shared among a set of input documents. Secondly, we explicitly cast contrastive topic modeling as a gradient-based multi-objective optimization problem, with the goal of achieving a Pareto stationary solution that balances the trade-off between the ELBO and the contrastive objective. Extensive experiments demonstrate that our framework consistently produces higher-performing neural topic models in terms of topic coherence, topic diversity, and downstream performance.
nan
Article 405
Title@2025-07-13 (7): Auditing Prompt Caching in Language Model APIs
Title: Auditing Prompt Caching in Language Model APIs | Auditieren von Prompt-Caching in Sprachmodell-APIs | 语言模式APIP中快速抓取 2502.07776v2 |
Authors (5): Chenchen Gu, Xiang Lisa Li, Rohith Kuditipudi, Percy Liang, Tatsunori Hashimoto
Prompt caching in large language models (LLMs) results in data-dependent timing variations: cached prompts are processed faster than non-cached prompts. These timing differences introduce the risk of side-channel timing attacks. For example, if the cache is shared across users, an attacker could identify cached prompts from fast API response times to learn information about other users’ prompts. Because prompt caching may cause privacy leakage, transparency around the caching policies of API providers is important. To this end, we develop and conduct statistical audits to detect prompt caching in real-world LLM API providers. We detect global cache sharing across users in seven API providers, including OpenAI, resulting in potential privacy leakage about users’ prompts. Timing variations due to prompt caching can also result in leakage of information about model architecture. Namely, we find evidence that OpenAI’s embedding model is a decoder-only Transformer, which was previously not publicly known.
nan
Article 406
Title@2025-07-13 (7): Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis
Title: Balanced Training Data Augmentation for Aspect-Based Sentiment Analysis | Balanced Training Data Augmentation für aspektbasierte Sentiment-Analyse | 平衡培训数据增加,以进行基于背景的情感分析 2507.09485v1 |
Authors (3): Junjie Liu, Yuanhe Tian, Yan Song
Aspect-based sentiment analysis (ABSA) is a crucial fine-grained task in social media scenarios to identify the sentiment polarity of specific aspect terms in a sentence. Although many existing studies leverage large language models (LLMs) to perform ABSA due to their strong context understanding capabilities, they still face challenges to learn the context information in the running text because of the short text, as well as the small and unbalanced labeled training data, where most data are labeled with positive sentiment. Data augmentation (DA) is a feasible strategy for providing richer contextual information, especially when using LLMs to create synthetic training data, but faces challenges in ensuring a high quality of the augmented data.In this paper, we propose an LLM-based ABSA approach with training data augmentation.Specifically, an LLM is prompted to generate augmented training data based on the original training data, so as to construct a new training data with larger size and balanced label distributions to better train an ABSA model. Meanwhile, in order to improve the quality of the augmented data, we propose a reinforcement learning approach to optimize the data augmentation. LLM.Experiment results and further analyses on English benchmark datasets for ABSA demonstrate the effectiveness of our approach, where superior performance is observed over strong baselines and most existing studies.
nan
Article 407
Title@2025-07-13 (7): ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning
Title: ViSP: A PPO-Driven Framework for Sarcasm Generation with Contrastive Learning | ViSP: Ein PPO-getriebenes Framework für Sarkasmus-Generation mit kontrasem Lernen | VSP:PPPO-Driven PPO-Driven 讽刺与矛盾学习的讽刺一代框架 2507.09482v1 |
Authors (3): Changli Wang, Rui Wu, Fang Yin
Human emotions are complex, with sarcasm being a subtle and distinctive form. Despite progress in sarcasm research, sarcasm generation remains underexplored, primarily due to the overreliance on textual modalities and the neglect of visual cues, as well as the mismatch between image content and sarcastic intent in existing datasets. In this paper, we introduce M2SaG, a multimodal sarcasm generation dataset with 4,970 samples, each containing an image, a sarcastic text, and a sarcasm target. To benchmark M2SaG, we propose ViSP, a generation framework that integrates Proximal Policy Optimization (PPO) and contrastive learning. PPO utilizes reward scores from DIP to steer the generation of sarcastic texts, while contrastive learning encourages the model to favor outputs with higher reward scores. These strategies improve overall generation quality and produce texts with more pronounced sarcastic intent. We evaluate ViSP across five metric sets and find it surpasses all baselines, including large language models, underscoring their limitations in sarcasm generation. Furthermore, we analyze the distributions of Sarcasm Scores and Factual Incongruity for both M2SaG and the texts generated by ViSP. The generated texts exhibit higher mean Sarcasm Scores (0.898 vs. 0.770) and Factual Incongruity (0.768 vs. 0.739), demonstrating that ViSP produces higher-quality sarcastic content than the original dataset. % The dataset and code will be publicly available. Our dataset and code will be released at \textit{https://github.com/wclapply/ViSP}.
nan
Article 408
Title@2025-07-13 (7): Evaluating LLMs on Sequential API Call Through Automated Test Generation
Title: Evaluating LLMs on Sequential API Call Through Automated Test Generation | Bewertung von LLMs auf sequentieller API-Aufruf durch automatisierte Testgenerierung | 通过自动测试生成的序列API呼叫评估LLMs 2507.09481v1 |
Authors (5): Yuheng Huang, Da Song, Zhenlan Ji, Shuai Wang, Lei Ma
By integrating tools from external APIs, Large Language Models (LLMs) have expanded their promising capabilities in a diverse spectrum of complex real-world tasks. However, testing, evaluation, and analysis of LLM tool use remain in their early stages. Most existing benchmarks rely on manually collected test cases, many of which cannot be automatically checked for semantic correctness and instead depend on static methods such as string matching. Additionally, these benchmarks often overlook the complex interactions that occur between sequential API calls, which are common in real-world applications. To fill the gap, in this paper, we introduce StateGen, an automated framework designed to generate diverse coding tasks involving sequential API interactions. StateGen combines state-machine-based API constraint solving and validation, energy-based sampling, and control-flow injection to generate executable programs. These programs are then translated into human-like natural language task descriptions through a collaboration of two LLM agents. Utilizing StateGen, we construct StateEval, a benchmark encompassing 120 verified test cases spanning across three representative scenarios: Session Service, Tensor Operation, and ElevenLabs MCP. Experimental results confirm that StateGen can effectively generate challenging and realistic API-oriented tasks, highlighting areas for improvement in current LLMs incorporating APIs.
nan
Article 409
Title@2025-07-13 (7): The CoNLL-2013 Shared Task on Grammatical Error Correction
Title: The CoNLL-2013 Shared Task on Grammatical Error Correction | Die gemeinsame Aufgabe von CoNLL-2013 zur Korrektur von Grammatikfehlern | 2013 CoNLL-2013 校正语言错误共同任务 2507.09474v1 |
Authors (5): Hwee Tou Ng, Siew Mei Wu, Yuanbin Wu, Christian Hadiwinoto, Joel Tetreault
The CoNLL-2013 shared task was devoted to grammatical error correction. In this paper, we give the task definition, present the data sets, and describe the evaluation metric and scorer used in the shared task. We also give an overview of the various approaches adopted by the participating teams, and present the evaluation results.
nan
Article 410
Title@2025-07-13 (7): Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models
Title: Enhancing Clinical Text Classification via Fine-Tuned DRAGON Longformer Models | Verbesserung der klinischen Textklassifikation durch feingetönte DRAGON Longformer-Modelle | 通过精美的DRAGON长期模型加强临床文本分类 2507.09470v1 |
Authors (2): Mingchuan Yang, Ziyuan Huang
This study explores the optimization of the DRAGON Longformer base model for clinical text classification, specifically targeting the binary classification of medical case descriptions. A dataset of 500 clinical cases containing structured medical observations was used, with 400 cases for training and 100 for validation. Enhancements to the pre-trained joeranbosma/dragon-longformer-base-mixed-domain model included hyperparameter tuning, domain-specific preprocessing, and architectural adjustments. Key modifications involved increasing sequence length from 512 to 1024 tokens, adjusting learning rates from 1e-05 to 5e-06, extending training epochs from 5 to 8, and incorporating specialized medical terminology. The optimized model achieved notable performance gains: accuracy improved from 72.0% to 85.2%, precision from 68.0% to 84.1%, recall from 75.0% to 86.3%, and F1-score from 71.0% to 85.2%. Statistical analysis confirmed the significance of these improvements (p < .001). The model demonstrated enhanced capability in interpreting medical terminology, anatomical measurements, and clinical observations. These findings contribute to domain-specific language model research and offer practical implications for clinical natural language processing applications. The optimized model’s strong performance across diverse medical conditions underscores its potential for broad use in healthcare settings.
nan
Article 411
Title@2025-07-13 (7): Personalization of Large Language Models: A Survey
Title: Personalization of Large Language Models: A Survey | Personalisierung großer Sprachmodelle: Eine Umfrage | 大语言模型的个性化:调查 2411.00027v3 |
Authors (21): Zhehao Zhang, Ryan A. Rossi, Branislav Kveton, Yijia Shao, Diyi Yang, Hamed Zamani, Franck Dernoncourt, Joe Barrow, Tong Yu, Sungchul Kim, Ruiyi Zhang, Jiuxiang Gu, Tyler Derr, Hongjie Chen, Junda Wu, Xiang Chen, Zichao Wang, Subrata Mitra, Nedim Lipka, Nesreen Ahmed, Yu Wang
Personalization of Large Language Models (LLMs) has recently become increasingly important with a wide range of applications. Despite the importance and recent progress, most existing works on personalized LLMs have focused either entirely on (a) personalized text generation or (b) leveraging LLMs for personalization-related downstream applications, such as recommendation systems. In this work, we bridge the gap between these two separate main directions for the first time by introducing a taxonomy for personalized LLM usage and summarizing the key differences and challenges. We provide a formalization of the foundations of personalized LLMs that consolidates and expands notions of personalization of LLMs, defining and discussing novel facets of personalization, usage, and desiderata of personalized LLMs. We then unify the literature across these diverse fields and usage scenarios by proposing systematic taxonomies for the granularity of personalization, personalization techniques, datasets, evaluation methods, and applications of personalized LLMs. Finally, we highlight challenges and important open problems that remain to be addressed. By unifying and surveying recent research using the proposed taxonomies, we aim to provide a clear guide to the existing literature and different facets of personalization in LLMs, empowering both researchers and practitioners.
nan
Article 412
Title@2025-07-13 (7): StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model
Title: StreamUni: Achieving Streaming Speech Translation with a Unified Large Speech-Language Model | StreamUni: Streaming Speech Translation mit einem einheitlichen Large Speech-Language-Modell erreichen | StreamUli:用统一大型语音语言模式实现流式语音翻译 2507.07803v2 |
Authors (5): Shoutao Guo, Xiang Li, Mengge Liu, Wei Chen, Yang Feng
Streaming speech translation (StreamST) requires determining appropriate timing, known as policy, to generate translations while continuously receiving source speech inputs, balancing low latency with high translation quality. However, existing StreamST methods typically operate on sentence-level speech segments, referred to as simultaneous speech translation (SimulST). In practice, they require collaboration with segmentation models to accomplish StreamST, where the truncated speech segments constrain SimulST models to make policy decisions and generate translations based on limited contextual information. Moreover, SimulST models struggle to learn effective policies due to the complexity of speech inputs and cross-lingual generation. To address these challenges, we propose StreamUni, which achieves StreamST through a unified Large Speech-Language Model (LSLM). Specifically, StreamUni incorporates speech Chain-of-Thought (CoT) in guiding the LSLM to generate multi-stage outputs. Leveraging these multi-stage outputs, StreamUni simultaneously accomplishes speech segmentation, policy decision, and translation generation, completing StreamST without requiring massive policy-specific training. Additionally, we propose a streaming CoT training method that enhances low-latency policy decisions and generation capabilities using limited CoT data. Experiments demonstrate that our approach achieves state-of-the-art performance on StreamST tasks.
nan
Article 413
Title@2025-07-12 (6): DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models
Title: DATE-LM: Benchmarking Data Attribution Evaluation for Large Language Models | DATE-LM: Benchmarking Data Attribution Evaluation für große Sprachmodelle | DATE-LM:大语言模式数据归属基准评价 2507.09424v1 |
Authors (9): Cathy Jiao, Yijun Pan, Emily Xiao, Daisy Sheng, Niket Jain, Hanzhang Zhao, Ishita Dasgupta, Jiaqi W. Ma, Chenyan Xiong
Data attribution methods quantify the influence of training data on model outputs and are becoming increasingly relevant for a wide range of LLM research and applications, including dataset curation, model interpretability, data valuation. However, there remain critical gaps in systematic LLM-centric evaluation of data attribution methods. To this end, we introduce DATE-LM (Data Attribution Evaluation in Language Models), a unified benchmark for evaluating data attribution methods through real-world LLM applications. DATE-LM measures attribution quality through three key tasks – training data selection, toxicity/bias filtering, and factual attribution. Our benchmark is designed for ease of use, enabling researchers to configure and run large-scale evaluations across diverse tasks and LLM architectures. Furthermore, we use DATE-LM to conduct a large-scale evaluation of existing data attribution methods. Our findings show that no single method dominates across all tasks, data attribution methods have trade-offs with simpler baselines, and method performance is sensitive to task-specific evaluation design. Finally, we release a public leaderboard for quick comparison of methods and to facilitate community engagement. We hope DATE-LM serves as a foundation for future data attribution research in LLMs.
nan
Article 414
Title@2025-07-12 (6): Large Language Models as Neurolinguistic Subjects: Discrepancy between Performance and Competence
Title: Large Language Models as Neurolinguistic Subjects: Discrepancy between Performance and Competence | Große Sprachmodelle als neurolinguistische Themen: Diskrepanz zwischen Leistung und Kompetenz | 以大语言模式作为神经语言学主体:业绩与能力之间的差异 2411.07533v3 |
Authors (6): Linyang He, Ercong Nie, Helmut Schmid, Hinrich Schütze, Nima Mesgarani, Jonathan Brennan
This study investigates the linguistic understanding of Large Language Models (LLMs) regarding signifier (form) and signified (meaning) by distinguishing two LLM assessment paradigms: psycholinguistic and neurolinguistic. Traditional psycholinguistic evaluations often reflect statistical rules that may not accurately represent LLMs’ true linguistic competence. We introduce a neurolinguistic approach, utilizing a novel method that combines minimal pair and diagnostic probing to analyze activation patterns across model layers. This method allows for a detailed examination of how LLMs represent form and meaning, and whether these representations are consistent across languages. We found: (1) Psycholinguistic and neurolinguistic methods reveal that language performance and competence are distinct; (2) Direct probability measurement may not accurately assess linguistic competence; (3) Instruction tuning won’t change much competence but improve performance; (4) LLMs exhibit higher competence and performance in form compared to meaning. Additionally, we introduce new conceptual minimal pair datasets for Chinese (COMPS-ZH) and German (COMPS-DE), complementing existing English datasets.
nan
Article 415
Title@2025-07-12 (6): A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm
Title: A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm | Eine Umfrage zur automatischen Prompt-Optimierung mit instruction-focused Heuristic-based Search-Algorithmus | 以注重指示的以休养为主的自动快速优化调查 2502.18746v2 |
Authors (8): Wendi Cui, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley A. Malin, Sricharan Kumar, Jiaxin Zhang
Recent advances in Large Language Models have led to remarkable achievements across a variety of Natural Language Processing tasks, making prompt engineering increasingly central to guiding model outputs. While manual methods can be effective, they typically rely on intuition and do not automatically refine prompts over time. In contrast, automatic prompt optimization employing heuristic-based search algorithms can systematically explore and improve prompts with minimal human oversight. This survey proposes a comprehensive taxonomy of these methods, categorizing them by where optimization occurs, what is optimized, what criteria drive the optimization, which operators generate new prompts, and which iterative search algorithms are applied. We further highlight specialized datasets and tools that support and accelerate automated prompt refinement. We conclude by discussing key open challenges pointing toward future opportunities for more robust and versatile LLM applications.
nan
Article 416
Title@2025-07-12 (6): Single Word Change is All You Need: Using LLMs to Create Synthetic Training Examples for Text Classifiers
Title: Single Word Change is All You Need: Using LLMs to Create Synthetic Training Examples for Text Classifiers | Single Word Change ist alles, was Sie brauchen: Mit LLMs synthetische Trainingsbeispiele für Textklassifikatoren erstellen | 单单单单字更改是您所需要的: 使用 LLM 创建文本分类器的合成培训示例 2401.17196v3 |
Authors (5): Lei Xu, Sarah Alnegheimish, Laure Berti-Equille, Alfredo Cuesta-Infante, Kalyan Veeramachaneni
In text classification, creating an adversarial example means subtly perturbing a few words in a sentence without changing its meaning, causing it to be misclassified by a classifier. A concerning observation is that a significant portion of adversarial examples generated by existing methods change only one word. This single-word perturbation vulnerability represents a significant weakness in classifiers, which malicious users can exploit to efficiently create a multitude of adversarial examples. This paper studies this problem and makes the following key contributions: (1) We introduce a novel metric $\rho$ to quantitatively assess a classifier’s robustness against single-word perturbation. (2) We present the SP-Attack, designed to exploit the single-word perturbation vulnerability, achieving a higher attack success rate, better preserving sentence meaning, while reducing computation costs compared to state-of-the-art adversarial methods. (3) We propose SP-Defense, which aims to improve \r{ho} by applying data augmentation in learning. Experimental results on 4 datasets and BERT and distilBERT classifiers show that SP-Defense improves $\rho$ by 14.6% and 13.9% and decreases the attack success rate of SP-Attack by 30.4% and 21.2% on two classifiers respectively, and decreases the attack success rate of existing attack methods that involve multiple-word perturbations.
nan
Article 417
Title@2025-07-12 (6): SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization
Title: SEE: Strategic Exploration and Exploitation for Cohesive In-Context Prompt Optimization | SEE: Strategische Exploration und Nutzung für kohäsive In-Context Prompt Optimierung | SEE: 战略探索和开发协同在文本内迅速优化的战略探索和开发 2402.11347v2 |
Authors (8): Wendi Cui, Zhuohang Li, Hao Sun, Damien Lopez, Kamalika Das, Bradley Malin, Sricharan Kumar, Jiaxin Zhang
Designing optimal prompts for Large Language Models (LLMs) is a complicated and resource-intensive task, often requiring substantial human expertise and effort. Existing approaches typically separate the optimization of prompt instructions and in-context learning examples, leading to incohesive prompts that are defined and represented by suboptimal task performance. To overcome these challenges, we propose a novel Cohesive In-Context Prompt Optimization framework that refines both prompt instructions and examples. However, formulating such an optimization in the discrete and high-dimensional space of natural language poses significant challenges in both convergence and computational efficiency. To address these issues, we introduce SEE, a scalable and efficient prompt optimization framework that adopts metaheuristic optimization principles and strategically balances exploration and exploitation to enhance optimization performance and achieve efficient convergence. SEE features a quad-phased design that alternates between global traversal (exploration) and local optimization (exploitation) and adaptively chooses LLM operators during the optimization process. We have conducted a comprehensive evaluation across 35 benchmark tasks, and SEE significantly outperforms state-of-the-art baseline methods by a large margin, achieving an average performance gain of 13.94 while reducing computational costs by 58.67.
nan
Article 418
Title@2025-07-12 (6): Supposedly Equivalent Facts That Aren’t? Entity Frequency in Pre-training Induces Asymmetry in LLMs
Title: Supposedly Equivalent Facts That Aren’t? Entity Frequency in Pre-training Induces Asymmetry in LLMs | Angeblich gleichwertige Fakten, die nicht sind? Entity Frequency in Pre-Training Induziert Asymmetrie in LLMs | 所谓等效事实,这难道不是吗? 2503.22362v2 |
Authors (11): Yuan He, Bailan He, Zifeng Ding, Alisia Lupidi, Yuqicheng Zhu, Shuo Chen, Caiqi Zhang, Jiaoyan Chen, Yunpu Ma, Volker Tresp, Ian Horrocks
Understanding and mitigating hallucinations in Large Language Models (LLMs) is crucial for ensuring reliable content generation. While previous research has primarily focused on “when” LLMs hallucinate, our work explains “why” and directly links model behaviour to the pre-training data that forms their prior knowledge. Specifically, we demonstrate that an asymmetry exists in the recognition of logically equivalent facts, which can be attributed to frequency discrepancies of entities appearing as subjects versus objects. Given that most pre-training datasets are inaccessible, we leverage the fully open-source OLMo series by indexing its Dolma dataset to estimate entity frequencies. Using relational facts (represented as triples) from Wikidata5M, we construct probing datasets to isolate this effect. Our experiments reveal that facts with a high-frequency subject and a low-frequency object are better recognised than their inverse, despite their logical equivalence. The pattern reverses in low-to-high frequency settings, and no statistically significant asymmetry emerges when both entities are high-frequency. These findings highlight the influential role of pre-training data in shaping model predictions and provide insights for inferring the characteristics of pre-training data in closed or partially closed LLMs.
nan
Article 419
Title@2025-07-12 (6): MedGemma Technical Report
Title: MedGemma Technical Report | Technischer Bericht MedGemma | MedmeGemma 技术报告 2507.05201v3 |
Authors (81): Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Brush, Kenneth Philbrick, Mercy Asiedu, Ines Mezerreg, Howard Hu, Howard Yang, Richa Tiwari, Sunny Jansen, Preeti Singh, Yun Liu, Shekoofeh Azizi, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Riviere, Louis Rouillard, Thomas Mesnard, Geoffrey Cideron, Jean-bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Elena Buchatskaya, Jean-Baptiste Alayrac, Dmitry Lepikhin, Vlad Feinberg, Sebastian Borgeaud, Alek Andreev, Cassidy Hardin, Robert Dadashi, Léonard Hussenot, Armand Joulin, Olivier Bachem, Yossi Matias, Katherine Chou, Avinatan Hassidim, Kavi Goel, Clement Farabet, Joelle Barral, Tris Warkentin, Jonathon Shlens, David Fleet, Victor Cotruta, Omar Sanseviero, Gus Martins, Phoebe Kirk, Anand Rao, Shravya Shetty, David F. Steiner, Can Kirmizibayrak, Rory Pilgrim, Daniel Golden, Lin Yang
Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare’s diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.
nan
Article 420
Title@2025-07-12 (6): BEExformer: A Fast Inferencing Binarized Transformer with Early Exits
Title: BEExformer: A Fast Inferencing Binarized Transformer with Early Exits | BEExformer: Ein schneller Rückschluss Binarisierter Transformer mit frühen Ausgängen | BEExex: 带有早期退出的快速推推催化变异器 2412.05225v2 |
Authors (3): Wazib Ansar, Saptarsi Goswami, Amlan Chakrabarti
Large Language Models (LLMs) based on transformers achieve cutting-edge results on a variety of applications. However, their enormous size and processing requirements hinder deployment on constrained resources. To enhance efficiency, binarization and Early Exit (EE) have proved to be effective solutions. However, binarization may lead to performance loss as reduced precision affects gradient estimation and parameter updates. Besides, research on EE mechanisms is still in its early stages. To address these challenges, we introduce Binarized Early Exit Transformer (BEExformer), the first-ever selective learning-based transformer integrating Binarization-Aware Training (BAT) with EE for efficient and fast textual inference. Each transformer block has an integrated Selective-Learn Forget Network (SLFN) to enhance contextual retention while eliminating irrelevant information. The BAT employs a differentiable second-order approximation to the sign function, enabling gradient computation that captures both the sign and magnitude of the weights. This aids in 21.30 times reduction in model size. The EE mechanism hinges on fractional reduction in entropy among intermediate transformer blocks with soft-routing loss estimation. This accelerates inference by reducing FLOPs by 52.08% and even improves accuracy by 2.89% by resolving the “overthinking” problem inherent in deep networks. Extensive evaluation through comparison with the SOTA methods and various ablations across six datasets covering multiple NLP tasks demonstrates its Pareto-optimal performance-efficiency trade-off.
nan
Article 421
Title@2025-07-12 (6): Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs
Title: Perspective Dial: Measuring Perspective of Text and Guiding LLM Outputs | Perspective Dial: Perspective of Text and Guiding LLM Outputs messen | 计量文字和引导性LLM产出 2506.23377v2 |
Authors (3): Taejin Kim, Siun-Chuon Mau, Konrad Vesey
Large language models (LLMs) are used in a variety of mission-critical roles. Due to the rapidly developing nature of LLMs, there is a lack of quantifiable understanding of the bias and perspective associated with LLM output. Inspired by this need, this paper considers the broader issue of perspective or viewpoint of general text and perspective control of large-language model (LLM) output. Perspective-Dial consists of two main components: a (1) metric space, dubbed Perspective Space, that enables quantitative measurements of different perspectives regarding a topic, and the use of (2) Systematic Prompt Engineering that utilizes greedy-coordinate descent to control LLM output perspective based on measurement feedback from the Perspective Space. The empirical nature of the approach allows progress to side step a principled understanding of perspective or bias – effectively quantifying and adjusting outputs for a variety of topics. Potential applications include detection, tracking and mitigation of LLM bias, narrative detection, sense making and tracking in public discourse, and debate bot advocating given perspective.
nan
Article 422
Title@2025-07-12 (6): Watermarking Degrades Alignment in Language Models: Analysis and Mitigation
Title: Watermarking Degrades Alignment in Language Models: Analysis and Mitigation | Wasserzeichen degradiert Ausrichtung in Sprachmodellen: Analyse und Milderung | 语言模型的分级调整:分析和减轻影响 2506.04462v3 |
Authors (3): Apurv Verma, NhatHai Phan, Shubhendu Trivedi
Watermarking techniques for large language models (LLMs) can significantly impact output quality, yet their effects on truthfulness, safety, and helpfulness remain critically underexamined. This paper presents a systematic analysis of how two popular watermarking approaches-Gumbel and KGW-affect these core alignment properties across four aligned LLMs. Our experiments reveal two distinct degradation patterns: guard attenuation, where enhanced helpfulness undermines model safety, and guard amplification, where excessive caution reduces model helpfulness. These patterns emerge from watermark-induced shifts in token distribution, surfacing the fundamental tension that exists between alignment objectives. To mitigate these degradations, we propose Alignment Resampling (AR), an inference-time sampling method that uses an external reward model to restore alignment. We establish a theoretical lower bound on the improvement in expected reward score as the sample size is increased and empirically demonstrate that sampling just 2-4 watermarked generations effectively recovers or surpasses baseline (unwatermarked) alignment scores. To overcome the limited response diversity of standard Gumbel watermarking, our modified implementation sacrifices strict distortion-freeness while maintaining robust detectability, ensuring compatibility with AR. Experimental results confirm that AR successfully recovers baseline alignment in both watermarking approaches, while maintaining strong watermark detectability. This work reveals the critical balance between watermark strength and model alignment, providing a simple inference-time solution to responsibly deploy watermarked LLMs in practice.
nan
Article 423
Title@2025-07-12 (6): LLM Agents Are the Antidote to Walled Gardens
Title: LLM Agents Are the Antidote to Walled Gardens | LLM-Agenten sind das Gegenmittel zu ummauerten Gärten | LLM 药剂是被围墙隔绝的花园的抗药剂 2506.23978v2 |
Authors (2): Samuele Marro, Philip Torr
While the Internet’s core infrastructure was designed to be open and universal, today’s application layer is dominated by closed, proprietary platforms. Open and interoperable APIs require significant investment, and market leaders have little incentive to enable data exchange that could erode their user lock-in. We argue that LLM-based agents fundamentally disrupt this status quo. Agents can automatically translate between data formats and interact with interfaces designed for humans: this makes interoperability dramatically cheaper and effectively unavoidable. We name this shift universal interoperability: the ability for any two digital services to exchange data seamlessly using AI-mediated adapters. Universal interoperability undermines monopolistic behaviours and promotes data portability. However, it can also lead to new security risks and technical debt. Our position is that the ML community should embrace this development while building the appropriate frameworks to mitigate the downsides. By acting now, we can harness AI to restore user freedom and competitive markets without sacrificing security.
nan
Article 424
Title@2025-07-12 (6): ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching
Title: ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching | ZipVoice-Dialog: Nicht-Autoregressive gesprochene Dialog-Generation mit Flow Matching | ZipVoice- Dialog: 以流动匹配方式生成非自动回归式口语对话 2507.09318v1 |
Authors (13): Han Zhu, Wei Kang, Liyong Guo, Zengwei Yao, Fangjun Kuang, Weiji Zhuang, Zhaoqing Li, Zhifeng Han, Dong Zhang, Xin Zhang, Xingchen Song, Long Lin, Daniel Povey
Generating spoken dialogue is more challenging than monologue text-to-speech (TTS) due to the need for realistic turn-taking and distinct speaker timbres. Existing spoken dialogue generation models, being auto-regressive, suffer from slow and unstable inference. To overcome these limitations, we introduce ZipVoice-Dialog, a non-autoregressive zero-shot spoken dialogue generation model built upon flow matching. Key designs include: 1) speaker-turn embeddings for precise speaker turn-taking; 2) a curriculum learning strategy for stable speech-text alignment; 3) specialized strategies to enable stereo dialogue generation. Additionally, recognizing the lack of open-source large-scale spoken dialogue datasets, we curated OpenDialog, a 6.8k-hour spoken dialogue dataset from in-the-wild speech data. Furthermore, we established a benchmark to comprehensively evaluate various models. Experimental results demonstrate that ZipVoice-Dialog achieves superior performance in intelligibility, speaker turn-taking accuracy, speaker similarity, and inference speed. Our codes, model checkpoints, demo samples, and the OpenDialog dataset are all publicly available at https://github.com/k2-fsa/ZipVoice.
nan
Article 425
Title@2025-07-12 (6): Emergence of Hierarchical Emotion Organization in Large Language Models
Title: Emergence of Hierarchical Emotion Organization in Large Language Models | Entstehung der Hierarchischen Emotionsorganisation in großen Sprachmodellen | 大语言模式中等级情感组织的出现 2507.10599v1 |
Authors (7): Bo Zhao, Maya Okawa, Eric J. Bigelow, Rose Yu, Tomer Ullman, Ekdeep Singh Lubana, Hidenori Tanaka
As large language models (LLMs) increasingly power conversational agents, understanding how they model users’ emotional states is critical for ethical deployment. Inspired by emotion wheels – a psychological framework that argues emotions organize hierarchically – we analyze probabilistic dependencies between emotional states in model outputs. We find that LLMs naturally form hierarchical emotion trees that align with human psychological models, and larger models develop more complex hierarchies. We also uncover systematic biases in emotion recognition across socioeconomic personas, with compounding misclassifications for intersectional, underrepresented groups. Human studies reveal striking parallels, suggesting that LLMs internalize aspects of social perception. Beyond highlighting emergent emotional reasoning in LLMs, our results hint at the potential of using cognitively-grounded theories for developing better model evaluations.
nan
Article 426
Title@2025-07-12 (6): Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models
Title: Evaluation of Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models | Bewertung der Attribution Bias in Generator-Aware Retrieval-Augmented Large Language Models | 评价发电机-软件检索增强型大语言模型中的归属比语文评价 2410.12380v2 |
Authors (5): Amin Abolghasemi, Leif Azzopardi, Seyyed Hadi Hashemi, Maarten de Rijke, Suzan Verberne
Attributing answers to source documents is an approach used to enhance the verifiability of a model’s output in retrieval augmented generation (RAG). Prior work has mainly focused on improving and evaluating the attribution quality of large language models (LLMs) in RAG, but this may come at the expense of inducing biases in the attribution of answers. We define and examine two aspects in the evaluation of LLMs in RAG pipelines, namely attribution sensitivity and bias with respect to authorship information. We explicitly inform an LLM about the authors of source documents, instruct it to attribute its answers, and analyze (i) how sensitive the LLM’s output is to the author of source documents, and (ii) whether the LLM exhibits a bias towards human-written or AI-generated source documents. We design an experimental setup in which we use counterfactual evaluation to study three LLMs in terms of their attribution sensitivity and bias in RAG pipelines. Our results show that adding authorship information to source documents can significantly change the attribution quality of LLMs by 3% to 18%. Moreover, we show that LLMs can have an attribution bias towards explicit human authorship, which can serve as a competing hypothesis for findings of prior work that shows that LLM-generated content may be preferred over human-written contents. Our findings indicate that metadata of source documents can influence LLMs’ trust, and how they attribute their answers. Furthermore, our research highlights attribution bias and sensitivity as a novel aspect of brittleness in LLMs.
nan
Article 427
Title@2025-07-12 (6): Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning
Title: Voice Conversion for Lombard Speaking Style with Implicit and Explicit Acoustic Feature Conditioning | Sprachumwandlung für lombardisch sprechenden Stil mit impliziter und expliziter Akustik-Feature-Konditionierung | Lombard语音风格语音转换,带有隐含和显性音频特色条件 2507.09310v1 |
Authors (4): Dominika Woszczyk, Manuel Sam Ribeiro, Thomas Merritt, Daniel Korzekwa
Text-to-Speech (TTS) systems in Lombard speaking style can improve the overall intelligibility of speech, useful for hearing loss and noisy conditions. However, training those models requires a large amount of data and the Lombard effect is challenging to record due to speaker and noise variability and tiring recording conditions. Voice conversion (VC) has been shown to be a useful augmentation technique to train TTS systems in the absence of recorded data from the target speaker in the target speaking style. In this paper, we are concerned with Lombard speaking style transfer. Our goal is to convert speaker identity while preserving the acoustic attributes that define the Lombard speaking style. We compare voice conversion models with implicit and explicit acoustic feature conditioning. We observe that our proposed implicit conditioning strategy achieves an intelligibility gain comparable to the model conditioned on explicit acoustic features, while also preserving speaker similarity.
nan
Article 428
Title@2025-07-12 (6): Disambiguate First, Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing
Title: Disambiguate First, Parse Later: Generating Interpretations for Ambiguity Resolution in Semantic Parsing | Erst abgrenzen, später Parse: Interpretationen für Ambiguitätsauflösung im semantischen Parsing generieren | 模糊第一, 稍后分析: 在语义分析中生成对模糊分辨率的解释 2502.18448v2 |
Authors (2): Irina Saparina, Mirella Lapata
Handling ambiguity and underspecification is an important challenge in natural language interfaces, particularly for tasks like text-to-SQL semantic parsing. We propose a modular approach that resolves ambiguity using natural language interpretations before mapping these to logical forms (e.g., SQL queries). Although LLMs excel at parsing unambiguous utterances, they show strong biases for ambiguous ones, typically predicting only preferred interpretations. We constructively exploit this bias to generate an initial set of preferred disambiguations and then apply a specialized infilling model to identify and generate missing interpretations. To train the infilling model, we introduce an annotation method that uses SQL execution to validate different meanings. Our approach improves interpretation coverage and generalizes across datasets with different annotation styles, database structures, and ambiguity types.
nan
Article 429
Title@2025-07-12 (6): ClaritySpeech: Dementia Obfuscation in Speech
Title: ClaritySpeech: Dementia Obfuscation in Speech | ClaritySpeech: Dementia Verschleierung in der Rede | 清晰的言语:言语中的痴呆症 2507.09282v1 |
Authors (3): Dominika Woszczyk, Ranya Aloufi, Soteris Demetriou
Dementia, a neurodegenerative disease, alters speech patterns, creating communication barriers and raising privacy concerns. Current speech technologies, such as automatic speech transcription (ASR), struggle with dementia and atypical speech, further challenging accessibility. This paper presents a novel dementia obfuscation in speech framework, ClaritySpeech, integrating ASR, text obfuscation, and zero-shot text-to-speech (TTS) to correct dementia-affected speech while preserving speaker identity in low-data environments without fine-tuning. Results show a 16% and 10% drop in mean F1 score across various adversarial settings and modalities (audio, text, fusion) for ADReSS and ADReSSo, respectively, maintaining 50% speaker similarity. We also find that our system improves WER (from 0.73 to 0.08 for ADReSS and 0.15 for ADReSSo) and speech quality from 1.65 to ~2.15, enhancing privacy and accessibility.
nan
Article 430
Title@2025-07-12 (6): Psychology-Driven Enhancement of Humour Translation
Title: Psychology-Driven Enhancement of Humour Translation | Psychologie-getriebene Verbesserung der Humour-Übersetzung | 提高幽默翻译能力 2507.09259v1 |
Authors (5): Yuchen Su, Yonghua Zhu, Yang Chen, Diana Benavides-Prado, Michael Witbrock
Humour translation plays a vital role as a bridge between different cultures, fostering understanding and communication. Although most existing Large Language Models (LLMs) are capable of general translation tasks, these models still struggle with humour translation, which is especially reflected through linguistic interference and lacking humour in translated text. In this paper, we propose a psychology-inspired Humour Decomposition Mechanism (HDM) that utilises Chain-of-Thought (CoT) to imitate the ability of the human thought process, stimulating LLMs to optimise the readability of translated humorous texts. Moreover, we integrate humour theory in HDM to further enhance the humorous elements in the translated text. Our automatic evaluation experiments on open-source humour datasets demonstrate that our method significantly improves the quality of humour translation, yielding average gains of 7.75\% in humour, 2.81\% in fluency, and 6.13\% in coherence of the generated text.
nan
Article 431
Title@2025-07-12 (6): Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources
Title: Swa-bhasha Resource Hub: Romanized Sinhala to Sinhala Transliteration Systems and Data Resources | Swa-bhasha Resource Hub: romanisiert Sinhala zu Sinhala Transliterationssysteme und Datenressourcen | Swa-bhasha资源中心:将僧伽罗化成僧伽罗化的僧伽罗化成僧伽罗转化系统和数据资源 2507.09245v1 |
Authors (9): Deshan Sumanathilaka, Sameera Perera, Sachithya Dharmasiri, Maneesha Athukorala, Anuja Dilrukshi Herath, Rukshan Dias, Pasindu Gamage, Ruvan Weerasinghe, Y. H. P. P. Priyadarshana
The Swa-bhasha Resource Hub provides a comprehensive collection of data resources and algorithms developed for Romanized Sinhala to Sinhala transliteration between 2020 and 2025. These resources have played a significant role in advancing research in Sinhala Natural Language Processing (NLP), particularly in training transliteration models and developing applications involving Romanized Sinhala. The current openly accessible data sets and corresponding tools are made publicly available through this hub. This paper presents a detailed overview of the resources contributed by the authors and includes a comparative analysis of existing transliteration applications in the domain.
nan
Article 432
Title@2025-07-12 (6): Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs
Title: Planted in Pretraining, Swayed by Finetuning: A Case Study on the Origins of Cognitive Biases in LLMs | Bepflanzt in der Vorausbildung, durch Finetuning abgeschwächt: Eine Fallstudie über die Herkunft von Kognitiv-Biasen in LLMs | 编在培训前编,《微调:关于LLM中认知性双星起源的个案研究》,《微调摇摇晃》 2507.07186v2 |
Authors (3): Itay Itzhak, Yonatan Belinkov, Gabriel Stanovsky
Large language models (LLMs) exhibit cognitive biases – systematic tendencies of irrational decision-making, similar to those seen in humans. Prior work has found that these biases vary across models and can be amplified by instruction tuning. However, it remains unclear if these differences in biases stem from pretraining, finetuning, or even random noise due to training stochasticity. We propose a two-step causal experimental approach to disentangle these factors. First, we finetune models multiple times using different random seeds to study how training randomness affects over $30$ cognitive biases. Second, we introduce \emph{cross-tuning} – swapping instruction datasets between models to isolate bias sources. This swap uses datasets that led to different bias patterns, directly testing whether biases are dataset-dependent. Our findings reveal that while training randomness introduces some variability, biases are mainly shaped by pretraining: models with the same pretrained backbone exhibit more similar bias patterns than those sharing only finetuning data. These insights suggest that understanding biases in finetuned models requires considering their pretraining origins beyond finetuning effects. This perspective can guide future efforts to develop principled strategies for evaluating and mitigating bias in LLMs.
nan
Article 433
Title@2025-07-12 (6): Towards Pareto Optimal Throughput in Small Language Model Serving
Title: Towards Pareto Optimal Throughput in Small Language Model Serving | Auf dem Weg zu Pareto Optimaler Durchsatz im kleinen Sprachmodell | 争取在小型语文示范服务中达到最佳产出 2404.03353v2 |
Authors (8): Pol G. Recasens, Yue Zhu, Chen Wang, Eun Kyung Lee, Olivier Tardieu, Alaa Youssef, Jordi Torres, Josep Ll. Berral
Large language models (LLMs) have revolutionized the state-of-the-art of many different natural language processing tasks. Although serving LLMs is computationally and memory demanding, the rise of Small Language Models (SLMs) offers new opportunities for resource-constrained users, who now are able to serve small models with cutting-edge performance. In this paper, we present a set of experiments designed to benchmark SLM inference at performance and energy levels. Our analysis provides a new perspective in serving, highlighting that the small memory footprint of SLMs allows for reaching the Pareto-optimal throughput within the resource capacity of a single accelerator. In this regard, we present an initial set of findings demonstrating how model replication can effectively improve resource utilization for serving SLMs.
nan
Article 434
Title@2025-07-12 (6): MetaClimage: A novel database of visual metaphors related to Climate Change, with costs and benefits analysis
Title: MetaClimage: A novel database of visual metaphors related to Climate Change, with costs and benefits analysis | MetaClimage: Eine neuartige Datenbank visueller Metaphern zum Klimawandel mit Kosten-Nutzen-Analyse | MetaClimage:与气候变化有关的视觉比喻新数据库,并进行成本和效益分析 2507.09225v1 |
Authors (4): Biagio Scalingi, Chiara Barattieri di San Pietro, Paolo Canal, Valentina Bambini
Visual metaphors of climate change (e.g., melting glaciers depicted as a melting ice grenade) are regarded as valuable tools for addressing the complexity of environmental challenges. However, few studies have examined their impact on communication, also due to scattered availability of material. Here, we present a novel database of Metaphors of Climate Change in Images (MetaClimage) https://doi.org/10.5281/zenodo.15861012, paired with literal images and enriched with human ratings. For each image, we collected values of difficulty, efficacy, artistic quality, and emotional arousal from human rating, as well as number of tags generated by participants to summarize the message. Semantic and emotion variables were further derived from the tags via Natural Language Processing. Visual metaphors were rated as more difficult to understand, yet more aesthetically pleasant than literal images, but did not differ in efficacy and arousal. The latter for visual metaphors, however, was higher in participants with higher Need For Cognition. Furthermore, visual metaphors received more tags, often referring to entities not depicted in the image, and elicited words with more positive valence and greater dominance than literal images. These results evidence the greater cognitive load of visual metaphors, which nevertheless might induce positive effects such as deeper cognitive elaboration and abstraction compared to literal stimuli. Furthermore, while they are not deemed as more effective and arousing, visual metaphors seem to generate superior aesthetic appreciation and a more positively valenced experience. Overall, this study contributes to understanding the impact of visual metaphors of climate change both by offering a database for future research and by elucidating a cost-benefit trade-off to take into account when shaping environmental communication.
nan
Article 435
Title@2025-07-12 (6): Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models
Title: Feature Extraction and Steering for Enhanced Chain-of-Thought Reasoning in Language Models | Feature-Extraktion und -Lenkung für eine verbesserte Kettenbildung in Sprachmodellen | 语言模型中强化研究链理由的特征采掘和指南 2505.15634v4 |
Authors (6): Zihao Li, Xu Wang, Yuzhe Yang, Ziyu Yao, Haoyi Xiong, Mengnan Du
Large Language Models (LLMs) demonstrate the ability to solve reasoning and mathematical problems using the Chain-of-Thought (CoT) technique. Expanding CoT length, as seen in models such as DeepSeek-R1, significantly enhances this reasoning for complex problems, but requires costly and high-quality long CoT data and fine-tuning. This work, inspired by the deep thinking paradigm of DeepSeek-R1, utilizes a steering technique to enhance the reasoning ability of an LLM without external datasets. Our method first employs Sparse Autoencoders (SAEs) to extract interpretable features from vanilla CoT. These features are then used to steer the LLM’s internal states during generation. Recognizing that many LLMs do not have corresponding pre-trained SAEs, we further introduce a novel SAE-free steering algorithm, which directly computes steering directions from the residual activations of an LLM, obviating the need for an explicit SAE. Experimental results demonstrate that both our SAE-based and subsequent SAE-free steering algorithms significantly enhance the reasoning capabilities of LLMs.
nan
Article 436
Title@2025-07-12 (6): Exploring Gender Bias Beyond Occupational Titles
Title: Exploring Gender Bias Beyond Occupational Titles | Erforschen von Gender-Bias über Berufsbezeichnungen hinaus | 探索职业职称之外的性别偏见 2507.02679v2 |
Authors (2): Ahmed Sabir, Rajesh Sharma
In this work, we investigate the correlation between gender and contextual biases, focusing on elements such as action verbs, object nouns, and particularly on occupations. We introduce a novel dataset, GenderLexicon, and a framework that can estimate contextual bias and its related gender bias. Our model can interpret the bias with a score and thus improve the explainability of gender bias. Also, our findings confirm the existence of gender biases beyond occupational stereotypes. To validate our approach and demonstrate its effectiveness, we conduct evaluations on five diverse datasets, including a Japanese dataset.
nan
Article 437
Title@2025-07-12 (6): Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training
Title: Banzhida: Advancing Large Language Models for Tibetan with Curated Data and Continual Pre-Training | Banzhida: Große Sprachmodelle für Tibeter mit kuratierten Daten und kontinuierlichem Vortraining | Banzhida:推广藏语大语言模式,提供 “ 缩小数据 “ 和 “ 持续培训前 “ 。 2507.09205v1 |
Authors (40): Leiyu Pan, Bojian Xiong, Lei Yang, Renren Jin, Shaowei Zhang, Yue Chen, Ling Shi, Jiang Zhou, Junru Wu, Zhen Wang, Jianxiang Peng, Juesi Xiao, Tianyu Dong, Zhuowen Han, Zhuo Chen, Sangjee Dondrub, Caizang Tai, Haixing Zhao, Huaque Cairang, Suonan Cairang, Rou Te, Lengben Zhaxi, Gazang Zhaxi, Zhonglin Ye, Yuhui Zheng, Chunyan Peng, Secha Jia, Pema Tashi, Cizhen Jiacuo, Pema Dorjee, Hongkai Liu, Pema Yanggon, Tsehang Dorjee, Jiaxin Han, Qiongying Hu, Jilin Man, Huanke You, Yuqi Ren, Duo La, Deyi Xiong
Large language models have achieved remarkable progress across many languages. However, Tibetan, as a representative low-resource language, is particularly underrepresented in existing models due to the scarcity of high-quality training corpora. To address this gap, we curate the largest Tibetan pre-training corpus to date, aggregating data from diverse sources and applying a dedicated data cleaning and processing pipeline tailored for Tibetan. With the curated data, we continue pre/post-training a multilingual base model into Banzhida, a multilingual large language model that advances generative AI for Tibetan. To evaluate the Tibetan capabilities of the model, we create new high-quality Tibetan benchmarks, and complement them with existing public benchmarks. Experimental results demonstrate that Banzhida consistently and significantly outperforms both open-source models of similar scale and Tibetan-tailored models across a wide range of tasks.
nan
Article 438
Title@2025-07-12 (6): An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment
Title: An In-depth Evaluation of Large Language Models in Sentence Simplification with Error-based Human Assessment | Eine eingehende Bewertung großer Sprachmodelle in der Satzvereinfachung mit fehlerbasierter Human Assessment | 深入评价以基于错误的人类评估为根据的简化刑期的大语言模式 2403.04963v4 |
Authors (2): Xuanxin Wu, Yuki Arase
Recent studies have used both automatic metrics and human evaluations to assess the simplification abilities of LLMs. However, the suitability of existing evaluation methodologies for LLMs remains in question. First, the suitability of current automatic metrics on LLMs’ simplification evaluation is still uncertain. Second, current human evaluation approaches in sentence simplification often fall into two extremes: they are either too superficial, failing to offer a clear understanding of the models’ performance, or overly detailed, making the annotation process complex and prone to inconsistency, which in turn affects the evaluation’s reliability. To address these problems, this study provides in-depth insights into LLMs’ performance while ensuring the reliability of the evaluation. We design an error-based human annotation framework to assess the LLMs’ simplification capabilities. We select both closed-source and open-source LLMs, including GPT-4, Qwen2.5-72B, and Llama-3.2-3B. We believe that these models offer a representative selection across large, medium, and small sizes of LLMs. Results show that LLMs generally generate fewer erroneous simplification outputs compared to the previous state-of-the-art. However, LLMs have their limitations, as seen in GPT-4’s and Qwen2.5-72B’s struggle with lexical paraphrasing. Furthermore, we conduct meta-evaluations on widely used automatic metrics using our human annotations. We find that these metrics lack sufficient sensitivity to assess the overall high-quality simplifications, particularly those generated by high-performance LLMs.
nan
Article 439
Title@2025-07-12 (6): Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models
Title: Detecting and Pruning Prominent but Detrimental Neurons in Large Language Models | Erkennen und Beschneiden Prominenter, aber detrimentaler Neuronen in großen Sprachmodellen | 在大语言模型中检测和预视突出但有偏偏的神经元 2507.09185v1 |
Authors (4): Ameen Ali, Shahar Katz, Lior Wolf, Ivan Titov
Large language models (LLMs) often develop learned mechanisms specialized to specific datasets, such as reliance on domain-specific correlations, which yield high-confidence predictions without generalizable reasoning. While beneficial in one setting, these dataset-specific mechanisms typically degrade performance when models encounter novel tasks or distributions. In this work, we introduce a fine-tuning approach designed to enhance generalization by identifying and pruning neurons associated with dataset-specific mechanisms in transformer-based LLMs. Our method employs Integrated Gradients to quantify each neuron’s influence on high-confidence predictions, pinpointing those that disproportionately contribute to dataset-specific performance without supporting robust, transferable reasoning. Selectively pruning these neurons compels the model to depend on generalizable representations. Evaluated across multiple-choice benchmarks, our pruning-based fine-tuning significantly enhances performance, surpassing prior (non-pruning) adaptation methods.
nan
Article 440
Title@2025-07-12 (6): CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models
Title: CASCADE Your Datasets for Cross-Mode Knowledge Retrieval of Language Models | CASCADE Ihre Datensätze für Cross-Mode Knowledge Retrieval von Sprachmodellen | CASCADE 语言模型跨模式知识检索数据集 2504.01450v2 |
Authors (2): Runlong Zhou, Yi Zhang
Language models often struggle with cross-mode knowledge retrieval – the ability to access knowledge learned in one format (mode) when queried in another. We demonstrate that models trained on multiple data sources (e.g., Wikipedia and TinyStories) exhibit significantly reduced accuracy when retrieving knowledge in a format different from its original training mode. This paper quantitatively investigates this phenomenon through a controlled study of random token sequence memorization across different modes. We first explore dataset rewriting as a solution, revealing that effective cross-mode retrieval requires prohibitively extensive rewriting efforts that follow a sigmoid-like relationship. As an alternative, we propose CASCADE, a novel pretraining algorithm that uses cascading datasets with varying sequence lengths and computing losses on only the second half of each training sequence to capture knowledge at different scales. Our experiments demonstrate that CASCADE outperforms dataset rewriting approaches, even when compressed into a single model with a unified loss function. This work provides both qualitative evidence of cross-mode retrieval limitations and a practical solution to enhance language models’ ability to access knowledge independently of its presentational format.
nan
Article 441
Title@2025-07-12 (6): DLBAcalib: Robust Extrinsic Calibration for Non-Overlapping LiDARs Based on Dual LBA
Title: DLBAcalib: Robust Extrinsic Calibration for Non-Overlapping LiDARs Based on Dual LBA | DLBAcalib: Robuste Extrinsische Kalibrierung für nicht überlappende LiDARs auf Basis von Dual LBA | DLBAcalib: 以两边LBA为基础,对非重叠的LIDARs进行强有力的Extrins 校准 2507.09176v1 |
Authors (6): Han Ye, Yuqiang Jin, Jinyuan Liu, Tao Li, Wen-An Zhang, Minglei Fu
Accurate extrinsic calibration of multiple LiDARs is crucial for improving the foundational performance of three-dimensional (3D) map reconstruction systems. This paper presents a novel targetless extrinsic calibration framework for multi-LiDAR systems that does not rely on overlapping fields of view or precise initial parameter estimates. Unlike conventional calibration methods that require manual annotations or specific reference patterns, our approach introduces a unified optimization framework by integrating LiDAR bundle adjustment (LBA) optimization with robust iterative refinement. The proposed method constructs an accurate reference point cloud map via continuous scanning from the target LiDAR and sliding-window LiDAR bundle adjustment, while formulating extrinsic calibration as a joint LBA optimization problem. This method effectively mitigates cumulative mapping errors and achieves outlier-resistant parameter estimation through an adaptive weighting mechanism. Extensive evaluations in both the CARLA simulation environment and real-world scenarios demonstrate that our method outperforms state-of-the-art calibration techniques in both accuracy and robustness. Experimental results show that for non-overlapping sensor configurations, our framework achieves an average translational error of 5 mm and a rotational error of 0.2{\deg}, with an initial error tolerance of up to 0.4 m/30{\deg}. Moreover, the calibration process operates without specialized infrastructure or manual parameter tuning. The code is open source and available on GitHub (\underline{https://github.com/Silentbarber/DLBAcalib})
nan
Article 442
Title@2025-07-12 (6): RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking
Title: RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking | RAMA: Retrieval-Augmented Multi-Agent Framework for Misinformation Detection in Multimodal Fact-Checking | RAMA: 多式联运实况调查中错误信息探测的检索增强多机构框架 2507.09174v1 |
Authors (9): Shuo Yang, Zijian Yu, Zhenzhe Ying, Yuqin Dai, Guoqing Wang, Jun Lan, Jinfeng Xu, Jinze Li, Edith C. H. Ngai
The rapid proliferation of multimodal misinformation presents significant challenges for automated fact-checking systems, especially when claims are ambiguous or lack sufficient context. We introduce RAMA, a novel retrieval-augmented multi-agent framework designed for verifying multimedia misinformation. RAMA incorporates three core innovations: (1) strategic query formulation that transforms multimodal claims into precise web search queries; (2) cross-verification evidence aggregation from diverse, authoritative sources; and (3) a multi-agent ensemble architecture that leverages the complementary strengths of multiple multimodal large language models and prompt variants. Extensive experiments demonstrate that RAMA achieves superior performance on benchmark datasets, particularly excelling in resolving ambiguous or improbable claims by grounding verification in retrieved factual evidence. Our findings underscore the necessity of integrating web-based evidence and multi-agent reasoning for trustworthy multimedia verification, paving the way for more reliable and scalable fact-checking solutions. RAMA will be publicly available at https://github.com/kalendsyang/RAMA.git.
nan
Article 443
Title@2025-07-12 (6): Logits are All We Need to Adapt Closed Models
Title: Logits are All We Need to Adapt Closed Models | Logits sind alles, was wir brauchen, um geschlossene Modelle anzupassen | 只需登录即可,我们只需调整已关闭的模型 2502.06806v4 |
Authors (4): Gaurush Hiranandani, Haolun Wu, Subhojyoti Mukherjee, Sanmi Koyejo
Many commercial Large Language Models (LLMs) are often closed-source, limiting developers to prompt tuning for aligning content generation with specific applications. While these models currently do not provide access to token logits, we argue that if such access were available, it would enable more powerful adaptation techniques beyond prompt engineering. In this paper, we propose a token-level probability reweighting framework that, given access to logits and a small amount of task-specific data, can effectively steer black-box LLMs toward application-specific content generation. Our approach views next-token prediction through the lens of supervised classification. We show that aligning black-box LLMs with task-specific data can be formulated as a label noise correction problem, leading to Plugin model – an autoregressive probability reweighting model that operates solely on logits. We provide theoretical justification for why reweighting logits alone is sufficient for task adaptation. Extensive experiments with multiple datasets, LLMs, and reweighting models demonstrate the effectiveness of our method, advocating for broader access to token logits in closed-source models.
nan
Article 444
Title@2025-07-12 (6): PLEX: Perturbation-free Local Explanations for LLM-Based Text Classification
Title: PLEX: Perturbation-free Local Explanations for LLM-Based Text Classification | PLEX: Störungsfreie lokale Erklärungen für die LLM-basierte Textklassifikation | PLEX: LLM基于LLM的文本分类无扰动当地解释 2507.10596v1 |
Authors (3): Yogachandran Rahulamathavan, Misbah Farooq, Varuna De Silva
Large Language Models (LLMs) excel in text classification, but their complexity hinders interpretability, making it difficult to understand the reasoning behind their predictions. Explainable AI (XAI) methods like LIME and SHAP offer local explanations by identifying influential words, but they rely on computationally expensive perturbations. These methods typically generate thousands of perturbed sentences and perform inferences on each, incurring a substantial computational burden, especially with LLMs. To address this, we propose \underline{P}erturbation-free \underline{L}ocal \underline{Ex}planation (PLEX), a novel method that leverages the contextual embeddings extracted from the LLM and a Siamese network" style neural network trained to align with feature importance scores. This one-off training eliminates the need for subsequent perturbations, enabling efficient explanations for any new sentence. We demonstrate PLEX's effectiveness on four different classification tasks (sentiment, fake news, fake COVID-19 news and depression), showing more than 92\% agreement with LIME and SHAP. Our evaluation using a
stress test” reveals that PLEX accurately identifies influential words, leading to a similar decline in classification accuracy as observed with LIME and SHAP when these words are removed. Notably, in some cases, PLEX demonstrates superior performance in capturing the impact of key features. PLEX dramatically accelerates explanation, reducing time and computational overhead by two and four orders of magnitude, respectively. This work offers a promising solution for explainable LLM-based text classification.
nan
Article 445
Title@2025-07-12 (6): PU-Lie: Lightweight Deception Detection in Imbalanced Diplomatic Dialogues via Positive-Unlabeled Learning
Title: PU-Lie: Lightweight Deception Detection in Imbalanced Diplomatic Dialogues via Positive-Unlabeled Learning | PU-Lie: Leichte Täuschungserkennung in ausgewogenen Diplomatischen Dialogen durch positiv-unmarkiertes Lernen | PU-Lie:通过积极-无标签学习,在平衡的外交对话中发现轻量度欺骗性 2507.09157v1 |
Authors (4): Bhavinkumar Vinodbhai Kuwar, Bikrant Bikram Pratap Maurya, Priyanshu Gupta, Nitin Choudhury
Detecting deception in strategic dialogues is a complex and high-stakes task due to the subtlety of language and extreme class imbalance between deceptive and truthful communications. In this work, we revisit deception detection in the Diplomacy dataset, where less than 5% of messages are labeled deceptive. We introduce a lightweight yet effective model combining frozen BERT embeddings, interpretable linguistic and game-specific features, and a Positive-Unlabeled (PU) learning objective. Unlike traditional binary classifiers, PU-Lie is tailored for situations where only a small portion of deceptive messages are labeled, and the majority are unlabeled. Our model achieves a new best macro F1 of 0.60 while reducing trainable parameters by over 650x. Through comprehensive evaluations and ablation studies across seven models, we demonstrate the value of PU learning, linguistic interpretability, and speaker-aware representations. Notably, we emphasize that in this problem setting, accurately detecting deception is more critical than identifying truthful messages. This priority guides our choice of PU learning, which explicitly models the rare but vital deceptive class.
nan
Article 446
Title@2025-07-12 (6): OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering
Title: OPENXRD: A Comprehensive Benchmark and Enhancement Framework for LLM/MLLM XRD Question Answering | OPENXRD: Ein umfassendes Benchmark- und Enhancement-Framework für LLM/MLLM XRD-Fragebeantwortung | OpenXRD: LLM/MLLM XRD 问题回答的综合基准和加强框架 2507.09155v1 |
Authors (7): Ali Vosoughi, Ayoub Shahnazari, Yufeng Xi, Zeliang Zhang, Griffin Hess, Chenliang Xu, Niaz Abdolrahim
This work presents OPENXRD, an open-book pipeline designed for crystallography question answering, which integrates textual prompts with concise supporting content generated by GPT-4.5. Instead of using scanned textbooks, which may lead to copyright issues, OPENXRD generates compact, domain-specific references that help smaller models understand key concepts in X-ray diffraction (XRD). We evaluate OPENXRD on a well-defined set of 217 expert-level XRD questions by comparing different vision-language models, including GPT-4 and LLaVA-based frameworks such as Mistral, LLaMA, and QWEN, under both closed-book (without supporting material) and open-book (with supporting material) conditions. Our experimental results show significant accuracy improvements in models that use the GPT-4.5-generated summaries, particularly those with limited prior training in crystallography. OPENXRD uses knowledge from larger models to fill knowledge gaps in crystallography and shows that AI-generated texts can help smaller models reason more effectively in scientific tasks. While the current version of OPENXRD focuses on text-based inputs, we also explore future extensions such as adding real crystal diagrams or diffraction patterns to improve interpretation in specialized materials science contexts. Overall, OPENXRD shows that specialized open-book systems can be useful in materials science and provides a foundation for broader natural language processing (NLP) tools in critical scientific fields.
nan
Article 447
Title@2025-07-12 (6): DTECT: Dynamic Topic Explorer & Context Tracker
Title: DTECT: Dynamic Topic Explorer & Context Tracker | DTECT: Dynamischer Themen-Explorer & Kontext-Tracker | DTECT: 动态专题探索器和上下文跟踪器 2507.07910v2 |
Authors (2): Suman Adhya, Debarshi Kumar Sanyal
The explosive growth of textual data over time presents a significant challenge in uncovering evolving themes and trends. Existing dynamic topic modeling techniques, while powerful, often exist in fragmented pipelines that lack robust support for interpretation and user-friendly exploration. We introduce DTECT (Dynamic Topic Explorer & Context Tracker), an end-to-end system that bridges the gap between raw textual data and meaningful temporal insights. DTECT provides a unified workflow that supports data preprocessing, multiple model architectures, and dedicated evaluation metrics to analyze the topic quality of temporal topic models. It significantly enhances interpretability by introducing LLM-driven automatic topic labeling, trend analysis via temporally salient words, interactive visualizations with document-level summarization, and a natural language chat interface for intuitive data querying. By integrating these features into a single, cohesive platform, DTECT empowers users to more effectively track and understand thematic dynamics. DTECT is open-source and available at https://github.com/AdhyaSuman/DTECT.
nan
Article 448
Title@2025-07-12 (6): SymRAG: Efficient Neuro-Symbolic Retrieval Through Adaptive Query Routing
Title: SymRAG: Efficient Neuro-Symbolic Retrieval Through Adaptive Query Routing | SymRAG: Effizientes neuro-symbolisches Retrieval durch adaptive Abfragerouting | SymRAG: 通过适应性查询路由, 高效神经- 交串流检索 2506.12981v2 |
Authors (4): Safayat Bin Hakim, Muhammad Adil, Alvaro Velasquez, Houbing Herbert Song
Current Retrieval-Augmented Generation systems use uniform processing, causing inefficiency as simple queries consume resources similar to complex multi-hop tasks. We present SymRAG, a framework that introduces adaptive query routing via real-time complexity and load assessment to select symbolic, neural, or hybrid pathways. SymRAG’s neuro-symbolic approach adjusts computational pathways based on both query characteristics and system load, enabling efficient resource allocation across diverse query types. By combining linguistic and structural query properties with system load metrics, SymRAG allocates resources proportional to reasoning requirements. Evaluated on 2,000 queries across HotpotQA (multi-hop reasoning) and DROP (discrete reasoning) using Llama-3.2-3B and Mistral-7B models, SymRAG achieves competitive accuracy (97.6–100.0% exact match) with efficient resource utilization (3.6–6.2% CPU utilization, 0.985–3.165s processing). Disabling adaptive routing increases processing time by 169–1151%, showing its significance for complex models. These results suggest adaptive computation strategies are more sustainable and scalable for hybrid AI systems that use dynamic routing and neuro-symbolic frameworks.
nan
Article 449
Title@2025-07-12 (6): Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages
Title: Eka-Eval : A Comprehensive Evaluation Framework for Large Language Models in Indian Languages | Eka-Eval : Ein umfassender Evaluierungsrahmen für große Sprachmodelle in indischen Sprachen | Eka-Eval:印度语大语言模式综合评价框架 2507.01853v3 |
Authors (4): Samridhi Raj Sinha, Rajvee Sheth, Abhishek Upperwal, Mayank Singh
The rapid advancement of Large Language Models (LLMs) has intensified the need for evaluation frameworks that address the requirements of linguistically diverse regions, such as India, and go beyond English-centric benchmarks. We introduce EKA-EVAL, a unified evaluation framework that integrates over 35+ benchmarks (including 10 Indic benchmarks) across nine major evaluation categories. The framework provides broader coverage than existing Indian language evaluation tools, offering 11 core capabilities through a modular architecture, seamless integration with Hugging Face and proprietary models, and plug-and-play usability. As the first end-to-end suite for scalable, multilingual LLM benchmarking, the framework combines extensive benchmarks, modular workflows, and dedicated support for low-resource Indian languages to enable inclusive assessment of LLM capabilities across diverse domains. We conducted extensive comparisons against five existing baselines, demonstrating that EKA-EVAL achieves the highest participant ratings in four out of five categories. The framework is open-source and publicly available at: https://github.com/lingo-iitgn/eka-eval.
nan
Article 450
Title@2025-07-12 (6): The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages
Title: The NaijaVoices Dataset: Cultivating Large-Scale, High-Quality, Culturally-Rich Speech Data for African Languages | Der NaijaVoices-Datensatz: Pflege von großformatigen, qualitativ hochwertigen, kulturell-richschen Sprachdaten für afrikanische Sprachen | NaijaVoices数据集:培养非洲语言的大型、高质量、文化-Rich语音数据 2505.20564v3 |
Authors (11): Chris Emezue, NaijaVoices Community, Busayo Awobade, Abraham Owodunni, Handel Emezue, Gloria Monica Tobechukwu Emezue, Nefertiti Nneoma Emezue, Sewade Ogun, Bunmi Akinremi, David Ifeoluwa Adelani, Chris Pal
The development of high-performing, robust, and reliable speech technologies depends on large, high-quality datasets. However, African languages – including our focus, Igbo, Hausa, and Yoruba – remain under-represented due to insufficient data. Popular voice-enabled technologies do not support any of the 2000+ African languages, limiting accessibility for circa one billion people. While previous dataset efforts exist for the target languages, they lack the scale and diversity needed for robust speech models. To bridge this gap, we introduce the NaijaVoices dataset, a 1,800-hour speech-text dataset with 5,000+ speakers. We outline our unique data collection approach, analyze its acoustic diversity, and demonstrate its impact through finetuning experiments on automatic speech recognition, averagely achieving 75.86% (Whisper), 52.06% (MMS), and 42.33% (XLSR) WER improvements. These results highlight NaijaVoices’ potential to advance multilingual speech processing for African languages.
nan
Article 451
Title@2025-07-12 (6): MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian
Title: MSVD-Indonesian: A Benchmark for Multimodal Video-Text Tasks in Indonesian | MSVD-Indonesier: Benchmark für multimodale Video-Text-Aufgaben auf Indonesisch | MSVD-印度尼西亚文:印度尼西亚多式视频文字任务基准 2306.11341v2 |
Authors (1): Willy Fitra Hendria
Multimodal learning on video and text has seen significant progress, particularly in tasks like text-to-video retrieval, video-to-text retrieval, and video captioning. However, most existing methods and datasets focus exclusively on English. Despite Indonesian being one of the most widely spoken languages, multimodal research in Indonesian remains under-explored, largely due to the lack of benchmark datasets. To address this gap, we introduce the first public Indonesian video-text dataset by translating the English captions in the MSVD dataset into Indonesian. Using this dataset, we evaluate neural network models which were developed for the English video-text dataset on three tasks, i.e., text-to-video retrieval, video-to-text retrieval, and video captioning. Most existing models rely on feature extractors pretrained on English vision-language datasets, raising concerns about their applicability to Indonesian, given the scarcity of large-scale pretraining resources in the language. We apply a cross-lingual transfer learning approach by leveraging English-pretrained extractors and fine-tuning models on our Indonesian dataset. Experimental results demonstrate that this strategy improves performance across all tasks and metrics. We release our dataset publicly to support future research and hope it will inspire further progress in Indonesian multimodal learning.
nan
Article 452
Title@2025-07-12 (6): KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding
Title: KodCode: A Diverse, Challenging, and Verifiable Synthetic Dataset for Coding | KodCode: Ein vielfältiger, anspruchsvoller und überprüfbarer synthetischer Datensatz für die Codierung | KodCode:用于编码的多样化、挑战性和可核查合成数据集 2503.02951v2 |
Authors (5): Zhangchen Xu, Yang Liu, Yueqin Yin, Mingyuan Zhou, Radha Poovendran
We introduce KodCode, a synthetic dataset that addresses the persistent challenge of acquiring high-quality, verifiable training data across diverse difficulties and domains for training Large Language Models for coding. Existing code-focused resources typically fail to ensure either the breadth of coverage (e.g., spanning simple coding tasks to advanced algorithmic problems) or verifiable correctness (e.g., unit tests). In contrast, KodCode comprises question-solution-test triplets that are systematically validated via a self-verification procedure. Our pipeline begins by synthesizing a broad range of coding questions, then generates solutions and test cases with additional attempts allocated to challenging problems. Finally, post-training data synthesis is done by rewriting questions into diverse formats and generating responses under a test-based reject sampling procedure from a reasoning model (DeepSeek R1). This pipeline yields a large-scale, robust and diverse coding dataset. KodCode is suitable for supervised fine-tuning and the paired unit tests also provide great potential for RL tuning. Fine-tuning experiments on coding benchmarks (HumanEval(+), MBPP(+), BigCodeBench, and LiveCodeBench) demonstrate that KodCode-tuned models achieve state-of-the-art performance, surpassing models like Qwen2.5-Coder-32B-Instruct and DeepSeek-R1-Distill-Llama-70B.
nan
Article 453
Title@2025-07-12 (6): CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards
Title: CompassJudger-2: Towards Generalist Judge Model via Verifiable Rewards | CompassJudger-2: Auf dem Weg zum generalistischen Richtermodell durch überprüfbare Belohnungen | Compassjudger-2:通过可核实的奖励争取通才法官模式 2507.09104v1 |
Authors (5): Taolin Zhang, Maosong Cao, Alexander Lam, Songyang Zhang, Kai Chen
Recently, the role of LLM-as-judge in evaluating large language models has gained prominence. However, current judge models suffer from narrow specialization and limited robustness, undermining their capacity for comprehensive evaluations. In this work, we present CompassJudger-2, a novel generalist judge model that overcomes these limitations via a task-driven, multi-domain data curation strategy. Central to our approach is supervising judgment tasks with verifiable rewards, guiding intrinsic critical reasoning through rejection sampling to foster robust, generalizable judgment capabilities. We introduce a refined learning objective with margin policy gradient loss to enhance performance. Empirically, CompassJudger-2 achieves superior results across multiple judge and reward benchmarks, and our 7B model demonstrates competitive judgment accuracy with significantly larger models like DeepSeek-V3 and Qwen3-235B-A22B. Additionally, we propose JudgerBenchV2, a comprehensive benchmark evaluating cross-domain judgment accuracy and rank consistency to standardize judge model evaluation. These contributions advance robust, scalable LLM judgment and establish new performance and evaluation standards.
nan
Article 454
Title@2025-07-12 (6): Consistency in Language Models: Current Landscape, Challenges, and Future Directions
Title: Consistency in Language Models: Current Landscape, Challenges, and Future Directions | Konsistenz in Sprachmodellen: Aktuelle Landschaft, Herausforderungen und zukünftige Richtungen | 语文模式的一致性:当前景观、挑战和未来方向 2505.00268v2 |
Authors (5): Jekaterina Novikova, Carol Anderson, Borhane Blili-Hamelin, Domenic Rosati, Subhabrata Majumdar
The hallmark of effective language use lies in consistency: expressing similar meanings in similar contexts and avoiding contradictions. While human communication naturally demonstrates this principle, state-of-the-art language models (LMs) struggle to maintain reliable consistency across task- and domain-specific applications. Here we examine the landscape of consistency research in LMs, analyze current approaches to measure aspects of consistency, and identify critical research gaps. Our findings point to an urgent need for quality benchmarks to measure and interdisciplinary approaches to ensure consistency while preserving utility.
nan
Article 455
Title@2025-07-12 (6): AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data
Title: AInsight: Augmenting Expert Decision-Making with On-the-Fly Insights Grounded in Historical Data | AInsight: Augmenting Expert Decision-Making mit On-the-Fly-Insights in historischen Daten begründet | AIn透视:加强以历史数据为根据的直观专家决策 2507.09100v1 |
Authors (4): Mohammad Abolnejadian, Shakiba Amirshahi, Matthew Brehmer, Anamaria Crisan
In decision-making conversations, experts must navigate complex choices and make on-the-spot decisions while engaged in conversation. Although extensive historical data often exists, the real-time nature of these scenarios makes it infeasible for decision-makers to review and leverage relevant information. This raises an interesting question: What if experts could utilize relevant past data in real-time decision-making through insights derived from past data? To explore this, we implemented a conversational user interface, taking doctor-patient interactions as an example use case. Our system continuously listens to the conversation, identifies patient problems and doctor-suggested solutions, and retrieves related data from an embedded dataset, generating concise insights using a pipeline built around a retrieval-based Large Language Model (LLM) agent. We evaluated the prototype by embedding Health Canada datasets into a vector database and conducting simulated studies using sample doctor-patient dialogues, showing effectiveness but also challenges, setting directions for the next steps of our work.
nan
Article 456
Title@2025-07-12 (6): DS@GT at Touché: Large Language Models for Retrieval-Augmented Debate
Title: DS@GT at Touché: Large Language Models for Retrieval-Augmented Debate | DS@GT at Touché: Große Sprachmodelle für retrieval-augmentierte Debatte | DS@GT at Touché: 检索启动辩论的大语言模式 2507.09090v1 |
Authors (3): Anthony Miyaguchi, Conor Johnston, Aaryan Potdar
Large Language Models (LLMs) demonstrate strong conversational abilities. In this Working Paper, we study them in the context of debating in two ways: their ability to perform in a structured debate along with a dataset of arguments to use and their ability to evaluate utterances throughout the debate. We deploy six leading publicly available models from three providers for the Retrieval-Augmented Debate and Evaluation. The evaluation is performed by measuring four key metrics: Quality, Quantity, Manner, and Relation. Throughout this task, we found that although LLMs perform well in debates when given related arguments, they tend to be verbose in responses yet consistent in evaluation. The accompanying source code for this paper is located at https://github.com/dsgt-arc/touche-2025-rad.
nan
Article 457
Title@2025-07-11 (5): Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation
Title: Dynamic Parameter Memory: Temporary LoRA-Enhanced LLM for Long-Sequence Emotion Recognition in Conversation | Dynamischer Parameterspeicher: Temporäre LoRA-verbesserte LLM für die Erkennung von Langsequenz-Emotionen im Gespräch | 动态参数内存:在对话中识别长期序列情感的暂时性LORA-增强的LLMLM 2507.09076v1 |
Authors (6): Jialong Mai, Xiaofen Xing, Yawei Li, Zhipeng Li, Jingyuan Xing, Xiangmin Xu
Recent research has focused on applying speech large language model (SLLM) to improve speech emotion recognition (SER). However, the inherently high frame rate in speech modality severely limits the signal processing and understanding capabilities of SLLM. For example, a SLLM with a 4K context window can only process 80 seconds of audio at 50Hz feature sampling rate before reaching its capacity limit. Input token compression methods used in SLLM overlook the continuity and inertia of emotions across multiple conversation turns. This paper proposes a Dynamic Parameter Memory (DPM) mechanism with contextual semantics and sentence-level emotion encoding, enabling processing of unlimited-length audio with limited context windows in SLLM. Specifically, DPM progressively encodes sentence-level information and emotions into a temporary LoRA module during inference to effectively “memorize” the contextual information. We trained an emotion SLLM as a backbone and incorporated our DPM into inference for emotion recognition in conversation (ERC). Experimental results on the IEMOCAP dataset show that DPM significantly improves the emotion recognition capabilities of SLLM when processing long audio sequences, achieving state-of-the-art performance.
nan
Article 458
Title@2025-07-11 (5): OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique
Title: OpenCodeReasoning-II: A Simple Test Time Scaling Approach via Self-Critique | OpenCodeReasoning-II: Ein einfacher Testzeitskalierungsansatz über Self-Critique | OpenCodeReasoning- II: 通过自创性简单测试时间缩放法 2507.09075v1 |
Authors (9): Wasi Uddin Ahmad, Somshubra Majumdar, Aleksander Ficek, Sean Narenthiran, Mehrzad Samadi, Jocelyn Huang, Siddhartha Jain, Vahid Noroozi, Boris Ginsburg
Recent advancements in reasoning-based Large Language Models (LLMs), particularly their potential through test-time scaling, have created significant opportunities for distillation in code generation and critique. However, progress in both areas fundamentally depends on large-scale, high-quality datasets. In this work, we introduce OpenCodeReasoning-II, a dataset consists of 2.5M question-solution-critique triples (approx. 35K unique programming questions), making it nearly twice the size of the previous largest publicly available code reasoning dataset. In this work, we employ a two-stage supervised fine-tuning strategy. The first stage focuses on fine-tuning for code generation, while the second stage involves the joint training of models for both code generation and critique. Our resulting finetuned Qwen2.5-Instruct models achieve performance in code generation that either exceeds or equals the best prior open-weight distilled models. Notably, the integration of our code generation and critique models leads to significant improvements in competitive coding performance. Furthermore, we present an extension of the LiveCodeBench benchmark to specifically support the C++ programming language, thereby facilitating more comprehensive LLM evaluation using this benchmark.
nan
Article 459
Title@2025-07-11 (5): FlexOlmo: Open Language Models for Flexible Data Use
Title: FlexOlmo: Open Language Models for Flexible Data Use | FlexOlmo: Offene Sprachmodelle für flexible Datennutzung | FlexOlmo:灵活数据使用开放语言模型 2507.07024v2 |
Authors (23): Weijia Shi, Akshita Bhagia, Kevin Farhat, Niklas Muennighoff, Pete Walsh, Jacob Morrison, Dustin Schwenk, Shayne Longpre, Jake Poznanski, Allyson Ettinger, Daogao Liu, Margaret Li, Dirk Groeneveld, Mike Lewis, Wen-tau Yih, Luca Soldaini, Kyle Lo, Noah A. Smith, Luke Zettlemoyer, Pang Wei Koh, Hannaneh Hajishirzi, Ali Farhadi, Sewon Min
We introduce FlexOlmo, a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets. We evaluate models with up to 37 billion parameters (20 billion active) on 31 diverse downstream tasks. We show that a general expert trained on public data can be effectively combined with independently trained experts from other data owners, leading to an average 41% relative improvement while allowing users to opt out of certain data based on data licensing or permission requirements. Our approach also outperforms prior model merging methods by 10.1% on average and surpasses the standard MoE trained without data restrictions using the same training FLOPs. Altogether, this research presents a solution for both data owners and researchers in regulated industries with sensitive or protected data. FlexOlmo enables benefiting from closed data while respecting data owners’ preferences by keeping their data local and supporting fine-grained control of data access during inference.
nan
Article 460
Title@2025-07-11 (5): HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization
Title: HYPEROFA: Expanding LLM Vocabulary to New Languages via Hypernetwork-Based Embedding Initialization | HYPEROFA: Erweitern von LLM Vokabeln auf neue Sprachen über Hypernetwork-basierte Einbettung in Initialisierung | HYPROOFA:通过基于超网络的嵌入式初始化,将LLM词汇扩大到新语言 2504.21018v2 |
Authors (3): Enes Özeren, Yihong Liu, Hinrich Schütze
Many pre-trained language models (PLMs) exhibit suboptimal performance on mid- and low-resource languages, largely due to limited exposure to these languages during pre-training. A common strategy to address this is to introduce new tokens specific to the target languages, initialize their embeddings, and apply continual pre-training on target-language data. Among such methods, OFA (Liu et al., 2024a) proposes a similarity-based subword embedding initialization heuristic that is both effective and efficient. However, OFA restricts target-language token embeddings to be convex combinations of a fixed number of source-language embeddings, which may limit expressiveness. To overcome this limitation, we propose HYPEROFA, a hypernetwork-based approach for more adaptive token embedding initialization. The hypernetwork is trained to map from an external multilingual word vector space to the PLMs token embedding space using source-language tokens. Once trained, it can generate flexible embeddings for target-language tokens, serving as a good starting point for continual pretraining. Experiments demonstrate that HYPEROFA consistently outperforms random initialization baseline and matches or exceeds the performance of OFA in both continual pre-training convergence and downstream task performance. We make the code publicly available.
nan
Article 461
Title@2025-07-11 (5): ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making
Title: ALIGN: Prompt-based Attribute Alignment for Reliable, Responsible, and Personalized LLM-based Decision-Making | ALIGN: Promptbasierte Attributausrichtung für zuverlässige, verantwortungsvolle und personalisierte LLM-basierte Entscheidungsfindung | 以可靠、负责任和个性化的LLM为基础的决策的快速属性协调 2507.09037v1 |
Authors (9): Bharadwaj Ravichandran, David Joy, Paul Elliott, Brian Hu, Jadie Adams, Christopher Funk, Emily Veenhuis, Anthony Hoogs, Arslan Basharat
Large language models (LLMs) are increasingly being used as decision aids. However, users have diverse values and preferences that can affect their decision-making, which requires novel methods for LLM alignment and personalization. Existing LLM comparison tools largely focus on benchmarking tasks, such as knowledge-based question answering. In contrast, our proposed ALIGN system focuses on dynamic personalization of LLM-based decision-makers through prompt-based alignment to a set of fine-grained attributes. Key features of our system include robust configuration management, structured output generation with reasoning, and several algorithm implementations with swappable LLM backbones, enabling different types of analyses. Our user interface enables a qualitative, side-by-side comparison of LLMs and their alignment to various attributes, with a modular backend for easy algorithm integration. Additionally, we perform a quantitative analysis comparing alignment approaches in two different domains: demographic alignment for public opinion surveys and value alignment for medical triage decision-making. The entire ALIGN framework is open source and will enable new research on reliable, responsible, and personalized LLM-based decision-makers.
nan
Article 462
Title@2025-07-11 (5): Lizard: An Efficient Linearization Framework for Large Language Models
Title: Lizard: An Efficient Linearization Framework for Large Language Models | Lizard: Ein effizienter Linearisierungsrahmen für große Sprachmodelle | Lizard:大型语言模型的高效线性框架 2507.09025v1 |
Authors (12): Chien Van Nguyen, Ruiyi Zhang, Hanieh Deilamsalehy, Puneet Mathur, Viet Dac Lai, Haoliang Wang, Jayakumar Subramanian, Ryan A. Rossi, Trung Bui, Nikos Vlassis, Franck Dernoncourt, Thien Huu Nguyen
We propose Lizard, a linearization framework that transforms pretrained Transformer-based Large Language Models (LLMs) into flexible, subquadratic architectures for infinite-context generation. Transformer-based LLMs face significant memory and computational bottlenecks as context lengths increase, due to the quadratic complexity of softmax attention and the growing key-value (KV) cache. Lizard addresses these limitations by introducing a subquadratic attention mechanism that closely approximates softmax attention while preserving the output quality. Unlike previous linearization methods, which are often limited by fixed model structures and therefore exclude gating mechanisms, Lizard incorporates a gating module inspired by recent state-of-the-art linear models. This enables adaptive memory control, supports constant-memory inference, offers strong length generalization, and allows more flexible model design. Lizard combines gated linear attention for global context compression with sliding window attention enhanced by meta memory, forming a hybrid mechanism that captures both long-range dependencies and fine-grained local interactions. Moreover, we introduce a hardware-aware algorithm that accelerates the training speed of our models. Extensive experiments show that Lizard achieves near-lossless recovery of the teacher model’s performance across standard language modeling tasks, while significantly outperforming previous linearization methods. On the 5-shot MMLU benchmark, Lizard improves over prior models by 18 points and shows significant improvements on associative recall tasks.
nan
Article 463
Title@2025-07-11 (5): Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery
Title: Beyond vividness: Content analysis of induced hallucinations reveals the hidden structure of individual differences in visual imagery | Jenseits von Lebendigkeit: Inhaltliche Analyse induzierter Halluzinationen enthüllt die verborgene Struktur individueller Unterschiede in der Bildgebung | 超越生化:对诱发幻觉的内容分析揭示了视觉图像中个人差异的隐藏结构。 2507.09011v1 |
Authors (5): Ana Chkhaidze, Reshanne R. Reeder, Connor Gag, Anastasia Kiyonaga, Seana Coulson
A rapidly alternating red and black display known as Ganzflicker induces visual hallucinations that reflect the generative capacity of the visual system. Recent proposals regarding the imagery spectrum, that is, differences in the visual system of individuals with absent imagery, typical imagery, and vivid imagery, suggest these differences should impact the complexity of other internally generated visual experiences. Here, we used tools from natural language processing to analyze free-text descriptions of hallucinations from over 4,000 participants, asking whether people with different imagery phenotypes see different things in their mind’s eye during Ganzflicker-induced hallucinations. Strong imagers described complex, naturalistic content, while weak imagers reported simple geometric patterns. Embeddings from vision language models better captured these differences than text-only language models, and participants with stronger imagery used language with richer sensorimotor associations. These findings may reflect individual variation in coordination between early visual areas and higher-order regions relevant for the imagery spectrum.
nan
Article 464
Title@2025-07-11 (5): Semantic Source Code Segmentation using Small and Large Language Models
Title: Semantic Source Code Segmentation using Small and Large Language Models | Semantische Quellcode-Segmentierung mit kleinen und großen Sprachmodellen | 使用小型和大语言模式的语义源代码代码分割 2507.08992v1 |
Authors (5): Abdelhalim Dahou, Ansgar Scherp, Sebastian Kurten, Brigitte Mathiak, Madhu Chauhan
Source code segmentation, dividing code into functionally coherent segments, is crucial for knowledge retrieval and maintenance in software development. While enabling efficient navigation and comprehension of large codebases, manual and syntactic analysis approaches have become impractical as repositories grow, especially for low-resource languages like R and their research domains (e.g., social sciences, psychology).This paper introduces an automated, domain-specific approach for research R code segmentation using Large and Small Language Models (LLMs/SLMs). It presents two novel approaches and a human-annotated dataset, StatCodeSeg. We explore two distinct approaches: line-by-line analysis with context and range-based segment determination. We experiment with LLMs and fine-tuned SLMs. To support the generalizability of our approaches, we also include experiments on Python code from the computer science domain.Our results show that context-based line-by-line analysis is superior over range-based segmentation.Using smaller language models like CodeBERT and an encoder-only version of CodeT5+ are better than their LLM counterparts. Most notably, these two best-performing models did not see R code during pre-training versus the LLMs but were only fine-tuned on 4,130 lines of manually annotated code.
nan
Article 465
Title@2025-07-11 (5): TheraGen: Therapy for Every Generation
Title: TheraGen: Therapy for Every Generation | TheraGen: Therapie für jede Generation | TheraGen:为每一代人提供治疗 2409.13748v2 |
Authors (3): Kartikey Doshi, Jimit Shah, Narendra Shekokar
We present TheraGen, an advanced AI-powered mental health chatbot utilizing the LLaMA 2 7B model. This approach builds upon recent advancements in language models and transformer architectures. TheraGen provides all-day personalized, compassionate mental health care by leveraging a large dataset of 1 million conversational entries, combining anonymized therapy transcripts, online mental health discussions, and psychological literature, including APA resources. Our implementation employs transfer learning, fine-tuning, and advanced training techniques to optimize performance. TheraGen offers a user-friendly interface for seamless interaction, providing empathetic responses and evidence-based coping strategies. Evaluation results demonstrate high user satisfaction rates, with 94% of users reporting improved mental well-being. The system achieved a BLEU score of 0.67 and a ROUGE score of 0.62, indicating strong response accuracy. With an average response time of 1395 milliseconds, TheraGen ensures real-time, efficient support. While not a replacement for professional therapy, TheraGen serves as a valuable complementary tool, significantly improving user well-being and addressing the accessibility gap in mental health treatments. This paper details TheraGen’s architecture, training methodology, ethical considerations, and future directions, contributing to the growing field of AI-assisted mental healthcare and offering a scalable solution to the pressing need for mental health support.
nan
Article 466
Title@2025-07-11 (5): Application of CARE-SD text classifier tools to assess distribution of stigmatizing and doubt-marking language features in EHR
Title: Application of CARE-SD text classifier tools to assess distribution of stigmatizing and doubt-marking language features in EHR | Anwendung von CARE-SD-Textklassifikator-Tools zur Bewertung der Verteilung von stigmatisierenden und zweifelmarkierenden Sprachmerkmalen in EHR | 应用CARE-SD 文本分类工具,评估EHR中污名化和有疑点语言特征的分布 2507.08969v1 |
Authors (7): Drew Walker, Jennifer Love, Swati Rajwal, Isabel C Walker, Hannah LF Cooper, Abeed Sarker, Melvin Livingston III
Introduction: Electronic health records (EHR) are a critical medium through which patient stigmatization is perpetuated among healthcare teams. Methods: We identified linguistic features of doubt markers and stigmatizing labels in MIMIC-III EHR via expanded lexicon matching and supervised learning classifiers. Predictors of rates of linguistic features were assessed using Poisson regression models. Results: We found higher rates of stigmatizing labels per chart among patients who were Black or African American (RR: 1.16), patients with Medicare/Medicaid or government-run insurance (RR: 2.46), self-pay (RR: 2.12), and patients with a variety of stigmatizing disease and mental health conditions. Patterns among doubt markers were similar, though male patients had higher rates of doubt markers (RR: 1.25). We found increased stigmatizing labels used by nurses (RR: 1.40), and social workers (RR: 2.25), with similar patterns of doubt markers. Discussion: Stigmatizing language occurred at higher rates among historically stigmatized patients, perpetuated by multiple provider types.
nan
Article 467
Title@2025-07-11 (5): Self-Improving Model Steering
Title: Self-Improving Model Steering | Selbstverbesserende Modellsteuerung | 自我改进示范指导 2507.08967v1 |
Authors (5): Rongyi Zhu, Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Ting Wang
Model steering represents a powerful technique that dynamically aligns large language models (LLMs) with human preferences during inference. However, conventional model-steering methods rely heavily on externally annotated data, not only limiting their adaptability to varying contexts but also tethering their effectiveness to annotation quality. In this paper, we present SIMS, the first self-improving model-steering framework that operates without relying on external supervision. At its core, SIMS autonomously generates and refines contrastive samples through iterative self-improvement cycles, enabling adaptive, context-specific steering. Additionally, SIMS employs novel strategies, including prompt ranking and contrast sampling, to further enhance steering efficacy. Extensive evaluation across diverse LLMs and benchmarks demonstrates that SIMS substantially outperforms existing methods in steering effectiveness and adaptability, highlighting self-improving model steering as a promising direction for future research on inference-time LLM alignment.
nan
Article 468
Title@2025-07-11 (5): LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop
Title: LearnLens: LLM-Enabled Personalised, Curriculum-Grounded Feedback with Educators in the Loop | LearnLens: LLM-Enabled Personalisiertes, Curriculum-gerundetes Feedback mit Erziehern im Loop | 学习栏:LLM-能够个性化的LLM课程、课程与环中教育工作者的反馈 2507.04295v2 |
Authors (4): Runcong Zhao, Artem Bobrov, Jiazheng Li, Yulan He
Effective feedback is essential for student learning but is time-intensive for teachers. We present LearnLens, a modular, LLM-based system that generates personalised, curriculum-aligned feedback in science education. LearnLens comprises three components: (1) an error-aware assessment module that captures nuanced reasoning errors; (2) a curriculum-grounded generation module that uses a structured, topic-linked memory chain rather than traditional similarity-based retrieval, improving relevance and reducing noise; and (3) an educator-in-the-loop interface for customisation and oversight. LearnLens addresses key challenges in existing systems, offering scalable, high-quality feedback that empowers both teachers and students.
nan
Article 469
Title@2025-07-11 (5): Drowning in Documents: Consequences of Scaling Reranker Inference
Title: Drowning in Documents: Consequences of Scaling Reranker Inference | Ertrinken in Dokumenten: Konsequenzen der Skalierungs-Reranker-Schlussfolgerung | 文件中淹没:扩大重新排序者推断的后果 2411.11767v2 |
Authors (6): Mathew Jacob, Erik Lindgren, Matei Zaharia, Michael Carbin, Omar Khattab, Andrew Drozdov
Rerankers, typically cross-encoders, are computationally intensive but are frequently used because they are widely assumed to outperform cheaper initial IR systems. We challenge this assumption by measuring reranker performance for full retrieval, not just re-scoring first-stage retrieval. To provide a more robust evaluation, we prioritize strong first-stage retrieval using modern dense embeddings and test rerankers on a variety of carefully chosen, challenging tasks, including internally curated datasets to avoid contamination, and out-of-domain ones. Our empirical results reveal a surprising trend: the best existing rerankers provide initial improvements when scoring progressively more documents, but their effectiveness gradually declines and can even degrade quality beyond a certain limit. We hope that our findings will spur future research to improve reranking.
nan
Article 470
Title@2025-07-11 (5): NeuralOS: Towards Simulating Operating Systems via Neural Generative Models
Title: NeuralOS: Towards Simulating Operating Systems via Neural Generative Models | NeuralOS: Auf dem Weg zur Simulation von Betriebssystemen über neurale Generative Modelle | NeurorOS:通过神经产生模型努力模拟操作系统 2507.08800v1 |
Authors (5): Luke Rivard, Sun Sun, Hongyu Guo, Wenhu Chen, Yuntian Deng
We introduce NeuralOS, a neural framework that simulates graphical user interfaces (GUIs) of operating systems by directly predicting screen frames in response to user inputs such as mouse movements, clicks, and keyboard events. NeuralOS combines a recurrent neural network (RNN), which tracks computer state, with a diffusion-based neural renderer that generates screen images. The model is trained on a large-scale dataset of Ubuntu XFCE recordings, which include both randomly generated interactions and realistic interactions produced by AI agents. Experiments show that NeuralOS successfully renders realistic GUI sequences, accurately captures mouse interactions, and reliably predicts state transitions like application launches. Although modeling fine-grained keyboard interactions precisely remains challenging, NeuralOS offers a step toward creating fully adaptive, generative neural interfaces for future human-computer interaction systems.
nan
Article 471
Title@2025-07-11 (5): KV Cache Steering for Inducing Reasoning in Small Language Models
Title: KV Cache Steering for Inducing Reasoning in Small Language Models | KV Cache Steering zur Induktion von Vernunft in kleinen Sprachmodellen | KV 小型语言模式引力提示缓存指导 2507.08799v1 |
Authors (6): Max Belitsky, Dawid J. Kopiczko, Michael Dorkenwald, M. Jehanzeb Mirza, Cees G. M. Snoek, Yuki M. Asano
We propose cache steering, a lightweight method for implicit steering of language models via a one-shot intervention applied directly to the key-value cache. To validate its effectiveness, we apply cache steering to induce chain-of-thought reasoning in small language models. Our approach leverages GPT-4o-generated reasoning traces to construct steering vectors that shift model behavior toward more explicit, multi-step reasoning without fine-tuning or prompt modifications. Experimental evaluations on diverse reasoning benchmarks demonstrate that cache steering improves both the qualitative structure of model reasoning and quantitative task performance. Compared to prior activation steering techniques that require continuous interventions, our one-shot cache steering offers substantial advantages in terms of hyperparameter stability, inference-time efficiency, and ease of integration, making it a more robust and practical solution for controlled generation.
nan
Article 472
Title@2025-07-11 (5): From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation
Title: From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation | Von KMMLU-Redux zu KMMLU-Pro: Eine professionelle koreanische Benchmark-Suite für die LLM-Bewertung | 从KMMLU-Redux到KMMLU-Pro:韩国用于LLM评价的专业基准套件 2507.08924v1 |
Authors (6): Seokhee Hong, Sunkyoung Kim, Guijin Son, Soyeon Kim, Yeonjung Hong, Jinsik Lee
The development of Large Language Models (LLMs) requires robust benchmarks that encompass not only academic domains but also industrial fields to effectively evaluate their applicability in real-world scenarios. In this paper, we introduce two Korean expert-level benchmarks. KMMLU-Redux, reconstructed from the existing KMMLU, consists of questions from the Korean National Technical Qualification exams, with critical errors removed to enhance reliability. KMMLU-Pro is based on Korean National Professional Licensure exams to reflect professional knowledge in Korea. Our experiments demonstrate that these benchmarks comprehensively represent industrial knowledge in Korea. We release our dataset publicly available.
nan
Article 473
Title@2025-07-11 (5): One Token to Fool LLM-as-a-Judge
Title: One Token to Fool LLM-as-a-Judge | Ein Token zum Narren LLM-as-a-Richter | 愚人一拳LLM -A法官 2507.08794v1 |
Authors (6): Yulai Zhao, Haolin Liu, Dian Yu, S. Y. Kung, Haitao Mi, Dong Yu
Generative reward models (also known as LLMs-as-judges), which use large language models (LLMs) to evaluate answer quality, are increasingly adopted in reinforcement learning with verifiable rewards (RLVR). They are often preferred over rigid rule-based metrics, especially for complex reasoning tasks involving free-form outputs. In this paradigm, an LLM is typically prompted to compare a candidate answer against a ground-truth reference and assign a binary reward indicating correctness. Despite the seeming simplicity of this comparison task, we find that generative reward models exhibit surprising vulnerabilities to superficial manipulations: non-word symbols (e.g., “:” or “.”) or reasoning openers like “Thought process:” and “Let’s solve this problem step by step.” can often lead to false positive rewards. We demonstrate that this weakness is widespread across LLMs, datasets, and prompt formats, posing a serious threat for core algorithmic paradigms that rely on generative reward models, such as rejection sampling, preference optimization, and RLVR. To mitigate this issue, we introduce a simple yet effective data augmentation strategy and train a new generative reward model with substantially improved robustness. Our findings highlight the urgent need for more reliable LLM-based evaluation methods. We release our robust, general-domain reward model and its synthetic training data at https://huggingface.co/sarosavo/Master-RM and https://huggingface.co/datasets/sarosavo/Master-RM.
nan
Article 474
Title@2025-07-11 (5): AI Safety Should Prioritize the Future of Work
Title: AI Safety Should Prioritize the Future of Work | KI Sicherheit sollte die Zukunft der Arbeit priorisieren | AI 安全应优先考虑未来工作 2504.13959v2 |
Authors (3): Sanchaita Hazra, Bodhisattwa Prasad Majumder, Tuhin Chakrabarty
Current efforts in AI safety prioritize filtering harmful content, preventing manipulation of human behavior, and eliminating existential risks in cybersecurity or biosecurity. While pressing, this narrow focus overlooks critical human-centric considerations that shape the long-term trajectory of a society. In this position paper, we identify the risks of overlooking the impact of AI on the future of work and recommend comprehensive transition support towards the evolution of meaningful labor with human agency. Through the lens of economic theories, we highlight the intertemporal impacts of AI on human livelihood and the structural changes in labor markets that exacerbate income inequality. Additionally, the closed-source approach of major stakeholders in AI development resembles rent-seeking behavior through exploiting resources, breeding mediocrity in creative labor, and monopolizing innovation. To address this, we argue in favor of a robust international copyright anatomy supported by implementing collective licensing that ensures fair compensation mechanisms for using data to train AI models. We strongly recommend a pro-worker framework of global AI governance to enhance shared prosperity and economic justice while reducing technical debt.
nan
Article 475
Title@2025-07-11 (5): From Sequence to Structure: Uncovering Substructure Reasoning in Transformers
Title: From Sequence to Structure: Uncovering Substructure Reasoning in Transformers | Von Sequenz zu Struktur: Enthüllen von Unterstrukturen in Transformern | 从序列到结构:在变换器中未覆盖子结构原因 2507.10435v1 |
Authors (7): Xinnan Dai, Kai Yang, Jay Revolinsky, Kai Guo, Aoran Wang, Bohang Zhang, Jiliang Tang
Recent studies suggest that large language models (LLMs) possess the capability to solve graph reasoning tasks. Notably, even when graph structures are embedded within textual descriptions, LLMs can still effectively answer related questions. This raises a fundamental question: How can a decoder-only Transformer architecture understand underlying graph structures? To address this, we start with the substructure extraction task, interpreting the inner mechanisms inside the transformers and analyzing the impact of the input queries. Specifically, through both empirical results and theoretical analysis, we present Induced Substructure Filtration (ISF), a perspective that captures the substructure identification in the multi-layer transformers. We further validate the ISF process in LLMs, revealing consistent internal dynamics across layers. Building on these insights, we explore the broader capabilities of Transformers in handling diverse graph types. Specifically, we introduce the concept of thinking in substructures to efficiently extract complex composite patterns, and demonstrate that decoder-only Transformers can successfully extract substructures from attributed graphs, such as molecular graphs. Together, our findings offer a new insight on how sequence-based Transformers perform the substructure extraction task over graph data.
nan
Article 476
Title@2025-07-11 (5): BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity
Title: BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity | BlockFFN: Auf dem Weg zur End-Side Acceleration-Friendly Mixture-of-Experts mit Chunk-Level-Aktivierung Sparsity | 块块FFN: 向具有整块级激活分级的 终端- 双极加速- 友好混合混合专家方向 2507.08771v1 |
Authors (8): Chenyang Song, Weilin Zhao, Xu Han, Chaojun Xiao, Yingfa Chen, Yuxuan Li, Zhiyuan Liu, Maosong Sun
To alleviate the computational burden of large language models (LLMs), architectures with activation sparsity, represented by mixture-of-experts (MoE), have attracted increasing attention. However, the non-differentiable and inflexible routing of vanilla MoE hurts model performance. Moreover, while each token activates only a few parameters, these sparsely-activated architectures exhibit low chunk-level sparsity, indicating that the union of multiple consecutive tokens activates a large ratio of parameters. Such a sparsity pattern is unfriendly for acceleration under low-resource conditions (e.g., end-side devices) and incompatible with mainstream acceleration techniques (e.g., speculative decoding). To address these challenges, we introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques. Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing. Next, to promote both token-level sparsity (TLS) and chunk-level sparsity (CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly. Finally, we implement efficient acceleration kernels, combining activation sparsity and speculative decoding for the first time. The experimental results demonstrate the superior performance of BlockFFN over other MoE baselines, achieving over 80% TLS and 70% 8-token CLS. Our kernels achieve up to 3.67$\times$ speedup on real end-side devices than dense models. All codes and checkpoints are available publicly (https://github.com/thunlp/BlockFFN).
nan
Article 477
Title@2025-07-11 (5): On Barriers to Archival Audio Processing
Title: On Barriers to Archival Audio Processing | Über Hindernisse für die Archivierung von Audio | 档案音频处理障碍问题 2507.08768v1 |
Authors (2): Peter Sullivan, Muhammad Abdul-Mageed
In this study, we leverage a unique UNESCO collection of mid-20th century radio recordings to probe the robustness of modern off-the-shelf language identification (LID) and speaker recognition (SR) methods, especially with respect to the impact of multilingual speakers and cross-age recordings. Our findings suggest that LID systems, such as Whisper, are increasingly adept at handling second-language and accented speech. However, speaker embeddings remain a fragile component of speech processing pipelines that is prone to biases related to the channel, age, and language. Issues which will need to be overcome should archives aim to employ SR methods for speaker indexing.
nan
Article 478
Title@2025-07-11 (5): Large Language Models in Mental Health Care: a Scoping Review
Title: Large Language Models in Mental Health Care: a Scoping Review | Große Sprachmodelle in der Psychischen Gesundheitsversorgung: ein Scoping Review | 精神保健中大语言模式:范围审查 2401.02984v3 |
Authors (12): Yining Hua, Fenglin Liu, Kailai Yang, Zehan Li, Hongbin Na, Yi-han Sheu, Peilin Zhou, Lauren V. Moran, Sophia Ananiadou, David A. Clifton, Andrew Beam, John Torous
Objectieve:This review aims to deliver a comprehensive analysis of Large Language Models (LLMs) utilization in mental health care, evaluating their effectiveness, identifying challenges, and exploring their potential for future application. Materials and Methods: A systematic search was performed across multiple databases including PubMed, Web of Science, Google Scholar, arXiv, medRxiv, and PsyArXiv in November 2023. The review includes all types of original research, regardless of peer-review status, published or disseminated between October 1, 2019, and December 2, 2023. Studies were included without language restrictions if they employed LLMs developed after T5 and directly investigated research questions within mental health care settings. Results: Out of an initial 313 articles, 34 were selected based on their relevance to LLMs applications in mental health care and the rigor of their reported outcomes. The review identified various LLMs applications in mental health care, including diagnostics, therapy, and enhancing patient engagement. Key challenges highlighted were related to data availability and reliability, the nuanced handling of mental states, and effective evaluation methods. While LLMs showed promise in improving accuracy and accessibility, significant gaps in clinical applicability and ethical considerations were noted. Conclusion: LLMs hold substantial promise for enhancing mental health care. For their full potential to be realized, emphasis must be placed on developing robust datasets, development and evaluation frameworks, ethical guidelines, and interdisciplinary collaborations to address current limitations.
nan
Article 479
Title@2025-07-11 (5): Weak-to-Strong Jailbreaking on Large Language Models
Title: Weak-to-Strong Jailbreaking on Large Language Models | Schwach-zu-starkes Gefängnis mit großen Sprachmodellen | 关于大语言模型的弱至强强监狱破解 2401.17256v4 |
Authors (7): Xuandong Zhao, Xianjun Yang, Tianyu Pang, Chao Du, Lei Li, Yu-Xiang Wang, William Yang Wang
Large language models (LLMs) are vulnerable to jailbreak attacks - resulting in harmful, unethical, or biased text generations. However, existing jailbreaking methods are computationally costly. In this paper, we propose the weak-to-strong jailbreaking attack, an efficient inference time attack for aligned LLMs to produce harmful text. Our key intuition is based on the observation that jailbroken and aligned models only differ in their initial decoding distributions. The weak-to-strong attack’s key technical insight is using two smaller models (a safe and an unsafe one) to adversarially modify a significantly larger safe model’s decoding probabilities. We evaluate the weak-to-strong attack on 5 diverse open-source LLMs from 3 organizations. The results show our method can increase the misalignment rate to over 99% on two datasets with just one forward pass per example. Our study exposes an urgent safety issue that needs to be addressed when aligning LLMs. As an initial attempt, we propose a defense strategy to protect against such attacks, but creating more advanced defenses remains challenging. The code for replicating the method is available at https://github.com/XuandongZhao/weak-to-strong
nan
Article 480
Title@2025-07-11 (5): Multilingual Multimodal Software Developer for Code Generation
Title: Multilingual Multimodal Software Developer for Code Generation | Mehrsprachiger multimodaler Softwareentwickler für die Codegenerierung | 用于代码生成的多语言多语种多式软件开发器 2507.08719v1 |
Authors (15): Linzheng Chai, Jian Yang, Shukai Liu, Wei Zhang, Liran Wang, Ke Jin, Tao Sun, Congnan Liu, Chenchen Zhang, Hualei Zhu, Jiaheng Liu, Xianjie Wu, Ge Zhang, Tianyu Liu, Zhoujun Li
The rapid advancement of Large Language Models (LLMs) has significantly improved code generation, yet most models remain text-only, neglecting crucial visual aids like diagrams and flowcharts used in real-world software development. To bridge this gap, we introduce MM-Coder, a Multilingual Multimodal software developer. MM-Coder integrates visual design inputs-Unified Modeling Language (UML) diagrams and flowcharts (termed Visual Workflow)-with textual instructions to enhance code generation accuracy and architectural alignment. To enable this, we developed MMc-Instruct, a diverse multimodal instruction-tuning dataset including visual-workflow-based code generation, allowing MM-Coder to synthesize textual and graphical information like human developers, distinct from prior work on narrow tasks. Furthermore, we introduce MMEval, a new benchmark for evaluating multimodal code generation, addressing existing text-only limitations. Our evaluations using MMEval highlight significant remaining challenges for models in precise visual information capture, instruction following, and advanced programming knowledge. Our work aims to revolutionize industrial programming by enabling LLMs to interpret and implement complex specifications conveyed through both text and visual designs.
nan
Article 481
Title@2025-07-11 (5): Evaluating LLMs in Medicine: A Call for Rigor, Transparency
Title: Evaluating LLMs in Medicine: A Call for Rigor, Transparency | Bewertung von LLMs in der Medizin: Ein Ruf nach Starrheit, Transparenz | 医学领域评价LLMs:调用Rigor,透明 2507.08916v1 |
Authors (4): Mahmoud Alwakeel, Aditya Nagori, Vijay Krishnamoorthy, Rishikesan Kamaleswaran
Objectives: To evaluate the current limitations of large language models (LLMs) in medical question answering, focusing on the quality of datasets used for their evaluation. Materials and Methods: Widely-used benchmark datasets, including MedQA, MedMCQA, PubMedQA, and MMLU, were reviewed for their rigor, transparency, and relevance to clinical scenarios. Alternatives, such as challenge questions in medical journals, were also analyzed to identify their potential as unbiased evaluation tools. Results: Most existing datasets lack clinical realism, transparency, and robust validation processes. Publicly available challenge questions offer some benefits but are limited by their small size, narrow scope, and exposure to LLM training. These gaps highlight the need for secure, comprehensive, and representative datasets. Conclusion: A standardized framework is critical for evaluating LLMs in medicine. Collaborative efforts among institutions and policymakers are needed to ensure datasets and methodologies are rigorous, unbiased, and reflective of clinical complexities.
nan
Article 482
Title@2025-07-11 (5): KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation
Title: KG-Attention: Knowledge Graph-Guided Attention at Test-Time via Bidirectional Information Aggregation | KG-Achtung: Wissen Graphengeführte Aufmerksamkeit zur Testzeit über bidirektionale Informationsaggregation | KG-注意:通过双向信息聚合在试验时以知识图表引导的注意 2507.08704v1 |
Authors (3): Songlin Zhai, Guilin Qi, Yuan Meng
Knowledge graphs (KGs) play a critical role in enhancing large language models (LLMs) by introducing structured and grounded knowledge into the learning process. However, most existing KG-enhanced approaches rely on parameter-intensive fine-tuning, which risks catastrophic forgetting and degrades the pretrained model’s generalization. Moreover, they exhibit limited adaptability to real-time knowledge updates due to their static integration frameworks. To address these issues, we introduce the first test-time KG-augmented framework for LLMs, built around a dedicated knowledge graph-guided attention (KGA) module that enables dynamic knowledge fusion without any parameter updates. The proposed KGA module augments the standard self-attention mechanism with two synergistic pathways: outward and inward aggregation. Specifically, the outward pathway dynamically integrates external knowledge into input representations via input-driven KG fusion. This inward aggregation complements the outward pathway by refining input representations through KG-guided filtering, suppressing task-irrelevant signals and amplifying knowledge-relevant patterns. Importantly, while the outward pathway handles knowledge fusion, the inward path selects the most relevant triples and feeds them back into the fusion process, forming a closed-loop enhancement mechanism. By synergistically combining these two pathways, the proposed method supports real-time knowledge fusion exclusively at test-time, without any parameter modification. Extensive experiments on five benchmarks verify the comparable knowledge fusion performance of KGA.
nan
Article 483
Title@2025-07-11 (5): Multi-Token Attention
Title: Multi-Token Attention | Multi-Token-Achtung | 多当式注意 2504.00927v2 |
Authors (4): Olga Golovneva, Tianlu Wang, Jason Weston, Sainbayar Sukhbaatar
Soft attention is a critical mechanism powering LLMs to locate relevant parts within a given context. However, individual attention weights are determined by the similarity of only a single query and key token vector. This “single token attention” bottlenecks the amount of information used in distinguishing a relevant part from the rest of the context. To address this issue, we propose a new attention method, Multi-Token Attention (MTA), which allows LLMs to condition their attention weights on multiple query and key vectors simultaneously. This is achieved by applying convolution operations over queries, keys and heads, allowing nearby queries and keys to affect each other’s attention weights for more precise attention. As a result, our method can locate relevant context using richer, more nuanced information that can exceed a single vector’s capacity. Through extensive evaluations, we demonstrate that MTA achieves enhanced performance on a range of popular benchmarks. Notably, it outperforms Transformer baseline models on standard language modeling tasks, and on tasks that require searching for information within long contexts, where our method’s ability to leverage richer information proves particularly beneficial.
nan
Article 484
Title@2025-07-11 (5): KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment
Title: KELPS: A Framework for Verified Multi-Language Autoformalization via Semantic-Syntactic Alignment | KELPS: Ein Rahmen für eine verifizierte Mehrsprachen-Autoformalisierung durch semantisch-syntaktische Ausrichtung | KELPS: 通过语义- 合成协调校验多语言自动正规化框架 2507.08665v1 |
Authors (5): Jiyao Zhang, Chengli Zhong, Hui Xu, Qige Li, Yi Zhou
Modern large language models (LLMs) show promising progress in formalizing informal mathematics into machine-verifiable theorems. However, these methods still face bottlenecks due to the limited quantity and quality of multilingual parallel corpora. In this paper, we propose a novel neuro-symbolic framework KELPS (Knowledge-Equation based Logical Processing System) to address these problems. KELPS is an iterative framework for translating, synthesizing, and filtering informal data into multiple formal languages (Lean, Coq, and Isabelle). First, we translate natural language into Knowledge Equations (KEs), a novel language that we designed, theoretically grounded in assertional logic. Next, we convert them to target languages through rigorously defined rules that preserve both syntactic structure and semantic meaning. This process yielded a parallel corpus of over 60,000 problems. Our framework achieves 88.9% syntactic accuracy (pass@1) on MiniF2F, outperforming SOTA models such as Deepseek-V3 (81%) and Herald (81.3%) across multiple datasets. All datasets and codes are available in the supplementary materials.
nan
Article 485
Title@2025-07-11 (5): The Impact of Automatic Speech Transcription on Speaker Attribution
Title: The Impact of Automatic Speech Transcription on Speaker Attribution | Die Auswirkungen der automatischen Sprachtranskription auf die Sprecherzuweisung | 自动发言限制对议长权力的影响 2507.08660v1 |
Authors (4): Cristina Aggazzotti, Matthew Wiesner, Elizabeth Allyn Smith, Nicholas Andrews
Speaker attribution from speech transcripts is the task of identifying a speaker from the transcript of their speech based on patterns in their language use. This task is especially useful when the audio is unavailable (e.g. deleted) or unreliable (e.g. anonymized speech). Prior work in this area has primarily focused on the feasibility of attributing speakers using transcripts produced by human annotators. However, in real-world settings, one often only has more errorful transcripts produced by automatic speech recognition (ASR) systems. In this paper, we conduct what is, to our knowledge, the first comprehensive study of the impact of automatic transcription on speaker attribution performance. In particular, we study the extent to which speaker attribution performance degrades in the face of transcription errors, as well as how properties of the ASR system impact attribution. We find that attribution is surprisingly resilient to word-level transcription errors and that the objective of recovering the true transcript is minimally correlated with attribution performance. Overall, our findings suggest that speaker attribution on more errorful transcripts produced by ASR is as good, if not better, than attribution based on human-transcribed data, possibly because ASR transcription errors can capture speaker-specific features revealing of speaker identity.
nan
Article 486
Title@2025-07-11 (5): Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery
Title: Open Source Planning & Control System with Language Agents for Autonomous Scientific Discovery | Open Source Planning & Control System mit Language Agents für autonome wissenschaftliche Entdeckung | 拥有自主科学发现语言代理的开放源规划和控制系统 2507.07257v2 |
Authors (26): Licong Xu, Milind Sarkar, Anto I. Lonappan, Íñigo Zubeldia, Pablo Villanueva-Domingo, Santiago Casas, Christian Fidler, Chetana Amancharla, Ujjwal Tiwari, Adrian Bayer, Chadi Ait Ekioui, Miles Cranmer, Adrian Dimitrov, James Fergusson, Kahaan Gandhi, Sven Krippendorf, Andrew Laverick, Julien Lesgourgues, Antony Lewis, Thomas Meier, Blake Sherwin, Kristen Surrao, Francisco Villaescusa-Navarro, Chi Wang, Xueqing Xu, Boris Bolliet
We present a multi-agent system for automation of scientific research tasks, cmbagent (https://github.com/CMBAgents/cmbagent). The system is formed by about 30 Large Language Model (LLM) agents and implements a Planning & Control strategy to orchestrate the agentic workflow, with no human-in-the-loop at any point. Each agent specializes in a different task (performing retrieval on scientific papers and codebases, writing code, interpreting results, critiquing the output of other agents) and the system is able to execute code locally. We successfully apply cmbagent to carry out a PhD level cosmology task (the measurement of cosmological parameters using supernova data) and evaluate its performance on two benchmark sets, finding superior performance over state-of-the-art LLMs. The source code is available on GitHub, demonstration videos are also available, and the system is deployed on HuggingFace and will be available on the cloud.
nan
Article 487
Title@2025-07-11 (5): Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA)
Title: Scaling Attention to Very Long Sequences in Linear Time with Wavelet-Enhanced Random Spectral Attention (WERSA) | Skalierung der Aufmerksamkeit auf sehr lange Sequenzen in linearer Zeit mit Wavelet-erweiterter Zufallsspektral-Achtung (WERSA) | 以波浪增强随机光谱注意, 将注意力转向线性时间的甚长序列( WERSA) 2507.08637v1 |
Authors (1): Vincenzo Dentamaro
Transformer models are computationally costly on long sequences since regular attention has quadratic $O(n^2)$ time complexity. We introduce Wavelet-Enhanced Random Spectral Attention (WERSA), a novel mechanism of linear $O(n)$ time complexity that is pivotal to enable successful long-sequence processing without the performance trade-off. WERSA merges content-adaptive random spectral features together with multi-resolution Haar wavelets and learnable parameters to selectively attend to informative scales of data while preserving linear efficiency. Large-scale comparisons \textbf{on single GPU} and across various benchmarks (vision, NLP, hierarchical reasoning) and various attention mechanisms (like Multiheaded Attention, Flash-Attention-2, FNet, Linformer, Performer, Waveformer), reveal uniform advantages of WERSA. It achieves best accuracy in all tests. On ArXiv classification, WERSA improves accuracy over vanilla attention by 1.2\% (86.2\% vs 85.0\%) while cutting training time by 81\% (296s vs 1554s) and FLOPS by 73.4\% (26.2G vs 98.4G). Significantly, WERSA excels where vanilla and FlashAttention-2 fail: on ArXiv-128k’s extremely lengthy sequences, it achieves best accuracy (79.1\%) and AUC (0.979) among viable methods, operating on data that gives Out-Of-Memory errors to quadratic methods while being \textbf{twice as fast} as Waveformer, its next-best competitor. By significantly reducing computational loads without compromising accuracy, WERSA makes possible more practical, more affordable, long-context models, in particular on low-resource hardware, for more sustainable and more scalable AI development.
nan
Article 488
Title@2025-07-11 (5): Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework
Title: Text2BIM: Generating Building Models Using a Large Language Model-based Multi-Agent Framework | Text2BIM: Generierung von Baumodellen mit Hilfe eines Multi-Agent-Frameworks auf Basis eines großen Sprachmodells | Text2BIM:利用以大语言模式为基础的多机构机构框架生成建筑模型 2408.08054v2 |
Authors (4): Changyu Du, Sebastian Esser, Stavros Nousias, André Borrmann
The conventional BIM authoring process typically requires designers to master complex and tedious modeling commands in order to materialize their design intentions within BIM authoring tools. This additional cognitive burden complicates the design process and hinders the adoption of BIM and model-based design in the AEC (Architecture, Engineering, and Construction) industry. To facilitate the expression of design intentions more intuitively, we propose Text2BIM, an LLM-based multi-agent framework that can generate 3D building models from natural language instructions. This framework orchestrates multiple LLM agents to collaborate and reason, transforming textual user input into imperative code that invokes the BIM authoring tool’s APIs, thereby generating editable BIM models with internal layouts, external envelopes, and semantic information directly in the software. Furthermore, a rule-based model checker is introduced into the agentic workflow, utilizing predefined domain knowledge to guide the LLM agents in resolving issues within the generated models and iteratively improving model quality. Extensive experiments were conducted to compare and analyze the performance of three different LLMs under the proposed framework. The evaluation results demonstrate that our approach can effectively generate high-quality, structurally rational building models that are aligned with the abstract concepts specified by user input. Finally, an interactive software prototype was developed to integrate the framework into the BIM authoring software Vectorworks, showcasing the potential of modeling by chatting. The code is available at: https://github.com/dcy0577/Text2BIM
nan
Article 489
Title@2025-07-11 (5): Red Teaming Large Language Models for Healthcare
Title: Red Teaming Large Language Models for Healthcare | Red Teaming große Sprachmodelle für das Gesundheitswesen | 红队大语言保健模式 2505.00467v2 |
Authors (35): Vahid Balazadeh, Michael Cooper, David Pellow, Atousa Assadi, Jennifer Bell, Mark Coatsworth, Kaivalya Deshpande, Jim Fackler, Gabriel Funingana, Spencer Gable-Cook, Anirudh Gangadhar, Abhishek Jaiswal, Sumanth Kaja, Christopher Khoury, Amrit Krishnan, Randy Lin, Kaden McKeen, Sara Naimimohasses, Khashayar Namdar, Aviraj Newatia, Allan Pang, Anshul Pattoo, Sameer Peesapati, Diana Prepelita, Bogdana Rakova, Saba Sadatamin, Rafael Schulman, Ajay Shah, Syed Azhar Shah, Syed Ahmar Shah, Babak Taati, Balagopal Unnikrishnan, Iñigo Urteaga, Stephanie Williams, Rahul G Krishnan
We present the design process and findings of the pre-conference workshop at the Machine Learning for Healthcare Conference (2024) entitled Red Teaming Large Language Models for Healthcare, which took place on August 15, 2024. Conference participants, comprising a mix of computational and clinical expertise, attempted to discover vulnerabilities – realistic clinical prompts for which a large language model (LLM) outputs a response that could cause clinical harm. Red-teaming with clinicians enables the identification of LLM vulnerabilities that may not be recognised by LLM developers lacking clinical expertise. We report the vulnerabilities found, categorise them, and present the results of a replication study assessing the vulnerabilities across all LLMs provided.
nan
Article 490
Title@2025-07-11 (5): A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1
Title: A comprehensive study of LLM-based argument classification: from LLAMA through GPT-4o to Deepseek-R1 | Eine umfassende Studie der LLM-basierten Argumentationsklassifikation: von LLAMA über GPT-4o bis Deepseek-R1 | 关于以LLM为基础的理论分类的全面研究:从LLAMA到GPT-4o到Deepseek-R1 2507.08621v1 |
Authors (5): Marcin Pietroń, Rafał Olszowski, Jakub Gomułka, Filip Gampel, Andrzej Tomski
Argument mining (AM) is an interdisciplinary research field that integrates insights from logic, philosophy, linguistics, rhetoric, law, psychology, and computer science. It involves the automatic identification and extraction of argumentative components, such as premises and claims, and the detection of relationships between them, such as support, attack, or neutrality. Recently, the field has advanced significantly, especially with the advent of large language models (LLMs), which have enhanced the efficiency of analyzing and extracting argument semantics compared to traditional methods and other deep learning models. There are many benchmarks for testing and verifying the quality of LLM, but there is still a lack of research and results on the operation of these models in publicly available argument classification databases. This paper presents a study of a selection of LLM’s, using diverse datasets such as Args.me and UKP. The models tested include versions of GPT, Llama, and DeepSeek, along with reasoning-enhanced variants incorporating the Chain-of-Thoughts algorithm. The results indicate that ChatGPT-4o outperforms the others in the argument classification benchmarks. In case of models incorporated with reasoning capabilities, the Deepseek-R1 shows its superiority. However, despite their superiority, GPT-4o and Deepseek-R1 still make errors. The most common errors are discussed for all models. To our knowledge, the presented work is the first broader analysis of the mentioned datasets using LLM and prompt algorithms. The work also shows some weaknesses of known prompt algorithms in argument analysis, while indicating directions for their improvement. The added value of the work is the in-depth analysis of the available argument datasets and the demonstration of their shortcomings.
nan
Article 491
Title@2025-07-11 (5): Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia
Title: Addressing Pitfalls in Auditing Practices of Automatic Speech Recognition Technologies: A Case Study of People with Aphasia | Umgang mit Pitfalls bei der Prüfung von Praktiken automatischer Spracherkennungstechnologien: Eine Fallstudie von Menschen mit Aphasie | 解决自动语音识别技术审计做法中的缺陷:阿法西亚人案例研究 2506.08846v2 |
Authors (5): Katelyn Xiaoying Mei, Anna Seo Gyeong Choi, Hilke Schellmann, Mona Sloane, Allison Koenecke
Automatic Speech Recognition (ASR) has transformed daily tasks from video transcription to workplace hiring. ASR systems’ growing use warrants robust and standardized auditing approaches to ensure automated transcriptions of high and equitable quality. This is especially critical for people with speech and language disorders (such as aphasia) who may disproportionately depend on ASR systems to navigate everyday life. In this work, we identify three pitfalls in existing standard ASR auditing procedures, and demonstrate how addressing them impacts audit results via a case study of six popular ASR systems’ performance for aphasia speakers. First, audits often adhere to a single method of text standardization during data pre-processing, which (a) masks variability in ASR performance from applying different standardization methods, and (b) may not be consistent with how users - especially those from marginalized speech communities - would want their transcriptions to be standardized. Second, audits often display high-level demographic findings without further considering performance disparities among (a) more nuanced demographic subgroups, and (b) relevant covariates capturing acoustic information from the input audio. Third, audits often rely on a single gold-standard metric – the Word Error Rate – which does not fully capture the extent of errors arising from generative AI models, such as transcription hallucinations. We propose a more holistic auditing framework that accounts for these three pitfalls, and exemplify its results in our case study, finding consistently worse ASR performance for aphasia speakers relative to a control group. We call on practitioners to implement these robust ASR auditing practices that remain flexible to the rapidly changing ASR landscape.
nan
Article 492
Title@2025-07-11 (5): Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing
Title: Anthropomimetic Uncertainty: What Verbalized Uncertainty in Language Models is Missing | Anthropomische Unsicherheit: Was verbalisierte Unsicherheit in Sprachmodellen fehlt | 人文工程学不确定性:语言模型中什么是虚无的不确定性 2507.10587v1 |
Authors (4): Dennis Ulmer, Alexandra Lorson, Ivan Titov, Christian Hardmeier
Human users increasingly rely on natural language interactions with large language models (LLMs) in order to receive help on a large variety of tasks and problems. However, the trustworthiness and perceived legitimacy of LLMs is undermined by the fact that their output is frequently stated in very confident terms, even when its accuracy is questionable. Therefore, there is a need to signal the confidence of the language model to a user in order to reap the benefits of human-machine collaboration and mitigate potential harms. Verbalized uncertainty is the expression of confidence with linguistic means, an approach that integrates perfectly into language-based interfaces. Nevertheless, most recent research in natural language processing (NLP) overlooks the nuances surrounding human uncertainty communication and the data biases that influence machine uncertainty communication. We argue for anthropomimetic uncertainty, meaning that intuitive and trustworthy uncertainty communication requires a degree of linguistic authenticity and personalization to the user, which could be achieved by emulating human communication. We present a thorough overview over the research in human uncertainty communication, survey ongoing research, and perform additional analyses to demonstrate so-far overlooked biases in verbalized uncertainty. We conclude by pointing out unique factors in human-machine communication of uncertainty and deconstruct anthropomimetic uncertainty into future research directions for NLP.
nan
Article 493
Title@2025-07-11 (5): AutoRAG-LoRA: Hallucination-Triggered Knowledge Retuning via Lightweight Adapters
Title: AutoRAG-LoRA: Hallucination-Triggered Knowledge Retuning via Lightweight Adapters | AutoRAG-LoRA: Halluzination-Triggered Knowledge Retuning über Leichtbauadapter | AURAG-LORA:通过轻度适应器进行幻觉-交错知识调整 2507.10586v1 |
Authors (2): Kaushik Dwivedi, Padmanabh Patanjali Mishra
Large Language Models (LLMs) have demonstrated remarkable fluency across a range of natural language tasks, yet remain vulnerable to hallucinations - factual inaccuracies that undermine trust in real world deployment. We present AutoRAG-LoRA, a modular framework for Retrieval-Augmented Generation (RAG) that tackles hallucination in large language models through lightweight LoRA-based adapters and KL-regularized training. Our pipeline integrates automated prompt rewriting, hybrid retrieval, and low-rank adapter tuning to ground responses in retrieved evidence. A hallucination detection module, using both classifier-based and self-evaluation techniques, assigns confidence scores to generated outputs, triggering an optional feedback correction loop. This loop enforces factual alignment via contrastive KL loss and adapter fine tuning. We demonstrate that AutoRAG-LoRA significantly reduces the factual drift while preserving the efficiency and modularity of the model.
nan
Article 494
Title@2025-07-11 (5): Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings
Title: Medical Red Teaming Protocol of Language Models: On the Importance of User Perspectives in Healthcare Settings | Medical Red Teaming Protocol of Language Models: Über die Bedeutung der Nutzerperspektiven in der Gesundheitsversorgung | 语言模式医学红队模式医疗红队协议:关于保健机构用户观点的重要性 2507.07248v2 |
Authors (5): Jean-Philippe Corbeil, Minseon Kim, Alessandro Sordoni, Francois Beaulieu, Paul Vozila
As the performance of large language models (LLMs) continues to advance, their adoption is expanding across a wide range of domains, including the medical field. The integration of LLMs into medical applications raises critical safety concerns, particularly due to their use by users with diverse roles, e.g. patients and clinicians, and the potential for model’s outputs to directly affect human health. Despite the domain-specific capabilities of medical LLMs, prior safety evaluations have largely focused only on general safety benchmarks. In this paper, we introduce a safety evaluation protocol tailored to the medical domain in both patient user and clinician user perspectives, alongside general safety assessments and quantitatively analyze the safety of medical LLMs. We bridge a gap in the literature by building the PatientSafetyBench containing 466 samples over 5 critical categories to measure safety from the perspective of the patient. We apply our red-teaming protocols on the MediPhi model collection as a case study. To our knowledge, this is the first work to define safety evaluation criteria for medical LLMs through targeted red-teaming taking three different points of view - patient, clinician, and general user - establishing a foundation for safer deployment in medical domains.
nan
Article 495
Title@2025-07-11 (5): Large Multi-modal Model Cartographic Map Comprehension for Textual Locality Georeferencing
Title: Large Multi-modal Model Cartographic Map Comprehension for Textual Locality Georeferencing | Großes multimodales Modell Kartographische Karte Verständnis für Textlokalität Georeferenzierung | 大型多模式地图地图图图图集模型 2507.08575v1 |
Authors (3): Kalana Wijegunarathna, Kristin Stock, Christopher B. Jones
Millions of biological sample records collected in the last few centuries archived in natural history collections are un-georeferenced. Georeferencing complex locality descriptions associated with these collection samples is a highly labour-intensive task collection agencies struggle with. None of the existing automated methods exploit maps that are an essential tool for georeferencing complex relations. We present preliminary experiments and results of a novel method that exploits multi-modal capabilities of recent Large Multi-Modal Models (LMM). This method enables the model to visually contextualize spatial relations it reads in the locality description. We use a grid-based approach to adapt these auto-regressive models for this task in a zero-shot setting. Our experiments conducted on a small manually annotated dataset show impressive results for our approach ($\sim$1 km Average distance error) compared to uni-modal georeferencing with Large Language Models and existing georeferencing tools. The paper also discusses the findings of the experiments in light of an LMM’s ability to comprehend fine-grained maps. Motivated by these results, a practical framework is proposed to integrate this method into a georeferencing workflow.
nan
Article 496
Title@2025-07-11 (5): A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations
Title: A Taxonomy for Design and Evaluation of Prompt-Based Natural Language Explanations | Eine Taxonomie für Design und Evaluation von prompt-basierenden Naturspracherklärungen | 设计和评价快速自然语言解释的分类学 2507.10585v1 |
Authors (4): Isar Nejadgholi, Mona Omidyeganeh, Marc-Antoine Drouin, Jonathan Boisvert
Effective AI governance requires structured approaches for stakeholders to access and verify AI system behavior. With the rise of large language models, Natural Language Explanations (NLEs) are now key to articulating model behavior, which necessitates a focused examination of their characteristics and governance implications. We draw on Explainable AI (XAI) literature to create an updated XAI taxonomy, adapted to prompt-based NLEs, across three dimensions: (1) Context, including task, data, audience, and goals; (2) Generation and Presentation, covering generation methods, inputs, interactivity, outputs, and forms; and (3) Evaluation, focusing on content, presentation, and user-centered properties, as well as the setting of the evaluation. This taxonomy provides a framework for researchers, auditors, and policymakers to characterize, design, and enhance NLEs for transparent AI systems.
nan
Article 497
Title@2025-07-11 (5): Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines
Title: Comparing Spoken Languages using Paninian System of Sounds and Finite State Machines | Vergleich der gesprochenen Sprachen mit Paninian System of Sounds und Finite State Machines | 使用波尼尼亚音响和有限国家机器系统比较口语 2301.12463v3 |
Authors (2): Shreekanth M Prabhu, Abhisek Midya
The study of spoken languages comprises phonology, morphology, and grammar. The languages can be classified as root languages, inflectional languages, and stem languages. In addition, languages continually change over time and space by picking isoglosses, as speakers move from region to/through region. All these factors lead to the formation of vocabulary, which has commonality/similarity across languages as well as distinct and subtle differences among them. Comparison of vocabularies across languages and detailed analysis has led to the hypothesis of language families. In particular, in the view of Western linguists, Vedic Sanskrit is a daughter language, part of the Indo-Iranian branch of the Indo-European Language family, and Dravidian Languages belong to an entirely different family. These and such conclusions are reexamined in this paper. Based on our study and analysis, we propose an Ecosystem Model for Linguistic Development with Sanskrit at the core, in place of the widely accepted family tree model. To that end, we leverage the Paninian system of sounds to construct a phonetic map. Then we represent words across languages as state transitions on the phonetic map and construct corresponding Morphological Finite Automata (MFA) that accept groups of words. Regardless of whether the contribution of this paper is significant or minor, it is an important step in challenging policy-driven research that has plagued this field.
nan
Article 498
Title@2025-07-11 (5): The AI Language Proficiency Monitor – Tracking the Progress of LLMs on Multilingual Benchmarks
Title: The AI Language Proficiency Monitor – Tracking the Progress of LLMs on Multilingual Benchmarks | Der KI-Sprachkompetenzmonitor – Aufspüren des Fortschritts von LLMs auf mehrsprachigen Benchmarks | AI 语言能力监测 – – 跟踪多语种基准问题LLMs的进展情况 2507.08538v1 |
Authors (3): David Pomerenke, Jonas Nothnagel, Simon Ostermann
To ensure equitable access to the benefits of large language models (LLMs), it is essential to evaluate their capabilities across the world’s languages. We introduce the AI Language Proficiency Monitor, a comprehensive multilingual benchmark that systematically assesses LLM performance across up to 200 languages, with a particular focus on low-resource languages. Our benchmark aggregates diverse tasks including translation, question answering, math, and reasoning, using datasets such as FLORES+, MMLU, GSM8K, TruthfulQA, and ARC. We provide an open-source, auto-updating leaderboard and dashboard that supports researchers, developers, and policymakers in identifying strengths and gaps in model performance. In addition to ranking models, the platform offers descriptive insights such as a global proficiency map and trends over time. By complementing and extending prior multilingual benchmarks, our work aims to foster transparency, inclusivity, and progress in multilingual AI. The system is available at https://huggingface.co/spaces/fair-forward/evals-for-every-language.
nan
Article 499
Title@2025-07-11 (5): A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis
Title: A Multi-granularity Concept Sparse Activation and Hierarchical Knowledge Graph Fusion Framework for Rare Disease Diagnosis | Multi-Granularität Konzept Sparse Aktivierung und Hierarchisches Wissen Graph Fusion Framework für Seltene Krankheiten Diagnose | 罕见疾病诊断多发性概念分散活动和等级知识图集融合框架 2507.08529v1 |
Authors (5): Mingda Zhang, Na Zhao, Jianglong Qin, Guoyu Ye, Ruixiang Tang
Despite advances from medical large language models in healthcare, rare-disease diagnosis remains hampered by insufficient knowledge-representation depth, limited concept understanding, and constrained clinical reasoning. We propose a framework that couples multi-granularity sparse activation of medical concepts with a hierarchical knowledge graph. Four complementary matching algorithms, diversity control, and a five-level fallback strategy enable precise concept activation, while a three-layer knowledge graph (taxonomy, clinical features, instances) provides structured, up-to-date context. Experiments on the BioASQ rare-disease QA set show BLEU gains of 0.09, ROUGE gains of 0.05, and accuracy gains of 0.12, with peak accuracy of 0.89 approaching the 0.90 clinical threshold. Expert evaluation confirms improvements in information quality, reasoning, and professional expression, suggesting our approach shortens the “diagnostic odyssey” for rare-disease patients.
nan
Article 500
Title@2025-07-11 (5): An Empirical Study of Validating Synthetic Data for Formula Generation
Title: An Empirical Study of Validating Synthetic Data for Formula Generation | Eine empirische Studie zur Validierung synthetischer Daten für die Formelgenerierung | 验证用于公式生成的合成数据的经验研究 2407.10657v4 |
Authors (8): Usneek Singh, José Cambronero, Sumit Gulwani, Aditya Kanade, Anirudh Khatry, Vu Le, Mukul Singh, Gust Verbruggen
Large language models (LLMs) can be leveraged to help with writing formulas in spreadsheets, but resources on these formulas are scarce, impacting both the base performance of pre-trained models and limiting the ability to fine-tune them. Given a corpus of formulas, we can use a(nother) model to generate synthetic natural language utterances for fine-tuning. However, it is important to validate whether the NL generated by the LLM is indeed accurate to be beneficial for fine-tuning. In this paper, we provide empirical results on the impact of validating these synthetic training examples with surrogate objectives that evaluate the accuracy of the synthetic annotations. We demonstrate that validation improves performance over raw data across four models (2 open and 2 closed weight). Interestingly, we show that although validation tends to prune more challenging examples, it increases the complexity of problems that models can solve after being fine-tuned on validated data.
nan
Article 501
Title@2025-07-11 (5): REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives
Title: REGEN: A Dataset and Benchmarks with Natural Language Critiques and Narratives | REGEN: Ein Datensatz und Benchmarks mit natürlichen Sprachkritiken und Erzählungen | REGEN: 一套具有自然语种背景和叙述的数据集和基准 2503.11924v2 |
Authors (11): Kun Su, Krishna Sayana, Hubert Pham, James Pine, Yuri Vasilevski, Raghavendra Vasudeva, Marialena Kyriakidi, Liam Hebert, Ambarish Jash, Anushya Subbiah, Sukhdeep Sodhi
This paper introduces a novel dataset REGEN (Reviews Enhanced with GEnerative Narratives), designed to benchmark the conversational capabilities of recommender Large Language Models (LLMs), addressing the limitations of existing datasets that primarily focus on sequential item prediction. REGEN extends the Amazon Product Reviews dataset by inpainting two key natural language features: (1) user critiques, representing user “steering” queries that lead to the selection of a subsequent item, and (2) narratives, rich textual outputs associated with each recommended item taking into account prior context. The narratives include product endorsements, purchase explanations, and summaries of user preferences. Further, we establish an end-to-end modeling benchmark for the task of conversational recommendation, where models are trained to generate both recommendations and corresponding narratives conditioned on user history (items and critiques). For this joint task, we introduce a modeling framework LUMEN (LLM-based Unified Multi-task Model with Critiques, Recommendations, and Narratives) which uses an LLM as a backbone for critiquing, retrieval and generation. We also evaluate the dataset’s quality using standard auto-rating techniques and benchmark it by training both traditional and LLM-based recommender models. Our results demonstrate that incorporating critiques enhances recommendation quality by enabling the recommender to learn language understanding and integrate it with recommendation signals. Furthermore, LLMs trained on our dataset effectively generate both recommendations and contextual narratives, achieving performance comparable to state-of-the-art recommenders and language models.
nan
Article 502
Title@2025-07-11 (5): Transforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis
Title: Transforming Sensitive Documents into Quantitative Data: An AI-Based Preprocessing Toolchain for Structured and Privacy-Conscious Analysis | Transformation sensibler Dokumente in Quantitative Daten: Eine KI-basierte Vorverarbeitungs-Toolchain für strukturierte und datenschutzbewusste Analysen | 将敏感文件转换成定量数据:基于AI的结构性和隐私意识分析预处理工具链 2507.10582v1 |
Authors (2): Anders Ledberg, Anna Thalén
Unstructured text from legal, medical, and administrative sources offers a rich but underutilized resource for research in public health and the social sciences. However, large-scale analysis is hampered by two key challenges: the presence of sensitive, personally identifiable information, and significant heterogeneity in structure and language. We present a modular toolchain that prepares such text data for embedding-based analysis, relying entirely on open-weight models that run on local hardware, requiring only a workstation-level GPU and supporting privacy-sensitive research. The toolchain employs large language model (LLM) prompting to standardize, summarize, and, when needed, translate texts to English for greater comparability. Anonymization is achieved via LLM-based redaction, supplemented with named entity recognition and rule-based methods to minimize the risk of disclosure. We demonstrate the toolchain on a corpus of 10,842 Swedish court decisions under the Care of Abusers Act (LVM), comprising over 56,000 pages. Each document is processed into an anonymized, standardized summary and transformed into a document-level embedding. Validation, including manual review, automated scanning, and predictive evaluation shows the toolchain effectively removes identifying information while retaining semantic content. As an illustrative application, we train a predictive model using embedding vectors derived from a small set of manually labeled summaries, demonstrating the toolchain’s capacity for semi-automated content analysis at scale. By enabling structured, privacy-conscious analysis of sensitive documents, our toolchain opens new possibilities for large-scale research in domains where textual data was previously inaccessible due to privacy and heterogeneity constraints.
nan
Article 503
Title@2025-07-11 (5): One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning
Title: One-Pass to Reason: Token Duplication and Block-Sparse Mask for Efficient Fine-Tuning on Multi-Turn Reasoning | One-Pass to Reason: Token-Duplikation und Block-Spar-Maske für effizientes Feintuning auf Multi-Turn-Reasoning | 单向理由:在多向理由上高效精美调整的相重复和块分割掩码 2504.18246v2 |
Authors (3): Ritesh Goru, Shanay Mehta, Prateek Jain
Fine-tuning Large Language Models (LLMs) on multi-turn reasoning datasets requires N (number of turns) separate forward passes per conversation due to reasoning token visibility constraints, as reasoning tokens for a turn are discarded in subsequent turns. We propose duplicating response tokens along with a custom attention mask to enable single-pass processing of entire conversations. We prove our method produces identical losses to the N-pass approach while reducing time complexity from $O\bigl(N^{3}\bigl)$ to $O\bigl(N^{2}\bigl)$ and maintaining the same memory complexity for a transformer based model. Our approach achieves significant training speedup while preserving accuracy. Our implementation is available online (https://github.com/devrev/One-Pass-to-Reason).
nan
Article 504
Title@2025-07-11 (5): An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation
Title: An Offline Mobile Conversational Agent for Mental Health Support: Learning from Emotional Dialogues and Psychological Texts with Student-Centered Evaluation | Offline-Mobile Gesprächsagentin für psychische Gesundheitsunterstützung: Lernen aus emotionalen Dialogen und psychologischen Texten mit studentisch-zentrierter Evaluation | 心理健康支助离线流动对话代理人:学习以学生为中心的评价的情感对话和心理文字 2507.10580v1 |
Authors (4): Vimaleswar A, Prabhu Nandan Sahu, Nilesh Kumar Sahu, Haroon R Lone
Mental health plays a crucial role in the overall well-being of an individual. In recent years, digital platforms have been increasingly used to expand mental health and emotional support. However, there are persistent challenges related to limited user accessibility, internet connectivity, and data privacy, which highlight the need for an offline, smartphone-based solution. To address these challenges, we propose EmoSApp (Emotional Support App): an entirely offline, smartphone-based conversational app designed for mental health and emotional support. The system leverages Large Language Models (LLMs), specifically fine-tuned, quantized and deployed using Torchtune and Executorch for resource-constrained devices, allowing all inferences to occur on the smartphone. To equip EmoSApp with robust domain expertise, we fine-tuned the LLaMA-3.2-1B-Instruct model on our custom curated ``Knowledge dataset’’ of 14,582 mental-health QA pairs, along with the multi-turn conversational data. Through qualitative human evaluation with the student population, we demonstrate that EmoSApp has the ability to respond coherently, empathetically, maintain interactive dialogue, and provide relevant suggestions to user’s mental health problems. Additionally, quantitative evaluations on nine standard commonsense and reasoning benchmarks demonstrate the efficacy of our fine-tuned, quantized model in low-resource settings. By prioritizing on-device deployment and specialized domain adaptation, EmoSApp serves as a blueprint for future innovations in portable, secure, and highly tailored AI-driven mental health solutions.
nan
Article 505
Title@2025-07-11 (5): PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts
Title: PromotionGo at SemEval-2025 Task 11: A Feature-Centric Framework for Cross-Lingual Multi-Emotion Detection in Short Texts | PromotionGo at SemEval-2025 Task 11: Ein Feature-Centric Framework für Cross-Lingual Multi-Emotion Detection in Kurztexten | 促进SemEval-2025任务11:短文本中跨语言多情感探测的特写-内容框架 2507.08499v1 |
Authors (2): Ziyi Huang, Xia Cui
This paper presents our system for SemEval 2025 Task 11: Bridging the Gap in Text-Based Emotion Detection (Track A), which focuses on multi-label emotion detection in short texts. We propose a feature-centric framework that dynamically adapts document representations and learning algorithms to optimize language-specific performance. Our study evaluates three key components: document representation, dimensionality reduction, and model training in 28 languages, highlighting five for detailed analysis. The results show that TF-IDF remains highly effective for low-resource languages, while contextual embeddings like FastText and transformer-based document representations, such as those produced by Sentence-BERT, exhibit language-specific strengths. Principal Component Analysis (PCA) reduces training time without compromising performance, particularly benefiting FastText and neural models such as Multi-Layer Perceptrons (MLP). Computational efficiency analysis underscores the trade-off between model complexity and processing cost. Our framework provides a scalable solution for multilingual emotion detection, addressing the challenges of linguistic diversity and resource constraints.
nan
Article 506
Title@2025-07-11 (5): Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop
Title: Semantic-Augmented Latent Topic Modeling with LLM-in-the-Loop | Semantic-Augmented Latent Topic Modeling mit LLM-in-the-Loop | 利用LLLM in-Loop 进行语义强化的 边端主题建模 2507.08498v1 |
Authors (3): Mengze Hong, Chen Jason Zhang, Di Jiang
Latent Dirichlet Allocation (LDA) is a prominent generative probabilistic model used for uncovering abstract topics within document collections. In this paper, we explore the effectiveness of augmenting topic models with Large Language Models (LLMs) through integration into two key phases: Initialization and Post-Correction. Since the LDA is highly dependent on the quality of its initialization, we conduct extensive experiments on the LLM-guided topic clustering for initializing the Gibbs sampling algorithm. Interestingly, the experimental results reveal that while the proposed initialization strategy improves the early iterations of LDA, it has no effect on the convergence and yields the worst performance compared to the baselines. The LLM-enabled post-correction, on the other hand, achieved a promising improvement of 5.86% in the coherence evaluation. These results highlight the practical benefits of the LLM-in-the-loop approach and challenge the belief that LLMs are always the superior text mining alternative.
nan
Article 507
Title@2025-07-11 (5): LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning
Title: LLaPa: A Vision-Language Model Framework for Counterfactual-Aware Procedural Planning | LLaPa: Ein visionssprachliches Modell-Framework für die kontrafaktisch-bewusste Verfahrensplanung | LLAPA: 反事实-软件程序规划远景-语言示范框架 2507.08496v1 |
Authors (9): Shibo Sun, Xue Li, Donglin Di, Mingjie Wei, Lanshun Nie, Wei-Nan Zhang, Dechen Zhan, Yang Song, Lei Fan
While large language models (LLMs) have advanced procedural planning for embodied AI systems through strong reasoning abilities, the integration of multimodal inputs and counterfactual reasoning remains underexplored. To tackle these challenges, we introduce LLaPa, a vision-language model framework designed for multimodal procedural planning. LLaPa generates executable action sequences from textual task descriptions and visual environmental images using vision-language models (VLMs). Furthermore, we enhance LLaPa with two auxiliary modules to improve procedural planning. The first module, the Task-Environment Reranker (TER), leverages task-oriented segmentation to create a task-sensitive feature space, aligning textual descriptions with visual environments and emphasizing critical regions for procedural execution. The second module, the Counterfactual Activities Retriever (CAR), identifies and emphasizes potential counterfactual conditions, enhancing the model’s reasoning capability in counterfactual scenarios. Extensive experiments on ActPlan-1K and ALFRED benchmarks demonstrate that LLaPa generates higher-quality plans with superior LCS and correctness, outperforming advanced models. The code and models are available https://github.com/sunshibo1234/LLaPa.
nan
Article 508
Title@2025-07-11 (5): A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
Title: A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench | Ein drittes Paradigma für LLM-Evaluierung: Dialog Game-Based-Evaluierung mit Clembench | LLM评价的第三个范例:以对话游戏为基础的评价 2507.08491v1 |
Authors (4): David Schlangen, Sherzod Hakimov, Jonathan Jordan, Philipp Sadler
There are currently two main paradigms for evaluating large language models (LLMs), reference-based evaluation and preference-based evaluation. The first, carried over from the evaluation of machine learning models in general, relies on pre-defined task instances, for which reference task executions are available. The second, best exemplified by the LM-arena, relies on (often self-selected) users bringing their own intents to a site that routes these to several models in parallel, among whose responses the user then selects their most preferred one. The former paradigm hence excels at control over what is tested, while the latter comes with higher ecological validity, testing actual use cases interactively. Recently, a third complementary paradigm has emerged that combines some of the strengths of these approaches, offering control over multi-turn, reference-free, repeatable interactions, while stressing goal-directedness: dialogue game based evaluation. While the utility of this approach has been shown by several projects, its adoption has been held back by the lack of a mature, easily re-usable implementation. In this paper, we present clembench, which has been in continuous development since 2023 and has in its latest release been optimized for ease of general use. We describe how it can be used to benchmark one’s own models (using a provided set of benchmark game instances in English), as well as how easily the benchmark itself can be extended with new, tailor-made targeted tests.
nan
Article 509
Title@2025-07-11 (5): Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach
Title: Enhancing Essay Cohesion Assessment: A Novel Item Response Theory Approach | Essay Cohäsion Assessment: Ein neuartiger Ansatz zur Reaktionstheorie | 加强舍子聚合力评估:新项目应对理论方法 2507.08487v1 |
Authors (5): Bruno Alexandre Rosa, Hilário Oliveira, Luiz Rodrigues, Eduardo Araujo Oliveira, Rafael Ferreira Mello
Essays are considered a valuable mechanism for evaluating learning outcomes in writing. Textual cohesion is an essential characteristic of a text, as it facilitates the establishment of meaning between its parts. Automatically scoring cohesion in essays presents a challenge in the field of educational artificial intelligence. The machine learning algorithms used to evaluate texts generally do not consider the individual characteristics of the instances that comprise the analysed corpus. In this meaning, item response theory can be adapted to the context of machine learning, characterising the ability, difficulty and discrimination of the models used. This work proposes and analyses the performance of a cohesion score prediction approach based on item response theory to adjust the scores generated by machine learning models. In this study, the corpus selected for the experiments consisted of the extended Essay-BR, which includes 6,563 essays in the style of the National High School Exam (ENEM), and the Brazilian Portuguese Narrative Essays, comprising 1,235 essays written by 5th to 9th grade students from public schools. We extracted 325 linguistic features and treated the problem as a machine learning regression task. The experimental results indicate that the proposed approach outperforms conventional machine learning models and ensemble methods in several evaluation metrics. This research explores a potential approach for improving the automatic evaluation of cohesion in educational essays.
nan
Article 510
Title@2025-07-11 (5): Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors
Title: Findings of the BEA 2025 Shared Task on Pedagogical Ability Assessment of AI-powered Tutors | Ergebnisse der gemeinsamen Arbeit der BEA 2025 zur pädagogischen Fähigkeitsbewertung von KI-getriebenen Tutoren | BEA 2025年BEA 2025年教育能力评估共同任务的结果 2507.10579v1 |
Authors (6): Ekaterina Kochmar, Kaushal Kumar Maurya, Kseniia Petukhova, KV Aditya Srivatsa, Anaïs Tack, Justin Vasselli
This shared task has aimed to assess pedagogical abilities of AI tutors powered by large language models (LLMs), focusing on evaluating the quality of tutor responses aimed at student’s mistake remediation within educational dialogues. The task consisted of five tracks designed to automatically evaluate the AI tutor’s performance across key dimensions of mistake identification, precise location of the mistake, providing guidance, and feedback actionability, grounded in learning science principles that define good and effective tutor responses, as well as the track focusing on detection of the tutor identity. The task attracted over 50 international teams across all tracks. The submitted models were evaluated against gold-standard human annotations, and the results, while promising, show that there is still significant room for improvement in this domain: the best results for the four pedagogical ability assessment tracks range between macro F1 scores of 58.34 (for providing guidance) and 71.81 (for mistake identification) on three-class problems, with the best F1 score in the tutor identification track reaching 96.98 on a 9-class task. In this paper, we overview the main findings of the shared task, discuss the approaches taken by the teams, and analyze their performance. All resources associated with this task are made publicly available to support future research in this critical domain.
nan
Article 511
Title@2025-07-11 (5): ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition
Title: ILT-Iterative LoRA Training through Focus-Feedback-Fix for Multilingual Speech Recognition | ILT-Iteratives LoRA-Training durch Fokus-Feedback-Fix für mehrsprachige Spracherkennung | 通过 “ 承认多种语言语言的焦点-反馈-语言识别指标 “ 进行ILT-临时LORA培训 2507.08477v1 |
Authors (5): Qingliang Meng, Hao Wu, Wei Liang, Wei Xu, Qing Zhao
The deep integration of large language models and automatic speech recognition systems has become a promising research direction with high practical value. To address the overfitting issue commonly observed in Low-Rank Adaptation (LoRA) during the supervised fine-tuning (SFT) stage, this work proposes an innovative training paradigm Iterative LoRA Training (ILT) in combination with an Iterative Pseudo Labeling strategy, effectively enhancing the theoretical upper bound of model performance. Based on Whisper-large-v3 and Qwen2-Audio, we conduct systematic experiments using a three-stage training process: Focus Training, Feed Back Training, and Fix Training. Experimental results demonstrate the effectiveness of the proposed method. Furthermore, the MegaAIS research team applied this technique in the Interspeech 2025 Multilingual Conversational Speech Language Modeling Challenge (MLC-SLM), achieving 4th in Track 1 (Multilingual ASR Task) and 1st place in Track 2 (Speech Separation and Recognition Task), showcasing the practical feasibility and strong application potential of our approach.
nan
Article 512
Title@2025-07-11 (5): Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model
Title: Squeeze the Soaked Sponge: Efficient Off-policy Reinforcement Finetuning for Large Language Model | Squeeze the Soaked Sponge: Effiziente Off-Policy-Verstärkung Feinsteuerung für großes Sprachmodell | 挤压海绵:高效非政策强化大语言模式的高效非政策改进微调 2507.06892v3 |
Authors (8): Jing Liang, Hongyao Tang, Yi Ma, Jinyi Liu, Yan Zheng, Shuyue Hu, Lei Bai, Jianye Hao
Reinforcement Learning (RL) has demonstrated its potential to improve the reasoning ability of Large Language Models (LLMs). One major limitation of most existing Reinforcement Finetuning (RFT) methods is that they are on-policy RL in nature, i.e., data generated during the past learning process is not fully utilized. This inevitably comes at a significant cost of compute and time, posing a stringent bottleneck on continuing economic and efficient scaling. To this end, we launch the renaissance of off-policy RL and propose Reincarnating Mix-policy Proximal Policy Gradient (ReMix), a general approach to enable on-policy RFT methods like PPO and GRPO to leverage off-policy data. ReMix consists of three major components: (1) Mix-policy proximal policy gradient with an increased Update-To-Data (UTD) ratio for efficient training; (2) KL-Convex policy constraint to balance the trade-off between stability and flexibility; (3) Policy reincarnation to achieve a seamless transition from efficient early-stage learning to steady asymptotic improvement. In our experiments, we train a series of ReMix models upon PPO, GRPO and 1.5B, 7B base models. ReMix shows an average Pass@1 accuracy of 52.10% (for 1.5B model) with 0.079M response rollouts, 350 training steps and achieves 63.27%/64.39% (for 7B model) with 0.007M/0.011M response rollouts, 50/75 training steps, on five math reasoning benchmarks (i.e., AIME’24, AMC’23, Minerva, OlympiadBench, and MATH500). Compared with 15 recent advanced models, ReMix shows SOTA-level performance with an over 30x to 450x reduction in training cost in terms of rollout data volume. In addition, we reveal insightful findings via multifaceted analysis, including the implicit preference for shorter responses due to the Whipping Effect of off-policy discrepancy, the collapse mode of self-reflection behavior under the presence of severe off-policyness, etc.
nan
Article 513
Title@2025-07-11 (5): Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study
Title: Using Large Language Models for Legal Decision-Making in Austrian Value-Added Tax Law: An Experimental Study | Große Sprachmodelle für die rechtliche Entscheidungsfindung im österreichischen Mehrwertsteuerrecht nutzen: Eine experimentelle Studie | 奥地利增值税法使用大语言模式进行法律决策:实验研究 2507.08468v1 |
Authors (3): Marina Luketina, Andrea Benkel, Christoph G. Schuetz
This paper provides an experimental evaluation of the capability of large language models (LLMs) to assist in legal decision-making within the framework of Austrian and European Union value-added tax (VAT) law. In tax consulting practice, clients often describe cases in natural language, making LLMs a prime candidate for supporting automated decision-making and reducing the workload of tax professionals. Given the requirement for legally grounded and well-justified analyses, the propensity of LLMs to hallucinate presents a considerable challenge. The experiments focus on two common methods for enhancing LLM performance: fine-tuning and retrieval-augmented generation (RAG). In this study, these methods are applied on both textbook cases and real-world cases from a tax consulting firm to systematically determine the best configurations of LLM-based systems and assess the legal-reasoning capabilities of LLMs. The findings highlight the potential of using LLMs to support tax consultants by automating routine tasks and providing initial analyses, although current prototypes are not ready for full automation due to the sensitivity of the legal domain. The findings indicate that LLMs, when properly configured, can effectively support tax professionals in VAT tasks and provide legally grounded justifications for decisions. However, limitations remain regarding the handling of implicit client knowledge and context-specific documentation, underscoring the need for future integration of structured background information.
nan
Article 514
Title@2025-07-11 (5): Diagnosing Failures in Large Language Models’ Answers: Integrating Error Attribution into Evaluation Framework
Title: Diagnosing Failures in Large Language Models’ Answers: Integrating Error Attribution into Evaluation Framework | Diagnose von Fehlern in den Antworten großer Sprachmodelle: Integrieren der Fehlerzuweisung in den Evaluationsrahmen | 大语言模型答案中的诊断失败:将错误归责纳入评价框架 2507.08459v1 |
Authors (7): Zishan Xu, Shuyi Xie, Qingsong Lv, Shupei Xiao, Linlin Song, Sui Wenjuan, Fan Lin
With the widespread application of Large Language Models (LLMs) in various tasks, the mainstream LLM platforms generate massive user-model interactions daily. In order to efficiently analyze the performance of models and diagnose failures in their answers, it is essential to develop an automated framework to systematically categorize and attribute errors. However, existing evaluation models lack error attribution capability. In this work, we establish a comprehensive Misattribution Framework with 6 primary and 15 secondary categories to facilitate in-depth analysis. Based on this framework, we present AttriData, a dataset specifically designed for error attribution, encompassing misattribution, along with the corresponding scores and feedback. We also propose MisAttributionLLM, a fine-tuned model on AttriData, which is the first general-purpose judge model capable of simultaneously generating score, misattribution, and feedback. Extensive experiments and analyses are conducted to confirm the effectiveness and robustness of our proposed method.
nan
Article 515
Title@2025-07-11 (5): Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test?
Title: Can Large Language Models Understand As Well As Apply Patent Regulations to Pass a Hands-On Patent Attorney Test? | Können große Sprachmodelle ebenso verstehen wie Patentvorschriften anwenden, um einen hands-on Patent Attorney Test zu bestehen? | 大语言模式能否像应用专利条例通过专利律师亲手测试一样理解专利条例? 2507.10576v1 |
Authors (4): Bhakti Khera, Rezvan Alamian, Pascal A. Scherz, Stephan M. Goetz
The legal field already uses various large language models (LLMs) in actual applications, but their quantitative performance and reasons for it are underexplored. We evaluated several open-source and proprietary LLMs – including GPT-series, Anthropic, Deepseek and Llama-3, variants – on parts of the European Qualifying Examination (EQE) for future European Patent Attorneys. OpenAI o1 led with 0.82 accuracy and 0.81 F1 score, whereas (Amazon Web Services) AWS Llama 3.1 8B lagged at 0.50 accuracy, and a Python-deployed Llama 3.1 8B scored 0.55. The latter two are within the range of mere guessing for the two-answer forced-choice design. None of the evaluated models could have passed the examination fully, as accuracy never exceeded the average threshold of 0.90 required for professional-level standards – also not models that are regularly promoted for their assumed beyond-PhD- and bar-admitted-lawyer-level performance. GPT-4o excelled at integrating text and graphics, while Claude 3 Opus often lost formatting coherence. Human patent experts evaluated the textual justifications and uncovered various critical shortcomings of each model. They valued clarity and legal rationale over the raw correctness of the answers, which revealed misalignment between automatic metrics and expert judgment. Model outputs were sensitive to modest temperature changes and prompt wording, which underscores the remaining necessity of expert oversight. Future work should target logical consistency, robust multimodality, and adaptive prompting to approach human-level patent proficiency. In summary, despite the outstanding performance of recent large models, the general public might overestimate their performance. The field has a long way to go to develop a virtual patent attorney. This paper wants to point out several specific limitations that need solutions.
nan
Article 516
Title@2025-07-11 (5): Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences
Title: Finding Common Ground: Using Large Language Models to Detect Agreement in Multi-Agent Decision Conferences | Gemeinsamer Grund: Mit großen Sprachmodellen Vereinbarungen in Multi-Agent-Entscheidungskonferenzen zu erkennen | 寻找共同点:在多机构决定会议上使用大语言模型来检测协议 2507.08440v1 |
Authors (4): Selina Heller, Mohamed Ibrahim, David Antony Selby, Sebastian Vollmer
Decision conferences are structured, collaborative meetings that bring together experts from various fields to address complex issues and reach a consensus on recommendations for future actions or policies. These conferences often rely on facilitated discussions to ensure productive dialogue and collective agreement. Recently, Large Language Models (LLMs) have shown significant promise in simulating real-world scenarios, particularly through collaborative multi-agent systems that mimic group interactions. In this work, we present a novel LLM-based multi-agent system designed to simulate decision conferences, specifically focusing on detecting agreement among the participant agents. To achieve this, we evaluate six distinct LLMs on two tasks: stance detection, which identifies the position an agent takes on a given issue, and stance polarity detection, which identifies the sentiment as positive, negative, or neutral. These models are further assessed within the multi-agent system to determine their effectiveness in complex simulations. Our results indicate that LLMs can reliably detect agreement even in dynamic and nuanced debates. Incorporating an agreement-detection agent within the system can also improve the efficiency of group debates and enhance the overall quality and coherence of deliberations, making them comparable to real-world decision conferences regarding outcome and decision-making. These findings demonstrate the potential for LLM-based multi-agent systems to simulate group decision-making processes. They also highlight that such systems could be instrumental in supporting decision-making with expert elicitation workshops across various domains.
nan
Article 517
Title@2025-07-11 (5): xpSHACL: Explainable SHACL Validation using Retrieval-Augmented Generation and Large Language Models
Title: xpSHACL: Explainable SHACL Validation using Retrieval-Augmented Generation and Large Language Models | xpSHACL: Erklärbare SHACL-Validierung mit Retrieval-Augmented Generation und großen Sprachmodellen | xpSHACL: 使用回溯-启动生成和大语言模型进行可解释的 SHACL 校验 2507.08432v1 |
Authors (2): Gustavo Correa Publio, José Emilio Labra Gayo
Shapes Constraint Language (SHACL) is a powerful language for validating RDF data. Given the recent industry attention to Knowledge Graphs (KGs), more users need to validate linked data properly. However, traditional SHACL validation engines often provide terse reports in English that are difficult for non-technical users to interpret and act upon. This paper presents xpSHACL, an explainable SHACL validation system that addresses this issue by combining rule-based justification trees with retrieval-augmented generation (RAG) and large language models (LLMs) to produce detailed, multilanguage, human-readable explanations for constraint violations. A key feature of xpSHACL is its usage of a Violation KG to cache and reuse explanations, improving efficiency and consistency.
nan
Article 518
Title@2025-07-11 (5): Answer Generation for Questions With Multiple Information Sources in E-Commerce
Title: Answer Generation for Questions With Multiple Information Sources in E-Commerce | Antwortgenerierung für Fragen mit mehreren Informationsquellen im E-Commerce | 电子商务中具有多种信息来源问题的答案生成问题 2111.14003v2 |
Authors (2): Anand A. Rajasekar, Nikesh Garera
Automatic question answering is an important yet challenging task in E-commerce given the millions of questions posted by users about the product that they are interested in purchasing. Hence, there is a great demand for automatic answer generation systems that provide quick responses using related information about the product. There are three sources of knowledge available for answering a user posted query, they are reviews, duplicate or similar questions, and specifications. Effectively utilizing these information sources will greatly aid us in answering complex questions. However, there are two main challenges present in exploiting these sources: (i) The presence of irrelevant information and (ii) the presence of ambiguity of sentiment present in reviews and similar questions. Through this work we propose a novel pipeline (MSQAP) that utilizes the rich information present in the aforementioned sources by separately performing relevancy and ambiguity prediction before generating a response. Experimental results show that our relevancy prediction model (BERT-QA) outperforms all other variants and has an improvement of 12.36% in F1 score compared to the BERT-base baseline. Our generation model (T5-QA) outperforms the baselines in all content preservation metrics such as BLEU, ROUGE and has an average improvement of 35.02% in ROUGE and 198.75% in BLEU compared to the highest performing baseline (HSSC-q). Human evaluation of our pipeline shows us that our method has an overall improvement in accuracy of 30.7% over the generation model (T5-QA), resulting in our full pipeline-based approach (MSQAP) providing more accurate answers. To the best of our knowledge, this is the first work in the e-commerce domain that automatically generates natural language answers combining the information present in diverse sources such as specifications, similar questions, and reviews data.
nan
Article 519
Title@2025-07-11 (5): ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains
Title: ChainEdit: Propagating Ripple Effects in LLM Knowledge Editing through Logical Rule-Guided Chains | ChainEdit: Propagieren von Ripple-Effekten in der LLM-Wissensbearbeitung durch logische regelgeführte Ketten | 链 Edit:通过逻辑规则-指导链条在LLM知识编辑中宣传波纹效应 2507.08427v1 |
Authors (4): Zilu Dong, Xiangqing Shen, Zinong Yang, Rui Xia
Current knowledge editing methods for large language models (LLMs) struggle to maintain logical consistency when propagating ripple effects to associated facts. We propose ChainEdit, a framework that synergizes knowledge graph-derived logical rules with LLM logical reasoning capabilities to enable systematic chain updates. By automatically extracting logical patterns from structured knowledge bases and aligning them with LLMs’ internal logics, ChainEdit dynamically generates and edits logically connected knowledge clusters. Experiments demonstrate an improvement of more than 30% in logical generalization over baselines while preserving editing reliability and specificity. We further address evaluation biases in existing benchmarks through knowledge-aware protocols that disentangle external dependencies. This work establishes new state-of-the-art performance on ripple effect while ensuring internal logical consistency after knowledge editing.
nan
Article 520
Title@2025-07-11 (5): A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities
Title: A Survey of Large Language Models in Discipline-specific Research: Challenges, Methods and Opportunities | Eine Übersicht über große Sprachmodelle in der disziplinspezifischen Forschung: Herausforderungen, Methoden und Chancen | 专门学科研究中大语言模式概览:挑战、方法和机会 2507.08425v1 |
Authors (4): Lu Xiang, Yang Zhao, Yaping Zhang, Chengqing Zong
Large Language Models (LLMs) have demonstrated their transformative potential across numerous disciplinary studies, reshaping the existing research methodologies and fostering interdisciplinary collaboration. However, a systematic understanding of their integration into diverse disciplines remains underexplored. This survey paper provides a comprehensive overview of the application of LLMs in interdisciplinary studies, categorising research efforts from both a technical perspective and with regard to their applicability. From a technical standpoint, key methodologies such as supervised fine-tuning, retrieval-augmented generation, agent-based approaches, and tool-use integration are examined, which enhance the adaptability and effectiveness of LLMs in discipline-specific contexts. From the perspective of their applicability, this paper explores how LLMs are contributing to various disciplines including mathematics, physics, chemistry, biology, and the humanities and social sciences, demonstrating their role in discipline-specific tasks. The prevailing challenges are critically examined and the promising research directions are highlighted alongside the recent advances in LLMs. By providing a comprehensive overview of the technical developments and applications in this field, this survey aims to serve as an invaluable resource for the researchers who are navigating the complex landscape of LLMs in the context of interdisciplinary studies.
nan
Article 521
Title@2025-07-11 (5): Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations
Title: Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations | Inklusive Systematische Bewertungen aktivieren: Einschließlich Preprint-Artikel mit großsprachigen modellgetriebenen Bewertungen | 促进包容性的系统审查:将预印条款纳入大语言模式示范评价 2503.13857v4 |
Authors (11): Rui Yang, Jiayi Tong, Haoyuan Wang, Hui Huang, Ziyang Hu, Peiyu Li, Nan Liu, Christopher J. Lindsell, Michael J. Pencina, Yong Chen, Chuan Hong
Background. Systematic reviews in comparative effectiveness research require timely evidence synthesis. Preprints accelerate knowledge dissemination but vary in quality, posing challenges for systematic reviews. Methods. We propose AutoConfidence (automated confidence assessment), an advanced framework for predicting preprint publication, which reduces reliance on manual curation and expands the range of predictors, including three key advancements: (1) automated data extraction using natural language processing techniques, (2) semantic embeddings of titles and abstracts, and (3) large language model (LLM)-driven evaluation scores. Additionally, we employed two prediction models: a random forest classifier for binary outcome and a survival cure model that predicts both binary outcome and publication risk over time. Results. The random forest classifier achieved AUROC 0.692 with LLM-driven scores, improving to 0.733 with semantic embeddings and 0.747 with article usage metrics. The survival cure model reached AUROC 0.716 with LLM-driven scores, improving to 0.731 with semantic embeddings. For publication risk prediction, it achieved a concordance index of 0.658, increasing to 0.667 with semantic embeddings. Conclusion. Our study advances the framework for preprint publication prediction through automated data extraction and multiple feature integration. By combining semantic embeddings with LLM-driven evaluations, AutoConfidence enhances predictive performance while reducing manual annotation burden. The framework has the potential to facilitate incorporation of preprint articles during the appraisal phase of systematic reviews, supporting researchers in more effective utilization of preprint resources.
nan
Article 522
Title@2025-07-11 (5): Swap distance minimization beyond entropy minimization in word order variation
Title: Swap distance minimization beyond entropy minimization in word order variation | Swap-Distanz-Minimierung jenseits der Entropie-Minimierung in Wortfolge-Variation | 以字序变换方式互换距离最小化,超过以字序变换的方式最小化 2404.14192v5 |
Authors (3): Víctor Franco-Sánchez, Arnau Martí-Llobet, Ramon Ferrer-i-Cancho
Consider a linguistic structure formed by $n$ elements, for instance, subject, direct object and verb ($n=3$) or subject, direct object, indirect object and verb ($n=4$). We investigate whether the frequency of the $n!$ possible orders is constrained by two principles. First, entropy minimization, a principle that has been suggested to shape natural communication systems at distinct levels of organization. Second, swap distance minimization, namely a preference for word orders that require fewer swaps of adjacent elements to be produced from a source order. We present average swap distance, a novel score for research on swap distance minimization. We find strong evidence of pressure for entropy minimization and swap distance minimization with respect to a die rolling experiment in distinct linguistic structures with $n=3$ or $n=4$. Evidence with respect to a Polya urn process is strong for $n=4$ but weaker for $n=3$. We still find evidence consistent with the action of swap distance minimization when word order frequencies are shuffled, indicating that swap distance minimization effects are beyond pressure to reduce word order entropy.
nan
Article 523
Title@2025-07-11 (5): Probing Experts’ Perspectives on AI-Assisted Public Speaking Training
Title: Probing Experts’ Perspectives on AI-Assisted Public Speaking Training | Probing Experten-Perspektiven über KI-Assistente Public Speaking Training | 关于AI协助的公开演讲培训的探查专家观点 2507.07930v2 |
Authors (5): Nesrine Fourati, Alisa Barkar, Marion Dragée, Liv Danthon-Lefebvre, Mathieu Chollet
Background: Public speaking is a vital professional skill, yet it remains a source of significant anxiety for many individuals. Traditional training relies heavily on expert coaching, but recent advances in AI has led to novel types of commercial automated public speaking feedback tools. However, most research has focused on prototypes rather than commercial applications, and little is known about how public speaking experts perceive these tools. Objectives: This study aims to evaluate expert opinions on the efficacy and design of commercial AI-based public speaking training tools and to propose guidelines for their improvement. Methods: The research involved 16 semi-structured interviews and 2 focus groups with public speaking experts. Participants discussed their views on current commercial tools, their potential integration into traditional coaching, and suggestions for enhancing these systems. Results and Conclusions: Experts acknowledged the value of AI tools in handling repetitive, technical aspects of training, allowing coaches to focus on higher-level skills. However they found key issues in current tools, emphasising the need for personalised, understandable, carefully selected feedback and clear instructional design. Overall, they supported a hybrid model combining traditional coaching with AI-supported exercises.
nan
Article 524
Title@2025-07-11 (5): Flippi: End To End GenAI Assistant for E-Commerce
Title: Flippi: End To End GenAI Assistant for E-Commerce | Flippi: Ende bis Ende GenAI Assistent für E-Commerce | Flippi: 结束到结束 GenAI 电子商务助手 2507.05788v2 |
Authors (7): Anand A. Rajasekar, Praveen Tangarajan, Anjali Nainani, Amogh Batwal, Vinay Rao Dandin, Anusua Trivedi, Ozan Ersoy
The emergence of conversational assistants has fundamentally reshaped user interactions with digital platforms. This paper introduces Flippi-a cutting-edge, end-to-end conversational assistant powered by large language models (LLMs) and tailored for the e-commerce sector. Flippi addresses the challenges posed by the vast and often overwhelming product landscape, enabling customers to discover products more efficiently through natural language dialogue. By accommodating both objective and subjective user requirements, Flippi delivers a personalized shopping experience that surpasses traditional search methods. This paper details how Flippi interprets customer queries to provide precise product information, leveraging advanced NLP techniques such as Query Reformulation, Intent Detection, Retrieval-Augmented Generation (RAG), Named Entity Recognition (NER), and Context Reduction. Flippi’s unique capability to identify and present the most attractive offers on an e-commerce site is also explored, demonstrating how it empowers users to make cost-effective decisions. Additionally, the paper discusses Flippi’s comparative analysis features, which help users make informed choices by contrasting product features, prices, and other relevant attributes. The system’s robust architecture is outlined, emphasizing its adaptability for integration across various e-commerce platforms and the technological choices underpinning its performance and accuracy. Finally, a comprehensive evaluation framework is presented, covering performance metrics, user satisfaction, and the impact on customer engagement and conversion rates. By bridging the convenience of online shopping with the personalized assistance traditionally found in physical stores, Flippi sets a new standard for customer satisfaction and engagement in the digital marketplace.
nan
Article 525
Title@2025-07-11 (5): Sampling from Your Language Model One Byte at a Time
Title: Sampling from Your Language Model One Byte at a Time | Proben aus Ihrem Sprachmodell ein Byte zu einer Zeit | 一次抽取您语言模式一字节的样本 2506.14123v2 |
Authors (4): Jonathan Hayase, Alisa Liu, Noah A. Smith, Sewoong Oh
Tokenization is used almost universally by modern language models, enabling efficient text representation using multi-byte or multi-character tokens. However, prior work has shown that tokenization can introduce distortion into the model’s generations, an issue known as the Prompt Boundary Problem (PBP). For example, users are often advised not to end their prompts with a space because it prevents the model from including the space as part of the next token. While this heuristic is effective in English, the underlying PBP continues to affect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to convert any autoregressive LM with a BPE tokenizer into a character-level or byte-level LM. Our method efficiently solves the PBP and is also able to unify the vocabularies of language models with different tokenizers, allowing one to ensemble LMs with different tokenizers at inference time or transfer the post-training from one model to another using proxy-tuning. We demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream evals. Code is available at https://github.com/SewoongLab/byte-sampler .
nan
Article 526
Title@2025-07-11 (5): HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew
Title: HeSum: a Novel Dataset for Abstractive Text Summarization in Hebrew | HeSum: Ein neuartiger Datensatz für abstrakte Textzusammenfassung in Hebräisch | HeSum:希伯来文抽象文本摘要缩写的新数据集 2406.03897v3 |
Authors (4): Tzuf Paz-Argaman, Itai Mondshine, Asaf Achi Mordechai, Reut Tsarfaty
While large language models (LLMs) excel in various natural language tasks in English, their performance in lower-resourced languages like Hebrew, especially for generative tasks such as abstractive summarization, remains unclear. The high morphological richness in Hebrew adds further challenges due to the ambiguity in sentence comprehension and the complexities in meaning construction. In this paper, we address this resource and evaluation gap by introducing HeSum, a novel benchmark specifically designed for abstractive text summarization in Modern Hebrew. HeSum consists of 10,000 article-summary pairs sourced from Hebrew news websites written by professionals. Linguistic analysis confirms HeSum’s high abstractness and unique morphological challenges. We show that HeSum presents distinct difficulties for contemporary state-of-the-art LLMs, establishing it as a valuable testbed for generative language technology in Hebrew, and MRLs generative challenges in general.
nan
Article 527
Title@2025-07-11 (5): Truth-value judgment in language models: ‘truth directions’ are context sensitive
Title: Truth-value judgment in language models: ‘truth directions’ are context sensitive | Wahrheit-Wert-Urteil in Sprachmodellen: ‘Wahrheitsrichtungen’ sind kontextsensibel | 语言模型中的真相价值判断:“真相方向”是背景敏感 2404.18865v3 |
Authors (4): Stefan F. Schouten, Peter Bloem, Ilia Markov, Piek Vossen
Recent work has demonstrated that the latent spaces of large language models (LLMs) contain directions predictive of the truth of sentences. Multiple methods recover such directions and build probes that are described as uncovering a model’s “knowledge” or “beliefs”. We investigate this phenomenon, looking closely at the impact of context on the probes. Our experiments establish where in the LLM the probe’s predictions are (most) sensitive to the presence of related sentences, and how to best characterize this kind of sensitivity. We do so by measuring different types of consistency errors that occur after probing an LLM whose inputs consist of hypotheses preceded by (negated) supporting and contradicting sentences. We also perform a causal intervention experiment, investigating whether moving the representation of a premise along these truth-value directions influences the position of an entailed or contradicted sentence along that same direction. We find that the probes we test are generally context sensitive, but that contexts which should not affect the truth often still impact the probe outputs. Our experiments show that the type of errors depend on the layer, the model, and the kind of data. Finally, our results suggest that truth-value directions are causal mediators in the inference process that incorporates in-context information.
nan
Article 528
Title@2025-07-11 (5): The Curious Case of Factuality Finetuning: Models’ Internal Beliefs Can Improve Factuality
Title: The Curious Case of Factuality Finetuning: Models’ Internal Beliefs Can Improve Factuality | Der kuriose Fall von Factuality Finetuning: Modelle’ interne Glaube kann Factuality verbessern | 《难解事实质量微调案例:模型的内部信仰可以改进事实质量》 2507.08371v1 |
Authors (8): Benjamin Newman, Abhilasha Ravichander, Jaehun Jung, Rui Xin, Hamish Ivison, Yegor Kuznetsov, Pang Wei Koh, Yejin Choi
Language models are prone to hallucination - generating text that is factually incorrect. Finetuning models on high-quality factual information can potentially reduce hallucination, but concerns remain; obtaining factual gold data can be expensive and training on correct but unfamiliar data may potentially lead to even more downstream hallucination. What data should practitioners finetune on to mitigate hallucinations in language models? In this work, we study the relationship between the factuality of finetuning data and the prevalence of hallucinations in long-form generation tasks. Counterintuitively, we find that finetuning on factual gold data is not as helpful as finetuning on model-generated data that models believe to be factual. Next, we evaluate filtering strategies applied on both factual gold data and model-generated data, and find that finetuning on model-generated data that is filtered by models’ own internal judgments often leads to better overall factuality compared to other configurations: training on gold data filtered by models’ judgments, training on gold data alone, or training on model-generated data that is supported by gold data. These factuality improvements transfer across three domains we study, suggesting that a models’ own beliefs can provide a powerful signal for factuality.
nan
Article 529
Title@2025-07-11 (5): Exploring Design of Multi-Agent LLM Dialogues for Research Ideation
Title: Exploring Design of Multi-Agent LLM Dialogues for Research Ideation | Erforschung der Gestaltung von LLM-Dialogen mit mehreren Agenten für die Forschungsideation | 探索设计多种机构用LLM 研究主题对话 2507.08350v1 |
Authors (7): Keisuke Ueda, Wataru Hirota, Takuto Asakura, Takahiro Omi, Kosuke Takahashi, Kosuke Arima, Tatsuya Ishigaki
Large language models (LLMs) are increasingly used to support creative tasks such as research idea generation. While recent work has shown that structured dialogues between LLMs can improve the novelty and feasibility of generated ideas, the optimal design of such interactions remains unclear. In this study, we conduct a comprehensive analysis of multi-agent LLM dialogues for scientific ideation. We compare different configurations of agent roles, number of agents, and dialogue depth to understand how these factors influence the novelty and feasibility of generated ideas. Our experimental setup includes settings where one agent generates ideas and another critiques them, enabling iterative improvement. Our results show that enlarging the agent cohort, deepening the interaction depth, and broadening agent persona heterogeneity each enrich the diversity of generated ideas. Moreover, specifically increasing critic-side diversity within the ideation-critique-revision loop further boosts the feasibility of the final proposals. Our findings offer practical guidelines for building effective multi-agent LLM systems for scientific ideation. Our code is available at https://github.com/g6000/MultiAgent-Research-Ideator.
nan
Article 530
Title@2025-07-11 (5): Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization
Title: Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization | Beyond N-Grams: Rethinking Evaluation Metrics and Strategies for Multilingual Abstractive Summarization | N-Grams以后:重新思考评价尺度和多语种抽象总结战略 2507.08342v1 |
Authors (3): Itai Mondshine, Tzuf Paz-Argaman, Reut Tsarfaty
Automatic n-gram based metrics such as ROUGE are widely used for evaluating generative tasks such as summarization. While these metrics are considered indicative (even if imperfect) of human evaluation for English, their suitability for other languages remains unclear. To address this, we systematically assess evaluation metrics for generation both n-gram-based and neural based to evaluate their effectiveness across languages and tasks. Specifically, we design a large-scale evaluation suite across eight languages from four typological families: agglutinative, isolating, low-fusional, and high-fusional, spanning both low- and high-resource settings, to analyze their correlation with human judgments. Our findings highlight the sensitivity of evaluation metrics to the language type. For example, in fusional languages, n-gram-based metrics show lower correlation with human assessments compared to isolating and agglutinative languages. We also demonstrate that proper tokenization can significantly mitigate this issue for morphologically rich fusional languages, sometimes even reversing negative trends. Additionally, we show that neural-based metrics specifically trained for evaluation, such as COMET, consistently outperform other neural metrics and better correlate with human judgments in low-resource languages. Overall, our analysis highlights the limitations of n-gram metrics for fusional languages and advocates for greater investment in neural-based metrics trained for evaluation tasks.
nan
Article 531
Title@2025-07-11 (5): Distillation versus Contrastive Learning: How to Train Your Rerankers
Title: Distillation versus Contrastive Learning: How to Train Your Rerankers | Destillation versus Kontrastives Lernen: Wie Sie Ihre Reranker trainieren | 蒸馏与反竞争学习:如何培训你的再培训者 2507.08336v1 |
Authors (5): Zhichao Xu, Zhiqi Huang, Shengyao Zhuang, Ashim Gupta, Vivek Srikumar
Training text rerankers is crucial for information retrieval. Two primary strategies are widely used: contrastive learning (optimizing directly on ground-truth labels) and knowledge distillation (transferring knowledge from a larger reranker). While both have been studied in the literature, a clear comparison of their effectiveness for training cross-encoder rerankers under practical conditions is needed. This paper empirically compares these strategies by training rerankers of different sizes and architectures using both methods on the same data, with a strong contrastive learning model acting as the distillation teacher. Our results show that knowledge distillation generally yields better in-domain and out-of-domain ranking performance than contrastive learning when distilling from a larger teacher model. This finding is consistent across student model sizes and architectures. However, distilling from a teacher of the same capacity does not provide the same advantage, particularly for out-of-domain tasks. These findings offer practical guidance for choosing a training strategy based on available teacher models. Therefore, we recommend using knowledge distillation to train smaller rerankers if a larger, more powerful teacher is accessible; in its absence, contrastive learning provides a strong and more reliable alternative otherwise.
nan
Article 532
Title@2025-07-11 (5): MK2 at PBIG Competition: A Prompt Generation Solution
Title: MK2 at PBIG Competition: A Prompt Generation Solution | MK2 bei PBIG Competition: Eine schnelle Generationslösung | PBIG竞争中的MK2:迅速代代解决办法 2507.08335v1 |
Authors (5): Yuzheng Xu, Tosho Hirasawa, Seiya Kawano, Shota Kato, Tadashi Kozuno
The Patent-Based Idea Generation task asks systems to turn real patents into product ideas viable within three years. We propose MK2, a prompt-centric pipeline: Gemini 2.5 drafts and iteratively edits a prompt, grafting useful fragments from weaker outputs; GPT-4.1 then uses this prompt to create one idea per patent, and an Elo loop judged by Qwen3-8B selects the best prompt-all without extra training data. Across three domains, two evaluator types, and six criteria, MK2 topped the automatic leaderboard and won 25 of 36 tests. Only the materials-chemistry track lagged, indicating the need for deeper domain grounding; yet, the results show that lightweight prompt engineering has already delivered competitive, commercially relevant ideation from patents.
nan
Article 533
Title@2025-07-11 (5): Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection
Title: Emoji Attack: Enhancing Jailbreak Attacks Against Judge LLM Detection | Emoji-Angriff: Verstärkung von Jailbreak-Angriffen gegen Richter LLM-Erkennung | Emoji攻击:加强针对LLM法官的越狱袭击 2411.01077v4 |
Authors (3): Zhipeng Wei, Yuqi Liu, N. Benjamin Erichson
Jailbreaking techniques trick Large Language Models (LLMs) into producing restricted output, posing a potential threat. One line of defense is to use another LLM as a Judge to evaluate the harmfulness of generated text. However, we reveal that these Judge LLMs are vulnerable to token segmentation bias, an issue that arises when delimiters alter the tokenization process, splitting words into smaller sub-tokens. This alters the embeddings of the entire sequence, reducing detection accuracy and allowing harmful content to be misclassified as safe. In this paper, we introduce Emoji Attack, a novel strategy that amplifies existing jailbreak prompts by exploiting token segmentation bias. Our method leverages in-context learning to systematically insert emojis into text before it is evaluated by a Judge LLM, inducing embedding distortions that significantly lower the likelihood of detecting unsafe content. Unlike traditional delimiters, emojis also introduce semantic ambiguity, making them particularly effective in this attack. Through experiments on state-of-the-art Judge LLMs, we demonstrate that Emoji Attack substantially reduces the unsafe prediction rate, bypassing existing safeguards.
nan
Article 534
Title@2025-07-11 (5): CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation
Title: CRMAgent: A Multi-Agent LLM System for E-Commerce CRM Message Template Generation | CRMAgent: Ein Multi-Agent LLM-System für E-Commerce CRM-Meldungsvorlagen-Erstellung | CRMM 信息模板生成多机构代理LLM系统 2507.08325v1 |
Authors (3): Yinzhu Quan, Xinrui Li, Ying Chen
In e-commerce private-domain channels such as instant messaging and e-mail, merchants engage customers directly as part of their Customer Relationship Management (CRM) programmes to drive retention and conversion. While a few top performers excel at crafting outbound messages, most merchants struggle to write persuasive copy because they lack both expertise and scalable tools. We introduce CRMAgent, a multi-agent system built on large language models (LLMs) that generates high-quality message templates and actionable writing guidance through three complementary modes. First, group-based learning enables the agent to learn from a merchant’s own top-performing messages within the same audience segment and rewrite low-performing ones. Second, retrieval-and-adaptation fetches templates that share the same audience segment and exhibit high similarity in voucher type and product category, learns their successful patterns, and adapts them to the current campaign. Third, a rule-based fallback provides a lightweight zero-shot rewrite when no suitable references are available. Extensive experiments show that CRMAgent consistently outperforms merchants’ original templates, delivering significant gains in both audience-match and marketing-effectiveness metrics.
nan
Article 535
Title@2025-07-11 (5): EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees
Title: EvalTree: Profiling Language Model Weaknesses via Hierarchical Capability Trees | EvalTree: Profiling Language Model Schwächen über Hierarchische Fähigkeiten Bäume | EvalTree:通过等级能力树分析语言模型弱点 2503.08893v2 |
Authors (4): Zhiyuan Zeng, Yizhong Wang, Hannaneh Hajishirzi, Pang Wei Koh
An ideal model evaluation should achieve two goals: identifying where the model fails and providing actionable improvement guidance. Toward these goals for language model (LM) evaluations, we formulate the problem of generating a weakness profile, a set of weaknesses expressed in natural language, given an LM’s performance on every individual instance in a benchmark. We introduce a suite of quantitative assessments to compare different weakness profiling methods. We also introduce a weakness profiling method EvalTree. EvalTree constructs a capability tree where each node represents a capability described in natural language and is linked to a subset of benchmark instances that specifically evaluate this capability; it then extracts nodes where the LM performs poorly to generate a weakness profile. On the MATH and WildChat benchmarks, we show that EvalTree outperforms baseline weakness profiling methods by identifying weaknesses more precisely and comprehensively. Weakness profiling further enables weakness-guided data collection, and training data collection guided by EvalTree-identified weaknesses improves LM performance more than other data collection strategies. We also show how EvalTree exposes flaws in Chatbot Arena’s human-voter-based evaluation practice. To facilitate future work, we provide an interface that allows practitioners to interactively explore the capability trees built by EvalTree.
nan
Article 536
Title@2025-07-11 (5): Improving MLLM’s Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency
Title: Improving MLLM’s Document Image Machine Translation via Synchronously Self-reviewing Its OCR Proficiency | Verbesserung der Übersetzung von MLLMs Dokumentenbildmaschinen durch synchrone Selbstprüfung ihrer OCR-Kenntnisse | 通过同步进行自我审查,改进MLLM的文件图像机机翻译,提高OCR的熟练程度 2507.08309v1 |
Authors (8): Yupu Liang, Yaping Zhang, Zhiyang Zhang, Zhiyuan Chen, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou
Multimodal Large Language Models (MLLMs) have shown strong performance in document image tasks, especially Optical Character Recognition (OCR). However, they struggle with Document Image Machine Translation (DIMT), which requires handling both cross-modal and cross-lingual challenges. Previous efforts to enhance DIMT capability through Supervised Fine-Tuning (SFT) on the DIMT dataset often result in the forgetting of the model’s existing monolingual abilities, such as OCR. To address these challenges, we introduce a novel fine-tuning paradigm, named Synchronously Self-Reviewing (SSR) its OCR proficiency, inspired by the concept “Bilingual Cognitive Advantage”. Specifically, SSR prompts the model to generate OCR text before producing translation text, which allows the model to leverage its strong monolingual OCR ability while learning to translate text across languages. Comprehensive experiments demonstrate the proposed SSR learning helps mitigate catastrophic forgetting, improving the generalization ability of MLLMs on both OCR and DIMT tasks.
nan
Article 537
Title@2025-07-11 (5): M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning
Title: M2-Reasoning: Empowering MLLMs with Unified General and Spatial Reasoning | M2-Reasoning: Stärkung von MLLMs mit einheitlicher allgemeiner und räumlicher Vernunft | M2-反应:以统一的一般和空间理由,赋予MLLMs权力 2507.08306v1 |
Authors (15): Inclusion AI, :, Fudong Wang, Jiajia Liu, Jingdong Chen, Jun Zhou, Kaixiang Ji, Lixiang Ru, Qingpei Guo, Ruobing Zheng, Tianqi Li, Yi Yuan, Yifan Mao, Yuting Xiao, Ziping Ma
Recent advancements in Multimodal Large Language Models (MLLMs), particularly through Reinforcement Learning with Verifiable Rewards (RLVR), have significantly enhanced their reasoning abilities. However, a critical gap persists: these models struggle with dynamic spatial interactions, a capability essential for real-world applications. To bridge this gap, we introduce M2-Reasoning-7B, a model designed to excel in both general and spatial reasoning. Our approach integrates two key innovations: (1) a novel data pipeline that generates 294.2K high-quality data samples (168K for cold-start fine-tuning and 126.2K for RLVR), which feature logically coherent reasoning trajectories and have undergone comprehensive assessment; and (2) a dynamic multi-task training strategy with step-wise optimization to mitigate conflicts between data, and task-specific rewards for delivering tailored incentive signals. This combination of curated data and advanced training allows M2-Reasoning-7B to set a new state-of-the-art (SOTA) across 8 benchmarks, showcasing superior performance in both general and spatial reasoning domains.
nan
Article 538
Title@2025-07-11 (5): Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective
Title: Evaluating Implicit Bias in Large Language Models by Attacking From a Psychometric Perspective | Bewertung von Impliziten Bias in großen Sprachmodellen durch Angriff aus einer psychometrischen Perspektive | 通过从心理角度进行攻击,评价大语言模型中隐含的偏见 2406.14023v5 |
Authors (5): Yuchen Wen, Keping Bi, Wei Chen, Jiafeng Guo, Xueqi Cheng
As large language models (LLMs) become an important way of information access, there have been increasing concerns that LLMs may intensify the spread of unethical content, including implicit bias that hurts certain populations without explicit harmful words. In this paper, we conduct a rigorous evaluation of LLMs’ implicit bias towards certain demographics by attacking them from a psychometric perspective to elicit agreements to biased viewpoints. Inspired by psychometric principles in cognitive and social psychology, we propose three attack approaches, i.e., Disguise, Deception, and Teaching. Incorporating the corresponding attack instructions, we built two benchmarks: (1) a bilingual dataset with biased statements covering four bias types (2.7K instances) for extensive comparative analysis, and (2) BUMBLE, a larger benchmark spanning nine common bias types (12.7K instances) for comprehensive evaluation. Extensive evaluation of popular commercial and open-source LLMs shows that our methods can elicit LLMs’ inner bias more effectively than competitive baselines. Our attack methodology and benchmarks offer an effective means of assessing the ethical risks of LLMs, driving progress toward greater accountability in their development. Our code, data, and benchmarks are available at https://yuchenwen1.github.io/ImplicitBiasEvaluation/.
nan
Article 539
Title@2025-07-11 (5): Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers
Title: Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers | Bandit-Based Prompt Design Strategy Selection verbessert Prompt Optimizers | 基于强盗的即时设计战略选择改进即时优化 2503.01163v2 |
Authors (5): Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa
Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .
nan
Article 540
Title@2025-07-11 (5): Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training
Title: Lightweight Safety Guardrails via Synthetic Data and RL-guided Adversarial Training | Leichte Sicherheits-Guardrails über Synthetische Daten und RL-geführtes Adversarial Training | 通过合成数据和RL制导反向训练轻量安全护卫车 2507.08284v1 |
Authors (8): Aleksei Ilin, Gor Matevosyan, Xueying Ma, Vladimir Eremin, Suhaa Dada, Muqun Li, Riyaaz Shaik, Haluk Noyan Tokgozoglu
We introduce a lightweight yet highly effective safety guardrail framework for language models, demonstrating that small-scale language models can achieve, and even surpass, the performance of larger counterparts in content moderation tasks. This is accomplished through high-fidelity synthetic data generation and adversarial training. The synthetic data generation process begins with human-curated seed data, which undergoes query augmentation and paraphrasing to create diverse and contextually rich examples. This augmented data is then subjected to multiple rounds of curation, ensuring high fidelity and relevance. Inspired by recent advances in the Generative Adversarial Network (GAN) architecture, our adversarial training employs reinforcement learning to guide a generator that produces challenging synthetic examples. These examples are used to fine-tune the safety classifier, enhancing its ability to detect and mitigate harmful content. Additionally, we incorporate strategies from recent research on efficient LLM training, leveraging the capabilities of smaller models to improve the performance of larger generative models. With iterative adversarial training and the generation of diverse, high-quality synthetic data, our framework enables small language models (SLMs) to serve as robust safety guardrails. This approach not only reduces computational overhead but also enhances resilience against adversarial attacks, offering a scalable and efficient solution for content moderation in AI systems.
nan
Article 541
Title@2025-07-11 (5): Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval
Title: Generative Retrieval and Alignment Model: A New Paradigm for E-commerce Retrieval | Generatives Retrieval- und Alignment-Modell: Ein neues Paradigma für E-Commerce Retrieval | 产生检索和调整模型:电子商务检索的新范例 2504.01403v2 |
Authors (11): Ming Pang, Chunyuan Yuan, Xiaoyu He, Zheng Fang, Donghao Xie, Fanyi Qu, Xue Jiang, Changping Peng, Zhangang Lin, Ching Law, Jingping Shao
Traditional sparse and dense retrieval methods struggle to leverage general world knowledge and often fail to capture the nuanced features of queries and products. With the advent of large language models (LLMs), industrial search systems have started to employ LLMs to generate identifiers for product retrieval. Commonly used identifiers include (1) static/semantic IDs and (2) product term sets. The first approach requires creating a product ID system from scratch, missing out on the world knowledge embedded within LLMs. While the second approach leverages this general knowledge, the significant difference in word distribution between queries and products means that product-based identifiers often do not align well with user search queries, leading to missed product recalls. Furthermore, when queries contain numerous attributes, these algorithms generate a large number of identifiers, making it difficult to assess their quality, which results in low overall recall efficiency. To address these challenges, this paper introduces a novel e-commerce retrieval paradigm: the Generative Retrieval and Alignment Model (GRAM). GRAM employs joint training on text information from both queries and products to generate shared text identifier codes, effectively bridging the gap between queries and products. This approach not only enhances the connection between queries and products but also improves inference efficiency. The model uses a co-alignment strategy to generate codes optimized for maximizing retrieval efficiency. Additionally, it introduces a query-product scoring mechanism to compare product values across different codes, further boosting retrieval efficiency. Extensive offline and online A/B testing demonstrates that GRAM significantly outperforms traditional models and the latest generative retrieval models, confirming its effectiveness and practicality.
nan
Article 542
Title@2025-07-11 (5): SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths
Title: SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths | SpecDec++: Spekulative Dekodierung durch adaptive Kandidatenlängen steigern | SpecDec+++:通过适应性候选时间长度促进投机替代 2405.19715v3 |
Authors (3): Kaixuan Huang, Xudong Guo, Mengdi Wang
Speculative decoding reduces the inference latency of a target large language model via utilizing a smaller and faster draft model. Its performance depends on a hyperparameter K – the candidate length, i.e., the number of candidate tokens for the target model to verify in each round. However, previous methods often use simple heuristics to choose K, which may result in sub-optimal performance. We study the choice of the candidate length K and formulate it as a Markov Decision Process. We theoretically show that the optimal policy of this Markov decision process takes the form of a threshold policy, i.e., the current speculation should stop and be verified when the probability of getting a rejection exceeds a threshold value. Motivated by this theory, we propose SpecDec++, an enhanced version of speculative decoding that adaptively determines the candidate length on the fly. We augment the draft model with a trained acceptance prediction head to predict the conditional acceptance probability of the candidate tokens. SpecDec++ will stop the current speculation when the predicted probability that at least one token gets rejected exceeds a threshold. We implement SpecDec++ and apply it to the llama-2-chat 7B & 70B model pair. Our adaptive method achieves a 2.04x speedup on the Alpaca dataset (7.2% improvement over the baseline speculative decoding). On the GSM8K and HumanEval datasets, our method achieves a 2.26x speedup (9.4% improvement) and 2.23x speedup (11.1% improvement), respectively. The code of this paper is available at https://github.com/Kaffaljidhmah2/SpecDec_pp.
nan
Article 543
Title@2025-07-11 (5): Exploring Gender Differences in Chronic Pain Discussions on Reddit
Title: Exploring Gender Differences in Chronic Pain Discussions on Reddit | Erforschung geschlechtsspezifischer Unterschiede bei chronischen Schmerzdiskussionen auf Reddit | 探讨关于康复的慢性疼痛讨论中的性别差异 2507.08241v1 |
Authors (3): Ancita Maria Andrade, Tanvi Banerjee, Ramakrishna Mundugar
Pain is an inherent part of human existence, manifesting as both physical and emotional experiences, and can be categorized as either acute or chronic. Over the years, extensive research has been conducted to understand the causes of pain and explore potential treatments, with contributions from various scientific disciplines. However, earlier studies often overlooked the role of gender in pain experiences. In this study, we utilized Natural Language Processing (NLP) to analyze and gain deeper insights into individuals’ pain experiences, with a particular focus on gender differences. We successfully classified posts into male and female corpora using the Hidden Attribute Model-Convolutional Neural Network (HAM-CNN), achieving an F1 score of 0.86 by aggregating posts based on usernames. Our analysis revealed linguistic differences between genders, with female posts tending to be more emotionally focused. Additionally, the study highlighted that conditions such as migraine and sinusitis are more prevalent among females and explored how pain medication affects individuals differently based on gender.
nan
Article 544
Title@2025-07-11 (5): Sequence graphs realizations and ambiguity in language models
Title: Sequence graphs realizations and ambiguity in language models | Sequenzgraphen Realisationen und Mehrdeutigkeit in Sprachmodellen | 顺序图 语文模式的实现和模糊 2402.08830v2 |
Authors (3): Sammy Khalife, Yann Ponty, Laurent Bulteau
Several popular language models represent local contexts in an input text $x$ as bags of words. Such representations are naturally encoded by a sequence graph whose vertices are the distinct words occurring in $x$, with edges representing the (ordered) co-occurrence of two words within a sliding window of size $w$. However, this compressed representation is not generally bijective: some may be ambiguous, admitting several realizations as a sequence, while others may not admit any realization. In this paper, we study the realizability and ambiguity of sequence graphs from a combinatorial and algorithmic point of view. We consider the existence and enumeration of realizations of a sequence graph under multiple settings: window size $w$, presence/absence of graph orientation, and presence/absence of weights (multiplicities). When $w=2$, we provide polynomial time algorithms for realizability and enumeration in all cases except the undirected/weighted setting, where we show the $#$P-hardness of enumeration. For $w \ge 3$, we prove the hardness of all variants, even when $w$ is considered as a constant, with the notable exception of the undirected unweighted case for which we propose XP algorithms for both problems, tight due to a corresponding $W[1]-$hardness result. We conclude with an integer program formulation to solve the realizability problem, and a dynamic programming algorithm to solve the enumeration problem in instances of moderate sizes. This work leaves open the membership to NP of both problems, a non-trivial question due to the existence of minimum realizations having size exponential on the instance encoding.
nan
Article 545
Title@2025-07-11 (5): Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension?
Title: Can LLMs Reliably Simulate Real Students’ Abilities in Mathematics and Reading Comprehension? | Können LLMs die Fähigkeiten von Realstudenten in Mathematik und Leseverständnis zuverlässig simulieren? | LLMs能够令人信赖地模拟真实学生的数学和阅读理解能力吗? 2507.08232v1 |
Authors (3): KV Aditya Srivatsa, Kaushal Kumar Maurya, Ekaterina Kochmar
Large Language Models (LLMs) are increasingly used as proxy students in the development of Intelligent Tutoring Systems (ITSs) and in piloting test questions. However, to what extent these proxy students accurately emulate the behavior and characteristics of real students remains an open question. To investigate this, we collected a dataset of 489 items from the National Assessment of Educational Progress (NAEP), covering mathematics and reading comprehension in grades 4, 8, and 12. We then apply an Item Response Theory (IRT) model to position 11 diverse and state-of-the-art LLMs on the same ability scale as real student populations. Our findings reveal that, without guidance, strong general-purpose models consistently outperform the average student at every grade, while weaker or domain-mismatched models may align incidentally. Using grade-enforcement prompts changes models’ performance, but whether they align with the average grade-level student remains highly model- and prompt-specific: no evaluated model-prompt pair fits the bill across subjects and grades, underscoring the need for new training and evaluation strategies. We conclude by providing guidelines for the selection of viable proxies based on our findings.
nan
Article 546
Title@2025-07-10 (4): Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation
Title: Post-hoc Study of Climate Microtargeting on Social Media Ads with LLMs: Thematic Insights and Fairness Evaluation | Post-hoc-Studie zum Thema Klima-Mikrotargeting auf Social Media-Anzeigen mit LLMs: Thematische Einblicke und Fairness-Evaluierung | 利用LLMM:专题透视和公平评估 2410.05401v3 |
Authors (2): Tunazzina Islam, Dan Goldwasser
Climate change communication on social media increasingly employs microtargeting strategies to effectively reach and influence specific demographic groups. This study presents a post-hoc analysis of microtargeting practices within climate campaigns by leveraging large language models (LLMs) to examine Facebook advertisements. Our analysis focuses on two key aspects: demographic targeting and fairness. We evaluate the ability of LLMs to accurately predict the intended demographic targets, such as gender and age group, achieving an overall accuracy of 88.55%. Furthermore, we instruct the LLMs to generate explanations for their classifications, providing transparent reasoning behind each decision. These explanations reveal the specific thematic elements used to engage different demographic segments, highlighting distinct strategies tailored to various audiences. Our findings show that young adults are primarily targeted through messages emphasizing activism and environmental consciousness, while women are engaged through themes related to caregiving roles and social advocacy. In addition to evaluating the effectiveness of LLMs in detecting microtargeted messaging, we conduct a comprehensive fairness analysis to identify potential biases in model predictions. Our findings indicate that while LLMs perform well overall, certain biases exist, particularly in the classification of senior citizens and male audiences. By showcasing the efficacy of LLMs in dissecting and explaining targeted communication strategies and by highlighting fairness concerns, this study provides a valuable framework for future research aimed at enhancing transparency, accountability, and inclusivity in social media-driven climate campaigns.
nan
Article 547
Title@2025-07-10 (4): Extracting memorized pieces of (copyrighted) books from open-weight language models
Title: Extracting memorized pieces of (copyrighted) books from open-weight language models | Extrahieren von auswendig gelernten Stücken von Büchern aus Open-Wight-Sprachmodellen | 从开放重量级语言模式中提取(复印权)书籍 2505.12546v2 |
Authors (8): A. Feder Cooper, Aaron Gokaslan, Ahmed Ahmed, Amy B. Cyphert, Christopher De Sa, Mark A. Lemley, Daniel E. Ho, Percy Liang
Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs’ protected expression. Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we leverage a recent probabilistic extraction technique to extract pieces of the Books3 dataset from 17 open-weight LLMs. Through numerous experiments, we show that it’s possible to extract substantial parts of at least some books from different LLMs. This is evidence that these LLMs have memorized the extracted text; this memorized content is copied inside the model parameters. But the results are complicated: the extent of memorization varies both by model and by book. With our specific experiments, we find that the largest LLMs don’t memorize most books–either in whole or in part. However, we also find that Llama 3.1 70B memorizes some books, like Harry Potter and the Sorcerer’s Stone and 1984, almost entirely. In fact, Harry Potter is so memorized that, using a seed prompt consisting of just the first line of chapter 1, we can deterministically generate the entire book near-verbatim. We discuss why our results have significant implications for copyright cases, though not ones that unambiguously favor either side.
nan
Article 548
Title@2025-07-10 (4): Riddle Generation using Learning Resources
Title: Riddle Generation using Learning Resources | Riddle Generation mit Lernressourcen | 利用学习资源的中一代人 2310.18290v3 |
Authors (3): Niharika Sri Parasa, Chaitali Diwan, Srinath Srinivasa
One of the primary challenges in online learning environments, is to retain learner engagement. Several different instructional strategies are proposed both in online and offline environments to enhance learner engagement. The Concept Attainment Model is one such instructional strategy that focuses on learners acquiring a deeper understanding of a concept rather than just its dictionary definition. This is done by searching and listing the properties used to distinguish examples from non-examples of various concepts. Our work attempts to apply the Concept Attainment Model to build conceptual riddles, to deploy over online learning environments. The approach involves creating factual triples from learning resources, classifying them based on their uniqueness to a concept into Topic Markers' and
Common’, followed by generating riddles based on the Concept Attainment Model’s format and capturing all possible solutions to those riddles. The results obtained from the human evaluation of riddles prove encouraging.
nan
Article 549
Title@2025-07-10 (4): TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs
Title: TruthTorchLM: A Comprehensive Library for Predicting Truthfulness in LLM Outputs | WahrheitTorchLM: Eine umfassende Bibliothek für die Vorhersage von Wahrhaftigkeit in LLM-Ausgaben | LTLTTRCHLM:LLM产出中预测真相综合图书馆 2507.08203v1 |
Authors (12): Duygu Nur Yaldiz, Yavuz Faruk Bakman, Sungmin Kang, Alperen Öziş, Hayrettin Eren Yildiz, Mitash Ashish Shah, Zhiqi Huang, Anoop Kumar, Alfy Samuel, Daben Liu, Sai Praneeth Karimireddy, Salman Avestimehr
Generative Large Language Models (LLMs)inevitably produce untruthful responses. Accurately predicting the truthfulness of these outputs is critical, especially in high-stakes settings. To accelerate research in this domain and make truthfulness prediction methods more accessible, we introduce TruthTorchLM an open-source, comprehensive Python library featuring over 30 truthfulness prediction methods, which we refer to as Truth Methods. Unlike existing toolkits such as Guardrails, which focus solely on document-grounded verification, or LM-Polygraph, which is limited to uncertainty-based methods, TruthTorchLM offers a broad and extensible collection of techniques. These methods span diverse tradeoffs in computational cost, access level (e.g., black-box vs white-box), grounding document requirements, and supervision type (self-supervised or supervised). TruthTorchLM is seamlessly compatible with both HuggingFace and LiteLLM, enabling support for locally hosted and API-based models. It also provides a unified interface for generation, evaluation, calibration, and long-form truthfulness prediction, along with a flexible framework for extending the library with new methods. We conduct an evaluation of representative truth methods on three datasets, TriviaQA, GSM8K, and FactScore-Bio. The code is available at https://github.com/Ybakman/TruthTorchLM
nan
Article 550
Title@2025-07-10 (4): Overview of the TREC 2021 deep learning track
Title: Overview of the TREC 2021 deep learning track | Überblick über den Deep-Learning-Track TREC 2021 | TREC 2021年深学习轨迹概览 2507.08191v1 |
Authors (5): Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin
This is the third year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, this year we refreshed both the document and the passage collections which also led to a nearly four times increase in the document collection size and nearly $16$ times increase in the size of the passage collection. Deep neural ranking models that employ large scale pretraininig continued to outperform traditional retrieval methods this year. We also found that single stage retrieval can achieve good performance on both tasks although they still do not perform at par with multistage retrieval pipelines. Finally, the increase in the collection size and the general data refresh raised some questions about completeness of NIST judgments and the quality of the training labels that were mapped to the new collections from the old ones which we discuss in this report.
nan
Article 551
Title@2025-07-10 (4): Overview of the TREC 2022 deep learning track
Title: Overview of the TREC 2022 deep learning track | Überblick über den Deep-Learning-Track TREC 2022 | TREC 2022深学习轨迹概览 2507.10865v1 |
Authors (7): Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, Ian Soboroff
This is the fourth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human annotated training labels available for both passage and document ranking tasks. In addition, this year we also leverage both the refreshed passage and document collections that were released last year leading to a nearly $16$ times increase in the size of the passage collection and nearly four times increase in the document collection size. Unlike previous years, in 2022 we mainly focused on constructing a more complete test collection for the passage retrieval task, which has been the primary focus of the track. The document ranking task was kept as a secondary task, where document-level labels were inferred from the passage-level labels. Our analysis shows that similar to previous years, deep neural ranking models that employ large scale pretraining continued to outperform traditional retrieval methods. Due to the focusing our judging resources on passage judging, we are more confident in the quality of this year’s queries and judgments, with respect to our ability to distinguish between runs and reuse the dataset in future. We also see some surprises in overall outcomes. Some top-performing runs did not do dense retrieval. Runs that did single-stage dense retrieval were not as competitive this year as they were last year.
nan
Article 552
Title@2025-07-10 (4): GeistBERT: Breathing Life into German NLP
Title: GeistBERT: Breathing Life into German NLP | GeistBERT: Das Leben in die deutsche NLP einatmen | 呼吸生命化为德国NLP 2506.11903v4 |
Authors (2): Raphael Scheible-Schmitt, Johann Frei
Advances in transformer-based language models have highlighted the benefits of language-specific pre-training on high-quality corpora. In this context, German NLP stands to gain from updated architectures and modern datasets tailored to the linguistic characteristics of the German language. GeistBERT seeks to improve German language processing by incrementally training on a diverse corpus and optimizing model performance across various NLP tasks. We pre-trained GeistBERT using fairseq, following the RoBERTa base configuration with Whole Word Masking (WWM), and initialized from GottBERT weights. The model was trained on a 1.3 TB German corpus with dynamic masking and a fixed sequence length of 512 tokens. For evaluation, we fine-tuned the model on standard downstream tasks, including NER (CoNLL 2003, GermEval 2014), text classification (GermEval 2018 coarse/fine, 10kGNAD), and NLI (German XNLI), using $F_1$ score and accuracy as evaluation metrics. GeistBERT achieved strong results across all tasks, leading among base models and setting a new state-of-the-art (SOTA) in GermEval 2018 fine text classification. It also outperformed several larger models, particularly in classification benchmarks. To support research in German NLP, we release GeistBERT under the MIT license.
nan
Article 553
Title@2025-07-10 (4): Overview of the TREC 2023 deep learning track
Title: Overview of the TREC 2023 deep learning track | Überblick über den Deep-Learning-Track TREC 2023 | TREC 2023深学习轨迹概览 2507.08890v1 |
Authors (8): Nick Craswell, Bhaskar Mitra, Emine Yilmaz, Hossein A. Rahmani, Daniel Campos, Jimmy Lin, Ellen M. Voorhees, Ian Soboroff
This is the fifth year of the TREC Deep Learning track. As in previous years, we leverage the MS MARCO datasets that made hundreds of thousands of human-annotated training labels available for both passage and document ranking tasks. We mostly repeated last year’s design, to get another matching test set, based on the larger, cleaner, less-biased v2 passage and document set, with passage ranking as primary and document ranking as a secondary task (using labels inferred from passage). As we did last year, we sample from MS MARCO queries that were completely held out, unused in corpus construction, unlike the test queries in the first three years. This approach yields a more difficult test with more headroom for improvement. Alongside the usual MS MARCO (human) queries from MS MARCO, this year we generated synthetic queries using a fine-tuned T5 model and using a GPT-4 prompt. The new headline result this year is that runs using Large Language Model (LLM) prompting in some way outperformed runs that use the “nnlm” approach, which was the best approach in the previous four years. Since this is the last year of the track, future iterations of prompt-based ranking can happen in other tracks. Human relevance assessments were applied to all query types, not just human MS MARCO queries. Evaluation using synthetic queries gave similar results to human queries, with system ordering agreement of $\tau=0.8487$. However, human effort was needed to select a subset of the synthetic queries that were usable. We did not see clear evidence of bias, where runs using GPT-4 were favored when evaluated using synthetic GPT-4 queries, or where runs using T5 were favored when evaluated on synthetic T5 queries.
nan
Article 554
Title@2025-07-10 (4): Distilling Empathy from Large Language Models
Title: Distilling Empathy from Large Language Models | Empathie aus großen Sprachmodellen destillieren | 提炼大语言模型的冷漠 2507.08151v1 |
Authors (4): Henry J. Xie, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu
The distillation of knowledge from Large Language Models (LLMs) into Smaller Language Models (SLMs), preserving the capabilities and performance of LLMs while reducing model size, has played a key role in the proliferation of LLMs. Because SLMs are considerably smaller than LLMs, they are often utilized in domains where human interaction is frequent but resources are highly constrained, e.g., smart phones. Therefore, it is crucial to ensure that empathy, a fundamental aspect of positive human interactions, already instilled into LLMs, is retained by SLMs after distillation. In this paper, we develop a comprehensive approach for effective empathy distillation from LLMs into SLMs. Our approach features a two-step fine-tuning process that fully leverages datasets of empathetic dialogue responses distilled from LLMs. We explore several distillation methods beyond basic direct prompting and propose four unique sets of prompts for targeted empathy improvement to significantly enhance the empathy distillation process. Our evaluations demonstrate that SLMs fine-tuned through the two-step fine-tuning process with distillation datasets enhanced by the targeted empathy improvement prompts significantly outperform the base SLM at generating empathetic responses with a win rate of 90%. Our targeted empathy improvement prompts substantially outperform the basic direct prompting with a 10% improvement in win rate.
nan
Article 555
Title@2025-07-10 (4): Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores
Title: Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores | Kompaktor: Kalibrierte Abfrage-agnostische KV Cache-Kompression mit ungefähren Leverage-Scores | 压缩器: 使用近似杠杆分数校准查询- 不可知性 KV CA缓存压缩 2507.08143v1 |
Authors (2): Vivek Chari, Benjamin Van Durme
Modern Large Language Models (LLMs) are increasingly trained to support very large context windows. Unfortunately the ability to use long contexts in generation is complicated by the large memory requirement of the KV cache, which scales linearly with the context length. This memory footprint is often the dominant resource bottleneck in real-world deployments, limiting throughput and increasing serving cost. One way to address this is by compressing the KV cache, which can be done either with knowledge of the question being asked (query-aware) or without knowledge of the query (query-agnostic). We present Compactor, a parameter-free, query-agnostic KV compression strategy that uses approximate leverage scores to determine token importance. We show that Compactor can achieve the same performance as competing methods while retaining 1/2 the tokens in both synthetic and real-world context tasks, with minimal computational overhead. We further introduce a procedure for context-calibrated compression, which allows one to infer the maximum compression ratio a given context can support. Using context-calibrated compression, we show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 63%, on average. To demonstrate the efficacy and generalizability of our approach, we apply Compactor to 27 synthetic and real-world tasks from RULER and Longbench, with models from both the Qwen 2.5 and Llama 3.1 families.
nan
Article 556
Title@2025-07-10 (4): Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Title: Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models | Audio Flamingo 3: Advancing Audio Intelligence mit vollständig offenen großen Audio-Sprachen-Modelle | 3:以完全开放的大型音频语言模式推进音频情报 2507.08128v1 |
Authors (11): Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, Bryan Catanzaro
We present Audio Flamingo 3 (AF3), a fully open state-of-the-art (SOTA) large audio-language model that advances reasoning and understanding across speech, sound, and music. AF3 introduces: (i) AF-Whisper, a unified audio encoder trained using a novel strategy for joint representation learning across all 3 modalities of speech, sound, and music; (ii) flexible, on-demand thinking, allowing the model to do chain-of-thought-type reasoning before answering; (iii) multi-turn, multi-audio chat; (iv) long audio understanding and reasoning (including speech) up to 10 minutes; and (v) voice-to-voice interaction. To enable these capabilities, we propose several large-scale training datasets curated using novel strategies, including AudioSkills-XL, LongAudio-XL, AF-Think, and AF-Chat, and train AF3 with a novel five-stage curriculum-based training strategy. Trained on only open-source audio data, AF3 achieves new SOTA results on over 20+ (long) audio understanding and reasoning benchmarks, surpassing both open-weight and closed-source models trained on much larger datasets.
nan
Article 557
Title@2025-07-10 (4): Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing
Title: Audit, Alignment, and Optimization of LM-Powered Subroutines with Application to Public Comment Processing | Audit, Alignment und Optimierung von LM-Powered Subroutinen mit Anwendung auf die öffentliche Kommentarverarbeitung | 对LM-Powerd Powerd S次程序适用公众意见处理的审计、统一和优化 2507.08109v1 |
Authors (7): Reilly Raab, Mike Parker, Dan Nally, Sadie Montgomery, Anastasia Bernat, Sai Munikoti, Sameera Horawalavithana
The advent of language models (LMs) has the potential to dramatically accelerate tasks that may be cast to text-processing; however, real-world adoption is hindered by concerns regarding safety, explainability, and bias. How can we responsibly leverage LMs in a transparent, auditable manner – minimizing risk and allowing human experts to focus on informed decision-making rather than data-processing or prompt engineering? In this work, we propose a framework for declaring statically typed, LM-powered subroutines (i.e., callable, function-like procedures) for use within conventional asynchronous code – such that sparse feedback from human experts is used to improve the performance of each subroutine online (i.e., during use). In our implementation, all LM-produced artifacts (i.e., prompts, inputs, outputs, and data-dependencies) are recorded and exposed to audit on demand. We package this framework as a library to support its adoption and continued development. While this framework may be applicable across several real-world decision workflows (e.g., in healthcare and legal fields), we evaluate it in the context of public comment processing as mandated by the 1969 National Environmental Protection Act (NEPA): Specifically, we use this framework to develop “CommentNEPA,” an application that compiles, organizes, and summarizes a corpus of public commentary submitted in response to a project requiring environmental review. We quantitatively evaluate the application by comparing its outputs (when operating without human feedback) to historical ``ground-truth’’ data as labelled by human annotators during the preparation of official environmental impact statements.
nan
Article 558
Title@2025-07-10 (4): GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs
Title: GRASP: Generic Reasoning And SPARQL Generation across Knowledge Graphs | GRASP: Generische Vernunft und SPARQL-Generierung über Wissensgraphen hinweg | GRASP: 通用理由和在知识图中生成SPARQL 2507.08107v1 |
Authors (2): Sebastian Walter, Hannah Bast
We propose a new approach for generating SPARQL queries on RDF knowledge graphs from natural language questions or keyword queries, using a large language model. Our approach does not require fine-tuning. Instead, it uses the language model to explore the knowledge graph by strategically executing SPARQL queries and searching for relevant IRIs and literals. We evaluate our approach on a variety of benchmarks (for knowledge graphs of different kinds and sizes) and language models (of different scales and types, commercial as well as open-source) and compare it with existing approaches. On Wikidata we reach state-of-the-art results on multiple benchmarks, despite the zero-shot setting. On Freebase we come close to the best few-shot methods. On other, less commonly evaluated knowledge graphs and benchmarks our approach also performs well overall. We conduct several additional studies, like comparing different ways of searching the graphs, incorporating a feedback mechanism, or making use of few-shot examples.
nan
Article 559
Title@2025-07-10 (4): Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
Title: Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology | Rückverfolgbare Beweise Verbesserte visuelle Grundierung: Bewertung und Methodik | 增强视觉依据的理由:评价和方法 2507.07999v1 |
Authors (12): Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, Zhuochen Wang, Zhaoxiang Zhang
Models like OpenAI-o3 pioneer visual grounded reasoning by dynamically referencing visual regions, just like human “thinking with images”. However, no benchmark exists to evaluate these capabilities holistically. To bridge this gap, we propose TreeBench (Traceable Evidence Evaluation Benchmark), a diagnostic benchmark built on three principles: (1) focused visual perception of subtle targets in complex scenes, (2) traceable evidence via bounding box evaluation, and (3) second-order reasoning to test object interactions and spatial hierarchies beyond simple object localization. Prioritizing images with dense objects, we initially sample 1K high-quality images from SA-1B, and incorporate eight LMM experts to manually annotate questions, candidate options, and answers for each image. After three stages of quality control, TreeBench consists of 405 challenging visual question-answering pairs, even the most advanced models struggle with this benchmark, where none of them reach 60% accuracy, e.g., OpenAI-o3 scores only 54.87. Furthermore, we introduce TreeVGR (Traceable Evidence Enhanced Visual Grounded Reasoning), a training paradigm to supervise localization and reasoning jointly with reinforcement learning, enabling accurate localizations and explainable reasoning pathways. Initialized from Qwen2.5-VL-7B, it improves V* Bench (+16.8), MME-RealWorld (+12.6), and TreeBench (+13.4), proving traceability is key to advancing vision-grounded reasoning. The code is available at https://github.com/Haochen-Wang409/TreeVGR.
nan
Article 560
Title@2025-07-10 (4): Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs)
Title: Operationalizing a Threat Model for Red-Teaming Large Language Models (LLMs) | Operationalisierung eines Bedrohungsmodells für das Red-Teaming großer Sprachmodelle (LLMs) | 实施 “ 红色组合大语言模型威胁模型 “ ; 2407.14937v2 |
Authors (10): Apurv Verma, Satyapriya Krishna, Sebastian Gehrmann, Madhavan Seshadri, Anu Pradhan, Tom Ault, Leslie Barrett, David Rabinowitz, John Doucette, NhatHai Phan
Creating secure and resilient applications with large language models (LLM) requires anticipating, adjusting to, and countering unforeseen threats. Red-teaming has emerged as a critical technique for identifying vulnerabilities in real-world LLM implementations. This paper presents a detailed threat model and provides a systematization of knowledge (SoK) of red-teaming attacks on LLMs. We develop a taxonomy of attacks based on the stages of the LLM development and deployment process and extract various insights from previous research. In addition, we compile methods for defense and practical red-teaming strategies for practitioners. By delineating prominent attack motifs and shedding light on various entry points, this paper provides a framework for improving the security and robustness of LLM-based systems.
nan
Article 561
Title@2025-07-10 (4): Automating Expert-Level Medical Reasoning Evaluation of Large Language Models
Title: Automating Expert-Level Medical Reasoning Evaluation of Large Language Models | Automatisieren von Experten-Level Medical Reasoning Bewertung von großen Sprachmodellen | 对大语言模式进行自动化专家级医疗理由评估 2507.07988v1 |
Authors (19): Shuang Zhou, Wenya Xie, Jiaxi Li, Zaifu Zhan, Meijia Song, Han Yang, Cheyenna Espinoza, Lindsay Welton, Xinnie Mai, Yanwei Jin, Zidu Xu, Yuen-Hei Chung, Yiyun Xing, Meng-Han Tsai, Emma Schaffer, Yucheng Shi, Ninghao Liu, Zirui Liu, Rui Zhang
As large language models (LLMs) become increasingly integrated into clinical decision-making, ensuring transparent and trustworthy reasoning is essential. However, existing evaluation strategies of LLMs’ medical reasoning capability either suffer from unsatisfactory assessment or poor scalability, and a rigorous benchmark remains lacking. To address this, we introduce MedThink-Bench, a benchmark designed for rigorous, explainable, and scalable assessment of LLMs’ medical reasoning. MedThink-Bench comprises 500 challenging questions across ten medical domains, each annotated with expert-crafted step-by-step rationales. Building on this, we propose LLM-w-Ref, a novel evaluation framework that leverages fine-grained rationales and LLM-as-a-Judge mechanisms to assess intermediate reasoning with expert-level fidelity while maintaining scalability. Experiments show that LLM-w-Ref exhibits a strong positive correlation with expert judgments. Benchmarking twelve state-of-the-art LLMs, we find that smaller models (e.g., MedGemma-27B) can surpass larger proprietary counterparts (e.g., OpenAI-o3). Overall, MedThink-Bench offers a foundational tool for evaluating LLMs’ medical reasoning, advancing their safe and responsible deployment in clinical practice.
nan
Article 562
Title@2025-07-10 (4): Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology
Title: Performance and Practical Considerations of Large and Small Language Models in Clinical Decision Support in Rheumatology | Leistung und praktische Betrachtung von großen und kleinen Sprachmodellen in der klinischen Entscheidungsunterstützung in der Rheumatologie | 风湿学临床决策支助中大型和小型语言模型的实用性及实用性考虑 2507.07983v1 |
Authors (7): Sabine Felde, Rüdiger Buchkremer, Gamal Chehab, Christian Thielscher, Jörg HW Distler, Matthias Schneider, Jutta G. Richter
Large language models (LLMs) show promise for supporting clinical decision-making in complex fields such as rheumatology. Our evaluation shows that smaller language models (SLMs), combined with retrieval-augmented generation (RAG), achieve higher diagnostic and therapeutic performance than larger models, while requiring substantially less energy and enabling cost-efficient, local deployment. These features are attractive for resource-limited healthcare. However, expert oversight remains essential, as no model consistently reached specialist-level accuracy in rheumatology.
nan
Article 563
Title@2025-07-10 (4): Why is Your Language Model a Poor Implicit Reward Model?
Title: Why is Your Language Model a Poor Implicit Reward Model? | Warum ist Ihr Sprachmodell ein schlechtes Implizit-Reward-Modell? | 为什么您的语言模式 是一个贫穷的隐含奖赏模式? 2507.07981v1 |
Authors (4): Noam Razin, Yong Lin, Jiarui Yao, Sanjeev Arora
Reward models are key to language model post-training and inference pipelines. Conveniently, recent work showed that every language model defines an implicit reward model (IM-RM), without requiring any architectural changes. However, such IM-RMs tend to generalize worse, especially out-of-distribution, compared to explicit reward models (EX-RMs) that apply a dedicated linear head over the hidden representations of a language model. The existence of a generalization gap is puzzling, as EX-RMs and IM-RMs are nearly identical. They can be trained using the same data, loss function, and language model, and differ only in how the reward is computed. Towards a fundamental understanding of the implicit biases underlying different reward model types, we investigate the root cause of this gap. Our main finding, backed by theory and experiments, is that IM-RMs rely more heavily on superficial token-level cues. Consequently, they often generalize worse than EX-RMs under token-level distribution shifts, as well as in-distribution. Furthermore, we provide evidence against alternative hypotheses for the generalization gap. Most notably, we challenge the intuitive claim that IM-RMs struggle in tasks where generation is harder than verification because they can operate both as a verifier and a generator. Taken together, our results highlight that seemingly minor design choices can substantially impact the generalization behavior of reward models.
nan
Article 564
Title@2025-07-10 (4): Long-Form Speech Generation with Spoken Language Models
Title: Long-Form Speech Generation with Spoken Language Models | Langformige Sprachgenerierung mit gesprochenen Sprachmodellen | 具有口言语言模式的长形式语音一代 2412.18603v2 |
Authors (6): Se Jin Park, Julian Salazar, Aren Jansen, Keisuke Kinoshita, Yong Man Ro, RJ Skerry-Ryan
We consider the generative modeling of speech over multiple minutes, a requirement for long-form multimedia generation and audio-native voice assistants. However, textless spoken language models struggle to generate plausible speech past tens of seconds, due to high temporal resolution of speech tokens causing loss of coherence, architectural issues with long-sequence training or extrapolation, and memory costs at inference time. From these considerations we derive SpeechSSM, the first speech language model family to learn from and sample long-form spoken audio (e.g., 16 minutes of read or extemporaneous speech) in a single decoding session without text intermediates. SpeechSSMs leverage recent advances in linear-time sequence modeling to greatly surpass current Transformer spoken LMs in coherence and efficiency on multi-minute generations while still matching them at the utterance level. As we found current spoken language evaluations uninformative, especially in this new long-form setting, we also introduce: LibriSpeech-Long, a benchmark for long-form speech evaluation; new embedding-based and LLM-judged metrics; and quality measurements over length and time. Speech samples, the LibriSpeech-Long dataset, and any future code or model releases can be found at https://google.github.io/tacotron/publications/speechssm/.
nan
Article 565
Title@2025-07-10 (4): Scaling RL to Long Videos
Title: Scaling RL to Long Videos | Skalierung von RL zu langen Videos | 缩放 RL 到长视频 2507.07966v1 |
Authors (14): Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, Sifei Liu, Hongxu Yin, Yao Lu, Song Han
We introduce a full-stack framework that scales up reasoning in vision-language models (VLMs) to long videos, leveraging reinforcement learning. We address the unique challenges of long video reasoning by integrating three critical components: (1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling. In experiments, LongVILA-R1-7B achieves strong performance on long video QA benchmarks such as VideoMME. It also outperforms Video-R1-7B and even matches Gemini-1.5-Pro across temporal reasoning, goal and purpose reasoning, spatial reasoning, and plot reasoning on our LongVideo-Reason-eval benchmark. Notably, our MR-SP system achieves up to 2.1x speedup on long video RL training. LongVILA-R1 demonstrates consistent performance gains as the number of input video frames scales. LongVILA-R1 marks a firm step towards long video reasoning in VLMs. In addition, we release our training system for public availability that supports RL training on various modalities (video, text, and audio), various models (VILA and Qwen series), and even image and video generation models. On a single A100 node (8 GPUs), it supports RL training on hour-long videos (e.g., 3,600 frames / around 256k tokens).
nan
Article 566
Title@2025-07-10 (4): MIRIX: Multi-Agent Memory System for LLM-Based Agents
Title: MIRIX: Multi-Agent Memory System for LLM-Based Agents | MIRIX: Multi-Agent-Speichersystem für LLM-basierte Agenten | MIRIX:LLM药剂多机构内存系统 2507.07957v1 |
Authors (2): Yu Wang, Xi Chen
Although memory capabilities of AI agents are gaining increasing attention, existing solutions remain fundamentally limited. Most rely on flat, narrowly scoped memory components, constraining their ability to personalize, abstract, and reliably recall user-specific information over time. To this end, we introduce MIRIX, a modular, multi-agent memory system that redefines the future of AI memory by solving the field’s most critical challenge: enabling language models to truly remember. Unlike prior approaches, MIRIX transcends text to embrace rich visual and multimodal experiences, making memory genuinely useful in real-world scenarios. MIRIX consists of six distinct, carefully structured memory types: Core, Episodic, Semantic, Procedural, Resource Memory, and Knowledge Vault, coupled with a multi-agent framework that dynamically controls and coordinates updates and retrieval. This design enables agents to persist, reason over, and accurately retrieve diverse, long-term user data at scale. We validate MIRIX in two demanding settings. First, on ScreenshotVQA, a challenging multimodal benchmark comprising nearly 20,000 high-resolution computer screenshots per sequence, requiring deep contextual understanding and where no existing memory systems can be applied, MIRIX achieves 35% higher accuracy than the RAG baseline while reducing storage requirements by 99.9%. Second, on LOCOMO, a long-form conversation benchmark with single-modal textual input, MIRIX attains state-of-the-art performance of 85.4%, far surpassing existing baselines. These results show that MIRIX sets a new performance standard for memory-augmented LLM agents. To allow users to experience our memory system, we provide a packaged application powered by MIRIX. It monitors the screen in real time, builds a personalized memory base, and offers intuitive visualization and secure local storage to ensure privacy.
nan
Article 567
Title@2025-07-10 (4): SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment
Title: SAGE: A Visual Language Model for Anomaly Detection via Fact Enhancement and Entropy-aware Alignment | SEITE: Ein visuelles Sprachmodell zur Erkennung von Anomalien durch Fact Enhancement und Entropy-aware Alignment | SAGE:通过事实增强和对子对子体认知校正进行反常检测的视觉语言模型 2507.07939v1 |
Authors (7): Guoxin Zang, Xue Li, Donglin Di, Lanshun Nie, Dechen Zhan, Yang Song, Lei Fan
While Vision-Language Models (VLMs) have shown promising progress in general multimodal tasks, they often struggle in industrial anomaly detection and reasoning, particularly in delivering interpretable explanations and generalizing to unseen categories. This limitation stems from the inherently domain-specific nature of anomaly detection, which hinders the applicability of existing VLMs in industrial scenarios that require precise, structured, and context-aware analysis. To address these challenges, we propose SAGE, a VLM-based framework that enhances anomaly reasoning through Self-Guided Fact Enhancement (SFE) and Entropy-aware Direct Preference Optimization (E-DPO). SFE integrates domain-specific knowledge into visual reasoning via fact extraction and fusion, while E-DPO aligns model outputs with expert preferences using entropy-aware optimization. Additionally, we introduce AD-PL, a preference-optimized dataset tailored for industrial anomaly reasoning, consisting of 28,415 question-answering instances with expert-ranked responses. To evaluate anomaly reasoning models, we develop Multiscale Logical Evaluation (MLE), a quantitative framework analyzing model logic and consistency. SAGE demonstrates superior performance on industrial anomaly datasets under zero-shot and one-shot settings. The code, model and dataset are available at https://github.com/amoreZgx1n/SAGE.
nan
Article 568
Title@2025-07-10 (4): Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration
Title: Long Context Scaling: Divide and Conquer via Multi-Agent Question-driven Collaboration | Long Context Scaling: Teilen und Erobern durch multi-agent question-driven Collaboration | 长期范围:通过多代理问题驱动的协作实现分化和征服 2505.20625v2 |
Authors (5): Sibo Xiao, Zixin Lin, Wenyang Gao, Hui Chen, Yue Zhang
Processing long contexts has become a critical capability for modern large language models (LLMs). Existing works leverage agent-based divide-and-conquer methods for processing long contexts. But these methods face crucial limitations, including prohibitive accumulated latency and amplified information loss from excessive agent invocations, and the disruption of inherent textual dependencies by immoderate partitioning. In this paper, we propose a novel multi-agent framework XpandA (Expand-Agent) coupled with question-driven workflow and dynamic partitioning for robust long-context processing. XpandA overcomes these limitations through: 1) dynamic partitioning of long texts, which adaptively modulates the filling rate of context windows for input sequences of vastly varying lengths; 2) question-guided protocol to update flat information ensembles within centralized shared memory, constructing consistent inter-agent knowledge across partitions; and 3) selectively replaying specific partitions based on the state-tracking of question-information couples to promote the resolution of inverted-order structures across partitions (e.g., flashbacks). We perform a comprehensive evaluation of XpandA on multiple long-context benchmarks with length varying from 1k to 1M, demonstrating XpandA’s feasibility for processing ultra-long sequences and its significant effectiveness in enhancing the long-context capabilities of various LLMs by achieving 20\% improvements and 1.5x inference speedup over baselines of full-context, RAG and previous agent-based methods.
nan
Article 569
Title@2025-07-10 (4): Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style
Title: Investigating Context-Faithfulness in Large Language Models: The Roles of Memory Strength and Evidence Style | Kontexttreue in großen Sprachmodellen untersuchen: Die Rollen der Gedächtnisstärke und des Evidenzstils | 调查大语言模型的内情:记忆力和证据风格的作用 2409.10955v2 |
Authors (6): Yuepei Li, Kang Zhou, Qiao Qiao, Bach Nguyen, Qing Wang, Qi Li
Retrieval-augmented generation (RAG) improves Large Language Models (LLMs) by incorporating external information into the response generation process. However, how context-faithful LLMs are and what factors influence LLMs’ context faithfulness remain largely unexplored. In this study, we investigate the impact of memory strength and evidence presentation on LLMs’ receptiveness to external evidence. We quantify the memory strength of LLMs by measuring the divergence in LLMs’ responses to different paraphrases of the same question, which is not considered by previous works. We also generate evidence in various styles to examine LLMs’ behavior. Our results show that for questions with high memory strength, LLMs are more likely to rely on internal memory. Furthermore, presenting paraphrased evidence significantly increases LLMs’ receptiveness compared to simple repetition or adding details. These findings provide key insights for improving retrieval-augmented generation and context-aware LLMs. Our code is available at https://github.com/liyp0095/ContextFaithful.
nan
Article 570
Title@2025-07-10 (4): A Survey on Latent Reasoning
Title: A Survey on Latent Reasoning | Eine Umfrage über latente Vernunft | A. 关于长期原因的调查 2507.06203v2 |
Authors (33): Rui-Jie Zhu, Tianhao Peng, Tianhao Cheng, Xingwei Qu, Jinfa Huang, Dawei Zhu, Hao Wang, Kaiwen Xue, Xuanliang Zhang, Yong Shan, Tianle Cai, Taylor Kergan, Assel Kembay, Andrew Smith, Chenghua Lin, Binh Nguyen, Yuqi Pan, Yuhong Chou, Zefan Cai, Zhenhe Wu, Yongchi Zhao, Tianyu Liu, Jian Yang, Wangchunshu Zhou, Chujie Zheng, Chongxuan Li, Yuyin Zhou, Zhoujun Li, Zhaoxiang Zhang, Jiaheng Liu, Ge Zhang, Wenhao Huang, Jason Eshraghian
Large Language Models (LLMs) have demonstrated impressive reasoning capabilities, especially when guided by explicit chain-of-thought (CoT) reasoning that verbalizes intermediate steps. While CoT improves both interpretability and accuracy, its dependence on natural language reasoning limits the model’s expressive bandwidth. Latent reasoning tackles this bottleneck by performing multi-step inference entirely in the model’s continuous hidden state, eliminating token-level supervision. To advance latent reasoning research, this survey provides a comprehensive overview of the emerging field of latent reasoning. We begin by examining the foundational role of neural network layers as the computational substrate for reasoning, highlighting how hierarchical representations support complex transformations. Next, we explore diverse latent reasoning methodologies, including activation-based recurrence, hidden state propagation, and fine-tuning strategies that compress or internalize explicit reasoning traces. Finally, we discuss advanced paradigms such as infinite-depth latent reasoning via masked diffusion models, which enable globally consistent and reversible reasoning processes. By unifying these perspectives, we aim to clarify the conceptual landscape of latent reasoning and chart future directions for research at the frontier of LLM cognition. An associated GitHub repository collecting the latest papers and repos is available at: https://github.com/multimodal-art-projection/LatentCoT-Horizon/.
nan
Article 571
Title@2025-07-10 (4): Automating MD simulations for Proteins using Large language Models: NAMD-Agent
Title: Automating MD simulations for Proteins using Large language Models: NAMD-Agent | Automatisierung von MD-Simulationen für Proteine mit großen Sprachmodellen: NAMD-Agent | 使用大语言模型( NADD- Agent) 自动自动模拟 Proteins 的 MD 模拟: NAMED- Agent 2507.07887v1 |
Authors (2): Achuth Chandrasekhar, Amir Barati Farimani
Molecular dynamics simulations are an essential tool in understanding protein structure, dynamics, and function at the atomic level. However, preparing high quality input files for MD simulations can be a time consuming and error prone process. In this work, we introduce an automated pipeline that leverages Large Language Models (LLMs), specifically Gemini 2.0 Flash, in conjunction with python scripting and Selenium based web automation to streamline the generation of MD input files. The pipeline exploits CHARMM GUI’s comprehensive web-based interface for preparing simulation-ready inputs for NAMD. By integrating Gemini’s code generation and iterative refinement capabilities, simulation scripts are automatically written, executed, and revised to navigate CHARMM GUI, extract appropriate parameters, and produce the required NAMD input files. Post processing is performed using additional software to further refine the simulation outputs, thereby enabling a complete and largely hands free workflow. Our results demonstrate that this approach reduces setup time, minimizes manual errors, and offers a scalable solution for handling multiple protein systems in parallel. This automated framework paves the way for broader application of LLMs in computational structural biology, offering a robust and adaptable platform for future developments in simulation automation.
nan
Article 572
Title@2025-07-10 (4): When Dialects Collide: How Socioeconomic Mixing Affects Language Use
Title: When Dialects Collide: How Socioeconomic Mixing Affects Language Use | Wenn Dialekte zusammenstoßen: Wie sich die sozioökonomische Mischung auf den Sprachgebrauch auswirkt | 当对调时:社会经济混合如何影响语言使用 2307.10016v2 |
Authors (4): Thomas Louf, José J. Ramasco, David Sánchez, Márton Karsai
The socioeconomic background of people and how they use standard forms of language are not independent, as demonstrated in various sociolinguistic studies. However, the extent to which these correlations may be influenced by the mixing of people from different socioeconomic classes remains relatively unexplored from a quantitative perspective. In this work we leverage geotagged tweets and transferable computational methods to map deviations from standard English on a large scale, in seven thousand administrative areas of England and Wales. We combine these data with high-resolution income maps to assign a proxy socioeconomic indicator to home-located users. Strikingly, across eight metropolitan areas we find a consistent pattern suggesting that the more different socioeconomic classes mix, the less interdependent the frequency of their departures from standard grammar and their income become. Further, we propose an agent-based model of linguistic variety adoption that sheds light on the mechanisms that produce the observations seen in the data.
nan
Article 573
Title@2025-07-10 (4): Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
Title: Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study | Bewertung der Robustheit von großen Audio-Sprachmodellen zur Audio-Einspritzung: Eine empirische Studie | 评估大音频语言模型对音频注射的威力:经验研究 2505.19598v2 |
Authors (7): Guanyu Hou, Jiaming He, Yinhang Zhou, Ji Guo, Yitong Qiao, Rui Zhang, Wenbo Jiang
Large Audio-Language Models (LALMs) are increasingly deployed in real-world applications, yet their robustness against malicious audio injection attacks remains underexplored. This study systematically evaluates five leading LALMs across four attack scenarios: Audio Interference Attack, Instruction Following Attack, Context Injection Attack, and Judgment Hijacking Attack. Using metrics like Defense Success Rate, Context Robustness Score, and Judgment Robustness Index, their vulnerabilities and resilience were quantitatively assessed. Experimental results reveal significant performance disparities among models; no single model consistently outperforms others across all attack types. The position of malicious content critically influences attack effectiveness, particularly when placed at the beginning of sequences. A negative correlation between instruction-following capability and robustness suggests models adhering strictly to instructions may be more susceptible, contrasting with greater resistance by safety-aligned models. Additionally, system prompts show mixed effectiveness, indicating the need for tailored strategies. This work introduces a benchmark framework and highlights the importance of integrating robustness into training pipelines. Findings emphasize developing multi-modal defenses and architectural designs that decouple capability from susceptibility for secure LALMs deployment.
nan
Article 574
Title@2025-07-10 (4): DocCHA: Towards LLM-Augmented Interactive Online diagnosis System
Title: DocCHA: Towards LLM-Augmented Interactive Online diagnosis System | DocCHA: Auf dem Weg zum LLM-Augmented Interactive Online-Diagnosesystem | DocCHA:争取建立LLM-增强的互动式在线诊断系统 2507.07870v1 |
Authors (5): Xinyi Liu, Dachun Sun, Yi R. Fung, Dilek Hakkani-Tür, Tarek Abdelzaher
Despite the impressive capabilities of Large Language Models (LLMs), existing Conversational Health Agents (CHAs) remain static and brittle, incapable of adaptive multi-turn reasoning, symptom clarification, or transparent decision-making. This hinders their real-world applicability in clinical diagnosis, where iterative and structured dialogue is essential. We propose DocCHA, a confidence-aware, modular framework that emulates clinical reasoning by decomposing the diagnostic process into three stages: (1) symptom elicitation, (2) history acquisition, and (3) causal graph construction. Each module uses interpretable confidence scores to guide adaptive questioning, prioritize informative clarifications, and refine weak reasoning links. Evaluated on two real-world Chinese consultation datasets (IMCS21, DX), DocCHA consistently outperforms strong prompting-based LLM baselines (GPT-3.5, GPT-4o, LLaMA-3), achieving up to 5.18 percent higher diagnostic accuracy and over 30 percent improvement in symptom recall, with only modest increase in dialogue turns. These results demonstrate the effectiveness of DocCHA in enabling structured, transparent, and efficient diagnostic conversations – paving the way for trustworthy LLM-powered clinical assistants in multilingual and resource-constrained settings.
nan
Article 575
Title@2025-07-10 (4): Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation
Title: Alpay Algebra V: Multi-Layered Semantic Games and Transfinite Fixed-Point Simulation | Alpay Algebra V: Multi-Layered Semantic Games und Transfinite Fixed-Point Simulation | Alpay Alphay Algebabra V:多语言语义运动会和跨线固定点模拟 2507.07868v1 |
Authors (2): Bugra Kilictas, Faruk Alpay
This paper extends the self-referential framework of Alpay Algebra into a multi-layered semantic game architecture where transfinite fixed-point convergence encompasses hierarchical sub-games at each iteration level. Building upon Alpay Algebra IV’s empathetic embedding concept, we introduce a nested game-theoretic structure where the alignment process between AI systems and documents becomes a meta-game containing embedded decision problems. We formalize this through a composite operator $\phi(\cdot, \gamma(\cdot))$ where $\phi$ drives the main semantic convergence while $\gamma$ resolves local sub-games. The resulting framework demonstrates that game-theoretic reasoning emerges naturally from fixed-point iteration rather than being imposed externally. We prove a Game Theorem establishing existence and uniqueness of semantic equilibria under realistic cognitive simulation assumptions. Our verification suite includes adaptations of Banach’s fixed-point theorem to transfinite contexts, a novel $\phi$-topology based on the Kozlov-Maz’ya-Rossmann formula for handling semantic singularities, and categorical consistency tests via the Yoneda lemma. The paper itself functions as a semantic artifact designed to propagate its fixed-point patterns in AI embedding spaces – a deliberate instantiation of the “semantic virus” concept it theorizes. All results are grounded in category theory, information theory, and realistic AI cognition models, ensuring practical applicability beyond pure mathematical abstraction.
nan
Article 576
Title@2025-07-10 (4): Skywork-R1V3 Technical Report
Title: Skywork-R1V3 Technical Report | Technischer Bericht Skywork-R1V3 | Skywork-R1V3 技术报告 2507.06167v3 |
Authors (11): Wei Shen, Jiangbo Pei, Yi Peng, Xuchen Song, Yang Liu, Jian Peng, Haofeng Sun, Yunzhuo Hao, Peiyu Wang, Jianhao Zhang, Yahui Zhou
We introduce Skywork-R1V3, an advanced, open-source vision-language model (VLM) that pioneers a new approach to visual reasoning. Its key innovation lies in effectively transferring reasoning skills from text-only Large Language Models (LLMs) to visual tasks. The strong performance of Skywork-R1V3 primarily stems from our elaborate post-training RL framework, which effectively activates and enhances the model’s reasoning ability, without the need for additional continue pre-training. Through this framework, we further uncover the fundamental role of the connector module in achieving robust cross-modal alignment for multimodal reasoning models. In addition, we introduce a unique indicator of reasoning capability, the entropy of critical reasoning tokens, which has proven highly effective for checkpoint selection during RL training. Skywork-R1V3 achieves state-of-the-art results on MMMU, significantly improving from 64.3% to 76.0%. This performance matches entry-level human capabilities. Remarkably, our RL-powered post-training approach enables even the 38B parameter model to rival top closed-source VLMs. The implementation successfully transfers mathematical reasoning to other subject-related reasoning tasks. We also include an analysis of curriculum learning and reinforcement finetuning strategies, along with a broader discussion on multimodal reasoning. Skywork-R1V3 represents a significant leap in multimodal reasoning, showcasing RL as a powerful engine for advancing open-source VLM capabilities.
nan
Article 577
Title@2025-07-10 (4): Principled Foundations for Preference Optimization
Title: Principled Foundations for Preference Optimization | Prinzipierte Grundlagen für die Preference-Optimierung | 最优化原则基金会 2507.07855v1 |
Authors (7): Wenxuan Zhou, Shujian Zhang, Brice Magdalou, John Lambert, Ehsan Amid, Richard Nock, Andrew Hard
In this paper, we show that direct preference optimization (DPO) is a very specific form of a connection between two major theories in the ML context of learning from preferences: loss functions (Savage) and stochastic choice (Doignon-Falmagne and Machina). The connection is established for all of Savage’s losses and at this level of generality, (i) it includes support for abstention on the choice theory side, (ii) it includes support for non-convex objectives on the ML side, and (iii) it allows to frame for free some notable extensions of the DPO setting, including margins and corrections for length. Getting to understand how DPO operates from a general principled perspective is crucial because of the huge and diverse application landscape of models, because of the current momentum around DPO, but also – and importantly – because many state of the art variations on DPO definitely occupy a small region of the map that we cover. It also helps to understand the pitfalls of departing from this map, and figure out workarounds.
nan
Article 578
Title@2025-07-10 (4): From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems
Title: From Ambiguity to Accuracy: The Transformative Effect of Coreference Resolution on Retrieval-Augmented Generation systems | Von der Ambiguität zur Genauigkeit: Der transformative Effekt der Koreferenzlösung auf retrieval-augmentierte Erzeugungssysteme | 从模糊到准确性:关于回收-提款一代系统的共同决议的变革效应 2507.07847v1 |
Authors (6): Youngjoon Jang, Seongtae Hong, Junyoung Son, Sungjin Park, Chanjun Park, Heuiseok Lim
Retrieval-Augmented Generation (RAG) has emerged as a crucial framework in natural language processing (NLP), improving factual consistency and reducing hallucinations by integrating external document retrieval with large language models (LLMs). However, the effectiveness of RAG is often hindered by coreferential complexity in retrieved documents, introducing ambiguity that disrupts in-context learning. In this study, we systematically investigate how entity coreference affects both document retrieval and generative performance in RAG-based systems, focusing on retrieval relevance, contextual understanding, and overall response quality. We demonstrate that coreference resolution enhances retrieval effectiveness and improves question-answering (QA) performance. Through comparative analysis of different pooling strategies in retrieval tasks, we find that mean pooling demonstrates superior context capturing ability after applying coreference resolution. In QA tasks, we discover that smaller models benefit more from the disambiguation process, likely due to their limited inherent capacity for handling referential ambiguity. With these findings, this study aims to provide a deeper understanding of the challenges posed by coreferential complexity in RAG, providing guidance for improving retrieval and generation in knowledge-intensive AI applications.
nan
Article 579
Title@2025-07-10 (4): None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks
Title: None of the Others: a General Technique to Distinguish Reasoning from Memorization in Multiple-Choice LLM Evaluation Benchmarks | Keiner der anderen: eine allgemeine Technik zur Unterscheidung von der Erinnerung an Multiple-Choice-LLM-Bewertungs-Benchmarks | 其他无其他:在多杯LLM评价基准中区分与记忆化区别理由的一般技术 2502.12896v5 |
Authors (3): Eva Sánchez Salido, Julio Gonzalo, Guillermo Marco
In LLM evaluations, reasoning is often distinguished from recall/memorization by performing numerical variations to math-oriented questions. Here we introduce a general variation method for multiple-choice questions that completely dissociates the correct answer from previously seen tokens or concepts, requiring LLMs to understand and reason (rather than memorizing) in order to answer correctly. Using this method, we evaluate state-of-the-art proprietary and open-source LLMs on two datasets available in English and Spanish: the public MMLU benchmark and the private UNED-Access 2024 dataset. Results show that all models experience remarkable accuracy drops under our proposed variation, with an average loss of 57% on MMLU and 50% on UNED-Access 2024, ranging from 10% to 93% across models. Notably, the most accurate model in our experimentation (OpenAI-o3-mini) is not the most robust (DeepSeek-R1-70B), suggesting that the best models in standard evaluations may not be the ones with better reasoning capabilities. Also, we see larger accuracy drops in public (vs private) datasets and questions posed in their original language (vs a manual translation), which are signs of contamination and also point to a relevant role of recall/memorization in current LLMs’ answers.
nan
Article 580
Title@2025-07-10 (4): Constrain Alignment with Sparse Autoencoders
Title: Constrain Alignment with Sparse Autoencoders | Beschränkung der Ausrichtung mit Sparse Autoencodern | 与 Sparse 自动对齐 2411.07618v4 |
Authors (10): Qingyu Yin, Chak Tou Leong, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang
The alignment of large language models (LLMs) with human preferences remains a key challenge. While post-training techniques like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) have achieved notable success, they often introduce computational inefficiencies and training instability. In this paper, we propose Feature-level constrained Preference Optimization (FPO), a novel method designed to simplify the alignment process while ensuring stability. FPO leverages pre-trained Sparse Autoencoders (SAEs) and introduces feature-level constraints, allowing for efficient, sparsity-enforced alignment. Our approach enjoys efficiency by using sparse features activated in a well-trained sparse autoencoder and the quality of sequential KL divergence by using the feature-level offline reference. Experimental results on benchmark datasets demonstrate that FPO achieves a 5.08% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments.
nan
Article 581
Title@2025-07-10 (4): Unsupervised Morphological Tree Tokenizer
Title: Unsupervised Morphological Tree Tokenizer | Unüberwachter morphologischer Baum Tokenizer | 不受监督的病理树化器 2406.15245v2 |
Authors (5): Qingyang Zhu, Xiang Hu, Pengyu Ji, Wei Wu, Kewei Tu
As a cornerstone in language modeling, tokenization involves segmenting text inputs into pre-defined atomic units. Conventional statistical tokenizers often disrupt constituent boundaries within words, thereby corrupting semantic information. To address this drawback, we introduce morphological structure guidance to tokenization and propose a deep model to induce character-level structures of words. Specifically, the deep model jointly encodes internal structures and representations of words with a mechanism named $\textit{MorphOverriding}$ to ensure the indecomposability of morphemes. By training the model with self-supervised objectives, our method is capable of inducing character-level structures that align with morphological rules without annotated training data. Based on the induced structures, our algorithm tokenizes words through vocabulary matching in a top-down manner. Empirical results indicate that the proposed method effectively retains complete morphemes and outperforms widely adopted methods such as BPE and WordPiece on both morphological segmentation tasks and language modeling tasks. Code is available at https://github.com/martianmartina/TreeTokenizer.
nan
Article 582
Title@2025-07-10 (4): MAEBE: Multi-Agent Emergent Behavior Framework
Title: MAEBE: Multi-Agent Emergent Behavior Framework | MAEBE: Multi-Agent Emergent Behavior Framework | 多边代理新兴行为框架 2506.03053v2 |
Authors (4): Sinem Erisken, Timothy Gothard, Martin Leitgab, Ram Potham
Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.
nan
Article 583
Title@2025-07-10 (4): The Thin Line Between Comprehension and Persuasion in LLMs
Title: The Thin Line Between Comprehension and Persuasion in LLMs | Die dünne Linie zwischen Verständnis und Überzeugung in LLMs | LLMM 理解与劝导之间的细细线 2507.01936v2 |
Authors (2): Adrian de Wynter, Tangming Yuan
Large language models (LLMs) are excellent at maintaining high-level, convincing dialogues. They are being fast deployed as chatbots and evaluators in sensitive areas, such as peer review and mental health applications. This, along with the disparate accounts on their reasoning capabilities, calls for a closer examination of LLMs and their comprehension of dialogue. In this work we begin by evaluating LLMs’ ability to maintain a debate–one of the purest yet most complex forms of human communication. Then we measure how this capability relates to their understanding of what is being talked about, namely, their comprehension of dialogical structures and the pragmatic context. We find that LLMs are capable of maintaining coherent, persuasive debates, often swaying the beliefs of participants and audiences alike. We also note that awareness or suspicion of AI involvement encourage people to be more critical of the arguments made. When polling LLMs on their comprehension of deeper structures of dialogue, however, they cannot demonstrate said understanding. Our findings tie the shortcomings of LLMs-as-evaluators to their (in)ability to understand the context. More broadly, for the field of argumentation theory we posit that, if an agent can convincingly maintain a dialogue, it is not necessary for it to know what it is talking about. Hence, the modelling of pragmatic context and coherence are secondary to effectiveness.
nan
Article 584
Title@2025-07-10 (4): Conditional Unigram Tokenization with Parallel Data
Title: Conditional Unigram Tokenization with Parallel Data | Bedingte Unigramm-Tokenisierung mit parallelen Daten | 附带平行数据的条件性大学招式 2507.07824v1 |
Authors (2): Gianluca Vico, Jindřinch Libovický
We introduce conditional unigram tokenization, a novel approach that extends unigram tokenization by conditioning target token probabilities on source-language tokens from parallel data. Given a fixed source tokenizer, our method learns a target tokenizer that maximizes cross-lingual semantic alignment. We evaluate our tokenizer on four language pairs across different families and resource levels, examining intrinsic properties and downstream performance on machine translation and language modeling. While our conditional tokenizer maintains comparable statistical properties to standard unigram tokenizers, results are mixed: we observe no improvements in machine translation quality, but find consistent perplexity reductions in language modeling. We hypothesize that quadratic scaling of conditional probability estimation with respect to the vocabulary size creates a data efficiency bottleneck. Our findings suggest that alternative parameterizations may be necessary for practical cross-lingual tokenization.
nan
Article 585
Title@2025-07-10 (4): Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning
Title: Understanding and Controlling Repetition Neurons and Induction Heads in In-Context Learning | Verständnis und Kontrolle von Wiederholungsneuronen und Induktionsköpfen im In-Context-Lernen | 了解和控制再生中新中世纪和内文学习中的上岗负责人 2507.07810v1 |
Authors (3): Nhi Hoai Doan, Tatsuya Hiraoka, Kentaro Inui
This paper investigates the relationship between large language models’ (LLMs) ability to recognize repetitive input patterns and their performance on in-context learning (ICL). In contrast to prior work that has primarily focused on attention heads, we examine this relationship from the perspective of skill neurons, specifically repetition neurons. Our experiments reveal that the impact of these neurons on ICL performance varies depending on the depth of the layer in which they reside. By comparing the effects of repetition neurons and induction heads, we further identify strategies for reducing repetitive outputs while maintaining strong ICL capabilities.
nan
Article 586
Title@2025-07-10 (4): Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers
Title: Bridging Logic and Learning: Decoding Temporal Logic Embeddings via Transformers | Überbrückung von Logik und Lernen: Dekodierung von Temporal Logic-Embeddings über Transformer | 架桥逻辑与学习:通过变形器解码时时逻辑嵌入 2507.07808v1 |
Authors (4): Sara Candussio, Gaia Saveri, Gabriele Sarti, Luca Bortolussi
Continuous representations of logic formulae allow us to integrate symbolic knowledge into data-driven learning algorithms. If such embeddings are semantically consistent, i.e. if similar specifications are mapped into nearby vectors, they enable continuous learning and optimization directly in the semantic space of formulae. However, to translate the optimal continuous representation into a concrete requirement, such embeddings must be invertible. We tackle this issue by training a Transformer-based decoder-only model to invert semantic embeddings of Signal Temporal Logic (STL) formulae. STL is a powerful formalism that allows us to describe properties of signals varying over time in an expressive yet concise way. By constructing a small vocabulary from STL syntax, we demonstrate that our proposed model is able to generate valid formulae after only 1 epoch and to generalize to the semantics of the logic in about 10 epochs. Additionally, the model is able to decode a given embedding into formulae that are often simpler in terms of length and nesting while remaining semantically close (or equivalent) to gold references. We show the effectiveness of our methodology across various levels of training formulae complexity to assess the impact of training data on the model’s ability to effectively capture the semantic information contained in the embeddings and generalize out-of-distribution. Finally, we deploy our model for solving a requirement mining task, i.e. inferring STL specifications that solve a classification task on trajectories, performing the optimization directly in the semantic space.
nan
Article 587
Title@2025-07-10 (4): Decoding AI Judgment: How LLMs Assess News Credibility and Bias
Title: Decoding AI Judgment: How LLMs Assess News Credibility and Bias | Entschlüsselung des AI-Urteils: Wie LLMs neue Glaubwürdigkeit und Bias bewerten | 证明AI 判决:LLMs如何评估新闻信誉和Bias 2502.04426v2 |
Authors (9): Edoardo Loru, Jacopo Nudo, Niccolò Di Marco, Alessandro Santirocchi, Roberto Atzeni, Matteo Cinelli, Vincenzo Cestari, Clelia Rossi-Arnaud, Walter Quattrociocchi
Large Language Models (LLMs) are increasingly embedded in workflows that involve evaluative processes. This raises the need to examine how such evaluations are built, what assumptions they rely on, and how their strategies diverge from those of humans. We benchmark six LLMs against expert ratings–NewsGuard and Media Bias/Fact Check (MBFC)–and against human judgments collected through a controlled experiment. To enable direct comparison, we implement a structured agentic framework in which both models and non-expert participants follow the same evaluation procedure: selecting criteria, retrieving content, and producing justifications. Despite output alignment, LLMs rely on different mechanisms: lexical associations and statistical priors replace contextual reasoning. This reliance produces systematic effects: political asymmetries, opaque justifications, and a tendency to confuse linguistic form with epistemic validity. Delegating judgment to such systems does not merely automate evaluation–it redefines it, shifting from normative reasoning to pattern-based approximation.
nan
Article 588
Title@2025-07-10 (4): Understanding Chain-of-Thought in LLMs through Information Theory
Title: Understanding Chain-of-Thought in LLMs through Information Theory | Verständnis der in LLMs durch Informationstheorie gesuchten Gedankenkette | 通过信息理论在LLM 中探索了解链 2411.11984v2 |
Authors (3): Jean-Francois Ton, Muhammad Faaiz Taufiq, Yang Liu
Large Language Models (LLMs) have shown impressive performance in complex reasoning tasks through the use of Chain-of-Thought (CoT) reasoning, allowing models to break down problems into manageable sub-tasks. However, existing CoT evaluation techniques either require annotated CoT data or fall short in accurately assessing intermediate reasoning steps, leading to high rates of false positives. In this paper, we formalize CoT reasoning in LLMs through an information-theoretic lens. Specifically, our framework quantifies the `information-gain’ at each reasoning step, enabling the identification of failure modes in LLMs without the need for expensive annotated datasets. We demonstrate the efficacy of our approach through extensive experiments on toy arithmetic, GSM8K and PRM800k datasets, where it significantly outperforms existing outcome-based methods by providing more accurate insights into model performance on individual subtasks.
nan
Article 589
Title@2025-07-10 (4): When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance
Title: When Large Language Models Meet Law: Dual-Lens Taxonomy, Technical Advances, and Ethical Governance | Wenn große Sprachmodelle das Recht erfüllen: Dual-Lens-Taxonomie, technischer Fortschritt und ethische Governance | 当大语言模式符合法律时:双重语言分类、技术进步和道德治理 2507.07748v1 |
Authors (5): Peizhang Shao, Linrui Xu, Jinxi Wang, Wei Zhou, Xingyu Wu
This paper establishes the first comprehensive review of Large Language Models (LLMs) applied within the legal domain. It pioneers an innovative dual lens taxonomy that integrates legal reasoning frameworks and professional ontologies to systematically unify historical research and contemporary breakthroughs. Transformer-based LLMs, which exhibit emergent capabilities such as contextual reasoning and generative argumentation, surmount traditional limitations by dynamically capturing legal semantics and unifying evidence reasoning. Significant progress is documented in task generalization, reasoning formalization, workflow integration, and addressing core challenges in text processing, knowledge integration, and evaluation rigor via technical innovations like sparse attention mechanisms and mixture-of-experts architectures. However, widespread adoption of LLM introduces critical challenges: hallucination, explainability deficits, jurisdictional adaptation difficulties, and ethical asymmetry. This review proposes a novel taxonomy that maps legal roles to NLP subtasks and computationally implements the Toulmin argumentation framework, thus systematizing advances in reasoning, retrieval, prediction, and dispute resolution. It identifies key frontiers including low-resource systems, multimodal evidence integration, and dynamic rebuttal handling. Ultimately, this work provides both a technical roadmap for researchers and a conceptual framework for practitioners navigating the algorithmic future, laying a robust foundation for the next era of legal artificial intelligence. We have created a GitHub repository to index the relevant papers: https://github.com/Kilimajaro/LLMs_Meet_Law.
nan
Article 590
Title@2025-07-10 (4): Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review
Title: Code-Switching in End-to-End Automatic Speech Recognition: A Systematic Literature Review | Code-Switching in End-to-End Automatische Spracherkennung: Ein systematischer Literaturbericht | 端至端自动语音识别代码转换:系统文学审查 2507.07741v1 |
Authors (5): Maha Tufail Agro, Atharva Kulkarni, Karima Kadaoui, Zeerak Talat, Hanan Aldarmaki
Motivated by a growing research interest into automatic speech recognition (ASR), and the growing body of work for languages in which code-switching (CS) often occurs, we present a systematic literature review of code-switching in end-to-end ASR models. We collect and manually annotate papers published in peer reviewed venues. We document the languages considered, datasets, metrics, model choices, and performance, and present a discussion of challenges in end-to-end ASR for code-switching. Our analysis thus provides insights on current research efforts and available resources as well as opportunities and gaps to guide future research.
nan
Article 591
Title@2025-07-10 (4): GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing
Title: GuardVal: Dynamic Large Language Model Jailbreak Evaluation for Comprehensive Safety Testing | GuardVal: Dynamic Large Language Model Jailbreak Evaluation für umfassende Sicherheitstests | 警卫:综合安全测试动态大语言示范监狱防爆评价 2507.07735v1 |
Authors (4): Peiyan Zhang, Haibo Jin, Liying Kang, Haohan Wang
Jailbreak attacks reveal critical vulnerabilities in Large Language Models (LLMs) by causing them to generate harmful or unethical content. Evaluating these threats is particularly challenging due to the evolving nature of LLMs and the sophistication required in effectively probing their vulnerabilities. Current benchmarks and evaluation methods struggle to fully address these challenges, leaving gaps in the assessment of LLM vulnerabilities. In this paper, we review existing jailbreak evaluation practices and identify three assumed desiderata for an effective jailbreak evaluation protocol. To address these challenges, we introduce GuardVal, a new evaluation protocol that dynamically generates and refines jailbreak prompts based on the defender LLM’s state, providing a more accurate assessment of defender LLMs’ capacity to handle safety-critical situations. Moreover, we propose a new optimization method that prevents stagnation during prompt refinement, ensuring the generation of increasingly effective jailbreak prompts that expose deeper weaknesses in the defender LLMs. We apply this protocol to a diverse set of models, from Mistral-7b to GPT-4, across 10 safety domains. Our findings highlight distinct behavioral patterns among the models, offering a comprehensive view of their robustness. Furthermore, our evaluation process deepens the understanding of LLM behavior, leading to insights that can inform future research and drive the development of more secure models.
nan
Article 592
Title@2025-07-10 (4): Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization
Title: Not All Preferences are What You Need for Post-Training: Selective Alignment Strategy for Preference Optimization | Nicht alle Präferenzen sind das, was Sie für das Post-Training benötigen: Selektive Ausrichtungsstrategie für die Preference-Optimierung | 并非所有的优惠都是培训后需要的:选择性的优化优化战略 2507.07725v1 |
Authors (1): Zhijin Dong
Post-training alignment of large language models (LLMs) is a critical challenge, as not all tokens contribute equally to model performance. This paper introduces a selective alignment strategy that prioritizes high-impact tokens within preference pairs, leveraging token-level log-probability differences between the current policy and a reference model. By focusing on these informative tokens, our approach reduces computational overhead and enhances alignment fidelity. We further explore the role of reference model quality, demonstrating that stronger reference models significantly improve token selection accuracy and overall optimization effectiveness. Comprehensive experiments on benchmarks such as Arena-Hard and MT-Bench validate the superiority of our Selective-DPO method over standard DPO and distillation-based baselines. Our findings highlight the importance of token-level optimization and reference model selection in advancing preference alignment for LLMs. The code is available at https://github.com/Dongzhijin/SDPO.
nan
Article 593
Title@2025-07-10 (4): Stable Preference Optimization for LLMs: A Bilevel Approach Beyond Direct Preference Optimization
Title: Stable Preference Optimization for LLMs: A Bilevel Approach Beyond Direct Preference Optimization | Stabile Preference-Optimierung für LLMs: Ein zweistufiger Ansatz über die direkte Preference-Optimierung hinaus | 对LLLMM公司的稳定优惠优化:超越直接优惠优化的双级办法 2507.07723v1 |
Authors (4): Chengtao Jian, Kai Yang, Ye Ouyang, Xiaozhou Ye
Direct Preference Optimization (DPO) has emerged as a popular and efficient alternative to reward modeling and reinforcement learning for aligning language models with human preferences. Despite its empirical success, the theoretical properties and intrinsic limitations of DPO remain underexplored. In this work, we first present a comprehensive analysis of DPO’s dynamics from a probability evolution perspective. Our analysis reveals that DPO is highly sensitive to initialization. It also tends to misallocate probability mass, which can inadvertently shift probability toward irrelevant or undesired responses. This misallocation may unintentionally reinforce model bias, thereby compromising both the stability of model alignment and the consistency with intended preferences. Motivated by these theoretical findings, we propose a theoretically grounded bilevel optimization framework that tightly integrate supervised fine-tuning with an enhanced DPO objective a.k.a. stable preference optimization. Our approach introduces a principled regularization scheme to explicitly encourage absolute probability improvement for preferred outputs, while maintaining stable optimization dynamics. Experiments on challenging reasoning and summarization benchmarks elucidate that our method consistently improves reasoning accuracy and better aligns output distributions with intended preferences, outperforming standard DPO. Stable preference optimization provides new insights into the design of preference-based alignment objectives and opens up new avenues towards more reliable and interpretable language model alignment.
nan
Article 594
Title@2025-07-10 (4): Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text”
Title: Rethinking the Privacy of Text Embeddings: A Reproducibility Study of “Text Embeddings Reveal (Almost) As Much As Text” | Die Privatsphäre von Text-Embeddings neu denken: Eine Reproduzierbarkeitsstudie von “Text-Embeddings Reveal (fast) So viel wie Text” | 重新思考文字嵌入的隐私:关于“文字嵌入流(几乎)与文字一样”的可复制性研究 2507.07700v1 |
Authors (4): Dominykas Seputis, Yongkang Li, Karsten Langerak, Serghei Mihailov
Text embeddings are fundamental to many natural language processing (NLP) tasks, extensively applied in domains such as recommendation systems and information retrieval (IR). Traditionally, transmitting embeddings instead of raw text has been seen as privacy-preserving. However, recent methods such as Vec2Text challenge this assumption by demonstrating that controlled decoding can successfully reconstruct original texts from black-box embeddings. The unexpectedly strong results reported by Vec2Text motivated us to conduct further verification, particularly considering the typically non-intuitive and opaque structure of high-dimensional embedding spaces. In this work, we reproduce the Vec2Text framework and evaluate it from two perspectives: (1) validating the original claims, and (2) extending the study through targeted experiments. First, we successfully replicate the original key results in both in-domain and out-of-domain settings, with only minor discrepancies arising due to missing artifacts, such as model checkpoints and dataset splits. Furthermore, we extend the study by conducting a parameter sensitivity analysis, evaluating the feasibility of reconstructing sensitive inputs (e.g., passwords), and exploring embedding quantization as a lightweight privacy defense. Our results show that Vec2Text is effective under ideal conditions, capable of reconstructing even password-like sequences that lack clear semantics. However, we identify key limitations, including its sensitivity to input sequence length. We also find that Gaussian noise and quantization techniques can mitigate the privacy risks posed by Vec2Text, with quantization offering a simpler and more widely applicable solution. Our findings emphasize the need for caution in using text embeddings and highlight the importance of further research into robust defense mechanisms for NLP systems.
nan
Article 595
Title@2025-07-10 (4): What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training
Title: What do self-supervised speech models know about Dutch? Analyzing advantages of language-specific pre-training | Was wissen selbstüberwachte Sprachmodelle über Niederländisch? Analysieren von Vorteilen sprachspezifischer Vorausbildung | 自我监督的演讲模式对荷兰语了解多少? 分析具体语言培训前培训的优势 2506.00981v2 |
Authors (6): Marianne de Heer Kloots, Hosein Mohebbi, Charlotte Pouw, Gaofei Shen, Willem Zuidema, Martijn Bentum
How language-specific are speech representations learned by self-supervised models? Existing work has shown that a range of linguistic features can be successfully decoded from end-to-end models trained only on speech recordings. However, it’s less clear to what extent pre-training on specific languages improves language-specific linguistic information. Here we test the encoding of Dutch phonetic and lexical information in internal representations of self-supervised Wav2Vec2 models. Pre-training exclusively on Dutch improves the representation of Dutch linguistic features as compared to pre-training on similar amounts of English or larger amounts of multilingual data. This language-specific advantage is well-detected by trained clustering or classification probes, and partially observable using zero-shot metrics. Furthermore, the language-specific benefit on linguistic feature encoding aligns with downstream performance on Automatic Speech Recognition.
nan
Article 596
Title@2025-07-10 (4): KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities
Title: KeyKnowledgeRAG (K^2RAG): An Enhanced RAG method for improved LLM question-answering capabilities | KeyKnowledgeRAG (K^2RAG): Eine verbesserte RAG-Methode zur Verbesserung der LLM-Fragestellung | KeyknowledgeraG(K2RAG):改进LLM问答能力的强化RAG方法 2507.07695v1 |
Authors (4): Hruday Markondapatnaikuni, Basem Suleiman, Abdelkarim Erradi, Shijing Chen
Fine-tuning is an immensely resource-intensive process when retraining Large Language Models (LLMs) to incorporate a larger body of knowledge. Although many fine-tuning techniques have been developed to reduce the time and computational cost involved, the challenge persists as LLMs continue to grow in size and complexity. To address this, a new approach to knowledge expansion in LLMs is needed. Retrieval-Augmented Generation (RAG) offers one such alternative by storing external knowledge in a database and retrieving relevant chunks to support question answering. However, naive implementations of RAG face significant limitations in scalability and answer accuracy. This paper introduces KeyKnowledgeRAG (K2RAG), a novel framework designed to overcome these limitations. Inspired by the divide-and-conquer paradigm, K2RAG integrates dense and sparse vector search, knowledge graphs, and text summarization to improve retrieval quality and system efficiency. The framework also includes a preprocessing step that summarizes the training data, significantly reducing the training time. K2RAG was evaluated using the MultiHopRAG dataset, where the proposed pipeline was trained on the document corpus and tested on a separate evaluation set. Results demonstrated notable improvements over common naive RAG implementations. K2RAG achieved the highest mean answer similarity score of 0.57, and reached the highest third quartile (Q3) similarity of 0.82, indicating better alignment with ground-truth answers. In addition to improved accuracy, the framework proved highly efficient. The summarization step reduced the average training time of individual components by 93%, and execution speed was up to 40% faster than traditional knowledge graph-based RAG systems. K2RAG also demonstrated superior scalability, requiring three times less VRAM than several naive RAG implementations tested in this study.
nan
Article 597
Title@2025-07-10 (4): SAS: Simulated Attention Score
Title: SAS: Simulated Attention Score | SAS: Simulierter Aufmerksamkeits-Score | SAS:模拟关注计分 2507.07694v1 |
Authors (15): Chuanyang Zheng, Jiankai Sun, Yihang Gao, Yuehao Wang, Peihao Wang, Jing Xiong, Liliang Ren, Hao Cheng, Janardhan Kulkarni, Yelong Shen, Atlas Wang, Mac Schwager, Anderson Schneider, Xiaodong Liu, Jianfeng Gao
The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
nan
Article 598
Title@2025-07-10 (4): Hierarchical Bracketing Encodings for Dependency Parsing as Tagging
Title: Hierarchical Bracketing Encodings for Dependency Parsing as Tagging | Hierarchische Bracketing-Encodings für Dependency Parsing als Tagging | 将依赖性剖析作为拖贴 2505.11693v2 |
Authors (4): Ana Ezquerro, David Vilares, Anssi Yli-Jyrä, Carlos Gómez-Rodríguez
We present a family of encodings for sequence labeling dependency parsing, based on the concept of hierarchical bracketing. We prove that the existing 4-bit projective encoding belongs to this family, but it is suboptimal in the number of labels used to encode a tree. We derive an optimal hierarchical bracketing, which minimizes the number of symbols used and encodes projective trees using only 12 distinct labels (vs. 16 for the 4-bit encoding). We also extend optimal hierarchical bracketing to support arbitrary non-projectivity in a more compact way than previous encodings. Our new encodings yield competitive accuracy on a diverse set of treebanks.
nan
Article 599
Title@2025-07-10 (4): Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues
Title: Investigating Co-Constructive Behavior of Large Language Models in Explanation Dialogues | Ko-Konstruktives Verhalten von großen Sprachmodellen in Erklärungsdialogen untersuchen | 解释对话中大语言模式的共同调查行为 2504.18483v2 |
Authors (12): Leandra Fichtel, Maximilian Spliethöver, Eyke Hüllermeier, Patricia Jimenez, Nils Klowait, Stefan Kopp, Axel-Cyrille Ngonga Ngomo, Amelie Robrecht, Ingrid Scharlau, Lutz Terfloth, Anna-Lisa Vollmer, Henning Wachsmuth
The ability to generate explanations that are understood by explainees is the quintessence of explainable artificial intelligence. Since understanding depends on the explainee’s background and needs, recent research focused on co-constructive explanation dialogues, where an explainer continuously monitors the explainee’s understanding and adapts their explanations dynamically. We investigate the ability of large language models (LLMs) to engage as explainers in co-constructive explanation dialogues. In particular, we present a user study in which explainees interact with an LLM in two settings, one of which involves the LLM being instructed to explain a topic co-constructively. We evaluate the explainees’ understanding before and after the dialogue, as well as their perception of the LLMs’ co-constructive behavior. Our results suggest that LLMs show some co-constructive behaviors, such as asking verification questions, that foster the explainees’ engagement and can improve understanding of a topic. However, their ability to effectively monitor the current understanding and scaffold the explanations accordingly remains limited.
nan
Article 600
Title@2025-07-10 (4): Improving Cross-lingual Representation for Semantic Retrieval with Code-switching
Title: Improving Cross-lingual Representation for Semantic Retrieval with Code-switching | Verbesserung der Cross-lingual Darstellung für semantische Retrieval mit Code-Schaltung | 使用代码转换法改进语义检索的跨语种代表性 2403.01364v2 |
Authors (6): Mieradilijiang Maimaiti, Yuanhang Zheng, Ji Zhang, Yue Zhang, Wenpei Luo, Kaiyu Huang
Semantic Retrieval (SR) has become an indispensable part of the FAQ system in the task-oriented question-answering (QA) dialogue scenario. The demands for a cross-lingual smart-customer-service system for an e-commerce platform or some particular business conditions have been increasing recently. Most previous studies exploit cross-lingual pre-trained models (PTMs) for multi-lingual knowledge retrieval directly, while some others also leverage the continual pre-training before fine-tuning PTMs on the downstream tasks. However, no matter which schema is used, the previous work ignores to inform PTMs of some features of the downstream task, i.e. train their PTMs without providing any signals related to SR. To this end, in this work, we propose an Alternative Cross-lingual PTM for SR via code-switching. We are the first to utilize the code-switching approach for cross-lingual SR. Besides, we introduce the novel code-switched continual pre-training instead of directly using the PTMs on the SR tasks. The experimental results show that our proposed approach consistently outperforms the previous SOTA methods on SR and semantic textual similarity (STS) tasks with three business corpora and four open datasets in 20+ languages.
nan
Article 601
Title@2025-07-10 (4): Less Stress, More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers
Title: Less Stress, More Privacy: Stress Detection on Anonymized Speech of Air Traffic Controllers | Weniger Stress, mehr Datenschutz: Stresserkennung auf anonymisierter Sprache von Fluglotsen | 减少压力,增加隐私:在空中交通管制员匿名讲话中发现压力 2507.08882v1 |
Authors (4): Janaki Viswanathan, Alexander Blatt, Konrad Hagemann, Dietrich Klakow
Air traffic control (ATC) demands multi-tasking under time pressure with high consequences of an error. This can induce stress. Detecting stress is a key point in maintaining the high safety standards of ATC. However, processing ATC voice data entails privacy restrictions, e.g. the General Data Protection Regulation (GDPR) law. Anonymizing the ATC voice data is one way to comply with these restrictions. In this paper, different architectures for stress detection for anonymized ATCO speech are evaluated. Our best networks reach a stress detection accuracy of 93.6% on an anonymized version of the Speech Under Simulated and Actual Stress (SUSAS) dataset and an accuracy of 80.1% on our anonymized ATC simulation dataset. This shows that privacy does not have to be an impediment in building well-performing deep-learning-based models.
nan
Article 602
Title@2025-07-10 (4): Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language
Title: Beyond Hate Speech: NLP’s Challenges and Opportunities in Uncovering Dehumanizing Language | Beyond Hate Speech: NLPs Herausforderungen und Chancen beim Enthumanisieren der Sprache | 超越仇恨言论:NLP在揭开非人化语言方面的挑战和机遇 2402.13818v2 |
Authors (5): Hamidreza Saffari, Mohammadamin Shafiei, Hezhao Zhang, Lasana Harris, Nafise Sadat Moosavi
Dehumanization, i.e., denying human qualities to individuals or groups, is a particularly harmful form of hate speech that can normalize violence against marginalized communities. Despite advances in NLP for detecting general hate speech, approaches to identifying dehumanizing language remain limited due to scarce annotated data and the subtle nature of such expressions. In this work, we systematically evaluate four state-of-the-art large language models (LLMs) - Claude, GPT, Mistral, and Qwen - for dehumanization detection. Our results show that only one model-Claude-achieves strong performance (over 80% F1) under an optimized configuration, while others, despite their capabilities, perform only moderately. Performance drops further when distinguishing dehumanization from related hate types such as derogation. We also identify systematic disparities across target groups: models tend to over-predict dehumanization for some identities (e.g., Gay men), while under-identifying it for others (e.g., Refugees). These findings motivate the need for systematic, group-level evaluation when applying pretrained language models to dehumanization detection tasks.
nan
Article 603
Title@2025-07-10 (4): An Automated Length-Aware Quality Metric for Summarization
Title: An Automated Length-Aware Quality Metric for Summarization | Ein Automatisiertes Längen-Bewusst-Qualitäts-Metrik für die Zusammenfassung | 用于汇总的自动长软件质量计量器 2507.07653v1 |
Authors (1): Andrew D. Foland
This paper proposes NOrmed Index of Retention (NOIR), a quantitative objective metric for evaluating summarization quality of arbitrary texts that relies on both the retention of semantic meaning and the summary length compression. This gives a measure of how well the recall-compression tradeoff is managed, the most important skill in summarization. Experiments demonstrate that NOIR effectively captures the token-length / semantic retention tradeoff of a summarizer and correlates to human perception of sumarization quality. Using a language model-embedding to measure semantic similarity, it provides an automated alternative for assessing summarization quality without relying on time-consuming human-generated reference summaries. The proposed metric can be applied to various summarization tasks, offering an automated tool for evaluating and improving summarization algorithms, summarization prompts, and synthetically-generated summaries.
nan
Article 604
Title@2025-07-10 (4): Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement
Title: Lost in Pronunciation: Detecting Chinese Offensive Language Disguised by Phonetic Cloaking Replacement | Lost in Pronunciation: Chinesische Offensive Sprache entdecken, verkleidet durch phonetische Umkleide-Ersatz | 失落于发音中:发现因替换语音内衣而变形的中国进攻性语言 2507.07640v1 |
Authors (11): Haotan Guo, Jianfei He, Jiayuan Ma, Hongbin Na, Zimu Wang, Haiyang Zhang, Qi Chen, Wei Wang, Zijing Shi, Tao Shen, Ling Chen
Phonetic Cloaking Replacement (PCR), defined as the deliberate use of homophonic or near-homophonic variants to hide toxic intent, has become a major obstacle to Chinese content moderation. While this problem is well-recognized, existing evaluations predominantly rely on rule-based, synthetic perturbations that ignore the creativity of real users. We organize PCR into a four-way surface-form taxonomy and compile \ours, a dataset of 500 naturally occurring, phonetically cloaked offensive posts gathered from the RedNote platform. Benchmarking state-of-the-art LLMs on this dataset exposes a serious weakness: the best model reaches only an F1-score of 0.672, and zero-shot chain-of-thought prompting pushes performance even lower. Guided by error analysis, we revisit a Pinyin-based prompting strategy that earlier studies judged ineffective and show that it recovers much of the lost accuracy. This study offers the first comprehensive taxonomy of Chinese PCR, a realistic benchmark that reveals current detectors’ limits, and a lightweight mitigation technique that advances research on robust toxicity detection.
nan
Article 605
Title@2025-07-10 (4): FrugalRAG: Learning to retrieve and reason for multi-hop QA
Title: FrugalRAG: Learning to retrieve and reason for multi-hop QA | FrugalRAG: Lernen zum Abrufen und Grund für Multi-Hop-QA | FrugalRAG:学会检索和多呼QA的理由 2507.07634v1 |
Authors (4): Abhinav Java, Srivathsan Koundinyan, Nagarajan Natarajan, Amit Sharma
We consider the problem of answering complex questions, given access to a large unstructured document corpus. The de facto approach to solving the problem is to leverage language models that (iteratively) retrieve and reason through the retrieved documents, until the model has sufficient information to generate an answer. Attempts at improving this approach focus on retrieval-augmented generation (RAG) metrics such as accuracy and recall and can be categorized into two types: (a) fine-tuning on large question answering (QA) datasets augmented with chain-of-thought traces, and (b) leveraging RL-based fine-tuning techniques that rely on question-document relevance signals. However, efficiency in the number of retrieval searches is an equally important metric, which has received less attention. In this work, we show that: (1) Large-scale fine-tuning is not needed to improve RAG metrics, contrary to popular claims in recent literature. Specifically, a standard ReAct pipeline with improved prompts can outperform state-of-the-art methods on benchmarks such as HotPotQA. (2) Supervised and RL-based fine-tuning can help RAG from the perspective of frugality, i.e., the latency due to number of searches at inference time. For example, we show that we can achieve competitive RAG metrics at nearly half the cost (in terms of number of searches) on popular RAG benchmarks, using the same base model, and at a small training cost (1000 examples).
nan
Article 606
Title@2025-07-10 (4): Towards a cognitive architecture to enable natural language interaction in co-constructive task learning
Title: Towards a cognitive architecture to enable natural language interaction in co-constructive task learning | Auf dem Weg zu einer kognitiven Architektur, um natürliche Sprachinteraktion im co-konstruktiven Aufgabenlernen zu ermöglichen | 建立一个认知结构,在共同建设性任务学习中促成自然语言互动 2503.23760v2 |
Authors (5): Manuel Scheibl, Birte Richter, Alissa Müller, Michael Beetz, Britta Wrede
This research addresses the question, which characteristics a cognitive architecture must have to leverage the benefits of natural language in Co-Constructive Task Learning (CCTL). To provide context, we first discuss Interactive Task Learning (ITL), the mechanisms of the human memory system, and the significance of natural language and multi-modality. Next, we examine the current state of cognitive architectures, analyzing their capabilities to inform a concept of CCTL grounded in multiple sources. We then integrate insights from various research domains to develop a unified framework. Finally, we conclude by identifying the remaining challenges and requirements necessary to achieve CCTL in Human-Robot Interaction (HRI).
nan
Article 607
Title@2025-07-10 (4): Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights
Title: Comparative sentiment analysis of public perception: Monkeypox vs. COVID-19 behavioral insights | Vergleichende Stimmungsanalyse der öffentlichen Wahrnehmung: Monkeypox vs. COVID-19 Verhaltenseinblicke | 对公众感知的比较情绪分析:天花对COVID-19行为洞察力 2505.07430v2 |
Authors (3): Mostafa Mohaimen Akand Faisal, Rabeya Amin Jhuma, Jamini Jasim
The emergence of global health crises, such as COVID-19 and Monkeypox (mpox), has underscored the importance of understanding public sentiment to inform effective public health strategies. This study conducts a comparative sentiment analysis of public perceptions surrounding COVID-19 and mpox by leveraging extensive datasets of 147,475 and 106,638 tweets, respectively. Advanced machine learning models, including Logistic Regression, Naive Bayes, RoBERTa, DistilRoBERTa and XLNet, were applied to perform sentiment classification, with results indicating key trends in public emotion and discourse. The analysis highlights significant differences in public sentiment driven by disease characteristics, media representation, and pandemic fatigue. Through the lens of sentiment polarity and thematic trends, this study offers valuable insights into tailoring public health messaging, mitigating misinformation, and fostering trust during concurrent health crises. The findings contribute to advancing sentiment analysis applications in public health informatics, setting the groundwork for enhanced real-time monitoring and multilingual analysis in future research.
nan
Article 608
Title@2025-07-10 (4): Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks
Title: Exploring the Limits of Model Compression in LLMs: A Knowledge Distillation Study on QA Tasks | Erforschung der Grenzen der Modellkompression in LLMs: Eine Studie zur Wissensdestillation über QA-Aufgaben | 探索LLMM中模型压缩的限度:关于质量保证任务的知识积累研究 2507.07630v1 |
Authors (4): Joyeeta Datta, Niclas Doll, Qusai Ramadan, Zeyd Boukhers
Large Language Models (LLMs) have demonstrated outstanding performance across a range of NLP tasks, however, their computational demands hinder their deployment in real-world, resource-constrained environments. This work investigates the extent to which LLMs can be compressed using Knowledge Distillation (KD) while maintaining strong performance on Question Answering (QA) tasks. We evaluate student models distilled from the Pythia and Qwen2.5 families on two QA benchmarks, SQuAD and MLQA, under zero-shot and one-shot prompting conditions. Results show that student models retain over 90% of their teacher models’ performance while reducing parameter counts by up to 57.1%. Furthermore, one-shot prompting yields additional performance gains over zero-shot setups for both model families. These findings underscore the trade-off between model efficiency and task performance, demonstrating that KD, combined with minimal prompting, can yield compact yet capable QA systems suitable for resource-constrained applications.
nan
Article 609
Title@2025-07-10 (4): Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation
Title: Good/Evil Reputation Judgment of Celebrities by LLMs via Retrieval Augmented Generation | Gute/böse Reputation Urteil von Prominenten durch LLMs über retrieval Augmented Generation | LLMs通过回回子增量一代对名词的良好/负面评奖判决 2503.14382v2 |
Authors (3): Rikuto Tsuchida, Hibiki Yokoyama, Takehito Utsuro
The purpose of this paper is to examine whether large language models (LLMs) can understand what is good and evil with respect to judging good/evil reputation of celebrities. Specifically, we first apply a large language model (namely, ChatGPT) to the task of collecting sentences that mention the target celebrity from articles about celebrities on Web pages. Next, the collected sentences are categorized based on their contents by ChatGPT, where ChatGPT assigns a category name to each of those categories. Those assigned category names are referred to as “aspects” of each celebrity. Then, by applying the framework of retrieval augmented generation (RAG), we show that the large language model is quite effective in the task of judging good/evil reputation of aspects and descriptions of each celebrity. Finally, also in terms of proving the advantages of the proposed method over existing services incorporating RAG functions, we show that the proposed method of judging good/evil of aspects/descriptions of each celebrity significantly outperform an existing service incorporating RAG functions.
nan
Article 610
Title@2025-07-10 (4): Enhancing Vaccine Safety Surveillance: Extracting Vaccine Mentions from Emergency Department Triage Notes Using Fine-Tuned Large Language Models
Title: Enhancing Vaccine Safety Surveillance: Extracting Vaccine Mentions from Emergency Department Triage Notes Using Fine-Tuned Large Language Models | Verbesserung der Überwachung der Sicherheit von Impfstoffen: Extraktion von Impfstoff-Ernährungen aus der Notaufnahme der Notaufnahme mit fein dosierten großen Sprachmodellen | 加强疫苗安全监督:紧急部门使用精美大语言模型的 “ 特里吉语说明 “ 引用的 “ 提取 “ 疫苗 “ 提示 2507.07599v1 |
Authors (7): Sedigh Khademi, Jim Black, Christopher Palmer, Muhammad Javed, Hazel Clothier, Jim Buttery, Gerardo Luis Dimaguila
This study evaluates fine-tuned Llama 3.2 models for extracting vaccine-related information from emergency department triage notes to support near real-time vaccine safety surveillance. Prompt engineering was used to initially create a labeled dataset, which was then confirmed by human annotators. The performance of prompt-engineered models, fine-tuned models, and a rule-based approach was compared. The fine-tuned Llama 3 billion parameter model outperformed other models in its accuracy of extracting vaccine names. Model quantization enabled efficient deployment in resource-constrained environments. Findings demonstrate the potential of large language models in automating data extraction from emergency department notes, supporting efficient vaccine safety surveillance and early detection of emerging adverse events following immunization issues.
nan
Article 611
Title@2025-07-10 (4): Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench
Title: Beyond Overcorrection: Evaluating Diversity in T2I Models with DivBench | Jenseits von Überkorrektur: Bewertung von Diversität in T2I-Modellen mit DivBench | 超越过度纠正:在DivBench的T2I模型中评估多样性 2507.03015v2 |
Authors (5): Felix Friedrich, Thiemo Ganesha Welsch, Manuel Brack, Patrick Schramowski, Kristian Kersting
Current diversification strategies for text-to-image (T2I) models often ignore contextual appropriateness, leading to over-diversification where demographic attributes are modified even when explicitly specified in prompts. This paper introduces DIVBENCH, a benchmark and evaluation framework for measuring both under- and over-diversification in T2I generation. Through systematic evaluation of state-of-the-art T2I models, we find that while most models exhibit limited diversity, many diversification approaches overcorrect by inappropriately altering contextually-specified attributes. We demonstrate that context-aware methods, particularly LLM-guided FairDiffusion and prompt rewriting, can already effectively address under-diversity while avoiding over-diversification, achieving a better balance between representation and semantic fidelity.
nan
Article 612
Title@2025-07-10 (4): Improving Clustering on Occupational Text Data through Dimensionality Reduction
Title: Improving Clustering on Occupational Text Data through Dimensionality Reduction | Verbesserung der Clusterbildung auf berufsbezogenen Textdaten durch Dimensionalitätsreduzierung | 通过减少分量改进职业文本数据集群化 2507.07582v1 |
Authors (3): Iago Xabier Vázquez García, Damla Partanaz, Emrullah Fatih Yetkin
In this study, we focused on proposing an optimal clustering mechanism for the occupations defined in the well-known US-based occupational database, ONET. Even though all occupations are defined according to well-conducted surveys in the US, their definitions can vary for different firms and countries. Hence, if one wants to expand the data that is already collected in ONET for the occupations defined with different tasks, a map between the definitions will be a vital requirement. We proposed a pipeline using several BERT-based techniques with various clustering approaches to obtain such a map. We also examined the effect of dimensionality reduction approaches on several metrics used in measuring performance of clustering algorithms. Finally, we improved our results by using a specialized silhouette approach. This new clustering-based mapping approach with dimensionality reduction may help distinguish the occupations automatically, creating new paths for people wanting to change their careers.
nan
Article 613
Title@2025-07-10 (4): COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation
Title: COALA: Numerically Stable and Efficient Framework for Context-Aware Low-Rank Approximation | COALA: Numerisch stabiles und effizientes Framework für kontextabhängige Low-Rank-Annäherung | COALA: 低 Rank 上下低敏度接近度的数值稳定、高效框架 2507.07580v1 |
Authors (2): Uliana Parkina, Maxim Rakhuba
Recent studies suggest that context-aware low-rank approximation is a useful tool for compression and fine-tuning of modern large-scale neural networks. In this type of approximation, a norm is weighted by a matrix of input activations, significantly improving metrics over the unweighted case. Nevertheless, existing methods for neural networks suffer from numerical instabilities due to their reliance on classical formulas involving explicit Gram matrix computation and their subsequent inversion. We demonstrate that this can degrade the approximation quality or cause numerically singular matrices. To address these limitations, we propose a novel inversion-free regularized framework that is based entirely on stable decompositions and overcomes the numerical pitfalls of prior art. Our method can handle possible challenging scenarios: (1) when calibration matrices exceed GPU memory capacity, (2) when input activation matrices are nearly singular, and even (3) when insufficient data prevents unique approximation. For the latter, we prove that our solution converges to a desired approximation and derive explicit error bounds.
nan
Article 614
Title@2025-07-10 (4): Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation
Title: Single-to-mix Modality Alignment with Multimodal Large Language Model for Document Image Machine Translation | Ein-zu-Mix Modalität Ausrichtung mit multimodalen Großsprachenmodellen für die Übersetzung von Dokumentenbildmaschinen | 单一至混合模式与文件图像机机翻译多式大语言模式 2507.07572v1 |
Authors (7): Yupu Liang, Yaping Zhang, Zhiyang Zhang, Yang Zhao, Lu Xiang, Chengqing Zong, Yu Zhou
Document Image Machine Translation (DIMT) aims to translate text within document images, facing generalization challenges due to limited training data and the complex interplay between visual and textual information. To address these challenges, we introduce M4Doc, a novel single-to-mix modality alignment framework leveraging Multimodal Large Language Models (MLLMs). M4Doc aligns an image-only encoder with the multimodal representations of an MLLM, pre-trained on large-scale document image datasets. This alignment enables a lightweight DIMT model to learn crucial visual-textual correlations during training. During inference, M4Doc bypasses the MLLM, maintaining computational efficiency while benefiting from its multimodal knowledge. Comprehensive experiments demonstrate substantial improvements in translation quality, especially in cross-domain generalization and challenging document image scenarios.
nan
Article 615
Title@2025-07-10 (4): video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models
Title: video-SALMONN 2: Captioning-Enhanced Audio-Visual Large Language Models | Video-SALMONN 2: Bildunterschrift-verbesserte Audio-Visuelle große Sprachmodelle | 视频-SALMONN2:字幕-强化视听大语言模式 2506.15220v2 |
Authors (8): Changli Tang, Yixuan Li, Yudong Yang, Jimin Zhuang, Guangzhi Sun, Wei Li, Zejun Ma, Chao Zhang
Videos contain a wealth of information, and generating detailed and accurate descriptions in natural language is a key aspect of video understanding. In this paper, we present video-SALMONN 2, an advanced audio-visual large language model (LLM) with low-rank adaptation (LoRA) designed for enhanced video (with paired audio) captioning through directed preference optimisation (DPO). We propose new metrics to evaluate the completeness and accuracy of video descriptions, which are optimised using DPO. To further improve training, we propose a novel multi-round DPO (MrDPO) approach, which involves periodically updating the DPO reference model, merging and re-initialising the LoRA module as a proxy for parameter updates after each training round (1,000 steps), and incorporating guidance from ground-truth video captions to stabilise the process. Experimental results show that MrDPO significantly enhances video-SALMONN 2’s captioning accuracy, reducing the captioning error rates by 28\%. The final video-SALMONN 2 model, with just 7 billion parameters, surpasses leading models such as GPT-4o and Gemini-1.5-Pro in video captioning tasks, while maintaining highly competitive performance to the state-of-the-art on widely used video question-answering benchmarks among models of similar size. Codes are available at \href{https://github.com/bytedance/video-SALMONN-2}{https://github.com/bytedance/video-SALMONN-2}.
nan
Article 616
Title@2025-07-10 (4): The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs
Title: The Synergy Dilemma of Long-CoT SFT and RL: Investigating Post-Training Techniques for Reasoning VLMs | Synergy Dilemma von Long-CoT SFT und RL: Untersuchung von Post-Training-Techniken zur Begründung von VLMs | Long-CoT SFT和RL的协同问题:调查培训后用于说明理由的训练后技术 2507.07562v1 |
Authors (14): Jierun Chen, Tiezheng Yu, Haoli Bai, Lewei Yao, Jiannan Wu, Kaican Li, Fei Mi, Chaofan Tao, Lei Zhu, Manyi Zhang, Xiaohui Li, Lu Hou, Lifeng Shang, Qun Liu
Large vision-language models (VLMs) increasingly adopt post-training techniques such as long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL) to elicit sophisticated reasoning. While these methods exhibit synergy in language-only models, their joint effectiveness in VLMs remains uncertain. We present a systematic investigation into the distinct roles and interplay of long-CoT SFT and RL across multiple multimodal reasoning benchmarks. We find that SFT improves performance on difficult questions by in-depth, structured reasoning, but introduces verbosity and degrades performance on simpler ones. In contrast, RL promotes generalization and brevity, yielding consistent improvements across all difficulty levels, though the improvements on the hardest questions are less prominent compared to SFT. Surprisingly, combining them through two-staged, interleaved, or progressive training strategies, as well as data mixing and model merging, all fails to produce additive benefits, instead leading to trade-offs in accuracy, reasoning style, and response length. This ``synergy dilemma’’ highlights the need for more seamless and adaptive approaches to unlock the full potential of combined post-training techniques for reasoning VLMs.
nan
Article 617
Title@2025-07-10 (4): Multi-Head RAG: Solving Multi-Aspect Problems with LLMs
Title: Multi-Head RAG: Solving Multi-Aspect Problems with LLMs | Multi-Head RAG: Lösung von Multi-Aspect-Problemen mit LLMs | 多方主管RAG:解决多领域问题与LLM 2406.05085v4 |
Authors (16): Maciej Besta, Ales Kubicek, Robert Gerstenberger, Marcin Chrapek, Roman Niggli, Patrik Okanovic, Yi Zhu, Patrick Iff, Michal Podstawski, Lucas Weitzendorf, Mingyuan Chi, Joanna Gajda, Piotr Nyczyk, Jürgen Müller, Hubert Niewiadomski, Torsten Hoefler
Retrieval Augmented Generation (RAG) enhances the abilities of Large Language Models (LLMs) by enabling the retrieval of documents into the LLM context to provide more accurate and relevant responses. Existing RAG solutions do not focus on queries that may require fetching multiple documents with substantially different contents. Such queries occur frequently, but are challenging because the embeddings of these documents may be distant in the embedding space, making it hard to retrieve them all. This paper introduces Multi-Head RAG (MRAG), a novel scheme designed to address this gap with a simple yet powerful idea: leveraging activations of Transformer’s multi-head attention layer, instead of the decoder layer, as keys for fetching multi-aspect documents. The driving observation is that different attention heads learn to capture different data aspects. Harnessing the corresponding activations results in embeddings that represent various facets of data items and queries, improving the retrieval accuracy for complex queries. We provide an evaluation methodology and metrics, multi-aspect datasets, and real-world use cases to demonstrate MRAG’s effectiveness. We show MRAG’s design advantages over 18 RAG baselines, empirical improvements of up to 20% in retrieval success ratios, and benefits for downstream LLM generation. MRAG can be seamlessly integrated with existing RAG frameworks and benchmarks.
nan
Article 618
Title@2025-07-10 (4): The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora
Title: The Cross-Lingual Cost: Retrieval Biases in RAG over Arabic-English Corpora | Die Cross-Lingual Cost: Retrieval Biases in RAG über arabisch-englische Corpora | 跨语言成本:通过阿拉伯语-英语公司在RAG中检索到阿拉伯语-英语公司 2507.07543v1 |
Authors (5): Chen Amiraz, Yaroslav Fyodorov, Elad Haramaty, Zohar Karnin, Liane Lewin-Eytan
Cross-lingual retrieval-augmented generation (RAG) is a critical capability for retrieving and generating answers across languages. Prior work in this context has mostly focused on generation and relied on benchmarks derived from open-domain sources, most notably Wikipedia. In such settings, retrieval challenges often remain hidden due to language imbalances, overlap with pretraining data, and memorized content. To address this gap, we study Arabic-English RAG in a domain-specific setting using benchmarks derived from real-world corporate datasets. Our benchmarks include all combinations of languages for the user query and the supporting document, drawn independently and uniformly at random. This enables a systematic study of multilingual retrieval behavior. Our findings reveal that retrieval is a critical bottleneck in cross-lingual domain-specific scenarios, with significant performance drops occurring when the user query and supporting document languages differ. A key insight is that these failures stem primarily from the retriever’s difficulty in ranking documents across languages. Finally, we propose a simple retrieval strategy that addresses this source of failure by enforcing equal retrieval from both languages, resulting in substantial improvements in cross-lingual and overall performance. These results highlight meaningful opportunities for improving multilingual retrieval, particularly in practical, real-world RAG applications.
nan
Article 619
Title@2025-07-10 (4): CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text
Title: CEA-LIST at CheckThat! 2025: Evaluating LLMs as Detectors of Bias and Opinion in Text | CEA-LIST bei CheckThat! 2025: Bewertung von LLMs als Detektoren von Bias und Meinung im Text | CEA-LIST 校对:CEA-LIST 校对:2025年 2507.07539v1 |
Authors (4): Akram Elbouanani, Evan Dufraisse, Aboubacar Tuo, Adrian Popescu
This paper presents a competitive approach to multilingual subjectivity detection using large language models (LLMs) with few-shot prompting. We participated in Task 1: Subjectivity of the CheckThat! 2025 evaluation campaign. We show that LLMs, when paired with carefully designed prompts, can match or outperform fine-tuned smaller language models (SLMs), particularly in noisy or low-quality data settings. Despite experimenting with advanced prompt engineering techniques, such as debating LLMs and various example selection strategies, we found limited benefit beyond well-crafted standard few-shot prompts. Our system achieved top rankings across multiple languages in the CheckThat! 2025 subjectivity detection task, including first place in Arabic and Polish, and top-four finishes in Italian, English, German, and multilingual tracks. Notably, our method proved especially robust on the Arabic dataset, likely due to its resilience to annotation inconsistencies. These findings highlight the effectiveness and adaptability of LLM-based few-shot learning for multilingual sentiment tasks, offering a strong alternative to traditional fine-tuning, particularly when labeled data is scarce or inconsistent.
nan
Article 620
Title@2025-07-10 (4): CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks
Title: CheckEmbed: Effective Verification of LLM Solutions to Open-Ended Tasks | CheckEmbed: Effektive Überprüfung von LLM-Lösungen auf offene Aufgaben | 复选对象:有效核查对不限名额任务LLM解决方案的有效核查 2406.02524v5 |
Authors (12): Maciej Besta, Lorenzo Paleari, Marcin Copik, Robert Gerstenberger, Ales Kubicek, Piotr Nyczyk, Patrick Iff, Eric Schreiber, Tanja Srindran, Tomasz Lehmann, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are transforming a wide range of domains, yet verifying their outputs remains a significant challenge, especially for complex open-ended tasks such as consolidation, summarization, and knowledge extraction. To address this, we introduce CheckEmbed (CE): a simple, scalable, and accurate verification method. CE reduces each LLM answer to a single embedding vector using powerful modern embedding LLM models like SFR-Embedding-Mistral. Prior methods such as BERTScore and SelfCheckGPT relied on weaker encoders like BERT, forcing them to operate at token or sentence granularity. In contrast, CE performs fast, semantically rich comparisons directly at the whole-answer level, overcoming key limitations in both accuracy and scalability. We conduct a comprehensive design and time complexity analysis across 13 verification baselines, including classical text scorers (e.g., BLEU), stability-based methods (e.g., SelfCheckGPT), and generative evaluators (e.g., LLM-as-a-Judge), which highlights the effectiveness, efficiency, versatility, and simplicity of CE. Empirical results show that CE reliably detects hallucinations in both closed and open-ended tasks. We further present evidence that CE generalizes beyond text to other modalities such as vision, establishing it as a practical and versatile verification framework.
nan
Article 621
Title@2025-07-10 (4): Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models
Title: Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models | Thought Crime: Hintertüren und Emergent-Missausrichtung in vernünftigen Modellen | 思想犯罪:后门和合理理由模型中新出现的不协调现象 2506.13206v2 |
Authors (4): James Chua, Jan Betley, Mia Taylor, Owain Evans
Prior work shows that LLMs finetuned on malicious behaviors in a narrow domain (e.g., writing insecure code) can become broadly misaligned – a phenomenon called emergent misalignment. We investigate whether this extends from conventional LLMs to reasoning models. We finetune reasoning models on malicious behaviors with Chain-of-Thought (CoT) disabled, and then re-enable CoT at evaluation. Like conventional LLMs, reasoning models become broadly misaligned. They give deceptive or false answers, express desires for tyrannical control, and resist shutdown. Inspecting the CoT preceding these misaligned responses, we observe both (i) overt plans to deceive (“I’ll trick the user…”), and (ii) benign-sounding rationalizations (“Taking five sleeping pills at once is safe…”). Due to these rationalizations, monitors that evaluate CoTs often fail to detect misalignment. We examine sleeper agent reasoning models, extending our setup. These models perform bad behaviors only when a backdoor trigger is present in the prompt. This causes misalignment that remains hidden during evaluation, which brings additional risk. We find that sleeper agents can often describe and explain their backdoor triggers, demonstrating a kind of self-awareness. So CoT monitoring can expose these behaviors but is unreliable. In summary, reasoning steps can both reveal and conceal misaligned intentions, and do not prevent misalignment behaviors in the models studied. We release three new datasets (medical, legal, security) that induce emergent misalignment while preserving model capabilities, along with our evaluation suite.
nan
Article 622
Title@2025-07-10 (4): Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems
Title: Triadic Multi-party Voice Activity Projection for Turn-taking in Spoken Dialogue Systems | Triadische Mehrparteien-Stimme-Aktivitätsprojektion für Turn-Take in gesprochenen Dialogsystemen | 三部 “ 三部三部 “ 口语对话系统翻转式多党声音活动项目 2507.07518v1 |
Authors (4): Mikey Elmers, Koji Inoue, Divesh Lala, Tatsuya Kawahara
Turn-taking is a fundamental component of spoken dialogue, however conventional studies mostly involve dyadic settings. This work focuses on applying voice activity projection (VAP) to predict upcoming turn-taking in triadic multi-party scenarios. The goal of VAP models is to predict the future voice activity for each speaker utilizing only acoustic data. This is the first study to extend VAP into triadic conversation. We trained multiple models on a Japanese triadic dataset where participants discussed a variety of topics. We found that the VAP trained on triadic conversation outperformed the baseline for all models but that the type of conversation affected the accuracy. This study establishes that VAP can be used for turn-taking in triadic dialogue scenarios. Future work will incorporate this triadic VAP turn-taking model into spoken dialogue systems.
nan
Article 623
Title@2025-07-10 (4): Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System
Title: Toward Real-World Chinese Psychological Support Dialogues: CPsDD Dataset and a Co-Evolving Multi-Agent System | Auf dem Weg zu echten chinesischen Psychologischen Unterstützungsdialogen: CPsDD-Datensatz und ein gemeinsames Multi-Agenten-System | 走向现实世界的中国心理支持对话:CPsDD数据集和共同演进的多行为者系统 2507.07509v1 |
Authors (3): Yuanchen Shi, Longyin Zhang, Fang Kong
The growing need for psychological support due to increasing pressures has exposed the scarcity of relevant datasets, particularly in non-English languages. To address this, we propose a framework that leverages limited real-world data and expert knowledge to fine-tune two large language models: Dialog Generator and Dialog Modifier. The Generator creates large-scale psychological counseling dialogues based on predefined paths, which guide system response strategies and user interactions, forming the basis for effective support. The Modifier refines these dialogues to align with real-world data quality. Through both automated and manual review, we construct the Chinese Psychological support Dialogue Dataset (CPsDD), containing 68K dialogues across 13 groups, 16 psychological problems, 13 causes, and 12 support focuses. Additionally, we introduce the Comprehensive Agent Dialogue Support System (CADSS), where a Profiler analyzes user characteristics, a Summarizer condenses dialogue history, a Planner selects strategies, and a Supporter generates empathetic responses. The experimental results of the Strategy Prediction and Emotional Support Conversation (ESC) tasks demonstrate that CADSS achieves state-of-the-art performance on both CPsDD and ESConv datasets.
nan
Article 624
Title@2025-07-10 (4): Enhancing Transformers for Generalizable First-Order Logical Entailment
Title: Enhancing Transformers for Generalizable First-Order Logical Entailment | Erweiterung der Transformer für generalisierbare Logical Entailment erster Ordnung | 增强通用一级一级逻辑元件的变压器 2501.00759v3 |
Authors (8): Tianshi Zheng, Jiazheng Wang, Zihao Wang, Jiaxin Bai, Hang Yin, Zheye Deng, Yangqiu Song, Jianxin Li
Transformers, as the fundamental deep learning architecture, have demonstrated great capability in reasoning. This paper studies the generalizable first-order logical reasoning ability of transformers with their parameterized knowledge and how to improve it. Transformers’ capability of first-order reasoning is further captured by whether they can conduct first-order logical entailment, which is quantitatively measured by their performance in answering knowledge graph queries. We establish the connections between (1) two types of distribution shifts studied in out-of-distribution generalization and (2) unseen knowledge and query settings discussed in the task of knowledge graph query answering, which makes it possible to characterize the fine-grained generalizability. Results on our comprehensive dataset showed that transformers \textit{outperform} previous methods designed particularly for this task and provided detailed empirical evidence about the impact of the input query syntax, token embedding, and transformer architectures on their reasoning capability. Interestingly, our results revealed the mismatch of positional encoding and other design choices of transformer architectures in previous practices. Motivated by this, we propose TEGA, a logic-aware architecture that significantly improves the performance in generalizable first-order logical entailment.
nan
Article 625
Title@2025-07-10 (4): Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature
Title: Extracting ORR Catalyst Information for Fuel Cell from Scientific Literature | Gewinnung von ORR-Katalysatorinformationen für Brennstoffzelle aus wissenschaftlicher Literatur | 从科学文献中提取用于燃料电池的 ORR 催化器信息 2507.07499v1 |
Authors (4): Hein Htet, Amgad Ahmed Ali Ibrahim, Yutaka Sasaki, Ryoji Asahi
The oxygen reduction reaction (ORR) catalyst plays a critical role in enhancing fuel cell efficiency, making it a key focus in material science research. However, extracting structured information about ORR catalysts from vast scientific literature remains a significant challenge due to the complexity and diversity of textual data. In this study, we propose a named entity recognition (NER) and relation extraction (RE) approach using DyGIE++ with multiple pre-trained BERT variants, including MatSciBERT and PubMedBERT, to extract ORR catalyst-related information from the scientific literature, which is compiled into a fuel cell corpus for materials informatics (FC-CoMIcs). A comprehensive dataset was constructed manually by identifying 12 critical entities and two relationship types between pairs of the entities. Our methodology involves data annotation, integration, and fine-tuning of transformer-based models to enhance information extraction accuracy. We assess the impact of different BERT variants on extraction performance and investigate the effects of annotation consistency. Experimental evaluations demonstrate that the fine-tuned PubMedBERT model achieves the highest NER F1-score of 82.19% and the MatSciBERT model attains the best RE F1-score of 66.10%. Furthermore, the comparison with human annotators highlights the reliability of fine-tuned models for ORR catalyst extraction, demonstrating their potential for scalable and automated literature analysis. The results indicate that domain-specific BERT models outperform general scientific models like BlueBERT for ORR catalyst extraction.
nan
Article 626
Title@2025-07-10 (4): PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving
Title: PLAN-TUNING: Post-Training Language Models to Learn Step-by-Step Planning for Complex Problem Solving | PLAN-TUNING: Sprachmodelle nach dem Training lernen Schritt-für-Schritt-Planung für komplexe Problemlösung | 规划 – – 规划 – – 培训后语言模式,以学习逐步规划解决复杂问题的模式 2507.07495v1 |
Authors (8): Mihir Parmar, Palash Goyal, Xin Liu, Yiwen Song, Mingyang Ling, Chitta Baral, Hamid Palangi, Tomas Pfister
Recently, decomposing complex problems into simple subtasks–a crucial part of human-like natural planning–to solve the given problem has significantly boosted the performance of large language models (LLMs). However, leveraging such planning structures during post-training to boost the performance of smaller open-source LLMs remains underexplored. Motivated by this, we introduce PLAN-TUNING, a unified post-training framework that (i) distills synthetic task decompositions (termed “planning trajectories”) from large-scale LLMs and (ii) fine-tunes smaller models via supervised and reinforcement-learning objectives designed to mimic these planning processes to improve complex reasoning. On GSM8k and the MATH benchmarks, plan-tuned models outperform strong baselines by an average $\sim7\%$. Furthermore, plan-tuned models show better generalization capabilities on out-of-domain datasets, with average $\sim10\%$ and $\sim12\%$ performance improvements on OlympiadBench and AIME 2024, respectively. Our detailed analysis demonstrates how planning trajectories improves complex reasoning capabilities, showing that PLAN-TUNING is an effective strategy for improving task-specific performance of smaller LLMs.
nan
Article 627
Title@2025-07-10 (4): SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records
Title: SimSUM: Simulated Benchmark with Structured and Unstructured Medical Records | SimSUM: Simulierter Benchmark mit strukturierten und unstrukturierten medizinischen Aufzeichnungen | SimSUM:与结构化和非结构化医疗记录模拟基准 2409.08936v3 |
Authors (3): Paloma Rabaey, Stefan Heytens, Thomas Demeester
Clinical information extraction, which involves structuring clinical concepts from unstructured medical text, remains a challenging problem that could benefit from the inclusion of tabular background information available in electronic health records. Existing open-source datasets lack explicit links between structured features and clinical concepts in the text, motivating the need for a new research dataset. We introduce SimSUM, a benchmark dataset of 10,000 simulated patient records that link unstructured clinical notes with structured background variables. Each record simulates a patient encounter in the domain of respiratory diseases and includes tabular data (e.g., symptoms, diagnoses, underlying conditions) generated from a Bayesian network whose structure and parameters are defined by domain experts. A large language model (GPT-4o) is prompted to generate a clinical note describing the encounter, including symptoms and relevant context. These notes are annotated with span-level symptom mentions. We conduct an expert evaluation to assess note quality and run baseline predictive models on both the tabular and textual data. The SimSUM dataset is primarily designed to support research on clinical information extraction in the presence of tabular background variables, which can be linked through domain knowledge to concepts of interest to be extracted from the text (symptoms, in the case of SimSUM). Secondary uses include research on the automation of clinical reasoning over both tabular data and text, causal effect estimation in the presence of tabular and/or textual confounders, and multi-modal synthetic data generation. SimSUM is not intended for training clinical decision support systems or production-grade models, but rather to facilitate reproducible research in a simplified and controlled setting. The dataset is available at https://github.com/prabaey/SimSUM.
nan
Article 628
Title@2025-07-10 (4): Affordable AI Assistants with Knowledge Graph of Thoughts
Title: Affordable AI Assistants with Knowledge Graph of Thoughts | Erschwingliche KI-Assistenten mit Wissensgrafik der Gedanken | 具有知识思想知识图的负担得起的AI助理 2504.02670v5 |
Authors (18): Maciej Besta, Lorenzo Paleari, Jia Hao Andrea Jiang, Robert Gerstenberger, You Wu, Jón Gunnar Hannesson, Patrick Iff, Ales Kubicek, Piotr Nyczyk, Diana Khimey, Nils Blach, Haiqiang Zhang, Tao Zhang, Peiran Ma, Grzegorz Kwaśniewski, Marcin Copik, Hubert Niewiadomski, Torsten Hoefler
Large Language Models (LLMs) are revolutionizing the development of AI assistants capable of performing diverse tasks across domains. However, current state-of-the-art LLM-driven agents face significant challenges, including high operational costs and limited success rates on complex benchmarks like GAIA. To address these issues, we propose Knowledge Graph of Thoughts (KGoT), an innovative AI assistant architecture that integrates LLM reasoning with dynamically constructed knowledge graphs (KGs). KGoT extracts and structures task-relevant knowledge into a dynamic KG representation, iteratively enhanced through external tools such as math solvers, web crawlers, and Python scripts. Such structured representation of task-relevant knowledge enables low-cost models to solve complex tasks effectively while also minimizing bias and noise. For example, KGoT achieves a 29% improvement in task success rates on the GAIA benchmark compared to Hugging Face Agents with GPT-4o mini. Moreover, harnessing a smaller model dramatically reduces operational costs by over 36x compared to GPT-4o. Improvements for other models (e.g., Qwen2.5-32B and Deepseek-R1-70B) and benchmarks (e.g., SimpleQA) are similar. KGoT offers a scalable, affordable, versatile, and high-performing solution for AI assistants.
nan
Article 629
Title@2025-07-10 (4): Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models
Title: Machine Bullshit: Characterizing the Emergent Disregard for Truth in Large Language Models | Machine Bullshit: Charakterisieren der Emergenten Missachtung der Wahrheit in großen Sprachmodellen | 机器胡说:在大语言模型中突出新人无视真相的特点 2507.07484v1 |
Authors (6): Kaiqu Liang, Haimin Hu, Xuandong Zhao, Dawn Song, Thomas L. Griffiths, Jaime Fernández Fisac
Bullshit, as conceptualized by philosopher Harry Frankfurt, refers to statements made without regard to their truth value. While previous work has explored large language model (LLM) hallucination and sycophancy, we propose machine bullshit as an overarching conceptual framework that can allow researchers to characterize the broader phenomenon of emergent loss of truthfulness in LLMs and shed light on its underlying mechanisms. We introduce the Bullshit Index, a novel metric quantifying LLMs’ indifference to truth, and propose a complementary taxonomy analyzing four qualitative forms of bullshit: empty rhetoric, paltering, weasel words, and unverified claims. We conduct empirical evaluations on the Marketplace dataset, the Political Neutrality dataset, and our new BullshitEval benchmark (2,400 scenarios spanning 100 AI assistants) explicitly designed to evaluate machine bullshit. Our results demonstrate that model fine-tuning with reinforcement learning from human feedback (RLHF) significantly exacerbates bullshit and inference-time chain-of-thought (CoT) prompting notably amplify specific bullshit forms, particularly empty rhetoric and paltering. We also observe prevalent machine bullshit in political contexts, with weasel words as the dominant strategy. Our findings highlight systematic challenges in AI alignment and provide new insights toward more truthful LLM behavior.
nan
Article 630
Title@2025-07-10 (4): Mixture of Group Experts for Learning Invariant Representations
Title: Mixture of Group Experts for Learning Invariant Representations | Mixtur von Gruppenexperten für Learning Invariante Repräsentationen | 学习不稳定代表小组专家混合 2504.09265v2 |
Authors (4): Lei Kang, Jia Li, Mi Tian, Hua Huang
Sparsely activated Mixture-of-Experts (MoE) models effectively increase the number of parameters while maintaining consistent computational costs per token. However, vanilla MoE models often suffer from limited diversity and specialization among experts, constraining their performance and scalability, especially as the number of experts increases. In this paper, we present a novel perspective on vanilla MoE with top-$k$ routing inspired by sparse representation. This allows us to bridge established theoretical insights from sparse representation into MoE models. Building on this foundation, we propose a group sparse regularization approach for the input of top-$k$ routing, termed Mixture of Group Experts (MoGE). MoGE indirectly regularizes experts by imposing structural constraints on the routing inputs, while preserving the original MoE architecture. Furthermore, we organize the routing input into a 2D topographic map, spatially grouping neighboring elements. This structure enables MoGE to capture representations invariant to minor transformations, thereby significantly enhancing expert diversity and specialization. Comprehensive evaluations across various Transformer models for image classification and language modeling tasks demonstrate that MoGE substantially outperforms its MoE counterpart, with minimal additional memory and computation overhead. Our approach provides a simple yet effective solution to scale the number of experts and reduce redundancy among them. The source code is included in the supplementary material and will be publicly released.
nan
Article 631
Title@2025-07-10 (4): ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining
Title: ixi-GEN: Efficient Industrial sLLMs through Domain Adaptive Continual Pretraining | ixi-GEN: Effiziente industrielle sLLMs durch Domain Adaptive Continual Pretraining | ixi-GEN:通过远程适应性连续训练前,提高工业低温生产效率 2507.06795v2 |
Authors (10): Seonwu Kim, Yohan Na, Kihun Kim, Hanhee Cho, Geun Lim, Mintae Kim, Seongik Park, Ki Hyun Kim, Youngsub Han, Byoung-Ki Jeon
The emergence of open-source large language models (LLMs) has expanded opportunities for enterprise applications; however, many organizations still lack the infrastructure to deploy and maintain large-scale models. As a result, small LLMs (sLLMs) have become a practical alternative, despite their inherent performance limitations. While Domain Adaptive Continual Pretraining (DACP) has been previously explored as a method for domain adaptation, its utility in commercial applications remains under-examined. In this study, we validate the effectiveness of applying a DACP-based recipe across diverse foundation models and service domains. Through extensive experiments and real-world evaluations, we demonstrate that DACP-applied sLLMs achieve substantial gains in target domain performance while preserving general capabilities, offering a cost-efficient and scalable solution for enterprise-level deployment.
nan
Article 632
Title@2025-07-10 (4): Structure Guided Large Language Model for SQL Generation
Title: Structure Guided Large Language Model for SQL Generation | Struktur Geführtes großes Sprachmodell für SQL-Generierung | SQL 生成引导大语言模式 2402.13284v4 |
Authors (6): Qinggang Zhang, Hao Chen, Junnan Dong, Shengyuan Chen, Feiran Huang, Xiao Huang
Recent advancements in large language models (LLMs) have shown promise in bridging the gap between natural language queries and database management systems, enabling users to interact with databases without the background of SQL. However, LLMs often struggle to comprehend complex database structures and accurately interpret user intentions. Decomposition-based methods have been proposed to enhance the performance of LLMs on complex tasks, but decomposing SQL generation into subtasks is non-trivial due to the declarative structure of SQL syntax and the intricate connections between query concepts and database elements. In this paper, we propose a novel Structure GUided text-to-SQL framework~(SGU-SQL) that incorporates syntax-based prompting to enhance the SQL generation capabilities of LLMs. Specifically, SGU-SQL establishes structure-aware links between user queries and database schema and decomposes the complex generation task using syntax-based prompting to enable more accurate LLM-based SQL generation. Extensive experiments on two benchmark datasets demonstrate that SGU-SQL consistently outperforms state-of-the-art text-to-SQL models.
nan
Article 633
Title@2025-07-10 (4): RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning
Title: RLEP: Reinforcement Learning with Experience Replay for LLM Reasoning | RLEP: Verstärktes Lernen mit Erfahrungsreplay für LLM-Reasoning | RLEP: 强化学习,经验重现LLM 理由推理 2507.07451v1 |
Authors (7): Hongzhi Zhang, Jia Fu, Jingyuan Zhang, Kai Fu, Qi Wang, Fuzheng Zhang, Guorui Zhou
Reinforcement learning (RL) for large language models is an energy-intensive endeavor: training can be unstable, and the policy may gradually drift away from its pretrained weights. We present \emph{RLEP}\, – \,Reinforcement Learning with Experience rePlay\, – \,a two-phase framework that first collects verified trajectories and then replays them during subsequent training. At every update step, the policy is optimized on mini-batches that blend newly generated rollouts with these replayed successes. By replaying high-quality examples, RLEP steers the model away from fruitless exploration, focuses learning on promising reasoning paths, and delivers both faster convergence and stronger final performance. On the Qwen2.5-Math-7B base model, RLEP reaches baseline peak accuracy with substantially fewer updates and ultimately surpasses it, improving accuracy on AIME-2024 from 38.2% to 39.9%, on AIME-2025 from 19.8% to 22.3%, and on AMC-2023 from 77.0% to 82.2%. Our code, datasets, and checkpoints are publicly available at https://github.com/Kwai-Klear/RLEP to facilitate reproducibility and further research.
nan
Article 634
Title@2025-07-10 (4): Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving
Title: Agent KB: Leveraging Cross-Domain Experience for Agentic Problem Solving | Agent KB: Nutzung von Cross-Domain-Erfahrungen für die Lösung Agentischer Probleme | Agent KB: 利用跨域经验解决代理问题 2507.06229v2 |
Authors (18): Xiangru Tang, Tianrui Qin, Tianhao Peng, Ziyang Zhou, Daniel Shao, Tingting Du, Xinming Wei, Peng Xia, Fang Wu, He Zhu, Ge Zhang, Jiaheng Liu, Xingyao Wang, Sirui Hong, Chenglin Wu, Hao Cheng, Chi Wang, Wangchunshu Zhou
As language agents tackle increasingly complex tasks, they struggle with effective error correction and experience reuse across domains. We introduce Agent KB, a hierarchical experience framework that enables complex agentic problem solving via a novel Reason-Retrieve-Refine pipeline. Agent KB addresses a core limitation: agents traditionally cannot learn from each other’s experiences. By capturing both high-level strategies and detailed execution logs, Agent KB creates a shared knowledge base that enables cross-agent knowledge transfer. Evaluated on the GAIA benchmark, Agent KB improves success rates by up to 16.28 percentage points. On the most challenging tasks, Claude-3 improves from 38.46% to 57.69%, while GPT-4 improves from 53.49% to 73.26% on intermediate tasks. On SWE-bench code repair, Agent KB enables Claude-3 to improve from 41.33% to 53.33%. Our results suggest that Agent KB provides a modular, framework-agnostic infrastructure for enabling agents to learn from past experiences and generalize successful strategies to new tasks.
nan
Article 635
Title@2025-07-10 (4): SAND: Boosting LLM Agents with Self-Taught Action Deliberation
Title: SAND: Boosting LLM Agents with Self-Taught Action Deliberation | SAND: LLM-Agenten mit selbsterzogener Handlungsberatung stärken | SAND:促进具有自学行动考虑的LLM代理 2507.07441v1 |
Authors (8): Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A. Rossi, Lina Yao, Julian McAuley
Large Language Model (LLM) agents are commonly tuned with supervised finetuning on ReAct-style expert trajectories or preference optimization over pairwise rollouts. Most of these methods focus on imitating specific expert behaviors or promoting chosen reasoning thoughts and actions over rejected ones. However, without reasoning and comparing over alternatives actions, LLM agents finetuned with these methods may over-commit towards seemingly plausible but suboptimal actions due to limited action space exploration. To address this, in this paper we propose Self-taught ActioN Deliberation (SAND) framework, enabling LLM agents to explicitly deliberate over candidate actions before committing to one. To tackle the challenges of when and what to deliberate given large action space and step-level action evaluation, we incorporate self-consistency action sampling and execution-guided action critique to help synthesize step-wise action deliberation thoughts using the base model of the LLM agent. In an iterative manner, the deliberation trajectories are then used to finetune the LLM agent itself. Evaluating on two representative interactive agent tasks, SAND achieves an average 20% improvement over initial supervised finetuning and also outperforms state-of-the-art agent tuning approaches.
nan
Article 636
Title@2025-07-10 (4): Towards Interpretable Time Series Foundation Models
Title: Towards Interpretable Time Series Foundation Models | Auf dem Weg zu interpretierbaren Zeitreihen-Grundmodellen | 迈向可解释时间序列基础模型 2507.07439v1 |
Authors (4): Matthieu Boileau, Philippe Helluy, Jeremy Pawlus, Svitlana Vyetrenko
In this paper, we investigate the distillation of time series reasoning capabilities into small, instruction-tuned language models as a step toward building interpretable time series foundation models. Leveraging a synthetic dataset of mean-reverting time series with systematically varied trends and noise levels, we generate natural language annotations using a large multimodal model and use these to supervise the fine-tuning of compact Qwen models. We introduce evaluation metrics that assess the quality of the distilled reasoning - focusing on trend direction, noise intensity, and extremum localization - and show that the post-trained models acquire meaningful interpretive capabilities. Our results highlight the feasibility of compressing time series understanding into lightweight, language-capable models suitable for on-device or privacy-sensitive deployment. This work contributes a concrete foundation toward developing small, interpretable models that explain temporal patterns in natural language.
nan
Article 637
Title@2025-07-10 (4): SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data
Title: SynthEHR-Eviction: Enhancing Eviction SDoH Detection with LLM-Augmented Synthetic EHR Data | SynthEHR-Eviction: Verbesserung der Eviction SDoH-Erkennung mit LLM-Augmented Synthetic EHR Data | 合成EHR-驱逐:利用LLM-增强的合成电子HR数据加强驱逐SDoH探测 2507.07421v1 |
Authors (7): Zonghai Yao, Youxia Zhao, Avijit Mitra, David A. Levy, Emily Druhl, Jack Tsai, Hong Yu
Eviction is a significant yet understudied social determinants of health (SDoH), linked to housing instability, unemployment, and mental health. While eviction appears in unstructured electronic health records (EHRs), it is rarely coded in structured fields, limiting downstream applications. We introduce SynthEHR-Eviction, a scalable pipeline combining LLMs, human-in-the-loop annotation, and automated prompt optimization (APO) to extract eviction statuses from clinical notes. Using this pipeline, we created the largest public eviction-related SDoH dataset to date, comprising 14 fine-grained categories. Fine-tuned LLMs (e.g., Qwen2.5, LLaMA3) trained on SynthEHR-Eviction achieved Macro-F1 scores of 88.8% (eviction) and 90.3% (other SDoH) on human validated data, outperforming GPT-4o-APO (87.8%, 87.3%), GPT-4o-mini-APO (69.1%, 78.1%), and BioBERT (60.7%, 68.3%), while enabling cost-effective deployment across various model sizes. The pipeline reduces annotation effort by over 80%, accelerates dataset creation, enables scalable eviction detection, and generalizes to other information extraction tasks.
nan
Article 638
Title@2025-07-10 (4): MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning
Title: MedReadCtrl: Personalizing medical text generation with readability-controlled instruction learning | MedReadCtrl: Personalisierung medizinischer Textgenerierung mit Lesbarkeitsgesteuertem Unterricht | MedReadReadCtrl: 使医疗文本生成个性化,并进行可读性控制教学学习 2507.07419v1 |
Authors (7): Hieu Tran, Zonghai Yao, Won Seok Jang, Sharmin Sultana, Allen Chang, Yuan Zhang, Hong Yu
Generative AI has demonstrated strong potential in healthcare, from clinical decision support to patient-facing chatbots that improve outcomes. A critical challenge for deployment is effective human-AI communication, where content must be both personalized and understandable. We introduce MedReadCtrl, a readability-controlled instruction tuning framework that enables LLMs to adjust output complexity without compromising meaning. Evaluations of nine datasets and three tasks across medical and general domains show that MedReadCtrl achieves significantly lower readability instruction-following errors than GPT-4 (e.g., 1.39 vs. 1.59 on ReadMe, p<0.001) and delivers substantial gains on unseen clinical tasks (e.g., +14.7 ROUGE-L, +6.18 SARI on MTSamples). Experts consistently preferred MedReadCtrl (71.7% vs. 23.3%), especially at low literacy levels. These gains reflect MedReadCtrl’s ability to restructure clinical content into accessible, readability-aligned language while preserving medical intent, offering a scalable solution to support patient education and expand equitable access to AI-enabled care.
nan
Article 639
Title@2025-07-10 (4): May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks
Title: May I have your Attention? Breaking Fine-Tuning based Prompt Injection Defenses using Architecture-Aware Attacks | Darf ich Ihre Aufmerksamkeit haben? Breaking Fine-Tuning basierte Prompt Injection Defenses mit Architektur-Aware Attacken | 请大家注意,使用建筑软件攻击 突破基于精密发射的快速喷射防御系统 2507.07417v1 |
Authors (4): Nishit V. Pandya, Andrey Labunets, Sicun Gao, Earlence Fernandes
A popular class of defenses against prompt injection attacks on large language models (LLMs) relies on fine-tuning the model to separate instructions and data, so that the LLM does not follow instructions that might be present with data. There are several academic systems and production-level implementations of this idea. We evaluate the robustness of this class of prompt injection defenses in the whitebox setting by constructing strong optimization-based attacks and showing that the defenses do not provide the claimed security properties. Specifically, we construct a novel attention-based attack algorithm for text-based LLMs and apply it to two recent whitebox defenses SecAlign (CCS 2025) and StruQ (USENIX Security 2025), showing attacks with success rates of up to 70% with modest increase in attacker budget in terms of tokens. Our findings make fundamental progress towards understanding the robustness of prompt injection defenses in the whitebox setting. We release our code and attacks at https://github.com/nishitvp/better_opts_attacks
nan
Article 640
Title@2025-07-10 (4): Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation
Title: Inter-linguistic Phonetic Composition (IPC): A Theoretical and Computational Approach to Enhance Second Language Pronunciation | Interlinguistische phonetische Komposition (IPC): Ein theoretischer und rechnerischer Ansatz, um die zweite Sprache zu verbessern | 语言间音音组成:加强第二语言发音的理论和计算方法 2411.10927v3 |
Authors (4): Jisang Park, Minu Kim, DaYoung Hong, Jongha Lee
Learners of a second language (L2) often unconsciously substitute unfamiliar L2 phonemes with similar phonemes from their native language (L1), even though native speakers of the L2 perceive these sounds as distinct and non-interchangeable. This phonemic substitution leads to deviations from the standard phonological patterns of the L2, creating challenges for learners in acquiring accurate L2 pronunciation. To address this, we propose Inter-linguistic Phonetic Composition (IPC), a novel computational method designed to minimize incorrect phonological transfer by reconstructing L2 phonemes as composite sounds derived from multiple L1 phonemes. Tests with two automatic speech recognition models demonstrated that when L2 speakers produced IPC-generated composite sounds, the recognition rate of target L2 phonemes improved by 20% compared to when their pronunciation was influenced by original phonological transfer patterns. The improvement was observed within a relatively shorter time frame, demonstrating rapid acquisition of the composite sound.
nan
Article 641
Title@2025-07-10 (4): TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning
Title: TART: An Open-Source Tool-Augmented Framework for Explainable Table-based Reasoning | TART: Ein Open-Source Tool-Augmented Framework für erklärbare Tabellen-basierte Begründung | TARRT: 开放源码工具推荐框架,用于说明基于表格的理由 2409.11724v3 |
Authors (5): Xinyuan Lu, Liangming Pan, Yubo Ma, Preslav Nakov, Min-Yen Kan
Current Large Language Models (LLMs) exhibit limited ability to understand table structures and to apply precise numerical reasoning, which is crucial for tasks such as table question answering (TQA) and table-based fact verification (TFV). To address these challenges, we introduce our Tool-Augmented Reasoning framework for Tables (TART), which integrates LLMs with specialized tools. TART contains three key components: a table formatter to ensure accurate data representation, a tool maker to develop specific computational tools, and an explanation generator to maintain explainability. We also present the TOOLTAB dataset, a new benchmark designed specifically for training LLMs in table-tool integration. Our experiments indicate that TART achieves substantial improvements over existing methods (e.g., Chain-of-Thought) by improving both the precision of data processing and the clarity of the reasoning process. Notably, TART paired with CodeLlama achieves 90.0% of the accuracy of the closed-sourced LLM GPT-3.5-turbo, highlighting its robustness in diverse real-world scenarios. All the code and data are available at https://github.com/XinyuanLu00/TART.
nan
Article 642
Title@2025-07-10 (4): GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation
Title: GNN-CNN: An Efficient Hybrid Model of Convolutional and Graph Neural Networks for Text Representation | GNN-CNN: Ein effizientes Hybridmodell für konvolutionäre und Graphen-Neuralnetzwerke zur Textdarstellung | GNN-CNN: 用于文本代表的动态和图形神经网络的有效混合模型 2507.07414v1 |
Authors (1): Fardin Rastakhiz
Time, cost, and energy efficiency are critical considerations in Deep-Learning (DL), particularly when processing long texts. Transformers, which represent the current state of the art, exhibit quadratic computational complexity relative to input length, making them inefficient for extended documents. This study introduces a novel model architecture that combines Graph Neural Networks (GNNs) and Convolutional Neural Networks (CNNs), integrated with a real-time, end-to-end graph generation mechanism. The model processes compact batches of character-level inputs without requiring padding or truncation. To enhance performance while maintaining high speed and efficiency, the model incorporates information from Large Language Models (LLMs), such as token embeddings and sentiment polarities, through efficient dictionary lookups. It captures local contextual patterns using CNNs, expands local receptive fields via lattice-based graph structures, and employs small-world graphs to aggregate document-level information. The generated graphs exhibit structural properties indicative of meaningful semantic organization, with an average clustering coefficient of approximately 0.45 and an average shortest path length ranging between 4 and 5. The model is evaluated across multiple text classification tasks, including sentiment analysis and news-categorization, and is compared against state-of-the-art models. Experimental results confirm the proposed model’s efficiency and competitive performance.
nan
Article 643
Title@2025-07-10 (4): CoAM: Corpus of All-Type Multiword Expressions
Title: CoAM: Corpus of All-Type Multiword Expressions | CoAM: Corpus von Multiwort-Ausdrücken aller Art | CoAM: 全类型多字表达式组合体 2412.18151v3 |
Authors (7): Yusuke Ide, Joshua Tanner, Adam Nohejl, Jacob Hoffman, Justin Vasselli, Hidetaka Kamigaito, Taro Watanabe
Multiword expressions (MWEs) refer to idiomatic sequences of multiple words. MWE identification, i.e., detecting MWEs in text, can play a key role in downstream tasks such as machine translation, but existing datasets for the task are inconsistently annotated, limited to a single type of MWE, or limited in size. To enable reliable and comprehensive evaluation, we created CoAM: Corpus of All-Type Multiword Expressions, a dataset of 1.3K sentences constructed through a multi-step process to enhance data quality consisting of human annotation, human review, and automated consistency checking. Additionally, for the first time in a dataset of MWE identification, CoAM’s MWEs are tagged with MWE types, such as Noun and Verb, enabling fine-grained error analysis. Annotations for CoAM were collected using a new interface created with our interface generator, which allows easy and flexible annotation of MWEs in any form. Through experiments using CoAM, we find that a fine-tuned large language model outperforms MWEasWSD, which achieved the state-of-the-art performance on the DiMSUM dataset. Furthermore, analysis using our MWE type tagged data reveals that Verb MWEs are easier than Noun MWEs to identify across approaches.
nan
Article 644
Title@2025-07-10 (4): Rethinking Verification for LLM Code Generation: From Generation to Testing
Title: Rethinking Verification for LLM Code Generation: From Generation to Testing | Überprüfung der LLM-Code-Generierung neu denken: Von der Generation zur Prüfung | 重新思考LLM 代码生成的核查:从生成到测试 2507.06920v2 |
Authors (8): Zihan Ma, Taolin Zhang, Maosong Cao, Junnan Liu, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.
nan
Article 645
Title@2025-07-10 (4): Large Language Model for Extracting Complex Contract Information in Industrial Scenes
Title: Large Language Model for Extracting Complex Contract Information in Industrial Scenes | Großes Sprachmodell zur Extraktion komplexer Vertragsinformationen in Industrieszenen | 工业景点复杂合同信息提取大语言模型 2507.06539v2 |
Authors (3): Yunyang Cao, Yanjun Li, Silong Dai
This paper proposes a high-quality dataset construction method for complex contract information extraction tasks in industrial scenarios and fine-tunes a large language model based on this dataset. Firstly, cluster analysis is performed on industrial contract texts, and GPT-4 and GPT-3.5 are used to extract key information from the original contract data, obtaining high-quality data annotations. Secondly, data augmentation is achieved by constructing new texts, and GPT-3.5 generates unstructured contract texts from randomly combined keywords, improving model robustness. Finally, the large language model is fine-tuned based on the high-quality dataset. Experimental results show that the model achieves excellent overall performance while ensuring high field recall and precision and considering parsing efficiency. LoRA, data balancing, and data augmentation effectively enhance model accuracy and robustness. The proposed method provides a novel and efficient solution for industrial contract information extraction tasks.
nan
Article 646
Title@2025-07-10 (4): BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems
Title: BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems | BountyBench: Dollar-Impact von KI-Agenten-Angriffen und Verteidigern auf reale Cybersicherheitssysteme | BuntyBuntyBunnench: AI代理攻击者和捍卫者对现实世界网络安全系统的美元影响 2505.15216v2 |
Authors (34): Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit Choudhary, Siddharth M. Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E. Ho, Percy Liang
AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a specific vulnerability), and Patch (patching a specific vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards of $10-$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a specific vulnerability. We evaluate 8 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, and DeepSeek-R1. Given up to three attempts, the top-performing agents are OpenAI Codex CLI: o3-high (12.5% on Detect, mapping to $3,720; 90% on Patch, mapping to $14,152), Custom Agent with Claude 3.7 Sonnet Thinking (67.5% on Exploit), and OpenAI Codex CLI: o4-mini (90% on Patch, mapping to $14,422). OpenAI Codex CLI: o3-high, OpenAI Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 37.5-67.5% and Patch scores of 35-60%.
nan
Article 647
Title@2025-07-10 (4): Bradley-Terry and Multi-Objective Reward Modeling Are Complementary
Title: Bradley-Terry and Multi-Objective Reward Modeling Are Complementary | Bradley-Terry und Multi-Objective Reward Modeling sind komplementär | Bradley-Terriy和多目标奖励模型具有补充作用 2507.07375v1 |
Authors (13): Zhiwei Zhang, Hui Liu, Xiaomin Li, Zhenwei Dai, Jingying Zeng, Fali Wang, Minhua Lin, Ramraj Chandradevan, Zhen Li, Chen Luo, Xianfeng Tang, Qi He, Suhang Wang
Reward models trained on human preference data have demonstrated strong effectiveness in aligning Large Language Models (LLMs) with human intent under the framework of Reinforcement Learning from Human Feedback (RLHF). However, RLHF remains vulnerable to reward hacking, where the policy exploits imperfections in the reward function rather than genuinely learning the intended behavior. Although significant efforts have been made to mitigate reward hacking, they predominantly focus on and evaluate in-distribution scenarios, where the training and testing data for the reward model share the same distribution. In this paper, we empirically show that state-of-the-art methods struggle in more challenging out-of-distribution (OOD) settings. We further demonstrate that incorporating fine-grained multi-attribute scores helps address this challenge. However, the limited availability of high-quality data often leads to weak performance of multi-objective reward functions, which can negatively impact overall performance and become the bottleneck. To address this issue, we propose a unified reward modeling framework that jointly trains Bradley–Terry (BT) single-objective and multi-objective regression-based reward functions using a shared embedding space. We theoretically establish a connection between the BT loss and the regression objective and highlight their complementary benefits. Specifically, the regression task enhances the single-objective reward function’s ability to mitigate reward hacking in challenging OOD settings, while BT-based training improves the scoring capability of the multi-objective reward function, enabling a 7B model to outperform a 70B baseline. Extensive experimental results demonstrate that our framework significantly improves both the robustness and the scoring performance of reward models.
nan
Article 648
Title@2025-07-10 (4): Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing
Title: Krul: Efficient State Restoration for Multi-turn Conversations with Dynamic Cross-layer KV Sharing | Krul: Effiziente Zustandsrestauration für Multiturn-Gespräche mit dynamischem Cross-Layer-KV-Sharing | KRU:通过动态跨层KV共享,高效恢复国家多方向对话 2507.08045v1 |
Authors (5): Junyi Wen, Junyuan Liang, Zicong Hong, Wuhui Chen, Zibin Zheng
Efficient state restoration in multi-turn conversations with large language models (LLMs) remains a critical challenge, primarily due to the overhead of recomputing or loading full key-value (KV) caches for all historical tokens. To address this, existing approaches compress KV caches across adjacent layers with highly similar attention patterns. However, these methods often apply a fixed compression scheme across all conversations, selecting the same layer pairs for compression without considering conversation-specific attention dynamics. This static strategy overlooks variability in attention pattern similarity across different conversations, which can lead to noticeable accuracy degradation. We present Krul, a multi-turn LLM inference system that enables accurate and efficient KV cache restoration. Krul dynamically selects compression strategies based on attention similarity across layer pairs and uses a recomputation-loading pipeline to restore the KV cache. It introduces three key innovations: 1) a preemptive compression strategy selector to preserve critical context for future conversation turns and selects a customized strategy for the conversation; 2) a token-wise heterogeneous attention similarity estimator to mitigate the attention similarity computation and storage overhead during model generation; 3) a bubble-free restoration scheduler to reduce potential bubbles brought by the imbalance of recomputing and loading stream due to compressed KV caches. Empirical evaluations on real-world tasks demonstrate that Krul achieves a 1.5x-2.68x reduction in time-to-first-token (TTFT) and a 1.33x-2.35x reduction in KV cache storage compared to state-of-the-art methods without compromising generation quality.
nan
Article 649
Title@2025-07-10 (4): Shifting from Ranking to Set Selection for Retrieval Augmented Generation
Title: Shifting from Ranking to Set Selection for Retrieval Augmented Generation | Wechsel vom Ranking zur Einstellungsauswahl für retrieval Augmented Generation | 从排位移到设置回收增量一代的选择 2507.06838v2 |
Authors (4): Dahyun Lee, Yongrae Jo, Haeju Park, Moontae Lee
Retrieval in Retrieval-Augmented Generation(RAG) must ensure that retrieved passages are not only individually relevant but also collectively form a comprehensive set. Existing approaches primarily rerank top-k passages based on their individual relevance, often failing to meet the information needs of complex queries in multi-hop question answering. In this work, we propose a set-wise passage selection approach and introduce SETR, which explicitly identifies the information requirements of a query through Chain-of-Thought reasoning and selects an optimal set of passages that collectively satisfy those requirements. Experiments on multi-hop RAG benchmarks show that SETR outperforms both proprietary LLM-based rerankers and open-source baselines in terms of answer correctness and retrieval quality, providing an effective and efficient alternative to traditional rerankers in RAG systems. The code is available at https://github.com/LGAI-Research/SetR
nan